1. RE: Check if files equal

This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_000_01C22612.2AE20EC0
 charset=iso-8859-1

Here is a routine that might be useful to someone...

---------------
include file.e

-- Returns true (1) if the two files contain the same data, otherwise it
returns false (0)
-- Parameters: 
--    1: sequence : The path and name of a file
--    2: sequence : The path and name of another file
function fileEqual(sequence pFileA, sequence pFileB)
    integer lhA, lhB
    integer lcA, lcB
    object  ldA, ldB
    
    -- First check that they exist and that they are the same size.
    ldA = dir(pFileA)
    ldB = dir(pFileB)
    
    if atom(ldA) or atom(ldB) or ldA[1][D_SIZE] != ldB[1][D_SIZE] then
        return 0
    end if

    -- Now compare each byte, starting from the first byte.    
    lhA = open(pFileA, "rb")
    lhB = open(pFileB, "rb")
    
    lcA = 0
    lcB = 0

    -- Stop comparing as soon as a mismatch or EOF is found.    
    while lcA = lcB and lcA != -1 do
        lcA = getc(lhA)
        lcB = getc(lhB)
    end while

    close(lhA)
    close(lhB)

    -- if we end up with EOF in both files, they must be equal.        
    return (lcA = -1) and (lcB = -1)
    
end function    
-------------
Derek.

==================================================================
De informatie opgenomen in dit bericht kan vertrouwelijk zijn en 
is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht 
onterecht ontvangt wordt u verzocht de inhoud niet te gebruiken en 
de afzender direct te informeren door het bericht te retourneren. 
==================================================================
The information contained in this message may be confidential 
and is intended to be exclusively for the addressee. Should you 
receive this message unintentionally, please do not use the contents 
herein and notify the sender immediately by return e-mail.


==================================================================

------_=_NextPart_000_01C22612.2AE20EC0
Content-Type: application/ms-tnef

new topic     » topic index » view message » categorize

2. RE: Check if files equal

This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_000_01C22613.6CC6B0B0
 charset=iso-8859-1

Hi all,
in the original question put to the list by Tone, he mentioned 500MB+ files.
I'm guessing that these are databases, given the size. If so, checking the
bytes from the front of the files is probably a good idea because a lot of
database systems keep pointers and stamps near the front of the database
file. Thus, even if the file sizes haven't been changed, a short scan will
probably find a changed stamp or pointer. If you get about 25% through the
files and haven't found a mismatch, you might take a risk that they are the
same - or you might like to check the last 25% too. Just a thought.

---------
Derek.

==================================================================
De informatie opgenomen in dit bericht kan vertrouwelijk zijn en 
is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht 
onterecht ontvangt wordt u verzocht de inhoud niet te gebruiken en 
de afzender direct te informeren door het bericht te retourneren. 
==================================================================
The information contained in this message may be confidential 
and is intended to be exclusively for the addressee. Should you 
receive this message unintentionally, please do not use the contents 
herein and notify the sender immediately by return e-mail.


==================================================================

------_=_NextPart_000_01C22613.6CC6B0B0
Content-Type: application/ms-tnef

new topic     » goto parent     » topic index » view message » categorize

3. RE: Check if files equal

This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_000_01C22630.D88A8E80
 charset=iso-8859-1

So I guess the situation is that you have an incoming file and you want to
see if you already have that file? 

If this guess is right, then try this algorithm:

  Calculate a checksum for the new file.
  For each file you already have:
     if the new checksum is the same as an existing file's checksum then
        reject the new file as a duplicate.
     end if
  end for
 
  If no existing checksum matched the new one,
     add the new file to your storage,
     keep it's checksum to compare when you get other new files.
  end if


This way, you only ever calculate a file's checksum once, and from then on,
you only compare checksums, which is fairly fast.

The trick will be to create a checksum that is truely representative of the
whole file. You'll probably need a 32-bit checksum for each 2^32 bits
(536,870,912 bytes).

-------
Derek.


> -----Original Message-----
> From: 10963508 at europeonline.com [mailto:10963508 at europeonline.com]
> Sent: Monday, 8 July 2002 12:32
> To: EUforum
> Subject: Re: Check if files equal
> 
> 
> 
> Files are not databases, they are .zip .avi and .mp3 files 
> mainly (stuff
> coming down from satellite) - so they are compressed in some way.
> Speed is more important than accuracy.
> 
> Tone Skoda
> 
> ----- Original Message -----
> From: "Derek Parnell" <Derek.Parnell at SYD.RABOBANK.COM>
> To: "EUforum" <EUforum at topica.com>
> Sent: Monday, July 08, 2002 2:07 AM
> Subject: RE: Check if files equal
> 
> 
> > Hi all,
> > in the original question put to the list by Tone, he 
> mentioned 500MB+
> files.
> > I'm guessing that these are databases, given the size. If 
> so, checking the
> > bytes from the front of the files is probably a good idea 
> because a lot of
> > database systems keep pointers and stamps near the front of 
> the database
> > file. Thus, even if the file sizes haven't been changed, a 
> short scan will
> > probably find a changed stamp or pointer. If you get about 
> 25% through the
> > files and haven't found a mismatch, you might take a risk 
> that they are
> the
> > same - or you might like to check the last 25% too. Just a thought.
> >
> > ---------
> > Derek.
> >
> > ==================================================================
> >
> >
> > ==================================================================
> >
> >
> 
> 
> 

==================================================================
De informatie opgenomen in dit bericht kan vertrouwelijk zijn en 
is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht 
onterecht ontvangt wordt u verzocht de inhoud niet te gebruiken en 
de afzender direct te informeren door het bericht te retourneren. 
==================================================================
The information contained in this message may be confidential 
and is intended to be exclusively for the addressee. Should you 
receive this message unintentionally, please do not use the contents 
herein and notify the sender immediately by return e-mail.


==================================================================

------_=_NextPart_000_01C22630.D88A8E80
Content-Type: application/ms-tnef

new topic     » goto parent     » topic index » view message » categorize

4. RE: Check if files equal

Kats test is very efficient, but terribly slow.

i think checking last characters of the file[50]bytes and beginning the 
ten random bytes in the middle. be sure to use seek() and where
to get the correct file size.

jordah

Kat wrote:
> On 7 Jul 2002, at 1:51, 10963508 at europeonline.com wrote:
> 
> > 
> > What is the fastest way of checking if two very large files (~500 MB) 
> > are
> > equal?
> > I was thinking about this:
> > -name
> > -size
> > -date last modified
> > -pick about 10 random positions and check if bytes at those positions in
> > both files match.
> 
> I would not trust those tests at all.
> 
> > Is there any better and faster way that I'm not aware of?
> 
> Open file
> while not eof do
> Read them in, one buffer size at a time, 
> compare, 
> if not equal { tell me it's not equal, abort}
> end while
> 
> Kat
> 
>

new topic     » goto parent     » topic index » view message » categorize

5. RE: Check if files equal

Igor:
Thanks! You saved me some effort. I just had plans to develop exactly this
program. There are some similar in the web, but they are not good enough or
else they are not free, at least the ones I know of. Surely, being
programmed in C or some other compiled language, they should be faster.
Does someone know who programmed it? There is no information about it in the
program itself.
Regards.
----- Original Message -----
From: Igor Kachan <kinz at peterlink.ru>
Subject: Re: Check if files equal



Hi Tone,

----------
> Îò: 10963508 at europeonline.com
> Êîìó: EUforum <EUforum at topica.com>
> Òåìà: Check if files equal
> Äàòà: 7 èþëÿ 2002 ã. 3:51
>
> What is the fastest way of checking if two very large files (~500 MB) are
> equal?
> I was thinking about this:
> -name
> -size
> -date last modified
> -pick about 10 random positions and check if bytes at those positions in
> both files match.
>
> Is there any better and faster way that I'm not aware of?
>
> Tone ©koda
>

There is a good program by RDS:

http://www.RapidEuphoria.com/dupfile.zip

Regards,
Igor Kachan
kinz at peterlink.ru

new topic     » goto parent     » topic index » view message » categorize

6. RE: Check if files equal

This is a multi-part message in MIME format.

------=_NextPart_000_000B_01C22917.1C06F5C0
	charset="ISO-8859-2"

Rob:
Actually, the program finds equal files on as many directories you want, not
only two. Since you programmed it, you should know better... blink
I found a problem with this good program:
It has an elusive bug. When trying it, I discovered that I had somwhere
four equal files, but the program found two groups of two equal files. After
scratching my head for a while, I traced the bug to this circumstance:
assume you have several files with the same length, and there are among them
a group of say 3 equal files and another group of say 4 equal files. When
processing the first group found, the equal files length are set to -1.
Then, when processing the second group, the 'while' can stop due to unequal
length *before* some other member of the second group is reached. Attached
you'll find a correction to this bug.
Regards.
----- Original Message -----
From: Robert Craig <rds at RapidEuphoria.com>
To: EUforum <EUforum at topica.com>
Sent: Wednesday, July 10, 2002 2:46 AM
Subject: Re: Check if files equal


>
> Igor Kachan writes:
> > There is a good program by RDS:
> >
> > http://www.RapidEuphoria.com/dupfile.zip
> >
> Ricardo Forno writes:
> > Does someone know who programmed it? There is no information
> > about it in the program itself.
>
> Since you seem to like it, I'll step forward and take the credit.
>
> dupfile.exw finds all sets of identical files within a directory (and
> subdirectories), or between two directories and their subdirectories.
> Whenever I use it, it seems to run very fast.
> I don't think writing it in C will help much,
> and of course you can always use the E to C Translator on it.
>
> Regards,
>    Rob Craig
>    Rapid Deployment Software
>    http://www.RapidEuphoria.com
>
>
>
>


------=_NextPart_000_000B_01C22917.1C06F5C0
Content-Type: application/x-zip-compressed;
	name="dupfile.ZIP"

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu