1. Check if files equal

What is the fastest way of checking if two very large files (~500 MB) are
equal?
I was thinking about this:
-name
-size
-date last modified
-pick about 10 random positions and check if bytes at those positions in
both files match.

Is there any better and faster way that I'm not aware of?

Tone ©koda

new topic     » topic index » view message » categorize

2. Re: Check if files equal

On 7 Jul 2002, at 1:51, 10963508 at europeonline.com wrote:

> 
> What is the fastest way of checking if two very large files (~500 MB) are
> equal?
> I was thinking about this:
> -name
> -size
> -date last modified
> -pick about 10 random positions and check if bytes at those positions in
> both files match.

I would not trust those tests at all.

> Is there any better and faster way that I'm not aware of?

Open file
while not eof do
Read them in, one buffer size at a time, 
compare, 
if not equal { tell me it's not equal, abort}
end while

Kat

new topic     » goto parent     » topic index » view message » categorize

3. Re: Check if files equal

10963508 at europeonline.com wrote:
> 
> What is the fastest way of checking if two very large files (~500 MB) are
> equal?
> I was thinking about this:
> -name
> -size
> -date last modified
> -pick about 10 random positions and check if bytes at those positions in
> both files match.
> 
> Is there any better and faster way that I'm not aware of?
>

T think the fastest way to ensure the exactly equal contents is to use
the Window command: "FC /B filename_1 filename_2"; the /B means binary
comparison. "FC /?" will give you all possible parameters.

Have a nice day, Rolf

new topic     » goto parent     » topic index » view message » categorize

4. Re: Check if files equal

Hi, Kat wrote:

> On 7 Jul 2002, at 1:51, 10963508 at europeonline.com wrote:
                          ^^^^^^^^
It would be nice, to see a name here. (Just my opinion.)

>> What is the fastest way of checking if two very large files (~500 MB) are
>> equal?

I assume you mean equal content, not equal name, equal date, ..

>> I was thinking about this:
>> -name

Name doesn't matter concerning the content.

>> -size
>> -date last modified

Date doesn't matter concerning the content.

>> -pick about 10 random positions and check if bytes at those positions in
>> both files match.

> I would not trust those tests at all.

First, I would compare the size of the files, this is very fast.
Whether this comparison can be trusted or not, depends on it's result!
If both files don't have the same size, it's 100% sure that they are
not equal. If they have the same size, further testing is needed.
The same logic goes for CRC tests and the comparison of random bytes.

>> Is there any better and faster way that I'm not aware of?

I think it would be the best, first to make some _fast_ tests, that
will find unequal files in some probability. (I don't know how fast CRC
testing is.)
Then, if these tests didn't prove that the files are unequal, more
precise tests must follow. Of course, the most precise test is this:

> Open file
> while not eof do
> Read them in, one buffer size at a time, 
> compare, 
> if not equal { tell me it's not equal, abort}
> end while

> Kat

Best regards,
   Juergen

new topic     » goto parent     » topic index » view message » categorize

5. Re: Check if files equal

On 7 Jul 2002, at 11:21, Juergen Luethje wrote:

> 
> Hi, Kat wrote:
> 
> > On 7 Jul 2002, at 1:51, 10963508 at europeonline.com wrote:
>                           ^^^^^^^^
> It would be nice, to see a name here. (Just my opinion.)

This isn't my problem, you know. 

> The same logic goes for CRC tests and the comparison of random bytes.

To get a CRC, you need to perform math on all the bytes. How will you get 
that math done, if you don't read all the bytes?

Kat

new topic     » goto parent     » topic index » view message » categorize

6. Re: Check if files equal

Kat wrote:

> On 7 Jul 2002, at 11:21, Juergen Luethje wrote:

>> Hi, Kat wrote:
>> 
>> > On 7 Jul 2002, at 1:51, 10963508 at europeonline.com wrote:
>>                           ^^^^^^^^
>> It would be nice, to see a name here. (Just my opinion.)

> This isn't my problem, you know. 

I know that, of course. Sorry if it looked as if I was meaning you!
I wrote it there because I assume, that the one who starts a thread,
reads all the posts in it.


When you want to discuss about a text, please don't snip the decisive
part away! Here it is again, what I wrote in my previous post:
---------------------------------------------------------------------
If both files don't have the same size, it's 100% sure that they are
not equal. If they have the same size, further testing is needed.
---------------------------------------------------------------------

>> The same logic goes for CRC tests and the comparison of random bytes.

> To get a CRC, you need to perform math on all the bytes.

I know.

> How will you get that math done, if you don't read all the bytes?

I didn't write that. In the above text between the two lines, just
replace "size" with "CRC", and then you'll see what I mean.

> Kat

Regards,
   Juergen

new topic     » goto parent     » topic index » view message » categorize

7. Re: Check if files equal

On 7 Jul 2002, at 21:08, Juergen Luethje wrote:

>
> Kat wrote:
>
> > On 7 Jul 2002, at 11:21, Juergen Luethje wrote:
>
> >> Hi, Kat wrote:
> >>
> >> > On 7 Jul 2002, at 1:51, 10963508 at europeonline.com wrote:
> >>                           ^^^^^^^^
> >> It would be nice, to see a name here. (Just my opinion.)
>
> > This isn't my problem, you know.
>
> I know that, of course. Sorry if it looked as if I was meaning you!
> I wrote it there because I assume, that the one who starts a thread,
> reads all the posts in it.
>
>
> When you want to discuss about a text, please don't snip the decisive
> part away!

I replied to what i wished to reply to.

> Here it is again, what I wrote in my previous post:
> ---------------------------------------------------------------------
> If both files don't have the same size, it's 100% sure that they are
> not equal. If they have the same size, further testing is needed.
> ---------------------------------------------------------------------
>
> >> The same logic goes for CRC tests and the comparison of random bytes.
>
> > To get a CRC, you need to perform math on all the bytes.
>
> I know.
>
> > How will you get that math done, if you don't read all the bytes?
>
> I didn't write that. In the above text between the two lines, just
> replace "size" with "CRC", and then you'll see what I mean.

Ok, i don't know who wrote what now, and i deleted all the previous emails
on this thread so i can't cheack, and i don't feel like wasteing any more
time
with it, but it appeared to me as though someone was saying a CRC check
was an alternative to reading in the files. Obviously, if you read in the
files,
you can abort with a "not equal" error long before you perform the CRC
calculations, which you'd do at the end of the files.

Kat

new topic     » goto parent     » topic index » view message » categorize

8. Re: Check if files equal

Files are not databases, they are .zip .avi and .mp3 files mainly (stuff
coming down from satellite) - so they are compressed in some way.
Speed is more important than accuracy.

Tone Skoda

----- Original Message -----
From: "Derek Parnell" <Derek.Parnell at SYD.RABOBANK.COM>
To: "EUforum" <EUforum at topica.com>
Subject: RE: Check if files equal


>
> Hi all,
> in the original question put to the list by Tone, he mentioned 500MB+
files.
> I'm guessing that these are databases, given the size. If so, checking the
> bytes from the front of the files is probably a good idea because a lot of
> database systems keep pointers and stamps near the front of the database
> file. Thus, even if the file sizes haven't been changed, a short scan will
> probably find a changed stamp or pointer. If you get about 25% through the
> files and haven't found a mismatch, you might take a risk that they are
the
> same - or you might like to check the last 25% too. Just a thought.
>
> ---------
> Derek.
>
> ==================================================================
>
>
> ==================================================================
>
>
>
>

new topic     » goto parent     » topic index » view message » categorize

9. Re: Check if files equal

-------Phoenix-Boundary-07081998-

You wrote on 7/7/02 5:53:05 PM:

>
>Files are not databases, they are .zip .avi and .mp3 files mainly (stuff
>coming down from satellite) - so they are compressed in some way.
>Speed is more important than accuracy.
>
>Tone Skoda
>

Some thoughts:
1) Compare long words rather than words or bytes.
2) Reduce disk latency. If possible, read one file entirely into
   memory before starting. If not possible, fill most of memory
   with one file, then  read comparitively small chunks of the
   other.
3) It may be useful to use non-blocking calls to the read routine
   so you can compare one buffer while reading the next. More
   importantly, this may prevent the disk from having to do a
   full rotation between reads.
4) Have the two files on different disks!
5) Have the two files on four disks!
6) You could use assembly for the comparison routines, and optimize
   for the processors' multiple execution units, but that is likely to
   be swamped by disk traffic.
7) Inside knowledge of the format might allow comparison of just a CRCC.

Karl Bochert

-------Phoenix-Boundary-07081998---

new topic     » goto parent     » topic index » view message » categorize

10. Re: Check if files equal

Hi Tone,

----------
> Îò: 10963508 at europeonline.com
> Êîìó: EUforum <EUforum at topica.com>
> Òåìà: Check if files equal
> Äàòà: 7 èþëÿ 2002 ã. 3:51
> 
> What is the fastest way of checking if two very large files (~500 MB) are
> equal?
> I was thinking about this:
> -name
> -size
> -date last modified
> -pick about 10 random positions and check if bytes at those positions in
> both files match.
> 
> Is there any better and faster way that I'm not aware of?
> 
> Tone ©koda
> 

There is a good program by RDS:

http://www.RapidEuphoria.com/dupfile.zip

Regards,
Igor Kachan
kinz at peterlink.ru

new topic     » goto parent     » topic index » view message » categorize

11. Re: Check if files equal

Igor Kachan writes:
> There is a good program by RDS:
>
> http://www.RapidEuphoria.com/dupfile.zip
>
Ricardo Forno writes:
> Does someone know who programmed it? There is no information 
> about it in the program itself.

Since you seem to like it, I'll step forward and take the credit.

dupfile.exw finds all sets of identical files within a directory (and
subdirectories), or between two directories and their subdirectories.
Whenever I use it, it seems to run very fast.
I don't think writing it in C will help much,
and of course you can always use the E to C Translator on it.

Regards,
   Rob Craig
   Rapid Deployment Software
   http://www.RapidEuphoria.com

new topic     » goto parent     » topic index » view message » categorize

12. Re: Check if files equal

{{{ On Mon, 8 Jul 2002 11:37:33 +0800, Derek Parnell <Derek.Parnell at SYD.RABOBANK.COM> wrote:

<snip>

The trick will be to create a checksum that is truely representative of the
whole file. You'll probably need a 32-bit checksum for each 2^32 bits
(536,870,912 bytes).

Just my $0.02: I'd use md5.

There's a dos/windows version here: http://www.fourmilab.ch/md5/ and linux, DLL, & more here: http://userpages.umbc.edu/mabzug1/cs/md5/md5.html

new topic     » goto parent     » topic index » view message » categorize

13. Re: Check if files equal

Ricardo Forno writes:
> Actually, the program finds equal files on as 
> many directories you want, not only two. 
> Since you programmed it, you should know better... blink

Yes, thanks. I forgot that I generalized it.

> Then, when processing the second group, the 'while' 
> can stop due to unequal length *before* some other 
> member of the second group is reached. Attached
> you'll find a correction to this bug.

Thanks. Your fix looks essentially correct.
I'll study it / test it a bit more, then upload the corrected version.

Regards,
   Rob Craig
   Rapid Deployment Software
   http://www.RapidEuphoria.com

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu