1. RE: Check if files equal
This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.
------_=_NextPart_000_01C22612.2AE20EC0
charset=iso-8859-1
Here is a routine that might be useful to someone...
---------------
include file.e
-- Returns true (1) if the two files contain the same data, otherwise it
returns false (0)
-- Parameters:
-- 1: sequence : The path and name of a file
-- 2: sequence : The path and name of another file
function fileEqual(sequence pFileA, sequence pFileB)
integer lhA, lhB
integer lcA, lcB
object ldA, ldB
-- First check that they exist and that they are the same size.
ldA = dir(pFileA)
ldB = dir(pFileB)
if atom(ldA) or atom(ldB) or ldA[1][D_SIZE] != ldB[1][D_SIZE] then
return 0
end if
-- Now compare each byte, starting from the first byte.
lhA = open(pFileA, "rb")
lhB = open(pFileB, "rb")
lcA = 0
lcB = 0
-- Stop comparing as soon as a mismatch or EOF is found.
while lcA = lcB and lcA != -1 do
lcA = getc(lhA)
lcB = getc(lhB)
end while
close(lhA)
close(lhB)
-- if we end up with EOF in both files, they must be equal.
return (lcA = -1) and (lcB = -1)
end function
-------------
Derek.
==================================================================
De informatie opgenomen in dit bericht kan vertrouwelijk zijn en
is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht
onterecht ontvangt wordt u verzocht de inhoud niet te gebruiken en
de afzender direct te informeren door het bericht te retourneren.
==================================================================
The information contained in this message may be confidential
and is intended to be exclusively for the addressee. Should you
receive this message unintentionally, please do not use the contents
herein and notify the sender immediately by return e-mail.
==================================================================
------_=_NextPart_000_01C22612.2AE20EC0
Content-Type: application/ms-tnef
2. RE: Check if files equal
This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.
------_=_NextPart_000_01C22613.6CC6B0B0
charset=iso-8859-1
Hi all,
in the original question put to the list by Tone, he mentioned 500MB+ files.
I'm guessing that these are databases, given the size. If so, checking the
bytes from the front of the files is probably a good idea because a lot of
database systems keep pointers and stamps near the front of the database
file. Thus, even if the file sizes haven't been changed, a short scan will
probably find a changed stamp or pointer. If you get about 25% through the
files and haven't found a mismatch, you might take a risk that they are the
same - or you might like to check the last 25% too. Just a thought.
---------
Derek.
==================================================================
De informatie opgenomen in dit bericht kan vertrouwelijk zijn en
is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht
onterecht ontvangt wordt u verzocht de inhoud niet te gebruiken en
de afzender direct te informeren door het bericht te retourneren.
==================================================================
The information contained in this message may be confidential
and is intended to be exclusively for the addressee. Should you
receive this message unintentionally, please do not use the contents
herein and notify the sender immediately by return e-mail.
==================================================================
------_=_NextPart_000_01C22613.6CC6B0B0
Content-Type: application/ms-tnef
3. RE: Check if files equal
This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.
------_=_NextPart_000_01C22630.D88A8E80
charset=iso-8859-1
So I guess the situation is that you have an incoming file and you want to
see if you already have that file?
If this guess is right, then try this algorithm:
Calculate a checksum for the new file.
For each file you already have:
if the new checksum is the same as an existing file's checksum then
reject the new file as a duplicate.
end if
end for
If no existing checksum matched the new one,
add the new file to your storage,
keep it's checksum to compare when you get other new files.
end if
This way, you only ever calculate a file's checksum once, and from then on,
you only compare checksums, which is fairly fast.
The trick will be to create a checksum that is truely representative of the
whole file. You'll probably need a 32-bit checksum for each 2^32 bits
(536,870,912 bytes).
-------
Derek.
> -----Original Message-----
> From: 10963508 at europeonline.com [mailto:10963508 at europeonline.com]
> Sent: Monday, 8 July 2002 12:32
> To: EUforum
> Subject: Re: Check if files equal
>
>
>
> Files are not databases, they are .zip .avi and .mp3 files
> mainly (stuff
> coming down from satellite) - so they are compressed in some way.
> Speed is more important than accuracy.
>
> Tone Skoda
>
> ----- Original Message -----
> From: "Derek Parnell" <Derek.Parnell at SYD.RABOBANK.COM>
> To: "EUforum" <EUforum at topica.com>
> Sent: Monday, July 08, 2002 2:07 AM
> Subject: RE: Check if files equal
>
>
> > Hi all,
> > in the original question put to the list by Tone, he
> mentioned 500MB+
> files.
> > I'm guessing that these are databases, given the size. If
> so, checking the
> > bytes from the front of the files is probably a good idea
> because a lot of
> > database systems keep pointers and stamps near the front of
> the database
> > file. Thus, even if the file sizes haven't been changed, a
> short scan will
> > probably find a changed stamp or pointer. If you get about
> 25% through the
> > files and haven't found a mismatch, you might take a risk
> that they are
> the
> > same - or you might like to check the last 25% too. Just a thought.
> >
> > ---------
> > Derek.
> >
> > ==================================================================
> >
> >
> > ==================================================================
> >
> >
>
>
>
==================================================================
De informatie opgenomen in dit bericht kan vertrouwelijk zijn en
is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht
onterecht ontvangt wordt u verzocht de inhoud niet te gebruiken en
de afzender direct te informeren door het bericht te retourneren.
==================================================================
The information contained in this message may be confidential
and is intended to be exclusively for the addressee. Should you
receive this message unintentionally, please do not use the contents
herein and notify the sender immediately by return e-mail.
==================================================================
------_=_NextPart_000_01C22630.D88A8E80
Content-Type: application/ms-tnef
4. RE: Check if files equal
Kats test is very efficient, but terribly slow.
i think checking last characters of the file[50]bytes and beginning the
ten random bytes in the middle. be sure to use seek() and where
to get the correct file size.
jordah
Kat wrote:
> On 7 Jul 2002, at 1:51, 10963508 at europeonline.com wrote:
>
> >
> > What is the fastest way of checking if two very large files (~500 MB)
> > are
> > equal?
> > I was thinking about this:
> > -name
> > -size
> > -date last modified
> > -pick about 10 random positions and check if bytes at those positions in
> > both files match.
>
> I would not trust those tests at all.
>
> > Is there any better and faster way that I'm not aware of?
>
> Open file
> while not eof do
> Read them in, one buffer size at a time,
> compare,
> if not equal { tell me it's not equal, abort}
> end while
>
> Kat
>
>
5. RE: Check if files equal
- Posted by rforno at tutopia.com
Jul 09, 2002
Igor:
Thanks! You saved me some effort. I just had plans to develop exactly this
program. There are some similar in the web, but they are not good enough or
else they are not free, at least the ones I know of. Surely, being
programmed in C or some other compiled language, they should be faster.
Does someone know who programmed it? There is no information about it in the
program itself.
Regards.
----- Original Message -----
From: Igor Kachan <kinz at peterlink.ru>
Subject: Re: Check if files equal
Hi Tone,
----------
> Îò: 10963508 at europeonline.com
> Êîìó: EUforum <EUforum at topica.com>
> Òåìà: Check if files equal
> Äàòà: 7 èþëÿ 2002 ã. 3:51
>
> What is the fastest way of checking if two very large files (~500 MB) are
> equal?
> I was thinking about this:
> -name
> -size
> -date last modified
> -pick about 10 random positions and check if bytes at those positions in
> both files match.
>
> Is there any better and faster way that I'm not aware of?
>
> Tone ©koda
>
There is a good program by RDS:
http://www.RapidEuphoria.com/dupfile.zip
Regards,
Igor Kachan
kinz at peterlink.ru
6. RE: Check if files equal
- Posted by rforno at tutopia.com
Jul 12, 2002
This is a multi-part message in MIME format.
------=_NextPart_000_000B_01C22917.1C06F5C0
charset="ISO-8859-2"
Rob:
Actually, the program finds equal files on as many directories you want, not
only two. Since you programmed it, you should know better...
I found a problem with this good program:
It has an elusive bug. When trying it, I discovered that I had somwhere
four equal files, but the program found two groups of two equal files. After
scratching my head for a while, I traced the bug to this circumstance:
assume you have several files with the same length, and there are among them
a group of say 3 equal files and another group of say 4 equal files. When
processing the first group found, the equal files length are set to -1.
Then, when processing the second group, the 'while' can stop due to unequal
length *before* some other member of the second group is reached. Attached
you'll find a correction to this bug.
Regards.
----- Original Message -----
From: Robert Craig <rds at RapidEuphoria.com>
To: EUforum <EUforum at topica.com>
Sent: Wednesday, July 10, 2002 2:46 AM
Subject: Re: Check if files equal
>
> Igor Kachan writes:
> > There is a good program by RDS:
> >
> > http://www.RapidEuphoria.com/dupfile.zip
> >
> Ricardo Forno writes:
> > Does someone know who programmed it? There is no information
> > about it in the program itself.
>
> Since you seem to like it, I'll step forward and take the credit.
>
> dupfile.exw finds all sets of identical files within a directory (and
> subdirectories), or between two directories and their subdirectories.
> Whenever I use it, it seems to run very fast.
> I don't think writing it in C will help much,
> and of course you can always use the E to C Translator on it.
>
> Regards,
> Rob Craig
> Rapid Deployment Software
> http://www.RapidEuphoria.com
>
>
>
>
------=_NextPart_000_000B_01C22917.1C06F5C0
Content-Type: application/x-zip-compressed;
name="dupfile.ZIP"