1. Text File Comparison
- Posted by petelomax at blueyonder.co.uk Feb 08, 2002
- 547 views
Looking for a source file comparison utility - must be written in Euphoria or a source I can translate. Thinking out loud, it seems non-trivial to report the smallest possible number of changed lines, which is what I want. At the moment I'm struggling with DOS fc utility but I'd like an output similar to: function fred() sequence result >integer i result={} < for i = 1 to 10 < if skip[i]=0 then > i=1 > while i <= 10 > if skip[i]>0 then > i+=skip[i] > else result&=i > i+=1 end if < end for > end while return result end procedure whereby ">" lines have been added & "<" removed. Hopefully someone out there in the Linux world has the source of "diff" I think it is which I suspect handles this alot better than I could starting from scratch. Using fc I get alot of false realigns on "end if" causing the output to be much larger than it ought to be. Raw performance is unlikely to be an issue. Pete
2. Re: Text File Comparison
- Posted by Martin Stachon <martin.stachon at worldonline.cz> Feb 09, 2002
- 486 views
> Looking for a source file comparison utility - must be written in > Euphoria or a source I can translate. > Thinking out loud, it seems non-trivial to report the smallest > possible number of changed lines, which is what I want. Goto www.gnu.org and grab the source of diff : "diffutils-2.7.tar.gz" , 305 kB It's in C. I tried to traslate it to Euphoria but was too difficult for me. "The algorithm is described in "An O(ND) Difference Algorithm and its Variations", Eugene W. Myers, Algorithmica Vol. 1 No. 2, 1986, pp. 251--266; and in "A File Comparison Program", Webb Miller and Eugene W. Myers, Software--Practice and Experience Vol. 15 No. 11, 1985, pp. 1025--1040. The algorithm was independently discovered as described in "Algorithms for Approximate String Matching", E. Ukkonen, Information and Control Vol. 64, 1985, pp. 100--118." Martin
3. Re: Text File Comparison
- Posted by rforno at tutopia.com Feb 09, 2002
- 499 views
I've found this to be an interestin challenge, and consequently I developed the following algorithm. I don't know if it is the same one mentioned by Martin Stachon, or if it is optimal, but I made some tests and it works OK. However, it is not fully tested. --Program to show the differences between two files --Author: R. M. Forno - Version 0.0 - 2002/02/09 function diff(sequence s, sequence t) integer i, lens, lent, max, min, a, b, z, n sequence r, x lens = length(s) lent = length(t) max = lens + lent - 1 r = {} i = 1 z = 1 while i <= lens do min = max for m = i to lens do if m - i >= min then exit end if x = s[m] --Speedwise n = min - m + i + z --Speedwise for j = z to lent do --Search for minimum slack if j >= n then exit end if if equal(x, t[j]) then min = m - i + j - z --Update minimum slack a = m b = j exit end if end for end for if min < max then for k = i to a - 1 do r = append(r, {1, k}) end for for k = z to b - 1 do r = append(r, {-1, k}) end for r = append(r, {0, a}) i = a + 1 --Update starting points z = b + 1 else for k = i to lens do --Last part r = append(r, {1, k}) end for for k = z to lent do r = append(r, {-1, k}) end for exit end if end while return r end function function read_in(sequence fn) sequence s object x integer f s = {} f = open(fn, "r") if f < 0 then puts(2, "Error - cannot open " & fn) abort(1) end if while 1 do x = gets(f) if atom(x) then exit end if s = append(s, x) end while return s end function procedure out_diff(sequence r, sequence a, sequence b) sequence x integer n, z for i = 1 to length(r) do x = r[i] n = x[1] z = x[2] if n < 0 then puts(1, "< " & b[z]) elsif n > 0 then puts(1, "> " & a[z]) else puts(1, " " & a[z]) end if end for end procedure --Example of usage sequence a1, a2, r a1 = read_in("file1") a2 = read_in("file2") r = diff(a1, a2) out_diff(r, a1, a2) ----- Original Message ----- From: <petelomax at blueyonder.co.uk> To: "EUforum" <EUforum at topica.com> Sent: Friday, February 08, 2002 9:40 PM Subject: Text File Comparison Looking for a source file comparison utility - must be written in Euphoria or a source I can translate. Thinking out loud, it seems non-trivial to report the smallest possible number of changed lines, which is what I want. At the moment I'm struggling with DOS fc utility but I'd like an output similar to: function fred() sequence result >integer i result={} < for i = 1 to 10 < if skip[i]=0 then > i=1 > while i <= 10 > if skip[i]>0 then > i+=skip[i] > else result&=i > i+=1 end if < end for > end while return result end procedure whereby ">" lines have been added & "<" removed. Hopefully someone out there in the Linux world has the source of "diff" I think it is which I suspect handles this alot better than I could starting from scratch. Using fc I get alot of false realigns on "end if" causing the output to be much larger than it ought to be. Raw performance is unlikely to be an issue. Pete
4. Re: Text File Comparison
- Posted by rforno at tutopia.com Feb 10, 2002
- 492 views
Please find below a corrected and improved version of my previous algorithm on the subject. --Program to show the differences between two files --Author: R. M. Forno - Version 0.1 - 2002/02/10 function diff(sequence s, sequence t) integer i, lens, lent, max, min, a, b, z, n, m, topi sequence r, x lens = length(s) lent = length(t) max = lens + lent r = {} i = 1 z = 1 while i <= lens do min = max topi = lens + 1 m = i while m < topi do x = s[m] --Speedwise n = min - m + i + z - 1 if n > lent then n = lent end if for j = z to n do --Search for minimum slack if equal(x, t[j]) then min = m - i + j - z --Update minimum slack topi = min + i --Update upper limit for m a = m b = j exit end if end for m += 1 end while if min < max then for k = i to a - 1 do r = append(r, {1, k}) end for for k = z to b - 1 do r = append(r, {-1, k}) end for r = append(r, {0, a}) i = a + 1 --Update starting points z = b + 1 else exit end if end while for k = i to lens do --Last part r = append(r, {1, k}) end for for k = z to lent do r = append(r, {-1, k}) end for return r end function function read_in(sequence fn) sequence s object x integer f s = {} f = open(fn, "r") if f < 0 then puts(2, "Error - cannot open " & fn) abort(1) end if while 1 do x = gets(f) if atom(x) then exit end if s = append(s, x) end while return s end function procedure out_diff(sequence r, sequence a, sequence b) sequence x integer n, z for i = 1 to length(r) do x = r[i] n = x[1] z = x[2] if n < 0 then puts(1, "< " & b[z]) elsif n > 0 then puts(1, "> " & a[z]) else puts(1, " " & a[z]) end if end for end procedure --Example of usage sequence a1, a2, r a1 = read_in("file1") a2 = read_in("file2") r = diff(a1, a2) out_diff(r, a1, a2) ----- Original Message ----- From: <petelomax at blueyonder.co.uk> To: "EUforum" <EUforum at topica.com> Sent: Friday, February 08, 2002 9:40 PM Subject: Text File Comparison Looking for a source file comparison utility - must be written in Euphoria or a source I can translate. Thinking out loud, it seems non-trivial to report the smallest possible number of changed lines, which is what I want. At the moment I'm struggling with DOS fc utility but I'd like an output similar to: function fred() sequence result >integer i result={} < for i = 1 to 10 < if skip[i]=0 then > i=1 > while i <= 10 > if skip[i]>0 then > i+=skip[i] > else result&=i > i+=1 end if < end for > end while return result end procedure whereby ">" lines have been added & "<" removed. Hopefully someone out there in the Linux world has the source of "diff" I think it is which I suspect handles this alot better than I could starting from scratch. Using fc I get alot of false realigns on "end if" causing the output to be much larger than it ought to be. Raw performance is unlikely to be an issue. Pete
5. Re: Text File Comparison
- Posted by petelomax at blueyonder.co.uk Feb 11, 2002
- 519 views
On Sun, 10 Feb 2002 22:39:21 -0300, you wrote: > >Please find below a corrected and improved version of my previous algorithm >on the subject. I'm intrigued enough to show my embarrassed ignorance. I was going to ask you what max,min, and n were actually for and what you were after with "minimum slack". I've just been sidewiped with a rather nasty flu-type bug which means I've managed little this past week but as far as I've tested your algorithm works well. Now of course I'm also intrigued as to what bugs you just fixed. Pete
6. Re: Text File Comparison
- Posted by rforno at tutopia.com Feb 12, 2002
- 538 views
My first version did not show the non-matching items of the second sequence after the last match. Regarding the names of my variables, I did not take much care for them to be significative... :) ----- Original Message ----- From: <petelomax at blueyonder.co.uk> To: "EUforum" <EUforum at topica.com> Subject: Re: Text File Comparison On Sun, 10 Feb 2002 22:39:21 -0300, you wrote: > >Please find below a corrected and improved version of my previous algorithm >on the subject. I'm intrigued enough to show my embarrassed ignorance. I was going to ask you what max,min, and n were actually for and what you were after with "minimum slack". I've just been sidewiped with a rather nasty flu-type bug which means I've managed little this past week but as far as I've tested your algorithm works well. Now of course I'm also intrigued as to what bugs you just fixed. Pete
7. Re: Text File Comparison
- Posted by petelomax at blueyonder.co.uk Feb 14, 2002
- 511 views
On Tue, 12 Feb 2002 23:22:45 -0300, you wrote: >>I was going to ask you what max,min, and n were actually for and what >>you were after with "minimum slack". >Regarding the names of my variables, I did not take much care for them to be >significative... :) That's helpful. After general success, I decided to 'cane' your code to see how stable it is, so I compared the stamped, comment stripped version of win32lib (285K) and the fulltext version (673K) while i <= lens do min = max topi = lens + 1 m = i while m < topi do x = s[m] --Speedwise -- ^^^ subscript out of bounds n = min - m + i + z - 1 if n > lent then n = lent end if for j = z to n do --Search for minimum slack if equal(x, t[j]) then min = m - i + j - z --Update minimum slack if min + i < topi then --<<< topi = min + i --Update upper limit for m end if --<<< a = m b = j exit end if end for m += 1 end while I've put the above to stop if bombing but as you won't explain what the code is trying to do I don't know, for example, if I should not update a & b at that point or reset min... Pete
8. Re: Text File Comparison
- Posted by rforno at tutopia.com Feb 16, 2002
- 505 views
Problem is I am too lazy... Well, the idea behind the algorithm is simple. Assume you have found the kth matching pair. Then, starting at this point, you search for the (k+1)th matching pair. From all the matching pairs you find, you should select the one for which the sum of distances between the element in sequence s and the starting point in this sequence, and the corresponding difference in sequence t, is minimum. The complications arise because, to decrease the time spent, one does not want to test all possibilities, but instead wants to reject some because "a priori" they cannot give a better minimum distance. This is the origin of variables n and topi. Please find below a new version of the algorithm. I did not test it with the files you used, because I don't have them. However, I am confident this time the algorithm shall work ;) --Program to show the differences between two files --Author: R. M. Forno - Version 0.2 - 2002/02/16 function diff(sequence s, sequence t) integer i, lens, lent, max, min, a, b, z, n, m, topi sequence r, x lens = length(s) lent = length(t) max = lens + lent r = {} i = 1 z = 1 while i <= lens do min = max topi = lens + 1 m = i while m < topi do x = s[m] --Speedwise n = min - m + i + z - 1 if n > lent then n = lent end if for j = z to n do --Search for minimum slack if equal(x, t[j]) then min = m - i + j - z --Update minimum slack if min + i < topi then topi = min + i --Update upper limit for m end if a = m b = j exit end if end for m += 1 end while if min < max then for k = i to a - 1 do r = append(r, {1, k}) end for for k = z to b - 1 do r = append(r, {-1, k}) end for r = append(r, {0, a}) i = a + 1 --Update starting points z = b + 1 else exit end if end while for k = i to lens do --Last part r = append(r, {1, k}) end for for k = z to lent do r = append(r, {-1, k}) end for return r end function function read_in(sequence fn) sequence s object x integer f s = {} f = open(fn, "r") if f < 0 then puts(2, "Error - cannot open " & fn) abort(1) end if while 1 do x = gets(f) if atom(x) then exit end if s = append(s, x) end while return s end function procedure out_diff(sequence r, sequence a, sequence b) sequence x integer n, z for i = 1 to length(r) do x = r[i] n = x[1] z = x[2] if n < 0 then puts(1, "< " & b[z]) elsif n > 0 then puts(1, "> " & a[z]) else puts(1, " " & a[z]) end if end for end procedure --Example of usage sequence a1, a2, r a1 = read_in("file1") a2 = read_in("file2") r = diff(a1, a2) out_diff(r, a1, a2) ----- Original Message ----- From: <petelomax at blueyonder.co.uk> To: "EUforum" <EUforum at topica.com> Sent: Thursday, February 14, 2002 11:50 PM Subject: Re: Text File Comparison On Tue, 12 Feb 2002 23:22:45 -0300, you wrote: >>I was going to ask you what max,min, and n were actually for and what >>you were after with "minimum slack". >Regarding the names of my variables, I did not take much care for them to be >significative... :) That's helpful. After general success, I decided to 'cane' your code to see how stable it is, so I compared the stamped, comment stripped version of win32lib (285K) and the fulltext version (673K) while i <= lens do min = max topi = lens + 1 m = i while m < topi do x = s[m] --Speedwise -- ^^^ subscript out of bounds n = min - m + i + z - 1 if n > lent then n = lent end if for j = z to n do --Search for minimum slack if equal(x, t[j]) then min = m - i + j - z --Update minimum slack if min + i < topi then --<<< topi = min + i --Update upper limit for m end if --<<< a = m b = j exit end if end for m += 1 end while I've put the above to stop if bombing but as you won't explain what the code is trying to do I don't know, for example, if I should not update a & b at that point or reset min... Pete