1. Text File Comparison

Looking for a source file comparison utility - must be written in
Euphoria or a source I can translate.

Thinking out loud, it seems non-trivial to report the smallest
possible number of changed lines, which is what I want.

At the moment I'm struggling with DOS fc utility but I'd like an
output similar to:

function fred()
sequence result
>integer i
	result={}
<	for i = 1 to 10
<		if skip[i]=0 then
>	i=1
>	while i <= 10
>		if skip[i]>0 then
>			i+=skip[i]
>		else
			result&=i
>			i+=1
		end if
<	end for
>	end while
	return result
end procedure

whereby ">" lines have been added & "<" removed.

Hopefully someone out there in the Linux world has the source of
"diff" I think it is which I suspect handles this alot better than I
could starting from scratch.

Using fc I get alot of false realigns on "end if" causing the output
to be much larger than it ought to be. Raw performance is unlikely to
be an issue.

Pete

new topic     » topic index » view message » categorize

2. Re: Text File Comparison

> Looking for a source file comparison utility - must be written in
> Euphoria or a source I can translate.

> Thinking out loud, it seems non-trivial to report the smallest
> possible number of changed lines, which is what I want.

Goto www.gnu.org and grab the source of diff : "diffutils-2.7.tar.gz" , 305 kB
It's in C. I tried to traslate it to Euphoria but was too difficult for me.

"The algorithm is described in "An O(ND) Difference Algorithm and its
Variations", Eugene
W. Myers, Algorithmica Vol. 1 No. 2, 1986, pp. 251--266; and in "A File
Comparison
Program", Webb Miller and Eugene W. Myers, Software--Practice and Experience
Vol. 15 No.
11, 1985, pp. 1025--1040. The algorithm was independently discovered as
described in
"Algorithms for Approximate String Matching", E. Ukkonen, Information and
Control Vol. 64,
1985, pp. 100--118."

    Martin

new topic     » goto parent     » topic index » view message » categorize

3. Re: Text File Comparison

I've found this to be an interestin challenge, and consequently I developed
the following algorithm. I don't know if it is the same one mentioned by
Martin Stachon, or if it is optimal, but I made some tests and it works OK.
However, it is not fully tested.

--Program to show the differences between two files
--Author: R. M. Forno - Version 0.0 - 2002/02/09

function diff(sequence s, sequence t)
    integer i, lens, lent, max, min, a, b, z, n
    sequence r, x
    lens = length(s)
    lent = length(t)
    max = lens + lent - 1
    r = {}
    i = 1
    z = 1
    while i <= lens do
       min = max
     for m = i to lens do
         if m - i >= min then
           exit
        end if
        x = s[m] --Speedwise
        n = min - m + i + z --Speedwise
        for j = z to lent do --Search for minimum slack
           if j >= n then
             exit
          end if
  if equal(x, t[j]) then
      min = m - i + j - z --Update minimum slack
      a = m
      b = j
      exit
  end if
     end for
 end for
 if min < max then
     for k = i to a - 1 do
  r = append(r, {1, k})
     end for
     for k = z to b - 1 do
  r = append(r, {-1, k})
     end for
     r = append(r, {0, a})
     i = a + 1 --Update starting points
     z = b + 1
 else
     for k = i to lens do --Last part
  r = append(r, {1, k})
     end for
     for k = z to lent do
  r = append(r, {-1, k})
     end for
     exit
 end if
    end while
    return r
end function

function read_in(sequence fn)
    sequence s
    object x
    integer f
    s = {}
    f = open(fn, "r")
    if f < 0 then
 puts(2, "Error - cannot open " & fn)
 abort(1)
    end if
    while 1 do
 x = gets(f)
 if atom(x) then
     exit
 end if
 s = append(s, x)
    end while
    return s
end function

procedure out_diff(sequence r, sequence a, sequence b)
    sequence x
    integer n, z
    for i = 1 to length(r) do
 x = r[i]
 n = x[1]
 z = x[2]
 if n < 0 then
     puts(1, "< " & b[z])
 elsif n > 0 then
     puts(1, "> " & a[z])
 else
     puts(1, "  " & a[z])
 end if
    end for
end procedure

--Example of usage

sequence a1, a2, r
a1 = read_in("file1")
a2 = read_in("file2")
r = diff(a1, a2)
out_diff(r, a1, a2)

----- Original Message -----
From: <petelomax at blueyonder.co.uk>
To: "EUforum" <EUforum at topica.com>
Sent: Friday, February 08, 2002 9:40 PM
Subject: Text File Comparison



Looking for a source file comparison utility - must be written in
Euphoria or a source I can translate.

Thinking out loud, it seems non-trivial to report the smallest
possible number of changed lines, which is what I want.

At the moment I'm struggling with DOS fc utility but I'd like an
output similar to:

function fred()
sequence result
>integer i
result={}
< for i = 1 to 10
< if skip[i]=0 then
> i=1
> while i <= 10
> if skip[i]>0 then
> i+=skip[i]
> else
result&=i
> i+=1
end if
< end for
> end while
return result
end procedure

whereby ">" lines have been added & "<" removed.

Hopefully someone out there in the Linux world has the source of
"diff" I think it is which I suspect handles this alot better than I
could starting from scratch.

Using fc I get alot of false realigns on "end if" causing the output
to be much larger than it ought to be. Raw performance is unlikely to
be an issue.

Pete

new topic     » goto parent     » topic index » view message » categorize

4. Re: Text File Comparison

Please find below a corrected and improved version of my previous algorithm
on the subject.

--Program to show the differences between two files
--Author: R. M. Forno - Version 0.1 - 2002/02/10

function diff(sequence s, sequence t)
    integer i, lens, lent, max, min, a, b, z, n, m, topi
    sequence r, x
    lens = length(s)
    lent = length(t)
    max = lens + lent
    r = {}
    i = 1
    z = 1
    while i <= lens do
 min = max
 topi = lens + 1
 m = i
 while m < topi do
     x = s[m] --Speedwise
     n = min - m + i + z - 1
     if n > lent then
  n = lent
     end if
     for j = z to n do --Search for minimum slack
  if equal(x, t[j]) then
      min = m - i + j - z --Update minimum slack
      topi = min + i --Update upper limit for m
      a = m
      b = j
      exit
  end if
     end for
     m += 1
 end while
 if min < max then
     for k = i to a - 1 do
  r = append(r, {1, k})
     end for
     for k = z to b - 1 do
  r = append(r, {-1, k})
     end for
     r = append(r, {0, a})
     i = a + 1 --Update starting points
     z = b + 1
 else
     exit
 end if
    end while
    for k = i to lens do --Last part
 r = append(r, {1, k})
    end for
    for k = z to lent do
 r = append(r, {-1, k})
    end for
    return r
end function

function read_in(sequence fn)
    sequence s
    object x
    integer f
    s = {}
    f = open(fn, "r")
    if f < 0 then
 puts(2, "Error - cannot open " & fn)
 abort(1)
    end if
    while 1 do
 x = gets(f)
 if atom(x) then
     exit
 end if
 s = append(s, x)
    end while
    return s
end function

procedure out_diff(sequence r, sequence a, sequence b)
    sequence x
    integer n, z
    for i = 1 to length(r) do
 x = r[i]
 n = x[1]
 z = x[2]
 if n < 0 then
     puts(1, "< " & b[z])
 elsif n > 0 then
     puts(1, "> " & a[z])
 else
     puts(1, "  " & a[z])
 end if
    end for
end procedure

--Example of usage

sequence a1, a2, r
a1 = read_in("file1")
a2 = read_in("file2")
r = diff(a1, a2)
out_diff(r, a1, a2)

----- Original Message -----
From: <petelomax at blueyonder.co.uk>
To: "EUforum" <EUforum at topica.com>
Sent: Friday, February 08, 2002 9:40 PM
Subject: Text File Comparison



Looking for a source file comparison utility - must be written in
Euphoria or a source I can translate.

Thinking out loud, it seems non-trivial to report the smallest
possible number of changed lines, which is what I want.

At the moment I'm struggling with DOS fc utility but I'd like an
output similar to:

function fred()
sequence result
>integer i
result={}
< for i = 1 to 10
< if skip[i]=0 then
> i=1
> while i <= 10
> if skip[i]>0 then
> i+=skip[i]
> else
result&=i
> i+=1
end if
< end for
> end while
return result
end procedure

whereby ">" lines have been added & "<" removed.

Hopefully someone out there in the Linux world has the source of
"diff" I think it is which I suspect handles this alot better than I
could starting from scratch.

Using fc I get alot of false realigns on "end if" causing the output
to be much larger than it ought to be. Raw performance is unlikely to
be an issue.

Pete

new topic     » goto parent     » topic index » view message » categorize

5. Re: Text File Comparison

On Sun, 10 Feb 2002 22:39:21 -0300, you wrote:

>
>Please find below a corrected and improved version of my previous algorithm
>on the subject.

I'm intrigued enough to show my embarrassed ignorance.

I was going to ask you what max,min, and n were actually for and what
you were after with "minimum slack". I've just been sidewiped with a
rather nasty flu-type bug which means I've managed little this past
week but as far as I've tested your algorithm works well. Now of
course I'm also intrigued as to what bugs you just fixed.

Pete

new topic     » goto parent     » topic index » view message » categorize

6. Re: Text File Comparison

My first version did not show the non-matching items of the second sequence
after the last match.
Regarding the names of my variables, I did not take much care for them to be
significative... :)
----- Original Message -----
From: <petelomax at blueyonder.co.uk>
To: "EUforum" <EUforum at topica.com>
Subject: Re: Text File Comparison



On Sun, 10 Feb 2002 22:39:21 -0300, you wrote:

>
>Please find below a corrected and improved version of my previous algorithm
>on the subject.

I'm intrigued enough to show my embarrassed ignorance.

I was going to ask you what max,min, and n were actually for and what
you were after with "minimum slack". I've just been sidewiped with a
rather nasty flu-type bug which means I've managed little this past
week but as far as I've tested your algorithm works well. Now of
course I'm also intrigued as to what bugs you just fixed.

Pete

new topic     » goto parent     » topic index » view message » categorize

7. Re: Text File Comparison

On Tue, 12 Feb 2002 23:22:45 -0300, you wrote:

>>I was going to ask you what max,min, and n were actually for and what
>>you were after with "minimum slack".
>Regarding the names of my variables, I did not take much care for them to be
>significative... :)
That's helpful.

After general success, I decided to 'cane' your code to see how stable
it is, so I compared the stamped, comment stripped version of win32lib
(285K) and the fulltext version (673K)

	while i <= lens do
		min = max
		topi = lens + 1
		m = i
		while m < topi do
			x = s[m] --Speedwise
--                                   ^^^ subscript out of bounds
			n = min - m + i + z - 1
			if n > lent then
				n = lent
			end if
			for j = z to n do --Search for minimum slack
				if equal(x, t[j]) then
					min = m - i + j - z --Update minimum slack
if min + i < topi then --<<<					
					topi = min + i --Update upper limit for m
end if --<<<
					a = m
					b = j
					exit
				end if
			end for
			m += 1
		end while

I've put the above to stop if bombing but as you won't explain what
the code is trying to do I don't know, for example, if I should not
update a & b at that point or reset min...

Pete

new topic     » goto parent     » topic index » view message » categorize

8. Re: Text File Comparison

Problem is I am too lazy...
Well, the idea behind the algorithm is simple. Assume you have found the kth
matching pair. Then, starting at this point, you search for the (k+1)th
matching pair. From all the matching pairs you find, you should select the
one for which the sum of distances between the element in sequence s and the
starting point in this sequence, and the corresponding difference in
sequence t, is minimum. The complications arise because, to decrease the
time spent, one does not want to test all possibilities, but instead wants
to reject some because "a priori" they cannot give a better minimum
distance. This is the origin of variables n and topi.
Please find below a new version of the algorithm. I did not test it with the
files you used, because I don't have them. However, I am confident this time
the algorithm shall work ;)

--Program to show the differences between two files
--Author: R. M. Forno - Version 0.2 - 2002/02/16

function diff(sequence s, sequence t)
    integer i, lens, lent, max, min, a, b, z, n, m, topi
    sequence r, x
    lens = length(s)
    lent = length(t)
    max = lens + lent
    r = {}
    i = 1
    z = 1
    while i <= lens do
 min = max
 topi = lens + 1
 m = i
 while m < topi do
     x = s[m] --Speedwise
     n = min - m + i + z - 1
     if n > lent then
  n = lent
     end if
     for j = z to n do --Search for minimum slack
  if equal(x, t[j]) then
      min = m - i + j - z --Update minimum slack
      if min + i < topi then
   topi = min + i --Update upper limit for m
      end if
      a = m
      b = j
      exit
  end if
     end for
     m += 1
 end while
 if min < max then
     for k = i to a - 1 do
  r = append(r, {1, k})
     end for
     for k = z to b - 1 do
  r = append(r, {-1, k})
     end for
     r = append(r, {0, a})
     i = a + 1 --Update starting points
     z = b + 1
 else
     exit
 end if
    end while
    for k = i to lens do --Last part
 r = append(r, {1, k})
    end for
    for k = z to lent do
 r = append(r, {-1, k})
    end for
    return r
end function

function read_in(sequence fn)
    sequence s
    object x
    integer f
    s = {}
    f = open(fn, "r")
    if f < 0 then
 puts(2, "Error - cannot open " & fn)
 abort(1)
    end if
    while 1 do
 x = gets(f)
 if atom(x) then
     exit
 end if
 s = append(s, x)
    end while
    return s
end function

procedure out_diff(sequence r, sequence a, sequence b)
    sequence x
    integer n, z
    for i = 1 to length(r) do
 x = r[i]
 n = x[1]
 z = x[2]
 if n < 0 then
     puts(1, "< " & b[z])
 elsif n > 0 then
     puts(1, "> " & a[z])
 else
     puts(1, "  " & a[z])
 end if
    end for
end procedure

--Example of usage

sequence a1, a2, r
a1 = read_in("file1")
a2 = read_in("file2")
r = diff(a1, a2)
out_diff(r, a1, a2)

----- Original Message -----
From: <petelomax at blueyonder.co.uk>
To: "EUforum" <EUforum at topica.com>
Sent: Thursday, February 14, 2002 11:50 PM
Subject: Re: Text File Comparison



On Tue, 12 Feb 2002 23:22:45 -0300, you wrote:

>>I was going to ask you what max,min, and n were actually for and what
>>you were after with "minimum slack".
>Regarding the names of my variables, I did not take much care for them to
be
>significative... :)
That's helpful.

After general success, I decided to 'cane' your code to see how stable
it is, so I compared the stamped, comment stripped version of win32lib
(285K) and the fulltext version (673K)

while i <= lens do
min = max
topi = lens + 1
m = i
while m < topi do
x = s[m] --Speedwise
--                                   ^^^ subscript out of bounds
n = min - m + i + z - 1
if n > lent then
n = lent
end if
for j = z to n do --Search for minimum slack
if equal(x, t[j]) then
min = m - i + j - z --Update minimum slack
if min + i < topi then --<<<
topi = min + i --Update upper limit for m
end if --<<<
a = m
b = j
exit
end if
end for
m += 1
end while

I've put the above to stop if bombing but as you won't explain what
the code is trying to do I don't know, for example, if I should not
update a & b at that point or reset min...

Pete

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu