Re: data analysis
- Posted by Kat <gertie at PELL.NET> Jan 17, 2001
- 443 views
On 17 Jan 2001, at 14:02, Kat wrote: > On 17 Jan 2001, at 0:55, Kat wrote: > > > David and Graeme, all i can say is wow, and thanks. David's presentation of > > the data is going to be easier to use first, but in the following test, > > David's didn't resync properly, and i think cause it is it trying to resync > > always on s2. In Sweigsdunka vs Zweigsdanka, it sync'd and found "weigsd", > > but then didn't resync properly on the "nka". It took the first 'a' from > > Zweigsdanka and went looking for it in Sweigsdunka, not finding it untill > > the end of the word. It thereby missed the common "nka". Swaping the words > > around didn't help David's code's results at all, but messed up Graeme's > > code's results in a new way. Is it possible to force which word is the > > primary sync in your code, David, in a way i can spec while it's running? > > Mostly, i'd be looking for the result with the fewest number of differences. > > /me is still studying the code.... > > Changing MaxGap to 2 made it resync faster, but i'm not sure yet that > passing maxgap to diff() along with the words is the right answer yet.. So i passed both min and max gap to it, and diff() returns a list of unique results, this way an analysis of the returns gives max results without too much of an info overload. I still don't see how to change the MaxGap in the middle of a word, without causing problems, such as making the passed specs and results more complex, and maybe this is not needed, so i'll put it on the back burner for now. I found the range of MinG= 2 to MaxG = 6 works best in the following code on the words i tested: -- almost entirely David Cuny's code global function diff( sequence s1, sequence s2, integer MinG, integer MaxG ) integer at1, at2, sync1, sync2 sequence result, bigresult bigresult = "" for MaxGap = MinG to MaxG do result = "" at1 = 0 at2 = 0 -- process until the end of one string while 1 do -- move ahead at1 += 1 at2 += 1 -- past end of one string? if at1 > length( s1 ) or at2 > length( s2 ) then exit end if -- same? if s1[at1] = s2[at2] then result &= s1[at1] else -- attempt to resync while 1 do -- find closest sync point sync2 = find( s1[at1], s2[at2..length(s2)] ) -- too far? if sync2 > 0 and sync2 < MaxGap then sync2 += at2 - 1 else sync2 = 9999 end if -- find closest sync sync1 = find( s2[at2], s1[at1..length(s1)] ) -- too far? if sync1 > 0 and sync1 < MaxGap then sync1 += at1 - 1 else sync1 = 9999 end if -- evaluate sync if sync1 = 9999 and sync2 = 9999 then -- no sync result &= sprintf( "[%s,%s]", {s1[at1],s2[at2]} ) -- at end? if at1 = length( s1 ) or at2 = length( s2 ) then exit end if -- skip at1 += 1 at2 += 1 elsif sync1 < sync2 then -- match on sync1 for i = at1 to sync1-1 do result &= sprintf( "[%s,]", {s1[i]} ) end for -- sync at1 = sync1 result &= s1[at1] -- leave loop exit else -- match on sync2 for i = at2 to sync2-1 do result &= sprintf( "[,%s]", {s2[i]}) end for -- sync at2 = sync2 result &= s2[at2] -- leave loop exit end if end while end if end while -- remainder? if at1 <= length( s1 ) then for i = at1 to length(s1) do result &= sprintf( "[%s,]", {s1[i]} ) end for elsif at2 <= length( s2 ) then for i = at2 to length(s2) do result &= sprintf( "[,%s]", {s2[i]} ) end for end if if ( match(result,bigresult) = 0 ) then bigresult &= "\n" & result -- i used "\n" just to make it easy to puts() it end if end for return bigresult end function -- diff( sequence s1, sequence s2, integer MinG, integer MaxG )