1. data analysis
- Posted by Grape_ Vine_ <g__vine at HOTMAIL.COM>
Jan 15, 2001
-
Last edited Jan 16, 2001
I am trying to look at a set amount of data and figure out a repeting data
pattern, any idea where i can get some routiens that already do this? None
of mine work =(
J Reeves
Grape Vine
ICQ# 13728824
_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com
2. Re: data analysis
At 08:51 PM 15/01/01 -0800, you wrote:
>I am trying to look at a set amount of data and figure out a repeting data
>pattern, any idea where i can get some routiens that already do this? None
>of mine work =(
>
>
>
>J Reeves
>Grape Vine
if I understand the question correctly, this should help.
It copies dat to s then searches s for characters.
if it isn't a pattern, the found character is erased
in s (by adding one to it) then it searches again.
dat is preserved to test for patterns.
This could be optimised by shortening s for subsequent searches.
It still needs some checking, passing it the wrong sequence
will crash it with a 'slice ends past seq end' error, but
it should get you of on the right foot.
Graeme
function get_pattern(sequence dat)
integer f
sequence s,test
s=dat
s[1]+=1
while 1 do
f=find(dat[1],s)
if not f then
return "No Pattern"
end if
if equal(dat[1..f-1],dat[f..f*2-2]) then
test=dat[1..f-1]
for x=1 to floor(length(dat)/length(test))+1 do
test&=dat[1..f-1]
end for
if equal(test[1..length(dat)],dat) then
return dat[1..f-1]
end if
s[f]+=1
else
s[f]+=1
end if
end while
end function
puts(1,get_pattern("asdfasdfasdfasdf")&"\n")
puts(1,get_pattern("aaajaaajaaajaaaj")&"\n")
puts(1,get_pattern("efvineifvuhdsdfsdf")&"\n")
----------------------------------------------------
3. Re: data analysis
Speaking of such, i got started on Metaphone again, and have the same
complaint i had about Soundex: they both throw away too much data. So i
am writing other codes into the Metaphone code, to present a bit more of
the data, analyzed differently. And fine-tuning the Metaphone code. Testing
on irc now, when i am happier with it, i'll send it to RDS. If i can't use it, i
prolly won't send it. What i want to do is have Tiggr compare a few yrs of
irc logs to a 130K word dictionary, and Metaphonize those words not in the
dictionary, and see if she can find a match in the Metaphonized dictionary.
Any suggestions?
Kat
4. Re: data analysis
Graeme, since you did that amazing pattern finding code,
i was wondering what your thoughts would be on this problem...
given:
<any> !m pittsbourgh
<[Tiggr]> pittsbourgh = PITSBERG CVCCCVCC CVCVC 11312
<any> !m pittsburg
<[Tiggr]> pittsburg = PITSBERG CVCCCVCC CVCVC 11312
<any> !m pittsburgh
<[Tiggr]> pittsburgh = PITSBERG CVCCCVCC CVCVC 11312
<any> !m pittsberg
<[Tiggr]> pittsberg = PITSBERG CVCCCVCC CVCVC 11312
there is still one piece of relavant data missing: word comparisons, for the
patterns, for in the above instances, "pittsbourgh" vs "pittsburgh" might
return (s=pittsb,d=o,s=urgh) or some representation containing the data
that "pittsb" is the same in both, "o" is different, and "urgh" is the same in
both. The location within the word is not so important at the moment; and i
have made the decision that the first word is the longer of the two, just to
standardize. As you can see in the above printout, i currently have 3
pieces of info: the Metaphoneized word (essentially a rough usa phonetic
spelling), then the placement of consonants and vowels, and finally that
placement with redundancies removed, and i'd like to be able to pass any
of the three to the comparison code. The above example is possibly easy,
since there is a big section of each word at the very start that is the same,
my problem is in the resync of the comparison after that 'o'. Is this possible
without the puter being bogged down all day in the comparison?
Kat
5. Re: data analysis
At 02:14 PM 16/01/01 -0800, you wrote:
>Kat wrote:
>
>> Graeme, since you did that amazing
>> pattern finding code, i was wondering
>> what your thoughts would be on this problem...
>
>I'm obviously not Graeme, but I couldn't resist. This routine will compare
>two strings, returning the differences between them. For example:
>
> return result
>
>end function
>
Bah! Got me. mine's still about half a hour away. :)
(Just woke up), I have a feeling it might be a fair
bit quicker than that one, though.
Graeme.
----------------------------------------------------
6. Re: data analysis
Kat wrote:
> Graeme, since you did that amazing
> pattern finding code, i was wondering
> what your thoughts would be on this problem...
I'm obviously not Graeme, but I couldn't resist. This routine will compare
two strings, returning the differences between them. For example:
diff( "abczz", "dbnz" )
will return:
"[a,d]b[c,n]z[z,]"
The notation is a bit funky, but you can adjust it as you see fit:
1. 'x' means that 'x' was found in both strings.
2. '[x,]' means that 'x' was found in the first string, but not the second.
3. '[,x]' means that 'x' was found in the second string, but not the first.
4. [x,y] means that 'x' was found in the first string, but not the second
and 'y' was found in the second string but not the first.
-- David Cuny
-- compare two strings, return differences
function diff( sequence s1, sequence s2 )
integer at1, at2, len, sync1, sync2
sequence result
result = ""
-- find shortest string
if length( s1 ) > length( s2 ) then
len = length( s1 )
else
len = length( s2 )
end if
at1 = 0
at2 = 0
-- process until the end of one string
while 1 do
-- move ahead
at1 += 1
at2 += 1
-- past end of one string?
if at1 > length( s1 )
or at2 > length( s2 ) then
exit
end if
-- same?
if s1[at1] = s2[at2] then
result &= s1[at1]
else
-- attempt to resync
while 1 do
-- find closest sync
sync1 = 9999
for i = at1+1 to length(s1) do
if s2[at2] = s1[i] then
sync1 = i
exit
end if
end for
sync2 = 9999
for i = at2+1 to length(s2) do
if s1[at1] = s2[i] then
sync2 = i
exit
end if
end for
-- result of sync
if sync1 = 9999
and sync2 = 9999 then
-- no sync
result &= sprintf( "[%s,%s]", {s1[at1],s2[at2]} )
-- at end?
if at1 = length( s1 )
or at2 = length( s2 ) then
exit
end if
-- skip
at1 += 1
at2 += 1
elsif sync1 < sync2 then
-- match on sync1
for i = at1 to sync1-1 do
result &= sprintf( "[%s,]", {s1[i]} )
end for
-- sync
at1 = sync1
result &= s1[at1]
-- leave loop
exit
else
-- match on sync2
for i = at2 to sync2-1 do
result &= sprintf( "[,%s]", {s2[i]})
end for
-- sync
at2 = sync2
result &= s2[at2]
-- leave loop
exit
end if
end while
end if
end while
-- remainder?
if at1 <= length( s1 ) then
for i = at1 to length(s1) do
result &= sprintf( "[%s,]", {s1[i]} )
end for
elsif at2 <= length( s2 ) then
for i = at2 to length(s2) do
result &= sprintf( "[,%s]", {s2[i]} )
end for
end if
return result
end function
7. Re: data analysis
thanx for the help guys, Ill try them and see if one/both help.
Mainly what i am doing it bit pattern matching tho it is in sequence's
of 1/0 X 8
J Reeves
Grape Vine
ICQ# 13728824
_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com
8. Re: data analysis
--=====================_979651966==_
At 08:44 AM 17/01/01 +1100, you wrote:
>At 02:14 PM 16/01/01 -0800, you wrote:
>>Kat wrote:
>>
>>> Graeme, since you did that amazing
>>> pattern finding code, i was wondering
>>> what your thoughts would be on this problem...
>>
>>I'm obviously not Graeme, but I couldn't resist. This routine will compare
>>two strings, returning the differences between them. For example:
>>
>
>> return result
>>
>>end function
>>
>
>Bah! Got me. mine's still about half a hour away. :)
>(Just woke up), I have a feeling it might be a fair
>bit quicker than that one, though.
>
>Graeme.
OK so here it is:
I've tested David's and my own solution with
the following test data:
{"this","that"},
{"actually","actaully"},
{"Pittsborough","Pittsburg"},
{"Pittsberg","Pittsburg"},
{"shello","hellos"},
{"fhfellos","shello"}
results as follows
( p=pass , f=fail )
DC GB
test1 p p
test2 p p
test3 p p
test4 f p
test5 p f
test6 f f
Depending on how harsh you're feeling, test3
could also be considered a failure for both.
('r' not matched)
Benchmarking indicates that my routine
is about 4.5 x faster than David's (on a P3)
( hopefully this will prompt David to write a
routine using match() that is faster and better
than both of the above )
routines & test prog attached
Graeme
--=====================_979651966==_
x-mac-type="705A4950"; x-mac-creator="705A4950"
9. Re: data analysis
Graeme wrote:
> results as follows
Thanks, I've made a couple of corrections:
1. Replaced the buggy code ('at1+1' should have read 'at+1') with find().
This corrected a bug, and made my code only about half as slow as Graeme's.
Thanks for the hint!
2. Added a MaxGap value that rejects matches past a certain length. This
allows matches text like the final example's. There are better ways to
decide this, but it was quick and dirty.
Thanks!
-- David Cuny
constant MaxGap = 6
global function diff( sequence s1, sequence s2 )
integer at1, at2, len, sync1, sync2
sequence result
result = ""
-- find shortest string
if length( s1 ) > length( s2 ) then
len = length( s1 )
else
len = length( s2 )
end if
at1 = 0
at2 = 0
-- process until the end of one string
while 1 do
-- move ahead
at1 += 1
at2 += 1
-- past end of one string?
if at1 > length( s1 )
or at2 > length( s2 ) then
exit
end if
-- same?
if s1[at1] = s2[at2] then
result &= s1[at1]
else
-- attempt to resync
while 1 do
-- find closest sync
sync1 = find( s2[at2], s1[at1..length(s1)] )
-- too far?
if sync1 > 0 and sync1 < MaxGap then
sync1 += at1 - 1
else
sync1 = 9999
end if
find closest sync point
sync2 = find( s1[at1], s2[at2..length(s2)] )
-- too far?
if sync2 > 0 and sync2 < MaxGap then
sync2 += at2 - 1
else
sync2 = 9999
end if
-- evaluate sync
if sync1 = 9999
and sync2 = 9999 then
-- no sync
result &= sprintf( "[%s,%s]", {s1[at1],s2[at2]} )
-- at end?
if at1 = length( s1 )
or at2 = length( s2 ) then
exit
end if
-- skip
at1 += 1
at2 += 1
elsif sync1 < sync2 then
-- match on sync1
for i = at1 to sync1-1 do
result &= sprintf( "[%s,]", {s1[i]} )
end for
-- sync
at1 = sync1
result &= s1[at1]
-- leave loop
exit
else
-- match on sync2
for i = at2 to sync2-1 do
result &= sprintf( "[,%s]", {s2[i]})
end for
-- sync
at2 = sync2
result &= s2[at2]
-- leave loop
exit
end if
end while
end if
end while
-- remainder?
if at1 <= length( s1 ) then
for i = at1 to length(s1) do
result &= sprintf( "[%s,]", {s1[i]} )
end for
elsif at2 <= length( s2 ) then
for i = at2 to length(s2) do
result &= sprintf( "[,%s]", {s2[i]} )
end for
end if
return result
end function
10. Re: data analysis
David and Graeme, all i can say is wow, and thanks. David's presentation
of the data is going to be easier to use first, but in the following test,
David's didn't resync properly, and i think cause it is it trying to resync
always on s2. In Sweigsdunka vs Zweigsdanka, it sync'd and found
"weigsd", but then didn't resync properly on the "nka". It took the first 'a'
from Zweigsdanka and went looking for it in Sweigsdunka, not finding it
untill the end of the word. It thereby missed the common "nka". Swaping
the words around didn't help David's code's results at all, but messed up
Graeme's code's results in a new way. Is it possible to force which word is
the primary sync in your code, David, in a way i can spec while it's
running? Mostly, i'd be looking for the result with the fewest number of
differences. /me is still studying the code....
David, a question, why do you find shortest string at the start, and then not
do anything with len?
Kat
11. Re: data analysis
On 17 Jan 2001, at 0:55, Kat wrote:
> David and Graeme, all i can say is wow, and thanks. David's presentation
> of the data is going to be easier to use first, but in the following test,
> David's didn't resync properly, and i think cause it is it trying to resync
> always on s2. In Sweigsdunka vs Zweigsdanka, it sync'd and found "weigsd",
> but then didn't resync properly on the "nka". It took the first 'a' from
> Zweigsdanka and went looking for it in Sweigsdunka, not finding it untill the
> end of the word. It thereby missed the common "nka". Swaping the words around
> didn't help David's code's results at all, but messed up Graeme's code's
> results in a new way. Is it possible to force which word is the primary sync
> in your code, David, in a way i can spec while it's running? Mostly, i'd be
> looking for the result with the fewest number of differences. /me is still
> studying the code....
Changing MaxGap to 2 made it resync faster, but i'm not sure yet that
passing maxgap to diff() along with the words is the right answer yet..
Kat
12. Re: data analysis
On 17 Jan 2001, at 14:02, Kat wrote:
> On 17 Jan 2001, at 0:55, Kat wrote:
>
> > David and Graeme, all i can say is wow, and thanks. David's presentation of
> > the data is going to be easier to use first, but in the following test,
> > David's didn't resync properly, and i think cause it is it trying to resync
> > always on s2. In Sweigsdunka vs Zweigsdanka, it sync'd and found "weigsd",
> > but then didn't resync properly on the "nka". It took the first 'a' from
> > Zweigsdanka and went looking for it in Sweigsdunka, not finding it untill
> > the end of the word. It thereby missed the common "nka". Swaping the words
> > around didn't help David's code's results at all, but messed up Graeme's
> > code's results in a new way. Is it possible to force which word is the
> > primary sync in your code, David, in a way i can spec while it's running?
> > Mostly, i'd be looking for the result with the fewest number of differences.
> > /me is still studying the code....
>
> Changing MaxGap to 2 made it resync faster, but i'm not sure yet that
> passing maxgap to diff() along with the words is the right answer yet..
Ok, passing different MaxGaps to diff() gets results, now to see if i can
change MaxGap every time sync is lost.....
Kat
13. Re: data analysis
- Posted by Kat <gertie at PELL.NET>
Jan 17, 2001
-
Last edited Jan 18, 2001
On 17 Jan 2001, at 14:02, Kat wrote:
> On 17 Jan 2001, at 0:55, Kat wrote:
>
> > David and Graeme, all i can say is wow, and thanks. David's presentation of
> > the data is going to be easier to use first, but in the following test,
> > David's didn't resync properly, and i think cause it is it trying to resync
> > always on s2. In Sweigsdunka vs Zweigsdanka, it sync'd and found "weigsd",
> > but then didn't resync properly on the "nka". It took the first 'a' from
> > Zweigsdanka and went looking for it in Sweigsdunka, not finding it untill
> > the end of the word. It thereby missed the common "nka". Swaping the words
> > around didn't help David's code's results at all, but messed up Graeme's
> > code's results in a new way. Is it possible to force which word is the
> > primary sync in your code, David, in a way i can spec while it's running?
> > Mostly, i'd be looking for the result with the fewest number of differences.
> > /me is still studying the code....
>
> Changing MaxGap to 2 made it resync faster, but i'm not sure yet that
> passing maxgap to diff() along with the words is the right answer yet..
So i passed both min and max gap to it, and diff() returns a list of unique
results, this way an analysis of the returns gives max results without too
much of an info overload. I still don't see how to change the MaxGap in the
middle of a word, without causing problems, such as making the passed
specs and results more complex, and maybe this is not needed, so i'll put
it on the back burner for now. I found the range of MinG= 2 to MaxG = 6
works best in the following code on the words i tested:
-- almost entirely David Cuny's code
global function diff( sequence s1, sequence s2, integer MinG, integer
MaxG )
integer at1, at2, sync1, sync2
sequence result, bigresult
bigresult = ""
for MaxGap = MinG to MaxG do
result = ""
at1 = 0
at2 = 0
-- process until the end of one string
while 1 do
-- move ahead
at1 += 1
at2 += 1
-- past end of one string?
if at1 > length( s1 )
or at2 > length( s2 ) then
exit
end if
-- same?
if s1[at1] = s2[at2] then
result &= s1[at1]
else
-- attempt to resync
while 1 do
-- find closest sync point
sync2 = find( s1[at1], s2[at2..length(s2)] )
-- too far?
if sync2 > 0 and sync2 < MaxGap then
sync2 += at2 - 1
else
sync2 = 9999
end if
-- find closest sync
sync1 = find( s2[at2], s1[at1..length(s1)] )
-- too far?
if sync1 > 0 and sync1 < MaxGap then
sync1 += at1 - 1
else
sync1 = 9999
end if
-- evaluate sync
if sync1 = 9999
and sync2 = 9999 then
-- no sync
result &= sprintf( "[%s,%s]", {s1[at1],s2[at2]} )
-- at end?
if at1 = length( s1 )
or at2 = length( s2 ) then
exit
end if
-- skip
at1 += 1
at2 += 1
elsif sync1 < sync2 then
-- match on sync1
for i = at1 to sync1-1 do
result &= sprintf( "[%s,]", {s1[i]} )
end for
-- sync
at1 = sync1
result &= s1[at1]
-- leave loop
exit
else
-- match on sync2
for i = at2 to sync2-1 do
result &= sprintf( "[,%s]", {s2[i]})
end for
-- sync
at2 = sync2
result &= s2[at2]
-- leave loop
exit
end if
end while
end if
end while
-- remainder?
if at1 <= length( s1 ) then
for i = at1 to length(s1) do
result &= sprintf( "[%s,]", {s1[i]} )
end for
elsif at2 <= length( s2 ) then
for i = at2 to length(s2) do
result &= sprintf( "[,%s]", {s2[i]} )
end for
end if
if ( match(result,bigresult) = 0 ) then
bigresult &= "\n" & result
-- i used "\n" just to make it easy to puts() it
end if
end for
return bigresult
end function -- diff( sequence s1, sequence s2, integer MinG, integer
MaxG )
14. Re: data analysis
- Posted by David Cuny <dcuny at LANSET.COM>
Jan 17, 2001
-
Last edited Jan 18, 2001
Here's an updated version of the pattern matcher. The output is changed a
bit (hopefully it's more clear), and I've added some more features. It now
checks for swapped letters, and prefers removal over resyncing.
Given the test data:
"this", "that"
"actually", "actaully"
"Pittsborough", "Pittsburg"
"Pittsberg", "Pittsburg"
"shello", "hellos"
"fhfellos", "shello"
"Sweigsdunka", "Zweigsdanka"
It produces:
th[is,at]
act[ua,au]lly
Pittsb[o,u]r[ou,]g[h,]
Pittsb[e,u]rg
[s,]hello[,s]
[f,s]h[f,]ello[s,]
[S,Z]weigsd[u,a]nka
-- David Cuny
constant MaxGap = 6
global function diff( sequence s1, sequence s2 )
integer at1, at2, sync1, sync2, diff1, diff2, best
sequence result,left,right
result = ""
left = ""
right = ""
at1 = 0
at2 = 0
-- process until the end of one string
while 1 do
-- move ahead
at1 += 1
at2 += 1
-- past end of one string?
if at1 > length( s1 )
or at2 > length( s2 ) then
exit
end if
-- same?
if s1[at1] = s2[at2] then
if length( left ) or length( right ) then
result &= sprintf( "[%s,%s]", {left,right})
left = ""
right = ""
end if
result &= s1[at1]
else
-- attempt to resync
while 1 do
-- swapped letters?
if at1 < length( s1 )
and at2 < length( s2 )
and s1[at1] = s2[at2+1]
and s1[at1+1] = s2[at2] then
left &= s1[at1..at1+1]
right &= s2[at2..at2+1]
at1 += 1
at2 += 1
exit
end if
-- find closest sync
diff1 = find( s2[at2], s1[at1..length(s1)] )
-- too far?
if diff1 > 0 and diff1 < MaxGap then
sync1 = at1 + diff1 - 1
else
sync1 = 9999
diff1 = 9999
end if
-- find closest sync point
diff2 = find( s1[at1], s2[at2..length(s2)] )
-- too far?
if diff2 > 0 and diff2 < MaxGap then
sync2 = at2 + diff2 - 1
else
sync2 = 9999
diff2 = 9999
end if
-- better to remove chars?
if sync1 != 9999
or sync2 != 9999 then
if diff1 < diff2 then
best = diff1
else
best = diff2
end if
for i = 1 to best do
if at1+i > length( s1 )
or at2+i > length( s2 ) then
exit
end if
if s1[at1+i-1] = s2[at2+i-1] then
-- better match
diff1 = i
diff2 = i
sync1 = at1+i-1
sync2 = at2+i-1
exit
end if
end for
end if
-- evaluate sync
if sync1 = 9999
and sync2 = 9999 then
-- no sync
left &= s1[at1]
right &= s2[at2]
-- at end?
if at1 = length( s1 )
or at2 = length( s2 ) then
exit
end if
-- skip
at1 += 1
at2 += 1
-- on a match?
if s1[at1] = s2[at2] then
if length( left ) or length( right ) then
result &= sprintf( "[%s,%s]", {left,right})
left = ""
right = ""
end if
result &= s1[at1]
exit
end if
elsif diff1 = diff2
and s1[sync1] = s2[sync2] then
-- match to both
for i = 0 to diff1-2 do
left &= s1[at1+i]
right &= s2[at2+i]
end for
-- sync
at1 = sync1
at2 = sync2
if length( left ) or length( right ) then
result &= sprintf( "[%s,%s]", {left,right})
left = ""
right = ""
end if
result &= s1[at1]
-- leave loop
exit
elsif diff1 <= diff2 then
-- match on sync1
for i = at1 to sync1-1 do
left &= s1[i]
end for
-- sync
at1 = sync1
if length( left ) or length( right ) then
result &= sprintf( "[%s,%s]", {left,right})
left = ""
right = ""
end if
result &= s1[at1]
-- leave loop
exit
else
-- match on sync2
for i = at2 to sync2-1 do
right &= s2[i]
end for
-- sync
at2 = sync2
if length( left ) or length( right ) then
result &= sprintf( "[%s,%s]", {left,right})
left = ""
right = ""
end if
result &= s2[at2]
-- leave loop
exit
end if
end while
end if
end while
-- remainder?
if at1 <= length( s1 ) then
for i = at1 to length(s1) do
left &= s1[i]
end for
elsif at2 <= length( s2 ) then
for i = at2 to length(s2) do
right &= s2[i]
end for
end if
if length( left ) or length( right ) then
result &= sprintf( "[%s,%s]", {left,right})
end if
return result
end function
15. Re: data analysis
- Posted by Kat <gertie at PELL.NET>
Jan 17, 2001
-
Last edited Jan 18, 2001
On 17 Jan 2001, at 19:15, David Cuny wrote:
> Here's an updated version of the pattern matcher. The output is changed a bit
> (hopefully it's more clear), and I've added some more features. It now checks
> for swapped letters, and prefers removal over resyncing.
>
> Given the test data:
>
> "this", "that"
> "actually", "actaully"
> "Pittsborough", "Pittsburg"
> "Pittsberg", "Pittsburg"
> "shello", "hellos"
> "fhfellos", "shello"
> "Sweigsdunka", "Zweigsdanka"
>
>
> It produces:
>
> th[is,at]
> act[ua,au]lly
> Pittsb[o,u]r[ou,]g[h,]
> Pittsb[e,u]rg
> [s,]hello[,s]
> [f,s]h[f,]ello[s,]
> [S,Z]weigsd[u,a]nka
Ok, that breaks the feeding of MinG and MaxG to it. I still don't understand
the resyncing, but maybe i will one day. Did you see how the code i
posted returns a list of results? Comments?
Kat
16. Re: data analysis
- Posted by Kat <gertie at PELL.NET>
Jan 17, 2001
-
Last edited Jan 18, 2001
On 17 Jan 2001, at 19:15, David Cuny wrote:
> Here's an updated version of the pattern matcher. The output is changed a bit
> (hopefully it's more clear), and I've added some more features. It now checks
> for swapped letters, and prefers removal over resyncing.
>
> Given the test data:
> "Pittsborough", "Pittsburg"
> It produces:
> Pittsb[o,u]r[ou,]g[h,]
I can't get it to sync on the 'u' now. Everything else is quite workable!
Kat
17. Re: data analysis
- Posted by David Cuny <dcuny at LANSET.COM>
Jan 17, 2001
-
Last edited Jan 18, 2001
Kat wrote:
> I still don't understand the resyncing,
> but maybe i will one day.
Think of the two strings as queues. The algorithm tries to make sure that
the first two items in the queue are the same. If they are, it's a match,
and they can be removed from the queue. For example:
[C]amero
[C]odas
Both strings start with 'C', so they are removed from the queue. If the
characters are different, the algorithm looks in both queues to see which
has the smallest gap:
amero = has 'o' in 5th position
odas = has 'a' in 3rd position
In this case, the gap of 3 is the smallest, so two characters are removed
from the second string to get the two in 'sync':
amero
[od]as
Finally, if there is no match with the prior test, both characters are
simply removed from the head of each queue:
[m]ero
[a]s
Eventually, both queues are empty, or one of the strings has stuff left
over.
That was the first version of the algorithm, anyway. I added a couple of
additional tests. For one thing, it looks for reversed letters:
[ca]t
[ac]t
For another, if there is a gap, it looks to see if there if it can remove
characters from the queue instead. For example:
clack
dreck
The solution of removing the first letters from both queues:
[cla]ck
[dre]ck
is preferred to removing them from only one queue, using the 'sync' routine
outlined above:
clack
[dre]ck <-- less good solution
> Did you see how the code i
> posted returns a list of results?
Changing MaxGap on the fly seems to me to be a bad idea, but if it works
better for you, go for it.
-- David Cuny
18. Re: data analysis
- Posted by Kat <gertie at PELL.NET>
Jan 17, 2001
-
Last edited Jan 18, 2001
On 17 Jan 2001, at 20:30, David Cuny wrote:
> > Did you see how the code i
> > posted returns a list of results?
>
> Changing MaxGap on the fly seems to me to be a bad idea, but if it works
> better for you, go for it.
Why is it a bad idea? What i was trying to do was to hit all possible resync
points.
Kat
19. Re: data analysis
- Posted by David Cuny <dcuny at LANSET.COM>
Jan 17, 2001
-
Last edited Jan 18, 2001
Kat wrote:
> Why is [changing MaxGap] a bad idea?
> What i was trying to do was to hit all
> possible resync points.
It's not, actually. For example, David Cope uses a similar pattern matcher
in his EMI program, only it works with musical pitches instead of letters.
In his application, the parameters are self tuning: he feeds it music by a
composer, and it automatically adjusts parameters of the pattern matcher
until the music it regenerates matches the same statistics as the source
material.
Something else you might want to consider - I recall reading that pattern
matching for speech recognition wasn't all that great, until someone decided
to use Markov chains to 'guess' what the next word might be. The utterance
would be first compared to that set, and if there were no good candidates, a
brute force match would be done. Perhaps something similar might work with
Tiggr?
-- David Cuny
20. Re: data analysis
On 17 Jan 2001, at 22:15, David Cuny wrote:
> Kat wrote:
>
> > Why is [changing MaxGap] a bad idea?
> > What i was trying to do was to hit all
> > possible resync points.
>
> It's not, actually. For example, David Cope uses a similar pattern matcher in
> his EMI program, only it works with musical pitches instead of letters. In his
> application, the parameters are self tuning: he feeds it music by a composer,
> and it automatically adjusts parameters of the pattern matcher until the music
> it regenerates matches the same statistics as the source material.
>
> Something else you might want to consider - I recall reading that pattern
> matching for speech recognition wasn't all that great, until someone decided
> to use Markov chains to 'guess' what the next word might be. The utterance
> would be first compared to that set, and if there were no good candidates, a
> brute force match would be done. Perhaps something similar might work with
> Tiggr?
I considered that many yrs ago, but the research i have run across since
has shown me that prediction works only in knowledge/theme domains the
script has knowledge of. The trick is to either know all the domains under
discussion (which is how most of the Turing tests held in Australia are
done: the script's authors get to choose the domain for that script's
interaction with the judges), or don't use much prediction. Since i use
variables for nearly everything, turning on/off the prediction on the fly isn't
a
problem, and some word pre/postdiction is already coded into the
database, as well as syntactic pre/post"diction". Even the domain is
encoded for each word on a per-use basis, altho a *lot* of data is still not
entered. What i have been concentrating on mostly, since i discovered
harddrives no longer cost $1000/megabyte, is to collect the data as sets
within a domain, with text concerning relavant words in that domain, for
Tiggr or Gertie to use as pre/postdictors in that domain, as well as
associative words. If she didn't have "nephrology" listed in the dictionary as
medical, a broad search would find it in the knowledgebase in the grouping
with other medical terms, and she could update the dictionary herself. At
least i *hope* she has these eureka moments, cause i *sure* don't want to
finish that job for her!
I don't know why some things are so difficult for me to understand. It took
me years to learn checkers, but i won my first chess game the same day i
first saw the chess board. I'm wierd like that. So i am really glad to have
this list here to help me out in difficult coding. Thanks again David.
Graeme, i simply haven't gotten to your code yet, but i bet i use David's
2nd and 3rd posting with your's in parallel threads.
Kat
21. Re: data analysis
- Posted by Graeme <graemeburke at CROSSWINDS.NET>
Jan 18, 2001
-
Last edited Jan 19, 2001
>I don't know why some things are so difficult for me to understand. It took
>me years to learn checkers, but i won my first chess game the same day i
>first saw the chess board. I'm wierd like that. So i am really glad to have
>this list here to help me out in difficult coding. Thanks again David.
>Graeme, i simply haven't gotten to your code yet, but i bet i use David's
>2nd and 3rd posting with your's in parallel threads.
Hey, I'm just kickin' back and enjoying this one. Feeling a
bit guilty that I havn't done anything else with my little
routine, but I've got pleanty to do and competeing with David
on stuff like this is a bit of a pointless exercise. If I feel
I can contribute I will, but it looks like you guys have it well
in hand.
Graeme.
(BTW: The chess/checkers thing sounds exactly like me - there's
a scary thought for you
----------------------------------------------------