1. data analysis

I am trying to look at a set amount of data and figure out a repeting data
pattern, any idea where i can get some routiens that already do this? None
of mine work =(



J Reeves
Grape Vine
ICQ# 13728824

_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com

new topic     » topic index » view message » categorize

2. Re: data analysis

At 08:51 PM 15/01/01 -0800, you wrote:
>I am trying to look at a set amount of data and figure out a repeting data
>pattern, any idea where i can get some routiens that already do this? None
>of mine work =(
>
>
>
>J Reeves
>Grape Vine



if I understand the question correctly, this should help.


It copies dat to s then searches s for characters.
if it isn't a pattern, the found character is erased
in s (by adding one to it) then it searches again.
dat is preserved to test for patterns.

This could be optimised by shortening s for subsequent searches.

It still needs some checking, passing it the wrong sequence
will crash it with a 'slice ends past seq end' error, but
it should get you of on the right foot.

Graeme




function get_pattern(sequence dat)
    integer f
    sequence s,test

    s=dat
    s[1]+=1
    while 1 do

        f=find(dat[1],s)
        if not f then
            return "No Pattern"
        end if

        if equal(dat[1..f-1],dat[f..f*2-2]) then
            test=dat[1..f-1]
            for x=1 to floor(length(dat)/length(test))+1 do
                test&=dat[1..f-1]
            end for
            if equal(test[1..length(dat)],dat) then
                return dat[1..f-1]
            end if
            s[f]+=1
        else
            s[f]+=1
        end if

    end while
end function




puts(1,get_pattern("asdfasdfasdfasdf")&"\n")
puts(1,get_pattern("aaajaaajaaajaaaj")&"\n")
puts(1,get_pattern("efvineifvuhdsdfsdf")&"\n")


----------------------------------------------------

new topic     » goto parent     » topic index » view message » categorize

3. Re: data analysis

Speaking of such, i got started on Metaphone again, and have the same
complaint i had about Soundex: they both throw away too much data. So i
am writing other codes into the Metaphone code, to present a bit more of
the data, analyzed differently. And fine-tuning the Metaphone code. Testing
on irc now, when i am happier with it, i'll send it to RDS. If i can't use it, i
prolly won't send it. What i want to do is have Tiggr compare a few yrs of
irc logs to a 130K word dictionary, and Metaphonize those words not in the
dictionary, and see if she can find a match in the Metaphonized dictionary.

Any suggestions?

Kat

new topic     » goto parent     » topic index » view message » categorize

4. Re: data analysis

Graeme, since you did that amazing pattern finding code,
i was wondering what your thoughts would be on this problem...

given:
<any> !m pittsbourgh
<[Tiggr]> pittsbourgh = PITSBERG CVCCCVCC CVCVC 11312
<any> !m pittsburg
<[Tiggr]> pittsburg = PITSBERG CVCCCVCC CVCVC 11312
<any> !m pittsburgh
<[Tiggr]> pittsburgh = PITSBERG CVCCCVCC CVCVC 11312
<any> !m pittsberg
 <[Tiggr]> pittsberg = PITSBERG CVCCCVCC CVCVC 11312

there is still one piece of relavant data missing: word comparisons, for the
patterns, for in the above instances, "pittsbourgh" vs "pittsburgh" might
return (s=pittsb,d=o,s=urgh) or some representation containing the data
that "pittsb" is the same in both, "o" is different, and "urgh" is the same in
both. The location within the word is not so important at the moment; and i
have made the decision that the first word is the longer of the two, just to
standardize. As you can see in the above printout, i currently have 3
pieces of info: the Metaphoneized word (essentially a rough usa phonetic
spelling), then the placement of consonants and vowels, and finally that
placement with redundancies removed, and i'd like to be able to pass any
of the three to the comparison code. The above example is possibly easy,
since there is a big section of each word at the very start that is the same,
my problem is in the resync of the comparison after that 'o'. Is this possible
without the puter being bogged down all day in the comparison?

Kat

new topic     » goto parent     » topic index » view message » categorize

5. Re: data analysis

At 02:14 PM 16/01/01 -0800, you wrote:
>Kat wrote:
>
>> Graeme, since you did that amazing
>> pattern finding code, i was wondering
>> what your thoughts would be on this problem...
>
>I'm obviously not Graeme, but I couldn't resist. This routine will compare
>two strings, returning the differences between them. For example:
>

>    return result
>
>end function
>


Bah! Got me. mine's still about half a hour away. :)
(Just woke up), I have a feeling it might be a fair
bit quicker than that one, though.

Graeme.


----------------------------------------------------

new topic     » goto parent     » topic index » view message » categorize

6. Re: data analysis

Kat wrote:

> Graeme, since you did that amazing
> pattern finding code, i was wondering
> what your thoughts would be on this problem...

I'm obviously not Graeme, but I couldn't resist. This routine will compare
two strings, returning the differences between them. For example:

   diff( "abczz", "dbnz" )

will return:

   "[a,d]b[c,n]z[z,]"

The notation is a bit funky, but you can adjust it as you see fit:

1. 'x' means that 'x' was found in both strings.

2. '[x,]' means that 'x' was found in the first string, but not the second.

3. '[,x]' means that 'x' was found in the second string, but not the first.

4. [x,y] means that 'x' was found in the first string, but not the second
and 'y' was found in the second string but not the first.

-- David Cuny

-- compare two strings, return differences

function diff( sequence s1, sequence s2 )

    integer at1, at2, len, sync1, sync2
    sequence result

    result = ""

    -- find shortest string
    if length( s1 ) > length( s2 ) then
        len = length( s1 )
    else
        len = length( s2 )
    end if

    at1 = 0
    at2 = 0

    -- process until the end of one string
    while 1 do

        -- move ahead
        at1 += 1
        at2 += 1

        -- past end of one string?
        if at1 > length( s1 )
        or at2 > length( s2 ) then
            exit
        end if

        -- same?
        if s1[at1] = s2[at2] then
            result &= s1[at1]
        else
            -- attempt to resync
            while 1 do

                -- find closest sync
                sync1 = 9999
                for i = at1+1 to length(s1) do
                    if s2[at2] = s1[i] then
                        sync1 = i
                        exit
                    end if
                end for

                sync2 = 9999
                for i = at2+1 to length(s2) do
                    if s1[at1] = s2[i] then
                        sync2 = i
                        exit
                    end if
                end for

                -- result of sync
                if sync1 = 9999
                and sync2 = 9999 then
                    -- no sync
                    result &= sprintf( "[%s,%s]", {s1[at1],s2[at2]} )

                    -- at end?
                    if at1 = length( s1 )
                    or at2 = length( s2 ) then
                        exit
                    end if

                    -- skip
                    at1 += 1
                    at2 += 1


                elsif sync1 < sync2 then
                    -- match on sync1
                    for i = at1 to sync1-1 do
                        result &= sprintf( "[%s,]", {s1[i]} )
                    end for

                    -- sync
                    at1 = sync1
                    result &= s1[at1]

                    -- leave loop
                    exit


                else
                    -- match on sync2
                    for i = at2 to sync2-1 do
                        result &= sprintf( "[,%s]", {s2[i]})
                    end for

                    -- sync
                    at2 = sync2
                    result &= s2[at2]

                    -- leave loop
                    exit


                end if

            end while

        end if

    end while

    -- remainder?
    if at1 <= length( s1 ) then
        for i = at1 to length(s1) do
            result &= sprintf( "[%s,]", {s1[i]} )
        end for
    elsif at2 <= length( s2 ) then
        for i = at2 to length(s2) do
            result &= sprintf( "[,%s]", {s2[i]} )
        end for

    end if

    return result

end function

new topic     » goto parent     » topic index » view message » categorize

7. Re: data analysis

thanx for the help guys, Ill try them and see if one/both help.
Mainly what i am doing it bit pattern matching tho it is in sequence's
of 1/0 X 8

J Reeves
Grape Vine
ICQ# 13728824

_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com

new topic     » goto parent     » topic index » view message » categorize

8. Re: data analysis

--=====================_979651966==_

At 08:44 AM 17/01/01 +1100, you wrote:
>At 02:14 PM 16/01/01 -0800, you wrote:
>>Kat wrote:
>>
>>> Graeme, since you did that amazing
>>> pattern finding code, i was wondering
>>> what your thoughts would be on this problem...
>>
>>I'm obviously not Graeme, but I couldn't resist. This routine will compare
>>two strings, returning the differences between them. For example:
>>
>
>>    return result
>>
>>end function
>>
>
>Bah! Got me. mine's still about half a hour away. :)
>(Just woke up), I have a feeling it might be a fair
>bit quicker than that one, though.
>
>Graeme.





OK so here it is:

I've tested David's and my own solution with
the following test data:

{"this","that"},
{"actually","actaully"},
{"Pittsborough","Pittsburg"},
{"Pittsberg","Pittsburg"},
{"shello","hellos"},
{"fhfellos","shello"}

results as follows

( p=pass , f=fail )

                DC      GB

test1           p       p
test2           p       p
test3           p       p
test4           f       p
test5           p       f
test6           f       f

Depending on how harsh you're feeling, test3
could also be considered a failure for both.
('r' not matched)

Benchmarking indicates that my routine
is about 4.5 x faster than David's (on a P3)

( hopefully this will prompt David to write a
routine using match() that is faster and better
than both of the above )


routines & test prog attached

Graeme















--=====================_979651966==_
 x-mac-type="705A4950"; x-mac-creator="705A4950"

new topic     » goto parent     » topic index » view message » categorize

9. Re: data analysis

Graeme wrote:

> results as follows

Thanks, I've made a couple of corrections:

1. Replaced the buggy code ('at1+1' should have read 'at+1') with find().
This corrected a bug, and made my code only about half as slow as Graeme's.
Thanks for the hint!

2. Added a MaxGap value that rejects matches past a certain length. This
allows matches text like the final example's. There are better ways to
decide this, but it was quick and dirty.

Thanks!

-- David Cuny


constant MaxGap = 6

global function diff( sequence s1, sequence s2 )

    integer at1, at2, len, sync1, sync2
    sequence result

    result = ""

    -- find shortest string
    if length( s1 ) > length( s2 ) then
        len = length( s1 )
    else
        len = length( s2 )
    end if

    at1 = 0
    at2 = 0

    -- process until the end of one string
    while 1 do

        -- move ahead
        at1 += 1
        at2 += 1

        -- past end of one string?
        if at1 > length( s1 )
        or at2 > length( s2 ) then
            exit
        end if

        -- same?
        if s1[at1] = s2[at2] then
            result &= s1[at1]
        else
            -- attempt to resync
            while 1 do

                -- find closest sync
                sync1 = find( s2[at2], s1[at1..length(s1)] )

                -- too far?
                if sync1 > 0 and sync1 < MaxGap then
                    sync1 += at1 - 1
                else
                    sync1 = 9999
                end if

                find closest sync point
                sync2 = find( s1[at1], s2[at2..length(s2)] )

                -- too far?
                if sync2 > 0 and sync2 < MaxGap then
                    sync2 += at2 - 1
                else
                    sync2 = 9999
                end if

                -- evaluate sync
                if sync1 = 9999
                and sync2 = 9999 then
                    -- no sync
                    result &= sprintf( "[%s,%s]", {s1[at1],s2[at2]} )

                    -- at end?
                    if at1 = length( s1 )
                    or at2 = length( s2 ) then
                        exit
                    end if

                    -- skip
                    at1 += 1
                    at2 += 1


                elsif sync1 < sync2 then
                    -- match on sync1
                    for i = at1 to sync1-1 do
                        result &= sprintf( "[%s,]", {s1[i]} )
                    end for

                    -- sync
                    at1 = sync1
                    result &= s1[at1]

                    -- leave loop
                    exit


                else
                    -- match on sync2
                    for i = at2 to sync2-1 do
                        result &= sprintf( "[,%s]", {s2[i]})
                    end for

                    -- sync
                    at2 = sync2
                    result &= s2[at2]

                    -- leave loop
                    exit


                end if

            end while

        end if

    end while

    -- remainder?
    if at1 <= length( s1 ) then
        for i = at1 to length(s1) do
            result &= sprintf( "[%s,]", {s1[i]} )
        end for
    elsif at2 <= length( s2 ) then
        for i = at2 to length(s2) do
            result &= sprintf( "[,%s]", {s2[i]} )
        end for

    end if

    return result

end function

new topic     » goto parent     » topic index » view message » categorize

10. Re: data analysis

David and Graeme, all i can say is wow, and thanks. David's presentation
of the data is going to be easier to use first, but in the following test,
David's didn't resync properly, and i think cause it is it trying to resync
always on s2. In Sweigsdunka vs Zweigsdanka, it sync'd and  found
"weigsd", but then didn't resync properly on the "nka". It took the first 'a'
from Zweigsdanka and went looking for it in Sweigsdunka, not finding it
untill the end of the word. It thereby missed the common "nka". Swaping
the words around didn't help David's code's results at all, but messed up
Graeme's code's results in a new way. Is it possible to force which word is
the primary sync in your code, David, in a way i can spec while it's
running? Mostly, i'd be looking for the result with the fewest number of
differences. /me is still studying the code....

David, a question, why do you find shortest string at the start, and then not
do anything with len?

Kat

new topic     » goto parent     » topic index » view message » categorize

11. Re: data analysis

On 17 Jan 2001, at 0:55, Kat wrote:

> David and Graeme, all i can say is wow, and thanks. David's presentation
> of the data is going to be easier to use first, but in the following test,
> David's didn't resync properly, and i think cause it is it trying to resync
> always on s2. In Sweigsdunka vs Zweigsdanka, it sync'd and  found "weigsd",
> but then didn't resync properly on the "nka". It took the first 'a' from
> Zweigsdanka and went looking for it in Sweigsdunka, not finding it untill the
> end of the word. It thereby missed the common "nka". Swaping the words around
> didn't help David's code's results at all, but messed up Graeme's code's
> results in a new way. Is it possible to force which word is the primary sync
> in your code, David, in a way i can spec while it's running? Mostly, i'd be
> looking for the result with the fewest number of differences. /me is still
> studying the code....

Changing MaxGap to 2 made it resync faster, but i'm not sure yet that
passing maxgap to diff() along with the words is the right answer yet..

Kat

new topic     » goto parent     » topic index » view message » categorize

12. Re: data analysis

On 17 Jan 2001, at 14:02, Kat wrote:

> On 17 Jan 2001, at 0:55, Kat wrote:
>
> > David and Graeme, all i can say is wow, and thanks. David's presentation of
> > the data is going to be easier to use first, but in the following test,
> > David's didn't resync properly, and i think cause it is it trying to resync
> > always on s2. In Sweigsdunka vs Zweigsdanka, it sync'd and  found "weigsd",
> > but then didn't resync properly on the "nka". It took the first 'a' from
> > Zweigsdanka and went looking for it in Sweigsdunka, not finding it untill
> > the end of the word. It thereby missed the common "nka". Swaping the words
> > around didn't help David's code's results at all, but messed up Graeme's
> > code's results in a new way. Is it possible to force which word is the
> > primary sync in your code, David, in a way i can spec while it's running?
> > Mostly, i'd be looking for the result with the fewest number of differences.
> > /me is still studying the code....
>
> Changing MaxGap to 2 made it resync faster, but i'm not sure yet that
> passing maxgap to diff() along with the words is the right answer yet..

Ok, passing different MaxGaps to diff() gets results, now to see if i can
change MaxGap every time sync is lost.....

Kat

new topic     » goto parent     » topic index » view message » categorize

13. Re: data analysis

On 17 Jan 2001, at 14:02, Kat wrote:

> On 17 Jan 2001, at 0:55, Kat wrote:
>
> > David and Graeme, all i can say is wow, and thanks. David's presentation of
> > the data is going to be easier to use first, but in the following test,
> > David's didn't resync properly, and i think cause it is it trying to resync
> > always on s2. In Sweigsdunka vs Zweigsdanka, it sync'd and  found "weigsd",
> > but then didn't resync properly on the "nka". It took the first 'a' from
> > Zweigsdanka and went looking for it in Sweigsdunka, not finding it untill
> > the end of the word. It thereby missed the common "nka". Swaping the words
> > around didn't help David's code's results at all, but messed up Graeme's
> > code's results in a new way. Is it possible to force which word is the
> > primary sync in your code, David, in a way i can spec while it's running?
> > Mostly, i'd be looking for the result with the fewest number of differences.
> > /me is still studying the code....
>
> Changing MaxGap to 2 made it resync faster, but i'm not sure yet that
> passing maxgap to diff() along with the words is the right answer yet..

So i passed both min and max gap to it, and diff() returns a list of unique
results, this way an analysis of the returns gives max results without too
much of an info overload. I still don't see how to change the MaxGap in the
middle of a word, without causing problems, such as making the passed
specs and results more complex, and maybe this is not needed, so i'll put
it on the back burner for now. I found the range of MinG= 2 to MaxG = 6
works best in the following code on the words i tested:

-- almost entirely David Cuny's code


global function diff( sequence s1, sequence s2, integer MinG, integer
MaxG )

    integer at1, at2, sync1, sync2
    sequence result, bigresult

    bigresult = ""


for MaxGap = MinG to MaxG do
    result = ""
    at1 = 0
    at2 = 0

    -- process until the end of one string
    while 1 do

        -- move ahead
        at1 += 1
        at2 += 1

        -- past end of one string?
        if at1 > length( s1 )
        or at2 > length( s2 ) then
            exit
        end if

        -- same?
        if s1[at1] = s2[at2] then
            result &= s1[at1]
        else
            -- attempt to resync
            while 1 do

            -- find closest sync point
                 sync2 = find( s1[at1], s2[at2..length(s2)] )

                -- too far?
                  if sync2 > 0 and sync2 < MaxGap then
                    sync2 += at2 - 1
                  else
                    sync2 = 9999
                  end if


                -- find closest sync
                sync1 = find( s2[at2], s1[at1..length(s1)] )

                -- too far?
                if sync1 > 0 and sync1 < MaxGap then
                    sync1 += at1 - 1
                else
                    sync1 = 9999
                end if


                -- evaluate sync
                if sync1 = 9999
                and sync2 = 9999 then
                    -- no sync
                    result &= sprintf( "[%s,%s]", {s1[at1],s2[at2]} )

                    -- at end?
                    if at1 = length( s1 )
                    or at2 = length( s2 ) then
                        exit
                    end if

                    -- skip
                    at1 += 1
                    at2 += 1


                elsif sync1 < sync2 then
                    -- match on sync1
                    for i = at1 to sync1-1 do
                        result &= sprintf( "[%s,]", {s1[i]} )
                    end for

                    -- sync
                    at1 = sync1
                    result &= s1[at1]

                    -- leave loop
                    exit


                else
                    -- match on sync2
                    for i = at2 to sync2-1 do
                        result &= sprintf( "[,%s]", {s2[i]})
                    end for

                    -- sync
                    at2 = sync2
                    result &= s2[at2]

                    -- leave loop
                    exit


                end if

            end while

        end if

    end while



    -- remainder?
    if at1 <= length( s1 ) then
        for i = at1 to length(s1) do
            result &= sprintf( "[%s,]", {s1[i]} )
        end for
    elsif at2 <= length( s2 ) then
        for i = at2 to length(s2) do
            result &= sprintf( "[,%s]", {s2[i]} )
        end for

    end if


    if ( match(result,bigresult) = 0 ) then
     bigresult &= "\n" & result
-- i used "\n" just to make it easy to puts() it
    end if

  end for

    return bigresult

end function -- diff( sequence s1, sequence s2, integer MinG, integer
MaxG )

new topic     » goto parent     » topic index » view message » categorize

14. Re: data analysis

Here's an updated version of the pattern matcher. The output is changed a
bit (hopefully it's more clear), and I've added some more features. It now
checks for swapped letters, and prefers removal over resyncing.

Given the test data:

   "this", "that"
   "actually", "actaully"
   "Pittsborough", "Pittsburg"
   "Pittsberg", "Pittsburg"
   "shello", "hellos"
   "fhfellos", "shello"
   "Sweigsdunka", "Zweigsdanka"


It produces:

   th[is,at]
   act[ua,au]lly
   Pittsb[o,u]r[ou,]g[h,]
   Pittsb[e,u]rg
   [s,]hello[,s]
   [f,s]h[f,]ello[s,]
   [S,Z]weigsd[u,a]nka

-- David Cuny

constant MaxGap = 6

global function diff( sequence s1, sequence s2 )

    integer at1, at2, sync1, sync2, diff1, diff2, best
    sequence result,left,right

    result = ""
    left = ""
    right = ""

    at1 = 0
    at2 = 0

    -- process until the end of one string
    while 1 do

        -- move ahead
        at1 += 1
        at2 += 1

        -- past end of one string?
        if at1 > length( s1 )
        or at2 > length( s2 ) then
            exit
        end if

        -- same?
        if s1[at1] = s2[at2] then
            if length( left ) or length( right ) then
                result &= sprintf( "[%s,%s]", {left,right})
                left = ""
                right = ""
            end if
            result &= s1[at1]
        else

            -- attempt to resync
            while 1 do

                -- swapped letters?
                if at1 < length( s1 )
                and at2 < length( s2 )
                and s1[at1] = s2[at2+1]
                and s1[at1+1] = s2[at2] then
                    left &= s1[at1..at1+1]
                    right &= s2[at2..at2+1]
                    at1 += 1
                    at2 += 1
                    exit
                end if

                -- find closest sync
                diff1 = find( s2[at2], s1[at1..length(s1)] )

                -- too far?
                if diff1 > 0 and diff1 < MaxGap then
                    sync1 = at1 + diff1 - 1
                else
                    sync1 = 9999
                    diff1 = 9999
                end if

                -- find closest sync point
                diff2 = find( s1[at1], s2[at2..length(s2)] )

                -- too far?
                if diff2 > 0 and diff2 < MaxGap then
                    sync2 = at2 + diff2 - 1
                else
                    sync2 = 9999
                    diff2 = 9999
                end if

                -- better to remove chars?
                if sync1 != 9999
                or sync2 != 9999 then

                    if diff1 < diff2 then
                        best = diff1
                    else
                        best = diff2
                    end if

                    for i = 1 to best do
                        if at1+i > length( s1 )
                        or at2+i > length( s2 ) then
                            exit
                        end if

                        if s1[at1+i-1] = s2[at2+i-1] then
                            -- better match
                            diff1 = i
                            diff2 = i
                            sync1 = at1+i-1
                            sync2 = at2+i-1
                            exit
                        end if
                    end for

                end if


                -- evaluate sync
                if sync1 = 9999
                and sync2 = 9999 then
                    -- no sync
                    left &= s1[at1]
                    right &= s2[at2]

                    -- at end?
                    if at1 = length( s1 )
                    or at2 = length( s2 ) then
                        exit
                    end if

                    -- skip
                    at1 += 1
                    at2 += 1

                    -- on a match?
                    if s1[at1] = s2[at2] then
                        if length( left ) or length( right ) then
                            result &= sprintf( "[%s,%s]", {left,right})
                            left = ""
                            right = ""
                        end if
                        result &= s1[at1]
                        exit
                    end if

                elsif diff1 = diff2
                and s1[sync1] = s2[sync2] then
                    -- match to both
                    for i = 0 to diff1-2 do
                        left &= s1[at1+i]
                        right &= s2[at2+i]
                    end for

                    -- sync
                    at1 = sync1
                    at2 = sync2

                    if length( left ) or length( right ) then
                        result &= sprintf( "[%s,%s]", {left,right})
                        left = ""
                        right = ""
                    end if
                    result &= s1[at1]

                    -- leave loop
                    exit


                elsif diff1 <= diff2 then
                    -- match on sync1
                    for i = at1 to sync1-1 do
                        left &= s1[i]
                    end for

                    -- sync
                    at1 = sync1
                    if length( left ) or length( right ) then
                        result &= sprintf( "[%s,%s]", {left,right})
                        left = ""
                        right = ""
                    end if
                    result &= s1[at1]

                    -- leave loop
                    exit


                else
                    -- match on sync2
                    for i = at2 to sync2-1 do
                        right &= s2[i]
                    end for

                    -- sync
                    at2 = sync2
                    if length( left ) or length( right ) then
                        result &= sprintf( "[%s,%s]", {left,right})
                        left = ""
                        right = ""
                    end if
                    result &= s2[at2]

                    -- leave loop
                    exit


                end if

            end while

        end if

    end while

    -- remainder?
    if at1 <= length( s1 ) then
        for i = at1 to length(s1) do
            left &= s1[i]
        end for
    elsif at2 <= length( s2 ) then
        for i = at2 to length(s2) do
            right &= s2[i]
        end for

    end if

    if length( left ) or length( right ) then
        result &= sprintf( "[%s,%s]", {left,right})
    end if

    return result

end function

new topic     » goto parent     » topic index » view message » categorize

15. Re: data analysis

On 17 Jan 2001, at 19:15, David Cuny wrote:

> Here's an updated version of the pattern matcher. The output is changed a bit
> (hopefully it's more clear), and I've added some more features. It now checks
> for swapped letters, and prefers removal over resyncing.
>
> Given the test data:
>
>    "this", "that"
>    "actually", "actaully"
>    "Pittsborough", "Pittsburg"
>    "Pittsberg", "Pittsburg"
>    "shello", "hellos"
>    "fhfellos", "shello"
>    "Sweigsdunka", "Zweigsdanka"
>
>
> It produces:
>
>    th[is,at]
>    act[ua,au]lly
>    Pittsb[o,u]r[ou,]g[h,]
>    Pittsb[e,u]rg
>    [s,]hello[,s]
>    [f,s]h[f,]ello[s,]
>    [S,Z]weigsd[u,a]nka

Ok, that breaks the feeding of MinG and MaxG to it. I still don't understand
the resyncing, but maybe i will one day. Did you see how the code i
posted returns a list of results? Comments?

Kat

new topic     » goto parent     » topic index » view message » categorize

16. Re: data analysis

On 17 Jan 2001, at 19:15, David Cuny wrote:

> Here's an updated version of the pattern matcher. The output is changed a bit
> (hopefully it's more clear), and I've added some more features. It now checks
> for swapped letters, and prefers removal over resyncing.
>
> Given the test data:

>    "Pittsborough", "Pittsburg"

> It produces:

>    Pittsb[o,u]r[ou,]g[h,]

I can't get it to sync on the 'u' now. Everything else is quite workable!

Kat

new topic     » goto parent     » topic index » view message » categorize

17. Re: data analysis

Kat wrote:

> I still don't understand the resyncing,
> but maybe i will one day.

Think of the two strings as queues. The algorithm tries to make sure that
the first two items in the queue are the same. If they are, it's a match,
and they can be removed from the queue. For example:

   [C]amero
   [C]odas

Both strings start with 'C', so they are removed from the queue. If the
characters are different, the algorithm looks in both queues to see which
has the smallest gap:

   amero = has 'o' in 5th position
   odas = has 'a' in 3rd position

In this case, the gap of 3 is the smallest, so two characters are removed
from the second string to get the two in 'sync':

   amero
   [od]as

Finally, if there is no match with the prior test, both characters are
simply removed from the head of each queue:

   [m]ero
   [a]s

Eventually, both queues are empty, or one of the strings has stuff left
over.

That was the first version of the algorithm, anyway. I added a couple of
additional tests. For one thing, it looks for reversed letters:

   [ca]t
   [ac]t

For another, if there is a gap, it looks to see if there if it can remove
characters from the queue instead. For example:

   clack
   dreck

The solution of removing the first letters from both queues:

   [cla]ck
   [dre]ck

is preferred to removing them from only one queue, using the 'sync' routine
outlined above:

   clack
   [dre]ck <-- less good solution

> Did you see how the code i
> posted returns a list of results?

Changing MaxGap on the fly seems to me to be a bad idea, but if it works
better for you, go for it.

-- David Cuny

new topic     » goto parent     » topic index » view message » categorize

18. Re: data analysis

On 17 Jan 2001, at 20:30, David Cuny wrote:


> > Did you see how the code i
> > posted returns a list of results?
>
> Changing MaxGap on the fly seems to me to be a bad idea, but if it works
> better for you, go for it.

Why is it a bad idea? What i was trying to do was to hit all possible resync
points.

Kat

new topic     » goto parent     » topic index » view message » categorize

19. Re: data analysis

Kat wrote:

> Why is [changing MaxGap] a bad idea?
> What i was trying to do was to hit all
> possible resync points.

It's not, actually. For example, David Cope uses a similar pattern matcher
in his EMI program, only it works with musical pitches instead of letters.
In his application, the parameters are self tuning: he feeds it music by a
composer, and it automatically adjusts parameters of the pattern matcher
until the music it regenerates matches the same statistics as the source
material.

Something else you might want to consider - I recall reading that pattern
matching for speech recognition wasn't all that great, until someone decided
to use Markov chains to 'guess' what the next word might be. The utterance
would be first compared to that set, and if there were no good candidates, a
brute force match would be done. Perhaps something similar might work with
Tiggr?

-- David Cuny

new topic     » goto parent     » topic index » view message » categorize

20. Re: data analysis

On 17 Jan 2001, at 22:15, David Cuny wrote:

> Kat wrote:
>
> > Why is [changing MaxGap] a bad idea?
> > What i was trying to do was to hit all
> > possible resync points.
>
> It's not, actually. For example, David Cope uses a similar pattern matcher in
> his EMI program, only it works with musical pitches instead of letters. In his
> application, the parameters are self tuning: he feeds it music by a composer,
> and it automatically adjusts parameters of the pattern matcher until the music
> it regenerates matches the same statistics as the source material.
>
> Something else you might want to consider - I recall reading that pattern
> matching for speech recognition wasn't all that great, until someone decided
> to use Markov chains to 'guess' what the next word might be. The utterance
> would be first compared to that set, and if there were no good candidates, a
> brute force match would be done. Perhaps something similar might work with
> Tiggr?

I considered that many yrs ago, but the research i have run across since
has shown me that prediction works only in knowledge/theme domains the
script has knowledge of. The trick is to either know all the domains under
discussion (which is how most of the Turing tests held in Australia are
done: the script's authors get to choose the domain for that script's
interaction with the judges), or don't use much prediction. Since i use
variables for nearly everything, turning on/off the prediction on the fly isn't
a
problem, and some word pre/postdiction is already coded into the
database, as well as syntactic pre/post"diction". Even the domain is
encoded for each word on a per-use basis, altho a *lot* of data is still not
entered. What i have been concentrating on mostly, since i discovered
harddrives no longer cost $1000/megabyte, is to collect the data as sets
within a domain, with text concerning relavant words in that domain, for
Tiggr or Gertie to use as pre/postdictors in that domain, as well as
associative words. If she didn't have "nephrology" listed in the dictionary as
medical, a broad search would find it in the knowledgebase in the grouping
with other medical terms, and she could update the dictionary herself. At
least i *hope* she has these eureka moments, cause i *sure* don't want to
finish that job for her!

I don't know why some things are so difficult for me to understand. It took
me years to learn checkers, but i won my first chess game the same day i
first saw the chess board. I'm wierd like that. So i am really glad to have
this list here to help me out in difficult coding. Thanks again David.
Graeme, i simply haven't gotten to your code yet, but i bet i use David's
2nd and 3rd posting with your's in parallel threads.

Kat

new topic     » goto parent     » topic index » view message » categorize

21. Re: data analysis

>I don't know why some things are so difficult for me to understand. It took
>me years to learn checkers, but i won my first chess game the same day i
>first saw the chess board. I'm wierd like that. So i am really glad to have
>this list here to help me out in difficult coding. Thanks again David.
>Graeme, i simply haven't gotten to your code yet, but i bet i use David's
>2nd and 3rd posting with your's in parallel threads.

Hey, I'm just kickin' back and enjoying this one. Feeling a
bit guilty that I havn't done anything else with my little
routine, but I've got pleanty to do and competeing with David
on stuff like this is a bit of a pointless exercise. If I feel
I can contribute I will, but it looks like you guys have it well
in hand.

Graeme.

(BTW: The chess/checkers thing sounds exactly like me - there's
a scary thought for you smile



----------------------------------------------------

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu