Re: data analysis

new topic     » goto parent     » topic index » view thread      » older message » newer message

Kat wrote:

> I still don't understand the resyncing,
> but maybe i will one day.

Think of the two strings as queues. The algorithm tries to make sure that
the first two items in the queue are the same. If they are, it's a match,
and they can be removed from the queue. For example:

   [C]amero
   [C]odas

Both strings start with 'C', so they are removed from the queue. If the
characters are different, the algorithm looks in both queues to see which
has the smallest gap:

   amero = has 'o' in 5th position
   odas = has 'a' in 3rd position

In this case, the gap of 3 is the smallest, so two characters are removed
from the second string to get the two in 'sync':

   amero
   [od]as

Finally, if there is no match with the prior test, both characters are
simply removed from the head of each queue:

   [m]ero
   [a]s

Eventually, both queues are empty, or one of the strings has stuff left
over.

That was the first version of the algorithm, anyway. I added a couple of
additional tests. For one thing, it looks for reversed letters:

   [ca]t
   [ac]t

For another, if there is a gap, it looks to see if there if it can remove
characters from the queue instead. For example:

   clack
   dreck

The solution of removing the first letters from both queues:

   [cla]ck
   [dre]ck

is preferred to removing them from only one queue, using the 'sync' routine
outlined above:

   clack
   [dre]ck <-- less good solution

> Did you see how the code i
> posted returns a list of results?

Changing MaxGap on the fly seems to me to be a bad idea, but if it works
better for you, go for it.

-- David Cuny

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu