1. String comparisons

I have two strings. I wish to know closely they match.

"Amarillo in Morning" and "Amarillo Texas in Morning"

Exact matches are easy. What I am looking for is a quality of match number, a correlation coefficient.

Does anyone out there know how to do this?
Any thoughts are appreciated.

Regards,
jd

new topic     » topic index » view message » categorize

2. Re: String comparisons

You may split both lines, do exact matches per word and use the the number of exact matches as a score.

This is a simple way to do the job. Something more complex would require too much computing capacity.

Jean-Marc

new topic     » goto parent     » topic index » view message » categorize

3. Re: String comparisons

https://github.com/wernsey/miscsrc/blob/master/simil.c

new topic     » goto parent     » topic index » view message » categorize

4. Re: String comparisons

Another way of looking at being close matches is the longest common string. This is complex but thanks to James Johnathan Cook, it's just a matter of downloading his include file https://archive.usingeuphoria.com/lcssp.zip and running a routine.

include lcssp.e 
sequence a = "hello good-night", b = "hello good-bye" -- two strings 
sequence common_string = LCSubstr_fast(a, b) 

The length(common_string) / max(length(a), length(b)) is a measure of similarity. It's value is between 0 to 1 inclusive.

new topic     » goto parent     » topic index » view message » categorize

5. Re: String comparisons

jmduro said...

This is a simple way to do the job. Something more complex would require too much computing capacity.

Not true. My WRDCOMP4.E is dated May 14 2001, and SEQCOMP5.E is dated Jan 20 2003. If i remember right, neither got contributed because of the significant pushback against Eu containing strings, so please consider that my fear of posting the code still exists. I still have the .e files, but not the test code, and not the code that actually used this day to day.

Sample test for the code included the following:

--{"c-a-t fish","catfish"}, 
--{"eeyes","eyes"}, 
--{"coookies","cookies"}, 
--{{"cookies"},{"cookies"}}, 
--{"NATO is considering allowing Yugoslav soldiers","NATO considered allowing Yugoslavian soldiers"}, 
--{"NATO is considering allowing Yugoslav soldiers","NATO considered allowing Yugoslavian soldiers"}, 
--{"NATO is considering allowing 78 Yugoslav soldiers","NATO is considering allowing 71 Yugoslav soldiers"}, 
--{{"NATO","is","considering","allowing","78","Yugoslav","soldiers"},{"NATO","is","considering","allowing","71","Yugoslav","soldiers"}}, 
--{{"the","Hook","of","Holland","gateway","to","Rotterdam",",","the","world's","biggest","port","."}, 
-- {"entrances","to","Rotterdam",",","the","world's","largest","port","."}}, 
 
--{"long",""}, 
--{"this bad english","the english muffins"}, 
--{"extrememly","extremely"}, 
--{"aaarrrggg","arrgggg"}, 
--{{"1","2"},{"1","3"}}, 
--{{"1","2","3"},{"1","2","4"}}, 
----{{"auto","-","parser"},{"autoparser"}}, 
--{"auto-parser","autoparser"}, 
--{{"the","cat","ate","a","big","dinner"},{"the","cat","aet","a","big","lunch"}}, 
 
 
  testdata[x][1] = {"Some","40","-","50","ships","blocked","the","Hook","of","Holland",",","gateway","to","Rotterdam","port","."} 
  testdata[x][2] = {"More","than","100","vessels","blocked","the","entrances","to","Rotterdam",",","the","world's","biggest","port","."} 

So, i've done it, 14 years ago, it won't take too many computing resources.

new topic     » goto parent     » topic index » view message » categorize

6. Re: String comparisons

This is by Jeremy Cowgar

You count the number of steps to transform s1 to s2.

 
-- Levenshtein string distance calculation 
-- 
-- Author: 
--     Jeremy Cowgar < jeremy at cowgar dot com > 
-- 
-- Created: 
--     August 16, 2008 
-- 
-- Updated: 
--     August 16, 2008 
-- 
-- Copyright: 
--     This file is in the public domain 
-- 
 
include std/math.e 
 
--** 
-- Computes the minimum number of operations needed to transform ##s1## into ##s2## 
-- 
-- Example: 
--    
--   ? levenshtein("day", "may") -- 1 
--   ? levenshtein("cat", "dog") -- 3 
--    
-- 
--   Now, those are easy examples, but take this one for example: 
-- 
--    
--   ? levenshtein("Saturday", "Sunday")  -- 3 
--    
-- 
--   Yes, 3. Delete "a" and "t" in Saturday, (Surday) then change the "r" to a 
--   "n" and you have Sunday! 
-- 
-- See Also: 
--   http://en.wikipedia.org/wiki/Levenshtein_distance 
 
export function levenshtein(sequence s1, sequence s2) 
    integer n = length(s1) + 1, m = length(s2) + 1 
 
    if n = 1  then 
        return m-1 
    elsif m = 1 then 
        return n-1 
    end if 
 
    sequence d = repeat(repeat(0, m), n) 
    for i = 1 to n do 
        d[i][1] = i-1 
    end for 
 
    for j = 1 to m do 
        d[1][j] = j-1 
    end for 
 
    for i = 2 to n do 
        for j = 2 to m do 
            d[i][j] = min({ 
                d[i-1][j] + 1, 
                d[i][j-1] + 1, 
                d[i-1][j-1] + (s1[i-1] != s2[j-1]) 
            }) 
        end for 
    end for 
 
    return d[n][m] 
end function 

_tom

new topic     » goto parent     » topic index » view message » categorize

7. Re: String comparisons

katsmeow said...
jmduro said...

This is a simple way to do the job. Something more complex would require too much computing capacity.

Not true. My WRDCOMP4.E is dated May 14 2001, and SEQCOMP5.E is dated Jan 20 2003. If i remember right, neither got contributed because of the significant pushback against Eu containing strings Sample test for the code included the following:

I wasn't on the development team back then but the idea of Eu not containing strings is rather weird. How many programs in The Archive do not contain strings? Probably zero.

new topic     » goto parent     » topic index » view message » categorize

8. Re: String comparisons

SDPringle said...

I wasn't on the development team back then but the idea of Eu not containing strings is rather weird. How many programs in The Archive do not contain strings? Probably zero.

It was wierd by today's standards, but people kept saying Eu is about atoms and sequences, not strings. The same fights existed over my strtok lib back then. Advancing Eu has not been easy, and stopped being fun years ago.

Written in the way back of time, here is the most diplomatic discussion of strings vs sequences ever written in all of Eu: http://openeuphoria.org/forum/49769.wc . As you can imagine, it also got very intense and aggressively personal at times. Some people just could not see that sequences could be used in different ways, just like clothes can, plants can, etc..

new topic     » goto parent     » topic index » view message » categorize

9. Re: String comparisons

This functionality has come up about every 2 or 3 years. Perhaps the time has come that it could be part of OE (not my version, of course!).

There's three versions of this sorta code mentioned in http://openeuphoria.org/forum/m/49684.wc

It was DCuny's code i hacked on for several years, running it against megabytes of news and irc text, and "a few" webpages. I found some odd rare gaps in resyncing, and patched those. Sped up the code by adding "memory", so it could delete duplicate parses as they got started.

In all this searching, i also found my Soundex-Metaphone lib. And made a more accurate version called Tiggrphone.e Also not contributed, but very tested.

new topic     » goto parent     » topic index » view message » categorize

10. Re: String comparisons

CoJaBo3 said...
katsmeow said...

Also not contributed, but very tested.

Why bother mentioning it if you aren't going to contribute it?

It was, and still is, available to those who ask me for it. When i contributed strtok, it generated too much negative feedback. It also contained two of the functions i was most proud of: getxml() and sorttok(). Getxml parsed the word tokens of html, sgml, xml, and anything custom, while sorttok ordered nested sequences of strings of unlimited depth, sequentially, by columns, as a database would.

new topic     » goto parent     » topic index » view message » categorize

11. Re: String comparisons

Thank you for sharing, Kat. It's useful just to see your ideas.

new topic     » goto parent     » topic index » view message » categorize

12. Re: String comparisons

CoJaBo3 said...
apeto3 said...

Play *nice*, CoJaBo3

I am. It's a fair question. I see no real reason to allow someone to talk about how great they are. On another person's thread. Which is a request for help.

Much like i did not complain about you telling us what you "see" in a thread asking for help comparing strings, and your opinion of my intent.

new topic     » goto parent     » topic index » view message » categorize

13. Re: String comparisons

katsmeow said...

It was, and still is, available to those who ask me for it.

I am asking for it.

new topic     » goto parent     » topic index » view message » categorize

14. Re: String comparisons

katsmeow said...

Much like i did not complain about you telling us what you "see" in a thread asking for help comparing strings, and your opinion of my intent.

Go and do it then. What the hell are you waiting for?

new topic     » goto parent     » topic index » view message » categorize

15. Re: String comparisons

CoJaBo3 said...
katsmeow said...

It was, and still is, available to those who ask me for it.

I am asking for it.

I sent it to the only email address i have for you. Enjoy!

new topic     » goto parent     » topic index » view message » categorize

16. Re: String comparisons

CoJaBo3 said...
katsmeow said...

Much like i did not complain about you telling us what you "see" in a thread asking for help comparing strings, and your opinion of my intent.

Go and do it then. What the hell are you waiting for?

This is incredible. Tom?

new topic     » goto parent     » topic index » view message » categorize

17. Re: String comparisons

_tom said...
katsmeow said...

Not true. My WRDCOMP4.E is dated May 14 2001, and SEQCOMP5.E is dated Jan 20 2003. If i remember right, neither got contributed because of the significant pushback against Eu containing strings, so please consider that my fear of posting the code still exists. I still have the .e files, but not the test code, and not the code that actually used this day to day.

...

If you still have these files can you please share them with us?

If you understand i stopped almost work on OE 2005(?), and now have no way to run anything (except maybe Phix). I can send you files, no readme, and no guarantees. I thought i had deleted these files, i still do not see the test files, or the code that actually used these libs. To where do i send a zip of them?

new topic     » goto parent     » topic index » view message » categorize

18. Re: String comparisons

katsmeow said...
_tom said...
katsmeow said...

Not true. My WRDCOMP4.E is dated May 14 2001, and SEQCOMP5.E is dated Jan 20 2003. If i remember right, neither got contributed because of the significant pushback against Eu containing strings, so please consider that my fear of posting the code still exists. I still have the .e files, but not the test code, and not the code that actually used this day to day.

...

If you still have these files can you please share them with us?

If you understand i stopped almost work on OE 2005(?), and now have no way to run anything (except maybe Phix). I can send you files, no readme, and no guarantees. I thought i had deleted these files, i still do not see the test files, or the code that actually used these libs. To where do i send a zip of them?

At the moment, two options, Pete's wiki http://phix.x10.mx/pmwiki/pmwiki.php?n=Main.HomePage, and / or openeuphoria @ gmail.com. Please put a quick note about usage permissions so everyones clear about they can and cannot do with your work.

Cheers

Chris

new topic     » goto parent     » topic index » view message » categorize

19. Re: String comparisons

Chris, i am really sure my desires, permissions, and donations go nowhere, mean nothing. I wasted my time listing out my RDS archives for you, no one asked for any of it, and when it was mentioned who was working towards a new archive, i was not mentioned.

I sent a zip to Tom, at his email address. I believe you will get a copy, and i will be suitably punished for the donation. I figure i should not read Euphorum for a few weeks.

new topic     » goto parent     » topic index » view message » categorize

20. Re: String comparisons

Hi Kat

If you mean EuDirList.zip, then they are presrved on your personal folder on the openeuphoria gmail account, for which I am graeful for the effort you put in, however this was happening at the same time that Greg (and subsequently me) scraped the wayback machine, so there was no requirement to ask you for any archived files that you may have had. However I assumed (perhaps erroneously) that the offer still stood, and that in the unfortunate case that all the other copies of the archive were lost we could call on you to supply what you may have had.

My apologies for not making the permissions issue clearer too - if you post to a public place (Pete's Pecan for instance), then one can assume that they are available to use under common sense terms, but if you post to the gmail account, this is not immediately obvious that they are for widespread dissemination, so a quick note to say that they are is always appreciated.

On the subject of your other contributions - I have always found strtok a very useful addition to the toolset that I use, no matter what other people say about strings and sequences anyway - the end result is always the same, I don't care as long as it works and gets the job done. Unfortunately I have had no need of the other libraries you mentioned, so I cannot comment on there usefulness. But if you look at the history of other Eu contributors, then there is often very little feedback. This in itself is not necessarily a bad thing. When problems arise people are often very vocal about expressing their concerns / dismay / anger whatever at the issue, but when everything is good, even if the thing helps improves their lives enormously, then they will often say very little, if anything about it. I give as an example Irv's GTK toolkit - he has often threatened to stop development of it, or post updates, because no one gives him any feedback, and yet I am fairly confident that it is fairly well used by a (perhaps small) number of users. This is human nature, negative feedback is always louder than positive feedback. An example outside of Eu would be human MMR vaccinations - vocal negative feedback from a small number of adverse reactions far outweighed the positive reaction from the vast majority of babies who benefited from being protected against the diseases, resulting in a large reduction in the uptake of the vaccine, until the results of not being vaccinated began to become apparent, at which point the negative feedback from the results of not being vaccinated began to appear (some parents actually blamed the doctors for not advocating the vaccine strongly enough).

I'm going to end on an apology for this diatribe on this thread, but I am sure that you will not be punished for offering those two files for our perusal - who knows, even I might find them useful!

Cheers

Chris

new topic     » goto parent     » topic index » view message » categorize

21. Re: String comparisons

katsmeow said...

... I still have the .e files, but not the test code, and not the code that actually used this day to day.

Thanks Kat, I got the source for your files.

I have only tested the `diff.ex` example. It is found here:

The remaining files are here:

These are rescued files; that means they were written in Eu3, they have not been used for a long time, and may be incomplete.

I thank Kat for taking the effort to locate them and make them available.

Authorship goes to Kat

_tom

new topic     » goto parent     » topic index » view message » categorize

22. Re: String comparisons

_tom said...

I have only tested the `diff.ex` example. It is found here:

I thank Kat for taking the effort to locate them and make them available.

Authorship goes to Kat

Very cool. Thankyou

new topic     » goto parent     » topic index » view message » categorize

23. Re: String comparisons

jmduro said...

You may split both lines, do exact matches per word and use the the number of exact matches as a score.

This is a simple way to do the job. Something more complex would require too much computing capacity.

Jean-Marc

Yes, very good. I think the multi-comparison is a good idea. I am trying to avoid doing cross correlation calculation if possible.

Thx, jd

new topic     » goto parent     » topic index » view message » categorize

24. Re: String comparisons

_tom said...

This is by Jeremy Cowgar

You count the number of steps to transform s1 to s2.

_tom



Yes, I have used the Levenshtein distance in other projects. The problem I had was how to set the match criterion, as there are many versions for string 2 which will yield the same Levenshtein distance; however, in my current case this isn't important. I think it will work.
Thanks,
jd

new topic     » goto parent     » topic index » view message » categorize

25. Re: String comparisons

SDPringle said...

Another way of looking at being close matches is the longest common string.

Another interesting idea. If I understand it, it could take significant computing time however, I'll give it a try and report back.

thanks, jd

new topic     » goto parent     » topic index » view message » categorize

26. Re: String comparisons

This appears to be an elegant solution to my problem. I will have to translate it and run a test or 2.

Many thanks,
jd

new topic     » goto parent     » topic index » view message » categorize

27. Re: String comparisons

There is a function in std/sequence.e

? sim_index("sit",      "sin")      --> 0.08784 
? sim_index("sit",      "sat")      --> 0.32394 
? sim_index("sit",      "skit")     --> 0.34324 
? sim_index("sit",      "its")      --> 0.68293 
? sim_index("sit",      "kit")      --> 0.86603 
 
? sim_index("knitting", "knitting") --> 0.00000 
? sim_index("kitting",  "kitten")   --> 0.09068 
? sim_index("knitting", "knotting") --> 0.27717 
? sim_index("knitting", "kitten")   --> 0.35332 
? sim_index("abacus","zoological")  --> 0.76304 

_tom

new topic     » goto parent     » topic index » view message » categorize

28. Re: String comparisons

_tom said...

There is a function in std/sequence.e

_tom

This is a very interesting function. The head weighting might cause a problem. It makes it more difficult assign a threshold. I will investigate.

Thanks much.

Regards,
jd

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu