1. Robert...EDS - questons/comments
- Posted by Tony Bucholtz <tony_bucholtz at HOTMAIL.COM> Feb 12, 2001
- 398 views
- Last edited Feb 13, 2001
G'day all >...Kat wrote... >If one assumes every word in the sentence is a typo, then every word >must >be compared to find a tree of possible correct words for the >entire >sentence. Anything else is throwing away info. There's a thing called a "suffix tree" that apparently is very useful for text searching and other things. I've only just started reading up on them, so I'm no expert, but the basic idea is to build a tree of strings (and substrings) and then search the tree for matching words and part-words... I think that's right... Anyway, fire up your search engine and see if suffix trees are any use to you. Regards Tony
2. Re: Robert...EDS - questons/comments
- Posted by Kat <gertie at PELL.NET> Feb 12, 2001
- 427 views
- Last edited Feb 13, 2001
On 12 Feb 2001, at 19:14, Tony Bucholtz wrote: > G'day all > > >...Kat wrote... > >If one assumes every word in the sentence is a typo, then every word >must > >be compared to find a tree of possible correct words for the >entire > >sentence. Anything else is throwing away info. > > There's a thing called a "suffix tree" that apparently is very useful for > text searching and other things. I've only just started reading up on them, > so I'm no expert, but the basic idea is to build a tree of strings (and > substrings) and then search the tree for matching words and part-words... I > think that's right... No, that's an entirely different animal, i have a list of suffixes and prefixes and infixes, and a list of rules for adding them to roots and removing them from roots. The problem is if someone says they saw a xebra at the zoo. Naturally, if this is the first time the word "xebra" is seen, it will not be found in the dictionary. Furthermore, the rule of thumb that the first letter of the misspelled word is correct, is wrong in this case. To top it off, "xebra" and the closest match and correct spelling "zebra" is at the end of the dictionary, and you can't do anything about it, you cannot assume the rest of the word is error free either. That last sentence is 39 words long, you can assume every 3rd line on irc contains a misspelling, and some web pages aren't much better, including news pages of multinational news organizations like Reuters and AP. At 82 seconds per word, that sentence would take nearly an hour to "understand", a totally unacceptable time. Even running hand-tailored machine code would be too slow with that 39 words, memory accesses to run thru the list of words would take 3.5 secs with 60ns memory, and the code to execute hasto be fetched and exec'd also, with no interruptions by the OS, meaning no OS more complex than dos. Plus the time to rate the likely comparisons and see if those found words work in the sentence. Tiggr gave me a dump of unknown words she has seen one day, it was 150,000 words, more than the total of used unique known words. Most contained more than one error, of omision, adddition, letter sawpping, or were speld foneticly,, so any one method of correction would fail. So what i am considering is the approach used to break encryption keys really fast: hardware. I can do digital hardware, a card to plug into an eisa slot, stuff it with the word list, and then shoot it the sentence and tell it to proof it. With no software overhead, the 3.5 sec can be cut with parallel accesses, wide data feeds. The alternative is throwing puters at it on a lan, which i cannot afford. Kat
3. Re: Robert...EDS - questons/comments
- Posted by "SR.Williamson" <writeneu at hotmail.com> Feb 13, 2001
- 390 views
Darn it, Kat, you went and made my brain work (such as it is) this morning. Many many years ago, when I was taking some Operations Research classes, we talked about search methods, especially search that may or may not have an end point. I don't remember much unfortunately. You probably know more about this than I do, and if not, *somebody* here does, since I'm sure it's in some 200 level undergrad course somewhere. Anyway, one thing we talked about is narrowing the scope of the search with assumptions. For example, if the word starts with a consonant as spelled, assume it really starts with a consonant. Secondly, assume that beginning consonants are phonetic clones or adjacent on the keyboard. Right there you've slashed your search quite a bit. Another option might be to use heuristics. Replace letter at random from a set of rules like those above, compare it to a dictionary of real words, and if it fits, go on. If not, try another letter. This can be optimized by deciding the letter to be replaced based on a table of probabilities of letter consecutiveness and word length. For example, the most common English word is "the". A combination that is 3 letters and has "he" at the end is highly likely to be a "t", and there are a limited number of other possibilities like "then", "they", "thee", "she", "he", etc. Of course, I'm only adding to the processor overhead, but it seems to me that there is some breakeven point where the smaller dictionary size makes up for the fact that the processor has to choose dictionaries and do multiple guesses. Like I said, I'm sure you know more about this than I do though. It's been over 10 years on this stuff, so somebody tell me if I'm just wrong.
4. Re: Robert...EDS - questons/comments
- Posted by Kat <gertie at PELL.NET> Feb 13, 2001
- 403 views
On 13 Feb 2001, at 8:25, SR.Williamson wrote: > Darn it, Kat, you went and made my brain work (such as it is) this > morning. Glad to be of service. > Many many years ago, when I was taking some Operations Research classes, > we talked about search methods, especially search that may or may not > have an end point. I don't remember much unfortunately. > > You probably know more about this than I do, and if not, *somebody* here > does, since I'm sure it's in some 200 level undergrad course somewhere. > Anyway, one thing we talked about is narrowing the scope of the search > with assumptions. For example, if the word starts with a consonant as > spelled, assume it really starts with a consonant. Secondly, assume that > beginning consonants are phonetic clones or adjacent on the keyboard. > Right there you've slashed your search quite a bit. Well, not really, because you are making an assumption with a degree of correctness, possibly a correctness factor of zero. If you have run down the phonetic clones list, and found a possible match, you could still be in error, and should check keybd layouts, different languages, personal typoing habits, regional spelling habits, etc.. And the word which is typoed isn't in the list you have. So you start a new search with a new assumption, repeatedly, untill you get a match you can accept, which still may not be correct, and you will never know, cause you didn't run the other algorithms and lists. > Another option might be to use heuristics. Replace letter at random from > a set of rules like those above, compare it to a dictionary of real > words, and if it fits, go on. If not, try another letter. This can be > optimized by deciding the letter to be replaced based on a table of > probabilities of letter consecutiveness and word length. For example, > the most common English word is "the". A combination that is 3 letters > and has "he" at the end is highly likely to be a "t", and there are a > limited number of other possibilities like "then", "they", "thee", > "she", "he", etc. Are you speaking of using the suffix tree? Problem is, as i see it, is that eventually you will have every letter of the alphabet followed/preceeded by every other letter/sequence, so everything matches to some extent, somewehere, even if it's wrong. Finding a word already in the tree is easy, but then so is a binary search in a list. I expect you'll start with the unknown, and search a sequence tree for every subsequence contained int he unknown, and end up with a multitude of results in the form Graeme's diff() code outputs? Ack! (No offence to Graeme, it's just a case of different solutions for different problems.) > Of course, I'm only adding to the processor overhead, but it seems to me > that there is some breakeven point where the smaller dictionary size > makes up for the fact that the processor has to choose dictionaries and > do multiple guesses. Like I said, I'm sure you know more about this than > I do though. There are several methods for reducing the search scope, such as knowledge (medical, physics) domain limits, language (english, spanish) limits, playing dumb (if the word isn't known, ignore it, or ask about it, or change the topic), etc., but i do not wish to impose limits, especially limits of understanding the words, even typoed words. Kat
5. Re: Robert...EDS - questons/comments
- Posted by Euman <euman at bellsouth.net> Feb 13, 2001
- 433 views
Check this out..... http://www.mathtools.net/MATLAB/ Good info on statistical mapping / problem solving that may/may not help in some sortof way..... Has a wealth of information on just about anything that could relate to programming...... Euman ----- Original Message ----- From: "Kat" <gertie at PELL.NET> To: "EUforum" <EUforum at topica.com> Sent: Tuesday, February 13, 2001 13:04 Subject: Re: Robert...EDS - questons/comments | On 13 Feb 2001, at 8:25, SR.Williamson wrote: | | > Darn it, Kat, you went and made my brain work (such as it is) this | > morning. | | Glad to be of service. | | > Many many years ago, when I was taking some Operations Research classes, | > we talked about search methods, especially search that may or may not | > have an end point. I don't remember much unfortunately. | > | > You probably know more about this than I do, and if not, *somebody* here | > does, since I'm sure it's in some 200 level undergrad course somewhere. | > Anyway, one thing we talked about is narrowing the scope of the search | > with assumptions. For example, if the word starts with a consonant as | > spelled, assume it really starts with a consonant. Secondly, assume that | > beginning consonants are phonetic clones or adjacent on the keyboard. | > Right there you've slashed your search quite a bit. | | Well, not really, because you are making an assumption with a degree of correctness, | possibly a correctness factor of zero. If you have run down the phonetic clones list, | and found a possible match, you could still be in error, and should check keybd | layouts, different languages, personal typoing habits, regional spelling habits, etc.. | And the word which is typoed isn't in the list you have. So you start a new search with | a new assumption, repeatedly, untill you get a match you can accept, which still may | not be correct, and you will never know, cause you didn't run the other algorithms and | lists. | | > Another option might be to use heuristics. Replace letter at random from | > a set of rules like those above, compare it to a dictionary of real | > words, and if it fits, go on. If not, try another letter. This can be | > optimized by deciding the letter to be replaced based on a table of | > probabilities of letter consecutiveness and word length. For example, | > the most common English word is "the". A combination that is 3 letters | > and has "he" at the end is highly likely to be a "t", and there are a | > limited number of other possibilities like "then", "they", "thee", | > "she", "he", etc. | | Are you speaking of using the suffix tree? Problem is, as i see it, is that eventually you | will have every letter of the alphabet followed/preceeded by every other | letter/sequence, so everything matches to some extent, somewehere, even if it's | wrong. Finding a word already in the tree is easy, but then so is a binary search in a | list. I expect you'll start with the unknown, and search a sequence tree for every | subsequence contained int he unknown, and end up with a multitude of results in the | form Graeme's diff() code outputs? Ack! (No offence to Graeme, it's just a case of | different solutions for different problems.) | | > Of course, I'm only adding to the processor overhead, but it seems to me | > that there is some breakeven point where the smaller dictionary size | > makes up for the fact that the processor has to choose dictionaries and | > do multiple guesses. Like I said, I'm sure you know more about this than | > I do though. | | There are several methods for reducing the search scope, such as knowledge | (medical, physics) domain limits, language (english, spanish) limits, playing dumb (if | the word isn't known, ignore it, or ask about it, or change the topic), etc., but i do not | wish to impose limits, especially limits of understanding the words, even typoed words. | | Kat | | |