Re: Robert...EDS - questons/comments
- Posted by Kat <gertie at PELL.NET> Feb 13, 2001
- 374 views
On 13 Feb 2001, at 8:25, SR.Williamson wrote: > Darn it, Kat, you went and made my brain work (such as it is) this > morning. Glad to be of service. > Many many years ago, when I was taking some Operations Research classes, > we talked about search methods, especially search that may or may not > have an end point. I don't remember much unfortunately. > > You probably know more about this than I do, and if not, *somebody* here > does, since I'm sure it's in some 200 level undergrad course somewhere. > Anyway, one thing we talked about is narrowing the scope of the search > with assumptions. For example, if the word starts with a consonant as > spelled, assume it really starts with a consonant. Secondly, assume that > beginning consonants are phonetic clones or adjacent on the keyboard. > Right there you've slashed your search quite a bit. Well, not really, because you are making an assumption with a degree of correctness, possibly a correctness factor of zero. If you have run down the phonetic clones list, and found a possible match, you could still be in error, and should check keybd layouts, different languages, personal typoing habits, regional spelling habits, etc.. And the word which is typoed isn't in the list you have. So you start a new search with a new assumption, repeatedly, untill you get a match you can accept, which still may not be correct, and you will never know, cause you didn't run the other algorithms and lists. > Another option might be to use heuristics. Replace letter at random from > a set of rules like those above, compare it to a dictionary of real > words, and if it fits, go on. If not, try another letter. This can be > optimized by deciding the letter to be replaced based on a table of > probabilities of letter consecutiveness and word length. For example, > the most common English word is "the". A combination that is 3 letters > and has "he" at the end is highly likely to be a "t", and there are a > limited number of other possibilities like "then", "they", "thee", > "she", "he", etc. Are you speaking of using the suffix tree? Problem is, as i see it, is that eventually you will have every letter of the alphabet followed/preceeded by every other letter/sequence, so everything matches to some extent, somewehere, even if it's wrong. Finding a word already in the tree is easy, but then so is a binary search in a list. I expect you'll start with the unknown, and search a sequence tree for every subsequence contained int he unknown, and end up with a multitude of results in the form Graeme's diff() code outputs? Ack! (No offence to Graeme, it's just a case of different solutions for different problems.) > Of course, I'm only adding to the processor overhead, but it seems to me > that there is some breakeven point where the smaller dictionary size > makes up for the fact that the processor has to choose dictionaries and > do multiple guesses. Like I said, I'm sure you know more about this than > I do though. There are several methods for reducing the search scope, such as knowledge (medical, physics) domain limits, language (english, spanish) limits, playing dumb (if the word isn't known, ignore it, or ask about it, or change the topic), etc., but i do not wish to impose limits, especially limits of understanding the words, even typoed words. Kat