Re: Robert...EDS - questons/comments
- Posted by Kat <gertie at PELL.NET> Feb 12, 2001
- 428 views
On 12 Feb 2001, at 19:14, Tony Bucholtz wrote: > G'day all > > >...Kat wrote... > >If one assumes every word in the sentence is a typo, then every word >must > >be compared to find a tree of possible correct words for the >entire > >sentence. Anything else is throwing away info. > > There's a thing called a "suffix tree" that apparently is very useful for > text searching and other things. I've only just started reading up on them, > so I'm no expert, but the basic idea is to build a tree of strings (and > substrings) and then search the tree for matching words and part-words... I > think that's right... No, that's an entirely different animal, i have a list of suffixes and prefixes and infixes, and a list of rules for adding them to roots and removing them from roots. The problem is if someone says they saw a xebra at the zoo. Naturally, if this is the first time the word "xebra" is seen, it will not be found in the dictionary. Furthermore, the rule of thumb that the first letter of the misspelled word is correct, is wrong in this case. To top it off, "xebra" and the closest match and correct spelling "zebra" is at the end of the dictionary, and you can't do anything about it, you cannot assume the rest of the word is error free either. That last sentence is 39 words long, you can assume every 3rd line on irc contains a misspelling, and some web pages aren't much better, including news pages of multinational news organizations like Reuters and AP. At 82 seconds per word, that sentence would take nearly an hour to "understand", a totally unacceptable time. Even running hand-tailored machine code would be too slow with that 39 words, memory accesses to run thru the list of words would take 3.5 secs with 60ns memory, and the code to execute hasto be fetched and exec'd also, with no interruptions by the OS, meaning no OS more complex than dos. Plus the time to rate the likely comparisons and see if those found words work in the sentence. Tiggr gave me a dump of unknown words she has seen one day, it was 150,000 words, more than the total of used unique known words. Most contained more than one error, of omision, adddition, letter sawpping, or were speld foneticly,, so any one method of correction would fail. So what i am considering is the approach used to break encryption keys really fast: hardware. I can do digital hardware, a card to plug into an eisa slot, stuff it with the word list, and then shoot it the sentence and tell it to proof it. With no software overhead, the 3.5 sec can be cut with parallel accesses, wide data feeds. The alternative is throwing puters at it on a lan, which i cannot afford. Kat