Re: Robert...EDS - questons/comments

new topic     » goto parent     » topic index » view thread      » older message » newer message

On 12 Feb 2001, at 19:14, Tony Bucholtz wrote:

> G'day all
> 
> >...Kat wrote...
> >If one assumes every word in the sentence is a typo, then every word >must 
> >be compared to find a tree of possible correct words for the >entire 
> >sentence. Anything else is throwing away info.
> 
> There's a thing called a "suffix tree" that apparently is very useful for 
> text searching and other things. I've only just started reading up on them, 
> so I'm no expert, but the basic idea is to build a tree of strings (and 
> substrings) and then search the tree for matching words and part-words... I 
> think that's right...

No, that's an entirely different animal, i have a list of suffixes and prefixes
and infixes,
and a list of rules for adding them to roots and removing them from roots.

The problem is if someone says they saw a xebra at the zoo. Naturally, if this
is the
first time the word "xebra" is seen, it will not be found in the dictionary.
Furthermore,
the rule of thumb that the first letter of the misspelled word is correct, is
wrong in this
case. To top it off, "xebra" and the closest match and correct spelling "zebra"
is at the
end of the dictionary, and you can't do anything about it, you cannot assume the
rest
of the word is error free either. That last sentence is 39 words long, you can
assume
every 3rd line on irc contains a misspelling, and some web pages aren't much
better,
including news pages of multinational news organizations like Reuters and AP. At
82
seconds per word, that sentence would take nearly an hour to "understand", a
totally
unacceptable time. Even running hand-tailored machine code would be too slow
with
that 39 words, memory accesses to run thru the list of words would take 3.5 secs
with
60ns memory, and the code to execute hasto be fetched and exec'd also, with no 
interruptions by the OS, meaning no OS more complex than dos. Plus the time to
rate
the likely comparisons and see if those found words work in the sentence. Tiggr
gave
me a dump of unknown words she has seen one day, it was 150,000 words, more than
the total of used unique known words. Most contained more than one error, of
omision,
adddition, letter sawpping, or were speld foneticly,, so any one method of
correction
would fail. So what i am considering is the approach used to break encryption
keys
really fast: hardware. I can do digital hardware, a card to plug into an eisa
slot, stuff it
with the word list, and then shoot it the sentence and tell it to proof it. With
no
software overhead, the 3.5 sec can be cut with parallel accesses, wide data
feeds. The
alternative is throwing puters at it on a lan, which i cannot afford.

Kat

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu