OpenEuphoria: Forum: Robert...EDS

1. Robert...EDS - questons/comments

Posted by Tony Bucholtz <tony_bucholtz at HOTMAIL.COM> Feb 12, 2001
369 views
Last edited Feb 13, 2001

G'day all

>...Kat wrote...
>If one assumes every word in the sentence is a typo, then every word >must 
>be compared to find a tree of possible correct words for the >entire 
>sentence. Anything else is throwing away info.

There's a thing called a "suffix tree" that apparently is very useful for 
text searching and other things. I've only just started reading up on them, 
so I'm no expert, but the basic idea is to build a tree of strings (and 
substrings) and then search the tree for matching words and part-words... I 
think that's right...

Anyway, fire up your search engine and see if suffix trees are any use to 
you.

Regards
Tony

new topic » topic index » view message » categorize

2. Re: Robert...EDS - questons/comments

Posted by Kat <gertie at PELL.NET> Feb 12, 2001
398 views
Last edited Feb 13, 2001

On 12 Feb 2001, at 19:14, Tony Bucholtz wrote:

> G'day all
> 
> >...Kat wrote...
> >If one assumes every word in the sentence is a typo, then every word >must 
> >be compared to find a tree of possible correct words for the >entire 
> >sentence. Anything else is throwing away info.
> 
> There's a thing called a "suffix tree" that apparently is very useful for 
> text searching and other things. I've only just started reading up on them, 
> so I'm no expert, but the basic idea is to build a tree of strings (and 
> substrings) and then search the tree for matching words and part-words... I 
> think that's right...

No, that's an entirely different animal, i have a list of suffixes and prefixes
and infixes,
and a list of rules for adding them to roots and removing them from roots.

The problem is if someone says they saw a xebra at the zoo. Naturally, if this
is the
first time the word "xebra" is seen, it will not be found in the dictionary.
Furthermore,
the rule of thumb that the first letter of the misspelled word is correct, is
wrong in this
case. To top it off, "xebra" and the closest match and correct spelling "zebra"
is at the
end of the dictionary, and you can't do anything about it, you cannot assume the
rest
of the word is error free either. That last sentence is 39 words long, you can
assume
every 3rd line on irc contains a misspelling, and some web pages aren't much
better,
including news pages of multinational news organizations like Reuters and AP. At
82
seconds per word, that sentence would take nearly an hour to "understand", a
totally
unacceptable time. Even running hand-tailored machine code would be too slow
with
that 39 words, memory accesses to run thru the list of words would take 3.5 secs
with
60ns memory, and the code to execute hasto be fetched and exec'd also, with no 
interruptions by the OS, meaning no OS more complex than dos. Plus the time to
rate
the likely comparisons and see if those found words work in the sentence. Tiggr
gave
me a dump of unknown words she has seen one day, it was 150,000 words, more than
the total of used unique known words. Most contained more than one error, of
omision,
adddition, letter sawpping, or were speld foneticly,, so any one method of
correction
would fail. So what i am considering is the approach used to break encryption
keys
really fast: hardware. I can do digital hardware, a card to plug into an eisa
slot, stuff it
with the word list, and then shoot it the sentence and tell it to proof it. With
no
software overhead, the 3.5 sec can be cut with parallel accesses, wide data
feeds. The
alternative is throwing puters at it on a lan, which i cannot afford.

Kat

new topic » goto parent » topic index » view message » categorize

3. Re: Robert...EDS - questons/comments

Posted by "SR.Williamson" <writeneu at hotmail.com> Feb 13, 2001
361 views

Darn it, Kat, you went and made my brain work (such as it is) this 
morning.

Many many years ago, when I was taking some Operations Research classes, 
we talked about search methods, especially search that may or may not 
have an end point. I don't remember much unfortunately.

You probably know more about this than I do, and if not, *somebody* here 
does, since I'm sure it's in some 200 level undergrad course somewhere. 
Anyway, one thing we talked about is narrowing the scope of the search 
with assumptions. For example, if the word starts with a consonant as 
spelled, assume it really starts with a consonant. Secondly, assume that 
beginning consonants are phonetic clones or adjacent on the keyboard. 
Right there you've slashed your search quite a bit. 

Another option might be to use heuristics. Replace letter at random from 
a set of rules like those above, compare it to a dictionary of real 
words, and if it fits, go on. If not, try another letter. This can be 
optimized by deciding the letter to be replaced based on a table of 
probabilities of letter consecutiveness and word length. For example, 
the most common English word is "the". A combination that is 3 letters 
and has "he" at the end is highly likely to be a "t", and there are a 
limited number of other possibilities like "then", "they", "thee", 
"she", "he", etc.

Of course, I'm only adding to the processor overhead, but it seems to me 
that there is some breakeven point where the smaller dictionary size 
makes up for the fact that the processor has to choose dictionaries and 
do multiple guesses. Like I said, I'm sure you know more about this than 
I do though.

It's been over 10 years on this stuff, so somebody tell me if I'm just 
wrong.

new topic » goto parent » topic index » view message » categorize

4. Re: Robert...EDS - questons/comments

Posted by Kat <gertie at PELL.NET> Feb 13, 2001
374 views

On 13 Feb 2001, at 8:25, SR.Williamson wrote:

> Darn it, Kat, you went and made my brain work (such as it is) this 
> morning.

Glad to be of service. 

> Many many years ago, when I was taking some Operations Research classes, 
> we talked about search methods, especially search that may or may not 
> have an end point. I don't remember much unfortunately.
> 
> You probably know more about this than I do, and if not, *somebody* here 
> does, since I'm sure it's in some 200 level undergrad course somewhere. 
> Anyway, one thing we talked about is narrowing the scope of the search 
> with assumptions. For example, if the word starts with a consonant as 
> spelled, assume it really starts with a consonant. Secondly, assume that 
> beginning consonants are phonetic clones or adjacent on the keyboard. 
> Right there you've slashed your search quite a bit. 

Well, not really, because you are making an assumption with a degree of
correctness,
possibly a correctness factor of zero. If you have run down the phonetic clones
list,
and found a possible match, you could still be in error, and should check keybd 
layouts, different languages, personal typoing habits, regional spelling habits,
etc..
And the word which is typoed isn't in the list you have. So you start a new
search with
a new assumption, repeatedly, untill you get a match you can accept, which still
may
not be correct, and you will never know, cause you didn't run the other
algorithms and
lists.

> Another option might be to use heuristics. Replace letter at random from 
> a set of rules like those above, compare it to a dictionary of real 
> words, and if it fits, go on. If not, try another letter. This can be 
> optimized by deciding the letter to be replaced based on a table of 
> probabilities of letter consecutiveness and word length. For example, 
> the most common English word is "the". A combination that is 3 letters 
> and has "he" at the end is highly likely to be a "t", and there are a 
> limited number of other possibilities like "then", "they", "thee", 
> "she", "he", etc.

Are you speaking of using the suffix tree? Problem is, as i see it, is that
eventually you
will have every letter of the alphabet followed/preceeded  by every other 
letter/sequence, so everything matches to some extent, somewehere, even if it's 
wrong. Finding a word already in the tree is easy, but then so is a binary
search in a
list. I expect you'll start with the unknown, and search a sequence tree for
every
subsequence contained int he unknown, and end up with a multitude of results in
the
form Graeme's diff() code outputs? Ack! (No offence to Graeme, it's just a case
of
different solutions for different problems.) 

> Of course, I'm only adding to the processor overhead, but it seems to me 
> that there is some breakeven point where the smaller dictionary size 
> makes up for the fact that the processor has to choose dictionaries and 
> do multiple guesses. Like I said, I'm sure you know more about this than 
> I do though.

There are several methods for reducing the search scope, such as knowledge 
(medical, physics) domain limits, language (english, spanish) limits, playing
dumb (if
the word isn't known, ignore it, or ask about it, or change the topic), etc.,
but i do not
wish to impose limits, especially limits of understanding the words, even typoed
words.

Kat

new topic » goto parent » topic index » view message » categorize

5. Re: Robert...EDS - questons/comments

Posted by Euman <euman at bellsouth.net> Feb 13, 2001
402 views

Check this out.....

 http://www.mathtools.net/MATLAB/

Good info on statistical mapping / problem solving
that may/may not help in some sortof way.....

Has a wealth of information on just about anything
that could relate to programming......

Euman

----- Original Message -----
From: "Kat" <gertie at PELL.NET>
To: "EUforum" <EUforum at topica.com>
Sent: Tuesday, February 13, 2001 13:04
Subject: Re: Robert...EDS - questons/comments


| On 13 Feb 2001, at 8:25, SR.Williamson wrote:
|
| > Darn it, Kat, you went and made my brain work (such as it is) this
| > morning.
|
| Glad to be of service. 
|
| > Many many years ago, when I was taking some Operations Research classes,
| > we talked about search methods, especially search that may or may not
| > have an end point. I don't remember much unfortunately.
| >
| > You probably know more about this than I do, and if not, *somebody* here
| > does, since I'm sure it's in some 200 level undergrad course somewhere.
| > Anyway, one thing we talked about is narrowing the scope of the search
| > with assumptions. For example, if the word starts with a consonant as
| > spelled, assume it really starts with a consonant. Secondly, assume that
| > beginning consonants are phonetic clones or adjacent on the keyboard.
| > Right there you've slashed your search quite a bit.
|
| Well, not really, because you are making an assumption with a degree of
correctness,
| possibly a correctness factor of zero. If you have run down the phonetic
clones list,
| and found a possible match, you could still be in error, and should check
keybd
| layouts, different languages, personal typoing habits, regional spelling
habits, etc..
| And the word which is typoed isn't in the list you have. So you start a
new search with
| a new assumption, repeatedly, untill you get a match you can accept, which
still may
| not be correct, and you will never know, cause you didn't run the other
algorithms and
| lists.
|
| > Another option might be to use heuristics. Replace letter at random from
| > a set of rules like those above, compare it to a dictionary of real
| > words, and if it fits, go on. If not, try another letter. This can be
| > optimized by deciding the letter to be replaced based on a table of
| > probabilities of letter consecutiveness and word length. For example,
| > the most common English word is "the". A combination that is 3 letters
| > and has "he" at the end is highly likely to be a "t", and there are a
| > limited number of other possibilities like "then", "they", "thee",
| > "she", "he", etc.
|
| Are you speaking of using the suffix tree? Problem is, as i see it, is
that eventually you
| will have every letter of the alphabet followed/preceeded  by every other
| letter/sequence, so everything matches to some extent, somewehere, even if
it's
| wrong. Finding a word already in the tree is easy, but then so is a binary
search in a
| list. I expect you'll start with the unknown, and search a sequence tree
for every
| subsequence contained int he unknown, and end up with a multitude of
results in the
| form Graeme's diff() code outputs? Ack! (No offence to Graeme, it's just a
case of
| different solutions for different problems.)
|
| > Of course, I'm only adding to the processor overhead, but it seems to me
| > that there is some breakeven point where the smaller dictionary size
| > makes up for the fact that the processor has to choose dictionaries and
| > do multiple guesses. Like I said, I'm sure you know more about this than
| > I do though.
|
| There are several methods for reducing the search scope, such as knowledge
| (medical, physics) domain limits, language (english, spanish) limits,
playing dumb (if
| the word isn't known, ignore it, or ask about it, or change the topic),
etc., but i do not
| wish to impose limits, especially limits of understanding the words, even
typoed words.
|
| Kat
|
|
|

OpenEuphoria

1. Robert...EDS - questons/comments

2. Re: Robert...EDS - questons/comments

3. Re: Robert...EDS - questons/comments

4. Re: Robert...EDS - questons/comments

5. Re: Robert...EDS - questons/comments

Search

Include:

Quick Links

User menu

Misc Menu