Re: Spell checker optimization

new topic     » goto parent     » topic index » view thread      » older message » newer message

Ok, Here's what I would try...
Take advantage of the soundex routines provided by Mr. Sephton.
It only takes a few modifications to make it into a good spelling
dictionary generator that emits a file that can be can be accessed
in an efficient manner. First, modify create.ex to include the words
in the same file with the soundex codes, eg;

A250 AACHEN
A631 AARDVARK
A650 AARON
A100 ABABA
A120 ABACK
A120 ABACUS
A120 ABACUSES
A130 ABAFT
A145 ABALONE
  .
  .
  .
etc

Now, you need to use a sort program to put them in order by their
codes. Dos's sort.exe wont do it because it can't handle the
memory requirements, but I think I recall seeing a euphoria program
somewhere that can.

A100 ABABA
A120 ABACK
A120 ABACUS
A120 ABACUSES
A130 ABAFT
A145 ABALONE
A250 AACHEN
A631 AARDVARK
A650 AARON

Now, write another program that puts the items into a list format
using the codes only once.

A100 ABABA
A120 ABACK ABACUS ABACUSES
A130 ABAFT
A145 ABALONE
A250 AACHEN
A631 AARDVARK
A650 AARON

You now have the database in the replacement list format needed for
your spell checker. More importantly, you have your codes in
ascendinding order in the file. This allows you to take the words
that you want to spell-check and convert them into their soundex
equivalents. You can then use a "binary search" algorithm to find
the match very quickly. The advantage of this method is that if
you choose not to load your list into memory, you can search directly
in the file. A binary sort will give you the fewest disk accesses
necessary to hit the match. This may well be fast enough with
disk caching turned on.

Some things you could do to speed it up:
You could turn your hashing algorithm onto the document to be checked,
create a sorted list of the words, eliminating redundacies by making
a pointer-list to locations of each occurance (This also allows you
to be abled to quickly ignore all instances of a word already
deemed correct). You then run thru the list and cull out all entries
that are found in the dictionary. Once that is done, you resort
the remaining pointer-list back into order by location (keeps from
jumping all over the document, confusing the user) and query the
user on each spelling mistake.

Of course, you may decide the two megs of memory is worth sacrificing
as well, as it would run much faster to check against the list in
memory.

Christopher D. Hickman

ps: One caveat on the file access method; in order to accomplish the
direct access method for it, each record will have to be a fixed width
as large as the longest list of words for a particular code (the
resulting
file would be several megs). With the in memory scheme, this of course
would not be necessary.

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu