Re: Spell checker optimization
- Posted by MAP <ddhinc at ALA.NET> Mar 24, 1998
- 832 views
Ok, Here's what I would try... Take advantage of the soundex routines provided by Mr. Sephton. It only takes a few modifications to make it into a good spelling dictionary generator that emits a file that can be can be accessed in an efficient manner. First, modify create.ex to include the words in the same file with the soundex codes, eg; A250 AACHEN A631 AARDVARK A650 AARON A100 ABABA A120 ABACK A120 ABACUS A120 ABACUSES A130 ABAFT A145 ABALONE . . . etc Now, you need to use a sort program to put them in order by their codes. Dos's sort.exe wont do it because it can't handle the memory requirements, but I think I recall seeing a euphoria program somewhere that can. A100 ABABA A120 ABACK A120 ABACUS A120 ABACUSES A130 ABAFT A145 ABALONE A250 AACHEN A631 AARDVARK A650 AARON Now, write another program that puts the items into a list format using the codes only once. A100 ABABA A120 ABACK ABACUS ABACUSES A130 ABAFT A145 ABALONE A250 AACHEN A631 AARDVARK A650 AARON You now have the database in the replacement list format needed for your spell checker. More importantly, you have your codes in ascendinding order in the file. This allows you to take the words that you want to spell-check and convert them into their soundex equivalents. You can then use a "binary search" algorithm to find the match very quickly. The advantage of this method is that if you choose not to load your list into memory, you can search directly in the file. A binary sort will give you the fewest disk accesses necessary to hit the match. This may well be fast enough with disk caching turned on. Some things you could do to speed it up: You could turn your hashing algorithm onto the document to be checked, create a sorted list of the words, eliminating redundacies by making a pointer-list to locations of each occurance (This also allows you to be abled to quickly ignore all instances of a word already deemed correct). You then run thru the list and cull out all entries that are found in the dictionary. Once that is done, you resort the remaining pointer-list back into order by location (keeps from jumping all over the document, confusing the user) and query the user on each spelling mistake. Of course, you may decide the two megs of memory is worth sacrificing as well, as it would run much faster to check against the list in memory. Christopher D. Hickman ps: One caveat on the file access method; in order to accomplish the direct access method for it, each record will have to be a fixed width as large as the longest list of words for a particular code (the resulting file would be several megs). With the in memory scheme, this of course would not be necessary.