Re: Spell checker optimization
- Posted by Robert B Pilkington <bpilkington at JUNO.COM> Mar 25, 1998
- 857 views
>I ment, a sequence with 27 entries, which all have 27 entries, etc. >For example: >The word apple can be looked up: ['a']['p']['p']['l']['e'] > >Only remember to subtract 'a' and add 1 to use a more efficient tree. >The 27th element, is the end of word element, if it contains a true >value than the word ends there, if it contains a false value than it means >there is no word yet. >Because euphoria's sequence can handle this in memory, but not in >file, you need indexes. >The file would start with a record of 27 fields. All fields contain an >index in the file (maybe a relative index works nicer) >Then it goes there, reads the next record of fields. >If it has no more characters to look up, it checks the 27th field and >see if it's a false or not. If not, the word is correct, else not. >This would mean that you only need to read from disk as many times as >the word's length. >Althrough you could have the first couple of levels in memory (in one >sequence). >As many levels as it takes to make the list of corrected words less >than a 1000 or 250 or something. >Then you can read the 1000's words up in once (one hd light if you use >Jaques Deschenes DOS i/o routines, so you can have it put in an >preallocated place in one stroke) >And then you can binairy search the list of words off course. > >THe binairy tree does however take up more memory than a normal list, >but it makes lookups much faster. >You can also offcourse use hashing for every two or three characters >and put that in a simerlair tree... Let's see if I understand you right: The dictionary contains a list with indexes to the rest of the list: A B C D E . . . . * | | | | | A B C A B C A B C A B C A B C | | | | | | | | | | | | | | | etc So to look up words: (byte locations are made up and * is the 27th location) Let's say AD: A (index to byte 12300) D (index to byte 23000) * (true, this word exists) And to look up ADD: A (index to byte 12300) D (index to byte 23000) D (index to byte 30040) * (true, this word exists) Looking up AE: A (index to byte 12300) E (index to byte 24000) * (false, this word doesn't exist) This sounds good, and loading up 3 or 4 levels into memory doesn't sound too bad... I can keep a lot of words in memory like that. Just look at this paragraph and see how many words are more than four letters. But the problem: When checking for replacements, I took David Cuny's advice and this checks a whole bunch of times: ((26 * number of letters) * 2) + (number of letters * 4) 6 different checks are done, and two check each letter of the alphabet a different way (replacement and insertation). Each word needs to be checked for existance. For a 5 letter word (no longer in memory, need to look in dictionary), 280 checks need to be done. 336 checks for 6 a six letter word. That would be 280 hard drive loads (Once for each letter not in memory). (BTW, 280 checks is real fast with find() and the hash table, is 'instant' on the 486 fast enough?) I'm thinking about how I could use this information as I'm typing this so don't worry if I answer the problems I mention..... Another thing is, the index would be 0 if there are no more words in this direction, right? I assume so. (ie searching for XXQ (the second X would have an index of 0 telling the search that there aren't any words that have XX as the first two letters) The question here is, would this make the dictionary filesize bigger, stay about the same, or larger? If it makes it larger then I don't know if it would be worth the extra speed. (I got the load routine down two 15 seconds, with 14 seconds second time with SmartDrv [wow], and it's only one second on the PII-233.. With the old load algorithm. The new one assumes there aren't any duplicate words. The old one checked anyway. I left the check commented so it's easy to check if any crept in.) If the dictionary filesize is smaller OTOH.... _____________________________________________________________________ You don't need to buy Internet access to use free Internet e-mail. Get completely free e-mail from Juno at http://www.juno.com Or call Juno at (800) 654-JUNO [654-5866]