1. Spell checker optimization
- Posted by Robert B Pilkington <bpilkington at JUNO.COM> Mar 24, 1998
- 837 views
Well, I got the replacement word routine working... It was slow but I added hashing and with 20-200 words per bucket it goes very fast on the 486/75 I'm using. (I'm purposely using the slowest computer here so I can see how much faster it needs to go.) Anyway, I'm using a 50000 word dictionary. (The one that came with SOUNDEX, which I'm considering adding to the Get_Replace() routine for the larger words) The dictionary is formatted like: edit editing editor i i'd i'll word word's But over 50000 words. Being over 500k it takes 16 seconds on the 486 to load. The problem is I want it more like 5 seconds (I can get it to less than 3 seconds if I don't store the words, just load them overtop the last word, but that wouldn't be too effective... :/ ). I'm using the hash routine from EUPHORIA\DEMO\HASH.EX, but with only one field per word. (Each word only appears once, so I don't need a counter.) The dictionary also takes up too much space in memory. Let's see, assuming 16 bit integers for 8 bit characters we have 1 meg in use for a 500k file. With 32 bit integers, we'd have 2 megs used up for the file. Below is the routine to load the words. (To test it, just grab the dictionary from the spell checker on the archives or SOUNDEX. They are uppercase while mine is lowercase, but that doesn't matter for timing the routine.) So what I need to know is: How do I speed up the load time? How do I make the dictionary smaller? How do I make the dictionary use up less memory? Below is the code I'm using, as well as the major bottlenecks (over 1% time profiled) already commented just above their respective lines: without warning without trace without type_check with profile_time constant HASH_BUCKETS = 2099 -- Larger prime, better performance? -- By increasing this number, I can get the -- load down to 13 seconds, but there is -- memory wasted from that, and it's already -- taking up over 1 meg... Increase the buckets -- too and it slows to a halt. object junk junk = 0 sequence words words = repeat("", HASH_BUCKETS) function hash_routine(sequence string) -- This routine works well for English words. -- It's fast to calculate and it gives a pretty even distribution. -- These are the two properties that you want in a hash routine. integer len atom val len = length(string) val = string[len] if len > 4 then len = 4 end if for i = 1 to len - 1 do -- profile: 3.71% val = val * 64 + string[i] -- shift 6 bits and add end for -- profile: 2.56% return remainder(val, HASH_BUCKETS) + 1 end function procedure hash_add(sequence string) -- If string is not in the table already, add it. integer hash_val --, found sequence bucket -- which bucket to search? -- profile: 1.71% hash_val = hash_routine(string) bucket = words[hash_val] -- TOO slow. This takes way too long. -- found = 0 -- -- search this bucket for the string -- for i = 1 to length(bucket) do -- if compare(string, bucket[i]) = 0 then -- -- found it -- found = i -- exit -- end if -- end for -- if not found then -- This is much faster, but still a bottleneck. -- profile 7.92% if not find(string, bucket) then -- profile: 18.82% bucket = append(bucket, string) end if words[hash_val] = bucket -- profile: 4.45% What??? end procedure procedure Get_Dictionary() integer dict dict = open("SPELL.DAT", "r") puts(1, "Loading dictionary...\r") if dict = -1 then puts(1, "Fatal error -- Main dictionary \"SPELL.DAT\" missing.\n") abort(1) end if junk = {} -- Biiig nono: words = {} while 1 do -- profile: 54.63% junk = gets(dict) if atom(junk) then exit end if -- profile 1.95% junk = junk[1..length(junk)-1] -- profile 1.16% hash_add(junk) -- Well, that's better than 20% down here: -- hash_add(junk[1..length(junk)-1]) end while close(dict) return end procedure atom a a = time() Get_Dictionary() a = time() - a printf("\n%.2f seconds.\n", {a}) _____________________________________________________________________ You don't need to buy Internet access to use free Internet e-mail. Get completely free e-mail from Juno at http://www.juno.com Or call Juno at (800) 654-JUNO [654-5866]
2. Re: Spell checker optimization
- Posted by MAP <ddhinc at ALA.NET> Mar 24, 1998
- 833 views
Ok, Here's what I would try... Take advantage of the soundex routines provided by Mr. Sephton. It only takes a few modifications to make it into a good spelling dictionary generator that emits a file that can be can be accessed in an efficient manner. First, modify create.ex to include the words in the same file with the soundex codes, eg; A250 AACHEN A631 AARDVARK A650 AARON A100 ABABA A120 ABACK A120 ABACUS A120 ABACUSES A130 ABAFT A145 ABALONE . . . etc Now, you need to use a sort program to put them in order by their codes. Dos's sort.exe wont do it because it can't handle the memory requirements, but I think I recall seeing a euphoria program somewhere that can. A100 ABABA A120 ABACK A120 ABACUS A120 ABACUSES A130 ABAFT A145 ABALONE A250 AACHEN A631 AARDVARK A650 AARON Now, write another program that puts the items into a list format using the codes only once. A100 ABABA A120 ABACK ABACUS ABACUSES A130 ABAFT A145 ABALONE A250 AACHEN A631 AARDVARK A650 AARON You now have the database in the replacement list format needed for your spell checker. More importantly, you have your codes in ascendinding order in the file. This allows you to take the words that you want to spell-check and convert them into their soundex equivalents. You can then use a "binary search" algorithm to find the match very quickly. The advantage of this method is that if you choose not to load your list into memory, you can search directly in the file. A binary sort will give you the fewest disk accesses necessary to hit the match. This may well be fast enough with disk caching turned on. Some things you could do to speed it up: You could turn your hashing algorithm onto the document to be checked, create a sorted list of the words, eliminating redundacies by making a pointer-list to locations of each occurance (This also allows you to be abled to quickly ignore all instances of a word already deemed correct). You then run thru the list and cull out all entries that are found in the dictionary. Once that is done, you resort the remaining pointer-list back into order by location (keeps from jumping all over the document, confusing the user) and query the user on each spelling mistake. Of course, you may decide the two megs of memory is worth sacrificing as well, as it would run much faster to check against the list in memory. Christopher D. Hickman ps: One caveat on the file access method; in order to accomplish the direct access method for it, each record will have to be a fixed width as large as the longest list of words for a particular code (the resulting file would be several megs). With the in memory scheme, this of course would not be necessary.
3. Re: Spell checker optimization
- Posted by Linda Nieuwenhuijsen <nieuwen at XS4ALL.NL> Mar 24, 1998
- 820 views
- Last edited Mar 25, 1998
>You now have the database in the replacement list format needed for >your spell checker. More importantly, you have your codes in >ascendinding order in the file. This allows you to take the words >that you want to spell-check and convert them into their soundex >equivalents. You can then use a "binary search" algorithm to find >the match very quickly. The advantage of this method is that if >you choose not to load your list into memory, you can search directly >in the file. A binary sort will give you the fewest disk accesses >necessary to hit the match. This may well be fast enough with >disk caching turned on. You could have the first 2 or 3 letters binairy tree in memory.. 26^5th doesn't even take too much memory, then you could do one large read to disk, which due to precize disk cashing (use Jaques routines to load them to an allocated memory block --> much more efficient and searches faster...and you can precizely say how much you want to be loaded.. cached..at once.. should go pretty fast..one blink at the hD-light), then you can hash search the words per 2 characters which are in a binairy tree. (each treee has 26^2 eh.. elements which all have 26^2 elements also.. you either need to use several files or blocks in a file per word length or you need to add a '\n' after each word (which works slower) This is fast, memory efficient, but not that easy to code..i guess.. >Some things you could do to speed it up: >You could turn your hashing algorithm onto the document to be checked, >create a sorted list of the words, eliminating redundacies by making >a pointer-list to locations of each occurance (This also allows you >to be abled to quickly ignore all instances of a word already >deemed correct). You then run thru the list and cull out all entries >that are found in the dictionary. Once that is done, you resort >the remaining pointer-list back into order by location (keeps from >jumping all over the document, confusing the user) and query the >user on each spelling mistake. With off course the option to change all simelar.. changes all the mistakes with THIS word... >Of course, you may decide the two megs of memory is worth sacrificing >as well, as it would run much faster to check against the list in >memory. Two megs.. a lot more it will use.. Euphoria uses about 6 bytes for a sequence I believe, but also the sequence length needs to be saved, and every element takes 4 bytes. So even in one long sequence it will be about 4 times 2 meg.. >ps: One caveat on the file access method; in order to accomplish the >direct access method for it, each record will have to be a fixed width >as large as the longest list of words for a particular code (the >resulting >file would be several megs). With the in memory scheme, this of course >would not be necessary. You can have quick indexes in the file too, or slit it up in several files that contain all the words of a specified length..or have indexed block within one file.. Ralf nieuwen at xs4all.nl
4. Re: Spell checker optimization
- Posted by MAP <ddhinc at ALA.NET> Mar 25, 1998
- 799 views
Well, I'm not sure I agree with Mr. Nieuwenhuijsen's idea of the binary tree implementation, but I'll give him a chance to explain it more fully before I say it's not a better way of doing it. Some of his other ideas like using an index file instead of fixed width records, loading the list in blocks rather than all at once etc, I do agree with. I should though point out the differences between a binary search and a binary tree for anyone not familiar with these two concepts. Although they are both born out of the same pricipals, they are implemented quite differently. Binary searchs are far simpler. They require that the list to be worked upon be in sequential order (exaclty the case in the spell checker problem). The binary search works by eliminating half of the elements remaining to be searched with each iteration. Take for instance the following array, a simple list of letters from "A" to "J", and say we are looking for the element conatining "D". 1: A 2: B 3: C <-- 2 4: D <-- 3 5: E 6: F <-- 1 7: G 8: H 9: I 10: J 11: K 12: J At the beginning, it is assumed that the item must lie between elements 1 to 12 or not be in the list. The search begins at the center of the list, which contains "F". When "D" is found to be lesser than "F", the top half of the list 6-12 is considered to be culled. The next iteration starts in the middle of the remaining elements 1-5, element 3. The comparison is made and the 1-3 elements are found to be lesser, now 4 & 5 remain. Half the distance between 4 & 5 is 4.5 (it's usually easiest just to round down) yielding 4 which is a match. This doesn't look very efficient for a small array, but when dealing with large arrays it's very efficient. Your spelling dictionary has about 50000 words. With the search you'll eliminate half the list with each search. 50000, 25000, 12500, 6250, 3125, 1562, 781, 390, 195, 97, 48, 24, 12, 6, 3, 2, 1... yields a match within a maximum of 17 comparisons, not bad for 50000 elements. *Binary trees* are used generally when you want to be abled to make changes to the list you are working on. They can be used quite effectively for situations where you need to access and update records, add and delete items etc. The binary tree consists of nodes and branches. The first item entered is considered the "root" node. When more items are added, they are compared against the root node and added to the left (lesser) or right (greater) branch of that node. If a node already exists on that branch, the the same process is repeated for that node as was on the root node. This carries on until an unused branch is found to place the item onto. Binary Trees are NOT intended to be fed items in sequential order. In fact it's the worst case scenario because it loads all items to the right (ascending) or left (descending) main branch, yielding the least efficient access (it's virtually the same as a linked list in this case). Therefore, binary tree's are intended to be used when the list entered has a basically random distribution, and even then require periodic "balancing" to maintain efficiency. I hope this clears it up some. Christopher D. Hickman
5. Re: Spell checker optimization
- Posted by Joe Phillips <bubba at TXWES.EDU> Mar 25, 1998
- 825 views
At 11:00 AM 3/25/98 -0600, you wrote: >Well, I'm not sure I agree with Mr. Nieuwenhuijsen's idea of the >binary tree implementation, but I'll give him a chance to explain >it more fully before I say it's not a better way of doing it. Thanks for the detailed explanations. Although I knew the concepts, the presentation was succinct and thorough. In fact, it was those explanations that leads me to think a Binary Tree might be the best for this implementation. Besides, what good is a dictionary to which you cannot add words? I think the b2-tree (balanced binary tree) concept works beautifully. In fact, the file can be stored as a b2-tree (with pointers) so as to not have to rebuild the tree at each load. Although, I often use binary searches for the very reasons mentioned. (EASY) In fact, many times I will even use binary searches in un-sorted, linked lists or files. Although I have to parse through pointers, key comparisons are kept at a minimum. --[ Joe Phillips, Sytems Analyst --[ Texas Wesleyan University 817-531-4444 --[ --[ "The tongue of the just is as choice silver: --[ the heart of the wicked is little worth." Proverbs 10:20
6. Re: Spell checker optimization
- Posted by Ralf Nieuwenhuijsen <nieuwen at XS4ALL.NL> Mar 25, 1998
- 870 views
I ment, a sequence with 27 entries, which all have 27 entries, etc. For example: The word apple can be looked up: ['a']['p']['p']['l']['e'] Only remember to subtract 'a' and add 1 to use a more efficient tree. The 27th element, is the end of word element, if it contains a true value than the word ends there, if it contains a false value than it means there is no word yet. Because euphoria's sequence can handle this in memory, but not in file, you need indexes. The file would start with a record of 27 fields. All fields contain an index in the file (maybe a relative index works nicer) Then it goes there, reads the next record of fields. If it has no more characters to look up, it checks the 27th field and see if it's a false or not. If not, the word is correct, else not. This would mean that you only need to read from disk as many times as the word's length. Althrough you could have the first couple of levels in memory (in one sequence). As many levels as it takes to make the list of corrected words less than a 1000 or 250 or something. Then you can read the 1000's words up in once (one hd light if you use Jaques Deschenes DOS i/o routines, so you can have it put in an preallocated place in one stroke) And then you can binairy search the list of words off course. THe binairy tree does however take up more memory than a normal list, but it makes lookups much faster. You can also offcourse use hashing for every two or three characters and put that in a simerlair tree... Ralf
7. Re: Spell checker optimization
- Posted by MAP <ddhinc at ALA.NET> Mar 25, 1998
- 842 views
Ralf Nieuwenhuijsen wrote: > > I ment, a sequence with 27 entries, which all have 27 entries, etc. > For example: > The word apple > can be looked up: ['a']['p']['p']['l']['e'] > <Snip> I understand what you're saying now. Remember though that what I was basically proposing was that Mr. Pilkington store the soundex codes as the key for his database. These are represented as a letter followed by a 3 digit value, not as the word itself. By comparing the soundex code generated by the word being checked against the already existing keys in the database, you are abled to obtain a list of alternate choices for any word not spelled correctly. Using a search algorithm based on the words themselves would re-incur the redundancy that the soundex algorithm had eliminated. Choosing the algorithm with which to access it quickly is still in question though. Although Mr. Phillips pointed out that it would be very useful to be abled to add words to the dictionary, I still am not sold on the binary tree idea. I think the volatility of the database would be too low to warrant the more complex implementation. The user would have to add 500 words to affect the 50000 word database by 1 percent... something which is more likely to occur over a great number of sessions than in one sitting. In a situation with volatility this low, i'd opt to go with the faster (and simpler) binary search algorithm, and use an alternate (even if slower than binary trees) method when adding items to the database since the operation is "relatively" infrequent. Also, he could very well choose to stick with the hashing algorithm if it suits his needs. Junko Miura has a spell checker listed in Euphoria's archive page that uses some variation of a hashing algorithm. It seems to run at a decent speed on my computer, although I only used it on a short document I had handy. I barely glanced at the code so I don't know the specifics on what it's doing, but it's probably worth investigating if he hasn't done so already. Christopher D. Hickman
8. Re: Spell checker optimization
- Posted by Robert B Pilkington <bpilkington at JUNO.COM> Mar 25, 1998
- 857 views
>I ment, a sequence with 27 entries, which all have 27 entries, etc. >For example: >The word apple can be looked up: ['a']['p']['p']['l']['e'] > >Only remember to subtract 'a' and add 1 to use a more efficient tree. >The 27th element, is the end of word element, if it contains a true >value than the word ends there, if it contains a false value than it means >there is no word yet. >Because euphoria's sequence can handle this in memory, but not in >file, you need indexes. >The file would start with a record of 27 fields. All fields contain an >index in the file (maybe a relative index works nicer) >Then it goes there, reads the next record of fields. >If it has no more characters to look up, it checks the 27th field and >see if it's a false or not. If not, the word is correct, else not. >This would mean that you only need to read from disk as many times as >the word's length. >Althrough you could have the first couple of levels in memory (in one >sequence). >As many levels as it takes to make the list of corrected words less >than a 1000 or 250 or something. >Then you can read the 1000's words up in once (one hd light if you use >Jaques Deschenes DOS i/o routines, so you can have it put in an >preallocated place in one stroke) >And then you can binairy search the list of words off course. > >THe binairy tree does however take up more memory than a normal list, >but it makes lookups much faster. >You can also offcourse use hashing for every two or three characters >and put that in a simerlair tree... Let's see if I understand you right: The dictionary contains a list with indexes to the rest of the list: A B C D E . . . . * | | | | | A B C A B C A B C A B C A B C | | | | | | | | | | | | | | | etc So to look up words: (byte locations are made up and * is the 27th location) Let's say AD: A (index to byte 12300) D (index to byte 23000) * (true, this word exists) And to look up ADD: A (index to byte 12300) D (index to byte 23000) D (index to byte 30040) * (true, this word exists) Looking up AE: A (index to byte 12300) E (index to byte 24000) * (false, this word doesn't exist) This sounds good, and loading up 3 or 4 levels into memory doesn't sound too bad... I can keep a lot of words in memory like that. Just look at this paragraph and see how many words are more than four letters. But the problem: When checking for replacements, I took David Cuny's advice and this checks a whole bunch of times: ((26 * number of letters) * 2) + (number of letters * 4) 6 different checks are done, and two check each letter of the alphabet a different way (replacement and insertation). Each word needs to be checked for existance. For a 5 letter word (no longer in memory, need to look in dictionary), 280 checks need to be done. 336 checks for 6 a six letter word. That would be 280 hard drive loads (Once for each letter not in memory). (BTW, 280 checks is real fast with find() and the hash table, is 'instant' on the 486 fast enough?) I'm thinking about how I could use this information as I'm typing this so don't worry if I answer the problems I mention..... Another thing is, the index would be 0 if there are no more words in this direction, right? I assume so. (ie searching for XXQ (the second X would have an index of 0 telling the search that there aren't any words that have XX as the first two letters) The question here is, would this make the dictionary filesize bigger, stay about the same, or larger? If it makes it larger then I don't know if it would be worth the extra speed. (I got the load routine down two 15 seconds, with 14 seconds second time with SmartDrv [wow], and it's only one second on the PII-233.. With the old load algorithm. The new one assumes there aren't any duplicate words. The old one checked anyway. I left the check commented so it's easy to check if any crept in.) If the dictionary filesize is smaller OTOH.... _____________________________________________________________________ You don't need to buy Internet access to use free Internet e-mail. Get completely free e-mail from Juno at http://www.juno.com Or call Juno at (800) 654-JUNO [654-5866]
9. Re: Spell checker optimization
- Posted by Lmailles <Lmailles at AOL.COM> Mar 26, 1998
- 809 views
You mentioned at the beginning about the ACTUAL comparison routine, to pick the best word. I wrote a QBASIC fuzzy search routine a while ago which would do just the job. It is already pretty well optimised for typical mis-spellings (letters in wrong order like hlelo or missed out letters) and a little extra code to check for keyboard typos (s = a or d, k = j or l) I'll convert it using David Cuny's ebasic, and then tidy up if you think it might be useful. I never bothered to turn it into a spell checker because QB was SO SLOW - I mean I even had to write my own sort rountine... Daniel
10. Re: Spell checker optimization
- Posted by Robert B Pilkington <bpilkington at JUNO.COM> Mar 26, 1998
- 783 views
>You mentioned at the beginning about the ACTUAL comparison routine, to >pick the best word. This thread has gotten a little big.... And without quoting... Where exactly are you referring to? (I've finished the comparison and replacement word routines, I just need to speedup the dictionary load and perhaps make the dictionary filesize smaller) >I wrote a QBASIC fuzzy search routine a while ago which would do just >the job. >It is already pretty well optimised for typical mis-spellings (letters >in wrong order like hlelo or missed out letters) and a little extra code >to check for keyboard typos (s = a or d, k = j or l) Mine does this, but a little more generic. (With the detail David Cuny went into on the message he sent me, I figure he could use the information to create a spell checker himself if he had more time....) It checks for missed out letters and wrong order, as well as extra letters, single letter typos and words that should be two words. It could be much more optimized, but like I said, on the 486 it's response time for the check is 'instant'. I don't notice any delay. (Except that 16 second one while the 1/2 meg dictionary is loading....) >I'll convert it using David Cuny's ebasic, and then tidy up if you >think it might be useful. I never bothered to turn it into a spell checker >because QB was SO SLOW - I mean I even had to write my own sort rountine... The replacement routine I have now works great. Later on I might add soundex encoding to replace words that are too mangled for the current checks. (Mostly big words that sound the same to soundex but have too many messed up letters, but I've only had 2 words so far that I didn't get the correct replacement in the list) >own sort rountine... ^^^^^^^^ My spell checker would catch that and give the correct replacement, assuming of course, that "routine" is in the dictionary... :) _____________________________________________________________________ You don't need to buy Internet access to use free Internet e-mail. Get completely free e-mail from Juno at http://www.juno.com Or call Juno at (800) 654-JUNO [654-5866]