Spell checker optimization
- Posted by Robert B Pilkington <bpilkington at JUNO.COM> Mar 24, 1998
- 836 views
Well, I got the replacement word routine working... It was slow but I added hashing and with 20-200 words per bucket it goes very fast on the 486/75 I'm using. (I'm purposely using the slowest computer here so I can see how much faster it needs to go.) Anyway, I'm using a 50000 word dictionary. (The one that came with SOUNDEX, which I'm considering adding to the Get_Replace() routine for the larger words) The dictionary is formatted like: edit editing editor i i'd i'll word word's But over 50000 words. Being over 500k it takes 16 seconds on the 486 to load. The problem is I want it more like 5 seconds (I can get it to less than 3 seconds if I don't store the words, just load them overtop the last word, but that wouldn't be too effective... :/ ). I'm using the hash routine from EUPHORIA\DEMO\HASH.EX, but with only one field per word. (Each word only appears once, so I don't need a counter.) The dictionary also takes up too much space in memory. Let's see, assuming 16 bit integers for 8 bit characters we have 1 meg in use for a 500k file. With 32 bit integers, we'd have 2 megs used up for the file. Below is the routine to load the words. (To test it, just grab the dictionary from the spell checker on the archives or SOUNDEX. They are uppercase while mine is lowercase, but that doesn't matter for timing the routine.) So what I need to know is: How do I speed up the load time? How do I make the dictionary smaller? How do I make the dictionary use up less memory? Below is the code I'm using, as well as the major bottlenecks (over 1% time profiled) already commented just above their respective lines: without warning without trace without type_check with profile_time constant HASH_BUCKETS = 2099 -- Larger prime, better performance? -- By increasing this number, I can get the -- load down to 13 seconds, but there is -- memory wasted from that, and it's already -- taking up over 1 meg... Increase the buckets -- too and it slows to a halt. object junk junk = 0 sequence words words = repeat("", HASH_BUCKETS) function hash_routine(sequence string) -- This routine works well for English words. -- It's fast to calculate and it gives a pretty even distribution. -- These are the two properties that you want in a hash routine. integer len atom val len = length(string) val = string[len] if len > 4 then len = 4 end if for i = 1 to len - 1 do -- profile: 3.71% val = val * 64 + string[i] -- shift 6 bits and add end for -- profile: 2.56% return remainder(val, HASH_BUCKETS) + 1 end function procedure hash_add(sequence string) -- If string is not in the table already, add it. integer hash_val --, found sequence bucket -- which bucket to search? -- profile: 1.71% hash_val = hash_routine(string) bucket = words[hash_val] -- TOO slow. This takes way too long. -- found = 0 -- -- search this bucket for the string -- for i = 1 to length(bucket) do -- if compare(string, bucket[i]) = 0 then -- -- found it -- found = i -- exit -- end if -- end for -- if not found then -- This is much faster, but still a bottleneck. -- profile 7.92% if not find(string, bucket) then -- profile: 18.82% bucket = append(bucket, string) end if words[hash_val] = bucket -- profile: 4.45% What??? end procedure procedure Get_Dictionary() integer dict dict = open("SPELL.DAT", "r") puts(1, "Loading dictionary...\r") if dict = -1 then puts(1, "Fatal error -- Main dictionary \"SPELL.DAT\" missing.\n") abort(1) end if junk = {} -- Biiig nono: words = {} while 1 do -- profile: 54.63% junk = gets(dict) if atom(junk) then exit end if -- profile 1.95% junk = junk[1..length(junk)-1] -- profile 1.16% hash_add(junk) -- Well, that's better than 20% down here: -- hash_add(junk[1..length(junk)-1]) end while close(dict) return end procedure atom a a = time() Get_Dictionary() a = time() - a printf("\n%.2f seconds.\n", {a}) _____________________________________________________________________ You don't need to buy Internet access to use free Internet e-mail. Get completely free e-mail from Juno at http://www.juno.com Or call Juno at (800) 654-JUNO [654-5866]