Spell checker optimization

new topic     » topic index » view thread      » older message » newer message

Well, I got the replacement word routine working... It was slow but I
added hashing and with 20-200 words per bucket it goes very fast on the
486/75 I'm using. (I'm purposely using the slowest computer here so I can
see how much faster it needs to go.)

Anyway, I'm using a 50000 word dictionary. (The one that came with
SOUNDEX, which I'm considering adding to the Get_Replace() routine for
the larger words)
The dictionary is formatted like:

edit
editing
editor
i
i'd
i'll
word
word's

But over 50000 words. Being over 500k it takes 16 seconds on the 486 to
load. The problem is I want it more like 5 seconds (I can get it to less
than 3 seconds if I don't store the words, just load them overtop the
last word, but that wouldn't be too effective... :/ ). I'm using the hash
routine from EUPHORIA\DEMO\HASH.EX, but with only one field per word.
(Each word only appears once, so I don't need a counter.) The dictionary
also takes up too much space in memory. Let's see, assuming 16 bit
integers for 8 bit characters we have 1 meg in use for a 500k file. With
32 bit integers, we'd have 2 megs used up for the file. Below is the
routine to load the words. (To test it, just grab the dictionary from the
spell checker on the archives or SOUNDEX. They are uppercase while mine
is lowercase, but that doesn't matter for timing the routine.)

So what I need to know is:

How do I speed up the load time?
How do I make the dictionary smaller?
How do I make the dictionary use up less memory?



Below is the code I'm using, as well as the major bottlenecks (over 1%
time profiled) already commented just above their respective lines:


without warning
without trace
without type_check

with profile_time

constant HASH_BUCKETS = 2099    -- Larger prime, better performance?
                                -- By increasing this number, I can get
the
                                -- load down to 13 seconds, but there is
                                -- memory wasted from that, and it's
already
                                -- taking up over 1 meg... Increase the
buckets
                                -- too and it slows to a halt.

object junk
junk = 0

sequence words
words = repeat("", HASH_BUCKETS)

function hash_routine(sequence string)
-- This routine works well for English words.
-- It's fast to calculate and it gives a pretty even distribution.
-- These are the two properties that you want in a hash routine.
    integer len
    atom val

    len = length(string)
    val = string[len]
    if len > 4 then
        len = 4
    end if
    for i = 1 to len - 1 do
        -- profile: 3.71%
        val = val * 64 + string[i]  -- shift 6 bits and add
    end for
    -- profile: 2.56%
    return remainder(val, HASH_BUCKETS) + 1
end function

procedure hash_add(sequence string)
-- If string is not in the table already, add it.
    integer hash_val --, found
    sequence bucket

    -- which bucket to search?
    -- profile: 1.71%
    hash_val = hash_routine(string)
    bucket = words[hash_val]

-- TOO slow. This takes way too long.
--    found = 0
--    -- search this bucket for the string
--    for i = 1 to length(bucket) do
--        if compare(string, bucket[i]) = 0 then
--            -- found it
--            found = i
--            exit
--        end if
--    end for
--    if not found then

    -- This is much faster, but still a bottleneck.
    -- profile 7.92%
    if not find(string, bucket) then
        -- profile: 18.82%
        bucket = append(bucket, string)
    end if
    words[hash_val] = bucket
-- profile: 4.45%   What???
end procedure

procedure Get_Dictionary()
    integer dict

    dict = open("SPELL.DAT", "r")
    puts(1, "Loading dictionary...\r")
    if dict = -1 then
        puts(1, "Fatal error -- Main dictionary \"SPELL.DAT\"
missing.\n")
        abort(1)
    end if
    junk = {}
    -- Biiig nono: words = {}
    while 1 do
        -- profile: 54.63%
        junk = gets(dict)
        if atom(junk) then
            exit
        end if
        -- profile 1.95%
        junk = junk[1..length(junk)-1]
        -- profile 1.16%
        hash_add(junk)
        -- Well, that's better than 20% down here:
        -- hash_add(junk[1..length(junk)-1])
    end while
    close(dict)
    return
end procedure

atom a
a = time()

Get_Dictionary()

a = time() - a
printf("\n%.2f seconds.\n", {a})


_____________________________________________________________________
You don't need to buy Internet access to use free Internet e-mail.
Get completely free e-mail from Juno at http://www.juno.com
Or call Juno at (800) 654-JUNO [654-5866]

new topic     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu