OpenEuphoria: Forum: Spell checker optimization

1. Spell checker optimization

Posted by Robert B Pilkington <bpilkington at JUNO.COM> Mar 24, 1998
837 views

Well, I got the replacement word routine working... It was slow but I
added hashing and with 20-200 words per bucket it goes very fast on the
486/75 I'm using. (I'm purposely using the slowest computer here so I can
see how much faster it needs to go.)

Anyway, I'm using a 50000 word dictionary. (The one that came with
SOUNDEX, which I'm considering adding to the Get_Replace() routine for
the larger words)
The dictionary is formatted like:

edit
editing
editor
i
i'd
i'll
word
word's

But over 50000 words. Being over 500k it takes 16 seconds on the 486 to
load. The problem is I want it more like 5 seconds (I can get it to less
than 3 seconds if I don't store the words, just load them overtop the
last word, but that wouldn't be too effective... :/ ). I'm using the hash
routine from EUPHORIA\DEMO\HASH.EX, but with only one field per word.
(Each word only appears once, so I don't need a counter.) The dictionary
also takes up too much space in memory. Let's see, assuming 16 bit
integers for 8 bit characters we have 1 meg in use for a 500k file. With
32 bit integers, we'd have 2 megs used up for the file. Below is the
routine to load the words. (To test it, just grab the dictionary from the
spell checker on the archives or SOUNDEX. They are uppercase while mine
is lowercase, but that doesn't matter for timing the routine.)

So what I need to know is:

How do I speed up the load time?
How do I make the dictionary smaller?
How do I make the dictionary use up less memory?



Below is the code I'm using, as well as the major bottlenecks (over 1%
time profiled) already commented just above their respective lines:


without warning
without trace
without type_check

with profile_time

constant HASH_BUCKETS = 2099    -- Larger prime, better performance?
                                -- By increasing this number, I can get
the
                                -- load down to 13 seconds, but there is
                                -- memory wasted from that, and it's
already
                                -- taking up over 1 meg... Increase the
buckets
                                -- too and it slows to a halt.

object junk
junk = 0

sequence words
words = repeat("", HASH_BUCKETS)

function hash_routine(sequence string)
-- This routine works well for English words.
-- It's fast to calculate and it gives a pretty even distribution.
-- These are the two properties that you want in a hash routine.
    integer len
    atom val

    len = length(string)
    val = string[len]
    if len > 4 then
        len = 4
    end if
    for i = 1 to len - 1 do
        -- profile: 3.71%
        val = val * 64 + string[i]  -- shift 6 bits and add
    end for
    -- profile: 2.56%
    return remainder(val, HASH_BUCKETS) + 1
end function

procedure hash_add(sequence string)
-- If string is not in the table already, add it.
    integer hash_val --, found
    sequence bucket

    -- which bucket to search?
    -- profile: 1.71%
    hash_val = hash_routine(string)
    bucket = words[hash_val]

-- TOO slow. This takes way too long.
--    found = 0
--    -- search this bucket for the string
--    for i = 1 to length(bucket) do
--        if compare(string, bucket[i]) = 0 then
--            -- found it
--            found = i
--            exit
--        end if
--    end for
--    if not found then

    -- This is much faster, but still a bottleneck.
    -- profile 7.92%
    if not find(string, bucket) then
        -- profile: 18.82%
        bucket = append(bucket, string)
    end if
    words[hash_val] = bucket
-- profile: 4.45%   What???
end procedure

procedure Get_Dictionary()
    integer dict

    dict = open("SPELL.DAT", "r")
    puts(1, "Loading dictionary...\r")
    if dict = -1 then
        puts(1, "Fatal error -- Main dictionary \"SPELL.DAT\"
missing.\n")
        abort(1)
    end if
    junk = {}
    -- Biiig nono: words = {}
    while 1 do
        -- profile: 54.63%
        junk = gets(dict)
        if atom(junk) then
            exit
        end if
        -- profile 1.95%
        junk = junk[1..length(junk)-1]
        -- profile 1.16%
        hash_add(junk)
        -- Well, that's better than 20% down here:
        -- hash_add(junk[1..length(junk)-1])
    end while
    close(dict)
    return
end procedure

atom a
a = time()

Get_Dictionary()

a = time() - a
printf("\n%.2f seconds.\n", {a})


_____________________________________________________________________
You don't need to buy Internet access to use free Internet e-mail.
Get completely free e-mail from Juno at http://www.juno.com
Or call Juno at (800) 654-JUNO [654-5866]

new topic » topic index » view message » categorize

2. Re: Spell checker optimization

Posted by MAP <ddhinc at ALA.NET> Mar 24, 1998
833 views

Ok, Here's what I would try...
Take advantage of the soundex routines provided by Mr. Sephton.
It only takes a few modifications to make it into a good spelling
dictionary generator that emits a file that can be can be accessed
in an efficient manner. First, modify create.ex to include the words
in the same file with the soundex codes, eg;

A250 AACHEN
A631 AARDVARK
A650 AARON
A100 ABABA
A120 ABACK
A120 ABACUS
A120 ABACUSES
A130 ABAFT
A145 ABALONE
  .
  .
  .
etc

Now, you need to use a sort program to put them in order by their
codes. Dos's sort.exe wont do it because it can't handle the
memory requirements, but I think I recall seeing a euphoria program
somewhere that can.

A100 ABABA
A120 ABACK
A120 ABACUS
A120 ABACUSES
A130 ABAFT
A145 ABALONE
A250 AACHEN
A631 AARDVARK
A650 AARON

Now, write another program that puts the items into a list format
using the codes only once.

A100 ABABA
A120 ABACK ABACUS ABACUSES
A130 ABAFT
A145 ABALONE
A250 AACHEN
A631 AARDVARK
A650 AARON

You now have the database in the replacement list format needed for
your spell checker. More importantly, you have your codes in
ascendinding order in the file. This allows you to take the words
that you want to spell-check and convert them into their soundex
equivalents. You can then use a "binary search" algorithm to find
the match very quickly. The advantage of this method is that if
you choose not to load your list into memory, you can search directly
in the file. A binary sort will give you the fewest disk accesses
necessary to hit the match. This may well be fast enough with
disk caching turned on.

Some things you could do to speed it up:
You could turn your hashing algorithm onto the document to be checked,
create a sorted list of the words, eliminating redundacies by making
a pointer-list to locations of each occurance (This also allows you
to be abled to quickly ignore all instances of a word already
deemed correct). You then run thru the list and cull out all entries
that are found in the dictionary. Once that is done, you resort
the remaining pointer-list back into order by location (keeps from
jumping all over the document, confusing the user) and query the
user on each spelling mistake.

Of course, you may decide the two megs of memory is worth sacrificing
as well, as it would run much faster to check against the list in
memory.

Christopher D. Hickman

ps: One caveat on the file access method; in order to accomplish the
direct access method for it, each record will have to be a fixed width
as large as the longest list of words for a particular code (the
resulting
file would be several megs). With the in memory scheme, this of course
would not be necessary.

new topic » goto parent » topic index » view message » categorize

3. Re: Spell checker optimization

Posted by Linda Nieuwenhuijsen <nieuwen at XS4ALL.NL> Mar 24, 1998
820 views
Last edited Mar 25, 1998

>You now have the database in the replacement list format needed for
>your spell checker. More importantly, you have your codes in
>ascendinding order in the file. This allows you to take the words
>that you want to spell-check and convert them into their soundex
>equivalents. You can then use a "binary search" algorithm to find
>the match very quickly. The advantage of this method is that if
>you choose not to load your list into memory, you can search directly
>in the file. A binary sort will give you the fewest disk accesses
>necessary to hit the match. This may well be fast enough with
>disk caching turned on.

You could have the first 2 or 3 letters binairy tree in memory..
26^5th doesn't even take too much memory, then you could do one large read
to disk, which due to precize disk cashing (use Jaques routines to load them
to an allocated memory block --> much more efficient and searches
faster...and you can precizely say how much you want to be loaded..
cached..at once.. should go pretty fast..one blink at the hD-light), then
you can hash search the words per 2 characters which are in a binairy tree.
(each treee has 26^2 eh.. elements which all have 26^2 elements also.. you
either need to use several files or blocks in a file per word length or you
need to add a '\n' after each word (which works slower)

This is fast, memory efficient, but not that easy to code..i guess..

>Some things you could do to speed it up:
>You could turn your hashing algorithm onto the document to be checked,
>create a sorted list of the words, eliminating redundacies by making
>a pointer-list to locations of each occurance (This also allows you
>to be abled to quickly ignore all instances of a word already
>deemed correct). You then run thru the list and cull out all entries
>that are found in the dictionary. Once that is done, you resort
>the remaining pointer-list back into order by location (keeps from
>jumping all over the document, confusing the user) and query the
>user on each spelling mistake.

With off course the option to change all simelar.. changes all the mistakes
with THIS word...

>Of course, you may decide the two megs of memory is worth sacrificing
>as well, as it would run much faster to check against the list in
>memory.

Two megs.. a lot more it will use..
Euphoria uses about 6 bytes for a sequence I believe, but also the sequence
length needs to be saved, and every element takes 4 bytes.
So even in one long sequence it will be about 4 times 2 meg..

>ps: One caveat on the file access method; in order to accomplish the
>direct access method for it, each record will have to be a fixed width
>as large as the longest list of words for a particular code (the
>resulting
>file would be several megs). With the in memory scheme, this of course
>would not be necessary.

You can have quick indexes in the file too, or slit it up in several files
that contain all the words of a specified length..or have indexed block
within one file..

Ralf
nieuwen at xs4all.nl

new topic » goto parent » topic index » view message » categorize

4. Re: Spell checker optimization

Posted by MAP <ddhinc at ALA.NET> Mar 25, 1998
800 views

Well, I'm not sure I agree with Mr. Nieuwenhuijsen's idea of the
binary tree implementation, but I'll give him a chance to explain
it more fully before I say it's not a better way of doing it.
Some of his other ideas like using an index file instead of fixed
width records, loading the list in blocks rather than all at once
etc, I do agree with.

I should though point out the differences between a binary search
and a binary tree for anyone not familiar with these two concepts.
Although they are both born out of the same pricipals, they are
implemented quite differently. Binary searchs are far simpler.
They require that the list to be worked upon be in sequential
order (exaclty the case in the spell checker problem). The binary
search works by eliminating half of the elements remaining to be
searched with each iteration.

Take for instance the following array, a simple list of letters
from "A" to "J", and say we are looking for the element conatining
"D".

 1: A
 2: B
 3: C  <-- 2
 4: D  <-- 3
 5: E
 6: F  <-- 1
 7: G
 8: H
 9: I
10: J
11: K
12: J

At the beginning, it is assumed that the item must lie between
elements 1 to 12 or not be in the list. The search begins at the
center of the list, which contains "F". When "D" is found to be
lesser than "F", the top half of the list 6-12 is considered to
be culled. The next iteration starts in the middle of the remaining
elements 1-5, element 3. The comparison is made and the 1-3 elements
are found to be lesser, now 4 & 5 remain. Half the distance between
4 & 5 is 4.5 (it's usually easiest just to round down) yielding 4
which is a match. This doesn't look very efficient for a small
array, but when dealing with large arrays it's very efficient.
Your spelling dictionary has about 50000 words. With the search you'll
eliminate half the list with each search.
50000, 25000, 12500, 6250, 3125, 1562, 781, 390, 195, 97, 48, 24,
12, 6, 3, 2, 1... yields a match within a maximum of 17 comparisons,
not bad for 50000 elements.

*Binary trees* are used generally when you want to be abled to make
changes to the list you are working on. They can be used quite
effectively for situations where you need to access and update
records, add and delete items etc. The binary tree consists of nodes
and branches. The first item entered is considered the "root" node.
When more items are added, they are compared against the root node
and added to the left (lesser) or right (greater) branch of that node.
If a node already exists on that branch, the the same process is
repeated
for that node as was on the root node. This carries on until an unused
branch is found to place the item onto. Binary Trees are NOT intended
to be fed items in sequential order. In fact it's the worst case
scenario because it loads all items to the right (ascending) or left
(descending) main branch, yielding the least efficient access (it's
virtually the same as a linked list in this case). Therefore, binary
tree's are intended to be used when the list entered has a basically
random distribution, and even then require periodic "balancing" to
maintain efficiency.

I hope this clears it up some.
Christopher D. Hickman

new topic » goto parent » topic index » view message » categorize

5. Re: Spell checker optimization

Posted by Joe Phillips <bubba at TXWES.EDU> Mar 25, 1998
826 views

At 11:00 AM 3/25/98 -0600, you wrote:
>Well, I'm not sure I agree with Mr. Nieuwenhuijsen's idea of the
>binary tree implementation, but I'll give him a chance to explain
>it more fully before I say it's not a better way of doing it.

Thanks for the detailed explanations. Although I knew the concepts, the
presentation was succinct and thorough.

In fact, it was those explanations that leads me to think a Binary Tree
might be the best for this implementation.  Besides, what good is a
dictionary to which you cannot add words?  I think the b2-tree (balanced
binary tree) concept works beautifully.  In fact, the file can be stored as
a b2-tree (with pointers) so as to not have to rebuild the tree at each load.

Although, I often use binary searches for the very reasons mentioned.
(EASY) In fact, many times I will even use binary searches in un-sorted,
linked lists or files.  Although I have to parse through pointers, key
comparisons are kept at a minimum.

--[ Joe Phillips, Sytems Analyst
--[ Texas Wesleyan University     817-531-4444
--[
--[  "The tongue of the just is as choice silver:
--[   the heart of the wicked is little worth."  Proverbs 10:20

new topic » goto parent » topic index » view message » categorize

6. Re: Spell checker optimization

Posted by Ralf Nieuwenhuijsen <nieuwen at XS4ALL.NL> Mar 25, 1998
871 views

I ment, a sequence with 27 entries, which all have 27 entries, etc.
For example:
The word apple
can be looked up: ['a']['p']['p']['l']['e']

Only remember to subtract 'a' and add 1 to use a more efficient tree.
The 27th element, is the end of word element, if it contains a true value
than the word ends there, if it contains a false value than it means there
is no word yet.

Because euphoria's sequence can handle this in memory, but not in file, you
need indexes.
The file would start with a record of 27 fields. All fields contain an index
in the file (maybe a relative index works nicer)
Then it goes there, reads the next record of fields.
If it has no more characters to look up, it checks the 27th field and see if
it's a false or not. If not, the word is correct, else not.
This would mean that you only need to read from disk as many times as the
word's length.
Althrough you could have the first couple of levels in memory (in one
sequence).
As many levels as it takes to make the list of corrected words less than a
1000 or 250 or something.
Then you can read the 1000's words up in once (one hd light if you use
Jaques Deschenes DOS i/o routines, so you can have it put in an preallocated
place in one stroke)
And then you can binairy search the list of words off course.

THe binairy tree does however take up more memory than a normal list, but it
makes lookups much faster.
You can also offcourse use hashing for every two or three characters and put
that in a simerlair tree...

Ralf

new topic » goto parent » topic index » view message » categorize

7. Re: Spell checker optimization

Posted by MAP <ddhinc at ALA.NET> Mar 25, 1998
842 views

Ralf Nieuwenhuijsen wrote:
>
> I ment, a sequence with 27 entries, which all have 27 entries, etc.
> For example:
> The word apple
> can be looked up: ['a']['p']['p']['l']['e']
>
<Snip>

I understand what you're saying now. Remember though that what I
was basically proposing was that Mr. Pilkington store the soundex
codes as the key for his database. These are represented as a
letter followed by a 3 digit value, not as the word itself. By
comparing the soundex code generated by the word being checked
against the already existing keys in the database, you are abled
to obtain a list of alternate choices for any word not spelled
correctly. Using a search algorithm based on the words
themselves would re-incur the redundancy that the soundex
algorithm had eliminated.

Choosing the algorithm with which to access it quickly is still
in question though. Although Mr. Phillips pointed out that
it would be very useful to be abled to add words to the
dictionary, I still am not sold on the binary tree
idea. I think the volatility of the database would be
too low to warrant the more complex implementation. The user
would have to add 500 words to affect the 50000 word database
by 1 percent... something which is more likely to occur over
a great number of sessions than in one sitting. In a situation
with volatility this low, i'd opt to go with the faster (and
simpler) binary search algorithm, and use an alternate
(even if slower than binary trees) method when adding items to
the database since the operation is "relatively" infrequent.

Also, he could very well choose to stick with the hashing
algorithm if it suits his needs. Junko Miura has a spell checker
listed in Euphoria's archive page that uses some variation of
a hashing algorithm. It seems to run at a decent speed on my
computer, although I only used it on a short document I had
handy. I barely glanced at the code so I don't know the
specifics on what it's doing, but it's probably worth
investigating if he hasn't done so already.

Christopher D. Hickman

new topic » goto parent » topic index » view message » categorize

8. Re: Spell checker optimization

Posted by Robert B Pilkington <bpilkington at JUNO.COM> Mar 25, 1998
858 views

>I ment, a sequence with 27 entries, which all have 27 entries, etc.
>For example:
>The word apple can be looked up: ['a']['p']['p']['l']['e']
>
>Only remember to subtract 'a' and add 1 to use a more efficient tree.
>The 27th element, is the end of word element, if it contains a true
>value than the word ends there, if it contains a false value than it
means
>there is no word yet.

>Because euphoria's sequence can handle this in memory, but not in
>file, you need indexes.
>The file would start with a record of 27 fields. All fields contain an
>index in the file (maybe a relative index works nicer)
>Then it goes there, reads the next record of fields.
>If it has no more characters to look up, it checks the 27th field and
>see if it's a false or not. If not, the word is correct, else not.

>This would mean that you only need to read from disk as many times as
>the word's length.
>Althrough you could have the first couple of levels in memory (in one
>sequence).
>As many levels as it takes to make the list of corrected words less
>than a 1000 or 250 or something.
>Then you can read the 1000's words up in once (one hd light if you use
>Jaques Deschenes DOS i/o routines, so you can have it put in an
>preallocated place in one stroke)
>And then you can binairy search the list of words off course.
>
>THe binairy tree does however take up more memory than a normal list,
>but it makes lookups much faster.
>You can also offcourse use hashing for every two or three characters
>and put that in a simerlair tree...


Let's see if I understand you right:

The dictionary contains a list with indexes to the rest of the list:

    A      B      C      D      E  . . . . *
    |      |      |      |      |
  A B C  A B C  A B C  A B C  A B C
  | | |  | | |  | | |  | | |  | | |

etc

So to look up words: (byte locations are made up and * is the 27th
location)
Let's say AD:
      A (index to byte 12300)
      D (index to byte 23000)
      * (true, this word exists)

And to look up ADD:
      A (index to byte 12300)
      D (index to byte 23000)
      D (index to byte 30040)
      * (true, this word exists)

Looking up AE:
      A (index to byte 12300)
      E (index to byte 24000)
      * (false, this word doesn't exist)


This sounds good, and loading up 3 or 4 levels into memory doesn't sound
too bad... I can keep a lot of words in memory like that. Just look at
this paragraph and see how many words are more than four letters.

But the problem: When checking for replacements, I took David Cuny's
advice and this checks a whole bunch of times:

((26 * number of letters) * 2) + (number of letters * 4)

6 different checks are done, and two check each letter of the alphabet a
different way (replacement and insertation). Each word needs to be
checked for existance. For a 5 letter word (no longer in memory, need to
look in dictionary), 280 checks need to be done. 336 checks for 6 a six
letter word. That would be 280 hard drive loads (Once for each letter not
in memory).

(BTW, 280 checks is real fast with find() and the hash table, is
'instant' on the 486 fast enough?)

I'm thinking about how I could use this information as I'm typing this so
don't worry if I answer the problems I mention.....

Another thing is, the index would be 0 if there are no more words in this
direction, right? I assume so. (ie searching for XXQ (the second X would
have an index of 0 telling the search that there aren't any words that
have XX as the first two letters) The question here is, would this make
the dictionary filesize bigger, stay about the same, or larger? If it
makes it larger then I don't know if it would be worth the extra speed.
(I got the load routine down two 15 seconds, with 14 seconds second time
with SmartDrv [wow], and it's only one second on the PII-233.. With the
old load algorithm. The new one assumes there aren't any duplicate words.
The old one checked anyway. I left the check commented so it's easy to
check if any crept in.) If the dictionary filesize is smaller OTOH....


_____________________________________________________________________
You don't need to buy Internet access to use free Internet e-mail.
Get completely free e-mail from Juno at http://www.juno.com
Or call Juno at (800) 654-JUNO [654-5866]

new topic » goto parent » topic index » view message » categorize

9. Re: Spell checker optimization

Posted by Lmailles <Lmailles at AOL.COM> Mar 26, 1998
810 views

You mentioned at the beginning about the ACTUAL comparison routine, to pick
the best word.
I wrote a QBASIC fuzzy search routine a while ago which would do just the job.
It is already pretty well optimised for typical mis-spellings (letters in
wrong order like hlelo or missed out letters) and a little extra code to check
for keyboard typos (s = a or d, k = j or l)
I'll convert it using David Cuny's ebasic, and then tidy up if you think it
might be useful. I never bothered to turn it into a spell checker because QB
was SO SLOW - I mean I even had to write my own sort rountine...

Daniel

new topic » goto parent » topic index » view message » categorize

10. Re: Spell checker optimization

Posted by Robert B Pilkington <bpilkington at JUNO.COM> Mar 26, 1998
783 views

>You mentioned at the beginning about the ACTUAL comparison routine, to
>pick the best word.

This thread has gotten a little big.... And without quoting... Where
exactly are you referring to? (I've finished the comparison and
replacement word routines, I just need to speedup the dictionary load and
perhaps make the dictionary filesize smaller)

>I wrote a QBASIC fuzzy search routine a while ago which would do just
>the job.
>It is already pretty well optimised for typical mis-spellings (letters
>in wrong order like hlelo or missed out letters) and a little extra code
>to check for keyboard typos (s = a or d, k = j or l)

Mine does this, but a little more generic. (With the detail David Cuny
went into on the message he sent me, I figure he could use the
information to create a spell checker himself if he had more time....) It
checks for missed out letters and wrong order, as well as extra letters,
single letter typos and words that should be two words. It could be much
more optimized, but like I said, on the 486 it's response time for the
check is 'instant'. I don't notice any delay. (Except that 16 second one
while the 1/2 meg dictionary is loading....)

>I'll convert it using David Cuny's ebasic, and then tidy up if you
>think it might be useful. I never bothered to turn it into a spell
checker
>because QB was SO SLOW - I mean I even had to write my own sort
rountine...

The replacement routine I have now works great. Later on I might add
soundex encoding to replace words that are too mangled for the current
checks. (Mostly big words that sound the same to soundex but have too
many messed up letters, but I've only had 2 words so far that I didn't
get the correct replacement in the list)

>own sort rountine...
          ^^^^^^^^
My spell checker would catch that and give the correct replacement,
assuming of course, that "routine" is in the dictionary... :)

_____________________________________________________________________
You don't need to buy Internet access to use free Internet e-mail.
Get completely free e-mail from Juno at http://www.juno.com
Or call Juno at (800) 654-JUNO [654-5866]

OpenEuphoria

1. Spell checker optimization

2. Re: Spell checker optimization

3. Re: Spell checker optimization

4. Re: Spell checker optimization

5. Re: Spell checker optimization

6. Re: Spell checker optimization

7. Re: Spell checker optimization

8. Re: Spell checker optimization

9. Re: Spell checker optimization

10. Re: Spell checker optimization

Search

Include:

Quick Links

User menu

Misc Menu