Re: Spell checker optimization

new topic     » goto parent     » topic index » view thread      » older message » newer message

Well, I'm not sure I agree with Mr. Nieuwenhuijsen's idea of the
binary tree implementation, but I'll give him a chance to explain
it more fully before I say it's not a better way of doing it.
Some of his other ideas like using an index file instead of fixed
width records, loading the list in blocks rather than all at once
etc, I do agree with.

I should though point out the differences between a binary search
and a binary tree for anyone not familiar with these two concepts.
Although they are both born out of the same pricipals, they are
implemented quite differently. Binary searchs are far simpler.
They require that the list to be worked upon be in sequential
order (exaclty the case in the spell checker problem). The binary
search works by eliminating half of the elements remaining to be
searched with each iteration.

Take for instance the following array, a simple list of letters
from "A" to "J", and say we are looking for the element conatining
"D".

 1: A
 2: B
 3: C  <-- 2
 4: D  <-- 3
 5: E
 6: F  <-- 1
 7: G
 8: H
 9: I
10: J
11: K
12: J

At the beginning, it is assumed that the item must lie between
elements 1 to 12 or not be in the list. The search begins at the
center of the list, which contains "F". When "D" is found to be
lesser than "F", the top half of the list 6-12 is considered to
be culled. The next iteration starts in the middle of the remaining
elements 1-5, element 3. The comparison is made and the 1-3 elements
are found to be lesser, now 4 & 5 remain. Half the distance between
4 & 5 is 4.5 (it's usually easiest just to round down) yielding 4
which is a match. This doesn't look very efficient for a small
array, but when dealing with large arrays it's very efficient.
Your spelling dictionary has about 50000 words. With the search you'll
eliminate half the list with each search.
50000, 25000, 12500, 6250, 3125, 1562, 781, 390, 195, 97, 48, 24,
12, 6, 3, 2, 1... yields a match within a maximum of 17 comparisons,
not bad for 50000 elements.

*Binary trees* are used generally when you want to be abled to make
changes to the list you are working on. They can be used quite
effectively for situations where you need to access and update
records, add and delete items etc. The binary tree consists of nodes
and branches. The first item entered is considered the "root" node.
When more items are added, they are compared against the root node
and added to the left (lesser) or right (greater) branch of that node.
If a node already exists on that branch, the the same process is
repeated
for that node as was on the root node. This carries on until an unused
branch is found to place the item onto. Binary Trees are NOT intended
to be fed items in sequential order. In fact it's the worst case
scenario because it loads all items to the right (ascending) or left
(descending) main branch, yielding the least efficient access (it's
virtually the same as a linked list in this case). Therefore, binary
tree's are intended to be used when the list entered has a basically
random distribution, and even then require periodic "balancing" to
maintain efficiency.

I hope this clears it up some.
Christopher D. Hickman

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu