Re: Spell checker optimization
- Posted by MAP <ddhinc at ALA.NET> Mar 25, 1998
- 799 views
Well, I'm not sure I agree with Mr. Nieuwenhuijsen's idea of the binary tree implementation, but I'll give him a chance to explain it more fully before I say it's not a better way of doing it. Some of his other ideas like using an index file instead of fixed width records, loading the list in blocks rather than all at once etc, I do agree with. I should though point out the differences between a binary search and a binary tree for anyone not familiar with these two concepts. Although they are both born out of the same pricipals, they are implemented quite differently. Binary searchs are far simpler. They require that the list to be worked upon be in sequential order (exaclty the case in the spell checker problem). The binary search works by eliminating half of the elements remaining to be searched with each iteration. Take for instance the following array, a simple list of letters from "A" to "J", and say we are looking for the element conatining "D". 1: A 2: B 3: C <-- 2 4: D <-- 3 5: E 6: F <-- 1 7: G 8: H 9: I 10: J 11: K 12: J At the beginning, it is assumed that the item must lie between elements 1 to 12 or not be in the list. The search begins at the center of the list, which contains "F". When "D" is found to be lesser than "F", the top half of the list 6-12 is considered to be culled. The next iteration starts in the middle of the remaining elements 1-5, element 3. The comparison is made and the 1-3 elements are found to be lesser, now 4 & 5 remain. Half the distance between 4 & 5 is 4.5 (it's usually easiest just to round down) yielding 4 which is a match. This doesn't look very efficient for a small array, but when dealing with large arrays it's very efficient. Your spelling dictionary has about 50000 words. With the search you'll eliminate half the list with each search. 50000, 25000, 12500, 6250, 3125, 1562, 781, 390, 195, 97, 48, 24, 12, 6, 3, 2, 1... yields a match within a maximum of 17 comparisons, not bad for 50000 elements. *Binary trees* are used generally when you want to be abled to make changes to the list you are working on. They can be used quite effectively for situations where you need to access and update records, add and delete items etc. The binary tree consists of nodes and branches. The first item entered is considered the "root" node. When more items are added, they are compared against the root node and added to the left (lesser) or right (greater) branch of that node. If a node already exists on that branch, the the same process is repeated for that node as was on the root node. This carries on until an unused branch is found to place the item onto. Binary Trees are NOT intended to be fed items in sequential order. In fact it's the worst case scenario because it loads all items to the right (ascending) or left (descending) main branch, yielding the least efficient access (it's virtually the same as a linked list in this case). Therefore, binary tree's are intended to be used when the list entered has a basically random distribution, and even then require periodic "balancing" to maintain efficiency. I hope this clears it up some. Christopher D. Hickman