1. RE: Hash Table Lookup
- Posted by "C. K. Lester" <cklester at yahoo.com> Mar 05, 2002
- 357 views
Derek Parnell wrote: > Using Euman's earlier code... You mean his earlier earlier code or his later earlier code? :) > lWord = upper("buffalo") > hashed = EumsHash(lWord) > len_text = length(lWord) > k = lData[1] - 64 It gets stuck here for me... lData hasn't been defined. I'm trying to understand his code posted under "EumsHash( ) Update". Maybe he took out the lData variable from that version...? Derek, I noticed you had a hash library in the EUPHORIA archives. Are you using that code in your program?
2. Re: RE: Hash Table Lookup
- Posted by Derek Parnell <ddparnell at bigpond.com> Mar 05, 2002
- 350 views
6/03/2002 3:21:29 PM, "C. K. Lester" <cklester at yahoo.com> wrote: > >Derek Parnell wrote: >> Using Euman's earlier code... > >You mean his earlier earlier code or his later earlier code? :) > >> lWord = upper("buffalo") >> hashed = EumsHash(lWord) >> len_text = length(lWord) >> k = lData[1] - 64 > >It gets stuck here for me... lData hasn't been defined. Ooops! That's what I get for cut&pasting without double checking! That "lData" should be "lWord". What it is trying to do is get the first character of the word and convert it to an index. It does this by subtracting 64 which is the ascii value of 'A' less one. >I'm trying to understand his code posted under "EumsHash( ) Update". >Maybe he took out the lData variable from that version...? > >Derek, I noticed you had a hash library in the EUPHORIA archives. Are >you using that code in your program? Thank you for noticing. I'm not using that library because that is a general purpose hashing library and for the competition code I'm using a highly optimised hashing algorithm. The one I'm using is nothing like Euman's either! Also, I'm not hashing the words either, but something else. The one in the archives is better suited to complex tasks such as symbol table management for compilers etc... On the other hand, maybe I should go back and see if I can optimise it some more --------- Cheers, Derek Parnell ICQ# 7647806
3. RE: Hash Table Lookup
- Posted by bensler at mail.com Mar 05, 2002
- 347 views
I think lData[1] was a typo. try lWord[1] (the first letter of the word) This is the hashtable structure: repeat(repeat({},26),26) The first dimension is the first letter of the words in the databank. This breaks the dictionary into 26 buckets. IE. hashtable[1] ('A'-64 = 1) would contain the key for "AARVARK" somewhere in that bucket of buckets. The 2nd dim is the word lengths. Using words.txt, that breaks each of the 1st buckets into 22. IE. hashtable[1][7] ( ['A'-64=1][length("AARDVARK")] ) would contain the key for "AARDVARK" somewhere in THAT bucket. The 3rd dim is the hashkeys themselves. if find(EumsHash("AARVARK"),hashtable['A'-64][length("AARDVARK")]) then the word exists in the dictionary. Chris C. K. Lester wrote: > Derek Parnell wrote: > > Using Euman's earlier code... > > You mean his earlier earlier code or his later earlier code? :) > > > lWord = upper("buffalo") > > hashed = EumsHash(lWord) > > len_text = length(lWord) > > k = lData[1] - 64 > > It gets stuck here for me... lData hasn't been defined. > > I'm trying to understand his code posted under "EumsHash( ) Update". > Maybe he took out the lData variable from that version...? > > Derek, I noticed you had a hash library in the EUPHORIA archives. Are > you using that code in your program? > >
4. RE: Hash Table Lookup
- Posted by "C. K. Lester" <cklester at yahoo.com> Mar 05, 2002
- 343 views
How degraded is the efficiency if you only make a hash table based on length of the text? For instance, hash_table = { { "a" , "I" }, { "ad" , "ah" , "am" , "an" , "as" , ... }... } I hope I'm using the right terminology! That would be a hash table, right? I can't seem to build a length-based table as quickly as the EumsBIGDHash gets built! Why is that (besides the fact that I really don't know what I'm doin' just yet).
5. RE: Hash Table Lookup
- Posted by "C. K. Lester" <cklester at yahoo.com> Mar 05, 2002
- 348 views
bensler at mail.com wrote: > > The 3rd dim is the hashkeys themselves. > if find(EumsHash("AARVARK"),hashtable['A'-64][length("AARDVARK")]) then > the word exists in the dictionary. > > > Chris Chris, thanks for the explanation! That helps a lot. Euman and Derek, thanks to you also for your willingness to teach!
6. Re: RE: Hash Table Lookup
- Posted by Derek Parnell <ddparnell at bigpond.com> Mar 05, 2002
- 375 views
The basic idea behind hash tables is to sort a large set of items into different "buckets", so that later on, you can find something quickly by working out which bucket its in. The trick is to try and limit the number of items you put into any given bucket. If we have a dictionary of some 50,000+ words and only want about 100 words per bucket, then we need at least 500 buckets. If you want less in each bucket, you need more buckets to play with. But how do you know which bucket to put a word into? Traditionally, people use the letters in the word to calculate a numeric index - ie, the number of the bucket to use. You could simply try to add up the values of the letters and use the result as an index. If the total exceeds the number of buckets you have, then divide the bucket count in to the total and use the remainder (plus one). However, with our English text, this doesn't tend to generate big enough numbers to spread the words around evenly. Below is another way that you can use to get big numbers from the letters in a word. function WordHash( sequence text) integer x x = text[1] + #FFFF for i = 2 to length(text) do x += text[i] * i * text[i-1] end for return x end function This method is fairly fast and still generates big-ish numbers. The bigger the number, the more chance you have of it landing it an empty bucket. Another hint is to use prime-numbers for the bucket count. 6/03/2002 4:06:28 PM, "C. K. Lester" <cklester at yahoo.com> wrote: > >How degraded is the efficiency if you only make a hash table based on >length of the text? For instance, > >hash_table = { > { "a" , "I" }, > { "ad" , "ah" , "am" , "an" , "as" , ... }... > } > >I hope I'm using the right terminology! That would be a hash table, >right? > >I can't seem to build a length-based table as quickly as the >EumsBIGDHash gets built! Why is that (besides the fact that I really >don't know what I'm doin' just yet). > > > > --------- Cheers, Derek Parnell ICQ# 7647806
7. Re: RE: Hash Table Lookup
- Posted by Kat <gertie at PELL.NET> Mar 05, 2002
- 357 views
On 6 Mar 2002, at 17:27, Derek Parnell wrote: > > The basic idea behind hash tables is to sort a large set of items into > different > "buckets", so that later on, you can find something quickly by working out > which > bucket its in. The trick is to try and limit the number of items you put into > any given bucket. If we have a dictionary of some 50,000+ words and only want > about 100 words per bucket, then we need at least 500 buckets. If you want > less > in each bucket, you need more buckets to play with. <snip> For non-spell checkers, how is this faster than a binary lookup with some optimisations? Kat
8. Re: RE: Hash Table Lookup
- Posted by Derek Parnell <ddparnell at bigpond.com> Mar 06, 2002
- 352 views
----- Original Message ----- From: "Kat" <gertie at PELL.NET> To: "EUforum" <EUforum at topica.com> Subject: Re: RE: Hash Table Lookup > > For non-spell checkers, how is this faster than a binary lookup with some > optimisations? > A binary search, at worst needs N-log2 compares to find the item, and it only works if the dictionary is sorted. This means that it is not the best solution for extremely large sets of symbols that are continually being added to (in non-sorted order). Such as compiler symbol tables. With a hash table, you can add symbols in any order and still be assured that lookups will be fast. In short, hash tables are best suited for dynamic lists of any size, and binary searches for static, ordered lists of about 30+ symbols. ---------- Derek.