1. RE: Hash Table Lookup

Derek Parnell wrote:
> Using Euman's earlier code...

You mean his earlier earlier code or his later earlier code? :)

>     lWord = upper("buffalo")
>     hashed = EumsHash(lWord)
>     len_text = length(lWord)
>     k = lData[1] - 64

It gets stuck here for me... lData hasn't been defined.

I'm trying to understand his code posted under "EumsHash( ) Update". 
Maybe he took out the lData variable from that version...?

Derek, I noticed you had a hash library in the EUPHORIA archives. Are 
you using that code in your program?

new topic     » topic index » view message » categorize

2. Re: RE: Hash Table Lookup

6/03/2002 3:21:29 PM, "C. K. Lester" <cklester at yahoo.com> wrote:

>
>Derek Parnell wrote:
>> Using Euman's earlier code...
>
>You mean his earlier earlier code or his later earlier code? :)
>
>>     lWord = upper("buffalo")
>>     hashed = EumsHash(lWord)
>>     len_text = length(lWord)
>>     k = lData[1] - 64
>
>It gets stuck here for me... lData hasn't been defined.

Ooops! That's what I get for cut&pasting without double checking!
That "lData" should be "lWord". What it is trying to do is get the first
character of the word and
convert it to an index. It does this by subtracting 64 which is the ascii value
of 'A' less one.

>I'm trying to understand his code posted under "EumsHash( ) Update". 
>Maybe he took out the lData variable from that version...?
>
>Derek, I noticed you had a hash library in the EUPHORIA archives. Are 
>you using that code in your program?

Thank you for noticing. I'm not using that library because that is a general
purpose hashing library
and for the competition code I'm using a highly optimised hashing algorithm. The
one I'm using is
nothing like Euman's either! Also, I'm not hashing the words either, but
something else.

The one in the archives is better suited to complex tasks such as symbol table
management for
compilers etc... On the other hand, maybe I should go back and see if I can
optimise it some more
blink

---------
Cheers,
Derek Parnell 
ICQ# 7647806

new topic     » goto parent     » topic index » view message » categorize

3. RE: Hash Table Lookup

I think lData[1] was a typo. try lWord[1] (the first letter of the word)

This is the hashtable structure: repeat(repeat({},26),26)

The first dimension is the first letter of the words in the databank. 
This breaks the dictionary into 26 buckets.

IE. hashtable[1] ('A'-64 = 1)
  would contain the key for "AARVARK" somewhere in that bucket of 
buckets.

The 2nd dim is the word lengths. Using words.txt, that breaks each of 
the 1st buckets into 22.

IE. hashtable[1][7] ( ['A'-64=1][length("AARDVARK")] )
  would contain the key for "AARDVARK" somewhere in THAT bucket.

The 3rd dim is the hashkeys themselves.
if find(EumsHash("AARVARK"),hashtable['A'-64][length("AARDVARK")]) then 
the word exists in the dictionary.


Chris

C. K. Lester wrote:
> Derek Parnell wrote:
> > Using Euman's earlier code...
> 
> You mean his earlier earlier code or his later earlier code? :)
> 
> >     lWord = upper("buffalo")
> >     hashed = EumsHash(lWord)
> >     len_text = length(lWord)
> >     k = lData[1] - 64
> 
> It gets stuck here for me... lData hasn't been defined.
> 
> I'm trying to understand his code posted under "EumsHash( ) Update". 
> Maybe he took out the lData variable from that version...?
> 
> Derek, I noticed you had a hash library in the EUPHORIA archives. Are 
> you using that code in your program?
> 
>

new topic     » goto parent     » topic index » view message » categorize

4. RE: Hash Table Lookup

How degraded is the efficiency if you only make a hash table based on 
length of the text? For instance,

hash_table = {
              { "a" , "I" },
              { "ad" , "ah" , "am" , "an" , "as" , ... }...
             }

I hope I'm using the right terminology! That would be a hash table, 
right?

I can't seem to build a length-based table as quickly as the 
EumsBIGDHash gets built! Why is that (besides the fact that I really 
don't know what I'm doin' just yet).

new topic     » goto parent     » topic index » view message » categorize

5. RE: Hash Table Lookup

bensler at mail.com wrote:
> 
> The 3rd dim is the hashkeys themselves.
> if find(EumsHash("AARVARK"),hashtable['A'-64][length("AARDVARK")]) then 
> the word exists in the dictionary.
> 
> 
> Chris

Chris, thanks for the explanation! That helps a lot.

Euman and Derek, thanks to you also for your willingness to teach!

new topic     » goto parent     » topic index » view message » categorize

6. Re: RE: Hash Table Lookup

The basic idea behind hash tables is to sort a large set of items into different
"buckets", so that
later on, you can find something quickly by working out which bucket its in. The
trick is to try and
limit the number of items you put into any given bucket. If we have a dictionary
of some 50,000+
words and only want about 100 words per bucket, then we need at least 500
buckets. If you want less
in each bucket, you need more buckets to play with. But how do you know which
bucket to put a word
into? Traditionally, people use the letters in the word to calculate a numeric
index - ie, the
number of the bucket to use. You could simply try to add up the values of the
letters and use the
result as an index. If the total exceeds the number of buckets you have, then
divide the bucket
count in to the total and use the remainder (plus one). 

However, with our English text, this doesn't tend to generate big enough numbers
to spread the words
around evenly. Below is another way that you can use to get big numbers from the
letters in a word.


function WordHash( sequence text)
    integer x
    x = text[1] + #FFFF
    for i = 2 to length(text) do
        x += text[i] * i *  text[i-1]
    end for
    return x
end function

This method is fairly fast and still generates big-ish numbers. The bigger the
number, the more
chance you have of it landing it an empty bucket. Another hint is to use
prime-numbers for the
bucket count.

6/03/2002 4:06:28 PM, "C. K. Lester" <cklester at yahoo.com> wrote:

>
>How degraded is the efficiency if you only make a hash table based on 
>length of the text? For instance,
>
>hash_table = {
>              { "a" , "I" },
>              { "ad" , "ah" , "am" , "an" , "as" , ... }...
>             }
>
>I hope I'm using the right terminology! That would be a hash table, 
>right?
>
>I can't seem to build a length-based table as quickly as the 
>EumsBIGDHash gets built! Why is that (besides the fact that I really 
>don't know what I'm doin' just yet).
>
>
>
>
---------
Cheers,
Derek Parnell 
ICQ# 7647806

new topic     » goto parent     » topic index » view message » categorize

7. Re: RE: Hash Table Lookup

On 6 Mar 2002, at 17:27, Derek Parnell wrote:

> 
> The basic idea behind hash tables is to sort a large set of items into
> different
> "buckets", so that later on, you can find something quickly by working out
> which
> bucket its in. The trick is to try and limit the number of items you put into
> any given bucket. If we have a dictionary of some 50,000+ words and only want
> about 100 words per bucket, then we need at least 500 buckets. If you want
> less
> in each bucket, you need more buckets to play with. 

<snip>

For non-spell checkers, how is this faster than a binary lookup with some 
optimisations?

Kat

new topic     » goto parent     » topic index » view message » categorize

8. Re: RE: Hash Table Lookup

----- Original Message -----
From: "Kat" <gertie at PELL.NET>
To: "EUforum" <EUforum at topica.com>
Subject: Re: RE: Hash Table Lookup


>
> For non-spell checkers, how is this faster than a binary lookup with some
> optimisations?
>

A binary search, at worst needs N-log2 compares to find the item, and it
only works if the dictionary is sorted. This means that it is not the best
solution for extremely large sets of symbols that are continually being
added to (in non-sorted order). Such as compiler symbol tables.

With a hash table, you can add symbols in any order and still be assured
that lookups will be fast.

In short, hash tables are best suited for dynamic lists of any size, and
binary searches for static, ordered lists of about 30+ symbols.
----------
Derek.

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu