1. OT: hashing strings
- Posted by useless Aug 05, 2009
- 1300 views
What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.
The strings might take the form of:
M:\KB\Military\submarines\I-401\I-401 Submarine Found off of Barbers Point_files\I401_DJ_01_sm.jpg
M:\KB\animals\otter\(2-19-2001) California Sea Otters Headed Back to Endangered Species List; Sudden Rise in Death Rates Traced to Sewage-related Parasitic, Bacterial Diseases.htm
M:\KB\Music\Bands\J\Janis JoplinJanis Joplin Ball and Chain ( in album Monterey International Pop Festival ).txt
M:\KB\Plants\blah.garden.org\guides\bf\b\Stylommatophora\Helicidae\Candidula\intersecta\genus\1\3467345\index.html
C:\kb-temp\history\china\Literature\Historiography\huangqingkaiguofanglve.html
They also might not be ascii or UTF-8 strings. They may be single tokens, like AADZKWAA8KOAAP (used in win32lib_0_70_4a\Demo\lvcol.exw).
useless
2. Re: OT: hashing strings
- Posted by mattlewis (admin) Aug 05, 2009
- 1249 views
What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.
I have the most misgivings about "using no math." What, exactly do you mean by this? There is a built-in hash function in eu 4.0. Of course, without advanced knowledge of all the strings, it's impossible to guarantee perfect hashing.
Matt
3. Re: OT: hashing strings
- Posted by DerekParnell (admin) Aug 05, 2009
- 1239 views
What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.
Not possible - some math is always going to be involved.
I note that one of your sample strings is less than 50 characters so I'd have to assume they can be any length.
Anyhow, try this to see if it worth using ..
include machine.e atom mem mem = allocate(4) function hashstr(sequence text) sequence chs atom val integer j integer k chs = repeat( remainder(length(text), 256), 4) text -= ' ' j = 0 for i = 1 to length(text) do j += 1 if j > 4 then j = 1 end if chs[j] += text[i] * i end for poke(mem, chs) val = peek4u(mem) return val end function
4. Re: OT: hashing strings
- Posted by prickle Aug 05, 2009
- 1194 views
What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.
The probability of a collision given any 32-bit key with a perfectly evenly distributed hash function on random data becomes 99.9972% at 300,000 unique strings. It's pretty much certain after that. Math Rocks!
Cheers,
Nick
5. Re: OT: hashing strings
- Posted by useless Aug 05, 2009
- 1258 views
What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.
The probability of a collision given any 32-bit key with a perfectly evenly distributed hash function on random data becomes 99.9972% at 300,000 unique strings. It's pretty much certain after that. Math Rocks!
Cheers,
Nick
Why won't 300,000 unique strings map into a 4,294,967,296 space with no duplicates?!
Kat
6. Re: OT: hashing strings
- Posted by useless Aug 05, 2009
- 1198 views
What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.
Not possible - some math is always going to be involved.
I note that one of your sample strings is less than 50 characters so I'd have to assume they can be any length.
Yes, i said some may be single tokens: They may be single tokens, like AADZKWAA8KOAAP (used in win32lib_0_70_4a\Demo\lvcol.exw).
[quote DerekParnell]
Anyhow, try this to see if it worth using ..
include machine.e atom mem mem = allocate(4) function hashstr(sequence text) sequence chs atom val integer j integer k chs = repeat( remainder(length(text), 256), 4) text -= ' ' j = 0 for i = 1 to length(text) do j += 1 if j > 4 then j = 1 end if chs[j] += text[i] * i end for poke(mem, chs) val = peek4u(mem) return val end function
Thanks, Derek! That looks like a cunning bit of code, and it uses only addition, which i can deal with.
I have a directory listing containing 9,574,746 filenames on M:\, and will make up a routine to run the code, and see if there's any duplicates. I know the absence of duplicates doesn't proove they cannot happen, but i have a feeling the code you wrote is unlikely to make many. It's going to take some time to run the test code, but i can get relavant data from it. This computer is out of ram and swap space already, and the dir listing file is 553,831,163 bytes. There's three other drives equally loaded. Locating files has become a seriously time-consuming problem.
Kat
7. Re: OT: hashing strings
- Posted by mattlewis (admin) Aug 05, 2009
- 1407 views
What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.
The probability of a collision given any 32-bit key with a perfectly evenly distributed hash function on random data becomes 99.9972% at 300,000 unique strings. It's pretty much certain after that. Math Rocks!
Cheers,
Nick
Why won't 300,000 unique strings map into a 4,294,967,296 space with no duplicates?!
They can map uniquely. That doesn't mean that it's easy to find a hash function to do it. See, for example:
- http://cmph.sourceforge.net/
- http://burtleburtle.net/bob/hash/perfect.html
- http://en.wikipedia.org/wiki/Perfect_hash_function
Matt
8. Re: OT: hashing strings
- Posted by prickle Aug 05, 2009
- 1253 views
Why won't 300,000 unique strings map into a 4,294,967,296 space with no duplicates?!
Kat
I'm not saying they cant, just the probability is that they wont if one's hash function is of a random nature.
See the Birthday paradox. Say we have a 4,294,967,296 day year. We then consider a string having a birthday (hash value) on any given day. It bails out very quickly despite the large number of keys. 32 bits actually holds out quite well on large datasets if one chooses an evenly distributed hash function. The CRC32 algorithm might be a good choice. Rolling sums can also work well. Nevertheless, you will not be able to avoid having many references in each bucket. It should still be pretty quick
Cheers,
Nick.
9. Re: OT: hashing strings
- Posted by prickle Aug 05, 2009
- 1205 views
Sorry, should have said you wont be able to avoid multiple references in _some_ buckets.
Nick
10. Re: OT: hashing strings
- Posted by useless Aug 05, 2009
- 1229 views
Why won't 300,000 unique strings map into a 4,294,967,296 space with no duplicates?!
Kat
I'm not saying they cant, just the probability is that they wont if one's hash function is of a random nature.
See the Birthday paradox. Say we have a 4,294,967,296 day year. We then consider a string having a birthday (hash value) on any given day. It bails out very quickly despite the large number of keys. 32 bits actually holds out quite well on large datasets if one chooses an evenly distributed hash function. The CRC32 algorithm might be a good choice. Rolling sums can also work well. Nevertheless, you will not be able to avoid having many references in each bucket. It should still be pretty quick
Cheers,
Nick.
Ok, i don't understand. Seems to me there's a 0.00000000023283064365386962890625 chance of the first birthday falling on any one day of the 4,294,967,296 day year. And since there's a inverse chance of a birthday falling on another day, the chance of the 2nd birthday falling on the same day as the first is roughly 4,294,967,296 squared. No?
Ok, maybe according to the Birthday page, the chances are based on pairs, so you're saying there's 1:450 chance of even the first two birthdays being the same, in a sample of 10 million on the 4,294,967,296 day year?
This is not good.
Kat
11. Re: OT: hashing strings
- Posted by prickle Aug 05, 2009
- 1173 views
Ok, maybe according to the Birthday page, the chances are based on pairs, so you're saying there's 1:450 chance of even the first two birthdays being the same, in a sample of 10 million on the 4,294,967,296 day year?
This is not good.
Kat
Yes, we are thinking about keys as pairs in this case because we are looking for a collision. Actually, a set of two keys appears to have very little chance of containing a colliding pair. But by the time you get to 1000 keys the probability appears to rise to 1:8620, or 0.116%.
Using the python snippet from the Birthday page with d=4294967296 I get:
samples = 2; prob. collision = 0.000000
samples = 1000; prob. collision = 0.00116
samples = 10000; prob. collision = 0.011573
samples = 100000; prob. collision = 0.687812
samples = 200000; prob. collision = 0.990502
samples = 300000; prob. collision = 0.999972
samples = 400000; prob. collision = 1.000000
Perhaps d is too large for the algorithm and something is overflowing. I don't know python well enough to tell.
Cheers,
Nick
12. Re: OT: hashing strings
- Posted by mattlewis (admin) Aug 05, 2009
- 1200 views
Ok, i don't understand. Seems to me there's a 0.00000000023283064365386962890625 chance of the first birthday falling on any one day of the 4,294,967,296 day year. And since there's a inverse chance of a birthday falling on another day, the chance of the 2nd birthday falling on the same day as the first is roughly 4,294,967,296 squared. No?
Ok, maybe according to the Birthday page, the chances are based on pairs, so you're saying there's 1:450 chance of even the first two birthdays being the same, in a sample of 10 million on the 4,294,967,296 day year?
This is not good.
Pairs is probably the wrong way to think about it. Let's imagine that we're going through each of the 300,000 strings. We can calculate the probability of a collision with that many items by considering the number of empty slots. While the number of slots is a lot, we're also talking about a lot of items to be hashed.
Remember, you didn't ask for a good hash function, but a perfect hash function. Here is some (eu 4.0) code that calculates the probabilities. It reports at each 10,000 increment. I hard coded in your values:
include std/console.e -- atom space = prompt_number("How many slots?", { 1, power( 2, 51 )}) -- atom items = prompt_number("How many items?", { 1, power( 2, 51 )}) atom space = power( 2, 32 ) atom items = 300_000 atom probability = 1.0 atom empty = space for i = 1 to items - 1 do empty -= 1 probability *= empty / space if not remainder( i, 10_000 ) then printf(1, "The probability of no collisions for %d items is %0.16f\n", i & probability ) end if end for
The results are as follows:
$ eui birthday.ex The probability of no collisions for 10000 items is 0.9884248086525800 The probability of no collisions for 20000 items is 0.9544991471483176 The probability of no collisions for 30000 items is 0.9005248436101611 The probability of no collisions for 40000 items is 0.8300496590491797 The probability of no collisions for 50000 items is 0.7474818501475397 The probability of no collisions for 60000 items is 0.6576357350322105 The probability of no collisions for 70000 items is 0.5652730868057533 The probability of no collisions for 80000 items is 0.4747001207863105 The probability of no collisions for 90000 items is 0.3894650411345447 The probability of no collisions for 100000 items is 0.3121804493495650 The probability of no collisions for 110000 items is 0.2444730439793339 The probability of no collisions for 120000 items is 0.1870442204532125 The probability of no collisions for 130000 items is 0.1398123764617983 The probability of no collisions for 140000 items is 0.1021021599584672 The probability of no collisions for 150000 items is 0.0728470896214529 The probability of no collisions for 160000 items is 0.0507782140487439 The probability of no collisions for 170000 items is 0.0345804432150361 The probability of no collisions for 180000 items is 0.0230076139686639 The probability of no collisions for 190000 items is 0.0149554842056866 The probability of no collisions for 200000 items is 0.0094976731827056 The probability of no collisions for 210000 items is 0.0058928011581831 The probability of no collisions for 220000 items is 0.0035720222185544 The probability of no collisions for 230000 items is 0.0021154087285301 The probability of no collisions for 240000 items is 0.0012239457946743 The probability of no collisions for 250000 items is 0.0006918593356940 The probability of no collisions for 260000 items is 0.0003820860057772 The probability of no collisions for 270000 items is 0.0002061541771879 The probability of no collisions for 280000 items is 0.0001086702846204 The probability of no collisions for 290000 items is 0.0000559650768343Matt
13. Re: OT: hashing strings
- Posted by DerekParnell (admin) Aug 05, 2009
- 1226 views
- Last edited Aug 06, 2009
Ok, i don't understand. Seems to me there's a 0.00000000023283064365386962890625 chance of the first birthday falling on any one day of the 4,294,967,296 day year. And since there's a inverse chance of a birthday falling on another day, the chance of the 2nd birthday falling on the same day as the first is roughly 4,294,967,296 squared. No?
Ok, maybe according to the Birthday page, the chances are based on pairs, so you're saying there's 1:450 chance of even the first two birthdays being the same, in a sample of 10 million on the 4,294,967,296 day year?
This is not good.
Pairs is probably the wrong way to think about it.
It seems that for 10 million pure hashes, to get (probably) no collisions you would need to use 46-bit hash values. So here's an update to that function to generate 64-bit values. These are presented as pairs of 32-bit values such that each hash value is a two element sequence.
include machine.e atom mem mem = allocate(8) function hashstr(sequence text) sequence chs sequence val integer j integer k chs = repeat(0,8) text -= ' ' j = 0 for i = 1 to length(text) do j += 1 if j > length(chs) then j = 1 else chs[j] = remainder(length(text), 256) end if chs[j] += text[i] * i end for poke(mem, chs) -- Grab the 64-bits of data from RAM val = peek4u({mem,2}) return val end function
14. Re: OT: hashing strings
- Posted by useless Aug 05, 2009
- 1199 views
- Last edited Aug 06, 2009
Ok, i don't understand. Seems to me there's a 0.00000000023283064365386962890625 chance of the first birthday falling on any one day of the 4,294,967,296 day year. And since there's a inverse chance of a birthday falling on another day, the chance of the 2nd birthday falling on the same day as the first is roughly 4,294,967,296 squared. No?
Ok, maybe according to the Birthday page, the chances are based on pairs, so you're saying there's 1:450 chance of even the first two birthdays being the same, in a sample of 10 million on the 4,294,967,296 day year?
This is not good.
Pairs is probably the wrong way to think about it.
It seems that for 10 million pure hashes, to get (probably) no collisions you would need to use 46-bit hash values.
I'm not sure that's true. Besides that, there's a reason to limit to 32bits that i don't want to divulge at this time. I think i must give up on unique and make some way to deal with buckets full of hash.
Before i do tho, i am going to try the 32bit code you wrote, Derek. Maybe there's some pre-munge i can do to slant the risk in my favor. Maybe there's a pattern to the unused hash numbers.
Kat
15. Re: OT: hashing strings
- Posted by mattlewis (admin) Aug 05, 2009
- 1208 views
- Last edited Aug 06, 2009
It seems that for 10 million pure hashes, to get (probably) no collisions you would need to use 46-bit hash values.
I'm not sure that's true.
You're right, because we're ignoring a key assumption, which is that the hash function is uniformly distributed over the hash space, which is not a given at all.
Matt
16. Re: OT: hashing strings
- Posted by euphoric (admin) Aug 06, 2009
- 1198 views
You're right, because we're ignoring a key assumption, which is that the hash function is uniformly distributed over the hash space, which is not a given at all.
I once wrote a perfect hash function that uniformly distributed any number of items but lost it when my hard drive crashed. Dang it!
17. Re: OT: hashing strings
- Posted by prickle Aug 06, 2009
- 1207 views
It seems that for 10 million pure hashes, to get (probably) no collisions you would need to use 46-bit hash values.
I'm not sure that's true.
You're right, because we're ignoring a key assumption, which is that the hash function is uniformly distributed over the hash space, which is not a given at all.
Matt
Yes, real-life performance would be somewhat worse. Even with a very large bitfield for the key I expect it is difficult to guarantee that any given random hash function solves uniquely over a set of data anyway.
Despite all this, at 32 bits the average number of keys per hash would remain very low. Also, any lookup table it used would be quite large. It's kind of hard to say anything more without knowing how it will be used.
Nick
18. Re: OT: hashing strings
- Posted by alanjohnoxley Aug 06, 2009
- 1290 views
Hi Kat, The author claims this is a fast 32bit hash, with less collisions. Source is included. http://www.azillionmonkeys.com/qed/hash.html
HTH! Regards Alan