Re: OT: hashing strings
- Posted by useless Aug 05, 2009
- 1230 views
Why won't 300,000 unique strings map into a 4,294,967,296 space with no duplicates?!
Kat
I'm not saying they cant, just the probability is that they wont if one's hash function is of a random nature.
See the Birthday paradox. Say we have a 4,294,967,296 day year. We then consider a string having a birthday (hash value) on any given day. It bails out very quickly despite the large number of keys. 32 bits actually holds out quite well on large datasets if one chooses an evenly distributed hash function. The CRC32 algorithm might be a good choice. Rolling sums can also work well. Nevertheless, you will not be able to avoid having many references in each bucket. It should still be pretty quick
Cheers,
Nick.
Ok, i don't understand. Seems to me there's a 0.00000000023283064365386962890625 chance of the first birthday falling on any one day of the 4,294,967,296 day year. And since there's a inverse chance of a birthday falling on another day, the chance of the 2nd birthday falling on the same day as the first is roughly 4,294,967,296 squared. No?
Ok, maybe according to the Birthday page, the chances are based on pairs, so you're saying there's 1:450 chance of even the first two birthdays being the same, in a sample of 10 million on the 4,294,967,296 day year?
This is not good.
Kat