Re: OT: hashing strings
- Posted by mattlewis (admin) Aug 05, 2009
- 1201 views
Ok, i don't understand. Seems to me there's a 0.00000000023283064365386962890625 chance of the first birthday falling on any one day of the 4,294,967,296 day year. And since there's a inverse chance of a birthday falling on another day, the chance of the 2nd birthday falling on the same day as the first is roughly 4,294,967,296 squared. No?
Ok, maybe according to the Birthday page, the chances are based on pairs, so you're saying there's 1:450 chance of even the first two birthdays being the same, in a sample of 10 million on the 4,294,967,296 day year?
This is not good.
Pairs is probably the wrong way to think about it. Let's imagine that we're going through each of the 300,000 strings. We can calculate the probability of a collision with that many items by considering the number of empty slots. While the number of slots is a lot, we're also talking about a lot of items to be hashed.
Remember, you didn't ask for a good hash function, but a perfect hash function. Here is some (eu 4.0) code that calculates the probabilities. It reports at each 10,000 increment. I hard coded in your values:
include std/console.e -- atom space = prompt_number("How many slots?", { 1, power( 2, 51 )}) -- atom items = prompt_number("How many items?", { 1, power( 2, 51 )}) atom space = power( 2, 32 ) atom items = 300_000 atom probability = 1.0 atom empty = space for i = 1 to items - 1 do empty -= 1 probability *= empty / space if not remainder( i, 10_000 ) then printf(1, "The probability of no collisions for %d items is %0.16f\n", i & probability ) end if end for
The results are as follows:
$ eui birthday.ex The probability of no collisions for 10000 items is 0.9884248086525800 The probability of no collisions for 20000 items is 0.9544991471483176 The probability of no collisions for 30000 items is 0.9005248436101611 The probability of no collisions for 40000 items is 0.8300496590491797 The probability of no collisions for 50000 items is 0.7474818501475397 The probability of no collisions for 60000 items is 0.6576357350322105 The probability of no collisions for 70000 items is 0.5652730868057533 The probability of no collisions for 80000 items is 0.4747001207863105 The probability of no collisions for 90000 items is 0.3894650411345447 The probability of no collisions for 100000 items is 0.3121804493495650 The probability of no collisions for 110000 items is 0.2444730439793339 The probability of no collisions for 120000 items is 0.1870442204532125 The probability of no collisions for 130000 items is 0.1398123764617983 The probability of no collisions for 140000 items is 0.1021021599584672 The probability of no collisions for 150000 items is 0.0728470896214529 The probability of no collisions for 160000 items is 0.0507782140487439 The probability of no collisions for 170000 items is 0.0345804432150361 The probability of no collisions for 180000 items is 0.0230076139686639 The probability of no collisions for 190000 items is 0.0149554842056866 The probability of no collisions for 200000 items is 0.0094976731827056 The probability of no collisions for 210000 items is 0.0058928011581831 The probability of no collisions for 220000 items is 0.0035720222185544 The probability of no collisions for 230000 items is 0.0021154087285301 The probability of no collisions for 240000 items is 0.0012239457946743 The probability of no collisions for 250000 items is 0.0006918593356940 The probability of no collisions for 260000 items is 0.0003820860057772 The probability of no collisions for 270000 items is 0.0002061541771879 The probability of no collisions for 280000 items is 0.0001086702846204 The probability of no collisions for 290000 items is 0.0000559650768343Matt