1. OT: hashing strings


What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.

The strings might take the form of:

M:\KB\Military\submarines\I-401\I-401 Submarine Found off of Barbers Point_files\I401_DJ_01_sm.jpg

M:\KB\animals\otter\(2-19-2001) California Sea Otters Headed Back to Endangered Species List; Sudden Rise in Death Rates Traced to Sewage-related Parasitic, Bacterial Diseases.htm

M:\KB\Music\Bands\J\Janis JoplinJanis Joplin Ball and Chain ( in album Monterey International Pop Festival ).txt

M:\KB\Plants\blah.garden.org\guides\bf\b\Stylommatophora\Helicidae\Candidula\intersecta\genus\1\3467345\index.html

C:\kb-temp\history\china\Literature\Historiography\huangqingkaiguofanglve.html

They also might not be ascii or UTF-8 strings. They may be single tokens, like AADZKWAA8KOAAP (used in win32lib_0_70_4a\Demo\lvcol.exw).

useless

new topic     » topic index » view message » categorize

2. Re: OT: hashing strings

useless said...


What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.

I have the most misgivings about "using no math." What, exactly do you mean by this? There is a built-in hash function in eu 4.0. Of course, without advanced knowledge of all the strings, it's impossible to guarantee perfect hashing.

Matt

new topic     » goto parent     » topic index » view message » categorize

3. Re: OT: hashing strings

kat said...

What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.

Not possible - some math is always going to be involved.

I note that one of your sample strings is less than 50 characters so I'd have to assume they can be any length.

Anyhow, try this to see if it worth using ..

include machine.e 
atom mem 
mem = allocate(4) 
 
function hashstr(sequence text) 
	sequence chs 
	atom val 
	integer j 
	integer k 
	 
	chs = repeat( remainder(length(text), 256), 4) 
	text -= ' ' 
	 
	j = 0 
	for i = 1 to length(text) do 
		j += 1 
		if j > 4 then 
			j = 1 
		end if 
		chs[j] += text[i] * i 
	end for 
	 
	poke(mem, chs) 
	 
	val = peek4u(mem) 
	 
	return val 
end function 
new topic     » goto parent     » topic index » view message » categorize

4. Re: OT: hashing strings

useless said...


What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.

The probability of a collision given any 32-bit key with a perfectly evenly distributed hash function on random data becomes 99.9972% at 300,000 unique strings. It's pretty much certain after that. Math Rocks! smile

Cheers,

Nick

new topic     » goto parent     » topic index » view message » categorize

5. Re: OT: hashing strings

prickle said...
useless said...


What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.

The probability of a collision given any 32-bit key with a perfectly evenly distributed hash function on random data becomes 99.9972% at 300,000 unique strings. It's pretty much certain after that. Math Rocks! smile

Cheers,

Nick

Why won't 300,000 unique strings map into a 4,294,967,296 space with no duplicates?!

Kat

new topic     » goto parent     » topic index » view message » categorize

6. Re: OT: hashing strings

DerekParnell said...
kat said...

What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.

Not possible - some math is always going to be involved.

I note that one of your sample strings is less than 50 characters so I'd have to assume they can be any length.

Yes, i said some may be single tokens: They may be single tokens, like AADZKWAA8KOAAP (used in win32lib_0_70_4a\Demo\lvcol.exw).

[quote DerekParnell]

kat said...

Anyhow, try this to see if it worth using ..

include machine.e 
atom mem 
mem = allocate(4) 
 
function hashstr(sequence text) 
	sequence chs 
	atom val 
	integer j 
	integer k 
	 
	chs = repeat( remainder(length(text), 256), 4) 
	text -= ' ' 
	 
	j = 0 
	for i = 1 to length(text) do 
		j += 1 
		if j > 4 then 
			j = 1 
		end if 
		chs[j] += text[i] * i 
	end for 
	 
	poke(mem, chs) 
	 
	val = peek4u(mem) 
	 
	return val 
end function 

Thanks, Derek! That looks like a cunning bit of code, and it uses only addition, which i can deal with. smile

I have a directory listing containing 9,574,746 filenames on M:\, and will make up a routine to run the code, and see if there's any duplicates. I know the absence of duplicates doesn't proove they cannot happen, but i have a feeling the code you wrote is unlikely to make many. It's going to take some time to run the test code, but i can get relavant data from it. This computer is out of ram and swap space already, and the dir listing file is 553,831,163 bytes. There's three other drives equally loaded. Locating files has become a seriously time-consuming problem.

Kat

new topic     » goto parent     » topic index » view message » categorize

7. Re: OT: hashing strings

useless said...
prickle said...
useless said...


What's your opinion of the fastest hash of 10million unique strings, each being 50char to 400char long, hashing down to a 32bit (or smaller) positive integer, preferably with unique hashes (or one result per bucket)? Preferably using no math.

The probability of a collision given any 32-bit key with a perfectly evenly distributed hash function on random data becomes 99.9972% at 300,000 unique strings. It's pretty much certain after that. Math Rocks! smile

Cheers,

Nick

Why won't 300,000 unique strings map into a 4,294,967,296 space with no duplicates?!

They can map uniquely. That doesn't mean that it's easy to find a hash function to do it. See, for example:

Matt

new topic     » goto parent     » topic index » view message » categorize

8. Re: OT: hashing strings

useless said...

Why won't 300,000 unique strings map into a 4,294,967,296 space with no duplicates?!

Kat

I'm not saying they cant, just the probability is that they wont if one's hash function is of a random nature.

See the Birthday paradox. Say we have a 4,294,967,296 day year. We then consider a string having a birthday (hash value) on any given day. It bails out very quickly despite the large number of keys. 32 bits actually holds out quite well on large datasets if one chooses an evenly distributed hash function. The CRC32 algorithm might be a good choice. Rolling sums can also work well. Nevertheless, you will not be able to avoid having many references in each bucket. It should still be pretty quick

Cheers,

Nick.

new topic     » goto parent     » topic index » view message » categorize

9. Re: OT: hashing strings

Sorry, should have said you wont be able to avoid multiple references in _some_ buckets.

Nick

new topic     » goto parent     » topic index » view message » categorize

10. Re: OT: hashing strings

prickle said...
useless said...

Why won't 300,000 unique strings map into a 4,294,967,296 space with no duplicates?!

Kat

I'm not saying they cant, just the probability is that they wont if one's hash function is of a random nature.

See the Birthday paradox. Say we have a 4,294,967,296 day year. We then consider a string having a birthday (hash value) on any given day. It bails out very quickly despite the large number of keys. 32 bits actually holds out quite well on large datasets if one chooses an evenly distributed hash function. The CRC32 algorithm might be a good choice. Rolling sums can also work well. Nevertheless, you will not be able to avoid having many references in each bucket. It should still be pretty quick

Cheers,

Nick.

Ok, i don't understand. Seems to me there's a 0.00000000023283064365386962890625 chance of the first birthday falling on any one day of the 4,294,967,296 day year. And since there's a inverse chance of a birthday falling on another day, the chance of the 2nd birthday falling on the same day as the first is roughly 4,294,967,296 squared. No?

Ok, maybe according to the Birthday page, the chances are based on pairs, so you're saying there's 1:450 chance of even the first two birthdays being the same, in a sample of 10 million on the 4,294,967,296 day year?

This is not good.

Kat

new topic     » goto parent     » topic index » view message » categorize

11. Re: OT: hashing strings

useless said...

Ok, maybe according to the Birthday page, the chances are based on pairs, so you're saying there's 1:450 chance of even the first two birthdays being the same, in a sample of 10 million on the 4,294,967,296 day year?

This is not good.

Kat

Yes, we are thinking about keys as pairs in this case because we are looking for a collision. Actually, a set of two keys appears to have very little chance of containing a colliding pair. But by the time you get to 1000 keys the probability appears to rise to 1:8620, or 0.116%.

Using the python snippet from the Birthday page with d=4294967296 I get:

samples = 2; prob. collision = 0.000000

samples = 1000; prob. collision = 0.00116

samples = 10000; prob. collision = 0.011573

samples = 100000; prob. collision = 0.687812

samples = 200000; prob. collision = 0.990502

samples = 300000; prob. collision = 0.999972

samples = 400000; prob. collision = 1.000000

Perhaps d is too large for the algorithm and something is overflowing. I don't know python well enough to tell.

Cheers,

Nick

new topic     » goto parent     » topic index » view message » categorize

12. Re: OT: hashing strings

useless said...

Ok, i don't understand. Seems to me there's a 0.00000000023283064365386962890625 chance of the first birthday falling on any one day of the 4,294,967,296 day year. And since there's a inverse chance of a birthday falling on another day, the chance of the 2nd birthday falling on the same day as the first is roughly 4,294,967,296 squared. No?

Ok, maybe according to the Birthday page, the chances are based on pairs, so you're saying there's 1:450 chance of even the first two birthdays being the same, in a sample of 10 million on the 4,294,967,296 day year?

This is not good.

Pairs is probably the wrong way to think about it. Let's imagine that we're going through each of the 300,000 strings. We can calculate the probability of a collision with that many items by considering the number of empty slots. While the number of slots is a lot, we're also talking about a lot of items to be hashed.

Remember, you didn't ask for a good hash function, but a perfect hash function. Here is some (eu 4.0) code that calculates the probabilities. It reports at each 10,000 increment. I hard coded in your values:

include std/console.e 
 
-- atom space = prompt_number("How many slots?", { 1, power( 2, 51 )}) 
-- atom items = prompt_number("How many items?", { 1, power( 2, 51 )}) 
atom space = power( 2, 32 ) 
atom items = 300_000 
 
atom probability = 1.0 
atom empty = space 
for i = 1 to items - 1 do 
	empty -= 1 
	probability *= empty / space 
	if not remainder( i, 10_000 ) then 
		printf(1, "The probability of no collisions for %d items is %0.16f\n", i & probability ) 
	end if 
end for 

The results are as follows:

$ eui birthday.ex 
The probability of no collisions for 10000 items is 0.9884248086525800 
The probability of no collisions for 20000 items is 0.9544991471483176 
The probability of no collisions for 30000 items is 0.9005248436101611 
The probability of no collisions for 40000 items is 0.8300496590491797 
The probability of no collisions for 50000 items is 0.7474818501475397 
The probability of no collisions for 60000 items is 0.6576357350322105 
The probability of no collisions for 70000 items is 0.5652730868057533 
The probability of no collisions for 80000 items is 0.4747001207863105 
The probability of no collisions for 90000 items is 0.3894650411345447 
The probability of no collisions for 100000 items is 0.3121804493495650 
The probability of no collisions for 110000 items is 0.2444730439793339 
The probability of no collisions for 120000 items is 0.1870442204532125 
The probability of no collisions for 130000 items is 0.1398123764617983 
The probability of no collisions for 140000 items is 0.1021021599584672 
The probability of no collisions for 150000 items is 0.0728470896214529 
The probability of no collisions for 160000 items is 0.0507782140487439 
The probability of no collisions for 170000 items is 0.0345804432150361 
The probability of no collisions for 180000 items is 0.0230076139686639 
The probability of no collisions for 190000 items is 0.0149554842056866 
The probability of no collisions for 200000 items is 0.0094976731827056 
The probability of no collisions for 210000 items is 0.0058928011581831 
The probability of no collisions for 220000 items is 0.0035720222185544 
The probability of no collisions for 230000 items is 0.0021154087285301 
The probability of no collisions for 240000 items is 0.0012239457946743 
The probability of no collisions for 250000 items is 0.0006918593356940 
The probability of no collisions for 260000 items is 0.0003820860057772 
The probability of no collisions for 270000 items is 0.0002061541771879 
The probability of no collisions for 280000 items is 0.0001086702846204 
The probability of no collisions for 290000 items is 0.0000559650768343 
Matt

new topic     » goto parent     » topic index » view message » categorize

13. Re: OT: hashing strings

mattlewis said...
useless said...

Ok, i don't understand. Seems to me there's a 0.00000000023283064365386962890625 chance of the first birthday falling on any one day of the 4,294,967,296 day year. And since there's a inverse chance of a birthday falling on another day, the chance of the 2nd birthday falling on the same day as the first is roughly 4,294,967,296 squared. No?

Ok, maybe according to the Birthday page, the chances are based on pairs, so you're saying there's 1:450 chance of even the first two birthdays being the same, in a sample of 10 million on the 4,294,967,296 day year?

This is not good.

Pairs is probably the wrong way to think about it.

It seems that for 10 million pure hashes, to get (probably) no collisions you would need to use 46-bit hash values. So here's an update to that function to generate 64-bit values. These are presented as pairs of 32-bit values such that each hash value is a two element sequence.

include machine.e 
atom mem 
mem = allocate(8) 
 
function hashstr(sequence text) 
	sequence chs 
	sequence val 
	integer j 
	integer k 
	 
	chs = repeat(0,8) 
	text -= ' ' 
	 
	j = 0 
	for i = 1 to length(text) do 
		j += 1 
		if j > length(chs) then 
			j = 1 
		else 
			chs[j] = remainder(length(text), 256) 
		end if 
		chs[j] += text[i] * i 
	end for 
	poke(mem, chs) 
 
        -- Grab the 64-bits of data from RAM	 
	val = peek4u({mem,2}) 
	 
	return val 
end function 
new topic     » goto parent     » topic index » view message » categorize

14. Re: OT: hashing strings

DerekParnell said...
mattlewis said...
useless said...

Ok, i don't understand. Seems to me there's a 0.00000000023283064365386962890625 chance of the first birthday falling on any one day of the 4,294,967,296 day year. And since there's a inverse chance of a birthday falling on another day, the chance of the 2nd birthday falling on the same day as the first is roughly 4,294,967,296 squared. No?

Ok, maybe according to the Birthday page, the chances are based on pairs, so you're saying there's 1:450 chance of even the first two birthdays being the same, in a sample of 10 million on the 4,294,967,296 day year?

This is not good.

Pairs is probably the wrong way to think about it.

It seems that for 10 million pure hashes, to get (probably) no collisions you would need to use 46-bit hash values.

I'm not sure that's true. Besides that, there's a reason to limit to 32bits that i don't want to divulge at this time. I think i must give up on unique and make some way to deal with buckets full of hash.

Before i do tho, i am going to try the 32bit code you wrote, Derek. Maybe there's some pre-munge i can do to slant the risk in my favor. Maybe there's a pattern to the unused hash numbers.

Kat

new topic     » goto parent     » topic index » view message » categorize

15. Re: OT: hashing strings

useless said...
DerekParnell said...

It seems that for 10 million pure hashes, to get (probably) no collisions you would need to use 46-bit hash values.

I'm not sure that's true.

You're right, because we're ignoring a key assumption, which is that the hash function is uniformly distributed over the hash space, which is not a given at all.

Matt

new topic     » goto parent     » topic index » view message » categorize

16. Re: OT: hashing strings

mattlewis said...

You're right, because we're ignoring a key assumption, which is that the hash function is uniformly distributed over the hash space, which is not a given at all.

I once wrote a perfect hash function that uniformly distributed any number of items but lost it when my hard drive crashed. Dang it!

new topic     » goto parent     » topic index » view message » categorize

17. Re: OT: hashing strings

mattlewis said...
useless said...
DerekParnell said...

It seems that for 10 million pure hashes, to get (probably) no collisions you would need to use 46-bit hash values.

I'm not sure that's true.

You're right, because we're ignoring a key assumption, which is that the hash function is uniformly distributed over the hash space, which is not a given at all.

Matt

Yes, real-life performance would be somewhat worse. Even with a very large bitfield for the key I expect it is difficult to guarantee that any given random hash function solves uniquely over a set of data anyway.

Despite all this, at 32 bits the average number of keys per hash would remain very low. Also, any lookup table it used would be quite large. It's kind of hard to say anything more without knowing how it will be used.

Nick

new topic     » goto parent     » topic index » view message » categorize

18. Re: OT: hashing strings

Hi Kat, The author claims this is a fast 32bit hash, with less collisions. Source is included. http://www.azillionmonkeys.com/qed/hash.html

HTH! Regards Alan

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu