OpenEuphoria: Ticket #593: Atom hashi...

Euphoria Ticket #593: Atom hashing seems inconsistent

Reported by mattlewis Jan 12, 2011
12 comments

Atoms are hashed differently based on whether they are internally a double or an integer.

include std/hash.e 
atom  
	x = 1, 
	y = 0.5 
y *= 2 
? hash( x, HSIEH32 )  -- 2236130196 
? hash( y, HSIEH32 )  -- 657194111

When hashing, we should probably check to see if the value stored as a double could be stored as a 32-bit integer (the basis for our hashing algorithms).

Values might be converted to doubles by using delete_routine() (in addition to normal . This subtle difference could also cause serious issues with std/map.e.

Details

Type:	Bug Report	Severity:	Major	Category:	Language
Assigned To:	mattlewis	Status:	Fixed	Reported Release:
Fixed in SVN #:		View VCS:	none	Milestone:	4.0.1

1. Comment by DerekParnell Jan 12, 2011

Hmmmm .... I thought that 0.5 * 2 would have been stored as an integer. If it is not then it isn't 1, thus hash sees it as a floating point. We have a problem here because how does hash or anyone know that such floating point number is meant to be 1 rather than the 0.999999998 ( or whatever ) it actual shows up as?

2. Comment by mattlewis Jan 12, 2011

We cannot know what anything is meant to be, however, we know what it is. As a user, I would have thought that euphoria was hashing on values, not an implementation detail like whether data is stored as an integer or double. After all, that's how euphoria handles equality.

So I guess the issue is that while the rest of euphoria transparently uses doubles and integers, the hashing does not.

3. Comment by jimcbrown Jan 12, 2011

This is not just some floating point oddity.

The following C code to test (compiled with gcc 4.1.2)

#include <stdio.h> 
 
int main() 
{ 
	int i; 
	double x, y, z; 
	x = 1.0; 
	y = 0.5; 
	y *= 2; 
	z = 0.0; 
	for (i = 0; i < 10;i ++) 
	{ 
		z += 0.1; 
	} 
	printf("x = %f, y = %f, z = %f\n", x, y, z); 
	printf("(x == 1.0) = %d, (y == 1.0) = %d, (z == 1.0) = %d\n", x == 1.0, y == 1.0, z == 1.0); 
	printf("x = "); 
	for (i = 0; i < 8;i ++) 
	{ 
		printf("%d ", ((char *)&x)[i]); 
	} 
	printf("\ny = "); 
	for (i = 0; i < 8;i ++) 
	{ 
		printf("%d ", ((char *)&y)[i]); 
	} 
	printf("\nz = "); 
	for (i = 0; i < 8;i ++) 
	{ 
		printf("%d ", ((char *)&z)[i]); 
	} 
	printf("\n"); 
	return 0; 
}

Produces this output:

x = 1.000000, y = 1.000000, z = 1.000000 
(x == 1.0) = 1, (y == 1.0) = 1, (z == 1.0) = 0 
x = 0 0 0 0 0 0 -16 63 
y = 0 0 0 0 0 0 -16 63 
z = -1 -1 -1 -1 -1 -1 -17 63

z is the floating point oddity here, the value that looks like 1.0f but is really different because 1/10th can't be represented in binary or something.

y, on the other hand, which is 0.5 * 2, is the same floating point value as x.

The failure to convert atom y in the example Euphoria code in this ticket into an integer automatically internally when it is created must be peculiar to the implementation of OpenEuphoria.

4. Comment by mattlewis Jan 12, 2011

For clarity, here is some code from be_runtime:calc_hsieh32:

if (IS_ATOM_INT(a)) { 
 	tf.integer = a; 
 	lHashVal = hsieh32(tf.tfc, 4, a*2 - 1); 
} 
else if (IS_ATOM_DBL(a)) { 
	tf.ieee_double = (DBL_PTR(a)->dbl); 
	lHashVal = hsieh32(tf.tfc, 8, 8); 
}

This really has less to do with floating point implementation issues than the hashing implementation of euphoria. We can't fix the known warts regarding floating point, but we can still act reasonably. Consider:

include std/eumem.e 
atom my_ptr = eumem:malloc() 
integer my_int_ptr = my_ptr 
 
? equal( my_ptr, my_int_ptr ) -- 1 
? hash( my_ptr, 0 )           -- 2850535635 
? hash( my_int_ptr, 0 )       -- 2849416540

In short, this really seems to violate the spirit (if not the actual letter) of hashing.

5. Comment by DerekParnell Jan 12, 2011

I'm not near a computer so can you try this out ...

atom a = 0.5 * 2 
? equal(a, 1)

You can see from the hash code that all it checks is how Euphoria has decided to store the value, regardless of what that value is. So it looks like 0.5 * 2 is stored as a floating point value. Is that because the value is not precisely 1?

6. Comment by mattlewis Jan 12, 2011

No, 0.5 is stored exactly, being a power of 2. But euphoria does not generally check to see if doubles can be converted to integers. And as the cleanup use case shows, there are times when it definitely shouldn't.

Either way, from the euphoria user's perspective, a double that equals an integer shouldn't have a different hash value from an equivalent integer.

7. Comment by jimcbrown Jan 12, 2011

Derek, the point of my tests was to show that 0.5 * 2 is precisely 1.0.

? equal(a, 1) prints out 1.

This is not a floating point oddity. 0.5 * 2 is 1.0 ! They are exactly the same.

Now, this euphoria code shows a true floating point oddity:

atom z 
z = 0 
for i = 1 to 10 do 
        z += 0.1 
end for 
? equal(z, 1)

That prints out 0. That is an example of a floating point oddity.

0.5 * 2 is exactly 1.0 ! This is not a floating point oddity.

If you don't believe me, there's no point in continuing this discussion.

8. Comment by DerekParnell Jan 12, 2011

Jim, it's not that I didn't believe you, but your example was in C Code so I wondered if Euphoria was different.

Matt, hash ( as you imply ) works with bit patterns as supplied by the caller without regard to the numerical value. It does this because I had to do something for Euphoria as hashing routines generally, if not exclusively, deal with byte values. I made it this way for speed and flexibility. Basically I treat integers and FP values as a set of 4 and 8 bytes respectively. Your implied suggestion that we treat FP arguments that can be stored in Euphoria integers as if they actually were integers, is an alternative, and valid, design decision. It would cost a tiny bit in performance to do the conversion and test but if you all feel that is justified then go ahead and make the change.

9. Comment by mattlewis Jan 12, 2011

Derek, I fully understand, and this wasn't meant as a criticism (at least not a personal one smile ). However, I do think that this is worth the performance hit. Certainly in Java, it's a requirement that two objects that are considered equal have the same has value. This is a pretty reasonable constraint, and makes the hash more predictable and useful to the end user.

10. Comment by jimcbrown Jan 12, 2011

Derek, you're right. Asking if the behavior might be different under Euphoria versus C is a valid (and very important) question.

As for the requirement that hash(x) = hash(y) if equal(x,y), this makes sense for the most part. There is a question of values (such as those returned by eumem or map or regex) which are meant to be opaque types, not integers or atoms but only represented as such. Additionally they have information encoded in them that's not always obvious or visible (right now just the delete_routine() information).

However, detecting and handling these special cases would require even more extra overhead, whereas simply treating them the same as 0.5 * 2 would not (or at least should not) break any code or functionality. So this might be more trouble than it's worth.

11. Comment by mattlewis Jan 13, 2011

See: hg:euphoria/rev/94f2ae7a1837

changeset: 4561:94f2ae7a1837 branch: 4.0 parent: 4498:3adb56cff9b1 user: Matt Lewis date: Thu Jan 13 06:34:55 2011 -0500 files: docs/release/4.0.1.txt source/be_runtime.c tests/t_hash.e description:

Hash doubles that store a valid euphoria integer as though they were actually the integer. This ensures that values that evaluate as equal() have the same hash value.
Fixes ticket 593

12. Comment by mattlewis Jan 13, 2011

See: hg:euphoria/rev/dce078a10f7c

changeset: 4562:dce078a10f7c tag: tip parent: 4560:3fa5fc56330a parent: 4561:94f2ae7a1837 user: Matt Lewis date: Thu Jan 13 06:37:04 2011 -0500 files: docs/release/4.0.1.txt source/be_runtime.c tests/t_hash.e description:

OpenEuphoria