1. Checksums, hashes, and digests, oh my!

I'm currently working on updating the internal hashing function to support most (hopefully all, eventually) secure hash algorithms like MD5, SHA1, SHA256, etc. Here are the issues I'm facing that I'm looking for feedback on:

Issue 1

The current code basically handles algorithms like Adler-32 and Fletcher-32, but those are really checksum algorithms not hashing algorithms. Should hash() be renamed to checksum()? (There's a checksum() in std/filesys.e but we that can be fixed with namespaces, e.g. hash:checksum() or filesys:checksum()). Furthermore, should hash() even be a built-in function like it is? It should probably be a machine function instead. Currently you have to include std/hash.e to get the algorithm values anyway. Then we could implement specific functions in std/hash.e like adler32() or sha256sum().

Issue 2

Currently, hash() will operate on any value through some weird type punning and by traversing nested sequences recursively (though it does try to respect strings as a series of bytes). I don't particularly like this approach, but it's what makes std/map.e work on any key value so it should probably stay. The problem is that secure hash algorithms are meant to produce a message digest which assumes the input is always a series of bytes; the message digest of an integer or floating point value is undefined. Should we continue to use this type punning to coerce numbers into "messages" or should we error out when presented with anything but a sequence of bytes? We can do type punning inside std/map.e if needed.

Issue 3

I'm currently looking at using picohash for this, which itself is mostly a slimmed-down libtomcrypt. Problem is, it's a bit older and doesn't support later SHA-2 algorithms (like SHA384/SHA512) or the more recent SHA-3. But libtomcrypt does support those and both are licensed to the public domain, but libtomcrypt itself is much bigger and it seems harder to build and use just the parts we need for hashing. Should I skip SHA384/SHA512/SHA-3 for now? Should I work on updating picohash to support those algorithms? (Shouldn't be terrible. It's quite modular, actually.) Should we use some other library altogether?

In summary:

  • Should hash() be renamed to checksum()?
  • Should hash() be a built-in or machine func?
  • Should we have algorithm functions like sha256sum()?
  • Should digest algorithms only operate on sequences of bytes?
  • Should we use picohash, libtomcrypt, or something else?
  • What digest algorithms should we bake into Euphoria?

I believe Derek wrote the original code for this so if he's around I'd really appreciate his insight.

-Greg

new topic     » topic index » view message » categorize

2. Re: Checksums, hashes, and digests, oh my!

Hi Gregg

What are these for? How will they benefit an average Euser? Will a name change break a lot of stuff?

Starting questions.

Cheers

Chris

new topic     » goto parent     » topic index » view message » categorize

3. Re: Checksums, hashes, and digests, oh my!

ChrisB said...

What are these for?

Cryptographic hash functions are used to verify the integrity or content of messages and files. They're typically used to store a one-way hash for validating passwords, or to check the integrity of a downloaded file.

ChrisB said...

How will they benefit an average Euser?

This is one of those features that nearly anyone can and should be using when needed. Providing a standard, base-line implementation ensures most users don't have to hunt for code and re-invent the wheel.

ChrisB said...

Will a name change break a lot of stuff?

I don't think so. We can accommodate the change in the standard library so std/map.e, etc. aren't affected and then mark hash() as deprecated in std/hash.e and then we can remove it entirely in a later version.

So std/hash.e would end up looking something like this:

constant 
    M_CHECKSUM = 98, 
    M_CALCHASH = 99 
 
public enum ADLER32, FLETCHER32, ... 
 
deprecate -- this will issue a warning if still used 
public function hash( object data_in, integer algo ) 
    return machine_func( M_CHECKSUM, {data_in,algo} ) 
end function 
 
public function checksum( object data_in, integer algo ) 
    return machine_func( M_CHECKSUM, {data_in,algo} ) 
end function 
 
public function adler32( object data_in ) 
    return machine_func( M_CHECKSUM, {data_in,ADLER32} ) 
end function 
 
public function fletcher32( object data_in ) 
    return machine_func( M_CHECKSUM, {data_in,FLETCHER32} ) 
end function 
 
-- etc. 
 
public enum MD5, SHA1, SHA256, ... 
 
public function calc_hash( sequence data_in, integer algo ) 
    return machine_func( M_CALCHASH, {data_in,algo} ) 
end function 
 
public function md5sum( sequence data_in ) 
    return machine_func( M_CALCHASH, {data_in,MD5} ) 
end function 
 
public function sha1sum( sequence data_in ) 
    return machine_func( M_CALCHASH, {data_in,SHA1} ) 
end function 
 
-- etc. 

There are plenty more algorithms that we could implement, but I think when we go from "common message digests" to "actual data cryptography" we should focus on providing an external library.

-Greg

new topic     » goto parent     » topic index » view message » categorize

4. Re: Checksums, hashes, and digests, oh my!

Hi Gregg

TBH, you should probably go for one that is best for you / quick and dirty to implement, and then if anyone wants or needs a more specialist one, or if you want to spend more time on a superior one, then do it as a secondary project. In my very humle opinion, I'm not sure this will add a lot to the eu ecosystem at this point. Maybe even leave hooks to alternative methods, much like the database library, that is somewhere.

Cheers

Chris

new topic     » goto parent     » topic index » view message » categorize

5. Re: Checksums, hashes, and digests, oh my!

Just so you know,

Phix begrudgingly(/not an exact match for OE) supports hash(x,HSIEH30), in builtins\hash.e, for reasons lost in the mists of time, and no other values for algo.
(There is also a half-baked and probably quite long dead builtins\phash.e...)

Otherwise it has (not as autoincludes)
builtins\sha256.e (and [uniquely in this list] a separate hand-crafted pwa\builtins\sha256.js)
builtins\sha512.e
builtins\sha1.e
builtins\hmac.e
builtins\md5.e
builtins\ripemd160.e
along with a few other crc32|md4|md5 bits 'n pieces in demo\rosetta.

I would say it feels wrong to me to bundle all such hash algorithms in a single file, as most apps likely only need one, and while the compiler can help a bit, it must be simpler to swap out one for another, without breaking some other unrelated program, ditto ship sources, when they are all in separate files. To my mind a builtins\hash.e might|should perhaps contain some common helper routines, but nothing else.

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu