Checksums, hashes, and digests, oh my!

new topic     » topic index » view thread      » older message » newer message

I'm currently working on updating the internal hashing function to support most (hopefully all, eventually) secure hash algorithms like MD5, SHA1, SHA256, etc. Here are the issues I'm facing that I'm looking for feedback on:

Issue 1

The current code basically handles algorithms like Adler-32 and Fletcher-32, but those are really checksum algorithms not hashing algorithms. Should hash() be renamed to checksum()? (There's a checksum() in std/filesys.e but we that can be fixed with namespaces, e.g. hash:checksum() or filesys:checksum()). Furthermore, should hash() even be a built-in function like it is? It should probably be a machine function instead. Currently you have to include std/hash.e to get the algorithm values anyway. Then we could implement specific functions in std/hash.e like adler32() or sha256sum().

Issue 2

Currently, hash() will operate on any value through some weird type punning and by traversing nested sequences recursively (though it does try to respect strings as a series of bytes). I don't particularly like this approach, but it's what makes std/map.e work on any key value so it should probably stay. The problem is that secure hash algorithms are meant to produce a message digest which assumes the input is always a series of bytes; the message digest of an integer or floating point value is undefined. Should we continue to use this type punning to coerce numbers into "messages" or should we error out when presented with anything but a sequence of bytes? We can do type punning inside std/map.e if needed.

Issue 3

I'm currently looking at using picohash for this, which itself is mostly a slimmed-down libtomcrypt. Problem is, it's a bit older and doesn't support later SHA-2 algorithms (like SHA384/SHA512) or the more recent SHA-3. But libtomcrypt does support those and both are licensed to the public domain, but libtomcrypt itself is much bigger and it seems harder to build and use just the parts we need for hashing. Should I skip SHA384/SHA512/SHA-3 for now? Should I work on updating picohash to support those algorithms? (Shouldn't be terrible. It's quite modular, actually.) Should we use some other library altogether?

In summary:

  • Should hash() be renamed to checksum()?
  • Should hash() be a built-in or machine func?
  • Should we have algorithm functions like sha256sum()?
  • Should digest algorithms only operate on sequences of bytes?
  • Should we use picohash, libtomcrypt, or something else?
  • What digest algorithms should we bake into Euphoria?

I believe Derek wrote the original code for this so if he's around I'd really appreciate his insight.

-Greg

new topic     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu