www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.hash: More questions

reply Johannes Pfau <nospam example.com> writes:
Code:
https://github.com/D-Programming-Language/phobos/pull/646
Docs:
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_hash.html
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_crc.html
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_md.html
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_sha.html

I just had another look at my initial std.hash design, and I realized
that the API could be simplified a little:

There's a reset function that's implemented in every hash. For sha1,
md5, crc32 it only forwards to the start function though. So I'm not
sure how useful this function is or if it should be dropped.

Advantages of keeping it:
* 'reset' better documents what's done than 'start' if the hash has
  already processed data
* Are there hashes which can implement a reset function in a faster way
  than calling start again?

Cons:
* Adds an additional function which probably isn't necessary


The start function is probably not needed as well. Tango doesn't have a
start function or something similar, but it could use constructors for
this (I only looked at docs, not code). We can't use constructors, so a
start function would be necessary for advanced initialization. But do
we actually need that advanced initialization? SHA1, MD5 and CRC32 just
do a "this = typeof(this).init" so a start function isn't necessary
here.

Advantages of keeping it:
* Are there hash algorithms which need some sort of complex
  initialization which can't be done with .init / default values?
* If we drop both start and reset the only way to reset the internal
  state is calling finish. This might be a little less efficient than a
  start/reset method.

Advantages of dropping it:
* Using hashes is easier, no need to call 'start' before hashing data

I think someone more familiar with hash functions than me needs to
answer the "do we need start/reset functions" questions.


API question:

CRC32 sums are usually presented as a uint, not a ubyte[4]. To fit the
rest of the API ubyte[4] is used. Now there's a small annoying detail:
The CRC32 should be printed in LSB-first order.
When printing an uint like this, that works well:
writefln("%#x", 4157704578); //0xf7d18982
but this doesn't:
toHexString(*cast(ubyte[4]*)&4157704578); //8289D1F7

I can't change toHexString as it's used for all hashes and it's correct
for SHA1, MD5, ...
So I currently use bswap in the CRC32 finish() implementation to fix
this issue.

Now the question is should I provide an additional finishUint function
which avoids the bswap? 


Implementation issue:

The current implementation of SHA1 and MD5 uses memcpy which doesn't
work in CTFE IIRC and which also prevents the code from being pure.
I could replace those memcpy calls with array copying but I'm not
sure if memcpy was used for performance, so I'd like to keep it as long
as we have no performance tests.
Jul 04 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 04-Jul-12 18:58, Johannes Pfau wrote:
 Code:
 https://github.com/D-Programming-Language/phobos/pull/646
 Docs:
 http://dl.dropbox.com/u/24218791/d/phobos/std_hash_hash.html
 http://dl.dropbox.com/u/24218791/d/phobos/std_hash_crc.html
 http://dl.dropbox.com/u/24218791/d/phobos/std_hash_md.html
 http://dl.dropbox.com/u/24218791/d/phobos/std_hash_sha.html

 I just had another look at my initial std.hash design, and I realized
 that the API could be simplified a little:

 There's a reset function that's implemented in every hash. For sha1,
 md5, crc32 it only forwards to the start function though. So I'm not
 sure how useful this function is or if it should be dropped.

 Advantages of keeping it:
 * 'reset' better documents what's done than 'start' if the hash has
    already processed data
 * Are there hashes which can implement a reset function in a faster way
    than calling start again?

 Cons:
 * Adds an additional function which probably isn't necessary


 The start function is probably not needed as well. Tango doesn't have a
 start function or something similar, but it could use constructors for
 this (I only looked at docs, not code).

The only thing I can think of that would require start function is using unconventional initial vectors.
 We can't use constructors, so a
 start function would be necessary for advanced initialization. But do
 we actually need that advanced initialization? SHA1, MD5 and CRC32 just
 do a "this = typeof(this).init" so a start function isn't necessary
 here.

 Advantages of keeping it:
 * Are there hash algorithms which need some sort of complex
    initialization which can't be done with .init / default values?
 * If we drop both start and reset the only way to reset the internal
    state is calling finish. This might be a little less efficient than a
    start/reset method.

 Advantages of dropping it:
 * Using hashes is easier, no need to call 'start' before hashing data

 I think someone more familiar with hash functions than me needs to
 answer the "do we need start/reset functions" questions.


 API question:

 CRC32 sums are usually presented as a uint, not a ubyte[4]. To fit the
 rest of the API ubyte[4] is used. Now there's a small annoying detail:
 The CRC32 should be printed in LSB-first order.

 When printing an uint like this, that works well:
 writefln("%#x", 4157704578); //0xf7d18982
 but this doesn't:
 toHexString(*cast(ubyte[4]*)&4157704578); //8289D1F7

There is no problem it's just order of printing that at fault. So I suggest to *stop* doing a bswap. It's just that printing something as an array of ubytes does it from least significant byte to most significant. You could try to add MSB/LSB first options to toHexString.
 I can't change toHexString as it's used for all hashes and it's correct
 for SHA1, MD5, ...
 So I currently use bswap in the CRC32 finish() implementation to fix
 this issue.

 Now the question is should I provide an additional finishUint function
 which avoids the bswap?


 Implementation issue:

 The current implementation of SHA1 and MD5 uses memcpy which doesn't
 work in CTFE IIRC and which also prevents the code from being pure.
 I could replace those memcpy calls with array copying but I'm not
 sure if memcpy was used for performance, so I'd like to keep it as long
 as we have no performance tests.

ptr1[x..y] = ptr2[x2..y2]; note that it's better to have them be pointers as it avoid bounds check & D runtime magic. If need be I can provide benchmarks but I'm certain from the days of optimizing std.regex that it's faster or on par with memcpy. -- Dmitry Olshansky
Jul 05 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08-Jul-12 17:09, Johannes Pfau wrote:
 Am Fri, 06 Jul 2012 01:24:04 +0400
 schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 The only thing  I can think of that would require start function is
 using unconventional initial vectors.

Those could be done as template parameters though? (If the hash is written as a templated struct).

Well probably, but it will lead to code duplication for no real benefit. Maybe it'll be faster with constant vectors, but I'm not so sure.
 But e.g. OpenSSL has *_init functions as well, so we probably should
 keep the start function even if it's just to allow wrappers for
 OpenSSL?

 CRC32 sums are usually presented as a uint, not a ubyte[4]. To fit
 the rest of the API ubyte[4] is used. Now there's a small annoying
 detail: The CRC32 should be printed in LSB-first order.


The rosettacode.org site which I used to verify the CRC32 results said LSB-first but it seems it only describes the data layout of the uint value (Little Endian). The printf/writef result is indeed MSB-first.
 When printing an uint like this, that works well:
 writefln("%#x", 4157704578); //0xf7d18982
 but this doesn't:
 toHexString(*cast(ubyte[4]*)&4157704578); //8289D1F7

There is no problem it's just order of printing that at fault. So I suggest to *stop* doing a bswap. It's just that printing something as an array of ubytes does it from least significant byte to most significant. You could try to add MSB/LSB first options to toHexString.

Yes, but that's not very intuitive. Most people would expect the same result (by default) that other languages provide: http://rosettacode.org/wiki/CRC-32 I'll add the order option to toHexString but I think I'll also add an alias crcToHexString/crcHexString or something like that.
 I can't change toHexString as it's used for all hashes and it's
 correct for SHA1, MD5, ...
 So I currently use bswap in the CRC32 finish() implementation to fix
 this issue.

 Now the question is should I provide an additional finishUint
 function which avoids the bswap?


 Implementation issue:

 The current implementation of SHA1 and MD5 uses memcpy which doesn't
 work in CTFE IIRC and which also prevents the code from being pure.
 I could replace those memcpy calls with array copying but I'm not
 sure if memcpy was used for performance, so I'd like to keep it as
 long as we have no performance tests.

ptr1[x..y] = ptr2[x2..y2]; note that it's better to have them be pointers as it avoid bounds check & D runtime magic. If need be I can provide benchmarks but I'm certain from the days of optimizing std.regex that it's faster or on par with memcpy.

OK great, pure is working. CTFE not yet, but that can be added later. Do we want to add 'pure' as part of the functions in the Digest interface? This would require all implementations to be pure, I don't know if that's a good idea right now.

Some implementations may choose to call into kernel for respective crypto-primitives. I'd say no need to slap pure on top of it in a harry. -- Dmitry Olshansky
Jul 08 2012
prev sibling parent Johannes Pfau <nospam example.com> writes:
Am Fri, 06 Jul 2012 01:24:04 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 
 The only thing  I can think of that would require start function is 
 using unconventional initial vectors.
 

Those could be done as template parameters though? (If the hash is written as a templated struct). But e.g. OpenSSL has *_init functions as well, so we probably should keep the start function even if it's just to allow wrappers for OpenSSL?
 CRC32 sums are usually presented as a uint, not a ubyte[4]. To fit
 the rest of the API ubyte[4] is used. Now there's a small annoying
 detail: The CRC32 should be printed in LSB-first order.


The rosettacode.org site which I used to verify the CRC32 results said LSB-first but it seems it only describes the data layout of the uint value (Little Endian). The printf/writef result is indeed MSB-first.
 
 When printing an uint like this, that works well:
 writefln("%#x", 4157704578); //0xf7d18982
 but this doesn't:
 toHexString(*cast(ubyte[4]*)&4157704578); //8289D1F7

There is no problem it's just order of printing that at fault. So I suggest to *stop* doing a bswap. It's just that printing something as an array of ubytes does it from least significant byte to most significant. You could try to add MSB/LSB first options to toHexString.

Yes, but that's not very intuitive. Most people would expect the same result (by default) that other languages provide: http://rosettacode.org/wiki/CRC-32 I'll add the order option to toHexString but I think I'll also add an alias crcToHexString/crcHexString or something like that.
 
 I can't change toHexString as it's used for all hashes and it's
 correct for SHA1, MD5, ...
 So I currently use bswap in the CRC32 finish() implementation to fix
 this issue.

 Now the question is should I provide an additional finishUint
 function which avoids the bswap?


 Implementation issue:

 The current implementation of SHA1 and MD5 uses memcpy which doesn't
 work in CTFE IIRC and which also prevents the code from being pure.
 I could replace those memcpy calls with array copying but I'm not
 sure if memcpy was used for performance, so I'd like to keep it as
 long as we have no performance tests.

ptr1[x..y] = ptr2[x2..y2]; note that it's better to have them be pointers as it avoid bounds check & D runtime magic. If need be I can provide benchmarks but I'm certain from the days of optimizing std.regex that it's faster or on par with memcpy.

OK great, pure is working. CTFE not yet, but that can be added later. Do we want to add 'pure' as part of the functions in the Digest interface? This would require all implementations to be pure, I don't know if that's a good idea right now.
Jul 08 2012