digitalmars.D - std.hash design

Johannes Pfau (42/42) Jun 22 2012 Pull request #221 and #585 both introduce modules for a new std.hash

Dmitry Olshansky (9/51) Jun 22 2012 That's useful especially if you use one type of hash always. (e.g. SHA-1...

Johannes Pfau (5/46) Jun 22 2012 Yeah we could indeed do that. This complicates the API a little, but

Regan Heath (35/74) Jun 22 2012 It might help (or it might not) to have a glance at the "design" of the ...

Johannes Pfau (16/59) Jun 22 2012 I had a short look at Piotr Szturmaj's sha implementations, and it

Regan Heath (22/75) Jun 22 2012 My original code was D1 and I used structs and mixins.. so perhaps alias...

Regan Heath (6/7) Jun 22 2012 Aargh! "..people can always pass the result of finish STRAIGHT into the ...
Johannes Pfau (14/21) Jun 22 2012 string digestToString(size_t num)(in ubyte[num] digest)

Regan Heath (7/29) Jun 25 2012 Name seems fine to me. Alternately just digestString or even the

Johannes Pfau (9/9) Jun 22 2012 Here's a first proposal for the API:

Johannes Pfau <nospam example.com> writes:


package. As those already change API compared to the old std.crc32 and
std.md5 modules we should probably decide on a common interface for all
std.hash modules.

These are the imho most important questions:

Free function (std.crc32) vs object(std.md5) interface
-----------------------------------------------------
I think we need a object based interface anyway as md5, sha-1 etc have
too much state to pass it around conveniently.

Structs and templates vs. classes and interfaces
-----------------------------------------------------
It's common to use a hash in a limited scope (like functions). So
allocating on the stack is important which favors the struct approach.
However, classes could also be allocated on the stack with scoped.

Classes+interfaces have the benefit that we could provide different
_ABI_ compatible implementations. E.g. MD5 hashes could be implemented
with D/OpenSSL wrapper/windows crypto API and we could even add a
configure switch to phobos to choose the default implementation.
Doing the same with structs likely only gives us API compatibility, so
switching the default implementation in phobos could cause trouble.

Basic design:
---------------
If we'll implement an object based interface (struct/class), it should
probably be an output range. Something like this:

struct/interface Hash
{
    void put(const(ubyte)[] data);
    void put(ubyte data);
    void start(); //initialize
    void reset(); //reset
    ubyte[] finish(ref ubyte[] buffer = null); //See below
    enum size_t hashLength; //optional? See below
}

The finish function signature is a little controversial. The length of
the result differs between hash implementations. For structs+templates
we could use static arrays, but for classes+interface we'd have to use
dynamic arrays.

toString doesn't make sense on a hash, as finish() has to be called
before a string can be generated. So a helper function could be useful.

Another open question is whether we should support 'bit hashing'.
Have a look at Piotr Szturmaj's implementation for details: 
https://github.com/pszturmaj/phobos/blob/master/std/crypto/hash/base.d

Jun 22 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 22-Jun-12 13:11, Johannes Pfau wrote:

 package. As those already change API compared to the old std.crc32 and
 std.md5 modules we should probably decide on a common interface for all
 std.hash modules.

 These are the imho most important questions:

 Free function (std.crc32) vs object(std.md5) interface
 -----------------------------------------------------
 I think we need a object based interface anyway as md5, sha-1 etc have
 too much state to pass it around conveniently.

Yup.

 Structs and templates vs. classes and interfaces
 -----------------------------------------------------
 It's common to use a hash in a limited scope (like functions). So
 allocating on the stack is important which favors the struct approach.

That's useful especially if you use one type of hash always. (e.g. SHA-1 
a-la git)

 However, classes could also be allocated on the stack with scoped.

Too shitty for widespread use. +Extra indirection on each block(?).

 Classes+interfaces have the benefit that we could provide different
 _ABI_ compatible implementations. E.g. MD5 hashes could be implemented
 with D/OpenSSL wrapper/windows crypto API and we could even add a
 configure switch to phobos to choose the default implementation.
 Doing the same with structs likely only gives us API compatibility, so
 switching the default implementation in phobos could cause trouble.

3rd option. Provide interface/class based polymorphic wrapper on top of 
structs. Come on! It's D, we can find reasonable compromise.

 Basic design:
 ---------------
 If we'll implement an object based interface (struct/class), it should
 probably be an output range. Something like this:

 struct/interface Hash
 {
      void put(const(ubyte)[] data);
      void put(ubyte data);
      void start(); //initialize
      void reset(); //reset
      ubyte[] finish(ref ubyte[] buffer = null); //See below
      enum size_t hashLength; //optional? See below
 }

 The finish function signature is a little controversial. The length of
 the result differs between hash implementations. For structs+templates
 we could use static arrays, but for classes+interface we'd have to use
 dynamic arrays.

 toString doesn't make sense on a hash, as finish() has to be called
 before a string can be generated. So a helper function could be useful.

 Another open question is whether we should support 'bit hashing'.
 Have a look at Piotr Szturmaj's implementation for details:
 https://github.com/pszturmaj/phobos/blob/master/std/crypto/hash/base.d


-- 
Dmitry Olshansky

Jun 22 2012

Johannes Pfau <nospam example.com> writes:

Am Fri, 22 Jun 2012 13:33:41 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 On 22-Jun-12 13:11, Johannes Pfau wrote:

 package. As those already change API compared to the old std.crc32
 and std.md5 modules we should probably decide on a common interface
 for all std.hash modules.

 These are the imho most important questions:

 Free function (std.crc32) vs object(std.md5) interface
 -----------------------------------------------------
 I think we need a object based interface anyway as md5, sha-1 etc
 have too much state to pass it around conveniently.

 
 Yup.
 
 Structs and templates vs. classes and interfaces
 -----------------------------------------------------
 It's common to use a hash in a limited scope (like functions). So
 allocating on the stack is important which favors the struct
 approach.

 
 That's useful especially if you use one type of hash always. (e.g.
 SHA-1 a-la git)
 
 However, classes could also be allocated on the stack with scoped.

 
 Too shitty for widespread use. +Extra indirection on each block(?).
 
 Classes+interfaces have the benefit that we could provide different
 _ABI_ compatible implementations. E.g. MD5 hashes could be
 implemented with D/OpenSSL wrapper/windows crypto API and we could
 even add a configure switch to phobos to choose the default
 implementation. Doing the same with structs likely only gives us
 API compatibility, so switching the default implementation in
 phobos could cause trouble.

 
 3rd option. Provide interface/class based polymorphic wrapper on top
 of structs. Come on! It's D, we can find reasonable compromise.

Yeah we could indeed do that. This complicates the API a little, but
it's probably still the best solution. I think I'll implement the
structs first, the polymorphic wrapper can be added later.

Jun 22 2012

"Regan Heath" <regan netmail.co.nz> writes:

On Fri, 22 Jun 2012 10:11:10 +0100, Johannes Pfau <nospam example.com>  
wrote:


 package. As those already change API compared to the old std.crc32 and
 std.md5 modules we should probably decide on a common interface for all
 std.hash modules.

 These are the imho most important questions:

 Free function (std.crc32) vs object(std.md5) interface
 -----------------------------------------------------
 I think we need a object based interface anyway as md5, sha-1 etc have
 too much state to pass it around conveniently.

 Structs and templates vs. classes and interfaces
 -----------------------------------------------------
 It's common to use a hash in a limited scope (like functions). So
 allocating on the stack is important which favors the struct approach.
 However, classes could also be allocated on the stack with scoped.

 Classes+interfaces have the benefit that we could provide different
 _ABI_ compatible implementations. E.g. MD5 hashes could be implemented
 with D/OpenSSL wrapper/windows crypto API and we could even add a
 configure switch to phobos to choose the default implementation.
 Doing the same with structs likely only gives us API compatibility, so
 switching the default implementation in phobos could cause trouble.

 Basic design:
 ---------------
 If we'll implement an object based interface (struct/class), it should
 probably be an output range. Something like this:

 struct/interface Hash
 {
     void put(const(ubyte)[] data);
     void put(ubyte data);
     void start(); //initialize
     void reset(); //reset
     ubyte[] finish(ref ubyte[] buffer = null); //See below
     enum size_t hashLength; //optional? See below
 }

 The finish function signature is a little controversial. The length of
 the result differs between hash implementations. For structs+templates
 we could use static arrays, but for classes+interface we'd have to use
 dynamic arrays.

It might help (or it might not) to have a glance at the "design" of the  
hashing routines in Tango:
http://www.dsource.org/projects/tango/docs/current/
(see tango.util.digest etc)

I contributed some of the initial code for these, though it has since  
evolved a lot.  I started with structs, mirroring the phobos MD5 code but  
used all sorts of unnecessary mixins to get the code reuse I wanted.  The  
result was ugly :p

Later someone contacted me about it, and wanted a class based approach so  
I did some refactoring and the result was much cleaner.  I'm not trying to  
say that a struct approach cannot be clean, just that I did a bad job of  
it initially, and also structs don't lend themselves to the factory  
pattern though which is a nice way to use hashing.

As Dmitry has said, we can likely get the best of both worlds with classes  
wrapping structs or similar.

 toString doesn't make sense on a hash, as finish() has to be called
 before a string can be generated. So a helper function could be useful.

toString() could output the intermediate/internal state at the time of the  
call, which if called after "finish" would be the hash result.  I can't  
recall if this has any specific usefulness, tho I have a nagging/niggling  
itch which says I did use this intermediate result for something at some  
stage.

It might be useful to have toString on a hash so that we can pass a  
completed hash object around and repeatedly obtain the string  
representation vs obtaining it once on "finish" and passing the string  
around.  However, that said, it's probably more secure to destroy and  
scrub the memory used by the hash object ASAP and only retain the  
resulting string or ubyte[] result.

I think I've talked myself round in a circle.. I think if we have a way to  
obtain the current state as ubyte[] that would satisfy the niggle I have.   
Having a separate routine for turning a ubyte[] into a hex string is  
probably better than attaching toString to a hash object.

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jun 22 2012

Johannes Pfau <nospam example.com> writes:

Am Fri, 22 Jun 2012 12:03:27 +0100
schrieb "Regan Heath" <regan netmail.co.nz>:

 
 It might help (or it might not) to have a glance at the "design" of
 the hashing routines in Tango:
 http://www.dsource.org/projects/tango/docs/current/
 (see tango.util.digest etc)
 
 I contributed some of the initial code for these, though it has
 since evolved a lot.  I started with structs, mirroring the phobos
 MD5 code but used all sorts of unnecessary mixins to get the code
 reuse I wanted.  The result was ugly :p
 
 Later someone contacted me about it, and wanted a class based
 approach so I did some refactoring and the result was much cleaner.
 I'm not trying to say that a struct approach cannot be clean, just
 that I did a bad job of it initially, and also structs don't lend
 themselves to the factory pattern though which is a nice way to use
 hashing.

I had a short look at Piotr Szturmaj's sha implementations, and it
seems this kind of code would benefit a lot from inheritance. I
understand that it was probably impossible to do this in D1, but don't
you think 'alias this' could work in D2? This wouldn't solve the
problem with the factory pattern, but that can be solved by providing
wrapper classes.

 
 As Dmitry has said, we can likely get the best of both worlds with
 classes wrapping structs or similar.

Yep, although classes wrapping structs doesn't help code reuse. But
alias this should hopefully work for that.

 
 toString doesn't make sense on a hash, as finish() has to be called
 before a string can be generated. So a helper function could be
 useful.

 
 toString() could output the intermediate/internal state at the time
 of the call, which if called after "finish" would be the hash
 result.  I can't recall if this has any specific usefulness, tho I
 have a nagging/niggling itch which says I did use this intermediate
 result for something at some stage.
 
 It might be useful to have toString on a hash so that we can pass a  
 completed hash object around and repeatedly obtain the string  
 representation vs obtaining it once on "finish" and passing the
 string around.  However, that said, it's probably more secure to
 destroy and scrub the memory used by the hash object ASAP and only
 retain the resulting string or ubyte[] result.
 
 I think I've talked myself round in a circle.. I think if we have a
 way to obtain the current state as ubyte[] that would satisfy the
 niggle I have. Having a separate routine for turning a ubyte[] into a
 hex string is probably better than attaching toString to a hash
 object.

We could also provide a finishString function or something like that.
But toString returning a intermediate state would be confusing.

Tango doesn't seem to offer a way to peek at the current state. But if
it's really useful, it could be added.

BTW: Do you know why digestSize is a function in tango? Are there
digests that produce variable length hashes?

Jun 22 2012

"Regan Heath" <regan netmail.co.nz> writes:

On Fri, 22 Jun 2012 14:21:28 +0100, Johannes Pfau <nospam example.com>  
wrote:
 Am Fri, 22 Jun 2012 12:03:27 +0100
 schrieb "Regan Heath" <regan netmail.co.nz>:

 It might help (or it might not) to have a glance at the "design" of
 the hashing routines in Tango:
 http://www.dsource.org/projects/tango/docs/current/
 (see tango.util.digest etc)

 I contributed some of the initial code for these, though it has
 since evolved a lot.  I started with structs, mirroring the phobos
 MD5 code but used all sorts of unnecessary mixins to get the code
 reuse I wanted.  The result was ugly :p

 Later someone contacted me about it, and wanted a class based
 approach so I did some refactoring and the result was much cleaner.
 I'm not trying to say that a struct approach cannot be clean, just
 that I did a bad job of it initially, and also structs don't lend
 themselves to the factory pattern though which is a nice way to use
 hashing.

 I had a short look at Piotr Szturmaj's sha implementations, and it
 seems this kind of code would benefit a lot from inheritance. I
 understand that it was probably impossible to do this in D1, but don't
 you think 'alias this' could work in D2? This wouldn't solve the
 problem with the factory pattern, but that can be solved by providing
 wrapper classes.

My original code was D1 and I used structs and mixins.. so perhaps alias  
this will solve the code re-use problem.  I haven't done enough D2 to be  
helpful here I'm afraid.

 toString doesn't make sense on a hash, as finish() has to be called
 before a string can be generated. So a helper function could be
 useful.

 toString() could output the intermediate/internal state at the time
 of the call, which if called after "finish" would be the hash
 result.  I can't recall if this has any specific usefulness, tho I
 have a nagging/niggling itch which says I did use this intermediate
 result for something at some stage.

 It might be useful to have toString on a hash so that we can pass a
 completed hash object around and repeatedly obtain the string
 representation vs obtaining it once on "finish" and passing the
 string around.  However, that said, it's probably more secure to
 destroy and scrub the memory used by the hash object ASAP and only
 retain the resulting string or ubyte[] result.

 I think I've talked myself round in a circle.. I think if we have a
 way to obtain the current state as ubyte[] that would satisfy the
 niggle I have. Having a separate routine for turning a ubyte[] into a
 hex string is probably better than attaching toString to a hash
 object.

 We could also provide a finishString function or something like that.
 But toString returning a intermediate state would be confusing.

Agreed.  In fact I wouldn't bother with finishString either TBH, people  
can always pass the result of finish string into the method which produces  
the hex string representation.

IIRC when I wrote my Tiger implementation it was fairly new, and I had a  
different method for formatting the hex string representation.  Either  
they later changed the Tiger spec, or I was confused at the time because I  
have this niggling memory that I later "discovered" it was the same all  
along, or something.

In any case, we can probably have one static toHexString method for all  
digests.

 Tango doesn't seem to offer a way to peek at the current state. But if
 it's really useful, it could be added.

Probably just cobwebs in my memory, ignore me :p

 BTW: Do you know why digestSize is a function in tango? Are there
 digests that produce variable length hashes?

Not to my knowledge.. perhaps there is a time/place where you want to know  
the size of the digest result before calculating the digest?  Might be  
useful in generic code perhaps..

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jun 22 2012

"Regan Heath" <regan netmail.co.nz> writes:

On Fri, 22 Jun 2012 18:12:20 +0100, Regan Heath <regan netmail.co.nz>  
wrote:

 people can always pass the result of finish string into the method

Aargh! "..people can always pass the result of finish STRAIGHT into the  
method.."

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jun 22 2012

Johannes Pfau <nospam example.com> writes:

Am Fri, 22 Jun 2012 18:12:20 +0100
schrieb "Regan Heath" <regan netmail.co.nz>:


 
 Agreed.  In fact I wouldn't bother with finishString either TBH,
 people can always pass the result of finish string into the method
 which produces the hex string representation.
 

In any case, we can probably have one static toHexString method for
all digests.

string digestToString(size_t num)(in ubyte[num] digest)
{
    auto result = new char[num*2];
    size_t i;

    foreach(u; digest)
    {
        result[i++] = std.ascii.hexDigits[u >> 4];
        result[i++] = std.ascii.hexDigits[u & 15];
    }
    return assumeUnique(result);
}

adapted from std.md5. I don't really like the name though ;-)

Jun 22 2012

"Regan Heath" <regan netmail.co.nz> writes:

On Fri, 22 Jun 2012 18:34:10 +0100, Johannes Pfau <nospam example.com>  
wrote:

 Am Fri, 22 Jun 2012 18:12:20 +0100
 schrieb "Regan Heath" <regan netmail.co.nz>:


 Agreed.  In fact I wouldn't bother with finishString either TBH,
 people can always pass the result of finish string into the method
 which produces the hex string representation.


 In any case, we can probably have one static toHexString method for
 all digests.

 string digestToString(size_t num)(in ubyte[num] digest)
 {
     auto result = new char[num*2];
     size_t i;

     foreach(u; digest)
     {
         result[i++] = std.ascii.hexDigits[u >> 4];
         result[i++] = std.ascii.hexDigits[u & 15];
     }
     return assumeUnique(result);
 }

 adapted from std.md5. I don't really like the name though ;-)

Name seems fine to me.  Alternately just digestString or even the  
completely generic "hexString".

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Jun 25 2012

Johannes Pfau <nospam example.com> writes:

Here's a first proposal for the API:
http://dl.dropbox.com/u/24218791/d/src/digest.html

One open question is:
What should we do if a too small buffer is passed to the finish
function (in the OOP API)?

Should we check for the length only in debug(assert) or in
debug+release mode (enforce) or should we use the tango way and
silently allocate? 

And another question: Can Digest.length be pure (for all digests)?

Jun 22 2012

D Programming

C/C++ Programming

Other

digitalmars.D - std.hash design