www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - The review of std.hash package

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
	Since the review queue has been mostly silent again I've decided to 
jump in and manage the one that's ready to go :)

	Today starts the the review of std.hash package by Johannes Pfau. We go 
with the usual cycle of two weeks for review and one week for voting. 
Thus review ends on 22th of August, followed by voting that ends on 29th 
of August.

Description:

std.hash.hash is a new module for Phobos defining an uniform interface 
for hashes and checksums. It also provides some useful helper functions 
to deal with this new API.

The std.hash package also includes:

	- MD5 implementation deprecating std.md5 (in std.hash.md, adapted from 
std.md5);

	- new SHA1 implementation by redstar (in std.hash.sha);

	- CRC32 implementation (in std.hash.crc) based on and deprecating the 
crc32 module (that's shipped with phobos but not documented).

It only covers hashes which can process data incrementally (in
smaller buffers as opposed to all data at once).

Code:
https://github.com/jpf91/phobos/tree/newHash/std/hash
https://github.com/jpf91/phobos/compare/master...newHash

Docs:
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_hash.html
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_md.html
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_sha.html
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_crc.html

-- 
Dmitry Olshansky
Aug 07 2012
next sibling parent reply David <d dav1d.de> writes:
Is the API already set in stone?

Using .start and .finish, feels like the use of OpenSSL. Why not making 
a Hash object not "restartable" (since e.g. MD5.start() sets `this` just 
to MD5.init) and making finish private and implementing a digest and 
hexdigest property which calls finish and returns an ubyte[] array (or 
string for hexdigest), this would also eliminate the need of the 
`digest` wrapper (e.g. you can mixin these properties with template 
mixins, so no code duplication)

Also I am not sure if we want to use `hash.put` instead of `hash.update` 
(which is used by python and openssl). But that's just a minor point 
(e.g. perl uses .add)

Furthermore, I don't like the `sha1Of` and `md5Of` etc. functions, why 
not having a static method? e.g. MD5.hexdigest("input"), which returns 
the hexdigest and not a MD5 object.

One more thing, `.reset` does the same as `start`? If so, why do both 
exist? (I also find the examples quite confusing, why do you want to 
reset the hash onject? I would documentate the function but wouldn't use 
it in the examples)

Well, that's it from my side :)
Aug 07 2012
next sibling parent David <d dav1d.de> writes:
 Also I am not sure if we want to use `hash.put` instead of `hash.update`
 (which is used by python and openssl). But that's just a minor point
 (e.g. perl uses .add)

Well, it's there to implement the Range interface, but still, put doesn't make too much sense for me (maybe personal preference).
Aug 07 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, August 07, 2012 20:07:07 David wrote:
 Is the API already set in stone?

No. That's the main point of the review process. The API needs to be reviewed and revised as appropriate (as does the implementation, but the API is much harder to fix later, since that tends to cause breakin changes). - Jonathan M Davis
Aug 07 2012
prev sibling next sibling parent David <d dav1d.de> writes:
Am 07.08.2012 21:53, schrieb Johannes Pfau:
 Am Tue, 07 Aug 2012 20:07:07 +0200
 schrieb David <d dav1d.de>:

 Is the API already set in stone?

part of the review. (Actually the most important part for this review, as we define one API for many implementations)
 Using .start and .finish, feels like the use of OpenSSL. Why not
 making a Hash object not "restartable" (since e.g. MD5.start() sets
 `this` just to MD5.init)

CRC32, MD5 and SHA1 and I proposed we should have a unique interface. As I had some free time and nobody else stepped up I implemented it, asking for expert help in the newsgroup. This probably also explains why this API isn't revolutionary, it's inspired by other designs (tango, .net, ...). BTW: making it possible to wrap OpenSSL and other C hash libraries was also a goal. This is one reason why a start() function is necessary. OK, that was the intro ;-) start is here as structs can't have default constructors. For all current hashes it's really mostly useless (except it can be used as a simple way to reset a hash, which is also why the hash structs don't have a reset member), but I just don't know if there are hashes which would require a more complex (runtime) initialization.

Ok this point with the one above makes sense (I implemented my OpenSSL hashing wrapper as a class, initilaization is done in the constructor), it still doesn't feel right, if you have to call .start first. What about a struct-constructor which calls .start internally, so you get a bit more of a modern API (imo) and you're still able to implement the same interface for any kind of wrapper/own implementation/whatever.
 Classes don't have a start method as they can do that initialization in
 the default constructor. But the default constructor can't be used as a
 cheap way to reset a hash, therefore the reset function was added to
 the OOP interface.

Ok. (more below)
 and making finish private and implementing a
 digest and hexdigest property

for some objects, so you can't access the property repeatedly. It'd have to be a function.

Well, you could store the result internally, property generates on the first call the digest, stores it (that's not too much, 16 byte for a md5) and returns the already stored value on the 2nd call.
 which calls finish and returns an
 ubyte[] array (or string for hexdigest)

This could be done and is probably a matter of taste. I prefer the free function in this case, it makes clear that digest() really is a generic function which can be used with any hash. It's also one function less which must be implemented by hash writers. I'd like to keep the number of members required to implement a hash as low as possible.

With digest you mean: http://dl.dropbox.com/u/24218791/d/phobos/std_hash_hash.html#digest ? You normally always want the hexdigest (you barely need "real" digest), so it's a matter of convenience. Well, thanks to UFCS, you can call it like a method.
 (This is however unrelated to the actual naming of the functions. If
 you think 'digest' is not a good name, please let me know)

I think that's a fitting name, but I am not a hashing expert.
 , this would also eliminate
 the need of the `digest` wrapper (e.g. you can mixin these properties
 with template mixins, so no code duplication)

there's actually no code duplication. 'digest' is the only implementation, sha1Of, md5Of are just aliases.

Yeah, I meant, you would have to implement these properties for each hash-type and in the end, all properties do the same (calling finish and maybe setting some internal flags, buffers), so you can put them into a template mixin and mixin them for each hash-type.
 Also I am not sure if we want to use `hash.put` instead of
 `hash.update` (which is used by python and openssl). But that's just
 a minor point (e.g. perl uses .add)

I initially called it update, I agree it's a better name. But phobos is all about ranges, so we need the put member anyway. Providing an 'update' alias just for the name isn't worth it, imho (see above, keeping the number of hash members low).

I have to agree, an alias would produce more confusion (what two methods which do the same?)
 Furthermore, I don't like the `sha1Of` and `md5Of` etc. functions,
 why not having a static method? e.g. MD5.hexdigest("input"), which
 returns the hexdigest and not a MD5 object.

Same reason as above, sha1Of, md5Of are just convenience aliases, it could be argued you don't need those at all. You can use 'digest' directly and it will provide exactly the same result. So hash writers don't have to implement anything to support digest, providing an alias is optional and just 1 line of code. A static method would have to be implemented by every hash.

Yes, you would have to implement it for every hash, but that's 3 lines: static string hexdigest(void[] data) { return toHexString(digest!(typeof(this))("yay")); }
 Yes you could use mixins to make that
 painless, but I think it's still to much work. (And mixins can't be
 documented in DDOC, but that's an unrelated compiler issue)

 BTW: it is implemented as a member for the OOP API, as we can just
 implement it in the base interface, so the actual implementations don't
 have to care about it.

I've seen that, but I am wondering why it's not a static method, the creation of the Hash Object could be hidden inside. You might say, this would allocate a class instance just for a single use, without the possibility to reuse it. Correct, but I think that's the use of such a function, if you just need a Hash, nothing else, otherwise you could use the class "directly", with all its other functionality.
 One more thing, `.reset` does the same as `start`? If so, why do both
 exist?

structs can't have default constructors, so we need start. The OOP/class api can use a default constructor. But unlike the start function it can't work as a reset function, so there's an explicit reset function.

I am not sure what do think about that. On the one side it's useful if you really need that speed, but then I think, is it really worth it resetting the state hidden in a function. If you really want to avoid another allocation you could use emplace (if I didn't misunderstand the use of emplace). To quote from the Python Zen: "Explicit is better than implicit."
 (I also find the examples quite confusing, why do you want to
 reset the hash onject?

it makes sense to use them. Maybe I should care more about that?

When I saw the documentation/examples, I was confused, why do you want to reset a hash object, so does it really reset the state? So I had to take a look into the code before I was sure, it surely does reset the whole thingy (is it called context?).
 It's more efficient than allocating a new hash with 'new' (or do you
 mean why use reset if finish resets anyway? It's useful if you put some
 data into a hash, then decide you don't want the hash (because the
 user aborts the operation, network connection is lost,...) but you still
 need the hash object later on.) You could just call finish and
 disregard the result, but the reset implementation is faster. I agree
 there's probably no _strong_ need for reset, but I think it doesn't do
 any harm.

"Explicit is better than implicit."
 Well, that's it from my side :)

Thanks for your review!

Thanks for writing std.hash, I think lots of people need it.
Aug 07 2012
prev sibling parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Johannes Pfau wrote:
 As I had some free time and nobody else stepped up I implemented it,

I tried it before but I wanted to create whole crypto package at once, and I guess it was a huge mistake. I know you have used my code in your uuid module, you can use it to add it to std.hash if you want. There are all SHA implementations, MD5 and a Tiger (both versions). They all support bit hashing and work with CTFE. I made a quick look at your proposal, but currently don't have time to do a deep investigation :). One note though. I don't think it's neccessary to maintain two different API sets. Sure, classes need heap allocation but usually that's only one allocation, it's more important to not allocate during algorithm computations. Fortunately crypto algorithms don't do this. So, struct allocation efficency is negligible here. I think separate struct based API set is not worth it (or I'm missing something not related to allocation cost).
Aug 07 2012
parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Johannes Pfau wrote:
 Am Wed, 08 Aug 2012 01:27:45 +0200
 schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:

 Johannes Pfau wrote:
 As I had some free time and nobody else stepped up I implemented it,

I tried it before but I wanted to create whole crypto package at once,

'nobody stepped up' is this: We had a pull request which moved the crc code into std.crc and it was already merged in for 2.060. I complained that it had yet another API and that we should have a common API. In the end that pull request was reverted but I felt kinda guilty as that left us without a public crc module. In that discussion, nobody else had time to implement that API, so I thought we'd need a solution soon and started implementing it.

Hey, I didn't say you did :-) I'm fine with your work :-)
 and I guess it was a huge mistake.

I'd personally like to see a bcrypt implementation. There's no widespread bcrypt C library which could be used, so there's no painless way to use bcrypt in D. (Although there _is_ public domain C source code)

Yes, there should be bcrypt, scrypt and PBKDF2.
 I know you have used my code
 in your uuid module, you can use it to add it to std.hash if you
 want. There are all SHA implementations, MD5 and a Tiger (both
 versions). They all support bit hashing and work with CTFE.

a standardized hash API in phobos). BTW: How does it work in CTFE? Don't you have to do endianness conversions at some time? According to Don that's not really supported.

std.bitmanip.swapEndian() works for me
 Another problem with prevents CTFE for my proposal would be that the
 internal state is currently implemented as an array of uints, but the
 API uses ubyte[] as a return type. That sort of reinterpret cast is not
 supposed to work in CTFE though. I wonder how you avoided that issue?

There is set of functions that abstract some operations to work with CTFE and at runtime: https://github.com/pszturmaj/phobos/blob/master/std/crypto/hash/base.d#L66. Particularly memCopy().
 And another problem is that void[][] (as used in the 'digest' function)
 doesn't work in CTFE (and it isn't supposed to work). But that's a
 problem specific to this API.

Yes, that's why I use ubyte[]. I think if it's possible to do it all with CTFE by using hacks, it should be rather implemented in the compiler, assuming the endiannes of the CPU it's running on. [...cut...]
 There's one important example where you really don't want any
 allocation to happen (and especially no GC allocation): You have a
 function which always calculates the same type of hash (e.g. md5) and
 you don't care about the implementation. You only care about the
 final hash, the hash context is never used outside of that function.
 This is an perfect fit for stack allocation, especially if the
 function is used a lot (GC). With classes you'd have to do
 some caching to do it efficiently (which then gives problems with
 const).
 It would be possible to use a class API + emplace, but that was deemed
 to be too cumbersome.

I don't think std.typecons.scoped is cumbersome: auto sha = scoped!SHA1(); // allocates on the stack auto digest = sha.digest("test"); Why I think classes should be supported is the need of polymorphism. One should be able to accept Digest base class as a function parameter or a field/property. Without polymorphism we need to resort to C-like switches, ifs and so on. Templates are no go, because hash function may be determined at runtime (it may depend on security protocol handshake).
Aug 08 2012
parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 08, 2012 18:12:23 Johannes Pfau wrote:
 Am Wed, 08 Aug 2012 11:27:49 +0200
 
 schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:
 BTW: How does it work in CTFE? Don't you have to do endianness
 conversions at some time? According to Don that's not really
 supported.

std.bitmanip.swapEndian() works for me

Great! I always tried the *endianToNative and nativeTo*Endian functions. So I didn't expect swapEndian to work.

What's wrong with the *endianToNative and nativeTo*Endian functions? They work just fine as far as I know. swapEndian works too if you want it to use that, but there should be nothing wrong with the endian-specific ones. - Jonathan M Davis
Aug 08 2012
prev sibling next sibling parent reply Ary Manzana <ary esperanto.org.ar> writes:
On 8/7/12 14:39 , Dmitry Olshansky wrote:
 Since the review queue has been mostly silent again I've decided to jump
 in and manage the one that's ready to go :)

 Today starts the the review of std.hash package by Johannes Pfau. We go
 with the usual cycle of two weeks for review and one week for voting.
 Thus review ends on 22th of August, followed by voting that ends on 29th
 of August.

 Description:

 std.hash.hash is a new module for Phobos defining an uniform interface
 for hashes and checksums. It also provides some useful helper functions
 to deal with this new API.

 The std.hash package also includes:

I think "std.crypto" is a better name for the package. At first I thought it contained an implementation of a Hash table. Also note these entries in wikipedia: http://en.wikipedia.org/wiki/Hash_function http://en.wikipedia.org/wiki/Cryptographic_hash_function Your package provides the later, not just any hash functions, but *crypto*graphic hash functions. :-) (and yes, I know I'm just discussing the name here, but names *are* important)
Aug 07 2012
next sibling parent deadalnix <deadalnix gmail.com> writes:
Le 07/08/2012 20:31, Ary Manzana a écrit :
 On 8/7/12 14:39 , Dmitry Olshansky wrote:
 Since the review queue has been mostly silent again I've decided to jump
 in and manage the one that's ready to go :)

 Today starts the the review of std.hash package by Johannes Pfau. We go
 with the usual cycle of two weeks for review and one week for voting.
 Thus review ends on 22th of August, followed by voting that ends on 29th
 of August.

 Description:

 std.hash.hash is a new module for Phobos defining an uniform interface
 for hashes and checksums. It also provides some useful helper functions
 to deal with this new API.

 The std.hash package also includes:

I think "std.crypto" is a better name for the package. At first I thought it contained an implementation of a Hash table. Also note these entries in wikipedia: http://en.wikipedia.org/wiki/Hash_function http://en.wikipedia.org/wiki/Cryptographic_hash_function Your package provides the later, not just any hash functions, but *crypto*graphic hash functions. :-) (and yes, I know I'm just discussing the name here, but names *are* important)

You'll find very hard to convince anyone that crc32 is a cryptographic hash function. And this API is suited for both cryptographic hash and regular hash. Many of them can be added in the future if need is met. I definitively am for std.hash .
Aug 07 2012
prev sibling next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, August 07, 2012 15:31:57 Ary Manzana wrote:
 The std.hash package also includes:

thought it contained an implementation of a Hash table.

That doesn't fly, because crc32 is going to be in there, and while it's a hash, it's no good for cryptography. - Jonathan M Davis
Aug 07 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/8/12 8:54 AM, Regan Heath wrote:
 "Hash" has too many meanings, we should avoid it.

Yes please. Andrei
Aug 08 2012
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 07-Aug-12 22:31, Ary Manzana wrote:
 I think "std.crypto" is a better name for the package. At first I
 thought it contained an implementation of a Hash table.

There is std.container, so unambiguous for me. As for std.crypto it's been discussed to death before. Short answer - no, because it's assumed to include other useful hash functions (like crc32, later I expect adler and whatnot). Also keep in mind that cryptographic hash is more of a status then permanent property. Now that md5 was cracked, it is typically used only as normal hash (e.g. checksum). Same thing would one day happen to SHA family. -- Dmitry Olshansky
Aug 07 2012
prev sibling next sibling parent "Regan Heath" <regan netmail.co.nz> writes:
On Tue, 07 Aug 2012 19:41:12 +0100, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 On Tuesday, August 07, 2012 15:31:57 Ary Manzana wrote:
 The std.hash package also includes:

thought it contained an implementation of a Hash table.

That doesn't fly, because crc32 is going to be in there, and while it's a hash, it's no good for cryptography.

std.digest then? R -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Aug 08 2012
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Aug 08, 2012 at 11:37:35AM +0100, Regan Heath wrote:
 On Tue, 07 Aug 2012 19:41:12 +0100, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:
 
On Tuesday, August 07, 2012 15:31:57 Ary Manzana wrote:
I think "std.crypto" is a better name for the package. At first I
thought it contained an implementation of a Hash table.

That doesn't fly, because crc32 is going to be in there, and while it's a hash, it's no good for cryptography.

std.digest then?

+1. I think std.hash is needlessly confusing (I thought it was another hashtable implementation until I read this thread more carefully). T -- Two wrongs don't make a right; but three rights do make a left...
Aug 08 2012
parent reply travert phare.normalesup.org (Christophe Travert) writes:
"Regan Heath" , dans le message (digitalmars.D:174462), a écrit :
 "Message-Digest Algorithm" is the proper term, "hash" is another, correct,  
 more general term.
 
 "hash" has other meanings, "Message-Digest Algorithm" does not.

I think the question is: is std.hash going to contain only message-digest algorithm, or could it also contain other hash functions? I think there is enough room in a package to have both message-digest algorithm and other kinds of hash functions.
Aug 08 2012
parent reply travert phare.normalesup.org (Christophe Travert) writes:
"Chris Cain" , dans le message (digitalmars.D:174466), a écrit :
 On Wednesday, 8 August 2012 at 13:38:26 UTC, 
 travert phare.normalesup.org (Christophe Travert) wrote:
 I think the question is: is std.hash going to contain only
 message-digest algorithm, or could it also contain other hash 
 functions?
 I think there is enough room in a package to have both 
 message-digest
 algorithm and other kinds of hash functions.

Even if that were the case, I'd say they should be kept separate. Cryptographic hash functions serve extremely different purposes from regular hash functions. There is no reason they should be categorized the same.

They should not be categorized the same. I don't expect a regular hash function to pass the isDigest predicate. But they have many similarities, which explains they are all called hash functions. There is enough room in a package to put several related concepts! Here, we have a package for 4 files, with a total number of line that is about one third of the single std.algorithm file (which is probably too big, I conceed). There aren't hundreds of message-digest functions to add here. If it where me, I would have the presently reviewed module std.hash.hash be called std.hash.digest, and leave room here for regular hash functions. In any case, I think regular hash HAVE to be in a std.hash module or package, because people looking for a regular hash function will look here first.
Aug 08 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08-Aug-12 18:16, Christophe Travert wrote:
 "Chris Cain" , dans le message (digitalmars.D:174466), a écrit :
 On Wednesday, 8 August 2012 at 13:38:26 UTC,
 travert phare.normalesup.org (Christophe Travert) wrote:
 I think the question is: is std.hash going to contain only
 message-digest algorithm, or could it also contain other hash
 functions?
 I think there is enough room in a package to have both
 message-digest
 algorithm and other kinds of hash functions.

Even if that were the case, I'd say they should be kept separate. Cryptographic hash functions serve extremely different purposes from regular hash functions. There is no reason they should be categorized the same.

They should not be categorized the same. I don't expect a regular hash function to pass the isDigest predicate. But they have many similarities, which explains they are all called hash functions. There is enough room in a package to put several related concepts!

You still can use say crc32 as normal hash function for some binary object. The notions are not as desperate as some designers would want them to be.
 Here, we have a package for 4 files, with a total number of line that is
 about one third of the single std.algorithm file (which is probably too
 big, I conceed). There aren't hundreds of message-digest functions to
 add here.

I'd rather see clean by family separation, as importing one huge digest module only to use SHA is kind of creepy. On the other hand as all of code is templated it's not a big deal.
 If it where me, I would have the presently reviewed module std.hash.hash
 be called std.hash.digest, and leave room here for regular hash
 functions. In any case, I think regular hash HAVE to be in a std.hash
 module or package, because people looking for a regular hash function
 will look here first.

I thing concerns me: if incremental digest hashes are all in one module what are the (would be) other modules in std.hash? -- Dmitry Olshansky
Aug 08 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 08-Aug-12 21:00, Dmitry Olshansky wrote:
 On 08-Aug-12 18:16, Christophe Travert wrote:

 They should not be categorized the same. I don't expect a regular hash
 function to pass the isDigest predicate. But they have many
 similarities, which explains they are all called hash functions. There
 is enough room in a package to put several related concepts!

You still can use say crc32 as normal hash function for some binary object. The notions are not as desperate as some designers would want them to be.

Damned spellcheckers: desperate -> disparate -- Dmitry Olshansky
Aug 08 2012
prev sibling next sibling parent travert phare.normalesup.org (Christophe Travert) writes:
"Chris Cain" , dans le message (digitalmars.D:174477), a écrit :

I think you misunderstood me (and it's probably my fault, since I don't 
know much of hash functions), I was wanted to compare two kind of 
concepts:

1/ message digest functions, like md5, or sha1, used on large files,
which is what is covered by this std.hash proposal.
2/ small hash function. Like what are use in an associative array, and 
are called toHash when used a member function.

And I didn't thought of:
3/ cryptographic hash functions

My opinion was that in a module or package called hash, I expect tools 
concerning #2. But #1 and #2 can coexist in the same package. The 
proposed std.hash.hash defines a digest concept for #1. That's why I 
would rather have it named std.hash.digest, leaving room in the hash 
package to other concepts, like small hash functions that can be used in 
associative arrays (#2).

I don't know the difference between #1 and #3, so I can't tell if they 
should share a common package. In anycase, I think putting #3 be in a 
crypto package makes sense.

Having 3 different packages seems too much to me. #1 is too 
restricted to be a whole package IMHO, and should be along #2 or #3.

-- 
Christophe
Aug 08 2012
prev sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 08, 2012 18:47:34 Johannes Pfau wrote:
 Am Wed, 8 Aug 2012 14:16:40 +0000 (UTC)
 
 schrieb travert phare.normalesup.org (Christophe Travert):
 If it where me, I would have the presently reviewed module
 std.hash.hash be called std.hash.digest, and leave room here for
 regular hash functions. In any case, I think regular hash HAVE to be
 in a std.hash module or package, because people looking for a regular
 hash function will look here first.

std.hash.digest doesn't sound too bad. We could have std.hash.func (or a better named module ;-) for general hash functions later.

I say just keep at simple and leave it at std.hash. It's plenty clear IMHO. - Jonathan M Davis
Aug 08 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/8/12 4:34 PM, Jonathan M Davis wrote:
 I say just keep at simple and leave it at std.hash. It's plenty clear IMHO.

Not clear to quite a few of us. IMHO it just makes us seem (to the larger community) clever about a petty point. There's plenty of other better names, and std.digest is very adequate. Andrei
Aug 08 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Wednesday, 8 August 2012 at 12:00:42 UTC, H. S. Teoh wrote:
 On Wed, Aug 08, 2012 at 11:37:35AM +0100, Regan Heath wrote:
 On Tue, 07 Aug 2012 19:41:12 +0100, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:
 
On Tuesday, August 07, 2012 15:31:57 Ary Manzana wrote:
I think "std.crypto" is a better name for the package. At 
first I
thought it contained an implementation of a Hash table.

That doesn't fly, because crc32 is going to be in there, and while it's a hash, it's no good for cryptography.

std.digest then?

+1. I think std.hash is needlessly confusing (I thought it was another hashtable implementation until I read this thread more carefully). T

-1 std.digest let's me think of http://en.wikipedia.org/wiki/Digestion digest is just not common if you mean hash in my cycles. I didn't think of an hash table implementation, maybe you are spoiled by writing one at the moment? (no offence).
Aug 08 2012
prev sibling next sibling parent "Regan Heath" <regan netmail.co.nz> writes:
On Wed, 08 Aug 2012 13:11:43 +0100, Tobias Pankrath <tobias pankrath.net>  
wrote:

 On Wednesday, 8 August 2012 at 12:00:42 UTC, H. S. Teoh wrote:
 On Wed, Aug 08, 2012 at 11:37:35AM +0100, Regan Heath wrote:
 On Tue, 07 Aug 2012 19:41:12 +0100, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:
  >On Tuesday, August 07, 2012 15:31:57 Ary Manzana wrote:
I think "std.crypto" is a better name for the package. At >>first I
thought it contained an implementation of a Hash table.

That doesn't fly, because crc32 is going to be in there, and >while it's a hash, it's no good for cryptography.


+1. I think std.hash is needlessly confusing (I thought it was another hashtable implementation until I read this thread more carefully). T

-1 std.digest let's me think of http://en.wikipedia.org/wiki/Digestion

That's exactly what it's supposed to suggest. The algorithm does digest the input (AKA message) and output .. something else :p
 digest is just not common if you mean hash in my cycles.

Like it or not, Digest is the correct term: http://en.wikipedia.org/wiki/MD5 "The MD5 Message-Digest Algorithm .."
 I didn't think of an hash table implementation, maybe you are spoiled by  
 writing one at the moment? (no offence).

"Hash" has too many meanings, we should avoid it. R -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Aug 08 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Wednesday, 8 August 2012 at 12:55:04 UTC, Regan Heath wrote:

 Like it or not, Digest is the correct term:
 http://en.wikipedia.org/wiki/MD5
 "The MD5 Message-Digest Algorithm .."

You could have cited the hole sentence
 The MD5 Message-Digest Algorithm is a widely used cryptographic 
 hash function

So at least this implies that hash function is the more general term here and the corresponding wiki article is named "hash function" and does not even mention digest.
 I didn't think of an hash table implementation, maybe you are 
 spoiled by writing one at the moment? (no offence).

"Hash" has too many meanings, we should avoid it.

At least hash table does not use a different meaning of the term hash. But I'm not that deep into it, I'd just say that digest is not clearly better than hash.
Aug 08 2012
prev sibling parent "Regan Heath" <regan netmail.co.nz> writes:
On Wed, 08 Aug 2012 14:03:32 +0100, Tobias Pankrath <tobias pankrath.net>  
wrote:

 On Wednesday, 8 August 2012 at 12:55:04 UTC, Regan Heath wrote:

 Like it or not, Digest is the correct term:
 http://en.wikipedia.org/wiki/MD5
 "The MD5 Message-Digest Algorithm .."

You could have cited the hole sentence

I could have, but I didn't read that far :p I knew what I was looking for and I copy/pasted it.
 The MD5 Message-Digest Algorithm is a widely used cryptographic hash  
 function

So at least this implies that hash function is the more general term here and the corresponding wiki article is named "hash function" and does not even mention digest.

"Message-Digest Algorithm" is the proper term, "hash" is another, correct, more general term. "hash" has other meanings, "Message-Digest Algorithm" does not. std.message-digest-algorithm is a bit wordy. std.digest is not. std.digest cannot be confused with anything else.
 I didn't think of an hash table implementation, maybe you are spoiled  
 by writing one at the moment? (no offence).

"Hash" has too many meanings, we should avoid it.

At least hash table does not use a different meaning of the term hash. But I'm not that deep into it, I'd just say that digest is not clearly better than hash.

I think it is. R -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Tue, 07 Aug 2012 20:07:07 +0200
schrieb David <d dav1d.de>:

 Is the API already set in stone?

part of the review. (Actually the most important part for this review, as we define one API for many implementations)
 Using .start and .finish, feels like the use of OpenSSL. Why not
 making a Hash object not "restartable" (since e.g. MD5.start() sets
 `this` just to MD5.init)

CRC32, MD5 and SHA1 and I proposed we should have a unique interface. As I had some free time and nobody else stepped up I implemented it, asking for expert help in the newsgroup. This probably also explains why this API isn't revolutionary, it's inspired by other designs (tango, .net, ...). BTW: making it possible to wrap OpenSSL and other C hash libraries was also a goal. This is one reason why a start() function is necessary. OK, that was the intro ;-) start is here as structs can't have default constructors. For all current hashes it's really mostly useless (except it can be used as a simple way to reset a hash, which is also why the hash structs don't have a reset member), but I just don't know if there are hashes which would require a more complex (runtime) initialization. Classes don't have a start method as they can do that initialization in the default constructor. But the default constructor can't be used as a cheap way to reset a hash, therefore the reset function was added to the OOP interface.
 and making finish private and implementing a
 digest and hexdigest property

for some objects, so you can't access the property repeatedly. It'd have to be a function.
which calls finish and returns an
 ubyte[] array (or string for hexdigest)

This could be done and is probably a matter of taste. I prefer the free function in this case, it makes clear that digest() really is a generic function which can be used with any hash. It's also one function less which must be implemented by hash writers. I'd like to keep the number of members required to implement a hash as low as possible. (This is however unrelated to the actual naming of the functions. If you think 'digest' is not a good name, please let me know)
, this would also eliminate
 the need of the `digest` wrapper (e.g. you can mixin these properties
 with template mixins, so no code duplication)

there's actually no code duplication. 'digest' is the only implementation, sha1Of, md5Of are just aliases.
 
 Also I am not sure if we want to use `hash.put` instead of
 `hash.update` (which is used by python and openssl). But that's just
 a minor point (e.g. perl uses .add)

I initially called it update, I agree it's a better name. But phobos is all about ranges, so we need the put member anyway. Providing an 'update' alias just for the name isn't worth it, imho (see above, keeping the number of hash members low).
 
 Furthermore, I don't like the `sha1Of` and `md5Of` etc. functions,
 why not having a static method? e.g. MD5.hexdigest("input"), which
 returns the hexdigest and not a MD5 object.

Same reason as above, sha1Of, md5Of are just convenience aliases, it could be argued you don't need those at all. You can use 'digest' directly and it will provide exactly the same result. So hash writers don't have to implement anything to support digest, providing an alias is optional and just 1 line of code. A static method would have to be implemented by every hash. Yes you could use mixins to make that painless, but I think it's still to much work. (And mixins can't be documented in DDOC, but that's an unrelated compiler issue) BTW: it is implemented as a member for the OOP API, as we can just implement it in the base interface, so the actual implementations don't have to care about it.
 
 One more thing, `.reset` does the same as `start`? If so, why do both 
 exist?

structs can't have default constructors, so we need start. The OOP/class api can use a default constructor. But unlike the start function it can't work as a reset function, so there's an explicit reset function.
(I also find the examples quite confusing, why do you want to 
 reset the hash onject?

it makes sense to use them. Maybe I should care more about that? It's more efficient than allocating a new hash with 'new' (or do you mean why use reset if finish resets anyway? It's useful if you put some data into a hash, then decide you don't want the hash (because the user aborts the operation, network connection is lost,...) but you still need the hash object later on.) You could just call finish and disregard the result, but the reset implementation is faster. I agree there's probably no _strong_ need for reset, but I think it doesn't do any harm.
I would documentate the function but wouldn't
 use it in the examples)
 
 Well, that's it from my side :)

Thanks for your review!
Aug 07 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Tue, 07 Aug 2012 20:46:44 +0200
schrieb deadalnix <deadalnix gmail.com>:

 You'll find very hard to convince anyone that crc32 is a
 cryptographic hash function.

BTW: I also considered splitting hashes into cryptographic and non-cryptographic/checksums. But as we also have some generic parts (currently in std.hash.hash) this would pose the question where to put the generic part? Put it in std.checksum and std.crypto users will complain, put it in std.crypto and std.checksum users won't be happy. And as was discussed previously what's considered a safe cryptographic hash might change as time goes by.
 And this API is suited for both cryptographic hash and regular hash. 
 Many of them can be added in the future if need is met. I
 definitively am for std.hash .

We had this package name discussion a few times on the newsgroup and I think on github as well. I personally don't care about the package name and I'll just choose what the majority thinks is best. Last time it seemed std.hash was the favorite (although hash and digest can probably be used interchangeably)
Aug 07 2012
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/7/2012 10:39 AM, Dmitry Olshansky wrote:
 std.hash.hash is a new module for Phobos defining an uniform interface for
 hashes and checksums. It also provides some useful helper functions to deal
with
 this new API.

The hash functions must use a Range interface, not a file interface. This is extremely important.
Aug 07 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/8/2012 1:44 AM, Johannes Pfau wrote:
 Am Tue, 07 Aug 2012 17:39:15 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:

 On 8/7/2012 10:39 AM, Dmitry Olshansky wrote:
 std.hash.hash is a new module for Phobos defining an uniform
 interface for hashes and checksums. It also provides some useful
 helper functions to deal with this new API.

The hash functions must use a Range interface, not a file interface. This is extremely important.

I guess this is meant as a general statement and not specifically targeted at my std.hash proposal?

Both.
 I'm a little confused as all hashes already are OutputRanges in my
 proposal. It's probably not explicit enough in the documentation, but
 it's mentioned in one example and in the documentation for 'put';

It should accept an input range. But using an Output Range confuses me. A hash function is a reduce algorithm - it accepts a sequence of input values, and produces a single value. You should be able to write code like: ubyte[] data; ... auto crc = data.crc32(); For example, the hash example given is: foreach (buffer; file.byChunk(4096 * 1024)) hash.put(buffer); auto result = hash.finish(); Instead it should be something like: auto result = file.byChunk(4096 * 1025).joiner.hash(); The magic is that any input range that produces bytes could be used, and that byte producing input range can be hooked up to the input of any reducing function. The use of a member finish() is not what any other reduce algorithm has, and so the interface is not a general component interface. I know the documentation on ranges in Phobos is incomplete and confusing. I appreciate the effort and care you're putting into this.
Aug 08 2012
next sibling parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Walter Bright wrote:
    auto result = file.byChunk(4096 * 1025).joiner.hash();

 The magic is that any input range that produces bytes could be used, and
 that byte producing input range can be hooked up to the input of any
 reducing function.

Suppose you have a callback that will give you blocks of bytes to hash. Blocks of bytes come from a socket, but not a blocking one. Instead, socket uses eventing mechanism (libevent) to get notifications about its readiness. How would you use the hash API in this situation?
Aug 08 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/8/2012 3:12 AM, Piotr Szturmaj wrote:
 Walter Bright wrote:
    auto result = file.byChunk(4096 * 1025).joiner.hash();

 The magic is that any input range that produces bytes could be used, and
 that byte producing input range can be hooked up to the input of any
 reducing function.

Suppose you have a callback that will give you blocks of bytes to hash. Blocks of bytes come from a socket, but not a blocking one. Instead, socket uses eventing mechanism (libevent) to get notifications about its readiness. How would you use the hash API in this situation?

Have the callback supply a range interface to call the hash with.
Aug 08 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/8/2012 12:14 PM, Martin Nowak wrote:
 That hardly works for event based programming without using coroutines.
 It's the classical inversion-of-control dilemma of event based programming that
 forces you to save/restore your state with every event.

See the discussion on using reduce().
Aug 08 2012
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/9/2012 2:48 AM, Johannes Pfau wrote:
 Am Wed, 08 Aug 2012 12:31:29 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:

 On 8/8/2012 12:14 PM, Martin Nowak wrote:
 That hardly works for event based programming without using
 coroutines. It's the classical inversion-of-control dilemma of
 event based programming that forces you to save/restore your state
 with every event.

See the discussion on using reduce().

I just don't understand it. Let's take the example by Martin Nowak and port it to reduce: (The code added as comments is the same code for hashes, working with the current API) int state; //Hash state; void onData(void[] data) { state = reduce(state, data); //copy(data, state); //state = copy(data, state); //also valid, but not necessary //state.put(data); //simple way, doesn't work for ranges } void main() { state = 0; //state.start(); auto stream = new EventTcpStream("localhost", 80); stream.onData = &onData; //auto result = hash.finish(); } There are only 2 differences: 1: the order of the arguments passed to copy and reduce is swapped. This kinda makes sense (if copy is interpreted as copyTo). Solution: Provide a method copyInto with swapped arguments if consistency is really so important.

Consistency is important so that disparate components can fit together.
 2:
 We need an additional call to finish. I can't say it often enough, I
 don't see a sane way to avoid it. Hashes work on blocks, if you didn't
 pass enough data finish will have to fill the rest of the block with
 zeros before you can get the hash value. This operation can't be
 undone. To get a valid result with every call to copy, you'd have to
 always call finish. This is
 * inefficient, you calculate intermediate values you don't need at all
 * you have to copy the hashes state, as you can't continue hashing
    after finish has been called

 and both, the state and the result would have to fit into the one value
 (called seed for reduce). But then it's still not 100% consistent, as
 reduce will return a single value, not some struct including internal
 state.

That isn't a problem, the internal state can be private data for the struct, and the "finish value" can be the result of overloading operator() on that struct. I'm not sure if that would work, but it's worth investigating.
Aug 09 2012
prev sibling parent deadalnix <deadalnix gmail.com> writes:
Le 09/08/2012 11:48, Johannes Pfau a écrit :
 Am Wed, 08 Aug 2012 12:31:29 -0700
 schrieb Walter Bright<newshound2 digitalmars.com>:

 On 8/8/2012 12:14 PM, Martin Nowak wrote:
 That hardly works for event based programming without using
 coroutines. It's the classical inversion-of-control dilemma of
 event based programming that forces you to save/restore your state
 with every event.

See the discussion on using reduce().

I just don't understand it. Let's take the example by Martin Nowak and port it to reduce: (The code added as comments is the same code for hashes, working with the current API) int state; //Hash state; void onData(void[] data) { state = reduce(state, data); //copy(data, state); //state = copy(data, state); //also valid, but not necessary //state.put(data); //simple way, doesn't work for ranges } void main() { state = 0; //state.start(); auto stream = new EventTcpStream("localhost", 80); stream.onData =&onData; //auto result = hash.finish(); } There are only 2 differences: 1: the order of the arguments passed to copy and reduce is swapped. This kinda makes sense (if copy is interpreted as copyTo). Solution: Provide a method copyInto with swapped arguments if consistency is really so important. 2: We need an additional call to finish. I can't say it often enough, I don't see a sane way to avoid it. Hashes work on blocks, if you didn't pass enough data finish will have to fill the rest of the block with zeros before you can get the hash value. This operation can't be undone. To get a valid result with every call to copy, you'd have to always call finish. This is * inefficient, you calculate intermediate values you don't need at all * you have to copy the hashes state, as you can't continue hashing after finish has been called and both, the state and the result would have to fit into the one value (called seed for reduce). But then it's still not 100% consistent, as reduce will return a single value, not some struct including internal state.

I'm pretty sure it is possible to pad and finish when a result is required without messing up the internal state.
Aug 16 2012
prev sibling next sibling parent deadalnix <deadalnix gmail.com> writes:
Le 08/08/2012 11:49, Walter Bright a écrit :
 On 8/8/2012 1:44 AM, Johannes Pfau wrote:
 Am Tue, 07 Aug 2012 17:39:15 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:

 On 8/7/2012 10:39 AM, Dmitry Olshansky wrote:
 std.hash.hash is a new module for Phobos defining an uniform
 interface for hashes and checksums. It also provides some useful
 helper functions to deal with this new API.

The hash functions must use a Range interface, not a file interface. This is extremely important.

I guess this is meant as a general statement and not specifically targeted at my std.hash proposal?

Both.
 I'm a little confused as all hashes already are OutputRanges in my
 proposal. It's probably not explicit enough in the documentation, but
 it's mentioned in one example and in the documentation for 'put';

It should accept an input range. But using an Output Range confuses me. A hash function is a reduce algorithm - it accepts a sequence of input values, and produces a single value. You should be able to write code like: ubyte[] data; ... auto crc = data.crc32(); For example, the hash example given is: foreach (buffer; file.byChunk(4096 * 1024)) hash.put(buffer); auto result = hash.finish(); Instead it should be something like: auto result = file.byChunk(4096 * 1025).joiner.hash(); The magic is that any input range that produces bytes could be used, and that byte producing input range can be hooked up to the input of any reducing function. The use of a member finish() is not what any other reduce algorithm has, and so the interface is not a general component interface. I know the documentation on ranges in Phobos is incomplete and confusing. I appreciate the effort and care you're putting into this.

That is a really good point. +1
Aug 08 2012
prev sibling next sibling parent travert phare.normalesup.org (Christophe Travert) writes:
Johannes Pfau , dans le message (digitalmars.D:174478), a écrit :
 but I don't know how make it an overload. See thread "overloading a
 function taking a void[][]" in D.learn for details.

Don't overload the function taking a void[][]. Remplace it. void[][] is a range of void[].
Aug 08 2012
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/8/2012 5:13 AM, Martin Nowak wrote:
 It should accept an input range. But using an Output Range confuses me. A hash
 function is a reduce algorithm - it accepts a sequence of input values, and
 produces a single value. You should be able to write code like:

    ubyte[] data;
    ...
    auto crc = data.crc32();

 For example, the hash example given is:

    foreach (buffer; file.byChunk(4096 * 1024))
        hash.put(buffer);
    auto result = hash.finish();

 Instead it should be something like:

    auto result = file.byChunk(4096 * 1025).joiner.hash();

It's also important to have a stateful hash implementation that can be updated incrementally, e.g. from a callback.

Take a look at the reduce function in http://dlang.org/phobos/std_algorithm.html#reduce It has provision for an initial state that can be the current running total.
Aug 08 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/8/2012 12:08 PM, Johannes Pfau wrote:
 No where's the difference, except that for hashes the context ('hash')
 has to be setup and finished manually?

The idea is to have hash act like a component - not with special added code the user has to write. In this case, it needs to work like a reduce algorithm, because it is a reduce algorithm. Need to find a way to make this work.
Aug 08 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/9/2012 2:59 AM, David Nadlinger wrote:
 On Wednesday, 8 August 2012 at 19:27:54 UTC, Walter Bright wrote:
 The idea is to have hash act like a component - not with special added code
 the user has to write.

Sorry, but I think this is a meaningless statement without specifying what kind of interface the component should adhere to. In my opinion, the proposed std.hash design would be a perfectly valid interface for »accumulate stuff and at some point get a result«-type components.

It is not a meaningless statement in that components have a predictable set of methods and properties. That's all a range is. Requiring extra methods means there's either an error in the component interface design or an error in the component instance design. What I'm trying to get away from is the C library style where every library lives in its own world, and when the user tries to connect them he's got a fair amount of work to do building a scaffolding between them. With component programming, the interfaces between disparate things is standardized. It does not have unique methods for different instances. For example, one component has a finish() method, another has a getResult() method, and a third has no method at all. This situation I wish to avoid.
 In this case, it needs to work like a reduce algorithm, because it is a reduce
 algorithm. Need to find a way to make this work.

Hash functions are _not_ analogous to reduce(), because the operation performed by reduce() is stateless, whereas hash functions generally have some internal state. »Continuing« a reduce() operation by repeatedly calling it with the last partial result as the starting value is only possible because there is no additional state to carry over. To make this work with hashes, you'd have to return something encapsulating the internal state from your hash function.

Wouldn't that be simply the handle to the hash?
 But then,
 you again need to obtain the actual result from that return value from that
 result somehow, which defeats the original intent of making it work like reduce
 – and incidentally is what finish() does.

I understand what finish() does. The interesting part is trying to figure a way out of needing that method. Or perhaps the reduce component design is incorrect.
Aug 09 2012
parent reply travert phare.normalesup.org (Christophe Travert) writes:
If a has is a range, it's an output range, because it's something you   
fee data to. Output range have only one method: put. Johannes used this 
method. But it's not sufficient, you need something to start and to 
finish the hash.

To bring consistency in the library, we should not remove this start and 
finish method. We should make all output range of the library use the 
same functions.

In the library, we have really few output range. We have writable input 
range and we have Appender. Is there more? There should be files, 
socket, maybe even signal, but IIRC these don't implement output range 
at the moment. What did I miss?

Appender doesn't use a finish method, but we have to 'get the 
result' of the appender, and for this we use appender.data. This name is 
not appropriate for generically getting a result or terminating an 
output range.

So We need a name that fits most output range use. start/finish sounds 
not bad. open/close fits files and socket, but many not all output 
ranges. Relying solely on constructors, opCall or alias this seems 
dangerous to me.

-- 
Christophe
Aug 09 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
See the new thread Andrei started entitled "finish function for output ranges". 
I think this discussion has clearly discovered a shortcoming in the current 
range design, and Andrei has a proposed solution.
Aug 11 2012
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 09-Aug-12 14:15, Regan Heath wrote:
 On Thu, 09 Aug 2012 10:59:47 +0100, David Nadlinger <see klickverbot.at>

 If the range/hash object stores the current state and returns this as
 the result of hashreduce, it would be chainable.  If it also had a
 "Digest" property/method which performed finish on a /temporary copy/ of
 the state it would almost be as automatic as reduce.  There would still
 be a manual step to get the result, but it would be analogous to calling
 toString on any range object to output it's "value".  The Digest
 property/method would not modify the internal state, and could be called
 at any time between (not sure there is a point to this) or after chained
 hashreduce operations.

struct ShaState { ... alias ubyte[16] getDidgest(); } Problem is: too much magic, spoils auto x = reduce(...); idiom. To be brutally honest I don't see what in std.hash doesn't fit "range component" model: - it works on input ranges via convenient adaptor (or would work soon) - it has output _ranges_. What's not to like about it? That it doesn't fit general reduce algorithm? I'm not convinced it's important. In the same vane you may try to shoehorn symmetric cyphers to follow map interface because conceptually it does 1:1 conversion. Now important thing: just look at other output ranges. They either require extra call (see Appender .data) or horrendously slow and ugly see lockingTextWriter/Reader. -- Dmitry Olshansky
Aug 09 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 09-Aug-12 20:32, Dmitry Olshansky wrote:
 On 09-Aug-12 14:15, Regan Heath wrote:
 On Thu, 09 Aug 2012 10:59:47 +0100, David Nadlinger <see klickverbot.at>

 If the range/hash object stores the current state and returns this as
 the result of hashreduce, it would be chainable.  If it also had a
 "Digest" property/method which performed finish on a /temporary copy/ of
 the state it would almost be as automatic as reduce.  There would still
 be a manual step to get the result, but it would be analogous to calling
 toString on any range object to output it's "value".  The Digest
 property/method would not modify the internal state, and could be called
 at any time between (not sure there is a point to this) or after chained
 hashreduce operations.

struct ShaState { ... alias ubyte[16] getDidgest(); }

Too fast.. should have been: ubyte[16] getDidgest(); alias getDigest this; -- Dmitry Olshansky
Aug 09 2012
parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 09, 2012 18:46:59 David Nadlinger wrote:
 On Thursday, 9 August 2012 at 16:37:57 UTC, Dmitry Olshansky
 
 wrote:
 Too fast.. should have been:
 ubyte[16] getDidgest();
 alias getDigest this;

I have been thinking about using AliasThis as well, but the problem is that precisely the use case this is meant to enable (i.e. »snapping together components«, like Walter said) tends to get broken in subtle ways due to the use of template functions/type inference.

Yeah. alias this can be very useful, but it's very dangerous when it comes to templated functions, because it's very easy to make assumptions when writing those functions which hold great when using the actual type but do funny things when alias this comes into play. And unfortunately, I don't think that it's something that's at all well understood. It's probably one of those things that we as a community need to get a better grip on. Just imagine how bad it would be though if D allowed as many implicit conversions as C++ does... - Jonathan M Davis
Aug 09 2012
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 15-Aug-12 11:41, Kagamin wrote:
 On Thursday, 9 August 2012 at 09:59:48 UTC, David Nadlinger wrote:
 In this case, it needs to work like a reduce algorithm, because it is
 a reduce algorithm. Need to find a way to make this work.

Hash functions are _not_ analogous to reduce(), because the operation performed by reduce() is stateless, whereas hash functions generally have some internal state.

An example of stateless hash in .net: http://msdn.microsoft.com/en-us/library/xa627k19.aspx

AFAIK it'a method of HashAlgorithm Object. http://msdn.microsoft.com/en-us/library/c06s9c55 It also includes TransformBlock & TransformFinalBlock. It does contain state of course.
Aug 15 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 15-Aug-12 12:17, Kagamin wrote:
 On Wednesday, 15 August 2012 at 08:13:27 UTC, Dmitry Olshansky wrote:
 AFAIK it'a method of HashAlgorithm Object.

It's a minor design detail, see the example: the method is called on each file without any explicit preparations and without calls to methods like TransformBlock. That's how stateless computation usually looks - it's just done and that's all.

Brrr. It's how convenience wrapper works :) And I totally expect this to call the same code and keep the same state during the work. E.g. see std.digest.digest functions digest or hexDigest you could call it stateless in the same vane.
Aug 15 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 15-Aug-12 12:45, Kagamin wrote:
 On Wednesday, 15 August 2012 at 08:25:51 UTC, Dmitry Olshansky wrote:
 Brrr. It's how convenience wrapper works :)

 And I totally expect this to call the same code and keep the same
 state during the work.

 E.g. see std.digest.digest functions digest or hexDigest you could
 call it stateless in the same vane.

Well there was a wish for stateless hash, Walter even posted the required interface: auto result = file.byChunk(4096 * 1025).joiner.hash();

auto result = file.byChunk(4096 * 1025).joiner.digest(); and is already supported in the proposal, peek at updated docs. There is no need for additional methods and whatnot. -- Olshansky Dmitry
Aug 15 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/8/2012 9:47 AM, Johannes Pfau wrote:
 Am Wed, 8 Aug 2012 17:50:33 +0200
 schrieb Johannes Pfau <nospam example.com>:

 However, I do agree digest!Hash, md5Of, sha1Of should have an
 additional overload which takes a InputRange. It would be implemented
 with copy and be a nice convenience function.

I implemented the function, it's actually quite simple: ---- digestType!Hash digestRange(Hash, Range)(Range data) if(isDigest!Hash && isInputRange!Range && __traits(compiles, digest!Hash(ElementType!(Range).init))) { Hash hash; hash.start(); copy(data, hash); return hash.finish(); } ----

The finish() should be implicit when the range ends.
 but I don't know how make it an overload. See thread "overloading a
 function taking a void[][]" in D.learn for details.

I don't know what you mean, it takes a range, not a void[][] as input.
Aug 08 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/8/2012 12:05 PM, Johannes Pfau wrote:
 So the post in D.learn for a detailed description. Yes the code I
 posted takes a range, but digest (as it is now) takes void[][] to
 accept all kind of types _without_ template bloat. The difficulty is to
 combine those two overloads without causing unnecessary template bloat.

Have the templated version with overloads simply call the single version (with a different name) with void[][].
Aug 08 2012
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/9/2012 2:05 AM, Johannes Pfau wrote:
 I guess a second function digestRange is not acceptable?

It's more the user API that matters, not how it works under the hood.
Aug 09 2012
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/9/2012 2:05 AM, Johannes Pfau wrote:
 http://dpaste.dzfl.pl/f86717f7

The Range argument - is it an InputRange, an OutputRange? While it's just a type name, the name should reflect what kind of range it is from the menagerie of ranges in std.range.
Aug 09 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/9/12 5:05 AM, Johannes Pfau wrote:
 Well that's possible, but I don't like the template bloat it causes.

What have you measured, and what is your dislike based upon? The library function must be generic. Then users worried about bloating may use it with a limited number of types. A digest function only dealing with void[void[]] is unacceptable. Andrei
Aug 09 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-08-09 15:02, Johannes Pfau wrote:

 What annoys me is that as long the function only supported arrays, it
 didn't need templates _at all_. So template bloat for arrays = 0. But
 adding range support means the version dealing with arrays now has to
 be a template as well(which is probably a bug, can't overload template
 and non template function) and will produce extra code for every array
 type. I just think adding range support shouldn't cause the array code
 to change in any way.

A workaround is to make the non-template function to a template, with no arguments. This should only cause one instantiation: void foo (T) (T t) if (/* some constraint making it not match "int" */); void foo () (int x); -- /Jacob Carlborg
Aug 09 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
 It should accept an input range. But using an Output Range confuses me.  
 A hash function is a reduce algorithm - it accepts a sequence of input  
 values, and produces a single value. You should be able to write code  
 like:

    ubyte[] data;
    ...
    auto crc = data.crc32();

 For example, the hash example given is:

    foreach (buffer; file.byChunk(4096 * 1024))
        hash.put(buffer);
    auto result = hash.finish();

 Instead it should be something like:

    auto result = file.byChunk(4096 * 1025).joiner.hash();

It's also important to have a stateful hash implementation that can be updated incrementally, e.g. from a callback.
Aug 08 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
--------
Hash hash;

void onData(void[] data)
{
     hash.put(data);
}

void main()
{
     hash.start();
     auto stream = new EventTcpStream("localhost", 80);
     stream.onData = &onData;
     hash.finish();
}
--------

 Have the callback supply a range interface to call the hash with.

That hardly works for event based programming without using coroutines. It's the classical inversion-of-control dilemma of event based programming that forces you to save/restore your state with every event.
Aug 08 2012
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Wednesday, 8 August 2012 at 19:27:54 UTC, Walter Bright wrote:
 The idea is to have hash act like a component - not with 
 special added code the user has to write.

Sorry, but I think this is a meaningless statement without specifying what kind of interface the component should adhere to. In my opinion, the proposed std.hash design would be a perfectly valid interface for »accumulate stuff and at some point get a result«-type components.
 In this case, it needs to work like a reduce algorithm, because 
 it is a reduce algorithm. Need to find a way to make this work.

Hash functions are _not_ analogous to reduce(), because the operation performed by reduce() is stateless, whereas hash functions generally have some internal state. »Continuing« a reduce() operation by repeatedly calling it with the last partial result as the starting value is only possible because there is no additional state to carry over. To make this work with hashes, you'd have to return something encapsulating the internal state from your hash function. But then, you again need to obtain the actual result from that return value from that result somehow, which defeats the original intent of making it work like reduce – and incidentally is what finish() does. David
Aug 09 2012
prev sibling next sibling parent "Regan Heath" <regan netmail.co.nz> writes:
On Thu, 09 Aug 2012 10:59:47 +0100, David Nadlinger <see klickverbot.at>=
  =

wrote:

 On Wednesday, 8 August 2012 at 19:27:54 UTC, Walter Bright wrote:
 The idea is to have hash act like a component - not with special adde=


 code the user has to write.

Sorry, but I think this is a meaningless statement without specifying =

 what kind of interface the component should adhere to. In my opinion, =

 the proposed std.hash design would be a perfectly valid interface for =

 =C2=BBaccumulate stuff and at some point get a result=C2=AB-type compo=

 In this case, it needs to work like a reduce algorithm, because it is=


 reduce algorithm. Need to find a way to make this work.

Hash functions are _not_ analogous to reduce(), because the operation =

 performed by reduce() is stateless, whereas hash functions generally  =

 have some internal state.

 =C2=BBContinuing=C2=AB a reduce() operation by repeatedly calling it w=

 partial result as the starting value is only possible because there is=

 no additional state to carry over. To make this work with hashes, you'=

 have to return something encapsulating the internal state from your ha=

 function.

This isn't necessarily a problem.
 But then, you again need to obtain the actual result from that return =

 value from that result somehow, which defeats the original intent of  =

 making it work like reduce =E2=80=93 and incidentally is what finish()=

But, this is a problem. finish in most cases pads the remaining data to= a = boundary of the internal state size, then completes one more iteration o= f = the algorithm to produce the final result. So, like David has just said, you can have 1 or the other. Either you c= an = chain hashreduce operations together, but you have to perform a manual = finish step to get the actual result, or you cannot chain hashreduce = operations together and finish is done automatically when the input rang= e = is consumed. Wild thought.. and I have to admit I've not followed the proposed or = suggested API closely, nor have I used ranges extensively so this may no= t = be possible.. If the range/hash object stores the current state and returns this as th= e = result of hashreduce, it would be chainable. If it also had a "Digest" = = property/method which performed finish on a /temporary copy/ of the stat= e = it would almost be as automatic as reduce. There would still be a manua= l = step to get the result, but it would be analogous to calling toString on= = any range object to output it's "value". The Digest property/method wou= ld = not modify the internal state, and could be called at any time between = (not sure there is a point to this) or after chained hashreduce operatio= ns. R -- = Using Opera's revolutionary email client: http://www.opera.com/mail/
Aug 09 2012
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Thursday, 9 August 2012 at 16:37:57 UTC, Dmitry Olshansky 
wrote:
 Too fast.. should have been:
 	ubyte[16] getDidgest();
 	alias getDigest this;

I have been thinking about using AliasThis as well, but the problem is that precisely the use case this is meant to enable (i.e. »snapping together components«, like Walter said) tends to get broken in subtle ways due to the use of template functions/type inference. David
Aug 09 2012
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Thursday, 9 August 2012 at 09:59:48 UTC, David Nadlinger wrote:
 In this case, it needs to work like a reduce algorithm, 
 because it is a reduce algorithm. Need to find a way to make 
 this work.

Hash functions are _not_ analogous to reduce(), because the operation performed by reduce() is stateless, whereas hash functions generally have some internal state.

An example of stateless hash in .net: http://msdn.microsoft.com/en-us/library/xa627k19.aspx
Aug 15 2012
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Wednesday, 15 August 2012 at 07:41:20 UTC, Kagamin wrote:
 On Thursday, 9 August 2012 at 09:59:48 UTC, David Nadlinger 
 wrote:
 Hash functions are _not_ analogous to reduce(), because the 
 operation performed by reduce() is stateless, whereas hash 
 functions generally have some internal state.

An example of stateless hash in .net: http://msdn.microsoft.com/en-us/library/xa627k19.aspx

http://msdn.microsoft.com/en-us/library/system.security.cryptography.hashalgorithm.state David
Aug 15 2012
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Wednesday, 15 August 2012 at 08:13:27 UTC, Dmitry Olshansky 
wrote:
 AFAIK it'a method of HashAlgorithm Object.

It's a minor design detail, see the example: the method is called on each file without any explicit preparations and without calls to methods like TransformBlock. That's how stateless computation usually looks - it's just done and that's all.
Aug 15 2012
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Wednesday, 15 August 2012 at 08:17:14 UTC, Kagamin wrote:
 On Wednesday, 15 August 2012 at 08:13:27 UTC, Dmitry Olshansky 
 wrote:
 AFAIK it'a method of HashAlgorithm Object.

It's a minor design detail, see the example […]

No, it's not a »minor design detail«, at least not regarding what has been the topic of the discussion here – you can always provide a simple wrapper function in the proposed design and call it »stateless« as well (in fact, an implementation has already been posted, IIRC). The point is that the ability to execute a hashing operation block by block is necessary, and that this operation is not analogous to reduce() because it potentially needs internal state. David
Aug 15 2012
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Wednesday, 15 August 2012 at 08:17:01 UTC, David Nadlinger 
wrote:
 On Wednesday, 15 August 2012 at 07:41:20 UTC, Kagamin wrote:
 On Thursday, 9 August 2012 at 09:59:48 UTC, David Nadlinger 
 wrote:
 Hash functions are _not_ analogous to reduce(), because the 
 operation performed by reduce() is stateless, whereas hash 
 functions generally have some internal state.

An example of stateless hash in .net: http://msdn.microsoft.com/en-us/library/xa627k19.aspx

http://msdn.microsoft.com/en-us/library/system.security.cryptography.hashalgorithm.state David

Ok, but HashAlgorithm still supports stateless interface which consists of a single method with a couple of overloads, the example in the ComputeHash article speaks for itself.
Aug 15 2012
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Wednesday, 15 August 2012 at 08:25:51 UTC, Dmitry Olshansky 
wrote:
 Brrr. It's how convenience wrapper works :)

 And I totally expect this to call the same code and keep the 
 same state during the work.

 E.g. see std.digest.digest functions digest or hexDigest you 
 could call it stateless in the same vane.

Well there was a wish for stateless hash, Walter even posted the required interface: auto result = file.byChunk(4096 * 1025).joiner.hash(); I just pointed out, that possibly stateful implementation doesn't prevent stateless interface. Can one even say that the implementation is stateful given just a stateless interface? One can even call reduce stateful because it does keep track of the result which is or a part of its state.
Aug 15 2012
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Wednesday, 15 August 2012 at 08:45:35 UTC, Kagamin wrote:
 I just pointed out, that possibly stateful implementation 
 doesn't prevent stateless interface. Can one even say that the 
 implementation is stateful given just a stateless interface?

And our point is that such an interface is trivial to implement over a »stateful« interface, and even already exists. Maybe you want to peruse some of the old posts? David
Aug 15 2012
prev sibling parent "Regan Heath" <regan netmail.co.nz> writes:
On Thu, 16 Aug 2012 21:25:55 +0100, deadalnix <deadalnix gmail.com> wrot=
e:

 Le 09/08/2012 11:48, Johannes Pfau a =E9crit :
 Am Wed, 08 Aug 2012 12:31:29 -0700
 schrieb Walter Bright<newshound2 digitalmars.com>:

 On 8/8/2012 12:14 PM, Martin Nowak wrote:
 That hardly works for event based programming without using
 coroutines. It's the classical inversion-of-control dilemma of
 event based programming that forces you to save/restore your state
 with every event.

See the discussion on using reduce().

I just don't understand it. Let's take the example by Martin Nowak an=


 port it to reduce: (The code added as comments is the same code for
 hashes, working with the current API)

 int state; //Hash state;

 void onData(void[] data)
 {
       state =3D reduce(state, data); //copy(data, state);
       //state =3D copy(data, state); //also valid, but not necessary
       //state.put(data); //simple way, doesn't work for ranges
 }

 void main()
 {
       state =3D 0; //state.start();
       auto stream =3D new EventTcpStream("localhost", 80);
       stream.onData =3D&onData;
       //auto result =3D hash.finish();
 }

 There are only 2 differences:

 1:
 the order of the arguments passed to copy and reduce is swapped. This=


 kinda makes sense (if copy is interpreted as copyTo). Solution: Provi=


 a method copyInto with swapped arguments if consistency is really so
 important.

 2:
 We need an additional call to finish. I can't say it often enough, I
 don't see a sane way to avoid it. Hashes work on blocks, if you didn'=


 pass enough data finish will have to fill the rest of the block with
 zeros before you can get the hash value. This operation can't be
 undone. To get a valid result with every call to copy, you'd have to
 always call finish. This is
 * inefficient, you calculate intermediate values you don't need at al=


 * you have to copy the hashes state, as you can't continue hashing
    after finish has been called

 and both, the state and the result would have to fit into the one val=


 (called seed for reduce). But then it's still not 100% consistent, as=


 reduce will return a single value, not some struct including internal=


 state.

I'm pretty sure it is possible to pad and finish when a result is =

 required without messing up the internal state.

Without copying it? AFAICR padding/finishing mutates the state, I mean,= = that's the whole point of it. R -- = Using Opera's revolutionary email client: http://www.opera.com/mail/
Aug 17 2012
prev sibling next sibling parent Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
The question about module names.

Is it supposed that e.g. std.hash.crc module will contain many CRC 
implementations, not only one CRC-32? If not, it will be better to call 
it std.hash.crc32 because other CRC variants are also in use. Or even 
std.hash.crc.crc32.

Same with std.hash.sha and std.hash.md modules.

-- 
Денис В. Шеломовский
Denis V. Shelomovskij
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 00:55:47 +0200
schrieb David <d dav1d.de>:

 Am 07.08.2012 21:53, schrieb Johannes Pfau:
 Am Tue, 07 Aug 2012 20:07:07 +0200
 schrieb David <d dav1d.de>:

 start is here as structs can't have default constructors. For all
 current hashes it's really mostly useless (except it can be used as
 a simple way to reset a hash, which is also why the hash structs
 don't have a reset member), but I just don't know if there are
 hashes which would require a more complex (runtime) initialization.

Ok this point with the one above makes sense (I implemented my OpenSSL hashing wrapper as a class, initilaization is done in the constructor), it still doesn't feel right, if you have to call .start first. What about a struct-constructor which calls .start internally, so you get a bit more of a modern API (imo) and you're still able to implement the same interface for any kind of wrapper/own implementation/whatever.

You mean an external function which constructs the digest context and calls start? That's similar to the appender function in std.array and probably a good idea. We'd need a good name for it though. * initContext * initializeContext * context * startContext * startHash / startDigest
 property is not a good idea, as finish must reset the internal state
 for some objects, so you can't access the property repeatedly. It'd
 have to be a function.

Well, you could store the result internally, property generates on the first call the digest, stores it (that's not too much, 16 byte for a md5) and returns the already stored value on the 2nd call.

yes, that could be done but to be honest I don't think it's worth it.
 which calls finish and returns an
 ubyte[] array (or string for hexdigest)

This could be done and is probably a matter of taste. I prefer the free function in this case, it makes clear that digest() really is a generic function which can be used with any hash. It's also one function less which must be implemented by hash writers. I'd like to keep the number of members required to implement a hash as low as possible.

With digest you mean: http://dl.dropbox.com/u/24218791/d/phobos/std_hash_hash.html#digest ? You normally always want the hexdigest (you barely need "real" digest),

really? I though it's the other way round and comparing/verifying hashes is the common case?
 so it's a matter of convenience. Well, thanks to UFCS, you
 can call it like a method.
 

it's very easy to implement this in user code as well: string hexDigest(Hash, Order order = Order.increasing)(scope const(void[])[] data...) if(isDigest!Hash) { return digest!Hash(data).toHexString!order(); }
 A static
 method would have to be implemented by every hash.

Yes, you would have to implement it for every hash, but that's 3 lines: static string hexdigest(void[] data) { return toHexString(digest!(typeof(this))("yay")); }

The implementation is indeed quite simple, so it probably is a matter of taste whether hexDigest is implemented as a member or as a free function.
 BTW: it is implemented as a member for the OOP API, as we can just
 implement it in the base interface, so the actual implementations
 don't have to care about it.

I've seen that, but I am wondering why it's not a static method, the creation of the Hash Object could be hidden inside. You might say, this would allocate a class instance just for a single use, without the possibility to reuse it. Correct, but I think that's the use of such a function, if you just need a Hash, nothing else, otherwise you could use the class "directly", with all its other functionality.

I have to think about this. It's too bad we can't have static & member functions with the same name in D. I'd probably argue if you want the behavior of a static function, use the struct API: digest!MD5(data); Doesn't this do everything you need, without the GC allocation the OOP api would cause? I mean the OOP API is mostly about polymorphism (and about reference semantics). If you use a static method as proposed, you can't use polymorphism and reference semantics are not useful either, as you never get a reference to the context?
 The OOP/class api can use a default constructor. But unlike the
 start function it can't work as a reset function, so there's an
 explicit reset function.

I am not sure what do think about that. On the one side it's useful if you really need that speed, but then I think, is it really worth it resetting the state hidden in a function. If you really want to avoid another allocation you could use emplace (if I didn't misunderstand the use of emplace). To quote from the Python Zen: "Explicit is better than implicit."

Yes, it should be possible to use emplace for that. Emplace is not well-known though and I'm not sure how well it works now. Also I'm not sure if emplace is enough in this case, I guess you'd have to destroy the old instance as well (so destructors are run)? But I'm not really sure about the importance of reset either. As long as the main implementation is in the struct API and WrapperDigest is used to wrap it into the OOP API there's no extra work to implement reset. If the main implementation is in the OOP interface, the reset function probably looks a lot like the constructor, so simply calling reset from the constructor should avoid code duplication.
 
 (I also find the examples quite confusing, why do you want to
 reset the hash onject?

when it makes sense to use them. Maybe I should care more about that?

When I saw the documentation/examples, I was confused, why do you want to reset a hash object, so does it really reset the state? So I had to take a look into the code before I was sure, it surely does reset the whole thingy (is it called context?).

Yes, the whole terminology is probably a little confusing. The old API called the contexts like this: MD5_CTX, but that doesn't fit the phobos naming conventions. MD5CTX, MD5Ctx, Md5Ctx and Md5CTX all look ugly (and only the first fits our naming conventions). So MD5Context is the next best choice, but it's a little long. It'd more precise though.
 
 It's more efficient than allocating a new hash with 'new' (or do you
 mean why use reset if finish resets anyway? It's useful if you put
 some data into a hash, then decide you don't want the hash (because
 the user aborts the operation, network connection is lost,...) but
 you still need the hash object later on.) You could just call
 finish and disregard the result, but the reset implementation is
 faster. I agree there's probably no _strong_ need for reset, but I
 think it doesn't do any harm.

"Explicit is better than implicit."

I'm just not sure if reset can be implemented in a reasonable way outside of the class in all cases. But then it's probably not important enough to justify an extra member...
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Tue, 07 Aug 2012 17:39:15 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 8/7/2012 10:39 AM, Dmitry Olshansky wrote:
 std.hash.hash is a new module for Phobos defining an uniform
 interface for hashes and checksums. It also provides some useful
 helper functions to deal with this new API.

The hash functions must use a Range interface, not a file interface. This is extremely important.

I guess this is meant as a general statement and not specifically targeted at my std.hash proposal? I'm a little confused as all hashes already are OutputRanges in my proposal. It's probably not explicit enough in the documentation, but it's mentioned in one example and in the documentation for 'put';
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 01:27:45 +0200
schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:

 Johannes Pfau wrote:
 As I had some free time and nobody else stepped up I implemented it,

I tried it before but I wanted to create whole crypto package at once,

'nobody stepped up' is this: We had a pull request which moved the crc code into std.crc and it was already merged in for 2.060. I complained that it had yet another API and that we should have a common API. In the end that pull request was reverted but I felt kinda guilty as that left us without a public crc module. In that discussion, nobody else had time to implement that API, so I thought we'd need a solution soon and started implementing it.
 and I guess it was a huge mistake.

I'd personally like to see a bcrypt implementation. There's no widespread bcrypt C library which could be used, so there's no painless way to use bcrypt in D. (Although there _is_ public domain C source code)
 I know you have used my code
 in your uuid module, you can use it to add it to std.hash if you
 want. There are all SHA implementations, MD5 and a Tiger (both
 versions). They all support bit hashing and work with CTFE.

a standardized hash API in phobos). BTW: How does it work in CTFE? Don't you have to do endianness conversions at some time? According to Don that's not really supported. Another problem with prevents CTFE for my proposal would be that the internal state is currently implemented as an array of uints, but the API uses ubyte[] as a return type. That sort of reinterpret cast is not supposed to work in CTFE though. I wonder how you avoided that issue? And another problem is that void[][] (as used in the 'digest' function) doesn't work in CTFE (and it isn't supposed to work). But that's a problem specific to this API.
 I made a quick look at your proposal, but currently don't have time
 to do a deep investigation :). One note though. I don't think it's 
 neccessary to maintain two different API sets. Sure, classes need
 heap allocation but usually that's only one allocation, it's more
 important to not allocate during algorithm computations. Fortunately
 crypto algorithms don't do this. So, struct allocation efficency is
 negligible here. I think separate struct based API set is not worth
 it (or I'm missing something not related to allocation cost).

I'm not sure, I guess some people would disagree. I initially asked whether we should use a struct based API or a class based API (+emplace where necessary). At that time the struct API was favored. It's about the allocation cost (and the additional GC work) although people could probably also complain about the indirection classes bring along. There's one important example where you really don't want any allocation to happen (and especially no GC allocation): You have a function which always calculates the same type of hash (e.g. md5) and you don't care about the implementation. You only care about the final hash, the hash context is never used outside of that function. This is an perfect fit for stack allocation, especially if the function is used a lot (GC). With classes you'd have to do some caching to do it efficiently (which then gives problems with const). It would be possible to use a class API + emplace, but that was deemed to be too cumbersome.
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 11:48:54 +0400
schrieb Denis Shelomovskij <verylonglogin.reg gmail.com>:

 The question about module names.
 
 Is it supposed that e.g. std.hash.crc module will contain many CRC 
 implementations, not only one CRC-32? If not, it will be better to
 call it std.hash.crc32 because other CRC variants are also in use. Or
 even std.hash.crc.crc32.
 
 Same with std.hash.sha and std.hash.md modules.
 

They're supposed to contain more implementations in the future. I basically just wrote the new API, then ported what we already had in phobos (and in a pull request by redstar) to the new API. The rest of the SHA functions could probably follow soon as Piotr Szturmaj already has a properly licensed implementation. The Boost project has a templated CRC implementation, we could port that as well. One problem is that the current MD5 and SHA implementations have been manually optimized for dmds inliner though, so those probably won't be replaced by a more generic version until we have benchmarks in phobos which ensure that there won't be performance regressions.
Aug 08 2012
prev sibling next sibling parent travert phare.normalesup.org (Christophe Travert) writes:
I'm not familiar with hash functions in general.

I think the core of std.hash is the digest function:

digestType!Hash digest(Hash)(scope const(void[])[] data...) 
if(isDigest!Hash)
{
    Hash hash;
    hash.start();
    foreach(datum; data)
        hash.put(cast(const(ubyte[]))datum);
    return hash.finish();
}

That seems to be too restrictive: you can only provide a void[][] or one 
or several void[], but you should be able to give it any range of 
void[] or of ubyte[] like:

auto dig = file.byChunk.digest!MD5;

That's the point of the range interface.

this can be done by templatizing the function, something like 
(untested):

template digest(Hash) if(isDigest!Hash)
{
  auto digest(R)(R data)
    if (isInputRange!R &&  is(ElementType!R : void[])
  {
    Hash hash;
    hash.start();
    data.copy(hash);
    return hash.finish();
  }
}

An interesting overload for range of single ubyte could be provided. 
This overload would fill a buffer of with data from this range,  
feed the hash, and start again.
Aug 08 2012
prev sibling next sibling parent "Chris Cain" <clcain uncg.edu> writes:
On Wednesday, 8 August 2012 at 13:38:26 UTC, 
travert phare.normalesup.org (Christophe Travert) wrote:
 I think the question is: is std.hash going to contain only
 message-digest algorithm, or could it also contain other hash 
 functions?
 I think there is enough room in a package to have both 
 message-digest
 algorithm and other kinds of hash functions.

Even if that were the case, I'd say they should be kept separate. Cryptographic hash functions serve extremely different purposes from regular hash functions. There is no reason they should be categorized the same.
Aug 08 2012
prev sibling next sibling parent "Regan Heath" <regan netmail.co.nz> writes:
On Wed, 08 Aug 2012 14:50:22 +0100, Chris Cain <clcain uncg.edu> wrote:

 On Wednesday, 8 August 2012 at 13:38:26 UTC,  
 travert phare.normalesup.org (Christophe Travert) wrote:
 I think the question is: is std.hash going to contain only
 message-digest algorithm, or could it also contain other hash functions?
 I think there is enough room in a package to have both message-digest
 algorithm and other kinds of hash functions.

Even if that were the case, I'd say they should be kept separate. Cryptographic hash functions serve extremely different purposes from regular hash functions. There is no reason they should be categorized the same.

I don't think there is any reason to separate them. People should know which digest algorithm they want, they're not going to pick one at random and assume it's "super secure!"(tm). And if they do, well tough, they deserve what they get. "std.digest" can encompass all message digest algorithms, whether secure or not. We could create a 2nd level below "secure" or "crypto" or similar if we really want, but I don't see much point TBH. R -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 02:49:00 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 
 It should accept an input range. But using an Output Range confuses
 me. A hash function is a reduce algorithm - it accepts a sequence of
 input values, and produces a single value. You should be able to
 write code like:
 
    ubyte[] data;
    ...
    auto crc = data.crc32();

auto crc = crc32Of(data); auto crc = data.crc32Of(); //ufcs This doesn't wok with every InputRange and this needs to be fixed. That's a quite simple fix (max 10 lines of code, one new overload) and not a inherent problem of the API (see below for more).
 
 For example, the hash example given is:
 
    foreach (buffer; file.byChunk(4096 * 1024))
        hash.put(buffer);
    auto result = hash.finish();
 
 Instead it should be something like:
 
    auto result = file.byChunk(4096 * 1025).joiner.hash();

But it also says this: //As digests implement OutputRange, we could use std.algorithm.copy //Let's do it manually for now You can basically do this with a range interface in 1 line: ---- import std.algorithm : copy; auto result = copy(file.byChunk(4096 * 1024), hash).finish(); ---- or with ufcs: ---- auto result = file.byChunk(4096 * 1024).copy(hash).finish(); ---- OK, you have to initialize hash and you have to call finish. With a new overload for digest it's as simple as this: ---- auto result = file.byChunk(4096 * 1024).digest!CRC32(); auto result = file.byChunk(4096 * 1024).crc32Of(); //with alias ---- The digests are OutputRanges, you can write data to them. There's absolutely no need to make them InputRanges as they only produce 1 value, and the hash sum is produced at once, so there's no way to receive the result in a partial way. A digest is very similar to Appender and it's .data property in this regard. The put function could accept an InputRange, but I think there was a thread recently which said this is evil for OutputRanges as the same feature can be achieved with copy. There's also no big benefit in doing it that way. If your InputRange is really unbuffered you could avoid double buffering. But then you transfer data byte by byte which will be horribly slow. If your InputRange has an internal buffer copy should just copy from that internal buffer to the 64 byte buffer used inside the digest implementation. This double buffering could only be avoided if the put function accepted an InputRange and could supply a buffer for that InputRange so the InputRange could write directly into the 64 byte buffer. But there's nothing like that in phobos, so this is all speculation. (Also the internal buffer is only used for the first 64 bytes (or less) of the supplied data. The rest is processed without copying. It could probably be optimized so that there's absolutely no copying as long as the input buffer length is a multiple of 64)
 
 The magic is that any input range that produces bytes could be used,
 and that byte producing input range can be hooked up to the input of
 any reducing function.

have to use copy.
 
 The use of a member finish() is not what any other reduce algorithm
 has, and so the interface is not a general component interface.

It's a struct with state, not a simple reduce function so it needs that finish member. It works like that way in every other language (and this is not cause those languages don't have ranges; streams and iterators (as in C#) work exactly the same in this case). Let's take a real world example: You want to download a huge file with std.net.curl and hash it on the fly. Completely reading into a buffer is not possible (large file!). Now std.net.curl has a callback interface (which is forced on us by libcurl). How would you map that into an InputRange? (The byLine range in std.net.curl is eager, byLineAsync needs an additional thread). A newbie trying to do that will despair as it would work just fine in every other language, but D forces that InputRange interface. Implementing it as an OutputRange is much better. The described scenario works fine and hashing an InputRange also works fine - just use copy. OutputRange is much more universal for this usecase. However, I do agree digest!Hash, md5Of, sha1Of should have an additional overload which takes a InputRange. It would be implemented with copy and be a nice convenience function.
 
 I know the documentation on ranges in Phobos is incomplete and
 confusing.

Especially for copy, as the documentation doesn't indicate the line I posted could work in any way ;-)
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 11:27:49 +0200
schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:

 BTW: How does it work in CTFE? Don't you have to do endianness
 conversions at some time? According to Don that's not really
 supported.

std.bitmanip.swapEndian() works for me

Great! I always tried the *endianToNative and nativeTo*Endian functions. So I didn't expect swapEndian to work.
 
 Another problem with prevents CTFE for my proposal would be that the
 internal state is currently implemented as an array of uints, but
 the API uses ubyte[] as a return type. That sort of reinterpret
 cast is not supposed to work in CTFE though. I wonder how you
 avoided that issue?

There is set of functions that abstract some operations to work with CTFE and at runtime: https://github.com/pszturmaj/phobos/blob/master/std/crypto/hash/base.d#L66. Particularly memCopy().

I should definitely look at this later. Would be great if hashes worked in CTFE.
 And another problem is that void[][] (as used in the 'digest'
 function) doesn't work in CTFE (and it isn't supposed to work). But
 that's a problem specific to this API.

Yes, that's why I use ubyte[].

strings, but for various reasons it didn't work out in the end.
 
 I don't think std.typecons.scoped is cumbersome:
 
 auto sha = scoped!SHA1(); // allocates on the stack
 auto digest = sha.digest("test");

Yes I'm not sure about this. But a class only based interface probably hasn't high chances of being accepted into phobos. And I think the struct interface+wrappers approach isn't bad.
 
 Why I think classes should be supported is the need of polymorphism.

windows crypto) at runtime. I know it's very useful, this is why we have the OOP api. It's very easy to wrap the OOP api onto the struct api. These are the implementations of MD5Digest, CRC32Digest and SHA1Digest: alias WrapperDigest!CRC32 CRC32Digest; alias WrapperDigest!MD5 MD5Digest; alias WrapperDigest!SHA1 SHA1Digest; with the support code in std.hash.hash 1LOC is enough to implement the OOP interface if a struct interface is available, so I don't think maintaining two APIs is a problem. A bigger problem is that the real implementation must be the struct interface, so you can't use polymorphism there. I hope alias this is enough.
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 11:27:49 +0200
schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:

 
 Yes, there should be bcrypt, scrypt and PBKDF2.

Wow, I didn't know about scrypt. Seems to be pretty cool.
Aug 08 2012
prev sibling next sibling parent "Chris Cain" <clcain uncg.edu> writes:
On Wednesday, 8 August 2012 at 14:14:29 UTC, Regan Heath wrote:
 I don't think there is any reason to separate them.  People 
 should know which digest algorithm they want, they're not going 
 to pick one at random and assume it's "super secure!"(tm).  And 
 if they do, well tough, they deserve what they get.

In this case, I'm not suggesting keep them separate to not confuse those who don't know better. They're simply disparate in actual use. What do you use a traditional hash function for? Usually to turn a large multibyte stream into some finite size so that you can use a lookup table or maybe to decrease wasted time in comparisons. What do you use a cryptographic hash function for? Almost always it's to verify the integrity of some data (usually files) or protect the original form from prying eyes (passwords ... though, there are better approaches for that now). You'd _never_ use a cryptographic hash function in place of a traditional hash function and vice versa because they designed for completely different purposes. At a cursory glance, they bare only one similarity and that's the fact that they turn a big chunk of data into a smaller form that has a fixed size. On Wednesday, 8 August 2012 at 14:16:40 UTC, travert phare.normalesup.org (Christophe Travert) wrote:
 function to pass the isDigest predicate. But they have many
 similarities, which explains they are all called hash 
 functions. There
 is enough room in a package to put several related concepts!

Crytographic hash functions are also known as "one-way compression functions." They also have similarities to file compression algorithms. After all, both of them turn large files into smaller data. However, the actual use of them is completely different and you wouldn't use one in place of the other. I wouldn't put the Burrows-Wheeler transform in the same package. It's just my opinion of course, but I just feel it wouldn't be right to intermingle normal hash functions and cryptographic hash functions in the same package. If we had to make a compromise and group them with something else, I'd really like to see cryptographic hash functions put in the same place we'd put other cryptography (such as AES) ... in a std.crypto package. But std.digest is good if they can exist in their own package. It also occurs to me that a lot of people are confounding cryptographic hash functions and normal hash functions enough that they think that a normal hash function has a "digest" ... I'm 99% sure that's exclusive to the cryptographic hash functions (at least, I've never heard of a normal hash function producing a digest).
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 8 Aug 2012 17:50:33 +0200
schrieb Johannes Pfau <nospam example.com>:

 However, I do agree digest!Hash, md5Of, sha1Of should have an
 additional overload which takes a InputRange. It would be implemented
 with copy and be a nice convenience function.

I implemented the function, it's actually quite simple: ---- digestType!Hash digestRange(Hash, Range)(Range data) if(isDigest!Hash && isInputRange!Range && __traits(compiles, digest!Hash(ElementType!(Range).init))) { Hash hash; hash.start(); copy(data, hash); return hash.finish(); } ---- but I don't know how make it an overload. See thread "overloading a function taking a void[][]" in D.learn for details.
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 8 Aug 2012 14:16:40 +0000 (UTC)
schrieb travert phare.normalesup.org (Christophe Travert):

 If it where me, I would have the presently reviewed module
 std.hash.hash be called std.hash.digest, and leave room here for
 regular hash functions. In any case, I think regular hash HAVE to be
 in a std.hash module or package, because people looking for a regular
 hash function will look here first.
 
 

std.hash.digest doesn't sound too bad. We could have std.hash.func (or a better named module ;-) for general hash functions later.
Aug 08 2012
prev sibling next sibling parent "Regan Heath" <regan netmail.co.nz> writes:
On Wed, 08 Aug 2012 18:33:01 +0100, Christophe Travert  =

<travert phare.normalesup.org> wrote:

 "Chris Cain" , dans le message (digitalmars.D:174477), a =E9crit :

 I think you misunderstood me (and it's probably my fault, since I don'=

 know much of hash functions), I was wanted to compare two kind of
 concepts:

 1/ message digest functions, like md5, or sha1, used on large files,
 which is what is covered by this std.hash proposal.
 2/ small hash function. Like what are use in an associative array, and=

 are called toHash when used a member function.

 And I didn't thought of:
 3/ cryptographic hash functions

 My opinion was that in a module or package called hash, I expect tools=

 concerning #2. But #1 and #2 can coexist in the same package. The
 proposed std.hash.hash defines a digest concept for #1. That's why I
 would rather have it named std.hash.digest, leaving room in the hash
 package to other concepts, like small hash functions that can be used =

 associative arrays (#2).

 I don't know the difference between #1 and #3, so I can't tell if they=

 should share a common package. In anycase, I think putting #3 be in a
 crypto package makes sense.

 Having 3 different packages seems too much to me. #1 is too
 restricted to be a whole package IMHO, and should be along #2 or #3.

Here is a perfect example of why we need to avoid using "hash", it has t= oo = many meanings to different people. I suggest: std.digest <- cryptographic "hash" algorithms std.crc <- crc "hash" algorithms std.uuid <- identity "hash" algorithms This is assuming we cannot have more levels of depth in the package/modu= le = tree, otherwise you could group them all under the package "hash": std.hash.digest std.hash.crc std.hash.uuid Some people are going to argue it should be: std.crypto.digest or.. std.crypto.hash But that leads us to something like: std.crypto.hash std.crc.hash std.uuid.hash And that seems back-to-front to me, and more importantly would = assume/suggest/require we have more packages to put in std.crc and = std.uuid, which I suspect we wont. R -- = Using Opera's revolutionary email client: http://www.opera.com/mail/
Aug 08 2012
prev sibling next sibling parent "Chris Cain" <clcain uncg.edu> writes:
On Wednesday, 8 August 2012 at 17:33:01 UTC, 
travert phare.normalesup.org (Christophe Travert) wrote:
 I think you misunderstood me (and it's probably my fault, since 
 I don't
 know much of hash functions), I was wanted to compare two kind 
 of
 concepts:

 1/ message digest functions, like md5, or sha1, used on large 
 files,
 which is what is covered by this std.hash proposal.
 2/ small hash function. Like what are use in an associative 
 array, and
 are called toHash when used a member function.

 And I didn't thought of:
 3/ cryptographic hash functions

Actually, maybe I'm the one not doing a good job of explaining. 1 and 3 are the same things (what you're calling "message digest" functions are cryptographic hash functions). I'm saying that even though similar in name, cryptographic hash functions really can't (IMO, I suppose I should make clear) be put in the same place as normal hash functions because they barely have anything in common. You can't use on in the place of another nor are they really used in similar manners.
 My opinion was that in a module or package called hash, I 
 expect tools
 concerning #2.

I agree. I'd think similarly (I'd assume std.hash has something to do with hash tables or hash functions used for hash tables). If I were looking to use a cryptographic hash function like SHA1 or (eh) MD5, I'd look for std.crypto first, and probably pick std.digest if I saw that. As a last resort I'd look in std.hash and vomit profusely after seeing it grouped with the "times 33" hash.
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 14:36:32 -0400
schrieb "Jonathan M Davis" <jmdavisProg gmx.com>:

 What's wrong with the *endianToNative and nativeTo*Endian functions?
 They work just fine as far as I know. swapEndian works too if you
 want it to use that, but there should be nothing wrong with the
 endian-specific ones.
 
 - Jonathan M Davis

in CTFE? http://dpaste.dzfl.pl/0503b8af According to Don reinterpret casts (even if done through unions) won't be supported in CTFE. So you can't convert from uint-->ubyte[4]
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 11:47:38 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 8/8/2012 9:47 AM, Johannes Pfau wrote:
 Am Wed, 8 Aug 2012 17:50:33 +0200
 schrieb Johannes Pfau <nospam example.com>:

 However, I do agree digest!Hash, md5Of, sha1Of should have an
 additional overload which takes a InputRange. It would be
 implemented with copy and be a nice convenience function.

I implemented the function, it's actually quite simple: ---- digestType!Hash digestRange(Hash, Range)(Range data) if(isDigest!Hash && isInputRange!Range && __traits(compiles, digest!Hash(ElementType!(Range).init))) { Hash hash; hash.start(); copy(data, hash); return hash.finish(); } ----

The finish() should be implicit when the range ends.
 but I don't know how make it an overload. See thread "overloading a
 function taking a void[][]" in D.learn for details.

I don't know what you mean, it takes a range, not a void[][] as input.

So the post in D.learn for a detailed description. Yes the code I posted takes a range, but digest (as it is now) takes void[][] to accept all kind of types _without_ template bloat. The difficulty is to combine those two overloads without causing unnecessary template bloat.
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 11:40:10 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 
 Take a look at the reduce function in 
 http://dlang.org/phobos/std_algorithm.html#reduce
 
 It has provision for an initial state that can be the current running
 total.
 

This can only work if the final state is valid as an initial state. This is just not true for some hash algorithms. --- auto sum = reduce!("a + b")(0, range); auto sum2 = reduce!("a + b")(sum, range2); --- --- MD5 hash; hash.start(); auto sum = copy(range, hash); auto sum2 = copy(range2, sum); auto result = hash.finish(); --- No where's the difference, except that for hashes the context ('hash') has to be setup and finished manually?
Aug 08 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 08, 2012 20:55:19 Johannes Pfau wrote:
 Am Wed, 08 Aug 2012 14:36:32 -0400
 
 schrieb "Jonathan M Davis" <jmdavisProg gmx.com>:
 What's wrong with the *endianToNative and nativeTo*Endian functions?
 They work just fine as far as I know. swapEndian works too if you
 want it to use that, but there should be nothing wrong with the
 endian-specific ones.
 
 - Jonathan M Davis

in CTFE? http://dpaste.dzfl.pl/0503b8af According to Don reinterpret casts (even if done through unions) won't be supported in CTFE. So you can't convert from uint-->ubyte[4]

No. It wouldn't work in CTFE, because it uses a union. But what it's trying to doesn't really make sense in CTFE in most cases anyway, because the endianness of the target machine may not be the same endianness as the machine doing the compilation. Any computations which cared about endianness must be in a state where they don't care about endianness anymore once CTFE has completed, or you're going to have bugs. Though if the issue is std.hash being CTFEable, I don't know why anyone would even care. It's cool if it's CTFEable, but the sorts of things that you hash pretty much always require user or file input of some kind (which you can't do with CTFE). You'd have to have a use case where something within the program itself needed to be hashed for some reason for it to matter whether std.hash was CTFEable or not, and it wouldn't surprise me at all if it were typical in hash functions to do stuff that isn't CTFEable anyway. - Jonathan M Davis
Aug 08 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 08, 2012 18:47:04 Andrei Alexandrescu wrote:
 On 8/8/12 4:34 PM, Jonathan M Davis wrote:
 I say just keep at simple and leave it at std.hash. It's plenty clear
 IMHO.

Not clear to quite a few of us. IMHO it just makes us seem (to the larger community) clever about a petty point. There's plenty of other better names, and std.digest is very adequate.

I prefer std.hash to std.digest, but I don't necessarily care all that much. What I was objecting to in particular was the suggestion to split it into std.hash.digest and std.hash.func. I think that all of the hashing algorithms should just go in the one package. Adding another layer is an unnecessary complication IMHO. - Jonathan M Davis
Aug 08 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 16:44:03 -0400
schrieb "Jonathan M Davis" <jmdavisProg gmx.com>:

 
 in CTFE?
 http://dpaste.dzfl.pl/0503b8af
 
 According to Don reinterpret casts (even if done through unions)
 won't be supported in CTFE. So you can't convert from
 uint-->ubyte[4]

No. It wouldn't work in CTFE, because it uses a union.But what it's trying to doesn't really make sense in CTFE in most cases anyway, because the endianness of the target machine may not be the same endianness as the machine doing the compilation. Any computations which cared about endianness must be in a state where they don't care about endianness anymore once CTFE has completed, or you're going to have bugs.

I completely agree, but this is true for hashes. Once the final hash value is produced it doesn't depend on the endianness.
 
 Though if the issue is std.hash being CTFEable, I don't know why
 anyone would even care. It's cool if it's CTFEable, but the sorts of
 things that you hash pretty much always require user or file input of
 some kind (which you can't do with CTFE).

Yeah it's not that useful, that's why I didn't care about CTFE support right now. The only usecase I can think of is to hash a string in CTFE, for example UUID could use it to support name based UUID literals.
 You'd have to have a use
 case where something within the program itself needed to be hashed
 for some reason for it to matter whether std.hash was CTFEable or
 not, and it wouldn't surprise me at all if it were typical in hash
 functions to do stuff that isn't CTFEable anyway.
 
 - Jonathan M Davis

Aug 09 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 12:30:31 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 8/8/2012 12:05 PM, Johannes Pfau wrote:
 So the post in D.learn for a detailed description. Yes the code I
 posted takes a range, but digest (as it is now) takes void[][] to
 accept all kind of types _without_ template bloat. The difficulty
 is to combine those two overloads without causing unnecessary
 template bloat.

Have the templated version with overloads simply call the single version (with a different name) with void[][].

Well that's possible, but I don't like the template bloat it causes. AFAIK a function taking a void[][] is just one instance, with that redirecting approach we'll have one instance per array type. This seems unnecessary (and maybe the compiler can merge such template instances in the future), but I can't seem to find a way to avoid it, so we'll probably have to live with that. http://dpaste.dzfl.pl/f86717f7 I guess a second function digestRange is not acceptable?
Aug 09 2012
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Tuesday, 7 August 2012 at 17:39:50 UTC, Dmitry Olshansky wrote:
 std.hash.hash is a new module for Phobos defining an uniform 
 interface for hashes and checksums. It also provides some 
 useful helper functions to deal with this new API.

Is it too late to ask to include MurmurHash 2 and/or 3? It's public domain, and great for things like hash tables. You can steal some code from here: https://github.com/CyberShadow/ae/blob/master/utils/digest.d https://github.com/CyberShadow/ae/blob/master/utils/digest_murmurhash3.d
Aug 09 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 12:27:39 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 8/8/2012 12:08 PM, Johannes Pfau wrote:
 No where's the difference, except that for hashes the context
 ('hash') has to be setup and finished manually?

The idea is to have hash act like a component - not with special added code the user has to write.

with a call to a fictional 'hashReduce'. Why is it so important that reduce + hash API's match 100% even if the API doesn't fit hashes?
 
 In this case, it needs to work like a reduce algorithm, because it is
 a reduce algorithm. Need to find a way to make this work.
 

We could do some ugly, performance killing hacks to make it possible, but I just don't see why this is necessary. ---- struct InterHash { MD5 ctx; ubyte[16] finished; alias finished this; } InterHash hashReduce(Range)(Range data) { InterHash hash; hash.ctx.start(); return hashReduce(hash, data); } InterHash hashReduce(Range)(InterHash hash, Range data) { copy(data, hash); auto ctxCopy = hash.ctx; hash.finished = ctxCopy.finish(); return hash; } auto a = hashReduce([1,2,3]); auto b = hashReduce(a, [3,4]); ---- However, a and b are still not really valid hash values. I just don't see why we should force an interface onto hashes which just doesn't fit.
Aug 09 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 08 Aug 2012 12:31:29 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 8/8/2012 12:14 PM, Martin Nowak wrote:
 That hardly works for event based programming without using
 coroutines. It's the classical inversion-of-control dilemma of
 event based programming that forces you to save/restore your state
 with every event.

See the discussion on using reduce().

I just don't understand it. Let's take the example by Martin Nowak and port it to reduce: (The code added as comments is the same code for hashes, working with the current API) int state; //Hash state; void onData(void[] data) { state = reduce(state, data); //copy(data, state); //state = copy(data, state); //also valid, but not necessary //state.put(data); //simple way, doesn't work for ranges } void main() { state = 0; //state.start(); auto stream = new EventTcpStream("localhost", 80); stream.onData = &onData; //auto result = hash.finish(); } There are only 2 differences: 1: the order of the arguments passed to copy and reduce is swapped. This kinda makes sense (if copy is interpreted as copyTo). Solution: Provide a method copyInto with swapped arguments if consistency is really so important. 2: We need an additional call to finish. I can't say it often enough, I don't see a sane way to avoid it. Hashes work on blocks, if you didn't pass enough data finish will have to fill the rest of the block with zeros before you can get the hash value. This operation can't be undone. To get a valid result with every call to copy, you'd have to always call finish. This is * inefficient, you calculate intermediate values you don't need at all * you have to copy the hashes state, as you can't continue hashing after finish has been called and both, the state and the result would have to fit into the one value (called seed for reduce). But then it's still not 100% consistent, as reduce will return a single value, not some struct including internal state.
Aug 09 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Thu, 09 Aug 2012 02:13:10 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 8/9/2012 2:05 AM, Johannes Pfau wrote:
 http://dpaste.dzfl.pl/f86717f7

The Range argument - is it an InputRange, an OutputRange? While it's just a type name, the name should reflect what kind of range it is from the menagerie of ranges in std.range.

It's an InputRange (of bytes) or an InputRange of some byte buffer (ElementType == ubyte[] || ElementType == ubyte[num]). We get the second version for free, so I just included it ;-) The documentation would have to make that clear of course. I could also change the name, it's just a proof of concept right now.
Aug 09 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Thu, 09 Aug 2012 11:32:34 +0200
schrieb "Vladimir Panteleev" <vladimir thecybershadow.net>:

 On Tuesday, 7 August 2012 at 17:39:50 UTC, Dmitry Olshansky wrote:
 std.hash.hash is a new module for Phobos defining an uniform 
 interface for hashes and checksums. It also provides some 
 useful helper functions to deal with this new API.

Is it too late to ask to include MurmurHash 2 and/or 3? It's public domain, and great for things like hash tables. You can steal some code from here: https://github.com/CyberShadow/ae/blob/master/utils/digest.d https://github.com/CyberShadow/ae/blob/master/utils/digest_murmurhash3.d

To be honest I didn't even know that MurmurHash can be used incrementally. I could port that code soon, but I think it's best to do it after the review. After we have formalized a common API adding new hashes probably won't require a full review. It should be possible to do that as a pull request.
Aug 09 2012
prev sibling next sibling parent "Regan Heath" <regan netmail.co.nz> writes:
On Thu, 09 Aug 2012 10:58:10 +0100, Johannes Pfau <nospam example.com>  
wrote:

 Am Thu, 09 Aug 2012 11:32:34 +0200
 schrieb "Vladimir Panteleev" <vladimir thecybershadow.net>:

 On Tuesday, 7 August 2012 at 17:39:50 UTC, Dmitry Olshansky wrote:
 std.hash.hash is a new module for Phobos defining an uniform
 interface for hashes and checksums. It also provides some
 useful helper functions to deal with this new API.

Is it too late to ask to include MurmurHash 2 and/or 3? It's public domain, and great for things like hash tables. You can steal some code from here: https://github.com/CyberShadow/ae/blob/master/utils/digest.d https://github.com/CyberShadow/ae/blob/master/utils/digest_murmurhash3.d

To be honest I didn't even know that MurmurHash can be used incrementally. I could port that code soon, but I think it's best to do it after the review. After we have formalized a common API adding new hashes probably won't require a full review. It should be possible to do that as a pull request.

Once the API is formalised I can contribute the hashes I have also :) R -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Aug 09 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Thu, 09 Aug 2012 08:48:37 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

 On 8/9/12 5:05 AM, Johannes Pfau wrote:
 Well that's possible, but I don't like the template bloat it causes.

What have you measured, and what is your dislike based upon?

What annoys me is that as long the function only supported arrays, it didn't need templates _at all_. So template bloat for arrays = 0. But adding range support means the version dealing with arrays now has to be a template as well(which is probably a bug, can't overload template and non template function) and will produce extra code for every array type. I just think adding range support shouldn't cause the array code to change in any way. But that's why I said I don't like it, not that it's a show stopper. The overhead is probably neglectable, but in theory it shouldn't be there at all.
 
 The library function must be generic. Then users worried about
 bloating may use it with a limited number of types.
 
 A digest function only dealing with void[void[]] is unacceptable.

Sure I agree that range support is necessary, I just forgot to implement it initially. I'm not against range support / ranges in general / the template instances needed for ranges. I just dislike that it affects the array implementation in this specific case.
Aug 09 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Thu, 09 Aug 2012 11:16:36 +0100
schrieb "Regan Heath" <regan netmail.co.nz>:

 
 Once the API is formalised I can contribute the hashes I have also :)
 

complete set of digests soon.
Aug 09 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
I implemented some of the suggestions, here's the list of changes:

Changelog:
* Add a new overload to the 'digest' function which accepts an
  InputRange
* Add a new convenience function 'hexDigest' which works just like
  'digest', but returns a string (also works with InputRanges)

  An open question is whether md5StringOf/md5HexOf aliases should be
  added (similar to md5Of)?
* Add a new convenience function 'startDigest' which returns an
  initialized digest
* New example for file hashing in idiomatic D
* Documented that Digests are always OutputRanges
* Added new examples using std.algorithm.copy & OutputRange interface
* Small optimization in toHexString & hexDigest: do not allocate if
  possible

TODO:
* move the package to std.digest (unless there are objections):
    std.hash.hash --> std.digest.digest
    std.hash.md   --> std.digest.md
    std.hash.sha  --> std.digest.sha
    std.hash.crc  --> std.digest.crc

* make sure the docs are consistent regarding names (digest vs. hash)


Code:
https://github.com/jpf91/phobos/tree/newHash/std/hash
https://github.com/jpf91/phobos/compare/master...newHash

Docs:
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_hash.html
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_md.html
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_sha.html
http://dl.dropbox.com/u/24218791/d/phobos/std_hash_crc.html
Aug 10 2012
prev sibling next sibling parent "RivenTheMage" <riven-mage id.ru> writes:
On Wednesday, 8 August 2012 at 16:47:35 UTC, Johannes Pfau wrote:

 std.hash.digest doesn't sound too bad. We could have 
 std.hash.func (or
 a better named module ;-) for general hash functions later.

Three basic types of hash functions are: 1) Hash - for fast searching and indexing in data structures 2) Checksum - detects the accidental errors in files, archives, streams 3) Message digest code - prevents the intentional modification of data They should not be mixed IMHO. 1) should go into std.container or (maybe) std.algorithm 2) std.checksum 3) std.crypto.mdc or std.crypto.digest
Aug 15 2012
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Wednesday, 15 August 2012 at 08:49:26 UTC, RivenTheMage wrote:
 Three basic types of hash functions are:

 1) Hash - for fast searching and indexing in data structures
 2) Checksum - detects the accidental errors in files, archives, 
 streams
 3) Message digest code - prevents the intentional modification 
 of data

 They should not be mixed IMHO.

Why? 1) might have a different interface than the others, but 2) and 3) only differ in their cryptological properties, the interface will likely be just the same – or what are you thinking about? David
Aug 15 2012
prev sibling next sibling parent "RivenTheMage" <riven-mage id.ru> writes:
On Wednesday, 15 August 2012 at 08:55:30 UTC, David Nadlinger
wrote:

 Why? 1) might have a different interface than the others, but 
 2) and 3) only differ in their cryptological properties, the 
 interface will likely be just the same – or what are you 
 thinking about?

 David

The "only" difference between 2) and 3) is a big difference. CRC32, Adler, etc. are NOT a cryptographic hash fuctions. Their purpose is to detect accidental errors caused by malfunction of hardware or software, nothing more. For me, it's weird and confusing to mix checksums and MDCs. It's about organizing the standard library for the better usability. That is the whole point of modules. After all, you can place all the standard library in one module, why not? :-)
Aug 15 2012
prev sibling next sibling parent "RivenTheMage" <riven-mage id.ru> writes:
Another example is a systematic error-correcting codes. The
"only" difference between them and checksums is the ability to
correct errors, not just detect them. CRC or MD5 can be viewed as
systematic code with zero error-correcting ability.

Should we mix Reed-Solomon codes and MD5 in one module? I don't 
think so.
Aug 15 2012
prev sibling next sibling parent =?ISO-8859-1?Q?Jos=E9_Armando_Garc=EDa_Sancio?= <jsancio gmail.com> writes:
On Wed, Aug 15, 2012 at 2:40 AM, RivenTheMage <riven-mage id.ru> wrote:
 Another example is a systematic error-correcting codes. The
 "only" difference between them and checksums is the ability to
 correct errors, not just detect them. CRC or MD5 can be viewed as
 systematic code with zero error-correcting ability.

 Should we mix Reed-Solomon codes and MD5 in one module? I don't think so.

Some people's point is that MD5 was consider a cryptographic digest function 16 years ago. It is not consider cryptographically secure today. So why make any design assumption today on how the landscape will look tomorrow? Specially on a field that is always changing. Why not lumped them all together and explain the current situation and recommendation in the comments. Looks at Python's passlib module for example. They enumerate every password encoding scheme under the sun (except for scrypt :() and give a recommendation on the appropriate algorithm to use in the current computing landscape. http://packages.python.org/passlib/lib/passlib.hash.html#module-passlib.hash Thanks, -Jose
Aug 15 2012
prev sibling next sibling parent "ReneSac" <reneduani yahoo.com.br> writes:
On Wednesday, 15 August 2012 at 14:36:00 UTC, José Armando 
García Sancio wrote:
 Some people's point is that MD5 was consider a cryptographic 
 digest
 function 16 years ago. It is not consider cryptographically 
 secure
 today. So why make any design assumption today on how the 
 landscape
 will look tomorrow? Specially on a field that is always 
 changing. Why
 not lumped them all together and explain the current situation 
 and
 recommendation in the comments.

 Looks at Python's passlib module for example. They enumerate 
 every
 password encoding scheme under the sun (except for scrypt :() 
 and give
 a recommendation on the appropriate algorithm to use in the 
 current
 computing landscape.
 http://packages.python.org/passlib/lib/passlib.hash.html#module-passlib.hash

 Thanks,
 -Jose

I agree that MD5 isn't cryptographically secure anymore, but it was designed as a cryptographic hash algorithm, and it shows. It's statistical and performance proprieties are completely different from CRCs, and no matter how broken, it still has a little of cryptographic strength (no practical preimage attack was found till this date, for example). Note that in the Python passlib, there is no mention to CRC, FNV, ROT13, etc. Their place is different.
Aug 15 2012
prev sibling next sibling parent =?ISO-8859-1?Q?Jos=E9_Armando_Garc=EDa_Sancio?= <jsancio gmail.com> writes:
On Wed, Aug 15, 2012 at 8:11 AM, ReneSac <reneduani yahoo.com.br> wrote:
 Note that in the Python passlib, there is no mention to CRC, FNV, ROT13,
 etc. Their place is different.

Thats because it is a "password module" and nobody or a small percentage of the population uses CRC for password digest. Note that the Python passlib module also has archaic plaintext encodings mainly for interacting with legacy systems. The basic point is that std.digest/std.hash (whatever people decide) should probably just have generic digesting algorithm. The user can decided which one to use given their requirements. Also, it would be beneficial if the module also includes a section where it recommends digest based on the current landscape of computing. High-level documentation and suggestions are easy to change; APIs are not. Thanks, -Jose
Aug 15 2012
prev sibling next sibling parent "RivenTheMage" <riven-mage id.ru> writes:
On Wednesday, 15 August 2012 at 19:38:34 UTC, José Armando
García Sancio wrote:

 Thats because it is a "password module" and nobody or a small
 percentage of the population uses CRC for password digest.

In turn, that's because CRC is not not a crytographic hash and not suited for password hashing :)
 The basic point is that std.digest/std.hash (whatever people 
 decide) should probably just have generic digesting algorithm.

Generic digesting algorithm should probably go into std.algorithm. It could be used like that: ------------ import std.algorithm; import std.checksum; import std.crypto.mdc; ushort num = 1234; auto hash1 = hash!("(a >>> 20) ^ (a >>> 12) ^ (a >>> 7) ^ (a >>> 4) ^ a")(str); // indexing hash string str = "abcd"; auto hash3 = hash!(CRC32)(str); // checksum auto hash2 = hash!(MD5)(str); // crytographic hash ------------ CRC32 and MD5 are ranges and/or classes, derived from HashAlgorithm interface.
Aug 15 2012
prev sibling next sibling parent "RivenTheMage" <riven-mage id.ru> writes:
On Thursday, 16 August 2012 at 03:02:59 UTC, RivenTheMage wrote:

 ushort num = 1234;
 auto hash1 = hash!("(a >>> 20) ^ (a >>> 12) ^ (a >>> 7) ^ (a >>>
 4) ^ a")(str); // indexing hash

I forgot that this case is already covered by reduce!(...)
Aug 15 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Changelog:
* moved the package to std.digest:
    std.hash.hash --> std.digest.digest
    std.hash.md   --> std.digest.md
    std.hash.sha  --> std.digest.sha
    std.hash.crc  --> std.digest.crc

* make sure the docs are consistent regarding names (digest vs. hash)


Code: (location changed!)
https://github.com/jpf91/phobos/tree/newHash/std/digest
https://github.com/jpf91/phobos/compare/master...newHash

Docs: (location changed!)
http://dl.dropbox.com/u/24218791/d/phobos/std_digest_digest.html
http://dl.dropbox.com/u/24218791/d/phobos/std_digest_md.html
http://dl.dropbox.com/u/24218791/d/phobos/std_digest_sha.html
http://dl.dropbox.com/u/24218791/d/phobos/std_digest_crc.html
Aug 20 2012
prev sibling next sibling parent "Jesse Phillips" <jessekphillips+D gmail.com> writes:
All this discussion on the use of auto in the docs made me notice 
something else about the docs I missed.

I like how ranges are documented and think digest could do the 
same. Instead of an ExampleDigest, just write the details under 
isDigest.

I don't see a need for template the constraint example (D idiom).

This would require changing examples which use ExampleDigest, but 
maybe that should happen anyway since it doesn't exist.

I don't see a reason to change my vote because of this, its all 
documentation.
Aug 28 2012
prev sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 29 Aug 2012 04:57:32 +0200
schrieb "Jesse Phillips" <jessekphillips+D gmail.com>:

 All this discussion on the use of auto in the docs made me notice 
 something else about the docs I missed.
 
 I like how ranges are documented and think digest could do the 
 same. Instead of an ExampleDigest, just write the details under 
 isDigest.

I had a look at how std.range documents the range interfaces, but the std.digest API forces more details on the implementation ( trusted, exact parameter types for put, return type of finish,...) so I think simply writing a text paragraph could get confusing. But if someone posts a pull request which replaces the ExampleDigest with something else I'm all for it.
 
 I don't see a need for template the constraint example (D idiom).
 
 This would require changing examples which use ExampleDigest, but 
 maybe that should happen anyway since it doesn't exist.

Yes, I'll change the examples (this also makes them runnable in theory. Although I have not found any documentation about making examples runnable on dlang.org)
 
 I don't see a reason to change my vote because of this, its all 
 documentation.

Sep 04 2012