digitalmars.D.learn - "Rolling Hash computation" or "Content Defined Chunking"

notna (13/13) May 01 2017 Hi Dlander's.

Johannes Pfau (26/45) May 06 2017 Interesting concept. I'm not aware of any D implementation but it

notna (5/51) May 09 2017 Thanks for the feedback, Johannes.

9il (7/10) May 09 2017 Hello notha,

notna <notna.remove.this ist-einmalig.de> writes:

Hi Dlander's.

Found some interesting reads ([1] [2] [3]) about the $SUBJECT and 
wonder if there is anything available in the Dland?!

If yes, pls. share.
If not, how could it be done (D'ish)

[1] - 
https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/
     - 
https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c

[2] - 
https://restic.github.io/blog/2015-09-12/restic-foundation1-cdc

[3] - http://www.infoarena.ro/blog/rolling-hash

Thanks & regards

May 01 2017

Johannes Pfau <nospam example.com> writes:

Am Mon, 01 May 2017 21:01:43 +0000
schrieb notna <notna.remove.this ist-einmalig.de>:

 Hi Dlander's.
 
 Found some interesting reads ([1] [2] [3]) about the $SUBJECT and 
 wonder if there is anything available in the Dland?!
 
 If yes, pls. share.
 If not, how could it be done (D'ish)
 
 [1] - 
 https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/
      - 
 https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c
 
 [2] - 
 https://restic.github.io/blog/2015-09-12/restic-foundation1-cdc
 
 [3] - http://www.infoarena.ro/blog/rolling-hash
 
 Thanks & regards

Interesting concept. I'm not aware of any D implementation but it
shouldn't be difficult to implement this in D:
https://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial

There's a BSD licensed haskell implementation, so a BSD licensed port
would be very easy to implement:
https://hackage.haskell.org/package/optimal-blocks-0.1.0
https://hackage.haskell.org/package/optimal-blocks-0.1.0/docs/src/Algorithm-OptimalBlocks-BuzzHash.html

To make an implementation D'ish it could integrate with either
std.digest or process input ranges. If you want to use it exclusively
for chunking your code can be more efficient (process InputRange until
a boundary condition is met). When using input ranges, prefer some kind
of buffered approach, Range!ubyte[] instead of Range!ubyte for better
performance.

If you really want the rolling hash value for each byte in a sequence
this will be less efficient as you'll have to enter data byte-by-byte.
In this case it's extremely important for performance that your
function can be inlined, so use templates:

ubyte[] data;
foreach(b; data)
{
    // This needs to be inlined for performance reasons
    rollinghash.put(b);
}

-- Johannes

May 06 2017

notna <notna.remove.this ist-einmalig.de> writes:

On Saturday, 6 May 2017 at 07:21:51 UTC, Johannes Pfau wrote:
 Am Mon, 01 May 2017 21:01:43 +0000
 schrieb notna <notna.remove.this ist-einmalig.de>:

 Hi Dlander's.
 
 Found some interesting reads ([1] [2] [3]) about the $SUBJECT 
 and wonder if there is anything available in the Dland?!
 
 If yes, pls. share.
 If not, how could it be done (D'ish)
 
 [1] -
 https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/
      -
 https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c
 
 [2] - 
 https://restic.github.io/blog/2015-09-12/restic-foundation1-cdc
 
 [3] - http://www.infoarena.ro/blog/rolling-hash
 
 Thanks & regards

 Interesting concept. I'm not aware of any D implementation but 
 it shouldn't be difficult to implement this in D: 
 https://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial

 There's a BSD licensed haskell implementation, so a BSD 
 licensed port
 would be very easy to implement:
 https://hackage.haskell.org/package/optimal-blocks-0.1.0
 https://hackage.haskell.org/package/optimal-blocks-0.1.0/docs/src/Algorithm-OptimalBlocks-BuzzHash.html

 To make an implementation D'ish it could integrate with either 
 std.digest or process input ranges. If you want to use it 
 exclusively for chunking your code can be more efficient 
 (process InputRange until a boundary condition is met). When 
 using input ranges, prefer some kind of buffered approach, 
 Range!ubyte[] instead of Range!ubyte for better performance.

 If you really want the rolling hash value for each byte in a 
 sequence this will be less efficient as you'll have to enter 
 data byte-by-byte. In this case it's extremely important for 
 performance that your function can be inlined, so use templates:

 ubyte[] data;
 foreach(b; data)
 {
     // This needs to be inlined for performance reasons
     rollinghash.put(b);
 }

 -- Johannes

Thanks for the feedback, Johannes.

I hoped there may already be something in Mir or Weka.io or 
somewhere else... Will read the Golang, C and C++ source and see 
if my Dlang is good enough for ranges and the like magic...

May 09 2017

9il <ilyayaroshenko gmail.com> writes:

On Tuesday, 9 May 2017 at 18:17:45 UTC, notna wrote:
 I hoped there may already be something in Mir or Weka.io or 
 somewhere else... Will read the Golang, C and C++ source and 
 see if my Dlang is good enough for ranges and the like magic...

Hello notha,

You may want to open a PR to mir-algorithm.  I will help to make 
code idiomatic.

https://github.com/libmir/mir-algorithm

Thanks,
Ilya

May 09 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - "Rolling Hash computation" or "Content Defined Chunking"