www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - Chunker - Content-Defined Chunking based on Rabin Checksums

reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
Hi,

This is a D port of a Go package implementing Content-Defined 
Chunking:

https://github.com/CyberShadow/chunker

The package contains the following modules:

- chunker.polynomials - implements Pol, a type which represents a 
polynomial from F_2[X]. I'm not quite sure what that is, but they 
seem to be very useful.

- chunker.rabin - implements RabinHash, which calculates a 
rolling Rabin Fingerprint.

- chunker - implements Chunker, an adapter range which accepts 
chunks of bytes (such as from File.byChunk) and emits 
variable-size content-defined chunks, which are split when the 
local Rabin Fingerprint reaches a certain value.

Links
-----

- Wikipedia: 
https://en.wikipedia.org/wiki/Rolling_hash#Rabin_fingerprint

- Original Go version: https://github.com/restic/chunker

- Dub package: https://code.dlang.org/packages/chunker

- Documentation: https://chunker.dpldocs.info/chunker.html 
(courtesy of Adam Ruppe's dpldocs service)

- Example: 
https://github.com/cybershadow/chunker/blob/master/src/chunker/example.d

Differences from the Go version
-------------------------------

- Chunker was adapted to be a D range and accept D ranges as 
input.

- The Rabin Fingerprint implementation was extracted out of 
Chunker and into its own module. It is usable stand-alone.

- Significant refactorings and simplifications of the 
implementation. The original code made some sacrifices in code 
readability to work around limitations of the language and 
compiler optimization to achieve reasonable performance.

- 20% faster than the Go version (LDC release build).

- Improved test coverage and symbol documentation.

The original package was written by Alexander Neumann and is used 
in the restic backup program.
Sep 20 2019
parent Bastiaan Veelo <Bastiaan Veelo.net> writes:
On Saturday, 21 September 2019 at 03:11:11 UTC, Vladimir 
Panteleev wrote:
 Hi,

 This is a D port of a Go package implementing Content-Defined 
 Chunking:

 https://github.com/CyberShadow/chunker
[...]
 - Significant refactorings and simplifications of the 
 implementation. The original code made some sacrifices in code 
 readability to work around limitations of the language and 
 compiler optimization to achieve reasonable performance.

 - 20% faster than the Go version (LDC release build).
Marvellous! Well done. [...]
 The original package was written by Alexander Neumann and is 
 used in the restic backup program.
Sounds like D would have been the right language for Restic. Maybe this is enough to spark Alexander’s interest in D? Cheers, Bastiaan.
Sep 21 2019