www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - eBay's TSV Utilities status update

reply Jon Degenhardt <jond noreply.com> writes:
An update on changes to this tool-set over the last year.

For those not familiar, tsv-utils are a set of command tools for 
manipulating large tabular data files. Files of numeric and text 
data common in machine learning and data mining environments. 
Filtering, statistics, sampling, joins, and more. The tools are 
intended for large files, larger than ideal for loading in-memory 
in tools like R or Pandas, but not so big as to necessitate 
moving to distributed compute environments. The tools are quite 
fast, the fastest of their kind available.

Besides being real tools, tsv-utils have also provided an 
environment for exploring the D programming language and the D 
ecosystem.

In past year there have been two main areas of work.

One area is the sampling and shuffling facilities provided by the 
tsv-sample program. New sampling methods are available and 
performance has been improved. tsv-sample is very similar to the 
excellent GNU shuf tool, but supports sampling methods not 
available in shuf. Sampling is a rich and diverse area, and the 
tsv-sample code is perhaps the most algorithmically interesting 
the tool-set.

The other main update is improved I/O read performance in many of 
the tools. This is from developing a buffered version of byLine. 
It is especially effective for skinny files (short lines). Most 
of the tools saw performance gains of 10-40%.

One of the earlier performance improvements came from buffering 
output lines. Combined, the line-by-line read-write performance 
is quite a bit faster than what is available in Phobos. The 
iopipe / std.io packages (Steve Schveighoff, Martin Nowak) are 
faster still, these are the place to go for really high 
performance. (See the links below for a benchmark report.)

Links:
* tsv-utils repo: https://github.com/eBay/tsv-utils
* tsv-sample user docs: 
https://github.com/eBay/tsv-utils/blob/master/docs/ToolReference.md#tsv-sample-reference
* tsv-sample code docs: 
https://tsv-utils.dpldocs.info/tsv_utils.tsv_sample.html
* Performance benchmarks on line-oriented I/O facilities: 
https://github.com/jondegenhardt/dcat-perf/issues/1
Apr 29 2019
parent reply James Blachly <james.blachly gmail.com> writes:
On 4/29/19 11:23 AM, Jon Degenhardt wrote:
 An update on changes to this tool-set over the last year.
...
 The other main update is improved I/O read performance in many of the 
 tools. This is from developing a buffered version of byLine. It is 
 especially effective for skinny files (short lines). Most of the tools 
 saw performance gains of 10-40%.
 
 One of the earlier performance improvements came from buffering output 
 lines. Combined, the line-by-line read-write performance is quite a bit 
 faster than what is available in Phobos. The iopipe / std.io packages 
 (Steve Schveighoff, Martin Nowak) are faster still, these are the place 
 to go for really high performance. (See the links below for a benchmark 
 report.)
 
 Links:
 * tsv-utils repo: https://github.com/eBay/tsv-utils
 * tsv-sample user docs: 
 https://github.com/eBay/tsv-utils/blob/master/docs/ToolReference.md#
sv-sample-reference 
 
 * tsv-sample code docs: 
 https://tsv-utils.dpldocs.info/tsv_utils.tsv_sample.html
 * Performance benchmarks on line-oriented I/O facilities: 
 https://github.com/jondegenhardt/dcat-perf/issues/1
Jon: Thank you for this, and thanks for your blog post of a couple of years ago, which I referred to many times while learning D and writing fast(er) CLI tools. Looking forward to trying Steve's iopipe as well as your bufferedByLineReader. James
May 02 2019
parent Jon Degenhardt <jond noreply.com> writes:
On Friday, 3 May 2019 at 03:54:14 UTC, James Blachly wrote:
 On 4/29/19 11:23 AM, Jon Degenhardt wrote:
 An update on changes to this tool-set over the last year.
... Thank you for this, and thanks for your blog post of a couple of years ago, which I referred to many times while learning D and writing fast(er) CLI tools. Looking forward to trying Steve's iopipe as well as your bufferedByLineReader. James
Thanks for the kind words James!
May 03 2019