www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - Command line utilities for tab-separated value files

reply Jon D <jond noreply.com> writes:
Hi all,

I've open sourced a set of command line utilities for 
manipulating tab-separated value files. They are complementary to 
traditional unix tools like cut, grep, etc. They're useful for 
manipulating large data files. I use them when prepping files for 
R and similar tools. These tools were part of my 'explore D' 
programming exercises.

The tools are here: https://github.com/eBay/tsv-utils-dlang

They are likely of interest primarily to people regularly working 
with large files, though others might find the performance 
benchmarks of interest as well (included in the README).

I'd welcome any feedback, either on the apps or the code. 
Intention is that the code be reasonable example programs. And, I 
may write a blog post about my D explorations at some point, 
they'd be referenced in such an article.

--Jon
Apr 11 2016
next sibling parent Joakim <dlang joakim.fea.st> writes:
On Tuesday, 12 April 2016 at 00:50:24 UTC, Jon D wrote:
 Hi all,

 I've open sourced a set of command line utilities for 
 manipulating tab-separated value files. They are complementary 
 to traditional unix tools like cut, grep, etc. They're useful 
 for manipulating large data files. I use them when prepping 
 files for R and similar tools. These tools were part of my 
 'explore D' programming exercises.

 [...]
Hmm, benchmarks are nice, someone post to reddit?
Apr 11 2016
prev sibling next sibling parent reply Puming <zhaopuming gmail.com> writes:
On Tuesday, 12 April 2016 at 00:50:24 UTC, Jon D wrote:
 Hi all,

 I've open sourced a set of command line utilities for 
 manipulating tab-separated value files. They are complementary 
 to traditional unix tools like cut, grep, etc. They're useful 
 for manipulating large data files. I use them when prepping 
 files for R and similar tools. These tools were part of my 
 'explore D' programming exercises.

 [...]
Interesting, I have large csv files, and this lib will be useful. Can you put it onto code.dlang.org so that we could use it with dub?
Apr 11 2016
parent reply Jon D <jond noreply.com> writes:
On Tuesday, 12 April 2016 at 06:22:55 UTC, Puming wrote:
 On Tuesday, 12 April 2016 at 00:50:24 UTC, Jon D wrote:
 Hi all,

 I've open sourced a set of command line utilities for 
 manipulating tab-separated value files. They are complementary 
 to traditional unix tools like cut, grep, etc. They're useful 
 for manipulating large data files. I use them when prepping 
 files for R and similar tools. These tools were part of my 
 'explore D' programming exercises.

 [...]
Interesting, I have large csv files, and this lib will be useful. Can you put it onto code.dlang.org so that we could use it with dub?
I'd certainly like to make it available via dub, but I wasn't sure how to set it up. There are two issues. One is that the package builds multiple executables, which dub doesn't seem to support easily. More problematic is that quite a bit of the test suite is run against the executables, which I could automate using make, but didn't see how to do it with dub. If there are suggestions for setting this up in dub that'd be great. An example project doing something similar would be really helpful. --Jon
Apr 12 2016
next sibling parent Edwin van Leeuwen <edder tkwsping.nl> writes:
On Tuesday, 12 April 2016 at 07:17:05 UTC, Jon D wrote:
 I'd certainly like to make it available via dub, but I wasn't 
 sure how to set it up. There are two issues. One is that the 
 package builds multiple executables, which dub doesn't seem to 
 support easily. More problematic is that quite a bit of the 
 test suite is run against the executables, which I could 
 automate using make, but didn't see how to do it with dub.

 If there are suggestions for setting this up in dub that'd be 
 great. An example project doing something similar would be 
 really helpful.
Dub is indeed not ideal for building multiple executables. You can either use subConfigurations or subPackages. In your case I would probably go the subPackages route, with the root dub file depending on all the executables. Never done that before though, so not exactly sure if that would work. If it works though then I'd think dub test in the root would run the tests for each subPackage.
Apr 12 2016
prev sibling parent reply Puming <zhaopuming gmail.com> writes:
On Tuesday, 12 April 2016 at 07:17:05 UTC, Jon D wrote:
 On Tuesday, 12 April 2016 at 06:22:55 UTC, Puming wrote:
 On Tuesday, 12 April 2016 at 00:50:24 UTC, Jon D wrote:
 Hi all,

 I've open sourced a set of command line utilities for 
 manipulating tab-separated value files. They are 
 complementary to traditional unix tools like cut, grep, etc. 
 They're useful for manipulating large data files. I use them 
 when prepping files for R and similar tools. These tools were 
 part of my 'explore D' programming exercises.

 [...]
Interesting, I have large csv files, and this lib will be useful. Can you put it onto code.dlang.org so that we could use it with dub?
I'd certainly like to make it available via dub, but I wasn't sure how to set it up. There are two issues. One is that the package builds multiple executables, which dub doesn't seem to support easily. More problematic is that quite a bit of the test suite is run against the executables, which I could automate using make, but didn't see how to do it with dub. If there are suggestions for setting this up in dub that'd be great. An example project doing something similar would be really helpful. --Jon
Here is what I know of it, using subPackages: Say you have a project named myapp, and you need three executables, app1, app2, app3, they all depend on a common code base, which you name it common. Using dub, you can have a parent project myapp, that does nothing but is a container of the three apps and their common code. dub.sdl in myapp dir: ``` name "myapp" dependency ":common" version="*" subPackage "./common/" dependency ":app1" version="*" subPackage "./app1/" dependency ":app2" version="*" subPackage "./app2/" dependency ":app3" version="*" subPackage "./app3/" ``` the comma in dependency name ":common" is equal to "myapp:common" now use `dub init common` and the like to create subdirectories. change dub.sdl in the subdirectory common so that it becomes a library type: ``` name "common" targetType "library" ``` change dub.sdl in myapp* subdirectories to depend on common: ``` name "app1" targetType "executable" dependency "myapp:common" version="*" ``` note here you need to add root project name "myapp:common". Then you should register your whole project into the local dub repo, so that subpackages can find its dependencies when building: in the project root directory: dub add-local . Now you can build each executable with: dub build :app1 dub build :app2 dub build :app3 Unfortunately dub does not build all sub packages at once when you dub in the root directory. But I think there might be a better way to handle multiple executables?
Apr 12 2016
parent reply Rory McGuire via Digitalmars-d-announce writes:
On Wed, Apr 13, 2016 at 3:41 AM, Puming via Digitalmars-d-announce <
digitalmars-d-announce puremagic.com> wrote:

 On Tuesday, 12 April 2016 at 07:17:05 UTC, Jon D wrote:

 On Tuesday, 12 April 2016 at 06:22:55 UTC, Puming wrote:

 On Tuesday, 12 April 2016 at 00:50:24 UTC, Jon D wrote:

 Hi all,

 I've open sourced a set of command line utilities for manipulating
 tab-separated value files. They are complementary to traditional unix tools
 like cut, grep, etc. They're useful for manipulating large data files. I
 use them when prepping files for R and similar tools. These tools were part
 of my 'explore D' programming exercises.

 [...]
Interesting, I have large csv files, and this lib will be useful. Can you put it onto code.dlang.org so that we could use it with dub?
I'd certainly like to make it available via dub, but I wasn't sure how to set it up. There are two issues. One is that the package builds multiple executables, which dub doesn't seem to support easily. More problematic is that quite a bit of the test suite is run against the executables, which I could automate using make, but didn't see how to do it with dub. If there are suggestions for setting this up in dub that'd be great. An example project doing something similar would be really helpful. --Jon
Here is what I know of it, using subPackages: Say you have a project named myapp, and you need three executables, app1, app2, app3, they all depend on a common code base, which you name it common. Using dub, you can have a parent project myapp, that does nothing but is a container of the three apps and their common code. dub.sdl in myapp dir: ``` name "myapp" dependency ":common" version="*" subPackage "./common/" dependency ":app1" version="*" subPackage "./app1/" dependency ":app2" version="*" subPackage "./app2/" dependency ":app3" version="*" subPackage "./app3/" ``` the comma in dependency name ":common" is equal to "myapp:common" now use `dub init common` and the like to create subdirectories. change dub.sdl in the subdirectory common so that it becomes a library type: ``` name "common" targetType "library" ``` change dub.sdl in myapp* subdirectories to depend on common: ``` name "app1" targetType "executable" dependency "myapp:common" version="*" ``` note here you need to add root project name "myapp:common". Then you should register your whole project into the local dub repo, so that subpackages can find its dependencies when building: in the project root directory: dub add-local . Now you can build each executable with: dub build :app1 dub build :app2 dub build :app3 Unfortunately dub does not build all sub packages at once when you dub in the root directory. But I think there might be a better way to handle multiple executables?
Just tried your suggestion and it works. I just added the below to the parent project to get the apps build: void main() { import std.process : executeShell; executeShell(`dub build :app1`); executeShell(`dub build :app2`); executeShell(`dub build :app3`); }
Apr 13 2016
parent reply Jon D <jond noreply.com> writes:
On Wednesday, 13 April 2016 at 07:34:11 UTC, Rory McGuire wrote:
 On Wed, Apr 13, 2016 at 3:41 AM, Puming via 
 Digitalmars-d-announce < digitalmars-d-announce puremagic.com> 
 wrote:

 On Tuesday, 12 April 2016 at 06:22:55 UTC, Puming wrote:
Here is what I know of it, using subPackages:
Just tried your suggestion and it works. I just added the below to the parent project to get the apps build: void main() { import std.process : executeShell; executeShell(`dub build :app1`); executeShell(`dub build :app2`); executeShell(`dub build :app3`); }
Thanks Rory, Puming. I'll look into this and see how best to make it fit. I'm realizing also there's one additional capability it'd be nice to have in dub for tools like this, which in an option to install the executables somewhere that can be easily be put on the path. Still, even without this there'd be benefit to having them fetched via dub. --Jon
Apr 13 2016
next sibling parent reply Dicebot <public dicebot.lv> writes:
On Wednesday, 13 April 2016 at 16:34:16 UTC, Jon D wrote:
 Thanks Rory, Puming. I'll look into this and see how best to 
 make it fit. I'm realizing also there's one additional 
 capability it'd be nice to have in dub for tools like this, 
 which in an option to install the executables somewhere that 
 can be easily be put on the path. Still, even without this 
 there'd be benefit to having them fetched via dub.
You don't need to put anything on path to run utils from dub packages. `dub run` will take care of setting necessary envionment (without messing with the system): dub fetch package_with_apps dub run package_with_apps:app1 --flags args
Apr 13 2016
parent reply Jon D <jond noreply.com> writes:
On Wednesday, 13 April 2016 at 17:01:33 UTC, Dicebot wrote:
 On Wednesday, 13 April 2016 at 16:34:16 UTC, Jon D wrote:
 Thanks Rory, Puming. I'll look into this and see how best to 
 make it fit. I'm realizing also there's one additional 
 capability it'd be nice to have in dub for tools like this, 
 which in an option to install the executables somewhere that 
 can be easily be put on the path. Still, even without this 
 there'd be benefit to having them fetched via dub.
You don't need to put anything on path to run utils from dub packages. `dub run` will take care of setting necessary envionment (without messing with the system): dub fetch package_with_apps dub run package_with_apps:app1 --flags args
These are command line utilities, along the lines of unix 'cut', 'grep', etc, intended to be used as part of unix pipeline. It'd be less convenient to be invoking them via dub. They really should be on the path themselves. --Jon
Apr 13 2016
next sibling parent reply Dicebot <public dicebot.lv> writes:
On Wednesday, 13 April 2016 at 17:21:58 UTC, Jon D wrote:
 You don't need to put anything on path to run utils from dub 
 packages. `dub run` will take care of setting necessary 
 envionment (without messing with the system):

 dub fetch package_with_apps
 dub run package_with_apps:app1 --flags args
These are command line utilities, along the lines of unix 'cut', 'grep', etc, intended to be used as part of unix pipeline. It'd be less convenient to be invoking them via dub. They really should be on the path themselves.
Sure, that would be beyond dub scope though. Making binary packages is independent of build system or source layout (and is highly platform-specific). The `dun run` feature is mostly helpful when you need to use one such tool as part of a build process for another dub package.
Apr 13 2016
parent reply Jon D <jond noreply.com> writes:
On Wednesday, 13 April 2016 at 18:22:21 UTC, Dicebot wrote:
 On Wednesday, 13 April 2016 at 17:21:58 UTC, Jon D wrote:
 You don't need to put anything on path to run utils from dub 
 packages. `dub run` will take care of setting necessary 
 envionment (without messing with the system):

 dub fetch package_with_apps
 dub run package_with_apps:app1 --flags args
These are command line utilities, along the lines of unix 'cut', 'grep', etc, intended to be used as part of unix pipeline. It'd be less convenient to be invoking them via dub. They really should be on the path themselves.
Sure, that would be beyond dub scope though. Making binary packages is independent of build system or source layout (and is highly platform-specific). The `dun run` feature is mostly helpful when you need to use one such tool as part of a build process for another dub package.
Right. So, partly what I'm wondering is if during the normal dub fetch/run cycle there might be an opportunity to print a message the user with some info to help them add the tools to their path. I haven't used dub much, so I'll have to look into it more. But there should be some way to make it reasonably easy and clear. It'll probably be a few days before I can get to this, but I would like to get them in the package registry. --Jon
Apr 13 2016
parent Dicebot <public dicebot.lv> writes:
On 04/13/2016 09:48 PM, Jon D wrote:
 Right. So, partly what I'm wondering is if during the normal dub
 fetch/run cycle there might be an opportunity to print a message the
 user with some info to help them add the tools to their path. I haven't
 used dub much, so I'll have to look into it more. But there should be
 some way to make it reasonably easy and clear. It'll probably be a few
 days before I can get to this, but I would like to get them in the
 package registry.
This is wrong direction. Users of those tools should not even ever need to have dub installed or know about it existence - dub is strictly a developer tool. Instead, whoever distributes the utils should use dub to build them and use generated artifacts to prepare distribution package.
Apr 13 2016
prev sibling parent Puming <zhaopuming gmail.com> writes:
On Wednesday, 13 April 2016 at 17:21:58 UTC, Jon D wrote:
 On Wednesday, 13 April 2016 at 17:01:33 UTC, Dicebot wrote:
 On Wednesday, 13 April 2016 at 16:34:16 UTC, Jon D wrote:
 [...]
You don't need to put anything on path to run utils from dub packages. `dub run` will take care of setting necessary envionment (without messing with the system): dub fetch package_with_apps dub run package_with_apps:app1 --flags args
These are command line utilities, along the lines of unix 'cut', 'grep', etc, intended to be used as part of unix pipeline. It'd be less convenient to be invoking them via dub. They really should be on the path themselves. --Jon
if dub supports something like: ``` dub deploy ``` and you can specifiy some dir like '/usr/bin/' in the dub.sdl, it would be great
Apr 13 2016
prev sibling next sibling parent Puming <zhaopuming gmail.com> writes:
On Wednesday, 13 April 2016 at 16:34:16 UTC, Jon D wrote:

 Thanks Rory, Puming. I'll look into this and see how best to 
 make it fit. I'm realizing also there's one additional 
 capability it'd be nice to have in dub for tools like this, 
 which in an option to install the executables somewhere that 
 can be easily be put on the path. Still, even without this 
 there'd be benefit to having them fetched via dub.

 --Jon
Well, you can do that: In the subpackage dub.sdl, add targetPath: ``` name "app1" targetType "executable" targetPath "../bin/" dependency "myapp:common" version="*" ```
Apr 13 2016
prev sibling parent Jon D <jond noreply.com> writes:
On Wednesday, 13 April 2016 at 16:34:16 UTC, Jon D wrote:
 On Wednesday, 13 April 2016 at 07:34:11 UTC, Rory McGuire wrote:
 On Wed, Apr 13, 2016 at 3:41 AM, Puming via 
 Digitalmars-d-announce < digitalmars-d-announce puremagic.com> 
 wrote:

 On Tuesday, 12 April 2016 at 06:22:55 UTC, Puming wrote:
Here is what I know of it, using subPackages:
Just tried your suggestion and it works. I just added the below to the parent project to get the apps build: void main() { import std.process : executeShell; executeShell(`dub build :app1`); executeShell(`dub build :app2`); executeShell(`dub build :app3`); }
Thanks Rory, Puming. I'll look into this ...
Available now via DUB. Setup follows the outline Rory and Puming suggested, plus a few other changes. Only useful for people already using DUB, but seems convenient and does get it included in the package registry. Works as follows: $ dub fetch tsv-utils-dlang $ dub run tsv-utils-dlang This kicks off a build of the package, binaries are in the DUB package repository. The user has to add them to the PATH.
Apr 19 2016
prev sibling next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 04/11/2016 05:50 PM, Jon D wrote:

 The tools are here: https://github.com/eBay/tsv-utils-dlang
 --Jon
Congratulations Jon. Really cool stuff! :) Ali
Apr 12 2016
prev sibling next sibling parent reply Dejan Lekic <dejan.lekic gmail.com> writes:
On Tuesday, 12 April 2016 at 00:50:24 UTC, Jon D wrote:
 Hi all,

 I've open sourced a set of command line utilities for 
 manipulating tab-separated value files. They are complementary 
 to traditional unix tools like cut, grep, etc. They're useful 
 for manipulating large data files. I use them when prepping 
 files for R and similar tools. These tools were part of my 
 'explore D' programming exercises.

 The tools are here: https://github.com/eBay/tsv-utils-dlang

 They are likely of interest primarily to people regularly 
 working with large files, though others might find the 
 performance benchmarks of interest as well (included in the 
 README).

 I'd welcome any feedback, either on the apps or the code. 
 Intention is that the code be reasonable example programs. And, 
 I may write a blog post about my D explorations at some point, 
 they'd be referenced in such an article.

 --Jon
I rarely need TSV files, but I deal with CSV files every day. - It would be nice to test your implementation against std.csv (it can use TAB as separator). Did you try to compare the two?
Apr 13 2016
parent Jon D <jond noreply.com> writes:
On Wednesday, 13 April 2016 at 12:36:56 UTC, Dejan Lekic wrote:
 On Tuesday, 12 April 2016 at 00:50:24 UTC, Jon D wrote:
 I've open sourced a set of command line utilities for 
 manipulating tab-separated value files.
I rarely need TSV files, but I deal with CSV files every day. - It would be nice to test your implementation against std.csv (it can use TAB as separator). Did you try to compare the two?
No, I didn't try using the std.csv library utilities. The utilities all take a delimiter, so comma can be specified, but that won't handle CSV escaping. For myself, I'd be more inclined to add TSV-CSV converters rather than adding native CSV support to each tool, but if you're working with CSV all the time that'd be nuisance. If you want, you can try rewriting the inner loop of one of the tools to use csvNextToken rather than algorithm.splitter. tsv-select would be the easiest of the tools to try. It'd also be necessary to replace the writeln for the output to properly add CSV escapes. --Jon
Apr 13 2016
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 04/11/2016 08:50 PM, Jon D wrote:
 Hi all,

 I've open sourced a set of command line utilities for manipulating
 tab-separated value files. They are complementary to traditional unix
 tools like cut, grep, etc. They're useful for manipulating large data
 files. I use them when prepping files for R and similar tools. These
 tools were part of my 'explore D' programming exercises.

 The tools are here: https://github.com/eBay/tsv-utils-dlang

 They are likely of interest primarily to people regularly working with
 large files, though others might find the performance benchmarks of
 interest as well (included in the README).

 I'd welcome any feedback, either on the apps or the code. Intention is
 that the code be reasonable example programs. And, I may write a blog
 post about my D explorations at some point, they'd be referenced in such
 an article.

 --Jon
Looking great. Thanks! https://www.facebook.com/dlang.org/posts/1275477382465940 https://twitter.com/D_Programming/status/720310640531808261 https://www.reddit.com/r/programming/comments/4ems6a/commandline_utilities_for_large_tabseparated/ Andrei
Apr 13 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/11/2016 5:50 PM, Jon D wrote:
 I'd welcome any feedback, either on the apps or the code. Intention is that the
 code be reasonable example programs. And, I may write a blog post about my D
 explorations at some point, they'd be referenced in such an article.
You've got questions on: https://www.reddit.com/r/programming/comments/4ems6a/commandline_utilities_for_large_tabseparated/ !! As the author, it'd be nice to do an AMA there.
Apr 13 2016
parent reply Jon D <jond noreply.com> writes:
On Wednesday, 13 April 2016 at 19:52:30 UTC, Walter Bright wrote:
 On 4/11/2016 5:50 PM, Jon D wrote:
 I'd welcome any feedback, either on the apps or the code. 
 Intention is that the
 code be reasonable example programs. And, I may write a blog 
 post about my D
 explorations at some point, they'd be referenced in such an 
 article.
You've got questions on: https://www.reddit.com/r/programming/comments/4ems6a/commandline_utilities_for_large_tabseparated/ !! As the author, it'd be nice to do an AMA there.
Thanks for posting there and letting me know. I responded and will watch the thread. What do you mean by an "AMA"?
Apr 13 2016
parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 04/13/2016 01:40 PM, Jon D wrote:

 What do you mean by an "AMA"?
It means "(I'm the author), Ask Me Anything". Ali
Apr 13 2016