digitalmars.D.announce - Faster Command Line Tools in D

Mike Parker (10/10) May 24 2017 Some of you may remember Jon Degenhardt's talk from one of the

cym13 (4/15) May 24 2017 A bit off topic but I really like that we still get quality

Jon Degenhardt (6/11) May 24 2017 The complement to the community is well deserved, thank you for

Walter Bright (2/2) May 24 2017 It's now #4 on the front page of Hacker News:

cym13 (12/14) May 24 2017 The comments on HN are useless though, everybody went for the

Jon Degenhardt (13/29) May 24 2017 Its not easy writing an article that doesn't draw some form of

Walter Bright (13/24) May 24 2017 Any time one writes an article comparing speed between languages X and Y...

Jon Degenhardt (9/25) May 24 2017 Thanks Walter, I appreciate your comments. And correct, as

Wulfklaue (45/52) May 25 2017 Maybe as a more casual observer the article did feel more like

Steven Schveighoffer (26/55) May 25 2017 Because split allocates on every call. The key, in many cases in D, to

Suliman (3/6) May 25 2017 Is there any plan to deprecate all splitters and make one single.

Jonathan M Davis via Digitalmars-d-announce (22/28) May 25 2017 I wouldn't expect any of the split-related functions to be going away. W...

cym13 (8/33) May 28 2017 I don't think know if people coming from other languages would

Jonathan M Davis via Digitalmars-d-announce (24/27) May 25 2017 Not only that, but over time, there has been a push to generalize functi...

Jack Stouffer (3/7) May 24 2017 Wouldn't be the first time

xtreak (9/20) May 25 2017 There are repeated references over usage of D at Netflix for

Nick Sabalausky (Abscissa) (3/9) May 26 2017 I've used netflix. If its "suggestion" features are any indication, I'm

=?UTF-8?Q?Ali_=c3=87ehreli?= (5/7) May 25 2017 Inspired Nim version, found on Reddit:

Basile B. (2/9) May 25 2017 Wow, the D blog post opened Pandora's box.

bachmeier (3/17) May 26 2017 I guess programmers will do comparisons of language speed

John Colvin (20/31) May 26 2017 I spent some time fiddling with my own manual approaches to

John Colvin (4/9) May 26 2017 This version also has the advantage of being (discounting any

Steven Schveighoffer (8/17) May 30 2017 I worked a lot on making sure this works properly. However, it's

Patrick Schluter (4/20) May 30 2017 If you want UCS-2 (aka UTF-16 without surrogates) data I can give

Steven Schveighoffer (11/29) May 30 2017 The data I can (and have) generated from UTF-8 data. I have tested my

Patrick Schluter (14/55) May 30 2017 In any case, you can download the dataset from [1] if you like.

Steven Schveighoffer (3/14) May 31 2017 Thanks, I'll bookmark it for later use.

Steven Schveighoffer (14/43) May 30 2017 nice! hm....

Joakim (4/15) Aug 08 2017 Heh, happened to notice that this blog post now has 21 comments,

bachmeier (2/19) Aug 08 2017 There was also a Haskell version on Reddit.

Mike Parker <aldacron gmail.com> writes:

Some of you may remember Jon Degenhardt's talk from one of the 
Silicon Valley D meetups, where he described the performance 
improvements he saw when he rewrote some of eBay's command line 
tools in D. He has now put the effort into crafting a blog post 
on the same topic, where he takes D version of a command-line 
tool written in Python and incrementally improves its performance.

The blog:
https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

Reddit:
https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/

May 24 2017

cym13 <cpicard openmailbox.org> writes:

On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:
 Some of you may remember Jon Degenhardt's talk from one of the 
 Silicon Valley D meetups, where he described the performance 
 improvements he saw when he rewrote some of eBay's command line 
 tools in D. He has now put the effort into crafting a blog post 
 on the same topic, where he takes D version of a command-line 
 tool written in Python and incrementally improves its 
 performance.

 The blog:
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/

A bit off topic but I really like that we still get quality 
content such as this post on this blog. Sustained quality is hard 
job and I thank everyone involved for that.

May 24 2017

Jon Degenhardt <jond noreply.com> writes:

On Wednesday, 24 May 2017 at 17:36:29 UTC, cym13 wrote:
 On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:
 [...snip...]

 A bit off topic but I really like that we still get quality 
 content such as this post on this blog. Sustained quality is 
 hard job and I thank everyone involved for that.

The complement to the community is well deserved, thank you for 
including this post in the company. In this case, the post 
benefited from some really excellent review feedback and Mike 
made the publication side really easy.

--Jon

May 24 2017

Walter Bright <newshound2 digitalmars.com> writes:



https://news.ycombinator.com/news

May 24 2017

cym13 <cpicard openmailbox.org> writes:

On Wednesday, 24 May 2017 at 21:34:08 UTC, Walter Bright wrote:


 https://news.ycombinator.com/news

The comments on HN are useless though, everybody went for the 
"D versus Python" thing and seem to complain that it's doing a 
D/Python benchmark while only talking about D optimization...even 
though optimizing D is the whole point of the article. In the 
same way they rant against the fact that many iterations on the D 
script are shown while it is obviously to give different tricks 
while being clear on what trick gives what.

I am disappointed because there are so many good things to say 
about this, so many good questions or remarks to make when not 
familiar with the language, and yet all we get is "Meh, this 
benchmark shows nothing of D's speed against Python".

May 24 2017

Jon Degenhardt <jond noreply.com> writes:

On Wednesday, 24 May 2017 at 21:46:10 UTC, cym13 wrote:
 On Wednesday, 24 May 2017 at 21:34:08 UTC, Walter Bright wrote:


 https://news.ycombinator.com/news

 The comments on HN are useless though, everybody went for the 
 "D versus Python" thing and seem to complain that it's doing a 
 D/Python benchmark while only talking about D 
 optimization...even though optimizing D is the whole point of 
 the article. In the same way they rant against the fact that 
 many iterations on the D script are shown while it is obviously 
 to give different tricks while being clear on what trick gives 
 what.

 I am disappointed because there are so many good things to say 
 about this, so many good questions or remarks to make when not 
 familiar with the language, and yet all we get is "Meh, this 
 benchmark shows nothing of D's speed against Python".

Its not easy writing an article that doesn't draw some form of 
criticism. FWIW, the reason I gave a Python example is because it 
is very commonly used for this type of problem and the language 
is well suited to it. A second reason is that I've seen several 
posts where someone has tried to rewrite a Python program like 
this in D, start with `split`, and wonder how to make it faster. 
My hope is that this will clarify how to achieve this.

Another goal of the article was to describe how performance in 
the TSV Utilities had been achieved. The article is not about the 
TSV Utilities, but discussing both the benchmark results and how 
they had been achieved would be a very long article.

--Jon

May 24 2017

Walter Bright <newshound2 digitalmars.com> writes:

On 5/24/2017 3:56 PM, Jon Degenhardt wrote:
 Its not easy writing an article that doesn't draw some form of criticism.
FWIW, 
 the reason I gave a Python example is because it is very commonly used for
this 
 type of problem and the language is well suited to it. A second reason is that 
 I've seen several posts where someone has tried to rewrite a Python program
like 
 this in D, start with `split`, and wonder how to make it faster. My hope is
that 
 this will clarify how to achieve this.
 
 Another goal of the article was to describe how performance in the TSV
Utilities 
 had been achieved. The article is not about the TSV Utilities, but discussing 
 both the benchmark results and how they had been achieved would be a very long 
 article.

Any time one writes an article comparing speed between languages X and Y, 
someone gets their ox gored and will bitterly complain about how unfair the 
article is (though I noticed that none of the complainers wrote a faster Python 
version). Even if you tried to optimize the Python program, you'll be
inevitably 
accused of deliberately not doing it right.

The nadir of this for me was when I compared Digital Mars C++ code with DMD. 
Both share the same optimizer and back end, yet I was accused of "sabotaging"
my 
own C++ compiler in order to make D look better !! Me, I just don't do public 
comparison benchmarking anymore. It's a waste of time arguing with people about
it.

I thought you wrote a fine article, and the criticism about the Python code was 
unwarranted (especially since nobody suggested better code), because the
article 
was about optimizing D code, not optimizing Python.

May 24 2017

Jon Degenhardt <jond noreply.com> writes:

On Thursday, 25 May 2017 at 05:17:29 UTC, Walter Bright wrote:
 Any time one writes an article comparing speed between 
 languages X and Y, someone gets their ox gored and will 
 bitterly complain about how unfair the article is (though I 
 noticed that none of the complainers wrote a faster Python 
 version). Even if you tried to optimize the Python program, 
 you'll be inevitably accused of deliberately not doing it right.

 The nadir of this for me was when I compared Digital Mars C++ 
 code with DMD. Both share the same optimizer and back end, yet 
 I was accused of "sabotaging" my own C++ compiler in order to 
 make D look better !! Me, I just don't do public comparison 
 benchmarking anymore. It's a waste of time arguing with people 
 about it.

 I thought you wrote a fine article, and the criticism about the 
 Python code was unwarranted (especially since nobody suggested 
 better code), because the article was about optimizing D code, 
 not optimizing Python.

Thanks Walter, I appreciate your comments. And correct, as 
multiple people noted, a speed comparison with other languages 
not at all a goal of the article.

The real intent was to tell a story of how several of D's 
features play together to enable optimizations like this, without 
having to write low-level code or step outside the core language 
features and standard library.

--Jon

May 24 2017

Wulfklaue <wulfklaue wulfklaue.com> writes:

On Thursday, 25 May 2017 at 06:22:28 UTC, Jon Degenhardt wrote:
 Thanks Walter, I appreciate your comments. And correct, as 
 multiple people noted, a speed comparison with other languages 
 not at all a goal of the article.

 The real intent was to tell a story of how several of D's 
 features play together to enable optimizations like this, 
 without having to write low-level code or step outside the core 
 language features and standard library.

Maybe as a more casual observer the article did feel more like 
Python vs D. I have not yet read the ycombinator comments, just 
from my personal observation after reading the article.

My thinking was:

- Python its PyPy is surprising fast.

- Surprised that D was slower in version 1.

- Kind of surprised again that it took so many versions to figure 
out the best approach.

- Also wondering why one needed std.algorithm splitter, when you 
expect string split to be the fasted. Even the fact that you need 
to import std.array to split a string simply felt  strange.

- So much effort for relative little gain ( after v2 splitter ). 
The time spend on finding a faster solution is in business sense 
not worth it. But not finding a faster way is simply wasting 
performance, just on this simple function.

- Started to wonder if Python its PyPy is so optimized that 
without any effort, your even faster then D. What other D 
idiomatic functions are slow?

I am not criticizing your article Jon, just mentioning how i felt 
when reading it yesterday. It felt like the solution was overly 
complex to find and required too much deep D knowledge. Going to 
read the ycombinator comments now.


Off-topic:

Yesterday i was struggling with split but for a whole different 
reason. Take in account that i am new at D.

Needed to split a string. Simple right? Search Google for "split 
string dlang". Get on the 
https://dlang.org/phobos/std_string.html page.

After seeing the splitLines and start experimenting with it. Half 
a hour later i realize that the wrong function was used and 
needed to import std.array split function.

Call it a issue with the documentation or my own stupidity. But 
the fact that Split was only listed as a imported function, in 
this mass of text, totally send me on the wrong direction.

As stated above, i expected split to be part of the std.string, 
because i am manipulating a string, not that i needed to import 
std.array what is the end result.

I simply find the documentation confusing with the wall of text. 
When i search for string split, you expect to arrive on the 
string.split page. Not only that, the split example are using 
split as a separate keyword, when i was looking for 
variable.split().

Veteran D programmers are probably going to laughing at me for 
this but one does feel a bit salty after that.

May 25 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 5/25/17 6:27 AM, Wulfklaue wrote:

 - Also wondering why one needed std.algorithm splitter, when you expect
 string split to be the fasted. Even the fact that you need to import
 std.array to split a string simply felt  strange.

Because split allocates on every call. The key, in many cases in D, to 
increasing performance is avoiding allocations. Has been that way for as 
long as I can remember.

Another possibility to "fix" this problem is to simply use an allocator 
with split that allocates on some predefined stack space. This is very 
similar to what v3 does with the Appender. Unfortunately, allocator is 
still experimental, and so split doesn't support using it.

 - So much effort for relative little gain ( after v2 splitter ). The
 time spend on finding a faster solution is in business sense not worth
 it. But not finding a faster way is simply wasting performance, just on
 this simple function.

The answer is always "it depends". If you're processing hundreds of 
these files in tight loops, it probably makes sense to optimize the 
code. If not, then it may make sense to focus efforts elsewhere. The 
point of the article is, this is how to do it if you need performance there.

 - Started to wonder if Python its PyPy is so optimized that without any
 effort, your even faster then D. What other D idiomatic functions are slow?

split didn't actually seem that slow. I'll note that you could opt for 
just the AA optimization (the converting char[] to string only when 
storing a new hash lookup is big, and not that cumbersome) and leave the 
code for split alone, and you probably still could beat the Python code.

 Off-topic:

 Yesterday i was struggling with split but for a whole different reason.
 Take in account that i am new at D.

 Needed to split a string. Simple right? Search Google for "split string
 dlang". Get on the https://dlang.org/phobos/std_string.html page.

 After seeing the splitLines and start experimenting with it. Half a hour
 later i realize that the wrong function was used and needed to import
 std.array split function.

 Call it a issue with the documentation or my own stupidity. But the fact
 that Split was only listed as a imported function, in this mass of text,
 totally send me on the wrong direction.

 As stated above, i expected split to be part of the std.string, because
 i am manipulating a string, not that i needed to import std.array what
 is the end result.

std.string, std.array, and std.algorithm all have cross-polination when 
it comes to array operations. It has to do with the history of when the 
modules were introduced.

 I simply find the documentation confusing with the wall of text. When i
 search for string split, you expect to arrive on the string.split page.
 Not only that, the split example are using split as a separate keyword,
 when i was looking for variable.split().

There is a search field on the top, which helps to narrow down what 
choices are available.

 Veteran D programmers are probably going to laughing at me for this but
 one does feel a bit salty after that.

I understand your pain. I work with Swift often, and sometimes it's very 
frustrating trying to find the right tool for the job, as I'm not 
thoroughly immersed in Apple's SDK on a day-to-day basis. I don't know 
that any programming language gets this perfect.

-Steve

May 25 2017

Suliman <evermind live.ru> writes:

 std.string, std.array, and std.algorithm all have 
 cross-polination when it comes to array operations. It has to 
 do with the history of when the modules were introduced.

Is there any plan to deprecate all splitters and make one single. 
Because now as I understand we have 4 functions that make same 
task.

May 25 2017

Jonathan M Davis via Digitalmars-d-announce writes:

On Thursday, May 25, 2017 14:17:27 Suliman via Digitalmars-d-announce wrote:
 std.string, std.array, and std.algorithm all have
 cross-polination when it comes to array operations. It has to
 do with the history of when the modules were introduced.

 Is there any plan to deprecate all splitters and make one single.
 Because now as I understand we have 4 functions that make same
 task.

I wouldn't expect any of the split-related functions to be going away. We
often have a function that operates on arrays or strings and another which
operates on more general ranges. It may mainly be for historical reasons,
but removing the array-based functions would break existing code, and we'd
get a whole other set of complaints about folks not understanding that you
need to slap array() on the end of a call to splitter to get the split that
they were looking for (especially if they're coming from another language
and don't understand ranges yet). And ultimately, the array-based functions
continue to serve as a way to have simpler code when you don't care about
(or you actually need) the additional memory allocations.

Also, splitLines/lineSplitter can't actually be written in terms of
split/splitter, because split/splitter does not have a way to provide
multiple delimeters (let alone multiple delimeters where one includes the
other, which is what you get with "\n" and "\r\n"). So, that distinction
isn't going away. It's also a common enough operation that having a function
for it rather than having to pass all of the delimeters to a more general
function is arguably worth it, just like having the overload of
split/splitter which takes no delimiter and then splits on whitespace is
arguably worth it over having a more general function where you have to feed
it every variation of whitespace.

- Jonathan M Davis

May 25 2017

cym13 <cpicard openmailbox.org> writes:

On Thursday, 25 May 2017 at 16:19:16 UTC, Jonathan M Davis wrote:
 I wouldn't expect any of the split-related functions to be 
 going away. We often have a function that operates on arrays or 
 strings and another which operates on more general ranges. It 
 may mainly be for historical reasons, but removing the 
 array-based functions would break existing code, and we'd get a 
 whole other set of complaints about folks not understanding 
 that you need to slap array() on the end of a call to splitter 
 to get the split that they were looking for (especially if 
 they're coming from another language and don't understand 
 ranges yet). And ultimately, the array-based functions continue 
 to serve as a way to have simpler code when you don't care 
 about (or you actually need) the additional memory allocations.

I don't think know if people coming from other languages would 
really mind. Of course it would have to be taught onces, 
everything has, but many languages (and I have python especially 
in mind) have been lazifying their standard libraries for years 
now. I think consistency is what brings less questions, not 
diversity where one of the possibilities corresponds to what the 
programmer wants. He'll ask for the difference anyway.

 Also, splitLines/lineSplitter can't actually be written in 
 terms of split/splitter, because split/splitter does not have a 
 way to provide multiple delimeters (let alone multiple 
 delimeters where one includes the other, which is what you get 
 with "\n" and "\r\n"). So, that distinction isn't going away. 
 It's also a common enough operation that having a function for 
 it rather than having to pass all of the delimeters to a more 
 general function is arguably worth it, just like having the 
 overload of split/splitter which takes no delimiter and then 
 splits on whitespace is arguably worth it over having a more 
 general function where you have to feed it every variation of 
 whitespace.

 - Jonathan M Davis

May 28 2017

Jonathan M Davis via Digitalmars-d-announce writes:

On Thursday, May 25, 2017 08:46:17 Steven Schveighoffer via Digitalmars-d-
announce wrote:
 std.string, std.array, and std.algorithm all have cross-polination when
 it comes to array operations. It has to do with the history of when the
 modules were introduced.

Not only that, but over time, there has been a push to generalize functions.
So, something that might have originally gotten put in std.string (because
you'd normally think of it as a string function) got moved to std.array,
because it could easily be generalized to work on arrays in general and not
just string operations (I believe that split is an example of this). And
something which was in std.array or std.string might have been generalized
for ranges in general, in which case, we ended up with a new function in
std.algorithm (hence, we have splitter in std.algorithm but split in
std.array).

The end result tends to make sense if you understand that functions that
only operate on strings go in std.string, functions that operate on dynamic
arrays in general (but not ranges) go in std.array, and functions which
could have gone in std.string or std.array except that they operate on
ranges in general go in std.algorithm. But if you don't understand that, it
tends to be quite confusing, and even if you do, it's often the case that
when you want to find a function to operate on a string, you're going to
need to look in std.string, std.array, and std.algorithm.

So, in part, it's an evolution thing, and in part, it's often just plain
hard to find stuff when you're focused on a specific use case, and the
library writer is focused on making the function that you need as general as
possible.

- Jonathan M Davis

May 25 2017

Jack Stouffer <jack jackstouffer.com> writes:

On Wednesday, 24 May 2017 at 21:46:10 UTC, cym13 wrote:
 I am disappointed because there are so many good things to say 
 about this, so many good questions or remarks to make when not 
 familiar with the language, and yet all we get is "Meh, this 
 benchmark shows nothing of D's speed against Python".

Wouldn't be the first time 
https://news.ycombinator.com/item?id=10828450

May 24 2017

xtreak <tir.karthi gmail.com> writes:

On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:
 Some of you may remember Jon Degenhardt's talk from one of the 
 Silicon Valley D meetups, where he described the performance 
 improvements he saw when he rewrote some of eBay's command line 
 tools in D. He has now put the effort into crafting a blog post 
 on the same topic, where he takes D version of a command-line 
 tool written in Python and incrementally improves its 
 performance.

 The blog:
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/

There are repeated references over usage of D at Netflix for 
machine learning. It will be a very helpful boost if someone 
comes up with any reference or a post regarding how D is used at 
Netflix and addition of Netflix to 
https://dlang.org/orgs-using-d.html will be amazing.

References :

https://news.ycombinator.com/item?id=14064012
https://news.ycombinator.com/item?id=14413546

May 25 2017

"Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:

On 05/25/2017 08:30 AM, xtreak wrote:
 
 There are repeated references over usage of D at Netflix for machine 
 learning. It will be a very helpful boost if someone comes up with any 
 reference or a post regarding how D is used at Netflix and addition of 
 Netflix to https://dlang.org/orgs-using-d.html will be amazing.
 

I've used netflix. If its "suggestion" features are any indication, I'm 
not sure such a thing would be a feather in D's cap ;)

May 26 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 05/24/2017 06:39 AM, Mike Parker wrote:

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/

Inspired Nim version, found on Reddit:

 
https://www.reddit.com/r/programming/comments/6dct6e/faster_command_line_tools_in_nim/

Ali

May 25 2017

Basile B. <b2.temp gmx.com> writes:

On Thursday, 25 May 2017 at 22:04:36 UTC, Ali Çehreli wrote:
 On 05/24/2017 06:39 AM, Mike Parker wrote:

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/

 Inspired Nim version, found on Reddit:


 https://www.reddit.com/r/programming/comments/6dct6e/faster_command_line_tools_in_nim/

 Ali

Wow, the D blog post opened Pandora's box.

May 25 2017

bachmeier <no spam.net> writes:

On Friday, 26 May 2017 at 06:05:11 UTC, Basile B. wrote:
 On Thursday, 25 May 2017 at 22:04:36 UTC, Ali Çehreli wrote:
 On 05/24/2017 06:39 AM, Mike Parker wrote:

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/

 Inspired Nim version, found on Reddit:


 https://www.reddit.com/r/programming/comments/6dct6e/faster_command_line_tools_in_nim/

 Ali

 Wow, the D blog post opened Pandora's box.

I guess programmers will do comparisons of language speed 
independent of whether it makes sense for that problem.

May 26 2017

John Colvin <john.loughran.colvin gmail.com> writes:

On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:
 Some of you may remember Jon Degenhardt's talk from one of the 
 Silicon Valley D meetups, where he described the performance 
 improvements he saw when he rewrote some of eBay's command line 
 tools in D. He has now put the effort into crafting a blog post 
 on the same topic, where he takes D version of a command-line 
 tool written in Python and incrementally improves its 
 performance.

 The blog:
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/

I spent some time fiddling with my own manual approaches to 
making this as fast, wasn't satisfied and so decided to try using 
Steven's iopipe (https://github.com/schveiguy/iopipe) instead. 
Results were excellent.

https://gist.github.com/John-Colvin/980b11f2b7a7e23faf8dfb44bd9f1242

On my machine:
python takes a little over 20s, pypy wobbles around 3.5s, v1 from 
the blog takes about 3.9s, v4b took 1.45s, a version of my own 
that is hideous* manages 0.78s at best, the above version with 
iopipe hits below 0.67s most runs.

Not bad for a process that most people would call "IO-bound" 
(code for "I don't want to have to write fast code & it's all the 
disk's fault").

Obviously this version is a bit more code than is ideal, iopipe 
is currently quite "barebones", but I don't see why with some 
clever abstractions and wrappers it couldn't be the default thing 
that one does even for small scripts.

*using byChunk and manually managing linesplits over chunks, very 
nasty.

May 26 2017

John Colvin <john.loughran.colvin gmail.com> writes:

On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:
 I spent some time fiddling with my own manual approaches to 
 making this as fast, wasn't satisfied and so decided to try 
 using Steven's iopipe (https://github.com/schveiguy/iopipe) 
 instead. Results were excellent.

 https://gist.github.com/John-Colvin/980b11f2b7a7e23faf8dfb44bd9f1242

This version also has the advantage of being (discounting any 
bugs in iopipe) correct for arbitrary unicode in all common UTF 
encodings.

May 26 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 5/26/17 11:20 AM, John Colvin wrote:
 On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:
 I spent some time fiddling with my own manual approaches to making
 this as fast, wasn't satisfied and so decided to try using Steven's
 iopipe (https://github.com/schveiguy/iopipe) instead. Results were
 excellent.

 https://gist.github.com/John-Colvin/980b11f2b7a7e23faf8dfb44bd9f1242

 This version also has the advantage of being (discounting any bugs in
 iopipe) correct for arbitrary unicode in all common UTF encodings.

I worked a lot on making sure this works properly. However, it's 
possible that there are some lingering issues.

I also did not spend much time optimizing these paths (whereas I spent a 
ton of time getting the utf8 line parsing as fast as it could be). 
Partly because finding things other than utf8 in the wild is rare, and 
partly because I have nothing to compare it with to know what is possible :)

-Steve

May 30 2017

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Tuesday, 30 May 2017 at 21:18:42 UTC, Steven Schveighoffer 
wrote:
 On 5/26/17 11:20 AM, John Colvin wrote:
 On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:
 [...]

 This version also has the advantage of being (discounting any 
 bugs in
 iopipe) correct for arbitrary unicode in all common UTF 
 encodings.

 I worked a lot on making sure this works properly. However, 
 it's possible that there are some lingering issues.

 I also did not spend much time optimizing these paths (whereas 
 I spent a ton of time getting the utf8 line parsing as fast as 
 it could be). Partly because finding things other than utf8 in 
 the wild is rare, and partly because I have nothing to compare 
 it with to know what is possible :)

 -Steve

If you want UCS-2 (aka UTF-16 without surrogates) data I can give 
you gigabytes of files in tmx format.

May 30 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 5/30/17 5:57 PM, Patrick Schluter wrote:
 On Tuesday, 30 May 2017 at 21:18:42 UTC, Steven Schveighoffer wrote:
 On 5/26/17 11:20 AM, John Colvin wrote:
 On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:
 [...]

 This version also has the advantage of being (discounting any bugs in
 iopipe) correct for arbitrary unicode in all common UTF encodings.

 I worked a lot on making sure this works properly. However, it's
 possible that there are some lingering issues.

 I also did not spend much time optimizing these paths (whereas I spent
 a ton of time getting the utf8 line parsing as fast as it could be).
 Partly because finding things other than utf8 in the wild is rare, and
 partly because I have nothing to compare it with to know what is
 possible :)

 If you want UCS-2 (aka UTF-16 without surrogates) data I can give you
 gigabytes of files in tmx format.

The data I can (and have) generated from UTF-8 data. I have tested my 
byLine parser to make sure it properly splits on "interesting" code 
points in all widths. UTF-16 data without surrogates should probably 
work fine. I haven't tuned it though like I tuned the UTF-8 version. Is 
there a memchr for wide characters? ;)

What I really haven't done is compared my line parsing code with 
multi-code-unit delimiters against one that can do the same thing. I 
know Phobos and C FILE * really can't do it. I haven't really looked at 
all in C++, so I should probably look there before giving up.

-Steve

May 30 2017

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Tuesday, 30 May 2017 at 22:31:50 UTC, Steven Schveighoffer 
wrote:
 On 5/30/17 5:57 PM, Patrick Schluter wrote:
 On Tuesday, 30 May 2017 at 21:18:42 UTC, Steven Schveighoffer 
 wrote:
 On 5/26/17 11:20 AM, John Colvin wrote:
 On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:
 [...]

 This version also has the advantage of being (discounting 
 any bugs in
 iopipe) correct for arbitrary unicode in all common UTF 
 encodings.

 I worked a lot on making sure this works properly. However, 
 it's
 possible that there are some lingering issues.

 I also did not spend much time optimizing these paths 
 (whereas I spent
 a ton of time getting the utf8 line parsing as fast as it 
 could be).
 Partly because finding things other than utf8 in the wild is 
 rare, and
 partly because I have nothing to compare it with to know what 
 is
 possible :)

 If you want UCS-2 (aka UTF-16 without surrogates) data I can 
 give you
 gigabytes of files in tmx format.

 The data I can (and have) generated from UTF-8 data. I have 
 tested my byLine parser to make sure it properly splits on 
 "interesting" code points in all widths. UTF-16 data without 
 surrogates should probably work fine. I haven't tuned it though 
 like I tuned the UTF-8 version. Is there a memchr for wide 
 characters? ;)

 What I really haven't done is compared my line parsing code 
 with multi-code-unit delimiters against one that can do the 
 same thing. I know Phobos and C FILE * really can't do it. I 
 haven't really looked at all in C++, so I should probably look 
 there before giving up.

 -Steve

In any case, you can download the dataset from [1] if you like. 
There are several 100 Mb big zip files containing a collection of 
tmx files (translation memory exchange) with European 
Legislation. The files contain multi-alignment texts in up to 24 
languages. The files are encoded in UCS-2 little-endian. I know 
for a fact (because I compiled the data) that they don't contain 
characters outside of the BMP. The data is public and can be used 
freely (as in beer).
When I get some time, I will try to port the java app that is 
distributed with it to D (partially done yet).

[1]: 
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory

May 30 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 5/31/17 1:09 AM, Patrick Schluter wrote:
 In any case, you can download the dataset from [1] if you like. There
 are several 100 Mb big zip files containing a collection of tmx files
 (translation memory exchange) with European Legislation. The files
 contain multi-alignment texts in up to 24 languages. The files are
 encoded in UCS-2 little-endian. I know for a fact (because I compiled
 the data) that they don't contain characters outside of the BMP. The
 data is public and can be used freely (as in beer).
 When I get some time, I will try to port the java app that is
 distributed with it to D (partially done yet).

 [1]:
 https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory

Thanks, I'll bookmark it for later use.

-Steve

May 31 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 5/26/17 10:41 AM, John Colvin wrote:
 On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:
 Some of you may remember Jon Degenhardt's talk from one of the Silicon
 Valley D meetups, where he described the performance improvements he
 saw when he rewrote some of eBay's command line tools in D. He has now
 put the effort into crafting a blog post on the same topic, where he
 takes D version of a command-line tool written in Python and
 incrementally improves its performance.

 The blog:
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/

 I spent some time fiddling with my own manual approaches to making this
 as fast, wasn't satisfied and so decided to try using Steven's iopipe
 (https://github.com/schveiguy/iopipe) instead. Results were excellent.

 https://gist.github.com/John-Colvin/980b11f2b7a7e23faf8dfb44bd9f1242

nice! hm....

/** something vaguely like this should be in iopipe, users shouldn't 
need to write it */
auto ref runWithEncoding(alias process, FileT, Args...)(FileT file, auto 
ref Args args)

stealing for iopipe, thanks :) I'll need to dedicate another slide to you...

 On my machine:
 python takes a little over 20s, pypy wobbles around 3.5s, v1 from the
 blog takes about 3.9s, v4b took 1.45s, a version of my own that is
 hideous* manages 0.78s at best, the above version with iopipe hits below
 0.67s most runs.

 Not bad for a process that most people would call "IO-bound" (code for
 "I don't want to have to write fast code & it's all the disk's fault").

 Obviously this version is a bit more code than is ideal, iopipe is
 currently quite "barebones", but I don't see why with some clever
 abstractions and wrappers it couldn't be the default thing that one does
 even for small scripts.

The idea behind iopipe is to give you the building blocks to create 
exactly the pipeline you need, without a lot of effort. Once you have 
those blocks, then you make higher level functions out of it. Like you 
have above :)

BTW, there is a byLineRange function that handles slicing off the 
newline character inside iopipe.textpipe.

-Steve

May 30 2017

Joakim <dlang joakim.fea.st> writes:

On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:
 Some of you may remember Jon Degenhardt's talk from one of the 
 Silicon Valley D meetups, where he described the performance 
 improvements he saw when he rewrote some of eBay's command line 
 tools in D. He has now put the effort into crafting a blog post 
 on the same topic, where he takes D version of a command-line 
 tool written in Python and incrementally improves its 
 performance.

 The blog:
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/

Heh, happened to notice that this blog post now has 21 comments, 
with people posting links to versions in Go, C++, and Kotlin up 
till this week, months after the post went up! :D

Aug 08 2017

bachmeier <no spam.net> writes:

On Tuesday, 8 August 2017 at 21:51:30 UTC, Joakim wrote:
 On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:
 Some of you may remember Jon Degenhardt's talk from one of the 
 Silicon Valley D meetups, where he described the performance 
 improvements he saw when he rewrote some of eBay's command 
 line tools in D. He has now put the effort into crafting a 
 blog post on the same topic, where he takes D version of a 
 command-line tool written in Python and incrementally improves 
 its performance.

 The blog:
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/

 Heh, happened to notice that this blog post now has 21 
 comments, with people posting links to versions in Go, C++, and 
 Kotlin up till this week, months after the post went up! :D

There was also a Haskell version on Reddit.

Aug 08 2017

D Programming

C/C++ Programming

Other

digitalmars.D.announce - Faster Command Line Tools in D