www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - Faster Command Line Tools in D

reply Mike Parker <aldacron gmail.com> writes:
Some of you may remember Jon Degenhardt's talk from one of the 
Silicon Valley D meetups, where he described the performance 
improvements he saw when he rewrote some of eBay's command line 
tools in D. He has now put the effort into crafting a blog post 
on the same topic, where he takes D version of a command-line 
tool written in Python and incrementally improves its performance.

The blog:
https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

Reddit:
https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/
May 24
next sibling parent reply cym13 <cpicard openmailbox.org> writes:
On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:
 Some of you may remember Jon Degenhardt's talk from one of the 
 Silicon Valley D meetups, where he described the performance 
 improvements he saw when he rewrote some of eBay's command line 
 tools in D. He has now put the effort into crafting a blog post 
 on the same topic, where he takes D version of a command-line 
 tool written in Python and incrementally improves its 
 performance.

 The blog:
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/
A bit off topic but I really like that we still get quality content such as this post on this blog. Sustained quality is hard job and I thank everyone involved for that.
May 24
parent Jon Degenhardt <jond noreply.com> writes:
On Wednesday, 24 May 2017 at 17:36:29 UTC, cym13 wrote:
 On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:
 [...snip...]

 A bit off topic but I really like that we still get quality 
 content such as this post on this blog. Sustained quality is 
 hard job and I thank everyone involved for that.
The complement to the community is well deserved, thank you for including this post in the company. In this case, the post benefited from some really excellent review feedback and Mike made the publication side really easy. --Jon
May 24
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
It's now #4 on the front page of Hacker News:

https://news.ycombinator.com/news
May 24
parent reply cym13 <cpicard openmailbox.org> writes:
On Wednesday, 24 May 2017 at 21:34:08 UTC, Walter Bright wrote:
 It's now #4 on the front page of Hacker News:

 https://news.ycombinator.com/news
The comments on HN are useless though, everybody went for the "D versus Python" thing and seem to complain that it's doing a D/Python benchmark while only talking about D optimization...even though optimizing D is the whole point of the article. In the same way they rant against the fact that many iterations on the D script are shown while it is obviously to give different tricks while being clear on what trick gives what. I am disappointed because there are so many good things to say about this, so many good questions or remarks to make when not familiar with the language, and yet all we get is "Meh, this benchmark shows nothing of D's speed against Python".
May 24
next sibling parent reply Jon Degenhardt <jond noreply.com> writes:
On Wednesday, 24 May 2017 at 21:46:10 UTC, cym13 wrote:
 On Wednesday, 24 May 2017 at 21:34:08 UTC, Walter Bright wrote:
 It's now #4 on the front page of Hacker News:

 https://news.ycombinator.com/news
The comments on HN are useless though, everybody went for the "D versus Python" thing and seem to complain that it's doing a D/Python benchmark while only talking about D optimization...even though optimizing D is the whole point of the article. In the same way they rant against the fact that many iterations on the D script are shown while it is obviously to give different tricks while being clear on what trick gives what. I am disappointed because there are so many good things to say about this, so many good questions or remarks to make when not familiar with the language, and yet all we get is "Meh, this benchmark shows nothing of D's speed against Python".
Its not easy writing an article that doesn't draw some form of criticism. FWIW, the reason I gave a Python example is because it is very commonly used for this type of problem and the language is well suited to it. A second reason is that I've seen several posts where someone has tried to rewrite a Python program like this in D, start with `split`, and wonder how to make it faster. My hope is that this will clarify how to achieve this. Another goal of the article was to describe how performance in the TSV Utilities had been achieved. The article is not about the TSV Utilities, but discussing both the benchmark results and how they had been achieved would be a very long article. --Jon
May 24
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2017 3:56 PM, Jon Degenhardt wrote:
 Its not easy writing an article that doesn't draw some form of criticism.
FWIW, 
 the reason I gave a Python example is because it is very commonly used for
this 
 type of problem and the language is well suited to it. A second reason is that 
 I've seen several posts where someone has tried to rewrite a Python program
like 
 this in D, start with `split`, and wonder how to make it faster. My hope is
that 
 this will clarify how to achieve this.
 
 Another goal of the article was to describe how performance in the TSV
Utilities 
 had been achieved. The article is not about the TSV Utilities, but discussing 
 both the benchmark results and how they had been achieved would be a very long 
 article.
Any time one writes an article comparing speed between languages X and Y, someone gets their ox gored and will bitterly complain about how unfair the article is (though I noticed that none of the complainers wrote a faster Python version). Even if you tried to optimize the Python program, you'll be inevitably accused of deliberately not doing it right. The nadir of this for me was when I compared Digital Mars C++ code with DMD. Both share the same optimizer and back end, yet I was accused of "sabotaging" my own C++ compiler in order to make D look better !! Me, I just don't do public comparison benchmarking anymore. It's a waste of time arguing with people about it. I thought you wrote a fine article, and the criticism about the Python code was unwarranted (especially since nobody suggested better code), because the article was about optimizing D code, not optimizing Python.
May 24
parent reply Jon Degenhardt <jond noreply.com> writes:
On Thursday, 25 May 2017 at 05:17:29 UTC, Walter Bright wrote:
 Any time one writes an article comparing speed between 
 languages X and Y, someone gets their ox gored and will 
 bitterly complain about how unfair the article is (though I 
 noticed that none of the complainers wrote a faster Python 
 version). Even if you tried to optimize the Python program, 
 you'll be inevitably accused of deliberately not doing it right.

 The nadir of this for me was when I compared Digital Mars C++ 
 code with DMD. Both share the same optimizer and back end, yet 
 I was accused of "sabotaging" my own C++ compiler in order to 
 make D look better !! Me, I just don't do public comparison 
 benchmarking anymore. It's a waste of time arguing with people 
 about it.

 I thought you wrote a fine article, and the criticism about the 
 Python code was unwarranted (especially since nobody suggested 
 better code), because the article was about optimizing D code, 
 not optimizing Python.
Thanks Walter, I appreciate your comments. And correct, as multiple people noted, a speed comparison with other languages not at all a goal of the article. The real intent was to tell a story of how several of D's features play together to enable optimizations like this, without having to write low-level code or step outside the core language features and standard library. --Jon
May 24
parent reply Wulfklaue <wulfklaue wulfklaue.com> writes:
On Thursday, 25 May 2017 at 06:22:28 UTC, Jon Degenhardt wrote:
 Thanks Walter, I appreciate your comments. And correct, as 
 multiple people noted, a speed comparison with other languages 
 not at all a goal of the article.

 The real intent was to tell a story of how several of D's 
 features play together to enable optimizations like this, 
 without having to write low-level code or step outside the core 
 language features and standard library.
Maybe as a more casual observer the article did feel more like Python vs D. I have not yet read the ycombinator comments, just from my personal observation after reading the article. My thinking was: - Python its PyPy is surprising fast. - Surprised that D was slower in version 1. - Kind of surprised again that it took so many versions to figure out the best approach. - Also wondering why one needed std.algorithm splitter, when you expect string split to be the fasted. Even the fact that you need to import std.array to split a string simply felt strange. - So much effort for relative little gain ( after v2 splitter ). The time spend on finding a faster solution is in business sense not worth it. But not finding a faster way is simply wasting performance, just on this simple function. - Started to wonder if Python its PyPy is so optimized that without any effort, your even faster then D. What other D idiomatic functions are slow? I am not criticizing your article Jon, just mentioning how i felt when reading it yesterday. It felt like the solution was overly complex to find and required too much deep D knowledge. Going to read the ycombinator comments now. Off-topic: Yesterday i was struggling with split but for a whole different reason. Take in account that i am new at D. Needed to split a string. Simple right? Search Google for "split string dlang". Get on the https://dlang.org/phobos/std_string.html page. After seeing the splitLines and start experimenting with it. Half a hour later i realize that the wrong function was used and needed to import std.array split function. Call it a issue with the documentation or my own stupidity. But the fact that Split was only listed as a imported function, in this mass of text, totally send me on the wrong direction. As stated above, i expected split to be part of the std.string, because i am manipulating a string, not that i needed to import std.array what is the end result. I simply find the documentation confusing with the wall of text. When i search for string split, you expect to arrive on the string.split page. Not only that, the split example are using split as a separate keyword, when i was looking for variable.split(). Veteran D programmers are probably going to laughing at me for this but one does feel a bit salty after that.
May 25
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/25/17 6:27 AM, Wulfklaue wrote:

 - Also wondering why one needed std.algorithm splitter, when you expect
 string split to be the fasted. Even the fact that you need to import
 std.array to split a string simply felt  strange.
Because split allocates on every call. The key, in many cases in D, to increasing performance is avoiding allocations. Has been that way for as long as I can remember. Another possibility to "fix" this problem is to simply use an allocator with split that allocates on some predefined stack space. This is very similar to what v3 does with the Appender. Unfortunately, allocator is still experimental, and so split doesn't support using it.
 - So much effort for relative little gain ( after v2 splitter ). The
 time spend on finding a faster solution is in business sense not worth
 it. But not finding a faster way is simply wasting performance, just on
 this simple function.
The answer is always "it depends". If you're processing hundreds of these files in tight loops, it probably makes sense to optimize the code. If not, then it may make sense to focus efforts elsewhere. The point of the article is, this is how to do it if you need performance there.
 - Started to wonder if Python its PyPy is so optimized that without any
 effort, your even faster then D. What other D idiomatic functions are slow?
split didn't actually seem that slow. I'll note that you could opt for just the AA optimization (the converting char[] to string only when storing a new hash lookup is big, and not that cumbersome) and leave the code for split alone, and you probably still could beat the Python code.
 Off-topic:

 Yesterday i was struggling with split but for a whole different reason.
 Take in account that i am new at D.

 Needed to split a string. Simple right? Search Google for "split string
 dlang". Get on the https://dlang.org/phobos/std_string.html page.

 After seeing the splitLines and start experimenting with it. Half a hour
 later i realize that the wrong function was used and needed to import
 std.array split function.

 Call it a issue with the documentation or my own stupidity. But the fact
 that Split was only listed as a imported function, in this mass of text,
 totally send me on the wrong direction.

 As stated above, i expected split to be part of the std.string, because
 i am manipulating a string, not that i needed to import std.array what
 is the end result.
std.string, std.array, and std.algorithm all have cross-polination when it comes to array operations. It has to do with the history of when the modules were introduced.
 I simply find the documentation confusing with the wall of text. When i
 search for string split, you expect to arrive on the string.split page.
 Not only that, the split example are using split as a separate keyword,
 when i was looking for variable.split().
There is a search field on the top, which helps to narrow down what choices are available.
 Veteran D programmers are probably going to laughing at me for this but
 one does feel a bit salty after that.
I understand your pain. I work with Swift often, and sometimes it's very frustrating trying to find the right tool for the job, as I'm not thoroughly immersed in Apple's SDK on a day-to-day basis. I don't know that any programming language gets this perfect. -Steve
May 25
next sibling parent reply Suliman <evermind live.ru> writes:
 std.string, std.array, and std.algorithm all have 
 cross-polination when it comes to array operations. It has to 
 do with the history of when the modules were introduced.
Is there any plan to deprecate all splitters and make one single. Because now as I understand we have 4 functions that make same task.
May 25
parent reply Jonathan M Davis via Digitalmars-d-announce writes:
On Thursday, May 25, 2017 14:17:27 Suliman via Digitalmars-d-announce wrote:
 std.string, std.array, and std.algorithm all have
 cross-polination when it comes to array operations. It has to
 do with the history of when the modules were introduced.
Is there any plan to deprecate all splitters and make one single. Because now as I understand we have 4 functions that make same task.
I wouldn't expect any of the split-related functions to be going away. We often have a function that operates on arrays or strings and another which operates on more general ranges. It may mainly be for historical reasons, but removing the array-based functions would break existing code, and we'd get a whole other set of complaints about folks not understanding that you need to slap array() on the end of a call to splitter to get the split that they were looking for (especially if they're coming from another language and don't understand ranges yet). And ultimately, the array-based functions continue to serve as a way to have simpler code when you don't care about (or you actually need) the additional memory allocations. Also, splitLines/lineSplitter can't actually be written in terms of split/splitter, because split/splitter does not have a way to provide multiple delimeters (let alone multiple delimeters where one includes the other, which is what you get with "\n" and "\r\n"). So, that distinction isn't going away. It's also a common enough operation that having a function for it rather than having to pass all of the delimeters to a more general function is arguably worth it, just like having the overload of split/splitter which takes no delimiter and then splits on whitespace is arguably worth it over having a more general function where you have to feed it every variation of whitespace. - Jonathan M Davis
May 25
parent cym13 <cpicard openmailbox.org> writes:
On Thursday, 25 May 2017 at 16:19:16 UTC, Jonathan M Davis wrote:
 I wouldn't expect any of the split-related functions to be 
 going away. We often have a function that operates on arrays or 
 strings and another which operates on more general ranges. It 
 may mainly be for historical reasons, but removing the 
 array-based functions would break existing code, and we'd get a 
 whole other set of complaints about folks not understanding 
 that you need to slap array() on the end of a call to splitter 
 to get the split that they were looking for (especially if 
 they're coming from another language and don't understand 
 ranges yet). And ultimately, the array-based functions continue 
 to serve as a way to have simpler code when you don't care 
 about (or you actually need) the additional memory allocations.
I don't think know if people coming from other languages would really mind. Of course it would have to be taught onces, everything has, but many languages (and I have python especially in mind) have been lazifying their standard libraries for years now. I think consistency is what brings less questions, not diversity where one of the possibilities corresponds to what the programmer wants. He'll ask for the difference anyway.
 Also, splitLines/lineSplitter can't actually be written in 
 terms of split/splitter, because split/splitter does not have a 
 way to provide multiple delimeters (let alone multiple 
 delimeters where one includes the other, which is what you get 
 with "\n" and "\r\n"). So, that distinction isn't going away. 
 It's also a common enough operation that having a function for 
 it rather than having to pass all of the delimeters to a more 
 general function is arguably worth it, just like having the 
 overload of split/splitter which takes no delimiter and then 
 splits on whitespace is arguably worth it over having a more 
 general function where you have to feed it every variation of 
 whitespace.

 - Jonathan M Davis
May 28
prev sibling parent Jonathan M Davis via Digitalmars-d-announce writes:
On Thursday, May 25, 2017 08:46:17 Steven Schveighoffer via Digitalmars-d-
announce wrote:
 std.string, std.array, and std.algorithm all have cross-polination when
 it comes to array operations. It has to do with the history of when the
 modules were introduced.
Not only that, but over time, there has been a push to generalize functions. So, something that might have originally gotten put in std.string (because you'd normally think of it as a string function) got moved to std.array, because it could easily be generalized to work on arrays in general and not just string operations (I believe that split is an example of this). And something which was in std.array or std.string might have been generalized for ranges in general, in which case, we ended up with a new function in std.algorithm (hence, we have splitter in std.algorithm but split in std.array). The end result tends to make sense if you understand that functions that only operate on strings go in std.string, functions that operate on dynamic arrays in general (but not ranges) go in std.array, and functions which could have gone in std.string or std.array except that they operate on ranges in general go in std.algorithm. But if you don't understand that, it tends to be quite confusing, and even if you do, it's often the case that when you want to find a function to operate on a string, you're going to need to look in std.string, std.array, and std.algorithm. So, in part, it's an evolution thing, and in part, it's often just plain hard to find stuff when you're focused on a specific use case, and the library writer is focused on making the function that you need as general as possible. - Jonathan M Davis
May 25
prev sibling parent Jack Stouffer <jack jackstouffer.com> writes:
On Wednesday, 24 May 2017 at 21:46:10 UTC, cym13 wrote:
 I am disappointed because there are so many good things to say 
 about this, so many good questions or remarks to make when not 
 familiar with the language, and yet all we get is "Meh, this 
 benchmark shows nothing of D's speed against Python".
Wouldn't be the first time https://news.ycombinator.com/item?id=10828450
May 24
prev sibling next sibling parent reply xtreak <tir.karthi gmail.com> writes:
On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:
 Some of you may remember Jon Degenhardt's talk from one of the 
 Silicon Valley D meetups, where he described the performance 
 improvements he saw when he rewrote some of eBay's command line 
 tools in D. He has now put the effort into crafting a blog post 
 on the same topic, where he takes D version of a command-line 
 tool written in Python and incrementally improves its 
 performance.

 The blog:
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/
There are repeated references over usage of D at Netflix for machine learning. It will be a very helpful boost if someone comes up with any reference or a post regarding how D is used at Netflix and addition of Netflix to https://dlang.org/orgs-using-d.html will be amazing. References : https://news.ycombinator.com/item?id=14064012 https://news.ycombinator.com/item?id=14413546
May 25
parent "Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:
On 05/25/2017 08:30 AM, xtreak wrote:
 
 There are repeated references over usage of D at Netflix for machine 
 learning. It will be a very helpful boost if someone comes up with any 
 reference or a post regarding how D is used at Netflix and addition of 
 Netflix to https://dlang.org/orgs-using-d.html will be amazing.
 
I've used netflix. If its "suggestion" features are any indication, I'm not sure such a thing would be a feather in D's cap ;)
May 26
prev sibling next sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 05/24/2017 06:39 AM, Mike Parker wrote:

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/
Inspired Nim version, found on Reddit: https://www.reddit.com/r/programming/comments/6dct6e/faster_command_line_tools_in_nim/ Ali
May 25
parent reply Basile B. <b2.temp gmx.com> writes:
On Thursday, 25 May 2017 at 22:04:36 UTC, Ali Çehreli wrote:
 On 05/24/2017 06:39 AM, Mike Parker wrote:

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/
Inspired Nim version, found on Reddit: https://www.reddit.com/r/programming/comments/6dct6e/faster_command_line_tools_in_nim/ Ali
Wow, the D blog post opened Pandora's box.
May 25
parent bachmeier <no spam.net> writes:
On Friday, 26 May 2017 at 06:05:11 UTC, Basile B. wrote:
 On Thursday, 25 May 2017 at 22:04:36 UTC, Ali Çehreli wrote:
 On 05/24/2017 06:39 AM, Mike Parker wrote:

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/
Inspired Nim version, found on Reddit: https://www.reddit.com/r/programming/comments/6dct6e/faster_command_line_tools_in_nim/ Ali
Wow, the D blog post opened Pandora's box.
I guess programmers will do comparisons of language speed independent of whether it makes sense for that problem.
May 26
prev sibling parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:
 Some of you may remember Jon Degenhardt's talk from one of the 
 Silicon Valley D meetups, where he described the performance 
 improvements he saw when he rewrote some of eBay's command line 
 tools in D. He has now put the effort into crafting a blog post 
 on the same topic, where he takes D version of a command-line 
 tool written in Python and incrementally improves its 
 performance.

 The blog:
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/
I spent some time fiddling with my own manual approaches to making this as fast, wasn't satisfied and so decided to try using Steven's iopipe (https://github.com/schveiguy/iopipe) instead. Results were excellent. https://gist.github.com/John-Colvin/980b11f2b7a7e23faf8dfb44bd9f1242 On my machine: python takes a little over 20s, pypy wobbles around 3.5s, v1 from the blog takes about 3.9s, v4b took 1.45s, a version of my own that is hideous* manages 0.78s at best, the above version with iopipe hits below 0.67s most runs. Not bad for a process that most people would call "IO-bound" (code for "I don't want to have to write fast code & it's all the disk's fault"). Obviously this version is a bit more code than is ideal, iopipe is currently quite "barebones", but I don't see why with some clever abstractions and wrappers it couldn't be the default thing that one does even for small scripts. *using byChunk and manually managing linesplits over chunks, very nasty.
May 26
next sibling parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:
 I spent some time fiddling with my own manual approaches to 
 making this as fast, wasn't satisfied and so decided to try 
 using Steven's iopipe (https://github.com/schveiguy/iopipe) 
 instead. Results were excellent.

 https://gist.github.com/John-Colvin/980b11f2b7a7e23faf8dfb44bd9f1242
This version also has the advantage of being (discounting any bugs in iopipe) correct for arbitrary unicode in all common UTF encodings.
May 26
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/26/17 11:20 AM, John Colvin wrote:
 On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:
 I spent some time fiddling with my own manual approaches to making
 this as fast, wasn't satisfied and so decided to try using Steven's
 iopipe (https://github.com/schveiguy/iopipe) instead. Results were
 excellent.

 https://gist.github.com/John-Colvin/980b11f2b7a7e23faf8dfb44bd9f1242
This version also has the advantage of being (discounting any bugs in iopipe) correct for arbitrary unicode in all common UTF encodings.
I worked a lot on making sure this works properly. However, it's possible that there are some lingering issues. I also did not spend much time optimizing these paths (whereas I spent a ton of time getting the utf8 line parsing as fast as it could be). Partly because finding things other than utf8 in the wild is rare, and partly because I have nothing to compare it with to know what is possible :) -Steve
May 30
parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Tuesday, 30 May 2017 at 21:18:42 UTC, Steven Schveighoffer 
wrote:
 On 5/26/17 11:20 AM, John Colvin wrote:
 On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:
 [...]
This version also has the advantage of being (discounting any bugs in iopipe) correct for arbitrary unicode in all common UTF encodings.
I worked a lot on making sure this works properly. However, it's possible that there are some lingering issues. I also did not spend much time optimizing these paths (whereas I spent a ton of time getting the utf8 line parsing as fast as it could be). Partly because finding things other than utf8 in the wild is rare, and partly because I have nothing to compare it with to know what is possible :) -Steve
If you want UCS-2 (aka UTF-16 without surrogates) data I can give you gigabytes of files in tmx format.
May 30
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/30/17 5:57 PM, Patrick Schluter wrote:
 On Tuesday, 30 May 2017 at 21:18:42 UTC, Steven Schveighoffer wrote:
 On 5/26/17 11:20 AM, John Colvin wrote:
 On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:
 [...]
This version also has the advantage of being (discounting any bugs in iopipe) correct for arbitrary unicode in all common UTF encodings.
I worked a lot on making sure this works properly. However, it's possible that there are some lingering issues. I also did not spend much time optimizing these paths (whereas I spent a ton of time getting the utf8 line parsing as fast as it could be). Partly because finding things other than utf8 in the wild is rare, and partly because I have nothing to compare it with to know what is possible :)
If you want UCS-2 (aka UTF-16 without surrogates) data I can give you gigabytes of files in tmx format.
The data I can (and have) generated from UTF-8 data. I have tested my byLine parser to make sure it properly splits on "interesting" code points in all widths. UTF-16 data without surrogates should probably work fine. I haven't tuned it though like I tuned the UTF-8 version. Is there a memchr for wide characters? ;) What I really haven't done is compared my line parsing code with multi-code-unit delimiters against one that can do the same thing. I know Phobos and C FILE * really can't do it. I haven't really looked at all in C++, so I should probably look there before giving up. -Steve
May 30
parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Tuesday, 30 May 2017 at 22:31:50 UTC, Steven Schveighoffer 
wrote:
 On 5/30/17 5:57 PM, Patrick Schluter wrote:
 On Tuesday, 30 May 2017 at 21:18:42 UTC, Steven Schveighoffer 
 wrote:
 On 5/26/17 11:20 AM, John Colvin wrote:
 On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:
 [...]
This version also has the advantage of being (discounting any bugs in iopipe) correct for arbitrary unicode in all common UTF encodings.
I worked a lot on making sure this works properly. However, it's possible that there are some lingering issues. I also did not spend much time optimizing these paths (whereas I spent a ton of time getting the utf8 line parsing as fast as it could be). Partly because finding things other than utf8 in the wild is rare, and partly because I have nothing to compare it with to know what is possible :)
If you want UCS-2 (aka UTF-16 without surrogates) data I can give you gigabytes of files in tmx format.
The data I can (and have) generated from UTF-8 data. I have tested my byLine parser to make sure it properly splits on "interesting" code points in all widths. UTF-16 data without surrogates should probably work fine. I haven't tuned it though like I tuned the UTF-8 version. Is there a memchr for wide characters? ;) What I really haven't done is compared my line parsing code with multi-code-unit delimiters against one that can do the same thing. I know Phobos and C FILE * really can't do it. I haven't really looked at all in C++, so I should probably look there before giving up. -Steve
In any case, you can download the dataset from [1] if you like. There are several 100 Mb big zip files containing a collection of tmx files (translation memory exchange) with European Legislation. The files contain multi-alignment texts in up to 24 languages. The files are encoded in UCS-2 little-endian. I know for a fact (because I compiled the data) that they don't contain characters outside of the BMP. The data is public and can be used freely (as in beer). When I get some time, I will try to port the java app that is distributed with it to D (partially done yet). [1]: https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory
May 30
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/31/17 1:09 AM, Patrick Schluter wrote:
 In any case, you can download the dataset from [1] if you like. There
 are several 100 Mb big zip files containing a collection of tmx files
 (translation memory exchange) with European Legislation. The files
 contain multi-alignment texts in up to 24 languages. The files are
 encoded in UCS-2 little-endian. I know for a fact (because I compiled
 the data) that they don't contain characters outside of the BMP. The
 data is public and can be used freely (as in beer).
 When I get some time, I will try to port the java app that is
 distributed with it to D (partially done yet).

 [1]:
 https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory
Thanks, I'll bookmark it for later use. -Steve
May 31
prev sibling parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/26/17 10:41 AM, John Colvin wrote:
 On Wednesday, 24 May 2017 at 13:39:57 UTC, Mike Parker wrote:
 Some of you may remember Jon Degenhardt's talk from one of the Silicon
 Valley D meetups, where he described the performance improvements he
 saw when he rewrote some of eBay's command line tools in D. He has now
 put the effort into crafting a blog post on the same topic, where he
 takes D version of a command-line tool written in Python and
 incrementally improves its performance.

 The blog:
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

 Reddit:
 https://www.reddit.com/r/programming/comments/6d25mg/faster_command_line_tools_in_d/
I spent some time fiddling with my own manual approaches to making this as fast, wasn't satisfied and so decided to try using Steven's iopipe (https://github.com/schveiguy/iopipe) instead. Results were excellent. https://gist.github.com/John-Colvin/980b11f2b7a7e23faf8dfb44bd9f1242
nice! hm.... /** something vaguely like this should be in iopipe, users shouldn't need to write it */ auto ref runWithEncoding(alias process, FileT, Args...)(FileT file, auto ref Args args) stealing for iopipe, thanks :) I'll need to dedicate another slide to you...
 On my machine:
 python takes a little over 20s, pypy wobbles around 3.5s, v1 from the
 blog takes about 3.9s, v4b took 1.45s, a version of my own that is
 hideous* manages 0.78s at best, the above version with iopipe hits below
 0.67s most runs.

 Not bad for a process that most people would call "IO-bound" (code for
 "I don't want to have to write fast code & it's all the disk's fault").

 Obviously this version is a bit more code than is ideal, iopipe is
 currently quite "barebones", but I don't see why with some clever
 abstractions and wrappers it couldn't be the default thing that one does
 even for small scripts.
The idea behind iopipe is to give you the building blocks to create exactly the pipeline you need, without a lot of effort. Once you have those blocks, then you make higher level functions out of it. Like you have above :) BTW, there is a byLineRange function that handles slicing off the newline character inside iopipe.textpipe. -Steve
May 30