www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - iopipe v0.0.4 - RingBuffers!

reply Steven Schveighoffer <schveiguy yahoo.com> writes:
OK, so at dconf I spoke with a few very smart guys about how I can use 
mmap to make a zero-copy buffer. And I implemented this on the plane 
ride home.

However, I am struggling to find a use case for this that showcases why 
you would want to use it. While it does work, and works beautifully, it 
doesn't show any measurable difference vs. the array allocated buffer 
that copies data when it needs to extend.

If anyone has any good use cases for it, I'm open to suggestions. 
Something that is going to potentially increase performance is an 
application that needs to keep the buffer mostly full when extending 
(i.e. something like 75% full or more).

The buffer is selected by using `rbufd` instead of just `bufd`. 
Everything should be a drop-in replacement except for that.

Note: I have ONLY tested on Macos, so if you find bugs in other OSes let 
me know. This is still a Posix-only library for now, but more on that 
later...

As a test for Ring buffers, I implemented a simple "grep-like" search 
program that doesn't use regex, but phobos' canFind to look for lines 
that match. It also prints some lines of context, configurable on the 
command line. The lines of context I thought would show better 
performance with the RingBuffer than the standard buffer since it has to 
keep a bunch of lines in the buffer. But alas, it's roughly the same, 
even with large number of lines for context (like 200).

However, this example *does* show the power of iopipe -- it handles all 
flavors of unicode with one template function, is quite straightforward 
(though I want to abstract the line tracking code, that stuff is really 
tricky to get right). Oh, and it's roughly 10x faster than grep, and a 
bunch faster than fgrep, at least on my machine ;) I'm tempted to add 
regex processing to see if it still beats grep.

Next up (when my bug fix for dmd is merged, see 
https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating 
iopipe to depend on https://github.com/MartinNowak/io, which should 
unlock Windows support (and I will add RingBuffer Windows support at 
that point).

Enjoy!

https://github.com/schveiguy/iopipe
https://code.dlang.org/packages/iopipe
http://schveiguy.github.io/iopipe/

-Steve
May 10
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer 
wrote:
 OK, so at dconf I spoke with a few very smart guys about how I 
 can use mmap to make a zero-copy buffer. And I implemented this 
 on the plane ride home.

 However, I am struggling to find a use case for this that 
 showcases why you would want to use it. While it does work, and 
 works beautifully, it doesn't show any measurable difference 
 vs. the array allocated buffer that copies data when it needs 
 to extend.
I’d start with something clinicaly synthetic. Say your record size is exactly half of buffer + 1 byte. If you were to extend the size of buffer, it would amortize. Basically: 16 Mb buffer fixed vs 16 Mb mmap-ed ring Where you read pieces in 8M+1 blocks.Yes, we are aiming to blow the CPU cache there. Otherwise CPU cache is so fast that ocasional copy is zilch, once we hit primary memory it’s not. Adjust sizes for your CPU. The amount of work done per byte though has to be minimal to actually see anything.
 
 in the buffer. But alas, it's roughly the same, even with large 
 number of lines for context (like 200).

 However, this example *does* show the power of iopipe -- it 
 handles all flavors of unicode with one template function, is 
 quite straightforward (though I want to abstract the line 
 tracking code, that stuff is really tricky to get right). Oh, 
 and it's roughly 10x faster than grep, and a bunch faster than 
 fgrep, at least on my machine ;) I'm tempted to add regex 
 processing to see if it still beats grep.
Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it. Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.
 Next up (when my bug fix for dmd is merged, see 
 https://issues.dlang.org/show_bug.cgi?id=17968) I will be 
 migrating iopipe to depend on 
 https://github.com/MartinNowak/io, which should unlock Windows 
 support (and I will add RingBuffer Windows support at that 
 point).

 Enjoy!

 https://github.com/schveiguy/iopipe
 https://code.dlang.org/packages/iopipe
 http://schveiguy.github.io/iopipe/

 -Steve
May 10
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/11/18 1:30 AM, Dmitry Olshansky wrote:
 On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
 OK, so at dconf I spoke with a few very smart guys about how I can use 
 mmap to make a zero-copy buffer. And I implemented this on the plane 
 ride home.

 However, I am struggling to find a use case for this that showcases 
 why you would want to use it. While it does work, and works 
 beautifully, it doesn't show any measurable difference vs. the array 
 allocated buffer that copies data when it needs to extend.
I’d start with something clinicaly synthetic. Say your record size is exactly half of buffer + 1 byte. If you were to extend the size of buffer, it would amortize.
Hm.. this wouldn't work, because the idea is to keep some of the buffer full. What will happen here is that the buffer will extend to be able to accomodate the extra byte, and then you are back to having less of the buffer full at once. Iopipe is not afraid to increase the buffer :)
 
 Basically:
 16 Mb buffer fixed
 vs
 16 Mb mmap-ed ring
 
 Where you read pieces in 8M+1 blocks.Yes, we are aiming to blow the CPU 
 cache there. Otherwise CPU cache is so fast that ocasional copy is 
 zilch, once we hit primary memory it’s not. Adjust sizes for your CPU.
This isn't how it will work. The system looks at the buffer and says "oh, I can just read 8MB - 1 byte," which gives you 2 bytes less than you need. Then you need the extra 2 bytes, so it will increase the buffer to hold at least 2 records. I do get the point of having to go outside the cache. I'll look and see if maybe specifying a 1000 line context helps ;) Update: nope, still pretty much the same.
 The amount of work done per byte though has to be minimal to actually 
 see anything.
Right, this is another part of the problem -- if copying is so rare compared to the other operations, then the difference is going to be lost in the noise. What I have learned here is: 1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers 2. The use cases are much smaller than I thought 3. In most real-world applications, they are a wash, and not worth the OS tricks needed to use it. 4. iopipe makes testing with a different kind of buffer really easy, which was one of my original goals. So I'm glad that works! I'm going to (obviously) leave them there, hoping that someone finds a good use case, but I can say that my extreme excitement at getting it to work was depressed quite a bit when I found it didn't really gain much in terms of performance for the use cases I have been doing.
 in the buffer. But alas, it's roughly the same, even with large number 
 of lines for context (like 200).

 However, this example *does* show the power of iopipe -- it handles 
 all flavors of unicode with one template function, is quite 
 straightforward (though I want to abstract the line tracking code, 
 that stuff is really tricky to get right). Oh, and it's roughly 10x 
 faster than grep, and a bunch faster than fgrep, at least on my 
 machine ;) I'm tempted to add regex processing to see if it still 
 beats grep.
Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it. Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.
I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line. One thing I could do to amortize is keep 2N lines in the buffer, and run the regex on a whole context's worth of lines, then dump them all. I don't get why grep is so bad at this, since it is supposedly doing the matching without line boundaries. I was actually quite shocked when iopipe was that much faster -- even when I'm not asking grep to print out line numbers (so it doesn't actually ever really have to keep track of lines). -Steve
May 11
next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 05/11/2018 06:28 AM, Steven Schveighoffer wrote:

 1. Ring buffers are really cool (I still love how it works) and perform
 as well as normal buffers
 2. The use cases are much smaller than I thought
There is the LMAX Disruptor, which was open sourced a few year ago along with a large number of articles, describing its history and design in great detail. Because of the large number of articles like this one https://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html it's impossible to find the one that had left an impression on me at the time I read it. The article was describing their story from the beginning to finally getting to their current design, starting from a simple std::map, lock contentions and other concurrency pitfall. They finally settled on a multi-producer-single-consumer design where the consumer works on one thread. This was giving them the biggest CPU cache advantage. The producers and the consumer share a ring buffer for communication. Perhaps the example you're looking for is in there somewhere. :) Ali
May 11
prev sibling next sibling parent reply Uknown <sireeshkodali1 gmail.com> writes:
On Friday, 11 May 2018 at 13:28:58 UTC, Steven Schveighoffer 
wrote:
 [...]
 I do get the point of having to go outside the cache. I'll look 
 and see if maybe specifying a 1000 line context helps ;)

 Update: nope, still pretty much the same.
I'm sure someone will find some good show off program.
 The amount of work done per byte though has to be minimal to 
 actually see anything.
Right, this is another part of the problem -- if copying is so rare compared to the other operations, then the difference is going to be lost in the noise. What I have learned here is: 1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers 2. The use cases are much smaller than I thought 3. In most real-world applications, they are a wash, and not worth the OS tricks needed to use it.
Now I need to learn all about ring-buffers. Do you have any good starting points?
 4. iopipe makes testing with a different kind of buffer really 
 easy, which was one of my original goals. So I'm glad that 
 works!
That satisfying feeling when the code works exactly the way you wanted it to!
 I'm going to (obviously) leave them there, hoping that someone 
 finds a good use case, but I can say that my extreme excitement 
 at getting it to work was depressed quite a bit when I found it 
 didn't really gain much in terms of performance for the use 
 cases I have been doing.
I'm sure someone will find a place where its useful.
 However, this example *does* show the power of iopipe -- it 
 handles all flavors of unicode with one template function, is 
 quite straightforward (though I want to abstract the line 
 tracking code, that stuff is really tricky to get right). Oh, 
 and it's roughly 10x faster than grep, and a bunch faster 
 than fgrep, at least on my machine ;) I'm tempted to add 
 regex processing to see if it still beats grep.
Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it. Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.
I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line. One thing I could do to amortize is keep 2N lines in the buffer, and run the regex on a whole context's worth of lines, then dump them all.
iopipe is looking like a great library!
 I don't get why grep is so bad at this, since it is supposedly 
 doing the matching without line boundaries. I was actually 
 quite shocked when iopipe was that much faster -- even when I'm 
 not asking grep to print out line numbers (so it doesn't 
 actually ever really have to keep track of lines).

 -Steve
That reminds me of this great blog post detailing grep's performance: http://ridiculousfish.com/blog/posts/old-age-and-treachery.html Also, one of the original authors of grep wrote about its performance optimizations, for anyone interested: https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
May 11
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/11/18 10:04 AM, Uknown wrote:
 On Friday, 11 May 2018 at 13:28:58 UTC, Steven Schveighoffer wrote:
 What I have learned here is:

 1. Ring buffers are really cool (I still love how it works) and 
 perform as well as normal buffers
 2. The use cases are much smaller than I thought
 3. In most real-world applications, they are a wash, and not worth the 
 OS tricks needed to use it.
Now I need to learn all about ring-buffers. Do you have any good starting points?
I would start here: https://en.wikipedia.org/wiki/Circular_buffer The point of a circular buffer is to avoid having to copy any data anywhere -- things stay in place as long as they are in-buffer. So originally, I had intended to make a ring buffer (or circular buffer) where I have a random access range between 2 separate segments. In other words, the range would abstract the fact that the buffer was 2 segments, and give you random access to each element by checking to see which segment it's in. I never actually made it, because I realized quickly while optimizing e.g. line processing that the huge benefit of having sequential memory far outweighs the drawback of occasionally having to copy data. In other words, you are paying for every element access to avoid paying for a rare small copy. Consider a byline range. If you have a 8k buffer, and your lines are approximately 80 bytes in length average, when you reach the end of the buffer and have to move whatever existing partial-line to the front of the buffer to continue reading, you are really only copying 1% of the buffer, 1% of the time. But while you are searching for line endings (99% of the time), you are using a simple indexed pointer dereference. Contrast that with a disjoint buffer where every access to an element first requires a check to see which segment you are in before dereferencing. You have moved the payment from the 1% into the 99%. BUT, when at dconf, Dmitry and Shachar let me know about a technique to map the same memory segment to 2 consecutive address ranges. This allows you to look at the ring buffer without it ever being disjoint. Simply put, you have a 2x buffer, whereby each half looks at the same memory. Whenever your buffer start gets to the half way point, you simply move the pointers back by half a buffer. Other than that, the code is nearly identical to a straight allocated buffer, and the memory access is just as fast. So I decided to implement, hoping that I would magically just get a bit better performance. I should have known better :)
 I'm thinking it's even simpler than that. All matches are dead on a 
 line break (it's how grep normally works), so you simply have to parse 
 the lines and run each one via regex. What I don't know is how much it 
 costs regex to startup and run on an individual line.

 One thing I could do to amortize is keep 2N lines in the buffer, and 
 run the regex on a whole context's worth of lines, then dump them all.
iopipe is looking like a great library!
Thanks! I hope to get more utility out of it. I still need to finish/publish my json parser based on it, and I'm thinking we need some parsing tools really to go on top of it to make things easier to approach.
 That reminds me of this great blog post detailing grep's performance:
 http://ridiculousfish.com/blog/posts/old-age-and-treachery.html
 
 Also, one of the original authors of grep wrote about its performance 
 optimizations, for anyone interested:
 https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
Thanks, I'll take a look at those. -Steve
May 11
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Friday, 11 May 2018 at 13:28:58 UTC, Steven Schveighoffer 
wrote:
 On 5/11/18 1:30 AM, Dmitry Olshansky wrote:
 On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer 
 wrote:
 OK, so at dconf I spoke with a few very smart guys about how 
 I can use mmap to make a zero-copy buffer. And I implemented 
 this on the plane ride home.

 However, I am struggling to find a use case for this that 
 showcases why you would want to use it. While it does work, 
 and works beautifully, it doesn't show any measurable 
 difference vs. the array allocated buffer that copies data 
 when it needs to extend.
I’d start with something clinicaly synthetic. Say your record size is exactly half of buffer + 1 byte. If you were to extend the size of buffer, it would amortize.
Hm.. this wouldn't work, because the idea is to keep some of the buffer full. What will happen here is that the buffer will extend to be able to accomodate the extra byte, and then you are back to having less of the buffer full at once. Iopipe is not afraid to increase the buffer :)
Then you cannot test it in such way.
 
 Basically:
 16 Mb buffer fixed
 vs
 16 Mb mmap-ed ring
 
 Where you read pieces in 8M+1 blocks.Yes, we are aiming to 
 blow the CPU cache there. Otherwise CPU cache is so fast that 
 ocasional copy is zilch, once we hit primary memory it’s not. 
 Adjust sizes for your CPU.
This isn't how it will work. The system looks at the buffer and says "oh, I can just read 8MB - 1 byte," which gives you 2 bytes less than you need. Then you need the extra 2 bytes, so it will increase the buffer to hold at least 2 records. I do get the point of having to go outside the cache. I'll look and see if maybe specifying a 1000 line context helps ;)
Nope. Consider reading binary records where you know length in advance and skip over it w/o need to touch every byte. There it might help. If you touch every byte and do something the cost of copying the tail is zilch. One example is net string which is: 13,Hello, world! Basically length in ascii digits ‘,’ followed by tgat much UTF-8 codeunits. No decoding nessary. Torrent files use that I think, maybe other files. Is a nice example that avoids scans to find delimiters.
 Update: nope, still pretty much the same.

 The amount of work done per byte though has to be minimal to 
 actually see anything.
Right, this is another part of the problem -- if copying is so rare compared to the other operations, then the difference is going to be lost in the noise. What I have learned here is: 1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers
This is also good. Normal ring buffers usually suck in speed department.
 2. The use cases are much smaller than I thought
 3. In most real-world applications, they are a wash, and not 
 worth the OS tricks needed to use it.
 4. iopipe makes testing with a different kind of buffer really 
 easy, which was one of my original goals. So I'm glad that 
 works!

 I'm going to (obviously) leave them there, hoping that someone 
 finds a good use case, but I can say that my extreme excitement 
 at getting it to work was depressed quite a bit when I found it 
 didn't really gain much in terms of performance for the use 
 cases I have been doing.
 Should be mostly trivial in fact. I mean our first designs for 
 IOpipe is where I wanted regex to work with it.
 
 Basically - if we started a match, extend window until we get 
 it or lose it. Then release up to the next point of potential 
 start.
I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line.
It is malloc/free/addRange/removeRange for each call. I optimized 2.080 to reuse last recently used engine w/o these costs but I’ll have to check if it covers all cases.
 One thing I could do to amortize is keep 2N lines in the 
 buffer, and run the regex on a whole context's worth of lines, 
 then dump them all.
I believe integrating iopipe awareness it in regex will easily make it 50% faster. A guestimate though.
 I don't get why grep is so bad at this, since it is supposedly
grep on Mac is a piece of sheat, sadly and I don’t know why exactly (too old?). Use some 3-rd party thing like ‘sift’ written in Go.
 -Steve
May 11
parent Uknown <sireeshkodali1 gmail.com> writes:
On Friday, 11 May 2018 at 23:46:16 UTC, Dmitry Olshansky wrote:
 On Friday, 11 May 2018 at 13:28:58 UTC, Steven Schveighoffer 
 wrote:
 On 5/11/18 1:30 AM, Dmitry Olshansky wrote:
 On Thursday, 10 May 2018 at 23:22:02 UTC, Steven 
 Schveighoffer wrote:
grep on Mac is a piece of sheat, sadly and I don’t know why exactly (too old?). Use some 3-rd party thing like ‘sift’ written in Go.
You can always use GNU grep. The one that comes with macOS is pretty old and slow. If you have macports, its just `port install grep`. I'm sure brew will have a similar package for GNU grep.
May 11
prev sibling next sibling parent reply Kagamin <spam here.lot> writes:
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer 
wrote:
 However, I am struggling to find a use case for this that 
 showcases why you would want to use it. While it does work, and 
 works beautifully, it doesn't show any measurable difference 
 vs. the array allocated buffer that copies data when it needs 
 to extend.
Depends on OS and hardware. I would expect mmap implementation to be slower as it reads file in chunks of 4kb and relies on page faults.
May 11
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Friday, 11 May 2018 at 09:55:10 UTC, Kagamin wrote:
 On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer 
 wrote:
 However, I am struggling to find a use case for this that 
 showcases why you would want to use it. While it does work, 
 and works beautifully, it doesn't show any measurable 
 difference vs. the array allocated buffer that copies data 
 when it needs to extend.
Depends on OS and hardware. I would expect mmap implementation to be slower as it reads file in chunks of 4kb and relies on page faults.
It doesn’t. Instead it has a buffer mmaped twice side by side. Therefore you can avoid copy at the end when it wraps around. Otherwise it’s the same buffering as usual.
May 11
prev sibling parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/11/18 5:55 AM, Kagamin wrote:
 On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
 However, I am struggling to find a use case for this that showcases 
 why you would want to use it. While it does work, and works 
 beautifully, it doesn't show any measurable difference vs. the array 
 allocated buffer that copies data when it needs to extend.
Depends on OS and hardware. I would expect mmap implementation to be slower as it reads file in chunks of 4kb and relies on page faults.
As Dmitry hinted at, there actually is no file involved. I'm mapping just straight memory to 2 segments. In fact, in my test application, I'm using stdin as the input, which may not even involve a file. It's just as fast as using memory, the only cool part is that you can write a buffer that wraps to the beginning as if it were a normal array. What surprises me is that the copying for the normal buffer doesn't hurt performance that much. I suppose this should probably have been expected, as CPUs are really really good at processing consecutive memory, and the copying you end up having to do is generally small compared to the rest of your app. -Steve
May 11
prev sibling next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/10/18 7:22 PM, Steven Schveighoffer wrote:

 However, this example *does* show the power of iopipe -- it handles all 
 flavors of unicode with one template function, is quite straightforward 
 (though I want to abstract the line tracking code, that stuff is really 
 tricky to get right). Oh, and it's roughly 10x faster than grep, and a 
 bunch faster than fgrep, at least on my machine ;) I'm tempted to add 
 regex processing to see if it still beats grep.
Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU grep, which has much better performance (and is 2x as fast as iopipe_search on my Linux VM, even when printing line numbers). So at least there is something to strive for :) -Steve
May 11
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/11/18 11:44 AM, Steven Schveighoffer wrote:
 On 5/10/18 7:22 PM, Steven Schveighoffer wrote:
 
 However, this example *does* show the power of iopipe -- it handles 
 all flavors of unicode with one template function, is quite 
 straightforward (though I want to abstract the line tracking code, 
 that stuff is really tricky to get right). Oh, and it's roughly 10x 
 faster than grep, and a bunch faster than fgrep, at least on my 
 machine ;) I'm tempted to add regex processing to see if it still 
 beats grep.
Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU grep, which has much better performance (and is 2x as fast as iopipe_search on my Linux VM, even when printing line numbers). So at least there is something to strive for :)
More testing reveals that as I increase the context lines to print, iopipe performs better than GNU grep. A shocking thing is that at 9 lines of context, grep goes up slightly, but all of a sudden at 10 lines of context, it doubles in the time taken (and is now slower than the iopipe_search). Also noting: my Linux VM does not have ldc, so these are dmd numbers. -Steve
May 11
parent reply Joakim <dlang joakim.fea.st> writes:
On Friday, 11 May 2018 at 16:07:26 UTC, Steven Schveighoffer 
wrote:
 On 5/11/18 11:44 AM, Steven Schveighoffer wrote:
 On 5/10/18 7:22 PM, Steven Schveighoffer wrote:
 
 [...]
Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU grep, which has much better performance (and is 2x as fast as iopipe_search on my Linux VM, even when printing line numbers). So at least there is something to strive for :)
More testing reveals that as I increase the context lines to print, iopipe performs better than GNU grep. A shocking thing is that at 9 lines of context, grep goes up slightly, but all of a sudden at 10 lines of context, it doubles in the time taken (and is now slower than the iopipe_search). Also noting: my Linux VM does not have ldc, so these are dmd numbers. -Steve
What stops you from downloading a linux release from here? https://github.com/ldc-developers/ldc/releases
May 11
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/11/18 5:42 PM, Joakim wrote:
 On Friday, 11 May 2018 at 16:07:26 UTC, Steven Schveighoffer wrote:
 On 5/11/18 11:44 AM, Steven Schveighoffer wrote:
 On 5/10/18 7:22 PM, Steven Schveighoffer wrote:

 [...]
Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU grep, which has much better performance (and is 2x as fast as iopipe_search on my Linux VM, even when printing line numbers). So at least there is something to strive for :)
More testing reveals that as I increase the context lines to print, iopipe performs better than GNU grep. A shocking thing is that at 9 lines of context, grep goes up slightly, but all of a sudden at 10 lines of context, it doubles in the time taken (and is now slower than the iopipe_search). Also noting: my Linux VM does not have ldc, so these are dmd numbers.
What stops you from downloading a linux release from here? https://github.com/ldc-developers/ldc/releases
So I did that, it's not much faster, a few milliseconds. Still about half as fast as GNU grep. But I am not expecting any miracles here. GNU grep does pretty much everything it can to achieve performance -- including eschewing the standard library buffering system as I am doing. I can probably match the performance at some point, but I doubt it's worth worrying about. It's still really really fast without trying to do anything crazy. I hope at some point, however, to work with Dmitry to add iopipe-based regex engine so we can see how much better we can make regex. -Steve
May 12
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Saturday, 12 May 2018 at 12:14:28 UTC, Steven Schveighoffer 
wrote:
 On 5/11/18 5:42 PM, Joakim wrote:
 On Friday, 11 May 2018 at 16:07:26 UTC, Steven Schveighoffer 
 wrote:
[...]
What stops you from downloading a linux release from here? https://github.com/ldc-developers/ldc/releases
So I did that, it's not much faster, a few milliseconds. Still about half as fast as GNU grep. But I am not expecting any miracles here. GNU grep does pretty much everything it can to achieve performance -- including eschewing the standard library buffering system as I am doing. I can probably match the performance at some point, but I doubt it's worth worrying about. It's still really really fast without trying to do anything crazy.
I could offer a few tricks to fix that w/o getting too dirty. GNU grep is fast, but std.regex is faster then that in raw speed on a significant class of quite common patterns. But I loaded file at once.
 I hope at some point, however, to work with Dmitry to add 
 iopipe-based regex engine so we can see how much better we can 
 make regex.
As such initiative goes it’s either now or never. Please get in touch directly over Slack or smth, let’s make it roll. I wanted to do grep-like utility since 2012. Now at long last we have all the building blocks.
 -Steve
May 12
parent reply Joakim <dlang joakim.fea.st> writes:
On Saturday, 12 May 2018 at 12:45:16 UTC, Dmitry Olshansky wrote:
 On Saturday, 12 May 2018 at 12:14:28 UTC, Steven Schveighoffer 
 wrote:
[...]
I could offer a few tricks to fix that w/o getting too dirty. GNU grep is fast, but std.regex is faster then that in raw speed on a significant class of quite common patterns. But I loaded file at once.
 [...]
As such initiative goes it’s either now or never. Please get in touch directly over Slack or smth, let’s make it roll. I wanted to do grep-like utility since 2012. Now at long last we have all the building blocks.
If you're talking about writing a grep prototype in D, that's a great idea, especially for publicizing D. :)
May 12
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Saturday, 12 May 2018 at 14:48:58 UTC, Joakim wrote:
 On Saturday, 12 May 2018 at 12:45:16 UTC, Dmitry Olshansky 
 wrote:
 On Saturday, 12 May 2018 at 12:14:28 UTC, Steven Schveighoffer 
 wrote:
[...]
I could offer a few tricks to fix that w/o getting too dirty. GNU grep is fast, but std.regex is faster then that in raw speed on a significant class of quite common patterns. But I loaded file at once.
 [...]
As such initiative goes it’s either now or never. Please get in touch directly over Slack or smth, let’s make it roll. I wanted to do grep-like utility since 2012. Now at long last we have all the building blocks.
If you're talking about writing a grep prototype in D, that's a great idea, especially for publicizing D. :)
For shaming others to beat us using some other language. Making life better for everyone. Taking a DMD to a gun fight ;)
May 12
prev sibling parent "Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:
On 05/12/2018 08:14 AM, Steven Schveighoffer wrote:
 
 But I am not expecting any miracles here. GNU grep does pretty much 
 everything it can to achieve performance -- including eschewing the 
 standard library buffering system as I am doing. I can probably match 
 the performance at some point, but I doubt it's worth worrying about. 
I wonder if there's realistic real-world cases where you could beat it due to being a library solution and skipping the cost of launching grep as a new process. Granted, outside of Windows, process launching is considered to be fairly cheap, but it still isn't no-cost. That would still be a nice feather in D's cap: Comparable to grep for large data, faster than spawning a grep process for smaller data.
May 12
prev sibling next sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Friday, May 11, 2018 11:44:04 Steven Schveighoffer via Digitalmars-d-
announce wrote:
 Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU
 grep, which has much better performance (and is 2x as fast as
 iopipe_search on my Linux VM, even when printing line numbers).
Curiously, the grep on FreeBSD seems to be GNU's grep with some additional patches, though I expect that it's a ways behind whatever GNU is releasing now, because while they were willing to put some GPLv2 stuff in FreeBSD, they have not been willing to have anything to do with GPLv3. FreeBSD's grep claims to be version 2.5.1-FreeBSD, whereas ports has the gnugrep package which is version 2.27, so that implies a fairly large version difference between the two. I have no idea how they compare in terms of performance. Either way, I would have expected FreeBSD to be using their own implementation, not something from GNU, especially since they seem to be trying to purge GPL stuff from FreeBSD. So, the fact that FreeBSD is using GNU's grep is a bit surprising. If I had to guess, I would guess that they switched to the GNU version at some point in the past, because it was easier to grab it than to make what they had faster, but I don't know. Either way, it sounds like Mac OS X either didn't take their grep from FreeBSD in this case, or they took it from an older version before FreeBSD switching to using GNU's grep. - Jonathan M Davis
May 11
parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Friday, 11 May 2018 at 16:06:41 UTC, Jonathan M Davis wrote:
 [...]
Oh, there had been an epic forum thread about the use of GNU grep for BSD. i don't remember the details but it was long and heated (it was so epic that I even read it as I normaly don't care at all for BSD stuff).
May 12
prev sibling parent Jon Degenhardt <jond noreply.com> writes:
On Friday, 11 May 2018 at 15:44:04 UTC, Steven Schveighoffer 
wrote:
 On 5/10/18 7:22 PM, Steven Schveighoffer wrote:

 Shameful note: Macos grep is BSD grep, and is not NEARLY as 
 fast as GNU grep, which has much better performance (and is 2x 
 as fast as iopipe_search on my Linux VM, even when printing 
 line numbers).
Yeah, the MacOS default versions of the Unix text processing tools are really slow. It's worth installing the GNU versions if doing performance comparisons on MacOS, or because you work with large files. Homebrew and MacPorts both have the GNU versions. Some relevant packages: coreutils, grep, gsed (sed), gawk (awk). Most tools are in coreutils. Many will be installed with a 'g' prefix by default, leaving the existing tools in place. e.g. 'cut' will be installed as 'gcut' unless specified otherwise. --Jon
May 11
prev sibling next sibling parent Arun Chandrasekaran <aruncxy gmail.com> writes:
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer 
wrote:
 OK, so at dconf I spoke with a few very smart guys about how I 
 can use mmap to make a zero-copy buffer. And I implemented this 
 on the plane ride home.

 However, I am struggling to find a use case for this that 
 showcases why you would want to use it. While it does work, and 
 works beautifully, it doesn't show any measurable difference 
 vs. the array allocated buffer that copies data when it needs 
 to extend.

 If anyone has any good use cases for it, I'm open to 
 suggestions. Something that is going to potentially increase 
 performance is an application that needs to keep the buffer 
 mostly full when extending (i.e. something like 75% full or 
 more).

 The buffer is selected by using `rbufd` instead of just `bufd`. 
 Everything should be a drop-in replacement except for that.

 Note: I have ONLY tested on Macos, so if you find bugs in other 
 OSes let me know. This is still a Posix-only library for now, 
 but more on that later...

 As a test for Ring buffers, I implemented a simple "grep-like" 
 search program that doesn't use regex, but phobos' canFind to 
 look for lines that match. It also prints some lines of 
 context, configurable on the command line. The lines of context 
 I thought would show better performance with the RingBuffer 
 than the standard buffer since it has to keep a bunch of lines 
 in the buffer. But alas, it's roughly the same, even with large 
 number of lines for context (like 200).

 However, this example *does* show the power of iopipe -- it 
 handles all flavors of unicode with one template function, is 
 quite straightforward (though I want to abstract the line 
 tracking code, that stuff is really tricky to get right). Oh, 
 and it's roughly 10x faster than grep, and a bunch faster than 
 fgrep, at least on my machine ;) I'm tempted to add regex 
 processing to see if it still beats grep.

 Next up (when my bug fix for dmd is merged, see 
 https://issues.dlang.org/show_bug.cgi?id=17968) I will be 
 migrating iopipe to depend on 
 https://github.com/MartinNowak/io, which should unlock Windows 
 support (and I will add RingBuffer Windows support at that 
 point).

 Enjoy!

 https://github.com/schveiguy/iopipe
 https://code.dlang.org/packages/iopipe
 http://schveiguy.github.io/iopipe/

 -Steve
Since mmap is involved, it would be interesting to see if this can be extended for interprocess communication, akin boost::interprocess https://www.boost.org/doc/libs/1_67_0/doc/html/interprocess.html boost::interprocess uses mmap[1] followed by shm_open[2] by default (unless specified to use SysV shm) [1] https://github.com/boostorg/interprocess/blob/4f8459e868617f88ff105633a9aa82221d5e9bb1/include/boost/interprocess/mapped_region.hpp#L698 [2] https://github.com/boostorg/interprocess/blob/develop/include/boost/interprocess/shared_memory_object.hpp#L315
May 11
prev sibling next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer 
wrote:
 OK, so at dconf I spoke with a few very smart guys about how I 
 can use mmap to make a zero-copy buffer. And I implemented this 
 on the plane ride home.

 [...]
They can be problematic with some CPU's and OS's. For modern CPU's there should be no problems but those with exotic caches and virtual memory configurations there can be some aliasing issues. Linus Torvalds talked a little about that case in this thread of realworldtech https://www.realworldtech.com/forum/?threadid=174426&curpostid=174731
May 12
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/12/18 3:38 PM, Patrick Schluter wrote:
 On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
 OK, so at dconf I spoke with a few very smart guys about how I can use 
 mmap to make a zero-copy buffer. And I implemented this on the plane 
 ride home.

 [...]
They can be problematic with some CPU's and OS's. For modern CPU's there should be no problems but those with exotic caches and virtual memory configurations there can be some aliasing issues. Linus Torvalds talked a little about that case in this thread of realworldtech https://www.realworldtech.com/forum/?threadid=174426&curpostid=174731
Thanks for the tip. The nice thing about iopipe is that the buffer type is completely selectable, and nothing changes, except possibly some performance. So on those arch's, I would expect people to select the normal AllocatedBuffer type. -Steve
May 12
prev sibling next sibling parent reply bioinfornatics <bioinfornatics fedoraproject.org> writes:
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer 
wrote:
 OK, so at dconf I spoke with a few very smart guys about how I 
 can use mmap to make a zero-copy buffer. And I implemented this 
 on the plane ride home.

 However, I am struggling to find a use case for this that 
 showcases why you would want to use it. While it does work, and 
 works beautifully, it doesn't show any measurable difference 
 vs. the array allocated buffer that copies data when it needs 
 to extend.

 If anyone has any good use cases for it, I'm open to 
 suggestions. Something that is going to potentially increase 
 performance is an application that needs to keep the buffer 
 mostly full when extending (i.e. something like 75% full or 
 more).

 The buffer is selected by using `rbufd` instead of just `bufd`. 
 Everything should be a drop-in replacement except for that.

 Note: I have ONLY tested on Macos, so if you find bugs in other 
 OSes let me know. This is still a Posix-only library for now, 
 but more on that later...

 As a test for Ring buffers, I implemented a simple "grep-like" 
 search program that doesn't use regex, but phobos' canFind to 
 look for lines that match. It also prints some lines of 
 context, configurable on the command line. The lines of context 
 I thought would show better performance with the RingBuffer 
 than the standard buffer since it has to keep a bunch of lines 
 in the buffer. But alas, it's roughly the same, even with large 
 number of lines for context (like 200).

 However, this example *does* show the power of iopipe -- it 
 handles all flavors of unicode with one template function, is 
 quite straightforward (though I want to abstract the line 
 tracking code, that stuff is really tricky to get right). Oh, 
 and it's roughly 10x faster than grep, and a bunch faster than 
 fgrep, at least on my machine ;) I'm tempted to add regex 
 processing to see if it still beats grep.

 Next up (when my bug fix for dmd is merged, see 
 https://issues.dlang.org/show_bug.cgi?id=17968) I will be 
 migrating iopipe to depend on 
 https://github.com/MartinNowak/io, which should unlock Windows 
 support (and I will add RingBuffer Windows support at that 
 point).

 Enjoy!

 https://github.com/schveiguy/iopipe
 https://code.dlang.org/packages/iopipe
 http://schveiguy.github.io/iopipe/

 -Steve
Hi Steve, It is an exciting works, that could help in bioinformatics area. Indeed in bioinformatics we are I/O bounding and we process lot of big files the amount of data can be in gigabytes, terabytes and even some times in petabytes. So processing efficiently these amount of data is critic. Some years ago I got a request 'How to parse fastq file format in D?' and monarch_dodra wrote a really fast parser (code: http://dpaste.dzfl.pl/37b893ed ) It could be interesting to show how iopipe is fast. You can grab a fastq file from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/ and take a look at iopipe perf . fastq file is plain test format and it is usually a repetition of four lines: 1/ title and description this line starts with 2/ sequence line this line contains ususally DNA letters (ACGT) 3/ comment line this line starts with + 4/ quality of amino acids this line has the same length as the sequence line (n°2) Rarely, the comment section is over multiple lines. Warning the and + characters can be found inside the quality line, thus I search a pattern of two characters '\n ' and '\n+'. I never split file by line as it is a waste of time instead I read the content as a stream. I hope this show case help you Good luck :-)
May 14
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/14/18 6:02 AM, bioinfornatics wrote:
 On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
 OK, so at dconf I spoke with a few very smart guys about how I can use 
 mmap to make a zero-copy buffer. And I implemented this on the plane 
 ride home.

 However, I am struggling to find a use case for this that showcases 
 why you would want to use it. While it does work, and works 
 beautifully, it doesn't show any measurable difference vs. the array 
 allocated buffer that copies data when it needs to extend.

 If anyone has any good use cases for it, I'm open to suggestions. 
 Something that is going to potentially increase performance is an 
 application that needs to keep the buffer mostly full when extending 
 (i.e. something like 75% full or more).

 The buffer is selected by using `rbufd` instead of just `bufd`. 
 Everything should be a drop-in replacement except for that.

 Note: I have ONLY tested on Macos, so if you find bugs in other OSes 
 let me know. This is still a Posix-only library for now, but more on 
 that later...

 As a test for Ring buffers, I implemented a simple "grep-like" search 
 program that doesn't use regex, but phobos' canFind to look for lines 
 that match. It also prints some lines of context, configurable on the 
 command line. The lines of context I thought would show better 
 performance with the RingBuffer than the standard buffer since it has 
 to keep a bunch of lines in the buffer. But alas, it's roughly the 
 same, even with large number of lines for context (like 200).

 However, this example *does* show the power of iopipe -- it handles 
 all flavors of unicode with one template function, is quite 
 straightforward (though I want to abstract the line tracking code, 
 that stuff is really tricky to get right). Oh, and it's roughly 10x 
 faster than grep, and a bunch faster than fgrep, at least on my 
 machine ;) I'm tempted to add regex processing to see if it still 
 beats grep.

 Next up (when my bug fix for dmd is merged, see 
 https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating 
 iopipe to depend on https://github.com/MartinNowak/io, which should 
 unlock Windows support (and I will add RingBuffer Windows support at 
 that point).

 Enjoy!

 https://github.com/schveiguy/iopipe
 https://code.dlang.org/packages/iopipe
 http://schveiguy.github.io/iopipe/
Hi Steve, It is an exciting works, that could help in bioinformatics area. Indeed in bioinformatics we are I/O bounding and we process lot of big files the amount of data can be in gigabytes, terabytes and even some times in petabytes. So processing efficiently these amount of data is critic. Some years ago I got a request 'How to parse fastq file format in D?' and monarch_dodra wrote a really fast parser (code: http://dpaste.dzfl.pl/37b893ed ) It could be interesting to show how iopipe is fast.
Yeah, I have been working on and off with Vang Le (biocyberman) on using iopipe to parse such formats. He gave a good presentation at dconf this year on using D in bioinformatics, and I think it is a great fit for D! At dconf, I threw together a crude fasta parser (with the intention of having it be the base for parsing fastq as well) to demonstrate how iopipe can perform while parsing such things. I have no idea how fast or slow it is, as I just barely got it to work (pass unit tests I made up based on wikipedia entry for fasta), but IMO, the direct buffer access makes fast parsing much more pleasant than having to deal with your own buffering (using phobos makes parsing a bit difficult, however, I still see a need for some parsing tools for iopipe). You can find that library here: https://github.com/schveiguy/fastaq Not being in the field of bioinformatics, I can't really say that I am likely to continue development of it, but I'm certainly willing to help with iopipe for anyone who wants to use it in this field. -Steve
May 14
parent biocyberman <biocyberman gmail.com> writes:
On Monday, 14 May 2018 at 14:23:43 UTC, Steven Schveighoffer 
wrote:
 On 5/14/18 6:02 AM, bioinfornatics wrote:
 On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer 
 wrote:
[...]
Hi Steve, It is an exciting works, that could help in bioinformatics area. Indeed in bioinformatics we are I/O bounding and we process lot of big files the amount of data can be in gigabytes, terabytes and even some times in petabytes. So processing efficiently these amount of data is critic. Some years ago I got a request 'How to parse fastq file format in D?' and monarch_dodra wrote a really fast parser (code: http://dpaste.dzfl.pl/37b893ed ) It could be interesting to show how iopipe is fast.
Yeah, I have been working on and off with Vang Le (biocyberman) on using iopipe to parse such formats. He gave a good presentation at dconf this year on using D in bioinformatics, and I think it is a great fit for D! At dconf, I threw together a crude fasta parser (with the intention of having it be the base for parsing fastq as well) to demonstrate how iopipe can perform while parsing such things. I have no idea how fast or slow it is, as I just barely got it to work (pass unit tests I made up based on wikipedia entry for fasta), but IMO, the direct buffer access makes fast parsing much more pleasant than having to deal with your own buffering (using phobos makes parsing a bit difficult, however, I still see a need for some parsing tools for iopipe). You can find that library here: https://github.com/schveiguy/fastaq Not being in the field of bioinformatics, I can't really say that I am likely to continue development of it, but I'm certainly willing to help with iopipe for anyone who wants to use it in this field. -Steve
Hi Steve Great work continuing to improve iopipe. Thank you for the example implementation of fasta/q parser with iopipe. I will definitely continue to work on this. It still requires some more time for me to get over beginner barriers in D. I am currently trying out some work over here https://github.com/bioslaD. Johnathan(bioinformatics) It will be great if you can join bioslaD and offer some help to make things move faster. Vang
May 15
prev sibling parent Claude <no no.no> writes:
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer 
wrote:
 However, I am struggling to find a use case for this that 
 showcases why you would want to use it. While it does work, and 
 works beautifully, it doesn't show any measurable difference 
 vs. the array allocated buffer that copies data when it needs 
 to extend.
I can think of a good use-case: - Audio streaming (on embedded environment)! If you have something like a bluetooth audio source, and alsa or any audio hardware API as an audio sink to speakers, you can use the ring-buffer as a fifo between the two. The bluetooth source has its own pace (and you cannot control it) and has a variable bit-rate, whereas the sink has a constant bit-rate, so you have to have a buffer between them. And you want to reduce the CPU cost has much as possible due to embedded system constraints (or even real-time constraint, especially for audio).
May 14