www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Help optimizing UnCompress for gzipped files

reply =?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:
Hi all,

over the holidays, I played around with processing some gzipped json
data. First version was implemented in ruby, but took too long, so I
tried, dlang. This was already faster, but not really satisfactory fast.
Then I wrote another version in java, which was much faster.

After this I analyzed the first step of the process (gunzipping the data
from a file to memory), and found out, that dlangs UnCompress is much
slower than java, and ruby and plain c.

There was some discussion on the forum a while ago:
http://forum.dlang.org/thread/pihxxhjgnveulcdtadvg forum.dlang.org

The code I used and the numbers I got are here:
https://github.com/gizmomogwai/benchmarks/tree/master/gunzip

I used an i7 macbook with os x 10.13.2, ruby 2.5.0 built via rvm,
python3 installed by homebrew, builtin clang compiler, ldc-1.7.0-beta1,
java 1.8.0_152.

Is there anything I can do to speed up the dlang stuff?

Thanks in advance,
Christian
Jan 02
next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 2 January 2018 at 10:27:11 UTC, Christian Köstlin 
wrote:
 Hi all,

 over the holidays, I played around with processing some gzipped 
 json data. First version was implemented in ruby, but took too 
 long, so I tried, dlang. This was already faster, but not 
 really satisfactory fast. Then I wrote another version in java, 
 which was much faster.

 After this I analyzed the first step of the process (gunzipping 
 the data from a file to memory), and found out, that dlangs 
 UnCompress is much slower than java, and ruby and plain c.

 There was some discussion on the forum a while ago: 
 http://forum.dlang.org/thread/pihxxhjgnveulcdtadvg forum.dlang.org

 The code I used and the numbers I got are here: 
 https://github.com/gizmomogwai/benchmarks/tree/master/gunzip

 I used an i7 macbook with os x 10.13.2, ruby 2.5.0 built via 
 rvm, python3 installed by homebrew, builtin clang compiler, 
 ldc-1.7.0-beta1, java 1.8.0_152.

 Is there anything I can do to speed up the dlang stuff?

 Thanks in advance,
 Christian
Yes indeed. You can make it much faster by using a sliced static array as buffer. I suspect that most of the slowdown is caused by the gc. As there should be only calls to the gzip library
Jan 02
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Tuesday, 2 January 2018 at 11:22:06 UTC, Stefan Koch wrote:
 You can make it much faster by using a sliced static array as 
 buffer.
Only if you want data corruption! It keeps a copy of your pointer internally: https://github.com/dlang/phobos/blob/master/std/zlib.d#L605 It also will always overallocate new buffers on each call <https://github.com/dlang/phobos/blob/master/std/zlib.d#L602> There is no efficient way to use it. The implementation is substandard because the API limits the design. If we really want a fast std.zlib, the API will need to be extended with new functions to fix these. Those new functions will probably look a LOT like the underlying C functions... which is why I say just use them right now.
 I suspect that most of the slowdown is caused by the gc.
 As there should be only calls to the gzip library
plz measure before spreading FUD about the GC.
Jan 02
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/2/18 8:57 AM, Adam D. Ruppe wrote:
 On Tuesday, 2 January 2018 at 11:22:06 UTC, Stefan Koch wrote:
 You can make it much faster by using a sliced static array as buffer.
Only if you want data corruption! It keeps a copy of your pointer internally: https://github.com/dlang/phobos/blob/master/std/zlib.d#L605 It also will always overallocate new buffers on each call <https://github.com/dlang/phobos/blob/master/std/zlib.d#L602> There is no efficient way to use it. The implementation is substandard because the API limits the design.
iopipe handles this quite well. And deals with the buffers properly (yes, it is very tricky. You have to ref-count the zstream structure, because it keeps internal pointers to *itself* as well!). And no, iopipe doesn't use std.zlib, I use the etc.zlib functions (but I poached some ideas from std.zlib when writing it). https://github.com/schveiguy/iopipe/blob/master/source/iopipe/zip.d I even wrote a json parser for iopipe. But it's far from complete. And probably needs updating since I changed some of the iopipe API. https://github.com/schveiguy/jsoniopipe Depending on the use case, it might be enough, and should be very fast. -Steve
Jan 02
parent reply =?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:
On 02.01.18 15:09, Steven Schveighoffer wrote:
 On 1/2/18 8:57 AM, Adam D. Ruppe wrote:
 On Tuesday, 2 January 2018 at 11:22:06 UTC, Stefan Koch wrote:
 You can make it much faster by using a sliced static array as buffer.
Only if you want data corruption! It keeps a copy of your pointer internally: https://github.com/dlang/phobos/blob/master/std/zlib.d#L605 It also will always overallocate new buffers on each call <https://github.com/dlang/phobos/blob/master/std/zlib.d#L602> There is no efficient way to use it. The implementation is substandard because the API limits the design.
iopipe handles this quite well. And deals with the buffers properly (yes, it is very tricky. You have to ref-count the zstream structure, because it keeps internal pointers to *itself* as well!). And no, iopipe doesn't use std.zlib, I use the etc.zlib functions (but I poached some ideas from std.zlib when writing it). https://github.com/schveiguy/iopipe/blob/master/source/iopipe/zip.d I even wrote a json parser for iopipe. But it's far from complete. And probably needs updating since I changed some of the iopipe API. https://github.com/schveiguy/jsoniopipe Depending on the use case, it might be enough, and should be very fast. -Steve
Thanks Steve for this proposal (actually I already had an iopipe version on my harddisk that I applied to this problem) Its more or less your unzip example + putting the data to an appender (I hope this is how it should be done, to get the data to RAM). iopipe is already better than the normal dlang version, almost like java, but still far from the solution. I updated https://github.com/gizmomogwai/benchmarks/tree/master/gunzip I will give the direct gunzip calls a try ... In terms of json parsing, I had really nice results with the fast.json pull parser, but its comparing a little bit apples with oranges, because I did not pull out all the data there. --- Christian
Jan 02
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/2/18 1:01 PM, Christian Köstlin wrote:
 On 02.01.18 15:09, Steven Schveighoffer wrote:
 On 1/2/18 8:57 AM, Adam D. Ruppe wrote:
 On Tuesday, 2 January 2018 at 11:22:06 UTC, Stefan Koch wrote:
 You can make it much faster by using a sliced static array as buffer.
Only if you want data corruption! It keeps a copy of your pointer internally: https://github.com/dlang/phobos/blob/master/std/zlib.d#L605 It also will always overallocate new buffers on each call <https://github.com/dlang/phobos/blob/master/std/zlib.d#L602> There is no efficient way to use it. The implementation is substandard because the API limits the design.
iopipe handles this quite well. And deals with the buffers properly (yes, it is very tricky. You have to ref-count the zstream structure, because it keeps internal pointers to *itself* as well!). And no, iopipe doesn't use std.zlib, I use the etc.zlib functions (but I poached some ideas from std.zlib when writing it). https://github.com/schveiguy/iopipe/blob/master/source/iopipe/zip.d I even wrote a json parser for iopipe. But it's far from complete. And probably needs updating since I changed some of the iopipe API. https://github.com/schveiguy/jsoniopipe Depending on the use case, it might be enough, and should be very fast.
Thanks Steve for this proposal (actually I already had an iopipe version on my harddisk that I applied to this problem) Its more or less your unzip example + putting the data to an appender (I hope this is how it should be done, to get the data to RAM).
Well, you don't need to use appender for that (and doing so is copying a lot of the data an extra time). All you need is to extend the pipe until there isn't any more new data, and it will all be in the buffer. // almost the same line from your current version auto mypipe = openDev("../out/nist/2011.json.gz") .bufd.unzip(CompressionFormat.gzip); // This line here will work with the current release (0.0.2): while(mypipe.extend(0) != 0) {} //But I have a fix for a bug that hasn't been released yet, this would work if you use iopipe-master: mypipe.ensureElems(); // getting the data is as simple as looking at the buffer. auto data = mypipe.window; // ubyte[] of the data
 iopipe is already better than the normal dlang version, almost like
 java, but still far from the solution. I updated
 https://github.com/gizmomogwai/benchmarks/tree/master/gunzip
 
 I will give the direct gunzip calls a try ...
 
 In terms of json parsing, I had really nice results with the fast.json
 pull parser, but its comparing a little bit apples with oranges, because
 I did not pull out all the data there.
Yeah, with jsoniopipe being very raw, I wouldn't be sure it was usable in your case. The end goal is to have something fast, but very easy to construct. I wasn't planning on focusing on the speed (yet) like other libraries do, but ease of writing code to use it. -Steve
Jan 02
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/2/18 3:13 PM, Steven Schveighoffer wrote:
 // almost the same line from your current version
 auto mypipe = openDev("../out/nist/2011.json.gz")
                    .bufd.unzip(CompressionFormat.gzip);
Would you mind telling me the source of the data? When I do get around to it, I want to have a good dataset to test things against, and would be good to use what others reach for. -Steve
Jan 02
parent =?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:
On 02.01.18 21:48, Steven Schveighoffer wrote:
 On 1/2/18 3:13 PM, Steven Schveighoffer wrote:
 // almost the same line from your current version
 auto mypipe = openDev("../out/nist/2011.json.gz")
                    .bufd.unzip(CompressionFormat.gzip);
Would you mind telling me the source of the data? When I do get around to it, I want to have a good dataset to test things against, and would be good to use what others reach for. -Steve
Hi Steve, thanks for looking into this. I use data from nist.gov, the Makefile includes these download instructions: curl -s https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2011.json.gz > out/nist/2011.json.gz` -- Christian Köstlin
Jan 02
prev sibling parent reply =?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:
On 02.01.18 21:13, Steven Schveighoffer wrote:
 Well, you don't need to use appender for that (and doing so is copying a
 lot of the data an extra time). All you need is to extend the pipe until
 there isn't any more new data, and it will all be in the buffer.
 
 // almost the same line from your current version
 auto mypipe = openDev("../out/nist/2011.json.gz")
                   .bufd.unzip(CompressionFormat.gzip);
 
 // This line here will work with the current release (0.0.2):
 while(mypipe.extend(0) != 0) {}
Thanks for this input, I updated the program to make use of this method and compare it to the appender thing as well.
 I will give the direct gunzip calls a try ...
I added direct gunzip calls as well... Those are really good, as long as I do not try to get the data into ram :) then it is "bad" again. I wonder what the real difference to the lowlevel solution with own appender and the c version is. For me they look almost the same (ugly, only the performance seems to be nice). Funny thing is, that if I add the clang address sanitizer things to the c program, I get almost the same numbers as for java :)
 Yeah, with jsoniopipe being very raw, I wouldn't be sure it was usable
 in your case. The end goal is to have something fast, but very easy to
 construct. I wasn't planning on focusing on the speed (yet) like other
 libraries do, but ease of writing code to use it.
 
 -Steve
-- Christian Köstlin
Jan 02
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/3/18 2:47 AM, Christian Köstlin wrote:
 On 02.01.18 21:13, Steven Schveighoffer wrote:
 Well, you don't need to use appender for that (and doing so is copying a
 lot of the data an extra time). All you need is to extend the pipe until
 there isn't any more new data, and it will all be in the buffer.

 // almost the same line from your current version
 auto mypipe = openDev("../out/nist/2011.json.gz")
                    .bufd.unzip(CompressionFormat.gzip);

 // This line here will work with the current release (0.0.2):
 while(mypipe.extend(0) != 0) {}
Thanks for this input, I updated the program to make use of this method and compare it to the appender thing as well.
Hm.. the numbers are worse! I would have expected to be at least comparable. I'll have to look into it. Thanks for posting this. -Steve
Jan 03
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/3/18 9:42 AM, Steven Schveighoffer wrote:
 On 1/3/18 2:47 AM, Christian Köstlin wrote:
 On 02.01.18 21:13, Steven Schveighoffer wrote:
 Well, you don't need to use appender for that (and doing so is copying a
 lot of the data an extra time). All you need is to extend the pipe until
 there isn't any more new data, and it will all be in the buffer.

 // almost the same line from your current version
 auto mypipe = openDev("../out/nist/2011.json.gz")
                    .bufd.unzip(CompressionFormat.gzip);

 // This line here will work with the current release (0.0.2):
 while(mypipe.extend(0) != 0) {}
Thanks for this input, I updated the program to make use of this method and compare it to the appender thing as well.
Hm.. the numbers are worse! I would have expected to be at least comparable. I'll have to look into it. Thanks for posting this.
Alright. I have spent some time examining the issues, and here are my findings: 1. The major differentiator between the C and D algorithms is the use of C realloc. This one thing saves the most time. I'm going to update iopipe so you can use it (stand by). I will also be examining how to simulate using realloc when not using C malloc in iopipe. I think it's the copying of data to the new buffer that is causing issues. 2. extend(0) always attempts to read 8k more bytes. The buffer extends by 8k by going into another couple pages. Somehow this is choking the algorithm. I think the cost of extending pages actually ends up hurting performance (something we need to look at in the GC in general). 3. extendElems should allow extending all the elements in the most optimal fashion. Fixing that, iopipe performs as well as the Appender/iopipe version. This is coming out as well. Stay tuned, there will be updates to iopipe to hopefully make it as fast in this microbenchmark as the C version :) -Steve
Jan 03
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/3/18 3:28 PM, Steven Schveighoffer wrote:
 1. The major differentiator between the C and D algorithms is the use of 
 C realloc. This one thing saves the most time. I'm going to update 
 iopipe so you can use it (stand by). I will also be examining how to 
 simulate using realloc when not using C malloc in iopipe. I think it's 
 the copying of data to the new buffer that is causing issues.
Looking at when C realloc actually moves the data, it appears it all of a sudden over-allocates very very large blocks, much larger than the GC will over-allocate. This is why the GC is losing. After a certain size, the GC doesn't allocate blocks to grow into, so we start copying on every realloc. So it's not so much the performance of the copying, but the number of times copying is required that is affecting the ultimate performance. -Steve
Jan 03
parent reply =?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:
On 03.01.18 22:33, Steven Schveighoffer wrote:
 On 1/3/18 3:28 PM, Steven Schveighoffer wrote:
 1. The major differentiator between the C and D algorithms is the use
 of C realloc. This one thing saves the most time. I'm going to update
 iopipe so you can use it (stand by). I will also be examining how to
 simulate using realloc when not using C malloc in iopipe. I think it's
 the copying of data to the new buffer that is causing issues.
Looking at when C realloc actually moves the data, it appears it all of a sudden over-allocates very very large blocks, much larger than the GC will over-allocate. This is why the GC is losing. After a certain size, the GC doesn't allocate blocks to grow into, so we start copying on every realloc.
That a very intersting finding! If I find some time today, I will also try a pure malloc/memcpy/free solution in c and also look how many real allocs are done. Thats funny, because this means, that this whole *2 thing for growing buffers that you learn is well taken care of in libc? Or does it still mean, that you should allocate in a pattern like this, for libc's algorithm to kick in? Is there api to see how much realloc really allocated, or is it only possible by comparing pointers to see if realloc delivers a new or the old pointer? -- Christian Köstlin
Jan 03
parent reply =?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:
I added now a c variant, that does always malloc/memcpy/free. Its much
slower for sure.
Also I put some output in thats shows when a real realloc happens. Its
like you said:

did a real realloc
did a real realloc
did a real realloc
did a real realloc
did a real realloc
did a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did a real realloc
did not a real realloc

funny thing is, i do not always get the same sequence of real reallocs.


--
Christian Köstlin
Jan 04
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/4/18 4:47 AM, Christian Köstlin wrote:
 I added now a c variant, that does always malloc/memcpy/free. Its much
 slower for sure.
 Also I put some output in thats shows when a real realloc happens. Its
 like you said:
 
 did a real realloc
 did a real realloc
 did a real realloc
 did a real realloc
 did a real realloc
 did a real realloc
 did not a real realloc
 did not a real realloc
 did not a real realloc
 did not a real realloc
 did not a real realloc
 did not a real realloc
 did not a real realloc
 did not a real realloc
 did a real realloc
 did not a real realloc
 
 funny thing is, i do not always get the same sequence of real reallocs.
Another thing I did was print out the size of the buffer before the realloc. In other words: auto oldsize = buffer.length; auto oldptr = buffer.ptr; buffer = (cast(ubyte *)realloc(buffer.ptr, newsize))[0 .. newsize]; if(buffer.ptr != oldptr) writeln("moved: ", oldsize); Even the sizes malloc chooses for its big leap are not consistent from run to run! I'm wondering if possibly when you keep reallocing a large block, and it doesn't have enough memory, needs more from the system, it just tacks it onto the end of the existing big block. This would make the performance for this microbenchmark highly dependent on not doing any other allocations! BTW, my numbers are consistent, but there is a difference from yours. For example, my C runs are about 430ms, and my fast D runs are about 650-700ms. I never get the 200ms range runs that you are seeing. I have a similar machine, but still running Sierra. -Steve
Jan 04
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/3/18 3:28 PM, Steven Schveighoffer wrote:

 Stay tuned, there will be updates to iopipe to hopefully make it as fast 
 in this microbenchmark as the C version :)
v0.0.3 has been released. To take advantage of using malloc/realloc, you can use std.experimental.allocator.mallocator.Mallocator as the BufferManager's allocator. e.g.: auto myPipe = openDev("file.gz").bufd // not here .unzip!Mallocator; // here myPipe.ensureElems(); // this works now. auto bytesRead = myPipe.window.length; The reason you don't need it on the first bufd is because that is the buffer of the file stream, which isn't going to grow. Note: the buffer manager does not free the data on destruction! You are responsible for freeing it (if you use Mallocator). -Steve
Jan 04
parent reply =?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:
On 04.01.18 16:53, Steven Schveighoffer wrote:
 On 1/3/18 3:28 PM, Steven Schveighoffer wrote:
 
 Stay tuned, there will be updates to iopipe to hopefully make it as
 fast in this microbenchmark as the C version :)
v0.0.3 has been released. To take advantage of using malloc/realloc, you can use std.experimental.allocator.mallocator.Mallocator as the BufferManager's allocator. e.g.: auto myPipe = openDev("file.gz").bufd // not here       .unzip!Mallocator;              // here myPipe.ensureElems(); // this works now. auto bytesRead = myPipe.window.length; The reason you don't need it on the first bufd is because that is the buffer of the file stream, which isn't going to grow. Note: the buffer manager does not free the data on destruction! You are responsible for freeing it (if you use Mallocator).
Thanks Steve, this runs now faster, I will update the table. Sorry, but I do not get how to cleanup the mallocated memory :) Could you help me out here? -- Christian Köstlin
Jan 04
next sibling parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/4/18 1:57 PM, Christian Köstlin wrote:
 Sorry, but I do not get how to cleanup the mallocated memory :)
 Could you help me out here?
free(mypipe.window.ptr); At the moment this works because you haven't released any data from the front. But eventually, I will need to figure out how to handle this when it's not so neat. -Steve
Jan 04
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/4/18 1:57 PM, Christian Köstlin wrote:
 Thanks Steve,
 this runs now faster, I will update the table.
Still a bit irked that I can't match the C speed :/ But, I can't get your C speed to duplicate on my mac even with gcc, so I'm not sure where to start. I find it interesting that you are not using any optimization flags for gcc. -Steve
Jan 04
parent reply =?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:
On 04.01.18 20:46, Steven Schveighoffer wrote:
 On 1/4/18 1:57 PM, Christian Köstlin wrote:
 Thanks Steve,
 this runs now faster, I will update the table.
Still a bit irked that I can't match the C speed :/ But, I can't get your C speed to duplicate on my mac even with gcc, so I'm not sure where to start. I find it interesting that you are not using any optimization flags for gcc.
I guess, the code in my program is small enough that the optimize flags do not matter... most of the stuff is pulled from libz? Which is dynamically linked against /usr/lib/libz.1.dylib. I also cannot understand what I should do more (will try realloc with Mallocator) for the dlang-low-level variant to get to the c speed. rust is doing quite well there -- Christian Köstlin
Jan 04
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/5/18 1:01 AM, Christian Köstlin wrote:
 On 04.01.18 20:46, Steven Schveighoffer wrote:
 On 1/4/18 1:57 PM, Christian Köstlin wrote:
 Thanks Steve,
 this runs now faster, I will update the table.
Still a bit irked that I can't match the C speed :/ But, I can't get your C speed to duplicate on my mac even with gcc, so I'm not sure where to start. I find it interesting that you are not using any optimization flags for gcc.
I guess, the code in my program is small enough that the optimize flags do not matter... most of the stuff is pulled from libz? Which is dynamically linked against /usr/lib/libz.1.dylib.
Yeah, I guess most of the bottlenecks are inside libz, or the memory allocator. There isn't much optimization to be done in the main program itself.
 I also cannot understand what I should do more (will try realloc with
 Mallocator) for the dlang-low-level variant to get to the c speed.
D compiles just the same as C. So theoretically you should be able to get the same performance with a ported version of your C code. It's worth a shot.
 rust is doing quite well there
I'll say a few words of caution here: 1. Almost all of these tests use the same C library to unzip. So it's really not a test of the performance of decompression, but the performance of memory management. And it appears that any test using malloc/realloc is in a different tier. Presumably because of the lack of copies (as discussed earlier). 2. Your rust test (I think, I'm not sure) is testing 2 things in the same run, which could potentially have dramatic consequences for the second test. For instance, it could already have all the required memory blocks ready, and the allocation strategy suddenly gets better. Or maybe there is some kind of caching of the input being done. I think you have a fairer test for the second option by running it in a separate program. I've never used rust, so I don't know what exactly your code is doing. 3. It's hard to make a decision based on such microbenchmarks as to which solution is "better" in an actual real-world program, especially when the state/usage of the memory allocator plays a huge role in this. -Steve
Jan 05
parent reply =?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:
On 05.01.18 15:39, Steven Schveighoffer wrote:
 Yeah, I guess most of the bottlenecks are inside libz, or the memory
 allocator. There isn't much optimization to be done in the main program
 itself.

 D compiles just the same as C. So theoretically you should be able to
 get the same performance with a ported version of your C code. It's
 worth a shot.
I added another version that tries to do the "same" as the c version using mallocator, but i am still way off, perhaps its creating too many ranges on the underlying array. but its around the same speed as your great iopipe thing. My solution does have the same memory leak, as I am not sure how to best get the memory out of the FastAppender so that it is automagically cleaned up. Perhaps if we get rc things, this gets easier? I updated: https://github.com/gizmomogwai/benchmarks/tree/master/gunzip with the newest numbers on my machine, but I think your iopipe solution is the best one we can get at the moment!
 rust is doing quite well there
I'll say a few words of caution here: 1. Almost all of these tests use the same C library to unzip. So it's really not a test of the performance of decompression, but the performance of memory management. And it appears that any test using malloc/realloc is in a different tier. Presumably because of the lack of copies (as discussed earlier). 2. Your rust test (I think, I'm not sure) is testing 2 things in the same run, which could potentially have dramatic consequences for the second test. For instance, it could already have all the required memory blocks ready, and the allocation strategy suddenly gets better. Or maybe there is some kind of caching of the input being done. I think you have a fairer test for the second option by running it in a separate program. I've never used rust, so I don't know what exactly your code is doing. 3. It's hard to make a decision based on such microbenchmarks as to which solution is "better" in an actual real-world program, especially when the state/usage of the memory allocator plays a huge role in this.
sure .. thats true
Jan 05
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/5/18 3:09 PM, Christian Köstlin wrote:
 On 05.01.18 15:39, Steven Schveighoffer wrote:
 Yeah, I guess most of the bottlenecks are inside libz, or the memory
 allocator. There isn't much optimization to be done in the main program
 itself.

 D compiles just the same as C. So theoretically you should be able to
 get the same performance with a ported version of your C code. It's
 worth a shot.
I added another version that tries to do the "same" as the c version using mallocator, but i am still way off, perhaps its creating too many ranges on the underlying array. but its around the same speed as your great iopipe thing.
Hm... I think really there is some magic initial state of the allocator, and that's what allows it to go so fast. One thing about the D version, because druntime is also using malloc (the GC is backed by malloc'd data after all), the initial state of the heap is quite different from when you start in C. It may be impossible or nearly impossible to duplicate the performance. But the flipside (if this is indeed the case) is that you won't see the same performance in a real-world app anyway, even in C. One thing to try, you preallocate the ENTIRE buffer. This only works if you know how many bytes it will decompress to (not always possible), but it will take the allocator out of the equation completely. And it's probably going to be the most efficient method (you aren't leaving behind smaller unused blocks when you realloc). If for some reason we can't beat/tie the C version doing that, then something else is going on.
 My solution does have the same memory leak, as I am not sure how to best
 get the memory out of the FastAppender so that it is automagically
 cleaned up. Perhaps if we get rc things, this gets easier?
I've been giving some thought to this. I think iopipe needs some buffer management primitives that allow you to finagle the buffer. I've been needing this for some time anyway (for file seeking). Right now, the buffer itself is buried in the chain, so it's hard to get at the actual buffer. Alternatively, I probably also need to give some thought to a mechanism that auto-frees the memory when it can tell nobody is still using the iopipe. Given that iopipe's signature feature is direct buffer access, this would mean anything that uses such a feature would have to be unsafe. -Steve
Jan 05
parent reply =?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:
On 05.01.18 23:04, Steven Schveighoffer wrote:
 On 1/5/18 3:09 PM, Christian Köstlin wrote:
 On 05.01.18 15:39, Steven Schveighoffer wrote:
 Yeah, I guess most of the bottlenecks are inside libz, or the memory
 allocator. There isn't much optimization to be done in the main program
 itself.

 D compiles just the same as C. So theoretically you should be able to
 get the same performance with a ported version of your C code. It's
 worth a shot.
I added another version that tries to do the "same" as the c version using mallocator, but i am still way off, perhaps its creating too many ranges on the underlying array. but its around the same speed as your great iopipe thing.
Hm... I think really there is some magic initial state of the allocator, and that's what allows it to go so fast. One thing about the D version, because druntime is also using malloc (the GC is backed by malloc'd data after all), the initial state of the heap is quite different from when you start in C. It may be impossible or nearly impossible to duplicate the performance. But the flipside (if this is indeed the case) is that you won't see the same performance in a real-world app anyway, even in C. One thing to try, you preallocate the ENTIRE buffer. This only works if you know how many bytes it will decompress to (not always possible), but it will take the allocator out of the equation completely. And it's probably going to be the most efficient method (you aren't leaving behind smaller unused blocks when you realloc). If for some reason we can't beat/tie the C version doing that, then something else is going on.
yes ... this is something i forgot to try out ... will do now :) mhh .. interesting numbers ... c is even faster, my d lowlevel solution is also a little bit faster, but much slower than the no copy version (funnily, no copy is the wrong name, it just overwrites all the data in a small buffer).
 My solution does have the same memory leak, as I am not sure how to best
 get the memory out of the FastAppender so that it is automagically
 cleaned up. Perhaps if we get rc things, this gets easier?
I've been giving some thought to this. I think iopipe needs some buffer management primitives that allow you to finagle the buffer. I've been needing this for some time anyway (for file seeking). Right now, the buffer itself is buried in the chain, so it's hard to get at the actual buffer. Alternatively, I probably also need to give some thought to a mechanism that auto-frees the memory when it can tell nobody is still using the iopipe. Given that iopipe's signature feature is direct buffer access, this would mean anything that uses such a feature would have to be unsafe.
yes .. thats tricky ... one question about iopipe. is it possible to transform the elements in the pipe as well ... e.g. away from a buffer of bytes to json objects? -- Christian Köstlin
Jan 06
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/6/18 11:14 AM, Christian Köstlin wrote:
 On 05.01.18 23:04, Steven Schveighoffer wrote:
 One thing to try, you preallocate the ENTIRE buffer. This only works if
 you know how many bytes it will decompress to (not always possible), but
 it will take the allocator out of the equation completely. And it's
 probably going to be the most efficient method (you aren't leaving
 behind smaller unused blocks when you realloc). If for some reason we
 can't beat/tie the C version doing that, then something else is going on.
yes ... this is something i forgot to try out ... will do now :) mhh .. interesting numbers ... c is even faster, my d lowlevel solution is also a little bit faster, but much slower than the no copy version (funnily, no copy is the wrong name, it just overwrites all the data in a small buffer).
Not from what I'm reading, the C solution is about the same (257 vs. 261). Not sure if you have averaged these numbers, especially on a real computer that might be doing other things. Note: I would expect it to be a tiny bit faster, but not monumentally faster. From my testing with the reallocation, it only reallocates a large quantity of data once. However, the D solution should be much faster. Part of the issue is that you still aren't low-level enough :) Instead of allocating the ubyte array with this line: ubyte[] buffer = new ubyte[200*1024*1024]; Try this instead: // from std.array auto buffer = uninitializedArray!(ubyte[], 200*1024*1024); The difference is that the first one will have the runtime 0-initialize all the data.
 one question about iopipe. is it possible to transform the elements in
 the pipe as well ... e.g. away from a buffer of bytes to json objects?
Yes! I am working on doing just that, but haven't had a chance to update the toy project I wrote: https://github.com/schveiguy/jsoniopipe I was planning actually on having an iopipe of JsonItem, which would work just like a normal buffer, but reference the ubyte buffer underneath. Eventually, the final product should have a range of JsonValue, which you would recurse into in order to parse its children. All of it will be lazy, and stream-based, so you don't have to load the whole file if it's huge. Note, you can't have an iopipe of JsonValue, because it's a recursive format. JsonItems are just individual defined tokens, so they can be linear. -Steve
Jan 07
parent =?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:
On 07.01.18 14:44, Steven Schveighoffer wrote:
 Not from what I'm reading, the C solution is about the same (257 vs.
 261). Not sure if you have averaged these numbers, especially on a real
 computer that might be doing other things.
yes you are right ... for proper benchmarking proper statistics should be in place, taking out extreme values, averaging them, ...
 Note: I would expect it to be a tiny bit faster, but not monumentally
 faster. From my testing with the reallocation, it only reallocates a
 large quantity of data once.
 
 However, the D solution should be much faster. Part of the issue is that
 you still aren't low-level enough :)
 
 Instead of allocating the ubyte array with this line:
 
 ubyte[] buffer = new ubyte[200*1024*1024];
 
 Try this instead:
 
 // from std.array
 auto buffer = uninitializedArray!(ubyte[], 200*1024*1024);
thanks for that ... i just did not know how to get an uninitialized array. i was aware, that dlang is nice and puts init there :)
 Yes! I am working on doing just that, but haven't had a chance to update
 the toy project I wrote: https://github.com/schveiguy/jsoniopipe
 
 I was planning actually on having an iopipe of JsonItem, which would
 work just like a normal buffer, but reference the ubyte buffer underneath.
 
 Eventually, the final product should have a range of JsonValue, which
 you would recurse into in order to parse its children. All of it will be
 lazy, and stream-based, so you don't have to load the whole file if it's
 huge.
 
 Note, you can't have an iopipe of JsonValue, because it's a recursive
 format. JsonItems are just individual defined tokens, so they can be
 linear.
sounds really good. i played around with https://github.com/mleise/fast/blob/master/source/fast/json.d ... thats an interesting pull parser with the wrong licence unfortunately ... i wonder if something like this could be done on top of iopipe instead of a "real" buffer. --- Christian Köstlin
Jan 09
prev sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Tuesday, 2 January 2018 at 10:27:11 UTC, Christian Köstlin 
wrote:
 After this I analyzed the first step of the process (gunzipping 
 the data from a file to memory), and found out, that dlangs 
 UnCompress is much slower than java, and ruby and plain c.
Yeah, std.zlib is VERY poorly written. You can get much better performance by just calling the C functions yourself instead. (You can just import import etc.c.zlib; it is included still) Improving it would mean changing the public API. I think the one-shot compress/uncompress functions are ok, but the streaming class does a lot of unnecessary work inside like copying stuff around.
Jan 02
parent =?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:
On 02.01.18 14:51, Adam D. Ruppe wrote:
 On Tuesday, 2 January 2018 at 10:27:11 UTC, Christian Köstlin wrote:
 After this I analyzed the first step of the process (gunzipping the
 data from a file to memory), and found out, that dlangs UnCompress is
 much slower than java, and ruby and plain c.
Yeah, std.zlib is VERY poorly written. You can get much better performance by just calling the C functions yourself instead. (You can just import import etc.c.zlib; it is included still) Improving it would mean changing the public API. I think the one-shot compress/uncompress functions are ok, but the streaming class does a lot of unnecessary work inside like copying stuff around.
I added a version that uses the gzip lowlevel apis (similar to my example c program). I am still having problems copying the data fast enough to an dlang array. please see the updated page: https://github.com/gizmomogwai/benchmarks/tree/master/gunzip -- Christian Köstlin
Jan 02