www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Performance of std.json

reply "David Soria Parra" <dsp experimentalworks.net> writes:
Hi,

I have recently had to deal with large amounts of JSON data in D. 
While doing that I've found that std.json is remarkable slow in 
comparison to other languages standard json implementation. I've 
create a small and simple benchmark parsing a local copy of a 
github API call 
"https://api.github.com/repos/D-Programming-Language/dmd/pulls" 
and parsing it 100% times and writing the title to stdout.

My results as follows:
   ./d-test > /dev/null  3.54s user 0.02s system 99% cpu 3.560 
total
   ./hs-test > /dev/null  0.02s user 0.00s system 93% cpu 0.023 
total
   python test.py > /dev/null  0.77s user 0.02s system 99% cpu 
0.792 total

The concrete implementations (sorry for my terrible haskell 
implementation) can be found here:

    https://github.com/dsp/D-Json-Tests/

This is comapring D's std.json vs Haskells Data.Aeson and python 
standard library json. I am a bit concerned with the current 
state of our JSON parser given that a lot of applications these 
day use JSON. I personally consider a high speed implementation 
of JSON a critical part of a standard library.

Would it make sense to start thinking about using ujson4c as an 
external library, or maybe come up with a better implementation. 
I know Orvid has something and might add some analysis as to why 
std.json is slow. Any ideas or pointers as to how to start with 
that?
Jun 01 2014
next sibling parent "Joshua Niehus" <jm.niehus gmail.com> writes:
On Monday, 2 June 2014 at 00:18:19 UTC, David Soria Parra wrote:
 Would it make sense to start thinking about using ujson4c as an 
 external library, or maybe come up with a better 
 implementation. I know Orvid has something and might add some 
 analysis as to why std.json is slow. Any ideas or pointers as 
 to how to start with that?

std.json is underpowered and in need of an overhaul. In the mean time have you tried vibe.d's json? http://vibed.org/api/vibe.data.json/
Jun 01 2014
prev sibling next sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Mon, 02 Jun 2014 00:18:18 +0000
David Soria Parra via Digitalmars-d <digitalmars-d puremagic.com> wrote:

 Hi,

 I have recently had to deal with large amounts of JSON data in D.
 While doing that I've found that std.json is remarkable slow in
 comparison to other languages standard json implementation. I've
 create a small and simple benchmark parsing a local copy of a
 github API call
 "https://api.github.com/repos/D-Programming-Language/dmd/pulls"
 and parsing it 100% times and writing the title to stdout.

 My results as follows:
    ./d-test > /dev/null  3.54s user 0.02s system 99% cpu 3.560
 total
    ./hs-test > /dev/null  0.02s user 0.00s system 93% cpu 0.023
 total
    python test.py > /dev/null  0.77s user 0.02s system 99% cpu
 0.792 total

 The concrete implementations (sorry for my terrible haskell
 implementation) can be found here:

     https://github.com/dsp/D-Json-Tests/

 This is comapring D's std.json vs Haskells Data.Aeson and python
 standard library json. I am a bit concerned with the current
 state of our JSON parser given that a lot of applications these
 day use JSON. I personally consider a high speed implementation
 of JSON a critical part of a standard library.

 Would it make sense to start thinking about using ujson4c as an
 external library, or maybe come up with a better implementation.
 I know Orvid has something and might add some analysis as to why
 std.json is slow. Any ideas or pointers as to how to start with
 that?

It's my understanding that the current design of std.json is considered to be poor, but I don't haven't used it, so I don't know any the details. But if it's as slow as you're finding to be the case, then I think that that supports the idea that it needs a redesign. The question then is what a new std.json should look like and who would do it. And that pretty much comes down to an interested and motivated developer coming up with and implementing a new design and then proposing it here. And until someone takes up that torch, we'll be stuck with what we have. Certainly, there's no fundamental reason why we can't have a lightening fast std.json. With ranges and slices, parsing in D in general should be faster than C/C++ (and definitely faster than Haskell of python), and if it isn't, that indicates that the implementation (if not the whole design) of that code needs to be redone. I know that vibe.d uses its own json implementation, but I don't know how much of that is part of its public API and how much of that is simply used internally: http://vibed.org - Jonathan M Davis
Jun 01 2014
parent reply Jacob Carlborg <doob me.com> writes:
On 02/06/14 13:36, w0rp wrote:

 In terms of API, I wouldn't go completely for an approach based on
 serialising to structs. Having a tagged union type is still helpful for
 situations where you just want to quickly get at some JSON data and do
 something with it. I have thought a great deal about writing data *to*
 JSON strings however, and I have an idea for this I would like to share.

I think there should be quite a minimal API, then a proper serialization module can be built on top of that. -- /Jacob Carlborg
Jun 02 2014
next sibling parent Jacob Carlborg <doob me.com> writes:
On 02/06/14 21:13, Sean Kelly wrote:
 The vibe.d parser is better, but it still creates a DOM-style
 tree of objects, which isn't acceptable in some circumstances.  I
 posted a performance comparison of the JSON parser I created for
 work use with std.json a while back, and mine is almost 100x
 faster than std.json in a simple test and allocates zero memory
 to boot:

 http://forum.dlang.org/thread/cyzcirslzcgnyxbyzycc forum.dlang.org#post-gxgeizjsurulklzftfqz:40forum.dlang.org


 I haven't tried it vs. the vibe.d parser, but I suspect it will
 still beat it by an order of magnitude or more because of the not
 allocating thing.

 I've said this a bunch of times, but what I want to see is a
 SAX-style parser as the bottom layer with an optional DOM-style
 parser built on top of it.  Then people who want the tree
 generated can get it, and people who want performance or don't
 want allocations can get that too.  I'm starting to wonder if I
 should just try and get permission from work to open source my
 parser so I can submit it.  Parsing JSON really isn't terribly
 difficult though.  It shouldn't take more than a few days for one
 of the more parser-oriented people here to produce something
 comparable.

Yes, exactly. -- /Jacob Carlborg
Jun 02 2014
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 02/06/14 21:13, Sean Kelly wrote:

 I'm starting to wonder if I
 should just try and get permission from work to open source my
 parser so I can submit it.

That would be awesome. Is it written in D or was it C++ ? -- /Jacob Carlborg
Jun 02 2014
prev sibling next sibling parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 02.06.2014 21:13, schrieb Sean Kelly:
 The vibe.d parser is better, but it still creates a DOM-style
 tree of objects, which isn't acceptable in some circumstances.  I
 posted a performance comparison of the JSON parser I created for
 work use with std.json a while back, and mine is almost 100x
 faster than std.json in a simple test and allocates zero memory
 to boot:

 http://forum.dlang.org/thread/cyzcirslzcgnyxbyzycc forum.dlang.org#post-gxgeizjsurulklzftfqz:40forum.dlang.org


 I haven't tried it vs. the vibe.d parser, but I suspect it will
 still beat it by an order of magnitude or more because of the not
 allocating thing.

 I've said this a bunch of times, but what I want to see is a
 SAX-style parser as the bottom layer with an optional DOM-style
 parser built on top of it.  Then people who want the tree
 generated can get it, and people who want performance or don't
 want allocations can get that too.  I'm starting to wonder if I
 should just try and get permission from work to open source my
 parser so I can submit it.  Parsing JSON really isn't terribly
 difficult though.  It shouldn't take more than a few days for one
 of the more parser-oriented people here to produce something
 comparable.

Since some time, the vibe.d parser can directly serialize from and to string form, circumventing the DOM step and without unnecessary allocations. But I agree that an intermediate SAX layer would be nice to have. Maybe even an additional StAX layer.
Jun 03 2014
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 03/06/14 09:15, Johannes Pfau wrote:

 I'd probably prefer a tokenizer/lexer as the lowest layer, then SAX and
 DOM implemented using the tokenizer. This way we can provide a kind of
 input range. I actually used Brian Schotts std.lexer proposal to build a
 simple JSON tokenizer/lexer and it worked quite well. But I don't
 think std.lexer is zero-allocation yet so that's an important drawback.

If I recall correctly it will allocate strings instead of slicing the input. The strings are then reused. If the input is sliced the whole input is retained in memory even if not all of the input is used. -- /Jacob Carlborg
Jun 03 2014
prev sibling next sibling parent "w0rp" <devw0rp gmail.com> writes:
On Monday, 2 June 2014 at 00:39:48 UTC, Jonathan M Davis via 
Digitalmars-d wrote:
 It's my understanding that the current design of std.json is 
 considered
 to be poor, but I don't haven't used it, so I don't know any the
 details. But if it's as slow as you're finding to be the case, 
 then I
 think that that supports the idea that it needs a redesign. The
 question then is what a new std.json should look like and who 
 would do
 it. And that pretty much comes down to an interested and 
 motivated
 developer coming up with and implementing a new design and then
 proposing it here. And until someone takes up that torch, we'll 
 be
 stuck with what we have. Certainly, there's no fundamental 
 reason why
 we can't have a lightening fast std.json. With ranges and 
 slices,
 parsing in D in general should be faster than C/C++ (and 
 definitely
 faster than Haskell of python), and if it isn't, that indicates 
 that
 the implementation (if not the whole design) of that code needs 
 to be
 redone.

 I know that vibe.d uses its own json implementation, but I 
 don't know
 how much of that is part of its public API and how much of that 
 is
 simply used internally: http://vibed.org

 - Jonathan M Davis

I implemented a JSON library myself which parses JSON and generates JSON objects similar to how std.json does not. I wrote it largely because of the poor API in the standard library at the time, but I think by this point nearly all of the concerns have been alleviated. At the time I benchmarked it against std.json and vibe.d's implementation, and they were all pretty equivalent in terms of performance. I settled for edging just slightly ahead of std.json. If there's any major performance gains to make, I believe we will have to completely rethink how we go about parsing JSON I suspect transparent character encoding and decoding (dchar ranges) might be one potential source of trouble. In terms of API, I wouldn't go completely for an approach based on serialising to structs. Having a tagged union type is still helpful for situations where you just want to quickly get at some JSON data and do something with it. I have thought a great deal about writing data *to* JSON strings however, and I have an idea for this I would like to share. First, you define by convention that there is a function writeJSON which takes some value and an OutputRange, and then writes the value in a JSON representation directly to an OutputRange. You define in the library writeJSON functions for standard types. writeJSON(OutputRange)(JSONValue, OutputRange); writeJSON(OutputRange)(string, OutputRange); writeJSON(OutputRange)(int, OutputRange); writeJSON(OutputRange)(bool, OutputRange); writeJSON(OutputRange)(typeof(null), OutputRange); // ... You define one additional writeJSON function, which takes any InputRange of type T and writes an array of Ts. (So string[] will write an array of strings, int[] will write ints, etc.) writeJSON(InputRange, OutputRange)(InputRange inRange, OutputRange outRange) { foreach(ref value; inRange) { writeJSON(value, outRange); } } Add a convenience method which takes var args alternatively string, T, string, U, ... Call it say, writeJSONObject. You now have a decent framework for writing objects directly to OutputRanges. struct Foo { AnotherType bar; string stringValue; int intValue; } writeJSON(OutputRange)(Foo foo, OutputRange outRange) { // Writes {"bar":<bar_value>, ... } writeJSONObject(outRange, // writeJSONObject calls writeJSON for AnotherType, etc. "bar", foo.bar, "stringValue", foo.stringValue, "intValue", foo.intValue ); } There are more details, and something would need to be done for handling stack overflows, (inlining?) but there's the idea that I had for improving writing JSON at least. One advantage in this approach would be that it wouldn't be dependent on the GC, and scoped buffers could be used. (A nogc candidate, I think.) You can't get this ability out of something like toJSON which produces a string at once.
Jun 02 2014
prev sibling next sibling parent "w0rp" <devw0rp gmail.com> writes:
It's worth noting, "pretty printing" could be configured entirely 
in an OutputRange which watches for certain syntax coming into 
the range and inserts whitespace where it believes to be 
appropriate, so writeJSON functions would not need to know 
anything about pretty printing.
Jun 02 2014
prev sibling next sibling parent "Chris Williams" <yoreanon-chrisw yahoo.co.jp> writes:
On Monday, 2 June 2014 at 00:39:48 UTC, Jonathan M Davis via 
Digitalmars-d wrote:
 I know that vibe.d uses its own json implementation, but I 
 don't know
 how much of that is part of its public API and how much of that 
 is
 simply used internally: http://vibed.org

In general, I've been pretty happy with vibe.d, and I've heard that the parser speed of the JSON implementation is good. But I must admit that I found the API to be fairly obtuse. In order to do much of anything, you really need to serialize/deserialize from structs. The JSON objects themselves are pretty impossible to modify. I haven't looked at how vibe's parser works, but any very-fast parser would probably need to support an input stream, so that it can build out data in parallel to I/O, and do a lot of manual memory management. E.g. you probably want a stack of reusable node buffers that you use to add elements to as you scan the JSON tree, then clone off purpose-sized nodes from the work buffers when you encounter the end of the definition. Whereas, the current implementation in std.json only accepts a complete string and for each node starts with no memory and has to allocate/reallocate for every fresh piece of information. Having worked with JSON libraries quite a bit, the key to a good one is the ability to refer to paths through the data. So besides the JSON objects themselves, you need something like a "struct JPath" that represents an array of strings and size_ts, which you can pass into get, set, has, and count methods. I'd view the lack of that as the larger issue with the current JSON implementations.
Jun 02 2014
prev sibling next sibling parent "Sean Kelly" <sean invisibleduck.org> writes:
The vibe.d parser is better, but it still creates a DOM-style
tree of objects, which isn't acceptable in some circumstances.  I
posted a performance comparison of the JSON parser I created for
work use with std.json a while back, and mine is almost 100x
faster than std.json in a simple test and allocates zero memory
to boot:

http://forum.dlang.org/thread/cyzcirslzcgnyxbyzycc forum.dlang.org#post-gxgeizjsurulklzftfqz:40forum.dlang.org

I haven't tried it vs. the vibe.d parser, but I suspect it will
still beat it by an order of magnitude or more because of the not
allocating thing.

I've said this a bunch of times, but what I want to see is a
SAX-style parser as the bottom layer with an optional DOM-style
parser built on top of it.  Then people who want the tree
generated can get it, and people who want performance or don't
want allocations can get that too.  I'm starting to wonder if I
should just try and get permission from work to open source my
parser so I can submit it.  Parsing JSON really isn't terribly
difficult though.  It shouldn't take more than a few days for one
of the more parser-oriented people here to produce something
comparable.
Jun 02 2014
prev sibling next sibling parent "David Soria Parra" <dsp experimentalworks.net> writes:
On Monday, 2 June 2014 at 19:05:15 UTC, Chris Williams wrote:

 In general, I've been pretty happy with vibe.d, and I've heard 
 that the parser speed of the JSON implementation is good. But I 
 must admit that I found the API to be fairly obtuse. In order 
 to do much of anything, you really need to 
 serialize/deserialize from structs. The JSON objects themselves 
 are pretty impossible to modify.

 I haven't looked at how vibe's parser works, but any very-fast 
 parser would probably need to support an input stream, so that 
 it can build out data in parallel to I/O, and do a lot of 
 manual memory management. E.g. you probably want a stack of 
 reusable node buffers that you use to add elements to as you 
 scan the JSON tree, then clone off purpose-sized nodes from the 
 work buffers when you encounter the end of the definition. 
 Whereas, the current implementation in std.json only accepts a 
 complete string and for each node starts with no memory and has 
 to allocate/reallocate for every fresh piece of information.

 Having worked with JSON libraries quite a bit, the key to a 
 good one is the ability to refer to paths through the data. So 
 besides the JSON objects themselves, you need something like a 
 "struct JPath" that represents an array of strings and size_ts, 
 which you can pass into get, set, has, and count methods. I'd 
 view the lack of that as the larger issue with the current JSON 
 implementations.

I think the main question is, given that std.json is close to be unusable for anything serious due to it's poor performance, can we come up with something faster that has the same API. I am not sure what phobos take on backwards compatibility is, but I'd rather keep the API than breaking it for whoever is using std.json.
Jun 02 2014
prev sibling next sibling parent "Chris Williams" <yoreanon-chrisw yahoo.co.jp> writes:
On Monday, 2 June 2014 at 20:10:52 UTC, David Soria Parra wrote:
 I think the main question is, given that std.json is close to 
 be unusable for anything serious due to it's poor performance, 
 can we come up with something faster that has the same API. I 
 am not sure what phobos take on backwards compatibility is, but 
 I'd rather keep the API than breaking it for whoever is using 
 std.json.

std.json really only has two methods parseJson and toJson. Any implementation is going to have those two methods, so in terms of not breaking anything, you're pretty safe there. Since it doesn't have any methods except those two, it really comes down to the underlying data structure. Right now, you have to read the source and understand the structure in order to operate on it, which is a hassle, but is presumably what people are doing. So maintaining the current structure would be the key necessity. I think that limits the optimizations which could be performed, but doesn't make them impossible. Adding a stream-based parsing method would probably be the main optimization. That adds to the API, but is backwards compatible. The module has a lot of inner methods and recursion. Reducing the number of function calls, using manual stack management instead of recursion, etc. might give another significant gain. How parseJson() works is irrelevant to the caller, so all of that can be optimized to the heart's content.
Jun 02 2014
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Mon, 02 Jun 2014 19:13:07 +0000
schrieb "Sean Kelly" <sean invisibleduck.org>:

 I've said this a bunch of times, but what I want to see is a
 SAX-style parser as the bottom layer with an optional DOM-style
 parser built on top of it. 

I'd probably prefer a tokenizer/lexer as the lowest layer, then SAX and DOM implemented using the tokenizer. This way we can provide a kind of input range. I actually used Brian Schotts std.lexer proposal to build a simple JSON tokenizer/lexer and it worked quite well. But I don't think std.lexer is zero-allocation yet so that's an important drawback.
Jun 03 2014
prev sibling next sibling parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Mon, 02 Jun 2014 19:13:07 +0000
Sean Kelly via Digitalmars-d <digitalmars-d puremagic.com> wrote:

 I've said this a bunch of times, but what I want to see is a
 SAX-style parser as the bottom layer with an optional DOM-style
 parser built on top of it.  Then people who want the tree
 generated can get it, and people who want performance or don't
 want allocations can get that too.  I'm starting to wonder if I
 should just try and get permission from work to open source my
 parser so I can submit it.  Parsing JSON really isn't terribly
 difficult though.  It shouldn't take more than a few days for one
 of the more parser-oriented people here to produce something
 comparable.

Agreed, though it might make sense to have something even lower level than a SAX parser. Certainly, for XML, I'd implement something that just gave you a range of the attributes without any consideration for what you might do with them, whereas it's my understanding that a SAX parser uses callbacks which triggered when it finds for what you're looking for. A SAX parser and DOM parser could then be built on top of the simple range API. I'd be looking to do something similar with JSON were I implementing a JSON parser, though since JSON is a bit different from XML in structure, I'm not quite sure what the lowest level API which would still be useful would be. I'd have to think about it. But in principle, I agree with what you're suggesting. - Jonathan M Davis
Jun 03 2014
prev sibling next sibling parent "Masahiro Nakagawa" <repeatedly gmail.com> writes:
On Monday, 2 June 2014 at 00:18:19 UTC, David Soria Parra wrote:
 Hi,

 I have recently had to deal with large amounts of JSON data in 
 D. While doing that I've found that std.json is remarkable slow 
 in comparison to other languages standard json implementation. 
 I've create a small and simple benchmark parsing a local copy 
 of a github API call 
 "https://api.github.com/repos/D-Programming-Language/dmd/pulls" 
 and parsing it 100% times and writing the title to stdout.

 My results as follows:
   ./d-test > /dev/null  3.54s user 0.02s system 99% cpu 3.560 
 total
   ./hs-test > /dev/null  0.02s user 0.00s system 93% cpu 0.023 
 total
   python test.py > /dev/null  0.77s user 0.02s system 99% cpu 
 0.792 total

 The concrete implementations (sorry for my terrible haskell 
 implementation) can be found here:

    https://github.com/dsp/D-Json-Tests/

 This is comapring D's std.json vs Haskells Data.Aeson and 
 python standard library json. I am a bit concerned with the 
 current state of our JSON parser given that a lot of 
 applications these day use JSON. I personally consider a high 
 speed implementation of JSON a critical part of a standard 
 library.

 Would it make sense to start thinking about using ujson4c as an 
 external library, or maybe come up with a better 
 implementation. I know Orvid has something and might add some 
 analysis as to why std.json is slow. Any ideas or pointers as 
 to how to start with that?

I don't know the status of another D based JSON library. If you can install yajl library, then yajl-d is an another candidate. % time ./yajl_test > /dev/null ./yajl_test > /dev/null 0.42s user 0.01s system 99% cpu 0.434 total % time python test.py> /dev/null python test.py > /dev/null 0.65s user 0.02s system 99% cpu 0.671 total % time ./test > /dev/null ./test > /dev/null 3.10s user 0.02s system 99% cpu 3.125 total import yajl.yajl, std.datetime, std.file, std.stdio; void parse() { foreach(elem; readText("test.json").decode.array) { writeln(elem.object["title"]); } } int main(string[] args) { for(uint i = 0; i < 100; i++) { parse(); } return 0; } http://code.dlang.org/packages/yajl NOTE: yajl-d doesn't expose yajl's SAX style API unlike Sean's implementation
Jun 03 2014
prev sibling next sibling parent "Masahiro Nakagawa" <repeatedly gmail.com> writes:
On Monday, 2 June 2014 at 00:18:19 UTC, David Soria Parra wrote:
 Hi,

 I have recently had to deal with large amounts of JSON data in 
 D. While doing that I've found that std.json is remarkable slow 
 in comparison to other languages standard json implementation. 
 I've create a small and simple benchmark parsing a local copy 
 of a github API call 
 "https://api.github.com/repos/D-Programming-Language/dmd/pulls" 
 and parsing it 100% times and writing the title to stdout.

 My results as follows:
   ./d-test > /dev/null  3.54s user 0.02s system 99% cpu 3.560 
 total
   ./hs-test > /dev/null  0.02s user 0.00s system 93% cpu 0.023 
 total
   python test.py > /dev/null  0.77s user 0.02s system 99% cpu 
 0.792 total

 The concrete implementations (sorry for my terrible haskell 
 implementation) can be found here:

    https://github.com/dsp/D-Json-Tests/

 This is comapring D's std.json vs Haskells Data.Aeson and 
 python standard library json. I am a bit concerned with the 
 current state of our JSON parser given that a lot of 
 applications these day use JSON. I personally consider a high 
 speed implementation of JSON a critical part of a standard 
 library.

 Would it make sense to start thinking about using ujson4c as an 
 external library, or maybe come up with a better 
 implementation. I know Orvid has something and might add some 
 analysis as to why std.json is slow. Any ideas or pointers as 
 to how to start with that?

BTW, my acquaintance points out your haskell code is different from other samples. Your haskell code parses JSON array only once. This is why so fast. He uploads same behaviour code which parses JSON array at each loop. Please check it. https://gist.github.com/maoe/e5f72c3cf3687610fe5c On my env result: % time ./new_test > /dev/null ./new_test > /dev/null 1.13s user 0.02s system 99% cpu 1.144 total % time ./test > /dev/null ./test > /dev/null 0.02s user 0.00s system 91% cpu 0.023 total
Jun 03 2014
prev sibling parent "Sean Kelly" <sean invisibleduck.org> writes:
On Tuesday, 3 June 2014 at 06:39:04 UTC, Jacob Carlborg wrote:
 On 02/06/14 21:13, Sean Kelly wrote:

 I'm starting to wonder if I
 should just try and get permission from work to open source my
 parser so I can submit it.

That would be awesome. Is it written in D or was it C++ ?

It's written in C, and so would need an overhaul regardless. The user basically assigns a bunch of function pointers for the callbacks. Using the parser at this level is really kind of difficult because you have to create a state machine for parsing anything reasonably complex, so what I usually do is nest calls to foreachObjectField and foreachArrayElem. I'm wondering if we can't do something similar here, but with corresponding ForwardRanges instead of the opApply style.
Jun 03 2014