www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - RFC: std.json sucessor

reply =?ISO-8859-15?Q?S=F6nke_Ludwig?= <sludwig rejectedsoftware.com> writes:
Following up on the recent "std.jgrandson" thread [1], I've picked up 
the work (a lot earlier than anticipated) and finished a first version 
of a loose blend of said std.jgrandson, vibe.data.json and some changes 
that I had planned for vibe.data.json for a while. I'm quite pleased by 
the results so far, although without a serialization framework it still 
misses a very important building block.

Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json

The new code contains:
  - Lazy lexer in the form of a token input range (using slices of the
    input if possible)
  - Lazy streaming parser (StAX style) in the form of a node input range
  - Eager DOM style parser returning a JSONValue
  - Range based JSON string generator taking either a token range, a
    node range, or a JSONValue
  - Opt-out location tracking (line/column) for tokens, nodes and values
  - No opDispatch() for JSONValue - this has shown to do more harm than
    good in vibe.data.json

The DOM style JSONValue type is based on std.variant.Algebraic. This 
currently has a few usability issues that can be solved by 
upgrading/fixing Algebraic:

  - Operator overloading only works sporadically
  - No "tag" enum is supported, so that switch()ing on the type of a
    value doesn't work and an if-else cascade is required
  - Operations and conversions between different Algebraic types is not
    conveniently supported, which gets important when other similar
    formats get supported (e.g. BSON)

Assuming that those points are solved, I'd like to get some early 
feedback before going for an official review. One open issue is how to 
handle unescaping of string literals. Currently it always unescapes 
immediately, which is more efficient for general input ranges when the 
unescaped result is needed, but less efficient for string inputs when 
the unescaped result is not needed. Maybe a flag could be used to 
conditionally switch behavior depending on the input range type.

Destroy away! ;)

[1]: http://forum.dlang.org/thread/lrknjl$co7$1 digitalmars.com
Aug 21 2014
next sibling parent reply "Brian Schott" <briancschott gmail.com> writes:
On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
 Destroy away! ;)
source/stdx/data/json/lexer.d(263:8)[warn]: 'JSONToken' has method 'opEquals', but not 'toHash'. source/stdx/data/json/lexer.d(499:65)[warn]: Use parenthesis to clarify this expression. source/stdx/data/json/parser.d(516:8)[warn]: 'JSONParserNode' has method 'opEquals', but not 'toHash'. source/stdx/data/json/value.d(95:10)[warn]: Variable c is never used. source/stdx/data/json/value.d(99:10)[warn]: Variable d is never used. source/stdx/data/json/package.d(942:14)[warn]: Variable val is never used. It's likely that you can ignore these, but I thought I'd post them anyways. (The last three are in unittest blocks, for example.)
Aug 21 2014
next sibling parent reply Justin Whear <justin economicmodeling.com> writes:
Someone needs to make a "showbrianmycode" bot: mention a D github repo 
and it runs static analysis for you.
Aug 21 2014
parent reply "Idan Arye" <GenericNPC gmail.com> writes:
On Thursday, 21 August 2014 at 23:27:28 UTC, Justin Whear wrote:
 Someone needs to make a "showbrianmycode" bot: mention a D 
 github repo
 and it runs static analysis for you.
Why bother with mentioning a GitHub repo? Just make the bot periodically scan the DUB registry.
Aug 21 2014
parent "Brian Schott" <briancschott gmail.com> writes:
On Thursday, 21 August 2014 at 23:33:35 UTC, Idan Arye wrote:
 Why bother with mentioning a GitHub repo? Just make the bot 
 periodically scan the DUB registry.
It's kind of picky. http://i.imgur.com/SHNAWnH.png
Aug 21 2014
prev sibling parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 00:48, schrieb Brian Schott:
 On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
 Destroy away! ;)
source/stdx/data/json/lexer.d(263:8)[warn]: 'JSONToken' has method 'opEquals', but not 'toHash'. source/stdx/data/json/lexer.d(499:65)[warn]: Use parenthesis to clarify this expression. source/stdx/data/json/parser.d(516:8)[warn]: 'JSONParserNode' has method 'opEquals', but not 'toHash'. source/stdx/data/json/value.d(95:10)[warn]: Variable c is never used. source/stdx/data/json/value.d(99:10)[warn]: Variable d is never used. source/stdx/data/json/package.d(942:14)[warn]: Variable val is never used. It's likely that you can ignore these, but I thought I'd post them anyways. (The last three are in unittest blocks, for example.)
Fixed all of them (neither was causing harm, but it's still nicer that way). Also added safe and nothrow where possible. BTW, anyone knows what's holding back formattedWrite() from being safe for simple types?
Aug 22 2014
prev sibling next sibling parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
On 8/21/14, 7:35 PM, Sönke Ludwig wrote:
 Following up on the recent "std.jgrandson" thread [1], I've picked up
 the work (a lot earlier than anticipated) and finished a first version
 of a loose blend of said std.jgrandson, vibe.data.json and some changes
 that I had planned for vibe.data.json for a while. I'm quite pleased by
 the results so far, although without a serialization framework it still
 misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
 Docs: http://s-ludwig.github.io/std_data_json/
 DUB: http://code.dlang.org/packages/std_data_json
Say I have a class Person with name (string) and age (int) with a constructor that receives both. How would I create an instance of a Person from a json with the json stream? Suppose the json is this: {"age": 10, "name": "John"} And the class is this: class Person { this(string name, int age) { // ... } }
Aug 21 2014
parent reply =?ISO-8859-15?Q?S=F6nke_Ludwig?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 02:42, schrieb Ary Borenszweig:
 Say I have a class Person with name (string) and age (int) with a
 constructor that receives both. How would I create an instance of a
 Person from a json with the json stream?

 Suppose the json is this:

 {"age": 10, "name": "John"}

 And the class is this:

 class Person {
    this(string name, int age) {
      // ...
    }
 }
Without a serialization framework it would in theory work like this: JSONValue v = parseJSON(`{"age": 10, "name": "John"}`); auto p = new Person(v["name"].get!string, v["age"].get!int); unfortunately the operator overloading doesn't work like this currently, so this is needed: JSONValue v = parseJSON(`{"age": 10, "name": "John"}`); auto p = new Person( v.get!(Json[string])["name"].get!string, v.get!(Json[string])["age"].get!int); That should be solved together with the new module (it could of course also easily be added to JSONValue itself instead of Algebraic, but the value of having it in Algebraic would be much higher).
Aug 21 2014
parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
On 8/22/14, 3:33 AM, Sönke Ludwig wrote:
 Am 22.08.2014 02:42, schrieb Ary Borenszweig:
 Say I have a class Person with name (string) and age (int) with a
 constructor that receives both. How would I create an instance of a
 Person from a json with the json stream?

 Suppose the json is this:

 {"age": 10, "name": "John"}

 And the class is this:

 class Person {
    this(string name, int age) {
      // ...
    }
 }
Without a serialization framework it would in theory work like this: JSONValue v = parseJSON(`{"age": 10, "name": "John"}`); auto p = new Person(v["name"].get!string, v["age"].get!int); unfortunately the operator overloading doesn't work like this currently, so this is needed: JSONValue v = parseJSON(`{"age": 10, "name": "John"}`); auto p = new Person( v.get!(Json[string])["name"].get!string, v.get!(Json[string])["age"].get!int);
But does this parse the whole json into JSONValue? I want to create a Person without creating an intermediate JSONValue for the whole json. Can this be done?
Aug 22 2014
parent reply =?ISO-8859-15?Q?S=F6nke_Ludwig?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 16:53, schrieb Ary Borenszweig:
 On 8/22/14, 3:33 AM, Sönke Ludwig wrote:
 Without a serialization framework it would in theory work like this:

      JSONValue v = parseJSON(`{"age": 10, "name": "John"}`);
      auto p = new Person(v["name"].get!string, v["age"].get!int);

 unfortunately the operator overloading doesn't work like this currently,
 so this is needed:

      JSONValue v = parseJSON(`{"age": 10, "name": "John"}`);
      auto p = new Person(
          v.get!(Json[string])["name"].get!string,
          v.get!(Json[string])["age"].get!int);
But does this parse the whole json into JSONValue? I want to create a Person without creating an intermediate JSONValue for the whole json. Can this be done?
That would be done by the serialization framework. Instead of using parseJSON(), it could use parseJSONStream() to populate the Person instance on the fly, without putting the whole JSON into memory. But I'd like to leave that for a later addition, because we'd otherwise end up with duplicate functionality once std.serialization gets finalized. Manually it would work similar to this: auto nodes = parseJSONStream(`{"age": 10, "name": "John"}`); with (JSONParserNode.Kind) { enforce(nodes.front == objectStart); nodes.popFront(); while (nodes.front != objectEnd) { auto key = nodes.front.key; nodes.popFront(); if (key == "name") person.name = nodes.front.literal.string; else if (key == "age") person.age = nodes.front.literal.number; } }
Aug 22 2014
parent Ary Borenszweig <ary esperanto.org.ar> writes:
On 8/22/14, 1:24 PM, Sönke Ludwig wrote:
 Am 22.08.2014 16:53, schrieb Ary Borenszweig:
 On 8/22/14, 3:33 AM, Sönke Ludwig wrote:
 Without a serialization framework it would in theory work like this:

      JSONValue v = parseJSON(`{"age": 10, "name": "John"}`);
      auto p = new Person(v["name"].get!string, v["age"].get!int);

 unfortunately the operator overloading doesn't work like this currently,
 so this is needed:

      JSONValue v = parseJSON(`{"age": 10, "name": "John"}`);
      auto p = new Person(
          v.get!(Json[string])["name"].get!string,
          v.get!(Json[string])["age"].get!int);
But does this parse the whole json into JSONValue? I want to create a Person without creating an intermediate JSONValue for the whole json. Can this be done?
That would be done by the serialization framework. Instead of using parseJSON(), it could use parseJSONStream() to populate the Person instance on the fly, without putting the whole JSON into memory. But I'd like to leave that for a later addition, because we'd otherwise end up with duplicate functionality once std.serialization gets finalized. Manually it would work similar to this: auto nodes = parseJSONStream(`{"age": 10, "name": "John"}`); with (JSONParserNode.Kind) { enforce(nodes.front == objectStart); nodes.popFront(); while (nodes.front != objectEnd) { auto key = nodes.front.key; nodes.popFront(); if (key == "name") person.name = nodes.front.literal.string; else if (key == "age") person.age = nodes.front.literal.number; } }
Cool, that looks good :-)
Aug 22 2014
prev sibling next sibling parent reply "Colden Cullen" <ColdenCullen gmail.com> writes:
I notice in the docs there are several references to a 
`parseJSON` and `parseJson`, but I can't seem to find where 
either of these are defined. Is this just a typo?

Hope this helps: 
https://github.com/s-ludwig/std_data_json/search?q=parseJson&type=Code
Aug 21 2014
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 04:35, schrieb Colden Cullen:
 I notice in the docs there are several references to a `parseJSON` and
 `parseJson`, but I can't seem to find where either of these are defined.
 Is this just a typo?

 Hope this helps:
 https://github.com/s-ludwig/std_data_json/search?q=parseJson&type=Code
Seems like I forgot to replace a few mentions. They are called parseJSONValue and toJSONValue now for clarity.
Aug 21 2014
prev sibling next sibling parent reply =?ISO-8859-15?Q?S=F6nke_Ludwig?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 00:35, schrieb Sönke Ludwig:
 The DOM style JSONValue type is based on std.variant.Algebraic. This
 currently has a few usability issues that can be solved by
 upgrading/fixing Algebraic:

   - Operator overloading only works sporadically
   - (...)
   - Operations and conversions between different Algebraic types is not
     conveniently supported, which gets important when other similar
     formats get supported (e.g. BSON)
https://github.com/D-Programming-Language/phobos/pull/2452 https://github.com/D-Programming-Language/phobos/pull/2453 Those fix the most important operators, index access and binary arithmetic.
Aug 22 2014
parent reply "matovitch" <camille.brugel laposte.net> writes:
Very nice ! I had started (and dropped) a json module based on 
Algebraic too. So without opDispatch you plan to use a syntax 
like jPerson["age"] = 10 ? You didn't use stdx.d.lexer. Any 
reason why ? (I am asking even if I never used this module.(never 
coded much in D in fact))
Aug 22 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 14:17, schrieb matovitch:
 Very nice ! I had started (and dropped) a json module based on Algebraic
 too. So without opDispatch you plan to use a syntax like jPerson["age"]
 = 10 ? You didn't use stdx.d.lexer. Any reason why ? (I am asking even
 if I never used this module.(never coded much in D in fact))
Exactly, that's the syntax you'd use for JSONValue. But my favorite way to work with most JSON data is actually to directly read the JSON string into a D struct using a serialization framework and then access the struct in a strongly typed way. This has both, less syntactic and less runtime overhead, and also greatly reduces the chance for field name/type related bugs. The module is written against current Phobos, which is why stdx.d.lexer wasn't really an option. I'm also unsure if std.lexer would be able to handle the parsing required for JSON numbers and strings. But it would certainly be nice already if at least the token structure could be reused. However, it should also be possible to find a painless migration path later, when std.lexer is actually part of Phobos.
Aug 22 2014
parent reply "matovitch" <camille.brugel laposte.net> writes:
On Friday, 22 August 2014 at 12:39:08 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 14:17, schrieb matovitch:
 Very nice ! I had started (and dropped) a json module based on 
 Algebraic
 too. So without opDispatch you plan to use a syntax like 
 jPerson["age"]
 = 10 ? You didn't use stdx.d.lexer. Any reason why ? (I am 
 asking even
 if I never used this module.(never coded much in D in fact))
Exactly, that's the syntax you'd use for JSONValue. But my favorite way to work with most JSON data is actually to directly read the JSON string into a D struct using a serialization framework and then access the struct in a strongly typed way. This has both, less syntactic and less runtime overhead, and also greatly reduces the chance for field name/type related bugs.
Completely agree, I am waiting for a serializer too. I would love to see something like cap'n proto in D.
 The module is written against current Phobos, which is why 
 stdx.d.lexer wasn't really an option. I'm also unsure if 
 std.lexer would be able to handle the parsing required for JSON 
 numbers and strings. But it would certainly be nice already if 
 at least the token structure could be reused. However, it 
 should also be possible to find a painless migration path 
 later, when std.lexer is actually part of Phobos.
Ok. I think I remember there was a stdx.d.lexer's Json parser provided as sample.
Aug 22 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 14:47, schrieb matovitch:
 Ok. I think I remember there was a stdx.d.lexer's Json parser provided
 as sample.
I see, so you just have to write your own number/string parsing routines: https://github.com/Hackerpilot/lexer-demo/blob/master/jsonlexer.d
Aug 22 2014
parent "matovitch" <camille.brugel laposte.net> writes:
On Friday, 22 August 2014 at 13:00:19 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 14:47, schrieb matovitch:
 Ok. I think I remember there was a stdx.d.lexer's Json parser 
 provided
 as sample.
I see, so you just have to write your own number/string parsing routines: https://github.com/Hackerpilot/lexer-demo/blob/master/jsonlexer.d
It's kind of "low level" indeed...I don't know what kind of back magic are doing all these template mixins but the code looks quite clean. Confusing : // Therefore, this always returns false. bool isSeparating(size_t offset) pure nothrow safe { return true; }
Aug 22 2014
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2014-08-22 00:35, Sönke Ludwig wrote:
 Following up on the recent "std.jgrandson" thread [1], I've picked up
 the work (a lot earlier than anticipated) and finished a first version
 of a loose blend of said std.jgrandson, vibe.data.json and some changes
 that I had planned for vibe.data.json for a while. I'm quite pleased by
 the results so far, although without a serialization framework it still
 misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
 Docs: http://s-ludwig.github.io/std_data_json/
 DUB: http://code.dlang.org/packages/std_data_json
* Opening braces should be put on their own line to follow Phobos style guides * I'm wondering about the assert in lexer.d, line 160. What happens if two invalid tokens after each other occur? * I think we have talked about this before, when reviewing D lexers. I'm thinking of how to handle invalid data. Is it the best solution to throw an exception? Would it be possible to return an error token and have the client decide what to do about? Shouldn't it be possible to build a JSON validator on this? * The lexer seems to always convert JSON types to their native D types, is that wise to do? That's unnecessary if you're implementing syntax highlighting -- /Jacob Carlborg
Aug 22 2014
next sibling parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Friday, 22 August 2014 at 15:47:51 UTC, Jacob Carlborg wrote:
 * I think we have talked about this before, when reviewing D 
 lexers. I'm thinking of how to handle invalid data. Is it the 
 best solution to throw an exception? Would it be possible to 
 return an error token and have the client decide what to do 
 about?
Hmm... my initial reaction was "not as default - it should throw on error, otherwise noone will check for errors". But if it's returning an error token, maybe it would be sufficient if that token throws when its value is accessed?
Aug 22 2014
prev sibling parent reply =?ISO-8859-15?Q?S=F6nke_Ludwig?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 17:47, schrieb Jacob Carlborg:
 On 2014-08-22 00:35, Sönke Ludwig wrote:
 Following up on the recent "std.jgrandson" thread [1], I've picked up
 the work (a lot earlier than anticipated) and finished a first version
 of a loose blend of said std.jgrandson, vibe.data.json and some changes
 that I had planned for vibe.data.json for a while. I'm quite pleased by
 the results so far, although without a serialization framework it still
 misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
 Docs: http://s-ludwig.github.io/std_data_json/
 DUB: http://code.dlang.org/packages/std_data_json
* Opening braces should be put on their own line to follow Phobos style guides
Will do.
 * I'm wondering about the assert in lexer.d, line 160. What happens if
 two invalid tokens after each other occur?
There are actually no invalid tokens at all, the "invalid" enum value is only used to denote that no token is currently stored in _front. If readToken() doesn't throw, there will always be a valid token.
 * I think we have talked about this before, when reviewing D lexers. I'm
 thinking of how to handle invalid data. Is it the best solution to throw
 an exception? Would it be possible to return an error token and have the
 client decide what to do about? Shouldn't it be possible to build a JSON
 validator on this?
That would indeed be a possibility, it's how I used to handle it in my private version of std.lexer, too. It could also be made a compile time option.
 * The lexer seems to always convert JSON types to their native D types,
 is that wise to do? That's unnecessary if you're implementing syntax
 highlighting
It's basically the same trade-off as for unescaping string literals. For "string" inputs, it would be more efficient to just store a slice, but for generic input ranges it avoids the otherwise needed allocation. The proposed flag could make an improvement here, too.
Aug 22 2014
parent =?ISO-8859-15?Q?S=F6nke_Ludwig?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 18:13, schrieb Sönke Ludwig:
 Am 22.08.2014 17:47, schrieb Jacob Carlborg:
 * Opening braces should be put on their own line to follow Phobos style
 guides
Will do.
 * I'm wondering about the assert in lexer.d, line 160. What happens if
 two invalid tokens after each other occur?
There are actually no invalid tokens at all, the "invalid" enum value is only used to denote that no token is currently stored in _front. If readToken() doesn't throw, there will always be a valid token.
Renamed from "invalid" to "none" now to avoid confusion ->
 * I think we have talked about this before, when reviewing D lexers. I'm
 thinking of how to handle invalid data. Is it the best solution to throw
 an exception? Would it be possible to return an error token and have the
 client decide what to do about? Shouldn't it be possible to build a JSON
 validator on this?
That would indeed be a possibility, it's how I used to handle it in my private version of std.lexer, too. It could also be made a compile time option.
and an additional "error" kind has been added, which implements the above. Enabled using LexOptions.noThrow.
 * The lexer seems to always convert JSON types to their native D types,
 is that wise to do? That's unnecessary if you're implementing syntax
 highlighting
It's basically the same trade-off as for unescaping string literals. For "string" inputs, it would be more efficient to just store a slice, but for generic input ranges it avoids the otherwise needed allocation. The proposed flag could make an improvement here, too.
Aug 22 2014
prev sibling next sibling parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
Some thoughts about the API:

1) Instead of `parseJSONValue` and `lexJSON`, how about static 
methods `JSON.parse` and `JSON.lex`, or even a module level 
functions `std.data.json.parse` etc.? The "JSON" part of the name 
is redundant.

2) Also, `parseJSONValue` and `parseJSONStream` probably don't 
need to have different names. They can be distinguished by their 
parameter types.

3) `toJSONString` shouldn't just take a boolean as flag for 
pretty-printing. It should either use something like 
`Pretty.YES`, or the function should be called 
`toPrettyJSONString` (I believe I have seen this latter 
convention elsewhere).
We should also think about whether we can just call the functions 
`toString` and `toPrettyString`. Alternatively, `toJSON` and 
`toPrettyJSON` should be considered.
Aug 22 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 18:15, schrieb "Marc Schütz" <schuetzm gmx.net>":
 Some thoughts about the API:

 1) Instead of `parseJSONValue` and `lexJSON`, how about static methods
 `JSON.parse` and `JSON.lex`, or even a module level functions
 `std.data.json.parse` etc.? The "JSON" part of the name is redundant.
For those functions it may be acceptable, although I really dislike that style, because it makes the code harder to read (what exactly does this parse?) and the functions are rarely used, so that that typing that additional "JSON" should be no issue at all. On the other hand, if you always type "JSON.lex" it's more to type than just "lexJSON". But for "[JSON]Value" it gets ugly really quick, because "Value"s are such a common thing and quickly occur in multiple kinds in the same source file.
 2) Also, `parseJSONValue` and `parseJSONStream` probably don't need to
 have different names. They can be distinguished by their parameter types.
Actually they take exactly the same parameters and just differ in their return value. It would be more descriptive to name them parseAsJSONValue and parseAsJSONStream - or maybe parseJSONAsValue or parseJSONToValue? The current naming is somewhat modeled after std.conv's "to!T" and "parse!T".
 3) `toJSONString` shouldn't just take a boolean as flag for
 pretty-printing. It should either use something like `Pretty.YES`, or
 the function should be called `toPrettyJSONString` (I believe I have
 seen this latter convention elsewhere).
 We should also think about whether we can just call the functions
 `toString` and `toPrettyString`. Alternatively, `toJSON` and
 `toPrettyJSON` should be considered.
Agreed, a boolean isn't good for a public interface, renaming the current writeAsString to private writeAsStringImpl and then adding "(writeAs/to)[Pretty]String" sounds reasonable. Actually I've done it that way for vibe.data.json.
Aug 22 2014
parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Friday, 22 August 2014 at 16:48:44 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 18:15, schrieb "Marc Schütz" <schuetzm gmx.net>":
 Some thoughts about the API:

 1) Instead of `parseJSONValue` and `lexJSON`, how about static 
 methods
 `JSON.parse` and `JSON.lex`, or even a module level functions
 `std.data.json.parse` etc.? The "JSON" part of the name is 
 redundant.
For those functions it may be acceptable, although I really dislike that style, because it makes the code harder to read (what exactly does this parse?) and the functions are rarely used, so that that typing that additional "JSON" should be no issue at all. On the other hand, if you always type "JSON.lex" it's more to type than just "lexJSON".
I'm not really concerned about the amount of typing, it just seemed a bit odd to have the redundant JSON in there, as we have module names for namespacing. Your argument about readability is true nevertheless. But...
 But for "[JSON]Value" it gets ugly really quick, because 
 "Value"s are such a common thing and quickly occur in multiple 
 kinds in the same source file.

 2) Also, `parseJSONValue` and `parseJSONStream` probably don't 
 need to
 have different names. They can be distinguished by their 
 parameter types.
Actually they take exactly the same parameters and just differ in their return value. It would be more descriptive to name them parseAsJSONValue and parseAsJSONStream - or maybe parseJSONAsValue or parseJSONToValue? The current naming is somewhat modeled after std.conv's "to!T" and "parse!T".
... why not use exactly the same convention then? => `parse!JSONValue` Would be nice to have a "pluggable" API where you just need to specify the type in a factory method to choose the input format. Then there could be `parse!BSON`, `parse!YAML`, with the same style as `parse!(int[])`. I know this sound a bit like bike-shedding, but the API shouldn't stand by itself, but fit into the "big picture", especially as there will probably be other parsers (you already named the module std._data_.json).
Aug 22 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 19:24, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 16:48:44 UTC, Sönke Ludwig wrote:
 Actually they take exactly the same parameters and just differ in
 their return value. It would be more descriptive to name them
 parseAsJSONValue and parseAsJSONStream - or maybe parseJSONAsValue or
 parseJSONToValue? The current naming is somewhat modeled after
 std.conv's "to!T" and "parse!T".
... why not use exactly the same convention then? => `parse!JSONValue` Would be nice to have a "pluggable" API where you just need to specify the type in a factory method to choose the input format. Then there could be `parse!BSON`, `parse!YAML`, with the same style as `parse!(int[])`. I know this sound a bit like bike-shedding, but the API shouldn't stand by itself, but fit into the "big picture", especially as there will probably be other parsers (you already named the module std._data_.json).
That would be nice, but then it should also work together with std.conv, which basically is exactly this pluggable API. Just like this it would result in an ambiguity error if both std.data.json and std.conv are imported at the same time. Is there a way to make std.conv work properly with JSONValue? I guess the only theoretical way would be to put something in JSONValue, but that would result in a slightly ugly cyclic dependency between parser.d and value.d.
Aug 22 2014
parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Friday, 22 August 2014 at 17:35:20 UTC, Sönke Ludwig wrote:
 ... why not use exactly the same convention then? => 
 `parse!JSONValue`

 Would be nice to have a "pluggable" API where you just need to 
 specify
 the type in a factory method to choose the input format. Then 
 there
 could be `parse!BSON`, `parse!YAML`, with the same style as
 `parse!(int[])`.

 I know this sound a bit like bike-shedding, but the API 
 shouldn't stand
 by itself, but fit into the "big picture", especially as there 
 will
 probably be other parsers (you already named the module 
 std._data_.json).
That would be nice, but then it should also work together with std.conv, which basically is exactly this pluggable API. Just like this it would result in an ambiguity error if both std.data.json and std.conv are imported at the same time. Is there a way to make std.conv work properly with JSONValue? I guess the only theoretical way would be to put something in JSONValue, but that would result in a slightly ugly cyclic dependency between parser.d and value.d.
The easiest and cleanest way would be to add a function in std.data.json: auto parse(Target, Source)(Source input) if(is(Target == JSONValue)) { return ...; } The various overloads of `std.conv.parse` already have mutually exclusive template constraints, they will not collide with our function.
Aug 22 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 19:57, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 17:35:20 UTC, Sönke Ludwig wrote:
 ... why not use exactly the same convention then? => `parse!JSONValue`

 Would be nice to have a "pluggable" API where you just need to specify
 the type in a factory method to choose the input format. Then there
 could be `parse!BSON`, `parse!YAML`, with the same style as
 `parse!(int[])`.

 I know this sound a bit like bike-shedding, but the API shouldn't stand
 by itself, but fit into the "big picture", especially as there will
 probably be other parsers (you already named the module
 std._data_.json).
That would be nice, but then it should also work together with std.conv, which basically is exactly this pluggable API. Just like this it would result in an ambiguity error if both std.data.json and std.conv are imported at the same time. Is there a way to make std.conv work properly with JSONValue? I guess the only theoretical way would be to put something in JSONValue, but that would result in a slightly ugly cyclic dependency between parser.d and value.d.
The easiest and cleanest way would be to add a function in std.data.json: auto parse(Target, Source)(Source input) if(is(Target == JSONValue)) { return ...; } The various overloads of `std.conv.parse` already have mutually exclusive template constraints, they will not collide with our function.
Okay, for parse that may work, but what about to!()?
Aug 22 2014
parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Friday, 22 August 2014 at 18:08:34 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 19:57, schrieb "Marc Schütz" <schuetzm gmx.net>":
 The easiest and cleanest way would be to add a function in 
 std.data.json:

     auto parse(Target, Source)(Source input)
         if(is(Target == JSONValue))
     {
         return ...;
     }

 The various overloads of `std.conv.parse` already have mutually
 exclusive template constraints, they will not collide with our 
 function.
Okay, for parse that may work, but what about to!()?
What's the problem with to!()?
Aug 22 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 21:00, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 18:08:34 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 19:57, schrieb "Marc Schütz" <schuetzm gmx.net>":
 The easiest and cleanest way would be to add a function in
 std.data.json:

     auto parse(Target, Source)(Source input)
         if(is(Target == JSONValue))
     {
         return ...;
     }

 The various overloads of `std.conv.parse` already have mutually
 exclusive template constraints, they will not collide with our function.
Okay, for parse that may work, but what about to!()?
What's the problem with to!()?
to!() definitely doesn't have a template constraint that excludes JSONValue. Instead, it will convert any struct type that doesn't define toString() to a D-like representation.
Aug 23 2014
parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Saturday, 23 August 2014 at 16:49:23 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 21:00, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 18:08:34 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 19:57, schrieb "Marc Schütz" 
 <schuetzm gmx.net>":
 The easiest and cleanest way would be to add a function in
 std.data.json:

    auto parse(Target, Source)(Source input)
        if(is(Target == JSONValue))
    {
        return ...;
    }

 The various overloads of `std.conv.parse` already have 
 mutually
 exclusive template constraints, they will not collide with 
 our function.
Okay, for parse that may work, but what about to!()?
What's the problem with to!()?
to!() definitely doesn't have a template constraint that excludes JSONValue. Instead, it will convert any struct type that doesn't define toString() to a D-like representation.
For converting a JSONValue to a different type, JSONValue can implement `opCast`, which is the regular interface that std.conv.to uses if it's available. For converting something _to_ a JSONValue, std.conv.to will simply create an instance of it by calling the constructor.
Aug 23 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 23.08.2014 19:25, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Saturday, 23 August 2014 at 16:49:23 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 21:00, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 18:08:34 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 19:57, schrieb "Marc Schütz" <schuetzm gmx.net>":
 The easiest and cleanest way would be to add a function in
 std.data.json:

    auto parse(Target, Source)(Source input)
        if(is(Target == JSONValue))
    {
        return ...;
    }

 The various overloads of `std.conv.parse` already have mutually
 exclusive template constraints, they will not collide with our
 function.
Okay, for parse that may work, but what about to!()?
What's the problem with to!()?
to!() definitely doesn't have a template constraint that excludes JSONValue. Instead, it will convert any struct type that doesn't define toString() to a D-like representation.
For converting a JSONValue to a different type, JSONValue can implement `opCast`, which is the regular interface that std.conv.to uses if it's available. For converting something _to_ a JSONValue, std.conv.to will simply create an instance of it by calling the constructor.
That would just introduce the said dependency cycle between JSONValue, the parser and the lexer. Possible, but not particularly pretty. Also, using the JSONValue constructor to parse an input string would contradict the intuitive behavior to just store the string value.
Aug 23 2014
parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Saturday, 23 August 2014 at 17:32:01 UTC, Sönke Ludwig wrote:
 Am 23.08.2014 19:25, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Saturday, 23 August 2014 at 16:49:23 UTC, Sönke Ludwig 
 wrote:
 Am 22.08.2014 21:00, schrieb "Marc Schütz" 
 <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 18:08:34 UTC, Sönke Ludwig 
 wrote:
 Am 22.08.2014 19:57, schrieb "Marc Schütz" 
 <schuetzm gmx.net>":
 The easiest and cleanest way would be to add a function in
 std.data.json:

   auto parse(Target, Source)(Source input)
       if(is(Target == JSONValue))
   {
       return ...;
   }

 The various overloads of `std.conv.parse` already have 
 mutually
 exclusive template constraints, they will not collide with 
 our
 function.
Okay, for parse that may work, but what about to!()?
What's the problem with to!()?
to!() definitely doesn't have a template constraint that excludes JSONValue. Instead, it will convert any struct type that doesn't define toString() to a D-like representation.
For converting a JSONValue to a different type, JSONValue can implement `opCast`, which is the regular interface that std.conv.to uses if it's available. For converting something _to_ a JSONValue, std.conv.to will simply create an instance of it by calling the constructor.
That would just introduce the said dependency cycle between JSONValue, the parser and the lexer. Possible, but not particularly pretty. Also, using the JSONValue constructor to parse an input string would contradict the intuitive behavior to just store the string value.
That's what I expect it to do anyway. For parsing, there are already other functions. "mystring".to!JSONValue should just wrap "mystring".
Aug 23 2014
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 23.08.2014 20:31, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Saturday, 23 August 2014 at 17:32:01 UTC, Sönke Ludwig wrote:
 Am 23.08.2014 19:25, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Saturday, 23 August 2014 at 16:49:23 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 21:00, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 18:08:34 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 19:57, schrieb "Marc Schütz" <schuetzm gmx.net>":
 The easiest and cleanest way would be to add a function in
 std.data.json:

   auto parse(Target, Source)(Source input)
       if(is(Target == JSONValue))
   {
       return ...;
   }

 The various overloads of `std.conv.parse` already have mutually
 exclusive template constraints, they will not collide with our
 function.
Okay, for parse that may work, but what about to!()?
What's the problem with to!()?
to!() definitely doesn't have a template constraint that excludes JSONValue. Instead, it will convert any struct type that doesn't define toString() to a D-like representation.
For converting a JSONValue to a different type, JSONValue can implement `opCast`, which is the regular interface that std.conv.to uses if it's available. For converting something _to_ a JSONValue, std.conv.to will simply create an instance of it by calling the constructor.
That would just introduce the said dependency cycle between JSONValue, the parser and the lexer. Possible, but not particularly pretty. Also, using the JSONValue constructor to parse an input string would contradict the intuitive behavior to just store the string value.
That's what I expect it to do anyway. For parsing, there are already other functions. "mystring".to!JSONValue should just wrap "mystring".
Probably, but then to!() is inconsistent with parse!(). Usually they are both the same apart from how the tail of the input string is handled.
Aug 23 2014
prev sibling next sibling parent reply "Christian Manning" <cmanning999 gmail.com> writes:
It would be nice to have integers treated separately to doubles. 
I know it makes the number parsing simpler to just treat 
everything as double, but still, it could be annoying when you 
expect an integer type.

I'd also like to see some benchmarks, particularly against some 
of the high performance C++ parsers, i.e. rapidjson, gason, 
sajson. Or even some of the "not bad" performance parsers with 
better APIs, i.e. QJsonDocument, jsoncpp and jsoncons (slow but 
perhaps comparable interface to this proposal?).
Aug 22 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 18:31, schrieb Christian Manning:
 It would be nice to have integers treated separately to doubles. I know
 it makes the number parsing simpler to just treat everything as double,
 but still, it could be annoying when you expect an integer type.
That's how I've done it for vibe.data.json, too. For the new implementation, I've just used the number parsing routine from Andrei's std.jgrandson module. Does anybody have reservations about representing integers as "long" instead?
 I'd also like to see some benchmarks, particularly against some of the
 high performance C++ parsers, i.e. rapidjson, gason, sajson. Or even
 some of the "not bad" performance parsers with better APIs, i.e.
 QJsonDocument, jsoncpp and jsoncons (slow but perhaps comparable
 interface to this proposal?).
That would indeed be nice to have, but I'm not sure if I can manage to squeeze that in besides finishing the module itself. My time frame for working on this is quite limited.
Aug 22 2014
parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Friday, 22 August 2014 at 16:56:26 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 18:31, schrieb Christian Manning:
 It would be nice to have integers treated separately to 
 doubles. I know
 it makes the number parsing simpler to just treat everything 
 as double,
 but still, it could be annoying when you expect an integer 
 type.
That's how I've done it for vibe.data.json, too. For the new implementation, I've just used the number parsing routine from Andrei's std.jgrandson module. Does anybody have reservations about representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe even use BigInt if applicable?
Aug 22 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 19:27, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 16:56:26 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 18:31, schrieb Christian Manning:
 It would be nice to have integers treated separately to doubles. I know
 it makes the number parsing simpler to just treat everything as double,
 but still, it could be annoying when you expect an integer type.
That's how I've done it for vibe.data.json, too. For the new implementation, I've just used the number parsing routine from Andrei's std.jgrandson module. Does anybody have reservations about representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe even use BigInt if applicable?
I guess BigInt + exponent would be the only lossless way to represent any JSON number. That could then be converted to any desired smaller type as required. But checking for overflow during number parsing would definitely have an impact on parsing speed, as well as using a BigInt of course, so the question is how we want set up the trade off here (or if there is another way that is overhead-free).
Aug 22 2014
next sibling parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Friday, 22 August 2014 at 17:45:03 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 19:27, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 16:56:26 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 18:31, schrieb Christian Manning:
 It would be nice to have integers treated separately to 
 doubles. I know
 it makes the number parsing simpler to just treat everything 
 as double,
 but still, it could be annoying when you expect an integer 
 type.
That's how I've done it for vibe.data.json, too. For the new implementation, I've just used the number parsing routine from Andrei's std.jgrandson module. Does anybody have reservations about representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe even use BigInt if applicable?
I guess BigInt + exponent would be the only lossless way to represent any JSON number. That could then be converted to any desired smaller type as required. But checking for overflow during number parsing would definitely have an impact on parsing speed, as well as using a BigInt of course, so the question is how we want set up the trade off here (or if there is another way that is overhead-free).
As the functions will be templatized anyway, it should include a flags parameter. These and possible future extensions can then be selected by the user.
Aug 22 2014
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 20:01, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 17:45:03 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 19:27, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 16:56:26 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 18:31, schrieb Christian Manning:
 It would be nice to have integers treated separately to doubles. I
 know
 it makes the number parsing simpler to just treat everything as
 double,
 but still, it could be annoying when you expect an integer type.
That's how I've done it for vibe.data.json, too. For the new implementation, I've just used the number parsing routine from Andrei's std.jgrandson module. Does anybody have reservations about representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe even use BigInt if applicable?
I guess BigInt + exponent would be the only lossless way to represent any JSON number. That could then be converted to any desired smaller type as required. But checking for overflow during number parsing would definitely have an impact on parsing speed, as well as using a BigInt of course, so the question is how we want set up the trade off here (or if there is another way that is overhead-free).
As the functions will be templatized anyway, it should include a flags parameter. These and possible future extensions can then be selected by the user.
I'm actually in the process of converting the "track_location" parameter to a flags enum and to add support for an error token, so this would fit right in.
Aug 22 2014
prev sibling parent reply "Christian Manning" <cmanning999 gmail.com> writes:
On Friday, 22 August 2014 at 17:45:03 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 19:27, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 16:56:26 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 18:31, schrieb Christian Manning:
 It would be nice to have integers treated separately to 
 doubles. I know
 it makes the number parsing simpler to just treat everything 
 as double,
 but still, it could be annoying when you expect an integer 
 type.
That's how I've done it for vibe.data.json, too. For the new implementation, I've just used the number parsing routine from Andrei's std.jgrandson module. Does anybody have reservations about representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe even use BigInt if applicable?
I guess BigInt + exponent would be the only lossless way to represent any JSON number. That could then be converted to any desired smaller type as required. But checking for overflow during number parsing would definitely have an impact on parsing speed, as well as using a BigInt of course, so the question is how we want set up the trade off here (or if there is another way that is overhead-free).
You could check for a decimal point and a 0 at the front (excluding possible - sign), either would indicate a double, making the reasonable assumption that anything else will fit in a long.
Aug 22 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 21:48, schrieb Christian Manning:
 On Friday, 22 August 2014 at 17:45:03 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 19:27, schrieb "Marc Schütz" <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 16:56:26 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 18:31, schrieb Christian Manning:
 It would be nice to have integers treated separately to doubles. I
 know
 it makes the number parsing simpler to just treat everything as
 double,
 but still, it could be annoying when you expect an integer type.
That's how I've done it for vibe.data.json, too. For the new implementation, I've just used the number parsing routine from Andrei's std.jgrandson module. Does anybody have reservations about representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe even use BigInt if applicable?
I guess BigInt + exponent would be the only lossless way to represent any JSON number. That could then be converted to any desired smaller type as required. But checking for overflow during number parsing would definitely have an impact on parsing speed, as well as using a BigInt of course, so the question is how we want set up the trade off here (or if there is another way that is overhead-free).
You could check for a decimal point and a 0 at the front (excluding possible - sign), either would indicate a double, making the reasonable assumption that anything else will fit in a long.
Yes, no decimal point + no exponent would work without overhead to detect integers, but that wouldn't solve the proposed automatic long->double overflow, which is what I meant. My current idea is to default to double and optionally support any of long, BigInt and "Decimal" (BigInt+exponent), where integer overflow only works for long->BigInt.
Aug 22 2014
next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Friday, 22 August 2014 at 20:02:41 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 21:48, schrieb Christian Manning:
 On Friday, 22 August 2014 at 17:45:03 UTC, Sönke Ludwig wrote:
 Am 22.08.2014 19:27, schrieb "Marc Schütz" 
 <schuetzm gmx.net>":
 On Friday, 22 August 2014 at 16:56:26 UTC, Sönke Ludwig 
 wrote:
 Am 22.08.2014 18:31, schrieb Christian Manning:
 It would be nice to have integers treated separately to 
 doubles. I
 know
 it makes the number parsing simpler to just treat 
 everything as
 double,
 but still, it could be annoying when you expect an integer 
 type.
That's how I've done it for vibe.data.json, too. For the new implementation, I've just used the number parsing routine from Andrei's std.jgrandson module. Does anybody have reservations about representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe even use BigInt if applicable?
I guess BigInt + exponent would be the only lossless way to represent any JSON number. That could then be converted to any desired smaller type as required. But checking for overflow during number parsing would definitely have an impact on parsing speed, as well as using a BigInt of course, so the question is how we want set up the trade off here (or if there is another way that is overhead-free).
You could check for a decimal point and a 0 at the front (excluding possible - sign), either would indicate a double, making the reasonable assumption that anything else will fit in a long.
Yes, no decimal point + no exponent would work without overhead to detect integers, but that wouldn't solve the proposed automatic long->double overflow, which is what I meant. My current idea is to default to double and optionally support any of long, BigInt and "Decimal" (BigInt+exponent), where integer overflow only works for long->BigInt.
It might be the right choice anyway (seeing as json/js do overflow to double), but fwiw it's still atrocious. double a = long.max; assert(iota(1, 1000000).map!(d => (a+d)-a).until!"a != 0".walkLength == 1024); Yuk. Floating point numbers and integers are so completely different in behaviour that it's just dishonest to transparently switch between the two. This especially the case for overflow from long -> double, where by definition you're 10 bits past being able to reliably accurately represent the integer in question.
Aug 22 2014
prev sibling parent "Christian Manning" <cmanning999 gmail.com> writes:
 Yes, no decimal point + no exponent would work without overhead 
 to detect integers, but that wouldn't solve the proposed 
 automatic long->double overflow, which is what I meant. My 
 current idea is to default to double and optionally support any 
 of long, BigInt and "Decimal" (BigInt+exponent), where integer 
 overflow only works for long->BigInt.
Ah I see. I have to say, if you are going to treat integers and floating point numbers differently, then you should store them differently. long should be used to store integers, double for floating point numbers. 64 bit signed integer (long) is a totally reasonable limitation for integers, but even that would lose precision stored as a double as you are proposing (if I'm understanding right). I don't think BigInt needs to be brought into this at all really. In the case of integers met in the parser which are too large/small to fit in long, give an error IMO. Such integers should be (and are by other libs IIRC) serialised in the form "1.234e-123" to force double parsing, perhaps losing precision at that stage rather than invisibly inside the library. Size of JSON numbers is implementation defined and the whole thing shouldn't be degraded in both performance and usability to cover JSON serialisers who go beyond common native number types. Of course, you are free to do whatever you like :)
Aug 22 2014
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/21/2014 3:35 PM, Sönke Ludwig wrote:
 Destroy away! ;)
Thanks for taking this on! This is valuable work. On to destruction! I'm looking at: http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/lexJSON.html I anticipate this will be used a LOT and in very high speed demanding applications. With that in mind, 1. There's no mention of what will happen if it is passed malformed JSON strings. I presume an exception is thrown. Exceptions are both slow and consume GC memory. I suggest an alternative would be to emit an "Error" token instead; this would be much like how the UTF decoding algorithms emit a "replacement char" for invalid UTF sequences. 2. The escape sequenced strings presumably consume GC memory. This will be a problem for high performance code. I suggest either leaving them undecoded in the token stream, and letting higher level code decide what to do about them, or provide a hook that the user can override with his own allocation scheme. If we don't make it possible to use std.json without invoking the GC, I believe the module will fail in the long term.
Aug 22 2014
next sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 20:08, schrieb Walter Bright:
 On 8/21/2014 3:35 PM, Sönke Ludwig wrote:
 Destroy away! ;)
Thanks for taking this on! This is valuable work. On to destruction! I'm looking at: http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/lexJSON.html I anticipate this will be used a LOT and in very high speed demanding applications. With that in mind, 1. There's no mention of what will happen if it is passed malformed JSON strings. I presume an exception is thrown. Exceptions are both slow and consume GC memory. I suggest an alternative would be to emit an "Error" token instead; this would be much like how the UTF decoding algorithms emit a "replacement char" for invalid UTF sequences.
The latest version now features a LexOptions.noThrow option which causes an error token to be emitted instead. After popping the error token, the range is always empty.
 2. The escape sequenced strings presumably consume GC memory. This will
 be a problem for high performance code. I suggest either leaving them
 undecoded in the token stream, and letting higher level code decide what
 to do about them, or provide a hook that the user can override with his
 own allocation scheme.
The problem is that it really depends on the use case and on the type of input stream which approach is more efficient (storing the escaped version of a string might require *two* allocations if the input range cannot be sliced and if the decoded string is then requested by the parser). My current idea therefore is to simply make this configurable, too. Enabling the use of custom allocators should be easily possible as an add-on functionality later on. At least my suggestion would be to wait with this until we have a finished std.allocator module.
Aug 22 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/22/2014 2:27 PM, Sönke Ludwig wrote:
 Am 22.08.2014 20:08, schrieb Walter Bright:
 1. There's no mention of what will happen if it is passed malformed JSON
 strings. I presume an exception is thrown. Exceptions are both slow and
 consume GC memory. I suggest an alternative would be to emit an "Error"
 token instead; this would be much like how the UTF decoding algorithms
 emit a "replacement char" for invalid UTF sequences.
The latest version now features a LexOptions.noThrow option which causes an error token to be emitted instead. After popping the error token, the range is always empty.
Having a nothrow option may prevent the functions from being attributed as "nothrow". But in any case, to worship at the Altar Of Composability, the error token could always be emitted, and then provide another algorithm which passes through all non-error tokens, and throws if it sees an error token.
 2. The escape sequenced strings presumably consume GC memory. This will
 be a problem for high performance code. I suggest either leaving them
 undecoded in the token stream, and letting higher level code decide what
 to do about them, or provide a hook that the user can override with his
 own allocation scheme.
The problem is that it really depends on the use case and on the type of input stream which approach is more efficient (storing the escaped version of a string might require *two* allocations if the input range cannot be sliced and if the decoded string is then requested by the parser). My current idea therefore is to simply make this configurable, too. Enabling the use of custom allocators should be easily possible as an add-on functionality later on. At least my suggestion would be to wait with this until we have a finished std.allocator module.
I'm worried that std.allocator is stalled and we'll be digging ourselves deeper into needing to revise things later to remove GC usage. I'd really like to find a way to abstract the allocation away from the algorithm.
Aug 22 2014
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/22/2014 6:05 PM, Walter Bright wrote:
 The problem is that it really depends on the use case and on the type of input
 stream which approach is more efficient (storing the escaped version of a
string
 might require *two* allocations if the input range cannot be sliced and if the
 decoded string is then requested by the parser). My current idea therefore is
to
 simply make this configurable, too.

 Enabling the use of custom allocators should be easily possible as an add-on
 functionality later on. At least my suggestion would be to wait with this until
 we have a finished std.allocator module.
Another possibility is to have the user pass in a resizeable buffer which then will be used to store the strings in as necessary. One example is std.internal.scopebuffer. The nice thing about that is the user can use the stack for the storage, which works out to be very, very fast.
Aug 22 2014
parent reply "Ola Fosheim Gr" <ola.fosheim.grostad+dlang gmail.com> writes:
On Saturday, 23 August 2014 at 02:30:23 UTC, Walter Bright wrote:
 Another possibility is to have the user pass in a resizeable 
 buffer which then will be used to store the strings in as 
 necessary.

 One example is std.internal.scopebuffer. The nice thing about 
 that is the user can use the stack for the storage, which works 
 out to be very, very fast.
Does this mean that D is getting resizable stack allocations in lower stack frames? That has a lot of implications for code gen.
Aug 22 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/22/2014 9:01 PM, Ola Fosheim Gr wrote:
 On Saturday, 23 August 2014 at 02:30:23 UTC, Walter Bright wrote:
 One example is std.internal.scopebuffer. The nice thing about that is the user
 can use the stack for the storage, which works out to be very, very fast.
Does this mean that D is getting resizable stack allocations in lower stack frames? That has a lot of implications for code gen.
scopebuffer does not require resizeable stack allocations.
Aug 22 2014
parent reply "Ola Fosheim Gr" <ola.fosheim.grostad+dlang gmail.com> writes:
On Saturday, 23 August 2014 at 04:36:34 UTC, Walter Bright wrote:
 On 8/22/2014 9:01 PM, Ola Fosheim Gr wrote:
 Does this mean that D is getting resizable stack allocations 
 in lower stack
 frames? That has a lot of implications for code gen.
scopebuffer does not require resizeable stack allocations.
So you cannot use the stack for resizable allocations. That would however be a nice optimization. Iff an algorithm only have one alloca, can be inlined in a way which does not extend the stack and use a resizable buffer that grows downwards in memory then you can have a resizable buffer on the stack: HIMEM ... Algorihm stack frame vars Inlined vars Buffer head/book keeping vars Buffer end Buffer front ...add to front here... End of stack LOMEM
Aug 22 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/22/2014 9:48 PM, Ola Fosheim Gr wrote:
 On Saturday, 23 August 2014 at 04:36:34 UTC, Walter Bright wrote:
 On 8/22/2014 9:01 PM, Ola Fosheim Gr wrote:
 Does this mean that D is getting resizable stack allocations in lower stack
 frames? That has a lot of implications for code gen.
scopebuffer does not require resizeable stack allocations.
So you cannot use the stack for resizable allocations.
Please, take a look at how scopebuffer works.
Aug 22 2014
parent reply "Ola Fosheim Gr" <ola.fosheim.grostad+dlang gmail.com> writes:
On Saturday, 23 August 2014 at 05:28:55 UTC, Walter Bright wrote:
 On 8/22/2014 9:48 PM, Ola Fosheim Gr wrote:
 On Saturday, 23 August 2014 at 04:36:34 UTC, Walter Bright 
 wrote:
 On 8/22/2014 9:01 PM, Ola Fosheim Gr wrote:
 Does this mean that D is getting resizable stack allocations 
 in lower stack
 frames? That has a lot of implications for code gen.
scopebuffer does not require resizeable stack allocations.
So you cannot use the stack for resizable allocations.
Please, take a look at how scopebuffer works.
I have? It requires an upperbound to stay on the stack, that creates a big hole in the stack. I don't think wasting the stack or moving to the heap is a nice predictable solution. It would be better to just have a couple of regions that do "reverse" stack allocations, but the most efficient solution is the one I outlined. With json you might be able to create an upperbound of say 4-8 times the size of the source iff you know the file size. You don't if you are streaming. (scopebuffer is too unpredictable for real time, a pure stack solution is predictable)
Aug 22 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/22/2014 11:25 PM, Ola Fosheim Gr wrote:
 On Saturday, 23 August 2014 at 05:28:55 UTC, Walter Bright wrote:
 On 8/22/2014 9:48 PM, Ola Fosheim Gr wrote:
 On Saturday, 23 August 2014 at 04:36:34 UTC, Walter Bright wrote:
 On 8/22/2014 9:01 PM, Ola Fosheim Gr wrote:
 Does this mean that D is getting resizable stack allocations in lower stack
 frames? That has a lot of implications for code gen.
scopebuffer does not require resizeable stack allocations.
So you cannot use the stack for resizable allocations.
Please, take a look at how scopebuffer works.
I have? It requires an upperbound to stay on the stack, that creates a big hole in the stack. I don't think wasting the stack or moving to the heap is a nice predictable solution. It would be better to just have a couple of regions that do "reverse" stack allocations, but the most efficient solution is the one I outlined.
Scopebuffer is extensively used in Warp, and works very well. The "hole" in the stack is not a significant problem.
 With json you might be able to create an upperbound of say 4-8 times the size
of
 the source iff you know the file size. You don't if you are streaming.

 (scopebuffer is too unpredictable for real time, a pure stack solution is
 predictable)
You can always implement your own buffering system and pass it in - that's the point, it's under user control.
Aug 22 2014
parent "Ola Fosheim Gr" <ola.fosheim.grostad+dlang gmail.com> writes:
On Saturday, 23 August 2014 at 06:41:11 UTC, Walter Bright wrote:
 Scopebuffer is extensively used in Warp, and works very well. 
 The "hole" in the stack is not a significant problem.
Well, on a webserver you don't want to push out the caches for no good reason.
 You can always implement your own buffering system and pass it 
 in - that's the point, it's under user control.
My point is that you need compiler support to get good buffering options on the stack. Something like an alloca_inline: auto buffer = alloca_inline getstuff(); process(buffer); I think all memory allocation should be under compiler control, the library solutions are bound to be suboptimal, i.e. slower.
Aug 22 2014
prev sibling parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 23.08.2014 03:05, schrieb Walter Bright:
 On 8/22/2014 2:27 PM, Sönke Ludwig wrote:
 Am 22.08.2014 20:08, schrieb Walter Bright:
 1. There's no mention of what will happen if it is passed malformed JSON
 strings. I presume an exception is thrown. Exceptions are both slow and
 consume GC memory. I suggest an alternative would be to emit an "Error"
 token instead; this would be much like how the UTF decoding algorithms
 emit a "replacement char" for invalid UTF sequences.
The latest version now features a LexOptions.noThrow option which causes an error token to be emitted instead. After popping the error token, the range is always empty.
Having a nothrow option may prevent the functions from being attributed as "nothrow".
It's a compile time option, so that shouldn't be an issue. There is also just a single "throw" statement in the source, so it's easy to isolate.
Aug 23 2014
prev sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 20:08, schrieb Walter Bright:
 (...)
 2. The escape sequenced strings presumably consume GC memory. This will
 be a problem for high performance code. I suggest either leaving them
 undecoded in the token stream, and letting higher level code decide what
 to do about them, or provide a hook that the user can override with his
 own allocation scheme.

 If we don't make it possible to use std.json without invoking the GC, I
 believe the module will fail in the long term.
I've added two new types now to abstract away how strings and numbers are represented in memory. For string literals this means that for input types "string" and "immutable(ubyte)[]" they will always be stored as slices to the input buffer. JSONValue has a .rawValue property to access them, as well as an "alias this"ed .value property that transparently unescapes. At that place it would also be easy to provide a method that takes an arbitrary output range to unescape without allocations. Documentation and code are both updated (also added a note about exception behavior).
Aug 23 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/23/2014 9:36 AM, Sönke Ludwig wrote:
 input types "string" and "immutable(ubyte)[]"
Why the immutable(ubyte)[] ?
Aug 23 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 23.08.2014 19:38, schrieb Walter Bright:
 On 8/23/2014 9:36 AM, Sönke Ludwig wrote:
 input types "string" and "immutable(ubyte)[]"
Why the immutable(ubyte)[] ?
I've adopted that basically from Andrei's module. The idea is to allow processing data with arbitrary character encoding. However, the output will always be Unicode and JSON is defined to be encoded as Unicode, too, so that could probably be dropped...
Aug 23 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/23/2014 10:42 AM, Sönke Ludwig wrote:
 Am 23.08.2014 19:38, schrieb Walter Bright:
 On 8/23/2014 9:36 AM, Sönke Ludwig wrote:
 input types "string" and "immutable(ubyte)[]"
Why the immutable(ubyte)[] ?
I've adopted that basically from Andrei's module. The idea is to allow processing data with arbitrary character encoding. However, the output will always be Unicode and JSON is defined to be encoded as Unicode, too, so that could probably be dropped...
I feel that non-UTF encodings should be handled by adapter algorithms, not embedded into the JSON lexer, so yes, I'd drop that.
Aug 23 2014
next sibling parent reply Brad Roberts via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 8/23/2014 10:46 AM, Walter Bright via Digitalmars-d wrote:
 On 8/23/2014 10:42 AM, Sönke Ludwig wrote:
 Am 23.08.2014 19:38, schrieb Walter Bright:
 On 8/23/2014 9:36 AM, Sönke Ludwig wrote:
 input types "string" and "immutable(ubyte)[]"
Why the immutable(ubyte)[] ?
I've adopted that basically from Andrei's module. The idea is to allow processing data with arbitrary character encoding. However, the output will always be Unicode and JSON is defined to be encoded as Unicode, too, so that could probably be dropped...
I feel that non-UTF encodings should be handled by adapter algorithms, not embedded into the JSON lexer, so yes, I'd drop that.
For performance purposes, determining encoding during lexing is useful. You can avoid any conversion costs when you know that the original string is ascii or utf-8 or other. The cost during lexing is essentially zero. The cost of storing that state might be a concern, or it might be free in otherwise unused padding space. The cost of re-scanning strings that can be avoided is non-trivial. My past experience with this was in an http parser, where there's even more complex logic than json parsing, but the concepts still apply.
Aug 23 2014
next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Saturday, 23 August 2014 at 19:01:13 UTC, Brad Roberts via 
Digitalmars-d wrote:
 original string is ascii or utf-8 or other.  The cost during 
 lexing is essentially zero.
I am not so sure when it comes to SIMD lexing. I think the specified behaviour should be done in a way which encourage later optimizations.
Aug 23 2014
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
Some baselines for performance:

https://github.com/mloskot/json_benchmark

http://chadaustin.me/2013/01/json-parser-benchmarking/
Aug 23 2014
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/23/2014 12:00 PM, Brad Roberts via Digitalmars-d wrote:
 On 8/23/2014 10:46 AM, Walter Bright via Digitalmars-d wrote:
 I feel that non-UTF encodings should be handled by adapter algorithms,
 not embedded into the JSON lexer, so yes, I'd drop that.
For performance purposes, determining encoding during lexing is useful.
I'm not convinced that using an adapter algorithm won't be just as fast.
Aug 23 2014
parent reply Brad Roberts via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 8/23/2014 3:20 PM, Walter Bright via Digitalmars-d wrote:
 On 8/23/2014 12:00 PM, Brad Roberts via Digitalmars-d wrote:
 On 8/23/2014 10:46 AM, Walter Bright via Digitalmars-d wrote:
 I feel that non-UTF encodings should be handled by adapter algorithms,
 not embedded into the JSON lexer, so yes, I'd drop that.
For performance purposes, determining encoding during lexing is useful.
I'm not convinced that using an adapter algorithm won't be just as fast.
Consider your own talks on optimizing the existing dmd lexer. In those talks you've talked about the evils of additional processing on every byte. That's what you're talking about here. While it's possible that the inliner and other optimizer steps might be able to integrate the two phases and remove some overhead, I'll believe it when I see the resulting assembly code.
Aug 23 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/23/2014 6:32 PM, Brad Roberts via Digitalmars-d wrote:
 I'm not convinced that using an adapter algorithm won't be just as fast.
Consider your own talks on optimizing the existing dmd lexer. In those talks you've talked about the evils of additional processing on every byte. That's what you're talking about here. While it's possible that the inliner and other optimizer steps might be able to integrate the two phases and remove some overhead, I'll believe it when I see the resulting assembly code.
On the other hand, deadalnix demonstrated that the ldc optimizer was able to remove the extra code. I have a reasonable faith that optimization can be improved where necessary to cover this.
Aug 25 2014
parent reply simendsjo <simendsjo+dlang gmail.com> writes:
On 08/25/2014 09:35 PM, Walter Bright wrote:
 On 8/23/2014 6:32 PM, Brad Roberts via Digitalmars-d wrote:
 I'm not convinced that using an adapter algorithm won't be just as fast.
Consider your own talks on optimizing the existing dmd lexer. In those talks you've talked about the evils of additional processing on every byte. That's what you're talking about here. While it's possible that the inliner and other optimizer steps might be able to integrate the two phases and remove some overhead, I'll believe it when I see the resulting assembly code.
On the other hand, deadalnix demonstrated that the ldc optimizer was able to remove the extra code. I have a reasonable faith that optimization can be improved where necessary to cover this.
I just happened to write a very small script yesterday and tested with the compilers (with dub --build=release). dmd: 2.8 mb gdc: 3.3 mb ldc 0.5 mb So ldc can remove quite a substantial amount of code in some cases.
Aug 25 2014
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/25/2014 12:49 PM, simendsjo wrote:
 I just happened to write a very small script yesterday and tested with
 the compilers (with dub --build=release).

 dmd: 2.8 mb
 gdc: 3.3 mb
 ldc  0.5 mb

 So ldc can remove quite a substantial amount of code in some cases.
Speed optimizations are different.
Aug 25 2014
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 25/08/14 21:49, simendsjo wrote:

 So ldc can remove quite a substantial amount of code in some cases.
It's because the latest release of LDC has the --gc-sections falg enabled by default. -- /Jacob Carlborg
Aug 26 2014
parent "Entusiastic user" <cncgeneralsfan999 abv.bg> writes:
I tried using "-disable-linker-strip-dead", but it had no effect. 
 From the error messages it seems the problem is compile-time and 
not link-time...



On Tuesday, 26 August 2014 at 07:01:09 UTC, Jacob Carlborg wrote:
 On 25/08/14 21:49, simendsjo wrote:

 So ldc can remove quite a substantial amount of code in some 
 cases.
It's because the latest release of LDC has the --gc-sections falg enabled by default.
Aug 26 2014
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/23/14, 10:46 AM, Walter Bright wrote:
 On 8/23/2014 10:42 AM, Sönke Ludwig wrote:
 Am 23.08.2014 19:38, schrieb Walter Bright:
 On 8/23/2014 9:36 AM, Sönke Ludwig wrote:
 input types "string" and "immutable(ubyte)[]"
Why the immutable(ubyte)[] ?
I've adopted that basically from Andrei's module. The idea is to allow processing data with arbitrary character encoding. However, the output will always be Unicode and JSON is defined to be encoded as Unicode, too, so that could probably be dropped...
I feel that non-UTF encodings should be handled by adapter algorithms, not embedded into the JSON lexer, so yes, I'd drop that.
I think accepting ubyte it's a good idea. It means "got this stream of bytes off of the wire and it hasn't been validated as a UTF string". It also means (which is true) that the lexer does enough validation to constrain arbitrary bytes into text, and saves caller from either a check (expensive) or a cast (unpleasant). Reality is the JSON lexer takes ubytes and produces tokens. Andrei
Aug 23 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/23/2014 2:36 PM, Andrei Alexandrescu wrote:
 I think accepting ubyte it's a good idea. It means "got this stream of bytes
off
 of the wire and it hasn't been validated as a UTF string". It also means (which
 is true) that the lexer does enough validation to constrain arbitrary bytes
into
 text, and saves caller from either a check (expensive) or a cast (unpleasant).

 Reality is the JSON lexer takes ubytes and produces tokens.
Using an adapter still makes sense, because: 1. The adapter should be just as fast as wiring it in internally 2. The adapter then becomes a general purpose tool that can be used elsewhere where the encoding is unknown or suspect 3. The scope of the adapter is small, so it is easier to get it right, and being reusable means every user benefits from it 4. If we can't make adapters efficient, we've failed at the ranges+algorithms model, and I'm very unwilling to fail at that
Aug 23 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/23/14, 3:24 PM, Walter Bright wrote:
 On 8/23/2014 2:36 PM, Andrei Alexandrescu wrote:
 I think accepting ubyte it's a good idea. It means "got this stream of
 bytes off
 of the wire and it hasn't been validated as a UTF string". It also
 means (which
 is true) that the lexer does enough validation to constrain arbitrary
 bytes into
 text, and saves caller from either a check (expensive) or a cast
 (unpleasant).

 Reality is the JSON lexer takes ubytes and produces tokens.
Using an adapter still makes sense, because: 1. The adapter should be just as fast as wiring it in internally 2. The adapter then becomes a general purpose tool that can be used elsewhere where the encoding is unknown or suspect 3. The scope of the adapter is small, so it is easier to get it right, and being reusable means every user benefits from it 4. If we can't make adapters efficient, we've failed at the ranges+algorithms model, and I'm very unwilling to fail at that
An adapter would solve the wrong problem here. There's nothing to adapt from and to. An adapter would be good if e.g. the stream uses UTF-16 or some Windows encoding. Bytes are the natural input for a json parser. Andrei
Aug 23 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/23/2014 3:51 PM, Andrei Alexandrescu wrote:
 An adapter would solve the wrong problem here. There's nothing to adapt from
and
 to.

 An adapter would be good if e.g. the stream uses UTF-16 or some Windows
 encoding. Bytes are the natural input for a json parser.
The adaptation is to take arbitrary byte input in an unknown encoding and produce valid UTF. Note that many html readers scan the bytes to see if it is ASCII, UTF, some code page encoding, Shift-JIS, etc., and translate accordingly. I do not see why that is less costly to put inside the JSON lexer than as an adapter.
Aug 25 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 19:38:05 UTC, Walter Bright wrote:
 The adaptation is to take arbitrary byte input in an unknown 
 encoding and produce valid UTF.
I agree. For a restful http service the encoding should be specified in the http header and the input rejected if it isn't UTF compatible. For that use scenario you only want validation, not conversion. However some validation is free, like if you only accept numbers you could just turn off parsing of strings in the template… If files are read from storage then you can reread the file if it fails validation on the first pass. I wonder, in which use scenario it is that both of these conditions fail? 1. unspecified character-set and cannot assume UTF for JSON 3. unable to re-parse
Aug 25 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 25.08.2014 21:50, schrieb "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>":
 On Monday, 25 August 2014 at 19:38:05 UTC, Walter Bright wrote:
 The adaptation is to take arbitrary byte input in an unknown encoding
 and produce valid UTF.
I agree. For a restful http service the encoding should be specified in the http header and the input rejected if it isn't UTF compatible. For that use scenario you only want validation, not conversion. However some validation is free, like if you only accept numbers you could just turn off parsing of strings in the template… If files are read from storage then you can reread the file if it fails validation on the first pass. I wonder, in which use scenario it is that both of these conditions fail? 1. unspecified character-set and cannot assume UTF for JSON 3. unable to re-parse
BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159, which is another argument for just letting the lexer assume valid UTF.
Aug 25 2014
next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 20:35:32 UTC, Sönke Ludwig wrote:
 BTW, JSON is *required* to be UTF encoded anyway as per 
 RFC-7159, which is another argument for just letting the lexer 
 assume valid UTF.
The lexer cannot assume valid UTF since the client might be a rogue, but it can just bail out if the lookahead isn't jSON? So UTF-validation is limited to strings. You have to parse the strings because of the \uXXXX escapes of course, so some basic validation is unavoidable? But I guess full validation of string content could be another useful option along with "ignore escapes" for the case where you want to avoid decode-encode scenarios. (like for a proxy, or if you store pre-escaped unicode in a database)
Aug 25 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 25.08.2014 22:51, schrieb "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>":
 On Monday, 25 August 2014 at 20:35:32 UTC, Sönke Ludwig wrote:
 BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159,
 which is another argument for just letting the lexer assume valid UTF.
The lexer cannot assume valid UTF since the client might be a rogue, but it can just bail out if the lookahead isn't jSON? So UTF-validation is limited to strings.
But why should UTF validation be the job of the lexer in the first place? D's "string" type is also defined to be UTF-8, so given that, it would of course be free to assume valid UTF-8. I agree with Walter there that validation/conversion should be added as a separate proxy range. But if we end up going for validating in the lexer, it would indeed be enough to validate inside strings, because the rest of the grammar assumes a subset of ASCII.
 You have to parse the strings because of the \uXXXX escapes of course,
 so some basic validation is unavoidable?
At least no UTF validation is needed. Since all non-ASCII characters will always be composed of bytes >0x7F, a sequence \uXXXX can be assumed to be valid wherever in the string it occurs, and all other bytes that don't belong to an escape sequence are just passed through as-is.
 But I guess full validation of
 string content could be another useful option along with "ignore
 escapes" for the case where you want to avoid decode-encode scenarios.
 (like for a proxy, or if you store pre-escaped unicode in a database)
Aug 25 2014
next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote:
 But why should UTF validation be the job of the lexer in the 
 first place?
Because you want to save time, it is faster to integrate validation? The most likely use scenario is to receive REST data over HTTP that needs validation. Well, so then I agree with Andrei… array of bytes it is. ;-)
 added as a separate proxy range. But if we end up going for 
 validating in the lexer, it would indeed be enough to validate 
 inside strings, because the rest of the grammar assumes a 
 subset of ASCII.
Not assumes, but defines! :-) If you have to validate UTF before lexing then you will end up needlessly scanning lots of ascii if the file contains lots of non-strings or is from a encoder that only sends pure ascii. If you want to have "plugin" validation of strings then you also need to differentiate strings so that the user can select which data should be just ascii, utf8, numbers, ids etc. Otherwise the user will end up doing double validation (you have to bypass >7F followed by string-end anyway). The advantage of integrated validation is that you can use 16 bytes SIMD registers on the buffer. I presume you can load 16 bytes and do BITWISE-AND on the MSB, then match against string-end and carefully use this to boost performance of simultanous UTF validation, escape-scanning, and string-end scan. A bit tricky, of course.
 At least no UTF validation is needed. Since all non-ASCII 
 characters will always be composed of bytes >0x7F, a sequence 
 \uXXXX can be assumed to be valid wherever in the string it 
 occurs, and all other bytes that don't belong to an escape 
 sequence are just passed through as-is.
You cannot assume \u… to be valid if you convert it.
Aug 25 2014
next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 21:53:50 UTC, Ola Fosheim Grøstad 
wrote:
 I presume you can load 16 bytes and do BITWISE-AND on the MSB, 
 then match against string-end and carefully use this to boost 
 performance of simultanous UTF validation, escape-scanning, and 
 string-end scan. A bit tricky, of course.
I think it is doable and worth it… https://software.intel.com/sites/landingpage/IntrinsicsGuide/ e.g.: __mmask16 _mm_cmpeq_epu8_mask (__m128i a, __m128i b) __mmask32 _mm256_cmpeq_epu8_mask (__m256i a, __m256i b) __mmask64 _mm512_cmpeq_epu8_mask (__m512i a, __m512i b) __mmask16 _mm_test_epi8_mask (__m128i a, __m128i b) etc. So you can: 1. preload registers with "\\\\\\\\…" , "\"\"…" and "\0\0\0…" 2. then compare signed/unsigned/equal whatever. 3. then load 16,32 or 64 bytes of data and stream until the masks trigger 4. tests masks 5. resolve any potential issues, goto 3
Aug 25 2014
parent reply "Kiith-Sa" <kiithsacmp gmail.com> writes:
On Monday, 25 August 2014 at 22:40:00 UTC, Ola Fosheim Grøstad 
wrote:
 On Monday, 25 August 2014 at 21:53:50 UTC, Ola Fosheim Grøstad 
 wrote:
 I presume you can load 16 bytes and do BITWISE-AND on the MSB, 
 then match against string-end and carefully use this to boost 
 performance of simultanous UTF validation, escape-scanning, 
 and string-end scan. A bit tricky, of course.
I think it is doable and worth it… https://software.intel.com/sites/landingpage/IntrinsicsGuide/ e.g.: __mmask16 _mm_cmpeq_epu8_mask (__m128i a, __m128i b) __mmask32 _mm256_cmpeq_epu8_mask (__m256i a, __m256i b) __mmask64 _mm512_cmpeq_epu8_mask (__m512i a, __m512i b) __mmask16 _mm_test_epi8_mask (__m128i a, __m128i b) etc. So you can: 1. preload registers with "\\\\\\\\…" , "\"\"…" and "\0\0\0…" 2. then compare signed/unsigned/equal whatever. 3. then load 16,32 or 64 bytes of data and stream until the masks trigger 4. tests masks 5. resolve any potential issues, goto 3
D:YAML uses a similar approach, but with 8 bytes (plain ulong - portable) to detect how many ASCII chars are there before the first non-ASCII UTF-8 sequence, and it significantly improves performance (didn't keep any numbers unfortunately, but it decreases decoding overhead to a fraction for most inputs (since YAML (and JSON) files tend to be mostly-ASCII with non-ASCII from time to time in strings), if we know that we have e.g. 100 chars incoming that are plain ASCII, we can use a fast path for them and only consider decoding after that)) See the countASCII() function in https://github.com/kiith-sa/D-YAML/blob/master/source/dyaml/reader.d However, this approach is useful only if you decode the whole buffer at once, not if you do something like foreach(dchar ch; "asdsššdfáľäô") {}, which is the most obvious way to decode in D. FWIW, decoding _was_ a significant overhead in D:YAML (again, didn't keep numbers, but at a time it was around 10% in the profiler), and I didn't like the fact that it prevented making my code nogc - I ended up copying chunks of std.utf and making them nogc nothrow (D:YAML as a whole is not nogc but I use nogc in some parts basically as " noalloc" to ensure I don't allocate anything)
Aug 25 2014
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 23:24:43 UTC, Kiith-Sa wrote:
 D:YAML uses a similar approach, but with 8 bytes (plain ulong - 
 portable) to detect how many ASCII chars are there before the 
 first non-ASCII UTF-8 sequence,  and it significantly improves 
 performance (didn't keep any numbers unfortunately, but it
Cool! I think often you will have an array of numbers so you could subtract "000000000…", then parse offset-bytes and convert the mantissa/exponent using shuffles and simd. Somehow…
Aug 25 2014
prev sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 25.08.2014 23:53, schrieb "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>":
 On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote:
 But why should UTF validation be the job of the lexer in the first place?
Because you want to save time, it is faster to integrate validation? The most likely use scenario is to receive REST data over HTTP that needs validation. Well, so then I agree with Andrei… array of bytes it is. ;-)
 added as a separate proxy range. But if we end up going for validating
 in the lexer, it would indeed be enough to validate inside strings,
 because the rest of the grammar assumes a subset of ASCII.
Not assumes, but defines! :-)
I guess it depends on if you look at the grammar as productions or comprehensions(right term?) ;)
 If you have to validate UTF before lexing then you will end up
 needlessly scanning lots of ascii if the file contains lots of
 non-strings or is from a encoder that only sends pure ascii.
That's true. So the ideal solution would be to *assume* UTF-8 when the input is char based and to *validate* if the input is "numeric".
 If you want to have "plugin" validation of strings then you also need to
 differentiate strings so that the user can select which data should be
 just ascii, utf8, numbers, ids etc. Otherwise the user will end up doing
 double validation (you have to bypass >7F followed by string-end anyway).

 The advantage of integrated validation is that you can use 16 bytes SIMD
 registers on the buffer.

 I presume you can load 16 bytes and do BITWISE-AND on the MSB, then
 match against string-end and carefully use this to boost performance of
 simultanous UTF validation, escape-scanning, and string-end scan. A bit
 tricky, of course.
Well, that's something that's definitely out of the scope of this proposal. Definitely an interesting direction to pursue, though.
 At least no UTF validation is needed. Since all non-ASCII characters
 will always be composed of bytes >0x7F, a sequence \uXXXX can be
 assumed to be valid wherever in the string it occurs, and all other
 bytes that don't belong to an escape sequence are just passed through
 as-is.
You cannot assume \u… to be valid if you convert it.
I meant "X" to stand for a hex digit. The point was just that you don't have to worry about interacting in a bad way with UTF sequences when you find "\uXXXX".
Aug 26 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Tuesday, 26 August 2014 at 07:51:04 UTC, Sönke Ludwig wrote:
 That's true. So the ideal solution would be to *assume* UTF-8 
 when the input is char based and to *validate* if the input is 
 "numeric".
I think you should validate JSON-strings to be UTF-8 encoded even if you allow illegal unicode values. Basically ensuring that
0x7f has the right number of bytes after it, so you don't get 
0x7f as the last byte in a string etc.
 Well, that's something that's definitely out of the scope of 
 this proposal. Definitely an interesting direction to pursue, 
 though.
Maybe the interface/code structure is or could be designed so that the implementation could later be version()'ed to SIMD where possible.
 You cannot assume \u… to be valid if you convert it.
I meant "X" to stand for a hex digit. The point was just that you don't have to worry about interacting in a bad way with UTF sequences when you find "\uXXXX".
When you convert "\uXXXX" to UTF-8 bytes, is it then validated as a legal code point? I guess it is not necessary. Btw, I believe rapidJSON achieves high speed by converting strings in situ, so that if the prefix is escape free it just converts in place when it hits the first escape. Thus avoiding some moving.
Aug 26 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 26.08.2014 10:24, schrieb "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>":
 On Tuesday, 26 August 2014 at 07:51:04 UTC, Sönke Ludwig wrote:
 That's true. So the ideal solution would be to *assume* UTF-8 when the
 input is char based and to *validate* if the input is "numeric".
I think you should validate JSON-strings to be UTF-8 encoded even if you allow illegal unicode values. Basically ensuring that >0x7f has the right number of bytes after it, so you don't get >0x7f as the last byte in a string etc.
I think this is a misunderstanding. What I mean is that if the input range passed to the lexer is char/wchar/dchar based, the lexer should assume that the input is well formed UTF. After all this is how D strings are defined. When on the other hand a ubyte/ushort/uint range is used, the lexer should validate all string literals.
 Well, that's something that's definitely out of the scope of this
 proposal. Definitely an interesting direction to pursue, though.
Maybe the interface/code structure is or could be designed so that the implementation could later be version()'ed to SIMD where possible.
I guess that shouldn't be an issue. From the outside it's just a generic range that is passed in and internally it's always possible to add special cases for array inputs. If someone else wants to play around with this idea, we could of course also integrate it right away, it's just that I personally don't have the time to go to the extreme here.
 You cannot assume \u… to be valid if you convert it.
I meant "X" to stand for a hex digit. The point was just that you don't have to worry about interacting in a bad way with UTF sequences when you find "\uXXXX".
When you convert "\uXXXX" to UTF-8 bytes, is it then validated as a legal code point? I guess it is not necessary.
What is validated is that it forms valid UTF-16 surrogate pairs, and those are converted to a single dchar instead (if applicable). This is necessary, because otherwise the lexer would produce invalid UTF-8 for valid inputs. Apart from that, the value is used verbatim as a dchar.
 Btw, I believe rapidJSON achieves high speed by converting strings in
 situ, so that if the prefix is escape free it just converts in place
 when it hits the first escape. Thus avoiding some moving.
The same is true for this lexer, at least for array inputs. It actually currently just stores a slice of the string literal in all cases and lazily decodes on the first access. While doing that, it first skips any escape sequence free prefix and returns a slice if the whole string is escape sequence free.
Aug 26 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Tuesday, 26 August 2014 at 09:05:05 UTC, Sönke Ludwig wrote:
 When on the other hand a ubyte/ushort/uint range is used, the 
 lexer should validate all string literals.
Yes, so this will be supported? Because this is what is most useful.
Aug 26 2014
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 26.08.2014 11:11, schrieb "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>":
 On Tuesday, 26 August 2014 at 09:05:05 UTC, Sönke Ludwig wrote:
 When on the other hand a ubyte/ushort/uint range is used, the lexer
 should validate all string literals.
Yes, so this will be supported? Because this is what is most useful.
If nobody plays a veto card, I'll implement it that way.
Aug 26 2014
prev sibling parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
Btw, maybe it would be a good idea to take a look on the JSON 
that various browsers generates to see if there are any 
differences?

Then one could tune optimizations to what is the most common 
coding, like this:

1. start parsing assuming "browser style restricted JSON" grammar.

2. on failure jump to the slower "generic JSON"

Chrome does not seem to generate whitespace in JSON.stringfy(). 
And I would not be surprised if the encoding of double is similar 
across browsers.

Ola.
Aug 25 2014
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/25/2014 1:35 PM, Sönke Ludwig wrote:
 BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159, which is
 another argument for just letting the lexer assume valid UTF.
I think that settles it.
Aug 25 2014
prev sibling next sibling parent reply Andrej Mitrovic via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 8/22/14, Sönke Ludwig <digitalmars-d puremagic.com> wrote:
 Docs: http://s-ludwig.github.io/std_data_json/
This confused me for a solid minute: // Lex a JSON string into a lazy range of tokens auto tokens = lexJSON(`{"name": "Peter", "age": 42}`); with (JSONToken.Kind) { assert(tokens.map!(t => t.kind).equal( [objectStart, string, colon, string, comma, string, colon, number, objectEnd])); } Generally I'd avoid using de-facto reserved names as enum member names (e.g. string).
Aug 22 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 22.08.2014 21:15, schrieb Andrej Mitrovic via Digitalmars-d:
 On 8/22/14, Sönke Ludwig <digitalmars-d puremagic.com> wrote:
 Docs: http://s-ludwig.github.io/std_data_json/
This confused me for a solid minute: // Lex a JSON string into a lazy range of tokens auto tokens = lexJSON(`{"name": "Peter", "age": 42}`); with (JSONToken.Kind) { assert(tokens.map!(t => t.kind).equal( [objectStart, string, colon, string, comma, string, colon, number, objectEnd])); } Generally I'd avoid using de-facto reserved names as enum member names (e.g. string).
Hmmm, but it *is* a string. Isn't the problem more the use of with in this case? Maybe the example should just use with(JSONToken) and then Kind.string?
Aug 22 2014
parent Andrej Mitrovic via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 8/22/14, Sönke Ludwig <digitalmars-d puremagic.com> wrote:
 Hmmm, but it *is* a string. Isn't the problem more the use of with in
 this case?
Yeah, maybe so. I thought for a second it was a tuple, but then I saw the square brackets and was left scratching my head. :)
Aug 23 2014
prev sibling next sibling parent reply "deadalnix" <deadalnix gmail.com> writes:
First thank you for your work. std.json is horrible to use right 
now, so a replacement is more than welcome.

I haven't played with your code yet, so I may be asking for 
somethign that already exists, but did you had a look to jsvar by 
Adam ?

You can find it here: 
https://github.com/adamdruppe/arsd/blob/master/jsvar.d

One of the big pain when one work with format like JSON is that 
you go from the untyped world to the typed world (the same 
problem occurs with XML and various config format as well).

I think Adam got the right balance in jsvar. It behave closely 
enough to javascript so it is convenient to manipulate, while 
removing the most dangerous behavior (concatenation is still done 
using ~and not + as in JS).

If that is not already the case, I'd love that the element I get 
out of my JSON behave that way. If you can do that, you have a 
user.
Aug 22 2014
next sibling parent ketmar via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Sat, 23 Aug 2014 02:23:25 +0000
deadalnix via Digitalmars-d <digitalmars-d puremagic.com> wrote:

 I haven't played with your code yet, so I may be asking for=20
 somethign that already exists, but did you had a look to jsvar by=20
 Adam ?
jsvar using opDispatch, and S=C3=B6nke wrote:
  - No opDispatch() for JSONValue - this has shown to do more harm than
    good in vibe.data.json
Aug 22 2014
prev sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 23.08.2014 04:23, schrieb deadalnix:
 First thank you for your work. std.json is horrible to use right now, so
 a replacement is more than welcome.

 I haven't played with your code yet, so I may be asking for somethign
 that already exists, but did you had a look to jsvar by Adam ?

 You can find it here:
 https://github.com/adamdruppe/arsd/blob/master/jsvar.d

 One of the big pain when one work with format like JSON is that you go
 from the untyped world to the typed world (the same problem occurs with
 XML and various config format as well).

 I think Adam got the right balance in jsvar. It behave closely enough to
 javascript so it is convenient to manipulate, while removing the most
 dangerous behavior (concatenation is still done using ~and not + as in JS).

 If that is not already the case, I'd love that the element I get out of
 my JSON behave that way. If you can do that, you have a user.
Setting the issue of opDispatch aside, one of the goals was to use Algebraic to store values. It is probably not completely as flexible as jsvar, but still transparently enables a lot of operations (with those pull requests merged at least). But it has another big advantage, which is that we can later define other types based on Algebraic, such as BSONValue, and those can be transparently runtime converted between each other in a generic way. A special case type on the other hand produces nasty dependencies between the formats. Main issues of using opDispatch: - Prone to bugs where a normal field/method of the JSONValue struct is accessed instead of a JSON field - On top of that the var.field syntax gives the wrong impression that you are working with static typing, while var["field"] makes it clear that runtime indexing is going on - Every interface change of JSONValue would be a silent breaking change, because the whole string domain is used up for opDispatch
Aug 23 2014
next sibling parent reply "w0rp" <devw0rp gmail.com> writes:
On Saturday, 23 August 2014 at 09:22:01 UTC, Sönke Ludwig wrote:
 Main issues of using opDispatch:

  - Prone to bugs where a normal field/method of the JSONValue 
 struct is accessed instead of a JSON field
  - On top of that the var.field syntax gives the wrong 
 impression that you are working with static typing, while 
 var["field"] makes it clear that runtime indexing is going on
  - Every interface change of JSONValue would be a silent 
 breaking change, because the whole string domain is used up for 
 opDispatch
I have seen similar issues to these with simplexml in PHP. Using opDispatch to match all possible names except a few doesn't work so well. I'm not sure if you've changed it already, but I agree with the earlier comment about changing the flag for pretty printing from a boolean to an enum value. Booleans in interfaces is one of my pet peeves.
Aug 23 2014
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 23.08.2014 14:19, schrieb w0rp:
 I'm not sure if you've changed it already, but I agree with the earlier
 comment about changing the flag for pretty printing from a boolean to an
 enum value. Booleans in interfaces is one of my pet peeves.
It's split into two separate functions now. Having to type out a full enum value I guess would be too distracting in this case, since they will be pretty frequently used.
Aug 23 2014
prev sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Saturday, 23 August 2014 at 09:22:01 UTC, Sönke Ludwig wrote:
 Main issues of using opDispatch:

  - Prone to bugs where a normal field/method of the JSONValue 
 struct is accessed instead of a JSON field
  - On top of that the var.field syntax gives the wrong 
 impression that you are working with static typing, while 
 var["field"] makes it clear that runtime indexing is going on
  - Every interface change of JSONValue would be a silent 
 breaking change, because the whole string domain is used up for 
 opDispatch
Yes, I don't mind missing that one. It look like a false good idea.
Aug 23 2014
prev sibling next sibling parent reply =?ISO-8859-15?Q?S=F6nke_Ludwig?= <sludwig rejectedsoftware.com> writes:
I've added support (compile time option [1]) for long and BigInt in the 
lexer (and parser), see [2]. JSONValue currently still only stores 
double for numbers. There are two options for extending JSONValue:

1. Add long and BigInt to the set of supported types for JSONValue. This 
preserves all features of Algebraic and would later still allow 
transparent conversion to other similar value types (e.g. BSONValue). On 
the other hand it would be necessary to always check the actual type 
before accessing a number, or the Algebraic would throw.

2. Instead of double, store a JSONNumber in the Algebraic. This enables 
all the transparent conversions of JSONNumber and would thus be more 
convenient, but blocks the way for possible automatic conversions in the 
future.

I'm leaning towards 1, because allowing generic conversion between 
different JSONValue-like types was one of my prime goals for the new module.

[1]: 
http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/LexOptions.html
[2]: 
http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/JSONNumber.html
Aug 25 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 11:30:15 UTC, Sönke Ludwig wrote:
 I've added support (compile time option [1]) for long and 
 BigInt in the lexer (and parser), see [2]. JSONValue currently 
 still only stores double for numbers.
It can be very useful to have a base 10 exponent representation in certain situations where you need to have the exact same results in two systems (like a third party ERP server versus a client side application). Base 2 exponents are tricky (incorrect) when you read ascii. E.g. I have resorted to using Decimal in Python just to avoid the weird round off issues when calculating prices where the price is given in fractions of the order unit. Perhaps a marginal problem, but could be important for some serious application areas where you need to integrate D with existing systems (for which you don't have the source code).
Aug 25 2014
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 25.08.2014 14:12, schrieb "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>":
 On Monday, 25 August 2014 at 11:30:15 UTC, Sönke Ludwig wrote:
 I've added support (compile time option [1]) for long and BigInt in
 the lexer (and parser), see [2]. JSONValue currently still only stores
 double for numbers.
It can be very useful to have a base 10 exponent representation in certain situations where you need to have the exact same results in two systems (like a third party ERP server versus a client side application). Base 2 exponents are tricky (incorrect) when you read ascii. E.g. I have resorted to using Decimal in Python just to avoid the weird round off issues when calculating prices where the price is given in fractions of the order unit. Perhaps a marginal problem, but could be important for some serious application areas where you need to integrate D with existing systems (for which you don't have the source code).
In fact, I've already prepared the code for that, but commented it out for now, because I wanted to have an efficient algorithm for converting double to Decimal and because we should probably first add a Decimal type to Phobos instead of adding it to the JSON module.
Aug 25 2014
prev sibling next sibling parent reply "Don" <x nospam.com> writes:
On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
 Following up on the recent "std.jgrandson" thread [1], I've 
 picked up the work (a lot earlier than anticipated) and 
 finished a first version of a loose blend of said 
 std.jgrandson, vibe.data.json and some changes that I had 
 planned for vibe.data.json for a while. I'm quite pleased by 
 the results so far, although without a serialization framework 
 it still misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
 Docs: http://s-ludwig.github.io/std_data_json/
 DUB: http://code.dlang.org/packages/std_data_json

 The new code contains:
  - Lazy lexer in the form of a token input range (using slices 
 of the
    input if possible)
  - Lazy streaming parser (StAX style) in the form of a node 
 input range
  - Eager DOM style parser returning a JSONValue
  - Range based JSON string generator taking either a token 
 range, a
    node range, or a JSONValue
  - Opt-out location tracking (line/column) for tokens, nodes 
 and values
  - No opDispatch() for JSONValue - this has shown to do more 
 harm than
    good in vibe.data.json

 The DOM style JSONValue type is based on std.variant.Algebraic. 
 This currently has a few usability issues that can be solved by 
 upgrading/fixing Algebraic:

  - Operator overloading only works sporadically
  - No "tag" enum is supported, so that switch()ing on the type 
 of a
    value doesn't work and an if-else cascade is required
  - Operations and conversions between different Algebraic types 
 is not
    conveniently supported, which gets important when other 
 similar
    formats get supported (e.g. BSON)

 Assuming that those points are solved, I'd like to get some 
 early feedback before going for an official review. One open 
 issue is how to handle unescaping of string literals. Currently 
 it always unescapes immediately, which is more efficient for 
 general input ranges when the unescaped result is needed, but 
 less efficient for string inputs when the unescaped result is 
 not needed. Maybe a flag could be used to conditionally switch 
 behavior depending on the input range type.

 Destroy away! ;)

 [1]: http://forum.dlang.org/thread/lrknjl$co7$1 digitalmars.com
One missing feature (which is also missing from the existing std.json) is support for NaN and Infinity as JSON values. Although they are not part of the formal JSON spec (which is a ridiculous omission, the argument given for excluding them is fallacious), they do get generated if you use Javascript's toString to create the JSON. Many JSON libraries (eg Google's) also generate them, so they are frequently encountered in practice. So a JSON parser should at least be able to lex them. ie this should be parsable: {"foo": NaN, "bar": Infinity, "baz": -Infinity} You should also put tests in for what happens when you pass NaN or infinity to toJSON. It shouldn't silently generate invalid JSON.
Aug 25 2014
next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 13:07:08 UTC, Don wrote:
 practice. So a JSON parser should at least be able to lex them.

 ie this should be parsable:

 {"foo": NaN, "bar": Infinity, "baz": -Infinity}

 You should also put tests in for what happens when you pass NaN 
 or infinity to toJSON. It shouldn't silently generate invalid 
 JSON.
I believe you are allowed to use very high exponents, though. Like: 1E999 . So you need to decide if those should be mapped to +Infinity or to the max value… NaN also come in two forms with differing semantics: signalling(NaNs) and quiet (NaN). NaN is used for 0/0 and sqrt(-1), but NaNs is used for illegal values and failure. For some reason D does not seem to support this aspect of IEEE754? I cannot find ".nans" listed on the page http://dlang.org/property.html The distinction is important when you do conditional branching. With NaNs you might not be able to figure out which branch to take since you might have missed out on a real value, with NaN you got the value (which is known to be not real) and you might be able to branch.
Aug 25 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/25/2014 6:23 AM, "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>" wrote:
 On Monday, 25 August 2014 at 13:07:08 UTC, Don wrote:
 practice. So a JSON parser should at least be able to lex them.

 ie this should be parsable:

 {"foo": NaN, "bar": Infinity, "baz": -Infinity}

 You should also put tests in for what happens when you pass NaN or infinity to
 toJSON. It shouldn't silently generate invalid JSON.
I believe you are allowed to use very high exponents, though. Like: 1E999 . So you need to decide if those should be mapped to +Infinity or to the max value…
Infinity. Mapping to max value would be a horrible bug.
 NaN also come in two forms with differing semantics: signalling(NaNs) and quiet
 (NaN).  NaN is used for 0/0 and sqrt(-1), but NaNs is used for illegal values
 and failure.

 For some reason D does not seem to support this aspect of IEEE754? I cannot
find
 ".nans" listed on the page http://dlang.org/property.html
Because I tried supporting them in C++. It doesn't work for various reasons. Nobody else supports them, either.
Aug 25 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 19:42:03 UTC, Walter Bright wrote:
 Infinity. Mapping to max value would be a horrible bug.
Yes… but then you are reading an illegal value that JSON does not support…
 For some reason D does not seem to support this aspect of 
 IEEE754? I cannot find
 ".nans" listed on the page http://dlang.org/property.html
Because I tried supporting them in C++. It doesn't work for various reasons. Nobody else supports them, either.
I haven't tested, but Python is supposed to throw on NaNs. gcc has support for nans in their documentation: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html IBM Fortran supports it… I think supporting signaling NaN is important for correctness.
Aug 25 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 20:04:10 UTC, Ola Fosheim Grøstad 
wrote:
 I think supporting signaling NaN is important for correctness.
It is defined in C++11: http://en.cppreference.com/w/cpp/types/numeric_limits/signaling_NaN
Aug 25 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/25/2014 1:21 PM, "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>" wrote:
 On Monday, 25 August 2014 at 20:04:10 UTC, Ola Fosheim Grøstad wrote:
 I think supporting signaling NaN is important for correctness.
It is defined in C++11: http://en.cppreference.com/w/cpp/types/numeric_limits/signaling_NaN
I didn't know that. But recall I did implement it in DMC++, and it turned out to simply not be useful. I'd be surprised if the new C++ support for it does anything worthwhile.
Aug 25 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote:
 I didn't know that. But recall I did implement it in DMC++, and 
 it turned out to simply not be useful. I'd be surprised if the 
 new C++ support for it does anything worthwhile.
Well, one should initialize with signaling NaN. Then you get an exception if you try to compute using uninitialized values.
Aug 25 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/25/2014 4:15 PM, "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>" wrote:
 On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote:
 I didn't know that. But recall I did implement it in DMC++, and it turned out
 to simply not be useful. I'd be surprised if the new C++ support for it does
 anything worthwhile.
Well, one should initialize with signaling NaN. Then you get an exception if you try to compute using uninitialized values.
That's the theory. The practice doesn't work out so well.
Aug 25 2014
parent reply "Don" <x nospam.com> writes:
On Monday, 25 August 2014 at 23:29:21 UTC, Walter Bright wrote:
 On 8/25/2014 4:15 PM, "Ola Fosheim Grøstad" 
 <ola.fosheim.grostad+dlang gmail.com>" wrote:
 On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote:
 I didn't know that. But recall I did implement it in DMC++, 
 and it turned out
 to simply not be useful. I'd be surprised if the new C++ 
 support for it does
 anything worthwhile.
Well, one should initialize with signaling NaN. Then you get an exception if you try to compute using uninitialized values.
That's the theory. The practice doesn't work out so well.
To be more concrete: Processors from AMD have signalling NaN behaviour which is different from processors from Intel. And the situation is worst on most other architectures. It's a lost cause, I think.
Aug 26 2014
next sibling parent reply "Ola Fosheim Gr" <ola.fosheim.grostad+dlang gmail.com> writes:
On Tuesday, 26 August 2014 at 07:24:19 UTC, Don wrote:
 Processors from AMD have signalling NaN behaviour which is 
 different from processors from Intel.

 And the situation is worst on most other architectures. It's a 
 lost cause, I think.
I disagree. AFAIK signaling NaN was standardized in IEEE 754-2008. So it receives attention.
Aug 26 2014
parent reply "Don" <x nospam.com> writes:
On Tuesday, 26 August 2014 at 07:34:05 UTC, Ola Fosheim Gr wrote:
 On Tuesday, 26 August 2014 at 07:24:19 UTC, Don wrote:
 Processors from AMD have signalling NaN behaviour which is 
 different from processors from Intel.

 And the situation is worst on most other architectures. It's a 
 lost cause, I think.
I disagree. AFAIK signaling NaN was standardized in IEEE 754-2008. So it receives attention.
It was always in IEEE754. The decision in 754-2008 was simply to not remove it from the spec (a lot of people wanted to remove it). I don't think anything has changed. The point is, existing hardware does not support it consistently. It's not possible at reasonable cost. --- real uninitialized_var = real.snan; void foo() { real other_var = void; asm { fld uninitialized_var; fstp other_var; } } --- will signal on AMD, but not Intel. I'd love for this to work, but the hardware is fighting against us. I think it's useful only for debugging.
Aug 26 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Tuesday, 26 August 2014 at 10:55:20 UTC, Don wrote:
 It was always in IEEE754. The decision in 754-2008 was simply 
 to not remove it from the spec (a lot of people wanted to 
 remove it). I don't think anything has changed.
It was implementation defined before. I think they specified the bit in 2008.
      fld uninitialized_var;
      fstp other_var;
This is not SSE, but I guess MOVSS does not create exceptions either. AVX is quite complicated, but searching for "signaling" gives some hints about the semantics you can rely on. https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions https://software.intel.com/sites/default/files/managed/c6/a9/319433-020.pdf Ola.
Aug 26 2014
next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Tuesday, 26 August 2014 at 12:37:58 UTC, Ola Fosheim Grøstad 
wrote:

 either. AVX is quite complicated, but searching for "signaling" 
 gives some hints about the semantics you can rely on.
…
 https://software.intel.com/sites/default/files/managed/c6/a9/319433-020.pdf
(Actually, searching for "SNAN" is better…)
Aug 26 2014
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
With the danger of being noisy, these instructions are subject to 
floating point exceptions according to my (perhaps sloppy) 
reading of Intel Architecture Instruction Set Extensions 
Programming Reference (2012):

(V)ADDPD, (V)ADDPS, (V)ADDSUBPD, (V)ADDSUBPS, (V)CMPPD, (V)CMPPS, 
(V)CVTDQ2PS, (V)CVTPD2DQ, (V)CVTPD2PS, (V)CVTPS2DQ, (V)CVTTPD2DQ, 
(V)CVTTPS2DQ, (V)DIVPD, (V)DIVPS, (V)DPPD*, (V)DPPS*, 
VFMADD132PD, VFMADD213PD, VFMADD231PD, VFMADD132PS, VFMADD213PS, 
VFMADD231PS, VFMADDSUB132PD, VFMADDSUB213PD, VFMADDSUB231PD, 
VFMADDSUB132PS, VFMADDSUB213PS, VFMADDSUB231PS, VFMSUBADD132PD, 
VFMSUBADD213PD, VFMSUBADD231PD, VFMSUBADD132PS, VFMSUBADD213PS, 
VFMSUBADD231PS, VFMSUB132PD, VFMSUB213PD, VFMSUB231PD, 
VFMSUB132PS, VFMSUB213PS, VFMSUB231PS, VFNMADD132PD, 
VFNMADD213PD, VFNMADD231PD, VFNMADD132PS, VFNMADD213PS, 
VFNMADD231PS, VFNMSUB132PD, VFNMSUB213PD, VFNMSUB231PD, 
VFNMSUB132PS, VFNMSUB213PS, VFNMSUB231PS, (V)HADDPD, (V)HADDPS, 
(V)HSUBPD, (V)HSUBPS, (V)MAXPD, (V)MAXPS, (V)MINPD, (V)MINPS, 
(V)MULPD, (V)MULPS, (V)ROUNDPS, (V)ROUNDPS, (V)SQRTPD, (V)SQRTPS, 
(V)SUBPD, (V)SUBPS

(V)ADDSD, (V)ADDSS, (V)CMPSD, (V)CMPSS, (V)COMISD, (V)COMISS, 
(V)CVTPS2PD, (V)CVTSD2SI, (V)CVTSD2SS, (V)CVTSI2SD, (V)CVTSI2SS, 
(V)CVTSS2SD, (V)CVTSS2SI, (V)CVTTSD2SI, (V)CVTTSS2SI, (V)DIVSD, 
(V)DIVSS, VFMADD132SD, VFMADD213SD, VFMADD231SD, VFMADD132SS, 
VFMADD213SS, VFMADD231SS, VFMSUB132SD, VFMSUB213SD, VFMSUB231SD, 
VFMSUB132SS, VFMSUB213SS, VFMSUB231SS, VFNMADD132SD, 
VFNMADD213SD, VFNMADD231SD, VFNMADD132SS, VFNMADD213SS, 
VFNMADD231SS, VFNMSUB132SD, VFNMSUB213SD, VFNMSUB231SD, 
VFNMSUB132SS, VFNMSUB213SS, VFNMSUB231SS, (V)MAXSD, (V)MAXSS, 
(V)MINSD, (V)MINSS, (V)MULSD, (V)MULSS, (V)ROUNDSD, (V)ROUNDSS, 
(V)SQRTSD, (V)SQRTSS, (V)SUBSD, (V)SUBSS, (V)UCOMISD, (V)UCOMISS

VCVTPH2PS, VCVTPS2PH

So I guess Intel floating point exceptions trigger on 
computations, but not on moves?

Ola.
Aug 26 2014
prev sibling parent reply "Don" <x nospam.com> writes:
On Tuesday, 26 August 2014 at 12:37:58 UTC, Ola Fosheim Grøstad 
wrote:
 On Tuesday, 26 August 2014 at 10:55:20 UTC, Don wrote:
 It was always in IEEE754. The decision in 754-2008 was simply 
 to not remove it from the spec (a lot of people wanted to 
 remove it). I don't think anything has changed.
It was implementation defined before. I think they specified the bit in 2008.
     fld uninitialized_var;
     fstp other_var;
This is not SSE, but I guess MOVSS does not create exceptions either.
No, it's more subtle. On the original x87, signalling NaNs are triggered for 64 bits loads, but not for 80 bit loads. You have to read the fine print to discover this. I don't think the behaviour was intentional.
Aug 26 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Tuesday, 26 August 2014 at 13:24:11 UTC, Don wrote:
 No, it's more subtle. On the original x87, signalling NaNs are 
 triggered for 64 bits loads, but not for 80 bit loads. You have 
 to read the fine print to discover this.
You are right, but it happens for loads from the FP-stack too: «Source operand is an SNaN. Does not occur if the source operand is in double extended-precision floating-point format (FLD m80fp or FLD ST(i)).»
 I don't think the behaviour was intentional.
It seems reasonable, you need to load/save NaNs without exceptions if you do a context switch? I don't think the extended format was not meant for "end users". Anyway, the x87 FP stack is history, even MOVSS is considered legacy by Intel…
Aug 26 2014
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Tuesday, 26 August 2014 at 13:43:56 UTC, Ola Fosheim Grøstad 
wrote:
 Anyway, the x87 FP stack is history, even MOVSS is considered 
 legacy by Intel…
Sorry for being off-topic, but MOVSS and VMOVSS on AMD don't throw FP exceptions either, but calculations does. So it seems like AMD and Intel sufficiently close for D to support NaNs, IMHO. Forget the legacy… http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/26568_APM_v41.pdf Ola.
Aug 26 2014
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/26/2014 12:24 AM, Don wrote:
 On Monday, 25 August 2014 at 23:29:21 UTC, Walter Bright wrote:
 On 8/25/2014 4:15 PM, "Ola Fosheim Grøstad"
 <ola.fosheim.grostad+dlang gmail.com>" wrote:
 On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote:
 I didn't know that. But recall I did implement it in DMC++, and it turned out
 to simply not be useful. I'd be surprised if the new C++ support for it does
 anything worthwhile.
Well, one should initialize with signaling NaN. Then you get an exception if you try to compute using uninitialized values.
That's the theory. The practice doesn't work out so well.
To be more concrete: Processors from AMD have signalling NaN behaviour which is different from processors from Intel. And the situation is worst on most other architectures. It's a lost cause, I think.
The other issues were just when the snan => qnan conversion took place. This is quite unclear given the extensive constant folding, CTFE, etc., that D does. It was also affected by how dmd generates code. Some code gen on floating point doesn't need the FPU, such as toggling the sign bit. But then what happens with snan => qnan? The whole thing is an undefined, unmanageable mess.
Aug 27 2014
parent reply "Don" <x nospam.com> writes:
On Wednesday, 27 August 2014 at 23:51:54 UTC, Walter Bright wrote:
 On 8/26/2014 12:24 AM, Don wrote:
 On Monday, 25 August 2014 at 23:29:21 UTC, Walter Bright wrote:
 On 8/25/2014 4:15 PM, "Ola Fosheim Grøstad"
 <ola.fosheim.grostad+dlang gmail.com>" wrote:
 On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright 
 wrote:
 I didn't know that. But recall I did implement it in DMC++, 
 and it turned out
 to simply not be useful. I'd be surprised if the new C++ 
 support for it does
 anything worthwhile.
Well, one should initialize with signaling NaN. Then you get an exception if you try to compute using uninitialized values.
That's the theory. The practice doesn't work out so well.
To be more concrete: Processors from AMD have signalling NaN behaviour which is different from processors from Intel. And the situation is worst on most other architectures. It's a lost cause, I think.
The other issues were just when the snan => qnan conversion took place. This is quite unclear given the extensive constant folding, CTFE, etc., that D does. It was also affected by how dmd generates code. Some code gen on floating point doesn't need the FPU, such as toggling the sign bit. But then what happens with snan => qnan? The whole thing is an undefined, unmanageable mess.
I think the way to think of it is, to the programmer, there is *no such thing* as an snan value. It's an implementation detail that should be invisible. Semantically, a signalling nan is a qnan value with a hardware breakpoint on it. An SNAN should never enter the CPU. The CPU always converts them to QNAN if you try. You're kind of not supposed to know that SNAN exists. Because of this, I think SNAN only ever makes sense for static variables. Setting local variables to snan doesn't make sense. since the snan has to enter the CPU. Making that work without triggering the snan is very painful. Making it trigger the snan on all forms of access is even worse. If float.init exists, it cannot be an snan, since you are allowed to use float.init.
Aug 28 2014
next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Thursday, 28 August 2014 at 11:09:16 UTC, Don wrote:
 I think the way to think of it is, to the programmer, there is 
 *no such thing* as an snan value. It's an implementation detail 
 that should be invisible.
 Semantically, a signalling nan is a qnan value with a hardware 
 breakpoint on it.
I disagree with this view. QNAN: there is a value, but it does not result in a real SNAN: the value is missing for an unspecified reason AFAIK some x86 ops such as ROUNDPD allows you to treat SNAN as QNAN or throw an exception. So there is an builtin test if needed. Other ops such as reciprocals don't throw any FP exceptions and will treat SNAN as QNAN.
 An SNAN should never enter the CPU. The CPU always converts 
 them to QNAN if you try. You're kind of not supposed to know 
 that SNAN exists.
I'm not sure how you reached this interpretation? The solution should be to emit a test for SNAN explicitly or implicitly if you cannot prove that SNAN is impossible.
Aug 28 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
Or to be more explicit:

If have SNAN then there is no point in trying to recompute the 
expression using a different algorithm.

If have QNAN then you might want to recompute the expression 
using a different algorithm (e.g. complex numbers or 
analytically).

?
Aug 28 2014
parent reply "Don" <x nospam.com> writes:
On Thursday, 28 August 2014 at 12:10:58 UTC, Ola Fosheim Grøstad 
wrote:
 Or to be more explicit:

 If have SNAN then there is no point in trying to recompute the 
 expression using a different algorithm.

 If have QNAN then you might want to recompute the expression 
 using a different algorithm (e.g. complex numbers or 
 analytically).

 ?
No. Once you load an SNAN, it isn't an SNAN any more! It is a QNAN. You cannot have an SNAN in a floating-point register (unless you do a nasty hack to pass it in). It gets converted during loading. const float x = snan; x = x; // x is now a qnan.
Aug 28 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Thursday, 28 August 2014 at 14:43:30 UTC, Don wrote:
 No. Once you load an SNAN, it isn't an SNAN any more! It is a 
 QNAN.
By which definition? It is only if you consume the SNAN with an fp-exception-free arithmetic op that it should be turned into a QNAN. If you compute with an op that throws then it should throw an exception. MOV should not be viewed as a computation… It also makes sense to save SNAN to file when converting corrupted data-files. SNAN could then mean "corrupted" and QNAN could mean "absent". You should not get an exception for loading a file. You should get an exception if you start computing on the SNAN in the file.
 You cannot have an SNAN in a floating-point register (unless 
 you do a nasty hack to pass it in). It gets converted during 
 loading.
I don't understand this position. If you cannot load SNAN then why does SSE handle SNAN in arithmetic ops and compares?
 const float x = snan;
 x = x;

 // x is now a qnan.
I disagree (and why const?) Assignment does nothing, it should not consume the SNAN. Assignment is just "naming". It is not "computing".
Aug 28 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
Let me try again:

SNAN => unfortunately absent

QNAN => deliberately absent

So you can have:

compute(SNAN) => handle(exception) {
    if(can turn unfortunate situation into deliberate)
    then compute(QNAN)
    else throw
)
Aug 28 2014
parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
Kahan states this in a 1997 paper:

«[…]An SNaN may be moved ( copied ) without incident, but any 
other arithmetic operation upon an SNaN is an INVALID operation ( 
and so is loading one onto the ix87's stack ) that must trap or 
else produce a new nonsignaling NaN. ( Another way to turn an 
SNaN into a NaN is to turn 0xxx...xxx into 1xxx...xxx with a 
logical OR.) Intended for, among other things, data missing from 
statistical collections, and for uninitialized variables[…]»

( http://www.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF)

x87 is legacy, it predates IEEE754 by 5 years and should be 
forgotten.

Note also that the string representation for a signalling nan is 
"NANS", so it reasonable to save it to file if you need to 
represent missing data. "NAN" represents 0/0, sqrt(-1), not 
missing data.

I'm not really sure how it can be interpreted differently?

Ola.
Aug 28 2014
prev sibling parent "Daniel Murphy" <yebbliesnospam gmail.com> writes:
"Don"  wrote in message news:fvxmsrbicgpqkkiufdyv forum.dlang.org...

 If float.init exists, it cannot be an snan, since you are allowed to use 
 float.init.
So should we get rid of them from the language completely? Using them as template parameters does even respect the sign of the NaN last time I checked, let alone the s/q or payload. If we change float.init to be a qnan then it won't be possible to make one at compile time.
Aug 28 2014
prev sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 25.08.2014 15:07, schrieb Don:
 On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
 Following up on the recent "std.jgrandson" thread [1], I've picked up
 the work (a lot earlier than anticipated) and finished a first version
 of a loose blend of said std.jgrandson, vibe.data.json and some
 changes that I had planned for vibe.data.json for a while. I'm quite
 pleased by the results so far, although without a serialization
 framework it still misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
 Docs: http://s-ludwig.github.io/std_data_json/
 DUB: http://code.dlang.org/packages/std_data_json

 The new code contains:
  - Lazy lexer in the form of a token input range (using slices of the
    input if possible)
  - Lazy streaming parser (StAX style) in the form of a node input range
  - Eager DOM style parser returning a JSONValue
  - Range based JSON string generator taking either a token range, a
    node range, or a JSONValue
  - Opt-out location tracking (line/column) for tokens, nodes and values
  - No opDispatch() for JSONValue - this has shown to do more harm than
    good in vibe.data.json

 The DOM style JSONValue type is based on std.variant.Algebraic. This
 currently has a few usability issues that can be solved by
 upgrading/fixing Algebraic:

  - Operator overloading only works sporadically
  - No "tag" enum is supported, so that switch()ing on the type of a
    value doesn't work and an if-else cascade is required
  - Operations and conversions between different Algebraic types is not
    conveniently supported, which gets important when other similar
    formats get supported (e.g. BSON)

 Assuming that those points are solved, I'd like to get some early
 feedback before going for an official review. One open issue is how to
 handle unescaping of string literals. Currently it always unescapes
 immediately, which is more efficient for general input ranges when the
 unescaped result is needed, but less efficient for string inputs when
 the unescaped result is not needed. Maybe a flag could be used to
 conditionally switch behavior depending on the input range type.

 Destroy away! ;)

 [1]: http://forum.dlang.org/thread/lrknjl$co7$1 digitalmars.com
One missing feature (which is also missing from the existing std.json) is support for NaN and Infinity as JSON values. Although they are not part of the formal JSON spec (which is a ridiculous omission, the argument given for excluding them is fallacious), they do get generated if you use Javascript's toString to create the JSON. Many JSON libraries (eg Google's) also generate them, so they are frequently encountered in practice. So a JSON parser should at least be able to lex them. ie this should be parsable: {"foo": NaN, "bar": Infinity, "baz": -Infinity}
This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though.
 You should also put tests in for what happens when you pass NaN or
 infinity to toJSON. It shouldn't silently generate invalid JSON.
Good point. The current solution to just use formattedWrite("%.16g") is also not ideal.
Aug 25 2014
next sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 25.08.2014 16:04, schrieb Sönke Ludwig:
 Am 25.08.2014 15:07, schrieb Don:
 On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
 Following up on the recent "std.jgrandson" thread [1], I've picked up
 the work (a lot earlier than anticipated) and finished a first version
 of a loose blend of said std.jgrandson, vibe.data.json and some
 changes that I had planned for vibe.data.json for a while. I'm quite
 pleased by the results so far, although without a serialization
 framework it still misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
 Docs: http://s-ludwig.github.io/std_data_json/
 DUB: http://code.dlang.org/packages/std_data_json

 The new code contains:
  - Lazy lexer in the form of a token input range (using slices of the
    input if possible)
  - Lazy streaming parser (StAX style) in the form of a node input range
  - Eager DOM style parser returning a JSONValue
  - Range based JSON string generator taking either a token range, a
    node range, or a JSONValue
  - Opt-out location tracking (line/column) for tokens, nodes and values
  - No opDispatch() for JSONValue - this has shown to do more harm than
    good in vibe.data.json

 The DOM style JSONValue type is based on std.variant.Algebraic. This
 currently has a few usability issues that can be solved by
 upgrading/fixing Algebraic:

  - Operator overloading only works sporadically
  - No "tag" enum is supported, so that switch()ing on the type of a
    value doesn't work and an if-else cascade is required
  - Operations and conversions between different Algebraic types is not
    conveniently supported, which gets important when other similar
    formats get supported (e.g. BSON)

 Assuming that those points are solved, I'd like to get some early
 feedback before going for an official review. One open issue is how to
 handle unescaping of string literals. Currently it always unescapes
 immediately, which is more efficient for general input ranges when the
 unescaped result is needed, but less efficient for string inputs when
 the unescaped result is not needed. Maybe a flag could be used to
 conditionally switch behavior depending on the input range type.

 Destroy away! ;)

 [1]: http://forum.dlang.org/thread/lrknjl$co7$1 digitalmars.com
One missing feature (which is also missing from the existing std.json) is support for NaN and Infinity as JSON values. Although they are not part of the formal JSON spec (which is a ridiculous omission, the argument given for excluding them is fallacious), they do get generated if you use Javascript's toString to create the JSON. Many JSON libraries (eg Google's) also generate them, so they are frequently encountered in practice. So a JSON parser should at least be able to lex them. ie this should be parsable: {"foo": NaN, "bar": Infinity, "baz": -Infinity}
This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though.
http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/LexOptions.specialFloatLiterals.html
 You should also put tests in for what happens when you pass NaN or
 infinity to toJSON. It shouldn't silently generate invalid JSON.
Good point. The current solution to just use formattedWrite("%.16g") is also not ideal.
By default, floating-point special values are now output as 'null', according to the ECMA-script standard. Optionally, they will be emitted as 'NaN' and 'Infinity': http://s-ludwig.github.io/std_data_json/stdx/data/json/generator/GeneratorOptions.specialFloatLiterals.html
Aug 25 2014
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 15:34:29 UTC, Sönke Ludwig wrote:
 By default, floating-point special values are now output as 
 'null', according to the ECMA-script standard. Optionally, they 
 will be emitted as 'NaN' and 'Infinity':
ECMAScript presumes double. I think one should base Phobos on language-independent standards. I suggest: http://tools.ietf.org/html/rfc7159 For a web server it would be most useful to get an exception since you risk ending up with web-clients not working with no logging. It is better to have an exception and log an error so the problem can be fixed.
Aug 25 2014
next sibling parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 15:46:12 UTC, Ola Fosheim Grøstad 
wrote:
 For a web server it would be most useful to get an exception 
 since you risk ending up with web-clients not working with no 
 logging. It is better to have an exception and log an error so 
 the problem can be fixed.
Let me expand a bit on the difference between web clients and servers, assuming D is used on the server: * Web servers have to check all input and log illegal activity. It is either a bug or an attack. * Web clients don't have to check input from the server (at most a crypto check) and should not do double work if servers validate anyway. * Web servers detect errors and send the error as a response to the client that displays it as a warning to the user. This is the uncommon case so you don't want to burden the client with it. From this we can infer: - It makes more sense for ECMAScript to turn illegal values into null since it runs on the client. - The server needs efficient validation of input so that it can have faster response. - The more integration of validation of typedness you can have in the parser, the better. Thus it would be an advantage to be able to configure the validation done in the parser (through template mechanisms): 1. On write: throw exception on all illegal values or values that cannot be represented in the format. If the values are illegal then the client should not receive it. It could cause legal problems (like wrong prices). 2. On read: add the ability to configure the validation of typedness on many parameters: - no nulls, no dicts, only nesting arrays etc - predetermined key-values and automatic mapping to structs on exact match. - require all leaf arrays to be uniform (array of strings, array of numbers) - match a predefined grammar etc
Aug 25 2014
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
 - It makes more sense for ECMAScript to turn illegal values into null
 since it runs on the client.
Like... node.js? Sorry, just kidding. I don't think it makes sense for clients to be less strict about such things, but I do agree with your assessment about being as strict as possible on the server. I also do think that exceptions are a perfect tool especially for server applications and that instead of avoiding them because they are slow, they should better be made fast enough to not be an issue.
Aug 25 2014
prev sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 25.08.2014 17:46, schrieb "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>":
 On Monday, 25 August 2014 at 15:34:29 UTC, Sönke Ludwig wrote:
 By default, floating-point special values are now output as 'null',
 according to the ECMA-script standard. Optionally, they will be
 emitted as 'NaN' and 'Infinity':
ECMAScript presumes double. I think one should base Phobos on language-independent standards. I suggest: http://tools.ietf.org/html/rfc7159
Well, of course it's based on that RFC, did you seriously think something else? However, that standard has no mention of infinity or NaN, and since JSON is designed to be a subset of ECMA script, it's basically the only thing that comes close.
 For a web server it would be most useful to get an exception since you
 risk ending up with web-clients not working with no logging. It is
 better to have an exception and log an error so the problem can be fixed.
Although you have a point there of course, it's also highly unlikely that those clients would work correctly if we presume that JSON supported infinity/NaN. So it would really be just coincidence to detect a bug like that. But I generally agree, it's just that the anti-exception voices are pretty loud these days (including Walter's), so that I opted for a non-throwing solution instead. I guess it wouldn't hurt though to default to throwing an exception, while still providing the GeneratorOptions.specialFloatLiterals option to handle those values without exception overhead, but in a non standard-conforming way.
Aug 25 2014
next sibling parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 25.08.2014 22:21, schrieb Sönke Ludwig:
 that standard has no mention of infinity or
 NaN
Sorry, to be precise, it has no suggestion of how to *handle* infinity or NaN.
Aug 25 2014
prev sibling parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 25 August 2014 at 20:21:01 UTC, Sönke Ludwig wrote:
 Well, of course it's based on that RFC, did you seriously think 
 something else?
I made no assumptions, just responded to what you wrote :-). It would be reasonable in the context of vibe.d to assume the ECMAScript spec.
 But I generally agree, it's just that the anti-exception voices 
 are pretty loud these days (including Walter's), so that I 
 opted for a non-throwing solution instead.
Yes, the minimum requirement is to just get "did not validate" directly as a single value. One can create a wrapper to get exceptions.
 I guess it wouldn't hurt though to default to throwing an 
 exception, while still providing the 
 GeneratorOptions.specialFloatLiterals option to handle those 
 values without exception overhead, but in a non 
 standard-conforming way.
What I care most about is getting all the free validation that can be added with no extra cost. That will make writing web services easier. Like if you can define constraints like: - root is array, values are strings. - root is array, second level only arrays, third level is numbers - root is dict, all arrays contain only numbers What is a bit annoying about generic libs is that you have no idea what you are getting so you have to spend time creating dull validation code. But maybe StructuredJSON should be a separate library. It would be useful for REST services to specify the grammar and auto-generate both javascript and D structures to hold it along with validation code. However, just turning off parsing of "true", "false", "null", "[", "{" etc seems like a cheap addition that also can improve parsing speed if the compiler can make do with two if statements instead of a switch. Ola.
Aug 25 2014
prev sibling parent reply "Don" <x nospam.com> writes:
On Monday, 25 August 2014 at 14:04:12 UTC, Sönke Ludwig wrote:
 Am 25.08.2014 15:07, schrieb Don:
 On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig 
 wrote:
 Following up on the recent "std.jgrandson" thread [1], I've 
 picked up
 the work (a lot earlier than anticipated) and finished a 
 first version
 of a loose blend of said std.jgrandson, vibe.data.json and 
 some
 changes that I had planned for vibe.data.json for a while. 
 I'm quite
 pleased by the results so far, although without a 
 serialization
 framework it still misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
 Docs: http://s-ludwig.github.io/std_data_json/
 DUB: http://code.dlang.org/packages/std_data_json

 The new code contains:
 - Lazy lexer in the form of a token input range (using slices 
 of the
   input if possible)
 - Lazy streaming parser (StAX style) in the form of a node 
 input range
 - Eager DOM style parser returning a JSONValue
 - Range based JSON string generator taking either a token 
 range, a
   node range, or a JSONValue
 - Opt-out location tracking (line/column) for tokens, nodes 
 and values
 - No opDispatch() for JSONValue - this has shown to do more 
 harm than
   good in vibe.data.json

 The DOM style JSONValue type is based on 
 std.variant.Algebraic. This
 currently has a few usability issues that can be solved by
 upgrading/fixing Algebraic:

 - Operator overloading only works sporadically
 - No "tag" enum is supported, so that switch()ing on the type 
 of a
   value doesn't work and an if-else cascade is required
 - Operations and conversions between different Algebraic 
 types is not
   conveniently supported, which gets important when other 
 similar
   formats get supported (e.g. BSON)

 Assuming that those points are solved, I'd like to get some 
 early
 feedback before going for an official review. One open issue 
 is how to
 handle unescaping of string literals. Currently it always 
 unescapes
 immediately, which is more efficient for general input ranges 
 when the
 unescaped result is needed, but less efficient for string 
 inputs when
 the unescaped result is not needed. Maybe a flag could be 
 used to
 conditionally switch behavior depending on the input range 
 type.

 Destroy away! ;)

 [1]: 
 http://forum.dlang.org/thread/lrknjl$co7$1 digitalmars.com
One missing feature (which is also missing from the existing std.json) is support for NaN and Infinity as JSON values. Although they are not part of the formal JSON spec (which is a ridiculous omission, the argument given for excluding them is fallacious), they do get generated if you use Javascript's toString to create the JSON. Many JSON libraries (eg Google's) also generate them, so they are frequently encountered in practice. So a JSON parser should at least be able to lex them. ie this should be parsable: {"foo": NaN, "bar": Infinity, "baz": -Infinity}
This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though.
Yes, it should be optional, but not a compile-time option. I think it should parse it, and based on a runtime flag, throw an error (perhaps an OutOfRange error or something, and use the same thing for values that exceed the representable range). An app may accept these non-standard values under certain circumstances and not others. In real-world code, you see a *lot* of these guys. Part of the reason these are important, is that NaN or Infinity generally means some Javascript code just has an uninitialized variable. Any other kind of invalid JSON typically means something very nasty has happened. It's important to distinguish these.
Aug 26 2014
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 26.08.2014 15:43, schrieb Don:
 On Monday, 25 August 2014 at 14:04:12 UTC, Sönke Ludwig wrote:
 Am 25.08.2014 15:07, schrieb Don:
 ie this should be parsable:

 {"foo": NaN, "bar": Infinity, "baz": -Infinity}
This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though.
Yes, it should be optional, but not a compile-time option. I think it should parse it, and based on a runtime flag, throw an error (perhaps an OutOfRange error or something, and use the same thing for values that exceed the representable range). An app may accept these non-standard values under certain circumstances and not others. In real-world code, you see a *lot* of these guys.
Why not a compile time option? That sounds to me like such an app should simply enable parsing those values and manually test for NaN at places where it matters. For all other (the majority) of applications, encountering NaN/Infinity will simply mean that there is a bug, so it makes sense to not accept those at all by default. Apart from that I don't think that it's a good idea for the lexer in general to accept non-standard input by default.
 Part of the reason these are important, is that NaN or Infinity
 generally means some Javascript code just has an uninitialized variable.
 Any other kind of invalid JSON typically means something very nasty has
 happened. It's important to distinguish these.
As far as I understood, JavaScript will output those special values as null (at least when not using external JSON libraries). But even if not, an uninitialized variable can also be very nasty, so it's hard to see why that kind of bug should be silently supported (by default).
Aug 26 2014
parent reply "Don" <x nospam.com> writes:
On Tuesday, 26 August 2014 at 14:06:42 UTC, Sönke Ludwig wrote:
 Am 26.08.2014 15:43, schrieb Don:
 On Monday, 25 August 2014 at 14:04:12 UTC, Sönke Ludwig wrote:
 Am 25.08.2014 15:07, schrieb Don:
 ie this should be parsable:

 {"foo": NaN, "bar": Infinity, "baz": -Infinity}
This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though.
Yes, it should be optional, but not a compile-time option. I think it should parse it, and based on a runtime flag, throw an error (perhaps an OutOfRange error or something, and use the same thing for values that exceed the representable range). An app may accept these non-standard values under certain circumstances and not others. In real-world code, you see a *lot* of these guys.
Why not a compile time option? That sounds to me like such an app should simply enable parsing those values and manually test for NaN at places where it matters. For all other (the majority) of applications, encountering NaN/Infinity will simply mean that there is a bug, so it makes sense to not accept those at all by default. Apart from that I don't think that it's a good idea for the lexer in general to accept non-standard input by default.
Please note, I've been talking about the lexer. I'm choosing my words very carefully.
 Part of the reason these are important, is that NaN or Infinity
 generally means some Javascript code just has an uninitialized 
 variable.
 Any other kind of invalid JSON typically means something very 
 nasty has
 happened. It's important to distinguish these.
As far as I understood, JavaScript will output those special values as null (at least when not using external JSON libraries).
No. Javascript generates them directly. Naive JS code generates these guys. That's why they're so important.
 But even if not, an uninitialized variable can also be very 
 nasty, so it's hard to see why that kind of bug should be 
 silently supported (by default).
I never said it should accepted by default. I said it is a situation which should be *lexed*. Ideally, by default it should give a different error from simply 'invalid JSON'. I believe it should ALWAYS be lexed, even if an error is ultimately generated. This is the difference: if you get NaN or Infinity, there's probably a straightforward bug in the Javascript code, but your D code is fine. Any other kind of JSON parsing error means you've got a garbage string that isn't JSON at all. They are very different errors. It's a diagnostics issue.
Aug 26 2014
next sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 26.08.2014 16:40, schrieb Don:
 On Tuesday, 26 August 2014 at 14:06:42 UTC, Sönke Ludwig wrote:
 Am 26.08.2014 15:43, schrieb Don:
 On Monday, 25 August 2014 at 14:04:12 UTC, Sönke Ludwig wrote:
 Am 25.08.2014 15:07, schrieb Don:
 ie this should be parsable:

 {"foo": NaN, "bar": Infinity, "baz": -Infinity}
This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though.
Yes, it should be optional, but not a compile-time option. I think it should parse it, and based on a runtime flag, throw an error (perhaps an OutOfRange error or something, and use the same thing for values that exceed the representable range). An app may accept these non-standard values under certain circumstances and not others. In real-world code, you see a *lot* of these guys.
Why not a compile time option? That sounds to me like such an app should simply enable parsing those values and manually test for NaN at places where it matters. For all other (the majority) of applications, encountering NaN/Infinity will simply mean that there is a bug, so it makes sense to not accept those at all by default. Apart from that I don't think that it's a good idea for the lexer in general to accept non-standard input by default.
Please note, I've been talking about the lexer. I'm choosing my words very carefully.
I've been talking about the lexer, too. Sorry for the confusing use of the term "parsing" (after all, the lexer is also a parser, but anyway).
 Part of the reason these are important, is that NaN or Infinity
 generally means some Javascript code just has an uninitialized variable.
 Any other kind of invalid JSON typically means something very nasty has
 happened. It's important to distinguish these.
As far as I understood, JavaScript will output those special values as null (at least when not using external JSON libraries).
No. Javascript generates them directly. Naive JS code generates these guys. That's why they're so important.
JSON.stringify(0/0) == "null" Holds for all browsers that I've tested.
 But even if not, an uninitialized variable can also be very nasty, so
 it's hard to see why that kind of bug should be silently supported (by
 default).
I never said it should accepted by default. I said it is a situation which should be *lexed*. Ideally, by default it should give a different error from simply 'invalid JSON'. I believe it should ALWAYS be lexed, even if an error is ultimately generated. This is the difference: if you get NaN or Infinity, there's probably a straightforward bug in the Javascript code, but your D code is fine. Any other kind of JSON parsing error means you've got a garbage string that isn't JSON at all. They are very different errors. It's a diagnostics issue.
The error will be more like "filename(line:column): Invalid token" - possibly the text following the line/column could also be displayed. Wouldn't that be sufficient?
Aug 26 2014
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 26.08.2014 16:51, schrieb Sönke Ludwig:
 Am 26.08.2014 16:40, schrieb Don:
 This is the difference: if you get NaN or Infinity, there's probably a
 straightforward bug in the Javascript code, but your D code is fine. Any
 other kind of JSON parsing error means you've got a garbage string that
 isn't JSON at all. They are very different errors.
 It's a diagnostics issue.
The error will be more like "filename(line:column): Invalid token" - possibly the text following the line/column could also be displayed. Wouldn't that be sufficient?
One argument against supporting it in the parser is that the parser currently works without any configuration, but the user would then have to specify two sets of configuration options with this added.
Aug 26 2014
prev sibling parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Tuesday, 26 August 2014 at 14:40:02 UTC, Don wrote:
 This is the difference: if you get NaN or Infinity, there's 
 probably a straightforward bug in the Javascript code, but your 
 D code is fine. Any other kind of JSON parsing error means 
 you've got a garbage string that isn't JSON at all. They are 
 very different errors.
I don't care either way, but JSON.stringify() has the following support: IE8 and up Firefox 3.5 and up Safari 4 and up Chrome So not using it is very much legacy…
Aug 26 2014
prev sibling next sibling parent reply "Entusiastic user" <cncgeneralsfan999 abv.bg> writes:
Hi!

Thanks for the effort you've put in this.

I am having problems with building with LDC 0.14.0. DMD 2.066.0
seems to work fine (all unit tests pass). Do you have any ideas
why?

I am using Ubuntu 3.10 (Linux 3.11.0-15-generic x86_64).

Master was at 6a9f8e62e456c3601fe8ff2e1fbb640f38793d08.
$ dub fetch std_data_json --version=~master
$ cd std_data_json-master/
$ dub test --compiler=ldc2

Generating test runner configuration '__test__library__' for
'library' (library).
Building std_data_json ~master configuration "__test__library__",
build type unittest.
Running ldc2...
source/stdx/data/json/parser.d(77): Error: safe function
'stdx.data.json.parser.__unittestL68_22' cannot call system
function 'object.AssociativeArray!(string,
JSONValue).AssociativeArray.length'
source/stdx/data/json/parser.d(124): Error: safe function
'stdx.data.json.parser.__unittestL116_24' cannot call system
function 'object.AssociativeArray!(string,
JSONValue).AssociativeArray.length'
source/stdx/data/json/parser.d(341): Error: function
stdx.data.json.parser.JSONParserRange!(JSONLexerRange!string).JSONParserRange.opAssign
is not callable because it is annotated with  disable
source/stdx/data/json/parser.d(341): Error: safe function
'stdx.data.json.parser.__unittestL318_32' cannot call system
function
'stdx.data.json.parser.JSONParserRange!(JSONLexerRange!string).JSONParserRange.opAssign'
source/stdx/data/json/parser.d(633): Error: function
stdx.data.json.lexer.JSONToken.opAssign is not callable because
it is annotated with  disable
source/stdx/data/json/parser.d(633): Error:
'stdx.data.json.lexer.JSONToken.opAssign' is not nothrow
source/stdx/data/json/parser.d(630): Error: function
'stdx.data.json.parser.JSONParserNode.literal' is nothrow yet may
throw
FAIL
.dub/build/__test__library__-unittest-linux.posix-x86_64-ldc2-0F620B217010475A5A4E545A57CDD09A/
__test__library__ executable
Error executing command test: ldc2 failed with exit code 1.

Thanks
Aug 25 2014
next sibling parent "Entusiastic user" <cncgeneralsfan999 abv.bg> writes:
 ...
 I am using Ubuntu 3.10 (Linux 3.11.0-15-generic x86_64).
 ...
I meant Ubuntu 13.10 :D
Aug 25 2014
prev sibling parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 26.08.2014 03:31, schrieb Entusiastic user:
 Hi!

 Thanks for the effort you've put in this.

 I am having problems with building with LDC 0.14.0. DMD 2.066.0
 seems to work fine (all unit tests pass). Do you have any ideas
 why?
I've fixed all errors on DMD 2.065 now. Hopefully that should also fix LDC.
Aug 26 2014
prev sibling next sibling parent "David Soria Parra" <davidsp fb.com> writes:
On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
 Following up on the recent "std.jgrandson" thread [1], I've 
 picked up the work (a lot earlier than anticipated) and 
 finished a first version of a loose blend of said 
 std.jgrandson, vibe.data.json and some changes that I had 
 planned for vibe.data.json for a while. I'm quite pleased by 
 the results so far, although without a serialization framework 
 it still misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
 Docs: http://s-ludwig.github.io/std_data_json/
 DUB: http://code.dlang.org/packages/std_data_json
Do we have any benchmarks for this yet. Note that the main motivation for a new json parsers was that std.json is remarkable slow in comparison to python's json or ujson.
Aug 26 2014
prev sibling next sibling parent "Atila Neves" <atila.neves gmail.com> writes:
Been using it for a bit now, I think the only thing I have to say 
is having to insert all of those `JSONValue` everywhere is 
tiresome and I never know when I have to do it.

Atila

On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
 Following up on the recent "std.jgrandson" thread [1], I've 
 picked up the work (a lot earlier than anticipated) and 
 finished a first version of a loose blend of said 
 std.jgrandson, vibe.data.json and some changes that I had 
 planned for vibe.data.json for a while. I'm quite pleased by 
 the results so far, although without a serialization framework 
 it still misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
 Docs: http://s-ludwig.github.io/std_data_json/
 DUB: http://code.dlang.org/packages/std_data_json

 The new code contains:
  - Lazy lexer in the form of a token input range (using slices 
 of the
    input if possible)
  - Lazy streaming parser (StAX style) in the form of a node 
 input range
  - Eager DOM style parser returning a JSONValue
  - Range based JSON string generator taking either a token 
 range, a
    node range, or a JSONValue
  - Opt-out location tracking (line/column) for tokens, nodes 
 and values
  - No opDispatch() for JSONValue - this has shown to do more 
 harm than
    good in vibe.data.json

 The DOM style JSONValue type is based on std.variant.Algebraic. 
 This currently has a few usability issues that can be solved by 
 upgrading/fixing Algebraic:

  - Operator overloading only works sporadically
  - No "tag" enum is supported, so that switch()ing on the type 
 of a
    value doesn't work and an if-else cascade is required
  - Operations and conversions between different Algebraic types 
 is not
    conveniently supported, which gets important when other 
 similar
    formats get supported (e.g. BSON)

 Assuming that those points are solved, I'd like to get some 
 early feedback before going for an official review. One open 
 issue is how to handle unescaping of string literals. Currently 
 it always unescapes immediately, which is more efficient for 
 general input ranges when the unescaped result is needed, but 
 less efficient for string inputs when the unescaped result is 
 not needed. Maybe a flag could be used to conditionally switch 
 behavior depending on the input range type.

 Destroy away! ;)

 [1]: http://forum.dlang.org/thread/lrknjl$co7$1 digitalmars.com
Sep 08 2014
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Here's my destruction of std.data.json.

* lexer.d:

** Beautifully done. From what I understand, if the input is string or 
immutable(ubyte)[] then the strings are carved out as slices of the 
input, as opposed to newly allocated. Awesome.

** The string after lexing is correctly scanned and stored in raw format 
(escapes are not rewritten) and decoded on demand. Problem with decoding 
is that it may allocate memory, and it would be great (and not 
difficult) to make the lexer 100% lazy/non-allocating. To achieve that, 
lexer.d should define TWO "Kind"s of strings at the lexer level: regular 
string and undecoded string. The former is lexer.d's way of saying "I 
got lucky" in the sense that it didn't detect any '\\' so the raw and 
decoded strings are identical. No need for anyone to do any further 
processing in the majority of cases => win. The latter means the lexer 
lexed the string, saw at least one '\\', and leaves it to the caller to 
do the actual decoding.

** After moving the decoding business out of lexer.d, a way to take this 
further would be to qualify lexer methods as  nogc if the input is 
string/immutable(ubyte)[]. I wonder how to implement a conditional 
attribute. We'll probably need a language enhancement for that.

** The implementation uses manually-defined tagged unions for work. 
Could we use Algebraic instead - dogfooding and all that? I recall there 
was a comment in Sönke's original work that Algebraic has a specific 
issue (was it false pointers?) - so the question arises, should we fix 
Algebraic and use it thus helping other uses as well?

** I see the "boolean" kind, should we instead have the "true_" and 
"false_" kinds?

** Long story short I couldn't find any major issue with this module, 
and I looked! I do think the decoding logic should be moved outside of 
lexer.d or at least the JSONLexerRange.

* generator.d: looking good, no special comments. Like the consistent 
use of structs filled with options as template parameters.

* foundation.d:

** At four words per token, Location seems pretty bulky. How about 
reducing line and column to uint?

** Could JSONException create the message string in toString (i.e. 
when/if used) as opposed to in the constructor?

* parser.d:

** How about using .init instead of .defaults for options?

** I'm a bit surprised by JSONParserNode.Kind. E.g. the objectStart/End 
markers shouldn't appear as nodes. There should be an "object" node 
only. I guess that's needed for laziness.

** It's unclear where memory is being allocated in the parser.  nogc 
annotations wherever appropriate would be great.

* value.d:

** Looks like this is/may be the only place where memory is being 
managed, at least if the input is string/immutable(ubyte)[]. Right?

** Algebraic ftw.

============================

Overall: This is very close to everything I hoped! A bit more care to 
 nogc would be awesome, especially with the upcoming focus on memory 
management going forward.

After one more pass it would be great to move forward for review.


Andrei
Oct 12 2014
next sibling parent reply "Sean Kelly" <sean invisibleduck.org> writes:
On Sunday, 12 October 2014 at 18:17:29 UTC, Andrei Alexandrescu 
wrote:
 ** The string after lexing is correctly scanned and stored in 
 raw format (escapes are not rewritten) and decoded on demand. 
 Problem with decoding is that it may allocate memory, and it 
 would be great (and not difficult) to make the lexer 100% 
 lazy/non-allocating. To achieve that, lexer.d should define TWO 
 "Kind"s of strings at the lexer level: regular string and 
 undecoded string. The former is lexer.d's way of saying "I got 
 lucky" in the sense that it didn't detect any '\\' so the raw 
 and decoded strings are identical. No need for anyone to do any 
 further processing in the majority of cases => win. The latter 
 means the lexer lexed the string, saw at least one '\\', and 
 leaves it to the caller to do the actual decoding.
I'd like to see unescapeStringLiteral() made public. Then I can unescape multiple strings to the same preallocated destination, or even unescape in place (guaranteed to work since the result will always be smaller than the input).
Oct 12 2014
next sibling parent reply "Sean Kelly" <sean invisibleduck.org> writes:
Oh, it looks like you aren't checking for 0x7F (DEL) as a control 
character.
Oct 12 2014
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 12.10.2014 23:52, schrieb Sean Kelly:
 Oh, it looks like you aren't checking for 0x7F (DEL) as a control
 character.
It doesn't get mentioned in the JSON spec, so I left it out. But I guess nothing speaks against adding it anyway.
Oct 13 2014
prev sibling parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 12.10.2014 21:04, schrieb Sean Kelly:
 I'd like to see unescapeStringLiteral() made public.  Then I can
 unescape multiple strings to the same preallocated destination, or even
 unescape in place (guaranteed to work since the result will always be
 smaller than the input).
Will do. Same for the inverse functions.
Oct 13 2014
prev sibling parent reply =?ISO-8859-15?Q?S=F6nke_Ludwig?= <sludwig rejectedsoftware.com> writes:
Am 12.10.2014 20:17, schrieb Andrei Alexandrescu:
 Here's my destruction of std.data.json.

 * lexer.d:

 ** Beautifully done. From what I understand, if the input is string or
 immutable(ubyte)[] then the strings are carved out as slices of the
 input, as opposed to newly allocated. Awesome.

 ** The string after lexing is correctly scanned and stored in raw format
 (escapes are not rewritten) and decoded on demand. Problem with decoding
 is that it may allocate memory, and it would be great (and not
 difficult) to make the lexer 100% lazy/non-allocating. To achieve that,
 lexer.d should define TWO "Kind"s of strings at the lexer level: regular
 string and undecoded string. The former is lexer.d's way of saying "I
 got lucky" in the sense that it didn't detect any '\\' so the raw and
 decoded strings are identical. No need for anyone to do any further
 processing in the majority of cases => win. The latter means the lexer
 lexed the string, saw at least one '\\', and leaves it to the caller to
 do the actual decoding.
This is actually more or less done in unescapeStringLiteral() - if it doesn't find any '\\', it just returns the original string. Also JSONString allows to access its .rawValue without doing any decoding/allocations. https://github.com/s-ludwig/std_data_json/blob/master/source/stdx/data/json/lexer.d#L1421 Unfortunately .rawValue can't be nogc because the "raw" value might have to be constructed first when the input is not a "string" (in this case unescaping is done on-the-fly for efficiency reasons).
 ** After moving the decoding business out of lexer.d, a way to take this
 further would be to qualify lexer methods as  nogc if the input is
 string/immutable(ubyte)[]. I wonder how to implement a conditional
 attribute. We'll probably need a language enhancement for that.
Isn't nogc inferred? Everything is templated, so that should be possible. Or does attribute inference only work for template function and not for methods of templated types? Should it?
 ** The implementation uses manually-defined tagged unions for work.
 Could we use Algebraic instead - dogfooding and all that? I recall there
 was a comment in Sönke's original work that Algebraic has a specific
 issue (was it false pointers?) - so the question arises, should we fix
 Algebraic and use it thus helping other uses as well?
I had started on an implementation of a type and ID safe TaggedAlgebraic that uses Algebraic for its internal storage. If we can get that in first, it should be no problem to use it instead (with no or minimal API breakage). However, it uses a struct instead of an enum to define the "Kind" (which is the only nice way I could conceive to safely couple enum value and type at compile time), so it's not as nice in the generated documentation.
 ** I see the "boolean" kind, should we instead have the "true_" and
 "false_" kinds?
I always found it cumbersome and awkward to work like that. What would be the reason to go that route?
 ** Long story short I couldn't find any major issue with this module,
 and I looked! I do think the decoding logic should be moved outside of
 lexer.d or at least the JSONLexerRange.

 * generator.d: looking good, no special comments. Like the consistent
 use of structs filled with options as template parameters.

 * foundation.d:

 ** At four words per token, Location seems pretty bulky. How about
 reducing line and column to uint?
Single line JSON files >64k (or line counts >64k) are no exception, so that would only work in a limited way. My thought about this was that it is quite unusual to actually store the tokens for most purposes (especially when directly serializing to a native D type), so that it should have minimal impact on performance or memory consumption.
 ** Could JSONException create the message string in toString (i.e.
 when/if used) as opposed to in the constructor?
That could of course be done, but the you'd not get the full error message using ex.msg, only with ex.toString(), which usually prints a call trace instead. Alternatively, it's also possible to completely avoid using exceptions with LexOptions.noThrow.
 * parser.d:

 ** How about using .init instead of .defaults for options?
I'd slightly tend to prefer the more explicit "defaults", especially because "init" could mean either "defaults" or "none" (currently it means "none"). But another idea would be to invert the option values so that defaults==none... any objections?
 ** I'm a bit surprised by JSONParserNode.Kind. E.g. the objectStart/End
 markers shouldn't appear as nodes. There should be an "object" node
 only. I guess that's needed for laziness.
While you could infer the end of an object in the parser range by looking for the first entry that doesn't start with a "key" node, the same would not be possible for arrays, so in general the end marker *is* required. Not that the parser range is a StAX style parser, which is still very close to the lexical structure of the document. I was also wondering if there might be a better name than "JSONParserNode". It's not really embedded into a tree or graph structure, which the name tends to suggest.
 ** It's unclear where memory is being allocated in the parser.  nogc
 annotations wherever appropriate would be great.
The problem is that the parser accesses the lexer, which in turn accesses the underlying input range, which in turn could allocate. Depending on the options passed to the lexer, it could also throw, and thus allocate, an exception. In the end only JSONParserRange.empty could generally be made nogc. However, attribute inference should be possible here in theory (the noThrow option is compile-time).
 * value.d:

 ** Looks like this is/may be the only place where memory is being
 managed, at least if the input is string/immutable(ubyte)[]. Right?
Yes, at least when setting aside optional exceptions and lazy allocations.
 ** Algebraic ftw.

 ============================

 Overall: This is very close to everything I hoped! A bit more care to
  nogc would be awesome, especially with the upcoming focus on memory
 management going forward.
I've tried to use nogc (as well as nothrow) in more places, but mostly due to not knowing if the underlying input range allocates, it hasn't really been possible. Even on lower levels (private functions) almost any Phobos function that is called is currently not nogc for reasons that are not always obvious, so I gave up on that for now.
 After one more pass it would be great to move forward for review.
There is also still one pending change that I didn't finish yet; the optional UTF input validation (never validate "string" inputs, but do validate "ubyte[]" inputs). Oh and there is the open issue of how to allocate in case of non-array inputs. Initially I wanted to wait with this until we have an allocators module, but Walter would like to have a way to do manual memory management in the initial version. However, the ideal design is still unclear to me - it would either simply resemble a general allocator interface, or could use something like a callback that returns an output range, which would probably be quite cumbersome to work with. Any ideas in this direction would be welcome. Sönke
Oct 13 2014
parent reply Jacob Carlborg <doob me.com> writes:
On 13/10/14 09:39, Sönke Ludwig wrote:

 ** At four words per token, Location seems pretty bulky. How about
 reducing line and column to uint?
Single line JSON files >64k (or line counts >64k) are no exception
64k? -- /Jacob Carlborg
Oct 13 2014
parent reply =?ISO-8859-15?Q?S=F6nke_Ludwig?= <sludwig rejectedsoftware.com> writes:
Am 13.10.2014 13:33, schrieb Jacob Carlborg:
 On 13/10/14 09:39, Sönke Ludwig wrote:

 ** At four words per token, Location seems pretty bulky. How about
 reducing line and column to uint?
Single line JSON files >64k (or line counts >64k) are no exception
64k?
Oh, I've read "both line and column into a single uint", because of "four words per token" - considering that "word == 16bit", but Andrei obviously meant "word == (void*).sizeof". If simply using uint instead of size_t is meant, then that's of course a different thing.
Oct 13 2014
next sibling parent reply "Daniel Murphy" <yebbliesnospam gmail.com> writes:
"Sönke Ludwig"  wrote in message news:m1ge08$10ub$1 digitalmars.com...

 Oh, I've read "both line and column into a single uint", because of "four 
 words per token" - considering that "word == 16bit", but Andrei obviously 
 meant "word == (void*).sizeof". If simply using uint instead of size_t is 
 meant, then that's of course a different thing.
I suppose a 4GB single-line json file is still possible.
Oct 13 2014
parent reply =?ISO-8859-15?Q?S=F6nke_Ludwig?= <sludwig rejectedsoftware.com> writes:
Am 13.10.2014 16:36, schrieb Daniel Murphy:
 "Sönke Ludwig"  wrote in message news:m1ge08$10ub$1 digitalmars.com...

 Oh, I've read "both line and column into a single uint", because of
 "four words per token" - considering that "word == 16bit", but Andrei
 obviously meant "word == (void*).sizeof". If simply using uint instead
 of size_t is meant, then that's of course a different thing.
I suppose a 4GB single-line json file is still possible.
If we make that assumption, we'd have to change it from size_t to ulong, but my feeling is that this case (format error at >4GB && human tries to look at that place using an editor) should be rare enough that we can make the compromise in favor of a smaller struct size.
Oct 13 2014
next sibling parent reply "Kiith-Sa" <kiithsacmp gmail.com> writes:
On Monday, 13 October 2014 at 17:21:44 UTC, Sönke Ludwig wrote:
 Am 13.10.2014 16:36, schrieb Daniel Murphy:
 "Sönke Ludwig"  wrote in message 
 news:m1ge08$10ub$1 digitalmars.com...

 Oh, I've read "both line and column into a single uint", 
 because of
 "four words per token" - considering that "word == 16bit", 
 but Andrei
 obviously meant "word == (void*).sizeof". If simply using 
 uint instead
 of size_t is meant, then that's of course a different thing.
I suppose a 4GB single-line json file is still possible.
If we make that assumption, we'd have to change it from size_t to ulong, but my feeling is that this case (format error at
4GB && human tries to look at that place using an editor)
should be rare enough that we can make the compromise in favor of a smaller struct size.
What are you using the location structs for? In D:YAML they're only used for info about errors, so I use ushorts and ushort.max means "65535 or more".
Oct 13 2014
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 13.10.2014 19:40, schrieb Kiith-Sa:
 On Monday, 13 October 2014 at 17:21:44 UTC, Sönke Ludwig wrote:
 Am 13.10.2014 16:36, schrieb Daniel Murphy:
 "Sönke Ludwig"  wrote in message news:m1ge08$10ub$1 digitalmars.com...

 Oh, I've read "both line and column into a single uint", because of
 "four words per token" - considering that "word == 16bit", but Andrei
 obviously meant "word == (void*).sizeof". If simply using uint instead
 of size_t is meant, then that's of course a different thing.
I suppose a 4GB single-line json file is still possible.
If we make that assumption, we'd have to change it from size_t to ulong, but my feeling is that this case (format error at
4GB && human tries to look at that place using an editor)
should be rare enough that we can make the compromise in favor of a smaller struct size.
What are you using the location structs for? In D:YAML they're only used for info about errors, so I use ushorts and ushort.max means "65535 or more".
Within the package itself they are also only used for error information. But they are also generally available with each token/node/value, so people could do very different things with them.
Oct 13 2014
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/13/14, 10:21 AM, Sönke Ludwig wrote:
 Am 13.10.2014 16:36, schrieb Daniel Murphy:
 "Sönke Ludwig"  wrote in message news:m1ge08$10ub$1 digitalmars.com...

 Oh, I've read "both line and column into a single uint", because of
 "four words per token" - considering that "word == 16bit", but Andrei
 obviously meant "word == (void*).sizeof". If simply using uint instead
 of size_t is meant, then that's of course a different thing.
I suppose a 4GB single-line json file is still possible.
If we make that assumption, we'd have to change it from size_t to ulong, but my feeling is that this case (format error at >4GB && human tries to look at that place using an editor) should be rare enough that we can make the compromise in favor of a smaller struct size.
Agreed. -- Andrei
Oct 13 2014
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/13/14, 4:45 AM, Sönke Ludwig wrote:
 Am 13.10.2014 13:33, schrieb Jacob Carlborg:
 On 13/10/14 09:39, Sönke Ludwig wrote:

 ** At four words per token, Location seems pretty bulky. How about
 reducing line and column to uint?
Single line JSON files >64k (or line counts >64k) are no exception
64k?
Oh, I've read "both line and column into a single uint", because of "four words per token" - considering that "word == 16bit", but Andrei obviously meant "word == (void*).sizeof". If simply using uint instead of size_t is meant, then that's of course a different thing.
Yah, one uint for each. -- Andrei
Oct 13 2014
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 22/08/14 00:35, Sönke Ludwig wrote:
 Following up on the recent "std.jgrandson" thread [1], I've picked up
 the work (a lot earlier than anticipated) and finished a first version
 of a loose blend of said std.jgrandson, vibe.data.json and some changes
 that I had planned for vibe.data.json for a while. I'm quite pleased by
 the results so far, although without a serialization framework it still
 misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
JSONToken.Kind and JSONParserNode.Kind could be "ubyte" to save space. -- /Jacob Carlborg
Oct 13 2014
parent reply =?ISO-8859-15?Q?S=F6nke_Ludwig?= <sludwig rejectedsoftware.com> writes:
Am 13.10.2014 13:37, schrieb Jacob Carlborg:
 On 22/08/14 00:35, Sönke Ludwig wrote:
 Following up on the recent "std.jgrandson" thread [1], I've picked up
 the work (a lot earlier than anticipated) and finished a first version
 of a loose blend of said std.jgrandson, vibe.data.json and some changes
 that I had planned for vibe.data.json for a while. I'm quite pleased by
 the results so far, although without a serialization framework it still
 misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
JSONToken.Kind and JSONParserNode.Kind could be "ubyte" to save space.
But it won't save space in practice, at least on x86, due to alignment, and depending on what the compiler assumes, the access can also be slower that way.
Oct 13 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/13/14, 4:48 AM, Sönke Ludwig wrote:
 Am 13.10.2014 13:37, schrieb Jacob Carlborg:
 On 22/08/14 00:35, Sönke Ludwig wrote:
 Following up on the recent "std.jgrandson" thread [1], I've picked up
 the work (a lot earlier than anticipated) and finished a first version
 of a loose blend of said std.jgrandson, vibe.data.json and some changes
 that I had planned for vibe.data.json for a while. I'm quite pleased by
 the results so far, although without a serialization framework it still
 misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
JSONToken.Kind and JSONParserNode.Kind could be "ubyte" to save space.
But it won't save space in practice, at least on x86, due to alignment, and depending on what the compiler assumes, the access can also be slower that way.
Correct. -- Andrei
Oct 13 2014
prev sibling next sibling parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
On 8/21/14, 7:35 PM, Sönke Ludwig wrote:
 Following up on the recent "std.jgrandson" thread [1], I've picked up
 the work (a lot earlier than anticipated) and finished a first version
 of a loose blend of said std.jgrandson, vibe.data.json and some changes
 that I had planned for vibe.data.json for a while. I'm quite pleased by
 the results so far, although without a serialization framework it still
 misses a very important building block.

 Code: https://github.com/s-ludwig/std_data_json
 Docs: http://s-ludwig.github.io/std_data_json/
 DUB: http://code.dlang.org/packages/std_data_json

 Destroy away! ;)

 [1]: http://forum.dlang.org/thread/lrknjl$co7$1 digitalmars.com
Once its done you can compare its performance against other languages with this benchmark: https://github.com/kostya/benchmarks/tree/master/json
Oct 17 2014
parent reply "Sean Kelly" <sean invisibleduck.org> writes:
On Friday, 17 October 2014 at 18:27:34 UTC, Ary Borenszweig wrote:
 Once its done you can compare its performance against other 
 languages with this benchmark:

 https://github.com/kostya/benchmarks/tree/master/json
Wow, the C++Rapid parser is really impressive. I threw together a test with my own parser for comparison, and Rapid still beat it. It's the first parser I've encountered that's faster. Ruby 0.4995479721139979 0.49977992077421846 0.49981146157805545 7.53s, 2330.9Mb Python 0.499547972114 0.499779920774 0.499811461578 12.01s, 1355.1Mb C++ Rapid 0.499548 0.49978 0.499811 1.75s, 1009.0Mb JEP (mine) 0.49954797 0.49977992 0.49981146 2.38s, 203.4Mb
Oct 18 2014
next sibling parent "Sean Kelly" <sean invisibleduck.org> writes:
On Saturday, 18 October 2014 at 19:53:23 UTC, Sean Kelly wrote:
 On Friday, 17 October 2014 at 18:27:34 UTC, Ary Borenszweig 
 wrote:
 Once its done you can compare its performance against other 
 languages with this benchmark:

 https://github.com/kostya/benchmarks/tree/master/json
Wow, the C++Rapid parser is really impressive. I threw together a test with my own parser for comparison, and Rapid still beat it. It's the first parser I've encountered that's faster. C++ Rapid 0.499548 0.49978 0.499811 1.75s, 1009.0Mb JEP (mine) 0.49954797 0.49977992 0.49981146 2.38s, 203.4Mb
I just commented out the sscanf() call that was parsing the float and re-ran the test to see what the difference would be. Here's the new timing: JEP (mine) 0.00000000 0.00000000 0.00000000 1.23s, 203.1Mb So nearly half of the total execution time was spent simply parsing floats. For this reason, I'm starting to think that this isn't the best benchmark of JSON parser performance. The other issue with my parser is that it's written in C, and so all of the user-defined bits are called via a bank of function pointers. If it were converted to C++ or D where this could be done via templates it would be much faster. Just as a test I nulled out the function pointers I'd set to see what the cost of indirection was, and here's the result: JEP (mine) nan nan nan 0.57s, 109.4Mb The memory difference is interesting, and I can't entirely explain it other than to say that it's probably an artifact of my mapping in the file as virtual memory rather than reading it into an allocated buffer. Either way, roughly 0.60s can be attributed to indirect function calls and the bit of logic on the other side, which seems like a good candidate for optimization.
Oct 18 2014
prev sibling next sibling parent Ary Borenszweig <ary esperanto.org.ar> writes:
On 10/18/14, 4:53 PM, Sean Kelly wrote:
 On Friday, 17 October 2014 at 18:27:34 UTC, Ary Borenszweig wrote:
 Once its done you can compare its performance against other languages
 with this benchmark:

 https://github.com/kostya/benchmarks/tree/master/json
Wow, the C++Rapid parser is really impressive. I threw together a test with my own parser for comparison, and Rapid still beat it. It's the first parser I've encountered that's faster. Ruby 0.4995479721139979 0.49977992077421846 0.49981146157805545 7.53s, 2330.9Mb Python 0.499547972114 0.499779920774 0.499811461578 12.01s, 1355.1Mb C++ Rapid 0.499548 0.49978 0.499811 1.75s, 1009.0Mb JEP (mine) 0.49954797 0.49977992 0.49981146 2.38s, 203.4Mb
Yes, C++ rapid seems to be really, really fast. It has some sse2/see4 specific optimizations and I guess a lot more. I have to investigate more in order to do something similar :-)
Oct 19 2014
prev sibling parent "David Soria Parra" <davidsp fb.com> writes:
On Saturday, 18 October 2014 at 19:53:23 UTC, Sean Kelly wrote:

 Python
 0.499547972114
 0.499779920774
 0.499811461578
 12.01s, 1355.1Mb
I assume this is the standard json module? I am wondering how ujson is performing, which is considered the fastest python module.
Oct 20 2014
prev sibling parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
 ...
Added to the review queue as a work in progress with relevant links: http://wiki.dlang.org/Review_Queue
Feb 05 2015
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/5/15 1:07 AM, Jakob Ovrum wrote:
 On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
 ...
Added to the review queue as a work in progress with relevant links: http://wiki.dlang.org/Review_Queue
Yay! -- Andrei
Feb 05 2015
prev sibling parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 05.02.2015 um 10:07 schrieb Jakob Ovrum:
 On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
 ...
Added to the review queue as a work in progress with relevant links: http://wiki.dlang.org/Review_Queue
Thanks! I(t) should be ready for an official review in one or two weeks when my schedule relaxes a little bit.
Feb 05 2015