www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - stdx.data.json needs a layer on top

reply "Laeeth Isharc" <laeethnospam nospamlaeeth.com> writes:
It's great, but it's not quite a replacement for std.json, as I 
see it.

The stream parser is fast, and it's valuable to be able to access 
it at a low level.

However, it was consciously designed to be low-level, and for 
something else to go on top.

As I understand it, there is a gap between what you can currently 
do with std.json (and indeed vibed json) and what you can do with 
stdx.data.json.  And the capability falls short of what can be 
done in other standard libraries such as the ones for python.

So since we are going for a nuclear-power station included 
approach, does that not mean that we need to specify what this 
layer should do, and somebody should start to work on it?
Jun 23 2015
next sibling parent reply Rikki Cattermole <alphaglosined gmail.com> writes:
On 24/06/2015 12:17 a.m., Laeeth Isharc wrote:
 It's great, but it's not quite a replacement for std.json, as I see it.

 The stream parser is fast, and it's valuable to be able to access it at
 a low level.

 However, it was consciously designed to be low-level, and for something
 else to go on top.

 As I understand it, there is a gap between what you can currently do
 with std.json (and indeed vibed json) and what you can do with
 stdx.data.json.  And the capability falls short of what can be done in
 other standard libraries such as the ones for python.

 So since we are going for a nuclear-power station included approach,
 does that not mean that we need to specify what this layer should do,
 and somebody should start to work on it?
Please come onto https://www.livecoding.tv/alphaglosined/ and hang out for half an hour. I want to show you something related.
Jun 23 2015
parent reply "Laeeth Isharc" <laeethnospam nospamlaeeth.com> writes:
On Tuesday, 23 June 2015 at 12:28:00 UTC, Rikki Cattermole wrote:
 Please come onto https://www.livecoding.tv/alphaglosined/ and 
 hang out for half an hour. I want to show you something related.
what times GMT or BST are good for you?
Jun 23 2015
parent Rikki Cattermole <alphaglosined gmail.com> writes:
On 24/06/2015 7:05 a.m., Laeeth Isharc wrote:
 On Tuesday, 23 June 2015 at 12:28:00 UTC, Rikki Cattermole wrote:
 Please come onto https://www.livecoding.tv/alphaglosined/ and hang out
 for half an hour. I want to show you something related.
what times GMT or BST are good for you?
12pm UTC+0 is when I aim to stream. Hopefully I'll stream again tonight. Although I'm getting a bit tired after streaming for three days! (usually only twice a week). Follow or keep an eye on livecodingtv on twitter to know when I start.
Jun 23 2015
prev sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 23.06.2015 um 14:17 schrieb Laeeth Isharc:
 It's great, but it's not quite a replacement for std.json, as I see it.

 The stream parser is fast, and it's valuable to be able to access it at
 a low level.

 However, it was consciously designed to be low-level, and for something
 else to go on top.

 As I understand it, there is a gap between what you can currently do
 with std.json (and indeed vibed json) and what you can do with
 stdx.data.json.  And the capability falls short of what can be done in
 other standard libraries such as the ones for python.

 So since we are going for a nuclear-power station included approach,
 does that not mean that we need to specify what this layer should do,
 and somebody should start to work on it?
One thing. which I consider the most important missing building block, is Jacob's anticipated std.serialization module [1]*. Skipping the data representation layer and going straight for a statically typed access to the data is the way to go in a language such as D, at least in most situations. Another part is a high level layer on top of the stream parser that exists for a while (albeit with room for improvement), but that I forgot to update the documentation for. I've now caught up on that and it can be found under [2] - see the read[...] and skip[...] functions. Do you, or anyone else, have further ideas for higher level functionality, or any concrete examples in other standard libraries? [1]: https://github.com/jacob-carlborg/orange [2]: http://s-ludwig.github.io/std_data_json/stdx/data/json/parser.html * Or any other suitable replacement, if that doesn't work out for some reason. The vibe.data.serialization module to me is not a suitable candidate as it stands, because it lacks some features of Jacob's solution, such as proper handling of (duplicate/interior) references. But it's a perfect fit for my own class of problems, so I currently can't justify to put work into this either.
Jun 23 2015
next sibling parent reply "Laeeth Isharc" <laeethnospam nospamlaeeth.com> writes:
On Tuesday, 23 June 2015 at 14:06:38 UTC, Sönke Ludwig wrote:
 As I understand it, there is a gap between what you can 
 currently do
 with std.json (and indeed vibed json) and what you can do with
 stdx.data.json.  And the capability falls short of what can be 
 done in
 other standard libraries such as the ones for python.

 So since we are going for a nuclear-power station included 
 approach,
 does that not mean that we need to specify what this layer 
 should do,
 and somebody should start to work on it?
One thing. which I consider the most important missing building block, is Jacob's anticipated std.serialization module [1]*. Skipping the data representation layer and going straight for a statically typed access to the data is the way to go in a language such as D, at least in most situations.
Thanks, Sonke. I appreciate your taking the time to reply, and I hope I represented my understanding of things correctly. I think often things get stuck in limbo because people don't know what's most useful, so I do think a central list of "things that need to be done" in D ecosystem might be nice, if it doesn't become excessively structured and bureaucratic. (I ain't volunteering to maintain it, as I can't commit to it). Thing is there are different use cases. For example, I pull data from Quandl - the metadata is standard and won't change in format often; but the data for a particular series will. For example if I pull volatility data that will have different fields to price or economic data. And I don't know beforehand the total set of possibilities. This must be quite a common use case, and indeed I just hit another one recently with a poorly-documented internal corporate database for securities. Maybe it's fine to generate the static typing in response to reading the data, but then it ought to be easy to do so (ultimately). Because otherwise you hack something up in Python because it's just easier, and that hack job becomes the basis for something larger then you ever intended or wanted and it's never worth rewriting given the other stuff you need. But even if you prefer static typing generated on the fly (which maybe becomes useful via introspection a la Alexandrescu talk), sometimes one will prefer dynamic typing, and since it's easy to do in a way that doesn't destroy the elegance and coherence of the whole project, why not give people the option ? It seems to me that Guido painted a target on Python by saying "it's fast enough, and you are usually I/O etc bound", because the numerical computing people have different needs. So BLAS and the like may be part of that, but also having something like pandas - and the ability to get data in and out of it - would be an important part in making it easy and fun to use D for this purpose, and it's not so hard to do so, just a fair bit of work. Not that it makes sense to undergo a death march to duplicate python functionality, but there are some things that are relatively easy that have a high payoff - like John Colvin's pydmagic. (The link here, which may not be so obvious, is that in a way pandas is a kind of replacement for a spreadsheet, and being able to just pull stuff in without minding your 'p's and 'q's to get a quick result lends itself to the kind of iterative exploration that makes spreadsheets still overused even today. And that's the link to JSON and (de)-serialization).
 Another part is a high level layer on top of the stream parser 
 that exists for a while (albeit with room for improvement), but 
 that I forgot to update the documentation for. I've now caught 
 up on that and it can be found under [2] - see the read[...] 
 and skip[...] functions.
Thank you for the link.
 Do you, or anyone else, have further ideas for higher level 
 functionality, or any concrete examples in other standard 
 libraries?
Will think it through and try to come up with some simple examples. Paging John Colvin and Russell Winder, too.
 * Or any other suitable replacement, if that doesn't work out 
 for some reason. The vibe.data.serialization module to me is 
 not a suitable candidate as it stands, because it lacks some 
 features of Jacob's solution, such as proper handling of 
 (duplicate/interior) references. But it's a perfect fit for my 
 own class of problems, so I currently can't justify to put work 
 into this either.
Is it worth you or someone else trying to articulate well what it does well that is missing from stdx.data.json?
Jun 23 2015
parent reply Jacob Carlborg <doob me.com> writes:
On 23/06/15 21:22, Laeeth Isharc wrote:

 Thing is there are different use cases.  For example, I pull data from
 Quandl - the metadata is standard and won't change in format often; but
 the data for a particular series will.  For example if I pull volatility
 data that will have different fields to price or economic data.  And I
 don't know beforehand the total set of possibilities.  This must be
 quite a common use case, and indeed I just hit another one recently with
 a poorly-documented internal corporate database for securities.
If the data can change between calls or is not consistent my serialization library is not a good fit. But if the data is consistent but changes over time, something like once a month, my serialization library could work if you update the data structures when the data changes. My serialization library can also work with optional fields if custom serialization is used. -- /Jacob Carlborg
Jun 24 2015
parent reply "Laeeth Isharc" <nospamlaeeth nospam.laeeth.com> writes:
On Wednesday, 24 June 2015 at 13:15:52 UTC, Jacob Carlborg wrote:
 On 23/06/15 21:22, Laeeth Isharc wrote:

 Thing is there are different use cases.  For example, I pull 
 data from
 Quandl - the metadata is standard and won't change in format 
 often; but
 the data for a particular series will.  For example if I pull 
 volatility
 data that will have different fields to price or economic 
 data.  And I
 don't know beforehand the total set of possibilities.  This 
 must be
 quite a common use case, and indeed I just hit another one 
 recently with
 a poorly-documented internal corporate database for securities.
If the data can change between calls or is not consistent my serialization library is not a good fit. But if the data is consistent but changes over time, something like once a month, my serialization library could work if you update the data structures when the data changes. My serialization library can also work with optional fields if custom serialization is used.
Thanks, Jacob. Some series shouldn't change too often. On the other hand, just with Quandl that is 10 million data series taken from a whole range of different sources, some of them rather unfinished, and it's hard to know. My needs are not relevant for the library, except that I think people often want to explore new data sets iteratively (over the course of weeks and months). Of course it doesn't take long to write the struct (or make something that will write it given the data and some guidance) but that's one more layer of friction. So from the perspective of D succeeding, I would think giving people the option (within a coherent framework, so not using one library here and another there when in other language ecosystems it is not fragmented) of using static or dynamic typing as they prefer would pay off. I don't know if you have looked at pandas and ipython notebook much. But now one can call D code from the ipython notebook (again, a 'trivial' piece of glue but ingenious and removing this small friction makes getting work done much easier) maybe having the option to have dynamic types with JSON will have more value. See here, as one simple example: http://nbviewer.ipython.org/gist/wesm/4757075/PandasTour.ipynb So it would be nice to be able to something like Adam Ruppe does here: https://github.com/adamdruppe/arsd/blob/master/jsvar.d var j = json!q{ "hello": { "data":[1,2,"giggle",4] }, "world":20 }; writeln(j.hello.data[2]); Obviously the scope is outside a serialization library, but just thinking about the broader integrated and coherent library offering we should have.
Jun 24 2015
next sibling parent Jacob Carlborg <doob me.com> writes:
On 24/06/15 15:48, Laeeth Isharc wrote:

 So it would be nice to be able to something like Adam Ruppe does here:
 https://github.com/adamdruppe/arsd/blob/master/jsvar.d

 var j = json!q{
          "hello": {
              "data":[1,2,"giggle",4]
          },
          "world":20
      };

      writeln(j.hello.data[2]);

 Obviously the scope is outside a serialization library, but just
 thinking about the broader integrated and coherent library offering we
 should have.
I understand and I agree it would be nice to have. -- /Jacob Carlborg
Jun 24 2015
prev sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 24.06.2015 um 15:48 schrieb Laeeth Isharc:
 So it would be nice to be able to something like Adam Ruppe does here:
 https://github.com/adamdruppe/arsd/blob/master/jsvar.d

 var j = json!q{
          "hello": {
              "data":[1,2,"giggle",4]
          },
          "world":20
      };

      writeln(j.hello.data[2]);

 Obviously the scope is outside a serialization library, but just
 thinking about the broader integrated and coherent library offering we
 should have.
This is very close to what I had done initially for vibe.d's "Json" struct. However, this approach requires adding opDispatch with an unbounded input domain. This in turn means that *any* change of the normal members in the Json struct is a potential silent breaking change. In particular, it then de-facto becomes impossible to add new methods. Another issue that has come up is that such a struct passes all kinds of duck typing tests, so that for example it was considered to be an input range when it really isn't. This can be an issue for things like a serialization library that don't include a special case for this type. Finally, although this is partially a matter of taste, I personally found that using the member access syntax can lead to the wrong (sub conscious) impression that these members were statically declared. I suspect that this leads to a larger likelihood that bugs caused by missing field existence checks slip into the source, as well as making it more difficult for the developer to detect typos (member access *looks* like it would be normal statically checked code, so it's easy to overlook typos there). For these reasons, the code with the proposed JSONValue becomes a little more verbose, requiring index based access instead: auto j = parseJSONValue(q{ "hello": { "data":[1,2,"giggle",4] }, "world":20 }); writeln(j["hello"]["data"][2]); There is also a method to safely (without causing exceptions) iterate down a path within the JSON DOM, when parts of the path might be missing: writeln(j.opt("hello", "data")[2]); JSONValue is backed by a std.variant.Algebraic, which has the advantage of getting the operators for free. It also means that JSONValue will automatically be compatible with other similar value types, such as a potential BSONValue (which has more types to choose from). [1]: http://s-ludwig.github.io/std_data_json/stdx/data/json/value/JSONValue.html
Jun 24 2015
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 25.06.2015 um 07:52 schrieb Sönke Ludwig:
 (...)
      auto j = parseJSONValue(q{
Should have been "toJSONValue".
Jun 24 2015
prev sibling parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 06/23/2015 04:06 PM, Sönke Ludwig wrote:
 
 Do you, or anyone else, have further ideas for higher level
 functionality, or any concrete examples in other standard libraries?
Allowing to lazily foreach over elements would be nice. foreach (elem; nodes.readArray) { // each elem would be a bounded node stream (range) foreach (key, value; elem.readObject) { } }
Jun 24 2015
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig rejectedsoftware.com> writes:
Am 24.06.2015 um 23:50 schrieb Martin Nowak:
 On 06/23/2015 04:06 PM, Sönke Ludwig wrote:
 Do you, or anyone else, have further ideas for higher level
 functionality, or any concrete examples in other standard libraries?
Allowing to lazily foreach over elements would be nice. foreach (elem; nodes.readArray) { // each elem would be a bounded node stream (range) foreach (key, value; elem.readObject) { } }
An initial version of readArray is up for discussion: https://github.com/s-ludwig/std_data_json/blob/3efc0600b4f8598dd6ccf897d6140d3351b5ee84/source/stdx/data/json/parser.d#L955 Unfortunately it is system, because the reference to the input range gets escaped. The "VR" struct also has to be non-copyable to avoid its "depth" field to get out of sync when it gets copied around.
Jun 25 2015