www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Range interface for std.serialization

reply Jacob Carlborg <doob me.com> writes:
After have been reading the review thread for std.serialization I've 
been trying to figure out how a range interface for std.serialization 
could look like. There have been several suggestions how to implement 
the range interface and I feel that I really don't know that the best 
choice would be.

What I can see there are two parts of the package that makes sense to 
support the range API's. Those are the serializer (Serializer) and 
archives (Archive).

If we start with the archive and the output, used for serializing. One 
idea is to have the current method "data" return an input range.

Alternative AO1 (Archive Output 1):

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);
serializer.serialize(new Object);

auto inputRange = archive.data;

This is pretty straight forward and the returned range can later be used 
to write to disk or whatever the user chooses.

If this alternative is chosen how should the range for the XmlArchive 
work like? Currently the archive returns a string, should the range just 
wrap the string and step through character by character? That doesn't 
sound very effective.



Alternative AO2:

Another idea is the archive is an output range, having this interface:

auto archive = new XmlArchive!(char);
archive.writeTo(outputRange);

auto serializer = new Serializer(archive);
serializer.serialize(new Object);

Use the output range when the serialization is done.

This has the same question as the input range, should I put to the range 
character by character?



Now we come to input for the archive, used for deserializing. I think 
the only alternative is an input range. I guess this is pretty straight 
forward. The archive needs to take the range as a template parameter to 
be able to store the range.

A problem with this, actually I don't know if it's considered a problem, 
is that the following won't be possible:

auto archive = new XmlArchive!(InputRange);
archive.data = archive.data;

Which one would usually expect from an OO API. The problem here is that 
the archive is typed for the original input range but the returned range 
from "data" is of a different type.



Then it comes to the serializer. For the input to the serializer there 
is a couple of alternatives:

Alternative SI1:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);
serializer.serialize(new Object);

The serializer can accept any type and will just serialize it. This is 
the current approach.

Alternative SI2:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);
serializer.serialize([1,2,3,4,5].stride(2).take(2));

The serialize recognizes input ranges and treat them differently. It can 
be serialize in a couple of different ways:

* Serialize as an array
* Serialize the range as the "serialize" method has been called multiple 
times
* Found out a new structure and serialize it as a range

Alternative SI3:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);
[1,2,3,4,5].stride(2).take(2).copy(serializer);

The serializer can be an output range and implement a "put" method. I 
guess this has the same problem, as alternative SI2, of how it would 
serialize the range.



For the output of the serializer (deserializing) I'm not sure if it 
makes sense to return a range. That's because you need to tell the 
serialize what root type to return:

Alternative SO1:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);

serializer.serialize(new Object);

auto object = serializer.deserialize!(Object)(data);

This is the current interface.



Alternative SO2:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);

serializer.serialize(new Object);

auto range = serializer.deserialize!(?)(data);

If the serializer returns a range, what type should be used in place of 
the question mark?



Conclusion:

As far as I can see there are many alternatives and I don't know which 
is best to choose.

-- 
/Jacob Carlborg
Aug 21 2013
next sibling parent reply "Dicebot" <public dicebot.lv> writes:
My 5 cents:

On Wednesday, 21 August 2013 at 18:45:48 UTC, Jacob Carlborg 
wrote:
 If this alternative is chosen how should the range for the 
 XmlArchive work like? Currently the archive returns a string, 
 should the range just wrap the string and step through 
 character by character? That doesn't sound very effective.

It should be range of strings - one call to popFront should serialize one object from input object range and provide matching string buffer.
 Alternative AO2:

 Another idea is the archive is an output range, having this 
 interface:

 auto archive = new XmlArchive!(char);
 archive.writeTo(outputRange);

 auto serializer = new Serializer(archive);
 serializer.serialize(new Object);

 Use the output range when the serialization is done.

I can't imagine a use case for this. Adding ranges just because you can is not very good :)
 A problem with this, actually I don't know if it's considered a 
 problem, is that the following won't be possible:

 auto archive = new XmlArchive!(InputRange);
 archive.data = archive.data;

What this snippet should do?
 Which one would usually expect from an OO API. The problem here 
 is that the archive is typed for the original input range but 
 the returned range from "data" is of a different type.

Range-based algorithms don't assign ranges. Transferring data from one range to another is done via copy(sourceRange, destRange) and similar tools.
 ... snip

It looks like difficulties come from your initial assumption that one call to serialize/deserialize implies one object - in that model ranges hardly are useful. I don't think it is a reasonable restriction. What is practically useful is (de)serialization of large list of objects lazily - and this is a natural job for ranges.
Aug 21 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-21 22:21, Dicebot wrote:

 It should be range of strings - one call to popFront should serialize
 one object from input object range and provide matching string buffer.

How should nesting been handled for a format like XML? Example: class Bar { int a; } class Foo { int b; Bar bar; } Currently the following XML is procured when serializing Foo: <object runtimeType="main.Foo" type="main.Foo" key="0" id="0"> <int key="b" id="1">4</int> <object runtimeType="main.Bar" type="main.Bar" key="bar" id="2"> <int key="a" id="3">3</int> </object> </object> If I shouldn't return the whole object, Foo, how can we know that when the string for Bar is returned it should actually be nested inside Foo?
 I can't imagine a use case for this. Adding ranges just because you can
 is not very good :)

Ok.
 What this snippet should do?

That was just a dummy snippet to set the data. This is a slightly better example: auto archive = new XmlArchive!(string); auto serializer = new Serializer(archive); serializer.serialize(new Object); writeToFile("foo.xml", archive.data); Now I want to deserialize the data back: archive.data = readFromFile("foo.xml"); // Error, cannot covert ReadFromFileRange to string
 Range-based algorithms don't assign ranges. Transferring data from one
 range to another is done via copy(sourceRange, destRange) and similar
 tools.

So how should the API look like for setting the data used when deserializing, like this? auto data = readFromFile("foo.xml"); auto archive = new XmlArchive!(string); copy(data, archive.data);
 It looks like difficulties come from your initial assumption that one
 call to serialize/deserialize implies one object - in that model ranges
 hardly are useful. I don't think it is a reasonable restriction. What is
 practically useful is (de)serialization of large list of objects lazily
 - and this is a natural job for ranges.

It depends on how you look at it. Currently it's only possible to serialize a single object with a single call to "serialize". So if you want to serialize multiple objects you do as you would do normally in your code, use an array, a linked list or similar. An array is still a single object, though it contains multiple objects, that is handled perfectly fine. The question is if a range should be treated as multiple objects, and not a single object (which it really is). How should it be serialized? * Something like an array, resulting in this XML: <array type="int" length="5" key="0" id="0"> <int key="0" id="1">1</int> <int key="1" id="2">2</int> <int key="2" id="3">3</int> <int key="3" id="4">4</int> <int key="4" id="5">5</int> </array> * Or like calling "serialize" multiple times, resulting in this XML: <int key="0" id="0">1</int> <int key="1" id="1">2</int> <int key="2" id="2">3</int> <int key="3" id="3">4</int> <int key="4" id="4">5</int> * Or as a single object: Then it would actually serialize the struct/class representing the range. And the most important question, how should ranges be deserialized? One have to tell the serializer what type to return, otherwise it won't work. But the whole point of ranges is that you shouldn't need to know the type. Sometimes you cannot even name the type, i.e. Voldemort types. -- /Jacob Carlborg
Aug 22 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-08-22 16:52, Dicebot wrote:

 Is there a reasons arrays needs to be serialized as (1), not (2)?

For arrays, one advantage is that I can allocate the whole array at once instead of appending, since the length of the array is serialized.
 I'd expect any input-range compliant data to be serialized as (2) and lazy.
 That allows you to use deserializer as a pipe over some sort of
 network-based string feed to get a potentially infinite input range of
 deserialized objects.

I still don't know how I would deserialize a range. I need to know the type to deserialize not just the interface. -- /Jacob Carlborg
Aug 22 2013
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-22 05:13, Tyler Jameson Little wrote:

 I don't like this because it still caches the whole object into memory.
 In a memory-restricted application, this is unacceptable.

It need to store all serialized reference types, otherwise it cannot properly serialize a complete object graph. We don't want duplicates. Example: The following code: auto bar = new Bar; bar.a = 3; auto foo = new Foo; foo.a = bar; foo.b = bar; Is serialized as: <object runtimeType="main.Foo" type="main.Foo" key="0" id="0"> <object runtimeType="main.Bar" type="main.Bar" key="a" id="1"> <int key="a" id="2">3</int> </object> <reference key="b">1</reference> </object> When "foo.b" is just serializes a reference, not the complete object, because that has already been serialized. The serializer needs to keep track of that.
 I think one call to popFront should release part of the serialized
 object. For example:

 struct B {
      int c, d;
 }

 struct A {
      int a;
      B b;
 }

 The JSON output of this would be:

      {
          a: 0,
          b: {
              c: 0,
              d: 0
          }
      }

 There's no reason why the serializer can't output this in chunks:

 Chunk 1:

      {
          a: 0,

 Chunk 2:

          b: {

 Etc...

It seems hard to keep track of nesting. I can't see how pretty printing using this technique would work.
 This is just a read-only property, which arguably doesn't break
 misconceptions. There should be no reason to assign directly to a range.

How should I set the data used for deserializing?
 I agree that (de)serializing a large list of objects lazily is
 important, but I don't think that's the natural interface for a
 Serializer. I think that each object should be lazily serialized instead
 to maximize throughput.

 If a Serializer is defined as only (de)serializing a single object, then
 serializing a range of Type would be as simple as using map() with a
 Serializer (getting a range of Serialize). If the allocs are too much,
 then the same serializer can be used, but serialize one-at-a-time.

 My main point here is that data should be written as it's being
 serialized. In a networked application, it may take a few packets to
 encode a larger object, so the first packets should be sent ASAP.

 As usual, feel free to destroy =D

Again, how does one keep track of nesting in formats like XML, JSON and YAML? -- /Jacob Carlborg
Aug 22 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-08-22 16:16, Tyler Jameson Little wrote:

 Right, but it doesn't need to keep the serialized data in memory.

No, exactly.
 Can't you just keep a counter? When you enter anything that would
 increase the indentation level, increment the indentation level. When
 leaving, decrement. At each level, insert whitespace equal to
 indentationLevel * whitespacePerLevel. This seems pretty trivial, unless
 I'm missing something.

That sounds like it will require quite some work. Currently I'm using std.xml.Document.toString.
 Also, I didn't check, but it turns off pretty-printing be default, right?

No, not currently, see above.
 How about passing it in with a function? Each range passed this way
 would represent a single object, so the current
 deserialize!Foo(InputRange) would work the same way it does now.

The archive needs to store it somehow, so pass it in the constructor? -- /Jacob Carlborg
Aug 22 2013
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-22 17:33, Johannes Pfau wrote:

 The reason is simple: In serialization it is not common to post-process
 the serialized data as far as I know.

Perhaps compression or encryption. -- /Jacob Carlborg
Aug 22 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-08-22 19:41, Johannes Pfau wrote:

 But compression or encryption are usually implemented as OutputRanges /
 OutputStreams.

Ok, I didn't know that. -- /Jacob Carlborg
Aug 22 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-08-22 17:49, Dicebot wrote:

 I was thinking about this but do we have any way to express non-monotone
 range in D? Variant seems only option, it implies that any format used
 by Archiver must always store type information though.

I'm wondering how interesting this really is to support. I basically only serialize a single object, which is the start of an object graph or an array.
 After some thinking I come to conclusion that is simply a matter of two
 `data` ranges - one parametrized by output type and other "raw". Latter
 than can output stuff in string chunks of undefined size (as small as
 serialization implementation allows). Does that help?

Do you mean that only the archive should handle ranges and the serializer shouldn't? -- /Jacob Carlborg
Aug 22 2013
prev sibling parent reply "Daniel Murphy" <yebblies nospamgmail.com> writes:
"Dicebot" <public dicebot.lv> wrote in message 
news:niufnloijwvjifusgisn forum.dlang.org...
 On Thursday, 22 August 2013 at 17:39:19 UTC, Johannes Pfau wrote:
 Yes, but the important point is that Serializer is _not_ an InputRange
 of serialized data. Instead it _uses_ a OutputRange / Stream
 internally.

Shame on me. I have completely misunderstood you and though you want to make serializer OutputRange itself. Your examples make a lot sense and I do agree it is a use case worth supporting. Need some more time to imagine how that may impact API in general.

It seems to me that if you give serializer a 'put' method, it _will_ be a valid output range.
Aug 25 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-Aug-2013 12:20, Daniel Murphy пишет:
 "Dicebot" <public dicebot.lv> wrote in message
 news:niufnloijwvjifusgisn forum.dlang.org...
 On Thursday, 22 August 2013 at 17:39:19 UTC, Johannes Pfau wrote:
 Yes, but the important point is that Serializer is _not_ an InputRange
 of serialized data. Instead it _uses_ a OutputRange / Stream
 internally.

Shame on me. I have completely misunderstood you and though you want to make serializer OutputRange itself. Your examples make a lot sense and I do agree it is a use case worth supporting. Need some more time to imagine how that may impact API in general.

It seems to me that if you give serializer a 'put' method, it _will_ be a valid output range.

Serializer is an output range for pretty much anything (that is serializable). Literally isOutputRange!T would be true for a whole lot of things, making it possible to dumping any ranges of Ts via copy. Just make its put method work on a variety of types and you have it. -- Dmitry Olshansky
Aug 25 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-Aug-2013 23:15, Dicebot пишет:
 On Sunday, 25 August 2013 at 08:36:40 UTC, Dmitry Olshansky wrote:
 Same thoughts here.
 Serializer is an output range for pretty much anything (that is
 serializable). Literally isOutputRange!T would be true for a whole lot
 of things, making it possible to dumping any ranges of Ts via copy.
 Just make its put method work on a variety of types and you have it.

Can't it be both OutputRange itself and provide InputRange via `serialize` call (for filters & similar pipe processing)?

I see that you potentially want to say compress serialized data on the fly via some range-based compressor. Or send over network... with some byChunk(favoriteBufferSize) or rather some kind of adapter that outputs no less then X bytes if not at end. Then indeed it becomes awkward to model a 'sink' kind of range as it is a transformer (no matter how convenient it makes putting stuff into it). It looks like the serializer has 2 "ends" - one accepts any element type, the other produces ubyte[] chunks. A problem is how to connect that output end, or more precisely this puts our "ranges are the pipeline" idea into an awkward situation. Basically on the front end data may arrive in various chunks and ditto on the output. More then that it isn't just an input range translation to ubyte[] (at least that'd be very ineffective and restrictive). But I have an idea. With all that said I get to the main point hopefully. here is an example far simpler then serialization. No matter how we look at this there has to be a way to connect 2 sinks, say I want to do: //can only use output range with it formattedWrite(compressor, "Hey, %s !\n", name); And have said compressor use LZMA on the data that is put into it, but it has to go somewhere. Thus problem of say compressing formatted text is not solved by input range, nor is the filtering of said text before 'put'-ing it somewhere. What's lacking is a way to connect a sink to another sink. My view of it is: auto app = appender!(ubyte[])(); //thus compression is an output range wrapper auto compressor = compress!LZMA(app); In other words an output range could be a filter, or rather forwarder of the transformed result to some other output range. And we need this not only for serialization (though formattedWrite can arguably be seen as a serialization) but anytime we have to turn heterogeneous input into homogeneous output and post-process THAT output. TL;DR: Simply put - make serialization an output range, and set an example by making archiver the first output range adapter. Adapting the code by Jacob (Alternative AO2) auto archiver = new XmlArchive!(char)(outputRange); auto serializer = new Serializer(archiver); serializer.put(new Object); serializer.put([1, 2, 3, 4]); //mix and match stuff as you see fit And even copy(iota(1, 10), serializer); Would all work just fine. -- Dmitry Olshansky
Aug 25 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-25 22:50, Dmitry Olshansky wrote:

 Adapting the code by Jacob (Alternative AO2)

 auto archiver = new XmlArchive!(char)(outputRange);
 auto serializer = new Serializer(archiver);
 serializer.put(new Object);
 serializer.put([1, 2, 3, 4]); //mix and match stuff as you see fit

 And even
 copy(iota(1, 10), serializer);

 Would all work just fine.

I'm still worried about how to get out deserialized objects, especially if they are serialized as a range. -- /Jacob Carlborg
Aug 26 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
26-Aug-2013 11:07, Jacob Carlborg пишет:
 On 2013-08-25 22:50, Dmitry Olshansky wrote:

 Adapting the code by Jacob (Alternative AO2)

 auto archiver = new XmlArchive!(char)(outputRange);
 auto serializer = new Serializer(archiver);
 serializer.put(new Object);
 serializer.put([1, 2, 3, 4]); //mix and match stuff as you see fit

 And even
 copy(iota(1, 10), serializer);

 Would all work just fine.

I'm still worried about how to get out deserialized objects, especially if they are serialized as a range.

Array or any container should do for a range. I'm not 100% sure what kind of interface to use, but Serializer and Deserializer should not be "shipped in one package" as in one class. The two a mirror each other but in essence are always used separately. Ditto about archiver/unarchiver they simply provide different functionality and it makes no sense to reuse the same object in 2 ways. Hence alternative 1 (unwrapping that snippet backwards): //BTW why go new & classes here(?) auto unarchiver = new XmlUnarchiver(someCharRange); auto deserialzier = new Deserializer(unarchiver); auto obj = deserializer.unpack!Object; //for sequence/array in underlying format it would use any container List!int list = deserializer.unpack!(List!int); int[] arr = deserializer.unpack!(int[]); IMO looks quite nice. The problem of how exactly should a container be filled is open though. So another alternative being more generic (the above could be consider convenience over this one): Vector!int ints; deserilaizer.unpackRange!(int)(x => ints.pushBack(x)); Basically unpack next sequence of data (as serialized) by feeding it to an output range using element type as param. And a simple lambda qualifies as an output range. Also take a look at the new digest API. I have an understanding that serialization would do well to take the same general strategy - concrete archivers as structs + polymorphic interface and wrappers on top. I'm still missing something about separation of archiver and serializer but in my mind these are tightly coupled and may as well be one entity. One tough little thing to take care of in std.serialization is how to reduce amount of constant overhead (indirections, function calls, branches etc.) per item. Polymorphism is easily achieved on top of fast and tight core the other way around is impossible. -- Dmitry Olshansky
Aug 26 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-26 11:23, Dmitry Olshansky wrote:

 Array or any container should do for a range.

But then it won't be lazy, or perhaps that's not a problem, since the whole deserializing should be lazy.
 I'm not 100% sure what kind of interface to use, but Serializer and
 Deserializer should not be "shipped in one package" as in one class.
 The two a mirror each other but in essence are always used separately.
 Ditto about archiver/unarchiver they simply provide different
 functionality and it makes no sense to reuse the same object in 2 ways.

 Hence alternative 1 (unwrapping that snippet backwards):

 //BTW why go new & classes here(?)

The reason to have classes is that I need reference types. I need to pass the serializer to "toData" and "fromData" methods that can be implemented on the objects being (de)serialized. I guess they could take the argument by ref. Is it possible to force that?
 auto unarchiver = new XmlUnarchiver(someCharRange);
 auto deserialzier = new Deserializer(unarchiver);
 auto obj = deserializer.unpack!Object;

 //for sequence/array in underlying format it would use any container
 List!int list = deserializer.unpack!(List!int);
 int[] arr = deserializer.unpack!(int[]);

 IMO looks quite nice. The problem of how exactly should a container be
 filled is open though.

 So another alternative being more generic (the above could be consider
 convenience over this one):

 Vector!int ints;
 deserilaizer.unpackRange!(int)(x => ints.pushBack(x));

 Basically unpack next sequence of data (as serialized) by feeding it to
 an output range using element type as param. And a simple lambda
 qualifies as an output range.

Here we have yet another suggestion for an API. The whole reason for this thread is that people weren't happy with the current interface, i.e. not range based. Now we got probably just as many suggestions as people who have answered to this thread. I still don't know hot the API should look like.
 Also take a look at the new digest API. I have an understanding that
 serialization would do well to take the same general strategy - concrete
 archivers as structs + polymorphic interface and wrappers on top.

I could have a look at that.
 I'm still missing something about separation of archiver and serializer
 but in my mind these are tightly coupled and may as well be one entity.
 One tough little thing to take care of in std.serialization is how to
 reduce amount of constant overhead (indirections, function calls,
 branches etc.) per item. Polymorphism is easily achieved on top of fast
 and tight core the other way around is impossible.

-- /Jacob Carlborg
Aug 26 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
26-Aug-2013 15:23, Jacob Carlborg пишет:
 On 2013-08-26 11:23, Dmitry Olshansky wrote:

 Array or any container should do for a range.

But then it won't be lazy, or perhaps that's not a problem, since the whole deserializing should be lazy.

It's lazy but you have to put stuff somewhere. Also 2nd API artifact unpackRange allows you to just look through the data (output range including lambdas can do anything). This can easily sift through say swaths of data picking only "lucky numbers": deserilaizer.unpackRange!(int)((x){ if(isLucky(x)) writeln(x); });
 I'm not 100% sure what kind of interface to use, but Serializer and
 Deserializer should not be "shipped in one package" as in one class.
 The two a mirror each other but in essence are always used separately.
 Ditto about archiver/unarchiver they simply provide different
 functionality and it makes no sense to reuse the same object in 2 ways.

 Hence alternative 1 (unwrapping that snippet backwards):

 //BTW why go new & classes here(?)

The reason to have classes is that I need reference types. I need to pass the serializer to "toData" and "fromData" methods that can be implemented on the objects being (de)serialized. I guess they could take the argument by ref. Is it possible to force that?

Would be interesting to do that. One way is to pass rvalue to said function and if it accepts that then it's not by ref. Along the lines of __traits(compiles, (){ T val; //or better can use dummy function //that returns Serializer by value val.toData(Serializer.init); }); At least that works with templated stuff too.
 auto unarchiver = new XmlUnarchiver(someCharRange);
 auto deserialzier = new Deserializer(unarchiver);
 auto obj = deserializer.unpack!Object;

 //for sequence/array in underlying format it would use any container
 List!int list = deserializer.unpack!(List!int);
 int[] arr = deserializer.unpack!(int[]);

 IMO looks quite nice. The problem of how exactly should a container be
 filled is open though.

 So another alternative being more generic (the above could be consider
 convenience over this one):

 Vector!int ints;
 deserilaizer.unpackRange!(int)(x => ints.pushBack(x));

 Basically unpack next sequence of data (as serialized) by feeding it to
 an output range using element type as param. And a simple lambda
 qualifies as an output range.

Here we have yet another suggestion for an API.

It's not just yet another. It isn't about particular shade of color. I can explain the ifs and whys of any design decision here if there is a doubt. I don't care for names but I see the precise semantics and there is little left to define. For instance I see good reasons why serializer _has_ to be OutputRange and not InputRange. Why archiver _has_ to take output range or be one and so on. Ditto on why there has to be separation of (un)archiver and (de)serializer.
 The whole reason for
 this thread is that people weren't happy with the current interface,
 i.e. not range based. Now we got probably just as many suggestions as
 people who have answered to this thread. I still don't know hot the API
 should look like.

It's not a question of putting here some ranges. Or replacing all arrays one comes across to ranges(as much as a lot of folks would unfortunately assume). Rather it's how it could operate with them at all without sacrificing the functionality, performance and ease of use. And there is not much in this tight design space that actually works. Pardon me if my tone is a bit sharp. I like any other want the best design we can get. Now that the great deal of work is done it would be a shame to present it in a bad package.
 Also take a look at the new digest API. I have an understanding that
 serialization would do well to take the same general strategy - concrete
 archivers as structs + polymorphic interface and wrappers on top.

I could have a look at that.

Aug 26 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-26 15:57, Dmitry Olshansky wrote:

 It's not just yet another. It isn't about particular shade of color. I
 can explain the ifs and whys of any design decision here if there is a
 doubt. I don't care for names but I see the precise semantics and there
 is little left to define.

Yes, please do. As I see it there are four parts of the interface that need to be solved: 1. How to get data in to the serializer 2. How to get data out of the serializer 3. How to get data in to the archiver 4. How to get data out of the archiver
 Pardon me if my tone is a bit sharp. I like any other want the best
 design we can get. Now that the great deal of work is done it would be a
 shame to present it in a bad package.

Yes, that's why we're having this discussion. -- /Jacob Carlborg
Aug 26 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
26-Aug-2013 18:37, Jacob Carlborg пишет:
 On 2013-08-26 15:57, Dmitry Olshansky wrote:

 It's not just yet another. It isn't about particular shade of color. I
 can explain the ifs and whys of any design decision here if there is a
 doubt. I don't care for names but I see the precise semantics and there
 is little left to define.

Yes, please do.

 As I see it there are four parts of the interface that
 need to be solved:

 1. How to get data in to the serializer
 2. How to get data out of the serializer
 3. How to get data in to the archiver
 4. How to get data out of the archiver

More a question of implementation then. Answer to both of them - wrapping an output range in archiver and being one for serializer. As for connection with your current API that gives away an array - just think std.array.Appender (and a multitude more ways to chew the data). Looking at your current code in depth... finally. I would have a problem starting the productive answers to the questions. First things first - there should not be a key parameter aside from stuff added by archiver itself for its internal needs. Nor is there a simple way to locate data by key afterwards (certainly not every format defines such). It would require some tagged object model and there is no such _requirement_ in the serialization. Citing a line from archive.d """ There are a couple of limitations when implementing a new archive, this is due* to how the serializer and the archive interface is built. Except for what this interface says explicitly an archive needs to be able to handle the following: Unarchive a value based on a key or id, regardless of where in the archive the value is located """ - this is impossible in the setting of serialization. Serialization is NOT about modeling the whole dataset and providing queries into that. A model of serialization is that of unix _tar_ just dump any graphs of data to "TAPE" and/or restore back. If you can do processing on the fly - bonus points (and what I push for can). This confusion and its consequences are no doubt due to building on std.xml and the interface it presents. Second - no, not every operation has to return some piece of Data that is produced. It would be tremendously inefficient and require keeping memory references to that alive (or be unsafe in addition to slow). Instead it just outputs something to the underlying sink. class Serializer { ... //now we are output range for anything. //add constraint as you see fit void put(T)(T value) { serializeInternal(value);//calls methods of archiver } Archiver archiver; } class MyArchiver(Output) if(isOutputRange!(Output, dchar)) //or ubyte if binary { ... this(Output sink) { this.sink = sink; } //and a method for example private void archivePrimitive (T) (T value, string key, Id id) { //along the lines of this, don't take literally //I've no idea of the actual format for tags you use formattedWrite(sink, "<%s>%s</%s>", id, value, id); } ... Output sink; } The user just writes e.g. auto app = appender!(char[])(); auto archiver = new XmlArchiver(app); auto serializer = new Serializer(archiver); and works with it serializer as with output range of anything (I showed the example before) Then once the data is required just peek at app.data and there it is. So in-memory case is easily covered. Other sinks bring more benefits see e.g.: auto sink = stdout.lockingtextWriter(); And the same code now writes directly to stdout and no worries if there is a lot of stuff to write. .... No matter how I look at code it needs a lot of (re-)work. For instance Archive type is obsessed with strings. I can't see a need for that many strings attached :) The awful duality of Serializer that results literally in: if(mode == serializing) doSerializing else do doDeserializing And the spectacular pair: T deserialize (T) (Data data, string key = "") { mode = deserializing; ... } Data serialize (T) (T value, string key = null) { mode = serializing; ... } Amount of extra code executed per bit of output is remarkably high, and a hallmark of standard library is pay as you go principle. We (as collectively Phobos devs) have to set the baseline for performance, if it's too low we're out of the game. For example - events are cute, but do we all need them? Do we always want an overhead of checking that stuff per field written? Instead decompose these layers, make them stackable for instance: auto serializer = new Serializer(...); auto tracingSerializer = new TracingSerializer(serializer); Or just make 2 kinds of serializers with static if on a single template parameter bool withEvents it's trivial. Then a couple of aliases would finish the job. With that I'm observe that events are attached to types/fields... hum, in such a case it needs work to make them zero-cost if absent.
 Pardon me if my tone is a bit sharp. I like any other want the best
 design we can get. Now that the great deal of work is done it would be a
 shame to present it in a bad package.

Yes, that's why we're having this discussion.

And I'm afraid it's too late or the changes are too far reaching but let's try it. I'm especially destroyed by (and the fact that it's a part of interface to implement): void archiveEnum (bool value, string baseType, string key, Id id); /// Ditto void archiveEnum (bool value, string baseType, string key, Id id); /// Ditto void archiveEnum (byte value, string baseType, string key, Id id); /// Ditto void archiveEnum (char value, string baseType, string key, Id id); /// Ditto void archiveEnum (dchar value, string baseType, string key, Id id); /// Ditto void archiveEnum (int value, string baseType, string key, Id id); /// Ditto void archiveEnum (long value, string baseType, string key, Id id); /// Ditto void archiveEnum (short value, string baseType, string key, Id id); /// Ditto void archiveEnum (ubyte value, string baseType, string key, Id id); /// Ditto void archiveEnum (uint value, string baseType, string key, Id id); /// Ditto void archiveEnum (ulong value, string baseType, string key, Id id); /// Ditto void archiveEnum (ushort value, string baseType, string key, Id id); /// Ditto void archiveEnum (wchar value, string baseType, string key, Id id); -- Dmitry Olshansky
Aug 26 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-26 18:41, Dmitry Olshansky wrote:

 More a question of implementation then.

 Answer to both of them - wrapping an output range in archiver and being
 one for serializer. As for connection with your current API that gives
 away an array - just think std.array.Appender (and a multitude more ways
 to chew the data).

Ok, thank you.
 Looking at your current code in depth... finally. I would have a problem
 starting the productive answers to the questions.

 First things first - there should not be a key parameter aside from
 stuff added by archiver itself for its internal needs. Nor is there a
 simple way to locate data by key afterwards (certainly not every format
 defines such). It would require some tagged object model and there is no
 such _requirement_ in the serialization.

I would really like the serializer not being dependent on the order of the fields of the types it's (de)serializing.
 Citing a line from archive.d
 """
 There are a couple of limitations when implementing a new archive, this
 is due* to how the serializer and the archive interface is built. Except
 for what this interface says explicitly an archive needs to be able to
 handle the following:
 Unarchive a value based on a key or id, regardless of where in the
 archive  the value is located
 """
 - this is impossible in the setting of serialization.

 Serialization is NOT about modeling the whole dataset and providing
 queries into that. A model of serialization is that of unix _tar_  just
 dump any graphs of data to "TAPE" and/or restore back. If you can do
 processing on the fly - bonus points (and what I push for can).

 This confusion and its consequences are no doubt due to building on
 std.xml and the interface it presents.

It might be due to the XML format in general but certainly not due to std.xml. The XmlArchie originally used the XML package in Tango, long before it supported std.xml.
 Second - no, not every operation has to return some piece of Data that
 is produced. It would be tremendously inefficient and require keeping
 memory references to that alive (or be unsafe in addition to slow).
 Instead it just outputs something to the underlying sink.

 class Serializer
 {
      ...
      //now we are output range for anything.
      //add constraint as you see fit
      void put(T)(T value)
      {
          serializeInternal(value);//calls methods of archiver
      }

      Archiver archiver;
 }

 class MyArchiver(Output)
      if(isOutputRange!(Output, dchar)) //or ubyte if binary
 {
      ...
      this(Output sink)
      {
          this.sink = sink;
      }

      //and a method for example
      private void archivePrimitive (T) (T value, string key, Id id)
      {
          //along the lines of this, don't take literally
          //I've no idea of the actual format for tags you use
             formattedWrite(sink, "<%s>%s</%s>", id, value, id);
          }
      ...
      Output sink;
 }

Good, thank you.
 The user just writes e.g.

 auto app = appender!(char[])();
 auto archiver = new XmlArchiver(app);
 auto serializer = new Serializer(archiver);

 and works with it serializer as with output range of anything (I showed
 the example before)

 Then once the data is required just peek at app.data and there it is.
 So in-memory case is easily covered. Other sinks bring more benefits see
 e.g.:

 auto sink = stdout.lockingtextWriter();

 And the same code now writes directly to stdout and no worries if there
 is a lot of stuff to write.

Thank you for giving some concrete ides of the API.
 ....

 No matter how I look at code it needs a lot of (re-)work.
 For instance Archive type is obsessed with strings. I can't see a need
 for that many strings attached :)

Yeah, I know. It's manly because the archive doesn't use templates, because it need to implement an interface.
 The awful duality of Serializer that results literally in:
 if(mode == serializing) doSerializing else do doDeserializing

 And the spectacular pair:

 T deserialize (T) (Data data, string key = "")
 {
          mode = deserializing;
      ...
 }

 Data serialize (T) (T value, string key = null)
 {
          mode = serializing;
      ...
 }

I guess that's easier to avoid if I divide Serializer in to two separate parts, one for serializing and one for deserializing.
 Amount of extra code executed per bit of output is remarkably high

Now I think you're exaggerating a bit. , and
 a hallmark of standard library is pay as you go principle. We (as
 collectively Phobos devs) have to set the baseline for performance, if
 it's too low we're out of the game.

 For example - events are cute, but do we all need them? Do we always
 want an overhead of checking that stuff per field written?

Sure, there are some overhead of calling some functions but the events are checked for at compile time so the overhead should be minimal.
 Instead decompose these layers, make them stackable for instance:

 auto serializer = new Serializer(...);
 auto tracingSerializer = new TracingSerializer(serializer);

 Or just make 2 kinds of serializers with static if on a single template
 parameter bool withEvents it's trivial. Then a couple of aliases would
 finish the job.

I don't think that will be needed. I can see if I can refactor a bit to minimize the overhead even more.
 With that I'm observe that events are attached to types/fields... hum,
 in such a case it needs work to make them zero-cost if absent.

The only cost is calling "triggerEvents" and "triggerEvent", the rest is performed at compile time.
 And I'm afraid it's too late or the changes are too far reaching but
 let's try it.

 I'm especially destroyed  by (and the fact that it's a part of interface
 to implement):

      void archiveEnum (bool value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (bool value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (byte value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (char value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (dchar value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (int value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (long value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (short value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (ubyte value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (uint value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (ulong value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (ushort value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (wchar value, string baseType, string key, Id id);

So you want templates instead? I have read your posts, thank you for your comments. I'm planning now to: * Split Serializer in to two parts * Make the parts struct * Possibly provide class wrappers * Split Archive in two parts * Add range interface to Serializer and Archive -- /Jacob Carlborg
Aug 27 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-Aug-2013 23:23, Jacob Carlborg пишет:
 On 2013-08-26 18:41, Dmitry Olshansky wrote:
 Looking at your current code in depth... finally. I would have a problem
 starting the productive answers to the questions.

 First things first - there should not be a key parameter aside from
 stuff added by archiver itself for its internal needs. Nor is there a
 simple way to locate data by key afterwards (certainly not every format
 defines such). It would require some tagged object model and there is no
 such _requirement_ in the serialization.

I would really like the serializer not being dependent on the order of the fields of the types it's (de)serializing.

That depends on the format and for these that have no keys or markers of any kind versioning might help here. For instance JSON/BSON could handle permutation of fields, but I then it falls short of handling links e.g. pointers (maybe there is a trick to get it, but I can't think of any right away). I suspect it would be best to somehow see archives by capbilities: 1. Rigid (most binary) - in-order, depends on the order of fields, may need to fit a scheme (in this cases D types implicitly define one) Rigid archivers may also enjoy (per format in the future) a code generator that given a scheme defines D types with a bit of CTFE+mixin. 2. Flexible - can survive reordering, is scheme-less, data defines structure etc. easer handles versioning e.g. XML is one. This also neatly answers the question about scheme vs scheme-less serialization. Protocol buffers/Thrift may be absorbed into Rigid category if we can get the versioning right. Also solving versioning is the last roadblock (after ranges) mentioned on the path to making this an epic addition to Phobos. + Some kind of capability flag (compile-time) if it can serialize full graphs or if the format is to limited for such. Taking that with Rigid would cover most adhoc binary formats in the wild, with Flexible it would handle some simple hierarchical formats as well.
 This confusion and its consequences are no doubt due to building on
 std.xml and the interface it presents.

It might be due to the XML format in general but certainly not due to std.xml. The XmlArchie originally used the XML package in Tango, long before it supported std.xml.

Was it DOM-ish too?
 The awful duality of Serializer that results literally in:
 if(mode == serializing) doSerializing else do doDeserializing

 And the spectacular pair:

 T deserialize (T) (Data data, string key = "")
 {
          mode = deserializing;
      ...
 }

 Data serialize (T) (T value, string key = null)
 {
          mode = serializing;
      ...
 }

I guess that's easier to avoid if I divide Serializer in to two separate parts, one for serializing and one for deserializing.

 Amount of extra code executed per bit of output is remarkably high

Now I think you're exaggerating a bit.

I've meant at least a check of 'mode' on each call to (de)serialize + some other branch-y stuff that tests overridden serializers etc. It could be a relatively new idiom to follow but there is a great value in having a lean common path aka 90% of use cases that need no extras should go the fastest route potentially at the _expense_ of *less frequent cases*. Simplified - the earlier you can elide extra work the better performance you get. To do that you may need to do double the overhead (checks) in less frequent case to remove some of it in the common case.
 a hallmark of standard library is pay as you go principle. We (as
 collectively Phobos devs) have to set the baseline for performance, if
 it's too low we're out of the game.

 For example - events are cute, but do we all need them? Do we always
 want an overhead of checking that stuff per field written?

Sure, there are some overhead of calling some functions but the events are checked for at compile time so the overhead should be minimal.

See below. I was talking namely about calling functions to see that no events are fired anyway.
 Instead decompose these layers, make them stackable for instance:

 auto serializer = new Serializer(...);
 auto tracingSerializer = new TracingSerializer(serializer);

 Or just make 2 kinds of serializers with static if on a single template
 parameter bool withEvents it's trivial. Then a couple of aliases would
 finish the job.

I don't think that will be needed. I can see if I can refactor a bit to minimize the overhead even more.

You are probably right as I note later on + there seems to be a way to elide the cost entirely if there are no events.
 With that I'm observe that events are attached to types/fields... hum,
 in such a case it needs work to make them zero-cost if absent.

The only cost is calling "triggerEvents" and "triggerEvent", the rest is performed at compile time.

Yeah, I see, but it's still a call to delegate that's hard to inline (well LDC/GDC might). Would it be hard to do a compile-time check if there are any events with the type in question at all and then call triggerEvent(s)? While we are on the subject of delegates - you absolutely should use 'scope delegate' as most (all?) delegates are never stored anywhere but rather pass blocks of code to call deeper down the line. (I guess it's somewhat Ruby-style, but it's not a problem).
 And I'm afraid it's too late or the changes are too far reaching but
 let's try it.

 I'm especially destroyed  by (and the fact that it's a part of interface
 to implement):

      void archiveEnum (bool value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (bool value, string baseType, string key, Id id);


 So you want templates instead?

Aye, as any faithful Phobos dev absolutely :) Seriously though ATM I just _suspect_ there is no need for Archive to be an interface. I would need to think this bit through more deeply but virtual call per field alone make me nervous here.
 I have read your posts, thank you for your comments. I'm planning now to:

 * Split Serializer in to two parts
 * Make the parts struct
 * Possibly provide class wrappers
 * Split Archive in two parts
 * Add range interface to Serializer and Archive

Great checklist, this would help greatly. I'm glad you see the value in these changes. Feel free to nag me on the NG and personally for any deficiency you come across on the way there ;) -- Dmitry Olshansky
Aug 27 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-27 22:12, Dmitry Olshansky wrote:

 I see...
 That depends on the format and for these that have no keys or markers of
 any kind versioning might help here. For instance JSON/BSON could handle
 permutation of fields, but I then it falls short of handling links e.g.
 pointers (maybe there is a trick to get it, but I can't think of any
 right away).

For pointers and reference types I currently serializing all fields with an id then when there's a pointer or reference I can just do this: <int name="foo" id="1">3</int> <pointer name="bar">1</pointer>
 I suspect it would be best to somehow see archives by capbilities:
 1. Rigid (most binary) - in-order, depends on the order of fields, may
 need to fit a scheme (in this cases D types implicitly define one)
 Rigid archivers may also enjoy (per format in the future) a code
 generator that given a scheme defines D types with a bit of CTFE+mixin.

 2. Flexible - can survive reordering, is scheme-less, data defines
 structure etc. easer handles versioning e.g. XML is one.

Yes, that's a good idea. In the binary archiver I'm working on I'm cheating quite a bit and relax the requirements made by the serializer.
 This also neatly answers the question about scheme vs scheme-less
 serialization. Protocol buffers/Thrift may be absorbed into Rigid
 category if we can get the versioning right. Also solving versioning is
 the last roadblock (after ranges) mentioned on the path to making this
 an epic addition to Phobos.

Versioning shouldn't be that hard, I think.
 + Some kind of capability flag (compile-time) if it can serialize full
 graphs or if the format is to limited for such. Taking that with Rigid
 would cover most adhoc binary formats in the wild, with Flexible it
 would handle some simple hierarchical formats as well.

Sounds like a good idea.
 Was it DOM-ish too?

Yes.
 I've meant at least a check of 'mode' on each call to (de)serialize +
 some other branch-y stuff that tests overridden serializers etc.

 It could be a relatively new idiom to follow but there is a great value
 in having a lean common path aka 90% of use cases that need no extras
 should go the fastest route potentially at the _expense_ of *less
 frequent cases*.
 Simplified - the earlier you can elide extra work the better performance
 you get. To do that you may need to do double the overhead (checks) in
 less frequent case to remove some of it in the common case.

Yes, I understand the checking for "mode" wasn't the best approach. The internals are mostly coded to be straight forward and just work.
 See below. I was talking namely about calling functions to see that no
 events are fired anyway.

I can probably add a static-if before calling the functions.
 Yeah, I see, but it's still a call to delegate that's hard to inline
 (well LDC/GDC might). Would it be hard to do a compile-time check if
 there are any events with the type in question at all and then call
 triggerEvent(s)?

No, I don't think so. I can also make the triggerEvents take the delegate by alias parameter, if that helps. Or inline it manually.
 While we are on the subject of delegates - you absolutely should use
 'scope delegate' as most (all?) delegates are never stored anywhere but
 rather pass blocks of code to call deeper down the line.
 (I guess it's somewhat Ruby-style, but it's not a problem).

Good idea. The reasons for the delegates is to avoid begin/end functions. This also forces the use of the API correctly. Hmm, actually it may not. Since the Serializer technically is the user of the archiver API and that is already correctly implemented. The developer do need to implement the archiver API correctly, but there's nothing that stops him/her from _not_ calling the delegate. Am I over thinking this?
 Aye, as any faithful Phobos dev absolutely :)
 Seriously though ATM I just _suspect_ there is no need for Archive to be
 an interface. I would need to think this bit through more deeply but
 virtual call per field alone make me nervous here.

Originally it was using templates. One of my design goals back then was to not have to use templates. Templates forces slightly more complicated API for the user: auto serializer = new Serializer!(XmlArchive); Which is fine, but I'm not very about the API for custom serialization: class Foo { void toData (Archive) (Serializer!(Archive) serializer); } The user is either forced to use templates here as well, or: class Foo { void toData (Serializer!(XmlArchive) serializer); } ... use a single type of archive. It's also possible to pass in anything as Archive. Now we have template constraints, which didn't exist back then, make it a bit better. About the large API to implement for an Archive, this is the criteria I had when creating the API, in order of importance. 1. Should be easy for a consumer to use 2. Should be easy for an archive implementor 3. Should be easy to implement the serializer In this case, point 1 made it less easy for point 2. Point 2 made me push as much as possible to the serializer instead of having it in the archiver. In the end, it's quite easy to copy-paste the API, do some search and replace and forward methods like these: void archiveEnum (bool value, string baseType, string key, Id id) void archiveEnum (char value, string baseType, string key, Id id) void archiveEnum (int value, string baseType, string key, Id id) ... to a private template method. That's what XmlArchive does: https://github.com/jacob-carlborg/orange/blob/master/orange/serialization/archives/XmlArchive.d#L439 -- /Jacob Carlborg
Aug 28 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Aug-2013 11:13, Jacob Carlborg пишет:
 On 2013-08-27 22:12, Dmitry Olshansky wrote:

 I see...
 That depends on the format and for these that have no keys or markers of
 any kind versioning might help here. For instance JSON/BSON could handle
 permutation of fields, but I then it falls short of handling links e.g.
 pointers (maybe there is a trick to get it, but I can't think of any
 right away).

For pointers and reference types I currently serializing all fields with an id then when there's a pointer or reference I can just do this: <int name="foo" id="1">3</int> <pointer name="bar">1</pointer>

That would be tricky in JSON and quite overheadish (e.g. wrapping everything into object just in case there is a pointer there).
 I suspect it would be best to somehow see archives by capbilities:
 1. Rigid (most binary) - in-order, depends on the order of fields, may
 need to fit a scheme (in this cases D types implicitly define one)
 Rigid archivers may also enjoy (per format in the future) a code
 generator that given a scheme defines D types with a bit of CTFE+mixin.

 2. Flexible - can survive reordering, is scheme-less, data defines
 structure etc. easer handles versioning e.g. XML is one.

Yes, that's a good idea. In the binary archiver I'm working on I'm cheating quite a bit and relax the requirements made by the serializer.

Yes, instead of cheating you can just define them as different kinds. It would ease the friction and prevent some "impedance mismatch" problems.
 This also neatly answers the question about scheme vs scheme-less
 serialization. Protocol buffers/Thrift may be absorbed into Rigid
 category if we can get the versioning right. Also solving versioning is
 the last roadblock (after ranges) mentioned on the path to making this
 an epic addition to Phobos.

Versioning shouldn't be that hard, I think.

Then collect some info on how to approach this problem. See e.g. Boost serialziation, Protocol Buffers and Thrift. The key point is that it's many things to many different people.
 Was it DOM-ish too?

Yes.

That nails it. DOM isn't quite serialization but rather a hierarchical DB. BTW Sqlite and other DBs may be an interesting backend for serialization (though they wouldn't have lookup untill deserialization).
 Yeah, I see, but it's still a call to delegate that's hard to inline
 (well LDC/GDC might). Would it be hard to do a compile-time check if
 there are any events with the type in question at all and then call
 triggerEvent(s)?

No, I don't think so. I can also make the triggerEvents take the delegate by alias parameter, if that helps. Or inline it manually.

Great, anything to lessen the extra load.
 While we are on the subject of delegates - you absolutely should use
 'scope delegate' as most (all?) delegates are never stored anywhere but
 rather pass blocks of code to call deeper down the line.
 (I guess it's somewhat Ruby-style, but it's not a problem).

Good idea. The reasons for the delegates is to avoid begin/end functions. This also forces the use of the API correctly. Hmm, actually it may not. Since the Serializer technically is the user of the archiver API and that is already correctly implemented. The developer do need to implement the archiver API correctly, but there's nothing that stops him/her from _not_ calling the delegate. Am I over thinking this?

Seems like, after all library implementors should be trusted to not do truly awful things.
 Aye, as any faithful Phobos dev absolutely :)
 Seriously though ATM I just _suspect_ there is no need for Archive to be
 an interface. I would need to think this bit through more deeply but
 virtual call per field alone make me nervous here.

Originally it was using templates. One of my design goals back then was to not have to use templates. Templates forces slightly more complicated API for the user: auto serializer = new Serializer!(XmlArchive); Which is fine, but I'm not very about the API for custom serialization: class Foo { void toData (Archive) (Serializer!(Archive) serializer); }

Rather this: void toData(Serializer)(Serializer serializer) if(isSerializer!Serializer) { ... } There is no need to even know how archiver looks like for the user code (wasn't it one of the goals of archivers?).
 The user is either forced to use templates here as well, or:

 class Foo
 {
      void toData (Serializer!(XmlArchive) serializer);
 }

The main problem would be that it can't overriden as templates are final. After all of this I think Archivers are just fine as templates user only ever interacts with them during creation. Then it's serializers templates that pick up the right types. Serializers themselves on the other hand are present in user code and may need one common polymorphic abstract class that provides 'put' and forwards it to a set of abstract methods. All polymorphic wrappers would inherit from it. This won't prevent folks from using templated version of toData/fromData if need be.
 ... use a single type of archive. It's also possible to pass in anything
 as Archive. Now we have template constraints, which didn't exist back
 then, make it a bit better.

 About the large API to implement for an Archive, this is the criteria I
 had when creating the API, in order of importance.

 1. Should be easy for a consumer to use
 2. Should be easy for an archive implementor
 3. Should be easy to implement the serializer

 In this case, point 1 made it less easy for point 2. Point 2 made me
 push as much as possible to the serializer instead of having it in the
 archiver.

I'd suggest to maximally hide away (Un)Archivers API from end users and as such it would be more convenient to just stay templated as it won't be seen.
 In the end, it's quite easy to copy-paste the API, do some search and
 replace and forward methods like these:

 void archiveEnum (bool value, string baseType, string key, Id id)
 void archiveEnum (char value, string baseType, string key, Id id)
 void archiveEnum (int value, string baseType, string key, Id id)

 ... to a private template method. That's what XmlArchive does:

 https://github.com/jacob-carlborg/orange/blob/master/orange/serialization/archives/XmlArchive.d#L439

-- Dmitry Olshansky
Aug 28 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Aug-2013 13:58, Dmitry Olshansky пишет:
 28-Aug-2013 11:13, Jacob Carlborg пишет:
 On 2013-08-27 22:12, Dmitry Olshansky wrote:

Rather this: void toData(Serializer)(Serializer serializer) if(isSerializer!Serializer) { ... } There is no need to even know how archiver looks like for the user code (wasn't it one of the goals of archivers?).
 The user is either forced to use templates here as well, or:

 class Foo
 {
      void toData (Serializer!(XmlArchive) serializer);
 }

The main problem would be that it can't overriden as templates are final. After all of this I think Archivers are just fine as templates user only ever interacts with them during creation. Then it's serializers templates that pick up the right types. Serializers themselves on the other hand are present in user code and may need one common polymorphic abstract class that provides 'put' and forwards it to a set of abstract methods. All polymorphic wrappers would inherit from it.

Taking into account that you've settled on keeping Serializers as classes just finalize all methods of a concrete serializer that is templated on archiver (and make it a final class). Should be as simple as: class Serializer { void put(T)(T item){ ...} //other methods per specific type } final class ConcreteSerializer(Archiver) : Serializer { final: ... //use Archiver here to implement these hooks } Then users that use templates in their code would have concrete types, for others it quickly "decays" to the base class they use. The boilerplate of defining a lot of methods now moves to Serializer but there should be only one such (template) class anyway. -- Dmitry Olshansky
Aug 28 2013
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-08-28 13:20, Dmitry Olshansky wrote:

 Taking into account that you've settled on keeping Serializers as
 classes

Not necessary.
 just finalize all methods of a concrete serializer that is
 templated on archiver (and make it a final class).

 Should be as simple as:

 class Serializer {
      void put(T)(T item){ ...}
      //other methods per specific type
 }

 final class ConcreteSerializer(Archiver) : Serializer {
 final:
      ...
      //use Archiver here to implement these hooks
 }

 Then users that use templates in their code would have concrete types,
 for others it quickly "decays" to the base class they use.

 The boilerplate of defining a lot of methods now moves to Serializer but
 there should be only one such (template) class anyway.

This is a good idea. -- /Jacob Carlborg
Aug 28 2013
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-28 13:20, Dmitry Olshansky wrote:

Bumping this thread.

 Taking into account that you've settled on keeping Serializers as
 classes just finalize all methods of a concrete serializer that is
 templated on archiver (and make it a final class).

 Should be as simple as:

 class Serializer {
      void put(T)(T item){ ...}
      //other methods per specific type
 }

 final class ConcreteSerializer(Archiver) : Serializer {
 final:
      ...
      //use Archiver here to implement these hooks
 }

I'm having quite hard time to figure out how this should work. Or I'm misunderstanding what you're saying. If I understand you correctly I should do something like: class Serializer { void put (T) (T item) { static if (is(T == int)) serializeInt(item); ... } abstract void serializeInt (int item); } But if I'm doing it that way I will still have the problem with a lot of methods that need to be implemented in the archiver. Hmm, I guess it would be possible to minimize the number of methods used for built in types. There's still a problem with user defined types though. -- /Jacob Carlborg
Sep 24 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Sep-2013 21:02, Jacob Carlborg пишет:
 On 2013-08-28 13:20, Dmitry Olshansky wrote:
 Taking into account that you've settled on keeping Serializers as
 classes just finalize all methods of a concrete serializer that is
 templated on archiver (and make it a final class).

 Should be as simple as:

 class Serializer {
      void put(T)(T item){ ...}
      //other methods per specific type
 }

 final class ConcreteSerializer(Archiver) : Serializer {
 final:
      ...
      //use Archiver here to implement these hooks
 }

I'm having quite hard time to figure out how this should work. Or I'm misunderstanding what you're saying. If I understand you correctly I should do something like: class Serializer { void put (T) (T item) { static if (is(T == int)) serializeInt(item); ... } abstract void serializeInt (int item); } But if I'm doing it that way I will still have the problem with a lot of methods that need to be implemented in the archiver.

If I'm correct archiver would have the benefit of templates and common code would be merged (so all of these in a concrete serializer do forward to archiver.write!int, archive.write!uint etc.) On the plus side of having a bunch of methods in Serializer you need exactly one ConcreteSerializer!(Archive) that implement them. And user-defined archiver need not to even think of this, just define single templated write (or put or whatever).
 Hmm, I guess it would be possible to minimize the number of methods used
 for built in types. There's still a problem with user defined types though.

Indeed. But it must be provided as a template in generic serializer. The benefit is that said logic to serialize arbitrary UDTs is implemented there once and for all. Archiver is then partially relived of it. To achieve that an archiver may need to provided some fundamental "hooks" like startStruct/endStruct (I didn't think through exact ones). -- Dmitry Olshansky
Sep 24 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-09-24 22:27, Dmitry Olshansky wrote:

 Indeed. But it must be provided as a template in generic serializer. The
 benefit is that said logic to serialize arbitrary UDTs is implemented
 there once and for all. Archiver is then partially relived of it. To
 achieve that an archiver may need to provided some fundamental "hooks"
 like  startStruct/endStruct (I didn't think through exact ones).

Ok, that's basically how it already works. Thanks. -- /Jacob Carlborg
Sep 25 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-08-28 11:58, Dmitry Olshansky wrote:

 That would be tricky in JSON and quite overheadish (e.g. wrapping
 everything into object just in case there is a pointer there).

Yes.
 Yes, instead of cheating you can just define them as different kinds. It
 would ease the friction and prevent some "impedance mismatch" problems.

Yes, that's better.
 Then collect some info on how to approach this problem.
 See e.g. Boost serialziation, Protocol Buffers and Thrift.
 The key point is that it's many things to many different people.

I'll do that.
 Rather this:

 void toData(Serializer)(Serializer serializer)
      if(isSerializer!Serializer)
 {
      ...
 }

 There is no need to even know how archiver looks like for the user code
 (wasn't it one of the goals of archivers?).

Right, didn't think of using a template argument for the whole serializer.
 Serializers themselves on the other hand are present in user code and
 may need one common polymorphic abstract class that provides 'put' and
 forwards it to a set of abstract methods. All polymorphic wrappers would
 inherit from it.

 This won't prevent folks from using templated version of toData/fromData
 if need be.

That's a good idea.
 I'd suggest to maximally hide away (Un)Archivers API from end users and
 as such it would be more convenient to just stay templated as it won't
 be seen.

Yes. -- /Jacob Carlborg
Aug 28 2013
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-27 22:12, Dmitry Olshansky wrote:

 Feel free to nag me on the NG and personally for any deficiency you come
 across on the way there ;)

About making Serializer a struct. Actually I think the semantics of Serializer should be a reference type. I see no use case in passing a Serializer by value. Although I do see the overhead of allocating a class and calling methods on it. I do plan to add a free function for deserializing, for convenience. In that function Serializer, if it's a class, would be allocated using emplace to make it stack allocated. -- /Jacob Carlborg
Aug 28 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Aug-2013 12:08, Jacob Carlborg пишет:
 On 2013-08-27 22:12, Dmitry Olshansky wrote:

 Feel free to nag me on the NG and personally for any deficiency you come
 across on the way there ;)

About making Serializer a struct. Actually I think the semantics of Serializer should be a reference type. I see no use case in passing a Serializer by value. Although I do see the overhead of allocating a class and calling methods on it.

Here you are quite right... just add a factory that hides away its true origin (and ctor as well so it can be changed later if need be) we.g.: auto serializer = serializerFor!(XmlArchiver)(archiver);
 I do plan to add a free function for deserializing, for convenience. In
 that function Serializer, if it's a class, would be allocated using
 emplace to make it stack allocated.

Good idea. API should have many layers so that power users may keep digging to the bottom and these that just need to get the job done can do it in one stroke. -- Dmitry Olshansky
Aug 28 2013
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-27 22:12, Dmitry Olshansky wrote:

 Feel free to nag me on the NG and personally for any deficiency you come
 across on the way there ;)

I'm bumping this again with a new question. I'm thinking about how to output the data to output range. If the output range is of type ubyte[] how should I output serialized data looking like this: <object runtimeType="main.Foo" type="main.Foo" key="0" id="0"> <int key="a" id="1">3</int> </object> Should I output this in one chunk or in parts like this: <object runtimeType="main.Foo" type="main.Foo" key="0" id="0"> Then <int key="a" id="1">3</int> Then </object> If the first case is chosen I guess this data: <object runtimeType="main.Foo" type="main.Foo" key="0" id="0"> <int key="a" id="1">3</int> </object> <object runtimeType="main.Foo" type="main.Foo" key="1" id="2"> <int key="a" id="3">3</int> </object> Would be outputted in two chunks. -- /Jacob Carlborg
Oct 10 2013
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-10-10 14:13, Dicebot wrote:

 I thought the very point of output ranges is that you can output in any
 chunks that seem most convenient / efficient to you.

Outputting it like the first suggestion will be a lot more convenient, especially how std.xml currently works. Although it will probably not be as efficient. This basically means a complete object graph will be outputted in one chunk. That is, unless the top element is a range. -- /Jacob Carlborg
Oct 10 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
10-Oct-2013 11:45, Jacob Carlborg пишет:
 On 2013-08-27 22:12, Dmitry Olshansky wrote:

 Feel free to nag me on the NG and personally for any deficiency you come
 across on the way there ;)

I'm bumping this again with a new question. I'm thinking about how to output the data to output range. If the output range is of type ubyte[] how should I output serialized data looking like this: <object runtimeType="main.Foo" type="main.Foo" key="0" id="0"> <int key="a" id="1">3</int> </object> Should I output this in one chunk or in parts like this: <object runtimeType="main.Foo" type="main.Foo" key="0" id="0"> Then <int key="a" id="1">3</int> Then </object>

I do believe it's a very minor detail of a specific implementation of XML archiver. Speaking of Archivers in general the main point is to try to avoid accumulating lots of data in memory if possible and put things as they go. This however doesn't preclude the 2nd goal - outputting as much as possible in one go (and not 2 chunks) is preferable (1 call vs 2 calls to put of the output range etc) if doesn't harm memory usage (O(1) is OK, anything else not) and doesn't complicate archiver. -- Dmitry Olshansky
Oct 10 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-08-26 15:53, Dicebot wrote:

 In the end it is still your decision to make - people here can provide
 some input and help with technical details but it is still your library
 and your voting ;) Though I would probably suggest to value Phobos
 developer opinions more than, ugh, mine (or any other random trespasser).

Well I'm happy with the interface as it is, that's why I created it like that. But not the other developers here, so it won't be accepted in its current state. That's why I'm asking: "how should it look like?". -- /Jacob Carlborg
Aug 26 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-08-26 15:42, Dicebot wrote:

 For me distinction was very natural. `(de)serializer` is something that
 takes care of D type introspection and provides it in simplified form to
 `(de)archiver` which embeds actual format knowledge. Former can get
 pretty tricky in D so it makes some sense to keep it separate.

I think he was referring to that there should be a separate deserializer from the serializer. And a separate unarchiver from the archiver. -- /Jacob Carlborg
Aug 26 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
26-Aug-2013 17:42, Dicebot пишет:
 On Monday, 26 August 2013 at 09:23:32 UTC, Dmitry Olshansky wrote:
 I'm still missing something about separation of archiver and
 serializer but in my mind these are tightly coupled and may as well be
 one entity.

For me distinction was very natural. `(de)serializer` is something that takes care of D type introspection and provides it in simplified form to `(de)archiver` which embeds actual format knowledge. Former can get pretty tricky in D so it makes some sense to keep it separate.

If that is the case then fine. Though upon seeing Archive interface in its mortifying glory I'm not sure about that simplifying bit ...
 I can't really add anything on ranges part of your comments - sounds
 like you have a better "big picture" anyway :)

-- Dmitry Olshansky
Aug 26 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
26-Aug-2013 00:50, Dmitry Olshansky пишет:
 25-Aug-2013 23:15, Dicebot пишет:
 On Sunday, 25 August 2013 at 08:36:40 UTC, Dmitry Olshansky wrote:
 Same thoughts here.
 Serializer is an output range for pretty much anything (that is
 serializable). Literally isOutputRange!T would be true for a whole lot
 of things, making it possible to dumping any ranges of Ts via copy.
 Just make its put method work on a variety of types and you have it.

Can't it be both OutputRange itself and provide InputRange via `serialize` call (for filters & similar pipe processing)?


 What's lacking is a way to connect a sink to another sink.

 My view of it is:
 auto app = appender!(ubyte[])();
 //thus compression is an output range wrapper
 auto compressor = compress!LZMA(app);

 In other words an output range could be a filter, or rather forwarder of
 the transformed result to some other output range. And we need this not
 only for serialization (though formattedWrite can arguably  be seen as a
 serialization) but anytime we have to turn heterogeneous input into
 homogeneous output and post-process THAT output.

On the subject of it we can do some cool wonders by providing such adapters, example - calculate SHA1 hash of a message on the fly: https://gist.github.com/blackwhale/6339932 As a proof of concept to show the power that output range adapters possess :) Sadly it hits a bug in LockingTextWriter, namely destructor fails on T.init (a usual oversight). Patch: --- a/std/stdio.d +++ b/std/stdio.d -1517,9 +1517,12 $(D Range) that locks the file and allows fast writing to it. ~this() { - FUNLOCK(fps); - fps = null; - handle = null; + if(fps) + { + FUNLOCK(fps); + fps = null; + handle = null; + } } this(this) -- Dmitry Olshansky
Aug 26 2013
prev sibling next sibling parent "Tyler Jameson Little" <beatgammit gmail.com> writes:
On Wednesday, 21 August 2013 at 20:21:49 UTC, Dicebot wrote:
 My 5 cents:

 On Wednesday, 21 August 2013 at 18:45:48 UTC, Jacob Carlborg 
 wrote:
 If this alternative is chosen how should the range for the 
 XmlArchive work like? Currently the archive returns a string, 
 should the range just wrap the string and step through 
 character by character? That doesn't sound very effective.

It should be range of strings - one call to popFront should serialize one object from input object range and provide matching string buffer.

I don't like this because it still caches the whole object into memory. In a memory-restricted application, this is unacceptable. I think one call to popFront should release part of the serialized object. For example: struct B { int c, d; } struct A { int a; B b; } The JSON output of this would be: { a: 0, b: { c: 0, d: 0 } } There's no reason why the serializer can't output this in chunks: Chunk 1: { a: 0, Chunk 2: b: { Etc... Most archive formats should support chunking. I realize this may be a rather large change to Orange, but I think it's a direction it should be headed.
 Alternative AO2:

 Another idea is the archive is an output range, having this 
 interface:

 auto archive = new XmlArchive!(char);
 archive.writeTo(outputRange);

 auto serializer = new Serializer(archive);
 serializer.serialize(new Object);

 Use the output range when the serialization is done.

I can't imagine a use case for this. Adding ranges just because you can is not very good :)

I completely agree.
 A problem with this, actually I don't know if it's considered 
 a problem, is that the following won't be possible:

 auto archive = new XmlArchive!(InputRange);
 archive.data = archive.data;

What this snippet should do?
 Which one would usually expect from an OO API. The problem 
 here is that the archive is typed for the original input range 
 but the returned range from "data" is of a different type.

Range-based algorithms don't assign ranges. Transferring data from one range to another is done via copy(sourceRange, destRange) and similar tools.

This is just a read-only property, which arguably doesn't break misconceptions. There should be no reason to assign directly to a range.
 It looks like difficulties come from your initial assumption 
 that one call to serialize/deserialize implies one object - in 
 that model ranges hardly are useful. I don't think it is a 
 reasonable restriction. What is practically useful is 
 (de)serialization of large list of objects lazily - and this is 
 a natural job for ranges.

I agree that (de)serializing a large list of objects lazily is important, but I don't think that's the natural interface for a Serializer. I think that each object should be lazily serialized instead to maximize throughput. If a Serializer is defined as only (de)serializing a single object, then serializing a range of Type would be as simple as using map() with a Serializer (getting a range of Serialize). If the allocs are too much, then the same serializer can be used, but serialize one-at-a-time. My main point here is that data should be written as it's being serialized. In a networked application, it may take a few packets to encode a larger object, so the first packets should be sent ASAP. As usual, feel free to destroy =D
Aug 21 2013
prev sibling next sibling parent "Tyler Jameson Little" <beatgammit gmail.com> writes:
On Thursday, 22 August 2013 at 07:16:11 UTC, Jacob Carlborg wrote:
 On 2013-08-22 05:13, Tyler Jameson Little wrote:

 I don't like this because it still caches the whole object 
 into memory.
 In a memory-restricted application, this is unacceptable.

It need to store all serialized reference types, otherwise it cannot properly serialize a complete object graph. We don't want duplicates. Example: The following code: auto bar = new Bar; bar.a = 3; auto foo = new Foo; foo.a = bar; foo.b = bar; Is serialized as: <object runtimeType="main.Foo" type="main.Foo" key="0" id="0"> <object runtimeType="main.Bar" type="main.Bar" key="a" id="1"> <int key="a" id="2">3</int> </object> <reference key="b">1</reference> </object> When "foo.b" is just serializes a reference, not the complete object, because that has already been serialized. The serializer needs to keep track of that.

Right, but it doesn't need to keep the serialized data in memory.
 I think one call to popFront should release part of the 
 serialized
 object. For example:

 struct B {
     int c, d;
 }

 struct A {
     int a;
     B b;
 }

 The JSON output of this would be:

     {
         a: 0,
         b: {
             c: 0,
             d: 0
         }
     }

 There's no reason why the serializer can't output this in 
 chunks:

 Chunk 1:

     {
         a: 0,

 Chunk 2:

         b: {

 Etc...

It seems hard to keep track of nesting. I can't see how pretty printing using this technique would work.

Can't you just keep a counter? When you enter anything that would increase the indentation level, increment the indentation level. When leaving, decrement. At each level, insert whitespace equal to indentationLevel * whitespacePerLevel. This seems pretty trivial, unless I'm missing something. Also, I didn't check, but it turns off pretty-printing be default, right?
 This is just a read-only property, which arguably doesn't break
 misconceptions. There should be no reason to assign directly 
 to a range.

How should I set the data used for deserializing?

How about passing it in with a function? Each range passed this way would represent a single object, so the current deserialize!Foo(InputRange) would work the same way it does now.
 I agree that (de)serializing a large list of objects lazily is
 important, but I don't think that's the natural interface for a
 Serializer. I think that each object should be lazily 
 serialized instead
 to maximize throughput.

 If a Serializer is defined as only (de)serializing a single 
 object, then
 serializing a range of Type would be as simple as using map() 
 with a
 Serializer (getting a range of Serialize). If the allocs are 
 too much,
 then the same serializer can be used, but serialize 
 one-at-a-time.

 My main point here is that data should be written as it's being
 serialized. In a networked application, it may take a few 
 packets to
 encode a larger object, so the first packets should be sent 
 ASAP.

 As usual, feel free to destroy =D

Again, how does one keep track of nesting in formats like XML, JSON and YAML?

YAML will take a little extra care since whitespace is significant, but it should work well enough as I've described above.
Aug 22 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Thursday, 22 August 2013 at 03:13:46 UTC, Tyler Jameson Little 
wrote:
 On Wednesday, 21 August 2013 at 20:21:49 UTC, Dicebot wrote:
 It should be range of strings - one call to popFront should 
 serialize one object from input object range and provide 
 matching string buffer.

I don't like this because it still caches the whole object into memory. In a memory-restricted application, this is unacceptable.

Well, in memory-restricted applications having large object at all is unacceptable. Rationale is that you hardly ever want half-deserialized object. If environment is very restrictive, smaller objects will be used anyway (list of smaller objects).
 ...
 There's no reason why the serializer can't output this in chunks

Outputting on its own is not useful to discuss - in pipe model output matches input. What is the point in outputting partial chunks of serialized object if you still need to provide it as a whole to the input?
Aug 22 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
I'll focus on part I find crucial:

On Thursday, 22 August 2013 at 07:08:28 UTC, Jacob Carlborg wrote:
 The question is if a range should be treated as multiple 
 objects, and not a single object (which it really is). How 
 should it be serialized?

 * Something like an array, resulting in this XML:

 <array type="int" length="5" key="0" id="0">
     <int key="0" id="1">1</int>
     <int key="1" id="2">2</int>
     <int key="2" id="3">3</int>
     <int key="3" id="4">4</int>
     <int key="4" id="5">5</int>
 </array>

 * Or like calling "serialize" multiple times, resulting in this 
 XML:

 <int key="0" id="0">1</int>
 <int key="1" id="1">2</int>
 <int key="2" id="2">3</int>
 <int key="3" id="3">4</int>
 <int key="4" id="4">5</int>

Is there a reasons arrays needs to be serialized as (1), not (2)? I'd expect any input-range compliant data to be serialized as (2) and lazy. That allows you to use deserializer as a pipe over some sort of network-based string feed to get a potentially infinite input range of deserialized objects.
Aug 22 2013
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Thursday, 22 August 2013 at 14:48:57 UTC, Dicebot wrote:
 Outputting on its own is not useful to discuss - in pipe model 
 output matches input. What is the point in outputting partial 
 chunks of serialized object if you still need to provide it as 
 a whole to the input?

Partial chunks of serialized objects can be useful for applications that aren't immediately deserializing: E.g. sending over a network, storing to disk etc.
Aug 22 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Thursday, 22 August 2013 at 14:55:50 UTC, John Colvin wrote:
 Partial chunks of serialized objects can be useful for 
 applications that aren't immediately deserializing: E.g. 
 sending over a network, storing to disk etc.

But text I/O operates on character ranges anyway, it just uses whatever data is available: // some imaginary stuff InputRange!Object.serialize.until(5).copy(stdout); `copy` will write text buffer that matches one Object at time. What is the point in serializing only half of given object if it is already in memory and available?
Aug 22 2013
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Wed, 21 Aug 2013 22:21:48 +0200
schrieb "Dicebot" <public dicebot.lv>:

 
 Alternative AO2:

 Another idea is the archive is an output range, having this 
 interface:

 auto archive = new XmlArchive!(char);
 archive.writeTo(outputRange);

 auto serializer = new Serializer(archive);
 serializer.serialize(new Object);

 Use the output range when the serialization is done.

I can't imagine a use case for this. Adding ranges just because you can is not very good :)

I'm kinda confused why nobody here sees the benefits of the output range model. Most serialization libraries in other languages are implemented like that. For example, .NET: -------- IFormatter formatter = ... Stream stream = new FileStream(...) formatter.Serialize(stream, obj); stream.Close(); -------- The reason is simple: In serialization it is not common to post-process the serialized data as far as I know. Usually it's either written to a file or sent over network which are perfect examples of Streams (or output ranges). Common usage is like this: ------- auto s = FileStream; auto serializer = Serializer(s); serializer.serialize(1); serializer.serialize("Hello"); foreach(value;...) serializer.serialize(value); ------- The classic way to efficiently implement this pattern is using an OutputRange/Stream. Serialization must be capable of outputting many 100MBs to a file or network without significant memory overhead. There are two specific ways how a InputRange interface can be useful: In case the serializer works as a filter for another Range: -------- auto serializer = new Serializer([1,2,3,4,5].take(3)); foreach(ubyte[] data; serializer) -------- But InputRanges are limited to the same type for all elements, the "serialize" call isn't. Of course you can use Variant. But what about big structs? And performance matters so the InputRange approach only works nicely if you serialize values of the same type. The other way is if you only want to serialize one element: -------- auto serializer = new Serializer(myobject); foreach(ubyte[] data; serializer) -------- It does not work well if you want to mix it with the "serialize" call: ------- auto serializer = new Serializer(); serializer.serialize(1); serializer.serialize("Hello"); serializer.serialize(3); serializer.serialize(4); foreach(ubyte[] data; serializer) ------- Here the serializer has to cache data or the original objects until the data is processed via foreach. If serializer had access to an output range the "serialize" calls could directly write to the streams without any caching. So the output-range model is clearly superior in this case.
Aug 22 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Thursday, 22 August 2013 at 15:33:07 UTC, Johannes Pfau wrote:
 The reason is simple: In serialization it is not common to 
 post-process
 the serialized data as far as I know. Usually it's either 
 written to a
 file or sent over network which are perfect examples of Streams 
 (or
 output ranges).

Hm but in this model it is file / socket which is an OutputRange, isn't it? Serializer itself just provides yet another InputRange which can be fed to target OutputRange. Am I getting this part wrong?
 But InputRanges are limited to the same type for all elements, 
 the
 "serialize" call isn't.

I was thinking about this but do we have any way to express non-monotone range in D? Variant seems only option, it implies that any format used by Archiver must always store type information though.
 Of course you can use Variant. But what about
 big structs?

After some thinking I come to conclusion that is simply a matter of two `data` ranges - one parametrized by output type and other "raw". Latter than can output stuff in string chunks of undefined size (as small as serialization implementation allows). Does that help?
Aug 22 2013
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Thu, 22 Aug 2013 17:49:04 +0200
schrieb "Dicebot" <public dicebot.lv>:

 On Thursday, 22 August 2013 at 15:33:07 UTC, Johannes Pfau wrote:
 The reason is simple: In serialization it is not common to 
 post-process
 the serialized data as far as I know. Usually it's either 
 written to a
 file or sent over network which are perfect examples of Streams 
 (or
 output ranges).

Hm but in this model it is file / socket which is an OutputRange, isn't it? Serializer itself just provides yet another InputRange which can be fed to target OutputRange. Am I getting this part wrong?

Yes, but the important point is that Serializer is _not_ an InputRange of serialized data. Instead it _uses_ a OutputRange / Stream internally. I'll show a very simplified example: --------------------- struct Serializer(T) //if(isOutputRange!(T, ubyte[])) { private T _output; this(T output) { _output = output; } void serialize(T)(T data) { _output.put((cast(ubyte*)&data)[0..T.sizeof]); } } void put(File f, ubyte[] data) //File is not an OutputRange... { f.write(data); } void main() { auto serializer = Serializer!File(stdout); serializer.serialize("Test"); serializer.serialize("Hello World!"); } --------------------- As you can see there are absolutely no memory allocations necessary. Of course in reality you'll need a fixed buffer but there's no dynamic allocation. Now try to implement this in an efficient way as an InputRange. Here's the skeleton: --------------------- struct Serializer { void serialize(T)(T data) {} bool empty() {} ubyte[] front; void popFront() {} } void main() { auto serializer = Serializer!File(stdout); serializer.serialize("Test"); serializer.serialize("Hello World!"); foreach(ubyte[] data; serializer) } --------------------- How would you implement this? This can only work efficiently if Serializer wraps its InputRange or if there's only one value to serialize. But the serialize method as defined above cannot be implemented efficiently with this approach. Now I do confess that an InputRange filter is useful. But only for specific use cases, the more common use case is directly outputting to an OutputRange and this should be as efficient as possible. With a good design it should be possible to support both cases efficiently with the same "backends". But implementing a InputRange serializer filter will still be much more difficult than the OutputRange case (the serializer must be capable of resuming serialization at any point as your output buffer might be full) I'd like to make another comment about performance. I think there are two possible usages / user groups of std.serialization. 1) The classical, heavyweight C#/Java style serialization which can serialize complete Object Graphs, deals with inheritance and so on 2) The simple "Just write the JSON representation of this struct to this file" kind of usage. For usecase 2 it's important that there's as little overhead as possible. Consider this struct: struct Song { string artist; string title; } If I'd write JSON serialization manually, it would look like this: --------- auto a = Appender!string; //or any outputRange Song s; a.put("{\n"); a.put(` "artist"="`); a.put(song.artist); a.put(`",\n`); a.put(` "title"="`); a.put(song.title); a.put(`"\n}\n`); --------- As you can see this code does basically nothing: No allocation, no string processing, it just copies data. But it's annoying to write this boilerplate. I'd expect a serialization lib to let me do this: serialize!JSON(a, s); And performance should be very close to the hand-written code written above.
Aug 22 2013
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Thu, 22 Aug 2013 18:13:23 +0200
schrieb Jacob Carlborg <doob me.com>:

 On 2013-08-22 17:33, Johannes Pfau wrote:
 
 The reason is simple: In serialization it is not common to
 post-process the serialized data as far as I know.

Perhaps compression or encryption.

But compression or encryption are usually implemented as OutputRanges / OutputStreams.
Aug 22 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Thursday, 22 August 2013 at 17:39:19 UTC, Johannes Pfau wrote:
 Yes, but the important point is that Serializer is _not_ an 
 InputRange
 of serialized data. Instead it _uses_ a OutputRange / Stream
 internally.

Shame on me. I have completely misunderstood you and though you want to make serializer OutputRange itself. Your examples make a lot sense and I do agree it is a use case worth supporting. Need some more time to imagine how that may impact API in general.
Aug 22 2013
prev sibling next sibling parent "Tyler Jameson Little" <beatgammit gmail.com> writes:
On Thursday, 22 August 2013 at 14:48:57 UTC, Dicebot wrote:
 On Thursday, 22 August 2013 at 03:13:46 UTC, Tyler Jameson 
 Little wrote:
 On Wednesday, 21 August 2013 at 20:21:49 UTC, Dicebot wrote:
 It should be range of strings - one call to popFront should 
 serialize one object from input object range and provide 
 matching string buffer.

I don't like this because it still caches the whole object into memory. In a memory-restricted application, this is unacceptable.

Well, in memory-restricted applications having large object at all is unacceptable. Rationale is that you hardly ever want half-deserialized object. If environment is very restrictive, smaller objects will be used anyway (list of smaller objects).

It seems you and I are trying to solve two very different problems. Perhaps if I explain my use-case, it'll make things clearer. I have a server that serializes data from a socket, processes that data, then updates internal state and sends notifications to clients (involves serialization as well). When new clients connect, they need all of this internal state, so the easiest way to do this is to create one large object out of all of the smaller objects: class Widget { } class InternalState { Widget[string] widgets; ... other data here } InternalState isn't very big by itself; it just has an associative array of Widget pointers with some other rather small data. When serialized, however, this can get quite large. Since archive formats are orders of magnitude less-efficient than in-memory stores, caching the archived version of the internal state can be prohibitively expensive. Let's say the serialized form of the internal state is 5MB, and I have 128MB available, while 50MB or so is used by the application. This leaves about 70MB, so I can only support 14 connected clients. With a streaming serializer (per object), I'll get that 5MB down to a few hundred KB and I can support many more clients.
 ...
 There's no reason why the serializer can't output this in 
 chunks

Outputting on its own is not useful to discuss - in pipe model output matches input. What is the point in outputting partial chunks of serialized object if you still need to provide it as a whole to the input?

This only makes sense if you are deserializing right after serializing, which is *not* a common thing to do. Also, it's much more likely to need to serialize a single object (as in a REST API, 3d model parser [think COLLADA] or config parser). Providing a range seems to fit only a small niche, people that need to dump the state of the system. With single-object serialization and chunked output, you can define your own range to get the same effect, but with an API as you detailed, you can't avoid memory problems without going outside std.
Aug 22 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Sunday, 25 August 2013 at 08:36:40 UTC, Dmitry Olshansky wrote:
 Same thoughts here.
 Serializer is an output range for pretty much anything (that is 
 serializable). Literally isOutputRange!T would be true for a 
 whole lot of things, making it possible to dumping any ranges 
 of Ts via copy.
 Just make its put method work on a variety of types and you 
 have it.

Can't it be both OutputRange itself and provide InputRange via `serialize` call (for filters & similar pipe processing)?
Aug 25 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Monday, 26 August 2013 at 09:23:32 UTC, Dmitry Olshansky wrote:
 I'm still missing something about separation of archiver and 
 serializer but in my mind these are tightly coupled and may as 
 well be one entity.

For me distinction was very natural. `(de)serializer` is something that takes care of D type introspection and provides it in simplified form to `(de)archiver` which embeds actual format knowledge. Former can get pretty tricky in D so it makes some sense to keep it separate. I can't really add anything on ranges part of your comments - sounds like you have a better "big picture" anyway :)
Aug 26 2013
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Monday, 26 August 2013 at 11:23:05 UTC, Jacob Carlborg wrote:
 Here we have yet another suggestion for an API. The whole 
 reason for this thread is that people weren't happy with the 
 current interface, i.e. not range based. Now we got probably 
 just as many suggestions as people who have answered to this 
 thread. I still don't know hot the API should look like.

In the end it is still your decision to make - people here can provide some input and help with technical details but it is still your library and your voting ;) Though I would probably suggest to value Phobos developer opinions more than, ugh, mine (or any other random trespasser). It is quite natural that such package has lot of potential use cases and thus different API expectation. Choosing proper aesthetics is what makes programming an art :P
Aug 26 2013
prev sibling parent "Dicebot" <public dicebot.lv> writes:
On Thursday, 10 October 2013 at 07:45:51 UTC, Jacob Carlborg 
wrote:
 ...

I thought the very point of output ranges is that you can output in any chunks that seem most convenient / efficient to you.
Oct 10 2013