digitalmars.D - Range interface for std.serialization

Jacob Carlborg (87/87) Aug 21 2013 After have been reading the review thread for std.serialization I've

Dicebot (18/38) Aug 21 2013 My 5 cents:

Tyler Jameson Little (48/90) Aug 21 2013 I don't like this because it still caches the whole object into

Jacob Carlborg (27/67) Aug 22 2013 It need to store all serialized reference types, otherwise it cannot

Tyler Jameson Little (15/105) Aug 22 2013 Can't you just keep a counter? When you enter anything that would

Jacob Carlborg (8/18) Aug 22 2013 That sounds like it will require quite some work. Currently I'm using

Dicebot (10/19) Aug 22 2013 Well, in memory-restricted applications having large object at

John Colvin (4/8) Aug 22 2013 Partial chunks of serialized objects can be useful for

Dicebot (8/11) Aug 22 2013 But text I/O operates on character ranges anyway, it just uses

Tyler Jameson Little (38/59) Aug 22 2013 It seems you and I are trying to solve two very different

Jacob Carlborg (65/78) Aug 22 2013 How should nesting been handled for a format like XML? Example:

Dicebot (7/25) Aug 22 2013 Is there a reasons arrays needs to be serialized as (1), not (2)?

Jacob Carlborg (7/12) Aug 22 2013 For arrays, one advantage is that I can allocate the whole array at once...

Johannes Pfau (54/71) Aug 22 2013 I'm kinda confused why nobody here sees the benefits of the output

Dicebot (14/26) Aug 22 2013 Hm but in this model it is file / socket which is an OutputRange,

Jacob Carlborg (8/15) Aug 22 2013 I'm wondering how interesting this really is to support. I basically
Johannes Pfau (95/108) Aug 22 2013 Yes, but the important point is that Serializer is _not_ an InputRange

Dicebot (6/10) Aug 22 2013 Shame on me. I have completely misunderstood you and though you

Daniel Murphy (4/13) Aug 25 2013 It seems to me that if you give serializer a 'put' method, it _will_ be ...

Dmitry Olshansky (8/23) Aug 25 2013 Same thoughts here.

Dicebot (3/10) Aug 25 2013 Can't it be both OutputRange itself and provide InputRange via

Dmitry Olshansky (47/55) Aug 25 2013 I see that you potentially want to say compress serialized data on the

Jacob Carlborg (5/13) Aug 26 2013 I'm still worried about how to get out deserialized objects, especially

Dmitry Olshansky (35/49) Aug 26 2013 Array or any container should do for a range.

Jacob Carlborg (15/47) Aug 26 2013 But then it won't be lazy, or perhaps that's not a problem, since the

Dicebot (9/14) Aug 26 2013 In the end it is still your decision to make - people here can

Jacob Carlborg (6/10) Aug 26 2013 Well I'm happy with the interface as it is, that's why I created it like...

Dmitry Olshansky (36/83) Aug 26 2013 It's lazy but you have to put stuff somewhere.

Jacob Carlborg (10/17) Aug 26 2013 Yes, please do. As I see it there are four parts of the interface that

Dmitry Olshansky (137/153) Aug 26 2013 More a question of implementation then.

Jacob Carlborg (30/163) Aug 27 2013 I would really like the serializer not being dependent on the order of

Dmitry Olshansky (57/133) Aug 27 2013 I see...

Jacob Carlborg (55/101) Aug 28 2013 For pointers and reference types I currently serializing all fields with...

Dmitry Olshansky (37/118) Aug 28 2013 That would be tricky in JSON and quite overheadish (e.g. wrapping

Dmitry Olshansky (20/44) Aug 28 2013 Taking into account that you've settled on keeping Serializers as

Jacob Carlborg (5/23) Aug 28 2013 This is a good idea.
Jacob Carlborg (21/34) Sep 24 2013 I'm having quite hard time to figure out how this should work. Or I'm

Dmitry Olshansky (15/49) Sep 24 2013 If I'm correct archiver would have the benefit of templates and common

Jacob Carlborg (4/9) Sep 25 2013 Ok, that's basically how it already works. Thanks.

Jacob Carlborg (9/33) Aug 28 2013 Yes, that's better.

Jacob Carlborg (10/12) Aug 28 2013 About making Serializer a struct. Actually I think the semantics of

Dmitry Olshansky (9/19) Aug 28 2013 Here you are quite right... just add a factory that hides away its true

Jacob Carlborg (23/25) Oct 10 2013 I'm bumping this again with a new question. I'm thinking about how to

Dicebot (4/5) Oct 10 2013 I thought the very point of output ranges is that you can output

Jacob Carlborg (7/9) Oct 10 2013 Outputting it like the first suggestion will be a lot more convenient,

Dmitry Olshansky (12/27) Oct 10 2013 [snip]

Dicebot (8/11) Aug 26 2013 For me distinction was very natural. `(de)serializer` is

Jacob Carlborg (5/9) Aug 26 2013 I think he was referring to that there should be a separate deserializer...
Dmitry Olshansky (5/15) Aug 26 2013 If that is the case then fine. Though upon seeing Archive interface in

Dmitry Olshansky (28/48) Aug 26 2013 On the subject of it we can do some cool wonders by providing such

Jacob Carlborg (4/6) Aug 22 2013 Perhaps compression or encryption.

Johannes Pfau (4/11) Aug 22 2013 But compression or encryption are usually implemented as OutputRanges /

Jacob Carlborg (4/6) Aug 22 2013 Ok, I didn't know that.

Jacob Carlborg <doob me.com> writes:

After have been reading the review thread for std.serialization I've 
been trying to figure out how a range interface for std.serialization 
could look like. There have been several suggestions how to implement 
the range interface and I feel that I really don't know that the best 
choice would be.

What I can see there are two parts of the package that makes sense to 
support the range API's. Those are the serializer (Serializer) and 
archives (Archive).

If we start with the archive and the output, used for serializing. One 
idea is to have the current method "data" return an input range.

Alternative AO1 (Archive Output 1):

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);
serializer.serialize(new Object);

auto inputRange = archive.data;

This is pretty straight forward and the returned range can later be used 
to write to disk or whatever the user chooses.

If this alternative is chosen how should the range for the XmlArchive 
work like? Currently the archive returns a string, should the range just 
wrap the string and step through character by character? That doesn't 
sound very effective.



Alternative AO2:

Another idea is the archive is an output range, having this interface:

auto archive = new XmlArchive!(char);
archive.writeTo(outputRange);

auto serializer = new Serializer(archive);
serializer.serialize(new Object);

Use the output range when the serialization is done.

This has the same question as the input range, should I put to the range 
character by character?



Now we come to input for the archive, used for deserializing. I think 
the only alternative is an input range. I guess this is pretty straight 
forward. The archive needs to take the range as a template parameter to 
be able to store the range.

A problem with this, actually I don't know if it's considered a problem, 
is that the following won't be possible:

auto archive = new XmlArchive!(InputRange);
archive.data = archive.data;

Which one would usually expect from an OO API. The problem here is that 
the archive is typed for the original input range but the returned range 
from "data" is of a different type.



Then it comes to the serializer. For the input to the serializer there 
is a couple of alternatives:

Alternative SI1:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);
serializer.serialize(new Object);

The serializer can accept any type and will just serialize it. This is 
the current approach.

Alternative SI2:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);
serializer.serialize([1,2,3,4,5].stride(2).take(2));

The serialize recognizes input ranges and treat them differently. It can 
be serialize in a couple of different ways:

* Serialize as an array
* Serialize the range as the "serialize" method has been called multiple 
times
* Found out a new structure and serialize it as a range

Alternative SI3:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);
[1,2,3,4,5].stride(2).take(2).copy(serializer);

The serializer can be an output range and implement a "put" method. I 
guess this has the same problem, as alternative SI2, of how it would 
serialize the range.



For the output of the serializer (deserializing) I'm not sure if it 
makes sense to return a range. That's because you need to tell the 
serialize what root type to return:

Alternative SO1:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);

serializer.serialize(new Object);

auto object = serializer.deserialize!(Object)(data);

This is the current interface.



Alternative SO2:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);

serializer.serialize(new Object);

auto range = serializer.deserialize!(?)(data);

If the serializer returns a range, what type should be used in place of 
the question mark?



Conclusion:

As far as I can see there are many alternatives and I don't know which 
is best to choose.

-- 
/Jacob Carlborg

Aug 21 2013

"Dicebot" <public dicebot.lv> writes:

My 5 cents:

On Wednesday, 21 August 2013 at 18:45:48 UTC, Jacob Carlborg 
wrote:
 If this alternative is chosen how should the range for the 
 XmlArchive work like? Currently the archive returns a string, 
 should the range just wrap the string and step through 
 character by character? That doesn't sound very effective.

It should be range of strings - one call to popFront should 
serialize one object from input object range and provide matching 
string buffer.

 Alternative AO2:

 Another idea is the archive is an output range, having this 
 interface:

 auto archive = new XmlArchive!(char);
 archive.writeTo(outputRange);

 auto serializer = new Serializer(archive);
 serializer.serialize(new Object);

 Use the output range when the serialization is done.

I can't imagine a use case for this. Adding ranges just because 
you can is not very good :)

 A problem with this, actually I don't know if it's considered a 
 problem, is that the following won't be possible:

 auto archive = new XmlArchive!(InputRange);
 archive.data = archive.data;

What this snippet should do?

 Which one would usually expect from an OO API. The problem here 
 is that the archive is typed for the original input range but 
 the returned range from "data" is of a different type.

Range-based algorithms don't assign ranges. Transferring data 
from one range to another is done via copy(sourceRange, 
destRange) and similar tools.

 ... snip

It looks like difficulties come from your initial assumption that 
one call to serialize/deserialize implies one object - in that 
model ranges hardly are useful. I don't think it is a reasonable 
restriction. What is practically useful is (de)serialization of 
large list of objects lazily - and this is a natural job for 
ranges.

Aug 21 2013

"Tyler Jameson Little" <beatgammit gmail.com> writes:

On Wednesday, 21 August 2013 at 20:21:49 UTC, Dicebot wrote:
 My 5 cents:

 On Wednesday, 21 August 2013 at 18:45:48 UTC, Jacob Carlborg 
 wrote:
 If this alternative is chosen how should the range for the 
 XmlArchive work like? Currently the archive returns a string, 
 should the range just wrap the string and step through 
 character by character? That doesn't sound very effective.

 It should be range of strings - one call to popFront should 
 serialize one object from input object range and provide 
 matching string buffer.

I don't like this because it still caches the whole object into 
memory. In a memory-restricted application, this is unacceptable.

I think one call to popFront should release part of the 
serialized object. For example:

struct B {
     int c, d;
}

struct A {
     int a;
     B b;
}

The JSON output of this would be:

     {
         a: 0,
         b: {
             c: 0,
             d: 0
         }
     }

There's no reason why the serializer can't output this in chunks:

Chunk 1:

     {
         a: 0,

Chunk 2:

         b: {

Etc...

Most archive formats should support chunking. I realize this may 
be a rather large change to Orange, but I think it's a direction 
it should be headed.

 Alternative AO2:

 Another idea is the archive is an output range, having this 
 interface:

 auto archive = new XmlArchive!(char);
 archive.writeTo(outputRange);

 auto serializer = new Serializer(archive);
 serializer.serialize(new Object);

 Use the output range when the serialization is done.

 I can't imagine a use case for this. Adding ranges just because 
 you can is not very good :)

I completely agree.

 A problem with this, actually I don't know if it's considered 
 a problem, is that the following won't be possible:

 auto archive = new XmlArchive!(InputRange);
 archive.data = archive.data;

 What this snippet should do?

 Which one would usually expect from an OO API. The problem 
 here is that the archive is typed for the original input range 
 but the returned range from "data" is of a different type.

 Range-based algorithms don't assign ranges. Transferring data 
 from one range to another is done via copy(sourceRange, 
 destRange) and similar tools.

This is just a read-only property, which arguably doesn't break 
misconceptions. There should be no reason to assign directly to a 
range.

 It looks like difficulties come from your initial assumption 
 that one call to serialize/deserialize implies one object - in 
 that model ranges hardly are useful. I don't think it is a 
 reasonable restriction. What is practically useful is 
 (de)serialization of large list of objects lazily - and this is 
 a natural job for ranges.

I agree that (de)serializing a large list of objects lazily is 
important, but I don't think that's the natural interface for a 
Serializer. I think that each object should be lazily serialized 
instead to maximize throughput.

If a Serializer is defined as only (de)serializing a single 
object, then serializing a range of Type would be as simple as 
using map() with a Serializer (getting a range of Serialize). If 
the allocs are too much, then the same serializer can be used, 
but serialize one-at-a-time.

My main point here is that data should be written as it's being 
serialized. In a networked application, it may take a few packets 
to encode a larger object, so the first packets should be sent 
ASAP.

As usual, feel free to destroy =D

Aug 21 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-22 05:13, Tyler Jameson Little wrote:

 I don't like this because it still caches the whole object into memory.
 In a memory-restricted application, this is unacceptable.

It need to store all serialized reference types, otherwise it cannot 
properly serialize a complete object graph. We don't want duplicates. 
Example:

The following code:

auto bar = new Bar;
bar.a = 3;

auto foo = new Foo;
foo.a = bar;
foo.b = bar;

Is serialized as:

<object runtimeType="main.Foo" type="main.Foo" key="0" id="0">
     <object runtimeType="main.Bar" type="main.Bar" key="a" id="1">
         <int key="a" id="2">3</int>
     </object>
     <reference key="b">1</reference>
</object>

When "foo.b" is just serializes a reference, not the complete object, 
because that has already been serialized. The serializer needs to keep 
track of that.

 I think one call to popFront should release part of the serialized
 object. For example:

 struct B {
      int c, d;
 }

 struct A {
      int a;
      B b;
 }

 The JSON output of this would be:

      {
          a: 0,
          b: {
              c: 0,
              d: 0
          }
      }

 There's no reason why the serializer can't output this in chunks:

 Chunk 1:

      {
          a: 0,

 Chunk 2:

          b: {

 Etc...

It seems hard to keep track of nesting. I can't see how pretty printing 
using this technique would work.

 This is just a read-only property, which arguably doesn't break
 misconceptions. There should be no reason to assign directly to a range.

How should I set the data used for deserializing?

 I agree that (de)serializing a large list of objects lazily is
 important, but I don't think that's the natural interface for a
 Serializer. I think that each object should be lazily serialized instead
 to maximize throughput.

 If a Serializer is defined as only (de)serializing a single object, then
 serializing a range of Type would be as simple as using map() with a
 Serializer (getting a range of Serialize). If the allocs are too much,
 then the same serializer can be used, but serialize one-at-a-time.

 My main point here is that data should be written as it's being
 serialized. In a networked application, it may take a few packets to
 encode a larger object, so the first packets should be sent ASAP.

 As usual, feel free to destroy =D

Again, how does one keep track of nesting in formats like XML, JSON and 
YAML?

-- 
/Jacob Carlborg

Aug 22 2013

"Tyler Jameson Little" <beatgammit gmail.com> writes:

On Thursday, 22 August 2013 at 07:16:11 UTC, Jacob Carlborg wrote:
 On 2013-08-22 05:13, Tyler Jameson Little wrote:

 I don't like this because it still caches the whole object 
 into memory.
 In a memory-restricted application, this is unacceptable.

 It need to store all serialized reference types, otherwise it 
 cannot properly serialize a complete object graph. We don't 
 want duplicates. Example:

 The following code:

 auto bar = new Bar;
 bar.a = 3;

 auto foo = new Foo;
 foo.a = bar;
 foo.b = bar;

 Is serialized as:

 <object runtimeType="main.Foo" type="main.Foo" key="0" id="0">
     <object runtimeType="main.Bar" type="main.Bar" key="a" 
 id="1">
         <int key="a" id="2">3</int>
     </object>
     <reference key="b">1</reference>
 </object>

 When "foo.b" is just serializes a reference, not the complete 
 object, because that has already been serialized. The 
 serializer needs to keep track of that.

Right, but it doesn't need to keep the serialized data in memory.

 I think one call to popFront should release part of the 
 serialized
 object. For example:

 struct B {
     int c, d;
 }

 struct A {
     int a;
     B b;
 }

 The JSON output of this would be:

     {
         a: 0,
         b: {
             c: 0,
             d: 0
         }
     }

 There's no reason why the serializer can't output this in 
 chunks:

 Chunk 1:

     {
         a: 0,

 Chunk 2:

         b: {

 Etc...

 It seems hard to keep track of nesting. I can't see how pretty 
 printing using this technique would work.

Can't you just keep a counter? When you enter anything that would 
increase the indentation level, increment the indentation level. 
When leaving, decrement. At each level, insert whitespace equal 
to indentationLevel * whitespacePerLevel. This seems pretty 
trivial, unless I'm missing something.

Also, I didn't check, but it turns off pretty-printing be 
default, right?

 This is just a read-only property, which arguably doesn't break
 misconceptions. There should be no reason to assign directly 
 to a range.

 How should I set the data used for deserializing?

How about passing it in with a function? Each range passed this 
way would represent a single object, so the current 
deserialize!Foo(InputRange) would work the same way it does now.

 I agree that (de)serializing a large list of objects lazily is
 important, but I don't think that's the natural interface for a
 Serializer. I think that each object should be lazily 
 serialized instead
 to maximize throughput.

 If a Serializer is defined as only (de)serializing a single 
 object, then
 serializing a range of Type would be as simple as using map() 
 with a
 Serializer (getting a range of Serialize). If the allocs are 
 too much,
 then the same serializer can be used, but serialize 
 one-at-a-time.

 My main point here is that data should be written as it's being
 serialized. In a networked application, it may take a few 
 packets to
 encode a larger object, so the first packets should be sent 
 ASAP.

 As usual, feel free to destroy =D

 Again, how does one keep track of nesting in formats like XML, 
 JSON and YAML?

YAML will take a little extra care since whitespace is 
significant, but it should work well enough as I've described 
above.

Aug 22 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-22 16:16, Tyler Jameson Little wrote:

 Right, but it doesn't need to keep the serialized data in memory.

No, exactly.

 Can't you just keep a counter? When you enter anything that would
 increase the indentation level, increment the indentation level. When
 leaving, decrement. At each level, insert whitespace equal to
 indentationLevel * whitespacePerLevel. This seems pretty trivial, unless
 I'm missing something.

That sounds like it will require quite some work. Currently I'm using 
std.xml.Document.toString.

 Also, I didn't check, but it turns off pretty-printing be default, right?

No, not currently, see above.

 How about passing it in with a function? Each range passed this way
 would represent a single object, so the current
 deserialize!Foo(InputRange) would work the same way it does now.

The archive needs to store it somehow, so pass it in the constructor?

-- 
/Jacob Carlborg

Aug 22 2013

"Dicebot" <public dicebot.lv> writes:

On Thursday, 22 August 2013 at 03:13:46 UTC, Tyler Jameson Little 
wrote:
 On Wednesday, 21 August 2013 at 20:21:49 UTC, Dicebot wrote:
 It should be range of strings - one call to popFront should 
 serialize one object from input object range and provide 
 matching string buffer.

 I don't like this because it still caches the whole object into 
 memory. In a memory-restricted application, this is 
 unacceptable.

Well, in memory-restricted applications having large object at 
all is unacceptable. Rationale is that you hardly ever want 
half-deserialized object. If environment is very restrictive, 
smaller objects will be used anyway (list of smaller objects).

 ...
 There's no reason why the serializer can't output this in chunks

Outputting on its own is not useful to discuss - in pipe model 
output matches input. What is the point in outputting partial 
chunks of serialized object if you still need to provide it as a 
whole to the input?

Aug 22 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Thursday, 22 August 2013 at 14:48:57 UTC, Dicebot wrote:
 Outputting on its own is not useful to discuss - in pipe model 
 output matches input. What is the point in outputting partial 
 chunks of serialized object if you still need to provide it as 
 a whole to the input?

Partial chunks of serialized objects can be useful for 
applications that aren't immediately deserializing: E.g. sending 
over a network, storing to disk etc.

Aug 22 2013

"Dicebot" <public dicebot.lv> writes:

On Thursday, 22 August 2013 at 14:55:50 UTC, John Colvin wrote:
 Partial chunks of serialized objects can be useful for 
 applications that aren't immediately deserializing: E.g. 
 sending over a network, storing to disk etc.

But text I/O operates on character ranges anyway, it just uses 
whatever data is available:
// some imaginary stuff
InputRange!Object.serialize.until(5).copy(stdout);

`copy` will write text buffer that matches one Object at time. 
What is the point in serializing only half of given object if it 
is already in memory and available?

Aug 22 2013

"Tyler Jameson Little" <beatgammit gmail.com> writes:

On Thursday, 22 August 2013 at 14:48:57 UTC, Dicebot wrote:
 On Thursday, 22 August 2013 at 03:13:46 UTC, Tyler Jameson 
 Little wrote:
 On Wednesday, 21 August 2013 at 20:21:49 UTC, Dicebot wrote:
 It should be range of strings - one call to popFront should 
 serialize one object from input object range and provide 
 matching string buffer.

 I don't like this because it still caches the whole object 
 into memory. In a memory-restricted application, this is 
 unacceptable.

 Well, in memory-restricted applications having large object at 
 all is unacceptable. Rationale is that you hardly ever want 
 half-deserialized object. If environment is very restrictive, 
 smaller objects will be used anyway (list of smaller objects).

It seems you and I are trying to solve two very different 
problems. Perhaps if I explain my use-case, it'll make things 
clearer.

I have a server that serializes data from a socket, processes 
that data, then updates internal state and sends notifications to 
clients (involves serialization as well).

When new clients connect, they need all of this internal state, 
so the easiest way to do this is to create one large object out 
of all of the smaller objects:

     class Widget {
     }

     class InternalState {
         Widget[string] widgets;
         ... other data here
     }

InternalState isn't very big by itself; it just has an 
associative array of Widget pointers with some other rather small 
data. When serialized, however, this can get quite large. Since 
archive formats are orders of magnitude less-efficient than 
in-memory stores, caching the archived version of the internal 
state can be prohibitively expensive.

Let's say the serialized form of the internal state is 5MB, and I 
have 128MB available, while 50MB or so is used by the 
application. This leaves about 70MB, so I can only support 14 
connected clients.

With a streaming serializer (per object), I'll get that 5MB down 
to a few hundred KB and I can support many more clients.

 ...
 There's no reason why the serializer can't output this in 
 chunks

 Outputting on its own is not useful to discuss - in pipe model 
 output matches input. What is the point in outputting partial 
 chunks of serialized object if you still need to provide it as 
 a whole to the input?

This only makes sense if you are deserializing right after 
serializing, which is *not* a common thing to do.

Also, it's much more likely to need to serialize a single object 
(as in a REST API, 3d model parser [think COLLADA] or config 
parser). Providing a range seems to fit only a small niche, 
people that need to dump the state of the system. With 
single-object serialization and chunked output, you can define 
your own range to get the same effect, but with an API as you 
detailed, you can't avoid memory problems without going outside 
std.

Aug 22 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-21 22:21, Dicebot wrote:

 It should be range of strings - one call to popFront should serialize
 one object from input object range and provide matching string buffer.

How should nesting been handled for a format like XML? Example:

class Bar
{
     int a;
}

class Foo
{
     int b;
     Bar bar;
}

Currently the following XML is procured when serializing Foo:

<object runtimeType="main.Foo" type="main.Foo" key="0" id="0">
     <int key="b" id="1">4</int>
     <object runtimeType="main.Bar" type="main.Bar" key="bar" id="2">
         <int key="a" id="3">3</int>
     </object>
</object>

If I shouldn't return the whole object, Foo, how can we know that when 
the string for Bar is returned it should actually be nested inside Foo?

 I can't imagine a use case for this. Adding ranges just because you can
 is not very good :)

Ok.

 What this snippet should do?

That was just a dummy snippet to set the data. This is a slightly better 
example:

auto archive = new XmlArchive!(string);
auto serializer = new Serializer(archive);
serializer.serialize(new Object);

writeToFile("foo.xml", archive.data);

Now I want to deserialize the data back:

archive.data = readFromFile("foo.xml"); // Error, cannot covert 
ReadFromFileRange to string

 Range-based algorithms don't assign ranges. Transferring data from one
 range to another is done via copy(sourceRange, destRange) and similar
 tools.

So how should the API look like for setting the data used when 
deserializing, like this?

auto data = readFromFile("foo.xml");
auto archive = new XmlArchive!(string);
copy(data, archive.data);

 It looks like difficulties come from your initial assumption that one
 call to serialize/deserialize implies one object - in that model ranges
 hardly are useful. I don't think it is a reasonable restriction. What is
 practically useful is (de)serialization of large list of objects lazily
 - and this is a natural job for ranges.

It depends on how you look at it. Currently it's only possible to 
serialize a single object with a single call to "serialize". So if you 
want to serialize multiple objects you do as you would do normally in 
your code, use an array, a linked list or similar. An array is still a 
single object, though it contains multiple objects, that is handled 
perfectly fine.

The question is if a range should be treated as multiple objects, and 
not a single object (which it really is). How should it be serialized?

* Something like an array, resulting in this XML:

<array type="int" length="5" key="0" id="0">
     <int key="0" id="1">1</int>
     <int key="1" id="2">2</int>
     <int key="2" id="3">3</int>
     <int key="3" id="4">4</int>
     <int key="4" id="5">5</int>
</array>

* Or like calling "serialize" multiple times, resulting in this XML:

<int key="0" id="0">1</int>
<int key="1" id="1">2</int>
<int key="2" id="2">3</int>
<int key="3" id="3">4</int>
<int key="4" id="4">5</int>

* Or as a single object:

Then it would actually serialize the struct/class representing the range.

And the most important question, how should ranges be deserialized? One 
have to tell the serializer what type to return, otherwise it won't 
work. But the whole point of ranges is that you shouldn't need to know 
the type. Sometimes you cannot even name the type, i.e. Voldemort types.

-- 
/Jacob Carlborg

Aug 22 2013

"Dicebot" <public dicebot.lv> writes:

I'll focus on part I find crucial:

On Thursday, 22 August 2013 at 07:08:28 UTC, Jacob Carlborg wrote:
 The question is if a range should be treated as multiple 
 objects, and not a single object (which it really is). How 
 should it be serialized?

 * Something like an array, resulting in this XML:

 <array type="int" length="5" key="0" id="0">
     <int key="0" id="1">1</int>
     <int key="1" id="2">2</int>
     <int key="2" id="3">3</int>
     <int key="3" id="4">4</int>
     <int key="4" id="5">5</int>
 </array>

 * Or like calling "serialize" multiple times, resulting in this 
 XML:

 <int key="0" id="0">1</int>
 <int key="1" id="1">2</int>
 <int key="2" id="2">3</int>
 <int key="3" id="3">4</int>
 <int key="4" id="4">5</int>

Is there a reasons arrays needs to be serialized as (1), not (2)? 
I'd expect any input-range compliant data to be serialized as (2) 
and lazy. That allows you to use deserializer as a pipe over some 
sort of network-based string feed to get a potentially infinite 
input range of deserialized objects.

Aug 22 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-22 16:52, Dicebot wrote:

 Is there a reasons arrays needs to be serialized as (1), not (2)?

For arrays, one advantage is that I can allocate the whole array at once 
instead of appending, since the length of the array is serialized.

 I'd expect any input-range compliant data to be serialized as (2) and lazy.
 That allows you to use deserializer as a pipe over some sort of
 network-based string feed to get a potentially infinite input range of
 deserialized objects.

I still don't know how I would deserialize a range. I need to know the 
type to deserialize not just the interface.

-- 
/Jacob Carlborg

Aug 22 2013

Johannes Pfau <nospam example.com> writes:

Am Wed, 21 Aug 2013 22:21:48 +0200
schrieb "Dicebot" <public dicebot.lv>:

 
 Alternative AO2:

 Another idea is the archive is an output range, having this 
 interface:

 auto archive = new XmlArchive!(char);
 archive.writeTo(outputRange);

 auto serializer = new Serializer(archive);
 serializer.serialize(new Object);

 Use the output range when the serialization is done.

 
 I can't imagine a use case for this. Adding ranges just because 
 you can is not very good :)
 


I'm kinda confused why nobody here sees the benefits of the output
range model. Most serialization libraries in other languages are
implemented like that. For example, .NET:

--------
IFormatter formatter = ...
Stream stream = new FileStream(...)
formatter.Serialize(stream, obj);
stream.Close();
--------

The reason is simple: In serialization it is not common to post-process
the serialized data as far as I know. Usually it's either written to a
file or sent over network which are perfect examples of Streams (or
output ranges). Common usage is like this:

-------
auto s = FileStream;
auto serializer = Serializer(s);
serializer.serialize(1);
serializer.serialize("Hello");
foreach(value;...)
    serializer.serialize(value);
-------

The classic way to efficiently implement this pattern is using an
OutputRange/Stream. Serialization must be capable of outputting many
100MBs to a file or network without significant memory overhead.




There are two specific ways how a InputRange interface can be useful: In
case the serializer works as a filter for another Range:
--------
auto serializer = new Serializer([1,2,3,4,5].take(3));
foreach(ubyte[] data; serializer)
--------
But InputRanges are limited to the same type for all elements, the
"serialize" call isn't. Of course you can use Variant. But what about
big structs? And performance matters so the InputRange approach only
works nicely if you serialize values of the same type.

The other way is if you only want to serialize one element:
--------
auto serializer = new Serializer(myobject);
foreach(ubyte[] data; serializer)
--------

It does not work well if you want to mix it with the "serialize" call:
-------
auto serializer = new Serializer();
serializer.serialize(1);
serializer.serialize("Hello");
serializer.serialize(3);
serializer.serialize(4);
foreach(ubyte[] data; serializer)
-------

Here the serializer has to cache data or the original objects until the
data is processed via foreach. If serializer had access to an output
range the "serialize" calls could directly write to the streams without
any caching. So the output-range model is clearly superior in this case.

Aug 22 2013

"Dicebot" <public dicebot.lv> writes:

On Thursday, 22 August 2013 at 15:33:07 UTC, Johannes Pfau wrote:
 The reason is simple: In serialization it is not common to 
 post-process
 the serialized data as far as I know. Usually it's either 
 written to a
 file or sent over network which are perfect examples of Streams 
 (or
 output ranges).

Hm but in this model it is file / socket which is an OutputRange, 
isn't it? Serializer itself just provides yet another InputRange 
which can be fed to target OutputRange. Am I getting this part 
wrong?

 But InputRanges are limited to the same type for all elements, 
 the
 "serialize" call isn't.

I was thinking about this but do we have any way to express 
non-monotone range in D? Variant seems only option, it implies 
that any format used by Archiver must always store type 
information though.

 Of course you can use Variant. But what about
 big structs?

After some thinking I come to conclusion that is simply a matter 
of two `data` ranges - one parametrized by output type and other 
"raw". Latter than can output stuff in string chunks of undefined 
size (as small as serialization implementation allows). Does that 
help?

Aug 22 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-22 17:49, Dicebot wrote:

 I was thinking about this but do we have any way to express non-monotone
 range in D? Variant seems only option, it implies that any format used
 by Archiver must always store type information though.

I'm wondering how interesting this really is to support. I basically 
only serialize a single object, which is the start of an object graph or 
an array.

 After some thinking I come to conclusion that is simply a matter of two
 `data` ranges - one parametrized by output type and other "raw". Latter
 than can output stuff in string chunks of undefined size (as small as
 serialization implementation allows). Does that help?

Do you mean that only the archive should handle ranges and the 
serializer shouldn't?

-- 
/Jacob Carlborg

Aug 22 2013

Johannes Pfau <nospam example.com> writes:

Am Thu, 22 Aug 2013 17:49:04 +0200
schrieb "Dicebot" <public dicebot.lv>:

 On Thursday, 22 August 2013 at 15:33:07 UTC, Johannes Pfau wrote:
 The reason is simple: In serialization it is not common to 
 post-process
 the serialized data as far as I know. Usually it's either 
 written to a
 file or sent over network which are perfect examples of Streams 
 (or
 output ranges).

 
 Hm but in this model it is file / socket which is an OutputRange, 
 isn't it? Serializer itself just provides yet another InputRange 
 which can be fed to target OutputRange. Am I getting this part 
 wrong?

Yes, but the important point is that Serializer is _not_ an InputRange
of serialized data. Instead it _uses_ a OutputRange / Stream
internally.

I'll show a very simplified example:
---------------------
struct Serializer(T) //if(isOutputRange!(T, ubyte[]))
{
    private T _output;
    this(T output)
    {
        _output = output;
    }

    void serialize(T)(T data)
    {
        _output.put((cast(ubyte*)&data)[0..T.sizeof]);
    }
}

void put(File f, ubyte[] data) //File is not an OutputRange...
{
	f.write(data);
}

void main()
{
    auto serializer = Serializer!File(stdout);
    serializer.serialize("Test");
    serializer.serialize("Hello World!");
}
---------------------

As you can see there are absolutely no memory allocations necessary. Of
course in reality you'll need a fixed buffer but there's no dynamic
allocation.

Now try to implement this in an efficient way as an InputRange. Here's
the skeleton:
---------------------
struct Serializer
{
    void serialize(T)(T data) {}
    bool empty() {}
    ubyte[] front;
    void popFront() {}
}

void main()
{
    auto serializer = Serializer!File(stdout);
    serializer.serialize("Test");
    serializer.serialize("Hello World!");
    foreach(ubyte[] data; serializer)
}
---------------------

How would you implement this? This can only work efficiently if
Serializer wraps its InputRange or if there's only one value to
serialize. But the serialize method as defined above cannot be
implemented efficiently with this approach.

Now I do confess that an InputRange filter is useful. But only for
specific use cases, the more common use case is directly outputting to
an OutputRange and this should be as efficient as possible. With a good
design it should be possible to support both cases efficiently with the
same "backends". But implementing a InputRange serializer filter will
still be much more difficult than the OutputRange case (the serializer
must be capable of resuming serialization at any point as your output
buffer might be full)




I'd like to make another comment about performance. I think there are
two possible usages / user groups of std.serialization.


serialize complete Object Graphs, deals with inheritance and so on

2) The simple "Just write the JSON representation of this struct to
this file" kind of usage.

For usecase 2 it's important that there's as little overhead as
possible. Consider this struct:

struct Song
{
    string artist;
    string title;
}

If I'd write JSON serialization manually, it would look like this:
---------
auto a = Appender!string; //or any outputRange
Song s;
a.put("{\n");
a.put(`    "artist"="`);
a.put(song.artist);
a.put(`",\n`);
a.put(`    "title"="`);
a.put(song.title);
a.put(`"\n}\n`);
---------

As you can see this code does basically nothing: No allocation, no
string processing, it just copies data. But it's annoying to write
this boilerplate.
I'd expect a serialization lib to let me do this:
serialize!JSON(a, s);
And performance should be very close to the hand-written code written
above.

Aug 22 2013

"Dicebot" <public dicebot.lv> writes:

On Thursday, 22 August 2013 at 17:39:19 UTC, Johannes Pfau wrote:
 Yes, but the important point is that Serializer is _not_ an 
 InputRange
 of serialized data. Instead it _uses_ a OutputRange / Stream
 internally.

Shame on me. I have completely misunderstood you and though you 
want to make serializer OutputRange itself.

Your examples make a lot sense and I do agree it is a use case 
worth supporting. Need some more time to imagine how that may 
impact API in general.

Aug 22 2013

"Daniel Murphy" <yebblies nospamgmail.com> writes:

"Dicebot" <public dicebot.lv> wrote in message 
news:niufnloijwvjifusgisn forum.dlang.org...
 On Thursday, 22 August 2013 at 17:39:19 UTC, Johannes Pfau wrote:
 Yes, but the important point is that Serializer is _not_ an InputRange
 of serialized data. Instead it _uses_ a OutputRange / Stream
 internally.

 Shame on me. I have completely misunderstood you and though you want to 
 make serializer OutputRange itself.

 Your examples make a lot sense and I do agree it is a use case worth 
 supporting. Need some more time to imagine how that may impact API in 
 general.

It seems to me that if you give serializer a 'put' method, it _will_ be a 
valid output range.

Aug 25 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

25-Aug-2013 12:20, Daniel Murphy пишет:
 "Dicebot" <public dicebot.lv> wrote in message
 news:niufnloijwvjifusgisn forum.dlang.org...
 On Thursday, 22 August 2013 at 17:39:19 UTC, Johannes Pfau wrote:
 Yes, but the important point is that Serializer is _not_ an InputRange
 of serialized data. Instead it _uses_ a OutputRange / Stream
 internally.

 Shame on me. I have completely misunderstood you and though you want to
 make serializer OutputRange itself.

 Your examples make a lot sense and I do agree it is a use case worth
 supporting. Need some more time to imagine how that may impact API in
 general.

 It seems to me that if you give serializer a 'put' method, it _will_ be a
 valid output range.

Same thoughts here.
Serializer is an output range for pretty much anything (that is 
serializable). Literally isOutputRange!T would be true for a whole lot 
of things, making it possible to dumping any ranges of Ts via copy.
Just make its put method work on a variety of types and you have it.


-- 
Dmitry Olshansky

Aug 25 2013

"Dicebot" <public dicebot.lv> writes:

On Sunday, 25 August 2013 at 08:36:40 UTC, Dmitry Olshansky wrote:
 Same thoughts here.
 Serializer is an output range for pretty much anything (that is 
 serializable). Literally isOutputRange!T would be true for a 
 whole lot of things, making it possible to dumping any ranges 
 of Ts via copy.
 Just make its put method work on a variety of types and you 
 have it.

Can't it be both OutputRange itself and provide InputRange via 
`serialize` call (for filters & similar pipe processing)?

Aug 25 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

25-Aug-2013 23:15, Dicebot пишет:
 On Sunday, 25 August 2013 at 08:36:40 UTC, Dmitry Olshansky wrote:
 Same thoughts here.
 Serializer is an output range for pretty much anything (that is
 serializable). Literally isOutputRange!T would be true for a whole lot
 of things, making it possible to dumping any ranges of Ts via copy.
 Just make its put method work on a variety of types and you have it.

 Can't it be both OutputRange itself and provide InputRange via
 `serialize` call (for filters & similar pipe processing)?

I see that you potentially want to say compress serialized data on the 
fly via some range-based compressor. Or send over network... with some 
byChunk(favoriteBufferSize) or rather some kind of adapter that outputs 
no less then X bytes if not at end. Then indeed it becomes awkward to 
model a 'sink' kind of range as it is a transformer (no matter how 
convenient it makes putting stuff into it).

It looks like the serializer has 2 "ends" - one accepts any element 
type, the other produces ubyte[] chunks.

A problem is how to connect that output end, or more precisely this puts 
our "ranges are the pipeline" idea into an awkward situation. Basically 
on the front end data may arrive in various chunks and ditto on the 
output. More then that it isn't just an input range translation to 
ubyte[] (at least that'd be very ineffective and restrictive). But I 
have an idea.

With all that said I get to the main point hopefully. here is an example 
far simpler then serialization.

No matter how we look at this there has to be a way to connect 2 sinks, 
say I want to do:

//can only use output range with it
formattedWrite(compressor, "Hey, %s !\n", name);

And have said compressor use LZMA on the data that is put into it, but 
it has to go somewhere. Thus problem of say compressing formatted text 
is not solved by input range, nor is the filtering of said text before 
'put'-ing it somewhere.

What's lacking is a way to connect a sink to another sink.

My view of it is:
auto app = appender!(ubyte[])();
//thus compression is an output range wrapper
auto compressor = compress!LZMA(app);

In other words an output range could be a filter, or rather forwarder of 
the transformed result to some other output range. And we need this not 
only for serialization (though formattedWrite can arguably  be seen as a 
serialization) but anytime we have to turn heterogeneous input into 
homogeneous output and post-process THAT output.

TL;DR: Simply put - make serialization an output range, and set an 
example by making archiver the first output range adapter.

Adapting the code by Jacob (Alternative AO2)

auto archiver = new XmlArchive!(char)(outputRange);
auto serializer = new Serializer(archiver);
serializer.put(new Object);
serializer.put([1, 2, 3, 4]); //mix and match stuff as you see fit

And even
copy(iota(1, 10), serializer);

Would all work just fine.

-- 
Dmitry Olshansky

Aug 25 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-25 22:50, Dmitry Olshansky wrote:

 Adapting the code by Jacob (Alternative AO2)

 auto archiver = new XmlArchive!(char)(outputRange);
 auto serializer = new Serializer(archiver);
 serializer.put(new Object);
 serializer.put([1, 2, 3, 4]); //mix and match stuff as you see fit

 And even
 copy(iota(1, 10), serializer);

 Would all work just fine.

I'm still worried about how to get out deserialized objects, especially 
if they are serialized as a range.

-- 
/Jacob Carlborg

Aug 26 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

26-Aug-2013 11:07, Jacob Carlborg пишет:
 On 2013-08-25 22:50, Dmitry Olshansky wrote:

 Adapting the code by Jacob (Alternative AO2)

 auto archiver = new XmlArchive!(char)(outputRange);
 auto serializer = new Serializer(archiver);
 serializer.put(new Object);
 serializer.put([1, 2, 3, 4]); //mix and match stuff as you see fit

 And even
 copy(iota(1, 10), serializer);

 Would all work just fine.

 I'm still worried about how to get out deserialized objects, especially
 if they are serialized as a range.

Array or any container should do for a range.

I'm not 100% sure what kind of interface to use, but Serializer and 
Deserializer should not be "shipped in one package" as in one class.
The two a mirror each other but in essence are always used separately.
Ditto about archiver/unarchiver they simply provide different 
functionality and it makes no sense to reuse the same object in 2 ways.

Hence alternative 1 (unwrapping that snippet backwards):

//BTW why go new & classes here(?)
auto unarchiver = new XmlUnarchiver(someCharRange);
auto deserialzier = new Deserializer(unarchiver);
auto obj = deserializer.unpack!Object;

//for sequence/array in underlying format it would use any container
List!int list = deserializer.unpack!(List!int);
int[] arr = deserializer.unpack!(int[]);

IMO looks quite nice. The problem of how exactly should a container be 
filled is open though.

So another alternative being more generic (the above could be consider 
convenience over this one):

Vector!int ints;
deserilaizer.unpackRange!(int)(x => ints.pushBack(x));

Basically unpack next sequence of data (as serialized) by feeding it to 
an output range using element type as param. And a simple lambda 
qualifies as an output range.

Also take a look at the new digest API. I have an understanding that 
serialization would do well to take the same general strategy - concrete 
archivers as structs + polymorphic interface and wrappers on top.

I'm still missing something about separation of archiver and serializer 
but in my mind these are tightly coupled and may as well be one entity.
One tough little thing to take care of in std.serialization is how to 
reduce amount of constant overhead (indirections, function calls, 
branches etc.) per item. Polymorphism is easily achieved on top of fast 
and tight core the other way around is impossible.

-- 
Dmitry Olshansky

Aug 26 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-26 11:23, Dmitry Olshansky wrote:

 Array or any container should do for a range.

But then it won't be lazy, or perhaps that's not a problem, since the 
whole deserializing should be lazy.

 I'm not 100% sure what kind of interface to use, but Serializer and
 Deserializer should not be "shipped in one package" as in one class.
 The two a mirror each other but in essence are always used separately.
 Ditto about archiver/unarchiver they simply provide different
 functionality and it makes no sense to reuse the same object in 2 ways.

 Hence alternative 1 (unwrapping that snippet backwards):

 //BTW why go new & classes here(?)

The reason to have classes is that I need reference types. I need to 
pass the serializer to "toData" and "fromData" methods that can be 
implemented on the objects being (de)serialized. I guess they could take 
the argument by ref. Is it possible to force that?

 auto unarchiver = new XmlUnarchiver(someCharRange);
 auto deserialzier = new Deserializer(unarchiver);
 auto obj = deserializer.unpack!Object;

 //for sequence/array in underlying format it would use any container
 List!int list = deserializer.unpack!(List!int);
 int[] arr = deserializer.unpack!(int[]);

 IMO looks quite nice. The problem of how exactly should a container be
 filled is open though.

 So another alternative being more generic (the above could be consider
 convenience over this one):

 Vector!int ints;
 deserilaizer.unpackRange!(int)(x => ints.pushBack(x));

 Basically unpack next sequence of data (as serialized) by feeding it to
 an output range using element type as param. And a simple lambda
 qualifies as an output range.

Here we have yet another suggestion for an API. The whole reason for 
this thread is that people weren't happy with the current interface, 
i.e. not range based. Now we got probably just as many suggestions as 
people who have answered to this thread. I still don't know hot the API 
should look like.

 Also take a look at the new digest API. I have an understanding that
 serialization would do well to take the same general strategy - concrete
 archivers as structs + polymorphic interface and wrappers on top.

I could have a look at that.

 I'm still missing something about separation of archiver and serializer
 but in my mind these are tightly coupled and may as well be one entity.
 One tough little thing to take care of in std.serialization is how to
 reduce amount of constant overhead (indirections, function calls,
 branches etc.) per item. Polymorphism is easily achieved on top of fast
 and tight core the other way around is impossible.

-- 
/Jacob Carlborg

Aug 26 2013

"Dicebot" <public dicebot.lv> writes:

On Monday, 26 August 2013 at 11:23:05 UTC, Jacob Carlborg wrote:
 Here we have yet another suggestion for an API. The whole 
 reason for this thread is that people weren't happy with the 
 current interface, i.e. not range based. Now we got probably 
 just as many suggestions as people who have answered to this 
 thread. I still don't know hot the API should look like.

In the end it is still your decision to make - people here can 
provide some input and help with technical details but it is 
still your library and your voting ;) Though I would probably 
suggest to value Phobos developer opinions more than, ugh, mine 
(or any other random trespasser).

It is quite natural that such package has lot of potential use 
cases and thus different API expectation. Choosing proper 
aesthetics is what makes programming an art :P

Aug 26 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-26 15:53, Dicebot wrote:

 In the end it is still your decision to make - people here can provide
 some input and help with technical details but it is still your library
 and your voting ;) Though I would probably suggest to value Phobos
 developer opinions more than, ugh, mine (or any other random trespasser).

Well I'm happy with the interface as it is, that's why I created it like 
that. But not the other developers here, so it won't be accepted in its 
current state. That's why I'm asking: "how should it look like?".

-- 
/Jacob Carlborg

Aug 26 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

26-Aug-2013 15:23, Jacob Carlborg пишет:
 On 2013-08-26 11:23, Dmitry Olshansky wrote:

 Array or any container should do for a range.

 But then it won't be lazy, or perhaps that's not a problem, since the
 whole deserializing should be lazy.

It's lazy but you have to put stuff somewhere.
Also 2nd API artifact unpackRange allows you to just look through the 
data (output range including lambdas can do anything). This can easily 
sift through say swaths of data picking only "lucky numbers":

deserilaizer.unpackRange!(int)((x){
	if(isLucky(x))
		writeln(x);
});

 I'm not 100% sure what kind of interface to use, but Serializer and
 Deserializer should not be "shipped in one package" as in one class.
 The two a mirror each other but in essence are always used separately.
 Ditto about archiver/unarchiver they simply provide different
 functionality and it makes no sense to reuse the same object in 2 ways.

 Hence alternative 1 (unwrapping that snippet backwards):

 //BTW why go new & classes here(?)

 The reason to have classes is that I need reference types. I need to
 pass the serializer to "toData" and "fromData" methods that can be
 implemented on the objects being (de)serialized. I guess they could take
 the argument by ref. Is it possible to force that?

Would be interesting to do that. One way is to pass rvalue to said 
function and if it accepts that then it's not by ref.

Along the lines of
__traits(compiles, (){
	T val;
	//or better can use dummy function
	//that returns Serializer by value
	val.toData(Serializer.init);
});

At least that works with templated stuff too.

 auto unarchiver = new XmlUnarchiver(someCharRange);
 auto deserialzier = new Deserializer(unarchiver);
 auto obj = deserializer.unpack!Object;

 //for sequence/array in underlying format it would use any container
 List!int list = deserializer.unpack!(List!int);
 int[] arr = deserializer.unpack!(int[]);

 IMO looks quite nice. The problem of how exactly should a container be
 filled is open though.

 So another alternative being more generic (the above could be consider
 convenience over this one):

 Vector!int ints;
 deserilaizer.unpackRange!(int)(x => ints.pushBack(x));

 Basically unpack next sequence of data (as serialized) by feeding it to
 an output range using element type as param. And a simple lambda
 qualifies as an output range.

 Here we have yet another suggestion for an API.

It's not just yet another. It isn't about particular shade of color. I 
can explain the ifs and whys of any design decision here if there is a 
doubt. I don't care for names but I see the precise semantics and there 
is little left to define.

For instance I see good reasons why serializer _has_ to be OutputRange 
and not InputRange. Why archiver _has_ to take output range or be one 
and so on. Ditto on why there has to be separation of (un)archiver and 
(de)serializer.

 The whole reason for
 this thread is that people weren't happy with the current interface,
 i.e. not range based. Now we got probably just as many suggestions as
 people who have answered to this thread. I still don't know hot the API
 should look like.

It's not a question of putting here some ranges. Or replacing all arrays 
one comes across to ranges(as much as a lot of folks would unfortunately 
assume).

Rather it's how it could operate with them at all without sacrificing 
the functionality, performance and ease of use. And there is not much in 
this tight design space that actually works.

Pardon me if my tone is a bit sharp. I like any other want the best 
design we can get. Now that the great deal of work is done it would be a 
shame to present it in a bad package.

 Also take a look at the new digest API. I have an understanding that
 serialization would do well to take the same general strategy - concrete
 archivers as structs + polymorphic interface and wrappers on top.

 I could have a look at that.

Aug 26 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-26 15:57, Dmitry Olshansky wrote:

 It's not just yet another. It isn't about particular shade of color. I
 can explain the ifs and whys of any design decision here if there is a
 doubt. I don't care for names but I see the precise semantics and there
 is little left to define.

Yes, please do. As I see it there are four parts of the interface that 
need to be solved:

1. How to get data in to the serializer
2. How to get data out of the serializer
3. How to get data in to the archiver
4. How to get data out of the archiver

 Pardon me if my tone is a bit sharp. I like any other want the best
 design we can get. Now that the great deal of work is done it would be a
 shame to present it in a bad package.

Yes, that's why we're having this discussion.

-- 
/Jacob Carlborg

Aug 26 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

26-Aug-2013 18:37, Jacob Carlborg пишет:
 On 2013-08-26 15:57, Dmitry Olshansky wrote:

 It's not just yet another. It isn't about particular shade of color. I
 can explain the ifs and whys of any design decision here if there is a
 doubt. I don't care for names but I see the precise semantics and there
 is little left to define.

 Yes, please do.

 As I see it there are four parts of the interface that
 need to be solved:

 1. How to get data in to the serializer
 2. How to get data out of the serializer
 3. How to get data in to the archiver
 4. How to get data out of the archiver

More a question of implementation then.

Answer to both of them - wrapping an output range in archiver and being 
one for serializer. As for connection with your current API that gives 
away an array - just think std.array.Appender (and a multitude more ways 
to chew the data).

Looking at your current code in depth... finally. I would have a problem 
starting the productive answers to the questions.

First things first - there should not be a key parameter aside from 
stuff added by archiver itself for its internal needs. Nor is there a 
simple way to locate data by key afterwards (certainly not every format 
defines such). It would require some tagged object model and there is no 
such _requirement_ in the serialization.

Citing a line from archive.d
"""
There are a couple of limitations when implementing a new archive, this 
is due* to how the serializer and the archive interface is built. Except 
for what this interface says explicitly an archive needs to be able to 
handle the following:
Unarchive a value based on a key or id, regardless of where in the 
archive  the value is located
"""
- this is impossible in the setting of serialization.

Serialization is NOT about modeling the whole dataset and providing 
queries into that. A model of serialization is that of unix _tar_  just 
dump any graphs of data to "TAPE" and/or restore back. If you can do 
processing on the fly - bonus points (and what I push for can).

This confusion and its consequences are no doubt due to building on 
std.xml and the interface it presents.

Second - no, not every operation has to return some piece of Data that 
is produced. It would be tremendously inefficient and require keeping 
memory references to that alive (or be unsafe in addition to slow). 
Instead it just outputs something to the underlying sink.

class Serializer
{
	...
	//now we are output range for anything.
	//add constraint as you see fit
	void put(T)(T value)
	{
		serializeInternal(value);//calls methods of archiver
	}

	Archiver archiver;
}

class MyArchiver(Output)
	if(isOutputRange!(Output, dchar)) //or ubyte if binary
{
	...
	this(Output sink)
	{
		this.sink = sink;
	}
	
	//and a method for example
	private void archivePrimitive (T) (T value, string key, Id id)
	{
		//along the lines of this, don't take literally
		//I've no idea of the actual format for tags you use
    		formattedWrite(sink, "<%s>%s</%s>", id, value, id);
         }
	...
	Output sink;
}


The user just writes e.g.

auto app = appender!(char[])();
auto archiver = new XmlArchiver(app);
auto serializer = new Serializer(archiver);

and works with it serializer as with output range of anything (I showed 
the example before)

Then once the data is required just peek at app.data and there it is.
So in-memory case is easily covered. Other sinks bring more benefits see 
e.g.:

auto sink = stdout.lockingtextWriter();

And the same code now writes directly to stdout and no worries if there 
is a lot of stuff to write.

....

No matter how I look at code it needs a lot of (re-)work.
For instance Archive type is obsessed with strings. I can't see a need 
for that many strings attached :)
The awful duality of Serializer that results literally in:
if(mode == serializing) doSerializing else do doDeserializing

And the spectacular pair:

T deserialize (T) (Data data, string key = "")
{
         mode = deserializing;
	...
}

Data serialize (T) (T value, string key = null)
{
         mode = serializing;
	...
}

Amount of extra code executed per bit of output is remarkably high, and 
a hallmark of standard library is pay as you go principle. We (as 
collectively Phobos devs) have to set the baseline for performance, if 
it's too low we're out of the game.

For example - events are cute, but do we all need them? Do we always 
want an overhead of checking that stuff per field written?

Instead decompose these layers, make them stackable for instance:

auto serializer = new Serializer(...);
auto tracingSerializer = new TracingSerializer(serializer);

Or just make 2 kinds of serializers with static if on a single template 
parameter bool withEvents it's trivial. Then a couple of aliases would 
finish the job.

With that I'm observe that events are attached to types/fields... hum, 
in such a case it needs work to make them zero-cost if absent.

 Pardon me if my tone is a bit sharp. I like any other want the best
 design we can get. Now that the great deal of work is done it would be a
 shame to present it in a bad package.

 Yes, that's why we're having this discussion.

And I'm afraid it's too late or the changes are too far reaching but 
let's try it.

I'm especially destroyed  by (and the fact that it's a part of interface 
to implement):

     void archiveEnum (bool value, string baseType, string key, Id id);

     /// Ditto
     void archiveEnum (bool value, string baseType, string key, Id id);

     /// Ditto
     void archiveEnum (byte value, string baseType, string key, Id id);

     /// Ditto
     void archiveEnum (char value, string baseType, string key, Id id);

     /// Ditto
     void archiveEnum (dchar value, string baseType, string key, Id id);

     /// Ditto
     void archiveEnum (int value, string baseType, string key, Id id);

     /// Ditto
     void archiveEnum (long value, string baseType, string key, Id id);

     /// Ditto
     void archiveEnum (short value, string baseType, string key, Id id);

     /// Ditto
     void archiveEnum (ubyte value, string baseType, string key, Id id);

     /// Ditto
     void archiveEnum (uint value, string baseType, string key, Id id);

     /// Ditto
     void archiveEnum (ulong value, string baseType, string key, Id id);

     /// Ditto
     void archiveEnum (ushort value, string baseType, string key, Id id);

     /// Ditto
     void archiveEnum (wchar value, string baseType, string key, Id id);


-- 
Dmitry Olshansky

Aug 26 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-26 18:41, Dmitry Olshansky wrote:

 More a question of implementation then.

 Answer to both of them - wrapping an output range in archiver and being
 one for serializer. As for connection with your current API that gives
 away an array - just think std.array.Appender (and a multitude more ways
 to chew the data).

Ok, thank you.

 Looking at your current code in depth... finally. I would have a problem
 starting the productive answers to the questions.

 First things first - there should not be a key parameter aside from
 stuff added by archiver itself for its internal needs. Nor is there a
 simple way to locate data by key afterwards (certainly not every format
 defines such). It would require some tagged object model and there is no
 such _requirement_ in the serialization.

I would really like the serializer not being dependent on the order of 
the fields of the types it's (de)serializing.

 Citing a line from archive.d
 """
 There are a couple of limitations when implementing a new archive, this
 is due* to how the serializer and the archive interface is built. Except
 for what this interface says explicitly an archive needs to be able to
 handle the following:
 Unarchive a value based on a key or id, regardless of where in the
 archive  the value is located
 """
 - this is impossible in the setting of serialization.

 Serialization is NOT about modeling the whole dataset and providing
 queries into that. A model of serialization is that of unix _tar_  just
 dump any graphs of data to "TAPE" and/or restore back. If you can do
 processing on the fly - bonus points (and what I push for can).

 This confusion and its consequences are no doubt due to building on
 std.xml and the interface it presents.

It might be due to the XML format in general but certainly not due to 
std.xml. The XmlArchie originally used the XML package in Tango, long 
before it supported std.xml.

 Second - no, not every operation has to return some piece of Data that
 is produced. It would be tremendously inefficient and require keeping
 memory references to that alive (or be unsafe in addition to slow).
 Instead it just outputs something to the underlying sink.

 class Serializer
 {
      ...
      //now we are output range for anything.
      //add constraint as you see fit
      void put(T)(T value)
      {
          serializeInternal(value);//calls methods of archiver
      }

      Archiver archiver;
 }

 class MyArchiver(Output)
      if(isOutputRange!(Output, dchar)) //or ubyte if binary
 {
      ...
      this(Output sink)
      {
          this.sink = sink;
      }

      //and a method for example
      private void archivePrimitive (T) (T value, string key, Id id)
      {
          //along the lines of this, don't take literally
          //I've no idea of the actual format for tags you use
             formattedWrite(sink, "<%s>%s</%s>", id, value, id);
          }
      ...
      Output sink;
 }

Good, thank you.

 The user just writes e.g.

 auto app = appender!(char[])();
 auto archiver = new XmlArchiver(app);
 auto serializer = new Serializer(archiver);

 and works with it serializer as with output range of anything (I showed
 the example before)

 Then once the data is required just peek at app.data and there it is.
 So in-memory case is easily covered. Other sinks bring more benefits see
 e.g.:

 auto sink = stdout.lockingtextWriter();

 And the same code now writes directly to stdout and no worries if there
 is a lot of stuff to write.

Thank you for giving some concrete ides of the API.

 ....

 No matter how I look at code it needs a lot of (re-)work.
 For instance Archive type is obsessed with strings. I can't see a need
 for that many strings attached :)

Yeah, I know. It's manly because the archive doesn't use templates, 
because it need to implement an interface.

 The awful duality of Serializer that results literally in:
 if(mode == serializing) doSerializing else do doDeserializing

 And the spectacular pair:

 T deserialize (T) (Data data, string key = "")
 {
          mode = deserializing;
      ...
 }

 Data serialize (T) (T value, string key = null)
 {
          mode = serializing;
      ...
 }

I guess that's easier to avoid if I divide Serializer in to two separate 
parts, one for serializing and one for deserializing.

 Amount of extra code executed per bit of output is remarkably high

Now I think you're exaggerating a bit.

, and
 a hallmark of standard library is pay as you go principle. We (as
 collectively Phobos devs) have to set the baseline for performance, if
 it's too low we're out of the game.

 For example - events are cute, but do we all need them? Do we always
 want an overhead of checking that stuff per field written?

Sure, there are some overhead of calling some functions but the events 
are checked for at compile time so the overhead should be minimal.

 Instead decompose these layers, make them stackable for instance:

 auto serializer = new Serializer(...);
 auto tracingSerializer = new TracingSerializer(serializer);

 Or just make 2 kinds of serializers with static if on a single template
 parameter bool withEvents it's trivial. Then a couple of aliases would
 finish the job.

I don't think that will be needed. I can see if I can refactor a bit to 
minimize the overhead even more.

 With that I'm observe that events are attached to types/fields... hum,
 in such a case it needs work to make them zero-cost if absent.

The only cost is calling "triggerEvents" and "triggerEvent", the rest is 
performed at compile time.

 And I'm afraid it's too late or the changes are too far reaching but
 let's try it.

 I'm especially destroyed  by (and the fact that it's a part of interface
 to implement):

      void archiveEnum (bool value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (bool value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (byte value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (char value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (dchar value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (int value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (long value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (short value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (ubyte value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (uint value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (ulong value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (ushort value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (wchar value, string baseType, string key, Id id);

So you want templates instead?

I have read your posts, thank you for your comments. I'm planning now to:

* Split Serializer in to two parts
* Make the parts struct
* Possibly provide class wrappers
* Split Archive in two parts
* Add range interface to Serializer and Archive

-- 
/Jacob Carlborg

Aug 27 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

27-Aug-2013 23:23, Jacob Carlborg пишет:
 On 2013-08-26 18:41, Dmitry Olshansky wrote:
 Looking at your current code in depth... finally. I would have a problem
 starting the productive answers to the questions.

 First things first - there should not be a key parameter aside from
 stuff added by archiver itself for its internal needs. Nor is there a
 simple way to locate data by key afterwards (certainly not every format
 defines such). It would require some tagged object model and there is no
 such _requirement_ in the serialization.

 I would really like the serializer not being dependent on the order of
 the fields of the types it's (de)serializing.

I see...
That depends on the format and for these that have no keys or markers of 
any kind versioning might help here. For instance JSON/BSON could handle 
permutation of fields, but I then it falls short of handling links e.g. 
pointers (maybe there is a trick to get it, but I can't think of any 
right away).

I suspect it would be best to somehow see archives by capbilities:
1. Rigid (most binary) - in-order, depends on the order of fields, may 
need to fit a scheme (in this cases D types implicitly define one)
Rigid archivers may also enjoy (per format in the future) a code 
generator that given a scheme defines D types with a bit of CTFE+mixin.

2. Flexible - can survive reordering, is scheme-less, data defines 
structure etc. easer handles versioning e.g. XML is one.

This also neatly answers the question about scheme vs scheme-less 
serialization. Protocol buffers/Thrift may be absorbed into Rigid 
category if we can get the versioning right. Also solving versioning is 
the last roadblock (after ranges) mentioned on the path to making this 
an epic addition to Phobos.

+ Some kind of capability flag (compile-time) if it can serialize full 
graphs or if the format is to limited for such. Taking that with Rigid 
would cover most adhoc binary formats in the wild, with Flexible it 
would handle some simple hierarchical formats as well.

 This confusion and its consequences are no doubt due to building on
 std.xml and the interface it presents.

 It might be due to the XML format in general but certainly not due to
 std.xml. The XmlArchie originally used the XML package in Tango, long
 before it supported std.xml.

Was it DOM-ish too?

 The awful duality of Serializer that results literally in:
 if(mode == serializing) doSerializing else do doDeserializing

 And the spectacular pair:

 T deserialize (T) (Data data, string key = "")
 {
          mode = deserializing;
      ...
 }

 Data serialize (T) (T value, string key = null)
 {
          mode = serializing;
      ...
 }

 I guess that's easier to avoid if I divide Serializer in to two separate
 parts, one for serializing and one for deserializing.

Right, I was shamelessly picking at this again.

 Amount of extra code executed per bit of output is remarkably high

 Now I think you're exaggerating a bit.

I've meant at least a check of 'mode' on each call to (de)serialize + 
some other branch-y stuff that tests overridden serializers etc.

It could be a relatively new idiom to follow but there is a great value 
in having a lean common path aka 90% of use cases that need no extras 
should go the fastest route potentially at the _expense_ of *less 
frequent cases*.
Simplified - the earlier you can elide extra work the better performance 
you get. To do that you may need to do double the overhead (checks) in 
less frequent case to remove some of it in the common case.

 a hallmark of standard library is pay as you go principle. We (as
 collectively Phobos devs) have to set the baseline for performance, if
 it's too low we're out of the game.

 For example - events are cute, but do we all need them? Do we always
 want an overhead of checking that stuff per field written?

 Sure, there are some overhead of calling some functions but the events
 are checked for at compile time so the overhead should be minimal.

See below. I was talking namely about calling functions to see that no 
events are fired anyway.

 Instead decompose these layers, make them stackable for instance:

 auto serializer = new Serializer(...);
 auto tracingSerializer = new TracingSerializer(serializer);

 Or just make 2 kinds of serializers with static if on a single template
 parameter bool withEvents it's trivial. Then a couple of aliases would
 finish the job.

 I don't think that will be needed. I can see if I can refactor a bit to
 minimize the overhead even more.

You are probably right as I note later on + there seems to be a way to 
elide the cost entirely if there are no events.
 With that I'm observe that events are attached to types/fields... hum,
 in such a case it needs work to make them zero-cost if absent.

 The only cost is calling "triggerEvents" and "triggerEvent", the rest is
 performed at compile time.

Yeah, I see, but it's still a call to delegate that's hard to inline 
(well LDC/GDC might). Would it be hard to do a compile-time check if 
there are any events with the type in question at all and then call 
triggerEvent(s)?

While we are on the subject of delegates - you absolutely should use 
'scope delegate' as most (all?) delegates are never stored anywhere but 
rather pass blocks of code to call deeper down the line.
(I guess it's somewhat Ruby-style, but it's not a problem).

 And I'm afraid it's too late or the changes are too far reaching but
 let's try it.

 I'm especially destroyed  by (and the fact that it's a part of interface
 to implement):

      void archiveEnum (bool value, string baseType, string key, Id id);

      /// Ditto
      void archiveEnum (bool value, string baseType, string key, Id id);


[snip]

 So you want templates instead?

Aye, as any faithful Phobos dev absolutely :)
Seriously though ATM I just _suspect_ there is no need for Archive to be 
an interface. I would need to think this bit through more deeply but 
virtual call per field alone make me nervous here.

 I have read your posts, thank you for your comments. I'm planning now to:

 * Split Serializer in to two parts
 * Make the parts struct
 * Possibly provide class wrappers
 * Split Archive in two parts
 * Add range interface to Serializer and Archive

Great checklist, this would help greatly. I'm glad you see the value in 
these changes.
Feel free to nag me on the NG and personally for any deficiency you come 
across on the way there ;)


-- 
Dmitry Olshansky

Aug 27 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-27 22:12, Dmitry Olshansky wrote:

 I see...
 That depends on the format and for these that have no keys or markers of
 any kind versioning might help here. For instance JSON/BSON could handle
 permutation of fields, but I then it falls short of handling links e.g.
 pointers (maybe there is a trick to get it, but I can't think of any
 right away).

For pointers and reference types I currently serializing all fields with 
an id then when there's a pointer or reference I can just do this:

<int name="foo" id="1">3</int>
<pointer name="bar">1</pointer>

 I suspect it would be best to somehow see archives by capbilities:
 1. Rigid (most binary) - in-order, depends on the order of fields, may
 need to fit a scheme (in this cases D types implicitly define one)
 Rigid archivers may also enjoy (per format in the future) a code
 generator that given a scheme defines D types with a bit of CTFE+mixin.

 2. Flexible - can survive reordering, is scheme-less, data defines
 structure etc. easer handles versioning e.g. XML is one.

Yes, that's a good idea. In the binary archiver I'm working on I'm 
cheating quite a bit and relax the requirements made by the serializer.

 This also neatly answers the question about scheme vs scheme-less
 serialization. Protocol buffers/Thrift may be absorbed into Rigid
 category if we can get the versioning right. Also solving versioning is
 the last roadblock (after ranges) mentioned on the path to making this
 an epic addition to Phobos.

Versioning shouldn't be that hard, I think.

 + Some kind of capability flag (compile-time) if it can serialize full
 graphs or if the format is to limited for such. Taking that with Rigid
 would cover most adhoc binary formats in the wild, with Flexible it
 would handle some simple hierarchical formats as well.

Sounds like a good idea.

 Was it DOM-ish too?

Yes.

 I've meant at least a check of 'mode' on each call to (de)serialize +
 some other branch-y stuff that tests overridden serializers etc.

 It could be a relatively new idiom to follow but there is a great value
 in having a lean common path aka 90% of use cases that need no extras
 should go the fastest route potentially at the _expense_ of *less
 frequent cases*.
 Simplified - the earlier you can elide extra work the better performance
 you get. To do that you may need to do double the overhead (checks) in
 less frequent case to remove some of it in the common case.

Yes, I understand the checking for "mode" wasn't the best approach. The 
internals are mostly coded to be straight forward and just work.

 See below. I was talking namely about calling functions to see that no
 events are fired anyway.

I can probably add a static-if before calling the functions.

 Yeah, I see, but it's still a call to delegate that's hard to inline
 (well LDC/GDC might). Would it be hard to do a compile-time check if
 there are any events with the type in question at all and then call
 triggerEvent(s)?

No, I don't think so. I can also make the triggerEvents take the 
delegate by alias parameter, if that helps. Or inline it manually.

 While we are on the subject of delegates - you absolutely should use
 'scope delegate' as most (all?) delegates are never stored anywhere but
 rather pass blocks of code to call deeper down the line.
 (I guess it's somewhat Ruby-style, but it's not a problem).

Good idea. The reasons for the delegates is to avoid begin/end 
functions. This also forces the use of the API correctly. Hmm, actually 
it may not. Since the Serializer technically is the user of the archiver 
API and that is already correctly implemented. The developer do need to 
implement the archiver API correctly, but there's nothing that stops 
him/her from _not_ calling the delegate. Am I over thinking this?

 Aye, as any faithful Phobos dev absolutely :)
 Seriously though ATM I just _suspect_ there is no need for Archive to be
 an interface. I would need to think this bit through more deeply but
 virtual call per field alone make me nervous here.

Originally it was using templates. One of my design goals back then was 
to not have to use templates. Templates forces slightly more complicated 
API for the user:

auto serializer = new Serializer!(XmlArchive);

Which is fine, but I'm not very about the API for custom serialization:

class Foo
{
     void toData (Archive) (Serializer!(Archive) serializer);
}

The user is either forced to use templates here as well, or:

class Foo
{
     void toData (Serializer!(XmlArchive) serializer);
}

... use a single type of archive. It's also possible to pass in anything 
as Archive. Now we have template constraints, which didn't exist back 
then, make it a bit better.

About the large API to implement for an Archive, this is the criteria I 
had when creating the API, in order of importance.

1. Should be easy for a consumer to use
2. Should be easy for an archive implementor
3. Should be easy to implement the serializer

In this case, point 1 made it less easy for point 2. Point 2 made me 
push as much as possible to the serializer instead of having it in the 
archiver.

In the end, it's quite easy to copy-paste the API, do some search and 
replace and forward methods like these:

void archiveEnum (bool value, string baseType, string key, Id id)
void archiveEnum (char value, string baseType, string key, Id id)
void archiveEnum (int value, string baseType, string key, Id id)

... to a private template method. That's what XmlArchive does:

https://github.com/jacob-carlborg/orange/blob/master/orange/serialization/archives/XmlArchive.d#L439

-- 
/Jacob Carlborg

Aug 28 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Aug-2013 11:13, Jacob Carlborg пишет:
 On 2013-08-27 22:12, Dmitry Olshansky wrote:

 I see...
 That depends on the format and for these that have no keys or markers of
 any kind versioning might help here. For instance JSON/BSON could handle
 permutation of fields, but I then it falls short of handling links e.g.
 pointers (maybe there is a trick to get it, but I can't think of any
 right away).

 For pointers and reference types I currently serializing all fields with
 an id then when there's a pointer or reference I can just do this:

 <int name="foo" id="1">3</int>
 <pointer name="bar">1</pointer>

That would be tricky in JSON and quite overheadish (e.g. wrapping 
everything into object just in case there is a pointer there).

 I suspect it would be best to somehow see archives by capbilities:
 1. Rigid (most binary) - in-order, depends on the order of fields, may
 need to fit a scheme (in this cases D types implicitly define one)
 Rigid archivers may also enjoy (per format in the future) a code
 generator that given a scheme defines D types with a bit of CTFE+mixin.

 2. Flexible - can survive reordering, is scheme-less, data defines
 structure etc. easer handles versioning e.g. XML is one.

 Yes, that's a good idea. In the binary archiver I'm working on I'm
 cheating quite a bit and relax the requirements made by the serializer.

Yes, instead of cheating you can just define them as different kinds. It 
would ease the friction and prevent some "impedance mismatch" problems.

 This also neatly answers the question about scheme vs scheme-less
 serialization. Protocol buffers/Thrift may be absorbed into Rigid
 category if we can get the versioning right. Also solving versioning is
 the last roadblock (after ranges) mentioned on the path to making this
 an epic addition to Phobos.

 Versioning shouldn't be that hard, I think.

Then collect some info on how to approach this problem.
See e.g. Boost serialziation, Protocol Buffers and Thrift.
The key point is that it's many things to many different people.

 Was it DOM-ish too?

 Yes.

That nails it. DOM isn't quite serialization but rather a hierarchical 
DB. BTW Sqlite and other DBs may be an interesting backend for 
serialization (though they wouldn't have lookup untill deserialization).

 Yeah, I see, but it's still a call to delegate that's hard to inline
 (well LDC/GDC might). Would it be hard to do a compile-time check if
 there are any events with the type in question at all and then call
 triggerEvent(s)?

 No, I don't think so. I can also make the triggerEvents take the
 delegate by alias parameter, if that helps. Or inline it manually.

Great, anything to lessen the extra load.

 While we are on the subject of delegates - you absolutely should use
 'scope delegate' as most (all?) delegates are never stored anywhere but
 rather pass blocks of code to call deeper down the line.
 (I guess it's somewhat Ruby-style, but it's not a problem).

 Good idea. The reasons for the delegates is to avoid begin/end
 functions. This also forces the use of the API correctly. Hmm, actually
 it may not. Since the Serializer technically is the user of the archiver
 API and that is already correctly implemented. The developer do need to
 implement the archiver API correctly, but there's nothing that stops
 him/her from _not_ calling the delegate. Am I over thinking this?

Seems like, after all library implementors should be trusted to not do 
truly awful things.

 Aye, as any faithful Phobos dev absolutely :)
 Seriously though ATM I just _suspect_ there is no need for Archive to be
 an interface. I would need to think this bit through more deeply but
 virtual call per field alone make me nervous here.

 Originally it was using templates. One of my design goals back then was
 to not have to use templates. Templates forces slightly more complicated
 API for the user:

 auto serializer = new Serializer!(XmlArchive);

 Which is fine, but I'm not very about the API for custom serialization:

 class Foo
 {
      void toData (Archive) (Serializer!(Archive) serializer);
 }

Rather this:

void toData(Serializer)(Serializer serializer)
	if(isSerializer!Serializer)
{
	...
}

There is no need to even know how archiver looks like for the user code 
(wasn't it one of the goals of archivers?).

 The user is either forced to use templates here as well, or:

 class Foo
 {
      void toData (Serializer!(XmlArchive) serializer);
 }

The main problem would be that it can't overriden as templates are final.

After all of this I think Archivers are just fine as templates user only 
ever interacts with them during creation. Then it's serializers 
templates that pick up the right types.

Serializers themselves on the other hand are present in user code and 
may need one common polymorphic abstract class that provides 'put' and 
forwards it to a set of abstract methods. All polymorphic wrappers would 
inherit from it.

This won't prevent folks from using templated version of toData/fromData 
if need be.

 ... use a single type of archive. It's also possible to pass in anything
 as Archive. Now we have template constraints, which didn't exist back
 then, make it a bit better.

 About the large API to implement for an Archive, this is the criteria I
 had when creating the API, in order of importance.

 1. Should be easy for a consumer to use
 2. Should be easy for an archive implementor
 3. Should be easy to implement the serializer

 In this case, point 1 made it less easy for point 2. Point 2 made me
 push as much as possible to the serializer instead of having it in the
 archiver.

I'd suggest to maximally hide away (Un)Archivers API from end users and 
as such it would be more convenient to just stay templated as it won't 
be seen.

 In the end, it's quite easy to copy-paste the API, do some search and
 replace and forward methods like these:

 void archiveEnum (bool value, string baseType, string key, Id id)
 void archiveEnum (char value, string baseType, string key, Id id)
 void archiveEnum (int value, string baseType, string key, Id id)

 ... to a private template method. That's what XmlArchive does:

 https://github.com/jacob-carlborg/orange/blob/master/orange/serialization/archives/XmlArchive.d#L439



-- 
Dmitry Olshansky

Aug 28 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Aug-2013 13:58, Dmitry Olshansky пишет:
 28-Aug-2013 11:13, Jacob Carlborg пишет:
 On 2013-08-27 22:12, Dmitry Olshansky wrote:

 Rather this:

 void toData(Serializer)(Serializer serializer)
      if(isSerializer!Serializer)
 {
      ...
 }

 There is no need to even know how archiver looks like for the user code
 (wasn't it one of the goals of archivers?).

 The user is either forced to use templates here as well, or:

 class Foo
 {
      void toData (Serializer!(XmlArchive) serializer);
 }

 The main problem would be that it can't overriden as templates are final.

 After all of this I think Archivers are just fine as templates user only
 ever interacts with them during creation. Then it's serializers
 templates that pick up the right types.

 Serializers themselves on the other hand are present in user code and
 may need one common polymorphic abstract class that provides 'put' and
 forwards it to a set of abstract methods. All polymorphic wrappers would
 inherit from it.

Taking into account that you've settled on keeping Serializers as 
classes just finalize all methods of a concrete serializer that is 
templated on archiver (and make it a final class).

Should be as simple as:

class Serializer {
	void put(T)(T item){ ...}
	//other methods per specific type
}

final class ConcreteSerializer(Archiver) : Serializer {
final:
	...
	//use Archiver here to implement these hooks
}

Then users that use templates in their code would have concrete types, 
for others it quickly "decays" to the base class they use.

The boilerplate of defining a lot of methods now moves to Serializer but 
there should be only one such (template) class anyway.

-- 
Dmitry Olshansky

Aug 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-28 13:20, Dmitry Olshansky wrote:

 Taking into account that you've settled on keeping Serializers as
 classes

Not necessary.

 just finalize all methods of a concrete serializer that is
 templated on archiver (and make it a final class).

 Should be as simple as:

 class Serializer {
      void put(T)(T item){ ...}
      //other methods per specific type
 }

 final class ConcreteSerializer(Archiver) : Serializer {
 final:
      ...
      //use Archiver here to implement these hooks
 }

 Then users that use templates in their code would have concrete types,
 for others it quickly "decays" to the base class they use.

 The boilerplate of defining a lot of methods now moves to Serializer but
 there should be only one such (template) class anyway.

This is a good idea.

-- 
/Jacob Carlborg

Aug 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-28 13:20, Dmitry Olshansky wrote:

Bumping this thread.

 Taking into account that you've settled on keeping Serializers as
 classes just finalize all methods of a concrete serializer that is
 templated on archiver (and make it a final class).

 Should be as simple as:

 class Serializer {
      void put(T)(T item){ ...}
      //other methods per specific type
 }

 final class ConcreteSerializer(Archiver) : Serializer {
 final:
      ...
      //use Archiver here to implement these hooks
 }

I'm having quite hard time to figure out how this should work. Or I'm 
misunderstanding what you're saying.

If I understand you correctly I should do something like:

class Serializer
{
     void put (T) (T item)
     {
         static if (is(T == int))
             serializeInt(item);

	...
     }

     abstract void serializeInt (int item);
}

But if I'm doing it that way I will still have the problem with a lot of 
methods that need to be implemented in the archiver.

Hmm, I guess it would be possible to minimize the number of methods used 
for built in types. There's still a problem with user defined types though.

-- 
/Jacob Carlborg

Sep 24 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

24-Sep-2013 21:02, Jacob Carlborg пишет:
 On 2013-08-28 13:20, Dmitry Olshansky wrote:
 Taking into account that you've settled on keeping Serializers as
 classes just finalize all methods of a concrete serializer that is
 templated on archiver (and make it a final class).

 Should be as simple as:

 class Serializer {
      void put(T)(T item){ ...}
      //other methods per specific type
 }

 final class ConcreteSerializer(Archiver) : Serializer {
 final:
      ...
      //use Archiver here to implement these hooks
 }

 I'm having quite hard time to figure out how this should work. Or I'm
 misunderstanding what you're saying.

 If I understand you correctly I should do something like:

 class Serializer
 {
      void put (T) (T item)
      {
          static if (is(T == int))
              serializeInt(item);

      ...
      }

      abstract void serializeInt (int item);
 }

 But if I'm doing it that way I will still have the problem with a lot of
 methods that need to be implemented in the archiver.

If I'm correct archiver would have the benefit of templates and common 
code would be merged (so all of these in a concrete serializer do 
forward to archiver.write!int, archive.write!uint etc.) On the plus side 
of having a bunch of methods in Serializer you need exactly one 
ConcreteSerializer!(Archive) that implement them. And user-defined 
archiver need not to even think of this, just define single templated 
write (or put or whatever).

 Hmm, I guess it would be possible to minimize the number of methods used
 for built in types. There's still a problem with user defined types though.

Indeed. But it must be provided as a template in generic serializer. The 
benefit is that said logic to serialize arbitrary UDTs is implemented 
there once and for all. Archiver is then partially relived of it. To 
achieve that an archiver may need to provided some fundamental "hooks" 
like  startStruct/endStruct (I didn't think through exact ones).

-- 
Dmitry Olshansky

Sep 24 2013

Jacob Carlborg <doob me.com> writes:

On 2013-09-24 22:27, Dmitry Olshansky wrote:

 Indeed. But it must be provided as a template in generic serializer. The
 benefit is that said logic to serialize arbitrary UDTs is implemented
 there once and for all. Archiver is then partially relived of it. To
 achieve that an archiver may need to provided some fundamental "hooks"
 like  startStruct/endStruct (I didn't think through exact ones).

Ok, that's basically how it already works. Thanks.

-- 
/Jacob Carlborg

Sep 25 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-28 11:58, Dmitry Olshansky wrote:

 That would be tricky in JSON and quite overheadish (e.g. wrapping
 everything into object just in case there is a pointer there).

Yes.

 Yes, instead of cheating you can just define them as different kinds. It
 would ease the friction and prevent some "impedance mismatch" problems.

Yes, that's better.

 Then collect some info on how to approach this problem.
 See e.g. Boost serialziation, Protocol Buffers and Thrift.
 The key point is that it's many things to many different people.

I'll do that.

 Rather this:

 void toData(Serializer)(Serializer serializer)
      if(isSerializer!Serializer)
 {
      ...
 }

 There is no need to even know how archiver looks like for the user code
 (wasn't it one of the goals of archivers?).

Right, didn't think of using a template argument for the whole serializer.


 Serializers themselves on the other hand are present in user code and
 may need one common polymorphic abstract class that provides 'put' and
 forwards it to a set of abstract methods. All polymorphic wrappers would
 inherit from it.

 This won't prevent folks from using templated version of toData/fromData
 if need be.

That's a good idea.

 I'd suggest to maximally hide away (Un)Archivers API from end users and
 as such it would be more convenient to just stay templated as it won't
 be seen.

Yes.

-- 
/Jacob Carlborg

Aug 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-27 22:12, Dmitry Olshansky wrote:

 Feel free to nag me on the NG and personally for any deficiency you come
 across on the way there ;)

About making Serializer a struct. Actually I think the semantics of 
Serializer should be a reference type. I see no use case in passing a 
Serializer by value. Although I do see the overhead of allocating a 
class and calling methods on it.

I do plan to add a free function for deserializing, for convenience. In 
that function Serializer, if it's a class, would be allocated using 
emplace to make it stack allocated.

-- 
/Jacob Carlborg

Aug 28 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Aug-2013 12:08, Jacob Carlborg пишет:
 On 2013-08-27 22:12, Dmitry Olshansky wrote:

 Feel free to nag me on the NG and personally for any deficiency you come
 across on the way there ;)

 About making Serializer a struct. Actually I think the semantics of
 Serializer should be a reference type. I see no use case in passing a
 Serializer by value. Although I do see the overhead of allocating a
 class and calling methods on it.

Here you are quite right... just add a factory that hides away its true 
origin (and ctor as well so it can be changed later if need be) we.g.:

auto serializer = serializerFor!(XmlArchiver)(archiver);

 I do plan to add a free function for deserializing, for convenience. In
 that function Serializer, if it's a class, would be allocated using
 emplace to make it stack allocated.

Good idea. API should have many layers so that power users may keep 
digging to the bottom and these that just need to get the job done can 
do it in one stroke.

-- 
Dmitry Olshansky

Aug 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-27 22:12, Dmitry Olshansky wrote:

 Feel free to nag me on the NG and personally for any deficiency you come
 across on the way there ;)

I'm bumping this again with a new question. I'm thinking about how to 
output the data to output range. If the output range is of type ubyte[] 
how should I output serialized data looking like this:

<object runtimeType="main.Foo" type="main.Foo" key="0" id="0">
     <int key="a" id="1">3</int>
</object>

Should I output this in one chunk or in parts like this:

<object runtimeType="main.Foo" type="main.Foo" key="0" id="0">

Then

<int key="a" id="1">3</int>

Then

</object>

If the first case is chosen I guess this data:

<object runtimeType="main.Foo" type="main.Foo" key="0" id="0">
     <int key="a" id="1">3</int>
</object>

<object runtimeType="main.Foo" type="main.Foo" key="1" id="2">
     <int key="a" id="3">3</int>
</object>

Would be outputted in two chunks.

-- 
/Jacob Carlborg

Oct 10 2013

"Dicebot" <public dicebot.lv> writes:

On Thursday, 10 October 2013 at 07:45:51 UTC, Jacob Carlborg 
wrote:
 ...

I thought the very point of output ranges is that you can output 
in any chunks that seem most convenient / efficient to you.

Oct 10 2013

Jacob Carlborg <doob me.com> writes:

On 2013-10-10 14:13, Dicebot wrote:

 I thought the very point of output ranges is that you can output in any
 chunks that seem most convenient / efficient to you.

Outputting it like the first suggestion will be a lot more convenient, 
especially how std.xml currently works. Although it will probably not be 
as efficient. This basically means a complete object graph will be 
outputted in one chunk. That is, unless the top element is a range.

-- 
/Jacob Carlborg

Oct 10 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

10-Oct-2013 11:45, Jacob Carlborg пишет:
 On 2013-08-27 22:12, Dmitry Olshansky wrote:

 Feel free to nag me on the NG and personally for any deficiency you come
 across on the way there ;)

 I'm bumping this again with a new question. I'm thinking about how to
 output the data to output range. If the output range is of type ubyte[]
 how should I output serialized data looking like this:

 <object runtimeType="main.Foo" type="main.Foo" key="0" id="0">
      <int key="a" id="1">3</int>
 </object>

 Should I output this in one chunk or in parts like this:

 <object runtimeType="main.Foo" type="main.Foo" key="0" id="0">

 Then

 <int key="a" id="1">3</int>

 Then

 </object>

[snip]

I do believe it's a very minor detail of a specific implementation of 
XML archiver.

Speaking of Archivers in general the main point is to try to avoid 
accumulating lots of data in memory if possible and put things as they go.

This however doesn't preclude the 2nd goal - outputting as much as 
possible in one go (and not 2 chunks) is preferable (1 call vs 2 calls 
to put of the output range etc) if doesn't harm memory usage (O(1) is 
OK, anything else not) and doesn't complicate archiver.

-- 
Dmitry Olshansky

Oct 10 2013

"Dicebot" <public dicebot.lv> writes:

On Monday, 26 August 2013 at 09:23:32 UTC, Dmitry Olshansky wrote:
 I'm still missing something about separation of archiver and 
 serializer but in my mind these are tightly coupled and may as 
 well be one entity.

For me distinction was very natural. `(de)serializer` is 
something that takes care of D type introspection and provides it 
in simplified form to `(de)archiver` which embeds actual format 
knowledge. Former can get pretty tricky in D so it makes some 
sense to keep it separate.

I can't really add anything on ranges part of your comments - 
sounds like you have a better "big picture" anyway :)

Aug 26 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-26 15:42, Dicebot wrote:

 For me distinction was very natural. `(de)serializer` is something that
 takes care of D type introspection and provides it in simplified form to
 `(de)archiver` which embeds actual format knowledge. Former can get
 pretty tricky in D so it makes some sense to keep it separate.

I think he was referring to that there should be a separate deserializer 
from the serializer. And a separate unarchiver from the archiver.

-- 
/Jacob Carlborg

Aug 26 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

26-Aug-2013 17:42, Dicebot пишет:
 On Monday, 26 August 2013 at 09:23:32 UTC, Dmitry Olshansky wrote:
 I'm still missing something about separation of archiver and
 serializer but in my mind these are tightly coupled and may as well be
 one entity.

 For me distinction was very natural. `(de)serializer` is something that
 takes care of D type introspection and provides it in simplified form to
 `(de)archiver` which embeds actual format knowledge. Former can get
 pretty tricky in D so it makes some sense to keep it separate.

If that is the case then fine. Though upon seeing Archive interface in 
its mortifying glory I'm not sure about that simplifying bit ...

 I can't really add anything on ranges part of your comments - sounds
 like you have a better "big picture" anyway :)


-- 
Dmitry Olshansky

Aug 26 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

26-Aug-2013 00:50, Dmitry Olshansky пишет:
 25-Aug-2013 23:15, Dicebot пишет:
 On Sunday, 25 August 2013 at 08:36:40 UTC, Dmitry Olshansky wrote:
 Same thoughts here.
 Serializer is an output range for pretty much anything (that is
 serializable). Literally isOutputRange!T would be true for a whole lot
 of things, making it possible to dumping any ranges of Ts via copy.
 Just make its put method work on a variety of types and you have it.

 Can't it be both OutputRange itself and provide InputRange via
 `serialize` call (for filters & similar pipe processing)?


[...]

 What's lacking is a way to connect a sink to another sink.

 My view of it is:
 auto app = appender!(ubyte[])();
 //thus compression is an output range wrapper
 auto compressor = compress!LZMA(app);

 In other words an output range could be a filter, or rather forwarder of
 the transformed result to some other output range. And we need this not
 only for serialization (though formattedWrite can arguably  be seen as a
 serialization) but anytime we have to turn heterogeneous input into
 homogeneous output and post-process THAT output.

On the subject of it we can do some cool wonders by providing such 
adapters, example - calculate SHA1 hash of a message on the fly:

https://gist.github.com/blackwhale/6339932

As a proof of concept to show the power that output range adapters 
possess :)

Sadly it hits a bug in LockingTextWriter, namely destructor fails on 
T.init (a usual oversight). Patch:

--- a/std/stdio.d
+++ b/std/stdio.d
   -1517,9 +1517,12    $(D Range) that locks the file and allows fast 
writing to it.

          ~this()
          {
-            FUNLOCK(fps);
-            fps = null;
-            handle = null;
+            if(fps)
+            {
+                FUNLOCK(fps);
+                fps = null;
+                handle = null;
+            }
          }

          this(this)

-- 
Dmitry Olshansky

Aug 26 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-22 17:33, Johannes Pfau wrote:

 The reason is simple: In serialization it is not common to post-process
 the serialized data as far as I know.

Perhaps compression or encryption.

-- 
/Jacob Carlborg

Aug 22 2013

Johannes Pfau <nospam example.com> writes:

Am Thu, 22 Aug 2013 18:13:23 +0200
schrieb Jacob Carlborg <doob me.com>:

 On 2013-08-22 17:33, Johannes Pfau wrote:
 
 The reason is simple: In serialization it is not common to
 post-process the serialized data as far as I know.

 
 Perhaps compression or encryption.
 

But compression or encryption are usually implemented as OutputRanges /
OutputStreams.

Aug 22 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-22 19:41, Johannes Pfau wrote:

 But compression or encryption are usually implemented as OutputRanges /
 OutputStreams.

Ok, I didn't know that.

-- 
/Jacob Carlborg

Aug 22 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Range interface for std.serialization