www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Array append performance 2

reply bearophile <bearophileHUGS lycos.com> writes:
Content-Type: text/plain

Until some more basic solution is found (and/or answers from Walter regarding
how to solve this append performance problem and regarding the idea of adding a
third capacity field to dynamical arrays), someone has suggested to create an
array builder class, like the string builder of Java. So I have created one,
the code is in the attach. Note that code is raw still, not commented, and it
does the bare minimum (append one item and extract the array, I'd like to add
generic extend methods, and tye possibility to change the capacity/length), its
tests are messy still, etc, once I know it's correct I'll improve it and I'll
add it to my libs (of course, I can offer a version of this code to
Phobos/Tango too).

It's not easy for me to write such GC-based code, so my questions are:
Is ArrayBuilder!() correct in every situation?
Why are the performance of the benchmark3 so low (a bit worse than normal array
append)? Can it be improved?


Extra note: while creating it I have found another bug in DMD I didn't know
about:

import std.stdio: writefln;
void main() {
    alias int[2] T;
    writefln(T.init);
}

This program has to print [0,0] instead of 0.


Timings, on a Core Duo   2 GHz, 2 GB RAM:

benchmark 1, N=2_000_000:
  Array append: 0.160 s
  ArrayBuilder: 0.026 s
benchmark 1, N=20_000_000:
  Array append: 1.837 s
  ArrayBuilder: 0.325 s
benchmark 1, N=50_000_000:
  Array append: 4.489 s
  ArrayBuilder: 0.731 s
benchmark 1, N=200_000_000:
  Array append: 32.293 s
  ArrayBuilder: Out of memory

benchmark 2, N=200_000:
  Array append: 0.099 s
  ArrayBuilder: 0.004 s
benchmark 2, N=2_000_000:
  Array append: 6.896 s
  ArrayBuilder: 0.050 s

benchmark 3, N=200_000:
  Array append: 0.090 s
  ArrayBuilder: 0.076 s
benchmark 3, N=2_000_000:
  Array append: 1.014 s
  ArrayBuilder: 0.923 s (why is this so slow?)
benchmark 3, N=6_000_000:
  Array append: 5.295 s
  ArrayBuilder: 4.431 s

benchmark 4, N=500_000:
  Array append: 1.109 s
  ArrayBuilder: 0.414 s
benchmark 4, N=1_000_000:
  Array append: 3.103 s

Bye,
bearophile
Aug 30 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
IsReferenceType!() isn't fully correct because it sees fixed size arrays as
reference types. I have posted a more correct (and cleaned, and
single-template) version here:

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=75813

Bye,
bearophile
Aug 30 2008
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 IsReferenceType!() isn't fully correct because it sees fixed size arrays as

version here:

 Bye,
 bearophile

One small issue: Your IsReference template doesn't work with invariant, and presumably const, types. For example, IsReferenceType!(typeof("Foo"[0])); //Returns true. The easiest way to fix this is to add a Mutable!(): static if (IsType!(Mutable!(Types[0]), bool, byte, ubyte, short, ushort, int, uint, long, ulong, float, double, real, ifloat, idouble, ireal, cfloat, cdouble, creal, char, dchar, wchar) ) Also, although it may be the least bad way to do this (I can't think of a better one), the process of elimination method this relies on seems kind of brittle because it would have to be modified for any new type that is added. If anyone knows of a less kludge-y way, please post.
Aug 30 2008
parent reply Sergey Gromov <snake.scaly gmail.com> writes:
dsimcha <dsimcha yahoo.com> wrote:
 static if (IsType!(Mutable!(Types[0]), bool, byte, ubyte, short, ushort, int,
uint,
                            long, ulong, float, double, real, ifloat, idouble,
                            ireal, cfloat, cdouble, creal, char, dchar, wchar) )
 
 Also, although it may be the least bad way to do this (I can't think of a
better
 one), the process of elimination method this relies on seems kind of brittle
 because it would have to be modified for any new type that is added.  If anyone
 knows of a less kludge-y way, please post.

I may be missing something, but the following seems to work: template Category(T) { static if ( is(T == interface) || is(T == class) || is(T Base : Base*)) { const Category = "reference"; } else const Category = "value"; } template Category(T : T[U], U) { const Category = Category!(T); } You need to extend this for structs and unions but that seems routine... -- SnakE
Aug 31 2008
parent Sergey Gromov <snake.scaly gmail.com> writes:
Sergey Gromov <snake.scaly gmail.com> wrote:
 template Category(T)
 {
 	static if (
 		is(T == interface) ||
 		is(T == class) ||
 		is(T Base : Base*))
 	{
 		const Category = "reference";
 	}
 	else
 		const Category = "value";
 }
 
 template Category(T : T[U], U)
 {
 	const Category = Category!(T);
 }

Fixed arrays: dynamic and associative arrays always contain references, static arrays must be checked recursively: template Category(T) { static if ( is(T == interface) || is(T == class) || is(T Elem : Elem[]) || // dyn array is(T Base : Base*)) { const Category = "reference"; } else const Category = "value"; } template Category(T : T[U], U) { // AAs always contain references const Category = "reference"; } template Category(T : T[U], size_t U) { // Static arrays may or may not contain references const Category = Category!(T); } -- SnakE
Aug 31 2008
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
Nice work on the reference template.  I had thought that the typeid(T).flags()
would be inlined, then constant folded, but CTFE doesn't work on it, so I guess
not.  Probably pretty negligible, but if I have a template laying around to do
this at compile time, I may as well use it.

Also, to answer your question, you do need to call hasNoPointers() each time you
realloc.

import std.stdio, std.gc, std.process;

void main() {
    uint[] foo = (cast(uint[]) malloc(uint.sizeof))[0..1];
    hasNoPointers(foo.ptr);
    foo = cast(uint[]) realloc(foo.ptr, 100_000 * uint.sizeof)[0..100_000];
    //Commenting next line out causes severe mem leaks.
    hasNoPointers(foo.ptr);
    foreach(ref f; foo) {
        f = cast(uint) test();  //Create false pointers.
    }
    system("pause");  //So mem usage can be checked.
}

uint* test() {
    return (new uint[1_000]).ptr;
}

Oh, one more thing:  What's a static struct as opposed to a regular struct? 
I've
never seen it used before.  I had assumed that this means a struct composed only
of static members, but apparently not.
Aug 30 2008
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
dsimcha:

Nice work on the reference template.<

It comes from several expansion and compression cycles, has requires few hours to be written ;-)
Also, to answer your question, you do need to call hasNoPointers() each time
you realloc.<

Right, I have just read the D docs that say it. But what's the rationale behind it? If I have a pointer and I want to realloc it, to make its space bigger, I never want to change the fact that the GC scans its pointers or not. Stating it once must suffice.
Oh, one more thing:  What's a static struct as opposed to a regular struct?<

I have taken a look at the documentation, and I have performed some benchmarks, now I think it does nothing, I have removed that attribute. I have flagged that struct fields as private because originally that was a class, but a struct is better and faster, so I have removed those "private" too.
One small issue:  Your IsReference template doesn't work with invariant, and
presumably const, types.<

Because I (as most people here, I presume) use D 1. Sorry for not stating it clearly.
 The easiest way to fix this is to add a Mutable!():
 static if (IsType!(Mutable!(Types[0]), bool, byte, [...]

Thank you, I have added a note into my libs, even if they are for D 1. Hopefully some one else will tell me if the ArrayBuilder is correct (even if this group isn't D.learn). Bye and thank you, bearophile
Aug 30 2008
prev sibling next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
dsimcha wrote:
 Nice work on the reference template.  I had thought that the typeid(T).flags()
 would be inlined, then constant folded, but CTFE doesn't work on it, so I guess
 not.  Probably pretty negligible, but if I have a template laying around to do
 this at compile time, I may as well use it.
 
 Also, to answer your question, you do need to call hasNoPointers() each time
you
 realloc.

Not with Tango. Sean
Aug 30 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-08-31 06:01:26 +0200, Sean Kelly <sean invisibleduck.org> said:

 dsimcha wrote:
 Nice work on the reference template.  I had thought that the typeid(T).flags()
 would be inlined, then constant folded, but CTFE doesn't work on it, so I guess
 not.  Probably pretty negligible, but if I have a template laying around to do
 this at compile time, I may as well use it.
 
 Also, to answer your question, you do need to call hasNoPointers() each 
 time you
 realloc.

Not with Tango. Sean

I would have done it quite differently: keep a linked list (if you want to improve of batches of 10 arrays), with a pointer to the end, append in O(1), and keep the total length. only upon a call to toArray allocate an array of the requested length and fill it. An array (even with tricks) is not a good structure to append to, so it seems a bad idea to me to use it. Fawzi
Aug 31 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Fawzi Mohamed:
 I would have done it quite differently: keep a linked list (if you want 
 to improve of batches of 10 arrays), with a pointer to the end, append 
 in O(1), and keep the total length.
 only upon a call to toArray allocate an array of the requested length 
 and fill it.
 An array (even with tricks) is not a good structure to append to, so it 
 seems a bad idea to me to use it.

Yes, my code was experimental, my main purpose was to ask if it's correct, in its GC management, etc. Algorithmic and code tuning is something I think about later. I'm now implementing ArrayBuilder with a deque-like, to see how it performs in comparison. There are 3 possible things you may want to join to such ArrayBuilder: - Fixed size arrays - Single items - Dynamic arrays For the dynamic arrays can be copied just a reference, for the other is better to copy the data. The single items can be added to short (64 KB) structs with fixed size arrays inside, organized into a linked list. The static arrays can equally be copied in pieces into such blocks. The dynamic arrays instead require short arrays filled with the struct of the dynamic arrays. This makes things messy. If the user keeps alternating single items to dynamic arrays the data structure has to avoid wasting too much memory. So I don't see an easy solution yet, beside my original one or a similar one (that consists in copying all items of dynamic arrays too). Bye and thank you, bearophile
Aug 31 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-08-31 17:37:13 +0200, bearophile <bearophileHUGS lycos.com> said:

 Fawzi Mohamed:
 I would have done it quite differently: keep a linked list (if you want
 to improve of batches of 10 arrays), with a pointer to the end, append
 in O(1), and keep the total length.
 only upon a call to toArray allocate an array of the requested length
 and fill it.
 An array (even with tricks) is not a good structure to append to, so it
 seems a bad idea to me to use it.

Yes, my code was experimental, my main purpose was to ask if it's correct, in its GC management, etc. Algorithmic and code tuning is something I think about later. I'm now implementing ArrayBuilder with a deque-like, to see how it performs in comparison. There are 3 possible things you may want to join to such ArrayBuilder: - Fixed size arrays - Single items - Dynamic arrays For the dynamic arrays can be copied just a reference, for the other is better to copy the data. The single items can be added to short (64 KB) structs with fixed size arrays inside, organized into a linked list. The static arrays can equally be copied in pieces into such blocks. The dynamic arrays instead require short arrays filled with the struct of the dynamic arrays. This makes things messy. If the user keeps alternating single items to dynamic arrays the data structure has to avoid wasting too much memory. So I don't see an easy solution yet, beside my original one or a similar one (that consists in copying all items of dynamic arrays too). Bye and thank you, bearophile

I had not thought about an important thing: the content of the array can change... so copying is safer. Probably using (as you said) a linked list of fixed arrays is the best approach for many small appends, maybe append might have a bool argument "fixed=false" to know if copying can be avoided. Fawzi
Aug 31 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Fawzi Mohamed:
 I had not thought about an important thing: the content of the array 
 can change... so copying is safer.

Yes, and there are other problems, that make the situation/code a mess, so I copy all items now, keeping code simpler. I'm debugging/benchmarking it now... Later, bearophile
Aug 31 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Denis Koroskin:
 Please, test my version, too!
 While it is slighly slower, it allows appending invariant arrays (i.e.  
 strings) without copying data (although it is currently not implemented).

Sorry, I'm not using D 2.x. I (hopefully) will soon show an improved version of my code, so you can test your version. Note: I have benchmarked the version with the deque-like data structure, and it's slower, the code is a bit more complex (and I may have hit a bug in the GC (I'd like to know if Tango behaves at the same way)), so I'll probably just use the first version, with the reallocs. ---------------- With Phobos: import std.stdio: writefln; import std.gc: malloc; void main() { int M = 1 << 16; int N = M / int.sizeof; const int N_POINTERS = 18; int*[N_POINTERS] pointers; foreach (ref ptr; pointers) { ptr = cast(int*)malloc(N * int.sizeof); if (ptr is null) { throw new Error("OUT OF MEM"); return 1; } } int last = 0; foreach (i, ptr; pointers) { int int_ptr = cast(int)ptr; int diff = int_ptr - last; writefln(i, " ", int_ptr, " ", diff, " ", M - diff); last = int_ptr; } } Prints: 0 9052160 9052160 -8986624 1 9117696 65536 0 2 9183232 65536 0 3 9248768 65536 0 4 9314304 65536 0 5 9379840 65536 0 6 9445376 65536 0 7 9510912 65536 0 8 9576448 65536 0 9 9641984 65536 0 10 9707520 65536 0 11 9773056 65536 0 12 9838592 65536 0 13 9904128 65536 0 14 9969664 65536 0 15 10092544 122880 -57344 16 10158080 65536 0 17 10223616 65536 0 Is that result at n = 15 normal/correct? Bye, bearophile
Sep 01 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Content-Type: text/plain

This is the first beta version, it seems to work well enough (despite too much
easy memory overflows), I'll add ddoc comments, better comments, I'll clean the
code, improve unit tests, etc. If you spot problems I'd like to know them.
You will probably find its finished version inside my libs in one or two days.
I presume Walter isn't interested to put this into Phobos, nor to 'fix' the
built-in arrays  (or maybe he's just busy).

Bye,
bearophile
Sep 01 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Well, a finished version of ArrayBuilder is in my libs, in the 'builders'
module:
http://www.fantascienza.net/leonardo/so/libs_d.zip

I have already used it to replace the old string appending used in the putr()
(printing functions) and now printing is about 2.5 times faster (and using it
for that purpose is a way to perform hundreds of unit tests on ArrayBuilder),
so I'm quite pleased.

Bye, and thank you for all the help and answers (but more bugs can be present
still),
bearophile
Sep 01 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
bearophile wrote:
 Well, a finished version of ArrayBuilder is in my libs, in the 'builders'
module:
 http://www.fantascienza.net/leonardo/so/libs_d.zip
 
 I have already used it to replace the old string appending used in the putr()
(printing functions) and now printing is about 2.5 times faster (and using it
for that purpose is a way to perform hundreds of unit tests on ArrayBuilder),
so I'm quite pleased.
 
 Bye, and thank you for all the help and answers (but more bugs can be present
still),
 bearophile

Very cool. Philosophically, though, isn't the whole purpose of having dynamic arrays in D to avoid having to create library implementations like this? I can definitely appreciate how something like this provides proof-of-concept for compiler & stdlib improvements, but in the end, having to rely on an array-builder class seems very non-D-ish. Thoughts? --benji
Sep 01 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Benji Smith:
 Philosophically, though, isn't the whole purpose of having dynamic 
 arrays in D to avoid having to create library implementations like this?

I agree, but so far I am not able to modify/improve the language. And an ArrayBuilder isn't meant to replace dynamic arrays, it's just a way to build them faster, "patching" what I perceive as one of their problems.
 I can definitely appreciate how something like this provides 
 proof-of-concept for compiler & stdlib improvements,

My libs are meant to improve std libs, if/where/when the compiler can't be modified. They are un-standard but they are largish libs anyway, and I'm developing them doing my best :-) Bye, bearophile
Sep 01 2008
parent reply "Dave" <Dave_member pathlink.com> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:g9i1nb$26pd$1 digitalmars.com...
 Benji Smith:
 Philosophically, though, isn't the whole purpose of having dynamic
 arrays in D to avoid having to create library implementations like this?

I agree, but so far I am not able to modify/improve the language. And an ArrayBuilder isn't meant to replace dynamic arrays, it's just a way to build them faster, "patching" what I perceive as one of their problems.

There have been quite a few examples of contributors modifying the D runtime to improve things like array concatenation. If it generally performs better and is demonstrably correct, I'm pretty sure Walter and the Tango crew would incorporate it. In other words, have a go at improving the arraycat code in internal/gc/gc.d if you have the time. It's simple enough to modify and build.
 I can definitely appreciate how something like this provides
 proof-of-concept for compiler & stdlib improvements,

My libs are meant to improve std libs, if/where/when the compiler can't be modified. They are un-standard but they are largish libs anyway, and I'm developing them doing my best :-) Bye, bearophile

Sep 02 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Dave:
 There have been quite a few examples of contributors modifying the D runtime 
 to improve things like array concatenation.

This requires three fields in the dynamic array struct, so maybe it requires changes in several points of the front-end.
If it generally performs better and is demonstrably correct, I'm pretty sure
Walter and the Tango crew would incorporate it.<

It seems to work, it speeds up the code where I have used it (I'm using it more and more in my libs, it has speed up the select() two times, it contains an array append because in general you don't know how many items you filter away from the input iterable) but I am not able to demostrate it correct. Its main fault so far is that it causes MemoryOverflow if you try to add something like 200 million size_t to it.
 In other words, have a go at improving the arraycat code in internal/gc/gc.d 
 if you have the time. It's simple enough to modify and build.

I'll take a look, but I presume it's not easy as you say :-) Bye, bearophile
Sep 03 2008
prev sibling parent BCS <ao pathlink.com> writes:
Reply to Benji,

 bearophile wrote:
 
 Well, a finished version of ArrayBuilder is in my libs, in the
 'builders' module: http://www.fantascienza.net/leonardo/so/libs_d.zip
 
 I have already used it to replace the old string appending used in
 the putr() (printing functions) and now printing is about 2.5 times
 faster (and using it for that purpose is a way to perform hundreds of
 unit tests on ArrayBuilder), so I'm quite pleased.
 
 Bye, and thank you for all the help and answers (but more bugs can be
 present still), bearophile
 

Philosophically, though, isn't the whole purpose of having dynamic arrays in D to avoid having to create library implementations like this? I can definitely appreciate how something like this provides proof-of-concept for compiler & stdlib improvements, but in the end, having to rely on an array-builder class seems very non-D-ish. Thoughts? --benji

I think that in many cases the information needed for the compiler to produce the needed optimizations is to hard to encode in D like languages. D can handle the individual cats but when to do the rest is a problem that can span to much code to expect the compiler to solve it. A human with knowledge of what is supposed to happen can use libs to improve the situation. OTOH the best solution would be to describe programs in a way that lets a computer do that work. We can dream.
Sep 01 2008
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Sun, 31 Aug 2008 04:58:10 +0400, dsimcha <dsimcha yahoo.com> wrote:

 Nice work on the reference template.  I had thought that the  
 typeid(T).flags()
 would be inlined, then constant folded, but CTFE doesn't work on it, so  
 I guess
 not.  Probably pretty negligible, but if I have a template laying around  
 to do
 this at compile time, I may as well use it.

 Also, to answer your question, you do need to call hasNoPointers() each  
 time you
 realloc.

 import std.stdio, std.gc, std.process;

 void main() {
     uint[] foo = (cast(uint[]) malloc(uint.sizeof))[0..1];
     hasNoPointers(foo.ptr);
     foo = cast(uint[]) realloc(foo.ptr, 100_000 *  
 uint.sizeof)[0..100_000];
     //Commenting next line out causes severe mem leaks.
     hasNoPointers(foo.ptr);
     foreach(ref f; foo) {
         f = cast(uint) test();  //Create false pointers.
     }
     system("pause");  //So mem usage can be checked.
 }

 uint* test() {
     return (new uint[1_000]).ptr;
 }

 Oh, one more thing:  What's a static struct as opposed to a regular  
 struct?  I've
 never seen it used before.  I had assumed that this means a struct  
 composed only
 of static members, but apparently not.

Static doesn't apply to a struct, only to inner classes, IIRC.
Aug 31 2008
prev sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Mon, 01 Sep 2008 01:11:46 +0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 Fawzi Mohamed:
 I had not thought about an important thing: the content of the array
 can change... so copying is safer.

Yes, and there are other problems, that make the situation/code a mess, so I copy all items now, keeping code simpler. I'm debugging/benchmarking it now... Later, bearophile

Please, test my version, too! While it is slighly slower, it allows appending invariant arrays (i.e. strings) without copying data (although it is currently not implemented). An idea is to store an array (or a list) of invariant elements. Current buffer is mutable until it is completely full. It then casted to invariant and appended to the list and an new mutable buffer (of a higher size) created. struct ArrayBuilder(T) { //private BearophileArrayBuilder!(invariant(T[])) subArrays; private invariant(T[])[] subArrays; private int lengthSoFar; private T[] tmpArray; private int curOffset = 0; void opCatAssign(T x) { int capacity = tmpArray.length; if (curOffset >= capacity) { int old_capacity = capacity; if (capacity == 0) capacity = 4; // starting point not too much small else if (capacity < (1 << 26)) // 67_108_864 capacity *= 2; // for amortized O(1) and less RAM fragmentation else capacity *= 1.3; // slower growing rate to save RAM if (old_capacity != 0) { subArrays ~= cast(invariant(T[]))tmpArray; } tmpArray = new T[capacity]; curOffset = 0; } tmpArray[curOffset] = x; assert(tmpArray[curOffset] == x); ++curOffset; ++lengthSoFar; } /// T[] toarray() { T[] array = new T[lengthSoFar]; int offset = 0; foreach (subArray; subArrays) { int newOffset = offset+subArray.length; array[offset..newOffset] = subArray[]; offset = newOffset; } array[offset..$] = tmpArray[0..curOffset]; return array; } }
Sep 01 2008
prev sibling parent reply Christian Kamm <kamm-incasoftware removethis.de> writes:
 Extra note: while creating it I have found another bug in DMD I didn't
 know about:
 
 import std.stdio: writefln;
 void main() {
     alias int[2] T;
     writefln(T.init);
 }
 
 This program has to print [0,0] instead of 0.

Actually, I don't think the D spec requires typeof(T.init) == T. And T is a valid initializer for T[N]: int[3] arr = 5; // sets all elements to 5 Christian
Aug 31 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Christian Kamm:
 Actually, I don't think the D spec requires typeof(T.init) == T.

Maybe the D specs have to be fixed then.
 And T is a valid initializer for T[N]:

Right, but the following program: import std.stdio: writefln; void main() { alias int[2] T; writefln(T.init); int[2] T_init; writefln(T_init); } prints: 0 [0,0] So if I have a dynamic array of static arrays (where T is the static array) I can't do this: array[0 .. $][] = T.init; I have to do: T T_init; array[0 .. $][] = T_init; While if T is a dynamic array I can do: array[0 .. $] = T.init; So even if they aren't bugs I don't like all such differences much... Bye, bearophile
Aug 31 2008
parent reply downs <default_357-line yahoo.de> writes:
bearophile wrote:
 Christian Kamm:
 Actually, I don't think the D spec requires typeof(T.init) == T.

Maybe the D specs have to be fixed then.

This should do what you want: template Init(T) { const T Init; }
Sep 01 2008
parent reply Nick B <nick.barbalich gmail.com> writes:
Hi

I came across this the other day, and no one has mentioned this, on this 
news group before, I thought I would bring the subject to the 
communities attention, so to speak.

Google has very recently, open sourced "Protocol Buffers".

What is it you ask ?  In a couple of lines it is a language-neutral, 
platform-neutral, extensible way of serializing structured data for use 
in communications protocols, data storage, and more.

Think XML, but smaller, faster, and simpler.

Why not just use XML ?

Protocol buffers have many advantages over XML for serializing 
structured data.  Protocol buffers:
* are simpler
* are 3 to 10 times smaller
* are 20 to 100 times faster
* are less ambiguous
* generate data access classes that are easier to use programmatically.

PB supports Java, Python,and C++ currently.

A more detailed overview can be found here:
http://code.google.com/apis/protocolbuffers/docs/overview.html

and the FAQ here:
http://code.google.com/apis/protocolbuffers/docs/faq.html

See the answer to the question "Can I add support for a new language to 
protocol buffers?" inside the FAQ.

Some Tips and comments can be found here:
http://zunger.livejournal.com/164024.html


My questions.
Does the D community see this of interest ?
Is this something they might use ?
Do they see value in it being added to the respective
Tango or Phobos frameworks ?

any other comments ?

cheers

Nick B.
Sep 01 2008
next sibling parent Lars Ivar Igesund <larsivar igesund.net> writes:
Nick B wrote:

 Hi
 
 I came across this the other day, and no one has mentioned this, on this
 news group before, I thought I would bring the subject to the
 communities attention, so to speak.

That is well and fine, but you shouldn't hijack a thread for it, rather start a new one.
 Google has very recently, open sourced "Protocol Buffers".
 
 What is it you ask ?  In a couple of lines it is a language-neutral,
 platform-neutral, extensible way of serializing structured data for use
 in communications protocols, data storage, and more.
 
 Think XML, but smaller, faster, and simpler.
 
 Why not just use XML ?
 
 Protocol buffers have many advantages over XML for serializing
 structured data.  Protocol buffers:
 * are simpler
 * are 3 to 10 times smaller
 * are 20 to 100 times faster

Not necessarily a valid point if you use the right parser, say, one from Tango.
 * are less ambiguous
 * generate data access classes that are easier to use programmatically.
 
 PB supports Java, Python,and C++ currently.
 
 A more detailed overview can be found here:
 http://code.google.com/apis/protocolbuffers/docs/overview.html
 
 and the FAQ here:
 http://code.google.com/apis/protocolbuffers/docs/faq.html
 
 See the answer to the question "Can I add support for a new language to
 protocol buffers?" inside the FAQ.
 
 Some Tips and comments can be found here:
 http://zunger.livejournal.com/164024.html
 
 
 My questions.
 Does the D community see this of interest ?
 Is this something they might use ?
 Do they see value in it being added to the respective
 Tango or Phobos frameworks ?

If the protocol is widely used, then it would certainly be beneficial to support it, and I would assume that it is fairly easy to implement on top of Tango. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Sep 01 2008
prev sibling parent reply Brian Price <blprice61 yahoo.com> writes:
== Quote from Nick B (nick.barbalich gmail.com)'s article
 Hi
 I came across this the other day, and no one has mentioned this, on this
 news group before, I thought I would bring the subject to the
 communities attention, so to speak.
 Google has very recently, open sourced "Protocol Buffers".
 What is it you ask ?  In a couple of lines it is a language-neutral,
 platform-neutral, extensible way of serializing structured data for use
 in communications protocols, data storage, and more.
 Think XML, but smaller, faster, and simpler.
 Why not just use XML ?
 Protocol buffers have many advantages over XML for serializing
 structured data.  Protocol buffers:
 * are simpler
 * are 3 to 10 times smaller
 * are 20 to 100 times faster
 * are less ambiguous
 * generate data access classes that are easier to use programmatically.
 PB supports Java, Python,and C++ currently.
 A more detailed overview can be found here:
 http://code.google.com/apis/protocolbuffers/docs/overview.html
 and the FAQ here:
 http://code.google.com/apis/protocolbuffers/docs/faq.html
 See the answer to the question "Can I add support for a new language to
 protocol buffers?" inside the FAQ.
 Some Tips and comments can be found here:
 http://zunger.livejournal.com/164024.html
 My questions.
 Does the D community see this of interest ?
 Is this something they might use ?
 Do they see value in it being added to the respective
 Tango or Phobos frameworks ?
 any other comments ?
 cheers
 Nick B.

Hate to say it but this is yet another case of reinventing the wheel. The worst thing about this throwback to the early 90s is its inherent violation of DRY. This package intermingles and confuses three separate issues, treating them as a monolithic whole, namely: serialization, marshalling, and versioning. Binary solutions such as this, while more efficient byte-wise, run into portability problems especially with floating point values. They also lose the human readability of the data (sans the use of special tools). Often the use of a binary solution is a case of premature optimization and indicates bad design at a higher level. The marshalling strategy should be 'pluggable' so that one can use an easier to debug, human readable, data format during development. Versioning problems can be (and have been) addressed in a number of ways over the years, the simplest and imo often the best, is to sidestep the problem by serializing a map of the data rather than just the raw data. That is, each value has associated metadata indicating its field name (a key-value mapping iow). If XML is too heavy a hammer, JSON (with or without embedded metadata) is a good alternative with quite a bit of industry support. In fact, the use of JSON encoding with embedded metadata gives you a lightweight solution that works with nearly every language in present use. Even better, any language with good reflection support (such as Java) allows an implementation that does not violate DRY. This "Protocol Buffers" solution is a dinosaur, a throwback, yet another wheel when we need ground effects or anti-gravity. My 2c, Brian
Sep 01 2008
next sibling parent reply Leandro Lucarella <llucax gmail.com> writes:
Brian Price, el  1 de septiembre a las 14:31 me escribiste:
 Hate to say it but this is yet another case of reinventing the wheel.  The
worst
 thing about this throwback to the early 90s is its inherent violation of DRY.
 This package intermingles and confuses three separate issues, treating them as
a
 monolithic whole, namely: serialization, marshalling, and versioning.

Different problems has differents solutions. What you say is really nice and have sense when *latency* and other performance issues are not a problem to you (it's obviously a problem for Google and for a lot of other people, like me, I was doing a little "framework", similar to Google's PB at work, just before they release it =). When you deal with (almost) realtime systems, you can't pay the price for human readability, extra parsing time, extra abstraction and extra bytes thrown throgh the wire. You just can't. You don't even care about portability (this kind of things usually run in a very controlled environment). I just wanted to clarify this. -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- PADRES DENUNCIAN QUE SU HIJA SE ESCAPO CON UN PARAGUAYITO -- Crónica TV
Sep 01 2008
parent Brian Price <blprice61 yahoo.com> writes:
== Quote from Leandro Lucarella (llucax gmail.com)'s article
 Brian Price, el  1 de septiembre a las 14:31 me escribiste:
 Hate to say it but this is yet another case of reinventing the wheel.  The
worst
 thing about this throwback to the early 90s is its inherent violation of DRY.
 This package intermingles and confuses three separate issues, treating them as
a
 monolithic whole, namely: serialization, marshalling, and versioning.

and have sense when *latency* and other performance issues are not a problem to you (it's obviously a problem for Google and for a lot of other people, like me, I was doing a little "framework", similar to Google's PB at work, just before they release it =). When you deal with (almost) realtime systems, you can't pay the price for human readability, extra parsing time, extra abstraction and extra bytes thrown throgh the wire. You just can't. You don't even care about portability (this kind of things usually run in a very controlled environment). I just wanted to clarify this.

In a (near) realtime embedded system with a high wire cost (cellphone, smart meters, etc) I'd have to agree with you. However, my initial impression was that this was being proposed/offered as a general purpose object streaming system. Multicore/hyperthreaded processors with multiple levels of caching have a huge latency (comparitively) between core and wire. Even in a high demand system there's generally plenty of cpu cycles left over to do quite a bit of extra parsing and data massaging (compression). A compressed text message (XML or JSON format) is very competitive in size with binary format (depending of course on the actual makeup of the data payload). I expect that in a few years the embedded systems will be sporting processors with similar capabilities to today's desktops and servers. At that time, it may be worthwhile re-evaluating the decision to go with a 'brittle' binary format in new designs. On a personal note, some of my friends are laughing at me over this - ten or twelve years ago I was vehemently against the idea of using human readable data formats for transmission - for the same reasons you give above. Of course my cellphone today probably has nearly the power of my desktop back then. Sorry for any misunderstanding, Brian
Sep 01 2008
prev sibling parent Mitja Slenc <mslenc gmail.com> writes:
Brian Price wrote:
 == Quote from Nick B (nick.barbalich gmail.com)'s article
 Hi
 I came across this the other day, and no one has mentioned this, on this
 news group before, I thought I would bring the subject to the
 communities attention, so to speak.
 Google has very recently, open sourced "Protocol Buffers".
 What is it you ask ?  In a couple of lines it is a language-neutral,
 platform-neutral, extensible way of serializing structured data for use
 in communications protocols, data storage, and more.
 Think XML, but smaller, faster, and simpler.
 Why not just use XML ?
 Protocol buffers have many advantages over XML for serializing
 structured data.  Protocol buffers:
 * are simpler
 * are 3 to 10 times smaller
 * are 20 to 100 times faster
 * are less ambiguous
 * generate data access classes that are easier to use programmatically.
 PB supports Java, Python,and C++ currently.
 A more detailed overview can be found here:
 http://code.google.com/apis/protocolbuffers/docs/overview.html
 and the FAQ here:
 http://code.google.com/apis/protocolbuffers/docs/faq.html
 See the answer to the question "Can I add support for a new language to
 protocol buffers?" inside the FAQ.
 Some Tips and comments can be found here:
 http://zunger.livejournal.com/164024.html
 My questions.
 Does the D community see this of interest ?
 Is this something they might use ?
 Do they see value in it being added to the respective
 Tango or Phobos frameworks ?
 any other comments ?
 cheers
 Nick B.

Hate to say it but this is yet another case of reinventing the wheel. The worst thing about this throwback to the early 90s is its inherent violation of DRY. This package intermingles and confuses three separate issues, treating them as a monolithic whole, namely: serialization, marshalling, and versioning.

Can you explain where you see violation of DRY exactly? Also, can you explain the big difference between serialization and marshalling? I find it hard to draw any meaningful line between the two..
 Binary solutions such as this, while more efficient byte-wise, run into
 portability problems especially with floating point values.

So, what sort of problems do you see with this specific solution? To me, it looks like a text based solution would have more problems with floating point values, because they typically get converted to/from decimal..
 They also lose the
 human readability of the data (sans the use of special tools).  Often the use
of a
 binary solution is a case of premature optimization and indicates bad design
at a
 higher level.  The marshalling strategy should be 'pluggable' so that one can
use
 an easier to debug, human readable, data format during development.

Well, in my experience, most data tends to live somewhere not readily accessible and/or human readable, so you need a tool one way or the other. Making generic tools for PBs shouldn't be too hard, as far as I can tell they do have support for generic message handling, 'reflection', there's a DebugString() method on each of them, etc..
 Versioning problems can be (and have been) addressed in a number of ways over
the
 years, the simplest and imo often the best, is to sidestep the problem by
 serializing a map of the data rather than just the raw data.  That is, each
value
 has associated metadata indicating its field name (a key-value mapping iow).

And the PB encoding differs from this how exactly? Each value has a key preceding it, which identifies which field it is..
 If XML is too heavy a hammer, JSON (with or without embedded metadata) is a
good
 alternative with quite a bit of industry support.  In fact, the use of JSON
 encoding with embedded metadata gives you a lightweight solution that works
with
 nearly every language in present use.  Even better, any language with good
 reflection support (such as Java) allows an implementation that does not
violate DRY.

I haven't tried it, but it looks like generic mapping to/from JSON would be very easily done on top of PBs, for when you need it (like when publishing a JavaScript API). But why would you want to use JSON in any other case? Compared to PBs, it's harder/slower to parse, definitely takes more space/bandwidth, has fewer types, and you don't get the nice free API protocol buffers provide you for each message type.
 This "Protocol Buffers" solution is a dinosaur, a throwback, yet another wheel
 when we need ground effects or anti-gravity.

Well, you haven't convinced me yet :) Mitja
Sep 02 2008