www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Message passing between threads: Java 4 times faster than D

reply Nicolae Mihalache <xpromache gmail.com> writes:
--14dae93404498aa2b804b88454c1
Content-Type: text/plain; charset=ISO-8859-1

Hello,

I'm a complete newbie in D and trying to compare with Java. I
implemented  a simple test for measuring the throughput in message
passing between threads. I see that Java can pass about 4mil
messages/sec while D only achieves 1mil/sec. I thought that D should
be faster.

The messages are simply integers (which are converted to Integer in Java).

The two programs are attached. I tried compiling the D version with
both dmd and gdc and various optimization flags.

mache

--14dae93404498aa2b804b88454c1
Content-Type: text/x-dsrc; charset=US-ASCII; name="mp.d"
Content-Disposition: attachment; filename="mp.d"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_gyfk9cv80

aW1wb3J0IHN0ZC5jb25jdXJyZW5jeSwgc3RkLnN0ZGlvOwppbXBvcnQgc3RkLmRhdGV0aW1lOwoK
Y29uc3Qgbj0xMDAwMDAwMDA7CnZvaWQgbWFpbigpIHsKICAgIGF1dG8gdGlkPXNwYXduKCZyZWNl
aXZlcik7CiAgICBzZXRNYXhNYWlsYm94U2l6ZSh0aWQsIDEwMDAsIE9uQ3Jvd2RpbmcuYmxvY2sp
OwogICAgdGlkLnNlbmQodGhpc1RpZCk7CiAgICBmb3JlYWNoKGk7IDAuLm4pIHsKICAgICAgIHRp
ZC5zZW5kKGkpOyAKICAgIH0KICAgIHdyaXRlbG4oImZpbmlzaGVkIHNlbmRpbmciKTsKICAgIGF1
dG8gcz1yZWNlaXZlT25seSEoc3RyaW5nKSgpOwogICAgd3JpdGVsbigicmVjZWl2ZWQgIiwgcyk7
Cn0KCnZvaWQgcmVjZWl2ZXIoKSB7CiAgIGF1dG8gbWFpblRpZD1yZWNlaXZlT25seSEoVGlkKSgp
OwogICBTdG9wV2F0Y2ggc3c7CiAgIHN3LnN0YXJ0KCk7ICAKICAgbG9uZyBzOwogICBmb3IoYXV0
byBpPTA7aTxuO2krKykgewogICAgICBhdXRvIG1zZyA9IHJlY2VpdmVPbmx5IShpbnQpKCk7CiAg
ICAgIHMrPW1zZzsKICAgICAgLy93cml0ZWxuKCJyZWNlaXZlZCAiLCBtc2cpOwogICB9CiAgIHN3
LnN0b3AoKTsKICAgd3JpdGVsbigiZmluaXNoZWQgcmVjZWl2aW5nIik7CiAgIHdyaXRlZmxuKCJy
ZWNlaXZlZCAlZCBtZXNzYWdlcyBpbiAlZCBtc2VjIHN1bT0lZCBzcGVlZD0lZCBtc2cvc2VjIiwg
biwgc3cucGVlaygpLm1zZWNzLCBzLCBuKjEwMDBML3N3LnBlZWsoKS5tc2Vjcyk7CiAgIG1haW5U
aWQuc2VuZCgiZmluaXNoZWQiKTsKfQo=
--14dae93404498aa2b804b88454c1
Content-Type: text/x-java; charset=US-ASCII; name="ThroughputMpTest.java"
Content-Disposition: attachment; filename="ThroughputMpTest.java"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_gyfk9vtr1

cGFja2FnZSBpbnV0aWwubG9jYWw7CgppbXBvcnQgamF2YS51dGlsLmNvbmN1cnJlbnQuQXJyYXlC
bG9ja2luZ1F1ZXVlOwppbXBvcnQgamF2YS51dGlsLmNvbmN1cnJlbnQuQmxvY2tpbmdRdWV1ZTsK
CnB1YmxpYyBjbGFzcyBUaHJvdWdocHV0TXBUZXN0IHsKICAgIHN0YXRpYyAgIGxvbmcgbj0xMDAw
MDAwMDsKCiAgICBzdGF0aWMgQmxvY2tpbmdRdWV1ZTxJbnRlZ2VyPiBxdWV1ZT1uZXcgQXJyYXlC
bG9ja2luZ1F1ZXVlPEludGVnZXI+KDEwMDApOwogICAgc3RhdGljIGNsYXNzIENvbnN1bWVyIGlt
cGxlbWVudHMgUnVubmFibGUgewogICAgICAgIEBPdmVycmlkZQogICAgICAgIHB1YmxpYyB2b2lk
IHJ1bigpIHsKICAgICAgICAgICAgbG9uZyBzPTA7CiAgICAgICAgICAgIHRyeSB7IAogICAgICAg
ICAgICAgICAgbG9uZyB0MD1TeXN0ZW0uY3VycmVudFRpbWVNaWxsaXMoKTsKICAgICAgICAgICAg
ICAgIGZvcihpbnQgaT0wO2k8bjtpKyspIHsKICAgICAgICAgICAgICAgICAgICBpbnQgeD1xdWV1
ZS50YWtlKCk7CiAgICAgICAgICAgICAgICAgICAgcys9eDsKICAgICAgICAgICAgICAgIH0KICAg
ICAgICAgICAgICAgIGxvbmcgdDE9U3lzdGVtLmN1cnJlbnRUaW1lTWlsbGlzKCk7CiAgICAgICAg
ICAgICAgICBkb3VibGUgZD10MS10MDsKICAgICAgICAgICAgICAgIFN5c3RlbS5vdXQucHJpbnRs
bihuKyIgbWVzc2FnZXMgcmVjZWl2ZWQgaW4gIitkKyIgbXMsIHN1bT0iK3MrIiBzcGVlZDogIisx
MDAwKmQvbisiIG1pY3Jvc2VjL21lc3NhZ2UsICIrMTAwMCpuL2QrIiBtZXNzYWdlcy9zZWMiKTsK
ICAgICAgICAgICAgfSBjYXRjaCAoRXhjZXB0aW9uIGUpewogICAgICAgICAgICAgICAgZS5wcmlu
dFN0YWNrVHJhY2UoKTsKICAgICAgICAgICAgfQogICAgICAgIH0KICAgIH0KCiAgICBzdGF0aWMg
Y2xhc3MgUHJvZHVjZXIgaW1wbGVtZW50cyBSdW5uYWJsZSB7CiAgICAgICAgQE92ZXJyaWRlCiAg
ICAgICAgcHVibGljIHZvaWQgcnVuKCkgewogICAgICAgICAgICB0cnkgewogICAgICAgICAgICAg
ICAgZm9yKGludCBpPTA7aTxuO2krKykgewogICAgICAgICAgICAgICAgICAgIHF1ZXVlLnB1dChp
KTsKICAgICAgICAgICAgICAgIH0KICAgICAgICAgICAgfSBjYXRjaCAoRXhjZXB0aW9uIGUpIHsK
ICAgICAgICAgICAgICAgIGUucHJpbnRTdGFja1RyYWNlKCk7CiAgICAgICAgICAgIH0KICAgICAg
ICB9CgogICAgfQoKICAgIHB1YmxpYyBzdGF0aWMgdm9pZCBtYWluKFN0cmluZ1tdIGFyZ3MpIHRo
cm93cyBJbnRlcnJ1cHRlZEV4Y2VwdGlvbiB7CiAgICAgICAgVGhyZWFkIHQ9bmV3IFRocmVhZChu
ZXcgQ29uc3VtZXIoKSk7CiAgICAgICAgdC5zdGFydCgpOwogICAgICAgIChuZXcgVGhyZWFkKG5l
dyBQcm9kdWNlcigpKSkuc3RhcnQoKTsKICAgICAgICB0LmpvaW4oKTsKICAgIH0KfQo=
--14dae93404498aa2b804b88454c1--
Feb 09 2012
next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 09.02.2012, 10:06 Uhr, schrieb Nicolae Mihalache <xpromache gmail.com>:

 Hello,

 I'm a complete newbie in D and trying to compare with Java. I
 implemented  a simple test for measuring the throughput in message
 passing between threads. I see that Java can pass about 4mil
 messages/sec while D only achieves 1mil/sec. I thought that D should
 be faster.

 The messages are simply integers (which are converted to Integer in  
 Java).

 The two programs are attached. I tried compiling the D version with
 both dmd and gdc and various optimization flags.

 mache

I cannot give you an explanation, just want to say that a message in std.concurrency is also using a wrapper (a 'Variant') + a type field (standard, priority, linkDead). So you effectively have no optimization for int, but the same situation as in Java. The second thing I notice is that std.concurrency uses a double linked list implementation, while you use an array in the Java version, which results in no additional node allocations.
Feb 09 2012
prev sibling next sibling parent reply "Alex_Dovhal" <alex_dovhal yahoo.com> writes:
"Nicolae Mihalache" <xpromache gmail.com> wrote:
 Hello,

 I'm a complete newbie in D and trying to compare with Java. I
 implemented  a simple test for measuring the throughput in message
 passing between threads. I see that Java can pass about 4mil
 messages/sec while D only achieves 1mil/sec. I thought that D should
 be faster.

 The messages are simply integers (which are converted to Integer in Java).

 The two programs are attached. I tried compiling the D version with
 both dmd and gdc and various optimization flags.

 mache

Hi, I downloaded your two programs, I didn't run them but noticed that in 'mp.d' you have n set to 100_000_000, while in 'ThroughputMpTest.java' n is set to 10_000_000, so with this D code is 10/4 = 2.5 times faster :)
Feb 09 2012
next sibling parent reply "Alex_Dovhal" <alex_dovhal yahoo.com> writes:
Sorry, my mistake. It's strange to have different 'n', but you measure speed 
as 1000*n/time, so it's doesn't matter if n is 10 times bigger. 
Feb 09 2012
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:
 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei
Feb 09 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/9/12 10:31 AM, Marco Leise wrote:
 Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>:

 On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:
 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei

Well, what does +1 Variant and +1 LinkedListNode sum up to?

Sorry, I don't understand... Andrei
Feb 09 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/9/12 11:49 AM, Marco Leise wrote:
 Am 09.02.2012, 20:35 Uhr, schrieb Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>:

 If we're doing one allocation per
 message passed, that might explain the 4x performance difference (I
 have no trouble figuring Java's allocator is this much faster than
 D's).


 Andrei

Well, what does +1 Variant and +1 LinkedListNode sum up to?

Sorry, I don't understand... Andrei

There are at least 2 allocations, one for the Variant and one for the new node in the linked list aka message box. But from what you wrote it sounds like a Variant doesn't allocate unless the contained data exceeds some internal storage. Sean found another possible allocation in the other branch of this discussion.

I understand. The good news is, this looks like low-hanging fruit! I'll keep an eye on pull requests in druntime. Thanks to fellow Romanian Nicolae Mihalache for contributing the comparison. Andrei
Feb 09 2012
prev sibling next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02/09/2012 08:27 PM, Sean Kelly wrote:
 On Feb 9, 2012, at 10:14 AM, Marco Leise wrote:

 Am 09.02.2012, 17:22 Uhr, schrieb dsimcha<dsimcha yahoo.com>:

 I wonder how much it helps to just optimize the GC a little.  How much does
the performance gap close when you use DMD 2.058 beta instead of 2.057?  This
upcoming release has several new garbage collector optimizations.  If the GC is
the bottleneck, then it's not surprising that anything that relies heavily on
it is slow because D's GC is still fairly naive.

I did some OProfile-ing. The full report is attached, but for simplicity it is without call graph this time. Here is an excerpt: CPU: Core 2, speed 2001 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % linenr info symbol name 13838 18.8416 gcx.d:426 void* gc.gcx.GC.malloc(ulong, uint, ulong*) 4465 6.0795 gcx.d:2454 ulong gc.gcx.Gcx.fullcollect(void*)

One random thing that just occurred to me… if the standard receive pattern is: receive((int x) { … }); There's a good chance that a stack frame is being dynamically allocated for the delegate when it's passed to receive (since I don't believe there's any way to declare the parameters to receive as "scope"). I'll have to check this, and maybe consider changing receive to use alias template parameters instead of normal function parameters?

You can mark an entire tuple as scope without trouble: void foo(T,S...)(T arg1, scope S args) {...} Does this improve the run time?
Feb 09 2012
prev sibling next sibling parent Graham St Jack <Graham.StJack internode.on.net> writes:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

I suggest using a template-generated type that can contain any of the 
messages to be sent over a channel. It is reasonably straightforward to 
generate all the boilerplate code necessary to make this happen. My 
prototype (attached) still needs work to remove linux dependencies and 
tighten it up, but it works ok. Another advantage of this approach 
(well, I see it as an advantage) is that you declare in a single 
location all the messages that can be sent over the channel, and of 
course the messages are type-safe.

The file of interest is concurrency.d.

On 10/02/12 02:14, Sean Kelly wrote:
 So a queue per message type?  How would ordering be preserved? Also, how would
this work for interprocess messaging?  An array-based queue is an option
however (though it would mean memmoves on receive), as are free-lists for
nodes, etc.  I guess the easiest thing there would be a lock-free shared slist
for the node free-list, though I couldn't weigh the chance of cache misses from
using old memory blocks vs. just expecting the allocator to be fast.

 On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan<gor.f.gyolchanyan gmail.com> 
wrote:

 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

 On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhal<alex_dovhal yahoo.com>  wrote:
 Sorry, my mistake. It's strange to have different 'n', but you measure speed
 as 1000*n/time, so it's doesn't matter if n is 10 times bigger.

-- Bye, Gor Gyolchanyan.


-- Graham St Jack
Feb 09 2012
prev sibling next sibling parent David Nadlinger <see klickverbot.at> writes:
On 2/9/12 11:17 PM, Sean Kelly wrote:
 On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:
 I didn't yet got around to polish my lock-free SList/DList implementations,
 but mutexes should only become a problem with high contention when you need to
block.
 You'd also would need some kind of blocking for lock-free lists.

No blocking should be necessary for the lock-free list. Just try to steal a node with a CAS. If the result was null (i.e. if the list ended up being empty), allocate a node via malloc/GC.

And the neat thing is that you don't have to worry about node deletion as much when you have a GC… David
Feb 09 2012
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-02-10 14:54, Oliver Plow wrote:
 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising


Is there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-w ndows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not. -- Oliver

There's a stub implementation of the GC which uses malloc, IIRC. Try using that one at link time. https://github.com/D-Programming-Language/druntime/blob/master/src/gcstub/gc.d -- /Jacob Carlborg
Feb 10 2012
prev sibling parent deadalnix <deadalnix gmail.com> writes:
Le 09/02/2012 20:57, Martin Nowak a écrit :
 On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly <sean invisibleduck.org>
 wrote:

 So a queue per message type? How would ordering be preserved? Also,
 how would this work for interprocess messaging? An array-based queue
 is an option however (though it would mean memmoves on receive), as
 are free-lists for nodes, etc. I guess the easiest thing there would
 be a lock-free shared slist for the node free-list, though I couldn't
 weigh the chance of cache misses from using old memory blocks vs. just
 expecting the allocator to be fast.

I didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists.

Doesn't this require shared to be working ?
Feb 10 2012
prev sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 02/10/12 14:54, Oliver Plow wrote:
 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising


Is there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-w ndows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not.

Calling GC.disable() at runtime will delay the GC until it's actually needed, but won't disable it completely. Having a std noop GC stub selected by a switch would be nice, but you can get the same effect by giving the linker an object that has the necessary stubs. For this test case something like the patch below improves things significantly, for more gains, std.concurrency needs more invasive changes. Note it's just POC, to measure the current std.concurrency efficiency vs other approaches. The "freelist" arrays are not freed and leak, a complete implementation would free them when the link is shut down. Ignore the synchronize() calls - "synchronized" isn't properly lowered by the compiler, so i had to resort to this after switching the locking primitives. It should work with std "synchronized" equally well. The original testcase from this thread achieves ~4M msg/sec with this change (the numbers aren't stable, but mostly in the 3.5..4.0M range, 4.5M+ happens sometimes). The memory usage also decreases noticeably. artur --- std/concurrency.d +++ std/concurrency.d -1387,7 +1396,7 private m_last = n; Node* todelete = n.next; n.next = n.next.next; - //delete todelete; + delete todelete; m_count--; } -1430,6 +1439,56 private { val = v; } + import core.memory; + import core.exception; + new(size_t size) { + void* p; + if (afreelist.length) + p = afreelist[--afreelist.length]; + else if (gfreelist.length) { + { + scope lock = synchronize(fl); + if (gfreelist.length) { + afreelist = cast(Node*[])gfreelist; + gfreelist.length=0; + } + } + if (afreelist.length) + p = afreelist[--afreelist.length]; + } + + if (p) + return p; + + p = std.c.stdlib.malloc(size); + if (!p) + throw new OutOfMemoryError(); + GC.addRange(p, size); + return p; + } + delete(void* p) { + if (!p) + return; + pfreelist ~= cast(Node*)p; + if (pfreelist.length>=8) + { + { + scope lock = synchronize(fl); + gfreelist ~= cast(shared Node*[])pfreelist; + } + pfreelist.length=0; + pfreelist.assumeSafeAppend(); + } + // At some point all free nodes need to be freed, using: + //GC.removeRange(p); + //std.c.stdlib.free(p); + } + static Node*[] afreelist; + static ubyte[56] d1; + static Node*[] pfreelist; + static ubyte[56] d2; + shared static Node*[] gfreelist; + shared static Mutex fl; }
Feb 10 2012
prev sibling next sibling parent Nicolae Mihalache <xpromache gmail.com> writes:
That would be funny but it's not true. I tested with different values,
that's why I ended up uploading different versions.

The programs print the computed message rate and takes into account
the number of messages.

mache





On Thu, Feb 9, 2012 at 11:57 AM, Alex_Dovhal <alex_dovhal yahoo.com> wrote:
 "Nicolae Mihalache" <xpromache gmail.com> wrote:
 Hello,

 I'm a complete newbie in D and trying to compare with Java. I
 implemented =A0a simple test for measuring the throughput in message
 passing between threads. I see that Java can pass about 4mil
 messages/sec while D only achieves 1mil/sec. I thought that D should
 be faster.

 The messages are simply integers (which are converted to Integer in Java=


 The two programs are attached. I tried compiling the D version with
 both dmd and gdc and various optimization flags.

 mache

Hi, I downloaded your two programs, I didn't run them but noticed that in 'mp.d' you have n set to 100_000_000, while in 'ThroughputMpTest.java' n is set =

 10_000_000, so with this D code is 10/4 =3D 2.5 times faster :)

Feb 09 2012
prev sibling next sibling parent Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
Generally, D's message passing is implemented in quite easy-to-use
way, but far from being fast.
I dislike the Variant structure, because it adds a huge overhead. I'd
rather have a templated message passing system with type-safe message
queue, so no Variant is necessary.
In specific cases Messages can be polymorphic objects. This will be
way faster, then Variant.

On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhal <alex_dovhal yahoo.com> wrote:
 Sorry, my mistake. It's strange to have different 'n', but you measure speed
 as 1000*n/time, so it's doesn't matter if n is 10 times bigger.

-- Bye, Gor Gyolchanyan.
Feb 09 2012
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
So a queue per message type?  How would ordering be preserved? Also, how wou=
ld this work for interprocess messaging?  An array-based queue is an option h=
owever (though it would mean memmoves on receive), as are free-lists for nod=
es, etc.  I guess the easiest thing there would be a lock-free shared slist f=
or the node free-list, though I couldn't weigh the chance of cache misses fr=
om using old memory blocks vs. just expecting the allocator to be fast.=20

On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> wr=
ote:

 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.
=20
 On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhal <alex_dovhal yahoo.com> wrote:=

 Sorry, my mistake. It's strange to have different 'n', but you measure sp=


 as 1000*n/time, so it's doesn't matter if n is 10 times bigger.
=20
=20

=20 =20 --=20 Bye, Gor Gyolchanyan.

Feb 09 2012
prev sibling next sibling parent "dsimcha" <dsimcha yahoo.com> writes:
I wonder how much it helps to just optimize the GC a little.  How 
much does the performance gap close when you use DMD 2.058 beta 
instead of 2.057?  This upcoming release has several new garbage 
collector optimizations.  If the GC is the bottleneck, then it's 
not surprising that anything that relies heavily on it is slow 
because D's GC is still fairly naive.

On Thursday, 9 February 2012 at 15:44:59 UTC, Sean Kelly wrote:
 So a queue per message type?  How would ordering be preserved? 
 Also, how would this work for interprocess messaging?  An 
 array-based queue is an option however (though it would mean 
 memmoves on receive), as are free-lists for nodes, etc.  I 
 guess the easiest thing there would be a lock-free shared slist 
 for the node free-list, though I couldn't weigh the chance of 
 cache misses from using old memory blocks vs. just expecting 
 the allocator to be fast.

 On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan 
 <gor.f.gyolchanyan gmail.com> wrote:

 Generally, D's message passing is implemented in quite 
 easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge 
 overhead. I'd
 rather have a templated message passing system with type-safe 
 message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This 
 will be
 way faster, then Variant.
 
 On Thu, Feb 9, 2012 at 3:12 PM, Alex Dovhal <alex 
 dovhal yahoo.com> wrote:
 Sorry, my mistake. It's strange to have different 'n', but 
 you measure speed
 as 1000*n/time, so it's doesn't matter if n is 10 times 
 bigger.
 
 

-- Bye, Gor Gyolchanyan.


Feb 09 2012
prev sibling next sibling parent Brad Anderson <eco gnuk.net> writes:
--14dae9d67d7c6b6d0604b88b485a
Content-Type: text/plain; charset=ISO-8859-1

On Thu, Feb 9, 2012 at 9:22 AM, dsimcha <dsimcha yahoo.com> wrote:

 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising that
 anything that relies heavily on it is slow because D's GC is still fairly
 naive.


 On Thursday, 9 February 2012 at 15:44:59 UTC, Sean Kelly wrote:

 So a queue per message type?  How would ordering be preserved? Also, how
 would this work for interprocess messaging?  An array-based queue is an
 option however (though it would mean memmoves on receive), as are
 free-lists for nodes, etc.  I guess the easiest thing there would be a
 lock-free shared slist for the node free-list, though I couldn't weigh the
 chance of cache misses from using old memory blocks vs. just expecting the
 allocator to be fast.

 On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan <gor.f.gyolchanyan gmail.com>
 wrote:

  Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

 On Thu, Feb 9, 2012 at 3:12 PM, Alex Dovhal <alex dovhal yahoo.com>
 wrote:

 Sorry, my mistake. It's strange to have different 'n', but you measure
 speed
 as 1000*n/time, so it's doesn't matter if n is 10 times bigger.

-- Bye, Gor Gyolchanyan.



received 100000000 messages in 192034 msec sum=4999999950000000 speed=520741 msg/sec received 100000000 messages in 84118 msec sum=4999999950000000 speed=1188806 msg/sec received 100000000 messages in 88274 msec sum=4999999950000000 speed=1132836 msg/sec dmd 2.058 beta: received 100000000 messages in 93539 msec sum=4999999950000000 speed=1069072 msg/sec received 100000000 messages in 96422 msec sum=4999999950000000 speed=1037107 msg/sec received 100000000 messages in 203961 msec sum=4999999950000000 speed=490289 msg/sec Both versions would inexplicably run at approximately half the speed sometimes. I have no idea what is up with that. I have no java development environment to test for comparison. This machine has 4 cores and is running Windows. Regards, Brad Anderson --14dae9d67d7c6b6d0604b88b485a Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Thu, Feb 9, 2012 at 9:22 AM, dsimcha <span dir=3D"ltr">&lt;<a href=3D"ma= ilto:dsimcha yahoo.com">dsimcha yahoo.com</a>&gt;</span> wrote:<br><div cla= ss=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 = .8ex;border-left:1px #ccc solid;padding-left:1ex"> I wonder how much it helps to just optimize the GC a little. =A0How much do= es the performance gap close when you use DMD 2.058 beta instead of 2.057? = =A0This upcoming release has several new garbage collector optimizations. = =A0If the GC is the bottleneck, then it&#39;s not surprising that anything = that relies heavily on it is slow because D&#39;s GC is still fairly naive.= <div class=3D"im"> <br> <br> On Thursday, 9 February 2012 at 15:44:59 UTC, Sean Kelly wrote:<br> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im"> So a queue per message type? =A0How would ordering be preserved? Also, how = would this work for interprocess messaging? =A0An array-based queue is an o= ption however (though it would mean memmoves on receive), as are free-lists= for nodes, etc. =A0I guess the easiest thing there would be a lock-free sh= ared slist for the node free-list, though I couldn&#39;t weigh the chance o= f cache misses from using old memory blocks vs. just expecting the allocato= r to be fast.<br> <br> On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan &lt;<a href=3D"mailto:gor.f.gyo= lchanyan gmail.com" target=3D"_blank">gor.f.gyolchanyan gmail.com</a>&gt; w= rote:<br> <br> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im"> Generally, D&#39;s message passing is implemented in quite easy-to-use<br> way, but far from being fast.<br> I dislike the Variant structure, because it adds a huge overhead. I&#39;d<b= r> rather have a templated message passing system with type-safe message<br> queue, so no Variant is necessary.<br> In specific cases Messages can be polymorphic objects. This will be<br> way faster, then Variant.<br> <br></div><div class=3D"im"> On Thu, Feb 9, 2012 at 3:12 PM, Alex Dovhal &lt;alex <a href=3D"mailto:dovh= al yahoo.com" target=3D"_blank">dovhal yahoo.com</a>&gt; wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Sorry, my mistake. It&#39;s strange to have different &#39;n&#39;, but you = measure speed<br> as 1000*n/time, so it&#39;s doesn&#39;t matter if n is 10 times bigger.<br> <br> <br> </blockquote> <br> <br> <br> --<br> Bye,<br> Gor Gyolchanyan.<br> </div></blockquote></blockquote> <br> <br> </blockquote></div><br><div>dmd 2.057:</div><div><div>received 100000000 me= ssages in 192034 msec sum=3D4999999950000000 speed=3D520741 msg/sec</div></= div><div><div>received 100000000 messages in 84118 msec sum=3D4999999950000= 000 speed=3D1188806 msg/sec</div> </div><div><div>received 100000000 messages in 88274 msec sum=3D49999999500= 00000 speed=3D1132836 msg/sec</div></div><div><br></div><div>dmd 2.058 beta= :</div><div><div>received 100000000 messages in 93539 msec sum=3D4999999950= 000000 speed=3D1069072 msg/sec</div> </div><div><div>received 100000000 messages in 96422 msec sum=3D49999999500= 00000 speed=3D1037107 msg/sec</div></div><div><div>received 100000000 messa= ges in 203961 msec sum=3D4999999950000000 speed=3D490289 msg/sec</div></div=
<div>

he speed sometimes. I have no idea what is up with that. =A0I have no java = development environment to test for comparison. =A0This machine has 4 cores= and is running Windows.</div> <div><br></div><div>Regards,</div><div>Brad Anderson</div> --14dae9d67d7c6b6d0604b88b485a--
Feb 09 2012
prev sibling next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
------------noRJrR8QMyJK4RgU4YKD9j
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit

Am 09.02.2012, 17:22 Uhr, schrieb dsimcha <dsimcha yahoo.com>:

 I wonder how much it helps to just optimize the GC a little.  How much  
 does the performance gap close when you use DMD 2.058 beta instead of  
 2.057?  This upcoming release has several new garbage collector  
 optimizations.  If the GC is the bottleneck, then it's not surprising  
 that anything that relies heavily on it is slow because D's GC is still  
 fairly naive.

I did some OProfile-ing. The full report is attached, but for simplicity it is without call graph this time. Here is an excerpt: CPU: Core 2, speed 2001 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % linenr info symbol name 13838 18.8416 gcx.d:426 void* gc.gcx.GC.malloc(ulong, uint, ulong*) 4465 6.0795 gcx.d:2454 ulong gc.gcx.Gcx.fullcollect(void*) ... Compiled with: gcc-Version 4.6.2 20111026 (gdc 0.31 - r751:34491c2e7bb4, using dmd 2.057) (GCC) ------------noRJrR8QMyJK4RgU4YKD9j Content-Disposition: attachment; filename=report.txt Content-Type: text/plain; name=report.txt Content-Transfer-Encoding: 7bit CPU: Core 2, speed 2001 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % linenr info symbol name 13838 18.8416 gcx.d:426 void* gc.gcx.GC.malloc(ulong, uint, ulong*) 4465 6.0795 gcx.d:2454 ulong gc.gcx.Gcx.fullcollect(void*) 3466 4.7192 lifetime.d:829 _d_newarrayiT 3282 4.4687 concurrency.d:926 void std.concurrency.MessageBox.put(ref std.concurrency.Message) 3088 4.2046 concurrency.d:1419 void std.concurrency.List!(std.concurrency.Message).List.put(ref std.concurrency.List!(std.concurrency.Message).List) 3037 4.1351 concurrency.d:977 bool std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure safe void delegate(std.concurrency.LinkTerminated), pure safe void delegate(std.concurrency.OwnerTerminated), pure safe void delegate(std.variant.VariantN!(32uL).VariantN)).get(nothrow safe void delegate(int), pure safe void delegate(std.concurrency.LinkTerminated), pure safe void delegate(std.concurrency.OwnerTerminated), pure safe void delegate(std.variant.VariantN!(32uL).VariantN)) 2624 3.5728 variant.d:235 long std.variant.VariantN!(32uL).VariantN.handler!(int).handler(std.variant.VariantN!( 2uL).VariantN.OpID, ubyte[32]*, void*) 2544 3.4639 gcbits.d:115 void gc.gcbits.GCBits.clear(ulong) 2152 2.9301 object_.d:2417 _d_monitorenter 2011 2.7381 concurrency.d:591 int std.concurrency.receiveOnly!(int).receiveOnly() 1712 2.3310 gc.d:205 gc_qalloc 1695 2.3079 variant.d:253 long std.variant.VariantN!(32uL).VariantN.handler!(int).handler(std.variant.VariantN!( 2uL).VariantN.OpID, ubyte[32]*, void*).bool tryPutting(int*, TypeInfo, void*) 1659 2.2589 concurrency.d:1058 bool std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure safe void delegate(std.concurrency.LinkTerminated), pure safe void delegate(std.concurrency.OwnerTerminated), pure safe void delegate(std.variant.VariantN!(32uL).VariantN)).get(nothrow safe void delegate(int), pure safe void delegate(std.concurrency.LinkTerminated), pure safe void delegate(std.concurrency.OwnerTerminated), pure safe void delegate(std.variant.VariantN!(32uL).VariantN)).bool scan(ref std.concurrency.List!(std.concurrency.Message).List) 1658 2.2575 gcbits.d:104 void gc.gcbits.GCBits.set(ulong) 1487 2.0247 variant.d:499 std.variant.VariantN!(32uL).VariantN std.variant.VariantN!(32uL).VariantN.opAssign!(int).opAssign(int) 1435 1.9539 gcbits.d:92 ulong gc.gcbits.GCBits.test(ulong) 1405 1.9130 condition.d:230 void core.sync.condition.Condition.notify() 1390 1.8926 object_.d:2440 _d_monitorexit 1386 1.8872 concurrency.d:1304 ref property std.concurrency.Message std.concurrency.List!(std.concurrency.Message).List.Range.front() 1248 1.6993 object_.d:137 bool object.opEquals(Object, Object) 1131 1.5399 concurrency.d:1378 void std.concurrency.List!(std.concurrency.Message).List.removeAt(std.concurrency.List!(std.concurrency.Message).List.Range) 998 1.3589 mutex.d:149 void core.sync.mutex.Mutex.unlock() 993 1.3521 gcbits.d:141 ulong gc.gcbits.GCBits.testClear(ulong) 920 1.2527 concurrency.d:497 void std.concurrency._send!(int)._send(std.concurrency.MsgType, std.concurrency.Tid, int) 859 1.1696 concurrency.d:996 bool std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure safe void delegate(std.concurrency.LinkTerminated), pure safe void delegate(std.concurrency.OwnerTerminated), pure safe void delegate(std.variant.VariantN!(32uL).VariantN)).get(nothrow safe void delegate(int), pure safe void delegate(std.concurrency.LinkTerminated), pure safe void delegate(std.concurrency.OwnerTerminated), pure safe void delegate(std.variant.VariantN!(32uL).VariantN)).bool onStandardMsg(ref std.concurrency.Message) 847 1.1533 concurrency.d:146 void std.concurrency.Message.map!(nothrow safe void delegate(int)).map(nothrow safe void delegate(int)) 768 1.0457 exception.d:355 pure safe std.concurrency.List!(std.concurrency.Message).List.Node* std.exception.enforce!(std.concurrency.List!(std.concurrency. essage).List.Node*, "../../../libphobos/std/concurrency.d", 1306).enforce(std.concurrency.List!(std.concurrency.Message).List.Node*, lazy const(char)[]) 748 1.0185 variant.d:237 long std.variant.VariantN!(32uL).VariantN.handler!(int).handler(std.variant.VariantN!( 2uL).VariantN.OpID, ubyte[32]*, void*).int* getPtr(void*) 730 0.9940 variant.d:611 property bool std.variant.VariantN!(32uL).VariantN.convertsTo!(int).convertsTo() 667 0.9082 variant.d:651 property int std.variant.VariantN!(32uL).VariantN.get!(int).get() 662 0.9014 concurrency.d:317 void std.concurrency.Tid.send!(int).send(int) 620 0.8442 concurrency.d:1102 bool std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure safe void delegate(std.concurrency.LinkTerminated), pure safe void delegate(std.concurrency.OwnerTerminated), pure safe void delegate(std.variant.VariantN!(32uL).VariantN)).get(nothrow safe void delegate(int), pure safe void delegate(std.concurrency.LinkTerminated), pure safe void delegate(std.concurrency.OwnerTerminated), pure safe void delegate(std.variant.VariantN!(32uL).VariantN)).bool pty(ref std.concurrency.List!(std.concurrency.Message).List) 588 0.8006 lifetime.d:243 bool rt.lifetime.__setArrayAllocLength(ref rt.lifetime.BlkInfo, ulong, bool, ulong) 543 0.7393 object_.d:1014 pure nothrow void[] object.TypeInfo_Struct.init() 529 0.7203 concurrency.d:122 property bool std.concurrency.Message.convertsTo!(int).convertsTo() 511 0.6958 concurrency.d:1237 pure bool std.concurrency.MessageBox.isControlMsg(ref std.concurrency.Message) 504 0.6862 concurrency.d:595 int std.concurrency.receiveOnly!(int).receiveOnly().nothrow safe void __dgliteral14(int) 492 0.6699 concurrency.d:108 std.concurrency.Message std.concurrency.Message.__ctor!(int).__ctor(std.concurrency.MsgType, int) 463 0.6304 object_.d:486 pure nothrow property TypeInfo object.TypeInfo_Array.next() 456 0.6209 concurrency.d:1419 property bool std.concurrency.List!(std.concurrency.Message).List.empty() 441 0.6005 exception.d:355 pure safe bool std.exception.enforce!(bool, "../../../libphobos/std/concurrency.d", 1382).enforce(bool, lazy const(char)[]) 426 0.5800 concurrency.d:487 void std.concurrency._send!(int)._send(std.concurrency.Tid, int) 417 0.5678 mp.d:19 void mp.receiver() 382 0.5201 mp.d:5 _Dmain 319 0.4343 gc.d:200 gc_malloc 257 0.3499 concurrency.d:1301 const( property bool function()) std.concurrency.List!(std.concurrency.Message).List.Range.empty 236 0.3213 mutex.d:127 void core.sync.mutex.Mutex.lock() 224 0.3050 object_.d:1009 pure nothrow property ulong object.TypeInfo_Struct.tsize() 178 0.2424 lifetime.d:98 _d_allocmemory 174 0.2369 concurrency.d:1369 std.concurrency.List!(std.concurrency.Message).List.Range std.concurrency.List!(std.concurrency.Message).List.opSlice() 165 0.2247 monitor_.d:250 _d_monitor_lock 161 0.2192 monitor_.d:258 _d_monitor_unlock 133 0.1811 object_.d:1016 pure nothrow property uint object.TypeInfo_Struct.flags() 108 0.1471 mutex.d:0 ___t.1.1960 93 0.1266 mutex.d:0 ___t.0.1956 35 0.0477 condition.d:126 void core.sync.condition.Condition.wait() 34 0.0463 gcx.d:2270 void gc.gcx.Gcx.mark(void*, void*) 21 0.0286 gcbits.d:172 ulong gc.gcbits.GCBits.testSet(ulong) 12 0.0163 concurrency.d:1225 void std.concurrency.MessageBox.updateMsgCount() 6 0.0082 thread.d:2258 thread_suspendAll 3 0.0041 thread.d:517 thread_suspendHandler 2 0.0027 thread.d:2274 extern (C) void core.thread.thread_suspendAll().void suspend(core.thread.Thread) 2 0.0027 thread.d:649 thread_resumeHandler 1 0.0014 condition.d:251 void core.sync.condition.Condition.notifyAll() 1 0.0014 thread.d:2491 extern (C) void core.thread.thread_resumeAll().void resume(core.thread.Thread) 1 0.0014 thread.d:2470 thread_resumeAll 1 0.0014 thread.d:2582 thread_scanAll ------------noRJrR8QMyJK4RgU4YKD9j--
Feb 09 2012
prev sibling next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org>:

 On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:
 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei

Well, what does +1 Variant and +1 LinkedListNode sum up to?
Feb 09 2012
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Feb 9, 2012, at 10:14 AM, Marco Leise wrote:

 Am 09.02.2012, 17:22 Uhr, schrieb dsimcha <dsimcha yahoo.com>:
=20
 I wonder how much it helps to just optimize the GC a little.  How =


of 2.057? This upcoming release has several new garbage collector = optimizations. If the GC is the bottleneck, then it's not surprising = that anything that relies heavily on it is slow because D's GC is still = fairly naive.
=20
 I did some OProfile-ing. The full report is attached, but for =

=20
 CPU: Core 2, speed 2001 MHz (estimated)
 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a =

 samples  %        linenr info                 symbol name
 13838    18.8416  gcx.d:426                   void* =

 4465      6.0795  gcx.d:2454                  ulong =

One random thing that just occurred to me=85 if the standard receive = pattern is: receive((int x) { =85 }); There's a good chance that a stack frame is being dynamically allocated = for the delegate when it's passed to receive (since I don't believe = there's any way to declare the parameters to receive as "scope"). I'll = have to check this, and maybe consider changing receive to use alias = template parameters instead of normal function parameters?=
Feb 09 2012
prev sibling next sibling parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 09.02.2012, 20:35 Uhr, schrieb Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org>:

 If we're doing one allocation per
 message passed, that might explain the 4x performance difference (I
 have no trouble figuring Java's allocator is this much faster than  
 D's).


 Andrei

Well, what does +1 Variant and +1 LinkedListNode sum up to?

Sorry, I don't understand... Andrei

There are at least 2 allocations, one for the Variant and one for the new node in the linked list aka message box. But from what you wrote it sounds like a Variant doesn't allocate unless the contained data exceeds some internal storage. Sean found another possible allocation in the other branch of this discussion.
Feb 09 2012
parent bearophile <bearophileHUGS lycos.com> writes:
Marco Leise:

 Sean found another possible allocation in the other  
 branch of this discussion.

Maybe this is able to help Sean and similar situations: http://d.puremagic.com/issues/show_bug.cgi?id=5070 Bye, bearophile
Feb 09 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly <sean invisibleduck.org>  
wrote:

 So a queue per message type?  How would ordering be preserved? Also, how  
 would this work for interprocess messaging?  An array-based queue is an  
 option however (though it would mean memmoves on receive), as are  
 free-lists for nodes, etc.  I guess the easiest thing there would be a  
 lock-free shared slist for the node free-list, though I couldn't weigh  
 the chance of cache misses from using old memory blocks vs. just  
 expecting the allocator to be fast.

I didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists. Best first order optimization would be to allocate the list node deterministically. The only reason to use GC memory for them is that mallocating is still too cumbersome. Nodes are unshared so you'd want a unique_pointer.
Feb 09 2012
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252

On Feb 9, 2012, at 10:31 AM, Marco Leise wrote:

 Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu =

=20
 On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:
 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. =



 rather have a templated message passing system with type-safe =



 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

cc Sean Kelly =20 I haven't looked at the implementation, but one possible liability is =


allocation under the wraps. There are a number of ways to avoid that, = such as parallel arrays (one array per type for data and one for the = additional tags).
=20
 We must make the message passing subsystem to not use any memory =


message passed, that might explain the 4x performance difference (I have = no trouble figuring Java's allocator is this much faster than D's).
=20
 Well, what does +1 Variant and +1 LinkedListNode sum up to?

FWIW, you can use DMD's built in profiler so long as the receiving = thread is the same as the sending thread: import std.concurrency; void main() { for(int i =3D 0; i < 1_000_000; i++) { send(thisTid, 12345); auto x =3D receiveOnly!int(); } } I generated timings for this both before and after adding "scope" to = mbox.get(): $ dmd -release -inline -O abc $ time abc real 0m0.831s user 0m0.829s sys 0m0.002s =85 add "scope" to mbox.get() $ dmd -release -inline -O abc $ time abc real 0m0.653s user 0m0.649s sys 0m0.003s And here's the trace log after "scope" was added. Notice that there = were 61 calls to GCX.fullcollect(). We can also see that there was 1 = allocation per send/receive operation, so only an alloc for the message = list node. $ dmd -O -release -profile abc gladsheim:misc sean$ time abc real 0m11.348s user 0m11.331s sys 0m0.015s =3D=3D=3D=3D=3D=3D=3D=3D Timer Is 3579545 Ticks/Sec, Times are in = Microsecs =3D=3D=3D=3D=3D=3D=3D=3D Num Tree Func Per Calls Time Time Call 1000000 437709765 220179413 220 void = std.concurrency._send!(int)._send(std.concurrency.MsgType, = std.concurrency.Tid, int) 1000000 300987757 140736393 140 bool = std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure = safe void function(std.concurrency.LinkTerminated)*, pure safe void = function(std.concurrency.OwnerTerminated)*, pure safe void = function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow safe = void delegate(int), scope pure safe void = function(std.concurrency.LinkTerminated)*, scope pure safe void = function(std.concurrency.OwnerTerminated)*, scope pure safe void = function(std.variant.VariantN!(32u).VariantN)*) 1000000 202131609 89479808 89 void* = gc.gcx.GC.malloc(uint, uint, uint*) 1 825045422 57556501 57556501 _Dmain 1000033 112651800 52026745 52 void* = gc.gcx.GC.mallocNoSync(uint, uint, uint*) 61 53422342 49606106 813214 uint = gc.gcx.Gcx.fullcollect(void*) 2000000 160103753 42531732 21 bool = std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure = safe void function(std.concurrency.LinkTerminated)*, pure safe void = function(std.concurrency.OwnerTerminated)*, pure safe void = function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow safe = void delegate(int), scope pure safe void = function(std.concurrency.LinkTerminated)*, scope pure safe void = function(std.concurrency.OwnerTerminated)*, scope pure safe void = function(std.variant.VariantN!(32u).VariantN)*).bool scan(ref = std.concurrency.List!(std.concurrency.Message).List) 2000000 42018612 39837170 19 int = std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.Vari= antN!(32u).VariantN.OpID, ubyte[32]*, void*) 1000000 117572021 24641771 24 bool = std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure = safe void function(std.concurrency.LinkTerminated)*, pure safe void = function(std.concurrency.OwnerTerminated)*, pure safe void = function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow safe = void delegate(int), scope pure safe void = function(std.concurrency.LinkTerminated)*, scope pure safe void = function(std.concurrency.OwnerTerminated)*, scope pure safe void = function(std.variant.VariantN!(32u).VariantN)*).bool onStandardMsg(ref = std.concurrency.Message) 1000000 47280794 20418675 20 void = std.concurrency.Message.map!(nothrow safe void = delegate(int)).map(nothrow safe void delegate(int)) 1000000 316556767 15569009 15 int = std.concurrency.receiveOnly!(int).receiveOnly() 1000000 36317362 13212905 13 property bool = std.variant.VariantN!(32u).VariantN.convertsTo!(int).convertsTo() 1000000 15445906 10879089 10 std.concurrency.Message = std.concurrency.Message.__ctor!(int).__ctor(std.concurrency.MsgType, = int) 1000000 45649454 9332092 9 property bool = std.concurrency.Message.convertsTo!(int).convertsTo() 1000000 26790032 7875877 7 property int = std.variant.VariantN!(32u).VariantN.get!(int).get() 1000000 444757778 7048013 7 void = std.concurrency._send!(int)._send(std.concurrency.Tid, int) 15512 6681162 6657279 429 int = gc.gcx.Gcx.allocPage(ubyte) 1000000 450932153 6174374 6 void = std.concurrency.send!(int).send(std.concurrency.Tid, int) 1000000 4566817 4566817 4 = std.variant.VariantN!(32u).VariantN = std.variant.VariantN!(32u).VariantN.opAssign!(int).opAssign(int) 4087 3635735 3518353 860 void = gc.gcx.Gcx.mark(void*, void*, int) 2000000 2069965 2069965 1 int = std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.Vari= antN!(32u).VariantN.OpID, ubyte[32]*, void*).bool tryPutting(int*, = TypeInfo, void*) 2990271 195287 195287 0 void* = gc.gcx.sentinel_add(void*) 1000000 147610 147610 0 bool = std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure = safe void function(std.concurrency.LinkTerminated)*, pure safe void = function(std.concurrency.OwnerTerminated)*, pure safe void = function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow safe = void delegate(int), scope pure safe void = function(std.concurrency.LinkTerminated)*, scope pure safe void = function(std.concurrency.OwnerTerminated)*, scope pure safe void = function(std.variant.VariantN!(32u).VariantN)*).bool pty(ref = std.concurrency.List!(std.concurrency.Message).List) 1000032 121459 121459 0 void = gc.gcx.Gcx.setBits(gc.gcx.Pool*, uint, uint) 2000000 111475 111475 0 int = std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.Vari= antN!(32u).VariantN.OpID, ubyte[32]*, void*).int* getPtr(void*) 1000033 102413 102413 0 ubyte = gc.gcx.Gcx.findBin(uint) 1002228 94742 94742 0 property uint = gc.gcx.Pool.shiftBy() 1000000 72086 72086 0 int = std.concurrency.receiveOnly!(int).receiveOnly().nothrow safe void = __lambda17(int) 995119 70854 70854 0 void = gc.gcx.Gcx.log_free(void*) 1000033 61505 61505 0 void = gc.gcx.sentinel_init(void*, uint) 1000033 55070 55070 0 void = gc.gcx.Gcx.log_malloc(void*, uint) 995119 51748 51748 0 void = gc.gcx.sentinel_Invariant(const(void*)) 1 24692 24692 24692 void = gc.gcx.Pool.initialize(uint, bool) 3538 3544517 24525 6 void = gc.gcx.Gcx.mark(void*, void*) 15511 23883 23522 1 uint = gc.gcx.Pool.allocPages(uint) 124447 11615 11615 0 void = gc.gcx.Gcx.clrBitsSmallSweep(gc.gcx.Pool*, uint, uint) 1 7051 5921 5921 void = gc.gcx.GC.initialize() 1 27490 2797 2797 gc.gcx.Pool* = gc.gcx.Gcx.newPool(uint, bool) 61 53424783 2441 40 uint = gc.gcx.Gcx.fullcollectshell() 55 2227 2227 40 void = gc.gcx.Gcx.addRange(void*, void*) 1 1129 1119 1119 void = gc.gcx.Gcx.initialize() 16 360 360 22 uint = gc.gcx.Pool.extendPages(uint) 1 2547 319 319 void = gc.gcx.GC.addRange(void*, uint) 2196 278 278 0 gc.gcx.Pool* = gc.gcx.Gcx.findPool(void*) 1 65 65 65 void gc.gcx.GC.disable() 1 9 9 9 void = gc.gcx.Gcx.log_init() 1 5 5 5 void gc.gcx.GC.enable() 1 0 0 0 void = gc.gcx.GC.setStackBottom(void*) 1 0 0 0 property uint = gc.gcx.Pool.divisor()
Feb 09 2012
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Feb 9, 2012, at 11:53 AM, bearophile wrote:

 Marco Leise:
=20
 Sean found another possible allocation in the other =20
 branch of this discussion.

Maybe this is able to help Sean and similar situations: http://d.puremagic.com/issues/show_bug.cgi?id=3D5070

This would be handy. I don't always think to check the asm dump when = I'm working with delegates.=
Feb 09 2012
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:

 On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly =

=20
 So a queue per message type?  How would ordering be preserved? Also, =


an option however (though it would mean memmoves on receive), as are = free-lists for nodes, etc. I guess the easiest thing there would be a = lock-free shared slist for the node free-list, though I couldn't weigh = the chance of cache misses from using old memory blocks vs. just = expecting the allocator to be fast.
=20
 I didn't yet got around to polish my lock-free SList/DList =

 but mutexes should only become a problem with high contention when you =

 You'd also would need some kind of blocking for lock-free lists.

No blocking should be necessary for the lock-free list. Just try to = steal a node with a CAS. If the result was null (i.e. if the list ended = up being empty), allocate a node via malloc/GC.
 Best first order optimization would be to allocate the list node =

Neat idea. I think I can make that change fairly trivially.=
Feb 09 2012
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

On Feb 9, 2012, at 2:17 PM, Sean Kelly wrote:
=20
 Best first order optimization would be to allocate the list node =


=20
 Neat idea.  I think I can make that change fairly trivially.

$ time abc real 0m0.556s user 0m0.555s sys 0m0.001s So another 100ms improvement. Switching to a (__gshared, no mutex) = free-list that falls back on malloc yields: $ time abc real 0m0.505s user 0m0.503s sys 0m0.001s Not as much of a gain there, and I believe we've eliminated all the = allocations (though I'd have to do a pile build to verify). Still, = that's approaching being twice as fast as before, which is definitely = something.=
Feb 09 2012
prev sibling next sibling parent "Oliver Plow" <saxo123 gmx.de> writes:
 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising


Is there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-w ndows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not. -- Oliver -- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
Feb 10 2012
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
GC.disable and GC.reserve are applicable. I tested with these and they did h=
elp but not a ton.=20

On Feb 10, 2012, at 5:54 AM, "Oliver Plow" <saxo123 gmx.de> wrote:

 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising


Is there a way to "turn off" the GC, e.g. a compiler switch to set the hea=

hrough this page: http://www.d-programming-language.org/dmd-windows.html#swi= tches But couldn't find anything helpful. Then you could measure the thing w= ith GC "turned off" to see whether the GC is the problem or not.
=20
 -- Oliver
=20
 --=20
 Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
 belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

Feb 10 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Fri, 10 Feb 2012 15:07:41 +0100, deadalnix <deadalnix gmail.com> wrot=
e:

 Le 09/02/2012 20:57, Martin Nowak a =C3=A9crit :
 On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly <sean invisibleduck.or=


 wrote:

 So a queue per message type? How would ordering be preserved? Also,
 how would this work for interprocess messaging? An array-based queue=



 is an option however (though it would mean memmoves on receive), as
 are free-lists for nodes, etc. I guess the easiest thing there would=



 be a lock-free shared slist for the node free-list, though I couldn'=



 weigh the chance of cache misses from using old memory blocks vs. ju=



 expecting the allocator to be fast.

I didn't yet got around to polish my lock-free SList/DList =


 implementations,
 but mutexes should only become a problem with high contention when yo=


 need to block.
 You'd also would need some kind of blocking for lock-free lists.

Doesn't this require shared to be working ?

Shared is already very helpful for writing lock-free stuff. I still need to use atomic load/store to make them portable though. If shared would add memory fences for full sequential order one would rather hack around this using weaker ordering.
Feb 10 2012
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Feb 9, 2012, at 2:17 PM, Sean Kelly wrote:

 On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:
=20
 I didn't yet got around to polish my lock-free SList/DList =


 but mutexes should only become a problem with high contention when =


 You'd also would need some kind of blocking for lock-free lists.

No blocking should be necessary for the lock-free list. Just try to =

up being empty), allocate a node via malloc/GC. I just realized that the free-list is actually a stack, so doing this = lock-free runs into the ABA problem. There goes the idea of this being = an easy optimization.=
Feb 10 2012