www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Message passing between threads: Java 4 times faster than D

reply Nicolae Mihalache <xpromache gmail.com> writes:
Hello,

I'm a complete newbie in D and trying to compare with Java. I
implemented  a simple test for measuring the throughput in message
passing between threads. I see that Java can pass about 4mil
messages/sec while D only achieves 1mil/sec. I thought that D should
be faster.

The messages are simply integers (which are converted to Integer in Java).

The two programs are attached. I tried compiling the D version with
both dmd and gdc and various optimization flags.

mache
Feb 09 2012
next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 09.02.2012, 10:06 Uhr, schrieb Nicolae Mihalache <xpromache gmail.com>:

 Hello,

 I'm a complete newbie in D and trying to compare with Java. I
 implemented  a simple test for measuring the throughput in message
 passing between threads. I see that Java can pass about 4mil
 messages/sec while D only achieves 1mil/sec. I thought that D should
 be faster.

 The messages are simply integers (which are converted to Integer in  
 Java).

 The two programs are attached. I tried compiling the D version with
 both dmd and gdc and various optimization flags.

 mache

I cannot give you an explanation, just want to say that a message in std.concurrency is also using a wrapper (a 'Variant') + a type field (standard, priority, linkDead). So you effectively have no optimization for int, but the same situation as in Java. The second thing I notice is that std.concurrency uses a double linked list implementation, while you use an array in the Java version, which results in no additional node allocations.
Feb 09 2012
prev sibling parent reply "Alex_Dovhal" <alex_dovhal yahoo.com> writes:
"Nicolae Mihalache" <xpromache gmail.com> wrote:
 Hello,

 I'm a complete newbie in D and trying to compare with Java. I
 implemented  a simple test for measuring the throughput in message
 passing between threads. I see that Java can pass about 4mil
 messages/sec while D only achieves 1mil/sec. I thought that D should
 be faster.

 The messages are simply integers (which are converted to Integer in Java).

 The two programs are attached. I tried compiling the D version with
 both dmd and gdc and various optimization flags.

 mache

Hi, I downloaded your two programs, I didn't run them but noticed that in 'mp.d' you have n set to 100_000_000, while in 'ThroughputMpTest.java' n is set to 10_000_000, so with this D code is 10/4 = 2.5 times faster :)
Feb 09 2012
next sibling parent reply "Alex_Dovhal" <alex_dovhal yahoo.com> writes:
Sorry, my mistake. It's strange to have different 'n', but you measure speed 
as 1000*n/time, so it's doesn't matter if n is 10 times bigger. 
Feb 09 2012
next sibling parent reply Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
Generally, D's message passing is implemented in quite easy-to-use
way, but far from being fast.
I dislike the Variant structure, because it adds a huge overhead. I'd
rather have a templated message passing system with type-safe message
queue, so no Variant is necessary.
In specific cases Messages can be polymorphic objects. This will be
way faster, then Variant.

On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhal <alex_dovhal yahoo.com> wrote:
 Sorry, my mistake. It's strange to have different 'n', but you measure speed
 as 1000*n/time, so it's doesn't matter if n is 10 times bigger.

-- Bye, Gor Gyolchanyan.
Feb 09 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:
 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei
Feb 09 2012
parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org>:

 On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:
 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei

Well, what does +1 Variant and +1 LinkedListNode sum up to?
Feb 09 2012
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/9/12 10:31 AM, Marco Leise wrote:
 Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>:

 On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:
 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei

Well, what does +1 Variant and +1 LinkedListNode sum up to?

Sorry, I don't understand... Andrei
Feb 09 2012
parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 09.02.2012, 20:35 Uhr, schrieb Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org>:

 If we're doing one allocation per
 message passed, that might explain the 4x performance difference (I
 have no trouble figuring Java's allocator is this much faster than  
 D's).


 Andrei

Well, what does +1 Variant and +1 LinkedListNode sum up to?

Sorry, I don't understand... Andrei

There are at least 2 allocations, one for the Variant and one for the new node in the linked list aka message box. But from what you wrote it sounds like a Variant doesn't allocate unless the contained data exceeds some internal storage. Sean found another possible allocation in the other branch of this discussion.
Feb 09 2012
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Marco Leise:

 Sean found another possible allocation in the other  
 branch of this discussion.

Maybe this is able to help Sean and similar situations: http://d.puremagic.com/issues/show_bug.cgi?id=5070 Bye, bearophile
Feb 09 2012
parent Sean Kelly <sean invisibleduck.org> writes:
On Feb 9, 2012, at 11:53 AM, bearophile wrote:

 Marco Leise:
=20
 Sean found another possible allocation in the other =20
 branch of this discussion.

=20 Maybe this is able to help Sean and similar situations: http://d.puremagic.com/issues/show_bug.cgi?id=3D5070

This would be handy. I don't always think to check the asm dump when = I'm working with delegates.=
Feb 09 2012
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/9/12 11:49 AM, Marco Leise wrote:
 Am 09.02.2012, 20:35 Uhr, schrieb Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>:

 If we're doing one allocation per
 message passed, that might explain the 4x performance difference (I
 have no trouble figuring Java's allocator is this much faster than
 D's).


 Andrei

Well, what does +1 Variant and +1 LinkedListNode sum up to?

Sorry, I don't understand... Andrei

There are at least 2 allocations, one for the Variant and one for the new node in the linked list aka message box. But from what you wrote it sounds like a Variant doesn't allocate unless the contained data exceeds some internal storage. Sean found another possible allocation in the other branch of this discussion.

I understand. The good news is, this looks like low-hanging fruit! I'll keep an eye on pull requests in druntime. Thanks to fellow Romanian Nicolae Mihalache for contributing the comparison. Andrei
Feb 09 2012
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
	charset=windows-1252

On Feb 9, 2012, at 10:31 AM, Marco Leise wrote:

 Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu =

<SeeWebsiteForEmail erdani.org>:
=20
 On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:
 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. =



I'd
 rather have a templated message passing system with type-safe =



message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

=20 cc Sean Kelly =20 I haven't looked at the implementation, but one possible liability is =


that large messages don't fit in a Variant and must use dynamic = allocation under the wraps. There are a number of ways to avoid that, = such as parallel arrays (one array per type for data and one for the = additional tags).
=20
 We must make the message passing subsystem to not use any memory =


allocation in the quiescent state. If we're doing one allocation per = message passed, that might explain the 4x performance difference (I have = no trouble figuring Java's allocator is this much faster than D's).
=20
 Well, what does +1 Variant and +1 LinkedListNode sum up to?

FWIW, you can use DMD's built in profiler so long as the receiving = thread is the same as the sending thread: import std.concurrency; void main() { for(int i =3D 0; i < 1_000_000; i++) { send(thisTid, 12345); auto x =3D receiveOnly!int(); } } I generated timings for this both before and after adding "scope" to = mbox.get(): $ dmd -release -inline -O abc $ time abc real 0m0.831s user 0m0.829s sys 0m0.002s =85 add "scope" to mbox.get() $ dmd -release -inline -O abc $ time abc real 0m0.653s user 0m0.649s sys 0m0.003s And here's the trace log after "scope" was added. Notice that there = were 61 calls to GCX.fullcollect(). We can also see that there was 1 = allocation per send/receive operation, so only an alloc for the message = list node. $ dmd -O -release -profile abc gladsheim:misc sean$ time abc real 0m11.348s user 0m11.331s sys 0m0.015s =3D=3D=3D=3D=3D=3D=3D=3D Timer Is 3579545 Ticks/Sec, Times are in = Microsecs =3D=3D=3D=3D=3D=3D=3D=3D Num Tree Func Per Calls Time Time Call 1000000 437709765 220179413 220 void = std.concurrency._send!(int)._send(std.concurrency.MsgType, = std.concurrency.Tid, int) 1000000 300987757 140736393 140 bool = std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure = safe void function(std.concurrency.LinkTerminated)*, pure safe void = function(std.concurrency.OwnerTerminated)*, pure safe void = function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow safe = void delegate(int), scope pure safe void = function(std.concurrency.LinkTerminated)*, scope pure safe void = function(std.concurrency.OwnerTerminated)*, scope pure safe void = function(std.variant.VariantN!(32u).VariantN)*) 1000000 202131609 89479808 89 void* = gc.gcx.GC.malloc(uint, uint, uint*) 1 825045422 57556501 57556501 _Dmain 1000033 112651800 52026745 52 void* = gc.gcx.GC.mallocNoSync(uint, uint, uint*) 61 53422342 49606106 813214 uint = gc.gcx.Gcx.fullcollect(void*) 2000000 160103753 42531732 21 bool = std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure = safe void function(std.concurrency.LinkTerminated)*, pure safe void = function(std.concurrency.OwnerTerminated)*, pure safe void = function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow safe = void delegate(int), scope pure safe void = function(std.concurrency.LinkTerminated)*, scope pure safe void = function(std.concurrency.OwnerTerminated)*, scope pure safe void = function(std.variant.VariantN!(32u).VariantN)*).bool scan(ref = std.concurrency.List!(std.concurrency.Message).List) 2000000 42018612 39837170 19 int = std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.Vari= antN!(32u).VariantN.OpID, ubyte[32]*, void*) 1000000 117572021 24641771 24 bool = std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure = safe void function(std.concurrency.LinkTerminated)*, pure safe void = function(std.concurrency.OwnerTerminated)*, pure safe void = function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow safe = void delegate(int), scope pure safe void = function(std.concurrency.LinkTerminated)*, scope pure safe void = function(std.concurrency.OwnerTerminated)*, scope pure safe void = function(std.variant.VariantN!(32u).VariantN)*).bool onStandardMsg(ref = std.concurrency.Message) 1000000 47280794 20418675 20 void = std.concurrency.Message.map!(nothrow safe void = delegate(int)).map(nothrow safe void delegate(int)) 1000000 316556767 15569009 15 int = std.concurrency.receiveOnly!(int).receiveOnly() 1000000 36317362 13212905 13 property bool = std.variant.VariantN!(32u).VariantN.convertsTo!(int).convertsTo() 1000000 15445906 10879089 10 std.concurrency.Message = std.concurrency.Message.__ctor!(int).__ctor(std.concurrency.MsgType, = int) 1000000 45649454 9332092 9 property bool = std.concurrency.Message.convertsTo!(int).convertsTo() 1000000 26790032 7875877 7 property int = std.variant.VariantN!(32u).VariantN.get!(int).get() 1000000 444757778 7048013 7 void = std.concurrency._send!(int)._send(std.concurrency.Tid, int) 15512 6681162 6657279 429 int = gc.gcx.Gcx.allocPage(ubyte) 1000000 450932153 6174374 6 void = std.concurrency.send!(int).send(std.concurrency.Tid, int) 1000000 4566817 4566817 4 = std.variant.VariantN!(32u).VariantN = std.variant.VariantN!(32u).VariantN.opAssign!(int).opAssign(int) 4087 3635735 3518353 860 void = gc.gcx.Gcx.mark(void*, void*, int) 2000000 2069965 2069965 1 int = std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.Vari= antN!(32u).VariantN.OpID, ubyte[32]*, void*).bool tryPutting(int*, = TypeInfo, void*) 2990271 195287 195287 0 void* = gc.gcx.sentinel_add(void*) 1000000 147610 147610 0 bool = std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure = safe void function(std.concurrency.LinkTerminated)*, pure safe void = function(std.concurrency.OwnerTerminated)*, pure safe void = function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow safe = void delegate(int), scope pure safe void = function(std.concurrency.LinkTerminated)*, scope pure safe void = function(std.concurrency.OwnerTerminated)*, scope pure safe void = function(std.variant.VariantN!(32u).VariantN)*).bool pty(ref = std.concurrency.List!(std.concurrency.Message).List) 1000032 121459 121459 0 void = gc.gcx.Gcx.setBits(gc.gcx.Pool*, uint, uint) 2000000 111475 111475 0 int = std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.Vari= antN!(32u).VariantN.OpID, ubyte[32]*, void*).int* getPtr(void*) 1000033 102413 102413 0 ubyte = gc.gcx.Gcx.findBin(uint) 1002228 94742 94742 0 property uint = gc.gcx.Pool.shiftBy() 1000000 72086 72086 0 int = std.concurrency.receiveOnly!(int).receiveOnly().nothrow safe void = __lambda17(int) 995119 70854 70854 0 void = gc.gcx.Gcx.log_free(void*) 1000033 61505 61505 0 void = gc.gcx.sentinel_init(void*, uint) 1000033 55070 55070 0 void = gc.gcx.Gcx.log_malloc(void*, uint) 995119 51748 51748 0 void = gc.gcx.sentinel_Invariant(const(void*)) 1 24692 24692 24692 void = gc.gcx.Pool.initialize(uint, bool) 3538 3544517 24525 6 void = gc.gcx.Gcx.mark(void*, void*) 15511 23883 23522 1 uint = gc.gcx.Pool.allocPages(uint) 124447 11615 11615 0 void = gc.gcx.Gcx.clrBitsSmallSweep(gc.gcx.Pool*, uint, uint) 1 7051 5921 5921 void = gc.gcx.GC.initialize() 1 27490 2797 2797 gc.gcx.Pool* = gc.gcx.Gcx.newPool(uint, bool) 61 53424783 2441 40 uint = gc.gcx.Gcx.fullcollectshell() 55 2227 2227 40 void = gc.gcx.Gcx.addRange(void*, void*) 1 1129 1119 1119 void = gc.gcx.Gcx.initialize() 16 360 360 22 uint = gc.gcx.Pool.extendPages(uint) 1 2547 319 319 void = gc.gcx.GC.addRange(void*, uint) 2196 278 278 0 gc.gcx.Pool* = gc.gcx.Gcx.findPool(void*) 1 65 65 65 void gc.gcx.GC.disable() 1 9 9 9 void = gc.gcx.Gcx.log_init() 1 5 5 5 void gc.gcx.GC.enable() 1 0 0 0 void = gc.gcx.GC.setStackBottom(void*) 1 0 0 0 property uint = gc.gcx.Pool.divisor()
Feb 09 2012
prev sibling next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
So a queue per message type?  How would ordering be preserved? Also, how wou=
ld this work for interprocess messaging?  An array-based queue is an option h=
owever (though it would mean memmoves on receive), as are free-lists for nod=
es, etc.  I guess the easiest thing there would be a lock-free shared slist f=
or the node free-list, though I couldn't weigh the chance of cache misses fr=
om using old memory blocks vs. just expecting the allocator to be fast.=20

On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> wr=
ote:

 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.
=20
 On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhal <alex_dovhal yahoo.com> wrote:=

 Sorry, my mistake. It's strange to have different 'n', but you measure sp=


eed
 as 1000*n/time, so it's doesn't matter if n is 10 times bigger.
=20
=20

=20 =20 =20 --=20 Bye, Gor Gyolchanyan.

Feb 09 2012
next sibling parent reply "dsimcha" <dsimcha yahoo.com> writes:
I wonder how much it helps to just optimize the GC a little.  How 
much does the performance gap close when you use DMD 2.058 beta 
instead of 2.057?  This upcoming release has several new garbage 
collector optimizations.  If the GC is the bottleneck, then it's 
not surprising that anything that relies heavily on it is slow 
because D's GC is still fairly naive.

On Thursday, 9 February 2012 at 15:44:59 UTC, Sean Kelly wrote:
 So a queue per message type?  How would ordering be preserved? 
 Also, how would this work for interprocess messaging?  An 
 array-based queue is an option however (though it would mean 
 memmoves on receive), as are free-lists for nodes, etc.  I 
 guess the easiest thing there would be a lock-free shared slist 
 for the node free-list, though I couldn't weigh the chance of 
 cache misses from using old memory blocks vs. just expecting 
 the allocator to be fast.

 On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan 
 <gor.f.gyolchanyan gmail.com> wrote:

 Generally, D's message passing is implemented in quite 
 easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge 
 overhead. I'd
 rather have a templated message passing system with type-safe 
 message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This 
 will be
 way faster, then Variant.
 
 On Thu, Feb 9, 2012 at 3:12 PM, Alex Dovhal <alex 
 dovhal yahoo.com> wrote:
 Sorry, my mistake. It's strange to have different 'n', but 
 you measure speed
 as 1000*n/time, so it's doesn't matter if n is 10 times 
 bigger.
 
 

-- Bye, Gor Gyolchanyan.


Feb 09 2012
next sibling parent Brad Anderson <eco gnuk.net> writes:
On Thu, Feb 9, 2012 at 9:22 AM, dsimcha <dsimcha yahoo.com> wrote:

 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising that
 anything that relies heavily on it is slow because D's GC is still fairly
 naive.


 On Thursday, 9 February 2012 at 15:44:59 UTC, Sean Kelly wrote:

 So a queue per message type?  How would ordering be preserved? Also, how
 would this work for interprocess messaging?  An array-based queue is an
 option however (though it would mean memmoves on receive), as are
 free-lists for nodes, etc.  I guess the easiest thing there would be a
 lock-free shared slist for the node free-list, though I couldn't weigh the
 chance of cache misses from using old memory blocks vs. just expecting the
 allocator to be fast.

 On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan <gor.f.gyolchanyan gmail.com>
 wrote:

  Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

 On Thu, Feb 9, 2012 at 3:12 PM, Alex Dovhal <alex dovhal yahoo.com>
 wrote:

 Sorry, my mistake. It's strange to have different 'n', but you measure
 speed
 as 1000*n/time, so it's doesn't matter if n is 10 times bigger.

-- Bye, Gor Gyolchanyan.



dmd 2.057: received 100000000 messages in 192034 msec sum=4999999950000000 speed=520741 msg/sec received 100000000 messages in 84118 msec sum=4999999950000000 speed=1188806 msg/sec received 100000000 messages in 88274 msec sum=4999999950000000 speed=1132836 msg/sec dmd 2.058 beta: received 100000000 messages in 93539 msec sum=4999999950000000 speed=1069072 msg/sec received 100000000 messages in 96422 msec sum=4999999950000000 speed=1037107 msg/sec received 100000000 messages in 203961 msec sum=4999999950000000 speed=490289 msg/sec Both versions would inexplicably run at approximately half the speed sometimes. I have no idea what is up with that. I have no java development environment to test for comparison. This machine has 4 cores and is running Windows. Regards, Brad Anderson
Feb 09 2012
prev sibling next sibling parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 09.02.2012, 17:22 Uhr, schrieb dsimcha <dsimcha yahoo.com>:

 I wonder how much it helps to just optimize the GC a little.  How much  
 does the performance gap close when you use DMD 2.058 beta instead of  
 2.057?  This upcoming release has several new garbage collector  
 optimizations.  If the GC is the bottleneck, then it's not surprising  
 that anything that relies heavily on it is slow because D's GC is still  
 fairly naive.

I did some OProfile-ing. The full report is attached, but for simplicity it is without call graph this time. Here is an excerpt: CPU: Core 2, speed 2001 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % linenr info symbol name 13838 18.8416 gcx.d:426 void* gc.gcx.GC.malloc(ulong, uint, ulong*) 4465 6.0795 gcx.d:2454 ulong gc.gcx.Gcx.fullcollect(void*) ... Compiled with: gcc-Version 4.6.2 20111026 (gdc 0.31 - r751:34491c2e7bb4, using dmd 2.057) (GCC)
Feb 09 2012
parent reply Sean Kelly <sean invisibleduck.org> writes:
On Feb 9, 2012, at 10:14 AM, Marco Leise wrote:

 Am 09.02.2012, 17:22 Uhr, schrieb dsimcha <dsimcha yahoo.com>:
=20
 I wonder how much it helps to just optimize the GC a little.  How =


much does the performance gap close when you use DMD 2.058 beta instead = of 2.057? This upcoming release has several new garbage collector = optimizations. If the GC is the bottleneck, then it's not surprising = that anything that relies heavily on it is slow because D's GC is still = fairly naive.
=20
 I did some OProfile-ing. The full report is attached, but for =

simplicity it is without call graph this time. Here is an excerpt:
=20
 CPU: Core 2, speed 2001 MHz (estimated)
 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a =

unit mask of 0x00 (Unhalted core cycles) count 100000
 samples  %        linenr info                 symbol name
 13838    18.8416  gcx.d:426                   void* =

gc.gcx.GC.malloc(ulong, uint, ulong*)
 4465      6.0795  gcx.d:2454                  ulong =

gc.gcx.Gcx.fullcollect(void*) One random thing that just occurred to me=85 if the standard receive = pattern is: receive((int x) { =85 }); There's a good chance that a stack frame is being dynamically allocated = for the delegate when it's passed to receive (since I don't believe = there's any way to declare the parameters to receive as "scope"). I'll = have to check this, and maybe consider changing receive to use alias = template parameters instead of normal function parameters?=
Feb 09 2012
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02/09/2012 08:27 PM, Sean Kelly wrote:
 On Feb 9, 2012, at 10:14 AM, Marco Leise wrote:

 Am 09.02.2012, 17:22 Uhr, schrieb dsimcha<dsimcha yahoo.com>:

 I wonder how much it helps to just optimize the GC a little.  How much does
the performance gap close when you use DMD 2.058 beta instead of 2.057?  This
upcoming release has several new garbage collector optimizations.  If the GC is
the bottleneck, then it's not surprising that anything that relies heavily on
it is slow because D's GC is still fairly naive.

I did some OProfile-ing. The full report is attached, but for simplicity it is without call graph this time. Here is an excerpt: CPU: Core 2, speed 2001 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % linenr info symbol name 13838 18.8416 gcx.d:426 void* gc.gcx.GC.malloc(ulong, uint, ulong*) 4465 6.0795 gcx.d:2454 ulong gc.gcx.Gcx.fullcollect(void*)

One random thing that just occurred to me… if the standard receive pattern is: receive((int x) { … }); There's a good chance that a stack frame is being dynamically allocated for the delegate when it's passed to receive (since I don't believe there's any way to declare the parameters to receive as "scope"). I'll have to check this, and maybe consider changing receive to use alias template parameters instead of normal function parameters?

You can mark an entire tuple as scope without trouble: void foo(T,S...)(T arg1, scope S args) {...} Does this improve the run time?
Feb 09 2012
prev sibling next sibling parent reply "Oliver Plow" <saxo123 gmx.de> writes:
 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising


Is there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-w ndows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not. -- Oliver -- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
Feb 10 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-02-10 14:54, Oliver Plow wrote:
 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising


Is there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-w ndows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not. -- Oliver

There's a stub implementation of the GC which uses malloc, IIRC. Try using that one at link time. https://github.com/D-Programming-Language/druntime/blob/master/src/gcstub/gc.d -- /Jacob Carlborg
Feb 10 2012
prev sibling next sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 02/10/12 14:54, Oliver Plow wrote:
 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising


Is there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-w ndows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not.

Calling GC.disable() at runtime will delay the GC until it's actually needed, but won't disable it completely. Having a std noop GC stub selected by a switch would be nice, but you can get the same effect by giving the linker an object that has the necessary stubs. For this test case something like the patch below improves things significantly, for more gains, std.concurrency needs more invasive changes. Note it's just POC, to measure the current std.concurrency efficiency vs other approaches. The "freelist" arrays are not freed and leak, a complete implementation would free them when the link is shut down. Ignore the synchronize() calls - "synchronized" isn't properly lowered by the compiler, so i had to resort to this after switching the locking primitives. It should work with std "synchronized" equally well. The original testcase from this thread achieves ~4M msg/sec with this change (the numbers aren't stable, but mostly in the 3.5..4.0M range, 4.5M+ happens sometimes). The memory usage also decreases noticeably. artur --- std/concurrency.d +++ std/concurrency.d -1387,7 +1396,7 private m_last = n; Node* todelete = n.next; n.next = n.next.next; - //delete todelete; + delete todelete; m_count--; } -1430,6 +1439,56 private { val = v; } + import core.memory; + import core.exception; + new(size_t size) { + void* p; + if (afreelist.length) + p = afreelist[--afreelist.length]; + else if (gfreelist.length) { + { + scope lock = synchronize(fl); + if (gfreelist.length) { + afreelist = cast(Node*[])gfreelist; + gfreelist.length=0; + } + } + if (afreelist.length) + p = afreelist[--afreelist.length]; + } + + if (p) + return p; + + p = std.c.stdlib.malloc(size); + if (!p) + throw new OutOfMemoryError(); + GC.addRange(p, size); + return p; + } + delete(void* p) { + if (!p) + return; + pfreelist ~= cast(Node*)p; + if (pfreelist.length>=8) + { + { + scope lock = synchronize(fl); + gfreelist ~= cast(shared Node*[])pfreelist; + } + pfreelist.length=0; + pfreelist.assumeSafeAppend(); + } + // At some point all free nodes need to be freed, using: + //GC.removeRange(p); + //std.c.stdlib.free(p); + } + static Node*[] afreelist; + static ubyte[56] d1; + static Node*[] pfreelist; + static ubyte[56] d2; + shared static Node*[] gfreelist; + shared static Mutex fl; }
Feb 10 2012
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
GC.disable and GC.reserve are applicable. I tested with these and they did h=
elp but not a ton.=20

On Feb 10, 2012, at 5:54 AM, "Oliver Plow" <saxo123 gmx.de> wrote:

 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising


=20 Is there a way to "turn off" the GC, e.g. a compiler switch to set the hea=

p size to a large number so that the GC is likely not to set in? I searched t= hrough this page: http://www.d-programming-language.org/dmd-windows.html#swi= tches But couldn't find anything helpful. Then you could measure the thing w= ith GC "turned off" to see whether the GC is the problem or not.
=20
 -- Oliver
=20
 --=20
 Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
 belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

Feb 10 2012
prev sibling parent Graham St Jack <Graham.StJack internode.on.net> writes:
I suggest using a template-generated type that can contain any of the 
messages to be sent over a channel. It is reasonably straightforward to 
generate all the boilerplate code necessary to make this happen. My 
prototype (attached) still needs work to remove linux dependencies and 
tighten it up, but it works ok. Another advantage of this approach 
(well, I see it as an advantage) is that you declare in a single 
location all the messages that can be sent over the channel, and of 
course the messages are type-safe.

The file of interest is concurrency.d.

On 10/02/12 02:14, Sean Kelly wrote:
 So a queue per message type?  How would ordering be preserved? Also, how would
this work for interprocess messaging?  An array-based queue is an option
however (though it would mean memmoves on receive), as are free-lists for
nodes, etc.  I guess the easiest thing there would be a lock-free shared slist
for the node free-list, though I couldn't weigh the chance of cache misses from
using old memory blocks vs. just expecting the allocator to be fast.

 On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan<gor.f.gyolchanyan gmail.com> 
wrote:

 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

 On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhal<alex_dovhal yahoo.com>  wrote:
 Sorry, my mistake. It's strange to have different 'n', but you measure speed
 as 1000*n/time, so it's doesn't matter if n is 10 times bigger.

-- Bye, Gor Gyolchanyan.


-- Graham St Jack
Feb 09 2012
prev sibling next sibling parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly <sean invisibleduck.org>  
wrote:

 So a queue per message type?  How would ordering be preserved? Also, how  
 would this work for interprocess messaging?  An array-based queue is an  
 option however (though it would mean memmoves on receive), as are  
 free-lists for nodes, etc.  I guess the easiest thing there would be a  
 lock-free shared slist for the node free-list, though I couldn't weigh  
 the chance of cache misses from using old memory blocks vs. just  
 expecting the allocator to be fast.

I didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists. Best first order optimization would be to allocate the list node deterministically. The only reason to use GC memory for them is that mallocating is still too cumbersome. Nodes are unshared so you'd want a unique_pointer.
Feb 09 2012
parent reply deadalnix <deadalnix gmail.com> writes:
Le 09/02/2012 20:57, Martin Nowak a écrit :
 On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly <sean invisibleduck.org>
 wrote:

 So a queue per message type? How would ordering be preserved? Also,
 how would this work for interprocess messaging? An array-based queue
 is an option however (though it would mean memmoves on receive), as
 are free-lists for nodes, etc. I guess the easiest thing there would
 be a lock-free shared slist for the node free-list, though I couldn't
 weigh the chance of cache misses from using old memory blocks vs. just
 expecting the allocator to be fast.

I didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists.

Doesn't this require shared to be working ?
Feb 10 2012
parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Fri, 10 Feb 2012 15:07:41 +0100, deadalnix <deadalnix gmail.com> wrot=
e:

 Le 09/02/2012 20:57, Martin Nowak a =C3=A9crit :
 On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly <sean invisibleduck.or=


g>
 wrote:

 So a queue per message type? How would ordering be preserved? Also,
 how would this work for interprocess messaging? An array-based queue=



 is an option however (though it would mean memmoves on receive), as
 are free-lists for nodes, etc. I guess the easiest thing there would=



 be a lock-free shared slist for the node free-list, though I couldn'=



t
 weigh the chance of cache misses from using old memory blocks vs. ju=



st
 expecting the allocator to be fast.

I didn't yet got around to polish my lock-free SList/DList =


 implementations,
 but mutexes should only become a problem with high contention when yo=


u
 need to block.
 You'd also would need some kind of blocking for lock-free lists.

Doesn't this require shared to be working ?

Shared is already very helpful for writing lock-free stuff. I still need to use atomic load/store to make them portable though. If shared would add memory fences for full sequential order one would rather hack around this using weaker ordering.
Feb 10 2012
prev sibling next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:

 On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly =

<sean invisibleduck.org> wrote:
=20
 So a queue per message type?  How would ordering be preserved? Also, =


how would this work for interprocess messaging? An array-based queue is = an option however (though it would mean memmoves on receive), as are = free-lists for nodes, etc. I guess the easiest thing there would be a = lock-free shared slist for the node free-list, though I couldn't weigh = the chance of cache misses from using old memory blocks vs. just = expecting the allocator to be fast.
=20
 I didn't yet got around to polish my lock-free SList/DList =

implementations,
 but mutexes should only become a problem with high contention when you =

need to block.
 You'd also would need some kind of blocking for lock-free lists.

No blocking should be necessary for the lock-free list. Just try to = steal a node with a CAS. If the result was null (i.e. if the list ended = up being empty), allocate a node via malloc/GC.
 Best first order optimization would be to allocate the list node =

deterministically. Neat idea. I think I can make that change fairly trivially.=
Feb 09 2012
parent David Nadlinger <see klickverbot.at> writes:
On 2/9/12 11:17 PM, Sean Kelly wrote:
 On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:
 I didn't yet got around to polish my lock-free SList/DList implementations,
 but mutexes should only become a problem with high contention when you need to
block.
 You'd also would need some kind of blocking for lock-free lists.

No blocking should be necessary for the lock-free list. Just try to steal a node with a CAS. If the result was null (i.e. if the list ended up being empty), allocate a node via malloc/GC.

And the neat thing is that you don't have to worry about node deletion as much when you have a GC… David
Feb 09 2012
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
	charset=us-ascii

On Feb 9, 2012, at 2:17 PM, Sean Kelly wrote:
=20
 Best first order optimization would be to allocate the list node =


deterministically.
=20
 Neat idea.  I think I can make that change fairly trivially.

$ time abc real 0m0.556s user 0m0.555s sys 0m0.001s So another 100ms improvement. Switching to a (__gshared, no mutex) = free-list that falls back on malloc yields: $ time abc real 0m0.505s user 0m0.503s sys 0m0.001s Not as much of a gain there, and I believe we've eliminated all the = allocations (though I'd have to do a pile build to verify). Still, = that's approaching being twice as fast as before, which is definitely = something.=
Feb 09 2012
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Feb 9, 2012, at 2:17 PM, Sean Kelly wrote:

 On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:
=20
 I didn't yet got around to polish my lock-free SList/DList =


implementations,
 but mutexes should only become a problem with high contention when =


you need to block.
 You'd also would need some kind of blocking for lock-free lists.

=20 No blocking should be necessary for the lock-free list. Just try to =

steal a node with a CAS. If the result was null (i.e. if the list ended = up being empty), allocate a node via malloc/GC. I just realized that the free-list is actually a stack, so doing this = lock-free runs into the ABA problem. There goes the idea of this being = an easy optimization.=
Feb 10 2012
prev sibling parent Nicolae Mihalache <xpromache gmail.com> writes:
That would be funny but it's not true. I tested with different values,
that's why I ended up uploading different versions.

The programs print the computed message rate and takes into account
the number of messages.

mache





On Thu, Feb 9, 2012 at 11:57 AM, Alex_Dovhal <alex_dovhal yahoo.com> wrote:
 "Nicolae Mihalache" <xpromache gmail.com> wrote:
 Hello,

 I'm a complete newbie in D and trying to compare with Java. I
 implemented =A0a simple test for measuring the throughput in message
 passing between threads. I see that Java can pass about 4mil
 messages/sec while D only achieves 1mil/sec. I thought that D should
 be faster.

 The messages are simply integers (which are converted to Integer in Java=


).
 The two programs are attached. I tried compiling the D version with
 both dmd and gdc and various optimization flags.

 mache

Hi, I downloaded your two programs, I didn't run them but noticed that in 'mp.d' you have n set to 100_000_000, while in 'ThroughputMpTest.java' n is set =

to
 10_000_000, so with this D code is 10/4 =3D 2.5 times faster :)

Feb 09 2012