digitalmars.D.learn - asynchronous communication between threads

Charles Hixson (43/43) Jan 01 2013 If I were to use the below as an asynchronous communication channel,

Dmitry Olshansky (13/22) Jan 03 2013 That's a lot of questions to ask and currently I hardly can decipher the...

Charles Hixson (39/59) Jan 03 2013 I'm trying to design a program (I'm not writing it yet) that requires

Dmitry Olshansky (58/123) Jan 03 2013 So cell is in fact a task (more common name for it I think) with a

Charles Hixson (158/282) Jan 03 2013 It's in core.Thread. A fiber would work well for the cell, but it's

Dmitry Olshansky (79/359) Jan 04 2013 CSMACD has the advantage of both parties knowing what they send and

Charles Hixson (92/456) Jan 04 2013 But is there any cost to keeping something locked that nobody's looking

Dmitry Olshansky (53/206) Jan 07 2013 There is a cost to (un)lock a mutex. You can make a simple program and

Charles Hixson <charleshixsn earthlink.net> writes:

If I were to use the below as an asynchronous communication channel, 
would it avoid deadlocks (presuming that only Cell called Msg) and that 
when a thread activated Cell, the first thing it did was process it's 
mailbox?
Also, if only around 7 cells were created in the main thread, would RAM 
usage remain reasonable (i.e., when a thread was through with an MCell, 
would only one copy continue to exist in RAM?  Or would there be a copy 
for each thread that had ever referenced it?  Should MCell instances be 
marked shared?

FWIW, I would expect there to be many thousands of instances of MCell, 
and around 7 threads processing them.  Cells would only interact with 
other cells via the mailbox, which could only receive messages, except 
that the associated cell could also remove messages from it.

class	MCell
{
    struct	Msg
    {  int[]	fromId;
       string[]	msg;
       MCell[]   from;

    }
	
    synchronized	class	MBox
    {  Msg[]	msg;
       void  receive (MCell mcell, int idNo, string messg)
       {  int  last = msg.length;
	 msg.length = last + 1;
          msg[last].fromId  = idNo;
          msg[last].msg     = messg;
          msg[last].from    = mcell;
       }

    }
	
    synchronized	class	Cell
    {  int	idNo;
		
       MCell	next;
       MCell[]	up;
       MCell[]	down;

       void send (MCell dest, string msg)
       {   dest.receive (this, idNo, msg);  }
    		
    }
}

Jan 01 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

02-Jan-2013 03:54, Charles Hixson пишет:
 If I were to use the below as an asynchronous communication channel,
 would it avoid deadlocks (presuming that only Cell called Msg) and that
 when a thread activated Cell, the first thing it did was process it's
 mailbox?
 Also, if only around 7 cells were created in the main thread, would RAM
 usage remain reasonable (i.e., when a thread was through with an MCell,
 would only one copy continue to exist in RAM?  Or would there be a copy
 for each thread that had ever referenced it?  Should MCell instances be
 marked shared?

That's a lot of questions to ask and currently I hardly can decipher the 
whole logic from this and the code below alone, too much is behind the 
scenes.

I'm willing to help out if you could post a more complete example or 
more accurate questions as e.g. "created 7 cells in main thread" could 
be done very differently.

Also a nit at this size of code (and for me it's unformatted) I'd 
recommend linking to the code on some paste service e.g. dpaste 
http://dpaste.dzfl.pl/.

[snip]

-- 
Dmitry Olshansky

Jan 03 2013

Charles Hixson <charleshixsn earthlink.net> writes:

On 01/03/2013 08:40 AM, Dmitry Olshansky wrote:
 02-Jan-2013 03:54, Charles Hixson пишет:
 If I were to use the below as an asynchronous communication channel,
 would it avoid deadlocks (presuming that only Cell called Msg) and that
 when a thread activated Cell, the first thing it did was process it's
 mailbox?
 Also, if only around 7 cells were created in the main thread, would RAM
 usage remain reasonable (i.e., when a thread was through with an MCell,
 would only one copy continue to exist in RAM? Or would there be a copy
 for each thread that had ever referenced it? Should MCell instances be
 marked shared?

 That's a lot of questions to ask and currently I hardly can decipher the
 whole logic from this and the code below alone, too much is behind the
 scenes.

 I'm willing to help out if you could post a more complete example or
 more accurate questions as e.g. "created 7 cells in main thread" could
 be done very differently.

 Also a nit at this size of code (and for me it's unformatted) I'd
 recommend linking to the code on some paste service e.g. dpaste
 http://dpaste.dzfl.pl/.

 [snip]

I'm trying to design a program (I'm not writing it yet) that requires 
asynchronous communication between LOTS of separate "cells".  You can 
think of it as vaguely like a neural network, but it doesn't fit the 
definition, that's just an analogy.  I thought that D would be a good 
language to do this in, but all the built-in approaches seem to be 
"overly protective", locking a lot more than I want and yielding 
deadlocks, or copying more than I want, and ensuring that I end up with 
stale data.  The approach that I've come up with to avoiding this is to 
attach a mailbox to each "cell".  In an ideal world, each cell would be 
a separate thread, but as there will be at least hundreds of thousands 
of them, and likely millions, that's unreasonable.  So I need the 
threads to cycle through pools of "cells".  Each cell may send messages 
to any other cell (not really, but you don't know in advance what the 
connections will be).

That skeleton of code was intended to show the idea of cells isolated 
from outside attached to a mailbox, which has blocking access from the 
outside.  The cells should never block, but this is largely because they 
are only directly accessed by the thread within which they run.  The 
mailboxes can receive messages from anyone, but their processes are so 
short, that blocking will be extremely brief.

I wanted to know if this proposed design would work, as in not getting 
into deadlocks, not blocking excessively, and not giving me excessively 
stale data.  More details aren't available, because I didn't want to 
commit to this design if the basic design was wrong, so I haven't 
written them.  It has been suggested that since the cell will only be 
accessed by the thread, it doesn't need to be synchronous.

I'm really nervous about how many copies of the cell will exist, 
however.  Since there are going to be so many of them, if I ended up 
with a cell/thread, the system would bog down in unused storage.  But 
the mailbox needs to be globally accessible for the scheme to work.

N.B.:  When a cell receives a message from another cell it's likely, but 
not guaranteed, to send a response back.  It may also send responses 
onwards.  And the number of cells isn't fixed, nor is the number of 
their connections.  (This is less important in D than in many other 
languages, but even in D it affects serialization.)

FWIW, I've been told that an approximately similar design has worked in 
Ada, though the design was written in Ada-83.  (In Ada the mailbox was 
protected, which I take to be approximately the same as synchronized.)

Jan 03 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

03-Jan-2013 22:38, Charles Hixson пишет:
 On 01/03/2013 08:40 AM, Dmitry Olshansky wrote:
 02-Jan-2013 03:54, Charles Hixson пишет:
 If I were to use the below as an asynchronous communication channel,
 would it avoid deadlocks (presuming that only Cell called Msg) and that
 when a thread activated Cell, the first thing it did was process it's
 mailbox?
 Also, if only around 7 cells were created in the main thread, would RAM
 usage remain reasonable (i.e., when a thread was through with an MCell,
 would only one copy continue to exist in RAM? Or would there be a copy
 for each thread that had ever referenced it? Should MCell instances be
 marked shared?

 That's a lot of questions to ask and currently I hardly can decipher the
 whole logic from this and the code below alone, too much is behind the
 scenes.

 I'm willing to help out if you could post a more complete example or
 more accurate questions as e.g. "created 7 cells in main thread" could
 be done very differently.

 Also a nit at this size of code (and for me it's unformatted) I'd
 recommend linking to the code on some paste service e.g. dpaste
 http://dpaste.dzfl.pl/.

 [snip]

 I'm trying to design a program (I'm not writing it yet) that requires
 asynchronous communication between LOTS of separate "cells".  You can
 think of it as vaguely like a neural network, but it doesn't fit the
 definition, that's just an analogy.  I thought that D would be a good
 language to do this in, but all the built-in approaches seem to be
 "overly protective", locking a lot more than I want and yielding
 deadlocks, or copying more than I want, and ensuring that I end up with
 stale data.  The approach that I've come up with to avoiding this is to
 attach a mailbox to each "cell".  In an ideal world, each cell would be
 a separate thread, but as there will be at least hundreds of thousands
 of them, and likely millions, that's unreasonable.  So I need the
 threads to cycle through pools of "cells".  Each cell may send messages
 to any other cell (not really, but you don't know in advance what the
 connections will be).

So cell is in fact a task (more common name for it I think) with a 
mailbox and you want threads multiplexed across these tasks. The task is 
running some code/function/callback/whatever that periodically polls a 
mailbox & puts stuff in other task's mailboxes. So far good?

Then definitely take a look at Fiber in druntime (it's core.Fiber AFAIK).

 That skeleton of code was intended to show the idea of cells isolated
 from outside attached to a mailbox, which has blocking access from the
 outside.  The cells should never block, but this is largely because they
 are only directly accessed by the thread within which they run.  The
 mailboxes can receive messages from anyone, but their processes are so
 short, that blocking will be extremely brief.

I'm sure not so optimistic about locking. Even though it's brief there 
are many threads that may be putting stuff  simultaneously into 
mailboxes thus contending on the lock and causing context switches.
+ the lock/unlock of a mutex is not free. The lock-based message queue 
is nice for a start (less _subtle_ bugs) but you sure got to look into 
lock-free alternatives later on.

 I wanted to know if this proposed design would work, as in not getting
 into deadlocks, not blocking excessively, and not giving me excessively
 stale data.

The the crucial part is missing - taking a message out of the mailbox ;)

But anyway let's focus on the details. 2 classes and 2 functions. Cell's 
send & MBox's receive.

Let's suppose we have 2 Cells A & B and their mboxes MA & MB.
 From the code I see (that's full of typos? MCell & Cell are used 
interchangeably) the chain of event for full send from A --> B is:

1. A Cell's send locks cell A. (send is sync-ed)
2. It locks target cell B.
3. It then locks its mailbox MB.
4. undoes all the locks backwards.

Then there is of course a deadlock bound to happen if B is sending 
message in opposite direction, e.g. :
1. A locks A (making its first step)
2. B lock B (ditto)
3. A locks B & blocks
3. B locks A & blocks

that if for instance there is step 2. So I guess you haven't meant the 
step 2.

If there is no lock of the cell except before sending then it looks 
legit as there are 2 locks protecting separate entities:

- one is to protect message queue on putting message into it

- the other one is to protect ... what exactly? send is already 
implicitly guarded by target queue's lock.

So I'd say you only need to guard the message queue and that's about it. 
The only other concern is properly scheduling the execution of tasks (or 
your cells) so that one runs on no more then one thread at any given time.

In simplest case just make a locked (or lock-free) queue of these and 
let threads pick cells/tasks from it and put back when they are done.

Far better is a large bounded queue that you never ever remove/put stuff 
into. It's just a big array of pointers/references to task. A thread 
then just goes around it looking for valid/ready entries (they stay 
allocated so no nulls there) and executes them. That goes firmly into 
lock-free zone to make it correctly synchronized though but with a bit 
of care it should be doable.

The second one also can be done with locks. In this case a thread goes 
through all of tasks/cells and tries to lock them (That's where your 
lock around it comes in, is it?). If it locks - cool, work on it, if not 
- try the next one.


 More details aren't available, because I didn't want to
 commit to this design if the basic design was wrong, so I haven't
 written them.  It has been suggested that since the cell will only be
 accessed by the thread, it doesn't need to be synchronous.

 I'm really nervous about how many copies of the cell will exist,
 however.  Since there are going to be so many of them, if I ended up
 with a cell/thread, the system would bog down in unused storage.  But
 the mailbox needs to be globally accessible for the scheme to work.

Everything on the heap is accessible from other threads, provided they 
have the pointer to the object in question.


 N.B.:  When a cell receives a message from another cell it's likely, but
 not guaranteed, to send a response back.  It may also send responses
 onwards.  And the number of cells isn't fixed, nor is the number of
 their connections.  (This is less important in D than in many other
 languages, but even in D it affects serialization.)

 FWIW, I've been told that an approximately similar design has worked in
 Ada, though the design was written in Ada-83.  (In Ada the mailbox was
 protected, which I take to be approximately the same as synchronized.)

In general there are ways to make it fly. The tricks to use depend on 
the use case and what is bottleneck (I/O or the CPU time).
The pain points is faster mailboxes and better scheduling (as in less 
context switches for nothing, faster turn-around time etc.).

-- 
Dmitry Olshansky

Jan 03 2013

Charles Hixson <charleshixsn earthlink.net> writes:

On 01/03/2013 11:34 AM, Dmitry Olshansky wrote:
 03-Jan-2013 22:38, Charles Hixson пишет:
 On 01/03/2013 08:40 AM, Dmitry Olshansky wrote:
 02-Jan-2013 03:54, Charles Hixson пишет:
 If I were to use the below as an asynchronous communication channel,
 would it avoid deadlocks (presuming that only Cell called Msg) and that
 when a thread activated Cell, the first thing it did was process it's
 mailbox?
 Also, if only around 7 cells were created in the main thread, would RAM
 usage remain reasonable (i.e., when a thread was through with an MCell,
 would only one copy continue to exist in RAM? Or would there be a copy
 for each thread that had ever referenced it? Should MCell instances be
 marked shared?

 That's a lot of questions to ask and currently I hardly can decipher the
 whole logic from this and the code below alone, too much is behind the
 scenes.

 I'm willing to help out if you could post a more complete example or
 more accurate questions as e.g. "created 7 cells in main thread" could
 be done very differently.

 Also a nit at this size of code (and for me it's unformatted) I'd
 recommend linking to the code on some paste service e.g. dpaste
 http://dpaste.dzfl.pl/.

 [snip]

 I'm trying to design a program (I'm not writing it yet) that requires
 asynchronous communication between LOTS of separate "cells". You can
 think of it as vaguely like a neural network, but it doesn't fit the
 definition, that's just an analogy. I thought that D would be a good
 language to do this in, but all the built-in approaches seem to be
 "overly protective", locking a lot more than I want and yielding
 deadlocks, or copying more than I want, and ensuring that I end up with
 stale data. The approach that I've come up with to avoiding this is to
 attach a mailbox to each "cell". In an ideal world, each cell would be
 a separate thread, but as there will be at least hundreds of thousands
 of them, and likely millions, that's unreasonable. So I need the
 threads to cycle through pools of "cells". Each cell may send messages
 to any other cell (not really, but you don't know in advance what the
 connections will be).

 So cell is in fact a task (more common name for it I think) with a
 mailbox and you want threads multiplexed across these tasks. The task is
 running some code/function/callback/whatever that periodically polls a
 mailbox & puts stuff in other task's mailboxes. So far good?

 Then definitely take a look at Fiber in druntime (it's core.Fiber AFAIK).

It's in core.Thread.  A fiber would work well for the cell, but it's 
been pointed out to me that with this design, an ordinary class instance 
would also work, so for that it's probably overkill.  Which leaves the 
mailbox unhandled.  "Periodic" polling, as in time based, is too 
expensive.  Periodic polling as in "once in each cycle" is what is 
planned, and that part doesn't require the overhead of threads, etc., 
except to keep martial messages from being read.  But the documentation 
of fiber *strongly* implies that the contents of the mailbox would be 
copied from it's normal location to the active thread, if it's not the 
same thread within which the cell executes.  So it looks to me as if the 
mailbox needs to be synchronized.  But I'm not really sure what 
synchronized means in D.  (N.B.:  The cell is logically a task, but if 
it were implemented as such, then the mailbox would be unnecessary.  But 
the overhead would be unbearable.)
 That skeleton of code was intended to show the idea of cells isolated
 from outside attached to a mailbox, which has blocking access from the
 outside. The cells should never block, but this is largely because they
 are only directly accessed by the thread within which they run. The
 mailboxes can receive messages from anyone, but their processes are so
 short, that blocking will be extremely brief.

 I'm sure not so optimistic about locking. Even though it's brief there
 are many threads that may be putting stuff simultaneously into mailboxes
 thus contending on the lock and causing context switches.
 + the lock/unlock of a mutex is not free. The lock-based message queue
 is nice for a start (less _subtle_ bugs) but you sure got to look into
 lock-free alternatives later on.

The number of cells is huge when compared with the number of available 
threads, On the rough order of 1,000,000 to 6.  Because of this I don't 
expect contention to be a big problem.  OTOH, the number of 
cores/processor is going *up*, so a lock-free system would be very 
desireable.  But the only lock-free design I'm familiar with is CSMACD 
(ethernet).  And to implement that, I think the mailboxes would need to 
be shared.  And this requires some protocol such as CSMACD for each 
access.  It's not clear to me that this would have less overhead than 
would synchronized (which locks, if I understand correctly, the class 
instance when someone else has write access to it).  Certainly TDPL 
talks about them sequentially, and implies that there is a tight 
connection between shared variables and synchronized classes.  It is as 
if a synchronized class is the analog of an atomic operation acting on a 
larger set of data.  And when I look at implementing CSMACD it looks as 
if only one thread can detect that it's write was interrupted, and then 
only after it's complete.  Unless I implement a checksum scheme, in 
which case I guess both could detect an improperly completed write. But 
then independent readers couldn't detect the problem.  So readers would 
need to be locked out while writing was in process...and we're back to 
mutexes/spinlocks/etc.
So it looks to me as if a synchronized mailbox class eliminates a 
multitude of problems a the lowest reasonable cost.  If it works as I 
hope it does.  But if the mailbox ends up getting copied from thread 
address space to thread address space, then the overhead would be much 
higher than a naive estimate, and I can't guess how high.  It *could* 
scale so poorly that 90% of the computation consisted of mailboxes being 
copied from thread to thread.
 I wanted to know if this proposed design would work, as in not getting
 into deadlocks, not blocking excessively, and not giving me excessively
 stale data.

 The the crucial part is missing - taking a message out of the mailbox ;)

The only entity permitted to take a message from the mailbox is the cell 
to which it is attached, and that happens at each initiation of 
activation.  But that's one reason that the mailbox should be synchronized.
 But anyway let's focus on the details. 2 classes and 2 functions. Cell's
 send & MBox's receive.

 Let's suppose we have 2 Cells A & B and their mboxes MA & MB.
  From the code I see (that's full of typos? MCell & Cell are used
 interchangeably) the chain of event for full send from A --> B is:

 1. A Cell's send locks cell A. (send is sync-ed)

Why does it lock the cell?  But perhaps this is because the proffered 
cell class was synchronized.
 2. It locks target cell B.

Why does it lock cell B?  Cell A makes no access to cell B.  Only to the 
mailbox.  Which is why in that rough design cell and mailbox were 
separate (synchronized) classes at the same level.  It it locks Cell B, 
then the design won't work.  (OTOH, the Cell function doesn't need to be 
synchronized anyway, that's a remnant from a prior design.)
 3. It then locks its mailbox MB.

Check.
 4. undoes all the locks backwards.

Check.
 Then there is of course a deadlock bound to happen if B is sending
 message in opposite direction, e.g. :
 1. A locks A (making its first step)
 2. B lock B (ditto)
 3. A locks B & blocks
 3. B locks A & blocks

If the cells get locked, then the design will not work.  No argument.  I 
was attempting to avoid that by making the mailbox a separate class, and 
having each cell only contact the other cell's mailbox.  If that doesn't 
work, then the design is not going to work, and I'll need to create a 
new one.
 that if for instance there is step 2. So I guess you haven't meant the
 step 2.

 If there is no lock of the cell except before sending then it looks
 legit as there are 2 locks protecting separate entities:

 - one is to protect message queue on putting message into it

 - the other one is to protect ... what exactly? send is already
 implicitly guarded by target queue's lock.

Yeah.  Sorry, that was sloppy thinking on my part.  The cell doesn't 
need to be synchronized.  (But I don't understand why that would be 
destructive of anything except efficiency.)
 So I'd say you only need to guard the message queue and that's about it.
 The only other concern is properly scheduling the execution of tasks (or
 your cells) so that one runs on no more then one thread at any given time.


threads.  Each cell is accessible by the thread whose sequence number is 

accessible to it.  When a cell is activated by the thread, first it 
checks it's mailbox to update it's state.  Then it processes, possibly 
sending messages to the mailboxes of other cells in a task independent 
way.  (It won't know about threads, only about mailboxes.)  Execution 
then passes on the the next cell in the thread.
 In simplest case just make a locked (or lock-free) queue of these and
 let threads pick cells/tasks from it and put back when they are done.

A lock-free queue would do it, I think, but the only ones I know of are 
attached to the thread rather than to a data instance, and that WOULD 
lead to massive contention.  And increasing contention as the number of 
cores (threads) increased.
 Far better is a large bounded queue that you never ever remove/put stuff
 into. It's just a big array of pointers/references to task. A thread
 then just goes around it looking for valid/ready entries (they stay
 allocated so no nulls there) and executes them. That goes firmly into
 lock-free zone to make it correctly synchronized though but with a bit
 of care it should be doable.

It would be quite reasonable to make the queue bounded, and reject 
messages when the mailbox was full.  But, again, if it's attached to the 
thread rather than to the object there is going to be a massive problem 
with contention.
 The second one also can be done with locks. In this case a thread goes
 through all of tasks/cells and tries to lock them (That's where your
 lock around it comes in, is it?). If it locks - cool, work on it, if not
 - try the next one.

Good point.  If the cell is initiated and can't read it's mailbox (this 
action needs a write lock, as it removes messages), then going on to the 
next cell rather than blocking is better.  (Again, if the mailbox is 
attached to the thread, this won't work.  All mailboxes will be nearly 
constantly in contention.)
  > More details aren't available, because I didn't want to
 commit to this design if the basic design was wrong, so I haven't
 written them. It has been suggested that since the cell will only be
 accessed by the thread, it doesn't need to be synchronous.

 I'm really nervous about how many copies of the cell will exist,
 however. Since there are going to be so many of them, if I ended up
 with a cell/thread, the system would bog down in unused storage. But
 the mailbox needs to be globally accessible for the scheme to work.

 Everything on the heap is accessible from other threads, provided they
 have the pointer to the object in question.

Good.  And since I'm working with classes, everything is really on the 
heap, which means that only pointers will get copied.  (Well, 
references.  I avoid using pointers in my code unless I really can't 
avoid them.)
 N.B.: When a cell receives a message from another cell it's likely, but
 not guaranteed, to send a response back. It may also send responses
 onwards. And the number of cells isn't fixed, nor is the number of
 their connections. (This is less important in D than in many other
 languages, but even in D it affects serialization.)

 FWIW, I've been told that an approximately similar design has worked in
 Ada, though the design was written in Ada-83. (In Ada the mailbox was
 protected, which I take to be approximately the same as synchronized.)

 In general there are ways to make it fly. The tricks to use depend on
 the use case and what is bottleneck (I/O or the CPU time).
 The pain points is faster mailboxes and better scheduling (as in less
 context switches for nothing, faster turn-around time etc.).

The bottleneck will be CPU time (given sufficient RAM).  No way to avoid 
that.  Stuffing things into a mailbox is going to be basically copying a 
struct.  (That hasn't been designed yet, but it's going to include a ref 
to the sender, a requested action,  a numeric value,  and I'm not sure 
of what else.  The "requested action" will probably be represented as an 
enum.  I'm probably going to avoid strings, as they don't appear to be 
valuable, even though that's just a reference copy.  So say 128 bytes of 
parameter, or possibly 256.  And receiving a message is copying the 
parameters into a queue.  Perhaps I could remove the synchronization 
from the class, and just guard calculating the index position into the 
queue, as once the position in the queue was known, there wouldn't be 
any contention WRT where the message would be stored.  That should be a 
very fast design, but I worry about how much RAM space would be 
required.  With a pre-allocated queue, each mailbox would consume the 
maximal amount of space even if none were used.  So if there were space 
for 20 messages, then the mailbox would consume 5120 bytes + overhead 
for each cell.  Which means that if I have 500,000 cells (an extremely 
lowball figure) just the mailboxes would consume 1,024,000,000 bytes 
plus overhead.  True. most of that would never be accessed, but enough 
accesses would be randomly distributed throughout the system that I 
would expect thrashing...or even failure due to inability to allocate 
memory. This would be compressed significantly if I used a thread 
attached mailbox, at the cost of nearly guaranteed massive contention 
problems.  And 20 messages is an unreasonably low upper limit, even 
though it's too high for a mean value.  (I expect the mean value to be 
closer to 3-4.)  So I'd been planning on using variable length arrays 
for the mailbox, which would be deallocated every time the cell accessed 
them.  So messages could be attached by simply doing an append.  This 
would place the message queue on the heap, with an initially quite low 

better because I can't estimate the amount of RAM consumed.

Perhaps I need to use some sort of disk cache of a data file, and only 
have the most active cells in memory at any particular time.  This would 
still result in a lot of thrashing, but in a more controlled manner. 
I've only around 8 GB of actual memory and it looks like 7.8 GB total 
memory, if one includes virtual memory.  Perhaps this needs to be 
upgraded...which would probably mean I upgraded the number of cores 
available, meaning increased contention.  But I could clearly upgrade 
the amount of virtual memory with just a repartitioning of the disk. 
It's not been needed so far, but then I've just been designing this 
system, not implementing it.  OTOH, virtual RAM is thrashing, so it's 
not clear how much that would help over, say, a BTree that rolled out 
relatively inactive cells, even though each cell would need to be RAM 
resident at least once per cycle,

That said, the application I'm designing would probably overstress any 
computer I could afford.  I'm almost resigned to that.  I just want to 
come as close to reasonable speed of execution as possible, and I 
clearly want to avoid deadlocks and data that doesn't get properly updated.

OTOH, rolling things out to a B+Tree means that I need to devise a way 

reference to the item.  A hash table is the obvious solution, but the 
class managing the hash table would need to roll the cell+mailbox in if 
it isn't RAM resident.  Not something that's reasonable to do while the 
mailbox access is locked.  So the mailbox would need to queue the 
request for access to the cell's mailbox with a thread, stuff the 
message in a "to be acted upon" queue, and return.  And where that queue 
should be stored isn't clear to me.  Perhaps that's where the thread 
attached non-blocking queue should come into play.  Also, hash tables 
have enough overhead themselves that they would limit the number of RAM 
resident cells considerably over prior estimates, even while increasing 
the total number that could be dealt with.  Probably they would halve 
the number of RAM resident cells (a rough estimate, admittedly), while 
expanding the total number of cells that could be handled to be limited 
by available disk space.  They would also impose a severe performance 
penalty.  (In prior small scale tests, hash table access has been very 
roughly about 1/10 as fast as variable length array access.  Of course, 
this is still a lot faster than disk file access.)

Still, it's sounding like the basic design will work, unless a cell 
calling the mailbox of another cell locks that cell, and I don't see any 
reason why it should...but I've been repeatedly surprised by the 
requirements imposed on concurrently executing threads of execution.

Jan 03 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

04-Jan-2013 10:36, Charles Hixson пишет:
[snip]
 So cell is in fact a task (more common name for it I think) with a
 mailbox and you want threads multiplexed across these tasks. The task is
 running some code/function/callback/whatever that periodically polls a
 mailbox & puts stuff in other task's mailboxes. So far good?

 Then definitely take a look at Fiber in druntime (it's core.Fiber AFAIK).

 It's in core.Thread.  A fiber would work well for the cell, but it's
 been pointed out to me that with this design, an ordinary class instance
 would also work, so for that it's probably overkill.  Which leaves the
 mailbox unhandled.  "Periodic" polling, as in time based, is too
 expensive.  Periodic polling as in "once in each cycle" is what is
 planned, and that part doesn't require the overhead of threads, etc.,
 except to keep martial messages from being read.  But the documentation
 of fiber *strongly* implies that the contents of the mailbox would be
 copied from it's normal location to the active thread, if it's not the
 same thread within which the cell executes.  So it looks to me as if the
 mailbox needs to be synchronized.  But I'm not really sure what
 synchronized means in D.  (N.B.:  The cell is logically a task, but if
 it were implemented as such, then the mailbox would be unnecessary.  But
 the overhead would be unbearable.)
 That skeleton of code was intended to show the idea of cells isolated
 from outside attached to a mailbox, which has blocking access from the
 outside. The cells should never block, but this is largely because they
 are only directly accessed by the thread within which they run. The
 mailboxes can receive messages from anyone, but their processes are so
 short, that blocking will be extremely brief.

 I'm sure not so optimistic about locking. Even though it's brief there
 are many threads that may be putting stuff simultaneously into mailboxes
 thus contending on the lock and causing context switches.
 + the lock/unlock of a mutex is not free. The lock-based message queue
 is nice for a start (less _subtle_ bugs) but you sure got to look into
 lock-free alternatives later on.

 The number of cells is huge when compared with the number of available
 threads, On the rough order of 1,000,000 to 6.  Because of this I don't
 expect contention to be a big problem.  OTOH, the number of
 cores/processor is going *up*, so a lock-free system would be very
 desireable.  But the only lock-free design I'm familiar with is CSMACD
 (ethernet).  And to implement that, I think the mailboxes would need to
 be shared.  And this requires some protocol such as CSMACD for each
 access.  It's not clear to me that this would have less overhead than
 would synchronized (which locks, if I understand correctly, the class
 instance when someone else has write access to it).

CSMACD has the advantage of both parties knowing what they send and 
ability to detect collision before the end of transfer(!). In software 
design lock-free has analogous power but on scale of one memory location 
(one word). This would imply first working on the side (creating message 
etc.) then in one atomic op updating state, if there was conflict 
(somebody else put their word into memory here) you redo the updating step.

 Certainly TDPL
 talks about them sequentially, and implies that there is a tight
 connection between shared variables and synchronized classes.  It is as
 if a synchronized class is the analog of an atomic operation acting on a
 larger set of data.  And when I look at implementing CSMACD it looks as
 if only one thread can detect that it's write was interrupted, and then
 only after it's complete.  Unless I implement a checksum scheme, in
 which case I guess both could detect an improperly completed write. But
 then independent readers couldn't detect the problem.  So readers would
 need to be locked out while writing was in process...and we're back to
 mutexes/spinlocks/etc.

 So it looks to me as if a synchronized mailbox class eliminates a
 multitude of problems a the lowest reasonable cost.

Given the million of cell you are going to lock things that need not be 
locked most of the time. It's hardly a reasonable cost.

 If it works as I
 hope it does.  But if the mailbox ends up getting copied from thread
 address space to thread address space, then the overhead would be much
 higher than a naive estimate, and I can't guess how high.  It *could*
 scale so poorly that 90% of the computation consisted of mailboxes being
 copied from thread to thread.

I don't quite get yours "attached to thread" (here and latter). You mean 
as in std.concurency? Well, it was designed this way that's all. There 
is no single reason why data-structure has to be "attached" to a thread. 
And you talk like it's not you who controls what gets copied and where :)

And the fear of things being copied between threads, if anything threads 
are  sharing single address space (a fact of life). If you want multiple 
processes then yes, different address space and some coping is bound to 
happen.

 I wanted to know if this proposed design would work, as in not getting
 into deadlocks, not blocking excessively, and not giving me excessively
 stale data.

 The the crucial part is missing - taking a message out of the mailbox ;)

 The only entity permitted to take a message from the mailbox is the cell
 to which it is attached, and that happens at each initiation of
 activation.  But that's one reason that the mailbox should be synchronized.

Then here is the thing - if unlucky enough a thread that sends will get 
blocked by the thread that reads messages. It's not deadlock but a waste 
of time.

 But anyway let's focus on the details. 2 classes and 2 functions. Cell's
 send & MBox's receive.

 Let's suppose we have 2 Cells A & B and their mboxes MA & MB.
  From the code I see (that's full of typos? MCell & Cell are used
 interchangeably) the chain of event for full send from A --> B is:

 1. A Cell's send locks cell A. (send is sync-ed)

 Why does it lock the cell?  But perhaps this is because the proffered
 cell class was synchronized.

Yup.

 2. It locks target cell B.

 Why does it lock cell B?  Cell A makes no access to cell B.  Only to the
 mailbox.  Which is why in that rough design cell and mailbox were
 separate (synchronized) classes at the same level.  It it locks Cell B,
 then the design won't work.  (OTOH, the Cell function doesn't need to be
 synchronized anyway, that's a remnant from a prior design.)
 3. It then locks its mailbox MB.

 Check.
 4. undoes all the locks backwards.

 Check.
 Then there is of course a deadlock bound to happen if B is sending
 message in opposite direction, e.g. :
 1. A locks A (making its first step)
 2. B lock B (ditto)
 3. A locks B & blocks
 3. B locks A & blocks

 If the cells get locked, then the design will not work.  No argument.  I
 was attempting to avoid that by making the mailbox a separate class, and
 having each cell only contact the other cell's mailbox.  If that doesn't
 work, then the design is not going to work, and I'll need to create a
 new one.
 that if for instance there is step 2. So I guess you haven't meant the
 step 2.

 If there is no lock of the cell except before sending then it looks
 legit as there are 2 locks protecting separate entities:

 - one is to protect message queue on putting message into it

 - the other one is to protect ... what exactly? send is already
 implicitly guarded by target queue's lock.

 Yeah.  Sorry, that was sloppy thinking on my part.  The cell doesn't
 need to be synchronized.  (But I don't understand why that would be
 destructive of anything except efficiency.)

I find extra locking to also be confusing as it looks like nobody knows 
where the synchronization was required in fact.

 So I'd say you only need to guard the message queue and that's about it.
 The only other concern is properly scheduling the execution of tasks (or
 your cells) so that one runs on no more then one thread at any given
 time.


 threads.  Each cell is accessible by the thread whose sequence number is

 accessible to it.  When a cell is activated by the thread, first it
 checks it's mailbox to update it's state.  Then it processes, possibly
 sending messages to the mailboxes of other cells in a task independent
 way.  (It won't know about threads, only about mailboxes.)  Execution
 then passes on the the next cell in the thread.

So the code to process messages is the same on each cell? What about 
connections (like in see these) to other cells are they permanent?

 In simplest case just make a locked (or lock-free) queue of these and
 let threads pick cells/tasks from it and put back when they are done.

 A lock-free queue would do it, I think, but the only ones I know of are
 attached to the thread rather than to a data instance, and that WOULD
 lead to massive contention.  And increasing contention as the number of
 cores (threads) increased.

Yup, separating by ID (a round robin sort of thing) could be far better.
That if the workload on cells is trivially balanced by partitioning 
cells beforehand (I bet not).

 Far better is a large bounded queue that you never ever remove/put stuff
 into. It's just a big array of pointers/references to task. A thread
 then just goes around it looking for valid/ready entries (they stay
 allocated so no nulls there) and executes them. That goes firmly into
 lock-free zone to make it correctly synchronized though but with a bit
 of care it should be doable.

 It would be quite reasonable to make the queue bounded, and reject
 messages when the mailbox was full.  But, again, if it's attached to the
 thread rather than to the object there is going to be a massive problem
 with contention.

Again I don't get the attached to thread thing. TLS? Just don't put it 
in TLS then ;)

 The second one also can be done with locks. In this case a thread goes
 through all of tasks/cells and tries to lock them (That's where your
 lock around it comes in, is it?). If it locks - cool, work on it, if not
 - try the next one.

 Good point.  If the cell is initiated and can't read it's mailbox (this
 action needs a write lock, as it removes messages), then going on to the
 next cell rather than blocking is better.  (Again, if the mailbox is
 attached to the thread, this won't work.  All mailboxes will be nearly
 constantly in contention.)
  > More details aren't available, because I didn't want to
 commit to this design if the basic design was wrong, so I haven't
 written them. It has been suggested that since the cell will only be
 accessed by the thread, it doesn't need to be synchronous.

 I'm really nervous about how many copies of the cell will exist,
 however. Since there are going to be so many of them, if I ended up
 with a cell/thread, the system would bog down in unused storage. But
 the mailbox needs to be globally accessible for the scheme to work.

 Everything on the heap is accessible from other threads, provided they
 have the pointer to the object in question.

 Good.  And since I'm working with classes, everything is really on the
 heap, which means that only pointers will get copied.  (Well,
 references.  I avoid using pointers in my code unless I really can't
 avoid them.)
 N.B.: When a cell receives a message from another cell it's likely, but
 not guaranteed, to send a response back. It may also send responses
 onwards. And the number of cells isn't fixed, nor is the number of
 their connections. (This is less important in D than in many other
 languages, but even in D it affects serialization.)

 FWIW, I've been told that an approximately similar design has worked in
 Ada, though the design was written in Ada-83. (In Ada the mailbox was
 protected, which I take to be approximately the same as synchronized.)

 In general there are ways to make it fly. The tricks to use depend on
 the use case and what is bottleneck (I/O or the CPU time).
 The pain points is faster mailboxes and better scheduling (as in less
 context switches for nothing, faster turn-around time etc.).

 The bottleneck will be CPU time (given sufficient RAM).  No way to avoid
 that.  Stuffing things into a mailbox is going to be basically copying a
 struct.  (That hasn't been designed yet, but it's going to include a ref
 to the sender, a requested action,  a numeric value,  and I'm not sure
 of what else.  The "requested action" will probably be represented as an
 enum.  I'm probably going to avoid strings, as they don't appear to be
 valuable, even though that's just a reference copy.  So say 128 bytes of
 parameter, or possibly 256.

That's a lot no matter how you put it: 1 message per mailbox in 1M of 
cells is 128/256 megs of ram to boot. Obviously even as modest as about 
8 is hitting gigaytes with these big kinds of messages.

I bet the easiest way out is variable-length messages (so it doesn't 
have to be all 128/256 bytes). Then see the point about queue below.

 And receiving a message is copying the
 parameters into a queue.
 Perhaps I could remove the synchronization
 from the class, and just guard calculating the index position into the
 queue, as once the position in the queue was known, there wouldn't be
 any contention WRT where the message would be stored.

Just keep in mind cache lines are ~ 64 bytes thus messages would have to 
be aligned by multiple of 64. That's where having a preallocated bounded 
queue of _pointers_ to messages could be nice as you never write to the 
queue itself.

 That should be a
 very fast design, but I worry about how much RAM space would be
 required.

Yup, but the RAM will eat your design anyway if messages are that big 
and the amount of cells is millions. The only sensible way to scale this 
up is having a cluster of cheap machines each running a bunch of cells 
and communicating over fast network (that's still very slow compared to 
memory access).

 With a pre-allocated queue, each mailbox would consume the
 maximal amount of space even if none were used.  So if there were space
 for 20 messages, then the mailbox would consume 5120 bytes + overhead
 for each cell.  Which means that if I have 500,000 cells (an extremely
 lowball figure) just the mailboxes would consume 1,024,000,000 bytes
 plus overhead.
 True. most of that would never be accessed, but enough
 accesses would be randomly distributed throughout the system that I
 would expect thrashing...or even failure due to inability to allocate
 memory.

 This would be compressed significantly if I used a thread
 attached mailbox, at the cost of nearly guaranteed massive contention
 problems.

 And 20 messages is an unreasonably low upper limit, even
 though it's too high for a mean value.  (I expect the mean value to be
 closer to 3-4.)  So I'd been planning on using variable length arrays
 for the mailbox, which would be deallocated every time the cell accessed
 them.  So messages could be attached by simply doing an append.

Append = (re)allocation, that is either taking OS (or GC) lock to 
allocate or the mailbox has allocated memory chunk bigger then required 
anyway.

  This
 would place the message queue on the heap, with an initially quite low

 better because I can't estimate the amount of RAM consumed.

It may help after system warms up figuring out sizes for mailboxes. 
However they may tend to oscillate and keep in mind the extra space 
reserved with dynamic allocations (see above).

 Perhaps I need to use some sort of disk cache of a data file, and only
 have the most active cells in memory at any particular time.

Forget it - let your OS memory manager eat its turf with extra swap 
space. That being said it will crawl to a stop once swapping is 
relatively high.

 This would
 still result in a lot of thrashing, but in a more controlled manner.
 I've only around 8 GB of actual memory and it looks like 7.8 GB total
 memory, if one includes virtual memory.

  you mean available? With some cleverness virtual memory can be 
overtaxed far beyond real RAM BTW (if you never end up using more then 
RAM+swap of it).

  Perhaps this needs to be
 upgraded...which would probably mean I upgraded the number of cores
 available, meaning increased contention.

Given the numbers that's least of worries. Unnecessary locking is.

 But I could clearly upgrade
 the amount of virtual memory with just a repartitioning of the disk.
 It's not been needed so far, but then I've just been designing this
 system, not implementing it.  OTOH, virtual RAM is thrashing, so it's
 not clear how much that would help over, say, a BTree that rolled out
 relatively inactive cells, even though each cell would need to be RAM
 resident at least once per cycle,

 That said, the application I'm designing would probably overstress any
 computer I could afford.  I'm almost resigned to that.  I just want to
 come as close to reasonable speed of execution as possible, and I
 clearly want to avoid deadlocks and data that doesn't get properly updated.

Then just run simulations of as big as you can and optimize speed, then 
keep this size fixed and optimize RAM used, then over again.
The RAM vs speed should be more or less analyzable (and extrapolable) 
even if tested on a select bunch of modest PCs.

 OTOH, rolling things out to a B+Tree means that I need to devise a way

 reference to the item.  A hash table is the obvious solution, but the
 class managing the hash table would need to roll the cell+mailbox in if
 it isn't RAM resident.  Not something that's reasonable to do while the
 mailbox access is locked.  So the mailbox would need to queue the
 request for access to the cell's mailbox with a thread, stuff the
 message in a "to be acted upon" queue, and return.  And where that queue
 should be stored isn't clear to me.

That's why I suggest to not even begin thinking about it - virtual ram 
and OS does it far better (and with hardware assistance from CPU).

Forget the hash table if RAM is premium,  use straight arrays they have 
the least per item overhead.

 Perhaps that's where the thread
 attached non-blocking queue should come into play.  Also, hash tables
 have enough overhead themselves that they would limit the number of RAM
 resident cells considerably over prior estimates, even while increasing
 the total number that could be dealt with.  Probably they would halve
 the number of RAM resident cells (a rough estimate, admittedly), while
 expanding the total number of cells that could be handled to be limited
 by available disk space.  They would also impose a severe performance
 penalty.  (In prior small scale tests, hash table access has been very
 roughly about 1/10 as fast as variable length array access.  Of course,
 this is still a lot faster than disk file access.)

 Still, it's sounding like the basic design will work, unless a cell
 calling the mailbox of another cell locks that cell, and I don't see any
 reason why it should...but I've been repeatedly surprised by the
 requirements imposed on concurrently executing threads of execution.

There is no deadlock as discussed (+ not a single call tries to grab 2 
locks at the same time). However if there is a single lock around 
mailbox (not 2 separate for different put/get) you'd need to use "try 
lock" trick mentioned to prevent the awkward blocking of one thread when 
reading & writing collide on the same mailbox.

Also if adventurous enough to look at lock-free I suggest to read this 
piece by Herb Sutter:
http://www.drdobbs.com/parallel/writing-a-generalized-concurrent-queue/211601363?pgno=2

it's quite short and it neatly deals with a lot of problems in making 
concurrent queue in a portable way. (note that you have 
multiproducer--single consumer queue)

-- 
Dmitry Olshansky

Jan 04 2013

Charles Hixson <charleshixsn earthlink.net> writes:

On 01/04/2013 02:56 AM, Dmitry Olshansky wrote:
 04-Jan-2013 10:36, Charles Hixson пишет:
 [snip]
 So cell is in fact a task (more common name for it I think) with a
 mailbox and you want threads multiplexed across these tasks. The task is
 running some code/function/callback/whatever that periodically polls a
 mailbox & puts stuff in other task's mailboxes. So far good?

 Then definitely take a look at Fiber in druntime (it's core.Fiber
 AFAIK).

 It's in core.Thread. A fiber would work well for the cell, but it's
 been pointed out to me that with this design, an ordinary class instance
 would also work, so for that it's probably overkill. Which leaves the
 mailbox unhandled. "Periodic" polling, as in time based, is too
 expensive. Periodic polling as in "once in each cycle" is what is
 planned, and that part doesn't require the overhead of threads, etc.,
 except to keep martial messages from being read. But the documentation
 of fiber *strongly* implies that the contents of the mailbox would be
 copied from it's normal location to the active thread, if it's not the
 same thread within which the cell executes. So it looks to me as if the
 mailbox needs to be synchronized. But I'm not really sure what
 synchronized means in D. (N.B.: The cell is logically a task, but if
 it were implemented as such, then the mailbox would be unnecessary. But
 the overhead would be unbearable.)
 That skeleton of code was intended to show the idea of cells isolated
 from outside attached to a mailbox, which has blocking access from the
 outside. The cells should never block, but this is largely because they
 are only directly accessed by the thread within which they run. The
 mailboxes can receive messages from anyone, but their processes are so
 short, that blocking will be extremely brief.

 I'm sure not so optimistic about locking. Even though it's brief there
 are many threads that may be putting stuff simultaneously into mailboxes
 thus contending on the lock and causing context switches.
 + the lock/unlock of a mutex is not free. The lock-based message queue
 is nice for a start (less _subtle_ bugs) but you sure got to look into
 lock-free alternatives later on.

 The number of cells is huge when compared with the number of available
 threads, On the rough order of 1,000,000 to 6. Because of this I don't
 expect contention to be a big problem. OTOH, the number of
 cores/processor is going *up*, so a lock-free system would be very
 desireable. But the only lock-free design I'm familiar with is CSMACD
 (ethernet). And to implement that, I think the mailboxes would need to
 be shared. And this requires some protocol such as CSMACD for each
 access. It's not clear to me that this would have less overhead than
 would synchronized (which locks, if I understand correctly, the class
 instance when someone else has write access to it).

 CSMACD has the advantage of both parties knowing what they send and
 ability to detect collision before the end of transfer(!). In software
 design lock-free has analogous power but on scale of one memory location
 (one word). This would imply first working on the side (creating message
 etc.) then in one atomic op updating state, if there was conflict
 (somebody else put their word into memory here) you redo the updating step.

 Certainly TDPL
 talks about them sequentially, and implies that there is a tight
 connection between shared variables and synchronized classes. It is as
 if a synchronized class is the analog of an atomic operation acting on a
 larger set of data. And when I look at implementing CSMACD it looks as
 if only one thread can detect that it's write was interrupted, and then
 only after it's complete. Unless I implement a checksum scheme, in
 which case I guess both could detect an improperly completed write. But
 then independent readers couldn't detect the problem. So readers would
 need to be locked out while writing was in process...and we're back to
 mutexes/spinlocks/etc.

 So it looks to me as if a synchronized mailbox class eliminates a
 multitude of problems a the lowest reasonable cost.

 Given the million of cell you are going to lock things that need not be
 locked most of the time. It's hardly a reasonable cost.

But is there any cost to keeping something locked that nobody's looking 
at?  There's clearly a RAM cost, of course, but any solution will have 
some RAM cost, and presumably the library implementation of synchronized 
will be at least as good as any that I could come up with.
 If it works as I
 hope it does. But if the mailbox ends up getting copied from thread
 address space to thread address space, then the overhead would be much
 higher than a naive estimate, and I can't guess how high. It *could*
 scale so poorly that 90% of the computation consisted of mailboxes being
 copied from thread to thread.

 I don't quite get yours "attached to thread" (here and latter). You mean
 as in std.concurency? Well, it was designed this way that's all. There
 is no single reason why data-structure has to be "attached" to a thread.
 And you talk like it's not you who controls what gets copied and where :)

Sorry, I keep getting confused about threads accessing RAM.  This is the 
first time I've actually TRIED to write a parallel task.  You *did* 
point out that the heap was generally accessible, and most of the time I 
remember that.  And this implies that only the reference to the 
thread... is really attached to the thread (i.e., stored in thread local 
memory).
 And the fear of things being copied between threads, if anything threads
 are sharing single address space (a fact of life). If you want multiple
 processes then yes, different address space and some coping is bound to
 happen.

The copying is, indeed, bound to happen.  I'm trying to design this so 
that the copying will be minimized.  But I also get confused about what 
is local to the thread.  (I didn't realize until you told me a message 
or so ago that the heap was shared.)  I knew that it was in most 
languages, but apparently I got confused when TDPL said that in D 
threads did not share memory (13,3).
 I wanted to know if this proposed design would work, as in not getting
 into deadlocks, not blocking excessively, and not giving me excessively
 stale data.

 The the crucial part is missing - taking a message out of the mailbox ;)

 The only entity permitted to take a message from the mailbox is the cell
 to which it is attached, and that happens at each initiation of
 activation. But that's one reason that the mailbox should be
 synchronized.

 Then here is the thing - if unlucky enough a thread that sends will get
 blocked by the thread that reads messages. It's not deadlock but a waste
 of time.

This "waste of time" appears to me to be unavoidable.  Basically the 
cell takes ownership of the message vector, and a new one is created for 
the mailbox.  It's that, or timestamp everything, in which case I could 
use a queue. In that case, though, I'd need to copy the messages out as 
I removed them. Which would also take time, though in the "home" thread 
(i.e., the one holding a reference to the cell) rather than in the one 
that would be blocked in the first instance.
 But anyway let's focus on the details. 2 classes and 2 functions. Cell's
 send & MBox's receive.

 Let's suppose we have 2 Cells A & B and their mboxes MA & MB.
 From the code I see (that's full of typos? MCell & Cell are used
 interchangeably) the chain of event for full send from A --> B is:

 1. A Cell's send locks cell A. (send is sync-ed)

 Why does it lock the cell? But perhaps this is because the proffered
 cell class was synchronized.

 Yup.

OK.  Since that cell is referenced in only one thread, it doesn't need 
to be synchronized.
 2. It locks target cell B.

 Why does it lock cell B? Cell A makes no access to cell B. Only to the
 mailbox. Which is why in that rough design cell and mailbox were
 separate (synchronized) classes at the same level. It it locks Cell B,
 then the design won't work. (OTOH, the Cell function doesn't need to be
 synchronized anyway, that's a remnant from a prior design.)
 3. It then locks its mailbox MB.

 Check.
 4. undoes all the locks backwards.

 Check.
 Then there is of course a deadlock bound to happen if B is sending
 message in opposite direction, e.g. :
 1. A locks A (making its first step)
 2. B lock B (ditto)
 3. A locks B & blocks
 3. B locks A & blocks

 If the cells get locked, then the design will not work. No argument. I
 was attempting to avoid that by making the mailbox a separate class, and
 having each cell only contact the other cell's mailbox. If that doesn't
 work, then the design is not going to work, and I'll need to create a
 new one.
 that if for instance there is step 2. So I guess you haven't meant the
 step 2.

 If there is no lock of the cell except before sending then it looks
 legit as there are 2 locks protecting separate entities:

 - one is to protect message queue on putting message into it

 - the other one is to protect ... what exactly? send is already
 implicitly guarded by target queue's lock.

 Yeah. Sorry, that was sloppy thinking on my part. The cell doesn't
 need to be synchronized. (But I don't understand why that would be
 destructive of anything except efficiency.)

 I find extra locking to also be confusing as it looks like nobody knows
 where the synchronization was required in fact.

Sorry.  It is, indeed evidence of confused thinking.  The Cell class 
doesn't need to be locked.
 So I'd say you only need to guard the message queue and that's about it.
 The only other concern is properly scheduling the execution of tasks (or
 your cells) so that one runs on no more then one thread at any given
 time.


 threads. Each cell is accessible by the thread whose sequence number is

 accessible to it. When a cell is activated by the thread, first it
 checks it's mailbox to update it's state. Then it processes, possibly
 sending messages to the mailboxes of other cells in a task independent
 way. (It won't know about threads, only about mailboxes.) Execution
 then passes on the the next cell in the thread.

 So the code to process messages is the same on each cell? What about
 connections (like in see these) to other cells are they permanent?

The connections are variable, and neither predeterminable nor immutable. 
  The vary depending on the incoming data.  Also, the number of cells is 
variable depending on the incoming data.
 In simplest case just make a locked (or lock-free) queue of these and
 let threads pick cells/tasks from it and put back when they are done.

 A lock-free queue would do it, I think, but the only ones I know of are
 attached to the thread rather than to a data instance, and that WOULD
 lead to massive contention. And increasing contention as the number of
 cores (threads) increased.

 Yup, separating by ID (a round robin sort of thing) could be far better.
 That if the workload on cells is trivially balanced by partitioning
 cells beforehand (I bet not).

It might be.  I'll need to think about that.  I expect the load/cell to 

cell is added to the next available thread.  It's more complex, but not 

mod thread count would be a round robin.)
 Far better is a large bounded queue that you never ever remove/put stuff
 into. It's just a big array of pointers/references to task. A thread
 then just goes around it looking for valid/ready entries (they stay
 allocated so no nulls there) and executes them. That goes firmly into
 lock-free zone to make it correctly synchronized though but with a bit
 of care it should be doable.

 It would be quite reasonable to make the queue bounded, and reject
 messages when the mailbox was full. But, again, if it's attached to the
 thread rather than to the object there is going to be a massive problem
 with contention.

 Again I don't get the attached to thread thing. TLS? Just don't put it
 in TLS then ;)

Perhaps the problem is that I don't know how to implement a lock-free 
queue that's thread safe?  I looked at the version in (IIRC) 
std.concurrency because that's the one I found in the library.  Queue's 
I can do, but when lock-free and thread-safe are added, I'm well beyond 
my depth.
 The second one also can be done with locks. In this case a thread goes
 through all of tasks/cells and tries to lock them (That's where your
 lock around it comes in, is it?). If it locks - cool, work on it, if not
 - try the next one.

 Good point. If the cell is initiated and can't read it's mailbox (this
 action needs a write lock, as it removes messages), then going on to the
 next cell rather than blocking is better. (Again, if the mailbox is
 attached to the thread, this won't work. All mailboxes will be nearly
 constantly in contention.)
 More details aren't available, because I didn't want to
 commit to this design if the basic design was wrong, so I haven't
 written them. It has been suggested that since the cell will only be
 accessed by the thread, it doesn't need to be synchronous.

 I'm really nervous about how many copies of the cell will exist,
 however. Since there are going to be so many of them, if I ended up
 with a cell/thread, the system would bog down in unused storage. But
 the mailbox needs to be globally accessible for the scheme to work.

 Everything on the heap is accessible from other threads, provided they
 have the pointer to the object in question.

 Good. And since I'm working with classes, everything is really on the
 heap, which means that only pointers will get copied. (Well,
 references. I avoid using pointers in my code unless I really can't
 avoid them.)
 N.B.: When a cell receives a message from another cell it's likely, but
 not guaranteed, to send a response back. It may also send responses
 onwards. And the number of cells isn't fixed, nor is the number of
 their connections. (This is less important in D than in many other
 languages, but even in D it affects serialization.)

 FWIW, I've been told that an approximately similar design has worked in
 Ada, though the design was written in Ada-83. (In Ada the mailbox was
 protected, which I take to be approximately the same as synchronized.)

 In general there are ways to make it fly. The tricks to use depend on
 the use case and what is bottleneck (I/O or the CPU time).
 The pain points is faster mailboxes and better scheduling (as in less
 context switches for nothing, faster turn-around time etc.).

 The bottleneck will be CPU time (given sufficient RAM). No way to avoid
 that. Stuffing things into a mailbox is going to be basically copying a
 struct. (That hasn't been designed yet, but it's going to include a ref
 to the sender, a requested action, a numeric value, and I'm not sure
 of what else. The "requested action" will probably be represented as an
 enum. I'm probably going to avoid strings, as they don't appear to be
 valuable, even though that's just a reference copy. So say 128 bytes of
 parameter, or possibly 256.

 That's a lot no matter how you put it: 1 message per mailbox in 1M of
 cells is 128/256 megs of ram to boot. Obviously even as modest as about
 8 is hitting gigaytes with these big kinds of messages.

 I bet the easiest way out is variable-length messages (so it doesn't
 have to be all 128/256 bytes). Then see the point about queue below.

 And receiving a message is copying the
 parameters into a queue.
 Perhaps I could remove the synchronization
 from the class, and just guard calculating the index position into the
 queue, as once the position in the queue was known, there wouldn't be
 any contention WRT where the message would be stored.

 Just keep in mind cache lines are ~ 64 bytes thus messages would have to
 be aligned by multiple of 64. That's where having a preallocated bounded
 queue of _pointers_ to messages could be nice as you never write to the
 queue itself.

 That should be a
 very fast design, but I worry about how much RAM space would be
 required.

 Yup, but the RAM will eat your design anyway if messages are that big
 and the amount of cells is millions. The only sensible way to scale this
 up is having a cluster of cheap machines each running a bunch of cells
 and communicating over fast network (that's still very slow compared to
 memory access).

Yeah, but I'm not a company.  I've got what I've got (or what I replace 
it with...but it would be replace, not add to.  Space is also a constraint.)
 With a pre-allocated queue, each mailbox would consume the
 maximal amount of space even if none were used. So if there were space
 for 20 messages, then the mailbox would consume 5120 bytes + overhead
 for each cell. Which means that if I have 500,000 cells (an extremely
 lowball figure) just the mailboxes would consume 1,024,000,000 bytes
 plus overhead.
 True. most of that would never be accessed, but enough
 accesses would be randomly distributed throughout the system that I
 would expect thrashing...or even failure due to inability to allocate
 memory.

 This would be compressed significantly if I used a thread
 attached mailbox, at the cost of nearly guaranteed massive contention
 problems.

 And 20 messages is an unreasonably low upper limit, even
 though it's too high for a mean value. (I expect the mean value to be
 closer to 3-4.) So I'd been planning on using variable length arrays
 for the mailbox, which would be deallocated every time the cell accessed
 them. So messages could be attached by simply doing an append.

 Append = (re)allocation, that is either taking OS (or GC) lock to
 allocate or the mailbox has allocated memory chunk bigger then required
 anyway.

Well, if the mailbox is initially allocated to 4 messages, then if 
there's overflow it jumps to 8, IIUC.  Then to 16.  Etc.  So automatic 
allocations will only happen occasionally, and for most cells will never 
happen (unless I've misjudged the average number of messages).
 This
 would place the message queue on the heap, with an initially quite low

 better because I can't estimate the amount of RAM consumed.

 It may help after system warms up figuring out sizes for mailboxes.
 However they may tend to oscillate and keep in mind the extra space
 reserved with dynamic allocations (see above).

Yeah.  64 bytes is larger than I was guessing.
 Perhaps I need to use some sort of disk cache of a data file, and only
 have the most active cells in memory at any particular time.

 Forget it - let your OS memory manager eat its turf with extra swap
 space. That being said it will crawl to a stop once swapping is
 relatively high.

 This would
 still result in a lot of thrashing, but in a more controlled manner.
 I've only around 8 GB of actual memory and it looks like 7.8 GB total
 memory, if one includes virtual memory.

 you mean available? With some cleverness virtual memory can be overtaxed
 far beyond real RAM BTW (if you never end up using more then RAM+swap of
 it).

Well, I didn't mean available.  But if I decide to use virtual memory, I 
can repartition the hard disk to increase that dramatically.  Probably 
buying a new machine would be better than adding RAM, though. My current 
machine is rather old.  But I don't want to do that until I have a 
working program to use on it.  And I can't yet buy enough machine for 
what I can afford, but then the program isn't ready yet either.
 Perhaps this needs to be
 upgraded...which would probably mean I upgraded the number of cores
 available, meaning increased contention.

 Given the numbers that's least of worries. Unnecessary locking is.

 But I could clearly upgrade
 the amount of virtual memory with just a repartitioning of the disk.
 It's not been needed so far, but then I've just been designing this
 system, not implementing it. OTOH, virtual RAM is thrashing, so it's
 not clear how much that would help over, say, a BTree that rolled out
 relatively inactive cells, even though each cell would need to be RAM
 resident at least once per cycle,

 That said, the application I'm designing would probably overstress any
 computer I could afford. I'm almost resigned to that. I just want to
 come as close to reasonable speed of execution as possible, and I
 clearly want to avoid deadlocks and data that doesn't get properly
 updated.

 Then just run simulations of as big as you can and optimize speed, then
 keep this size fixed and optimize RAM used, then over again.
 The RAM vs speed should be more or less analyzable (and extrapolable)
 even if tested on a select bunch of modest PCs.

"select bunch of modest PCs" is a problem.  I've got a few old 
computers, but no space to use them in.  (Seriously, I should really get 
rid of them, but one if them is occasionally useful when the active 
machine is in the shop.)  Space is a lot more expensive than computers.
FWIW, this program I'm describing *IS* a simulation of a more complex 
system, drastically simplified already.  I've got less trouble 
sacrificing speed than capabilities.  But I could reduce the size 
requirements by using a restricted range of inputs.
 OTOH, rolling things out to a B+Tree means that I need to devise a way

 reference to the item. A hash table is the obvious solution, but the
 class managing the hash table would need to roll the cell+mailbox in if
 it isn't RAM resident. Not something that's reasonable to do while the
 mailbox access is locked. So the mailbox would need to queue the
 request for access to the cell's mailbox with a thread, stuff the
 message in a "to be acted upon" queue, and return. And where that queue
 should be stored isn't clear to me.

 That's why I suggest to not even begin thinking about it - virtual ram
 and OS does it far better (and with hardware assistance from CPU).

 Forget the hash table if RAM is premium, use straight arrays they have
 the least per item overhead.

Yes.  A hash table, while obvious, is clearly not the right idea.  I do 
find myself attracted to a variation on a B+Tree with fancy rebalancing, 
so that each node is more than 2/3 full.  (I think I could get it to 
average over 3/4 full, excluding the root node.) This, however, is 
computationally expensive.  It's main advantages are that then the limit 
to the structure is disk space, and persistence is automatically 
maintained.  It would, however, require another redesign.

OTOH, *some* means of saving state persistently will be necessary, as I 
can't dedicate my computer to nothing but this program.  Also deleting 
cells that have become very inactive may prove to be difficult, and 
might be better done while the rest of the program is inactive.  (That 
would mean marking the cell for deletion during normal operation, and 
then running the garbage collector separately.  Facilitating this is one 
reason that all links need to be bidirectional.
 Perhaps that's where the thread
 attached non-blocking queue should come into play. Also, hash tables
 have enough overhead themselves that they would limit the number of RAM
 resident cells considerably over prior estimates, even while increasing
 the total number that could be dealt with. Probably they would halve
 the number of RAM resident cells (a rough estimate, admittedly), while
 expanding the total number of cells that could be handled to be limited
 by available disk space. They would also impose a severe performance
 penalty. (In prior small scale tests, hash table access has been very
 roughly about 1/10 as fast as variable length array access. Of course,
 this is still a lot faster than disk file access.)

 Still, it's sounding like the basic design will work, unless a cell
 calling the mailbox of another cell locks that cell, and I don't see any
 reason why it should...but I've been repeatedly surprised by the
 requirements imposed on concurrently executing threads of execution.

 There is no deadlock as discussed (+ not a single call tries to grab 2
 locks at the same time). However if there is a single lock around
 mailbox (not 2 separate for different put/get) you'd need to use "try
 lock" trick mentioned to prevent the awkward blocking of one thread when
 reading & writing collide on the same mailbox.

That sounds like good advice.
 Also if adventurous enough to look at lock-free I suggest to read this
 piece by Herb Sutter:
 http://www.drdobbs.com/parallel/writing-a-generalized-concurrent-queue/211601363?pgno=2

 From that link (right near the start):
     "Like any shared variable, the next pointer needs to be
      protected by a mutex or, as here, declared as an ordered
      atomic type (C++0x atomic<> or Java/.NET volatile)."
To me that sounds like synchronized, since it can't be atomic, being a 
struct.  My C++ is pretty weak, though.  I'm guessing at what is meant 
by "ordered atomic".  In D, however, I believe that 64 bits is the 
length of the longest atomic type.  This is a bit of a guess, but real 
types can't be atomic on Intel cpus, because they are 80 bits long.

OTOH, the last bit of TDPL (pg 426-429 of my copy) talks about lock-free 
stacks and lists.  For the mailbox I think a stack would do as well as a 
queue as it's going to be emptied entirely when the cell accesses it, 
though perhaps the list would be better, but the stack is simpler.  I'd 
need to adapt the pop routine into a "popAll", but that looks 
straight-forwards.
 it's quite short and it neatly deals with a lot of problems in making
 concurrent queue in a portable way. (note that you have
 multiproducer--single consumer queue)

Jan 04 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

05-Jan-2013 08:15, Charles Hixson пишет:
 On 01/04/2013 02:56 AM, Dmitry Olshansky wrote:
 04-Jan-2013 10:36, Charles Hixson пишет:
 [snip]
 Certainly TDPL
 talks about them sequentially, and implies that there is a tight
 connection between shared variables and synchronized classes. It is as
 if a synchronized class is the analog of an atomic operation acting on a
 larger set of data. And when I look at implementing CSMACD it looks as
 if only one thread can detect that it's write was interrupted, and then
 only after it's complete. Unless I implement a checksum scheme, in
 which case I guess both could detect an improperly completed write. But
 then independent readers couldn't detect the problem. So readers would
 need to be locked out while writing was in process...and we're back to
 mutexes/spinlocks/etc.

 So it looks to me as if a synchronized mailbox class eliminates a
 multitude of problems a the lowest reasonable cost.

 Given the million of cell you are going to lock things that need not be
 locked most of the time. It's hardly a reasonable cost.

 But is there any cost to keeping something locked that nobody's looking
 at?  There's clearly a RAM cost, of course, but any solution will have
 some RAM cost, and presumably the library implementation of synchronized
 will be at least as good as any that I could come up with.

There is a cost to (un)lock a mutex. You can make a simple program and 
measure it or just look around for some hard data on the subject.

 And the fear of things being copied between threads, if anything threads
 are sharing single address space (a fact of life). If you want multiple
 processes then yes, different address space and some coping is bound to
 happen.

 The copying is, indeed, bound to happen.  I'm trying to design this so
 that the copying will be minimized.  But I also get confused about what
 is local to the thread.  (I didn't realize until you told me a message
 or so ago that the heap was shared.)  I knew that it was in most
 languages, but apparently I got confused when TDPL said that in D
 threads did not share memory (13,3).

The memory (on hardware level) is shared either way, but semantically 
(on type-system level) everything not tagged shared is thread-local.

Stack is already private to each thread, globals are stored in TLS 
provided by the OS and things are set up in such a way that heap memory 
is also 'type-marked' as thread-local but in fact it's a type-cast away 
from being used by some other thread. The idea was to put more onus on 
what data is passed between threads to reduce threading errors and undue 
sharing of data.

 I wanted to know if this proposed design would work, as in not getting
 into deadlocks, not blocking excessively, and not giving me
 excessively
 stale data.

 The the crucial part is missing - taking a message out of the
 mailbox ;)

 The only entity permitted to take a message from the mailbox is the cell
 to which it is attached, and that happens at each initiation of
 activation. But that's one reason that the mailbox should be
 synchronized.

 Then here is the thing - if unlucky enough a thread that sends will get
 blocked by the thread that reads messages. It's not deadlock but a waste
 of time.

 This "waste of time" appears to me to be unavoidable.  Basically the
 cell takes ownership of the message vector, and a new one is created for
 the mailbox.

I've meant that the thread should keep working on other cell not be 
blocked by this unfortunate collision. i.e. that "try lock" idea.

 In simplest case just make a locked (or lock-free) queue of these and
 let threads pick cells/tasks from it and put back when they are done.

 A lock-free queue would do it, I think, but the only ones I know of are
 attached to the thread rather than to a data instance, and that WOULD
 lead to massive contention. And increasing contention as the number of
 cores (threads) increased.

 Yup, separating by ID (a round robin sort of thing) could be far better.
 That if the workload on cells is trivially balanced by partitioning
 cells beforehand (I bet not).

 It might be.  I'll need to think about that.  I expect the load/cell to

 cell is added to the next available thread.  It's more complex, but not

 mod thread count would be a round robin.)

Yeah, I've though basic mod (n_threads+1).

[snip]
 It would be quite reasonable to make the queue bounded, and reject
 messages when the mailbox was full. But, again, if it's attached to the
 thread rather than to the object there is going to be a massive problem
 with contention.

 Again I don't get the attached to thread thing. TLS? Just don't put it
 in TLS then ;)

 Perhaps the problem is that I don't know how to implement a lock-free
 queue that's thread safe?  I looked at the version in (IIRC)
 std.concurrency because that's the one I found in the library.  Queue's
 I can do, but when lock-free and thread-safe are added, I'm well beyond
 my depth.

Queue in std.concurrency AFAIK is not lock-free, there were floating 
thoughts of making it such. Either way it's a good idea to copy some 
well-analyzed work then to try to roll your own with lock-free stuff :)

[snip]
 That should be a
 very fast design, but I worry about how much RAM space would be
 required.

 Yup, but the RAM will eat your design anyway if messages are that big
 and the amount of cells is millions. The only sensible way to scale this
 up is having a cluster of cheap machines each running a bunch of cells
 and communicating over fast network (that's still very slow compared to
 memory access).

 Yeah, but I'm not a company.  I've got what I've got (or what I replace
 it with...but it would be replace, not add to.  Space is also a
 constraint.)

The other way to get out of memory usage & variable queues is the 
following that is if you can fit it, and that's a big if. The idea is 1) 
That queues are intentionally quite small
2) When cell tries to put message into a full queue of other cell, it 
saves this 'suspended' state (right before send, *this* could be hard) 
and keeps the message waiting (1 per cell max)
3) Thread switches over to the cell with full queue
(this implies threads need to lock cells or otherwise avoid 2 threads 
working on one cell)
4) When (some) thread reaches the cell (that was suspended) it first 
tries to continue work starting with that send.

Now it may or may not fit into your design but does 2 things: keeps data 
not stale (short queues + "busy" cells get their queue flushed more 
often) and limits memory used. The overhead is obviously in a bit of CPU 
time  to save/restore state of 'interrupted' cells.


 That said, the application I'm designing would probably overstress any
 computer I could afford. I'm almost resigned to that. I just want to
 come as close to reasonable speed of execution as possible, and I
 clearly want to avoid deadlocks and data that doesn't get properly
 updated.

 Then just run simulations of as big as you can and optimize speed, then
 keep this size fixed and optimize RAM used, then over again.
 The RAM vs speed should be more or less analyzable (and extrapolable)
 even if tested on a select bunch of modest PCs.

 "select bunch of modest PCs" is a problem.  I've got a few old
 computers, but no space to use them in.  (Seriously, I should really get
 rid of them, but one if them is occasionally useful when the active
 machine is in the shop.)  Space is a lot more expensive than computers.

Aye, though laptops are pretty small and darn fast these days :)

 FWIW, this program I'm describing *IS* a simulation of a more complex
 system, drastically simplified already.

I deduced as much.

  I've got less trouble
 sacrificing speed than capabilities.  But I could reduce the size
 requirements by using a restricted range of inputs.

[snip]
 That's why I suggest to not even begin thinking about it - virtual ram
 and OS does it far better (and with hardware assistance from CPU).

 Forget the hash table if RAM is premium, use straight arrays they have
 the least per item overhead.

 Yes.  A hash table, while obvious, is clearly not the right idea.  I do
 find myself attracted to a variation on a B+Tree with fancy rebalancing,
 so that each node is more than 2/3 full.  (I think I could get it to
 average over 3/4 full, excluding the root node.) This, however, is
 computationally expensive.  It's main advantages are that then the limit
 to the structure is disk space, and persistence is automatically
 maintained.  It would, however, require another redesign.

 OTOH, *some* means of saving state persistently will be necessary, as I
 can't dedicate my computer to nothing but this program.  Also deleting
 cells that have become very inactive may prove to be difficult, and
 might be better done while the rest of the program is inactive.  (That
 would mean marking the cell for deletion during normal operation, and
 then running the garbage collector separately.  Facilitating this is one
 reason that all links need to be bidirectional.

Then also take a look at memory-mapped I/O. Though persistence would 
mean fixing up all pointer/reference stuff...

 Also if adventurous enough to look at lock-free I suggest to read this
 piece by Herb Sutter:
 http://www.drdobbs.com/parallel/writing-a-generalized-concurrent-queue/211601363?pgno=2

  From that link (right near the start):
      "Like any shared variable, the next pointer needs to be
       protected by a mutex or, as here, declared as an ordered
       atomic type (C++0x atomic<> or Java/.NET volatile)."
 To me that sounds like synchronized, since it can't be atomic, being a
 struct.  My C++ is pretty weak, though.  I'm guessing at what is meant
 by "ordered atomic".

Structs can be atomic if size permits. Anyway it's a pointer in disguise 
thus can be atomic. Ordered atomic implies sequential consistency 
(w.r.t. of multiple parallel operations on the same memory location) + 
atomic operation.

 In D, however, I believe that 64 bits is the
 length of the longest atomic type.  This is a bit of a guess, but real
 types can't be atomic on Intel cpus, because they are 80 bits long.

Yup. BTW structs could be used with atomic ops provided they are simple 
enough (no destructors, no postblits etc.) and fits into specific size.

 OTOH, the last bit of TDPL (pg 426-429 of my copy) talks about lock-free
 stacks and lists.  For the mailbox I think a stack would do as well as a
 queue as it's going to be emptied entirely when the cell accesses it,
 though perhaps the list would be better, but the stack is simpler.  I'd
 need to adapt the pop routine into a "popAll", but that looks
 straight-forwards.

The "looks straight-forward" is almost always wrong with lock-free ;)

Just be careful and have your design so that you can switch the 
explosive & fast lock-free queues with good old lock-based ones.
It's painfully bad to discover memory corruption due to *some* glitch in 
the lock-free data-structure a week before shipping app into production 
(or even a week after...).


-- 
Dmitry Olshansky

Jan 07 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - asynchronous communication between threads