www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Low-Lock Singletons In D

reply "dsimcha" <dsimcha yahoo.com> writes:
On the advice of Walter and Andrei, I've written a blog article 
about the low-lock Singleton pattern in D.  This is a previously 
obscure pattern that uses thread-local storage to make Singletons 
both thread-safe and efficient and was independently invented by 
at least me and Alexander Terekhov, an IBM researcher.  However, 
D's first-class treatment of thread-local storage means the time 
has come to move it out of obscurity and possibly make it the 
standard way to do Singletons.

Article:
http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

Reddit:
http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/
May 05 2013
next sibling parent "Joshua Niehus" <jm.niehus gmail.com> writes:
On Monday, 6 May 2013 at 02:35:33 UTC, dsimcha wrote:
 Article:
 http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

 Reddit:
 http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/

Excellent talk at the conf, solid blog: +1 and 1
May 05 2013
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/5/13 10:35 PM, dsimcha wrote:
 On the advice of Walter and Andrei, I've written a blog article about
 the low-lock Singleton pattern in D. This is a previously obscure
 pattern that uses thread-local storage to make Singletons both
 thread-safe and efficient and was independently invented by at least me
 and Alexander Terekhov, an IBM researcher. However, D's first-class
 treatment of thread-local storage means the time has come to move it out
 of obscurity and possibly make it the standard way to do Singletons.

 Article:
 http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

 Reddit:
 http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/

The reddit post is in need for upvotes to make it on the /r/programming front page. I think posting time wasn't the best: Saturday is the worst day to post anything (http://www.slideshare.net/babasave/the-hidden-secrets-of-successful-reddit-posts). Anyhow, I've also posted to https://news.ycombinator.com/item?id=5660897 Vote up! Andrei
May 05 2013
prev sibling next sibling parent reply "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 6 May 2013 at 02:35:33 UTC, dsimcha wrote:
 that uses thread-local storage
 On DMD the overhead of TLS vs. unsafe is noticeable but small.  
 In both cases it pales in comparison to the overhead of 
 synchronizing on every call to get().

Hmm... So I just invented another method right now in like 10 minutes, which can completely avoid TLS altogether, and which I think might end up being faster. Basically, my idea is that since modern CPUs (x86) are great at predicting predictable virtual calls, you can use that to your advantage as shown below. The code below is C#, but easy enough to turn into D... I don't have the time to do it at the moment but if you're interested give it a try and see how it compares to TLS: class Program { private interface IValue<T> { T Get(); } private class ActualValue<T> : IValue<T> { private T value; public T Get() { return this.value; } } private class NullValue<T> : IValue<T> { // This field is initialized on startup public static IValue<T> _static = new NullValue<T>(); public T Get() { lock (this) { _static = new ActualValue<T>(); } // Insert memory barrier if you'd like return _static.Get(); } } public static object Static { get { return NullValue<object>._static.Get(); } } static void Main(string[] args) { var a = Static; // initializes on first use var b = Static; // doesn't initialize anymore } }
May 06 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
06-May-2013 13:06, Mehrdad пишет:
 On Monday, 6 May 2013 at 02:35:33 UTC, dsimcha wrote:
 that uses thread-local storage
 On DMD the overhead of TLS vs. unsafe is noticeable but small. In both
 cases it pales in comparison to the overhead of synchronizing on every
 call to get().

Hmm... So I just invented another method right now in like 10 minutes, which can completely avoid TLS altogether, and which I think might end up being faster. Basically, my idea is that since modern CPUs (x86) are great at predicting predictable virtual calls, you can use that to your advantage as shown below. The code below is C#, but easy enough to turn into D... I don't have the time to do it at the moment but if you're interested give it a try and see how it compares to TLS: class Program { private interface IValue<T> { T Get(); } private class ActualValue<T> : IValue<T> { private T value; public T Get() { return this.value; } } private class NullValue<T> : IValue<T> { // This field is initialized on startup public static IValue<T> _static = new NullValue<T>();

And that field is still shared... it doesn't matter if the null was replaced by NullValue and an if-branch with indirect call. You have to read a field to know what to do next, and the other processor may as well write to it.
          public T Get()
          {
              lock (this)
              {

                  _static = new ActualValue<T>();

Who told you that the processor will immediately see the fully constructed value of ActualValue upon assignment? Barriers and other minor forms of black magic are still required on each access (e.g. atomic reads and atomic write) which isn't the case with TLS flag as discussed.
              }
              // Insert memory barrier if you'd like
              return _static.Get();
          }
      }

      public static object Static
      { get { return NullValue<object>._static.Get(); } }


      static void Main(string[] args)
      {
          var a = Static;  // initializes on first use
          var b = Static;  // doesn't initialize anymore
      }
 }

-- Dmitry Olshansky
May 06 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
06-May-2013 14:06, Mehrdad пишет:
 On Monday, 6 May 2013 at 09:35:59 UTC, Dmitry Olshansky wrote:
 You have to read a field to know what to do next, and the other
 processor may as well write to it.

That never happens, though. _static is only written to inside a lock,

Yes, but the 1st processor may read _static exactly when the 2nd is inside the lock and writing to that field. Then chances are it will read whatever partial state there is written.
 and the check is inside a lock, hence the other processor can't be
 writing to it when it's being read...

Surely it can. 1st processor takes the lock and write while the second one reads static_ to call Get.
 Maybe I misunderstood what you're saying?

In pseudo-code what you do is this: read static_ // 1st processor may be reading here if static_ is NullValue //that is indirect call lock(static_){ check again write static_ //while 2nd one writes here } else call static_ -~~-> get I claim that 'read static_' better be "protected" against that 'write static_' inside the lock. One way is to ensure write is atomic w.r.t. that particular read operation e.g. by hardware acquire-release, or via both reads/writes being atomic. On x86 word-sized reads/writes are atomic anyway (iff aligned apparently). The neat stuff about that TLS pattern was that you need not to rely on hardware specific support. -- Dmitry Olshansky
May 06 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
06-May-2013 22:23, Mehrdad пишет:
 On Monday, 6 May 2013 at 11:21:37 UTC, Dmitry Olshansky wrote:
 Yes, but the 1st processor may read _static exactly when the 2nd is
 inside the lock and writing to that field. Then chances are it will
 read whatever partial state there is written.
 Surely it can. 1st processor takes the lock and write while the second
 one reads static_ to call Get.

It's a single pointer, there is no partial state -- it's either written or it isn't.

True... e.g on x86 if that word is aligned. But the core of problem is not only that word it's rather the fact that processor (and compiler and who's not) is free to re-order read/write ops inside that locked region. Potentially it could lead to epically nasty things. struct Static{ int value; Static(int v){ value = v; } } Now suppose: lock(_static){ _static = new Static(42); //or smth like } Is compiled down to this "C--" code (quite realistically): lock _static_mutex; x = alloc int; x[0] = 42; static_ = x; unlock _static_mutex; And now compiler/CPU decides to optimize/execute out of order (again, it's an illustration) it as: lock _static_mutex; x = alloc int; //even if that's atomic static_ = x; // BOOM! somebody not locking mutex may already // see static_ in "half-baked" state x[0] = 42; unlock _static_mutex; Observe that: a) It can and nothing prevents such a scenario. In fact it should feel free to optimize inside the locked section, but not accross b) The chance that it happens is non-zero and rises with the complexity of construction c) Depends on hardware kind(!) used as in OoO vs non-OoO So it would work more reliably on old Atoms if that of any comfort ;). That if your compiler isn't equally smart and does nasty things behind your back ("Sorry, I just scheduled instructions optimally...").
 One way is to ensure write is atomic w.r.t. that particular read
 operation

Are't pointer writes always atomic?

In short - no. Even not counting the world of legal re-ordering, unaligned writes too have this nasty habit of partially written stuff being visible at the wrong time. Truth be told relying on these kind of "not ever likely to go wrong" is the very reason explicit atomics are encouraged even if they are NOPs on your current arch. That or faster/harder stuff - barriers. Speaking of barriers - that is exactly the kind of thing that would disallow moving write of x[0] = 42 past "line" of static_ = x; Simply put barriers act as a line of "no reordering across this point" for specific operations (depending on type, or all of them). Needles to say - hurts performance. -- Dmitry Olshansky
May 06 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-May-2013 10:47, Mehrdad пишет:
 On Monday, 6 May 2013 at 18:56:08 UTC, Dmitry Olshansky wrote:


 Thanks for the detailed explanation!


 And now compiler/CPU decides to optimize/execute out of order (again,
 it's an illustration) it as:

 lock _static_mutex;
 x = alloc int;
 //even if that's atomic
 static_ = x;
 // BOOM! somebody not locking mutex may already
 // see static_ in "half-baked" state
 x[0] = 42;
 unlock _static_mutex;

That's exactly the same as the classic double-checked lock bug, right?

Yeah, and that was my point to begin with - your method doesn't bring anything new. It's the same as the one with null and 'if-null-check' with same issues and requires atomics or barriers.
 As I wrote in my original code -- and as you also mentioned yourself --
 isn't it trivially fixed with a memory barrier?

 Like maybe replacing

      _static = new ActualValue<T>();

 with

      var value = new ActualValue<T>();
      _ReadWriteBarrier();
      _static = value;



 Wouldn't this make it correct?

Would but then it's the same as the old fixed double-checked locking. Barriers hurt performance that we were after to begin with. Now it would be interesting to measure speed of this TLS low-lock vs atomic-load/memory barrier + mutex. This measurement is absent from the blog post, but Andrei claims memory barrier on each access is too slow. -- Dmitry Olshansky
May 07 2013
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/6/2013 2:06 AM, Mehrdad wrote:
 So I just invented another method right now in like 10 minutes, which can
 completely avoid TLS altogether, and which I think might end up being faster.

I did that too. It's trickier than it appears.
May 06 2013
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/6/13 5:06 AM, Mehrdad wrote:
 On Monday, 6 May 2013 at 02:35:33 UTC, dsimcha wrote:
 that uses thread-local storage
 On DMD the overhead of TLS vs. unsafe is noticeable but small. In both
 cases it pales in comparison to the overhead of synchronizing on every
 call to get().

Hmm... So I just invented another method right now in like 10 minutes, which can completely avoid TLS altogether, and which I think might end up being faster.

It's well known. Needs a memory barrier on each access, so it's slower. Andrei
May 06 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/6/13 2:25 PM, Mehrdad wrote:
 On Monday, 6 May 2013 at 13:33:54 UTC, Andrei Alexandrescu wrote:
 It's well known. Needs a memory barrier on each access, so it's slower.

Hmm, are you saying I'm missing a memory barrier I should have written, or are you saying I already have a memory barrier which I'm not seeing? The only memory barrier I have is during initialization, and after that only read operations occur.

Any concurrent operation (in this case read from one thread and write from another) requires a handshake between threads, most often in the form of an release write coupled with an acquire read. Whenever the handshake is absent but concurrent operations on shared memory do occur, the code is broken. The beauty of the TLS-based pattern is that in the steady state there's no need for a shared read and handshake. Andrei
May 06 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/7/13 2:50 AM, Mehrdad wrote:
 On Monday, 6 May 2013 at 18:46:56 UTC, Andrei Alexandrescu wrote:
 Any concurrent operation (in this case read from one thread and write
 from another) requires a handshake between threads, most often in the
 form of an release write coupled with an acquire read. Whenever the
 handshake is absent but concurrent operations on shared memory do
 occur, the code is broken. The beauty of the TLS-based pattern is that
 in the steady state there's no need for a shared read and handshake.

 Andrei

Hmm, are you referring to the same lack of a barrier that the others are also referring to? As far as I can see, there shouldn't be a need for any other handshake in this example. As long as the object is fully initialized before _static is written to (easy enough with just a memory barrier), there is no penalty for subsequent reads whatsoever. Right?

No. A tutorial on memory consistency models would be too long to insert here. I don't know of a good online resource, does anyone? Andrei
May 07 2013
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/7/13 10:31 AM, Steven Schveighoffer wrote:
 On Tue, 07 May 2013 09:25:36 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 No. A tutorial on memory consistency models would be too long to
 insert here. I don't know of a good online resource, does anyone?

In essence, a read requires an acquire memory barrier, a write requires a release memory barrier, but in this case, we only need to be concerned if the value we get back is not valid (i.e. NullValue). Once in steady state, there is no need to acquire (as long as the write is atomic, the read value will either be NullValue or ActualValue, not something else).

There's always a need to acquire so as to figure whether the steady state has been entered. Andrei
May 07 2013
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/7/13 12:30 PM, deadalnix wrote:
 On Tuesday, 7 May 2013 at 16:14:50 UTC, Steven Schveighoffer wrote:
 Not really. Whether it is entered or not is dictated by the vtable.
 Even classic double-check locking doesn't need an acquire outside the
 lock. Even if your CPU's view of the variable is outdated, the check
 after the memory barrier inside the lock only occurs once. After that,
 steady state is achieved. All subsequent reads need no memory
 barriers, because the singleton object will never change after that.

 The only thing we need to guard against is non-atomic writes, and out
 of order writes of the static variable (fixed with a memory barrier).
 Instruction ordering OUTSIDE the lock is irrelevant, because if we
 don't get the "steady state" value (not null), then we go into the
 lock to perform the careful initialization with barriers.

 I think aligned native word writes are atomic, so we don't have to
 worry about that.

That is incorrect as the thread not going into the lock can see a partially initialized object.

Correct. Andrei
May 07 2013
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/7/13 12:46 PM, Steven Schveighoffer wrote:
 On Tue, 07 May 2013 12:30:05 -0400, deadalnix <deadalnix gmail.com> wrote:

 On Tuesday, 7 May 2013 at 16:14:50 UTC, Steven Schveighoffer wrote:
 Not really. Whether it is entered or not is dictated by the vtable.
 Even classic double-check locking doesn't need an acquire outside the
 lock. Even if your CPU's view of the variable is outdated, the check
 after the memory barrier inside the lock only occurs once. After
 that, steady state is achieved. All subsequent reads need no memory
 barriers, because the singleton object will never change after that.

 The only thing we need to guard against is non-atomic writes, and out
 of order writes of the static variable (fixed with a memory barrier).
 Instruction ordering OUTSIDE the lock is irrelevant, because if we
 don't get the "steady state" value (not null), then we go into the
 lock to perform the careful initialization with barriers.

 I think aligned native word writes are atomic, so we don't have to
 worry about that.

That is incorrect as the thread not going into the lock can see a partially initialized object.

The memory barrier prevents that. You don't store the variable until the object is initialized. That is the whole point.

A memory barrier is not a one-way thing, i.e. not only the writer must do it. Any operation on shared memory is a handshake between the writer and the reader. If the reader doesn't do its bit, it can see the writes out of order no matter what the writer does. Andrei
May 07 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-May-2013 23:49, Andrei Alexandrescu пишет:
 On 5/7/13 12:46 PM, Steven Schveighoffer wrote:
 On Tue, 07 May 2013 12:30:05 -0400, deadalnix <deadalnix gmail.com>
 wrote:


 That is incorrect as the thread not going into the lock can see a
 partially initialized object.

The memory barrier prevents that. You don't store the variable until the object is initialized. That is the whole point.

A memory barrier is not a one-way thing, i.e. not only the writer must do it. Any operation on shared memory is a handshake between the writer and the reader. If the reader doesn't do its bit, it can see the writes out of order no matter what the writer does.

Returning to the confusing point. On x86 things are actually muddied by stronger then required hardware guarantees. And only because of this there no need for explicit read barrier (nor x86 have one) in this case. Still the read operation has to be marked specifically (volatile, asm block, whatever) to ensure the _compiler_ does the right thing (no reordering of across it). -- Dmitry Olshansky
May 07 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/7/13 6:12 PM, Steven Schveighoffer wrote:
 So the memory barrier ensures that neither the compiler nor the
 processor can re-order the stores to memory.

 But you are saying that the *reader* must also put in a memory barrier,
 otherwise it might see the stores out of order.

Yes. (One detail is there are several kinds of barriers, such as acquire and release.)
 It does not make sense to me, how does the reader see a half-initialized
 object, when the only way it sees anything is when it's stored to main
 memory *and* the stores are done in order?

There are several actors involved: two processors, each with its own cache, and the main memory. There are several cache coherency protocols, which reflect in different programming models.
 So that must not be what it is doing. What it must be doing is storing
 out of order, BUT placing a prevention mechanism from reading the memory
 until the "release" is done? Kind of like a minuscule mutex lock. So
 while it is out-of-order writing the object data, it holds a lock on the
 reference pointer to that data, so anyone using acquire cannot read it yet?

 That actually makes sense to me. Is that the case?

Not at all. A memory barrier dictates operation ordering. It doesn't do interlocking. One of Herb's slides shows very nicely how memory operations can't be moved ahead of an acquire and past a release.
 Returning to the confusing point.

 On x86 things are actually muddied by stronger then required hardware
 guarantees. And only because of this there no need for explicit read
 barrier (nor x86 have one) in this case. Still the read operation has
 to be marked specifically (volatile, asm block, whatever) to ensure
 the _compiler_ does the right thing (no reordering of across it).

I would think the model Mehrdad used would be sufficient to prevent caching of the data, since it is a virtual, no?

No. Andrei
May 07 2013
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/7/13 5:57 PM, Mehrdad wrote:
 On Tuesday, 7 May 2013 at 19:49:30 UTC, Andrei Alexandrescu wrote:
 A memory barrier is not a one-way thing, i.e. not only the writer must
 do it. Any operation on shared memory is a handshake between the
 writer and the reader. If the reader doesn't do its bit, it can see
 the writes out of order no matter what the writer does.

 Andrei

Andrew, I still don't understand: The writer is ensuring that writes to memory are happening _after_ the object is initialized and _before_ the reference to the old object is modified, via a memory barrier.

The writer is only half of the equation. The reader has its own cache to worry about and its own loading order.
 Unless you're claiming that a memory barrier _doesn't_ do what it's
 supposed to (i.e., the memory module is executing writes out-of-order
 even though the processor is issuing them in the correct order), there
 is no way for _anyone_ to see a partially initialized object anywhere...

I'm not claiming, I'm destroying :o). There is. I know it's confusing. You may want to peruse the reading materials linked by others. Andrei
May 07 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/7/2013 6:25 AM, Andrei Alexandrescu wrote:
 No. A tutorial on memory consistency models would be too long to insert here. I
 don't know of a good online resource, does anyone?

Here's one written by this Andrei Alexandrescu fellow, I hear he knows what he's talking about: http://erdani.com/publications/DDJ_Jul_Aug_2004_revised.pdf
May 07 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/7/13 1:31 PM, Walter Bright wrote:
 On 5/7/2013 6:25 AM, Andrei Alexandrescu wrote:
 No. A tutorial on memory consistency models would be too long to
 insert here. I
 don't know of a good online resource, does anyone?

Here's one written by this Andrei Alexandrescu fellow, I hear he knows what he's talking about: http://erdani.com/publications/DDJ_Jul_Aug_2004_revised.pdf

That doesn't get into memory models. Andrei
May 07 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-May-2013 17:25, Andrei Alexandrescu пишет:
 On 5/7/13 2:50 AM, Mehrdad wrote:
 On Monday, 6 May 2013 at 18:46:56 UTC, Andrei Alexandrescu wrote:
 Any concurrent operation (in this case read from one thread and write
 from another) requires a handshake between threads, most often in the
 form of an release write coupled with an acquire read. Whenever the
 handshake is absent but concurrent operations on shared memory do
 occur, the code is broken. The beauty of the TLS-based pattern is that
 in the steady state there's no need for a shared read and handshake.

 Andrei

Hmm, are you referring to the same lack of a barrier that the others are also referring to? As far as I can see, there shouldn't be a need for any other handshake in this example. As long as the object is fully initialized before _static is written to (easy enough with just a memory barrier), there is no penalty for subsequent reads whatsoever. Right?

No. A tutorial on memory consistency models would be too long to insert here. I don't know of a good online resource, does anyone?

Sutter's Mill is a good starting point even though it's C++ biased. Two recent insightful talks on C++11 memory model with down and dirty details: http://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/ -- Dmitry Olshansky -- Dmitry Olshansky
May 07 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 6 May 2013 at 09:06:55 UTC, Mehrdad wrote:
 lock (this)
 {
 	_static = new ActualValue<T>();
 }

Oops I forgot the null check inside, don't forget that.
May 06 2013
prev sibling next sibling parent "Sergei Nosov" <sergei.nosov gmail.com> writes:
On Monday, 6 May 2013 at 09:11:00 UTC, Mehrdad wrote:
 On Monday, 6 May 2013 at 09:06:55 UTC, Mehrdad wrote:
 lock (this)
 {
 	_static = new ActualValue<T>();
 }

Oops I forgot the null check inside, don't forget that.

All that double-checking stuff is trying to avoid calling "lock" for entire Get function body. Your solution does exactly that. So it's no good.
May 06 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 6 May 2013 at 09:30:24 UTC, Sergei Nosov wrote:
 On Monday, 6 May 2013 at 09:11:00 UTC, Mehrdad wrote:
 On Monday, 6 May 2013 at 09:06:55 UTC, Mehrdad wrote:
 lock (this)
 {
 	_static = new ActualValue<T>();
 }

Oops I forgot the null check inside, don't forget that.

All that double-checking stuff is trying to avoid calling "lock" for entire Get function body. Your solution does exactly that. So it's no good.

There are _two_ Get() functions. Only one of them calls lock(); once the field is initialized, that Get() is no longer called, and the other one is called instead. Unless I'm missing something?
May 06 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 6 May 2013 at 09:35:59 UTC, Dmitry Olshansky wrote:
 Who told you that the processor will immediately see the fully 
 constructed value of ActualValue upon assignment? Barriers and 
 other minor forms of black magic are still required

Isn't that why I wrote this?
             // Insert memory barrier if you'd like


May 06 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 6 May 2013 at 09:35:59 UTC, Dmitry Olshansky wrote:
 You have to read a field to know what to do next, and the other 
 processor may as well write to it.

That never happens, though. _static is only written to inside a lock, and the check is inside a lock, hence the other processor can't be writing to it when it's being read... Maybe I misunderstood what you're saying?
May 06 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Monday, 6 May 2013 at 10:06:50 UTC, Mehrdad wrote:
 On Monday, 6 May 2013 at 09:35:59 UTC, Dmitry Olshansky wrote:
 You have to read a field to know what to do next, and the 
 other processor may as well write to it.

That never happens, though. _static is only written to inside a lock, and the check is inside a lock, hence the other processor can't be writing to it when it's being read... Maybe I misunderstood what you're saying?

ON x86 you rarely see this kind of thing as the memory model is strong. You can see some weird thing in rare cases, but as soon as you leave x86 world, it isn't even worth trying.
May 06 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 6 May 2013 at 10:12:53 UTC, deadalnix wrote:
 On Monday, 6 May 2013 at 10:06:50 UTC, Mehrdad wrote:
 On Monday, 6 May 2013 at 09:35:59 UTC, Dmitry Olshansky wrote:
 You have to read a field to know what to do next, and the 
 other processor may as well write to it.

That never happens, though. _static is only written to inside a lock, and the check is inside a lock, hence the other processor can't be writing to it when it's being read... Maybe I misunderstood what you're saying?

ON x86 you rarely see this kind of thing as the memory model is strong. You can see some weird thing in rare cases, but as soon as you leave x86 world, it isn't even worth trying.

Ah yeah, this entire thing was geared toward x86. On a second thought, maybe it probably doesn't improve anything compared to the double-checked lock, since the branch prediction for 'if' is most likely faster than the branch prediction for a virtual call anyway. It was worth trying I guess...
May 06 2013
prev sibling next sibling parent reply "Max Samukha" <maxsamukha gmail.com> writes:
On Monday, 6 May 2013 at 02:35:33 UTC, dsimcha wrote:
 On the advice of Walter and Andrei, I've written a blog article 
 about the low-lock Singleton pattern in D.  This is a 
 previously obscure pattern that uses thread-local storage to 
 make Singletons both thread-safe and efficient and was 
 independently invented by at least me and Alexander Terekhov, 
 an IBM researcher.  However, D's first-class treatment of 
 thread-local storage means the time has come to move it out of 
 obscurity and possibly make it the standard way to do 
 Singletons.

 Article:
 http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

 Reddit:
 http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/

FWIW, I played with a generalized form of this pattern long ago, something like (typing from memory): template sharedGlobal(alias ctor, alias lock = globalLock) { property sharedGlobal() { alias ReturnType!ctor T; __gshared static Nullable!T t; static bool instantiated; if (!instantiated) { synchronized(lock) { if (t is null) t = ctor(); } } return cast(T)t; } alias sharedGlobal!({ return new Blah(); }) blah; It should have been part of QtD but that has never happened.
May 06 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/6/2013 6:14 AM, Max Samukha wrote:
 FWIW, I played with a generalized form of this pattern long ago, something like
 (typing from memory):

And, that's the classic double checked locking bug!
May 06 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 06 May 2013 13:58:19 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 5/6/2013 6:14 AM, Max Samukha wrote:
 FWIW, I played with a generalized form of this pattern long ago,  
 something like
 (typing from memory):

And, that's the classic double checked locking bug!

I think that's exactly what David had in his slide (well, except for an obvious typo). He's only checking t once, and protecting the lock/check with a TLS boolean. Although the set of the boolean is missing (probably a memory error, since he typed it from memory :) -Steve
May 06 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 6 May 2013 at 11:21:37 UTC, Dmitry Olshansky wrote:
 Yes, but the 1st processor may read _static exactly when the 
 2nd is inside the lock and writing to that field. Then chances 
 are it will read whatever partial state there is written.
 Surely it can. 1st processor takes the lock and write while the 
 second one reads static_ to call Get.

It's a single pointer, there is no partial state -- it's either written or it isn't.
 One way is to ensure write is atomic w.r.t. that particular 
 read operation

Are't pointer writes always atomic?
 The neat stuff about that TLS pattern was that you need not to 
 rely on hardware specific support.

Ah okay I see, didn't realize that.
May 06 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 6 May 2013 at 13:33:54 UTC, Andrei Alexandrescu wrote:
 It's well known. Needs a memory barrier on each access, so it's 
 slower.

Hmm, are you saying I'm missing a memory barrier I should have written, or are you saying I already have a memory barrier which I'm not seeing? The only memory barrier I have is during initialization, and after that only read operations occur.
May 06 2013
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Monday, 6 May 2013 at 18:23:51 UTC, Mehrdad wrote:
 One way is to ensure write is atomic w.r.t. that particular 
 read operation


No. But *aligned* word-sized writes *on x86* are. David
May 06 2013
prev sibling next sibling parent "Idan Arye" <GenericNPC gmail.com> writes:
On Monday, 6 May 2013 at 02:35:33 UTC, dsimcha wrote:
 On the advice of Walter and Andrei, I've written a blog article 
 about the low-lock Singleton pattern in D.  This is a 
 previously obscure pattern that uses thread-local storage to 
 make Singletons both thread-safe and efficient and was 
 independently invented by at least me and Alexander Terekhov, 
 an IBM researcher.  However, D's first-class treatment of 
 thread-local storage means the time has come to move it out of 
 obscurity and possibly make it the standard way to do 
 Singletons.

 Article:
 http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

 Reddit:
 http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/

Thanks! I want to make a module with template mixins that implement some common idioms - singleton being one of them(http://forum.dlang.org/thread/fofbrlqruxbevnxchxdp forum.dlang.org). I'm going to use your version for implementing the singleton idiom.
May 06 2013
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 5/6/13, dsimcha <dsimcha yahoo.com> wrote:
 On the advice of Walter and Andrei, I've written a blog article
 about the low-lock Singleton pattern in D.

Personally this covers it for me: final abstract class singleton { static: void foo() { } int f; } void main() { singleton.foo(); singleton.f = 1; } Never needed anything more complex than that. :p
May 06 2013
prev sibling next sibling parent "Diggory" <diggsey googlemail.com> writes:
On Monday, 6 May 2013 at 21:13:24 UTC, Andrej Mitrovic wrote:
 On 5/6/13, dsimcha <dsimcha yahoo.com> wrote:
 On the advice of Walter and Andrei, I've written a blog article
 about the low-lock Singleton pattern in D.

Personally this covers it for me: final abstract class singleton { static: void foo() { } int f; } void main() { singleton.foo(); singleton.f = 1; } Never needed anything more complex than that. :p

Although that's only a single-per-thread-ton :P
May 06 2013
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 5/6/13, Diggory <diggsey googlemail.com> wrote:
 Although that's only a single-per-thread-ton :P

My bad.
 final abstract class singleton
 {
 __gshared:
     void foo() { }
     int f;
 }

May 06 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 06 May 2013 17:36:00 -0400, Andrej Mitrovic  
<andrej.mitrovich gmail.com> wrote:

 On 5/6/13, Diggory <diggsey googlemail.com> wrote:
 Although that's only a single-per-thread-ton :P

My bad.
 final abstract class singleton
 {
 __gshared:
     void foo() { }
     int f;
 }


The point that is missed is the lazy creation. Imagine if it's not just an int but a whole large resource, which takes a long time to create/open. If all we cared about is access, and you *know* you are going to need it, you can just create it eagerly via shared static this(). That is not what the singleton pattern is for. -Steve
May 06 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 6 May 2013 at 18:51:12 UTC, David Nadlinger wrote:
 On Monday, 6 May 2013 at 18:23:51 UTC, Mehrdad wrote:
 One way is to ensure write is atomic w.r.t. that particular 
 read operation


No. But *aligned* word-sized writes *on x86* are. David

Right, but this was specifically meant for x86 (as I mentioned in the OP) and pointers are aligned by default. :)
May 06 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 6 May 2013 at 18:56:08 UTC, Dmitry Olshansky wrote:


Thanks for the detailed explanation!


 And now compiler/CPU decides to optimize/execute out of order 
 (again, it's an illustration) it as:

 lock _static_mutex;
 x = alloc int;
 //even if that's atomic
 static_ = x;
 // BOOM! somebody not locking mutex may already
 // see static_ in "half-baked" state
 x[0] = 42;
 unlock _static_mutex;

That's exactly the same as the classic double-checked lock bug, right? As I wrote in my original code -- and as you also mentioned yourself -- isn't it trivially fixed with a memory barrier? Like maybe replacing _static = new ActualValue<T>(); with var value = new ActualValue<T>(); _ReadWriteBarrier(); _static = value; Wouldn't this make it correct?
 Are't pointer writes always atomic?

In short - no. Even not counting the world of legal re-ordering, unaligned writes

But my example was completely aligned...
May 06 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 6 May 2013 at 18:46:56 UTC, Andrei Alexandrescu wrote:
 Any concurrent operation (in this case read from one thread and 
 write from another) requires a handshake between threads, most 
 often in the form of an release write coupled with an acquire 
 read. Whenever the handshake is absent but concurrent 
 operations on shared memory do occur, the code is broken. The 
 beauty of the TLS-based pattern is that in the steady state 
 there's no need for a shared read and handshake.

 Andrei

Hmm, are you referring to the same lack of a barrier that the others are also referring to? As far as I can see, there shouldn't be a need for any other handshake in this example. As long as the object is fully initialized before _static is written to (easy enough with just a memory barrier), there is no penalty for subsequent reads whatsoever. Right?
May 06 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 07 May 2013 09:25:36 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 No. A tutorial on memory consistency models would be too long to insert  
 here. I don't know of a good online resource, does anyone?

In essence, a read requires an acquire memory barrier, a write requires a release memory barrier, but in this case, we only need to be concerned if the value we get back is not valid (i.e. NullValue). Once in steady state, there is no need to acquire (as long as the write is atomic, the read value will either be NullValue or ActualValue, not something else). The code in the case of NullValue must be handled very carefully with the correct memory barriers (Given the fact that you should only execute once, just insert a full memory barrier). But that is not the steady state. It's not a revolutionary design, it's basic double-checked locking (implemented in an unnecessarily complex way). It can be done right, but it's still really difficult to get right. The benefit of David's method is that it's REALLY easy to get right, and is MM independent. -Steve
May 07 2013
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Tuesday, 7 May 2013 at 06:50:16 UTC, Mehrdad wrote:
 As far as I can see, there shouldn't be a need for any other 
 handshake in this example.

 As long as the object is fully initialized before _static is 
 written to (easy enough with just a memory barrier), there is 
 no penalty for subsequent reads whatsoever.

 Right?

The issue is that the write to _static might never appear on the other threads, thus leading to multiple instances being created - even though it is atomic in the sense that you never end up reading a pointer with e.g. only half of the bytes updated. David
May 07 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 07 May 2013 10:33:13 -0400, David Nadlinger <see klickverbot.at>  
wrote:

 On Tuesday, 7 May 2013 at 06:50:16 UTC, Mehrdad wrote:
 As far as I can see, there shouldn't be a need for any other handshake  
 in this example.

 As long as the object is fully initialized before _static is written to  
 (easy enough with just a memory barrier), there is no penalty for  
 subsequent reads whatsoever.

 Right?

The issue is that the write to _static might never appear on the other threads, thus leading to multiple instances being created - even though it is atomic in the sense that you never end up reading a pointer with e.g. only half of the bytes updated.

I don't think that is an issue. The write to _static is protected by a lock, which should present a consistent view of it inside the lock. Mehrdad corrected his version right away that you need to check the value inside the lock again. This essentially is classic double-checked locking, with one check being a virtual table lookup. -Steve
May 07 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 07 May 2013 11:30:12 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 5/7/13 10:31 AM, Steven Schveighoffer wrote:
 On Tue, 07 May 2013 09:25:36 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 No. A tutorial on memory consistency models would be too long to
 insert here. I don't know of a good online resource, does anyone?

In essence, a read requires an acquire memory barrier, a write requires a release memory barrier, but in this case, we only need to be concerned if the value we get back is not valid (i.e. NullValue). Once in steady state, there is no need to acquire (as long as the write is atomic, the read value will either be NullValue or ActualValue, not something else).

There's always a need to acquire so as to figure whether the steady state has been entered.

Not really. Whether it is entered or not is dictated by the vtable. Even classic double-check locking doesn't need an acquire outside the lock. Even if your CPU's view of the variable is outdated, the check after the memory barrier inside the lock only occurs once. After that, steady state is achieved. All subsequent reads need no memory barriers, because the singleton object will never change after that. The only thing we need to guard against is non-atomic writes, and out of order writes of the static variable (fixed with a memory barrier). Instruction ordering OUTSIDE the lock is irrelevant, because if we don't get the "steady state" value (not null), then we go into the lock to perform the careful initialization with barriers. I think aligned native word writes are atomic, so we don't have to worry about that. But I think we've spent enough time on this solution. Yes, double-checked locking can be done, but David's pattern is far easier to implement, understand, and explain. It comes at a small cost of checking a boolean before each access of the initialized data. His benchmarks show a very small performance penalty. And another LARGE benefit is you don't have to pull out your obscure (possibly challenged) memory model book/blog post or the CPU spec to prove it :) Hmm... you might be able to mitigate the penalty by storing the actual object reference instead of a bool in the _instantiated variable. Then a separate load is not required. David? -Steve
May 07 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Tuesday, 7 May 2013 at 16:14:50 UTC, Steven Schveighoffer 
wrote:
 Not really.  Whether it is entered or not is dictated by the 
 vtable.  Even classic double-check locking doesn't need an 
 acquire outside the lock.  Even if your CPU's view of the 
 variable is outdated, the check after the memory barrier inside 
 the lock only occurs once.  After that, steady state is 
 achieved.  All subsequent reads need no memory barriers, 
 because the singleton object will never change after that.

 The only thing we need to guard against is non-atomic writes, 
 and out of order writes of the static variable (fixed with a 
 memory barrier).  Instruction ordering OUTSIDE the lock is 
 irrelevant, because if we don't get the "steady state" value 
 (not null), then we go into the lock to perform the careful 
 initialization with barriers.

 I think aligned native word writes are atomic, so we don't have 
 to worry about that.

That is incorrect as the thread not going into the lock can see a partially initialized object.
May 07 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 07 May 2013 12:30:05 -0400, deadalnix <deadalnix gmail.com> wrote:

 On Tuesday, 7 May 2013 at 16:14:50 UTC, Steven Schveighoffer wrote:
 Not really.  Whether it is entered or not is dictated by the vtable.   
 Even classic double-check locking doesn't need an acquire outside the  
 lock.  Even if your CPU's view of the variable is outdated, the check  
 after the memory barrier inside the lock only occurs once.  After that,  
 steady state is achieved.  All subsequent reads need no memory  
 barriers, because the singleton object will never change after that.

 The only thing we need to guard against is non-atomic writes, and out  
 of order writes of the static variable (fixed with a memory barrier).   
 Instruction ordering OUTSIDE the lock is irrelevant, because if we  
 don't get the "steady state" value (not null), then we go into the lock  
 to perform the careful initialization with barriers.

 I think aligned native word writes are atomic, so we don't have to  
 worry about that.

That is incorrect as the thread not going into the lock can see a partially initialized object.

The memory barrier prevents that. You don't store the variable until the object is initialized. That is the whole point. -Steve
May 07 2013
prev sibling next sibling parent "QAston" <qaston gmail.com> writes:
 No. A tutorial on memory consistency models would be too long 
 to insert here. I don't know of a good online resource, does 
 anyone?

 Andrei

This was very helpful for me - focuses much more on the memory model itself than the c++11 part. http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2
May 07 2013
prev sibling next sibling parent "QAston" <qaston gmail.com> writes:
On Tuesday, 7 May 2013 at 20:17:43 UTC, QAston wrote:
 No. A tutorial on memory consistency models would be too long 
 to insert here. I don't know of a good online resource, does 
 anyone?

 Andrei

This was very helpful for me - focuses much more on the memory model itself than the c++11 part. http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2

Nah, look Dmitry's post, he was first to post that :P
May 07 2013
prev sibling next sibling parent "Sean Kelly" <sean invisibleduck.org> writes:
On Tuesday, 7 May 2013 at 13:25:37 UTC, Andrei Alexandrescu wrote:
 A tutorial on memory consistency models would be too long to 
 insert here. I don't know of a good online resource, does 
 anyone?

The best explanation I've found of memory models is this paper: http://dl.acm.org/citation.cfm?id=325102 Which can be had for free (albeit in terrible print quality) here: http://eecs.vanderbilt.edu/courses/eece343/papers/p15-gharachorloo.pdf Kourosh Gharachorloo actually has a number of really good papers about this stuff. Here's another, if you're interested in the details of how things work: http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-9.pdf
May 07 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Tuesday, 7 May 2013 at 19:49:30 UTC, Andrei Alexandrescu wrote:
 A memory barrier is not a one-way thing, i.e. not only the 
 writer must do it. Any operation on shared memory is a 
 handshake between the writer and the reader. If the reader 
 doesn't do its bit, it can see the writes out of order no 
 matter what the writer does.

 Andrei

Andrew, I still don't understand: The writer is ensuring that writes to memory are happening _after_ the object is initialized and _before_ the reference to the old object is modified, via a memory barrier. Unless you're claiming that a memory barrier _doesn't_ do what it's supposed to (i.e., the memory module is executing writes out-of-order even though the processor is issuing them in the correct order), there is no way for _anyone_ to see a partially initialized object anywhere...
May 07 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 07 May 2013 16:02:08 -0400, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 07-May-2013 23:49, Andrei Alexandrescu =D0=BF=D0=B8=D1=88=D0=B5=D1=82:=

 On 5/7/13 12:46 PM, Steven Schveighoffer wrote:
 On Tue, 07 May 2013 12:30:05 -0400, deadalnix <deadalnix gmail.com>
 wrote:


 That is incorrect as the thread not going into the lock can see a
 partially initialized object.

The memory barrier prevents that. You don't store the variable until=



 the
 object is initialized. That is the whole point.

A memory barrier is not a one-way thing, i.e. not only the writer mus=


 do it. Any operation on shared memory is a handshake between the writ=


 and the reader. If the reader doesn't do its bit, it can see the writ=


 out of order no matter what the writer does.


So the memory barrier ensures that neither the compiler nor the processo= r = can re-order the stores to memory. But you are saying that the *reader* must also put in a memory barrier, = = otherwise it might see the stores out of order. It does not make sense to me, how does the reader see a half-initialized= = object, when the only way it sees anything is when it's stored to main = memory *and* the stores are done in order? So that must not be what it is doing. What it must be doing is storing = = out of order, BUT placing a prevention mechanism from reading the memory= = until the "release" is done? Kind of like a minuscule mutex lock. So = while it is out-of-order writing the object data, it holds a lock on the= = reference pointer to that data, so anyone using acquire cannot read it y= et? That actually makes sense to me. Is that the case?
 Returning to the confusing point.

 On x86 things are actually muddied by stronger then required hardware =

 guarantees. And only because of this there no need for explicit read  =

 barrier (nor x86 have one) in this case. Still the read operation has =

 be marked specifically (volatile, asm block, whatever) to ensure the  =

 _compiler_ does the right thing (no reordering of across it).

I would think the model Mehrdad used would be sufficient to prevent = caching of the data, since it is a virtual, no? -Steve
May 07 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Tuesday, 7 May 2013 at 22:44:28 UTC, Andrei Alexandrescu wrote:
 The writer is only half of the equation. The reader has its own 
 cache to worry about and its own loading order.

Oooh! so basically this is the scenario you're referring to? 1. The reader has the uninitialized data in its cache 2. The writer writes the new data and a pointer to the new data 3. The reader sees a new pointer and attempts to load the new data 4. The reader receives stale data from its cache In that case....
 I'm not claiming, I'm destroying :o)

... well done!!
 There is. I know it's confusing. You may want to peruse the 
 reading materials linked by others.

Ok thanks!
May 07 2013
prev sibling next sibling parent Brad Roberts <braddr puremagic.com> writes:
On 5/7/13 3:44 PM, Andrei Alexandrescu wrote:
 On 5/7/13 5:57 PM, Mehrdad wrote:
 On Tuesday, 7 May 2013 at 19:49:30 UTC, Andrei Alexandrescu wrote:
 A memory barrier is not a one-way thing, i.e. not only the writer must
 do it. Any operation on shared memory is a handshake between the
 writer and the reader. If the reader doesn't do its bit, it can see
 the writes out of order no matter what the writer does.

 Andrei

Andrew, I still don't understand: The writer is ensuring that writes to memory are happening _after_ the object is initialized and _before_ the reference to the old object is modified, via a memory barrier.

The writer is only half of the equation. The reader has its own cache to worry about and its own loading order.
 Unless you're claiming that a memory barrier _doesn't_ do what it's
 supposed to (i.e., the memory module is executing writes out-of-order
 even though the processor is issuing them in the correct order), there
 is no way for _anyone_ to see a partially initialized object anywhere...

I'm not claiming, I'm destroying :o). There is. I know it's confusing. You may want to peruse the reading materials linked by others. Andrei

(this might be a repeat, I've only skimmed this thread) Section 8.2 of this doc is a good read: http://download.intel.com/products/processor/manual/325384.pdf There's a couple allowed re-orderings that are not what most people expect. Later, Brad
May 07 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 07 May 2013 18:49:19 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 5/7/13 6:12 PM, Steven Schveighoffer wrote:
 So that must not be what it is doing. What it must be doing is storing
 out of order, BUT placing a prevention mechanism from reading the memory
 until the "release" is done? Kind of like a minuscule mutex lock. So
 while it is out-of-order writing the object data, it holds a lock on the
 reference pointer to that data, so anyone using acquire cannot read it  
 yet?

 That actually makes sense to me. Is that the case?

Not at all. A memory barrier dictates operation ordering. It doesn't do interlocking. One of Herb's slides shows very nicely how memory operations can't be moved ahead of an acquire and past a release.

I will have to watch those some time, when I have 3 hours to kill :) Thanks for sticking with me and not dismissing out of hand. This is actually the first time I felt close to grasping this concept. -Steve
May 07 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 05 May 2013 22:35:27 -0400, dsimcha <dsimcha yahoo.com> wrote:

 On the advice of Walter and Andrei, I've written a blog article about  
 the low-lock Singleton pattern in D.  This is a previously obscure  
 pattern that uses thread-local storage to make Singletons both  
 thread-safe and efficient and was independently invented by at least me  
 and Alexander Terekhov, an IBM researcher.  However, D's first-class  
 treatment of thread-local storage means the time has come to move it out  
 of obscurity and possibly make it the standard way to do Singletons.

 Article:
 http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

 Reddit:
 http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/

Pulling this out from the depths of this discussion: David, the current pattern protects the read of the __gshared singleton with a thread local boolean. This means to check to see if the value is valid, we do: if(!instantiated_) { ... // thread safe initialization of instance_ instantiated_ = true; } return instance_; This requires a load of the boolean, and then a load of the instance. I wonder, if it might be more efficient to store a copy of the instance instead of the bool? This would only require one load for the check: if(!instantiated_instance_) // a TLS copy of the instance { ... // thread safe initialization of instance_ instantiated_instance_ = instance_; } return instantiated_instance_; I think in the steady state this devolves to the "unsafe" case, which of course is safe by that time. Might this account for at least some of dmd's poorer performance? -Steve
May 07 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Tuesday, 7 May 2013 at 22:12:17 UTC, Steven Schveighoffer 
wrote:
 On Tue, 07 May 2013 16:02:08 -0400, Dmitry Olshansky 
 <dmitry.olsh gmail.com> wrote:

 07-May-2013 23:49, Andrei Alexandrescu пишет:
 On 5/7/13 12:46 PM, Steven Schveighoffer wrote:
 On Tue, 07 May 2013 12:30:05 -0400, deadalnix 
 <deadalnix gmail.com>
 wrote:


 That is incorrect as the thread not going into the lock can 
 see a
 partially initialized object.

The memory barrier prevents that. You don't store the variable until the object is initialized. That is the whole point.

A memory barrier is not a one-way thing, i.e. not only the writer must do it. Any operation on shared memory is a handshake between the writer and the reader. If the reader doesn't do its bit, it can see the writes out of order no matter what the writer does.


So the memory barrier ensures that neither the compiler nor the processor can re-order the stores to memory. But you are saying that the *reader* must also put in a memory barrier, otherwise it might see the stores out of order. It does not make sense to me, how does the reader see a half-initialized object, when the only way it sees anything is when it's stored to main memory *and* the stores are done in order?

Short answer : because read can be done out of order and/or from cached values. The reader must do its part of the job as well. You ay want to read this serie of articles, they are very interesting. http://preshing.com/20120913/acquire-and-release-semantics
May 07 2013
prev sibling next sibling parent "Max Samukha" <maxsamukha gmail.com> writes:
On Monday, 6 May 2013 at 17:58:19 UTC, Walter Bright wrote:
 On 5/6/2013 6:14 AM, Max Samukha wrote:
 FWIW, I played with a generalized form of this pattern long 
 ago, something like
 (typing from memory):

And, that's the classic double checked locking bug!

D man is the bug! I simply failed to insert a line that flags the thread-local guard. Sorry for that. The Nullable thing was an impromptu to avoid ugly specializations I used for nullable and non-nullable types in my original implementation. Note that the Nullable is not phobos Nullable - the latter incurs unnecessary overhead for types that are already nullable. Maybe, the use of Nullable is an overkill and a __gshared boolean (redundant in case of a nullable type) would suffice. David mentioned the trick on this NG years ago. It is well known, understood and would be rarely needed if D could properly do eager initialization of global state. :P Anyway, I popped up mainly to point out that the pattern should not be limited to classes.
May 24 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Max Samukha:

 Note that the Nullable is not phobos Nullable -
 the latter incurs unnecessary overhead for types that are 
 already nullable.

In Bugzilla I have suggested some improvements for Nullable, but in Phobos there is already an alternative Nullable that avoids that overhead: struct Nullable(T, T nullValue); Bye, bearophile
May 24 2013
prev sibling next sibling parent "Max Samukha" <maxsamukha gmail.com> writes:
On Friday, 24 May 2013 at 13:05:36 UTC, bearophile wrote:
 Max Samukha:

 Note that the Nullable is not phobos Nullable -
 the latter incurs unnecessary overhead for types that are 
 already nullable.

In Bugzilla I have suggested some improvements for Nullable, but in Phobos there is already an alternative Nullable that avoids that overhead: struct Nullable(T, T nullValue); Bye, bearophile

The question is what should be the result of: Nullable(int*)? Nullable!(Nullable!T)? Forbidden (like in C#)? Aliased to the source type? A distinct type?
May 24 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Max Samukha:

 The question is what should be the result of:

 Nullable(int*)?
 Nullable!(Nullable!T)?

 Forbidden (like in C#)? Aliased to the source type? A distinct 
 type?

In D Nullable!(int*) needs to be a different type, because the interface is different between a pointer and a Nullable. But often you want to use something like Nullable(int*, null) or Nullable(int*, cast(int*)null) instead. In the second case maybe it can be collapsed into a single Nullable, but such behavior needs to be documented. Bye, bearophile
May 24 2013
prev sibling next sibling parent "Max Samukha" <maxsamukha gmail.com> writes:
On Friday, 24 May 2013 at 14:03:24 UTC, bearophile wrote:
 Max Samukha:

 The question is what should be the result of:

 Nullable(int*)?
 Nullable!(Nullable!T)?

 Forbidden (like in C#)? Aliased to the source type? A distinct 
 type?

In D Nullable!(int*) needs to be a different type, because the interface is different between a pointer and a Nullable.

It could be normalized by free functions isNull(T)(T t), nullableValue(T)(..) etc. Isn't that what was done to arrays so they could support the range interface?
 But often you want to use something like Nullable(int*, null) 
 or Nullable(int*, cast(int*)null) instead.

Do you have an example? In what cases is the distinction between a null null int* and non-null null int* necessary?
 In the second case maybe it can be collapsed into a single 
 Nullable, but such behavior needs to be documented.

Ok.
 Bye,
 bearophile

May 24 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Max Samukha:

 It could be normalized by free functions isNull(T)(T t), 
 nullableValue(T)(..) etc.

In Haskell/Scala if you program idiomatically you usually use HOFs as map, filter, etc that work "transparently" on nullables or collection of nullables, because Maybe is a monad. Example: you have a range of nullables, and you map a function on it, using a special map that ignores the null items. (This is one of the improvements I have suggested in Bugzilla.)
 But often you want to use something like Nullable(int*, null) 
 or Nullable(int*, cast(int*)null) instead.

Do you have an example? In what cases is the distinction between a null null int* and non-null null int* necessary?

If you use a Nullable!(int*, null) type that distinction vanishes, there is only one null. Bye, bearophile
May 24 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Friday, 24 May 2013 at 13:49:14 UTC, Max Samukha wrote:
 On Friday, 24 May 2013 at 13:05:36 UTC, bearophile wrote:
 Max Samukha:

 Note that the Nullable is not phobos Nullable -
 the latter incurs unnecessary overhead for types that are 
 already nullable.

In Bugzilla I have suggested some improvements for Nullable, but in Phobos there is already an alternative Nullable that avoids that overhead: struct Nullable(T, T nullValue); Bye, bearophile

The question is what should be the result of: Nullable(int*)?

New type, but no overhead.
 Nullable!(Nullable!T)?

New type, with an extra boolean.
May 24 2013
prev sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Tuesday, 7 May 2013 at 20:17:43 UTC, QAston wrote:
 No. A tutorial on memory consistency models would be too long 
 to insert here. I don't know of a good online resource, does 
 anyone?

 Andrei

This was very helpful for me - focuses much more on the memory model itself than the c++11 part. http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2

This was awesome/amazing.
May 25 2013