digitalmars.D - Low-Lock Singletons In D

dsimcha (12/12) May 05 2013 On the advice of Walter and Andrei, I've written a blog article

Joshua Niehus (2/6) May 05 2013 Excellent talk at the conf, solid blog: +1 and 1
Andrei Alexandrescu (8/19) May 05 2013 The reddit post is in need for upvotes to make it on the /r/programming
Mehrdad (44/48) May 06 2013 Hmm...

Mehrdad (2/6) May 06 2013 Oops I forgot the null check inside, don't forget that.

Sergei Nosov (4/10) May 06 2013 All that double-checking stuff is trying to avoid calling "lock"

Mehrdad (5/16) May 06 2013 There are _two_ Get() functions. Only one of them calls lock();

Dmitry Olshansky (13/61) May 06 2013 And that field is still shared... it doesn't matter if the null was

Mehrdad (2/6) May 06 2013
Mehrdad (5/7) May 06 2013 That never happens, though. _static is only written to inside a

deadalnix (4/11) May 06 2013 ON x86 you rarely see this kind of thing as the memory model is

Mehrdad (6/21) May 06 2013 Ah yeah, this entire thing was geared toward x86.

Dmitry Olshansky (26/33) May 06 2013 Yes, but the 1st processor may read _static exactly when the 2nd is

Mehrdad (5/14) May 06 2013 It's a single pointer, there is no partial state -- it's either

David Nadlinger (3/6) May 06 2013 No. But *aligned* word-sized writes *on x86* are.

Mehrdad (3/9) May 06 2013 Right, but this was specifically meant for x86 (as I mentioned in

Dmitry Olshansky (55/66) May 06 2013 True... e.g on x86 if that word is aligned.

Mehrdad (14/27) May 06 2013 That's exactly the same as the classic double-checked lock bug,

Dmitry Olshansky (11/34) May 07 2013 Yeah, and that was my point to begin with - your method doesn't bring

Walter Bright (2/4) May 06 2013 I did that too. It's trickier than it appears.
Andrei Alexandrescu (3/12) May 06 2013 It's well known. Needs a memory barrier on each access, so it's slower.

Mehrdad (6/8) May 06 2013 Hmm, are you saying I'm missing a memory barrier I should have

Andrei Alexandrescu (8/14) May 06 2013 Any concurrent operation (in this case read from one thread and write

Mehrdad (9/17) May 06 2013 Hmm, are you referring to the same lack of a barrier that the

Andrei Alexandrescu (4/21) May 07 2013 No. A tutorial on memory consistency models would be too long to insert

Steven Schveighoffer (16/18) May 07 2013 In essence, a read requires an acquire memory barrier, a write requires ...

Andrei Alexandrescu (4/14) May 07 2013 There's always a need to acquire so as to figure whether the steady

Steven Schveighoffer (26/42) May 07 2013 Not really. Whether it is entered or not is dictated by the vtable. Ev...

deadalnix (4/19) May 07 2013 That is incorrect as the thread not going into the lock can see a

Steven Schveighoffer (4/23) May 07 2013 The memory barrier prevents that. You don't store the variable until th...

Andrei Alexandrescu (6/29) May 07 2013 A memory barrier is not a one-way thing, i.e. not only the writer must

Dmitry Olshansky (11/25) May 07 2013 Exactly.

Steven Schveighoffer (31/57) May 07 2013 =

Andrei Alexandrescu (11/33) May 07 2013 Yes. (One detail is there are several kinds of barriers, such as acquire...

Steven Schveighoffer (6/18) May 07 2013 I will have to watch those some time, when I have 3 hours to kill :)

deadalnix (7/43) May 07 2013 Short answer : because read can be done out of order and/or from

Mehrdad (10/16) May 07 2013 Andrew, I still don't understand:

Andrei Alexandrescu (6/21) May 07 2013 The writer is only half of the equation. The reader has its own cache to...

Mehrdad (9/14) May 07 2013 Oooh! so basically this is the scenario you're referring to?
Brad Roberts (7/32) May 07 2013 (this might be a repeat, I've only skimmed this thread)

Andrei Alexandrescu (3/22) May 07 2013 Correct.

Walter Bright (4/6) May 07 2013 Here's one written by this Andrei Alexandrescu fellow, I hear he knows w...

Andrei Alexandrescu (3/10) May 07 2013 That doesn't get into memory models.

Dmitry Olshansky (9/36) May 07 2013 Sutter's Mill is a good starting point even though it's C++ biased.
QAston (3/7) May 07 2013 This was very helpful for me - focuses much more on the memory

QAston (2/10) May 07 2013 Nah, look @ Dmitry's post, he was first to post that :P
Mehrdad (2/10) May 25 2013 This was awesome/amazing.

Sean Kelly (9/12) May 07 2013 The best explanation I've found of memory models is this paper:

David Nadlinger (6/12) May 07 2013 The issue is that the write to _static might never appear on the

Steven Schveighoffer (8/21) May 07 2013 I don't think that is an issue. The write to _static is protected by a ...

Max Samukha (22/35) May 06 2013 FWIW, I played with a generalized form of this pattern long ago,

Walter Bright (2/4) May 06 2013 And, that's the classic double checked locking bug!

Steven Schveighoffer (8/13) May 06 2013 I think that's exactly what David had in his slide (well, except for an ...
Max Samukha (15/20) May 24 2013 D man is the bug! I simply failed to insert a line that flags the

bearophile (7/10) May 24 2013 In Bugzilla I have suggested some improvements for Nullable, but

Max Samukha (6/16) May 24 2013 The question is what should be the result of:

bearophile (9/14) May 24 2013 In D Nullable!(int*) needs to be a different type, because the

Max Samukha (7/23) May 24 2013 It could be normalized by free functions isNull(T)(T t),

bearophile (11/17) May 24 2013 In Haskell/Scala if you program idiomatically you usually use

deadalnix (3/21) May 24 2013 New type, with an extra boolean.

Idan Arye (7/20) May 06 2013 Thanks!
Andrej Mitrovic (14/16) May 06 2013 Personally this covers it for me:

Diggory (2/18) May 06 2013 Although that's only a single-per-thread-ton :P

Andrej Mitrovic (2/9) May 06 2013

Steven Schveighoffer (8/17) May 06 2013 The point that is missed is the lazy creation. Imagine if it's not just...

Steven Schveighoffer (22/33) May 07 2013 Pulling this out from the depths of this discussion:

"dsimcha" <dsimcha yahoo.com> writes:

On the advice of Walter and Andrei, I've written a blog article 
about the low-lock Singleton pattern in D.  This is a previously 
obscure pattern that uses thread-local storage to make Singletons 
both thread-safe and efficient and was independently invented by 
at least me and Alexander Terekhov, an IBM researcher.  However, 
D's first-class treatment of thread-local storage means the time 
has come to move it out of obscurity and possibly make it the 
standard way to do Singletons.

Article:
http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

Reddit:
http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/

May 05 2013

"Joshua Niehus" <jm.niehus gmail.com> writes:

On Monday, 6 May 2013 at 02:35:33 UTC, dsimcha wrote:
 Article:
 http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

 Reddit:
 http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/

Excellent talk at the conf, solid blog: +1 and 1

May 05 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/5/13 10:35 PM, dsimcha wrote:
 On the advice of Walter and Andrei, I've written a blog article about
 the low-lock Singleton pattern in D. This is a previously obscure
 pattern that uses thread-local storage to make Singletons both
 thread-safe and efficient and was independently invented by at least me
 and Alexander Terekhov, an IBM researcher. However, D's first-class
 treatment of thread-local storage means the time has come to move it out
 of obscurity and possibly make it the standard way to do Singletons.

 Article:
 http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

 Reddit:
 http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/

The reddit post is in need for upvotes to make it on the /r/programming 
front page. I think posting time wasn't the best: Saturday is the worst 
day to post anything 
(http://www.slideshare.net/babasave/the-hidden-secrets-of-successful-reddit-posts).

Anyhow, I've also posted to https://news.ycombinator.com/item?id=5660897

Vote up!


Andrei

May 05 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 6 May 2013 at 02:35:33 UTC, dsimcha wrote:
 that uses thread-local storage
 On DMD the overhead of TLS vs. unsafe is noticeable but small.  
 In both cases it pales in comparison to the overhead of 
 synchronizing on every call to get().


Hmm...

So I just invented another method right now in like 10 minutes, 
which can completely avoid TLS altogether, and which I think 
might end up being faster.

Basically, my idea is that since modern CPUs (x86) are great at 
predicting predictable virtual calls, you can use that to your 
advantage as shown below.



have the time to do it at the moment but if you're interested 
give it a try and see how it compares to TLS:


class Program
{
	private interface IValue<T>
	{
		T Get();
	}

	private class ActualValue<T> : IValue<T>
	{
		private T value;
		public T Get() { return this.value; }
	}

	private class NullValue<T> : IValue<T>
	{
		// This field is initialized on startup
		public static IValue<T> _static = new NullValue<T>();

		public T Get()
		{
			lock (this)
			{
				_static = new ActualValue<T>();
			}
			// Insert memory barrier if you'd like
			return _static.Get();
		}
	}

	public static object Static
	{ get { return NullValue<object>._static.Get(); } }


	static void Main(string[] args)
	{
		var a = Static;  // initializes on first use
		var b = Static;  // doesn't initialize anymore
	}
}

May 06 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 6 May 2013 at 09:06:55 UTC, Mehrdad wrote:
 lock (this)
 {
 	_static = new ActualValue<T>();
 }

Oops I forgot the null check inside, don't forget that.

May 06 2013

"Sergei Nosov" <sergei.nosov gmail.com> writes:

On Monday, 6 May 2013 at 09:11:00 UTC, Mehrdad wrote:
 On Monday, 6 May 2013 at 09:06:55 UTC, Mehrdad wrote:
 lock (this)
 {
 	_static = new ActualValue<T>();
 }

 Oops I forgot the null check inside, don't forget that.

All that double-checking stuff is trying to avoid calling "lock" 
for entire Get function body. Your solution does exactly that. So 
it's no good.

May 06 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 6 May 2013 at 09:30:24 UTC, Sergei Nosov wrote:
 On Monday, 6 May 2013 at 09:11:00 UTC, Mehrdad wrote:
 On Monday, 6 May 2013 at 09:06:55 UTC, Mehrdad wrote:
 lock (this)
 {
 	_static = new ActualValue<T>();
 }

 Oops I forgot the null check inside, don't forget that.

 All that double-checking stuff is trying to avoid calling 
 "lock" for entire Get function body. Your solution does exactly 
 that. So it's no good.



There are _two_ Get() functions. Only one of them calls lock(); 
once the field is initialized, that Get() is no longer called, 
and the other one is called instead.

Unless I'm missing something?

May 06 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

06-May-2013 13:06, Mehrdad пишет:
 On Monday, 6 May 2013 at 02:35:33 UTC, dsimcha wrote:
 that uses thread-local storage
 On DMD the overhead of TLS vs. unsafe is noticeable but small. In both
 cases it pales in comparison to the overhead of synchronizing on every
 call to get().


 Hmm...

 So I just invented another method right now in like 10 minutes, which
 can completely avoid TLS altogether, and which I think might end up
 being faster.

 Basically, my idea is that since modern CPUs (x86) are great at
 predicting predictable virtual calls, you can use that to your advantage
 as shown below.



 time to do it at the moment but if you're interested give it a try and
 see how it compares to TLS:


 class Program
 {
      private interface IValue<T>
      {
          T Get();
      }

      private class ActualValue<T> : IValue<T>
      {
          private T value;
          public T Get() { return this.value; }
      }

      private class NullValue<T> : IValue<T>
      {
          // This field is initialized on startup
          public static IValue<T> _static = new NullValue<T>();

And that field is still shared... it doesn't matter if the null was 
replaced by NullValue and an if-branch with indirect call.

You have to read a field to know what to do next, and the other 
processor may as well write to it.

          public T Get()
          {
              lock (this)
              {

			if(_static != null)
                  _static = new ActualValue<T>();

Who told you that the processor will immediately see the fully 
constructed value of ActualValue upon assignment? Barriers and other 
minor forms of black magic are still required on each access (e.g. 
atomic reads and atomic write) which isn't the case with TLS flag as 
discussed.

              }
              // Insert memory barrier if you'd like
              return _static.Get();
          }
      }

      public static object Static
      { get { return NullValue<object>._static.Get(); } }


      static void Main(string[] args)
      {
          var a = Static;  // initializes on first use
          var b = Static;  // doesn't initialize anymore
      }
 }


-- 
Dmitry Olshansky

May 06 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 6 May 2013 at 09:35:59 UTC, Dmitry Olshansky wrote:
 Who told you that the processor will immediately see the fully 
 constructed value of ActualValue upon assignment? Barriers and 
 other minor forms of black magic are still required


Isn't that why I wrote this?

             // Insert memory barrier if you'd like

May 06 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 6 May 2013 at 09:35:59 UTC, Dmitry Olshansky wrote:
 You have to read a field to know what to do next, and the other 
 processor may as well write to it.


That never happens, though. _static is only written to inside a 
lock, and the check is inside a lock, hence the other processor 
can't be writing to it when it's being read...


Maybe I misunderstood what you're saying?

May 06 2013

"deadalnix" <deadalnix gmail.com> writes:

On Monday, 6 May 2013 at 10:06:50 UTC, Mehrdad wrote:
 On Monday, 6 May 2013 at 09:35:59 UTC, Dmitry Olshansky wrote:
 You have to read a field to know what to do next, and the 
 other processor may as well write to it.


 That never happens, though. _static is only written to inside a 
 lock, and the check is inside a lock, hence the other processor 
 can't be writing to it when it's being read...


 Maybe I misunderstood what you're saying?

ON x86 you rarely see this kind of thing as the memory model is 
strong. You can see some weird thing in rare cases, but as soon 
as you leave x86 world, it isn't even worth trying.

May 06 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 6 May 2013 at 10:12:53 UTC, deadalnix wrote:
 On Monday, 6 May 2013 at 10:06:50 UTC, Mehrdad wrote:
 On Monday, 6 May 2013 at 09:35:59 UTC, Dmitry Olshansky wrote:
 You have to read a field to know what to do next, and the 
 other processor may as well write to it.


 That never happens, though. _static is only written to inside 
 a lock, and the check is inside a lock, hence the other 
 processor can't be writing to it when it's being read...


 Maybe I misunderstood what you're saying?

 ON x86 you rarely see this kind of thing as the memory model is 
 strong. You can see some weird thing in rare cases, but as soon 
 as you leave x86 world, it isn't even worth trying.


Ah yeah, this entire thing was geared toward x86.

On a second thought, maybe it probably doesn't improve anything 
compared to the double-checked lock, since the branch prediction 
for 'if' is most likely faster than the branch prediction for a 
virtual call anyway. It was worth trying I guess...

May 06 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

06-May-2013 14:06, Mehrdad пишет:
 On Monday, 6 May 2013 at 09:35:59 UTC, Dmitry Olshansky wrote:
 You have to read a field to know what to do next, and the other
 processor may as well write to it.


 That never happens, though. _static is only written to inside a lock,

Yes, but the 1st processor may read _static exactly when the 2nd is 
inside the lock and writing to that field. Then chances are it will read 
whatever partial state there is written.

 and the check is inside a lock, hence the other processor can't be
 writing to it when it's being read...

Surely it can. 1st processor takes the lock and write while the second 
one reads static_ to call Get.

 Maybe I misunderstood what you're saying?

I bet you just mystified the pattern a bit by relying on indirect call.

In pseudo-code what you do is this:


read static_ // 1st processor may be reading here
if static_ is NullValue //that is indirect call
	lock(static_){
		check again
		write static_ //while 2nd one writes here
	}
else
	call static_ -~~-> get

I claim that 'read  static_' better be "protected" against that 'write 
static_' inside the lock.

One way is to ensure write is atomic w.r.t. that particular read 
operation e.g. by hardware acquire-release, or via both reads/writes 
being atomic.
On x86 word-sized reads/writes are atomic anyway (iff aligned apparently).

The neat stuff about that TLS pattern was that you need not to rely on 
hardware specific support.

-- 
Dmitry Olshansky

May 06 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 6 May 2013 at 11:21:37 UTC, Dmitry Olshansky wrote:
 Yes, but the 1st processor may read _static exactly when the 
 2nd is inside the lock and writing to that field. Then chances 
 are it will read whatever partial state there is written.
 Surely it can. 1st processor takes the lock and write while the 
 second one reads static_ to call Get.


It's a single pointer, there is no partial state -- it's either 
written or it isn't.


 One way is to ensure write is atomic w.r.t. that particular 
 read operation


Are't pointer writes always atomic?


 The neat stuff about that TLS pattern was that you need not to 
 rely on hardware specific support.

Ah okay I see, didn't realize that.

May 06 2013

"David Nadlinger" <see klickverbot.at> writes:

On Monday, 6 May 2013 at 18:23:51 UTC, Mehrdad wrote:
 One way is to ensure write is atomic w.r.t. that particular 
 read operation

 Are't pointer writes always atomic?

No. But *aligned* word-sized writes *on x86* are.

David

May 06 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 6 May 2013 at 18:51:12 UTC, David Nadlinger wrote:
 On Monday, 6 May 2013 at 18:23:51 UTC, Mehrdad wrote:
 One way is to ensure write is atomic w.r.t. that particular 
 read operation

 Are't pointer writes always atomic?

 No. But *aligned* word-sized writes *on x86* are.

 David

Right, but this was specifically meant for x86 (as I mentioned in 
the OP) and pointers are aligned by default. :)

May 06 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

06-May-2013 22:23, Mehrdad пишет:
 On Monday, 6 May 2013 at 11:21:37 UTC, Dmitry Olshansky wrote:
 Yes, but the 1st processor may read _static exactly when the 2nd is
 inside the lock and writing to that field. Then chances are it will
 read whatever partial state there is written.
 Surely it can. 1st processor takes the lock and write while the second
 one reads static_ to call Get.


 It's a single pointer, there is no partial state -- it's either written
 or it isn't.

True... e.g on x86 if that word is aligned.
But the core of problem is not only that word it's rather the fact
that processor (and compiler and who's not) is free to re-order 
read/write ops
inside that locked region.

Potentially it could lead to epically nasty things.

struct Static{
	int value;
	Static(int v){
		value = v;
	}
}

Now suppose:

lock(_static){
	_static = new Static(42); //or smth like
}

Is compiled down to this "C--" code (quite realistically):

lock _static_mutex;
x = alloc int;
x[0] = 42;
static_ = x;
unlock _static_mutex;

And now compiler/CPU decides to optimize/execute out of order (again, 
it's an illustration) it as:

lock _static_mutex;
x = alloc int;
//even if that's atomic
static_ = x;
// BOOM! somebody not locking mutex may already
// see static_ in "half-baked" state
x[0] = 42;
unlock _static_mutex;

Observe that:
a) It can and nothing prevents such a scenario. In fact it should feel 
free to optimize inside the locked section, but not accross
b) The chance that it happens is non-zero and rises with the complexity 
of construction
c) Depends on hardware kind(!) used as in OoO vs non-OoO

So it would work more reliably on old Atoms if that of any comfort ;). 
That if your compiler isn't equally smart and does nasty things behind 
your back ("Sorry, I just scheduled instructions optimally...").

 One way is to ensure write is atomic w.r.t. that particular read
 operation


 Are't pointer writes always atomic?

In short - no. Even not counting the world of legal re-ordering, 
unaligned writes too have this nasty habit of partially written stuff 
being visible at the wrong time.

Truth be told relying on these kind of "not ever likely to go wrong" is 
the very reason explicit atomics are encouraged even if they are NOPs on 
your current arch. That or faster/harder stuff - barriers.

Speaking of barriers - that is exactly the kind of thing that would 
disallow moving write of x[0] = 42 past "line" of static_ = x;
Simply put barriers act as a line of "no reordering across this point" 
for specific operations (depending on type, or all of them). Needles to 
say - hurts performance.

-- 
Dmitry Olshansky

May 06 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 6 May 2013 at 18:56:08 UTC, Dmitry Olshansky wrote:


Thanks for the detailed explanation!


 And now compiler/CPU decides to optimize/execute out of order 
 (again, it's an illustration) it as:

 lock _static_mutex;
 x = alloc int;
 //even if that's atomic
 static_ = x;
 // BOOM! somebody not locking mutex may already
 // see static_ in "half-baked" state
 x[0] = 42;
 unlock _static_mutex;



That's exactly the same as the classic double-checked lock bug, 
right?


As I wrote in my original code -- and as you also mentioned 
yourself -- isn't it trivially fixed with a memory barrier?

Like maybe replacing

     _static = new ActualValue<T>();

with

     var value = new ActualValue<T>();
     _ReadWriteBarrier();
     _static = value;



Wouldn't this make it correct?




 Are't pointer writes always atomic?

 In short - no. Even not counting the world of legal 
 re-ordering, unaligned writes



But my example was completely aligned...

May 06 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

07-May-2013 10:47, Mehrdad пишет:
 On Monday, 6 May 2013 at 18:56:08 UTC, Dmitry Olshansky wrote:


 Thanks for the detailed explanation!


 And now compiler/CPU decides to optimize/execute out of order (again,
 it's an illustration) it as:

 lock _static_mutex;
 x = alloc int;
 //even if that's atomic
 static_ = x;
 // BOOM! somebody not locking mutex may already
 // see static_ in "half-baked" state
 x[0] = 42;
 unlock _static_mutex;



 That's exactly the same as the classic double-checked lock bug, right?

Yeah, and that was my point to begin with - your method doesn't bring 
anything new. It's the same as the one with null and 'if-null-check' 
with same issues and requires atomics or barriers.

 As I wrote in my original code -- and as you also mentioned yourself --
 isn't it trivially fixed with a memory barrier?

 Like maybe replacing

      _static = new ActualValue<T>();

 with

      var value = new ActualValue<T>();
      _ReadWriteBarrier();
      _static = value;



 Wouldn't this make it correct?

Would but then it's the same as the old fixed double-checked locking.
Barriers hurt performance that we were after to begin with.

Now it would be interesting to measure speed of this TLS low-lock vs 
atomic-load/memory barrier + mutex. This measurement is absent from the 
blog post, but Andrei claims memory barrier on each access is too slow.


-- 
Dmitry Olshansky

May 07 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/6/2013 2:06 AM, Mehrdad wrote:
 So I just invented another method right now in like 10 minutes, which can
 completely avoid TLS altogether, and which I think might end up being faster.

I did that too. It's trickier than it appears.

May 06 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/6/13 5:06 AM, Mehrdad wrote:
 On Monday, 6 May 2013 at 02:35:33 UTC, dsimcha wrote:
 that uses thread-local storage
 On DMD the overhead of TLS vs. unsafe is noticeable but small. In both
 cases it pales in comparison to the overhead of synchronizing on every
 call to get().


 Hmm...

 So I just invented another method right now in like 10 minutes, which
 can completely avoid TLS altogether, and which I think might end up
 being faster.

It's well known. Needs a memory barrier on each access, so it's slower.

Andrei

May 06 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 6 May 2013 at 13:33:54 UTC, Andrei Alexandrescu wrote:
 It's well known. Needs a memory barrier on each access, so it's 
 slower.


Hmm, are you saying I'm missing a memory barrier I should have 
written, or are you saying I already have a memory barrier which 
I'm not seeing?


The only memory barrier I have is during initialization, and 
after that only read operations occur.

May 06 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/6/13 2:25 PM, Mehrdad wrote:
 On Monday, 6 May 2013 at 13:33:54 UTC, Andrei Alexandrescu wrote:
 It's well known. Needs a memory barrier on each access, so it's slower.


 Hmm, are you saying I'm missing a memory barrier I should have written,
 or are you saying I already have a memory barrier which I'm not seeing?


 The only memory barrier I have is during initialization, and after that
 only read operations occur.

Any concurrent operation (in this case read from one thread and write 
from another) requires a handshake between threads, most often in the 
form of an release write coupled with an acquire read. Whenever the 
handshake is absent but concurrent operations on shared memory do occur, 
the code is broken. The beauty of the TLS-based pattern is that in the 
steady state there's no need for a shared read and handshake.

Andrei

May 06 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 6 May 2013 at 18:46:56 UTC, Andrei Alexandrescu wrote:
 Any concurrent operation (in this case read from one thread and 
 write from another) requires a handshake between threads, most 
 often in the form of an release write coupled with an acquire 
 read. Whenever the handshake is absent but concurrent 
 operations on shared memory do occur, the code is broken. The 
 beauty of the TLS-based pattern is that in the steady state 
 there's no need for a shared read and handshake.

 Andrei



Hmm, are you referring to the same lack of a barrier that the 
others are also referring to?


As far as I can see, there shouldn't be a need for any other 
handshake in this example.

As long as the object is fully initialized before _static is 
written to (easy enough with just a memory barrier), there is no 
penalty for subsequent reads whatsoever.

Right?

May 06 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/7/13 2:50 AM, Mehrdad wrote:
 On Monday, 6 May 2013 at 18:46:56 UTC, Andrei Alexandrescu wrote:
 Any concurrent operation (in this case read from one thread and write
 from another) requires a handshake between threads, most often in the
 form of an release write coupled with an acquire read. Whenever the
 handshake is absent but concurrent operations on shared memory do
 occur, the code is broken. The beauty of the TLS-based pattern is that
 in the steady state there's no need for a shared read and handshake.

 Andrei



 Hmm, are you referring to the same lack of a barrier that the others are
 also referring to?


 As far as I can see, there shouldn't be a need for any other handshake
 in this example.

 As long as the object is fully initialized before _static is written to
 (easy enough with just a memory barrier), there is no penalty for
 subsequent reads whatsoever.

 Right?

No. A tutorial on memory consistency models would be too long to insert 
here. I don't know of a good online resource, does anyone?

Andrei

May 07 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 07 May 2013 09:25:36 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 No. A tutorial on memory consistency models would be too long to insert  
 here. I don't know of a good online resource, does anyone?

In essence, a read requires an acquire memory barrier, a write requires a  
release memory barrier, but in this case, we only need to be concerned if  
the value we get back is not valid (i.e. NullValue).

Once in steady state, there is no need to acquire (as long as the write is  
atomic, the read value will either be NullValue or ActualValue, not  
something else).  The code in the case of NullValue must be handled very  
carefully with the correct memory barriers (Given the fact that you should  
only execute once, just insert a full memory barrier).  But that is not  
the steady state.

It's not a revolutionary design, it's basic double-checked locking  
(implemented in an unnecessarily complex way).  It can be done right, but  
it's still really difficult to get right.  The benefit of David's method  
is that it's REALLY easy to get right, and is MM independent.

-Steve

May 07 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/7/13 10:31 AM, Steven Schveighoffer wrote:
 On Tue, 07 May 2013 09:25:36 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 No. A tutorial on memory consistency models would be too long to
 insert here. I don't know of a good online resource, does anyone?

 In essence, a read requires an acquire memory barrier, a write requires
 a release memory barrier, but in this case, we only need to be concerned
 if the value we get back is not valid (i.e. NullValue).

 Once in steady state, there is no need to acquire (as long as the write
 is atomic, the read value will either be NullValue or ActualValue, not
 something else).

There's always a need to acquire so as to figure whether the steady 
state has been entered.

Andrei

May 07 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 07 May 2013 11:30:12 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 5/7/13 10:31 AM, Steven Schveighoffer wrote:
 On Tue, 07 May 2013 09:25:36 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 No. A tutorial on memory consistency models would be too long to
 insert here. I don't know of a good online resource, does anyone?

 In essence, a read requires an acquire memory barrier, a write requires
 a release memory barrier, but in this case, we only need to be concerned
 if the value we get back is not valid (i.e. NullValue).

 Once in steady state, there is no need to acquire (as long as the write
 is atomic, the read value will either be NullValue or ActualValue, not
 something else).

 There's always a need to acquire so as to figure whether the steady  
 state has been entered.

Not really.  Whether it is entered or not is dictated by the vtable.  Even  
classic double-check locking doesn't need an acquire outside the lock.   
Even if your CPU's view of the variable is outdated, the check after the  
memory barrier inside the lock only occurs once.  After that, steady state  
is achieved.  All subsequent reads need no memory barriers, because the  
singleton object will never change after that.

The only thing we need to guard against is non-atomic writes, and out of  
order writes of the static variable (fixed with a memory barrier).   
Instruction ordering OUTSIDE the lock is irrelevant, because if we don't  
get the "steady state" value (not null), then we go into the lock to  
perform the careful initialization with barriers.

I think aligned native word writes are atomic, so we don't have to worry  
about that.

But I think we've spent enough time on this solution.  Yes, double-checked  
locking can be done, but David's pattern is far easier to implement,  
understand, and explain.  It comes at a small cost of checking a boolean  
before each access of the initialized data.  His benchmarks show a very  
small performance penalty.  And another LARGE benefit is you don't have to  
pull out your obscure (possibly challenged) memory model book/blog post or  
the CPU spec to prove it :)

Hmm... you might be able to mitigate the penalty by storing the actual  
object reference instead of a bool in the _instantiated variable.  Then a  
separate load is not required.  David?

-Steve

May 07 2013

"deadalnix" <deadalnix gmail.com> writes:

On Tuesday, 7 May 2013 at 16:14:50 UTC, Steven Schveighoffer 
wrote:
 Not really.  Whether it is entered or not is dictated by the 
 vtable.  Even classic double-check locking doesn't need an 
 acquire outside the lock.  Even if your CPU's view of the 
 variable is outdated, the check after the memory barrier inside 
 the lock only occurs once.  After that, steady state is 
 achieved.  All subsequent reads need no memory barriers, 
 because the singleton object will never change after that.

 The only thing we need to guard against is non-atomic writes, 
 and out of order writes of the static variable (fixed with a 
 memory barrier).  Instruction ordering OUTSIDE the lock is 
 irrelevant, because if we don't get the "steady state" value 
 (not null), then we go into the lock to perform the careful 
 initialization with barriers.

 I think aligned native word writes are atomic, so we don't have 
 to worry about that.

That is incorrect as the thread not going into the lock can see a 
partially initialized object.

May 07 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 07 May 2013 12:30:05 -0400, deadalnix <deadalnix gmail.com> wrote:

 On Tuesday, 7 May 2013 at 16:14:50 UTC, Steven Schveighoffer wrote:
 Not really.  Whether it is entered or not is dictated by the vtable.   
 Even classic double-check locking doesn't need an acquire outside the  
 lock.  Even if your CPU's view of the variable is outdated, the check  
 after the memory barrier inside the lock only occurs once.  After that,  
 steady state is achieved.  All subsequent reads need no memory  
 barriers, because the singleton object will never change after that.

 The only thing we need to guard against is non-atomic writes, and out  
 of order writes of the static variable (fixed with a memory barrier).   
 Instruction ordering OUTSIDE the lock is irrelevant, because if we  
 don't get the "steady state" value (not null), then we go into the lock  
 to perform the careful initialization with barriers.

 I think aligned native word writes are atomic, so we don't have to  
 worry about that.

 That is incorrect as the thread not going into the lock can see a  
 partially initialized object.

The memory barrier prevents that.  You don't store the variable until the  
object is initialized.  That is the whole point.

-Steve

May 07 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/7/13 12:46 PM, Steven Schveighoffer wrote:
 On Tue, 07 May 2013 12:30:05 -0400, deadalnix <deadalnix gmail.com> wrote:

 On Tuesday, 7 May 2013 at 16:14:50 UTC, Steven Schveighoffer wrote:
 Not really. Whether it is entered or not is dictated by the vtable.
 Even classic double-check locking doesn't need an acquire outside the
 lock. Even if your CPU's view of the variable is outdated, the check
 after the memory barrier inside the lock only occurs once. After
 that, steady state is achieved. All subsequent reads need no memory
 barriers, because the singleton object will never change after that.

 The only thing we need to guard against is non-atomic writes, and out
 of order writes of the static variable (fixed with a memory barrier).
 Instruction ordering OUTSIDE the lock is irrelevant, because if we
 don't get the "steady state" value (not null), then we go into the
 lock to perform the careful initialization with barriers.

 I think aligned native word writes are atomic, so we don't have to
 worry about that.

 That is incorrect as the thread not going into the lock can see a
 partially initialized object.

 The memory barrier prevents that. You don't store the variable until the
 object is initialized. That is the whole point.

A memory barrier is not a one-way thing, i.e. not only the writer must 
do it. Any operation on shared memory is a handshake between the writer 
and the reader. If the reader doesn't do its bit, it can see the writes 
out of order no matter what the writer does.

Andrei

May 07 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

07-May-2013 23:49, Andrei Alexandrescu пишет:
 On 5/7/13 12:46 PM, Steven Schveighoffer wrote:
 On Tue, 07 May 2013 12:30:05 -0400, deadalnix <deadalnix gmail.com>
 wrote:


[snip]
 That is incorrect as the thread not going into the lock can see a
 partially initialized object.

 The memory barrier prevents that. You don't store the variable until the
 object is initialized. That is the whole point.

 A memory barrier is not a one-way thing, i.e. not only the writer must
 do it. Any operation on shared memory is a handshake between the writer
 and the reader. If the reader doesn't do its bit, it can see the writes
 out of order no matter what the writer does.

Exactly.


Returning to the confusing point.

On x86 things are actually muddied by stronger then required hardware 
guarantees. And only because of this there no need for explicit read 
barrier (nor x86 have one) in this case. Still the read operation has to 
be marked specifically (volatile, asm block, whatever) to ensure the 
_compiler_ does the right thing (no reordering of across it).


-- 
Dmitry Olshansky

May 07 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 07 May 2013 16:02:08 -0400, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 07-May-2013 23:49, Andrei Alexandrescu =D0=BF=D0=B8=D1=88=D0=B5=D1=82:=

 On 5/7/13 12:46 PM, Steven Schveighoffer wrote:
 On Tue, 07 May 2013 12:30:05 -0400, deadalnix <deadalnix gmail.com>
 wrote:


 [snip]
 That is incorrect as the thread not going into the lock can see a
 partially initialized object.

 The memory barrier prevents that. You don't store the variable until=



  =

 the
 object is initialized. That is the whole point.

 A memory barrier is not a one-way thing, i.e. not only the writer mus=


t
 do it. Any operation on shared memory is a handshake between the writ=


er
 and the reader. If the reader doesn't do its bit, it can see the writ=


es
 out of order no matter what the writer does.

 Exactly.

So the memory barrier ensures that neither the compiler nor the processo=
r  =

can re-order the stores to memory.

But you are saying that the *reader* must also put in a memory barrier, =
 =

otherwise it might see the stores out of order.

It does not make sense to me, how does the reader see a half-initialized=
  =

object, when the only way it sees anything is when it's stored to main  =

memory *and* the stores are done in order?

So that must not be what it is doing.  What it must be doing is storing =
 =

out of order, BUT placing a prevention mechanism from reading the memory=
  =

until the "release" is done?  Kind of like a minuscule mutex lock.  So  =

while it is out-of-order writing the object data, it holds a lock on the=
  =

reference pointer to that data, so anyone using acquire cannot read it y=
et?

That actually makes sense to me.  Is that the case?

 Returning to the confusing point.

 On x86 things are actually muddied by stronger then required hardware =

 =

 guarantees. And only because of this there no need for explicit read  =

 barrier (nor x86 have one) in this case. Still the read operation has =

to  =

 be marked specifically (volatile, asm block, whatever) to ensure the  =

 _compiler_ does the right thing (no reordering of across it).

I would think the model Mehrdad used would be sufficient to prevent  =

caching of the data, since it is a virtual, no?

-Steve

May 07 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/7/13 6:12 PM, Steven Schveighoffer wrote:
 So the memory barrier ensures that neither the compiler nor the
 processor can re-order the stores to memory.

 But you are saying that the *reader* must also put in a memory barrier,
 otherwise it might see the stores out of order.

Yes. (One detail is there are several kinds of barriers, such as acquire 
and release.)

 It does not make sense to me, how does the reader see a half-initialized
 object, when the only way it sees anything is when it's stored to main
 memory *and* the stores are done in order?

There are several actors involved: two processors, each with its own 
cache, and the main memory. There are several cache coherency protocols, 
which reflect in different programming models.

 So that must not be what it is doing. What it must be doing is storing
 out of order, BUT placing a prevention mechanism from reading the memory
 until the "release" is done? Kind of like a minuscule mutex lock. So
 while it is out-of-order writing the object data, it holds a lock on the
 reference pointer to that data, so anyone using acquire cannot read it yet?

 That actually makes sense to me. Is that the case?

Not at all. A memory barrier dictates operation ordering. It doesn't do 
interlocking. One of Herb's slides shows very nicely how memory 
operations can't be moved ahead of an acquire and past a release.

 Returning to the confusing point.

 On x86 things are actually muddied by stronger then required hardware
 guarantees. And only because of this there no need for explicit read
 barrier (nor x86 have one) in this case. Still the read operation has
 to be marked specifically (volatile, asm block, whatever) to ensure
 the _compiler_ does the right thing (no reordering of across it).

 I would think the model Mehrdad used would be sufficient to prevent
 caching of the data, since it is a virtual, no?

No.


Andrei

May 07 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 07 May 2013 18:49:19 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 5/7/13 6:12 PM, Steven Schveighoffer wrote:
 So that must not be what it is doing. What it must be doing is storing
 out of order, BUT placing a prevention mechanism from reading the memory
 until the "release" is done? Kind of like a minuscule mutex lock. So
 while it is out-of-order writing the object data, it holds a lock on the
 reference pointer to that data, so anyone using acquire cannot read it  
 yet?

 That actually makes sense to me. Is that the case?

 Not at all. A memory barrier dictates operation ordering. It doesn't do  
 interlocking. One of Herb's slides shows very nicely how memory  
 operations can't be moved ahead of an acquire and past a release.

I will have to watch those some time, when I have 3 hours to kill :)

Thanks for sticking with me and not dismissing out of hand.  This is  
actually the first time I felt close to grasping this concept.

-Steve

May 07 2013

"deadalnix" <deadalnix gmail.com> writes:

On Tuesday, 7 May 2013 at 22:12:17 UTC, Steven Schveighoffer 
wrote:
 On Tue, 07 May 2013 16:02:08 -0400, Dmitry Olshansky 
 <dmitry.olsh gmail.com> wrote:

 07-May-2013 23:49, Andrei Alexandrescu пишет:
 On 5/7/13 12:46 PM, Steven Schveighoffer wrote:
 On Tue, 07 May 2013 12:30:05 -0400, deadalnix 
 <deadalnix gmail.com>
 wrote:


 [snip]
 That is incorrect as the thread not going into the lock can 
 see a
 partially initialized object.

 The memory barrier prevents that. You don't store the 
 variable until the
 object is initialized. That is the whole point.

 A memory barrier is not a one-way thing, i.e. not only the 
 writer must
 do it. Any operation on shared memory is a handshake between 
 the writer
 and the reader. If the reader doesn't do its bit, it can see 
 the writes
 out of order no matter what the writer does.

 Exactly.

 So the memory barrier ensures that neither the compiler nor the 
 processor can re-order the stores to memory.

 But you are saying that the *reader* must also put in a memory 
 barrier, otherwise it might see the stores out of order.

 It does not make sense to me, how does the reader see a 
 half-initialized object, when the only way it sees anything is 
 when it's stored to main memory *and* the stores are done in 
 order?

Short answer : because read can be done out of order and/or from 
cached values. The reader must do its part of the job as well.

You ay want to read this serie of articles, they are very 
interesting.
http://preshing.com/20120913/acquire-and-release-semantics

May 07 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Tuesday, 7 May 2013 at 19:49:30 UTC, Andrei Alexandrescu wrote:
 A memory barrier is not a one-way thing, i.e. not only the 
 writer must do it. Any operation on shared memory is a 
 handshake between the writer and the reader. If the reader 
 doesn't do its bit, it can see the writes out of order no 
 matter what the writer does.

 Andrei



Andrew, I still don't understand:

The writer is ensuring that writes to memory are happening 
_after_ the object is initialized and _before_ the reference to 
the old object is modified, via a memory barrier.


Unless you're claiming that a memory barrier _doesn't_ do what 
it's supposed to (i.e., the memory module is executing writes 
out-of-order even though the processor is issuing them in the 
correct order), there is no way for _anyone_ to see a partially 
initialized object anywhere...

May 07 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/7/13 5:57 PM, Mehrdad wrote:
 On Tuesday, 7 May 2013 at 19:49:30 UTC, Andrei Alexandrescu wrote:
 A memory barrier is not a one-way thing, i.e. not only the writer must
 do it. Any operation on shared memory is a handshake between the
 writer and the reader. If the reader doesn't do its bit, it can see
 the writes out of order no matter what the writer does.

 Andrei



 Andrew, I still don't understand:

 The writer is ensuring that writes to memory are happening _after_ the
 object is initialized and _before_ the reference to the old object is
 modified, via a memory barrier.

The writer is only half of the equation. The reader has its own cache to 
worry about and its own loading order.

 Unless you're claiming that a memory barrier _doesn't_ do what it's
 supposed to (i.e., the memory module is executing writes out-of-order
 even though the processor is issuing them in the correct order), there
 is no way for _anyone_ to see a partially initialized object anywhere...

I'm not claiming, I'm destroying :o). There is. I know it's confusing. 
You may want to peruse the reading materials linked by others.


Andrei

May 07 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Tuesday, 7 May 2013 at 22:44:28 UTC, Andrei Alexandrescu wrote:
 The writer is only half of the equation. The reader has its own 
 cache to worry about and its own loading order.



Oooh! so basically this is the scenario you're referring to?

1. The reader has the uninitialized data in its cache
2. The writer writes the new data and a pointer to the new data
3. The reader sees a new pointer and attempts to load the new data
4. The reader receives stale data from its cache


In that case....

 I'm not claiming, I'm destroying :o)

... well done!!



 There is. I know it's confusing. You may want to peruse the 
 reading materials linked by others.

Ok thanks!

May 07 2013

Brad Roberts <braddr puremagic.com> writes:

On 5/7/13 3:44 PM, Andrei Alexandrescu wrote:
 On 5/7/13 5:57 PM, Mehrdad wrote:
 On Tuesday, 7 May 2013 at 19:49:30 UTC, Andrei Alexandrescu wrote:
 A memory barrier is not a one-way thing, i.e. not only the writer must
 do it. Any operation on shared memory is a handshake between the
 writer and the reader. If the reader doesn't do its bit, it can see
 the writes out of order no matter what the writer does.

 Andrei



 Andrew, I still don't understand:

 The writer is ensuring that writes to memory are happening _after_ the
 object is initialized and _before_ the reference to the old object is
 modified, via a memory barrier.

 The writer is only half of the equation. The reader has its own cache to
 worry about and its own loading order.

 Unless you're claiming that a memory barrier _doesn't_ do what it's
 supposed to (i.e., the memory module is executing writes out-of-order
 even though the processor is issuing them in the correct order), there
 is no way for _anyone_ to see a partially initialized object anywhere...

 I'm not claiming, I'm destroying :o). There is. I know it's confusing.
 You may want to peruse the reading materials linked by others.


 Andrei

(this might be a repeat, I've only skimmed this thread)

Section 8.2 of this doc is a good read:
   http://download.intel.com/products/processor/manual/325384.pdf

There's a couple allowed re-orderings that are not what most people expect.

Later,
Brad

May 07 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/7/13 12:30 PM, deadalnix wrote:
 On Tuesday, 7 May 2013 at 16:14:50 UTC, Steven Schveighoffer wrote:
 Not really. Whether it is entered or not is dictated by the vtable.
 Even classic double-check locking doesn't need an acquire outside the
 lock. Even if your CPU's view of the variable is outdated, the check
 after the memory barrier inside the lock only occurs once. After that,
 steady state is achieved. All subsequent reads need no memory
 barriers, because the singleton object will never change after that.

 The only thing we need to guard against is non-atomic writes, and out
 of order writes of the static variable (fixed with a memory barrier).
 Instruction ordering OUTSIDE the lock is irrelevant, because if we
 don't get the "steady state" value (not null), then we go into the
 lock to perform the careful initialization with barriers.

 I think aligned native word writes are atomic, so we don't have to
 worry about that.

 That is incorrect as the thread not going into the lock can see a
 partially initialized object.

Correct.

Andrei

May 07 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/7/2013 6:25 AM, Andrei Alexandrescu wrote:
 No. A tutorial on memory consistency models would be too long to insert here. I
 don't know of a good online resource, does anyone?

Here's one written by this Andrei Alexandrescu fellow, I hear he knows what
he's 
talking about:

http://erdani.com/publications/DDJ_Jul_Aug_2004_revised.pdf

May 07 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/7/13 1:31 PM, Walter Bright wrote:
 On 5/7/2013 6:25 AM, Andrei Alexandrescu wrote:
 No. A tutorial on memory consistency models would be too long to
 insert here. I
 don't know of a good online resource, does anyone?

 Here's one written by this Andrei Alexandrescu fellow, I hear he knows
 what he's talking about:

 http://erdani.com/publications/DDJ_Jul_Aug_2004_revised.pdf

That doesn't get into memory models.

Andrei

May 07 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

07-May-2013 17:25, Andrei Alexandrescu пишет:
 On 5/7/13 2:50 AM, Mehrdad wrote:
 On Monday, 6 May 2013 at 18:46:56 UTC, Andrei Alexandrescu wrote:
 Any concurrent operation (in this case read from one thread and write
 from another) requires a handshake between threads, most often in the
 form of an release write coupled with an acquire read. Whenever the
 handshake is absent but concurrent operations on shared memory do
 occur, the code is broken. The beauty of the TLS-based pattern is that
 in the steady state there's no need for a shared read and handshake.

 Andrei



 Hmm, are you referring to the same lack of a barrier that the others are
 also referring to?


 As far as I can see, there shouldn't be a need for any other handshake
 in this example.

 As long as the object is fully initialized before _static is written to
 (easy enough with just a memory barrier), there is no penalty for
 subsequent reads whatsoever.

 Right?

 No. A tutorial on memory consistency models would be too long to insert
 here. I don't know of a good online resource, does anyone?

Sutter's Mill is a good starting point even though it's C++ biased.

Two recent insightful talks on C++11 memory model with down and dirty 
details:

http://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/


-- 
Dmitry Olshansky

-- 
Dmitry Olshansky

May 07 2013

"QAston" <qaston gmail.com> writes:

 No. A tutorial on memory consistency models would be too long 
 to insert here. I don't know of a good online resource, does 
 anyone?

 Andrei

This was very helpful for me - focuses much more on the memory 
model itself than the c++11 part.

http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2

May 07 2013

"QAston" <qaston gmail.com> writes:

On Tuesday, 7 May 2013 at 20:17:43 UTC, QAston wrote:
 No. A tutorial on memory consistency models would be too long 
 to insert here. I don't know of a good online resource, does 
 anyone?

 Andrei

 This was very helpful for me - focuses much more on the memory 
 model itself than the c++11 part.

 http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2

Nah, look   Dmitry's post, he was first to post that :P

May 07 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Tuesday, 7 May 2013 at 20:17:43 UTC, QAston wrote:
 No. A tutorial on memory consistency models would be too long 
 to insert here. I don't know of a good online resource, does 
 anyone?

 Andrei

 This was very helpful for me - focuses much more on the memory 
 model itself than the c++11 part.

 http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2

This was awesome/amazing.

May 25 2013

"Sean Kelly" <sean invisibleduck.org> writes:

On Tuesday, 7 May 2013 at 13:25:37 UTC, Andrei Alexandrescu wrote:
 A tutorial on memory consistency models would be too long to 
 insert here. I don't know of a good online resource, does 
 anyone?

The best explanation I've found of memory models is this paper:

http://dl.acm.org/citation.cfm?id=325102

Which can be had for free (albeit in terrible print quality) here:

http://eecs.vanderbilt.edu/courses/eece343/papers/p15-gharachorloo.pdf

Kourosh Gharachorloo actually has a number of really good papers 
about this stuff.  Here's another, if you're interested in the 
details of how things work:

http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-9.pdf

May 07 2013

"David Nadlinger" <see klickverbot.at> writes:

On Tuesday, 7 May 2013 at 06:50:16 UTC, Mehrdad wrote:
 As far as I can see, there shouldn't be a need for any other 
 handshake in this example.

 As long as the object is fully initialized before _static is 
 written to (easy enough with just a memory barrier), there is 
 no penalty for subsequent reads whatsoever.

 Right?

The issue is that the write to _static might never appear on the 
other threads, thus leading to multiple instances being created - 
even though it is atomic in the sense that you never end up 
reading a pointer with e.g. only half of the bytes updated.

David

May 07 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 07 May 2013 10:33:13 -0400, David Nadlinger <see klickverbot.at>  
wrote:

 On Tuesday, 7 May 2013 at 06:50:16 UTC, Mehrdad wrote:
 As far as I can see, there shouldn't be a need for any other handshake  
 in this example.

 As long as the object is fully initialized before _static is written to  
 (easy enough with just a memory barrier), there is no penalty for  
 subsequent reads whatsoever.

 Right?

 The issue is that the write to _static might never appear on the other  
 threads, thus leading to multiple instances being created - even though  
 it is atomic in the sense that you never end up reading a pointer with  
 e.g. only half of the bytes updated.

I don't think that is an issue.  The write to _static is protected by a  
lock, which should present a consistent view of it inside the lock.

Mehrdad corrected his version right away that you need to check the value  
inside the lock again.  This essentially is classic double-checked  
locking, with one check being a virtual table lookup.

-Steve

May 07 2013

"Max Samukha" <maxsamukha gmail.com> writes:

On Monday, 6 May 2013 at 02:35:33 UTC, dsimcha wrote:
 On the advice of Walter and Andrei, I've written a blog article 
 about the low-lock Singleton pattern in D.  This is a 
 previously obscure pattern that uses thread-local storage to 
 make Singletons both thread-safe and efficient and was 
 independently invented by at least me and Alexander Terekhov, 
 an IBM researcher.  However, D's first-class treatment of 
 thread-local storage means the time has come to move it out of 
 obscurity and possibly make it the standard way to do 
 Singletons.

 Article:
 http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

 Reddit:
 http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/

FWIW, I played with a generalized form of this pattern long ago, 
something like (typing from memory):

template sharedGlobal(alias ctor, alias lock = globalLock)
{
       property sharedGlobal()
      {
          alias ReturnType!ctor T;
          __gshared static Nullable!T t;
          static bool instantiated;

          if (!instantiated)
          {
              synchronized(lock)
              {
                  if (t is null)
                      t = ctor();
              }
      }

      return cast(T)t;
}

alias sharedGlobal!({ return new Blah(); }) blah;

It should have been part of QtD but that has never happened.

May 06 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/6/2013 6:14 AM, Max Samukha wrote:
 FWIW, I played with a generalized form of this pattern long ago, something like
 (typing from memory):

And, that's the classic double checked locking bug!

May 06 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 06 May 2013 13:58:19 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 5/6/2013 6:14 AM, Max Samukha wrote:
 FWIW, I played with a generalized form of this pattern long ago,  
 something like
 (typing from memory):

 And, that's the classic double checked locking bug!

I think that's exactly what David had in his slide (well, except for an  
obvious typo).

He's only checking t once, and protecting the lock/check with a TLS  
boolean.  Although the set of the boolean is missing (probably a memory  
error, since he typed it from memory :)

-Steve

May 06 2013

"Max Samukha" <maxsamukha gmail.com> writes:

On Monday, 6 May 2013 at 17:58:19 UTC, Walter Bright wrote:
 On 5/6/2013 6:14 AM, Max Samukha wrote:
 FWIW, I played with a generalized form of this pattern long 
 ago, something like
 (typing from memory):

 And, that's the classic double checked locking bug!

D man is the bug! I simply failed to insert a line that flags the
thread-local guard. Sorry for that.

The Nullable thing was an impromptu to avoid ugly specializations
I used for nullable and non-nullable types in my original
implementation. Note that the Nullable is not phobos Nullable -
the latter incurs unnecessary overhead for types that are already
nullable. Maybe, the use of Nullable is an overkill and a
__gshared boolean (redundant in case of a nullable type) would
suffice.

David mentioned the trick on this NG years ago. It is well known,
understood and would be rarely needed if D could properly do
eager initialization of global state. :P

Anyway, I popped up mainly to point out that the pattern should
not be limited to classes.

May 24 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Max Samukha:

 Note that the Nullable is not phobos Nullable -
 the latter incurs unnecessary overhead for types that are 
 already nullable.

In Bugzilla I have suggested some improvements for Nullable, but 
in Phobos there is already an alternative Nullable that avoids 
that overhead:

struct Nullable(T, T nullValue);

Bye,
bearophile

May 24 2013

"Max Samukha" <maxsamukha gmail.com> writes:

On Friday, 24 May 2013 at 13:05:36 UTC, bearophile wrote:
 Max Samukha:

 Note that the Nullable is not phobos Nullable -
 the latter incurs unnecessary overhead for types that are 
 already nullable.

 In Bugzilla I have suggested some improvements for Nullable, 
 but in Phobos there is already an alternative Nullable that 
 avoids that overhead:

 struct Nullable(T, T nullValue);

 Bye,
 bearophile

The question is what should be the result of:

Nullable(int*)?
Nullable!(Nullable!T)?


type?

May 24 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Max Samukha:

 The question is what should be the result of:

 Nullable(int*)?
 Nullable!(Nullable!T)?


 type?

In D Nullable!(int*) needs to be a different type, because the 
interface is different between a pointer and a Nullable.

But often you want to use something like Nullable(int*, null) or 
Nullable(int*, cast(int*)null) instead.

In the second case maybe it can be collapsed into a single 
Nullable, but such behavior needs to be documented.

Bye,
bearophile

May 24 2013

"Max Samukha" <maxsamukha gmail.com> writes:

On Friday, 24 May 2013 at 14:03:24 UTC, bearophile wrote:
 Max Samukha:

 The question is what should be the result of:

 Nullable(int*)?
 Nullable!(Nullable!T)?


 type?

 In D Nullable!(int*) needs to be a different type, because the 
 interface is different between a pointer and a Nullable.

It could be normalized by free functions isNull(T)(T t), 
nullableValue(T)(..) etc. Isn't that what was done to arrays so 
they could support the range interface?

 But often you want to use something like Nullable(int*, null) 
 or Nullable(int*, cast(int*)null) instead.

Do you have an example? In what cases is the distinction between 
a null null int* and non-null null int* necessary?

 In the second case maybe it can be collapsed into a single 
 Nullable, but such behavior needs to be documented.

Ok.

 Bye,
 bearophile

May 24 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Max Samukha:

 It could be normalized by free functions isNull(T)(T t), 
 nullableValue(T)(..) etc.

In Haskell/Scala if you program idiomatically you usually use 
HOFs as map, filter, etc that work "transparently" on nullables 
or collection of nullables, because Maybe is a monad.
Example: you have a range of nullables, and you map a function on 
it, using a special map that ignores the null items. (This is one 
of the improvements I have suggested in Bugzilla.)


 But often you want to use something like Nullable(int*, null) 
 or Nullable(int*, cast(int*)null) instead.

 Do you have an example? In what cases is the distinction 
 between a null null int* and non-null null int* necessary?

If you use a Nullable!(int*, null) type that distinction 
vanishes, there is only one null.

Bye,
bearophile

May 24 2013

"deadalnix" <deadalnix gmail.com> writes:

On Friday, 24 May 2013 at 13:49:14 UTC, Max Samukha wrote:
 On Friday, 24 May 2013 at 13:05:36 UTC, bearophile wrote:
 Max Samukha:

 Note that the Nullable is not phobos Nullable -
 the latter incurs unnecessary overhead for types that are 
 already nullable.

 In Bugzilla I have suggested some improvements for Nullable, 
 but in Phobos there is already an alternative Nullable that 
 avoids that overhead:

 struct Nullable(T, T nullValue);

 Bye,
 bearophile

 The question is what should be the result of:

 Nullable(int*)?

New type, but no overhead.

 Nullable!(Nullable!T)?

New type, with an extra boolean.

May 24 2013

"Idan Arye" <GenericNPC gmail.com> writes:

On Monday, 6 May 2013 at 02:35:33 UTC, dsimcha wrote:
 On the advice of Walter and Andrei, I've written a blog article 
 about the low-lock Singleton pattern in D.  This is a 
 previously obscure pattern that uses thread-local storage to 
 make Singletons both thread-safe and efficient and was 
 independently invented by at least me and Alexander Terekhov, 
 an IBM researcher.  However, D's first-class treatment of 
 thread-local storage means the time has come to move it out of 
 obscurity and possibly make it the standard way to do 
 Singletons.

 Article:
 http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

 Reddit:
 http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/

Thanks!

I want to make a module with template mixins that implement some 
common idioms - singleton being one of 
them(http://forum.dlang.org/thread/fofbrlqruxbevnxchxdp forum.dlang.org). 
I'm going to use your version for implementing the singleton 
idiom.

May 06 2013

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 5/6/13, dsimcha <dsimcha yahoo.com> wrote:
 On the advice of Walter and Andrei, I've written a blog article
 about the low-lock Singleton pattern in D.

Personally this covers it for me:

final abstract class singleton
{
static:
    void foo() { }
    int f;
}

void main()
{
    singleton.foo();
    singleton.f = 1;
}

Never needed anything more complex than that. :p

May 06 2013

"Diggory" <diggsey googlemail.com> writes:

On Monday, 6 May 2013 at 21:13:24 UTC, Andrej Mitrovic wrote:
 On 5/6/13, dsimcha <dsimcha yahoo.com> wrote:
 On the advice of Walter and Andrei, I've written a blog article
 about the low-lock Singleton pattern in D.

 Personally this covers it for me:

 final abstract class singleton
 {
 static:
     void foo() { }
     int f;
 }

 void main()
 {
     singleton.foo();
     singleton.f = 1;
 }

 Never needed anything more complex than that. :p

Although that's only a single-per-thread-ton :P

May 06 2013

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 5/6/13, Diggory <diggsey googlemail.com> wrote:
 Although that's only a single-per-thread-ton :P

My bad.

 final abstract class singleton
 {
 __gshared:
     void foo() { }
     int f;
 }

May 06 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 06 May 2013 17:36:00 -0400, Andrej Mitrovic  
<andrej.mitrovich gmail.com> wrote:

 On 5/6/13, Diggory <diggsey googlemail.com> wrote:
 Although that's only a single-per-thread-ton :P

 My bad.

 final abstract class singleton
 {
 __gshared:
     void foo() { }
     int f;
 }


The point that is missed is the lazy creation.  Imagine if it's not just  
an int but a whole large resource, which takes a long time to create/open.

If all we cared about is access, and you *know* you are going to need it,  
you can just create it eagerly via shared static this().  That is not what  
the singleton pattern is for.

-Steve

May 06 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sun, 05 May 2013 22:35:27 -0400, dsimcha <dsimcha yahoo.com> wrote:

 On the advice of Walter and Andrei, I've written a blog article about  
 the low-lock Singleton pattern in D.  This is a previously obscure  
 pattern that uses thread-local storage to make Singletons both  
 thread-safe and efficient and was independently invented by at least me  
 and Alexander Terekhov, an IBM researcher.  However, D's first-class  
 treatment of thread-local storage means the time has come to move it out  
 of obscurity and possibly make it the standard way to do Singletons.

 Article:
 http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

 Reddit:
 http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/

Pulling this out from the depths of this discussion:

David, the current pattern protects the read of the __gshared singleton  
with a thread local boolean.  This means to check to see if the value is  
valid, we do:

if(!instantiated_)
{ ... // thread safe initialization of instance_
    instantiated_ = true;
}

return instance_;

This requires a load of the boolean, and then a load of the instance.

I wonder, if it might be more efficient to store a copy of the instance  
instead of the bool?  This would only require one load for the check:

if(!instantiated_instance_) // a TLS copy of the instance
{ ... // thread safe initialization of instance_
   instantiated_instance_ = instance_;
}
return instantiated_instance_;

I think in the steady state this devolves to the "unsafe" case, which of  
course is safe by that time.  Might this account for at least some of  
dmd's poorer performance?

-Steve

May 07 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Low-Lock Singletons In D