www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Thin Lock Implementation

reply Bartosz Milewski <bartosz relisoft.com> writes:
I posted some implementation details of Thin Lock in my blog, 
http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

And by the way, I put it on reddit too: 
http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .

Bartosz
Aug 17 2008
next sibling parent reply Frank Benoit <keinfarbton googlemail.com> writes:
Bartosz Milewski schrieb:
 I posted some implementation details of Thin Lock in my blog, 
 http://bartoszmilewski.wordpress.com/ . I would appreciate comments.
 
 And by the way, I put it on reddit too: 
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .
 
 Bartosz

This sounds very intersting. In what state is this? Discussion or are you already in process of implementing this for D?
Aug 18 2008
parent Bartosz Milewski <bartosz relisoft.com> writes:
I'm in the process of implementing it. I've been already rewriting parts 
of the runtime to make monitors more modular. Now I'm diving into the 
low-level implementation.
Since, at the moment, dmd only supports the x86, that's the platform I'm 
targeting. But I'm also abstracting processor-dependent part into a 
separate library.
Aug 18 2008
prev sibling next sibling parent reply Aarti_pl <aarti interia.pl> writes:
Bartosz Milewski pisze:
 I posted some implementation details of Thin Lock in my blog, 
 http://bartoszmilewski.wordpress.com/ . I would appreciate comments.
 
 And by the way, I put it on reddit too: 
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .
 
 Bartosz

Well, I am definitely no concurrency expert, but one questions bothers me a bit. From your article it seems that there will be limit of 2046 threads for program (11 bits, one based). Isn't it too small number, taking into account multi core processors which will emerge (able to start much more threads than currently), and applications like web servers (which serves a lot of concurrent connections)? BR Marcin Kuszczak (aarti_pl)
Aug 19 2008
next sibling parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Aarti_pl wrote:

 Bartosz Milewski pisze:
 I posted some implementation details of Thin Lock in my blog,
 http://bartoszmilewski.wordpress.com/ . I would appreciate comments.
 
 And by the way, I put it on reddit too:
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .
 
 Bartosz

Well, I am definitely no concurrency expert, but one questions bothers me a bit. From your article it seems that there will be limit of 2046 threads for program (11 bits, one based). Isn't it too small number, taking into account multi core processors which will emerge (able to start much more threads than currently), and applications like web servers (which serves a lot of concurrent connections)? BR Marcin Kuszczak (aarti_pl)

2046 isn't a particularly small number, in fact most OS'es aren't set up to allow more than 500-1000 depending on available memory. However, I agree that it should not restrict _more_ than the OS where a larger number is available. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Aug 19 2008
parent "Jb" <jb nowhere.com> writes:
"Lars Ivar Igesund" <larsivar igesund.net> wrote in message 
news:g8e509$2872$1 digitalmars.com...
 2046 isn't a particularly small number, in fact most OS'es aren't set up 
 to
 allow more than 500-1000 depending on available memory.

 However, I agree that it should not restrict _more_ than the OS where a
 larger number is available.

If you're on a windows box you can see how many threads are currently up and running. My single core Athlon, which doesnt actualy have much going on, is clocking up 370 threads. I imagine that newer more busy PCs are running 2 or 3 times as many. That said.. this is system wide. I dont know if the OS will partition threads, so any particular process can only see it's threads, and a few OS threads. I dont know how it works.
Aug 19 2008
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Aarti_pl" wrote
 Bartosz Milewski pisze:
 I posted some implementation details of Thin Lock in my blog, 
 http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

 And by the way, I put it on reddit too: 
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .

 Bartosz

Well, I am definitely no concurrency expert, but one questions bothers me a bit. From your article it seems that there will be limit of 2046 threads for program (11 bits, one based). Isn't it too small number, taking into account multi core processors which will emerge (able to start much more threads than currently), and applications like web servers (which serves a lot of concurrent connections)?

On a 32-bit system, the amount of addressable memory space and the stack size are the factors that limit the number of threads. For example, if addressable space is 2GB, and each thread has a 1MB stack, that's 2000 threads max (this is the typical situation for Windows). -Steve
Aug 19 2008
parent reply Sascha Katzner <sorry.no spam.invalid> writes:
Steven Schveighoffer wrote:
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).

Not exactly right, since the stack is organized in 4kb pages and these pages are not initialized until they are first used they usually require a lot less than 1mb. LLAP, Sascha
Aug 19 2008
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Sascha Katzner" wrote
 Steven Schveighoffer wrote:
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).

Not exactly right, since the stack is organized in 4kb pages and these pages are not initialized until they are first used they usually require a lot less than 1mb.

Hm... I read the limit above online (http://blogs.msdn.com/oldnewthing/archive/2005/07/29/444912.aspx). I think maybe the pages aren't initialized, but doesn't the system have to reserve them because it has to guarantee a 1MB stack for each? Or maybe that guy is completely wrong, or the concept has been fixed :) -Steve
Aug 19 2008
parent reply "Jb" <jb nowhere.com> writes:
"Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
news:g8emvj$1cu5$1 digitalmars.com...
 "Sascha Katzner" wrote
 Steven Schveighoffer wrote:
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).

Not exactly right, since the stack is organized in 4kb pages and these pages are not initialized until they are first used they usually require a lot less than 1mb.

Hm... I read the limit above online (http://blogs.msdn.com/oldnewthing/archive/2005/07/29/444912.aspx). I think maybe the pages aren't initialized, but doesn't the system have to reserve them because it has to guarantee a 1MB stack for each? Or maybe that guy is completely wrong, or the concept has been fixed :)

Read the whole page ;-) The 1MB is just the default stack size if you pass a value of zero to the CreateThread function, and usualy most linkers / librarys whatever do that. You can override this by specifying the stack size. And as he said he could then create 13000 rather than 2000 threads. But the actualy mimimum size is 64k, as that is the granularity of the windows allocator. You can check by calling getSystemInfo, and examining the allocationGranularity, but I've never seen it return anything other than 64k.
Aug 19 2008
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Jb" wrote
 "Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
 news:g8emvj$1cu5$1 digitalmars.com...
 "Sascha Katzner" wrote
 Steven Schveighoffer wrote:
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).

Not exactly right, since the stack is organized in 4kb pages and these pages are not initialized until they are first used they usually require a lot less than 1mb.

Hm... I read the limit above online (http://blogs.msdn.com/oldnewthing/archive/2005/07/29/444912.aspx). I think maybe the pages aren't initialized, but doesn't the system have to reserve them because it has to guarantee a 1MB stack for each? Or maybe that guy is completely wrong, or the concept has been fixed :)

Read the whole page ;-) The 1MB is just the default stack size if you pass a value of zero to the CreateThread function, and usualy most linkers / librarys whatever do that. You can override this by specifying the stack size. And as he said he could then create 13000 rather than 2000 threads. But the actualy mimimum size is 64k, as that is the granularity of the windows allocator. You can check by calling getSystemInfo, and examining the allocationGranularity, but I've never seen it return anything other than 64k.

Read my whole post :) "this is the typical situation for Windows" i.e. the default. -Steve
Aug 19 2008
parent "Jb" <jb nowhere.com> writes:
"Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
news:g8eqos$1jn5$1 digitalmars.com...
 Read my whole post :)

 "this is the typical situation for Windows"

 i.e. the default.

Ok, sorry, my bad. ;-)
Aug 19 2008
prev sibling next sibling parent "Jb" <jb nowhere.com> writes:
"Sascha Katzner" <sorry.no spam.invalid> wrote in message 
news:g8emle$1bad$1 digitalmars.com...
 Steven Schveighoffer wrote:
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).

Not exactly right, since the stack is organized in 4kb pages and these pages are not initialized until they are first used they usually require a lot less than 1mb.

The allocation granularity is actualy 64k chunks on most (at least all that ive ever used) windows OSes. The 4k granulary is for the protection status. Which i've always thought was a bit odd.
Aug 19 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Sascha Katzner wrote:
 Steven Schveighoffer wrote:
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).

Not exactly right, since the stack is organized in 4kb pages and these pages are not initialized until they are first used they usually require a lot less than 1mb.

The memory isn't committed until it is used, but the address space is. You cannot move the stack if it exceeds its preallocated address space allotment.
Aug 19 2008
next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Walter,

 Sascha Katzner wrote:
 
 Steven Schveighoffer wrote:
 
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).
 

these pages are not initialized until they are first used they usually require a lot less than 1mb.

You cannot move the stack if it exceeds its preallocated address space allotment.

Must the stack be contiguous? Could you, on a stack overflow, heap allocate a new chunk and put in a frame that gets back to the last segment?
Aug 19 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
BCS wrote:
 The memory isn't committed until it is used, but the address space is.
 You cannot move the stack if it exceeds its preallocated address space
 allotment.

allocate a new chunk and put in a frame that gets back to the last segment?

No.
Aug 19 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Walter Bright wrote:
 BCS wrote:
 The memory isn't committed until it is used, but the address space is.
 You cannot move the stack if it exceeds its preallocated address space
 allotment.

allocate a new chunk and put in a frame that gets back to the last segment?

No.

I suppose I should explain a bit more. Stack overflows are detected by the hardware, which means every PUSH could overflow the stack. It isn't practical to then try and continue the stack elsewhere, because the corresponding POPs would not know what was going on.
Aug 19 2008
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
Thin locks sounds pretty cool, but I do have one question.  It seems like
automatically inflating the lock the first time contention is detected is a
fairly
inefficient strategy, since the lock may be contested, but very infrequently.
Would it be possible to keep track of the frequency of contesting and, if the
lock
is contested only, say, 1% of the time, just leave it as a spinlock?
Aug 20 2008
parent reply Bartosz Milewski <bartosz relisoft.com> writes:
dsimcha wrote:
 Thin locks sounds pretty cool, but I do have one question.  It seems like
 automatically inflating the lock the first time contention is detected is a
fairly
 inefficient strategy, since the lock may be contested, but very infrequently.
 Would it be possible to keep track of the frequency of contesting and, if the
lock
 is contested only, say, 1% of the time, just leave it as a spinlock?

The paper describing thin locks (my post has a link to it) claims that objects essentially come in two classes--the ones that are never contested, and the ones that are constantly contested. In-between situation are relatively rare.
Aug 20 2008
parent reply Brad Roberts <braddr bellevue.puremagic.com> writes:
On Wed, 20 Aug 2008, Bartosz Milewski wrote:

 dsimcha wrote:
 Thin locks sounds pretty cool, but I do have one question.  It seems like
 automatically inflating the lock the first time contention is detected is a
 fairly
 inefficient strategy, since the lock may be contested, but very
 infrequently.
 Would it be possible to keep track of the frequency of contesting and, if
 the lock
 is contested only, say, 1% of the time, just leave it as a spinlock?

The paper describing thin locks (my post has a link to it) claims that objects essentially come in two classes--the ones that are never contested, and the ones that are constantly contested. In-between situation are relatively rare.

I'll agree that inbetween is relatively rare, but just because there's contention, doesn't require the switch to heavy locks. In a lot of the code I have written, one spin (assuming the spin utilizes rep; nop;) is sufficient for the 'blocked' acquire to finish it's acquire. The use of heavy weight locks in that scenario is a net loss. Later, Brad
Aug 20 2008
parent reply Bartosz Milewski <bartosz relisoft.com> writes:
Brad Roberts wrote:
 I'll agree that inbetween is relatively rare, but just because there's 
 contention, doesn't require the switch to heavy locks.  In a lot of the 
 code I have written, one spin (assuming the spin utilizes rep; nop;) is 
 sufficient for the 'blocked' acquire to finish it's acquire.  The use of 
 heavy weight locks in that scenario is a net loss.

A thin lock is not a spin lock. It's an optimization on top of the OS lock specifically for the no-contention case. If there is contention, the OS lock, which is part of FatLock, is used. It presumably does its own optimizations, so calling it "heavy weight" may be unfair.
Aug 20 2008
parent reply Leandro Lucarella <llucax gmail.com> writes:
Bartosz Milewski, el 20 de agosto a las 15:14 me escribiste:
 Brad Roberts wrote:
I'll agree that inbetween is relatively rare, but just because there's 
contention, doesn't require the switch to heavy locks.  In a lot of the code I 
have written, one spin (assuming the spin utilizes rep; nop;) is sufficient for 
the 'blocked' acquire to finish it's acquire.  The use of heavy weight locks in 
that scenario is a net loss.

A thin lock is not a spin lock. It's an optimization on top of the OS lock specifically for the no-contention case. If there is contention, the OS lock, which is part of FatLock, is used. It presumably does its own optimizations, so calling it "heavy weight" may be unfair.

I would love to hear you point on optimizing an OS lock that is already optimized. Is that really necessary? Do you have numbers that back you up? Thank you. -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- aFR [afr my.farts.cause.nuclear.reaction.org] has quit IRC (Ping timeout)
Aug 22 2008
next sibling parent Leandro Lucarella <llucax gmail.com> writes:
Robert Jacques, el 22 de agosto a las 22:43 me escribiste:
 On Fri, 22 Aug 2008 14:08:28 -0400, Leandro Lucarella <llucax gmail.com> wrote:
 
Bartosz Milewski, el 20 de agosto a las 15:14 me escribiste:
Brad Roberts wrote:
I'll agree that inbetween is relatively rare, but just because there's
contention, doesn't require the switch to heavy locks.  In a lot of the code I
have written, one spin (assuming the spin utilizes rep; nop;) is sufficient for
the 'blocked' acquire to finish it's acquire.  The use of heavy weight locks in
that scenario is a net loss.

A thin lock is not a spin lock. It's an optimization on top of the OS lock specifically for the no-contention case. If there is contention, the OS lock, which is part of FatLock, is used. It presumably does its own optimizations, so calling it "heavy weight" may be unfair.

I would love to hear you point on optimizing an OS lock that is already optimized. Is that really necessary? Do you have numbers that back you up? Thank you.


It doesn't (at least in Linux). -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Si pensas que el alma no se ve el alma sí se ve en los ojos
Aug 22 2008
prev sibling parent Bartosz Milewski <bartosz relisoft.com> writes:
Leandro Lucarella wrote:
 I would love to hear you point on optimizing an OS lock that is already
 optimized.

I think I'm going to write a separate blog post on this topic, since it's causing a lot of controversy. The short answer is that OS locks are optimized for different scenarios then thin locks.
 Is that really necessary? Do you have numbers that back you up?

The original thin lock paper has the numbers. Granted, they are for Java, but I don't see why they shouldn't be a ballpark for D as well.
Aug 24 2008
prev sibling parent reply BCS <ao pathlink.com> writes:
Reply to Walter,

 Walter Bright wrote:
 
 BCS wrote:
 
 The memory isn't committed until it is used, but the address space
 is. You cannot move the stack if it exceeds its preallocated
 address space allotment.
 

allocate a new chunk and put in a frame that gets back to the last segment?


the hardware, which means every PUSH could overflow the stack. It isn't practical to then try and continue the stack elsewhere, because the corresponding POPs would not know what was going on.

That's what the extra stack frame I mentioned is for. I'll grant you can't split a single frame across the sections but if you cold detect an impending stack overflow before the function runs (by push/popping the largest block that the function can use [calloc aside]) you could then do the new stack at that point. When the function returns, it ends up in the extra frame and it knows how to return to the right location.
Aug 20 2008
parent Sean Kelly <sean invisibleduck.org> writes:
BCS wrote:
 Reply to Walter,
 
 Walter Bright wrote:

 BCS wrote:

 The memory isn't committed until it is used, but the address space
 is. You cannot move the stack if it exceeds its preallocated
 address space allotment.

allocate a new chunk and put in a frame that gets back to the last segment?


the hardware, which means every PUSH could overflow the stack. It isn't practical to then try and continue the stack elsewhere, because the corresponding POPs would not know what was going on.

That's what the extra stack frame I mentioned is for. I'll grant you can't split a single frame across the sections but if you cold detect an impending stack overflow before the function runs (by push/popping the largest block that the function can use [calloc aside]) you could then do the new stack at that point. When the function returns, it ends up in the extra frame and it knows how to return to the right location.

This is basically how Fibers work, though the stack transition isn't invisible. Each Fiber has its own stack and execution is transferred to some location in this stack when the Fiber is called. Sean
Aug 20 2008
prev sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Fri, 22 Aug 2008 14:08:28 -0400, Leandro Lucarella <llucax gmail.com>  
wrote:

 Bartosz Milewski, el 20 de agosto a las 15:14 me escribiste:
 Brad Roberts wrote:
I'll agree that inbetween is relatively rare, but just because there's
contention, doesn't require the switch to heavy locks.  In a lot of  

have written, one spin (assuming the spin utilizes rep; nop;) is  

the 'blocked' acquire to finish it's acquire.  The use of heavy weight  

that scenario is a net loss.

A thin lock is not a spin lock. It's an optimization on top of the OS lock specifically for the no-contention case. If there is contention, the OS lock, which is part of FatLock, is used. It presumably does its own optimizations, so calling it "heavy weight" may be unfair.

I would love to hear you point on optimizing an OS lock that is already optimized. Is that really necessary? Do you have numbers that back you up? Thank you.

performance difference. Kernel calls are the reason that GC programs beat programs using malloc and minimizing them is the secret behind all the various 'fast' malloc replacements.
Aug 22 2008
prev sibling next sibling parent reply Bartosz Milewski <bartosz relisoft.com> writes:
Aarti_pl wrote:
  From your article it seems that there will be limit of 2046 threads for 
 program (11 bits, one based).
 
 Isn't it too small number, taking into account multi core processors 
 which will emerge (able to start much more threads than currently), and 
 applications like web servers (which serves a lot of concurrent 
 connections)?
 

This is the number I've found in the D runtime and I didn't question it (yet!). Thin lock has a lot of bits to spare--I made the recursion count ridiculously huge. Count can easily be cut down to 8 bits and all the rest could go towards the thread index if necessary. On a 64-bit machine, thread index could spill into the extra 32-bits. So we are are well prepared for the future.
Aug 19 2008
parent Lars Ivar Igesund <larsivar igesund.net> writes:
Bartosz Milewski wrote:

 Aarti_pl wrote:
  From your article it seems that there will be limit of 2046 threads for
 program (11 bits, one based).
 
 Isn't it too small number, taking into account multi core processors
 which will emerge (able to start much more threads than currently), and
 applications like web servers (which serves a lot of concurrent
 connections)?
 

This is the number I've found in the D runtime and I didn't question it (yet!).

I believe the same restriction doesn't exist in Tango's runtime. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Aug 19 2008
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Aarti_pl (aarti interia.pl)'s article
 Bartosz Milewski pisze:
 I posted some implementation details of Thin Lock in my blog,
 http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

 And by the way, I put it on reddit too:
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .

 Bartosz

me a bit. From your article it seems that there will be limit of 2046 threads for program (11 bits, one based). Isn't it too small number, taking into account multi core processors which will emerge (able to start much more threads than currently), and applications like web servers (which serves a lot of concurrent connections)?

I had the same concern, however, the compact thread id is clearly necessary for the algorithm to operate properly, so what can you do. Perhaps transfer some bits from the recursion counter to the thread id to increase that number. For what it's worth, Tango doesn't use a static thread list because I think the basic approach is flawed. As you say, it creates a fixed upper bound for the number of threads the app may contain, creates visible gaps in the global thread list (visible via getAll), and causes contention for a common cache line. However, I think it's reasonable to create some sort of thread id anyway, if only to allow for thin locks to operate. I've been kicking around a few ways to graft this functionality onto Tango without incurring any undue overhead. Sean
Aug 19 2008
prev sibling next sibling parent "Jb" <jb nowhere.com> writes:
"Bartosz Milewski" <bartosz relisoft.com> wrote in message 
news:g8aeg7$h2h$1 digitalmars.com...
I posted some implementation details of Thin Lock in my blog, 
http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

 And by the way, I put it on reddit too: 
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .

One thing that occurs to me is this probably wont work with threads the D runtime doesnt know about. For example interupt call backs. For real time audio you typicaly send a callback funciton pointer to the soundcard driver and then this is called from the soundcards interupt thread. But the D runtime thread pool doesnt know about this thread. What about OS callbacks? Do they always come in on a thread that D knows about? Im not sure tbh. And if you want real time operation, pro audio, live video, ect.. then these callback threads have to remain invisible to the GC, so it cant pause them.. but you'd still need them visible in the global thread index.. or else your thin lock mechanism wont work. Then again maybe you shouldnt use the inbuilt thin locks in such circumstances anyway.
Aug 19 2008
prev sibling next sibling parent "Jb" <jb nowhere.com> writes:
"Bartosz Milewski" <bartosz relisoft.com> wrote in message 
news:g8aeg7$h2h$1 digitalmars.com...
I posted some implementation details of Thin Lock in my blog, 
http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

Another thought I had is that when reading it is this. For debuging purposes the compiler could do this. When an object is created as never shared, rather than setting the thin lock to zero, you could instead set it to the current threadId. Then the compiler could atomaticy add checks into the DBC code to make sure that the object is only ever accessed from the thread that created it.
Aug 19 2008
prev sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Bartosz Milewski" wrote
I posted some implementation details of Thin Lock in my blog, 
http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

 And by the way, I put it on reddit too: 
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .

This looks a lot more promising than the original concept that Walter linked to :) There is mention of casting away shared which throws an exception if the object isn't shared. Is this the locking mechanism to both read and write data? i.e. you have to cast away sharing to use the data, then cast it back to unshared? How do you wait for an object to be available so you can unshare it? What happens if you forget to uncast it? What happens to subcomponents? i.e. shared is supposed to be transitive, right? So if you cast away shared in the top-level object, does that go and try unsharing all the objects referenced by it (i.e. setting all the currently shared bits to 0)? If not, then how do you know that the subobjects are shared in the type system? It looks like a lot of this is going to be runtime checks, or is there going to be another type modifier for 'created shared, but currently unshared', i.e. 'locked' or something? Cool stuff, I think you are on the right track. -Steve
Aug 19 2008
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Steven Schveighoffer" wrote
 "Bartosz Milewski" wrote
I posted some implementation details of Thin Lock in my blog, 
http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

 And by the way, I put it on reddit too: 
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .

This looks a lot more promising than the original concept that Walter linked to :) There is mention of casting away shared which throws an exception if the object isn't shared. Is this the locking mechanism to both read and write data? i.e. you have to cast away sharing to use the data, then cast it back to unshared? How do you wait for an object to be available so you can unshare it? What happens if you forget to uncast it? What happens to subcomponents? i.e. shared is supposed to be transitive, right? So if you cast away shared in the top-level object, does that go and try unsharing all the objects referenced by it (i.e. setting all the currently shared bits to 0)? If not, then how do you know that the subobjects are shared in the type system? It looks like a lot of this is going to be runtime checks, or is there going to be another type modifier for 'created shared, but currently unshared', i.e. 'locked' or something? Cool stuff, I think you are on the right track.

Reading further messages in your blog, I see that you expect sharing casts to be recursive (to preserve the transitive nature of it). What if two shared objects have references to the same shared object? If you 'unshare' one parent, the other object now is shared, but its subcomponent is not shared. Is that considered ok? -Steve
Aug 19 2008
parent reply Bartosz Milewski <bartosz relisoft.com> writes:
Steven Schveighoffer wrote:
 Reading further messages in your blog, I see that you expect sharing casts 
 to be recursive (to preserve the transitive nature of it).  What if two 
 shared objects have references to the same shared object?  If you 'unshare' 
 one parent, the other object now is shared, but its subcomponent is not 
 shared.  Is that considered ok?

No, it's definitely not ok, but unless there is a reference-counting mechanism in place, we can't do anything about it during casting. However, the sub-objects will be marked as exclusively owned by the casting thread, so any attempt from another thread to lock them will result in an exception. This is true to the spirit of casting in general.
Aug 19 2008
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Bartosz Milewski" wrote
 Steven Schveighoffer wrote:
 Reading further messages in your blog, I see that you expect sharing 
 casts to be recursive (to preserve the transitive nature of it).  What if 
 two shared objects have references to the same shared object?  If you 
 'unshare' one parent, the other object now is shared, but its 
 subcomponent is not shared.  Is that considered ok?

No, it's definitely not ok, but unless there is a reference-counting mechanism in place, we can't do anything about it during casting. However, the sub-objects will be marked as exclusively owned by the casting thread, so any attempt from another thread to lock them will result in an exception. This is true to the spirit of casting in general.

Yes, it's similar to casting something to invariant when you are not sure that another reference doesn't exist. However, in this case, the casting is considered to be an accepted practice (and probably more likely than casting something back and forth from invariant). Normally, casting has big warning signs all over it, I see this as a commonly used feature (or should it be?). I noticed that you have 'locked' and 'unshared' concepts as separate, the former using contention and the latter which is a try-only version. Is there any enforcement for not using either of these? For example, if thread A tries to access (without casting) a shared object when thread B has casted it to unshared, does it succeed? The thing I'm trying to figure out in all this is, I can see the potential with the implementation you are coming up with for some really cool locking features, yet the 'shared/unshared' plan also has to do with lock-free programming (with fences and stuff, I'm not very familiar with these things). It looks to me like you can do one or the other, but not both? How do you know which mode you are using a shared variable in? -Steve
Aug 19 2008
parent reply Sean Kelly <sean invisibleduck.org> writes:
Steven Schveighoffer wrote:
 "Bartosz Milewski" wrote
 Steven Schveighoffer wrote:
 Reading further messages in your blog, I see that you expect sharing 
 casts to be recursive (to preserve the transitive nature of it).  What if 
 two shared objects have references to the same shared object?  If you 
 'unshare' one parent, the other object now is shared, but its 
 subcomponent is not shared.  Is that considered ok?

mechanism in place, we can't do anything about it during casting. However, the sub-objects will be marked as exclusively owned by the casting thread, so any attempt from another thread to lock them will result in an exception. This is true to the spirit of casting in general.

Yes, it's similar to casting something to invariant when you are not sure that another reference doesn't exist. However, in this case, the casting is considered to be an accepted practice (and probably more likely than casting something back and forth from invariant). Normally, casting has big warning signs all over it, I see this as a commonly used feature (or should it be?). I noticed that you have 'locked' and 'unshared' concepts as separate, the former using contention and the latter which is a try-only version. Is there any enforcement for not using either of these? For example, if thread A tries to access (without casting) a shared object when thread B has casted it to unshared, does it succeed? The thing I'm trying to figure out in all this is, I can see the potential with the implementation you are coming up with for some really cool locking features, yet the 'shared/unshared' plan also has to do with lock-free programming (with fences and stuff, I'm not very familiar with these things). It looks to me like you can do one or the other, but not both? How do you know which mode you are using a shared variable in?

I think the implication that casting an object to unshared actually mutates the "shared" bit within the thin lock bitset. Something like this: shared MyClass c; void execThreadA() { MyClass x = makeLocal( c ); // 'shared' bit in instance of x is now 0 } T makeLocal(T)( inout T val ) { // Assume that the state of MyClass is appropriately visible to // the calling thread because this is a shared object shared T tmp = val; val = null; return cast(unshared) tmp; } If the "val = null" assignment is forgotten in makeLocal then an attempt to synchronize on c will throw a SharingException (or whatever it's called). The benefit of being able to cast away the shared label is clearly for performance... if you know that you're the sole owner of a shared object then you don't need to lock for otherwise synchronized accesses (so long), much like assertUnique for const/invariant data. Sean
Aug 19 2008
parent reply Bartosz Milewski <bartosz relisoft.com> writes:
Sean Kelly wrote:
 If the "val = null" assignment is forgotten in makeLocal then an attempt 
 to synchronize on c will throw a SharingException (or whatever it's 
 called).

I would like to make the nulling of the original reference part of the cast. In other words, makeLocal would take its argument by non-const reference and null it. That still doesn't solve the problem of multiple aliases, hence the need for SharingException.
 The benefit of being able to cast away the shared label is clearly for 
 performance... if you know that you're the sole owner of a shared object 
 then you don't need to lock for otherwise synchronized accesses (so 
 long), much like assertUnique for const/invariant data.

There is one complication though--the type system. A function that takes a non-shared argument can't be called with a shared object. But it's true that a sharing cast is an optimization. The safe way to do it is to make a thread-local deep copy of the object and pass that to a library function. (Object cloning is a hot topic in D.)
Aug 19 2008
parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from Bartosz Milewski (bartosz relisoft.com)'s article
 Sean Kelly wrote:
 The benefit of being able to cast away the shared label is clearly for
 performance... if you know that you're the sole owner of a shared object
 then you don't need to lock for otherwise synchronized accesses (so
 long), much like assertUnique for const/invariant data.

a non-shared argument can't be called with a shared object. But it's true that a sharing cast is an optimization. The safe way to do it is to make a thread-local deep copy of the object and pass that to a library function. (Object cloning is a hot topic in D.)

Yeah, that's the nasty thing about atomics--they provide yet another reason to duplicate function bodies. Sometimes I wonder if we won't eventually end up writing only template code simply out of necessity. I've had the same thought about cloning, though. Once 'shared' is in place we'll need to perform deep copies to transfer local data between threads, and I would expect such passing to be far more common than the use of shared data. So we'll need a reliable and efficient means of cloning. But then we'll need the same thing for STM, so I guess we'll be killing two birds with one stone :-) Sean
Aug 19 2008
parent Bartosz Milewski <bartosz relisoft.com> writes:
Sean Kelly wrote:
 So we'll need a reliable and efficient means of cloning.  But
 then we'll need the same thing for STM, so I guess we'll be killing two
 birds with one stone :-)

Exactly!
Aug 19 2008