digitalmars.D - Thin Lock Implementation

Bartosz Milewski (5/5) Aug 17 2008 I posted some implementation details of Thin Lock in my blog,

Frank Benoit (4/11) Aug 18 2008 This sounds very intersting.

Bartosz Milewski (6/6) Aug 18 2008 I'm in the process of implementing it. I've been already rewriting parts...

Aarti_pl (12/19) Aug 19 2008 Well, I am definitely no concurrency expert, but one questions bothers

Lars Ivar Igesund (10/33) Aug 19 2008 2046 isn't a particularly small number, in fact most OS'es aren't set up...

Jb (10/15) Aug 19 2008 If you're on a windows box you can see how many threads are currently up...

Steven Schveighoffer (6/22) Aug 19 2008 On a 32-bit system, the amount of addressable memory space and the stack...

Sascha Katzner (6/11) Aug 19 2008 Not exactly right, since the stack is organized in 4kb pages and these

Steven Schveighoffer (7/16) Aug 19 2008 Hm... I read the limit above online

Jb (11/27) Aug 19 2008 Read the whole page ;-)

Steven Schveighoffer (5/34) Aug 19 2008 Read my whole post :)

Jb (3/6) Aug 19 2008 Ok, sorry, my bad. ;-)

Jb (6/15) Aug 19 2008 The allocation granularity is actualy 64k chunks on most (at least all t...
Walter Bright (4/14) Aug 19 2008 The memory isn't committed until it is used, but the address space is.

BCS (3/21) Aug 19 2008 Must the stack be contiguous? Could you, on a stack overflow, heap alloc...

Walter Bright (2/7) Aug 19 2008 No.

Walter Bright (5/14) Aug 19 2008 I suppose I should explain a bit more. Stack overflows are detected by

dsimcha (5/5) Aug 20 2008 Thin locks sounds pretty cool, but I do have one question. It seems lik...

Bartosz Milewski (5/10) Aug 20 2008 The paper describing thin locks (my post has a link to it) claims that

Brad Roberts (8/21) Aug 20 2008 I'll agree that inbetween is relatively rare, but just because there's

Bartosz Milewski (5/10) Aug 20 2008 A thin lock is not a spin lock. It's an optimization on top of the OS

Leandro Lucarella (11/22) Aug 22 2008 I would love to hear you point on optimizing an OS lock that is already

Robert Jacques (6/28) Aug 22 2008 Assuming the OS lock involves a kernel call then yes, there is a major

Leandro Lucarella (9/32) Aug 22 2008 It doesn't (at least in Linux).

Bartosz Milewski (6/9) Aug 24 2008 I think I'm going to write a separate blog post on this topic, since

BCS (7/26) Aug 20 2008 That's what the extra stack frame I mentioned is for. I'll grant you can...

Sean Kelly (5/33) Aug 20 2008 This is basically how Fibers work, though the stack transition isn't

Bartosz Milewski (7/15) Aug 19 2008 This is the number I've found in the D runtime and I didn't question it

Lars Ivar Igesund (7/19) Aug 19 2008 I believe the same restriction doesn't exist in Tango's runtime.

Sean Kelly (12/28) Aug 19 2008 I had the same concern, however, the compact thread id is clearly necess...

Jb (15/19) Aug 19 2008 One thing that occurs to me is this probably wont work with threads the ...
Jb (8/10) Aug 19 2008 Another thought I had is that when reading it is this.
Steven Schveighoffer (18/22) Aug 19 2008 This looks a lot more promising than the original concept that Walter li...

Steven Schveighoffer (7/30) Aug 19 2008 Reading further messages in your blog, I see that you expect sharing cas...

Bartosz Milewski (6/11) Aug 19 2008 No, it's definitely not ok, but unless there is a reference-counting

Steven Schveighoffer (19/30) Aug 19 2008 Yes, it's similar to casting something to invariant when you are not sur...

Sean Kelly (24/57) Aug 19 2008 I think the implication that casting an object to unshared actually

Bartosz Milewski (10/17) Aug 19 2008 I would like to make the nulling of the original reference part of the

Sean Kelly (11/21) Aug 19 2008 Yeah, that's the nasty thing about atomics--they provide yet another rea...

Bartosz Milewski (2/5) Aug 19 2008 Exactly!

Bartosz Milewski <bartosz relisoft.com> writes:

I posted some implementation details of Thin Lock in my blog, 
http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

And by the way, I put it on reddit too: 
http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .

Bartosz

Aug 17 2008

Frank Benoit <keinfarbton googlemail.com> writes:

Bartosz Milewski schrieb:
 I posted some implementation details of Thin Lock in my blog, 
 http://bartoszmilewski.wordpress.com/ . I would appreciate comments.
 
 And by the way, I put it on reddit too: 
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .
 
 Bartosz

This sounds very intersting.
In what state is this? Discussion or are you already in process of 
implementing this for D?

Aug 18 2008

Bartosz Milewski <bartosz relisoft.com> writes:

I'm in the process of implementing it. I've been already rewriting parts 
of the runtime to make monitors more modular. Now I'm diving into the 
low-level implementation.
Since, at the moment, dmd only supports the x86, that's the platform I'm 
targeting. But I'm also abstracting processor-dependent part into a 
separate library.

Aug 18 2008

Aarti_pl <aarti interia.pl> writes:

Bartosz Milewski pisze:
 I posted some implementation details of Thin Lock in my blog, 
 http://bartoszmilewski.wordpress.com/ . I would appreciate comments.
 
 And by the way, I put it on reddit too: 
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .
 
 Bartosz

Well, I am definitely no concurrency expert, but one questions bothers 
me a bit.

 From your article it seems that there will be limit of 2046 threads for 
program (11 bits, one based).

Isn't it too small number, taking into account multi core processors 
which will emerge (able to start much more threads than currently), and 
applications like web servers (which serves a lot of concurrent 
connections)?

BR
Marcin Kuszczak
(aarti_pl)

Aug 19 2008

Lars Ivar Igesund <larsivar igesund.net> writes:

Aarti_pl wrote:

 Bartosz Milewski pisze:
 I posted some implementation details of Thin Lock in my blog,
 http://bartoszmilewski.wordpress.com/ . I would appreciate comments.
 
 And by the way, I put it on reddit too:
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .
 
 Bartosz

 
 Well, I am definitely no concurrency expert, but one questions bothers
 me a bit.
 
  From your article it seems that there will be limit of 2046 threads for
 program (11 bits, one based).
 
 Isn't it too small number, taking into account multi core processors
 which will emerge (able to start much more threads than currently), and
 applications like web servers (which serves a lot of concurrent
 connections)?
 
 BR
 Marcin Kuszczak
 (aarti_pl)

2046 isn't a particularly small number, in fact most OS'es aren't set up to
allow more than 500-1000 depending on available memory.

However, I agree that it should not restrict _more_ than the OS where a
larger number is available.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

Aug 19 2008

"Jb" <jb nowhere.com> writes:

"Lars Ivar Igesund" <larsivar igesund.net> wrote in message 
news:g8e509$2872$1 digitalmars.com...
 2046 isn't a particularly small number, in fact most OS'es aren't set up 
 to
 allow more than 500-1000 depending on available memory.

 However, I agree that it should not restrict _more_ than the OS where a
 larger number is available.

If you're on a windows box you can see how many threads are currently up and 
running. My single core Athlon, which doesnt actualy have much going on, is 
clocking up 370 threads.

I imagine that newer more busy PCs are running 2 or 3 times as many.

That said.. this is system wide. I dont know if the OS will partition 
threads, so any particular process can only see it's threads, and a few OS 
threads.

I dont know how it works.

Aug 19 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Aarti_pl" wrote
 Bartosz Milewski pisze:
 I posted some implementation details of Thin Lock in my blog, 
 http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

 And by the way, I put it on reddit too: 
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .

 Bartosz

 Well, I am definitely no concurrency expert, but one questions bothers me 
 a bit.

 From your article it seems that there will be limit of 2046 threads for 
 program (11 bits, one based).

 Isn't it too small number, taking into account multi core processors which 
 will emerge (able to start much more threads than currently), and 
 applications like web servers (which serves a lot of concurrent 
 connections)?

On a 32-bit system, the amount of addressable memory space and the stack 
size are the factors that limit the number of threads.  For example, if 
addressable space is 2GB, and each thread has a 1MB stack, that's 2000 
threads max (this is the typical situation for Windows).

-Steve

Aug 19 2008

Sascha Katzner <sorry.no spam.invalid> writes:

Steven Schveighoffer wrote:
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).

Not exactly right, since the stack is organized in 4kb pages and these
pages are not initialized until they are first used they usually require
a lot less than 1mb.

LLAP,
Sascha

Aug 19 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Sascha Katzner" wrote
 Steven Schveighoffer wrote:
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).

 Not exactly right, since the stack is organized in 4kb pages and these
 pages are not initialized until they are first used they usually require
 a lot less than 1mb.

Hm... I read the limit above online 
(http://blogs.msdn.com/oldnewthing/archive/2005/07/29/444912.aspx).  I think 
maybe the pages aren't initialized, but doesn't the system have to reserve 
them because it has to guarantee a 1MB stack for each?  Or maybe that guy is 
completely wrong, or the concept has been fixed :)

-Steve

Aug 19 2008

"Jb" <jb nowhere.com> writes:

"Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
news:g8emvj$1cu5$1 digitalmars.com...
 "Sascha Katzner" wrote
 Steven Schveighoffer wrote:
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).

 Not exactly right, since the stack is organized in 4kb pages and these
 pages are not initialized until they are first used they usually require
 a lot less than 1mb.

 Hm... I read the limit above online 
 (http://blogs.msdn.com/oldnewthing/archive/2005/07/29/444912.aspx).  I 
 think maybe the pages aren't initialized, but doesn't the system have to 
 reserve them because it has to guarantee a 1MB stack for each?  Or maybe 
 that guy is completely wrong, or the concept has been fixed :)

Read the whole page ;-)

The 1MB is just the default stack size if you pass a value of zero to the 
CreateThread function, and usualy most linkers / librarys whatever do that.

You can override this by specifying the stack size. And as he said he could 
then create 13000 rather than 2000 threads.

But the actualy mimimum size is 64k, as that is the granularity of the 
windows allocator. You can check by calling getSystemInfo, and examining the 
allocationGranularity, but I've never seen it return anything other than 
64k.

Aug 19 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Jb" wrote
 "Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
 news:g8emvj$1cu5$1 digitalmars.com...
 "Sascha Katzner" wrote
 Steven Schveighoffer wrote:
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).

 Not exactly right, since the stack is organized in 4kb pages and these
 pages are not initialized until they are first used they usually require
 a lot less than 1mb.

 Hm... I read the limit above online 
 (http://blogs.msdn.com/oldnewthing/archive/2005/07/29/444912.aspx).  I 
 think maybe the pages aren't initialized, but doesn't the system have to 
 reserve them because it has to guarantee a 1MB stack for each?  Or maybe 
 that guy is completely wrong, or the concept has been fixed :)

 Read the whole page ;-)

 The 1MB is just the default stack size if you pass a value of zero to the 
 CreateThread function, and usualy most linkers / librarys whatever do 
 that.

 You can override this by specifying the stack size. And as he said he 
 could then create 13000 rather than 2000 threads.

 But the actualy mimimum size is 64k, as that is the granularity of the 
 windows allocator. You can check by calling getSystemInfo, and examining 
 the allocationGranularity, but I've never seen it return anything other 
 than 64k.


Read my whole post :)

"this is the typical situation for Windows"

i.e. the default.

-Steve

Aug 19 2008

"Jb" <jb nowhere.com> writes:

"Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
news:g8eqos$1jn5$1 digitalmars.com...
 Read my whole post :)

 "this is the typical situation for Windows"

 i.e. the default.

Ok, sorry, my bad. ;-)

Aug 19 2008

"Jb" <jb nowhere.com> writes:

"Sascha Katzner" <sorry.no spam.invalid> wrote in message 
news:g8emle$1bad$1 digitalmars.com...
 Steven Schveighoffer wrote:
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).

 Not exactly right, since the stack is organized in 4kb pages and these
 pages are not initialized until they are first used they usually require
 a lot less than 1mb.

The allocation granularity is actualy 64k chunks on most (at least all that 
ive ever used) windows OSes.

The 4k granulary is for the protection status.

Which i've always thought was a bit odd.

Aug 19 2008

Walter Bright <newshound1 digitalmars.com> writes:

Sascha Katzner wrote:
 Steven Schveighoffer wrote:
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).

 
 Not exactly right, since the stack is organized in 4kb pages and these
 pages are not initialized until they are first used they usually require
 a lot less than 1mb.

The memory isn't committed until it is used, but the address space is. 
You cannot move the stack if it exceeds its preallocated address space 
allotment.

Aug 19 2008

BCS <ao pathlink.com> writes:

Reply to Walter,

 Sascha Katzner wrote:
 
 Steven Schveighoffer wrote:
 
 On a 32-bit system, the amount of addressable memory space and the
 stack size are the factors that limit the number of threads.  For
 example, if addressable space is 2GB, and each thread has a 1MB
 stack, that's 2000 threads max (this is the typical situation for
 Windows).
 

 Not exactly right, since the stack is organized in 4kb pages and
 these pages are not initialized until they are first used they
 usually require a lot less than 1mb.
 

 The memory isn't committed until it is used, but the address space is.
 You cannot move the stack if it exceeds its preallocated address space
 allotment.
 

Must the stack be contiguous? Could you, on a stack overflow, heap allocate 
a new chunk and put in a frame that gets back to the last segment?

Aug 19 2008

Walter Bright <newshound1 digitalmars.com> writes:

BCS wrote:
 The memory isn't committed until it is used, but the address space is.
 You cannot move the stack if it exceeds its preallocated address space
 allotment.

 Must the stack be contiguous? Could you, on a stack overflow, heap 
 allocate a new chunk and put in a frame that gets back to the last segment?

No.

Aug 19 2008

Walter Bright <newshound1 digitalmars.com> writes:

Walter Bright wrote:
 BCS wrote:
 The memory isn't committed until it is used, but the address space is.
 You cannot move the stack if it exceeds its preallocated address space
 allotment.

 Must the stack be contiguous? Could you, on a stack overflow, heap 
 allocate a new chunk and put in a frame that gets back to the last 
 segment?

 
 No.

I suppose I should explain a bit more. Stack overflows are detected by 
the hardware, which means every PUSH could overflow the stack. It isn't 
practical to then try and continue the stack elsewhere, because the 
corresponding POPs would not know what was going on.

Aug 19 2008

dsimcha <dsimcha yahoo.com> writes:

Thin locks sounds pretty cool, but I do have one question.  It seems like
automatically inflating the lock the first time contention is detected is a
fairly
inefficient strategy, since the lock may be contested, but very infrequently.
Would it be possible to keep track of the frequency of contesting and, if the
lock
is contested only, say, 1% of the time, just leave it as a spinlock?

Aug 20 2008

Bartosz Milewski <bartosz relisoft.com> writes:

dsimcha wrote:
 Thin locks sounds pretty cool, but I do have one question.  It seems like
 automatically inflating the lock the first time contention is detected is a
fairly
 inefficient strategy, since the lock may be contested, but very infrequently.
 Would it be possible to keep track of the frequency of contesting and, if the
lock
 is contested only, say, 1% of the time, just leave it as a spinlock?

The paper describing thin locks (my post has a link to it) claims that 
objects essentially come in two classes--the ones that are never 
contested, and the ones that are constantly contested. In-between 
situation are relatively rare.

Aug 20 2008

Brad Roberts <braddr bellevue.puremagic.com> writes:

On Wed, 20 Aug 2008, Bartosz Milewski wrote:

 dsimcha wrote:
 Thin locks sounds pretty cool, but I do have one question.  It seems like
 automatically inflating the lock the first time contention is detected is a
 fairly
 inefficient strategy, since the lock may be contested, but very
 infrequently.
 Would it be possible to keep track of the frequency of contesting and, if
 the lock
 is contested only, say, 1% of the time, just leave it as a spinlock?

 
 The paper describing thin locks (my post has a link to it) claims that objects
 essentially come in two classes--the ones that are never contested, and the
 ones that are constantly contested. In-between situation are relatively rare.

I'll agree that inbetween is relatively rare, but just because there's 
contention, doesn't require the switch to heavy locks.  In a lot of the 
code I have written, one spin (assuming the spin utilizes rep; nop;) is 
sufficient for the 'blocked' acquire to finish it's acquire.  The use of 
heavy weight locks in that scenario is a net loss.

Later,
Brad

Aug 20 2008

Bartosz Milewski <bartosz relisoft.com> writes:

Brad Roberts wrote:
 I'll agree that inbetween is relatively rare, but just because there's 
 contention, doesn't require the switch to heavy locks.  In a lot of the 
 code I have written, one spin (assuming the spin utilizes rep; nop;) is 
 sufficient for the 'blocked' acquire to finish it's acquire.  The use of 
 heavy weight locks in that scenario is a net loss.

A thin lock is not a spin lock. It's an optimization on top of the OS 
lock specifically for the no-contention case. If there is contention, 
the OS lock, which is part of FatLock, is used. It presumably does its 
own optimizations, so calling it "heavy weight" may be unfair.

Aug 20 2008

Leandro Lucarella <llucax gmail.com> writes:

Bartosz Milewski, el 20 de agosto a las 15:14 me escribiste:
 Brad Roberts wrote:
I'll agree that inbetween is relatively rare, but just because there's 
contention, doesn't require the switch to heavy locks.  In a lot of the code I 
have written, one spin (assuming the spin utilizes rep; nop;) is sufficient for 
the 'blocked' acquire to finish it's acquire.  The use of heavy weight locks in 
that scenario is a net loss.

 
 A thin lock is not a spin lock. It's an optimization on top of the OS lock 
 specifically for the no-contention case. If there is contention, the OS lock, 
 which is part of FatLock, is used. It presumably does its own optimizations,
so 
 calling it "heavy weight" may be unfair.

I would love to hear you point on optimizing an OS lock that is already
optimized.

Is that really necessary? Do you have numbers that back you up?

Thank you.

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
aFR [afr my.farts.cause.nuclear.reaction.org] has quit IRC (Ping timeout)

Aug 22 2008

"Robert Jacques" <sandford jhu.edu> writes:

On Fri, 22 Aug 2008 14:08:28 -0400, Leandro Lucarella <llucax gmail.com>  
wrote:

 Bartosz Milewski, el 20 de agosto a las 15:14 me escribiste:
 Brad Roberts wrote:
I'll agree that inbetween is relatively rare, but just because there's
contention, doesn't require the switch to heavy locks.  In a lot of  

 the code I
have written, one spin (assuming the spin utilizes rep; nop;) is  

 sufficient for
the 'blocked' acquire to finish it's acquire.  The use of heavy weight  

 locks in
that scenario is a net loss.

 A thin lock is not a spin lock. It's an optimization on top of the OS  
 lock
 specifically for the no-contention case. If there is contention, the OS  
 lock,
 which is part of FatLock, is used. It presumably does its own  
 optimizations, so
 calling it "heavy weight" may be unfair.

 I would love to hear you point on optimizing an OS lock that is already
 optimized.

 Is that really necessary? Do you have numbers that back you up?

 Thank you.

Assuming the OS lock involves a kernel call then yes, there is a major  
performance difference. Kernel calls are the reason that GC programs beat  
programs using malloc and minimizing them is the secret behind all the  
various 'fast' malloc replacements.

Aug 22 2008

Leandro Lucarella <llucax gmail.com> writes:

Robert Jacques, el 22 de agosto a las 22:43 me escribiste:
 On Fri, 22 Aug 2008 14:08:28 -0400, Leandro Lucarella <llucax gmail.com> wrote:
 
Bartosz Milewski, el 20 de agosto a las 15:14 me escribiste:
Brad Roberts wrote:
I'll agree that inbetween is relatively rare, but just because there's
contention, doesn't require the switch to heavy locks.  In a lot of the code I
have written, one spin (assuming the spin utilizes rep; nop;) is sufficient for
the 'blocked' acquire to finish it's acquire.  The use of heavy weight locks in
that scenario is a net loss.

A thin lock is not a spin lock. It's an optimization on top of the OS lock
specifically for the no-contention case. If there is contention, the OS lock,
which is part of FatLock, is used. It presumably does its own optimizations, so
calling it "heavy weight" may be unfair.

I would love to hear you point on optimizing an OS lock that is already
optimized.

Is that really necessary? Do you have numbers that back you up?

Thank you.

 Assuming the OS lock involves a kernel call

It doesn't (at least in Linux).

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
Si pensas que el alma no se ve
el alma sí se ve en los ojos

Aug 22 2008

Bartosz Milewski <bartosz relisoft.com> writes:

Leandro Lucarella wrote:
 I would love to hear you point on optimizing an OS lock that is already
 optimized.

I think I'm going to write a separate blog post on this topic, since 
it's causing a lot of controversy. The short answer is that OS locks are 
optimized for different scenarios then thin locks.

 Is that really necessary? Do you have numbers that back you up?

The original thin lock paper has the numbers. Granted, they are for 
Java, but I don't see why they shouldn't be a ballpark for D as well.

Aug 24 2008

BCS <ao pathlink.com> writes:

Reply to Walter,

 Walter Bright wrote:
 
 BCS wrote:
 
 The memory isn't committed until it is used, but the address space
 is. You cannot move the stack if it exceeds its preallocated
 address space allotment.
 

 Must the stack be contiguous? Could you, on a stack overflow, heap
 allocate a new chunk and put in a frame that gets back to the last
 segment?
 

 No.
 

 I suppose I should explain a bit more. Stack overflows are detected by
 the hardware, which means every PUSH could overflow the stack. It
 isn't practical to then try and continue the stack elsewhere, because
 the corresponding POPs would not know what was going on.
 

That's what the extra stack frame I mentioned is for. I'll grant you can't 
split a single frame across the sections but if you cold detect an impending 
stack overflow before the function runs (by push/popping the largest block 
that the function can use [calloc aside]) you could then do the new stack 
at that point. When the function returns, it ends up in the extra frame and 
it knows how to return to the right location.

Aug 20 2008

Sean Kelly <sean invisibleduck.org> writes:

BCS wrote:
 Reply to Walter,
 
 Walter Bright wrote:

 BCS wrote:

 The memory isn't committed until it is used, but the address space
 is. You cannot move the stack if it exceeds its preallocated
 address space allotment.

 Must the stack be contiguous? Could you, on a stack overflow, heap
 allocate a new chunk and put in a frame that gets back to the last
 segment?

 No.

 I suppose I should explain a bit more. Stack overflows are detected by
 the hardware, which means every PUSH could overflow the stack. It
 isn't practical to then try and continue the stack elsewhere, because
 the corresponding POPs would not know what was going on.

 
 That's what the extra stack frame I mentioned is for. I'll grant you 
 can't split a single frame across the sections but if you cold detect an 
 impending stack overflow before the function runs (by push/popping the 
 largest block that the function can use [calloc aside]) you could then 
 do the new stack at that point. When the function returns, it ends up in 
 the extra frame and it knows how to return to the right location.

This is basically how Fibers work, though the stack transition isn't 
invisible.  Each Fiber has its own stack and execution is transferred to 
some location in this stack when the Fiber is called.


Sean

Aug 20 2008

Bartosz Milewski <bartosz relisoft.com> writes:

Aarti_pl wrote:
  From your article it seems that there will be limit of 2046 threads for 
 program (11 bits, one based).
 
 Isn't it too small number, taking into account multi core processors 
 which will emerge (able to start much more threads than currently), and 
 applications like web servers (which serves a lot of concurrent 
 connections)?
 

This is the number I've found in the D runtime and I didn't question it 
(yet!). Thin lock has a lot of bits to spare--I made the recursion count 
ridiculously huge. Count can easily be cut down to 8 bits and all the 
rest could go towards the thread index if necessary. On a 64-bit 
machine, thread index could spill into the extra 32-bits. So we are are 
well prepared for the future.

Aug 19 2008

Lars Ivar Igesund <larsivar igesund.net> writes:

Bartosz Milewski wrote:

 Aarti_pl wrote:
  From your article it seems that there will be limit of 2046 threads for
 program (11 bits, one based).
 
 Isn't it too small number, taking into account multi core processors
 which will emerge (able to start much more threads than currently), and
 applications like web servers (which serves a lot of concurrent
 connections)?
 

 
 This is the number I've found in the D runtime and I didn't question it
 (yet!). 

I believe the same restriction doesn't exist in Tango's runtime.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

Aug 19 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Aarti_pl (aarti interia.pl)'s article
 Bartosz Milewski pisze:
 I posted some implementation details of Thin Lock in my blog,
 http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

 And by the way, I put it on reddit too:
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .

 Bartosz

 Well, I am definitely no concurrency expert, but one questions bothers
 me a bit.
  From your article it seems that there will be limit of 2046 threads for
 program (11 bits, one based).
 Isn't it too small number, taking into account multi core processors
 which will emerge (able to start much more threads than currently), and
 applications like web servers (which serves a lot of concurrent
 connections)?

I had the same concern, however, the compact thread id is clearly necessary
for the algorithm to operate properly, so what can you do.  Perhaps transfer
some bits from the recursion counter to the thread id to increase that number.

For what it's worth, Tango doesn't use a static thread list because I think the
basic approach is flawed.  As you say, it creates a fixed upper bound for the
number of threads the app may contain, creates visible gaps in the global
thread list (visible via getAll), and causes contention for a common cache
line.  However, I think it's reasonable to create some sort of thread id anyway,
if only to allow for thin locks to operate.  I've been kicking around a few ways
to graft this functionality onto Tango without incurring any undue overhead.


Sean

Aug 19 2008

"Jb" <jb nowhere.com> writes:

"Bartosz Milewski" <bartosz relisoft.com> wrote in message 
news:g8aeg7$h2h$1 digitalmars.com...
I posted some implementation details of Thin Lock in my blog, 
http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

 And by the way, I put it on reddit too: 
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .

One thing that occurs to me is this probably wont work with threads the D 
runtime doesnt know about. For example interupt call backs. For real time 
audio you typicaly send a callback funciton pointer to the soundcard driver 
and then this is called from the soundcards interupt thread.

But the D runtime thread pool doesnt know about this thread.

What about OS callbacks? Do they always come in on a thread that D knows 
about? Im not sure tbh.

And if you want real time operation, pro audio, live video, ect.. then these 
callback threads have to remain invisible to the GC, so it cant pause them.. 
but you'd still need them visible in the global thread index.. or else your 
thin lock mechanism wont work.

Then again maybe you shouldnt use the inbuilt thin locks in such 
circumstances anyway.

Aug 19 2008

"Jb" <jb nowhere.com> writes:

"Bartosz Milewski" <bartosz relisoft.com> wrote in message 
news:g8aeg7$h2h$1 digitalmars.com...
I posted some implementation details of Thin Lock in my blog, 
http://bartoszmilewski.wordpress.com/ . I would appreciate comments.


Another thought I had is that when reading it is this.

For debuging purposes the compiler could do this.

When an object is created as never shared, rather than setting the thin lock 
to zero, you could instead set it to the current threadId.

Then the compiler could atomaticy add checks into the DBC code to make sure 
that the object is only ever accessed from the thread that created it.

Aug 19 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Bartosz Milewski" wrote
I posted some implementation details of Thin Lock in my blog, 
http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

 And by the way, I put it on reddit too: 
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .

This looks a lot more promising than the original concept that Walter linked 
to :)

There is mention of casting away shared which throws an exception if the 
object isn't shared.  Is this the locking mechanism to both read and write 
data?  i.e. you have to cast away sharing to use the data, then cast it back 
to unshared?  How do you wait for an object to be available so you can 
unshare it?

What happens if you forget to uncast it?

What happens to subcomponents?  i.e. shared is supposed to be transitive, 
right?  So if you cast away shared in the top-level object, does that go and 
try unsharing all the objects referenced by it (i.e. setting all the 
currently shared bits to 0)?  If not, then how do you know that the 
subobjects are shared in the type system?  It looks like a lot of this is 
going to be runtime checks, or is there going to be another type modifier 
for 'created shared, but currently unshared', i.e. 'locked' or something?

Cool stuff, I think you are on the right track.

-Steve

Aug 19 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Steven Schveighoffer" wrote
 "Bartosz Milewski" wrote
I posted some implementation details of Thin Lock in my blog, 
http://bartoszmilewski.wordpress.com/ . I would appreciate comments.

 And by the way, I put it on reddit too: 
 http://www.reddit.com/comments/6wmq4/thin_lock_implementation/ .

 This looks a lot more promising than the original concept that Walter 
 linked to :)

 There is mention of casting away shared which throws an exception if the 
 object isn't shared.  Is this the locking mechanism to both read and write 
 data?  i.e. you have to cast away sharing to use the data, then cast it 
 back to unshared?  How do you wait for an object to be available so you 
 can unshare it?

 What happens if you forget to uncast it?

 What happens to subcomponents?  i.e. shared is supposed to be transitive, 
 right?  So if you cast away shared in the top-level object, does that go 
 and try unsharing all the objects referenced by it (i.e. setting all the 
 currently shared bits to 0)?  If not, then how do you know that the 
 subobjects are shared in the type system?  It looks like a lot of this is 
 going to be runtime checks, or is there going to be another type modifier 
 for 'created shared, but currently unshared', i.e. 'locked' or something?

 Cool stuff, I think you are on the right track.

Reading further messages in your blog, I see that you expect sharing casts 
to be recursive (to preserve the transitive nature of it).  What if two 
shared objects have references to the same shared object?  If you 'unshare' 
one parent, the other object now is shared, but its subcomponent is not 
shared.  Is that considered ok?

-Steve

Aug 19 2008

Bartosz Milewski <bartosz relisoft.com> writes:

Steven Schveighoffer wrote:
 Reading further messages in your blog, I see that you expect sharing casts 
 to be recursive (to preserve the transitive nature of it).  What if two 
 shared objects have references to the same shared object?  If you 'unshare' 
 one parent, the other object now is shared, but its subcomponent is not 
 shared.  Is that considered ok?

No, it's definitely not ok, but unless there is a reference-counting 
mechanism in place, we can't do anything about it during casting. 
However, the sub-objects will be marked as exclusively owned by the 
casting thread, so any attempt from another thread to lock them will 
result in an exception. This is true to the spirit of casting in general.

Aug 19 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Bartosz Milewski" wrote
 Steven Schveighoffer wrote:
 Reading further messages in your blog, I see that you expect sharing 
 casts to be recursive (to preserve the transitive nature of it).  What if 
 two shared objects have references to the same shared object?  If you 
 'unshare' one parent, the other object now is shared, but its 
 subcomponent is not shared.  Is that considered ok?

 No, it's definitely not ok, but unless there is a reference-counting 
 mechanism in place, we can't do anything about it during casting. However, 
 the sub-objects will be marked as exclusively owned by the casting thread, 
 so any attempt from another thread to lock them will result in an 
 exception. This is true to the spirit of casting in general.

Yes, it's similar to casting something to invariant when you are not sure 
that another reference doesn't exist.

However, in this case, the casting is considered to be an accepted practice 
(and probably more likely than casting something back and forth from 
invariant).  Normally, casting has big warning signs all over it, I see this 
as a commonly used feature (or should it be?).

I noticed that you have 'locked' and 'unshared' concepts as separate, the 
former using contention and the latter which is a try-only version.  Is 
there any enforcement for not using either of these?  For example, if thread 
A tries to access (without casting) a shared object when thread B has casted 
it to unshared, does it succeed?

The thing I'm trying to figure out in all this is, I can see the potential 
with the implementation you are coming up with for some really cool locking 
features, yet the 'shared/unshared' plan also has to do with lock-free 
programming (with fences and stuff, I'm not very familiar with these 
things).  It looks to me like you can do one or the other, but not both? 
How do you know which mode you are using a shared variable in?

-Steve

Aug 19 2008

Sean Kelly <sean invisibleduck.org> writes:

Steven Schveighoffer wrote:
 "Bartosz Milewski" wrote
 Steven Schveighoffer wrote:
 Reading further messages in your blog, I see that you expect sharing 
 casts to be recursive (to preserve the transitive nature of it).  What if 
 two shared objects have references to the same shared object?  If you 
 'unshare' one parent, the other object now is shared, but its 
 subcomponent is not shared.  Is that considered ok?

 No, it's definitely not ok, but unless there is a reference-counting 
 mechanism in place, we can't do anything about it during casting. However, 
 the sub-objects will be marked as exclusively owned by the casting thread, 
 so any attempt from another thread to lock them will result in an 
 exception. This is true to the spirit of casting in general.

 
 Yes, it's similar to casting something to invariant when you are not sure 
 that another reference doesn't exist.
 
 However, in this case, the casting is considered to be an accepted practice 
 (and probably more likely than casting something back and forth from 
 invariant).  Normally, casting has big warning signs all over it, I see this 
 as a commonly used feature (or should it be?).
 
 I noticed that you have 'locked' and 'unshared' concepts as separate, the 
 former using contention and the latter which is a try-only version.  Is 
 there any enforcement for not using either of these?  For example, if thread 
 A tries to access (without casting) a shared object when thread B has casted 
 it to unshared, does it succeed?
 
 The thing I'm trying to figure out in all this is, I can see the potential 
 with the implementation you are coming up with for some really cool locking 
 features, yet the 'shared/unshared' plan also has to do with lock-free 
 programming (with fences and stuff, I'm not very familiar with these 
 things).  It looks to me like you can do one or the other, but not both? 
 How do you know which mode you are using a shared variable in?

I think the implication that casting an object to unshared actually 
mutates the "shared" bit within the thin lock bitset.  Something like this:

     shared MyClass c;

     void execThreadA()
     {
         MyClass x = makeLocal( c );
         // 'shared' bit in instance of x is now 0
     }

     T makeLocal(T)( inout T val )
     {
         // Assume that the state of MyClass is appropriately visible to
         // the calling thread because this is a shared object
         shared T tmp = val;
         val = null;
         return cast(unshared) tmp;
     }

If the "val = null" assignment is forgotten in makeLocal then an attempt 
to synchronize on c will throw a SharingException (or whatever it's called).

The benefit of being able to cast away the shared label is clearly for 
performance... if you know that you're the sole owner of a shared object 
then you don't need to lock for otherwise synchronized accesses (so 
long), much like assertUnique for const/invariant data.



Sean

Aug 19 2008

Bartosz Milewski <bartosz relisoft.com> writes:

Sean Kelly wrote:
 If the "val = null" assignment is forgotten in makeLocal then an attempt 
 to synchronize on c will throw a SharingException (or whatever it's 
 called).

I would like to make the nulling of the original reference part of the 
cast. In other words, makeLocal would take its argument by non-const 
reference and null it. That still doesn't solve the problem of multiple 
aliases, hence the need for SharingException.

 The benefit of being able to cast away the shared label is clearly for 
 performance... if you know that you're the sole owner of a shared object 
 then you don't need to lock for otherwise synchronized accesses (so 
 long), much like assertUnique for const/invariant data.

There is one complication though--the type system. A function that takes 
a non-shared argument can't be called with a shared object. But it's 
true that a sharing cast is an optimization. The safe way to do it is to 
make a thread-local deep copy of the object and pass that to a library 
function. (Object cloning is a hot topic in D.)

Aug 19 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Bartosz Milewski (bartosz relisoft.com)'s article
 Sean Kelly wrote:
 The benefit of being able to cast away the shared label is clearly for
 performance... if you know that you're the sole owner of a shared object
 then you don't need to lock for otherwise synchronized accesses (so
 long), much like assertUnique for const/invariant data.

 There is one complication though--the type system. A function that takes
 a non-shared argument can't be called with a shared object. But it's
 true that a sharing cast is an optimization. The safe way to do it is to
 make a thread-local deep copy of the object and pass that to a library
 function. (Object cloning is a hot topic in D.)

Yeah, that's the nasty thing about atomics--they provide yet another reason
to duplicate function bodies.  Sometimes I wonder if we won't eventually end
up writing only template code simply out of necessity.

I've had the same thought about cloning, though.  Once 'shared' is in place
we'll need to perform deep copies to transfer local data between threads,
and I would expect such passing to be far more common than the use of
shared data.  So we'll need a reliable and efficient means of cloning.  But
then we'll need the same thing for STM, so I guess we'll be killing two
birds with one stone :-)


Sean

Aug 19 2008

Bartosz Milewski <bartosz relisoft.com> writes:

Sean Kelly wrote:
 So we'll need a reliable and efficient means of cloning.  But
 then we'll need the same thing for STM, so I guess we'll be killing two
 birds with one stone :-)

Exactly!

Aug 19 2008

D Programming

C/C++ Programming

Other

digitalmars.D - Thin Lock Implementation