www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 9163] New: std.parallelism broken with extensive optimizations (gdc)

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163

           Summary: std.parallelism broken with extensive optimizations
                    (gdc)
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: critical
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: johannespfau gmail.com


--- Comment #0 from Johannes Pfau <johannespfau gmail.com> 2012-12-16 01:23:23
PST ---
Disclaimer: Although the title contains "gdc" this is not a gdc bug as far as I
can see. DMDs optimizer is probably not good enough to reproduce this bug
though. I guess ldc will also be affected.


Artem Tarasov recently filed a big on the gdc bugtracker about std.parallelism
randomly failing: http://gdcproject.org/bugzilla/show_bug.cgi?id=16

jerro added an example which always fails for him and added some comments:
----------------------------
import std.stdio, std.parallelism;

void main()
{
    writeln(taskPool.reduce!"a+b"([0, 1, 2, 3]));
}
----------------------------
"The reason for this is that gdc generates code that doesn't call
atomicReadUbyte(status) on every iteration of the loop in
std.parallelism.TaskPool.workLoop as the source code does, but calls it once
before the loop begins and then uses a stored value inside the loop."

And as far as I can see this is std.parallelisms fault (or you could also blame
it on the language for not having "volatile"):

workLoop repeatedly calls atomicReadUbyte(status). status is a member of
TaskPool with type ubyte. It's not marked as shared or volatile, but it's
_used_ as a shared variable. atomicReadUbyte also doesn't mark its parameter as
shared/volatile.

The GCC backend only sees a repeated access to a thread-local variable. It
can't know that the variable can be changed by different threads and obviously
optimizes the repeated function call away. It's the textbook example of
volatile.

The solution is probably not that obvious. There have been discussions whether
shared/__gshared should imply volatile, but as long as there is no official
statement in this regard, there's not much we can do.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 16 2012
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163


Walter Bright <bugzilla digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugzilla digitalmars.com


--- Comment #1 from Walter Bright <bugzilla digitalmars.com> 2012-12-16
02:17:00 PST ---
shared should imply volatile, in that reordering of shared reads and writes
must not be done by the compiler.

This is not the case for __gshared.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 16 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163


Iain Buclaw <ibuclaw ubuntu.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ibuclaw ubuntu.com


--- Comment #2 from Iain Buclaw <ibuclaw ubuntu.com> 2012-12-16 02:36:57 PST ---
The read/writes just need to be safeguarded by synchronized to prevent any
memoization occuring.


private void atomicSetUbyte(ref ubyte stuff, ubyte newVal)
{
  synchronized atomicStore(*(cast(shared) &stuff), newVal);
}

private ubyte atomicReadUbyte(ref ubyte val)
{
  synchronized return atomicLoad(*(cast(shared) &val));
}

private bool atomicCasUbyte(ref ubyte stuff, ubyte testVal, ubyte newVal)
{
  synchronized return core.atomic.cas(cast(shared) &stuff, testVal, newVal);
}



I don't think the use of volatile statements/variables may help here...

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 16 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163


David Nadlinger <code klickverbot.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |code klickverbot.at


--- Comment #3 from David Nadlinger <code klickverbot.at> 2012-12-16 02:49:36
PST ---
I'm not sure that this isn't just a GDC bug. Why should status necessarily be
thread-local?

At least in LDC, the parameter to atomicReadUbyte is just an i8 pointer. There
is no reason why it should require any special qualification as
volatile/shared/etc. in order for the compiler to handle it correctly for
codegen _inside_ the function, as its target is simply not known. We will just
emit an LLVM atomic load instruction from that address.

But I'm also not quite sure why the calling code should permit hoisting the
call to that function (resp. the atomic load, after inlining) out of the loop.
You say that "the GCC backend only sees a repeated access to a thread-local
variable" – why should status necessarily be a thread-local variable?
Thread-localness of a class instance (TaskPool in this case) cannot be inferred
from the lack of a shared qualifier on its type in D, as there is __gshared.

As for the general requirements of shared, I'd rather wait until we have a
finished design including an at least somewhat precise spec for that (which is
something of crucial importance when it comes to semantics) before spending
lots of time trying to guess how exactly the requirements will be.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 16 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163



--- Comment #4 from David Nadlinger <code klickverbot.at> 2012-12-16 02:56:04
PST ---
(In reply to comment #2)
 The read/writes just need to be safeguarded by synchronized to prevent any
 memoization occuring.

Okay, I admit that this is all just hand-waiving until we finally get at »D memory model« spec together (maybe we should just start a wiki page to collect ideas), but: Why would you think that scalar promotion for the (inlined) function call is legal in the first place? To be able to do that, you would have to prove that the call to atomicReadUbyte/atomicLoad depends only on thread local state and has no global side effects, and I can't see how atomic operations would fit any sensible definition of that terms. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163



--- Comment #5 from Iain Buclaw <ibuclaw ubuntu.com> 2012-12-16 03:59:53 PST ---
(In reply to comment #4)
 (In reply to comment #2)
 The read/writes just need to be safeguarded by synchronized to prevent any
 memoization occuring.

Okay, I admit that this is all just hand-waiving until we finally get at »D memory model« spec together (maybe we should just start a wiki page to collect ideas), but: Why would you think that scalar promotion for the (inlined) function call is legal in the first place? To be able to do that, you would have to prove that the call to atomicReadUbyte/atomicLoad depends only on thread local state and has no global side effects, and I can't see how atomic operations would fit any sensible definition of that terms.

Could be something trivial such as the backend somehow deduces that the function (atomicLoad) is pure. I'll have a poke round and see what I can did up. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163


jerro.public gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jerro.public gmail.com


--- Comment #6 from jerro.public gmail.com 2012-12-16 07:40:04 PST ---
 Could be something trivial such as the backend somehow deduces that the
 function (atomicLoad) is pure.

I don't know anything about the gcc backend, but it seems to me that it would have to deduce that to be able to move a call out of the loop without breaking code. I also noticed that adding an empty gcc asm block to atomicLoad results in correct code being generated for workLoop. I guess that supports that theory. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163



--- Comment #7 from David Nadlinger <code klickverbot.at> 2012-12-16 08:02:45
PST ---
Ah, GDC still implements atomic operations as normal D assignments combined
with a barrier intrinsic, right? If this is really the cause, I guess this
would be a good opportunity to switch to the new atomic builtins introduced
with C++11 support – they should lead to much better code anyway.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 16 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163



--- Comment #8 from Iain Buclaw <ibuclaw ubuntu.com> 2012-12-16 08:15:03 PST ---
(In reply to comment #6)
 Could be something trivial such as the backend somehow deduces that the
 function (atomicLoad) is pure.

I don't know anything about the gcc backend, but it seems to me that it would have to deduce that to be able to move a call out of the loop without breaking code. I also noticed that adding an empty gcc asm block to atomicLoad results in correct code being generated for workLoop. I guess that supports that theory.

Probably because asm statements are volatile, and is a black hole that the compiler does not touch. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163



--- Comment #9 from Iain Buclaw <ibuclaw ubuntu.com> 2012-12-16 08:15:58 PST ---
(In reply to comment #7)
 Ah, GDC still implements atomic operations as normal D assignments combined
 with a barrier intrinsic, right? If this is really the cause, I guess this
 would be a good opportunity to switch to the new atomic builtins introduced
 with C++11 support – they should lead to much better code anyway.

Indeed, and the C++11 support is better mapped to D's atomic calls. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163



--- Comment #10 from Johannes Pfau <johannespfau gmail.com> 2012-12-16 08:24:13
PST ---
(In reply to comment #3)
 I'm not sure that this isn't just a GDC bug. Why should status necessarily be
 thread-local?

Maybe my conclusions were premature, but according to TDPL chapter 13.12.1 page 414: "The compiler optimizes code using non-shared data to the maximum, in full confidence that no other thread can ever access it, and only tiptoes around shared data". The status field is not TLS data as it's not a static field, but it's instance data and to my understanding if the class instance isn't marked as shared and the field isn't shared the compiler is allowed to assume that only one thread can access that data. And if the compiler assumes status can only be accessed from one thread and "atomicReadUbyte" is inferred as pure, that would lead to this wrong behavior. You've got a point that this can't work for __gshared, so I'm not really sure what to do about this. So is it always invalid to optimize this code in D? --------------------------------------- pure bool checkFlag(const ref bool a) { } class A { bool flag; void test() { while(true) { if(checkFlag(flag)) //Can this check be moved out of the loop return; } } } -------------------------------------- (In reply to comment #2)
 
 I don't think the use of volatile statements/variables may help here...

wouldn't status as a volatile field signal to the compiler that it's value might change between calls to "atomicReadUbyte" and therefore prevent this optimization? -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163



--- Comment #11 from Iain Buclaw <ibuclaw ubuntu.com> 2012-12-16 08:27:35 PST
---
(In reply to comment #10)
 (In reply to comment #3)
 I'm not sure that this isn't just a GDC bug. Why should status necessarily be
 thread-local?

Maybe my conclusions were premature, but according to TDPL chapter 13.12.1 page 414: "The compiler optimizes code using non-shared data to the maximum, in full confidence that no other thread can ever access it, and only tiptoes around shared data". The status field is not TLS data as it's not a static field, but it's instance data and to my understanding if the class instance isn't marked as shared and the field isn't shared the compiler is allowed to assume that only one thread can access that data. And if the compiler assumes status can only be accessed from one thread and "atomicReadUbyte" is inferred as pure, that would lead to this wrong behavior. You've got a point that this can't work for __gshared, so I'm not really sure what to do about this. So is it always invalid to optimize this code in D? --------------------------------------- pure bool checkFlag(const ref bool a) { } class A { bool flag; void test() { while(true) { if(checkFlag(flag)) //Can this check be moved out of the loop return; } } } -------------------------------------- (In reply to comment #2)
 
 I don't think the use of volatile statements/variables may help here...

wouldn't status as a volatile field signal to the compiler that it's value might change between calls to "atomicReadUbyte" and therefore prevent this optimization?

atomicReadUbyte isn't pure, but it is inlined. Leaving atomicLoad as the culprit. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163



--- Comment #12 from David Nadlinger <code klickverbot.at> 2012-12-16 08:29:56
PST ---
Just as a general note, because not everybody might be aware of this: GCC has
its own "pure" attribute with semantics different from its namesake in the D
langauge.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 16 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163


Iain Buclaw <ibuclaw ubuntu.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


--- Comment #13 from Iain Buclaw <ibuclaw ubuntu.com> 2013-01-03 13:45:09 PST
---
Can no longer reproduce this in gdc after some changes. The dmd compiler may
not do everything right, but then again sometimes neither does gdc. :o)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 03 2013
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=9163


David Simcha <dsimcha yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dsimcha yahoo.com


--- Comment #14 from David Simcha <dsimcha yahoo.com> 2013-01-06 05:56:19 PST
---
Sorry, I've been incredibly busy lately and just saw this post now.  IIUC the
**whole point** of atomic operations in core.atomic is that they're supposed to
act as fences.  No code motion should take place across ASM statements at the
compiler level, and none should take place across lock; instructions at the
lower levels.  Obviously either my understanding of the spec is wrong or GDC
isn't optimizing them properly if that's the case.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 06 2013