digitalmars.D.bugs - [Issue 9163] New: std.parallelism broken with extensive optimizations (gdc)

d-bugmail puremagic.com (47/47) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163

d-bugmail puremagic.com (12/12) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (24/24) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (26/26) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (14/16) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (9/22) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (14/16) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (9/9) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (8/16) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (7/11) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (40/44) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (8/57) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (8/8) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (12/12) Jan 03 2013 http://d.puremagic.com/issues/show_bug.cgi?id=9163
d-bugmail puremagic.com (15/15) Jan 06 2013 http://d.puremagic.com/issues/show_bug.cgi?id=9163

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163

           Summary: std.parallelism broken with extensive optimizations
                    (gdc)
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: critical
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: johannespfau gmail.com



PST ---
Disclaimer: Although the title contains "gdc" this is not a gdc bug as far as I
can see. DMDs optimizer is probably not good enough to reproduce this bug
though. I guess ldc will also be affected.


Artem Tarasov recently filed a big on the gdc bugtracker about std.parallelism
randomly failing: http://gdcproject.org/bugzilla/show_bug.cgi?id=16

jerro added an example which always fails for him and added some comments:
----------------------------
import std.stdio, std.parallelism;

void main()
{
    writeln(taskPool.reduce!"a+b"([0, 1, 2, 3]));
}
----------------------------
"The reason for this is that gdc generates code that doesn't call
atomicReadUbyte(status) on every iteration of the loop in
std.parallelism.TaskPool.workLoop as the source code does, but calls it once
before the loop begins and then uses a stored value inside the loop."

And as far as I can see this is std.parallelisms fault (or you could also blame
it on the language for not having "volatile"):

workLoop repeatedly calls atomicReadUbyte(status). status is a member of
TaskPool with type ubyte. It's not marked as shared or volatile, but it's
_used_ as a shared variable. atomicReadUbyte also doesn't mark its parameter as
shared/volatile.

The GCC backend only sees a repeated access to a thread-local variable. It
can't know that the variable can be changed by different threads and obviously
optimizes the repeated function call away. It's the textbook example of
volatile.

The solution is probably not that obvious. There have been discussions whether
shared/__gshared should imply volatile, but as long as there is no official
statement in this regard, there's not much we can do.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163


Walter Bright <bugzilla digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugzilla digitalmars.com



02:17:00 PST ---
shared should imply volatile, in that reordering of shared reads and writes
must not be done by the compiler.

This is not the case for __gshared.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163


Iain Buclaw <ibuclaw ubuntu.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ibuclaw ubuntu.com



The read/writes just need to be safeguarded by synchronized to prevent any
memoization occuring.


private void atomicSetUbyte(ref ubyte stuff, ubyte newVal)
{
  synchronized atomicStore(*(cast(shared) &stuff), newVal);
}

private ubyte atomicReadUbyte(ref ubyte val)
{
  synchronized return atomicLoad(*(cast(shared) &val));
}

private bool atomicCasUbyte(ref ubyte stuff, ubyte testVal, ubyte newVal)
{
  synchronized return core.atomic.cas(cast(shared) &stuff, testVal, newVal);
}



I don't think the use of volatile statements/variables may help here...

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163


David Nadlinger <code klickverbot.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |code klickverbot.at



PST ---
I'm not sure that this isn't just a GDC bug. Why should status necessarily be
thread-local?

At least in LDC, the parameter to atomicReadUbyte is just an i8 pointer. There
is no reason why it should require any special qualification as
volatile/shared/etc. in order for the compiler to handle it correctly for
codegen _inside_ the function, as its target is simply not known. We will just
emit an LLVM atomic load instruction from that address.

But I'm also not quite sure why the calling code should permit hoisting the
call to that function (resp. the atomic load, after inlining) out of the loop.
You say that "the GCC backend only sees a repeated access to a thread-local
variable" – why should status necessarily be a thread-local variable?
Thread-localness of a class instance (TaskPool in this case) cannot be inferred
from the lack of a shared qualifier on its type in D, as there is __gshared.

As for the general requirements of shared, I'd rather wait until we have a
finished design including an at least somewhat precise spec for that (which is
something of crucial importance when it comes to semantics) before spending
lots of time trying to guess how exactly the requirements will be.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163




PST ---

 The read/writes just need to be safeguarded by synchronized to prevent any
 memoization occuring.

Okay, I admit that this is all just hand-waiving until we finally get at �D
memory model� spec together (maybe we should just start a wiki page to collect
ideas), but:

Why would you think that scalar promotion for the (inlined) function call is
legal in the first place? To be able to do that, you would have to prove that
the call to atomicReadUbyte/atomicLoad depends only on thread local state and
has no global side effects, and I can't see how atomic operations would fit any
sensible definition of that terms.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163






 The read/writes just need to be safeguarded by synchronized to prevent any
 memoization occuring.

 
 Okay, I admit that this is all just hand-waiving until we finally get at �D
 memory model� spec together (maybe we should just start a wiki page to collect
 ideas), but:
 
 Why would you think that scalar promotion for the (inlined) function call is
 legal in the first place? To be able to do that, you would have to prove that
 the call to atomicReadUbyte/atomicLoad depends only on thread local state and
 has no global side effects, and I can't see how atomic operations would fit any
 sensible definition of that terms.

Could be something trivial such as the backend somehow deduces that the
function (atomicLoad) is pure.  I'll have a poke round and see what I can did
up.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163


jerro.public gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jerro.public gmail.com



 Could be something trivial such as the backend somehow deduces that the
 function (atomicLoad) is pure.

I don't know anything about the gcc backend, but it seems to me that it would
have to deduce that to be able to move a call out of the loop without breaking
code. I also noticed that adding an empty gcc asm block to atomicLoad results
in correct code being generated for workLoop. I guess that supports that
theory.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163




PST ---
Ah, GDC still implements atomic operations as normal D assignments combined
with a barrier intrinsic, right? If this is really the cause, I guess this
would be a good opportunity to switch to the new atomic builtins introduced
with C++11 support – they should lead to much better code anyway.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163





 Could be something trivial such as the backend somehow deduces that the
 function (atomicLoad) is pure.

 
 I don't know anything about the gcc backend, but it seems to me that it would
 have to deduce that to be able to move a call out of the loop without breaking
 code. I also noticed that adding an empty gcc asm block to atomicLoad results
 in correct code being generated for workLoop. I guess that supports that
 theory.

Probably because asm statements are volatile, and is a black hole that the
compiler does not touch.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163





 Ah, GDC still implements atomic operations as normal D assignments combined
 with a barrier intrinsic, right? If this is really the cause, I guess this
 would be a good opportunity to switch to the new atomic builtins introduced
 with C++11 support – they should lead to much better code anyway.

Indeed, and the C++11 support is better mapped to D's atomic calls.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163




PST ---

 I'm not sure that this isn't just a GDC bug. Why should status necessarily be
 thread-local?

Maybe my conclusions were premature, but according to TDPL chapter 13.12.1 page
414: "The compiler optimizes code using non-shared data to the maximum, in full
confidence that no other thread can ever access it, and only tiptoes around
shared data".

The status field is not TLS data as it's not a static field, but it's instance
data and to my understanding if the class instance isn't marked as shared and
the field isn't shared the compiler is allowed to assume that only one thread
can access that data.

And if the compiler assumes status can only be accessed from one thread and
"atomicReadUbyte" is inferred as pure, that would lead to this wrong behavior.

You've got a point that this can't work for __gshared, so I'm not really sure
what to do about this.

So is it always invalid to optimize this code in D?
---------------------------------------
pure bool checkFlag(const ref bool a)
{

}

class A
{
    bool flag;

    void test()
    {
        while(true)
        {
            if(checkFlag(flag)) //Can this check be moved out of the loop
                return;
        }
    }
}
--------------------------------------


 
 I don't think the use of volatile statements/variables may help here...

wouldn't status as a volatile field signal to the compiler that it's value
might change between calls to "atomicReadUbyte" and therefore prevent this
optimization?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163




---


 I'm not sure that this isn't just a GDC bug. Why should status necessarily be
 thread-local?

 
 Maybe my conclusions were premature, but according to TDPL chapter 13.12.1 page
 414: "The compiler optimizes code using non-shared data to the maximum, in full
 confidence that no other thread can ever access it, and only tiptoes around
 shared data".
 
 The status field is not TLS data as it's not a static field, but it's instance
 data and to my understanding if the class instance isn't marked as shared and
 the field isn't shared the compiler is allowed to assume that only one thread
 can access that data.
 
 And if the compiler assumes status can only be accessed from one thread and
 "atomicReadUbyte" is inferred as pure, that would lead to this wrong behavior.
 
 You've got a point that this can't work for __gshared, so I'm not really sure
 what to do about this.
 
 So is it always invalid to optimize this code in D?
 ---------------------------------------
 pure bool checkFlag(const ref bool a)
 {
 
 }
 
 class A
 {
     bool flag;
 
     void test()
     {
         while(true)
         {
             if(checkFlag(flag)) //Can this check be moved out of the loop
                 return;
         }
     }
 }
 --------------------------------------
 

 
 I don't think the use of volatile statements/variables may help here...

 
 wouldn't status as a volatile field signal to the compiler that it's value
 might change between calls to "atomicReadUbyte" and therefore prevent this
 optimization?

atomicReadUbyte isn't pure, but it is inlined.  Leaving atomicLoad as the
culprit.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163




PST ---
Just as a general note, because not everybody might be aware of this: GCC has
its own "pure" attribute with semantics different from its namesake in the D
langauge.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 16 2012

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163


Iain Buclaw <ibuclaw ubuntu.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED



---
Can no longer reproduce this in gdc after some changes. The dmd compiler may
not do everything right, but then again sometimes neither does gdc. :o)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Jan 03 2013

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=9163


David Simcha <dsimcha yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dsimcha yahoo.com



---
Sorry, I've been incredibly busy lately and just saw this post now.  IIUC the
**whole point** of atomic operations in core.atomic is that they're supposed to
act as fences.  No code motion should take place across ASM statements at the
compiler level, and none should take place across lock; instructions at the
lower levels.  Obviously either my understanding of the spec is wrong or GDC
isn't optimizing them properly if that's the case.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Jan 06 2013

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - [Issue 9163] New: std.parallelism broken with extensive optimizations (gdc)