digitalmars.D.bugs - [Issue 9163] New: std.parallelism broken with extensive optimizations (gdc)
- d-bugmail puremagic.com (47/47) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (12/12) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (24/24) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (26/26) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (14/16) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (9/22) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (14/16) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (9/9) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (8/16) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (7/11) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (40/44) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (8/57) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (8/8) Dec 16 2012 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (12/12) Jan 03 2013 http://d.puremagic.com/issues/show_bug.cgi?id=9163
- d-bugmail puremagic.com (15/15) Jan 06 2013 http://d.puremagic.com/issues/show_bug.cgi?id=9163
http://d.puremagic.com/issues/show_bug.cgi?id=9163 Summary: std.parallelism broken with extensive optimizations (gdc) Product: D Version: D2 Platform: All OS/Version: All Status: NEW Severity: critical Priority: P2 Component: Phobos AssignedTo: nobody puremagic.com ReportedBy: johannespfau gmail.com PST --- Disclaimer: Although the title contains "gdc" this is not a gdc bug as far as I can see. DMDs optimizer is probably not good enough to reproduce this bug though. I guess ldc will also be affected. Artem Tarasov recently filed a big on the gdc bugtracker about std.parallelism randomly failing: http://gdcproject.org/bugzilla/show_bug.cgi?id=16 jerro added an example which always fails for him and added some comments: ---------------------------- import std.stdio, std.parallelism; void main() { writeln(taskPool.reduce!"a+b"([0, 1, 2, 3])); } ---------------------------- "The reason for this is that gdc generates code that doesn't call atomicReadUbyte(status) on every iteration of the loop in std.parallelism.TaskPool.workLoop as the source code does, but calls it once before the loop begins and then uses a stored value inside the loop." And as far as I can see this is std.parallelisms fault (or you could also blame it on the language for not having "volatile"): workLoop repeatedly calls atomicReadUbyte(status). status is a member of TaskPool with type ubyte. It's not marked as shared or volatile, but it's _used_ as a shared variable. atomicReadUbyte also doesn't mark its parameter as shared/volatile. The GCC backend only sees a repeated access to a thread-local variable. It can't know that the variable can be changed by different threads and obviously optimizes the repeated function call away. It's the textbook example of volatile. The solution is probably not that obvious. There have been discussions whether shared/__gshared should imply volatile, but as long as there is no official statement in this regard, there's not much we can do. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163 Walter Bright <bugzilla digitalmars.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bugzilla digitalmars.com 02:17:00 PST --- shared should imply volatile, in that reordering of shared reads and writes must not be done by the compiler. This is not the case for __gshared. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163 Iain Buclaw <ibuclaw ubuntu.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ibuclaw ubuntu.com The read/writes just need to be safeguarded by synchronized to prevent any memoization occuring. private void atomicSetUbyte(ref ubyte stuff, ubyte newVal) { synchronized atomicStore(*(cast(shared) &stuff), newVal); } private ubyte atomicReadUbyte(ref ubyte val) { synchronized return atomicLoad(*(cast(shared) &val)); } private bool atomicCasUbyte(ref ubyte stuff, ubyte testVal, ubyte newVal) { synchronized return core.atomic.cas(cast(shared) &stuff, testVal, newVal); } I don't think the use of volatile statements/variables may help here... -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163 David Nadlinger <code klickverbot.at> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |code klickverbot.at PST --- I'm not sure that this isn't just a GDC bug. Why should status necessarily be thread-local? At least in LDC, the parameter to atomicReadUbyte is just an i8 pointer. There is no reason why it should require any special qualification as volatile/shared/etc. in order for the compiler to handle it correctly for codegen _inside_ the function, as its target is simply not known. We will just emit an LLVM atomic load instruction from that address. But I'm also not quite sure why the calling code should permit hoisting the call to that function (resp. the atomic load, after inlining) out of the loop. You say that "the GCC backend only sees a repeated access to a thread-local variable" – why should status necessarily be a thread-local variable? Thread-localness of a class instance (TaskPool in this case) cannot be inferred from the lack of a shared qualifier on its type in D, as there is __gshared. As for the general requirements of shared, I'd rather wait until we have a finished design including an at least somewhat precise spec for that (which is something of crucial importance when it comes to semantics) before spending lots of time trying to guess how exactly the requirements will be. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163 PST ---The read/writes just need to be safeguarded by synchronized to prevent any memoization occuring.Okay, I admit that this is all just hand-waiving until we finally get at »D memory model« spec together (maybe we should just start a wiki page to collect ideas), but: Why would you think that scalar promotion for the (inlined) function call is legal in the first place? To be able to do that, you would have to prove that the call to atomicReadUbyte/atomicLoad depends only on thread local state and has no global side effects, and I can't see how atomic operations would fit any sensible definition of that terms. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163Could be something trivial such as the backend somehow deduces that the function (atomicLoad) is pure. I'll have a poke round and see what I can did up. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------The read/writes just need to be safeguarded by synchronized to prevent any memoization occuring.Okay, I admit that this is all just hand-waiving until we finally get at »D memory model« spec together (maybe we should just start a wiki page to collect ideas), but: Why would you think that scalar promotion for the (inlined) function call is legal in the first place? To be able to do that, you would have to prove that the call to atomicReadUbyte/atomicLoad depends only on thread local state and has no global side effects, and I can't see how atomic operations would fit any sensible definition of that terms.
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163 jerro.public gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jerro.public gmail.comCould be something trivial such as the backend somehow deduces that the function (atomicLoad) is pure.I don't know anything about the gcc backend, but it seems to me that it would have to deduce that to be able to move a call out of the loop without breaking code. I also noticed that adding an empty gcc asm block to atomicLoad results in correct code being generated for workLoop. I guess that supports that theory. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163 PST --- Ah, GDC still implements atomic operations as normal D assignments combined with a barrier intrinsic, right? If this is really the cause, I guess this would be a good opportunity to switch to the new atomic builtins introduced with C++11 support – they should lead to much better code anyway. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163Probably because asm statements are volatile, and is a black hole that the compiler does not touch. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------Could be something trivial such as the backend somehow deduces that the function (atomicLoad) is pure.I don't know anything about the gcc backend, but it seems to me that it would have to deduce that to be able to move a call out of the loop without breaking code. I also noticed that adding an empty gcc asm block to atomicLoad results in correct code being generated for workLoop. I guess that supports that theory.
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163Ah, GDC still implements atomic operations as normal D assignments combined with a barrier intrinsic, right? If this is really the cause, I guess this would be a good opportunity to switch to the new atomic builtins introduced with C++11 support – they should lead to much better code anyway.Indeed, and the C++11 support is better mapped to D's atomic calls. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163 PST ---I'm not sure that this isn't just a GDC bug. Why should status necessarily be thread-local?Maybe my conclusions were premature, but according to TDPL chapter 13.12.1 page 414: "The compiler optimizes code using non-shared data to the maximum, in full confidence that no other thread can ever access it, and only tiptoes around shared data". The status field is not TLS data as it's not a static field, but it's instance data and to my understanding if the class instance isn't marked as shared and the field isn't shared the compiler is allowed to assume that only one thread can access that data. And if the compiler assumes status can only be accessed from one thread and "atomicReadUbyte" is inferred as pure, that would lead to this wrong behavior. You've got a point that this can't work for __gshared, so I'm not really sure what to do about this. So is it always invalid to optimize this code in D? --------------------------------------- pure bool checkFlag(const ref bool a) { } class A { bool flag; void test() { while(true) { if(checkFlag(flag)) //Can this check be moved out of the loop return; } } } --------------------------------------I don't think the use of volatile statements/variables may help here...wouldn't status as a volatile field signal to the compiler that it's value might change between calls to "atomicReadUbyte" and therefore prevent this optimization? -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163 ---atomicReadUbyte isn't pure, but it is inlined. Leaving atomicLoad as the culprit. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------I'm not sure that this isn't just a GDC bug. Why should status necessarily be thread-local?Maybe my conclusions were premature, but according to TDPL chapter 13.12.1 page 414: "The compiler optimizes code using non-shared data to the maximum, in full confidence that no other thread can ever access it, and only tiptoes around shared data". The status field is not TLS data as it's not a static field, but it's instance data and to my understanding if the class instance isn't marked as shared and the field isn't shared the compiler is allowed to assume that only one thread can access that data. And if the compiler assumes status can only be accessed from one thread and "atomicReadUbyte" is inferred as pure, that would lead to this wrong behavior. You've got a point that this can't work for __gshared, so I'm not really sure what to do about this. So is it always invalid to optimize this code in D? --------------------------------------- pure bool checkFlag(const ref bool a) { } class A { bool flag; void test() { while(true) { if(checkFlag(flag)) //Can this check be moved out of the loop return; } } } --------------------------------------I don't think the use of volatile statements/variables may help here...wouldn't status as a volatile field signal to the compiler that it's value might change between calls to "atomicReadUbyte" and therefore prevent this optimization?
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163 PST --- Just as a general note, because not everybody might be aware of this: GCC has its own "pure" attribute with semantics different from its namesake in the D langauge. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 16 2012
http://d.puremagic.com/issues/show_bug.cgi?id=9163 Iain Buclaw <ibuclaw ubuntu.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Can no longer reproduce this in gdc after some changes. The dmd compiler may not do everything right, but then again sometimes neither does gdc. :o) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 03 2013
http://d.puremagic.com/issues/show_bug.cgi?id=9163 David Simcha <dsimcha yahoo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dsimcha yahoo.com --- Sorry, I've been incredibly busy lately and just saw this post now. IIUC the **whole point** of atomic operations in core.atomic is that they're supposed to act as fences. No code motion should take place across ASM statements at the compiler level, and none should take place across lock; instructions at the lower levels. Obviously either my understanding of the spec is wrong or GDC isn't optimizing them properly if that's the case. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 06 2013