www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Multicores and Publication Safety

reply Walter Bright <newshound1 digitalmars.com> writes:
"What memory fences are useful for on multiprocessors; and why you 
should care, even if you're not an assembly programmer."

http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/

http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
Aug 04 2008
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Walter Bright wrote:
 http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
There seems to be a cadre of reddit readers who immediately vote down anything on D. That can be counteracted if the community votes them up!
Aug 04 2008
prev sibling parent reply "Jb" <jb nowhere.com> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:g7855a$2sd3$1 digitalmars.com...
 "What memory fences are useful for on multiprocessors; and why you should 
 care, even if you're not an assembly programmer."

 http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/

 http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Aug 04 2008
next sibling parent reply Brad Roberts <braddr puremagic.com> writes:
Jb wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g7855a$2sd3$1 digitalmars.com...
 "What memory fences are useful for on multiprocessors; and why you should 
 care, even if you're not an assembly programmer."

 http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/

 http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Pay very close attention to sections 2.3 and 2.4 of that document.
Aug 04 2008
next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Brad Roberts wrote:
 Jb wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g7855a$2sd3$1 digitalmars.com...
 "What memory fences are useful for on multiprocessors; and why you should 
 care, even if you're not an assembly programmer."

 http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/

 http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Pay very close attention to sections 2.3 and 2.4 of that document.
2.4 is the most interesting aspect of PC. It means that you can run into situations like this: // Thread A x = 1; // Thread B if( x == 1 ) y = 1; // Thread C if( y == 1 ) assert( x == 1 ); // may fail Alex Terekhov came up with a sneaky solution for this based on how the IA-32 spec says CAS is currently implemented: // Thread A x = 1; // Thread B t = CAS( x, 0, 0 ); if( t == 1 ) y = 1; // Thread C if( y == 1 ) assert( x == 1 ); // true In essence, Intel currently implements CAS by either storing the new value /or/ re-storing the old value based on the result of the comparison, and because all stores from a single processor are ordered, Thread C is therefore guaranteed to see the store to x before the store to y. As cool as I find the above solution, however, I do hope that this helps to demonstrate the complexity of lock-free programming. It also shows just how complex analysis of this stuff is. Even with the full source code available it would take some doing for a compiler to recognize a problem similar to the above. Sean
Aug 04 2008
parent Brad Roberts <braddr puremagic.com> writes:
Sean Kelly wrote:
 Brad Roberts wrote:
 Jb wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message
 news:g7855a$2sd3$1 digitalmars.com...
 "What memory fences are useful for on multiprocessors; and why you
 should care, even if you're not an assembly programmer."

 http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/


 http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Pay very close attention to sections 2.3 and 2.4 of that document.
2.4 is the most interesting aspect of PC. It means that you can run into situations like this: // Thread A x = 1; // Thread B if( x == 1 ) y = 1; // Thread C if( y == 1 ) assert( x == 1 ); // may fail Alex Terekhov came up with a sneaky solution for this based on how the IA-32 spec says CAS is currently implemented: // Thread A x = 1; // Thread B t = CAS( x, 0, 0 ); if( t == 1 ) y = 1; // Thread C if( y == 1 ) assert( x == 1 ); // true In essence, Intel currently implements CAS by either storing the new value /or/ re-storing the old value based on the result of the comparison, and because all stores from a single processor are ordered, Thread C is therefore guaranteed to see the store to x before the store to y. As cool as I find the above solution, however, I do hope that this helps to demonstrate the complexity of lock-free programming. It also shows just how complex analysis of this stuff is. Even with the full source code available it would take some doing for a compiler to recognize a problem similar to the above. Sean
For that example, section 2.8 kicks in, locked instructions (such as CAS) help constrain ordering. So.. summary. Reordering is real, even on x86 class hardware. To make life even more interesting, there's also various cpu bugs that help make things even worse. See this thread (unconfirmed info, but interesting non-the-less) on the linux-kernel mailing list: http://www.ussg.iu.edu/hypermail/linux/kernel/0808.0/0882.html Whee, Brad
Aug 04 2008
prev sibling parent reply "Jb" <jb nowhere.com> writes:
"Brad Roberts" <braddr puremagic.com> wrote in message 
news:mailman.10.1217908384.1156.digitalmars-d puremagic.com...
 Jb wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message
 news:g7855a$2sd3$1 digitalmars.com...
 "What memory fences are useful for on multiprocessors; and why you 
 should
 care, even if you're not an assembly programmer."

 http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/

 http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Pay very close attention to sections 2.3 and 2.4 of that document.
They dont override 2.1, they complement it. IE... *Stores cannot be reordered with other stores* *Loads cannot be reordered with other loads* x = 1; ready = 1; Happens in order whether or not a load is reordered with those stores. You cant have a situation where a processor sees the write to "ready" before it sees the write "x". What Bartoz said.. "writes to memory can be completed out of order and" Is not true on x86. What 2.3 is saying is that a later load could be reordered before either store, but it still cant be reordered before the store to 'x' and after the store to 'ready', because the order of those stores cannot be changed. If it gets reordered before the store to 'x' it implicity gets reordered before the store to ready. That's the whole point of the ordering of stores / loads being enforced. Reagrding 2.4 : What this is saying is that there may be a delay between processors seeing each others stores, not that they can be seen out of order. Processor 1 may see it's own write to 'x' before processor 2 does, but processor 2 still wont see the write to 'ready' before the write to 'x'.
Aug 05 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Jb wrote:
 What Bartoz said.. "writes to memory can be completed out of order and"
 
 Is not true on x86.
It's risky to write such code, however, because: 1. someone else may try to port it to another processor, and then be mystified as to why it breaks 2. Intel may change this behavior on future x86's, which means your code will break years from now
Aug 05 2008
parent reply "Jb" <jb nowhere.com> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:g795mq$25jq$1 digitalmars.com...
 Jb wrote:
 What Bartoz said.. "writes to memory can be completed out of order and"

 Is not true on x86.
It's risky to write such code, however, because: 1. someone else may try to port it to another processor, and then be mystified as to why it breaks
You cant design / write your code based on the idea that someone who doesnt know what they are doing will try and modify it later. And if they are unaware of memory ordering they are likely unaware of alignment atomicity, and probably dont understand the subtleties of syncronization, and a whole bunch of other issues. I'm not saying every joe blogs programmer should know about memory ordering and use it where they can to avoid more expensive syncronization primatives. But the compiler and stdlib, or multithreding librarys, should know about it. I dont think the compiler should be dumping memory fences all over the place on the assumtion that they might be needed by the x86 processors of 2012.
 2. Intel may change this behavior on future x86's, which means your code 
 will break years from now
I dont think they could because i think a lot of code probably already relys on it. And i think it's likely that the new comitment to strong memory ordering, from both AMD and INTEL (both have pdfs regarding 64 bit that specify it), is mainly because they realize it is needed to help progress with multi core.
Aug 05 2008
parent reply Walter Bright <walter nospammm-digitalmars.com> writes:
Jb Wrote:

 
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g795mq$25jq$1 digitalmars.com...
 Jb wrote:
 What Bartoz said.. "writes to memory can be completed out of order and"

 Is not true on x86.
It's risky to write such code, however, because: 1. someone else may try to port it to another processor, and then be mystified as to why it breaks
You cant design / write your code based on the idea that someone who doesnt know what they are doing will try and modify it later. And if they are unaware of memory ordering they are likely unaware of alignment atomicity, and probably dont understand the subtleties of syncronization, and a whole bunch of other issues. I'm not saying every joe blogs programmer should know about memory ordering and use it where they can to avoid more expensive syncronization primatives. But the compiler and stdlib, or multithreding librarys, should know about it. I dont think the compiler should be dumping memory fences all over the place on the assumtion that they might be needed by the x86 processors of 2012.
The model the compiler uses is to generate code "as if" fences were inserted everywhere. The compiler may, however, as part of optimization and generating code for a particular CPU, elide as many as it can.
 2. Intel may change this behavior on future x86's, which means your code 
 will break years from now
I dont think they could because i think a lot of code probably already relys on it. And i think it's likely that the new comitment to strong memory ordering, from both AMD and INTEL (both have pdfs regarding 64 bit that specify it), is mainly because they realize it is needed to help progress with multi core.
I think that is because the current language technology is deficient. We aim to fix that with D :-)
Aug 05 2008
parent "Jb" <jb nowhere.com> writes:
"Walter Bright" <walter nospammm-digitalmars.com> wrote in message 
news:g7b7h1$aeb$1 digitalmars.com...
 2. Intel may change this behavior on future x86's, which means your 
 code
 will break years from now
I dont think they could because i think a lot of code probably already relys on it. And i think it's likely that the new comitment to strong memory ordering, from both AMD and INTEL (both have pdfs regarding 64 bit that specify it), is mainly because they realize it is needed to help progress with multi core.
I think that is because the current language technology is deficient. We aim to fix that with D :-)
FWIW i think you're right. But a little more help from the hardware would be nice aswell. I'd like to see "lock free" (non blocking) syncronization made a bit easier, somthing like a double CAS.
Aug 06 2008
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Jb wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g7855a$2sd3$1 digitalmars.com...
 "What memory fences are useful for on multiprocessors; and why you should 
 care, even if you're not an assembly programmer."

 http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/

 http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Not true. The actual behavior of IA-32 processors has been hotly debated, but it's been established that at least certain AMD processors may reorder loads. Also, even under the PCsc model it is completely legal to "hoist" loads above stores, or equivalently, to "sink" stores below loads. In short, unless you've *really* done your homework I suggest being very careful with respect to lock-free programming--ie. always perform fully sequenced operations just to be safe. Tango has had such a module from the start, and it looks like Phobos2 may get one fairly soon as well. Sean
Aug 04 2008
parent reply "Jb" <jb nowhere.com> writes:
"Sean Kelly" <sean invisibleduck.org> wrote in message 
news:g78man$17sb$1 digitalmars.com...
 Jb wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g7855a$2sd3$1 digitalmars.com...
 "What memory fences are useful for on multiprocessors; and why you 
 should care, even if you're not an assembly programmer."

 http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/

 http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Not true. The actual behavior of IA-32 processors has been hotly debated, but it's been established that at least certain AMD processors may reorder loads.
Thats news to me.
 Also, even under the PCsc model it is completely legal to "hoist" loads 
 above stores, or equivalently, to "sink" stores below loads.
Yes but as long as stores are not reordered with other stores, and loads not reordered with other loads, then that kind of re-ordering wont result in the situation Bartoz described.
Aug 05 2008
next sibling parent Sean Kelly <sean invisibleduck.org> writes:
Jb wrote:
 "Sean Kelly" <sean invisibleduck.org> wrote in message 
 news:g78man$17sb$1 digitalmars.com...
 Jb wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g7855a$2sd3$1 digitalmars.com...
 "What memory fences are useful for on multiprocessors; and why you 
 should care, even if you're not an assembly programmer."

 http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/

 http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Not true. The actual behavior of IA-32 processors has been hotly debated, but it's been established that at least certain AMD processors may reorder loads.
Thats news to me.
 Also, even under the PCsc model it is completely legal to "hoist" loads 
 above stores, or equivalently, to "sink" stores below loads.
Yes but as long as stores are not reordered with other stores, and loads not reordered with other loads, then that kind of re-ordering wont result in the situation Bartoz described.
True enough. It's mostly an issue with creating mutexes and the like. Sean
Aug 05 2008
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Jb wrote:
 "Sean Kelly" <sean invisibleduck.org> wrote in message 
 news:g78man$17sb$1 digitalmars.com...
 Jb wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g7855a$2sd3$1 digitalmars.com...
 "What memory fences are useful for on multiprocessors; and why you 
 should care, even if you're not an assembly programmer."

 http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/

 http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Not true. The actual behavior of IA-32 processors has been hotly debated, but it's been established that at least certain AMD processors may reorder loads.
Thats news to me.
I don't know that this was ever confirmed with anyone at AMD, but it did come up in the C++0x talks and I believe the linux kernel accounts for it. Sean
Aug 05 2008
parent reply "Jb" <jb nowhere.com> writes:
"Sean Kelly" <sean invisibleduck.org> wrote in message 
news:g79ugv$mdd$1 digitalmars.com...
 Jb wrote:
 "Sean Kelly" <sean invisibleduck.org> wrote in message 
 news:g78man$17sb$1 digitalmars.com...
 Jb wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g7855a$2sd3$1 digitalmars.com...
 "What memory fences are useful for on multiprocessors; and why you 
 should care, even if you're not an assembly programmer."

 http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/

 http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Not true. The actual behavior of IA-32 processors has been hotly debated, but it's been established that at least certain AMD processors may reorder loads.
Thats news to me.
I don't know that this was ever confirmed with anyone at AMD, but it did come up in the C++0x talks and I believe the linux kernel accounts for it.
I did a bit of googling and it does seem older AMDs were less strongly ordered. It seems SSE/3DNow non temporal stores particulary. But it looks like they have gone for strong ordering with AMD64. http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf From 7.2 : Multiprocessor Memory Ordering. "Loads do not pass previous loads (loads are not re-ordered). Stores do not pass previous stores (stores are not re-ordered)" Although skim reading more of chapter 7 it looks like they might do reordering behind the scence, or "such that the appearance of in-order execution is maintained" as they say. My guess is that strong ordering, or at least the appearance of it, is an important factor in multi core cpus scalling well.
Aug 05 2008
parent reply Sean Kelly <sean invisibleduck.org> writes:
Jb wrote:
 "Sean Kelly" <sean invisibleduck.org> wrote in message 
 news:g79ugv$mdd$1 digitalmars.com...
 Jb wrote:
 "Sean Kelly" <sean invisibleduck.org> wrote in message 
 news:g78man$17sb$1 digitalmars.com...
 Jb wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g7855a$2sd3$1 digitalmars.com...
 "What memory fences are useful for on multiprocessors; and why you 
 should care, even if you're not an assembly programmer."

 http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/

 http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Not true. The actual behavior of IA-32 processors has been hotly debated, but it's been established that at least certain AMD processors may reorder loads.
Thats news to me.
I don't know that this was ever confirmed with anyone at AMD, but it did come up in the C++0x talks and I believe the linux kernel accounts for it.
I did a bit of googling and it does seem older AMDs were less strongly ordered. It seems SSE/3DNow non temporal stores particulary. But it looks like they have gone for strong ordering with AMD64. http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf From 7.2 : Multiprocessor Memory Ordering. "Loads do not pass previous loads (loads are not re-ordered). Stores do not pass previous stores (stores are not re-ordered)" Although skim reading more of chapter 7 it looks like they might do reordering behind the scence, or "such that the appearance of in-order execution is maintained" as they say.
At least AMD and Intel have figured out how to separate discussion of implementation issues with visible behavior. The original IA-32 spec was an absolute disaster in this respect. I'm also encouraged that the memory model has been both fully specified and strengthened to PCsc or better. The x86 has always been pretty easy to deal with and it's nice to see that this will continue to be true. I suppose my only question at this point is how the official memory barrier instructions apply to normal (non-SSE) instruction ordering. I don't suppose the recent specs say anything about this?
 My guess is that strong ordering, or at least the appearance of it, is an 
 important factor in multi core cpus scalling well.
Yup. And the Intel announcement makes the very good point that it's a huge factor in performance per watt as well. Strengthening the memory model and shrinking the pipeline allows for a tremendous amount of logic hardware to simply be thrown away, which means smaller, cooler, more energy-efficient CPUs. My big question now is how computers will be built in the coming years... will we have a few traditional (fast) cores plus a general-purpose parallel computing cluster? I suppose I should read that Intel paper posted yesterday. Sean
Aug 05 2008
parent Benji Smith <dlanguage benjismith.net> writes:
Sean Kelly wrote:
 My big question now is how computers will be 
 built in the coming years... will we have a few traditional (fast) cores 
 plus a general-purpose parallel computing cluster?
 
 Sean
Interesting you should bring this up. I was just reading an article yesterday about the "Cell Broadband Engine" used in the Playstation 3. It features one general-purpose 64-bit PowerPC chip (the "Power Processor Element") and eight co-processing cores (the "Synergistic Processing Units"), each with a 128-bit SIMD architecture. So, at least from the perspective of IBM and Sony, the answer is "yes". --benji
Aug 05 2008