www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Who Ordered Memory Fences on an x86?

reply Walter Bright <newshound1 digitalmars.com> writes:
Another one of Bartosz' great blogs:

http://www.reddit.com/r/programming/comments/7bmkt/who_ordered_memory_fences_on_an_x86/

This will be required reading when we start implementing shared types.
Nov 05 2008
parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:get95v$1r39$1 digitalmars.com...
 Another one of Bartosz' great blogs:

 http://www.reddit.com/r/programming/comments/7bmkt/who_ordered_memory_fences_on_an_x86/

 This will be required reading when we start implementing shared types.

Call me a grumpy old fart, but I'd be happy just tossing fences in everywhere (when a multicore is detected) and be done with the whole mess, just because trying to wring every little bit of speed from, say, a 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd rather optimize for the lower end and let the fancy overpriced crap handle it however it will. And that's even before tossing in the consideration that (to my dismay) most code these days is written in languages/platforms (ex, "Ajaxy" web-apps) that throw any notion of performance straight into the trash anyway (what's 100 extra cycles here and there, when the browser/interpreter/OS/whatever makes something as simple as navigation and text entry less responsive than it was on a 1MHz 6502?).
Nov 05 2008
next sibling parent reply Russell Lewis <webmaster villagersonline.com> writes:
Nick Sabalausky wrote:
 Call me a grumpy old fart, but I'd be happy just tossing fences in 
 everywhere (when a multicore is detected) and be done with the whole mess, 
 just because trying to wring every little bit of speed from, say, a 3+ GHz 
 multicore processor strikes me as a highly unworthy pursuit. I'd rather 
 optimize for the lower end and let the fancy overpriced crap handle it 
 however it will.

Ok, I'm not going to go as far as that. :) But I've heard that Intel has been pondering doing almost that, at the hardware level. The theory is that years from now, our CPUs will not be "one or a few extremely complex processors," but instead "hundreds or thousands of simplistic processors." You could implement a pretty braindead execution model (no reordering, etc.) if you had 1024 cores all working in parallel. The question, of course, is how fast the software will come along to the point where it can actually make use of that many cores.
Nov 05 2008
parent reply "Nick Sabalausky" <a a.a> writes:
"Russell Lewis" <webmaster villagersonline.com> wrote in message 
news:geu020$6fd$1 digitalmars.com...
 Nick Sabalausky wrote:
 Call me a grumpy old fart, but I'd be happy just tossing fences in 
 everywhere (when a multicore is detected) and be done with the whole 
 mess, just because trying to wring every little bit of speed from, say, a 
 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd 
 rather optimize for the lower end and let the fancy overpriced crap 
 handle it however it will.

Ok, I'm not going to go as far as that. :) But I've heard that Intel has been pondering doing almost that, at the hardware level. The theory is that years from now, our CPUs will not be "one or a few extremely complex processors," but instead "hundreds or thousands of simplistic processors." You could implement a pretty braindead execution model (no reordering, etc.) if you had 1024 cores all working in parallel. The question, of course, is how fast the software will come along to the point where it can actually make use of that many cores.

Yea, I've been hearing a lot about that. Be interesting to see what happens. Console gaming will probably be the best place to watch to see how that turns out (what with Sony's Cell and all). Also, a few years ago I heard about some research on a "smart memory" chip that would allow certain basic operations to be performed within the memory chip itself (basically turning the memory cells into registers, AIUI). IIRC, the benefit they were aiming for was reducing the bottleneck of CPU<->RAM bus traffic. I haven't heard anything about it since then, but between that and the "lots of simple CPU cores" predictions, I wouldn't be surprised (though I'm not sure I would bet on it either) to eventually see traditional memory and processors (and maybe even hard drives) become replaced by a hybrid "CPU/RAM" chip.
Nov 05 2008
parent reply BCS <ao pathlink.com> writes:
Reply to Nick,


 Yea, I've been hearing a lot about that. Be interesting to see what
 happens. Console gaming will probably be the best place to watch to
 see how that turns out (what with Sony's Cell and all).
 
 Also, a few years ago I heard about some research on a "smart memory"
 chip that would allow certain basic operations to be performed within
 the memory chip itself (basically turning the memory cells into
 registers, AIUI). IIRC, the benefit they were aiming for was reducing
 the bottleneck of CPU<->RAM bus traffic. I haven't heard anything
 about it since then, but between that and the "lots of simple CPU
 cores" predictions, I wouldn't be surprised (though I'm not sure I
 would bet on it either) to eventually see traditional memory and
 processors (and maybe even hard drives) become replaced by a hybrid
 "CPU/RAM" chip.
 

One option people are throwing around in this is the Field Programmable Processor Array, think a FPGA with more complex gates.
Nov 06 2008
parent "Nick Sabalausky" <a a.a> writes:
"BCS" <ao pathlink.com> wrote in message 
news:78ccfa2d350718cb0e1e3ffdd376 news.digitalmars.com...
 Reply to Nick,


 Yea, I've been hearing a lot about that. Be interesting to see what
 happens. Console gaming will probably be the best place to watch to
 see how that turns out (what with Sony's Cell and all).

 Also, a few years ago I heard about some research on a "smart memory"
 chip that would allow certain basic operations to be performed within
 the memory chip itself (basically turning the memory cells into
 registers, AIUI). IIRC, the benefit they were aiming for was reducing
 the bottleneck of CPU<->RAM bus traffic. I haven't heard anything
 about it since then, but between that and the "lots of simple CPU
 cores" predictions, I wouldn't be surprised (though I'm not sure I
 would bet on it either) to eventually see traditional memory and
 processors (and maybe even hard drives) become replaced by a hybrid
 "CPU/RAM" chip.

One option people are throwing around in this is the Field Programmable Processor Array, think a FPGA with more complex gates.

Oh yes, the processor that can rewire itself on-the-fly. A very interesting idea.
Nov 06 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Nick Sabalausky wrote:
 Call me a grumpy old fart, but I'd be happy just tossing fences in 
 everywhere (when a multicore is detected) and be done with the whole mess, 
 just because trying to wring every little bit of speed from, say, a 3+ GHz 
 multicore processor strikes me as a highly unworthy pursuit. I'd rather 
 optimize for the lower end and let the fancy overpriced crap handle it 
 however it will.
 
 And that's even before tossing in the consideration that (to my dismay) most 
 code these days is written in languages/platforms (ex, "Ajaxy" web-apps) 
 that throw any notion of performance straight into the trash anyway (what's 
 100 extra cycles here and there, when the browser/interpreter/OS/whatever 
 makes something as simple as navigation and text entry less responsive than 
 it was on a 1MHz 6502?).

Bartosz, Andrei, Sean and I have discussed this at length. My personal view is that nobody actually understands the proper use of fences (the CPU documentation on exactly what they do is frustratingly obtuse, which does not help at all). Then there's the issue of fences behaving very differently on different CPUs. If you use explicit fences, you have no hope of portability. To address this, the idea we've been tossing about is to allow the only operations on shared variables to be read and write, implemented as compiler intrinsics: shared int x; ... int y = shared_read(x); shared_write(x, y + 1); which implements: int y = x++; (Forget the names of the intrinsics for the moment.) Yes, it's painfully explicit. But it's easy to visually verify correctness, and one can grep for them for code review purposes. Each shared_read and shared_write are guaranteed to be sequentially consistent, within a thread as well as among multiple threads. How they are implemented is up to the compiler. The compiler can do the naive approach and lard them up with airtight fences, or a more advanced compiler can do data flow analysis and compute a reasonable minimum number of fences required. The point here is that only *one* person needs to know how the fences actually work on the target CPU, the person who writes the compiler back end. And even that person only needs to solve the problem once. I think that's a far more tractable problem than trying to educate every programmer out there on the subtleties of fences for every CPU variant. Yes, this screws down very tightly what can be done with shared variables. Once we get this done, and get it right, we'll be able to see much more clearly where the right places are to loosen those screws.
Nov 05 2008
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:geu161$91s$1 digitalmars.com...
 Nick Sabalausky wrote:
 Call me a grumpy old fart, but I'd be happy just tossing fences in 
 everywhere (when a multicore is detected) and be done with the whole 
 mess, just because trying to wring every little bit of speed from, say, a 
 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd 
 rather optimize for the lower end and let the fancy overpriced crap 
 handle it however it will.

 And that's even before tossing in the consideration that (to my dismay) 
 most code these days is written in languages/platforms (ex, "Ajaxy" 
 web-apps) that throw any notion of performance straight into the trash 
 anyway (what's 100 extra cycles here and there, when the 
 browser/interpreter/OS/whatever makes something as simple as navigation 
 and text entry less responsive than it was on a 1MHz 6502?).

Bartosz, Andrei, Sean and I have discussed this at length. My personal view is that nobody actually understands the proper use of fences (the CPU documentation on exactly what they do is frustratingly obtuse, which does not help at all). Then there's the issue of fences behaving very differently on different CPUs. If you use explicit fences, you have no hope of portability.

From reading the article, I was under the impression that not using explicit fences lead to CPUs inevitably making false assumptions and thus spitting out erroneus results. So it sounds like explicit fences are a case of "dammed if you do, dammed if you don't": ie, "Use explicit fences everywhere and you get unportable machine code. Don't use explicit fences and you get errors." Is this accurate? (If so, what a mess!) Also, one thing I'ma little nclear on, is this whole mess only applicable when multiple cores are in use, or do the same problems crop up on unicore chips?
 To address this, the idea we've been tossing about is to allow the only 
 operations on shared variables to be read and write, implemented as 
 compiler intrinsics:

 shared int x;
 ...
 int y = shared_read(x);
 shared_write(x, y + 1);

 which implements: int y = x++;

 (Forget the names of the intrinsics for the moment.)

 Yes, it's painfully explicit. But it's easy to visually verify 
 correctness, and one can grep for them for code review purposes. Each 
 shared_read and shared_write are guaranteed to be sequentially consistent, 
 within a thread as well as among multiple threads.

I volunteer to respond to inevitable "Why does D's shared memory access syntax suck so badly?" inqueries with "You can thank the CPU vendors for that" ;) (Or do I misunderstand the root issue?)
 How they are implemented is up to the compiler. The compiler can do the 
 naive approach and lard them up with airtight fences, or a more advanced 
 compiler can do data flow analysis and compute a reasonable minimum number 
 of fences required.

 The point here is that only *one* person needs to know how the fences 
 actually work on the target CPU, the person who writes the compiler back 
 end. And even that person only needs to solve the problem once. I think 
 that's a far more tractable problem than trying to educate every 
 programmer out there on the subtleties of fences for every CPU variant.

 Yes, this screws down very tightly what can be done with shared variables. 
 Once we get this done, and get it right, we'll be able to see much more 
 clearly where the right places are to loosen those screws.

Seems to make sense. Maybe I'm being naive, but would it have made more sense for CPUs to assume memory accesses *cannot* be reordered unless told otherwise, instead of the other way around?
Nov 05 2008
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Nick Sabalausky wrote:
 Also, one thing I'ma little 
 nclear on, is this whole mess only applicable when multiple cores are in 
 use, or do the same problems crop up on unicore chips?

It's a problem only on multicore chips.
 Maybe I'm being naive, but would it have made more sense for CPUs to assume 
 memory accesses *cannot* be reordered unless told otherwise, instead of the 
 other way around? 

It's the way it is for performance reasons.
Nov 06 2008
prev sibling parent Russell Lewis <webmaster villagersonline.com> writes:
Nick Sabalausky wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:geu161$91s$1 digitalmars.com...
 Nick Sabalausky wrote:
 Call me a grumpy old fart, but I'd be happy just tossing fences in 
 everywhere (when a multicore is detected) and be done with the whole 
 mess, just because trying to wring every little bit of speed from, say, a 
 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd 
 rather optimize for the lower end and let the fancy overpriced crap 
 handle it however it will.

 And that's even before tossing in the consideration that (to my dismay) 
 most code these days is written in languages/platforms (ex, "Ajaxy" 
 web-apps) that throw any notion of performance straight into the trash 
 anyway (what's 100 extra cycles here and there, when the 
 browser/interpreter/OS/whatever makes something as simple as navigation 
 and text entry less responsive than it was on a 1MHz 6502?).

view is that nobody actually understands the proper use of fences (the CPU documentation on exactly what they do is frustratingly obtuse, which does not help at all). Then there's the issue of fences behaving very differently on different CPUs. If you use explicit fences, you have no hope of portability.

From reading the article, I was under the impression that not using explicit fences lead to CPUs inevitably making false assumptions and thus spitting out erroneus results. So it sounds like explicit fences are a case of "dammed if you do, dammed if you don't": ie, "Use explicit fences everywhere and you get unportable machine code. Don't use explicit fences and you get errors." Is this accurate? (If so, what a mess!) Also, one thing I'ma little nclear on, is this whole mess only applicable when multiple cores are in use, or do the same problems crop up on unicore chips?

In theory, you can write a portable program that uses explicit fences. The problem is that you have to design for the absolute worst-case CPU, which basically means putting an absolute read/write fence in almost every conceivable place. That would make performance suck. For reasonable performance, you need to cut down on the fences to only those that are required in order to get correctness...and that is definitely *not* portable.
Nov 06 2008
prev sibling parent Russell Lewis <webmaster villagersonline.com> writes:
Walter Bright wrote:
 Yes, this screws down very tightly what can be done with shared 
 variables. Once we get this done, and get it right, we'll be able to see 
 much more clearly where the right places are to loosen those screws.

Wise choice. I'm readying my firehose so that I can spray you down when the flames come. :)
Nov 06 2008