digitalmars.D - Floating Point + Threads?
- dsimcha <dsimcha yahoo.com> Apr 15 2011
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Apr 15 2011
- Walter Bright <newshound2 digitalmars.com> Apr 16 2011
- Iain Buclaw <ibuclaw ubuntu.com> Apr 16 2011
- Walter Bright <newshound2 digitalmars.com> Apr 16 2011
- Iain Buclaw <ibuclaw ubuntu.com> Apr 16 2011
- Walter Bright <newshound2 digitalmars.com> Apr 16 2011
- dsimcha <dsimcha yahoo.com> Apr 16 2011
- "Robert Jacques" <sandford jhu.edu> Apr 15 2011
- dsimcha <dsimcha yahoo.com> Apr 16 2011
- dsimcha <dsimcha yahoo.com> Apr 16 2011
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Apr 16 2011
- dsimcha <dsimcha yahoo.com> Apr 16 2011
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Apr 16 2011
- Walter Bright <newshound2 digitalmars.com> Apr 16 2011
- Iain Buclaw <ibuclaw ubuntu.com> Apr 16 2011
- dsimcha <dsimcha yahoo.com> Apr 16 2011
- "JimBob" <jim bob.com> Apr 16 2011
- Timon Gehr <timon.gehr gmx.ch> Apr 16 2011
- dsimcha <dsimcha yahoo.com> Apr 16 2011
- dsimcha <dsimcha yahoo.com> Apr 16 2011
- Walter Bright <newshound2 digitalmars.com> Apr 16 2011
- bearophile <bearophileHUGS lycos.com> Apr 16 2011
- Don <nospam nospam.com> Apr 20 2011
- "JimBob" <jim bob.com> Apr 20 2011
- Don <nospam nospam.com> Apr 20 2011
- Sean Kelly <sean invisibleduck.org> Apr 20 2011
- Sean Kelly <sean invisibleduck.org> Apr 16 2011
- "Robert Jacques" <sandford jhu.edu> Apr 16 2011
- Sean Kelly <sean invisibleduck.org> Apr 19 2011
- "Robert Jacques" <sandford jhu.edu> Apr 19 2011
- Sean Kelly <sean invisibleduck.org> Apr 20 2011
I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 . Basically, the example in question sums up all the elements of a lazy range (actually, std.algorithm.map) in parallel. It uses taskPool.reduce, which divides the summation into work units to be executed in parallel. When executed in parallel, the results of the summation are non-deterministic after about the 12th decimal place, even though all of the following properties are true: 1. The work is divided into work units in a deterministic fashion. 2. Within each work unit, the summation happens in a deterministic order. 3. The final summation of the results of all the work units is done in a deterministic order. 4. The smallest term in the summation is about 5e-10. This means the difference across runs is about two orders of magnitude smaller than the smallest term. It can't be a concurrency bug where some terms sometimes get skipped. 5. The results for the individual tasks, not just the final summation, differ in the low-order bits. Each task is executed in a single thread. 6. The rounding mode is apparently the same in all of the threads. 7. The bug appears even on machines with only one core, as long as the number of task pool threads is manually set to >0. Since it's a single core machine, it can't be a low level memory model issue. What could possibly cause such small, non-deterministic differences in floating point results, given everything above? I'm just looking for suggestions here, as I don't even know where to start hunting for a bug like this.
Apr 15 2011
On 4/15/11 10:22 PM, dsimcha wrote:I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .
Does the scheduling affect the summation order? Andrei
Apr 15 2011
On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote:On 4/15/11 10:22 PM, dsimcha wrote:I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .
Does the scheduling affect the summation order?
That's a good thought. FP addition results can differ dramatically depending on associativity.
Apr 16 2011
== Quote from Walter Bright (newshound2 digitalmars.com)'s articleOn 4/15/2011 8:40 PM, Andrei Alexandrescu wrote:On 4/15/11 10:22 PM, dsimcha wrote:I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .
Does the scheduling affect the summation order?
associativity.
And not to forget optimisations too. ;)
Apr 16 2011
On 4/16/2011 6:46 AM, Iain Buclaw wrote:== Quote from Walter Bright (newshound2 digitalmars.com)'s articleThat's a good thought. FP addition results can differ dramatically depending on associativity.
And not to forget optimisations too. ;)
The dmd optimizer is careful not to reorder evaluation in such a way as to change the results.
Apr 16 2011
== Quote from Walter Bright (newshound2 digitalmars.com)'s articleOn 4/16/2011 6:46 AM, Iain Buclaw wrote:== Quote from Walter Bright (newshound2 digitalmars.com)'s articleThat's a good thought. FP addition results can differ dramatically depending on associativity.
And not to forget optimisations too. ;)
change the results.
And so it rightly shouldn't! I was thinking more of a case of FPU precision rather than ordering: as in you get a different result computing on SSE in double precision mode on the one hand, and by computing on x87 in double precision then writing to a double variable in memory. Classic example (which could either be a bug or non-bug depending on your POV): void test(double x, double y) { double y2 = x + 1.0; assert(y == y2); // triggers with -O } void main() { double x = .012; double y = x + 1.0; test(x, y); }
Apr 16 2011
On 4/16/2011 11:43 AM, Iain Buclaw wrote:I was thinking more of a case of FPU precision rather than ordering: as in you get a different result computing on SSE in double precision mode on the one hand, and by computing on x87 in double precision then writing to a double variable in memory.
You're right on that one.
Apr 16 2011
On 4/15/2011 11:40 PM, Andrei Alexandrescu wrote:On 4/15/11 10:22 PM, dsimcha wrote:I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .
Does the scheduling affect the summation order? Andrei
No. I realize floating point addition isn't associative, but unless there's some detail I'm forgetting about, the ordering is deterministic within each work unit and the ordering of the final summation is deterministic.
Apr 16 2011
On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha <dsimcha yahoo.com> wrote:I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 . Basically, the example in question sums up all the elements of a lazy range (actually, std.algorithm.map) in parallel. It uses taskPool.reduce, which divides the summation into work units to be executed in parallel. When executed in parallel, the results of the summation are non-deterministic after about the 12th decimal place, even though all of the following properties are true: 1. The work is divided into work units in a deterministic fashion. 2. Within each work unit, the summation happens in a deterministic order. 3. The final summation of the results of all the work units is done in a deterministic order. 4. The smallest term in the summation is about 5e-10. This means the difference across runs is about two orders of magnitude smaller than the smallest term. It can't be a concurrency bug where some terms sometimes get skipped. 5. The results for the individual tasks, not just the final summation, differ in the low-order bits. Each task is executed in a single thread. 6. The rounding mode is apparently the same in all of the threads. 7. The bug appears even on machines with only one core, as long as the number of task pool threads is manually set to >0. Since it's a single core machine, it can't be a low level memory model issue. What could possibly cause such small, non-deterministic differences in floating point results, given everything above? I'm just looking for suggestions here, as I don't even know where to start hunting for a bug like this.
Well, on one hand floating point math is not cumulative, and running sums have many known issues (I'd recommend looking up Khan summation). On the hand, it should be repeatably different. As for suggestions? First and foremost, you should always add small to large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should the answer be better, but if your error rate goes down, you have a good idea of where the problem is. I'd also try isolating your implementation's numerics, from the underlying concurrency. i.e. use a task pool of 1 and don't let the host thread join it, so the entire job is done by one worker. The other thing to try is isolation /removing map and iota from the equation.
Apr 15 2011
On 4/16/2011 2:09 AM, Robert Jacques wrote:On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha <dsimcha yahoo.com> wrote:I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 . Basically, the example in question sums up all the elements of a lazy range (actually, std.algorithm.map) in parallel. It uses taskPool.reduce, which divides the summation into work units to be executed in parallel. When executed in parallel, the results of the summation are non-deterministic after about the 12th decimal place, even though all of the following properties are true: 1. The work is divided into work units in a deterministic fashion. 2. Within each work unit, the summation happens in a deterministic order. 3. The final summation of the results of all the work units is done in a deterministic order. 4. The smallest term in the summation is about 5e-10. This means the difference across runs is about two orders of magnitude smaller than the smallest term. It can't be a concurrency bug where some terms sometimes get skipped. 5. The results for the individual tasks, not just the final summation, differ in the low-order bits. Each task is executed in a single thread. 6. The rounding mode is apparently the same in all of the threads. 7. The bug appears even on machines with only one core, as long as the number of task pool threads is manually set to >0. Since it's a single core machine, it can't be a low level memory model issue. What could possibly cause such small, non-deterministic differences in floating point results, given everything above? I'm just looking for suggestions here, as I don't even know where to start hunting for a bug like this.
Well, on one hand floating point math is not cumulative, and running sums have many known issues (I'd recommend looking up Khan summation). On the hand, it should be repeatably different. As for suggestions? First and foremost, you should always add small to large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should the answer be better, but if your error rate goes down, you have a good idea of where the problem is. I'd also try isolating your implementation's numerics, from the underlying concurrency. i.e. use a task pool of 1 and don't let the host thread join it, so the entire job is done by one worker. The other thing to try is isolation /removing map and iota from the equation.
Right. For this example, though, assuming floating point math behaves like regular math is a good enough approximation. The issue isn't that the results aren't reasonably accurate. Furthermore, the results will always change slightly depending on how many work units you have. (I even warn in the documentation that floating point addition is not associative, though it is approximately associative in the well-behaved cases.) My only concern is whether this non-determinism represents some deep underlying bug. For a given work unit allocation (work unit allocations are deterministic and only change when the number of threads changes or they're changed explicitly), I can't figure out how scheduling could change the results at all. If I could be sure that it wasn't a symptom of an underlying bug in std.parallelism, I wouldn't care about this small amount of numerical fuzz. Floating point math is always inexact and parallel summation by its nature can't be made to give the exact same results as serial summation.
Apr 16 2011
On 4/16/2011 10:11 AM, dsimcha wrote:My only concern is whether this non-determinism represents some deep underlying bug. For a given work unit allocation (work unit allocations are deterministic and only change when the number of threads changes or they're changed explicitly), I can't figure out how scheduling could change the results at all. If I could be sure that it wasn't a symptom of an underlying bug in std.parallelism, I wouldn't care about this small amount of numerical fuzz. Floating point math is always inexact and parallel summation by its nature can't be made to give the exact same results as serial summation.
Ok, it's definitely **NOT** a bug in std.parallelism. Here's a reduced test case that only uses core.thread, not std.parallelism. All it does is sum an array using std.algorithm.reduce from the main thread, then start a new thread to do the same thing and compare answers. At the beginning of the summation function the rounding mode is printed to verify that it's the same for both threads. The two threads get slightly different answers. Just to thoroughly rule out a concurrency bug, I didn't even let the two threads execute concurrently. The main thread produces its result, then starts and immediately joins the second thread. import std.algorithm, core.thread, std.stdio, core.stdc.fenv; real sumRange(const(real)[] range) { writeln("Rounding mode: ", fegetround); // 0 from both threads. return reduce!"a + b"(range); } void main() { immutable n = 1_000_000; immutable delta = 1.0 / n; auto terms = new real[1_000_000]; foreach(i, ref term; terms) { immutable x = ( i - 0.5 ) * delta; term = delta / ( 1.0 + x * x ) * 1; } immutable res1 = sumRange(terms); writefln("%.19f", res1); real res2; auto t = new Thread( { res2 = sumRange(terms); } ); t.start(); t.join(); writefln("%.19f", res2); } Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348
Apr 16 2011
On 4/16/11 9:52 AM, dsimcha wrote:Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348
I think at this precision the difference may be in random bits. Probably you don't need to worry about it. Andrei
Apr 16 2011
On 4/16/2011 10:55 AM, Andrei Alexandrescu wrote:On 4/16/11 9:52 AM, dsimcha wrote:Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348
I think at this precision the difference may be in random bits. Probably you don't need to worry about it. Andrei
"random bits"? I am fully aware that these low order bits are numerical fuzz and are meaningless from a practical perspective. I am only concerned because I thought these bits are supposed to be deterministic even if they're meaningless. Now that I've ruled out a bug in std.parallelism, I'm wondering if it's a bug in druntime or DMD.
Apr 16 2011
On 4/16/11 9:59 AM, dsimcha wrote:On 4/16/2011 10:55 AM, Andrei Alexandrescu wrote:On 4/16/11 9:52 AM, dsimcha wrote:Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348
I think at this precision the difference may be in random bits. Probably you don't need to worry about it. Andrei
"random bits"? I am fully aware that these low order bits are numerical fuzz and are meaningless from a practical perspective. I am only concerned because I thought these bits are supposed to be deterministic even if they're meaningless. Now that I've ruled out a bug in std.parallelism, I'm wondering if it's a bug in druntime or DMD.
I seem to remember that essentially some of the last bits printed in such a result are essentially arbitrary. I forgot what could cause this. Andrei
Apr 16 2011
On 4/16/2011 9:52 AM, Andrei Alexandrescu wrote:I seem to remember that essentially some of the last bits printed in such a result are essentially arbitrary. I forgot what could cause this.
To see what the exact bits are, print using %A format. In any case, floating point bits are not random. They are completely deterministic.
Apr 16 2011
== Quote from dsimcha (dsimcha yahoo.com)'s articleOn 4/16/2011 10:11 AM, dsimcha wrote: Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348
This is not something I can replicate on my workstation.
Apr 16 2011
On 4/16/2011 11:16 AM, Iain Buclaw wrote:== Quote from dsimcha (dsimcha yahoo.com)'s articleOn 4/16/2011 10:11 AM, dsimcha wrote: Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348
This is not something I can replicate on my workstation.
Interesting. Since I know you're a Linux user, I fired up my Ubuntu VM and tried out my test case. I can't reproduce it either on Linux, only on Windows.
Apr 16 2011
"dsimcha" <dsimcha yahoo.com> wrote in message news:iocalv$2h58$1 digitalmars.com...Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348
Could be something somewhere is getting truncated from real to double, which would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision in one of the threads?
Apr 16 2011
Could be something somewhere is getting truncated from real to double, which would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision in one of the threads?
Yes indeed, this is a _Windows_ "bug". I have experienced this in Windows before, the main thread's FPU state register is initialized to lower FPU-Precision (64bits) by default by the OS, presumably to make FP calculations faster. However, when you start a new thread, the FPU will use the whole 80 bits for computation because, curiously, the FPU is not reconfigured for those. Suggested fix: Add asm{fninit}; to the beginning of your main function, and the difference between the two will be gone. This would be a compatibility issue DMD/windows which disables the "real" data type. You might want to file a bug report for druntime if my suggested fix works. (This would imply that the real type was basically identical to the double type in Windows all along!)
Apr 16 2011
On 4/16/2011 2:15 PM, Timon Gehr wrote:Could be something somewhere is getting truncated from real to double, which would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision in one of the threads?
Yes indeed, this is a _Windows_ "bug". I have experienced this in Windows before, the main thread's FPU state register is initialized to lower FPU-Precision (64bits) by default by the OS, presumably to make FP calculations faster. However, when you start a new thread, the FPU will use the whole 80 bits for computation because, curiously, the FPU is not reconfigured for those. Suggested fix: Add asm{fninit}; to the beginning of your main function, and the difference between the two will be gone. This would be a compatibility issue DMD/windows which disables the "real" data type. You might want to file a bug report for druntime if my suggested fix works. (This would imply that the real type was basically identical to the double type in Windows all along!)
Close: If I add this instruction to the function for the new thread, the difference goes away. The relevant statement is: auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } ); At any rate, this is a **huge** WTF that should probably be fixed in druntime. Once I understand it a little better, I'll file a bug report.
Apr 16 2011
On 4/16/2011 2:24 PM, dsimcha wrote:On 4/16/2011 2:15 PM, Timon Gehr wrote:Could be something somewhere is getting truncated from real to double, which would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision in one of the threads?
Yes indeed, this is a _Windows_ "bug". I have experienced this in Windows before, the main thread's FPU state register is initialized to lower FPU-Precision (64bits) by default by the OS, presumably to make FP calculations faster. However, when you start a new thread, the FPU will use the whole 80 bits for computation because, curiously, the FPU is not reconfigured for those. Suggested fix: Add asm{fninit}; to the beginning of your main function, and the difference between the two will be gone. This would be a compatibility issue DMD/windows which disables the "real" data type. You might want to file a bug report for druntime if my suggested fix works. (This would imply that the real type was basically identical to the double type in Windows all along!)
Close: If I add this instruction to the function for the new thread, the difference goes away. The relevant statement is: auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } ); At any rate, this is a **huge** WTF that should probably be fixed in druntime. Once I understand it a little better, I'll file a bug report.
Read up a little on what fninit does, etc. This is IMHO a druntime bug. Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .
Apr 16 2011
On 4/16/2011 11:51 AM, Sean Kelly wrote:On Apr 16, 2011, at 11:43 AM, dsimcha wrote:Close: If I add this instruction to the function for the new thread, the difference goes away. The relevant statement is: auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } ); At any rate, this is a **huge** WTF that should probably be fixed in druntime. Once I understand it a little better, I'll file a bug report.
Read up a little on what fninit does, etc. This is IMHO a druntime bug. Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .
Really a Windows bug that should be fixed in druntime :-) I know I'm splitting hairs. This will be fixed for the next release.
The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation.
Apr 16 2011
Walter:The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation.
My congratulations to all the (mostly two) people involved in finding this bug and its causes :-) I'd like to see this module in Phobos. Bye, bearophile
Apr 16 2011
Sean Kelly wrote:On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright <newshound2 digitalmars.com> wrote:The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation.
There is no option to set "80-bit precision" via the FPU control word.
??? Yes there is. enum PrecisionControl : short { PRECISION80 = 0x300, PRECISION64 = 0x200, PRECISION32 = 0x000 }; /** Set the number of bits of precision used by 'real'. * * Returns: the old precision. * This is not supported on all platforms. */ PrecisionControl reduceRealPrecision(PrecisionControl prec) { version(D_InlineAsm_X86) { short cont; asm { fstcw cont; mov CX, cont; mov AX, cont; and EAX, 0x0300; // Form the return value and CX, 0xFCFF; or CX, prec; mov cont, CX; fldcw cont; } } else { assert(0, "Not yet supported"); } }
Apr 20 2011
"Sean Kelly" <sean invisibleduck.org> wrote in message news:mailman.3597.1303316625.4748.digitalmars-d puremagic.com... On Apr 20, 2011, at 5:06 AM, Don wrote:Sean Kelly wrote:On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright <newshound2 digitalmars.com> wrote:The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation.
64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default.
??? Yes there is. enum PrecisionControl : short { PRECISION80 = 0x300, PRECISION64 = 0x200, PRECISION32 = 0x000 }; So has Intel deprecated 80-bit FPU support? Why do the docs for this say that 64-bit is the highest precision? And more importantly, does this mean that we should be setting the PC field explicitly instead of relying on fninit? The docs say that fninit initializes to 64-bit precision. Or is that inaccurate as well?=
You misread the docs, it's talking about precision which is just the size of the mantisa, not the actual full size of the floating point data. IE... 80 float = 64 bit precision 64 float = 53 bit precision 32 float = 24 bit precision
Apr 20 2011
Sean Kelly wrote:On Apr 20, 2011, at 10:46 AM, JimBob wrote:"Sean Kelly" <sean invisibleduck.org> wrote in message news:mailman.3597.1303316625.4748.digitalmars-d puremagic.com... On Apr 20, 2011, at 5:06 AM, Don wrote:Sean Kelly wrote:On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright <newshound2 digitalmars.com> wrote:The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation.
64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default.
enum PrecisionControl : short { PRECISION80 = 0x300, PRECISION64 = 0x200, PRECISION32 = 0x000 }; So has Intel deprecated 80-bit FPU support? Why do the docs for this say that 64-bit is the highest precision? And more importantly, does this mean that we should be setting the PC field explicitly instead of relying on fninit? The docs say that fninit initializes to 64-bit precision. Or is that inaccurate as well?=
the mantisa, not the actual full size of the floating point data. IE... 80 float = 64 bit precision 64 float = 53 bit precision 32 float = 24 bit precision
Oops, you're right. So to summarize: fninit does what we want because it sets 64-bit precision, which is effectively 80-bit mode. Is this correct?
Yes.
Apr 20 2011
On Apr 20, 2011, at 10:46 AM, JimBob wrote:=20 "Sean Kelly" <sean invisibleduck.org> wrote in message=20 news:mailman.3597.1303316625.4748.digitalmars-d puremagic.com... On Apr 20, 2011, at 5:06 AM, Don wrote: =20Sean Kelly wrote:On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright=20 <newshound2 digitalmars.com> wrote:=20 The dmd startup code (actually the C startup code) does an fninit. =
never thought about new thread starts. So, yeah, druntime should =
fninit on thread creation.
64-bit precision, which means that by default we aren't seeing the=20=
benefit of D's reals. I'd much prefer 80-bit precision by default.
=20 ??? Yes there is. =20 enum PrecisionControl : short { PRECISION80 =3D 0x300, PRECISION64 =3D 0x200, PRECISION32 =3D 0x000 }; =20 So has Intel deprecated 80-bit FPU support? Why do the docs for this =
that 64-bit is the highest precision? And more importantly, does this mean that =
should be setting the PC field explicitly instead of relying on fninit? The docs say =
fninit initializes to 64-bit precision. Or is that inaccurate as well?=3D
You misread the docs, it's talking about precision which is just the =
the mantisa, not the actual full size of the floating point data. =
=20 80 float =3D 64 bit precision 64 float =3D 53 bit precision 32 float =3D 24 bit precision
Oops, you're right. So to summarize: fninit does what we want because = it sets 64-bit precision, which is effectively 80-bit mode. Is this = correct?=
Apr 20 2011
On Apr 16, 2011, at 11:43 AM, dsimcha wrote:=20 Close: If I add this instruction to the function for the new thread, =
difference goes away. The relevant statement is: =20 auto t =3D new Thread( { asm { fninit; } res2 =3D sumRange(terms); } ); =20 At any rate, this is a **huge** WTF that should probably be fixed in druntime. Once I understand it a little better, I'll file a bug =
=20 Read up a little on what fninit does, etc. This is IMHO a druntime =
Really a Windows bug that should be fixed in druntime :-) I know I'm = splitting hairs. This will be fixed for the next release.
Apr 16 2011
On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright <newshound2 digitalmars.com> wrote:On 4/16/2011 11:51 AM, Sean Kelly wrote:On Apr 16, 2011, at 11:43 AM, dsimcha wrote:Close: If I add this instruction to the function for the new thread, the difference goes away. The relevant statement is: auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } ); At any rate, this is a **huge** WTF that should probably be fixed in druntime. Once I understand it a little better, I'll file a bug report.
Read up a little on what fninit does, etc. This is IMHO a druntime bug. Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .
Really a Windows bug that should be fixed in druntime :-) I know I'm splitting hairs. This will be fixed for the next release.
The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation.
The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default.
Apr 16 2011
On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright =
=20 =20 The dmd startup code (actually the C startup code) does an fninit. I =
fninit on thread creation.=20 The documentation I've found on fninit seems to indicate it defaults =
benefit of D's reals. I'd much prefer 80-bit precision by default. There is no option to set "80-bit precision" via the FPU control word. = Section 8.1.5.2 of the Intel 64 SDM says the following: "The precision-control (PC) field (bits 8 and 9 of the x87 FPU control = word) determines the precision (64, 53, or 24 bits) of floating-point = calculations made by the x87 FPU (see Table 8-2). The default precision = is double extended precision, which uses the full 64-bit significand = available with the double extended-precision floating-point format of = the x87 FPU data registers. This setting is best suited for most = applications, because it allows applications to take full advantage of = the maximum precision available with the x87 FPU data registers." So it sounds like finit/fninit does what we want.=
Apr 19 2011
On Tue, 19 Apr 2011 14:18:46 -0400, Sean Kelly <sean invisibleduck.org> wrote:On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright <newshound2 digitalmars.com> wrote:The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation.
The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default.
There is no option to set "80-bit precision" via the FPU control word. Section 8.1.5.2 of the Intel 64 SDM says the following: "The precision-control (PC) field (bits 8 and 9 of the x87 FPU control word) determines the precision (64, 53, or 24 bits) of floating-point calculations made by the x87 FPU (see Table 8-2). The default precision is double extended precision, which uses the full 64-bit significand available with the double extended-precision floating-point format of the x87 FPU data registers. This setting is best suited for most applications, because it allows applications to take full advantage of the maximum precision available with the x87 FPU data registers." So it sounds like finit/fninit does what we want.
Yes, that sounds right. Thanks for clarifying.
Apr 19 2011
On Apr 20, 2011, at 5:06 AM, Don wrote:Sean Kelly wrote:On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright =
=20 The dmd startup code (actually the C startup code) does an fninit. =
fninit on thread creation.The documentation I've found on fninit seems to indicate it defaults =
benefit of D's reals. I'd much prefer 80-bit precision by default.There is no option to set "80-bit precision" via the FPU control =
=20 ??? Yes there is. =20 enum PrecisionControl : short { PRECISION80 =3D 0x300, PRECISION64 =3D 0x200, PRECISION32 =3D 0x000 };
So has Intel deprecated 80-bit FPU support? Why do the docs for this = say that 64-bit is the highest precision? And more importantly, does = this mean that we should be setting the PC field explicitly instead of = relying on fninit? The docs say that fninit initializes to 64-bit = precision. Or is that inaccurate as well?=
Apr 20 2011









Walter Bright <newshound2 digitalmars.com> 