www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Yet another strike against the current AA implementation

reply dsimcha <dsimcha yahoo.com> writes:
So I tried to create a struct to allow iteration over the keys or values of a
builtin associative array using a lazy forward range.  This would allow one to
pass the keys or values of an AA to any function that expects a regular old
forward range w/o any heap activity, eager copying, etc. and with O(1)
auxiliary memory usage.  Now that ranges are the lingua franca of iteration in
D, this seems like a pretty necessary feature.

The problem is that D's current AAs are based on a hash table with collision
resolution done using binary trees.  I guess this decision was made to allow
for O(log N) worst case performance even when there are ridiculous amounts of
hash collisions.  However, this is a very bad decision because, in exchange
for more bulletproof performance in corner cases, you lose:

1.  Every element is now (void*).sizeof bytes larger because you have to store
a left and right pointer.
2.  You can't iterate over a binary tree in O(1) auxiliary space, at least not
in any way I'm aware of.  This means that a lazy range to iterate over the
keys or values is relatively useless because it would generate heap activity
anyhow to store the stack or queue necessary to iterate over the tree, so you
may as well just use the current .keys or .values properties, which just
return an array.  (Because of the way that ranges are designed, you can't just
use the call stack, unless you use fibers or something, which is obviously not
a win from an efficiency point of view.)

Both of these are important in the more common, general case of sane hashing
performance.
Apr 26 2009
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2009-04-26 11:21:54 -0400, dsimcha <dsimcha yahoo.com> said:

 2.  You can't iterate over a binary tree in O(1) auxiliary space, at least not
 in any way I'm aware of.  This means that a lazy range to iterate over the
 keys or values is relatively useless because it would generate heap activity
 anyhow to store the stack or queue necessary to iterate over the tree, so you
 may as well just use the current .keys or .values properties, which just
 return an array.  (Because of the way that ranges are designed, you can't just
 use the call stack, unless you use fibers or something, which is obviously not
 a win from an efficiency point of view.)

Hum, are you saying opApply superior when it comes to iterating in a tree? It seems that with opApply you could use the stack by recursively calling opApply with the given delegate on each one of the branches. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Apr 26 2009
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Michel Fortin (michel.fortin michelf.com)'s article
 On 2009-04-26 11:21:54 -0400, dsimcha <dsimcha yahoo.com> said:
 2.  You can't iterate over a binary tree in O(1) auxiliary space, at least not
 in any way I'm aware of.  This means that a lazy range to iterate over the
 keys or values is relatively useless because it would generate heap activity
 anyhow to store the stack or queue necessary to iterate over the tree, so you
 may as well just use the current .keys or .values properties, which just
 return an array.  (Because of the way that ranges are designed, you can't just
 use the call stack, unless you use fibers or something, which is obviously not
 a win from an efficiency point of view.)

tree? It seems that with opApply you could use the stack by recursively calling opApply with the given delegate on each one of the branches.

Exactly. On the other hand, you lose a lot of flexibility with opApply. If you tried to port most of std.range to operate on the opApply interface instead fo the forward range interface, I doubt you'd get very far. IMHO, though, opApply should *not* be deprecated. opApply and ranges attempt to solve similar problems, but not the same problem. Ranges attempt to solve the problem of iterating over some object with maximum flexibility from the point of view of the user of the object. opApply attempts to solve the problem of iterating with maximum flexibility from the point of view of the implementer of the object. In some cases, like the one you just described, the latter can be better. However, it might make sense to document and give examples of this tradeoff somewhere.
Apr 26 2009
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2009-04-26 11:46:51 -0400, dsimcha <dsimcha yahoo.com> said:

 == Quote from Michel Fortin (michel.fortin michelf.com)'s article
 Hum, are you saying opApply superior when it comes to iterating in a
 tree? It seems that with opApply you could use the stack by recursively
 calling opApply with the given delegate on each one of the branches.

Exactly. On the other hand, you lose a lot of flexibility with opApply. If you tried to port most of std.range to operate on the opApply interface instead fo the forward range interface, I doubt you'd get very far. IMHO, though, opApply should *not* be deprecated. opApply and ranges attempt to solve similar problems, but not the same problem. Ranges attempt to solve the problem of iterating over some object with maximum flexibility from the point of view of the user of the object. opApply attempts to solve the problem of iterating with maximum flexibility from the point of view of the implementer of the object. In some cases, like the one you just described, the latter can be better.

Indeed. I certainly agree that both ranges and opApply have their place. So what the implementer can do with opApply is write an optimized iteration algorithm for use with foreach. Which may mean that when both opApply and ranges are available for generating foreach's code, the compiler should favor opApply. Currently, I believe it's the reverse. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Apr 26 2009
next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Michel Fortin (michel.fortin michelf.com)'s article
 Indeed. I certainly agree that both ranges and opApply have their place.
 So what the implementer can do with opApply is write an optimized
 iteration algorithm for use with foreach. Which may mean that when both
 opApply and ranges are available for generating foreach's code, the
 compiler should favor opApply. Currently, I believe it's the reverse.

You probably have a point. If the range interface is "good enough" for the implementer of the object to write an optimal implementation without defining opApply, then the implementer will probably not define opApply. On the other hand, it could make sense to define a range interface for when the user of the object needs that flexibility and an opApply interface that is more efficient internally in the same object. What I think it really boils down to, and what should be emphasized in the documentation, is control of the stack. If the implementer of the object needs control of the stack during iteration, use opApply to get this control. Otherwise, use ranges to allow the user more control over the iteration process. However, what I care more about, upon thinking more about it, is that the concept of an iterable gets defined "officially" in Phobos and in the documentation. An iterable is any object of type T such that the following code works: foreach(elem; T.init) { // do stuff. } This can be considered a superset of input ranges, since all input ranges are iterables but not all iterables are input ranges. Of course, stuff that works like: foreach(key, value; T.init) { // do stuff. } uses opApply but would not be considered an iterable because it iterates more than one item. Therefore, strictly speaking, the model is broken, but these kinds of oddball situations are few enough and far enough between that I think they can be ignored, at least until the issue of making ranges do this kind of thing too is addressed. The idea, then, is that if all you need is an iterable for some generic function in Phobos or in any 3rd party library, make the constraint be that it only requires an iterable. To encourage this, an isIterable template should be included in Phobos, and std.range.ElementType should be generalized to return the element type for all iterables, not just ranges.
Apr 26 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Michel Fortin wrote:
 On 2009-04-26 11:46:51 -0400, dsimcha <dsimcha yahoo.com> said:
 
 == Quote from Michel Fortin (michel.fortin michelf.com)'s article
 Hum, are you saying opApply superior when it comes to iterating in a
 tree? It seems that with opApply you could use the stack by recursively
 calling opApply with the given delegate on each one of the branches.

Exactly. On the other hand, you lose a lot of flexibility with opApply. If you tried to port most of std.range to operate on the opApply interface instead fo the forward range interface, I doubt you'd get very far. IMHO, though, opApply should *not* be deprecated. opApply and ranges attempt to solve similar problems, but not the same problem. Ranges attempt to solve the problem of iterating over some object with maximum flexibility from the point of view of the user of the object. opApply attempts to solve the problem of iterating with maximum flexibility from the point of view of the implementer of the object. In some cases, like the one you just described, the latter can be better.

Indeed. I certainly agree that both ranges and opApply have their place. So what the implementer can do with opApply is write an optimized iteration algorithm for use with foreach. Which may mean that when both opApply and ranges are available for generating foreach's code, the compiler should favor opApply. Currently, I believe it's the reverse.

I think, all other things being equal, that opApply tends to be slower than iterators. Andrei
Apr 26 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Steven Schveighoffer wrote:
 On Sun, 26 Apr 2009 21:20:32 -0400, Andrei Alexandrescu 
 I think, all other things being equal, that opApply tends to be slower 
 than iterators.

It depends. The range must have either inline-able functions, or must NOT call virtual functions. Otherwise, it could be worse. there is also the factor that you are using an inner function with opApply, so if you regularly access variables outside the foreach loop, then those add up. Observe that in opapply style, you do one delegate call per loop iteration. With range style, you do three calls on the range per loop iteration (empty, front, popFront). If those range calls call virtual calls or ARE virtual calls, then range loses. Of course, the difference might be washed if you do lots of accesses to an outer variable in your foreach loop, which require a pointer dereference vs. a stack dereference. We should quantify this and see what the actual performance is.

This is a very good point. If a range offers virtual empty/popFront/front, then opApply is a faster proposition. This makes for a good use case for making opApply the preferred way of iteration, if present (if you just do foreach and opApply is present, use it). Andrei
Apr 27 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
dsimcha wrote:
 == Quote from Michel Fortin (michel.fortin michelf.com)'s article
 On 2009-04-26 11:21:54 -0400, dsimcha <dsimcha yahoo.com> said:
 2.  You can't iterate over a binary tree in O(1) auxiliary space, at least not
 in any way I'm aware of.  This means that a lazy range to iterate over the
 keys or values is relatively useless because it would generate heap activity
 anyhow to store the stack or queue necessary to iterate over the tree, so you
 may as well just use the current .keys or .values properties, which just
 return an array.  (Because of the way that ranges are designed, you can't just
 use the call stack, unless you use fibers or something, which is obviously not
 a win from an efficiency point of view.)

tree? It seems that with opApply you could use the stack by recursively calling opApply with the given delegate on each one of the branches.

Exactly. On the other hand, you lose a lot of flexibility with opApply. If you tried to port most of std.range to operate on the opApply interface instead fo the forward range interface, I doubt you'd get very far. IMHO, though, opApply should *not* be deprecated. opApply and ranges attempt to solve similar problems, but not the same problem. Ranges attempt to solve the problem of iterating over some object with maximum flexibility from the point of view of the user of the object. opApply attempts to solve the problem of iterating with maximum flexibility from the point of view of the implementer of the object. In some cases, like the one you just described, the latter can be better. However, it might make sense to document and give examples of this tradeoff somewhere.

1. Depth of iteration is low, so a short vector optimization will most of the time solve the problem without allocating extra memory. 2. I haven't measured, but the cost of the indirect call is large enough to make me suspect that opApply is not as efficient as it's cracked to be, even when compared with an iterator. Andrei
Apr 26 2009
parent reply Christopher Wright <dhasenan gmail.com> writes:
Andrei Alexandrescu wrote:
 2. I haven't measured, but the cost of the indirect call is large enough 
 to make me suspect that opApply is not as efficient as it's cracked to 
 be, even when compared with an iterator.

When you know the type beforehand or can use templates, that is, rather than wrapping your range struct in a wrapper class. If you can't use a template for whatever reason, ranges are going to suck -- three virtual calls rather than one. I don't usually care sufficiently about performance to worry about whether a call is virtual or not, but you brought that issue up before. And I imagine that, most of the time, you will know the range type in advance.
Apr 26 2009
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Christopher Wright wrote:
 Andrei Alexandrescu wrote:
 2. I haven't measured, but the cost of the indirect call is large 
 enough to make me suspect that opApply is not as efficient as it's 
 cracked to be, even when compared with an iterator.

When you know the type beforehand or can use templates, that is, rather than wrapping your range struct in a wrapper class. If you can't use a template for whatever reason, ranges are going to suck -- three virtual calls rather than one. I don't usually care sufficiently about performance to worry about whether a call is virtual or not, but you brought that issue up before. And I imagine that, most of the time, you will know the range type in advance.

Yah, that's a good motivation to change how hashes are currently handled. But above I was referring to the cost of opApply's callback, which adds to the costs you mention. Andrei
Apr 26 2009
prev sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Christopher Wright (dhasenan gmail.com)'s article
 Andrei Alexandrescu wrote:
 2. I haven't measured, but the cost of the indirect call is large enough
 to make me suspect that opApply is not as efficient as it's cracked to
 be, even when compared with an iterator.

than wrapping your range struct in a wrapper class. If you can't use a template for whatever reason, ranges are going to suck -- three virtual calls rather than one. I don't usually care sufficiently about performance to worry about whether a call is virtual or not, but you brought that issue up before. And I imagine that, most of the time, you will know the range type in advance.

Rule number 1 of all performance discussions: Always measure if possible. import std.stdio, std.perf; struct Direct { uint n; uint front() { return n; } void popFront() { n++; } bool empty() { return n >= 1_000_000_000; } } struct Apply { int opApply(int delegate(ref uint) dg) { int res = 0; foreach(i; 0U..1_000_000_000) { res = dg(i); if(res) break; } return res; } } class Virtual { uint n; uint front() { return n; } void popFront() { n++; } bool empty() { return n >= 1_000_000_000; } } void main() { scope pc = new PerformanceCounter; pc.start; foreach(elem; Direct.init) {} pc.stop; writeln("Direct: ", pc.milliseconds); pc.start; auto v = new Virtual; foreach(elem; v) {} pc.stop; writeln("Virtual: ", pc.milliseconds); pc.start; foreach(elem; Apply.init) {} pc.stop; writeln("opApply: ", pc.milliseconds); } Output: Direct: 2343 Virtual: 5695 opApply: 3014 Bottom line is that these pretty much come in the order you would expect them to, but there are no particularly drastic differences between any of them. To put these timings in perspective, 5700 ms for 1 billion iterations is roughly (on a 2.7 GHz machine) 15 clock cycles per iteration. How often does anyone really have code that is performance critical *and* where the contents of the loop don't take long enough to dwarf the 15 clock cycles per iteration loop overhead *and* you need the iteration to be polymorphic?
Apr 26 2009
next sibling parent reply Robert Fraser <fraserofthenight gmail.com> writes:
dsimcha wrote:
 == Quote from Christopher Wright (dhasenan gmail.com)'s article
 Andrei Alexandrescu wrote:
 2. I haven't measured, but the cost of the indirect call is large enough
 to make me suspect that opApply is not as efficient as it's cracked to
 be, even when compared with an iterator.

than wrapping your range struct in a wrapper class. If you can't use a template for whatever reason, ranges are going to suck -- three virtual calls rather than one. I don't usually care sufficiently about performance to worry about whether a call is virtual or not, but you brought that issue up before. And I imagine that, most of the time, you will know the range type in advance.

Rule number 1 of all performance discussions: Always measure if possible. import std.stdio, std.perf; struct Direct { uint n; uint front() { return n; } void popFront() { n++; } bool empty() { return n >= 1_000_000_000; } } struct Apply { int opApply(int delegate(ref uint) dg) { int res = 0; foreach(i; 0U..1_000_000_000) { res = dg(i); if(res) break; } return res; } } class Virtual { uint n; uint front() { return n; } void popFront() { n++; } bool empty() { return n >= 1_000_000_000; } } void main() { scope pc = new PerformanceCounter; pc.start; foreach(elem; Direct.init) {} pc.stop; writeln("Direct: ", pc.milliseconds); pc.start; auto v = new Virtual; foreach(elem; v) {} pc.stop; writeln("Virtual: ", pc.milliseconds); pc.start; foreach(elem; Apply.init) {} pc.stop; writeln("opApply: ", pc.milliseconds); } Output: Direct: 2343 Virtual: 5695 opApply: 3014 Bottom line is that these pretty much come in the order you would expect them to, but there are no particularly drastic differences between any of them. To put these timings in perspective, 5700 ms for 1 billion iterations is roughly (on a 2.7 GHz machine) 15 clock cycles per iteration. How often does anyone really have code that is performance critical *and* where the contents of the loop don't take long enough to dwarf the 15 clock cycles per iteration loop overhead *and* you need the iteration to be polymorphic?

Did you compile that with optimization/inlining? If not, it's not really a fair test... and if so why isn't the compiler getting rid of those pointless loops?
Apr 26 2009
parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Robert Fraser (fraserofthenight gmail.com)'s article
 Did you compile that with optimization/inlining? If not, it's not really
 a fair test... and if so why isn't the compiler getting rid of those
 pointless loops?

Yes, -O -inline -release. I didn't bother stating that explicitly b/c it's the standard way of doing benchmarks.
Apr 27 2009
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2009-04-27 00:50:23 -0400, dsimcha <dsimcha yahoo.com> said:

 Output:
 Direct:  2343
 Virtual:  5695
 opApply:  3014

Nice. Isn't there room for optimisation on the compiler side though? I mean, the compiler could inline opApply, and while doing so it could notice that the delegate is constant and inline it too. I suspect that adding this optimisation would put opApply at the same performance level than ranges. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Apr 27 2009
next sibling parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Michel Fortin wrote:
 On 2009-04-27 00:50:23 -0400, dsimcha <dsimcha yahoo.com> said:
 
 Output:
 Direct:  2343
 Virtual:  5695
 opApply:  3014

Nice. Isn't there room for optimisation on the compiler side though? I mean, the compiler could inline opApply, and while doing so it could notice that the delegate is constant and inline it too. I suspect that adding this optimisation would put opApply at the same performance level than ranges.

Inlining does not automatically make things faster. Case in point: downs' raytracer stacy actually slows down when you inline certain parts. The penalty of not fitting in cache was overwhelming the penalty from using virtual methods. -- Daniel
Apr 27 2009
parent bearophile <bearophileHUGS lycos.com> writes:
 Michel Fortin:
 Isn't there room for optimisation on the compiler side though? I mean,
 the compiler could inline opApply, and while doing so it could notice
 that the delegate is constant and inline it too. I suspect that adding
 this optimisation would put opApply at the same performance level than
 ranges.


Daniel Keep:
 Inlining does not automatically make things faster.

That's true in general: if you inline too much code you may end up with too much code to be moved inside and out of the caches, reducing the performance. I agree with Michel, if done wisely by the compiler some inlining may help with such loops. Bye, bearophile
Apr 27 2009
prev sibling parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Michel Fortin wrote:
 On 2009-04-27 00:50:23 -0400, dsimcha <dsimcha yahoo.com> said:
 
 Output:
 Direct:  2343
 Virtual:  5695
 opApply:  3014

Nice. Isn't there room for optimisation on the compiler side though? I mean, the compiler could inline opApply, and while doing so it could notice that the delegate is constant and inline it too.

For non-virtual opApply this is exactly what LDC does[1] at the higher optimization levels :). (Assuming both opApply and the foreach body are deemed inline-worthy by the inliner) [1]: Since trunk r1219, which was about 11 days ago.
 I suspect that adding
 this optimisation would put opApply at the same performance level than
 ranges.

I haven't actually measured to see if this is true, but there should indeed be very little difference since this optimization essentially turns opApply into a regular loop (and LDC does this before any loop optimizations run).
Apr 27 2009
parent bearophile <bearophileHUGS lycos.com> writes:
Frits van Bommel:
 I haven't actually measured to see if this is true, but there should indeed be 
 very little difference since this optimization essentially turns opApply into
a 
 regular loop (and LDC does this before any loop optimizations run).

When you find a bit of free time please time it and show the timings :-) Bye, bearophile
Apr 27 2009
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
dsimcha:
 Rule number 1 of all performance discussions:  Always measure if possible.

Very good, thank you. I have run your code, with a bit of changes: import std.stdio: writeln; import std.perf: PerformanceCounter; enum uint N = 1_000_000_000; struct Direct { uint n; uint front() { return n; } void popFront() { n++; } bool empty() { return n >= N; } } struct Apply { int opApply(int delegate(ref uint) dg) { int res; foreach (i; 0U .. N) { res = dg(i); if(res) break; } return res; } } class Virtual { uint n; uint front() { return n; } void popFront() { n++; } bool empty() { return n >= N; } } void main() { scope pc = new PerformanceCounter; pc.start; foreach (elem; Direct.init) {} pc.stop; writeln("Direct: ", pc.milliseconds); pc.start; foreach (elem; Apply.init) {} pc.stop; writeln("opApply: ", pc.milliseconds); pc.start; auto v = new Virtual; foreach (elem; v) {} pc.stop; writeln("Virtual: ", pc.milliseconds); pc.start; scope auto v2 = new Virtual; foreach (elem; v2) {} pc.stop; writeln("Scoped virtual: ", pc.milliseconds); } As you see I have added a version with scoped class at the bottom hoping to improve performance a bit, but the result is the opposite, can you explain me why? Direct: 2699 opApply: 3520 Virtual: 7543 Scoped virtual: 8550 Run on a Core2 2 GHz, on WinXP. Bye, bearophile
Apr 27 2009
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
dsimcha wrote:
 Output:
 Direct:  2343
 Virtual:  5695
 opApply:  3014
 
 Bottom line is that these pretty much come in the order you would expect them
to,
 but there are no particularly drastic differences between any of them.  To put
 these timings in perspective, 5700 ms for 1 billion iterations is roughly (on a
 2.7 GHz machine) 15 clock cycles per iteration.  How often does anyone really
have
 code that is performance critical *and* where the contents of the loop don't
take
 long enough to dwarf the 15 clock cycles per iteration loop overhead *and* you
 need the iteration to be polymorphic?

Thanks for the numbers. The way I'd read them is not to look at the milliseconds but at the fractions. Loops with opApply have an extra 28% overhead, and I think that doesn't make them look very well. Bearophile's numbers show a 30% overhead. I have measurements that show more dramatic pessimization, but I never keep those around. I know in my early tests for std.algorithm that had I used opApply in it, I would have essentially never used an algorithm if I could help it. Andrei
Apr 27 2009
prev sibling parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
dsimcha wrote:
[snip]
 
 Output:
 Direct:  2343
 Virtual:  5695
 opApply:  3014
 
 Bottom line is that these pretty much come in the order you would expect them
to,
 but there are no particularly drastic differences between any of them.  To put
 these timings in perspective, 5700 ms for 1 billion iterations is roughly (on a
 2.7 GHz machine) 15 clock cycles per iteration.  How often does anyone really
have
 code that is performance critical *and* where the contents of the loop don't
take
 long enough to dwarf the 15 clock cycles per iteration loop overhead *and* you
 need the iteration to be polymorphic?

I edited this code to work with ldc (D1) + Tango, and saw the Direct and opApply cases generate identical code (inc, cmp, jne, with the loop counter in a register) [1], so they're equally fast (modulo process scheduling randomness). Virtual was roughly 10 times slower on my machine. (with ldc) Unfortunately, I can't directly compare timings between ldc and dmd directly because dmd is likely at a disadvantage due to being 32-bit in a 64-bit world. Although... the Virtual case takes about equal time with ldc- and dmd-compiled code, so maybe the slowness of Direct/dmd when compared to Direct/ldc (the dmd code is a factor 3 slower) is due to it apparently not register-allocating the loop variable. The opApply case was another factor 2 slower than Direct with dmd on my machine, probably because opApply and the loop body don't get inlined. It seems gdc is the only compiler to realize the first loop can be completely optimized away. It's again about equally fast for Virtual, but for opApply it's roughly a factor 3 slower than ldc; it seem to inline only opApply itself, not the loop body. [1]: -O3 -release (with inlining), x86_64
Apr 27 2009
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2009-04-27 10:51:22 -0400, Frits van Bommel 
<fvbommel REMwOVExCAPSs.nl> said:

 I edited this code to work with ldc (D1) + Tango, and saw the Direct 
 and opApply cases generate identical code (inc, cmp, jne, with the loop 
 counter in a register) [1], so they're equally fast (modulo process 
 scheduling randomness).

Thank you for your timings. I think it shows my point: that by prefering ranges over opApply we're just optimising around a deficiency in DMD's optimizer. I'm thinking... with proper inlining, perhaps we could take the notion of ranges out of the compiler and just define a generic opApply in std.range that use front, popFront, and empty. :-) Perhaps. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Apr 28 2009
parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Michel Fortin wrote:
 On 2009-04-27 10:51:22 -0400, Frits van Bommel
 <fvbommel REMwOVExCAPSs.nl> said:
 
 I edited this code to work with ldc (D1) + Tango, and saw the Direct
 and opApply cases generate identical code (inc, cmp, jne, with the
 loop counter in a register) [1], so they're equally fast (modulo
 process scheduling randomness).

Thank you for your timings. I think it shows my point: that by prefering ranges over opApply we're just optimising around a deficiency in DMD's optimizer.

Not true. Here's an excellent reason to use ranges over opApply: you can't define zip with opApply. Because opApply uses inversion of control, you can't use more than one without bringing threads into the equation.
 I'm thinking... with proper inlining, perhaps we could take the notion
 of ranges out of the compiler and just define a generic opApply in
 std.range that use front, popFront, and empty. :-) Perhaps.

I suspect supporting ranges is just much easier. -- Daniel
Apr 28 2009
next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2009-04-28 06:33:13 -0400, Daniel Keep <daniel.keep.lists gmail.com> said:

 Michel Fortin wrote:
 On 2009-04-27 10:51:22 -0400, Frits van Bommel
 <fvbommel REMwOVExCAPSs.nl> said:
 
 I edited this code to work with ldc (D1) + Tango, and saw the Direct
 and opApply cases generate identical code (inc, cmp, jne, with the
 loop counter in a register) [1], so they're equally fast (modulo
 process scheduling randomness).

Thank you for your timings. I think it shows my point: that by prefering ranges over opApply we're just optimising around a deficiency in DMD's optimizer.

Not true. Here's an excellent reason to use ranges over opApply: you can't define zip with opApply. Because opApply uses inversion of control, you can't use more than one without bringing threads into the equation.

I guess I removed too much context from the above posts. We're just timing various foreach implementations. You're right when you say ranges are more versatile than opApply, and I'm all for keeping both ranges and opApply. I just want the compiler to prefer opApply over ranges when both are available when generating code for foreach, since with opApply you sometime can optimize things in a way that that you can't with ranges.
 I'm thinking... with proper inlining, perhaps we could take the notion
 of ranges out of the compiler and just define a generic opApply in
 std.range that use front, popFront, and empty. :-) Perhaps.

I suspect supporting ranges is just much easier.

Sure, especially since they're already implemented in the compiler. Inlining of delegates known at compile-time would have a greater reach though. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Apr 28 2009
prev sibling next sibling parent reply downs <default_357-line yahoo.de> writes:
Daniel Keep wrote:
 
 Michel Fortin wrote:
 On 2009-04-27 10:51:22 -0400, Frits van Bommel
 <fvbommel REMwOVExCAPSs.nl> said:

 I edited this code to work with ldc (D1) + Tango, and saw the Direct
 and opApply cases generate identical code (inc, cmp, jne, with the
 loop counter in a register) [1], so they're equally fast (modulo
 process scheduling randomness).

ranges over opApply we're just optimising around a deficiency in DMD's optimizer.

Not true. Here's an excellent reason to use ranges over opApply: you can't define zip with opApply. Because opApply uses inversion of control, you can't use more than one without bringing threads into the equation.

Your point stands, of course, but I just wanted to mention that stackthreads/fibers work too and have far less overhead.
 I'm thinking... with proper inlining, perhaps we could take the notion
 of ranges out of the compiler and just define a generic opApply in
 std.range that use front, popFront, and empty. :-) Perhaps.

I suspect supporting ranges is just much easier. -- Daniel

Apr 28 2009
parent reply grauzone <none example.net> writes:
downs wrote:
 Daniel Keep wrote:
 Michel Fortin wrote:
 On 2009-04-27 10:51:22 -0400, Frits van Bommel
 <fvbommel REMwOVExCAPSs.nl> said:

 I edited this code to work with ldc (D1) + Tango, and saw the Direct
 and opApply cases generate identical code (inc, cmp, jne, with the
 loop counter in a register) [1], so they're equally fast (modulo
 process scheduling randomness).

ranges over opApply we're just optimising around a deficiency in DMD's optimizer.

can't define zip with opApply. Because opApply uses inversion of control, you can't use more than one without bringing threads into the equation.

Your point stands, of course, but I just wanted to mention that stackthreads/fibers work too and have far less overhead.

First, don't they have LOTS of memory overhead, because each fiber needs a new/separate stack? Second, isn't the overhead still relatively high compared to simple function calls?
Apr 28 2009
parent reply "Joel C. Salomon" <joelcsalomon gmail.com> writes:
grauzone wrote:
 downs wrote:
 Daniel Keep wrote:
 Here's an excellent reason to use ranges over opApply: you
 can't define zip with opApply.  Because opApply uses inversion of
 control, you can't use more than one without bringing threads into the
 equation.

Your point stands, of course, but I just wanted to mention that stackthreads/fibers work too and have far less overhead.

First, don't they have LOTS of memory overhead, because each fiber needs a new/separate stack?

A fiber-specific stack needn’t be very large. A few KB is often enough even for long-running threads; if the call stack is only going to be a few levels deep you might get away with a few hundred bytes each. —Joel Salomon
Apr 28 2009
parent reply downs <default_357-line yahoo.de> writes:
Joel C. Salomon wrote:
 grauzone wrote:
 downs wrote:
 Daniel Keep wrote:
 Here's an excellent reason to use ranges over opApply: you
 can't define zip with opApply.  Because opApply uses inversion of
 control, you can't use more than one without bringing threads into the
 equation.

stackthreads/fibers work too and have far less overhead.

a new/separate stack?

A fiber-specific stack needn’t be very large. A few KB is often enough even for long-running threads; if the call stack is only going to be a few levels deep you might get away with a few hundred bytes each. —Joel Salomon

Furthermore, using anonymous mmapped files the memory is only allocated when accessed - the only thing taken is address space. Which, granted, can be limiting. Luckily 64-bit will fix that. :)
Apr 28 2009
parent reply Georg Wrede <georg.wrede iki.fi> writes:
downs wrote:
 Joel C. Salomon wrote:
 grauzone wrote:
 downs wrote:
 Daniel Keep wrote:
 Here's an excellent reason to use ranges over opApply: you
 can't define zip with opApply.  Because opApply uses inversion of
 control, you can't use more than one without bringing threads into the
 equation.

stackthreads/fibers work too and have far less overhead.

a new/separate stack?

even for long-running threads; if the call stack is only going to be a few levels deep you might get away with a few hundred bytes each.


The Commodore-64 had a stack of 0.25kB shared between your Basic program, subroutine calls, and the operating system. While not unheard of, it still was unusual to have to make amendments to program design because of stack limitations, even in large programs. *If* we had a convenient way to record the high-water mark of stack usage for functions (and maybe also threads), then people could have new insights on how little stack space is needed. This would be /very/ convenient once we start using thread specific stacks, because the stack space has to be allocated in advance, hopefully not wasting huge amounts of space "just in case". Andrei, Walter??? PS, on the 6502 (the C64 CPU), the stack was hard wired to 0100-01FF. This made the processor vastly simpler, which was essential for cost savings in an era where an additional transistor in the processor did cost an arm and a leg. The area before the stack was used with 8-bit addressing, which was faster, and traditionally it was used more or less as an extension to processor registers, because the 6502 only had the X, Y, A, PC, and Flags registers. The area after the stack was free for the circuit board designer to use in any way he liked. On the C64 this was used for ROM, memory mapped peripherals, IO, the heap, and the program area. Interestingly, self-modifying code was used for keyboard input, to enhance speed and code size. This concept was copied from the VIC-20, which came with 3k of usable RAM.
Apr 29 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Georg Wrede wrote:
 downs wrote:
 Joel C. Salomon wrote:
 grauzone wrote:
 downs wrote:
 Daniel Keep wrote:
 Here's an excellent reason to use ranges over opApply: you
 can't define zip with opApply.  Because opApply uses inversion of
 control, you can't use more than one without bringing threads into 
 the
 equation.

stackthreads/fibers work too and have far less overhead.

needs a new/separate stack?

even for long-running threads; if the call stack is only going to be a few levels deep you might get away with a few hundred bytes each.


The Commodore-64 had a stack of 0.25kB shared between your Basic program, subroutine calls, and the operating system. While not unheard of, it still was unusual to have to make amendments to program design because of stack limitations, even in large programs. *If* we had a convenient way to record the high-water mark of stack usage for functions (and maybe also threads), then people could have new insights on how little stack space is needed. This would be /very/ convenient once we start using thread specific stacks, because the stack space has to be allocated in advance, hopefully not wasting huge amounts of space "just in case". Andrei, Walter???

Since C64 programs have gotten a lot bigger, programming style favors deeper call stacks, and (mutually) recursive functions are not a rarity. Computing the stack depth even for non-recursive functions is an interprocedural analysis so it's difficult to do modularly. The stack is also unchecked so going with a "good enough" approximation is bound to be risky. Andrei
Apr 29 2009
next sibling parent reply Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 downs wrote:
 Joel C. Salomon wrote:
 grauzone wrote:
 downs wrote:
 Daniel Keep wrote:
 Here's an excellent reason to use ranges over opApply: you
 can't define zip with opApply.  Because opApply uses inversion of
 control, you can't use more than one without bringing threads 
 into the
 equation.

stackthreads/fibers work too and have far less overhead.

needs a new/separate stack?

even for long-running threads; if the call stack is only going to be a few levels deep you might get away with a few hundred bytes each.


The Commodore-64 had a stack of 0.25kB shared between your Basic program, subroutine calls, and the operating system. While not unheard of, it still was unusual to have to make amendments to program design because of stack limitations, even in large programs. *If* we had a convenient way to record the high-water mark of stack usage for functions (and maybe also threads), then people could have new insights on how little stack space is needed. This would be /very/ convenient once we start using thread specific stacks, because the stack space has to be allocated in advance, hopefully not wasting huge amounts of space "just in case". Andrei, Walter???

Since C64 programs have gotten a lot bigger, programming style favors deeper call stacks, and (mutually) recursive functions are not a rarity. Computing the stack depth even for non-recursive functions is an interprocedural analysis so it's difficult to do modularly. The stack is also unchecked so going with a "good enough" approximation is bound to be risky.

The old size estimation problem. No matter how big you decide on, somewhere is an idiot who will blow that size too. A runtime solution just occurred to me. The newly allocated stack for a thread is zeroed first, right? Well then, just before the thread finishes, one could scan the stack area, and find the first address that contains a non-zero value. This isn't bullet proof in theory, but hey, the aim is to help in /estimating/ the needed stack size. If this shows the maximum stack usage as 160 bytes, then today's youngsters could allocate, say, 0.25k, (or even 1k) instead of the 50k they'd otherwise allocate. Incidentally, this raises another thought: what if the stack was extendable? Like D arrays? A stack overflow would trigger a relocation of the stack, and then continue. Being able to have hundreds of small stacks would be useful in other things than threads, too.
Apr 29 2009
next sibling parent reply "Joel C. Salomon" <joelcsalomon gmail.com> writes:
Georg Wrede wrote:
 *If* we had a convenient way to record the high-water mark of stack
 usage for functions (and maybe also threads), then people could have new
 insights on how little stack space is needed.
 
 This would be /very/ convenient once we start using thread specific
 stacks, because the stack space has to be allocated in advance,
 hopefully not wasting huge amounts of space "just in case".

What do you mean by “thread-specific stacks”?
 Incidentally, this raises another thought: what if the stack was
 extendable? Like D arrays? A stack overflow would trigger a relocation
 of the stack, and then continue. Being able to have hundreds of small
 stacks would be useful in other things than threads, too.

How much rewriting of the stack would have to happen? Relocating the stack will invalidate any pointers to automatic (in the C sense) variables. Detecting stack overflow at runtime might also not be trivial. —Joel Salomon
Apr 29 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Joel C. Salomon wrote:
 Detecting stack overflow at runtime might also not be trivial.

In a way it is trivial... if you accept a 7% added overhead per function call :o). (The number is what I remember costed Borland Pascal.) Andrei
Apr 29 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 [Detecting stack overflow at runtime] In a way it is trivial... if you accept
a 7% added overhead per function call :o). (The number is what I remember
costed Borland Pascal.)<

Now Delphi and FreePascal allow to switch that on and off. From the page: http://www.freepascal.org/docs-html/prog/progsu100.html 1.2.24 $S : Stack checking The {$S+} directive tells the compiler to generate stack checking code. This generates code to check if a stack overflow occurred, i.e. to see whether the stack has grown beyond its maximally allowed size. If the stack grows beyond the maximum size, then a run-time error is generated, and the program will exit with exit code 202. Specifying {$S-} will turn generation of stack-checking code off. The command line compiler switch -Ct has the same effect as the {$S+} directive. By default, no stack checking is performed. I have never timed programs to test how much you have to pay if you want stack checking code. The compiler is free, and suitable testing programs can be found in the Shootout site, so if you want you don't need much time to test this. Bye, bearophile
Apr 29 2009
parent Daniel Keep <daniel.keep.lists gmail.com> writes:
bearophile wrote:
 Andrei Alexandrescu:
 [Detecting stack overflow at runtime] In a way it is trivial... if you accept
a 7% added overhead per function call :o). (The number is what I remember
costed Borland Pascal.)<

Now Delphi and FreePascal allow to switch that on and off. From the page: http://www.freepascal.org/docs-html/prog/progsu100.html 1.2.24 $S : Stack checking The {$S+} directive tells the compiler to generate stack checking code. This generates code to check if a stack overflow occurred, i.e. to see whether the stack has grown beyond its maximally allowed size. If the stack grows beyond the maximum size, then a run-time error is generated, and the program will exit with exit code 202. Specifying {$S-} will turn generation of stack-checking code off. The command line compiler switch -Ct has the same effect as the {$S+} directive. By default, no stack checking is performed. I have never timed programs to test how much you have to pay if you want stack checking code. The compiler is free, and suitable testing programs can be found in the Shootout site, so if you want you don't need much time to test this. Bye, bearophile

Doesn't the OS protect you from stack overflows? -- Daniel
Apr 29 2009
prev sibling parent Leandro Lucarella <llucax gmail.com> writes:
Andrei Alexandrescu, el 29 de abril a las 09:58 me escribiste:
 Joel C. Salomon wrote:
Detecting stack overflow at runtime might also not be trivial.

In a way it is trivial... if you accept a 7% added overhead per function call :o). (The number is what I remember costed Borland Pascal.)

Using page protection have no overhead, unless you actually hit a stack overflow. -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ----------------------------------------------------------------------------
Apr 29 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 Computing the stack depth even for non-recursive functions is an 
 interprocedural analysis so it's difficult to do modularly. The stack 
 is also unchecked so going with a "good enough" approximation is bound 
 to be risky.

The old size estimation problem. No matter how big you decide on, somewhere is an idiot who will blow that size too. A runtime solution just occurred to me. The newly allocated stack for a thread is zeroed first, right? Well then, just before the thread finishes, one could scan the stack area, and find the first address that contains a non-zero value. This isn't bullet proof in theory, but hey, the aim is to help in /estimating/ the needed stack size.

You can't be more serious than Mark Twain when giving advice on the stock market: "Buy some stock. If it goes up, sell it. If it doesn't go up, don't buy it."
 Incidentally, this raises another thought: what if the stack was 
 extendable? Like D arrays? A stack overflow would trigger a relocation 
 of the stack, and then continue. Being able to have hundreds of small 
 stacks would be useful in other things than threads, too.

That would be great. I don't know whether it can be done. By the way, I think you're underestimating how much stack the OS gives to the stack. On my system: $ grep PTHREAD_STACK /usr/include/**/*.h /usr/include/bits/local_lim.h:#define PTHREAD_STACK_MIN 16384 /usr/include/pthread.h: minimal size of the block must be PTHREAD_STACK_MIN.*/ /usr/include/pthread.h: to be started. This size must never be less than PTHREAD_STACK_MIN $ _ You can't have a stack smaller than 16KB as far as I understand. I seem to recall the default stack size is much bigger. Andrei
Apr 29 2009
next sibling parent Sean Kelly <sean invisibleduck.org> writes:
Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 Computing the stack depth even for non-recursive functions is an 
 interprocedural analysis so it's difficult to do modularly. The stack 
 is also unchecked so going with a "good enough" approximation is 
 bound to be risky.

The old size estimation problem. No matter how big you decide on, somewhere is an idiot who will blow that size too. A runtime solution just occurred to me. The newly allocated stack for a thread is zeroed first, right? Well then, just before the thread finishes, one could scan the stack area, and find the first address that contains a non-zero value. This isn't bullet proof in theory, but hey, the aim is to help in /estimating/ the needed stack size.


You could have the GC report stack sizes when it performs a collection. No guarantee it would be a high-water mark, but it's something.
 Incidentally, this raises another thought: what if the stack was 
 extendable? Like D arrays? A stack overflow would trigger a relocation 
 of the stack, and then continue. Being able to have hundreds of small 
 stacks would be useful in other things than threads, too.

That would be great. I don't know whether it can be done.

Windows works this way, and it's possible to do in user code (the Fiber implementation has some of the groundwork for this). Beyond "usable" stack is a guard page that triggers a stack extension when hit. But I think if you grow the stack past the guard page without ever touching it you can still blow the stack.
 By the way, I think you're underestimating how much stack the OS gives 
 to the stack. On my system:
 
 $ grep PTHREAD_STACK /usr/include/**/*.h
 /usr/include/bits/local_lim.h:#define PTHREAD_STACK_MIN 16384
 /usr/include/pthread.h:   minimal size of the block must be 
 PTHREAD_STACK_MIN.*/
 /usr/include/pthread.h:   to be started.  This size must never be less 
 than PTHREAD_STACK_MIN
 $ _
 
 You can't have a stack smaller than 16KB as far as I understand. I seem 
 to recall the default stack size is much bigger.

On Windows I think the default size is 1MB.
Apr 29 2009
prev sibling parent reply Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 Computing the stack depth even for non-recursive functions is an 
 interprocedural analysis so it's difficult to do modularly. The stack 
 is also unchecked so going with a "good enough" approximation is 
 bound to be risky.

The old size estimation problem. No matter how big you decide on, somewhere is an idiot who will blow that size too. A runtime solution just occurred to me. The newly allocated stack for a thread is zeroed first, right? Well then, just before the thread finishes, one could scan the stack area, and find the first address that contains a non-zero value. This isn't bullet proof in theory, but hey, the aim is to help in /estimating/ the needed stack size.

You can't be more serious than Mark Twain when giving advice on the stock market: "Buy some stock. If it goes up, sell it. If it doesn't go up, don't buy it."

I'd be amazed if the stack usage of hello.d varied from run to run.
 Incidentally, this raises another thought: what if the stack was 
 extendable? Like D arrays? A stack overflow would trigger a relocation 
 of the stack, and then continue. Being able to have hundreds of small 
 stacks would be useful in other things than threads, too.

That would be great. I don't know whether it can be done.

Two methods come to mind.(1)
 By the way, I think you're underestimating how much stack the OS gives 
 to the stack. On my system:
 
 $ grep PTHREAD_STACK /usr/include/**/*.h
 /usr/include/bits/local_lim.h:#define PTHREAD_STACK_MIN 16384
 /usr/include/pthread.h:   minimal size of the block must be 
 PTHREAD_STACK_MIN.*/
 /usr/include/pthread.h:   to be started.  This size must never be less 
 than PTHREAD_STACK_MIN

Seems like that minimum means the OS won't start a new thread with less [stack] space left on the computer. Gotta be /some/ value, big enough to be fine with virtually any app.
 You can't have a stack smaller than 16KB as far as I understand. I seem 
 to recall the default stack size is much bigger.

Totally not having researched this... The page http://www.unix.com/high-level-programming/34632-how-find-out-stack-size-occupied-process.html states that "Most Linux systems have no stack limits set" (didn't have time to find a definitive reference, yet). Also, PTHREAD_STACK_MIN would be the main process stack size. If we're running co-operative multitasking, or continuations or some such, then the other stacks really can be smaller. The important thing to know is, does "somebody else" than our program use the current stack? If not, then we're free to allocate "just enough" (in practice much more) to the stacks. IIUC, our stack is only used by our app, its library calls, and the OS API calls. (1) Using several stacks: What does one need to use several stacks? Let's assume we have the "regular" stack, and then we temporarily want to use another. First we allocate space for the new stack. Then we disable interrupts, (we could even save processor state), change the stack pointer to point to the new stack. Enable interrupts, and go on with the program (i.e. call the routine we want to have using the new stack). Once it finishes, disable interrupts, reset the stack pointer, enable interrupts. Knowing in advance the needed stack space helps here. Especially if instances of the same code are to run with each stack, then all we need is an array of stack spaces. :-) The other thing: suppose we really don't know the stack usage. Then each routine can use the same stack while it runs, and at task switch we copy the actually used portion of the stack to the heap. Next time we want to run that task, we copy the data back from the heap. I admit this isn't done in an evening, but it's doable.
Apr 29 2009
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 Computing the stack depth even for non-recursive functions is an 
 interprocedural analysis so it's difficult to do modularly. The 
 stack is also unchecked so going with a "good enough" approximation 
 is bound to be risky.

The old size estimation problem. No matter how big you decide on, somewhere is an idiot who will blow that size too. A runtime solution just occurred to me. The newly allocated stack for a thread is zeroed first, right? Well then, just before the thread finishes, one could scan the stack area, and find the first address that contains a non-zero value. This isn't bullet proof in theory, but hey, the aim is to help in /estimating/ the needed stack size.

You can't be more serious than Mark Twain when giving advice on the stock market: "Buy some stock. If it goes up, sell it. If it doesn't go up, don't buy it."

I'd be amazed if the stack usage of hello.d varied from run to run.

I'd be amazed if hello.d were your prototype of a useful program :o). Let's face it. Any program that uses mutual recursion of any kind (even when the user did not set out to use recursion) or that uses alloca (fewer), is bound to have the consumed stack dependent on the input. Even without recursion, various traces of a program depend on a data and have different stack depths. That's why your estimate is utterly useless. I can't believe you brought it up as anything else but jesting (maybe you are and I'm not getting it). I can't even believe I'm replying! :o) Andrei
Apr 29 2009
parent reply Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 Computing the stack depth even for non-recursive functions is an 
 interprocedural analysis so it's difficult to do modularly. The 
 stack is also unchecked so going with a "good enough" approximation 
 is bound to be risky.

The old size estimation problem. No matter how big you decide on, somewhere is an idiot who will blow that size too. A runtime solution just occurred to me. The newly allocated stack for a thread is zeroed first, right? Well then, just before the thread finishes, one could scan the stack area, and find the first address that contains a non-zero value. This isn't bullet proof in theory, but hey, the aim is to help in /estimating/ the needed stack size.

You can't be more serious than Mark Twain when giving advice on the stock market: "Buy some stock. If it goes up, sell it. If it doesn't go up, don't buy it."

I'd be amazed if the stack usage of hello.d varied from run to run.

I'd be amazed if hello.d were your prototype of a useful program :o).

Somebody more resourceful than I might /prove/ some maximum for any program that doesn't use recursion (be it implicit or explicit recursion). Hello.d attempted to be an example of such a program.
 Let's face it. Any program that uses mutual recursion of any kind (even 
 when the user did not set out to use recursion) or that uses alloca 
 (fewer), is bound to have the consumed stack dependent on the input. 

Of course. And there's a limit to the stack size, ultimately at the hardware level. For any increase in available stack size, the probability of an out-of-stack exception decreases, never reaching zero. I agree.
 Even without recursion, various traces of a program depend on a data and 
 have different stack depths. That's why your estimate is utterly 
 useless.

It'd be useless if we were attempting to give 100.000% guarantees.
 I can't believe you brought it up as anything else but jesting 
 (maybe you are and I'm not getting it).

Jesting on a level where you might not get it, wouldn't be fun. And definitely boring and confusing to the rest of the crowd. :-) Maybe I'll have to dig up prior art, or something. Clearly I'm not qualified to properly defend the validity of this idea.
Apr 29 2009
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Georg Wrede wrote:
 Somebody more resourceful than I might /prove/ some maximum for any 
 program that doesn't use recursion (be it implicit or explicit 
 recursion). Hello.d attempted to be an example of such a program.

There's no need for a lot of expertise, it's a trivial fact for anyone who's written even a toy compiler. The tidbit that's difficult is the interprocedural part: in order to compute the stack consumed by any function, you need to know the stack consumed by all functions it invokes.
 Maybe I'll have to dig up prior art, or something. Clearly I'm not 
 qualified to properly defend the validity of this idea.

I think my failure is seeing what the idea is. My understanding is that you use one trace of a run to estimate the (average? minimum? maximum?) stack needed by future runs of a thread. It's unclear to me how that estimate would be used, what sort of information it provides (if any - I highly doubt it tells much), and generally how you'd integrate that into a compiler. Andrei
Apr 29 2009
parent Don <nospam nospam.com> writes:
Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Somebody more resourceful than I might /prove/ some maximum for any 
 program that doesn't use recursion (be it implicit or explicit 
 recursion). Hello.d attempted to be an example of such a program.

There's no need for a lot of expertise, it's a trivial fact for anyone who's written even a toy compiler. The tidbit that's difficult is the interprocedural part: in order to compute the stack consumed by any function, you need to know the stack consumed by all functions it invokes.
 Maybe I'll have to dig up prior art, or something. Clearly I'm not 
 qualified to properly defend the validity of this idea.

I think my failure is seeing what the idea is. My understanding is that you use one trace of a run to estimate the (average? minimum? maximum?) stack needed by future runs of a thread. It's unclear to me how that estimate would be used, what sort of information it provides (if any - I highly doubt it tells much), and generally how you'd integrate that into a compiler.

I think in the first instance it was at least to get an idea of how much stack space you're typically using. I must confess to having absolutely no idea of how much stack space my D apps are using. It'd be an interesting thing to know.
 
 
 Andrei

Apr 29 2009
prev sibling parent reply Georg Wrede <georg.wrede iki.fi> writes:
Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 Computing the stack depth even for non-recursive functions is an 
 interprocedural analysis so it's difficult to do modularly. The 
 stack is also unchecked so going with a "good enough" 
 approximation is bound to be risky.

The old size estimation problem. No matter how big you decide on, somewhere is an idiot who will blow that size too. A runtime solution just occurred to me. The newly allocated stack for a thread is zeroed first, right? Well then, just before the thread finishes, one could scan the stack area, and find the first address that contains a non-zero value. This isn't bullet proof in theory, but hey, the aim is to help in /estimating/ the needed stack size.

You can't be more serious than Mark Twain when giving advice on the stock market: "Buy some stock. If it goes up, sell it. If it doesn't go up, don't buy it."

I'd be amazed if the stack usage of hello.d varied from run to run.

I'd be amazed if hello.d were your prototype of a useful program :o).

Somebody more resourceful than I might /prove/ some maximum for any program that doesn't use recursion (be it implicit or explicit recursion). Hello.d attempted to be an example of such a program.
 Let's face it. Any program that uses mutual recursion of any kind 
 (even when the user did not set out to use recursion) or that uses 
 alloca (fewer), is bound to have the consumed stack dependent on the 
 input. 

Of course. And there's a limit to the stack size, ultimately at the hardware level. For any increase in available stack size, the probability of an out-of-stack exception decreases, never reaching zero. I agree.
 Even without recursion, various traces of a program depend on a data 
 and have different stack depths. That's why your estimate is utterly 
 useless.

It'd be useless if we were attempting to give 100.000% guarantees.
 I can't believe you brought it up as anything else but jesting (maybe 
 you are and I'm not getting it).


Having looked into things, it turns out I'm the one that now suspects you of jesting.
 Maybe I'll have to dig up prior art, or something. Clearly I'm not 
 qualified to properly defend the validity of this idea.

There is prior art, by the truckload. A /very/ short slide show here: http://zope.stackless.com/StacklessEuroPy.ppt I'd kill to get that in D. And about your comments on stack size, seems regular Python has an in-built limit on recursion, at 1000 deep. That should be diametrically opposite your stance on stack size.
Apr 30 2009
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Georg Wrede wrote:
 I can't believe you brought it up as anything else but jesting (maybe 
 you are and I'm not getting it).


Having looked into things, it turns out I'm the one that now suspects you of jesting.

What I referred to was inferring a thread's actual stack requirements from one trace of its run. That's just untenable.
 Maybe I'll have to dig up prior art, or something. Clearly I'm not 
 qualified to properly defend the validity of this idea.

There is prior art, by the truckload. A /very/ short slide show here: http://zope.stackless.com/StacklessEuroPy.ppt I'd kill to get that in D.

Interesting, just you can't claim you were talking about that all along. I mean, come on! You say one thing that was untenable, and now you come up with a different one that is. And you were right all along?
 And about your comments on stack size, seems regular Python has an 
 in-built limit on recursion, at 1000 deep. That should be diametrically 
 opposite your stance on stack size.

Python has apparently set a maximum to have a guarantee it won't crash when given a certain minimum stack size. That's nice, but I'm not quite seeing how that's opposite to what I said in this discussion. Andrei
Apr 30 2009
parent reply Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 I can't believe you brought it up as anything else but jesting 
 (maybe you are and I'm not getting it).


Having looked into things, it turns out I'm the one that now suspects you of jesting.


Could it be that you're looking at this too much through the eyes of a compiler developer? With souce code that can include templates, it should be no surprise that a programmer can write a file which results in preposterous stack needs. This is certainly the case with C++, for example.
 What I referred to was inferring a thread's actual stack requirements 
 from one trace of its run. 

Of course you'd give different input data, a little like you give different input data to your unit tests.
 That's just untenable.

Agreed, IF there's a possibility of recursion happening, the variation in input data is not guaranteed, etc. But for practical purposes (read: most of my programs, and presumably those of the other programmers'), the situation is different. For a particular application, we expect some explicit or implicit constarints on the input data. We also know (er, expect) whether (and to what extent) there will be recursion. Additionally (as with any real-life application), we accept that if the input data goes wild, the application /doesn't/ have to cope with it. (Making nuke proof apps where reasonably good is sufficient, is just gratuituous work.) For example, a robust application, instead of accepting and coping with *anything*, instead has *documented limitations*. IMHO, that is the right way to "fix infinity problems". Not demanding limitless stack for all kinds of programs.
 Maybe I'll have to dig up prior art, or something. Clearly I'm not 
 qualified to properly defend the validity of this idea.

There is prior art, by the truckload. A /very/ short slide show here: http://zope.stackless.com/StacklessEuroPy.ppt I'd kill to get that in D.

Interesting, just you can't claim you were talking about that all along. I mean, come on! You say one thing that was untenable, and now you come up with a different one that is. And you were right all along?

I was actually talking about *two* things. If you remember, I was talking about measuring actually used stack depth. Then I got the idea of having separate stacks for threads. Two separate things in a post. (Now that I think more about it, I must have heard about stackless Python somewhere. But as I've seen others do, this time I, too, believed I invented something that I've actually seen somewhere.)
 And about your comments on stack size, seems regular Python has an 
 in-built limit on recursion, at 1000 deep. That should be 
 diametrically opposite your stance on stack size.

Python has apparently set a maximum to have a guarantee it won't crash when given a certain minimum stack size. That's nice, but I'm not quite seeing how that's opposite to what I said in this discussion.

Nope. It's because - They don't want the entire computer thrashing each time some idiot has a typo in his recursive algorithm. - They genuinely don't think you usually need preposterous levels of recursion. And if you do, then just change the limit. Python is for Real Computers (as opposed to Windows). There it's polite to make your apps (here, the Python interpreter) have some decency as to how much resources they take. The idea is that there are hundreds of other users, and dozens of daemons running critical apps, round the clock. If developers didn't know (by measurement or analysis) the reasonable stack size needed, there would be no mobile phones, no set-top boxes with hard disks, no GPS receivers, no digital PBX appliances, no solid state firewalls, no ADSL boxes, no programmable calculators...... Dammit, any pocket calculator that has parenthesis buttons needs a stack. And that stack space is predermined by the manufacturer. I'm still maintaining that the stack space needs of regular applications are much less than people tend to think. Last time I checked, you ran out of stack on a name brand calculator with just ten levels of parentheses. That's not much. On the other hand, in real life, not too many people in the world have ever seen a calculator run out of stack space.
Apr 30 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 I can't believe you brought it up as anything else but jesting 
 (maybe you are and I'm not getting it).


Having looked into things, it turns out I'm the one that now suspects you of jesting.


Could it be that you're looking at this too much through the eyes of a compiler developer? With souce code that can include templates, it should be no surprise that a programmer can write a file which results in preposterous stack needs. This is certainly the case with C++, for example.

I am sorry, I need to fix a few mistakes in this post and then I'll have to leave this conversation as it becomes ungainful for us both. The above is wrong, or at least contains a claim that you'd need to support better. Stack size requirements grow with the prevalence of short functions. Those are a style of programming not particularly correlated with templates. I don't have lots of small functions that call one another in templated code, any more than in regular code.
 What I referred to was inferring a thread's actual stack requirements 
 from one trace of its run. 

Of course you'd give different input data, a little like you give different input data to your unit tests.
 That's just untenable.

Agreed, IF there's a possibility of recursion happening, the variation in input data is not guaranteed, etc. But for practical purposes (read: most of my programs, and presumably those of the other programmers'), the situation is different.

This is utter nonsense. Practical vs. theoretical has absolutely nothing to do with this all. Forget recursion. I'm talking about straight "if"s that create different program traces that in turn require different stack sizes. Collecting one and saying it's fine "for practical purposes" is just nonsense. What's so complicated about this?
 For a particular application, we expect some explicit or implicit 
 constarints on the input data. We also know (er, expect) whether (and to 
 what extent) there will be recursion. Additionally (as with any 
 real-life application), we accept that if the input data goes wild, the 
 application /doesn't/ have to cope with it. (Making nuke proof apps 
 where reasonably good is sufficient, is just gratuituous work.)

Again, just to drive the point home - recursion has nothing to do with. Your inferences are starting from a faulty premise.
 For example, a robust application, instead of accepting and coping with 
 *anything*, instead has *documented limitations*. IMHO, that is the 
 right way to "fix infinity problems". Not demanding limitless stack for 
 all kinds of programs.
 
 Maybe I'll have to dig up prior art, or something. Clearly I'm not 
 qualified to properly defend the validity of this idea.

There is prior art, by the truckload. A /very/ short slide show here: http://zope.stackless.com/StacklessEuroPy.ppt I'd kill to get that in D.

Interesting, just you can't claim you were talking about that all along. I mean, come on! You say one thing that was untenable, and now you come up with a different one that is. And you were right all along?

I was actually talking about *two* things. If you remember, I was talking about measuring actually used stack depth. Then I got the idea of having separate stacks for threads. Two separate things in a post.

Threads already have separate stacks. (You might have said more.)
 (Now that I think more about it, I must have heard about stackless 
 Python somewhere. But as I've seen others do, this time I, too, believed 
 I invented something that I've actually seen somewhere.)

Fortran was stackless for... for a long time :o). An activation record in Fotran is in static memory. To call a Fortran function, you deposit arguments in that static storage, issue the CALL, and wait.
 And about your comments on stack size, seems regular Python has an 
 in-built limit on recursion, at 1000 deep. That should be 
 diametrically opposite your stance on stack size.

Python has apparently set a maximum to have a guarantee it won't crash when given a certain minimum stack size. That's nice, but I'm not quite seeing how that's opposite to what I said in this discussion.

Nope. It's because - They don't want the entire computer thrashing each time some idiot has a typo in his recursive algorithm. - They genuinely don't think you usually need preposterous levels of recursion. And if you do, then just change the limit. Python is for Real Computers (as opposed to Windows). There it's polite to make your apps (here, the Python interpreter) have some decency as to how much resources they take. The idea is that there are hundreds of other users, and dozens of daemons running critical apps, round the clock. If developers didn't know (by measurement or analysis) the reasonable stack size needed, there would be no mobile phones, no set-top boxes with hard disks, no GPS receivers, no digital PBX appliances, no solid state firewalls, no ADSL boxes, no programmable calculators...... Dammit, any pocket calculator that has parenthesis buttons needs a stack. And that stack space is predermined by the manufacturer. I'm still maintaining that the stack space needs of regular applications are much less than people tend to think. Last time I checked, you ran out of stack on a name brand calculator with just ten levels of parentheses. That's not much. On the other hand, in real life, not too many people in the world have ever seen a calculator run out of stack space.

Sure, I'm not contending that. But that's a rather different claim than some others you made. Andrei
Apr 30 2009
parent Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 I am sorry, I need to fix a few mistakes in this post and then I'll have 
 to leave this conversation as it becomes ungainful for us both.

Sigh.
Apr 30 2009
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Georg Wrede:
 And about your comments on stack size, seems regular Python has an 
 in-built limit on recursion, at 1000 deep. That should be diametrically 
 opposite your stance on stack size.

See also sys.setrecursionlimit() to lift that limit where useful. You can also define how much deep to show the when an error occurs (to avoid flooding the shell, for example). Python is a high-level language, so naturally people expect it to be fit for recursive algorithms, because they are sometimes shorter and more natural (when they work on nested data structures). But the low stack limit (and the very low performance of function calls) forces people to often manually translate recursive algorithms into iterative ones. I hate this. Bye, bearophile
Apr 30 2009
parent Georg Wrede <georg.wrede iki.fi> writes:
bearophile wrote:
 Georg Wrede:
 And about your comments on stack size, seems regular Python has an
  in-built limit on recursion, at 1000 deep. That should be
 diametrically opposite your stance on stack size.

See also sys.setrecursionlimit() to lift that limit where useful. You can also define how much deep to show the when an error occurs (to avoid flooding the shell, for example). Python is a high-level language, so naturally people expect it to be fit for recursive algorithms, because they are sometimes shorter and more natural (when they work on nested data structures). But the low stack limit (and the very low performance of function calls) forces people to often manually translate recursive algorithms into iterative ones. I hate this.

Python seems a nice language. It's a shame that it's a bit slow. Heh, that's good for D.
Apr 30 2009
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 
 You can't have a stack smaller than 16KB as far as I understand. I 
 seem to recall the default stack size is much bigger.

Totally not having researched this... The page http://www.unix.com/high-level-programming/34632-how-find-out-stack-size-o cupied-process.html states that "Most Linux systems have no stack limits set" (didn't have time to find a definitive reference, yet).

One thing I don't understand is how the OS can manage essentially unbounded stack growth in a multithreaded program with a flat address space. Is it simply that the physical memory for each stack is mapped into the same virtual address space when a context switch occurs? And if so, how does this work with multiple cores? The only alternative I can think of would be to dedicate a virtual address range to each stack, which would set a definite upper bound on the stack size for each thread, and wasted/unused memory between stacks.
Apr 29 2009
parent Georg Wrede <georg.wrede iki.fi> writes:
Sean Kelly wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:

 You can't have a stack smaller than 16KB as far as I understand. I 
 seem to recall the default stack size is much bigger.

Totally not having researched this... The page http://www.unix.com/high-level-programming/34632-how-find-out-stack-size-o cupied-process.html states that "Most Linux systems have no stack limits set" (didn't have time to find a definitive reference, yet).

One thing I don't understand is how the OS can manage essentially unbounded stack growth in a multithreaded program with a flat address space. Is it simply that the physical memory for each stack is mapped into the same virtual address space when a context switch occurs? And if so, how does this work with multiple cores? The only alternative I can think of would be to dedicate a virtual address range to each stack, which would set a definite upper bound on the stack size for each thread, and wasted/unused memory between stacks.

Well, unbounded and unbounded. It just means there's no set value. The stack can grow as long as there is space. And the virtual address range is more than there's physical memory, so I guess they can say no limit.
Apr 29 2009
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 Computing the stack depth even for non-recursive functions is an 
 interprocedural analysis so it's difficult to do modularly. The stack is 
 also unchecked so going with a "good enough" approximation is bound to 
 be risky.

No way in hell is there a reliable way for a compiler to estimate stack depth. Heck, just a virtual function call blows it. The stack, however, is checked. The virtual memory system puts a "guard" page following the end of the stack, so running off the end produces an exception. If there is unallocated space beyond that, the operating system will allocate more pages, then resume from the exception.
Apr 29 2009
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Some more notes about Phobos2. Some of the things I say may be wrong because my
experience with Phobos2 is limited still.

Daniel Keep:
 Not true.  Here's an excellent reason to use ranges over opApply: you
 can't define zip with opApply.  Because opApply uses inversion of
 control, you can't use more than one without bringing threads into the
 equation.

I'll try to zip two ranges that return the leaves of two different binary trees. This can be a simple example to show to attract people to D2 language :-) So I've tried using zip: import std.range: zip; import std.stdio: writeln; void main() { auto a = [1, 2, 3]; string[] b = ["a", "b", "c"]; foreach (xy; zip(a, b)) writeln(xy.at!(0), " ", xy.at!(1)); } That doesn't work: ...\dmd\src\phobos\std\range.d(1847): Error: template instance Zip!(int[3u],immutable(char)[][]) does not match template declaration Zip(R...) if (R.length && allSatisfy!(isInputRange,R)) probably because 'a' is a static array. Is isInputRange false for static arrays? ------------ The most basic and useful range is the one of integer numbers. How can I create with Phobos2 lazy and eager ranges like the following ones?
 range(1, 5)



 range(5, 1, -1)



 list(xrange(5, 10))



 list(xrange(5, 10, 2))



Similar ranges are useful with map,zip, and in many other situations. (I hope the x..y range syntax of D2 foreach will evolve in a syntax that can be used everywhere an integral range can be used). ------------ The docs say about "map":
Multiple functions can be passed to map. In that case, the element type of map
is a tuple containing one element for each function.<
 [...]
 foreach (e; map!("a + a", "a * a")(arr1))

Having the possibility to map two functions in parallel may be useful in some situations (where for example computing the items is costly) but quite more useful can be to map a function that takes two or more arguments. An example:
 map(lambda x, y: x*y, ["a", "b", "c"], xrange(1, 4))



If you don't have that you are forced to use a zip inside the map, but then you also are forced to change the D2 mapping function to something like: (p){ return p.at!(0) * p.at!(1); } ---------- all() and any() functions are useful to test if all or any items of an iterable are true (with optinal mapping function too). They are useful as templates too, so I suggest to rename allSatisfy as "All" and "anySatisfy" as "Any". A similar template named "Map" can be created if not already present, to map a given template to a typetuple. In future I'll probably write more notes, suggestions and questions about Phobos2. I hope such notes are very useful. Bye, bearophile
Apr 28 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Some more notes about Phobos2. Some of the things I say may be wrong because my

 Daniel Keep:
 Not true.  Here's an excellent reason to use ranges over opApply: you
 can't define zip with opApply.  Because opApply uses inversion of
 control, you can't use more than one without bringing threads into the
 equation.


 So I've tried using zip:
 import std.range: zip;
 import std.stdio: writeln;
 void main() {
     auto a = [1, 2, 3];
     string[] b = ["a", "b", "c"];
     foreach (xy; zip(a, b))
         writeln(xy.at!(0), " ", xy.at!(1));
 }
 That doesn't work:
 ...\dmd\src\phobos\std\range.d(1847): Error: template instance

(R.length && allSatisfy!(isInputRange,R))
 probably because 'a' is a static array. Is isInputRange false for static
arrays?

Yes, because the range interface for arrays relies on std.array being imported and using extension method syntax. This has bitten me a bunch of times, too. The functions in std.array take dynamic arrays and static arrays are not implicitly convertible to dynamic arrays, except in the specific case of binding an array literal to a variable. Whether they should be, I don't know.
 ------------
 The most basic and useful range is the one of integer numbers. How can I create

 range(1, 5)



 range(5, 1, -1)



 list(xrange(5, 10))



 list(xrange(5, 10, 2))



Similar ranges are useful with map,zip, and in many other situations. (I hope the x..y range syntax of D2 foreach will evolve in a syntax that can be

I think you're looking for std.range.iota, although it doesn't work so well right now because of bug 2871.
Apr 28 2009
parent bearophile <bearophileHUGS lycos.com> writes:
dsimcha:

Whether they should be, I don't know.<

In my D1 dlibs most things digest static arrays too (but they don't let you iterate on them or return them, that's a limit of the language). I don't know D2 enough yet to give you a better answer, but I think it's better to remove similar limits from the language. zip-ing two static arrays has to be a legit operation.
I think you're looking for std.range.iota,<

Yes, sorry, I have found it few seconds after sending the post. I am dumb.
although it doesn't work so well right now because of bug 2871.<

And I have found the bug, of course... :-) I can also see Denis Koroskin has suggested a small fix: http://d.puremagic.com/issues/show_bug.cgi?id=2871 Regarding iota: it's better for the third argument of iota to defult to 1, becasuse most of the times you want a step = 1. In Python range() goes one step further: if you give just one argument it's meant to be the second one and the first defaults to zero. In practice this causes no problems because in Python you use range()/xrange() all the time. In D I may accept iota to have 2-3 arguments because it will probably used much less often. Bye and thank you, bearophile
Apr 28 2009
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 26 Apr 2009 21:20:32 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 Michel Fortin wrote:
 On 2009-04-26 11:46:51 -0400, dsimcha <dsimcha yahoo.com> said:

 == Quote from Michel Fortin (michel.fortin michelf.com)'s article
 Hum, are you saying opApply superior when it comes to iterating in a
 tree? It seems that with opApply you could use the stack by  
 recursively
 calling opApply with the given delegate on each one of the branches.

Exactly. On the other hand, you lose a lot of flexibility with opApply. If you tried to port most of std.range to operate on the opApply interface instead fo the forward range interface, I doubt you'd get very far. IMHO, though, opApply should *not* be deprecated. opApply and ranges attempt to solve similar problems, but not the same problem. Ranges attempt to solve the problem of iterating over some object with maximum flexibility from the point of view of the user of the object. opApply attempts to solve the problem of iterating with maximum flexibility from the point of view of the implementer of the object. In some cases, like the one you just described, the latter can be better.

place. So what the implementer can do with opApply is write an optimized iteration algorithm for use with foreach. Which may mean that when both opApply and ranges are available for generating foreach's code, the compiler should favor opApply. Currently, I believe it's the reverse.

I think, all other things being equal, that opApply tends to be slower than iterators.

It depends. The range must have either inline-able functions, or must NOT call virtual functions. Otherwise, it could be worse. there is also the factor that you are using an inner function with opApply, so if you regularly access variables outside the foreach loop, then those add up. Observe that in opapply style, you do one delegate call per loop iteration. With range style, you do three calls on the range per loop iteration (empty, front, popFront). If those range calls call virtual calls or ARE virtual calls, then range loses. Of course, the difference might be washed if you do lots of accesses to an outer variable in your foreach loop, which require a pointer dereference vs. a stack dereference. We should quantify this and see what the actual performance is. -Steve
Apr 27 2009
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 28 Apr 2009 07:35:22 -0400, downs <default_357-line yahoo.de>  
wrote:

 Daniel Keep wrote:
 Michel Fortin wrote:
 On 2009-04-27 10:51:22 -0400, Frits van Bommel
 <fvbommel REMwOVExCAPSs.nl> said:

 I edited this code to work with ldc (D1) + Tango, and saw the Direct
 and opApply cases generate identical code (inc, cmp, jne, with the
 loop counter in a register) [1], so they're equally fast (modulo
 process scheduling randomness).

prefering ranges over opApply we're just optimising around a deficiency in DMD's optimizer.

Not true. Here's an excellent reason to use ranges over opApply: you can't define zip with opApply. Because opApply uses inversion of control, you can't use more than one without bringing threads into the equation.

Your point stands, of course, but I just wanted to mention that stackthreads/fibers work too and have far less overhead.

read: less overhead than full threads, not less overhead than ranges ;) -Steve
Apr 28 2009
prev sibling parent reply Steve Teale <steve.teale britseyeview.com> writes:
dsimcha Wrote:

 So I tried to create a struct to allow iteration over the keys or values of a
 builtin associative array using a lazy forward range.  This would allow one to
 pass the keys or values of an AA to any function that expects a regular old
 forward range w/o any heap activity, eager copying, etc. and with O(1)
 auxiliary memory usage.  Now that ranges are the lingua franca of iteration in
 D, this seems like a pretty necessary feature.
 
 The problem is that D's current AAs are based on a hash table with collision
 resolution done using binary trees.  I guess this decision was made to allow
 for O(log N) worst case performance even when there are ridiculous amounts of
 hash collisions.  However, this is a very bad decision because, in exchange
 for more bulletproof performance in corner cases, you lose:
 
 1.  Every element is now (void*).sizeof bytes larger because you have to store
 a left and right pointer.
 2.  You can't iterate over a binary tree in O(1) auxiliary space, at least not
 in any way I'm aware of.  This means that a lazy range to iterate over the
 keys or values is relatively useless because it would generate heap activity
 anyhow to store the stack or queue necessary to iterate over the tree, so you
 may as well just use the current .keys or .values properties, which just
 return an array.  (Because of the way that ranges are designed, you can't just
 use the call stack, unless you use fibers or something, which is obviously not
 a win from an efficiency point of view.)
 
 Both of these are important in the more common, general case of sane hashing
 performance.

Ranges, ranges! That's all I hear these days, and it looks to me like the continuing advance of D toward being a complete meta-language. Where do I see ranges described in terms that an old hand can understand? I'm constantly having to roll my own in many areas when I see how the meta stuff is implemented - like x ~= c to add a character to the end of an array - reallocation every time? I thought D was supposed to be a systems programming language, not something that was guaranteed to win the universe obfuscated code competition. I've been trying to keep my projects up to date and compilable with D2.0xx, but I think I'm going to give up on that and rewrite them for whatever the current version of D1 is. I seriously think that the crew who are driving the development should realize that not all computer programmers are going to have an IQ of 140, and that any practical language should be comprehensible to a majority of its users. Maybe the problem I'm complaining about is just a lack of documentation. Generating said from comments really does not hack it - comments are always skimped, and usually lie. Before something like Ranges are implemented, there should be some sort of RFC process where they are properly described rather than a reliance on D users to have read every thread of the newsgroup, and remembered it all.
Apr 26 2009
next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Steve Teale (steve.teale britseyeview.com)'s article
 dsimcha Wrote:
 So I tried to create a struct to allow iteration over the keys or values of a
 builtin associative array using a lazy forward range.  This would allow one to
 pass the keys or values of an AA to any function that expects a regular old
 forward range w/o any heap activity, eager copying, etc. and with O(1)
 auxiliary memory usage.  Now that ranges are the lingua franca of iteration in
 D, this seems like a pretty necessary feature.

 The problem is that D's current AAs are based on a hash table with collision
 resolution done using binary trees.  I guess this decision was made to allow
 for O(log N) worst case performance even when there are ridiculous amounts of
 hash collisions.  However, this is a very bad decision because, in exchange
 for more bulletproof performance in corner cases, you lose:

 1.  Every element is now (void*).sizeof bytes larger because you have to store
 a left and right pointer.
 2.  You can't iterate over a binary tree in O(1) auxiliary space, at least not
 in any way I'm aware of.  This means that a lazy range to iterate over the
 keys or values is relatively useless because it would generate heap activity
 anyhow to store the stack or queue necessary to iterate over the tree, so you
 may as well just use the current .keys or .values properties, which just
 return an array.  (Because of the way that ranges are designed, you can't just
 use the call stack, unless you use fibers or something, which is obviously not
 a win from an efficiency point of view.)

 Both of these are important in the more common, general case of sane hashing
 performance.


 Where do I see ranges described in terms that an old hand can understand?
 I'm constantly having to roll my own in many areas when I see how the meta
stuff

reallocation every time?
 I thought D was supposed to be a systems programming language, not something

 I've been trying to keep my projects up to date and compilable with D2.0xx, but

version of D1 is.
 I seriously think that the crew who are driving the development should realize

practical language should be comprehensible to a majority of its users.
 Maybe the problem I'm complaining about is just a lack of documentation.

skimped, and usually lie.
 Before something like Ranges are implemented, there should be some sort of RFC

have read every thread of the newsgroup, and remembered it all. But there was an RFC, and it was even called that. It just happened to take place on this newsgroup. Also, keep in mind that D2 is still in alpha. Worrying about how to explain this stuff to people who aren't heavily involved with D at this point would slow down the pace of evolution and generate more confusion. The time to do this is when the dust has settled a little.
Apr 26 2009
prev sibling next sibling parent Leandro Lucarella <llucax gmail.com> writes:
Steve Teale, el 26 de abril a las 14:39 me escribiste:
 Before something like Ranges are implemented, there should be some sort
 of RFC process where they are properly described rather than a reliance
 on D users to have read every thread of the newsgroup, and remembered it
 all.

A PEP[1]-like process would be great. [1] http://www.python.org/dev/peps/ -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- <o_O> parakenotengobarraespaciadora <o_O> aver <o_O> estoyarreglandolabarraporkeserompiounapatita
Apr 26 2009
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Steve Teale wrote:
 Before something like Ranges are implemented, there should be some
 sort of RFC process where they are properly described rather than a
 reliance on D users to have read every thread of the newsgroup, and
 remembered it all.

There was. Incidentally, it was called "RFC on range design for D2". I understand your frustration, but if you think for a minute, you realize your comments are uncalled for. We have discussed ranges at length and the community has had a long time to establish semantics and even names. I have (early, often, and repeatedly) warned the community that there will be a Phobos update that is bound to break a lot of code. I actively tried to massage several breaking changes into one single release so as to sweeten the pill. That all took a lot of time and thought from several people (Walter, Sean, and myself) who have better things to do. Now it would be great if you put yourself in our shoes, read your comments again, and tell me how they reflect on the writer. You mention you want to see ranges described in terms that an old hand can understand. That's great. But that one good request was marred by comments that show you are more interested in reaffirming your prejudice, than in overcoming it. Andrei
Apr 26 2009
prev sibling next sibling parent reply Georg Wrede <georg.wrede iki.fi> writes:
Steve Teale wrote:
 Ranges, ranges! That's all I hear these days, and it looks to me like
 the continuing advance of D toward being a complete meta-language.
 
 Where do I see ranges described in terms that an old hand can
 understand?
 
 I'm constantly having to roll my own in many areas when I see how the
 meta stuff is implemented - like x ~= c to add a character to the end
 of an array - reallocation every time?
 
 I thought D was supposed to be a systems programming language, not
 something that was guaranteed to win the universe obfuscated code
 competition.
 
 I've been trying to keep my projects up to date and compilable with
 D2.0xx, but I think I'm going to give up on that and rewrite them for
 whatever the current version of D1 is.
 
 I seriously think that the crew who are driving the development
 should realize that not all computer programmers are going to have an
 IQ of 140, and that any practical language should be comprehensible
 to a majority of its users.

It will, be patient. See below.
 Maybe the problem I'm complaining about is just a lack of
 documentation. Generating said from comments really does not hack it
 - comments are always skimped, and usually lie.
 
 Before something like Ranges are implemented, there should be some
 sort of RFC process where they are properly described rather than a
 reliance on D users to have read every thread of the newsgroup, and
 remembered it all.

There's an illusion. And that illusion comes from the D newsgroups having "wrong" names. The D2 newsgroup should have a name like "D2 discussion -- for D language development folks, enter at your own risk". And a *D1* newsgroup should then be for anybody who actually uses the language for something. Currently, actually, the D.learn newsgroup has sort-of assumed this functionality. Being as it is (a de facto D2 development newsgroup), d.D will contain legthy discussions that meander and roam, possibly for months, before something crystallises, and the issue is settled (at least for the time). A lack of proper documentation belongs to this setup. (But then, one really has to congratulate Andrei, Phobos docs have *never* been as good as they are today!!!) About ranges: I suspect that once D2 is solidified (and we start working on D3 :-) ), ranges will be easy to use, easy to understand, and practical. Incidentally, that same complaint could have been said about templates. It took more than a year to get the discussion settled down a bit, and today templates seem not impossible at all to begin using (unless you want to begin right with the hairy stuff). And earlier, it took two months of intense discussion here before Unicode, UTF and their relation to char, wchar and dchar got generally understood. Today you can program in D without much thinking about UTF, and things "just work". Even abroad. But to sum it all, any project whose purpose is anything else than learning D2, should be written in D1.
Apr 27 2009
next sibling parent reply Derek Parnell <derek psych.ward> writes:
On Mon, 27 Apr 2009 11:32:30 +0300, Georg Wrede wrote:

 Phobos docs have *never* been as good as they are today!!!

Really?! I still find them very confusing. Maybe its just the formatting but I often find it so hard to understand or find what I'm after that I refer to the source code to work it out.
 About ranges: I suspect that once D2 is solidified (and we start working 
 on D3 :-) ), ranges will be easy to use, easy to understand, and 
 practical. Incidentally, that same complaint could have been said about 
 templates. It took more than a year to get the discussion settled down a 
 bit, and today templates seem not impossible at all to begin using 
 (unless you want to begin right with the hairy stuff).

It must be me, but I still find D2 templates as clear as mud. The syntax is butt-ugly, counter intuitive, and reader-hostile, IMHO. I'm sure that C++ is worse, but that is not the issue. I'll have to wait until somebody writes the "D Templates for Dummies" book. -- Derek Parnell Melbourne, Australia skype: derek.j.parnell
Apr 27 2009
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Derek Parnell wrote:
 On Mon, 27 Apr 2009 11:32:30 +0300, Georg Wrede wrote:
 
 Phobos docs have *never* been as good as they are today!!!

Really?! I still find them very confusing. Maybe its just the formatting but I often find it so hard to understand or find what I'm after that I refer to the source code to work it out.

Suggestions are always welcome. Andrei
Apr 27 2009
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Derek Parnell:
 [...] I still find D2 templates as clear as mud. The syntax is
 butt-ugly, counter intuitive, and reader-hostile, IMHO.

I find templates in D1 easy to use in most situations. And template constraints of D2 too are easy. What kind of syntax/semantics do you suggest to improve the current situation? If you have ideas I am interested. Bye, bearophile
Apr 27 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Georg Wrede wrote:
 There's an illusion. And that illusion comes from the D newsgroups 
 having "wrong" names.
 
 The D2 newsgroup should have a name like "D2 discussion -- for D 
 language development folks, enter at your own risk". And a *D1* 
 newsgroup should then be for anybody who actually uses the language for 
 something. Currently, actually, the D.learn newsgroup has sort-of 
 assumed this functionality.

Yes, well put. I think it would be great to define a digitalmars.d2 or digitalmars.d2-design newsgroup. Andrei
Apr 27 2009
parent Christopher Wright <dhasenan gmail.com> writes:
Simen Kjaeraas wrote:
 Andrei Alexandrescu wrote:
 
 Georg Wrede wrote:
 There's an illusion. And that illusion comes from the D newsgroups 
 having "wrong" names.
  The D2 newsgroup should have a name like "D2 discussion -- for D 
 language development folks, enter at your own risk". And a *D1* 
 newsgroup should then be for anybody who actually uses the language 
 for something. Currently, actually, the D.learn newsgroup has sort-of 
 assumed this functionality.

Yes, well put. I think it would be great to define a digitalmars.d2 or digitalmars.d2-design newsgroup.

And what would we do the day D987 came out? Start another group called digitalmars.d987-design? Or just create one called digitalmars.d.design, and use that no matter the version of D being discussed? -- Simen

How about digitalmars.dnext?
Apr 27 2009
prev sibling next sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
Andrei Alexandrescu wrote:

 Georg Wrede wrote:
 There's an illusion. And that illusion comes from the D newsgroups  
 having "wrong" names.
  The D2 newsgroup should have a name like "D2 discussion -- for D  
 language development folks, enter at your own risk". And a *D1*  
 newsgroup should then be for anybody who actually uses the language for  
 something. Currently, actually, the D.learn newsgroup has sort-of  
 assumed this functionality.

Yes, well put. I think it would be great to define a digitalmars.d2 or digitalmars.d2-design newsgroup.

And what would we do the day D987 came out? Start another group called digitalmars.d987-design? Or just create one called digitalmars.d.design, and use that no matter the version of D being discussed? -- Simen
Apr 27 2009
prev sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Tue, 28 Apr 2009 01:38:27 +0400, Christopher Wright <dhasenan gmail.com>
wrote:

 Simen Kjaeraas wrote:
 Andrei Alexandrescu wrote:

 Georg Wrede wrote:
 There's an illusion. And that illusion comes from the D newsgroups  
 having "wrong" names.
  The D2 newsgroup should have a name like "D2 discussion -- for D  
 language development folks, enter at your own risk". And a *D1*  
 newsgroup should then be for anybody who actually uses the language  
 for something. Currently, actually, the D.learn newsgroup has sort-of  
 assumed this functionality.

Yes, well put. I think it would be great to define a digitalmars.d2 or digitalmars.d2-design newsgroup.

digitalmars.d987-design? Or just create one called digitalmars.d.design, and use that no matter the version of D being discussed? -- Simen

How about digitalmars.dnext?

Maybe digitalmars.future? :)
Apr 27 2009