www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Memory leak with dynamic array

reply Joseph Wakeling <joseph.wakeling gmail.com> writes:
Hello all,

This one is another 'oh God it doesn't work like C++' query ... :-)

I've been having a problem with a memory leak that I've tied down to a dynamic
array that is repeatedly resized to zero and then extended within a loop.  The
following code gives a minimal example:

////////////////////////////////////////////////////////////////////
import std.array;
import std.stdio;

void main()
{
    real[] x;

    x.length = 5000000;

    foreach(uint i;0..100) {
        x.length = 0;

        foreach(uint j;0..5000000)
            x ~= j;

        writefln("At iteration %u, x has %u elements.",i,x.length);
    }
}
////////////////////////////////////////////////////////////////////

If I run this on my system, the memory usage explodes within a very short
time, despite the fact that the length of the array is set early on, so the
memory should all be reserved.  The writefln() command confirms that the
actual number of elements always stays within the original bounds set.

I'm guessing this is down to concatenation ~ working differently from C++
vectors' push_back() command.  I have also tried using the Appender and its
put() command, but the same result happens.

I even tried moving the array declaration within the first foreach() loop, and
putting an explicit delete at the end, but to no avail.

Can anyone advise? :-)

Thanks & best wishes,

    -- Joe
Apr 10 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Joseph Wakeling:

 Can anyone advise? :-)

Dynamic arrays in D are not black boxes, they are an abstraction that leaks a lot, so you must know how dynamic arrays are implemented. Recently the implementation of dynamic arrays being changed, has someone a link to a description of how they work now? Few notes on your code: there is no need to import std.array into that little program. Often there is no need to add an "uint" inside the foreach. I suggest to put a space after every comma. I also suggest you to write big numbers like this: 5_000_000 because they can avoid some mistakes. From the little I know about the new arrays this doesn't work anymore, zeroing the length produces a deallocation: x.length = 5_000_000; x.length = 0; So a basic strategy to face your problem is to not let the GC work: import std.stdio: writefln; void main() { auto x = new real[5_000_000]; foreach (i; 0 .. 1_000) { int x_len = 0; foreach (j; 0 .. 5_000_000) { x[x_len] = j; x_len++; } writefln("At iteration %u, x has %u elements.", i, x_len); } } When the GC is not good enough one thing you can do is to manage the memory in a less automatic way... Bye, bearophile
Apr 10 2010
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Joseph Wakeling:

 Can anyone advise? :-)

My precedent answer was ignorant. Read about the capacity, reserve and assumeSafeAppend here: http://www.digitalmars.com/d/2.0/phobos/object.html I think a short page about the design of the new dynamic arrays can be added to the D site. This is a version of the program that doesn't blow up the memory (it uses more and more memory but just a bit): import std.stdio: writeln, writefln; void main() { enum int N = 5_000_000; real[] arr; writeln("A) arr.capacity: ", arr.capacity); arr.reserve(N); // optional writeln("B) arr.capacity: ", arr.capacity); foreach (i; 0 .. 100) { arr.length = 0; writeln("1) arr.capacity: ", arr.capacity); assumeSafeAppend(arr); // not optional writeln("2) arr.capacity: ", arr.capacity); foreach (j; 0 .. N) arr ~= j; writefln("At iteration %u, arr has %u elements.", i, arr.length); } } Bye, bearophile
Apr 10 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Few dynamic array benchmarks that can show you more differences between C++ and
D2.

Timings dmd, N=5_000_000, NLOOPS=15, T=double, seconds:
  C++ v1: 0.67
  C++ v1: 0.72  
  D v1:   6.47 (uses >1 GB RAM)
  D v2:   0.76
  D v3:   5.25

dmd v2.043, dmd -O -release -inline
g++ 4.4.1, -O3 -Wall -s

-----------------

// D v.1
import std.c.stdio: printf;

void main() {
    alias double T;
    enum int N = 5_000_000;
    enum int NLOOPS = 15;

    T[] arr;
    arr.length = N;

    foreach (i; 0 .. NLOOPS) {
        arr.length = 0;

        foreach (j; 0 .. N)
            arr ~= j;

        printf("At iteration %d, arr has %u elements.\n", i, arr.length);
    }
}

-----------------

// D v.2
import std.c.stdio: printf;

void main() {
    alias double T;
    enum int N = 5_000_000;
    enum int NLOOPS = 15;

    auto arr = new T[N];

    foreach (i; 0 .. NLOOPS) {
        uint arr_len = 0;

        foreach (j; 0 .. N) {
            arr[arr_len] = j;
            arr_len++;
        }

        printf("At iteration %d, arr has %u elements.\n", i, arr_len);
    }
}

-----------------

// D v.3
import std.c.stdio: printf;

void main() {
    alias double T;
    enum int N = 5_000_000;
    enum int NLOOPS = 15;

    T[] arr;
    arr.reserve(N);

    foreach (i; 0 .. NLOOPS) {
        arr.length = 0;
        arr.assumeSafeAppend;

        foreach (j; 0 .. N)
            arr ~= j;

        printf("At iteration %d, arr has %u elements.\n", i, arr.length);
    }
}

-----------------

// C++ v.1
#include "stdio.h"
#include <vector>

int main() {
    typedef double T;
    const int N = 5000000;
    const int NLOOPS = 15;

    std::vector<T> arr;
    arr.reserve(N);

    for (int i = 0; i < NLOOPS; i++) {
        arr.clear();

        for (int j = 0; j < N; j++)
            arr.push_back(j);

        printf("At iteration %d, arr has %u elements.\n", i, arr.size());
    }
}

-----------------

// C++ v.2
#include "stdio.h"
#include <vector>

int main() {
    typedef double T;
    const int N = 5000000;
    const int NLOOPS = 15;

    std::vector<T> arr(N);

    for (int i = 0; i < NLOOPS; i++) {
        size_t arr_len = 0;

        for (int j = 0; j < N; j++) {
            arr[arr_len] = j;
            arr_len++;
        }

        printf("At iteration %d, arr has %u elements.\n", i, arr_len);
    }
}

-----------------

Bye,
bearophile
Apr 10 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
This can help confuse your mind a bit:

import std.stdio: writeln;
void main() {
    {
        int[] arr = [0, 1, 2, 3, 4, 5, 6].dup;
        int[] slice = arr[2 .. 4];
        writeln("arr, slice: ", arr, " | ", slice); // arr, slice: 0 1 2 3 4 5
6 | 2 3
        slice ~= 10; slice ~= 20;
        writeln("arr.capacity, slice.capacity: ", arr.capacity, " ",
slice.capacity); // arr.capacity, slice.capacity: 7 7
        writeln("arr, slice: ", arr, " | ", slice); // arr, slice: 0 1 2 3 4 5
6 | 2 3 10 20
    }

    {
        int[] arr = [0, 1, 2, 3, 4, 5, 6].dup;
        int[] slice = arr[2 .. 4];
        writeln("arr, slice: ", arr, " | ", slice); // arr, slice: 0 1 2 3 4 5
6 | 2 3

        slice.assumeSafeAppend;
        slice ~= 10; slice ~= 20; // causes stomping
        writeln("arr.capacity, slice.capacity: ", arr.capacity, " ",
slice.capacity); // arr.capacity, slice.capacity: 0 5
        writeln("arr, slice: ", arr, " | ", slice); // arr, slice: 0 1 2 3 10
20 6 | 2 3 10 20
        slice ~= 30; slice ~= 40;
        writeln("arr.capacity, slice.capacity: ", arr.capacity, " ",
slice.capacity); // arr.capacity, slice.capacity: 7 7
        writeln("arr, slice: ", arr, " | ", slice); // arr, slice: 0 1 2 3 10
20 30 | 2 3 10 20 30 40
    }
}


The  slice.capacity = 7 in the first case is just a coincidence, it's the
result of the overallocation.

But I don't know why arr.capacity is zero and then seven in the second and
third case.

Bye,
bearophile
Apr 10 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Steven Schveighoffer:

Thank you for your answers Steven, I always appreciate your efforts in both
trying to improve D runtime and in teaching me :-)

The new dynamic array regime has the main advantage of avoiding some
stomping-related bugs (a secondary advantage is a little higher append speed).
One its disadvantage is its increased complexity, that makes it harder for the
programmer to understand what dynamic arrays are actually doing under the
cover. Sometimes you don't need to know what's happening under the cover, but
in this case arrays are so basic and common data structures, and their
abstraction is leaking so much, that I think a D programmer must know what's
happening inside them to use them at their best.


   C++ v1: 0.67
   C++ v1: 0.72


That the 0.72 timing was from the second C++ program, sorry.
 That is to be expected.  The append function is not the most efficient  
 thing.  It still must do a lot of work just to look up the valid length  
 and verify appending can be done, whereas the v2 "append" is a single  
 instruction!

The v2 "append" seems two instructions: arr[arr_len] = j; arr_len++; And the C++ v1 too has to verify if the appending can be done, comparing the size with the capacity and if so assigning the item and increasing the size. I understand that the D code has to work more here, and I presume your code can't be improved. But those timings are a little sad anyway :-|
 A better method is to set the length, and then write the data.

What I have done in the D v2 example.
0 means if you append to the array, it will reallocate.  It will return 0  

taken over the "allocated" length of the block. Essentially, capacity indicates how many elements can be appended. The function gives up and returns 0 if it determines the array does not end at the allocated part of a block.<
For fun, add one more element to slice, and the arr will now have a valid
capacity :)<

I will have to read this some more times, and probably I will have to read the source code of the append implementation :-)
Technically, I could return the length of the array, but I'm not sure whether
that is as useful.<

I like this.
FYI, using arr after assuming safe append on the slice is undefined.<

/me nods. Can you please write a short page, to be added to the D site (for example in the "Articles" section) that explains the data structures used under the cover by the dynamic arrays and how such data structures work (and maybe why this is the chosen design of arrays instead of other possible designs, how it iteracts with the current GC)? The purpose of this short text is to help D programmers understand what and why dynamic arrays work this way in their programs. Dynamic arrays are among the most common data structure in D programs, so they are pretty important and basic. If you are willing to write a first draft of this small document I can offer you any kind of help you want, for example I can read and comment your text, fix typos, convert it to html, I can even draw for you small data structures with their little pointers, I can beg Walter to add the resulting html page to his site, and so on :-) It's not a big work, despite all it can't be a very complex data structure. Bye and thank you, bearophile
Apr 10 2010
next sibling parent bearophile writes:
One more small thing I was forgetting. In my dlibs1 there are functions similar
to:


T[][] NewVoidGCMatrix(T)(int nr, int nc) {
    // Part of the code by Marius Muja <mariusm cs.ubc.ca>
    assert(nr > 0, "NewVoidCGMatrix: nr must be > 0.");
    assert(nc > 0, "NewVoidCGMatrix: nc must be > 0.");
    void* mem = cast(void*)gcmalloc(nr * (T[]).sizeof + nr * nc * T.sizeof);
    hasNoPointers(mem);
    T[]* index = cast(T[]*)mem;
    T* mat = cast(T*)(mem + nr * (T[]).sizeof);

    for (int i = 0; i < nr; ++i) {
        index[i] = mat[0 .. nc];
        mat += nc;
    }

    return index[0 .. nr];
}

void delVoidGC(T)(ref T[] a) {
    gcrealloc(a.ptr, 0);
    a = null;
}


They allow to allocate only one chunk of memory for a matrix, and in some
situations they give a little more performance (because they give a little more
CPU cache coherence). I guess such functions can't be used in D2. But maybe
once I know the underlying data structure I can write something similar for D2
too.

Bye,
bearophile
Apr 11 2010
prev sibling next sibling parent reply Joseph Wakeling <joseph.wakeling gmail.com> writes:
Thanks very much for the extended and detailed explanations.  The
assumeSafeAppend
function
indeed fixes the memory leakage. :-)

In reply to some comments ...

 That is to be expected.  The append function is not the most efficient
 thing.  It still must do a lot of work just to look up the valid length
 and verify appending can be done, whereas the v2 "append" is a single
 instruction!

The v2 "append" seems two instructions: arr[arr_len] = j; arr_len++; And the C++ v1 too has to verify if the appending can be done, comparing the size with the capacity and if so assigning the item and increasing the size. I understand that the D code has to work more here, and I presume your code can't be improved. But those timings are a little sad anyway :-|
 A better method is to set the length, and then write the data.

What I have done in the D v2 example.

I was very happy to see that D _does_ have a 'reserve' function for arrays, which I had been missing compared to C++ (it's not mentioned in the array docs). Still, I don't think that pre-reserving the memory per se is the influencing factor on the differences in performance. To see why, compare these two pieces of code -- a non-leaking D code, and the equivalent in C++: ////////////// D example ///////////////////////// import std.stdio; void main() { double[] x; foreach(uint i;0..100) { x.length = 0; writefln("Array capacity (1) = %u",x.capacity); assumeSafeAppend(x); writefln("Array capacity (2) = %u",x.capacity); foreach(uint j;0..5000000) x ~= j; writefln("At iteration %u, x has %u elements.",i,x.length); } } ///////////////////////// //////////// C++ example ///////////////////////// #include <vector> #include <iostream> using namespace std; int main() { vector<double> x; for(unsigned int i=0;i!=100;++i) { x.resize(0); cout << "Vector capacity: " << x.capacity() << endl; for(unsigned int j=0;j!=5000000;++j) x.push_back(j); cout << "At iteration " << i << ", x has " << x.size() << " elements." << endl; } return 0; } ///////////////////////// Note that in C++ the memory is not preassigned either. The difference between the performance of these pieces of code is striking -- on my machine the D example takes about 70--75 seconds to run, whereas the C++ example speeds through it in 10s or less. D also uses about 20% more memory than the C++ even though the C++ code declares a higher capacity for the vector (above 8 million) than D does for the array (a little over 5 million). I don't know if it was naive to assume that D's dynamic arrays would be equivalent to vectors in C++. But it's clear that the D array appender ~= is much slower than the C++ vector push_back(). Using an Appender class and put() (from std.array) is even slower, despite the std.array docs recommending this over ~. :-( It's disappointing because I'm loving so much of D's syntax, but I can write far more efficient code (with less explicit memory management) in C++ ...
Apr 11 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Joseph Wakeling:

 I was very happy to see that D _does_ have a 'reserve' function for arrays,
 which I had been missing compared to C++ (it's not mentioned in the array
 docs).

'reserve' is not a dynamic array method, it's a free function contained in the Phobos default module that also contains object, see the docs page: http://www.digitalmars.com/d/2.0/phobos/object.html It's not mentioned in the array docs because the 'reserve' and related functions are new, they come from a recent change in how dynamic arrays are implemented in D.
 Still, I don't think that pre-reserving the memory per se is the influencing
 factor on the differences in performance.

You are right. The difference in performance comes from mostly from the way D arrays are designed. I agree with you that dynamic array append in D is slow, in absolute terms. Comparing the performance of D code with C++ is a good thing to do, it gives a baseline for comparison. D dynamic arrays are more flexible than C++ vector, they can be sliced, such slicing is O(1), and the slices are seen by the language just like other arrays. So you pay the price of some performance for such increased flexibility. The idea here is that the built-in data types must be as flexible as possible even if their performance is not so high, so they can be used for many different purposes. Then D standard library will have specialized data structures that are faster thanks to being more specialized and less flexible. In D dynamic arrays some of the performance price is also paid for the automatic memory management, for the GC that's not a precise GC (for example if your array has some empty items at the end past its true length, the GC must ignore them). I think D dynamic arrays have a lower append performance because they don't contain the capacity in a handy place (their new design has a capacity, but it's not beside length and pointer). I don't know how the D array append works in situations of multithreading. If you don't require the slicing flexibility, you can use D language to implement a C++-style Vector in D. If you implement a struct that contains capacity, length and pointer to data, and you don't let the GC scan the memory of the data, and you add method overloading for the ~= that doubles the size of the memory with a realloc once in a while, you probably have a performance similar to the C++ one (not equal because the performance profile of the GC in allocating memory chunks is a little more complex than the C malloc). See the bottom of this post.
 D also uses about 20% more memory than the C++

I don't know why, I presume it's caused by the GC bookkeeping, it must be careful to avoid memory fragmentation. Currently the D GC is not so refined.
 Using an Appender class and put() (from std.array) is even slower,
 despite the std.array docs recommending this over ~. :-(

That was useful and designed for the precedent design of the dynamic arrays. Maybe it can be removed from Phobos now...
 It's disappointing because I'm loving so much of D's syntax, but I can
 write far more efficient code (with less explicit memory management)
 in C++ ...

As you know all data structures are the result of compromises, they are a balance of performance for their different uses, there is no perfect data structure; in practice D arrays are not designed for an efficient "en mass" appending. A D dynamic array is seen locally as a struct of 2 words. So it's cheap to copy and pass around. Why they don't contain the capacity too in such struct is a long story. If you like to write benchmarks you will also find performance difference between for example a C++ hash_map and the built-in associative arrays of D. Beside the design (that for example can make it hard to implement append-efficient dynamic arrays in D), D language is also rather new still, so far the focus was on its design, not in tuning the performance of its implementation. It's being several years they are performance-tuning C++ implementations :-) ------------------- Below three implementations in D and C++. The Vector(T) is bare-bones. It missed most features and it's unsafe. In the D2 version I don't use the GC, and the memory is deallocated when arr gets out of scope. Timings, T=double, N=5_000_000, NLOOPS=100, seconds: D1: 38.20 D2: 8.32 C++: 6.65 You can see the performance of the append in D2 is a matter of data structure implementation, it's not a language problem :-) With LDC (once we'll have a D2 version of it) the performance of D2 can probably be the same as the C++. DMD maybe loses a little here because it's not so good at inlining, or maybe because the C++ vector is better than this D2 code. ------------------- // First D version import std.stdio: printf; void main() { alias double T; enum int N = 5_000_000; enum int NLOOPS = 100; T[] arr; foreach (i; 0 .. NLOOPS) { arr.length = 0; arr.assumeSafeAppend; printf("Array capacity = %d\n", arr.capacity); foreach (uint j; 0 .. N) arr ~= j; printf("At iteration %u, x has %u elements.\n", i, arr.length); } } ------------------ // Second D version import std.stdio: printf; import std.c.stdlib: malloc, realloc, free; struct Vector(T) { enum double FAST_GROW = 2; enum double SLOW_GROW = 1.3; enum int STARTING_SIZE = 4; enum int BIG_MEM = (1 << 27) / T.sizeof; T* ptr; int length, capacity; ~this() { free(ptr); ptr = null; length = 0; capacity = 0; } void opOpAssign(string Op:"~=")(T item) { if (length >= capacity) { if (capacity == 0) _grow_capacity(STARTING_SIZE); else if (capacity < BIG_MEM) _grow_capacity(cast(int)(capacity * FAST_GROW)); else _grow_capacity(cast(int)(capacity * SLOW_GROW)); } ptr[length] = item; length++; } void _grow_capacity(int new_capacity) { ptr = cast(T*)realloc(ptr, new_capacity * T.sizeof); assert(ptr); this.capacity = new_capacity; } } void main() { alias double T; enum int N = 5_000_000; enum int NLOOPS = 100; Vector!T arr; foreach (i; 0 .. NLOOPS) { arr.length = 0; printf("Array capacity = %d\n", arr.capacity); foreach (uint j; 0 .. N) arr ~= j; printf("At iteration %u, x has %u elements.\n", i, arr.length); } } ------------------ // C++ version #include "stdio.h" #include <vector> int main() { typedef double T; const unsigned int N = 5000000; const unsigned int NLOOPS = 100; std::vector<T> arr; for (unsigned int i = 0; i < NLOOPS; i++) { arr.resize(0); printf("Array capacity = %d\n", arr.capacity()); for (unsigned int j = 0; j < N; j++) arr.push_back(j); printf("At iteration %u, x has %u elements.\n", i, arr.size()); } } Bye, bearophile
Apr 11 2010
next sibling parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 04/11/2010 07:42 PM, bearophile wrote:
 D dynamic arrays are more flexible than C++ vector, they can be sliced, such
slicing is O(1), and the slices are seen by the language just like other
arrays. So you pay the price of some performance for such increased
flexibility. The idea here is that the built-in data types must be as flexible
as possible even if their performance is not so high, so they can be used for
many different purposes. Then D standard library will have specialized data
structures that are faster thanks to being more specialized and less flexible.
In D dynamic arrays some of the performance price is also paid for the
automatic memory management, for the GC that's not a precise GC (for example if
your array has some empty items at the end past its true length, the GC must
ignore them).

OTish: It would be nice if the data-structures-to-be are designed such that they can be easily dropped in wherever dynamic arrays are used. At least the list-like ones.
Apr 11 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Ellery Newcomer:
 OTish: It would be nice if the data-structures-to-be are designed such 
 that they can be easily dropped in wherever dynamic arrays are used. At 
 least the list-like ones.

Andrei will probably do what you are asking for here. But those list-like data structures probably can't be defined using the normal array literals :-) D2 operator overloading is flexible enough for many purposes, but data structure literals are a weak spot of D. For example you need to put the digits in a string to initialize a large BigInt. It's not easy to invent a way to allow literal design through code in a C-like language. Bye, bearophile
Apr 11 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
 But those list-like data structures probably can't be defined using the normal
array literals :-)

I was wrong, this is not so bad looking: struct List(T) { // Node of std.range.SListRange is not static! static struct Node { T data; Node* next; this(T d, Node* p=null) { data = d; next = p; } } Node* lh; this(T[] arr) { foreach_reverse (el; arr) lh = new Node(el, lh); } } void main() { List!int items = [1, 2, 3]; } Bye, bearophile
Apr 11 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 04/11/2010 09:36 PM, bearophile wrote:
 But those list-like data structures probably can't be defined using the normal
array literals :-)

I was wrong, this is not so bad looking: struct List(T) { // Node of std.range.SListRange is not static! static struct Node { T data; Node* next; this(T d, Node* p=null) { data = d; next = p; } } Node* lh; this(T[] arr) { foreach_reverse (el; arr) lh = new Node(el, lh); } } void main() { List!int items = [1, 2, 3]; } Bye, bearophile

It won't work for classes, though. I'd think there'd be more trouble with other parts of the language though, like slices. foreach with index is another thing I care about particularly, but it should be trivial to implement. I can't recall what else is still special-cased with array literals. And I agree about integer literals. What would be nice is a way to override how the literals are interpreted. Maybe some sort of macroish facility that pattern matches and manipulates the AST at compile time... <dreaming> macro (<Decl>: (<Type> is X) (<Identifier>) = (<IntegerLiteral>, lit), decl){ string str = "`" ~ lit.getText() ~ "`"; lit = ( str2X( $(<Exp>, str) )); return decl; } Ick, that turned into a mess fast. </dreaming>
Apr 11 2010
parent bearophile <bearophileHUGS lycos.com> writes:
Something that I have forgotten, for the original poster: the D append is not
templated code, so this produces no template bloat as in the C++ <vector>. This
is another compromise.

-----------

Ellery Newcomer:

It won't work for classes, though.<

You have to write: class List(T) { static struct Node { T data; Node* next; this(T d, Node* p=null) { data = d; next = p; } } Node* lh; this(T[] arr) { foreach_reverse (el; arr) lh = new Node(el, lh); } } void main() { auto items = new List!int([1, 2, 3]); }
Ick, that turned into a mess fast.<

I'd like to be a computer scientist, able to create a DMeta, similar to PyMeta. A DMeta can be much cleaner than that "mess". See the original OMeta: http://tinlizzie.org/ometa/ and its Python version: http://washort.twistedmatrix.com/ https://launchpad.net/pymeta Bye, bearophile
Apr 12 2010
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
A third uglier D version, this is a bit faster than the C++ version. With few
experiments I've seen that the second D version was slower than the C++ code
because dmd missed the inlining of the opOpAssign and because it contaned the
if(capacity<BIG_MEM) test too.

LDC can surely inline the method in a situation like that. But the BIG_MEM test
was something I have added to reduce memory usage for very large vectors. I
don't know if the C++ implementation has something similar.


// Third D version
import std.stdio: printf;
import std.c.stdlib: malloc, realloc, free;

struct Vector(T) {
    enum double FAST_GROW = 2;
    enum double SLOW_GROW = 1.3;
    enum int STARTING_SIZE = 4;
    enum int BIG_MEM = (1 << 27) / T.sizeof;

    T* ptr;
    int length, capacity;

    ~this() {
        free(ptr);
        ptr = null;
        length = 0;
        capacity = 0;
    }
}

void main() {
    alias double T;
    enum int N = 5_000_000;
    enum int NLOOPS = 100;
    Vector!T arr;

    foreach (i; 0 .. NLOOPS) {
        arr.length = 0;
        printf("Array capacity = %d\n", arr.capacity);

        foreach (uint j; 0 .. N) {
            if (arr.length >= arr.capacity) {
                int new_capacity = arr.capacity ? cast(int)(arr.capacity *
arr.FAST_GROW) : arr.STARTING_SIZE;
                arr.ptr = cast(T*)realloc(arr.ptr, new_capacity * T.sizeof);
                assert(arr.ptr);
                arr.capacity = new_capacity;
            }

            arr.ptr[arr.length] = j;
            arr.length++;
        }

        printf("At iteration %u, x has %u elements.\n", i, arr.length);
    }
}

Bye,
bearophile
Apr 11 2010
prev sibling parent reply Joseph Wakeling <joseph.wakeling gmail.com> writes:
Thanks to all again for the discussion, examples and explanations. :-)

One note -- I wouldn't want anyone to think I'm bashing D or complaining.
I've been interested in the language for some time and this seemed an
opportune time to start experimenting properly.  It's fun, I'm learning
a lot, and I'm genuinely touched by the amount of effort put in by
everyone on this list to teach and share examples.

I'm also fully aware that D is still growing, and that I need to be
patient in some cases ... :-)

bearophile wrote:
 D dynamic arrays are more flexible than C++ vector, they can be sliced,
 such slicing is O(1), and the slices are seen by the language just like
 other arrays. So you pay the price of some performance for such
 increased flexibility. The idea here is that the built-in data types
 must be as flexible as possible even if their performance is not so
 high, so they can be used for many different purposes.

No complaint there. :-)
 Then D standard library will have specialized data structures that are
 faster thanks to being more specialized and less flexible.

In my case -- I'm turning into 'Mr C++' again -- probably that's often what I need. If I look at the major benefits I found in moving from C to C++, the first was memory management that was as automatic as I required. For example, C++ vectors are great because they do away with having to put in malloc/realloc/free statements and let you treat dynamic arrays pretty much as 'just another variable'. Within my own needs I've not yet found a case where the kind of smart GC functionality discussed on this thread seemed necessary, but of course I've never had it available to use before ... :-)
 In D dynamic arrays some of the performance price is also paid for the
 automatic memory management, for the GC that's not a precise GC (for
 example if your array has some empty items at the end past its true
 length, the GC must ignore them).

An idea was floating in my head about whether it is/could be possible to turn off GC safety features in a scope where they are unnecessary -- rather like a more general version of the 'assumeSafeAppend' function...
 With LDC (once we'll have a D2 version of it) the performance of D2
 can probably be the same as the C++. DMD maybe loses a little here
 because it's not so good at inlining, or maybe because the C++ vector
 is better than this D2 code.

I thought dev effort was now focusing back on GDC ... ? :-P I have actually not made much use of the -inline function because in the code I wrote (maybe not best suited to inlining...), it made the program generally run slower ... Steven Schveighoffer wrote:
 The C++ example is reallocating memory, freeing memory it is no longer
 using.  It also manually handles the memory management, allocating larger
 and larger arrays in some algorithmically determined fashion (for example,
 multiplying the length by some constant factor).  This gives it an edge in
 performance because it does not have to do any costly lookup to determine
 if it can append in place, plus the realloc of the memory probably is
 cheaper than the GC realloc of D.

Right. In fact you get precisely 24 allocs/deallocs, each doubling the memory reserve to give a total capacity of 2^23 -- and then that memory is there and can be used for the rest of the 100 iterations of the outer loop. The shock for me was finding that D wasn't treating the memory like this but was preserving each loop's memory (as you say, for good reason).
 D does not assume you stopped caring about the memory being pointed to
 when it had to realloc. [...] You can't do the same thing with C++
 vectors, when they reallocate, the memory they used to own could be
 freed.  This invalidates all pointers and iterators into the vector,
 but the language doesn't prevent you from having such dangling pointers.

I have a vague memory of trying to do something exactly like your example when I was working with C++ for the first time, and getting bitten on the arse by exactly the problem you describe. I wish I could remember where. I know that I found another (and possibly better) solution to do what I wanted, but it would be nice to see if a D-ish solution would give me something good.
 This must be fixed, the appender should be blazingly fast at appending
 (almost as fast as C++), with the drawback that the overhead is higher.

Overhead = memory cost? I'm not so bothered as long as the memory stays within constant, predictable bounds. It was the memory explosion that scared me. And I suspect I'd pay a small performance cost (though it would have to be small) for the kind of safety and flexibility the arrays have.
 You haven't done much with it yet.  When you start discovering how much D
 takes care of, you will be amazed :)

I know. :-) My needs are in some ways quite narrow -- numerical simulations in interdisciplinary physics -- hence the C background, and hence the premium on performance. They're also not very big programs -- simple enough for me to generally keep a personal overview on the memory management, even though with C++ that's usually all taken care of automatically (no new or delete statements if I can avoid it). What I'm fairly confident about is that, given not too much time, D will become a _far_ preferable language for that kind of development.
 The thing about D is it *can* be fast and unsafe, just as fast and unsafe
 as C, but that's not the default.

That's apparent -- I mean, given that D wraps the whole C standard library, I could basically write C code in D if I wanted, no? But of course it would have all the notational complexities of C, which is what I'd like to escape from ... :-P
Apr 12 2010
next sibling parent Don <nospam nospam.com> writes:
Joseph Wakeling wrote:
 Thanks to all again for the discussion, examples and explanations. :-)

[snip]
 My needs are in some ways quite narrow -- numerical simulations in
 interdisciplinary physics -- hence the C background, and hence the premium
 on performance.  They're also not very big programs -- simple enough for me
 to generally keep a personal overview on the memory management, even though
 with C++ that's usually all taken care of automatically (no new or delete
 statements if I can avoid it).
 
 What I'm fairly confident about is that, given not too much time, D will
 become a _far_ preferable language for that kind of development.

There are quite a lot of us here with exactly that kind of background. Something about the array issue -- D dynamic arrays are heavily geared towards algorithms which perform an initial allocation and afterwards avoid memory allocation entirely. In D, such slicing algorithms are extremely clean, extremely fast, and memory safe. In C++, it's much more difficult to write code in that manner. A straightforward translation from C++ will generally miss the benefits of D arrays, and you'll end up with slower code. A kind of "Zen of D" is to use array slices as much as possible.
Apr 12 2010
prev sibling parent reply Joseph Wakeling <joseph.wakeling webdrake.net> writes:
Joseph Wakeling wrote:
 On the other hand, if the assumeSafeAppend takes place outside the loops
 entirely, blow-up occurs -- because the command is not issued after the
 array resize to zero?  I've attached examples.

If I include a line x.reserve(5_000_000); before the assumeSafeAppend, the memory does not explode indefinitely -- but it uses about 3x as much memory as the 'NoLeak' code even when the latter lacks advance reservation of memory.
Apr 13 2010
next sibling parent reply Don <nospam nospam.com> writes:
Joseph Wakeling wrote:
 Steven Schveighoffer wrote:
 Assuming you may ask questions about this, it's not an exact description
 of what happens.  The append code and GC are smarter about it than this,
 I just wanted to explain it in an easy-to-understand way :)  The real
 code uses algorithms to determine optimal grow sizes and can extend into
 consecutive blocks if possible.

I realised that while running some of the code that bearophile suggested to me, because if you look at 'capacity' in a non-pre-reserved D array that's appended to 5 million times, its capacity is only a little over 5 million -- compared to a C++ vector which is the nearest power of 2 (2^23, over 8 million). The GC overhead means that in this case actual memory usage is a bit higher than C++, but on a larger scale this could surely make a big difference. In the case of the piece of code I'm working on I don't think the pre-reserve is really so important as far as performance goes -- the append speed is a bit of a bugger, but the main issue is probably to do with speed of copying and iterating across arrays. For example, I suspect that the D array's, x[] = 0; y[] = z[]; is not as efficient as a C++ vector's x.assign(x.size(),0); y.assign(z.begin(),z.end());

The D code compiles directly to a memset and memcpy, which are intrinsics. There's no way C++ could beat it.
Apr 13 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Joseph Wakeling:
 OK.  I'm just bemused about why -- having solved the memory leak -- my D
 code is running significantly slower than equivalent C++ code.

I'll try to take a look at your D code. In the meantime you can show the equivalent C++ code too, so I can compare. Note that dmd has an old backend, so it doesn't optimize as well as g++ or as llvm. LDC is often able to optimize better. Bye, bearophile
Apr 13 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
So far I've just given a light reading of the code. Notes:
- 1.0L in D is a real, not a double, it's 10-12-16 bytes long.
- Mt19937 is a Mersenne twister, keep in mind that it's really good but slow.
If you need less quality, use a faster generator.
- Compile your code with -w, so it says you that you have missed an "override".
- ReputationAlgorithm can be an abstract class, I think. So you can write it as:
abstract class ReputationAlgorithm {
    this() {}
    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject);
}

- Both Yzlm and AvgArithmetic classes can be marked as final, and in the other
classes there are other methods can be marked as final. As in Java in D class
methods are virtual unless you mark them as final, but unlike the HotSpot of
Java as far as I know no D compiler is currently able to inline virtual methods.
- Dmd has both a built-ihn profiler and code coverage, use them to find hot
spots.
- Put a space after commas, generally.
- pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but you have to
import std.math anyway, because of a bug).
- You can learn to use a little of contract programming, and move in a
precondition the asserts, for example of the Yzlm constructor. Then you can add
class invariants too, if you want.
- Switching the contents of dynamic arrays do work (it doesn't work well with
stack-allocated arrays, the contents get copied).

I'll try to add more comments later.
Bye,
bearophile
Apr 13 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
This is a little cleaned up version of the result of the profiling (no inling,
final classes, -O -release):

   Func  
   Time  

7586124  std.random __uniform std.random __MersenneTwisterEngine
4040384  main
2408110  performance userReputationUpdate performance Rating
2299561  std.math __nextafter nextafter
2176746  std.contracts __enforce
2051276  std.conv __to
2006079  performance.objectReputationUpdate performance Rating
1898177  std.array __Appender performance Rating Appender __put performance
Rating put performance Rating

It's a little disaster. As I have thought the Mersenne twister is taking up lot
of time. And then the nextafter, the enforce, the to!, the appender...

Bye,
bearophile
Apr 13 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
This is a single module, and on my PC is more than two times faster than the
original code. Now I think there are no naive ways left to speed it up.

Compile it with:
dmd -O -release -inline

The low performance was caused by:
- virtual methods that don't need to be virtual;
- ^^0.5 and pow(x,2) doesn't perfectly efficient;
- the two nested loops in the main are more efficient as ref double, this is
something dmd will need to fix;
- Mersenne twister is slow. I have used a Kiss rnd generator here, adapted from
Tango.
- the uniform of std.random is bad, it contains nextafter that's slow, it can
be avoided using a closed range: "[]" instead of the default one "[)".
- Somewhere in the middle of the random generator there was an enforce. I am
now not so sure it's good to use a lot of enforce() in the std lib of a system
language. Maybe Andrei will need to go back using asserts inside contracts.
- The temporary struct inside the two main loops didn't get optimized away
- the append was messy and slow, I have removed the appends and replaced with
simpler and tidier code with an index, that's quite faster.

Now the profiling result is clean:
16394266 main
 8079735 2performance510FastRandom7uniformMFddZd
 2841876 2performance510FastRandom8randUintMFZk
 1923377 2performance54Yzlm20userReputationUpdateMFKAS12performance56RatingKAdKAdZv
 1729333 2performance54Yzlm22objectReputationUpdateMFKAS12performance56RatingKAdKAdZd

Most of the time is spent in the main and generating rnd values, but the
timings are clean.

The resulting code can be slower than the C++ code, but now it's mostly a
matter of backend. If I port this code to D1 for Tango and I compile it very
well with LDC the performance can be about as good as the C++ version or better
:-) If necessary I can use an even faster random number generator, but that's
for tomorrow, if you want.


import std.stdio: printf;
import std.math: sqrt, pow;

struct FastRandom {
    uint kiss_x = 1;
    uint kiss_y = 2;
    uint kiss_z = 4;
    uint kiss_w = 8;
    uint kiss_carry = 0;
    uint kiss_k, kiss_m;

    this(uint seed) {
        this.seed(seed);
    }

    void seed(uint seed) {
        kiss_x = seed | 1;
        kiss_y = seed | 2;
        kiss_z = seed | 4;
        kiss_w = seed | 8;
        kiss_carry = 0;
    }

    uint randUint() {
        kiss_x = kiss_x * 69069 + 1;
        kiss_y ^= kiss_y << 13;
        kiss_y ^= kiss_y >> 17;
        kiss_y ^= kiss_y << 5;
        kiss_k = (kiss_z >> 2) + (kiss_w >> 3) + (kiss_carry >> 2);
        kiss_m = kiss_w + kiss_w + kiss_z + kiss_carry;
        kiss_z = kiss_w;
        kiss_w = kiss_m;
        kiss_carry = kiss_k >> 30;
        return kiss_x + kiss_y + kiss_w;
    }

    double random() {
        return this.randUint() / (uint.max + 1.0);
    }

    double uniform(double a, double b) {
        double r = cast(double)this.randUint() / (uint.max + 1.0);
        return a + (b - a) * r;
    }
}


struct Rating {
    uint user, object;
    double value;
}


abstract class ReputationAlgorithm {
    this() {}
}


final class Yzlm : ReputationAlgorithm {
    double beta;
    double convergenceRequirement;
    double errorMin;
//    uint[] userLinks; // commented out because for now it
                        // has a constant value for all users
    double[] weightSum;
    double[] oldReputationObject;

    double objectReputationUpdate(ref Rating[] ratings,
                                  ref double[] reputationUser,
                                  ref double[] reputationObject) {
        double diff = 0;
        double[] temp = oldReputationObject;

        // Original version had:
        //
        //    oldReputationObject[] = reputationObject[]
        //
        // This version is an attempt to save effort
        // by just switching round the memory the two
        // arrays are pointing at -- not sure if it
        // actually does what I'm expecting it to.
        // Doesn't seem to improve speed. :-(
        oldReputationObject = reputationObject;
        reputationObject = temp;
        reputationObject[] = 0;

        weightSum[] = 0;

        foreach (Rating r; ratings) {
            reputationObject[r.object] += reputationUser[r.user] * r.value;
            weightSum[r.object] += reputationUser[r.user];
        }

        foreach (uint object, ref double r; reputationObject) {
            r /= (weightSum[object] > 0) ? weightSum[object] : 1;
            auto aux = (r - oldReputationObject[object]);
            diff += aux * aux;
        }

        return sqrt(diff);
    }

    void userReputationUpdate(ref Rating[] ratings,
                              ref double[] reputationUser,
                              ref double[] reputationObject) {
        reputationUser[] = 0;

        foreach (Rating r; ratings) {
            auto aux = (r.value - reputationObject[r.object]);
            reputationUser[r.user] += aux * aux;
        }

        foreach (uint user, ref double r; reputationUser) {
            //if(userLinks[user]>0)
                r = pow( (r/reputationObject.length/*userLinks[user]*/) +
errorMin, -beta);
        }
    }

    void opCall(ref Rating[] ratings,
                ref double[] reputationUser,
                ref double[] reputationObject) {
    //    userLinks.length = reputationUser.length;
    //    userLinks[] = 0;

        weightSum.length = reputationObject.length;
        oldReputationObject.length = reputationObject.length;

    //    foreach (Rating r; ratings)
    //        userLinks[r.user]++;

        double diff;
        uint iterations = 0;
        do {
            userReputationUpdate(ratings, reputationUser, reputationObject);
            diff = objectReputationUpdate(ratings, reputationUser,
reputationObject);
            ++iterations;
        } while (diff > convergenceRequirement);
        printf("Exited in %u iterations with diff = %g < %g\n",
               iterations, diff, convergenceRequirement);
    }

    this() {}

    this(double b, double c, double e) {
        beta = b;
        convergenceRequirement = c;
        errorMin = e;
        assert(beta >= 0);
        assert(convergenceRequirement > 0);
        assert(errorMin >= 0);
    }

    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject,
         double b, double c, double e) {
        this(b,c,e);

        opCall(ratings, reputationUser, reputationObject);
    }
}


class AvgWeighted : ReputationAlgorithm {
    double[] weightSum;

    void opCall(ref Rating[] ratings,
                ref double[] reputationUser,
                ref double[] reputationObject) {
        weightSum.length = reputationObject.length;
        weightSum[] = 0;

        reputationObject[] = 0;

        foreach (Rating r; ratings) {
            reputationObject[r.object] += reputationUser[r.user]*r.value;
            weightSum[r.object] += reputationUser[r.user];
        }

        foreach (uint o, ref double r; reputationObject)
            r /= weightSum[o];
    }

    this() {}

    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject) {
        opCall(ratings, reputationUser, reputationObject);
    }
}


final class AvgArithmetic : AvgWeighted {
    override void opCall(ref Rating[] ratings,
            ref double[] reputationUser,
            ref double[] reputationObject) {
        reputationUser[] = 1;
        super.opCall(ratings, reputationUser, reputationObject);
    }

    this() {}

    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject) {
        opCall(ratings, reputationUser, reputationObject);
    }
}


void main() {
    Rating[] ratings;
    double[] reputationObject, reputationUser;
    double[] objectQuality, userError;
    auto aa = new AvgArithmetic;
    auto yzlm = new Yzlm(0.8, 1e-12, 1e-36);

    auto rnd = FastRandom(1001);

    reputationObject.length = 1_000;
    reputationUser.length = 1_000;

    objectQuality.length = reputationObject.length;
    userError.length = reputationUser.length;

    ratings.length = reputationObject.length * reputationUser.length;

    foreach (uint i; 0 .. 100) {
        foreach (ref double Q; objectQuality)
            Q = rnd.uniform(0.0, 10.0);

        foreach (ref double sigma2; userError)
            sigma2 = rnd.random();

        int pos;
        foreach (uint object, ref double Q; objectQuality)
            foreach (uint user, ref double sigma2; userError)
                ratings[pos++] = Rating(user, object, rnd.uniform(Q - sigma2, Q
+ sigma2));

        printf("We now have %u ratings.\n", ratings.length);

        aa(ratings, reputationUser, reputationObject);
        yzlm(ratings, reputationUser, reputationObject);

        double deltaQ = 0;
        foreach (uint object, double r; reputationObject) {
            auto aux = (r - objectQuality[object]);
            deltaQ += aux * aux;
        }
        deltaQ = sqrt(deltaQ / reputationObject.length);
        printf("[%u] Error in quality estimate: %g\n", i, deltaQ);
    }
}


Bye,
bearophile
Apr 13 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
This compiles for D1. With LDC it runs about two times faster still, about
4.5-5 times faster than the original version. We can compare this to the C++
version of yours.

ldc -O5 -release -inline -enable-unsafe-fp-math -unroll-allow-partial test.d

I have also tried the faster rnd generator (R250-521), but in this program it's
not faster than Kiss, so I've kept Kiss. Note that Tango that you can use with
LDC has Kiss already.


version (Tango) {
    import tango.stdc.stdio: printf;
    import tango.math.Math: sqrt, pow;
} else {
    import std.stdio: printf;
    import std.math: sqrt, pow;
}

struct FastRandom {
    uint kiss_x = 1;
    uint kiss_y = 2;
    uint kiss_z = 4;
    uint kiss_w = 8;
    uint kiss_carry = 0;
    uint kiss_k, kiss_m;

    void seed(uint seed) {
        kiss_x = seed | 1;
        kiss_y = seed | 2;
        kiss_z = seed | 4;
        kiss_w = seed | 8;
        kiss_carry = 0;
    }

    uint randUint() {
        kiss_x = kiss_x * 69069 + 1;
        kiss_y ^= kiss_y << 13;
        kiss_y ^= kiss_y >> 17;
        kiss_y ^= kiss_y << 5;
        kiss_k = (kiss_z >> 2) + (kiss_w >> 3) + (kiss_carry >> 2);
        kiss_m = kiss_w + kiss_w + kiss_z + kiss_carry;
        kiss_z = kiss_w;
        kiss_w = kiss_m;
        kiss_carry = kiss_k >> 30;
        return kiss_x + kiss_y + kiss_w;
    }

    double random() {
        return this.randUint() / (uint.max + 1.0);
    }

    double uniform(double a, double b) {
        double r = cast(double)this.randUint() / (uint.max + 1.0);
        return a + (b - a) * r;
    }
}


struct Rating {
    uint user, object;
    double value;
}


abstract class ReputationAlgorithm {
    this() {}
}


final class Yzlm : ReputationAlgorithm {
    double beta;
    double convergenceRequirement;
    double errorMin;
//    uint[] userLinks; // commented out because for now it
                        // has a constant value for all users
    double[] weightSum;
    double[] oldReputationObject;

    double objectReputationUpdate(ref Rating[] ratings,
                                  ref double[] reputationUser,
                                  ref double[] reputationObject) {
        double diff = 0;
        double[] temp = oldReputationObject;

        // Original version had:
        //
        //    oldReputationObject[] = reputationObject[]
        //
        // This version is an attempt to save effort
        // by just switching round the memory the two
        // arrays are pointing at -- not sure if it
        // actually does what I'm expecting it to.
        // Doesn't seem to improve speed. :-(
        oldReputationObject = reputationObject;
        reputationObject = temp;
        reputationObject[] = 0;

        weightSum[] = 0;

        foreach (Rating r; ratings) {
            reputationObject[r.object] += reputationUser[r.user] * r.value;
            weightSum[r.object] += reputationUser[r.user];
        }

        foreach (uint object, ref double r; reputationObject) {
            r /= (weightSum[object] > 0) ? weightSum[object] : 1;
            auto aux = (r - oldReputationObject[object]);
            diff += aux * aux;
        }

        return sqrt(diff);
    }

    void userReputationUpdate(ref Rating[] ratings,
                              ref double[] reputationUser,
                              ref double[] reputationObject) {
        reputationUser[] = 0;

        foreach (Rating r; ratings) {
            auto aux = (r.value - reputationObject[r.object]);
            reputationUser[r.user] += aux * aux;
        }

        foreach (uint user, ref double r; reputationUser) {
            //if(userLinks[user]>0)
                r = pow( (r/reputationObject.length/*userLinks[user]*/) +
errorMin, -beta);
        }
    }

    void opCall(ref Rating[] ratings,
                ref double[] reputationUser,
                ref double[] reputationObject) {
    //    userLinks.length = reputationUser.length;
    //    userLinks[] = 0;

        weightSum.length = reputationObject.length;
        oldReputationObject.length = reputationObject.length;

    //    foreach (Rating r; ratings)
    //        userLinks[r.user]++;

        double diff;
        uint iterations = 0;
        do {
            userReputationUpdate(ratings, reputationUser, reputationObject);
            diff = objectReputationUpdate(ratings, reputationUser,
reputationObject);
            ++iterations;
        } while (diff > convergenceRequirement);
        printf("Exited in %u iterations with diff = %g < %g\n",
               iterations, diff, convergenceRequirement);
    }

    this() {}

    this(double b, double c, double e) {
        beta = b;
        convergenceRequirement = c;
        errorMin = e;
        assert(beta >= 0);
        assert(convergenceRequirement > 0);
        assert(errorMin >= 0);
    }

    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject,
         double b, double c, double e) {
        this(b,c,e);

        opCall(ratings, reputationUser, reputationObject);
    }
}


class AvgWeighted : ReputationAlgorithm {
    double[] weightSum;

    void opCall(ref Rating[] ratings,
                ref double[] reputationUser,
                ref double[] reputationObject) {
        weightSum.length = reputationObject.length;
        weightSum[] = 0;

        reputationObject[] = 0;

        foreach (Rating r; ratings) {
            reputationObject[r.object] += reputationUser[r.user]*r.value;
            weightSum[r.object] += reputationUser[r.user];
        }

        foreach (uint o, ref double r; reputationObject)
            r /= weightSum[o];
    }

    this() {}

    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject) {
        opCall(ratings, reputationUser, reputationObject);
    }
}


final class AvgArithmetic : AvgWeighted {
    override void opCall(ref Rating[] ratings,
            ref double[] reputationUser,
            ref double[] reputationObject) {
        reputationUser[] = 1;
        super.opCall(ratings, reputationUser, reputationObject);
    }

    this() {}

    this(ref Rating[] ratings,
         ref double[] reputationUser,
         ref double[] reputationObject) {
        opCall(ratings, reputationUser, reputationObject);
    }
}


void main() {
    Rating[] ratings;
    double[] reputationObject, reputationUser;
    double[] objectQuality, userError;
    auto aa = new AvgArithmetic;
    auto yzlm = new Yzlm(0.8, 1e-12, 1e-36);

    FastRandom rnd;
    rnd.seed(1001);

    reputationObject.length = 1_000;
    reputationUser.length = 1_000;

    objectQuality.length = reputationObject.length;
    userError.length = reputationUser.length;

    ratings.length = reputationObject.length * reputationUser.length;

    for (uint i; i < 4; i++) { // 100  4  ***************
        foreach (ref double Q; objectQuality)
            Q = rnd.uniform(0.0, 10.0);

        foreach (ref double sigma2; userError)
            sigma2 = rnd.random();

        int pos;
        foreach (uint object, ref double Q; objectQuality)
            foreach (uint user, ref double sigma2; userError)
                ratings[pos++] = Rating(user, object, rnd.uniform(Q - sigma2, Q
+ sigma2));

        printf("We now have %u ratings.\n", ratings.length);

        aa(ratings, reputationUser, reputationObject);
        yzlm(ratings, reputationUser, reputationObject);

        double deltaQ = 0;
        foreach (uint object, double r; reputationObject) {
            auto aux = (r - objectQuality[object]);
            deltaQ += aux * aux;
        }
        deltaQ = sqrt(deltaQ / reputationObject.length);
        printf("[%u] Error in quality estimate: %g\n", i, deltaQ);
    }
}

Bye,
bearophile
Apr 13 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Joseph Wakeling:

My guess for the difference is that you ran less than the full 100 iterations
of the main for loop.<

You are right, sorry.
So I think it's probably just compiler difference that's to blame for speed
differences<

This can be true, but such differences are not magic, they can be found, and eventually they can even become precise enhancement requests for llvm developers, they are open to such requests.
After all, D in general and DMD in particular is in development.<

DMD has a quite old back-end that is not in active development. Maybe it will become 64 bit.
I _am_ surprised that in your profiling the random number generator took up so
much time, as if you cut out the lines,

yzlm(ratings, reputationUser, reputationObject); ... and recompile, the program screams through in no time at all -- which would lead me to think that it's the yzlm opCall and the functions it calls that take up the majority of time. Or is there some lazy evaluation going on here ... ?< I've done few more tests, an I've seen that my profiling in DMD was wrong, I have disabled the inlining (to have a less aggregated result), but this has produced fake results, because in the true optimized run the inlining removes most of the overhead of the Kiss generator, so the true profiling is now like you say: Func Time 17703021 Yzlm.objectReputationUpdate 12709135 Yzlm.userReputationUpdate 1385194 main In this program I have seen that a good percentage of the speed difference between ldc/dmd is caused by foreach loops. When you iterate over structs longer than size_t use ref (or in D2 better ref const(...)). foreach (Rating r; ratings) ==> foreach (ref const(Rating) r; ratings) There is nothing algorithmically strange going on here, but to be sure you can take a look at the produced asm. This is the middle parts (the two loops) of Yzlm.objectReputationUpdate compiled with ldc, this is one of the two main hot spots of the program, the you can compared this asm with the asm of the same part from the C++ version: ... .LBB6_2: movl (%eax), %edx movl 52(%esp), %edi movl 4(%edi), %ebx movsd (%ebx,%edx,8), %xmm0 mulsd 8(%eax), %xmm0 movl 4(%eax), %edx movl 48(%esp), %ebx movl 4(%ebx), %ebx addsd (%ebx,%edx,8), %xmm0 movsd %xmm0, (%ebx,%edx,8) movl 4(%eax), %edx movl 36(%esi), %ebx movsd (%ebx,%edx,8), %xmm0 movl (%eax), %ebp movl 4(%edi), %edi addsd (%edi,%ebp,8), %xmm0 movsd %xmm0, (%ebx,%edx,8) addl $16, %eax decl %ecx jne .LBB6_2 .LBB6_3: movl 48(%esp), %eax movl (%eax), %eax testl %eax, %eax je .LBB6_9 movl 48(%esp), %ecx movl 4(%ecx), %ecx pxor %xmm0, %xmm0 xorl %edx, %edx pxor %xmm1, %xmm1 .align 16 .LBB6_5: movl 36(%esi), %edi movsd (%edi,%edx,8), %xmm2 ucomisd %xmm1, %xmm2 ja .LBB6_7 movsd .LCPI6_0, %xmm2 .LBB6_7: movsd (%ecx,%edx,8), %xmm3 divsd %xmm2, %xmm3 movsd %xmm3, (%ecx,%edx,8) movl 44(%esi), %edi subsd (%edi,%edx,8), %xmm3 mulsd %xmm3, %xmm3 addsd %xmm3, %xmm0 incl %edx cmpl %eax, %edx jne .LBB6_5 ... I've also seen that the program gets a little faster with some unrollig, done like this: const int UNROLL = 5; assert(!(ratings.length % UNROLL)); assert(!(reputationUser.length % UNROLL)); for (int i; i < ratings.length; i += UNROLL) { foreach (k; Range!(UNROLL)) { auto r = &(ratings[i + k]); auto aux = (r.value - reputationObject[r.object]); reputationUser[r.user] += aux * aux; } } Where: template Tuple(T...) { alias T Tuple; } template Range(int stop) { static if (stop <= 0) alias Tuple!() Range; else alias Tuple!(Range!(stop-1), stop-1) Range; } But to make it work with a arrays of arbitrary length you need a little more code, using a strategy similar to: sum = 0; for (i = 0; i < n - 3; i += 4) sum += A[i] + A[i+1] + A[i+2] + A[i+3]; for (; i < n; i++) sum += A[i];
I don't think it's the algorithm in any case (although it might be Phobos'
implementation) as working with Tango and LDC, changing between the Twister or
Kiss random number generators doesn't make for any meaningful performance
difference.<

Those Phobos and Tango implementations, when you use uniform(), are probably different, there are enforce() and the nextafter() that are probably not used in that Tango code.
I noticed that when compiled with LDC the memory usage stays constant, but in
DMD the memory use blows up.  Is this a D1/D2 difference (with D2 having the
more advanced features that require preservation of memory) or is it a bug in
one or the other of the compilers ... ?<

D2 dynamic arrays have being changed recently, D1 dynamic arrays have not changed. LDC is D1 still. So this is just a D1/D2 difference. The purpose of this change between D1 and D2 was to avoid bugs caused by "memory stomping" of array slices. Bye, bearophile
Apr 15 2010
parent bearophile <bearophileHUGS lycos.com> writes:
 This is the middle parts (the two loops) of Yzlm.objectReputationUpdate
compiled with ldc, this is one of the two main hot spots of the program, the
you can compared this asm with the asm of the same part from the C++ version:

To find why the C++ code is faster you can show me the equivalent C++ code that I can compile myself, or you can compile your C++ code and show me the asm of the two functions that are hot spots. Bye, bearophile
Apr 15 2010
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
 - the two nested loops in the main are more efficient as ref double,
 this is something dmd will need to fix;

A test shows that on ldc the two nested loops are a little faster without the ref. I'd like the compiler to use a "const ref" with the foreach iterates on array items bigger than a word. I've done a related small test: version (Tango) import tango.stdc.stdio: printf; else import std.stdio: printf; void main() { auto data = new double[1_000]; double tot = 0.0; foreach (uint i, ref el; data) tot += el * i; printf("%f\n", tot); } Just the asm of the loop. Note how ldc keeps two iteration indexes, one in xmm0 and one in ESI, because it's faster this way. LCPI1_0 is .quad 4607182418800017408 that is the double value 1.0. (Here the presence of ref doesn't change the code produced by ldc, but in the two loops of the original code it makes a little difference). ldc -O5 -release -inline -enable-unsafe-fp-math -output-s test.d dmd -O -release -inline test.d ------------------------- ldc, with ref: .LBB1_1: movapd %xmm0, %xmm1 mulsd (%eax,%esi,8), %xmm1 movsd 16(%esp), %xmm2 addsd %xmm1, %xmm2 movsd %xmm2, 16(%esp) incl %esi cmpl $1000, %esi addsd .LCPI1_0, %xmm0 jne .LBB1_1 ------------------------- ldc, without ref: .LBB1_1: movapd %xmm0, %xmm1 mulsd (%eax,%esi,8), %xmm1 movsd 16(%esp), %xmm2 addsd %xmm1, %xmm2 movsd %xmm2, 16(%esp) incl %esi cmpl $1000, %esi addsd .LCPI1_0, %xmm0 jne .LBB1_1 ------------------------- dmd, with ref: L45: mov 8[ESP],EDX xor ESI,ESI fld qword ptr [EDX*8][ECX] mov 0Ch[ESP],ESI mov EBX,EDX inc EDX fild long64 ptr 8[ESP] fmulp ST(1),ST fadd qword ptr 018h[ESP] cmp EDX,010h[ESP] fstp qword ptr 018h[ESP] jb L45 ------------------------- dmd, without ref: L45: mov 8[ESP],EDX xor ESI,ESI mov EBX,EDX mov 0Ch[ESP],ESI fild long64 ptr 8[ESP] fmul qword ptr [EDX*8][ECX] inc EDX cmp EDX,010h[ESP] fadd qword ptr 018h[ESP] fstp qword ptr 018h[ESP] jb L45 Bye, bearophile
Apr 14 2010
prev sibling parent reply Don <nospam nospam.com> writes:
bearophile wrote:
 So far I've just given a light reading of the code. Notes:

 - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but you have to
import std.math anyway, because of a bug).

That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number. Note that x^^2 doesn't require import std.math. (The fact that x ^^ 5 requires import std.math is a temporary limitation).
Apr 14 2010
next sibling parent reply "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
Don wrote:
 bearophile wrote:
 So far I've just given a light reading of the code. Notes:

 - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but you 
 have to import std.math anyway, because of a bug).

That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.

Really? Why is that? I find that kind of disappointing, I always believed it to be a temporary solution. I think the inconsistency with the other operators will make this a major WTF for people new to the language. Why should a^^b require an explicit import while a*b doesn't? If the language made it possible to overload operators using free functions, I wouldn't mind if opBinary!"^^"(float, float) was implemented in std.math. The way it is now, it's a halfway built-in, halfway library feature, and just seems halfway altogether. -Lars
Apr 14 2010
parent reply Don <nospam nospam.com> writes:
Lars T. Kyllingstad wrote:
 Don wrote:
 bearophile wrote:
 So far I've just given a light reading of the code. Notes:

 - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but 
 you have to import std.math anyway, because of a bug).

That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.

Really? Why is that? I find that kind of disappointing, I always believed it to be a temporary solution. I think the inconsistency with the other operators will make this a major WTF for people new to the language. Why should a^^b require an explicit import while a*b doesn't?

Because pow() for floating point, when implemented properly, is a HUGE function, that ends up dragging almost all of std.math into the executable. And I think it's deceptive to do that silently. To make it completely built-in, basically all of std.math would need to be moved into druntime. Feel free to try to change my mind, of course.
 If the language made it possible to overload operators using free 
 functions, I wouldn't mind if opBinary!"^^"(float, float) was 
 implemented in std.math.  The way it is now, it's a halfway built-in, 
 halfway library feature, and just seems halfway altogether.

You can expect the integration to become cleaner; still, it's a compromise. It was a big fight to get it into the language at all, and I've tried to minimize the cost of it. ^^ is basically syntax sugar, and the price you pay for the additional tiny bit of syntax sugar (avoiding import std.math) has an appalling cost-benefit ratio. Raising to a float power is really a niche feature. Are there really many uses for ^^ of floats, where std.math isn't imported already (for example, where you don't even use abs()) ?
Apr 14 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Don:

Thank you for your answer and comments.

 Because pow() for floating point, when implemented properly, is a HUGE
 function, that ends up dragging almost all of std.math into the
 executable. And I think it's deceptive to do that silently.
 To make it completely built-in, basically all of std.math would need to
 be moved into druntime.

I have just read the source of pow() and indeed it's a complex function that needs several other functions of math.
Raising to a float power is really a niche feature.<

Used it as x^^0.5 to perform the square root is not a niche feature, square roots are common enough. And sqrt is an intrinsic, it doesn't need all std.math. Error messages are not good yet, this program: import std.c.stdio: printf; void main() { printf("%f\n", 2.5 ^^ 0.5); } Generates: test.d(3): Error: must import std.math to use ^^ operator test.d(3): Error: undefined identifier module test5.std test.d(3): Error: no property 'math' for type 'void' Error: no property 'sqrt' for type 'int' test.d(3): Error: function expected before (), not __error of type int Alternative single error: test.d(4): Error: must import std.math to use "x ^^ floating point". If I change the code a little the situation doesn't improve a lot: import std.c.stdio: printf; import std.math: sqrt; void main() { printf("%f\n", 2.5 ^^ 0.5); } test5.d(4): Error: undefined identifier module test5.std test5.d(4): Error: no property 'math' for type 'void' Error: no property 'sqrt' for type 'int' test5.d(4): Error: function expected before (), not __error of type int So I can see various solutions: 1) Keep the situation as now, but improve the error messages (essentially printing only one error message that's more focused). 2) As the point (1), but in the case of x^^0.5 doesn't require the import math; x^^FP requires the import still. 3) Automatically import the pow() in all modules, even if this causes the import of lot of std.math. So no errors are generated. 4) Automatically import the pow() in a module only if it contains a x^^FP where FP is a floating point != 0.5 So the code in most modules is not affected by std.math. Regardless what the final decision will be, this topic must be discussed in the D newsgroup, it's not something you have to decide for yourself (despite you are a very programmer, Don). This subthread can even be moved to the main D newsgroup. Bye, bearophile
Apr 14 2010
next sibling parent Don <nospam nospam.com> writes:
bearophile wrote:
 Don:
 
 Thank you for your answer and comments.
 
 Because pow() for floating point, when implemented properly, is a HUGE
 function, that ends up dragging almost all of std.math into the
 executable. And I think it's deceptive to do that silently.
 To make it completely built-in, basically all of std.math would need to
 be moved into druntime.

I have just read the source of pow() and indeed it's a complex function that needs several other functions of math.

It's not even implemented correctly yet. A correct implementation is AT LEAST ten times larger.
 
 
 Raising to a float power is really a niche feature.<

Used it as x^^0.5 to perform the square root is not a niche feature, square roots are common enough. And sqrt is an intrinsic, it doesn't need all std.math.

 
 Error messages are not good yet, this program:
 
 
 import std.c.stdio: printf;
 void main() {
     printf("%f\n", 2.5 ^^ 0.5);
 }
 
 
 Generates:
 
 test.d(3): Error: must import std.math to use ^^ operator
 test.d(3): Error: undefined identifier module test5.std
 test.d(3): Error: no property 'math' for type 'void'
 Error: no property 'sqrt' for type 'int'
 test.d(3): Error: function expected before (), not __error of type int
 
 
 Alternative single error:
 
 test.d(4): Error: must import std.math to use "x ^^ floating point".

Please don't complain about this obvious stuff. Parasitic error suppression is an ongoing task in the compiler. It's gradually improving, and there's nothing special about this particular case.
 If I change the code a little the situation doesn't improve a lot:
 
 import std.c.stdio: printf;
 import std.math: sqrt;
 void main() {
     printf("%f\n", 2.5 ^^ 0.5);
 }
 
 
 test5.d(4): Error: undefined identifier module test5.std
 test5.d(4): Error: no property 'math' for type 'void'
 Error: no property 'sqrt' for type 'int'
 test5.d(4): Error: function expected before (), not __error of type int
 
 
 So I can see various solutions:
 
 1) Keep the situation as now, but improve the error messages (essentially
printing only one error message that's more focused).
 
 2) As the point (1), but in the case of x^^0.5 doesn't require the import
math; x^^FP requires the import still.
 
 3) Automatically import the pow() in all modules, even if this causes the
import of lot of std.math. So no errors are generated.
 
 4) Automatically import the pow() in a module only if it contains a x^^FP
where FP is a floating point != 0.5  So the code in most modules is not
affected by std.math.
 
 Regardless what the final decision will be, this topic must be discussed in
the D newsgroup, it's not something you have to decide for yourself (despite
you are a very programmer, Don). This subthread can even be moved to the main D
newsgroup.

I see it as a VERY minor issue, we don't need to deal with it for a long time.
Apr 14 2010
prev sibling parent Justin Spahr-Summers <Justin.SpahrSummers gmail.com> writes:
On Wed, 14 Apr 2010 05:58:55 -0400, bearophile 
<bearophileHUGS lycos.com> wrote:
 
 Don:
 
Raising to a float power is really a niche feature.<

Used it as x^^0.5 to perform the square root is not a niche feature, square roots are common enough. And sqrt is an intrinsic, it doesn't need all std.math.

I think I've only ever used a square root once, and that was to get the distance between two points. I'm not sure that square roots are common enough in general code to justify an implicit import (especially if we think of D in its context as a systems programming language).
Apr 14 2010
prev sibling next sibling parent reply "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
Don wrote:
 Lars T. Kyllingstad wrote:
 Don wrote:
 bearophile wrote:
 So far I've just given a light reading of the code. Notes:

 - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but 
 you have to import std.math anyway, because of a bug).

That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.

Really? Why is that? I find that kind of disappointing, I always believed it to be a temporary solution. I think the inconsistency with the other operators will make this a major WTF for people new to the language. Why should a^^b require an explicit import while a*b doesn't?

Because pow() for floating point, when implemented properly, is a HUGE function, that ends up dragging almost all of std.math into the executable. And I think it's deceptive to do that silently. To make it completely built-in, basically all of std.math would need to be moved into druntime. Feel free to try to change my mind, of course.

Exponentiation is a built-in operation in FORTRAN, so I made this little program to check: program test implicit none real :: base, expo, pow write (*,*) "Base:" read (*,*) base write (*,*) "Exponent:" read (*,*) expo pow = base**expo write (*,*) pow end program test The produced executable is 11K. If I replace exponentiation with multiplication, it is still 11K. Why wouldn't the same be possible in D? Note that I'm not trying to argue with you, I am sure you know what you're talking about. I know very little of what's going on under the hood of either compiler, so I'm asking out of curiosity.
 If the language made it possible to overload operators using free 
 functions, I wouldn't mind if opBinary!"^^"(float, float) was 
 implemented in std.math.  The way it is now, it's a halfway built-in, 
 halfway library feature, and just seems halfway altogether.

You can expect the integration to become cleaner; still, it's a compromise. It was a big fight to get it into the language at all, and I've tried to minimize the cost of it. ^^ is basically syntax sugar, and the price you pay for the additional tiny bit of syntax sugar (avoiding import std.math) has an appalling cost-benefit ratio. Raising to a float power is really a niche feature. Are there really many uses for ^^ of floats, where std.math isn't imported already (for example, where you don't even use abs()) ?

It's the inconsistency with the other operations on built-in types that bothers me, not the hassle of writing 'import std.math'. But please don't get me wrong, I absolutely prefer the current solution over nothing. Now that I've gotten used to it I'm never going back. :) For fun I grepped my numerics library for uses of ^^, and found I've used it >65 times (65 is the line count, and it's used twice in some lines). And let it be known that the majority of the cases are float^^float! ;) -Lars
Apr 14 2010
parent reply "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
Lars T. Kyllingstad wrote:
 Don wrote:
 Lars T. Kyllingstad wrote:
 Don wrote:
 bearophile wrote:
 So far I've just given a light reading of the code. Notes:

 - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but 
 you have to import std.math anyway, because of a bug).

That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.

Really? Why is that? I find that kind of disappointing, I always believed it to be a temporary solution. I think the inconsistency with the other operators will make this a major WTF for people new to the language. Why should a^^b require an explicit import while a*b doesn't?

Because pow() for floating point, when implemented properly, is a HUGE function, that ends up dragging almost all of std.math into the executable. And I think it's deceptive to do that silently. To make it completely built-in, basically all of std.math would need to be moved into druntime. Feel free to try to change my mind, of course.

Exponentiation is a built-in operation in FORTRAN, so I made this little program to check: program test implicit none real :: base, expo, pow write (*,*) "Base:" read (*,*) base write (*,*) "Exponent:" read (*,*) expo pow = base**expo write (*,*) pow end program test The produced executable is 11K. If I replace exponentiation with multiplication, it is still 11K. Why wouldn't the same be possible in D?

Scratch that, I just remembered that libgfortran is dynamically linked in by default. However, compiling with -static-libgfortran makes the executable 155K with both exponentiation and multiplication. -Lars
Apr 14 2010
parent reply Don <nospam nospam.com> writes:
Lars T. Kyllingstad wrote:
 Lars T. Kyllingstad wrote:
 Don wrote:
 Lars T. Kyllingstad wrote:
 Don wrote:
 bearophile wrote:
 So far I've just given a light reading of the code. Notes:

 - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but 
 you have to import std.math anyway, because of a bug).

That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.

Really? Why is that? I find that kind of disappointing, I always believed it to be a temporary solution. I think the inconsistency with the other operators will make this a major WTF for people new to the language. Why should a^^b require an explicit import while a*b doesn't?

Because pow() for floating point, when implemented properly, is a HUGE function, that ends up dragging almost all of std.math into the executable. And I think it's deceptive to do that silently. To make it completely built-in, basically all of std.math would need to be moved into druntime. Feel free to try to change my mind, of course.

Exponentiation is a built-in operation in FORTRAN, so I made this little program to check: program test implicit none real :: base, expo, pow write (*,*) "Base:" read (*,*) base write (*,*) "Exponent:" read (*,*) expo pow = base**expo write (*,*) pow end program test The produced executable is 11K. If I replace exponentiation with multiplication, it is still 11K. Why wouldn't the same be possible in D?

Scratch that, I just remembered that libgfortran is dynamically linked in by default. However, compiling with -static-libgfortran makes the executable 155K with both exponentiation and multiplication. -Lars

I have a vague recollection that correctly-rounded pow() will require bigint (can't quite remember, though). I'm also concerned about build tools -- I don't want them to have to know about the dependency. As a bare minimum, the error message will need to improve (with some explanation of why std.math is required, to reduce the WTF factor). But in any case, it's a very minor issue. Personally, I'm finding that having ^^ as standard nomenclature is extremely useful, even apart from the use in code.
Apr 14 2010
parent reply Robert Clipsham <robert octarineparrot.com> writes:
On 14/04/10 20:54, Don wrote:
 I have a vague recollection that correctly-rounded pow() will require
 bigint (can't quite remember, though). I'm also concerned about build
 tools -- I don't want them to have to know about the dependency.
 As a bare minimum, the error message will need to improve (with some
 explanation of why std.math is required, to reduce the WTF factor).
 But in any case, it's a very minor issue.
 Personally, I'm finding that having ^^ as standard nomenclature is
 extremely useful, even apart from the use in code.

I've always disliked the requirement to import std.math, but only because of it's lack of flexibility... Surely it would be better to require the pow() function to be defined somewhere, this way if/when tango is ported to D2 it doesn't have to hack around this to allow it to allow users to use ^^.
Apr 14 2010
parent bearophile <bearophileHUGS lycos.com> writes:
Robert Clipsham Wrote:

 this way if/when 
 tango is ported to D2 it doesn't have to hack around this to allow it to 
 allow users to use ^^.

I hope Tango2 will be designed to be installed beside Phobos2, and not in place of it. Bye, bearophile
Apr 14 2010
prev sibling parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
Don wrote:
 Lars T. Kyllingstad wrote:
 Don wrote:
 bearophile wrote:
 So far I've just given a light reading of the code. Notes:

 - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but 
 you have to import std.math anyway, because of a bug).

That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.

Really? Why is that? I find that kind of disappointing, I always believed it to be a temporary solution. I think the inconsistency with the other operators will make this a major WTF for people new to the language. Why should a^^b require an explicit import while a*b doesn't?

Because pow() for floating point, when implemented properly, is a HUGE function, that ends up dragging almost all of std.math into the executable. And I think it's deceptive to do that silently. To make it completely built-in, basically all of std.math would need to be moved into druntime. Feel free to try to change my mind, of course.

Is there a better way to do pow() for floating point numbers without importing std.math? I see this: 1. You want to do x ^^ fp. 2. The compiler complains saying "if you want to do that, import std.math. I'm telling you this just to let you know you'll be importing the whole module". Alternative 1 for user: User says "Ok, I import std.math" Alternative 2 for user: User says "No way I'm importing std.math just to make a pow. But... I still *need* to make that pow... what is your advice, mr. compiler?"
Apr 15 2010
parent reply Don <nospam nospam.com> writes:
Ary Borenszweig wrote:
 Don wrote:
 Lars T. Kyllingstad wrote:
 Don wrote:
 bearophile wrote:
 So far I've just given a light reading of the code. Notes:

 - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but 
 you have to import std.math anyway, because of a bug).

That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.

Really? Why is that? I find that kind of disappointing, I always believed it to be a temporary solution. I think the inconsistency with the other operators will make this a major WTF for people new to the language. Why should a^^b require an explicit import while a*b doesn't?

Because pow() for floating point, when implemented properly, is a HUGE function, that ends up dragging almost all of std.math into the executable. And I think it's deceptive to do that silently. To make it completely built-in, basically all of std.math would need to be moved into druntime. Feel free to try to change my mind, of course.

Is there a better way to do pow() for floating point numbers without importing std.math? I see this: 1. You want to do x ^^ fp. 2. The compiler complains saying "if you want to do that, import std.math. I'm telling you this just to let you know you'll be importing the whole module". Alternative 1 for user: User says "Ok, I import std.math" Alternative 2 for user: User says "No way I'm importing std.math just to make a pow. But... I still *need* to make that pow... what is your advice, mr. compiler?"

That's a good point, it should be possible to use a static import as well. I do think it's pretty odd to be doing floating point without importing std.math, though. I mean, abs() is absolutely fundamental.
Apr 15 2010
parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
Don wrote:
 Ary Borenszweig wrote:
 Don wrote:
 Lars T. Kyllingstad wrote:
 Don wrote:
 bearophile wrote:
 So far I've just given a light reading of the code. Notes:

 - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 (but 
 you have to import std.math anyway, because of a bug).

That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.

Really? Why is that? I find that kind of disappointing, I always believed it to be a temporary solution. I think the inconsistency with the other operators will make this a major WTF for people new to the language. Why should a^^b require an explicit import while a*b doesn't?

Because pow() for floating point, when implemented properly, is a HUGE function, that ends up dragging almost all of std.math into the executable. And I think it's deceptive to do that silently. To make it completely built-in, basically all of std.math would need to be moved into druntime. Feel free to try to change my mind, of course.

Is there a better way to do pow() for floating point numbers without importing std.math? I see this: 1. You want to do x ^^ fp. 2. The compiler complains saying "if you want to do that, import std.math. I'm telling you this just to let you know you'll be importing the whole module". Alternative 1 for user: User says "Ok, I import std.math" Alternative 2 for user: User says "No way I'm importing std.math just to make a pow. But... I still *need* to make that pow... what is your advice, mr. compiler?"

That's a good point, it should be possible to use a static import as well. I do think it's pretty odd to be doing floating point without importing std.math, though. I mean, abs() is absolutely fundamental.

But if you do a static import the whole module gets linked in, right? My point is, if you are going to pow, you will need std.math, so it'll always be a burden to import it by hand when using it. ^^
Apr 15 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Ary Borenszweig:
 My point is, if you are going to pow, you will need std.math, so it'll 
 always be a burden to import it by hand when using it. ^^

Can the automatic import happen only iff a module uses the ^^fp? Bye, bearophile
Apr 15 2010
parent Ary Borenszweig <ary esperanto.org.ar> writes:
bearophile wrote:
 Ary Borenszweig:
 My point is, if you are going to pow, you will need std.math, so it'll 
 always be a burden to import it by hand when using it. ^^

Can the automatic import happen only iff a module uses the ^^fp? Bye, bearophile

That's what I'm asking for.
Apr 16 2010
prev sibling parent reply Don <nospam nospam.com> writes:
Ary Borenszweig wrote:
 Don wrote:
 Ary Borenszweig wrote:
 Don wrote:
 Lars T. Kyllingstad wrote:
 Don wrote:
 bearophile wrote:
 So far I've just given a light reading of the code. Notes:

 - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 
 (but you have to import std.math anyway, because of a bug).

That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.

Really? Why is that? I find that kind of disappointing, I always believed it to be a temporary solution. I think the inconsistency with the other operators will make this a major WTF for people new to the language. Why should a^^b require an explicit import while a*b doesn't?

Because pow() for floating point, when implemented properly, is a HUGE function, that ends up dragging almost all of std.math into the executable. And I think it's deceptive to do that silently. To make it completely built-in, basically all of std.math would need to be moved into druntime. Feel free to try to change my mind, of course.

Is there a better way to do pow() for floating point numbers without importing std.math? I see this: 1. You want to do x ^^ fp. 2. The compiler complains saying "if you want to do that, import std.math. I'm telling you this just to let you know you'll be importing the whole module". Alternative 1 for user: User says "Ok, I import std.math" Alternative 2 for user: User says "No way I'm importing std.math just to make a pow. But... I still *need* to make that pow... what is your advice, mr. compiler?"

That's a good point, it should be possible to use a static import as well. I do think it's pretty odd to be doing floating point without importing std.math, though. I mean, abs() is absolutely fundamental.

But if you do a static import the whole module gets linked in, right? My point is, if you are going to pow, you will need std.math, so it'll always be a burden to import it by hand when using it. ^^

But you're assuming that you're using ^^ without using anything else from std.math. I think that's a very obscure case. For example, any code which is ported from C or C++, or D1, that uses pow, will already be importing std.math. Cases where you see that you could use ^^ that isn't already using pow() (eg, where you see z = x*x + y*y), you will need to add an import.
Apr 16 2010
parent Ary Borenszweig <ary esperanto.org.ar> writes:
Don wrote:
 Ary Borenszweig wrote:
 Don wrote:
 Ary Borenszweig wrote:
 Don wrote:
 Lars T. Kyllingstad wrote:
 Don wrote:
 bearophile wrote:
 So far I've just given a light reading of the code. Notes:

 - pow(x, 2) and sqrt(y) can be written as x ^^ 2 and y ^^ 0.5 
 (but you have to import std.math anyway, because of a bug).

That's not a bug. It's intentional. x ^^ y will probably always require import std.math, if y is a floating point number.

Really? Why is that? I find that kind of disappointing, I always believed it to be a temporary solution. I think the inconsistency with the other operators will make this a major WTF for people new to the language. Why should a^^b require an explicit import while a*b doesn't?

Because pow() for floating point, when implemented properly, is a HUGE function, that ends up dragging almost all of std.math into the executable. And I think it's deceptive to do that silently. To make it completely built-in, basically all of std.math would need to be moved into druntime. Feel free to try to change my mind, of course.

Is there a better way to do pow() for floating point numbers without importing std.math? I see this: 1. You want to do x ^^ fp. 2. The compiler complains saying "if you want to do that, import std.math. I'm telling you this just to let you know you'll be importing the whole module". Alternative 1 for user: User says "Ok, I import std.math" Alternative 2 for user: User says "No way I'm importing std.math just to make a pow. But... I still *need* to make that pow... what is your advice, mr. compiler?"

That's a good point, it should be possible to use a static import as well. I do think it's pretty odd to be doing floating point without importing std.math, though. I mean, abs() is absolutely fundamental.

But if you do a static import the whole module gets linked in, right? My point is, if you are going to pow, you will need std.math, so it'll always be a burden to import it by hand when using it. ^^

But you're assuming that you're using ^^ without using anything else from std.math. I think that's a very obscure case. For example, any code which is ported from C or C++, or D1, that uses pow, will already be importing std.math. Cases where you see that you could use ^^ that isn't already using pow() (eg, where you see z = x*x + y*y), you will need to add an import.

- Oh, hi Foo, I want to show you this great language D. You can doo blah and bleh. - Cool, I will code something little to see how it is. ... Hey, WTF, I use the pow operand and it tells me I need to add an import? WTF?? You know, people program from scratch, they don't spend their life porting C and C++ code... ^^
Apr 16 2010
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Don:

 That's not a bug. It's intentional. x ^^ y will probably always require 
 import std.math, if y is a floating point number.

What's the rationale behind this decision? I have filed it as a bug many days ago, because I think it's a wrong intention: http://d.puremagic.com/issues/show_bug.cgi?id=3832 Having to import all names from the math module to use a built-in operator is unintuitive, not handy, and sounds silly. Bye, bearophile
Apr 14 2010
prev sibling next sibling parent Joseph Wakeling <joseph.wakeling webdrake.net> writes:
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

Don wrote:
 The D code compiles directly to a memset and memcpy, which are
 intrinsics. There's no way C++ could beat it.

OK. I'm just bemused about why -- having solved the memory leak -- my D code is running significantly slower than equivalent C++ code. For now I'm just playing with D by doing a rewrite of some C++ code I did for a project a while back -- to see if I can write a neater version of the code in the new language, and learn something about D in the process. I may have fallen into the trap you described earlier, of just 'translating' C++ style code directly into D, but I don't think that's the issue here. There are a couple of algorithmic inefficiencies in there with deliberate reason, but removing them for testing purposes doesn't improve matters. The array append is slow, but it's not the major part of the running time of the program. For the sake of clarity, I've attached the complete code. The really important part to look at is the file yzlm.d in the opCall() function and class functions objectReputationUpdate() and userReputationUpdate(). This is a reputation-generating process which takes a set of ratings by users of objects, and uses them to iteratively recalculate user reputation and re-aggregate the ratings taking this into account. I don't see anything particularly bad about the way these algorithms are written but maybe they are imperfectly D-ish ... :-) Best wishes, -- Joe
Apr 13 2010
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
I'm catching up with the posts.

Joseph Wakeling:

 As for iteration, I don't know to what degree D's foreach() across
 arrays compares to for() commands in C++ -- I guess it should be pretty
 close in performance, no?

As you have seen from my benchmarks, when with dmd you use: foreach (x; arr) and x is longer than a word, then the loop is not as fast as it can be. When x is longer than a word it's often more efficient to use a ref, especially in the inner loop: foreach (ref const(double) x; arr) With the ldc compiler the different between using or not using a ref is zero, or some times using ref slows down the foreach a little. I have not filed enhancement requests about this because I think it's useless. Bye, bearophile
Apr 14 2010
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 14 Apr 2010 17:19:53 -0400, Joseph Wakeling  
<joseph.wakeling webdrake.net> wrote:


 /////////////////////////////////////////////////////////////////////////
 version(Tango) {
     import tango.stdc.stdio: printf;
 } else {
     import std.stdio: printf;
 }

 void main()
 {
     double[] x;

     for(uint i=0;i<100;++i) {
         x.length = 0;

         for(uint j=0;j<5_000;++j) {
             for(uint k=0;k<1_000;++k) {
                 x ~= j*k;
             }
         }

         printf("At iteration %u, x has %u elements.\n",i,x.length);
     }
 }
 /////////////////////////////////////////////////////////////////////////

 I noticed that when compiled with LDC the memory usage stays constant,
 but in DMD the memory use blows up.  Is this a D1/D2 difference (with D2
 having the more advanced features that require preservation of memory)
 or is it a bug in one or the other of the compilers ... ?

It is because D2 is more advanced in preserving memory. In D1, setting length to zero is equivalent to also assuming it can append safely. One thing I forgot about, there is currently a design flaw in D2 array appending which will keep around a certain number of arrays even after all references have been removed (i.e. the arrays should be collectable, but they are not collected by the GC). I have a clear path on how to fix that, but I haven't done it yet. See this bugzilla report: http://d.puremagic.com/issues/show_bug.cgi?id=3929 This might also help explain why the memory blows up. -Steve
Apr 15 2010
prev sibling parent Joseph Wakeling <joseph.wakeling webdrake.net> writes:
bearophile wrote:
 Robert Clipsham Wrote:
 
 this way if/when 
 tango is ported to D2 it doesn't have to hack around this to allow it to 
 allow users to use ^^.

I hope Tango2 will be designed to be installed beside Phobos2, and not in place of it.

I thought that was the aim ... ? At least according to D's Wikipedia page: http://en.wikipedia.org/wiki/D_%28programming_language%29#Division_concerning_the_standard_library
Apr 15 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 10 Apr 2010 18:17:12 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 This can help confuse your mind a bit:

 import std.stdio: writeln;
 void main() {
     {
         int[] arr = [0, 1, 2, 3, 4, 5, 6].dup;
         int[] slice = arr[2 .. 4];
         writeln("arr, slice: ", arr, " | ", slice); // arr, slice: 0 1 2  
 3 4 5 6 | 2 3
         slice ~= 10; slice ~= 20;
         writeln("arr.capacity, slice.capacity: ", arr.capacity, " ",  
 slice.capacity); // arr.capacity, slice.capacity: 7 7
         writeln("arr, slice: ", arr, " | ", slice); // arr, slice: 0 1 2  
 3 4 5 6 | 2 3 10 20
     }

     {
         int[] arr = [0, 1, 2, 3, 4, 5, 6].dup;
         int[] slice = arr[2 .. 4];
         writeln("arr, slice: ", arr, " | ", slice); // arr, slice: 0 1 2  
 3 4 5 6 | 2 3

         slice.assumeSafeAppend;
         slice ~= 10; slice ~= 20; // causes stomping
         writeln("arr.capacity, slice.capacity: ", arr.capacity, " ",  
 slice.capacity); // arr.capacity, slice.capacity: 0 5
         writeln("arr, slice: ", arr, " | ", slice); // arr, slice: 0 1 2  
 3 10 20 6 | 2 3 10 20
         slice ~= 30; slice ~= 40;
         writeln("arr.capacity, slice.capacity: ", arr.capacity, " ",  
 slice.capacity); // arr.capacity, slice.capacity: 7 7
         writeln("arr, slice: ", arr, " | ", slice); // arr, slice: 0 1 2  
 3 10 20 30 | 2 3 10 20 30 40
     }
 }


 The  slice.capacity = 7 in the first case is just a coincidence, it's  
 the result of the overallocation.

Yes, to allocate an array of 4 integers, you need 16 bytes, but there needs to be a padding byte to prevent cross-block pointer problems, so it uses a 32-byte block. The 32-byte block also needs a padding byte, so really you can only put in 7 elements, not 8.
 But I don't know why arr.capacity is zero and then seven in the second  
 and third case.

0 means if you append to the array, it will reallocate. It will return 0 for stack-allocated arrays also. This makes sense since the slice has taken over the "allocated" length of the block. Essentially, capacity indicates how many elements can be appended. The function gives up and returns 0 if it determines the array does not end at the allocated part of a block. Technically, I could return the length of the array, but I'm not sure whether that is as useful. For fun, add one more element to slice, and the arr will now have a valid capacity :) FYI, using arr after assuming safe append on the slice is undefined. -Steve
Apr 10 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 10 Apr 2010 17:19:17 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 Few dynamic array benchmarks that can show you more differences between  
 C++ and D2.

 Timings dmd, N=5_000_000, NLOOPS=15, T=double, seconds:
   C++ v1: 0.67
   C++ v1: 0.72
   D v1:   6.47 (uses >1 GB RAM)
   D v2:   0.76
   D v3:   5.25

 dmd v2.043, dmd -O -release -inline
 g++ 4.4.1, -O3 -Wall -s

That is to be expected. The append function is not the most efficient thing. It still must do a lot of work just to look up the valid length and verify appending can be done, whereas the v2 "append" is a single instruction! The append function is good when you don't know what the length of an array is, and you are appending a small number of elements, or another array. I don't advise using it for one-at-a-time appending if you know you have a large number of elements to append. A better method is to set the length, and then write the data. -Steve
Apr 10 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 11 Apr 2010 12:50:11 -0400, Joseph Wakeling  
<joseph.wakeling gmail.com> wrote:

 I was very happy to see that D _does_ have a 'reserve' function for  
 arrays,
 which I had been missing compared to C++ (it's not mentioned in the array
 docs).

It's new. It probably should be mentioned in the spec, but it's a druntime thing.
 Still, I don't think that pre-reserving the memory per se is the  
 influencing
 factor on the differences in performance.

No, and like bearophile said, the D array is a compromise between performance and flexibility. There are amazing ways to use D arrays that you could never do with C++ vectors.
 Note that in C++ the memory is not preassigned either.  The difference
 between the performance of these pieces of code is striking -- on my
 machine the D example takes about 70--75 seconds to run, whereas the
 C++ example speeds through it in 10s or less.

The C++ example is reallocating memory, freeing memory it is no longer using. It also manually handles the memory management, allocating larger and larger arrays in some algorithmically determined fashion (for example, multiplying the length by some constant factor). This gives it an edge in performance because it does not have to do any costly lookup to determine if it can append in place, plus the realloc of the memory probably is cheaper than the GC realloc of D. If you want to compare apples to apples (well, probably more like red apples to green apples), you need to do these things in a struct for D. I had thought the D appender class would do the trick, but as you stated below, it's even slower. This needs to be remedied.
 D also uses about 20% more memory than the C++ even though the C++ code
 declares a higher capacity for the vector (above 8 million) than D does
 for the array (a little over 5 million).

D does not assume you stopped caring about the memory being pointed to when it had to realloc. Therefore, it leaves it around in case you are still using it. You can also expect more overhead in the GC because it tends to hang on to memory in case it wants to use it again, or because it hasn't collected it yet. This is true of most GC-based languages. For example, you can do this in D: int[] arr = new int[50]; int *arrelem = &arr[5]; for(int i = 0; i < 10000; i++) arr ~= i; *arrelem = 12345; // valid. You can't do the same thing with C++ vectors, when they reallocate, the memory they used to own could be freed. This invalidates all pointers and iterators into the vector, but the language doesn't prevent you from having such dangling pointers. Using one of them can result in memory corruption, one of the worst bugs. D tries to eliminate as much memory corruption problems as possible.
 I don't know if it was naive to assume that D's dynamic arrays would be
 equivalent to vectors in C++.  But it's clear that the D array appender
 ~= is much slower than the C++ vector push_back().

But is much safer, and supports safe slicing. Compromises.
 Using an Appender class and put() (from std.array) is even slower,
 despite the std.array docs recommending this over ~. :-(

This must be fixed, the appender should be blazingly fast at appending (almost as fast as C++), with the drawback that the overhead is higher.
 It's disappointing because I'm loving so much of D's syntax, but I can
 write far more efficient code (with less explicit memory management)
 in C++ ...

You haven't done much with it yet. When you start discovering how much D takes care of, you will be amazed :) array appending isn't the fastest operation, but it is safe, and safety can be worth infinitely more than speed sometimes. The thing about D is it *can* be fast and unsafe, just as fast and unsafe as C, but that's not the default. -Steve
Apr 11 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 12 Apr 2010 12:03:38 -0400, Joseph Wakeling  
<joseph.wakeling gmail.com> wrote:

 I thought dev effort was now focusing back on GDC ... ? :-P

AFAIK, gdc hasn't been actively developed for a few years. ldc, on the other hand, has regular releases. I think ldc may be the future of D compilers, but I currently use dmd since I'm using D2.
 Steven Schveighoffer wrote:
 The C++ example is reallocating memory, freeing memory it is no longer
 using.  It also manually handles the memory management, allocating  
 larger
 and larger arrays in some algorithmically determined fashion (for  
 example,
 multiplying the length by some constant factor).  This gives it an edge  
 in
 performance because it does not have to do any costly lookup to  
 determine
 if it can append in place, plus the realloc of the memory probably is
 cheaper than the GC realloc of D.

Right. In fact you get precisely 24 allocs/deallocs, each doubling the memory reserve to give a total capacity of 2^23 -- and then that memory is there and can be used for the rest of the 100 iterations of the outer loop. The shock for me was finding that D wasn't treating the memory like this but was preserving each loop's memory (as you say, for good reason).

Yes, you get around this by preallocating.
 D does not assume you stopped caring about the memory being pointed to
 when it had to realloc. [...] You can't do the same thing with C++
 vectors, when they reallocate, the memory they used to own could be
 freed.  This invalidates all pointers and iterators into the vector,
 but the language doesn't prevent you from having such dangling pointers.

I have a vague memory of trying to do something exactly like your example when I was working with C++ for the first time, and getting bitten on the arse by exactly the problem you describe. I wish I could remember where. I know that I found another (and possibly better) solution to do what I wanted, but it would be nice to see if a D-ish solution would give me something good.

It's often these types of performance discrepancies that critics point to (not that you are a critic), but it's the cost of having a more comprehensive language. Your appetite for the sheer performance of a language will sour once you get bit by a few of these nasty bugs. But D fosters a completely different way of thinking about solving problems. One problem with C++'s vector is it is a value type -- you must pass a reference in order to avoid copying an entire vector. However, D's arrays are a hybrid between reference and value type. Often, once you set data in a vector/array, you never change it again. D allows ways to enforce this (i.e. immutable) and also allows you to pass around "slices" of your array with zero overhead (no copying). It results in some extremely high-performance code, which wouldn't be easy, or maybe even possible, with C++. Take for instance a split function. In C++, I'd expect split(string x) to return a vector<string>. However, vector<string> makes a copy of each part of the string it has split out. D, however, can return references to the original data (slices), which consume no overhead. The only extra space allocated is the array to hold the string references. All this is also completely safe! You could then even modify the original string (assuming you were not using immutable strings) in place! Or append to any one of the strings in the array safely.
 This must be fixed, the appender should be blazingly fast at appending
 (almost as fast as C++), with the drawback that the overhead is higher.

Overhead = memory cost? I'm not so bothered as long as the memory stays within constant, predictable bounds. It was the memory explosion that scared me. And I suspect I'd pay a small performance cost (though it would have to be small) for the kind of safety and flexibility the arrays have.

Overhead = bigger initialization cost, memory footprint. It's not important if you are building a large array (which is what appender should be for), but the cost would add up if you had lots of little appenders that you didn't append much to. The point is, the builtin array optimizes performance for operations besides append, but allows appending as a convenience. Appender should optimize appending, sacrificing performance in other areas. It all depends on your particular application whether you should use appender or builtin arrays (or something entirely different/custom).
 You haven't done much with it yet.  When you start discovering how much  
 D
 takes care of, you will be amazed :)

I know. :-) My needs are in some ways quite narrow -- numerical simulations in interdisciplinary physics -- hence the C background, and hence the premium on performance. They're also not very big programs -- simple enough for me to generally keep a personal overview on the memory management, even though with C++ that's usually all taken care of automatically (no new or delete statements if I can avoid it).

There are many in the community that use D for numerical stuff. It's definitely not as mature as it could be, but getting better. Don is adding a lot of cool stuff to it, including a builtin exponent operator and arbitrary precision numbers.
 The thing about D is it *can* be fast and unsafe, just as fast and  
 unsafe
 as C, but that's not the default.

That's apparent -- I mean, given that D wraps the whole C standard library, I could basically write C code in D if I wanted, no?

Yes, but that's not what I meant ;) I mean, you can write your own types, like the Appender (or what the appender *should* be) that optimize the behavior of code to meet any needs. And it can do it with a much better syntax than C. D's template system and ability to make user-types seem like builtins I think is unparalleled in C-like languages. -Steve
Apr 12 2010
prev sibling next sibling parent Brad Roberts <braddr puremagic.com> writes:
On Mon, 12 Apr 2010, Joseph Wakeling wrote:

 Curious question: how come with a registered email address my messages
 have to be moderated, whereas via the http interface I can post straight
 away? :-P
 
 Best wishes,
 
     -- Joe

All new subscribers to the lists start off moderated until a post flows through that isn't obvious spam. As soon as I see that, I remove the moderated flag. The list is a much bigger spam target than the news server is. I deflect dozens of spam messages a day, some of which are clever enough to register for the list first (not daily, but probably monthly). Later, Brad
Apr 12 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 12 Apr 2010 15:36:19 -0400, Joseph Wakeling  
<joseph.wakeling webdrake.net> wrote:

 Joseph Wakeling wrote:
 (Actually I'm still having some issues with this, despite using
 assumeSafeAppend, but more on that in a separate email.)

... solved; it's interesting to note that assumeSafeAppend has to be used in _exactly_ the same scope as the append ~= itself. e.g. with this loop, the memory blows up: //////////////////////////////////////////////////////////////// foreach(uint i;0..100) { x.length = 0; assumeSafeAppend(x); foreach(uint j;0..5000) foreach(uint k;0..1000) x ~= j*k; writefln("At iteration %u, x has %u elements.",i,x.length); } ////////////////////////////////////////////////////////////////

This should work as you expect. Can you produce a full code example? Are you reserving space in x before-hand?
 ... while with this one it's OK:

     ////////////////////////////////////////////////////////////////
     foreach(uint i;0..100) {
         x.length = 0;

         foreach(uint j;0..5000) {
             foreach(uint k;0..1000) {
                 assumeSafeAppend(x);
                 x ~= j*k;
             }
         }

         writefln("At iteration %u, x has %u elements.",i,x.length);
     }
     ////////////////////////////////////////////////////////////////

The first assumeSafeAppend is the only one that matters. Subsequent ones are wasted cycles. I'm unsure why this one works and the other doesn't. Again, I need full compilable examples. -Steve
Apr 13 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 13 Apr 2010 08:24:57 -0400, Joseph Wakeling  
<joseph.wakeling webdrake.net> wrote:

 Steven Schveighoffer wrote:
 This should work as you expect.  Can you produce a full code example?
 Are you reserving space in x before-hand?

Weird; now it works. I wonder if because I upgraded to 2.043 ... ? I understand there was a fix in there for array append.

2.043 has fixed a corruption bug, one which you are unlikely to have triggered with a simple program like you wrote. However, 2.041 had a severe corruption bug, which any array could fall victim to. If you upgraded from 2.041, it's quite possible that fixed the problem. If you just upgraded from 2.042, I think this may be a case of stale object/binary files ;)
 On the other hand, if the assumeSafeAppend takes place outside the loops
 entirely, blow-up occurs -- because the command is not issued after the
 array resize to zero?  I've attached examples.

Yes, assumeSafeAppend must be called every time you want to overwrite memory. The runtime cannot tell if you still have references to the memory that was in use you don't want to be changed by appending to the header. So it makes a conservative decision, "hey, this memory was in use, I'm not sure if it's in use anymore, so I'll reallocate rather than stomp on it." assumeSafeAppend tells the runtime "yes, it was in use, but I'm no longer using it, so it is safe to stomp on it." However, you only have to call it once per loop, because a single assumeSafeAppend assumes the remainder of the memory block after the array you pass as an argument to assumeSafeAppend is ok to stomp on. -Steve
Apr 13 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 13 Apr 2010 08:30:57 -0400, Joseph Wakeling  
<joseph.wakeling webdrake.net> wrote:

 Joseph Wakeling wrote:
 On the other hand, if the assumeSafeAppend takes place outside the loops
 entirely, blow-up occurs -- because the command is not issued after the
 array resize to zero?  I've attached examples.

If I include a line x.reserve(5_000_000); before the assumeSafeAppend, the memory does not explode indefinitely -- but it uses about 3x as much memory as the 'NoLeak' code even when the latter lacks advance reservation of memory.

x.reserve(5_000_000) will reserve the memory for the first loop. But the second time through the loop will start reallocating from scratch unless you tell the runtime it's safe to reuse that memory. Most likely it uses less memory than the noleak code because it has a nice continuous 40MB region to use in subsequent loops. Because of the size of the array, it will naturally gravitate towards that region, and once it finds it, it will happily grow in place for a while. assumeSafeAppend is needed to avoid reallocating the memory when setting the length back to 0. It is an optimization designed specifically for this purpose. My recommended approach for your little code sample is this: import std.array; import std.stdio; void main() { double[] x; x.reserve(5_000_000); // heads up, runtime, I'm about to use 40MB of continuous space foreach(uint i;0..100) { x.length = 0; assumeSafeAppend(x); // you can reuse that memory I just got done filling, I don't need it anymore. foreach(uint j;0..5_000) { foreach(uint k;0..1_000) { x ~= j*k; } } writefln("At iteration %u, x has %u elements.",i,x.length); } } If you don't use reserve, here is what happens during the loop: 1st append: The smallest block possible to hold 1 double is found and allocated 2nd append: If the block holding the first double can hold 2, the "allocated" size of the block is increased to 2 doubles. Otherwise, the smallest block possible to hold 2 doubles is found and allocated. ... Which means the first loop will consume extra memory on its way to building the full array. With reserve, the array is preallocated, so every append is able to simply update the allocated length instead of reallocating, increasing performance and saving memory. Note that every time the reallocation occurs because the block isn't big enough, the old block is left behind until the GC cleans it up. But even when the GC cleans it up, it doesn't necessarily release it back to the OS, so your program may still be consuming that memory. -Steve
Apr 13 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 13 Apr 2010 08:56:03 -0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:


 If you don't use reserve, here is what happens during the loop:

 1st append: The smallest block possible to hold 1 double is found and  
 allocated
 2nd append: If the block holding the first double can hold 2, the  
 "allocated" size of the block is increased to 2 doubles.  Otherwise, the  
 smallest block possible to hold 2 doubles is found and allocated.
 ...

Assuming you may ask questions about this, it's not an exact description of what happens. The append code and GC are smarter about it than this, I just wanted to explain it in an easy-to-understand way :) The real code uses algorithms to determine optimal grow sizes and can extend into consecutive blocks if possible. -Steve
Apr 13 2010
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 13 Apr 2010 11:15:15 -0400, Joseph Wakeling  
<joseph.wakeling webdrake.net> wrote:

 As for iteration, I don't know to what degree D's foreach() across
 arrays compares to for() commands in C++ -- I guess it should be pretty
 close in performance, no?

D special cases foreach on arrays to basically be a standard array for loop, i.e.: foreach(x; arr) { ... } translates to something like: auto __len = arr.length; for(uint __i = 0; __i < __len; ++__i) { auto x = arr[__i]; ... } Although it may be even further optimized, I'm not sure. -Steve
Apr 13 2010
prev sibling next sibling parent Joseph Wakeling <joseph.wakeling webdrake.net> writes:
Steven Schveighoffer wrote:
 On Mon, 12 Apr 2010 12:03:38 -0400, Joseph Wakeling
 <joseph.wakeling gmail.com> wrote:
 
 I thought dev effort was now focusing back on GDC ... ? :-P

AFAIK, gdc hasn't been actively developed for a few years. ldc, on the other hand, has regular releases. I think ldc may be the future of D compilers, but I currently use dmd since I'm using D2.

http://bitbucket.org/goshawk/gdc/wiki/Home :-) Either way I'm happy. I don't have any issues with dmd, but I do want to see a properly free D compiler that can be prepackaged in all the Linux distros.
 Yes, you get around this by preallocating.

Sure. But that wasn't what shocked me -- what I was amazed by was setting up a situation where I _had_ preallocated the memory and still seeing the memory usage explode, because D was preserving the memory from each round of the loop. (Actually I'm still having some issues with this, despite using assumeSafeAppend, but more on that in a separate email.)
 It's often these types of performance discrepancies that critics point
 to (not that you are a critic), but it's the cost of having a more
 comprehensive language.  Your appetite for the sheer performance of a
 language will sour once you get bit by a few of these nasty bugs.

For sure.
 But D fosters a completely different way of thinking about solving
 problems.

I can see how the example you give would be fantastically useful. More generally, I think this is the point -- I need to adjust my head to writing D-ish code, just as when moving from C to C++ I needed to switch to various new ways of doing things.
 There are many in the community that use D for numerical stuff.  It's
 definitely not as mature as it could be, but getting better.  Don is
 adding a lot of cool stuff to it, including a builtin exponent operator
 and arbitrary precision numbers.

I guessed there would be -- I knew for example that there was someone out there working on a fairly major mathematical/numerical library for D, but it's a while since I checked that out. So take my earlier comment about numerical work to refer only to doing things the way I'm used to ... ;-)
 Yes, but that's not what I meant ;)  I mean, you can write your own
 types, like the Appender (or what the appender *should* be) that
 optimize the behavior of code to meet any needs.  And it can do it with
 a much better syntax than C.  D's template system and ability to make
 user-types seem like builtins I think is unparalleled in C-like languages.

Hence much pleasure and excitement in learning D ... :-) Don wrote:
 There are quite a lot of us here with exactly that kind of background.
 
 Something about the array issue -- D dynamic arrays are heavily geared towards
algorithms which perform an initial allocation and afterwards avoid memory
allocation entirely.
 In D, such slicing algorithms are extremely clean, extremely fast, and memory
safe.
 In C++, it's much more difficult to write code in that manner. A
straightforward translation from C++ will generally miss the benefits of D
arrays, and you'll end up with slower code.

Exactly my current situation. :-P
 A kind of "Zen of D" is to use array slices as much as possible.

I will look into this more and see if this approach can help with some of my code -- are there existing projects I could take a look at to get some examples? Thanks & best wishes, -- Joe
Apr 12 2010
prev sibling next sibling parent Joseph Wakeling <joseph.wakeling webdrake.net> writes:
Joseph Wakeling wrote:
 (Actually I'm still having some issues with this, despite using
 assumeSafeAppend, but more on that in a separate email.)

... solved; it's interesting to note that assumeSafeAppend has to be used in _exactly_ the same scope as the append ~= itself. e.g. with this loop, the memory blows up: //////////////////////////////////////////////////////////////// foreach(uint i;0..100) { x.length = 0; assumeSafeAppend(x); foreach(uint j;0..5000) foreach(uint k;0..1000) x ~= j*k; writefln("At iteration %u, x has %u elements.",i,x.length); } //////////////////////////////////////////////////////////////// ... while with this one it's OK: //////////////////////////////////////////////////////////////// foreach(uint i;0..100) { x.length = 0; foreach(uint j;0..5000) { foreach(uint k;0..1000) { assumeSafeAppend(x); x ~= j*k; } } writefln("At iteration %u, x has %u elements.",i,x.length); } //////////////////////////////////////////////////////////////// Curious question: how come with a registered email address my messages have to be moderated, whereas via the http interface I can post straight away? :-P Best wishes, -- Joe
Apr 12 2010
prev sibling next sibling parent Joseph Wakeling <joseph.wakeling webdrake.net> writes:
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

Steven Schveighoffer wrote:
 This should work as you expect.  Can you produce a full code example? 
 Are you reserving space in x before-hand?

Weird; now it works. I wonder if because I upgraded to 2.043 ... ? I understand there was a fix in there for array append. On the other hand, if the assumeSafeAppend takes place outside the loops entirely, blow-up occurs -- because the command is not issued after the array resize to zero? I've attached examples. Thanks & best wishes, -- Joe
Apr 13 2010
prev sibling next sibling parent Joseph Wakeling <joseph.wakeling webdrake.net> writes:
Steven Schveighoffer wrote:
 Assuming you may ask questions about this, it's not an exact description
 of what happens.  The append code and GC are smarter about it than this,
 I just wanted to explain it in an easy-to-understand way :)  The real
 code uses algorithms to determine optimal grow sizes and can extend into
 consecutive blocks if possible.

I realised that while running some of the code that bearophile suggested to me, because if you look at 'capacity' in a non-pre-reserved D array that's appended to 5 million times, its capacity is only a little over 5 million -- compared to a C++ vector which is the nearest power of 2 (2^23, over 8 million). The GC overhead means that in this case actual memory usage is a bit higher than C++, but on a larger scale this could surely make a big difference. In the case of the piece of code I'm working on I don't think the pre-reserve is really so important as far as performance goes -- the append speed is a bit of a bugger, but the main issue is probably to do with speed of copying and iterating across arrays. For example, I suspect that the D array's, x[] = 0; y[] = z[]; is not as efficient as a C++ vector's x.assign(x.size(),0); y.assign(z.begin(),z.end()); I think I can find some ways round this by taking advantage of some of the D arrays' cleverness, limiting the copying of values and instead just swapping around which memory each array is pointing to. As for iteration, I don't know to what degree D's foreach() across arrays compares to for() commands in C++ -- I guess it should be pretty close in performance, no? Best wishes, -- Joe
Apr 13 2010
prev sibling next sibling parent Joseph Wakeling <joseph.wakeling webdrake.net> writes:
Hi Bearophile,

Thanks ever so much for the amount of effort you put into this -- it is
extremely kind and generous. :-)

I'm not seeing the significant gains that you see with DMD, but for me
too LDC doubles the speed.

My guess for the difference is that you ran less than the full 100
iterations of the main for loop.  As you can see in the code the Yzlm
class launches an iterative procedure which takes more or less
iterations to converge depending on initial conditions.

This _normally_ is relatively few, but can in some circumstances be
large (several hundred).

An effect of changing the random number generator is that it changes the
initial conditions presented to the algorithm.  For example, in the
first 10 runs of the loop, the original code performs 404 iterations
overall and your new code performs only 348.

Combine that with other savings like the removal of the appender and the
faster random number generation and you get a fairly sizeable saving --
about a 25-30% cut in running time when compiling with dmd.  But those
savings become less of an influence on running time as you increase the
number of simulations.

The _real_ measure of speed is thus not overall program running time but
running time relative to the total number of iterations.  The good news
is that while the LDC-compiled version doesn't beat g++ it comes pretty
close, with about 32 iterations per second for the D code compared to 26
for the C++. :-)

The C++ code can probably be optimised a bit further but not too much;
the algorithms in the iterative process are pretty much identical
between the two versions.  (The C++ doesn't have the 'clever' memory
swap-around that I put in the D code, but that doesn't make any
difference to D's performance anyway...)  So I think it's probably just
compiler difference that's to blame for speed differences -- which is
fine.  After all, D in general and DMD in particular is in development.

I _am_ surprised that in your profiling the random number generator took
up so much time, as if you cut out the lines,

        aa(ratings, reputationUser, reputationObject);
        yzlm(ratings, reputationUser, reputationObject);

... and recompile, the program screams through in no time at all --
which would lead me to think that it's the yzlm opCall and the functions
it calls that take up the majority of time.  Or is there some lazy
evaluation going on here ... ?

I don't think it's the algorithm in any case (although it might be
Phobos' implementation) as working with Tango and LDC, changing between
the Twister or Kiss random number generators doesn't make for any
meaningful performance difference.

One last thing.  While comparing DMD and LDC I noticed something odd.
Take this code, which is an LDC and DMD-compatible version of the 'array
leak' code I posted originally:

/////////////////////////////////////////////////////////////////////////
version(Tango) {
    import tango.stdc.stdio: printf;
} else {
    import std.stdio: printf;
}

void main()
{
    double[] x;

    for(uint i=0;i<100;++i) {
        x.length = 0;

        for(uint j=0;j<5_000;++j) {
            for(uint k=0;k<1_000;++k) {
                x ~= j*k;
            }
        }

        printf("At iteration %u, x has %u elements.\n",i,x.length);
    }
}
/////////////////////////////////////////////////////////////////////////

I noticed that when compiled with LDC the memory usage stays constant,
but in DMD the memory use blows up.  Is this a D1/D2 difference (with D2
having the more advanced features that require preservation of memory)
or is it a bug in one or the other of the compilers ... ?

Thanks & best wishes,

    -- Joe
Apr 14 2010
prev sibling next sibling parent Joseph Wakeling <joseph.wakeling webdrake.net> writes:
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

bearophile wrote:
 You are right, sorry.

No need to apologise! You helped me significantly improve my code, helped me understand D a lot better and left me feeling generally very positive about developing further in D. I'd call that a result. :-)
 So I think it's probably just compiler difference that's to blame for speed
differences<

This can be true, but such differences are not magic, they can be found, and eventually they can even become precise enhancement requests for llvm developers, they are open to such requests.

I hope my questions here have been useful in this respect ...
 After all, D in general and DMD in particular is in development.<

DMD has a quite old back-end that is not in active development. Maybe it will become 64 bit.

The DMD backend seems a generally problematic piece of code given the proprietary restrictions that apply to it. I have the impression that it's being used simply because it's the backend that is known and tested and that over the longer term there may be a switch ... ? The one compiler-related concern I have is whether there will be an effective free-as-in-freedom compiler for D2 in the near future. Work is going on GDC but it's moving slowly -- and I don't see a clear LDC roadmap to D2 support.
 In this program I have seen that a good percentage of the speed difference
between ldc/dmd is caused by foreach loops. When you iterate over structs
longer than size_t use ref (or in D2 better ref const(...)).
 foreach (Rating r; ratings)
 ==>
 foreach (ref const(Rating) r; ratings)
 
 There is nothing algorithmically strange going on here, but to be sure you can
take a look at the produced asm.

Interesting ... ! I'm afraid the asm is over my head, but I will read up and try to understand it.
 To find why the C++ code is faster you can show me the equivalent C++
 code that I can compile myself, or you can compile your C++ code and
 show me the asm of the two functions that are hot spots.

Sure -- attached. The key functions are the ones in tref.cpp (the smalltref class) and trefsim.cpp, and the treftest.cpp file that runs a roughly-equivalent simulation to test.d. It's a little embarrassing to share because it's fairly badly-designed code -- my first serious C++ attempt -- it served its purpose some time ago and was done with. It started to become vaguely relevant again so I thought I'd try and redo it with a better design, and rather than do it in C++ (which I could do much better by now) this seemed like a nice excuse to build my first D project ... :-) I already made a couple of tweaks based on your improvements to the D code, but I think g++ probably already has such optimizations during the compile process. I also realised that I wasn't comparing like with like -- the code was using the RanLux random number generator, and with mt19937 it shaves quite a bit of time off. So C++ still has a bit of an edge right now. There's surely a lot of other stuff that can be done to improve this C++, but please don't put lots of effort in unless it helps with improving D (the language and/or frontend, not my code, which is not so important:-). Thanks & best wishes, -- Joe
Apr 16 2010
prev sibling parent Joseph Wakeling <joseph.wakeling webdrake.net> writes:
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

bearophile wrote:
 This is the middle parts (the two loops) of Yzlm.objectReputationUpdate
compiled with ldc, this is one of the two main hot spots of the program, the
you can compared this asm with the asm of the same part from the C++ version:

To find why the C++ code is faster you can show me the equivalent C++ code that I can compile myself, or you can compile your C++ code and show me the asm of the two functions that are hot spots.

I took a little time last night and wrote up a C++ version that is much closer in design to the D code (as well as smaller, simpler, less hell to read...). It has libboost as a dependency for random number generation and for the BOOST_FOREACH macro. Following your observation with D, it looks like the C++ also benefits from using references in the FOREACH loops for elements larger than size_t -- this shaves about 20s off running time on my machine. Either way, like its uglier predecessor it's quite a bit faster than DMD or LDC. Unfortunately I can't get llvm-g++ running on my machine, or I would compare performance there too. Hope this makes an interesting point of comparison. :-) Best wishes, -- Joe
Apr 17 2010