www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - D outperformed by C++, what am I doing wrong?

reply amfvcg <hid d.en> writes:
Hi all,
I'm solving below task:

given container T and value R return sum of R-ranges over T. An 
example:
input : T=[1,1,1] R=2
output : [2, 1]

input : T=[1,2,3] R=1
output : [1,2,3]
(see dlang unittests for more examples)


Below c++ code compiled with g++-5.4.0 -O2 -std=c++14 runs on my 
machine in 656 836 us.
Below D code compiled with dmd v2.067.1 -O runs on my machine in 
~ 14.5 sec.

Each language has it's own "way of programming", and as I'm a 
beginner in D - probably I'm running through bushes instead of 
highway. Therefore I'd like to ask you, experienced dlang devs, 
to shed some light on "how to do it dlang-way".


C++ code:

#include <algorithm>
#include <chrono>
#include <iostream>
#include <iterator>
#include <list>
#include <map>
#include <string>
#include <utility>
#include <numeric>
#include <vector>


template<typename T, typename K>
std::vector<K> sum_elements(const T& beg, const T& end, 
std::size_t k, K def)
{
  if (k == 0) {
      return std::vector<K>{};
  }
  return sum_elements(beg, end, k, def, [](auto &l, auto &r){ 
return r+l;});
}

template<typename T, typename K, class BinaryOp>
std::vector<K>
     sum_elements(
             const T& beg,
             const T& end,
             std::size_t k,
             K def,
             BinaryOp op)
{
     std::vector<K> out;
     out.reserve((std::distance(beg, end) - 1)/k + 1);
     for (auto it = beg; it!=end; std::advance(it, 
std::min(static_cast<std::size_t>(std::distance(it, end)), k)))
     {
         out.push_back(std::accumulate(it, std::next(it, 
std::min(static_cast<std::size_t>(std::distance(it, end)), k)), 
def, op));
     }
     return out;
}

int main()
{
     std::vector<int> vec;
     auto size = 1000000;
     vec.reserve(size);
     for (int i=0; i < size; ++i)
         vec.push_back(i);
     auto beg = std::chrono::system_clock::now();
     auto sum = 0;
     for (int i=0; i < 100; i++)
         sum += sum_elements(vec.begin(), vec.end(), 2, 0).size();
     auto end = std::chrono::system_clock::now();
     std::cout << 
std::chrono::duration_cast<std::chrono::microseconds>(end-beg).count() <<
std::endl;
     std::cout << sum << std::endl;

     return sum;
}


D code:

import std.stdio : writeln;
import std.algorithm.comparison: min;
import std.algorithm.iteration: sum;
import core.time: MonoTime, Duration;


T[] sum_subranges(T)(T[] input, uint range)
{
     T[] result;
     if (range == 0)
     {
         return result;
     }
     for (uint i; i < input.length; i=min(i+range, input.length))
     {
         result ~= sum(input[i..min(i+range, input.length)]);
     }
     return result;
}

unittest
{
     assert(sum_subranges([1,1,1], 2) == [2, 1]);
     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
     assert(sum_subranges([], 2) == []);
     assert(sum_subranges([1], 2) == [1]);
     assert(sum_subranges([1], 0) == []);
}


int main()
{
     int[1000000] v;
     for (int i=0; i < 1000000; ++i)
         v[i] = i;
     int sum;
     MonoTime beg = MonoTime.currTime;
     for (int i=0; i < 100; i++)
         sum += cast(int)sum_subranges(v,2).length;
     MonoTime end = MonoTime.currTime;
     writeln(end-beg);
     writeln(sum);
     return sum;
}
Aug 12
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 13/08/2017 7:09 AM, amfvcg wrote:
 Hi all,
 I'm solving below task:
 
 given container T and value R return sum of R-ranges over T. An example:
 input : T=[1,1,1] R=2
 output : [2, 1]
 
 input : T=[1,2,3] R=1
 output : [1,2,3]
 (see dlang unittests for more examples)
 
 
 Below c++ code compiled with g++-5.4.0 -O2 -std=c++14 runs on my machine 
 in 656 836 us.
 Below D code compiled with dmd v2.067.1 -O runs on my machine in ~ 14.5 
 sec.
 
 Each language has it's own "way of programming", and as I'm a beginner 
 in D - probably I'm running through bushes instead of highway. Therefore 
 I'd like to ask you, experienced dlang devs, to shed some light on "how 
 to do it dlang-way".
 
 
 C++ code:
 
 #include <algorithm>
 #include <chrono>
 #include <iostream>
 #include <iterator>
 #include <list>
 #include <map>
 #include <string>
 #include <utility>
 #include <numeric>
 #include <vector>
 
 
 template<typename T, typename K>
 std::vector<K> sum_elements(const T& beg, const T& end, std::size_t k, K 
 def)
 {
   if (k == 0) {
       return std::vector<K>{};
   }
   return sum_elements(beg, end, k, def, [](auto &l, auto &r){ return 
 r+l;});
 }
 
 template<typename T, typename K, class BinaryOp>
 std::vector<K>
      sum_elements(
              const T& beg,
              const T& end,
              std::size_t k,
              K def,
              BinaryOp op)
 {
      std::vector<K> out;
      out.reserve((std::distance(beg, end) - 1)/k + 1);
      for (auto it = beg; it!=end; std::advance(it, 
 std::min(static_cast<std::size_t>(std::distance(it, end)), k)))
      {
          out.push_back(std::accumulate(it, std::next(it, 
 std::min(static_cast<std::size_t>(std::distance(it, end)), k)), def, op));
      }
      return out;
 }
 
 int main()
 {
      std::vector<int> vec;
      auto size = 1000000;
      vec.reserve(size);
      for (int i=0; i < size; ++i)
          vec.push_back(i);
      auto beg = std::chrono::system_clock::now();
      auto sum = 0;
      for (int i=0; i < 100; i++)
          sum += sum_elements(vec.begin(), vec.end(), 2, 0).size();
      auto end = std::chrono::system_clock::now();
      std::cout << 
 std::chrono::duration_cast<std::chrono::microseconds>(end-beg).count() 
 << std::endl;
      std::cout << sum << std::endl;
 
      return sum;
 }
 
 
 D code:
 
 import std.stdio : writeln;
 import std.algorithm.comparison: min;
 import std.algorithm.iteration: sum;
 import core.time: MonoTime, Duration;
 
 
 T[] sum_subranges(T)(T[] input, uint range)
 {
      T[] result;
      if (range == 0)
      {
          return result;
      }
      for (uint i; i < input.length; i=min(i+range, input.length))
      {
          result ~= sum(input[i..min(i+range, input.length)]);
      }
      return result;
 }
 
 unittest
 {
      assert(sum_subranges([1,1,1], 2) == [2, 1]);
      assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
      assert(sum_subranges([], 2) == []);
      assert(sum_subranges([1], 2) == [1]);
      assert(sum_subranges([1], 0) == []);
 }
 
 
 int main()
 {
      int[1000000] v;
      for (int i=0; i < 1000000; ++i)
          v[i] = i;
      int sum;
      MonoTime beg = MonoTime.currTime;
      for (int i=0; i < 100; i++)
          sum += cast(int)sum_subranges(v,2).length;
      MonoTime end = MonoTime.currTime;
      writeln(end-beg);
      writeln(sum);
      return sum;
 }
Dmd compiles quickly, but doesn't create all that optimized code. Try ldc or gdc and get back to us about it ;)
Aug 12
prev sibling next sibling parent Roman Hargrave <roman hargrave.info> writes:
On Sunday, 13 August 2017 at 06:09:39 UTC, amfvcg wrote:
 Hi all,
 I'm solving below task:

 given container T and value R return sum of R-ranges over T. An 
 example:
 input : T=[1,1,1] R=2
 output : [2, 1]

 input : T=[1,2,3] R=1
 output : [1,2,3]
 (see dlang unittests for more examples)


 Below c++ code compiled with g++-5.4.0 -O2 -std=c++14 runs on 
 my machine in 656 836 us.
 Below D code compiled with dmd v2.067.1 -O runs on my machine 
 in ~ 14.5 sec.
If I had to guess, this could be due to backend and optimizer. I don't want to go in to detail on my thoughts because I am not an expert on codegen optimization, but I might suggest running your test compiled with GDC (using identical optimization settings as G++) and ldc2 with similar settings.
Aug 12
prev sibling next sibling parent reply Neia Neutuladh <neia ikeran.org> writes:
On Sunday, 13 August 2017 at 06:09:39 UTC, amfvcg wrote:
 Hi all,
 I'm solving below task:
Well, for one thing, you are preallocating in C++ code but not in D. On my machine, your version of the code completes in 3.175 seconds. Changing it a little reduces it to 0.420s: T[] result = new T[input.length]; size_t o = 0; for (uint i; i < input.length; i=min(i+range, input.length)) { result[o] = sum(input[i..min(i+range, input.length)]); o++; } return result[0..o]; You can also use Appender from std.array.
Aug 13
next sibling parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
this works ok for me with ldc compiler, gdc does not work on my arch
machine so I can not do comparsion to your c++ versin (clang does not work
with your c++ code)

import std.stdio : writeln;
import std.algorithm.comparison: min;
import std.algorithm.iteration: sum;
import core.time: MonoTime, Duration;


T[] sum_subranges(T)(T[] input, uint range)
{
    import std.array : appender;
    auto app = appender!(T[])();
    if (range == 0)
    {
        return app.data;
    }
    for (uint i; i < input.length; i=min(i+range, input.length))
    {
        app.put(sum(input[i..min(i+range, input.length)]));
    }
    return app.data;
}

unittest
{
    assert(sum_subranges([1,1,1], 2) == [2, 1]);
    assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
    assert(sum_subranges([], 2) == []);
    assert(sum_subranges([1], 2) == [1]);
    assert(sum_subranges([1], 0) == []);
}


int main()
{
    import std.range : iota, array;
    auto v = iota(0,1000000).array;
    int sum;
    MonoTime beg = MonoTime.currTime;
    for (int i=0; i < 100; i++)
        sum += cast(int)sum_subranges(v,2).length;
    MonoTime end = MonoTime.currTime;
    writeln(end-beg);
    writeln(sum);
    return sum;
}

On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com> wrote:

 On Sunday, 13 August 2017 at 06:09:39 UTC, amfvcg wrote:

 Hi all,
 I'm solving below task:
Well, for one thing, you are preallocating in C++ code but not in D. On my machine, your version of the code completes in 3.175 seconds. Changing it a little reduces it to 0.420s: T[] result = new T[input.length]; size_t o = 0; for (uint i; i < input.length; i=min(i+range, input.length)) { result[o] = sum(input[i..min(i+range, input.length)]); o++; } return result[0..o]; You can also use Appender from std.array.
Aug 13
prev sibling parent reply Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Here is more D idiomatic way:

import std.stdio : writeln;
import std.algorithm.comparison: min;
import std.algorithm.iteration: sum;
import core.time: MonoTime, Duration;


auto sum_subranges(T)(T input, uint range)
{
    import std.array : array;
    import std.range : chunks, ElementType;
    import std.algorithm : map;

    if (range == 0)
    {
        return ElementType!(T)[].init;
    }
    return input.chunks(range).map!(sum).array;
}

unittest
{
    assert(sum_subranges([1,1,1], 2) == [2, 1]);
    assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
    assert(sum_subranges([], 2) == []);
    assert(sum_subranges([1], 2) == [1]);
    assert(sum_subranges([1], 0) == []);
}


int main()
{
    import std.range : iota, array;
    auto v = iota(0,1000000);
    int sum;
    MonoTime beg = MonoTime.currTime;
    for (int i=0; i < 100; i++)
        sum += cast(int)sum_subranges(v,2).length;
    MonoTime end = MonoTime.currTime;
    writeln(end-beg);
    writeln(sum);
    return sum;
}

On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11 gmail.com> wrote:

 this works ok for me with ldc compiler, gdc does not work on my arch
 machine so I can not do comparsion to your c++ versin (clang does not work
 with your c++ code)

 import std.stdio : writeln;
 import std.algorithm.comparison: min;
 import std.algorithm.iteration: sum;
 import core.time: MonoTime, Duration;


 T[] sum_subranges(T)(T[] input, uint range)
 {
     import std.array : appender;
     auto app = appender!(T[])();
     if (range == 0)
     {
         return app.data;
     }
     for (uint i; i < input.length; i=min(i+range, input.length))
     {
         app.put(sum(input[i..min(i+range, input.length)]));
     }
     return app.data;
 }

 unittest
 {
     assert(sum_subranges([1,1,1], 2) == [2, 1]);
     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
     assert(sum_subranges([], 2) == []);
     assert(sum_subranges([1], 2) == [1]);
     assert(sum_subranges([1], 0) == []);
 }


 int main()
 {
     import std.range : iota, array;
     auto v = iota(0,1000000).array;
     int sum;
     MonoTime beg = MonoTime.currTime;
     for (int i=0; i < 100; i++)
         sum += cast(int)sum_subranges(v,2).length;
     MonoTime end = MonoTime.currTime;
     writeln(end-beg);
     writeln(sum);
     return sum;
 }

 On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn <
 digitalmars-d-learn puremagic.com> wrote:

 On Sunday, 13 August 2017 at 06:09:39 UTC, amfvcg wrote:

 Hi all,
 I'm solving below task:
Well, for one thing, you are preallocating in C++ code but not in D. On my machine, your version of the code completes in 3.175 seconds. Changing it a little reduces it to 0.420s: T[] result = new T[input.length]; size_t o = 0; for (uint i; i < input.length; i=min(i+range, input.length)) { result[o] = sum(input[i..min(i+range, input.length)]); o++; } return result[0..o]; You can also use Appender from std.array.
Aug 13
parent reply amfvcg <hid d.en> writes:
On Sunday, 13 August 2017 at 07:30:32 UTC, Daniel Kozak wrote:
 Here is more D idiomatic way:

 import std.stdio : writeln;
 import std.algorithm.comparison: min;
 import std.algorithm.iteration: sum;
 import core.time: MonoTime, Duration;


 auto sum_subranges(T)(T input, uint range)
 {
     import std.array : array;
     import std.range : chunks, ElementType;
     import std.algorithm : map;

     if (range == 0)
     {
         return ElementType!(T)[].init;
     }
     return input.chunks(range).map!(sum).array;
 }

 unittest
 {
     assert(sum_subranges([1,1,1], 2) == [2, 1]);
     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
     assert(sum_subranges([], 2) == []);
     assert(sum_subranges([1], 2) == [1]);
     assert(sum_subranges([1], 0) == []);
 }


 int main()
 {
     import std.range : iota, array;
     auto v = iota(0,1000000);
     int sum;
     MonoTime beg = MonoTime.currTime;
     for (int i=0; i < 100; i++)
         sum += cast(int)sum_subranges(v,2).length;
     MonoTime end = MonoTime.currTime;
     writeln(end-beg);
     writeln(sum);
     return sum;
 }

 On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak 
 <kozzi11 gmail.com> wrote:

 this works ok for me with ldc compiler, gdc does not work on 
 my arch machine so I can not do comparsion to your c++ versin 
 (clang does not work with your c++ code)

 import std.stdio : writeln;
 import std.algorithm.comparison: min;
 import std.algorithm.iteration: sum;
 import core.time: MonoTime, Duration;


 T[] sum_subranges(T)(T[] input, uint range)
 {
     import std.array : appender;
     auto app = appender!(T[])();
     if (range == 0)
     {
         return app.data;
     }
     for (uint i; i < input.length; i=min(i+range, 
 input.length))
     {
         app.put(sum(input[i..min(i+range, input.length)]));
     }
     return app.data;
 }

 unittest
 {
     assert(sum_subranges([1,1,1], 2) == [2, 1]);
     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
     assert(sum_subranges([], 2) == []);
     assert(sum_subranges([1], 2) == [1]);
     assert(sum_subranges([1], 0) == []);
 }


 int main()
 {
     import std.range : iota, array;
     auto v = iota(0,1000000).array;
     int sum;
     MonoTime beg = MonoTime.currTime;
     for (int i=0; i < 100; i++)
         sum += cast(int)sum_subranges(v,2).length;
     MonoTime end = MonoTime.currTime;
     writeln(end-beg);
     writeln(sum);
     return sum;
 }

 On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via 
 Digitalmars-d-learn < digitalmars-d-learn puremagic.com> wrote:

 [...]
Thank you all for the replies. Good to know the community is alive in d :) Let's settle the playground: D : http://ideone.com/h4fnsD C++: http://ideone.com/X1pyXG Both using GCC under the hood. C++ in 112 ms; D in : - 2.5 sec with original source; - 2.5 sec with Daniel's 1st version; - 5 sec timeout exceeded with Daniel's 2nd version; - 1.8 sec with Neia-like preallocation; So still it's not that neaty. (What's interesting C++ code generates 2KLOC of assembly, and Dlang ldc 12KLOC - checked at godbolt). P.S. For C++ version to work under clang, the function which takes (BinaryOp) must go before the other one (my bad).
Aug 13
next sibling parent reply Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
my second version on ldc takes 380ms and c++ version on same compiler
(clang), takes 350ms, so it seems to be almost same

On Sun, Aug 13, 2017 at 9:51 AM, amfvcg via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com> wrote:

 On Sunday, 13 August 2017 at 07:30:32 UTC, Daniel Kozak wrote:

 Here is more D idiomatic way:

 import std.stdio : writeln;
 import std.algorithm.comparison: min;
 import std.algorithm.iteration: sum;
 import core.time: MonoTime, Duration;


 auto sum_subranges(T)(T input, uint range)
 {
     import std.array : array;
     import std.range : chunks, ElementType;
     import std.algorithm : map;

     if (range == 0)
     {
         return ElementType!(T)[].init;
     }
     return input.chunks(range).map!(sum).array;
 }

 unittest
 {
     assert(sum_subranges([1,1,1], 2) == [2, 1]);
     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
     assert(sum_subranges([], 2) == []);
     assert(sum_subranges([1], 2) == [1]);
     assert(sum_subranges([1], 0) == []);
 }


 int main()
 {
     import std.range : iota, array;
     auto v = iota(0,1000000);
     int sum;
     MonoTime beg = MonoTime.currTime;
     for (int i=0; i < 100; i++)
         sum += cast(int)sum_subranges(v,2).length;
     MonoTime end = MonoTime.currTime;
     writeln(end-beg);
     writeln(sum);
     return sum;
 }

 On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11 gmail.com> wrote:

 this works ok for me with ldc compiler, gdc does not work on my arch
 machine so I can not do comparsion to your c++ versin (clang does not work
 with your c++ code)

 import std.stdio : writeln;
 import std.algorithm.comparison: min;
 import std.algorithm.iteration: sum;
 import core.time: MonoTime, Duration;


 T[] sum_subranges(T)(T[] input, uint range)
 {
     import std.array : appender;
     auto app = appender!(T[])();
     if (range == 0)
     {
         return app.data;
     }
     for (uint i; i < input.length; i=min(i+range, input.length))
     {
         app.put(sum(input[i..min(i+range, input.length)]));
     }
     return app.data;
 }

 unittest
 {
     assert(sum_subranges([1,1,1], 2) == [2, 1]);
     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
     assert(sum_subranges([], 2) == []);
     assert(sum_subranges([1], 2) == [1]);
     assert(sum_subranges([1], 0) == []);
 }


 int main()
 {
     import std.range : iota, array;
     auto v = iota(0,1000000).array;
     int sum;
     MonoTime beg = MonoTime.currTime;
     for (int i=0; i < 100; i++)
         sum += cast(int)sum_subranges(v,2).length;
     MonoTime end = MonoTime.currTime;
     writeln(end-beg);
     writeln(sum);
     return sum;
 }

 On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn
 < digitalmars-d-learn puremagic.com> wrote:

 [...]

Thank you all for the replies. Good to know the community is alive in d :) Let's settle the playground: D : http://ideone.com/h4fnsD C++: http://ideone.com/X1pyXG Both using GCC under the hood. C++ in 112 ms; D in : - 2.5 sec with original source; - 2.5 sec with Daniel's 1st version; - 5 sec timeout exceeded with Daniel's 2nd version; - 1.8 sec with Neia-like preallocation; So still it's not that neaty. (What's interesting C++ code generates 2KLOC of assembly, and Dlang ldc 12KLOC - checked at godbolt). P.S. For C++ version to work under clang, the function which takes (BinaryOp) must go before the other one (my bad).
Aug 13
parent reply amfvcg <hid d.en> writes:
On Sunday, 13 August 2017 at 08:00:53 UTC, Daniel Kozak wrote:
 my second version on ldc takes 380ms and c++ version on same 
 compiler (clang), takes 350ms, so it seems to be almost same
Ok, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it gets killed. What version are you using? Either way, if that'd be the case - that's slick. (and ldc would be the compiler of choice for real use cases).
Aug 13
next sibling parent reply ikod <geller.garry gmail.com> writes:
On Sunday, 13 August 2017 at 08:13:56 UTC, amfvcg wrote:
 On Sunday, 13 August 2017 at 08:00:53 UTC, Daniel Kozak wrote:
 my second version on ldc takes 380ms and c++ version on same 
 compiler (clang), takes 350ms, so it seems to be almost same
Ok, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it gets killed. What version are you using? Either way, if that'd be the case - that's slick. (and ldc would be the compiler of choice for real use cases).
import std.stdio : writeln; import std.algorithm.comparison: min; import std.algorithm.iteration: sum; import core.time: MonoTime, Duration; import std.range; import std.algorithm; auto s1(T)(T input, uint r) { return input.chunks(r).map!sum; } T[] sum_subranges(T)(T[] input, uint range) { T[] result; if (range == 0) { return result; } for (uint i; i < input.length; i=min(i+range, input.length)) { result ~= sum(input[i..min(i+range, input.length)]); } return result; } unittest { assert(sum_subranges([1,1,1], 2) == [2, 1]); assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); assert(sum_subranges([], 2) == []); assert(sum_subranges([1], 2) == [1]); assert(sum_subranges([1], 0) == []); assert(s1([1,1,1], 2).array == [2, 1]); assert(s1([1,1,1,2,3,3], 2).array == [2, 3, 6]); } int main() { int sum; MonoTime beg0 = MonoTime.currTime; for (int i=0; i < 100; i++) sum += s1(iota(1000000),2).length; MonoTime end0 = MonoTime.currTime; writeln(end0-beg0); writeln(sum); sum = 0; int[1000000] v; for (int i=0; i < 1000000; ++i) v[i] = i; MonoTime beg = MonoTime.currTime; for (int i=0; i < 100; i++) sum += cast(int)sum_subranges(v,2).length; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(sum); return sum; } Gives me 5 μs and 2 hnsecs 50000000 3 secs, 228 ms, 837 μs, and 4 hnsecs 50000000
Aug 13
parent reply amfvcg <hid d.en> writes:
 Gives me

 5 μs and 2 hnsecs
 50000000
 3 secs, 228 ms, 837 μs, and 4 hnsecs
 50000000
And you've compiled it with? Btw. clang for c++ version works worse than gcc (for this case [112ms vs 180ms]).
Aug 13
parent ikod <geller.garry gmail.com> writes:
On Sunday, 13 August 2017 at 08:32:50 UTC, amfvcg wrote:
 Gives me

 5 μs and 2 hnsecs
 50000000
 3 secs, 228 ms, 837 μs, and 4 hnsecs
 50000000
And you've compiled it with? Btw. clang for c++ version works worse than gcc (for this case [112ms vs 180ms]).
DMD64 D Compiler v2.074.1
Aug 13
prev sibling parent reply Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:
On Sunday, 13 August 2017 at 08:13:56 UTC, amfvcg wrote:
 On Sunday, 13 August 2017 at 08:00:53 UTC, Daniel Kozak wrote:
 my second version on ldc takes 380ms and c++ version on same 
 compiler (clang), takes 350ms, so it seems to be almost same
Ok, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it gets killed. What version are you using? Either way, if that'd be the case - that's slick. (and ldc would be the compiler of choice for real use cases).
Here are my results: $ uname -sri Linux 4.10.0-28-generic x86_64 $ lscpu | grep 'Model name' Model name: Intel(R) Core(TM) i7-3770K CPU 3.50GHz $ ldc2 --version | head -n5 LDC - the LLVM D compiler (1.3.0): based on DMD v2.073.2 and LLVM 4.0.0 built with LDC - the LLVM D compiler (1.3.0) Default target: x86_64-unknown-linux-gnu Host CPU: ivybridge $ g++ --version | head -n1 g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406 $ ldc2 -O3 --release sum_subranges.d $ ./sum_subranges 378 ms, 556 μs, and 9 hnsecs 50000000 $ g++ -O5 sum_subranges.cpp -o sum_subranges $ ./sum_subranges 237135 50000000
Aug 13
parent reply Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:
On Sunday, 13 August 2017 at 08:29:30 UTC, Petar Kirov 
[ZombineDev] wrote:
 On Sunday, 13 August 2017 at 08:13:56 UTC, amfvcg wrote:
 On Sunday, 13 August 2017 at 08:00:53 UTC, Daniel Kozak wrote:
 my second version on ldc takes 380ms and c++ version on same 
 compiler (clang), takes 350ms, so it seems to be almost same
Ok, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it gets killed. What version are you using? Either way, if that'd be the case - that's slick. (and ldc would be the compiler of choice for real use cases).
Here are my results: $ uname -sri Linux 4.10.0-28-generic x86_64 $ lscpu | grep 'Model name' Model name: Intel(R) Core(TM) i7-3770K CPU 3.50GHz $ ldc2 --version | head -n5 LDC - the LLVM D compiler (1.3.0): based on DMD v2.073.2 and LLVM 4.0.0 built with LDC - the LLVM D compiler (1.3.0) Default target: x86_64-unknown-linux-gnu Host CPU: ivybridge $ g++ --version | head -n1 g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406 $ ldc2 -O3 --release sum_subranges.d $ ./sum_subranges 378 ms, 556 μs, and 9 hnsecs 50000000 $ g++ -O5 sum_subranges.cpp -o sum_subranges $ ./sum_subranges 237135 50000000
With Daniel's latest version ( http://forum.dlang.org/post/mailman.5963.1502612885.31550.digitalmars-d learn puremagic.com ) $ ldc2 -O3 --release sum_subranges2.d $ ./sum_subranges2 210 ms, 838 μs, and 8 hnsecs 50000000
Aug 13
parent reply amfvcg <hid d.en> writes:
On Sunday, 13 August 2017 at 08:33:53 UTC, Petar Kirov 
[ZombineDev] wrote:
 With Daniel's latest version (
 http://forum.dlang.org/post/mailman.5963.1502612885.31550.digitalmars-d
learn puremagic.com )

 $ ldc2 -O3 --release sum_subranges2.d
 $ ./sum_subranges2
 210 ms, 838 μs, and 8 hnsecs
 50000000
Great!!! And that's what I was hoping for. So the conclusion is: use the latest ldc that's out there. Thank you Petar, thank you Daniel. (I cannot change the subject to SOLVED, can I?) Btw. the idiomatic version of this d sample looks how I imagined it should!
Aug 13
parent reply Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:
On Sunday, 13 August 2017 at 08:43:29 UTC, amfvcg wrote:
 On Sunday, 13 August 2017 at 08:33:53 UTC, Petar Kirov 
 [ZombineDev] wrote:
 With Daniel's latest version (
 http://forum.dlang.org/post/mailman.5963.1502612885.31550.digitalmars-d
learn puremagic.com )

 $ ldc2 -O3 --release sum_subranges2.d
 $ ./sum_subranges2
 210 ms, 838 μs, and 8 hnsecs
 50000000
Great!!! And that's what I was hoping for. So the conclusion is: use the latest ldc that's out there. Thank you Petar, thank you Daniel. (I cannot change the subject to SOLVED, can I?) Btw. the idiomatic version of this d sample looks how I imagined it should!
There's one especially interesting result: This instantiation: sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint) of the following function: auto sum_subranges(T)(T input, uint range) { import std.range : chunks, ElementType, array; import std.algorithm : map; return input.chunks(range).map!(sum); } gets optimized with LDC to: push rax test edi, edi je .LBB2_2 mov edx, edi mov rax, rsi pop rcx ret .LBB2_2: lea rsi, [rip + .L.str.3] lea rcx, [rip + .L.str] mov edi, 45 mov edx, 89 mov r8d, 6779 call _d_assert_msg PLT I.e. the compiler turned a O(n) algorithm to O(1), which is quite neat. It is also quite surprising to me that it looks like even dmd managed to do a similar optimization: sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint): push rbp mov rbp,rsp sub rsp,0x30 mov DWORD PTR [rbp-0x8],edi mov r9d,DWORD PTR [rbp-0x8] test r9,r9 jne 41 mov r8d,0x1b67 mov ecx,0x0 mov eax,0x61 mov rdx,rax mov QWORD PTR [rbp-0x28],rdx mov edx,0x0 mov edi,0x2d mov rsi,rdx mov rdx,QWORD PTR [rbp-0x28] call 41 41: mov QWORD PTR [rbp-0x20],rsi mov QWORD PTR [rbp-0x18],r9 mov rdx,QWORD PTR [rbp-0x18] mov rax,QWORD PTR [rbp-0x20] mov rsp,rbp algorithms a pop rbp ret Moral of the story: templates + ranges are an awesome combination.
Aug 13
next sibling parent reply amfvcg <hid d.en> writes:
On Sunday, 13 August 2017 at 09:08:14 UTC, Petar Kirov 
[ZombineDev] wrote:
 There's one especially interesting result:

 This instantiation:

 sum_subranges(std.range.iota!(int, int).iota(int, int).Result, 
 uint)

 of the following function:

 auto sum_subranges(T)(T input, uint range)
 {
     import std.range : chunks, ElementType, array;
     import std.algorithm : map;
     return input.chunks(range).map!(sum);
 }

 gets optimized with LDC to:
   push rax
   test edi, edi
   je .LBB2_2
   mov edx, edi
   mov rax, rsi
   pop rcx
   ret
 .LBB2_2:
   lea rsi, [rip + .L.str.3]
   lea rcx, [rip + .L.str]
   mov edi, 45
   mov edx, 89
   mov r8d, 6779
   call _d_assert_msg PLT

 I.e. the compiler turned a O(n) algorithm to O(1), which is 
 quite neat. It is also quite surprising to me that it looks 
 like even dmd managed to do a similar optimization:

 sum_subranges(std.range.iota!(int, int).iota(int, int).Result, 
 uint):
     push   rbp
     mov    rbp,rsp
     sub    rsp,0x30
     mov    DWORD PTR [rbp-0x8],edi
     mov    r9d,DWORD PTR [rbp-0x8]
     test   r9,r9
     jne    41
     mov    r8d,0x1b67
     mov    ecx,0x0
     mov    eax,0x61
     mov    rdx,rax
     mov    QWORD PTR [rbp-0x28],rdx
     mov    edx,0x0
     mov    edi,0x2d
     mov    rsi,rdx
     mov    rdx,QWORD PTR [rbp-0x28]
     call   41
 41: mov    QWORD PTR [rbp-0x20],rsi
     mov    QWORD PTR [rbp-0x18],r9
     mov    rdx,QWORD PTR [rbp-0x18]
     mov    rax,QWORD PTR [rbp-0x20]
     mov    rsp,rbp algorithms a
     pop    rbp
     ret

 Moral of the story: templates + ranges are an awesome 
 combination.
Change the parameter for this array size to be taken from stdin and I assume that these optimizations will go away.
Aug 13
parent reply Johan Engelen <j j.nl> writes:
On Sunday, 13 August 2017 at 09:15:48 UTC, amfvcg wrote:
 Change the parameter for this array size to be taken from stdin 
 and I assume that these optimizations will go away.
This is paramount for all of the testing, examining, and comparisons that are discussed in this thread. Full information is given to the compiler, and you are basically testing the constant folding power of the compilers (not unimportant). No runtime calculation is needed for the sum. Your program could be optimized to the following code: ``` void main() { MonoTime beg = MonoTime.currTime; MonoTime end = MonoTime.currTime; writeln(end-beg); writeln(50000000); } ``` So actually you should be more surprised that the reported time is not equal to near-zero (just the time between two `MonoTime.currTime` calls)! Instead of `iota(1,1000000)`, you should initialize the array with random numbers with a randomization seed given by the user (e.g. commandline argument or stdin). Then, the program will actually have to do the runtime calculations that I assume you are expecting it to perform. - Johan
Aug 13
parent Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:
On Sunday, 13 August 2017 at 09:56:44 UTC, Johan Engelen wrote:
 On Sunday, 13 August 2017 at 09:15:48 UTC, amfvcg wrote:
 Change the parameter for this array size to be taken from 
 stdin and I assume that these optimizations will go away.
This is paramount for all of the testing, examining, and comparisons that are discussed in this thread. Full information is given to the compiler, and you are basically testing the constant folding power of the compilers (not unimportant).
I agree that in general this is not the right way to benchmark. I however am interested specifically in the pattern matching / constant folding abilities of the compiler. I would have expected `sum(iota(1, N + 1))` to be replaced with `(N*(N+1))/2`. LDC already does this optimization in some cases. I have opened an issue for some of the rest: https://github.com/ldc-developers/ldc/issues/2271
 No runtime calculation is needed for the sum. Your program 
 could be optimized to the following code:
 ```
 void main()
 {
     MonoTime beg = MonoTime.currTime;
     MonoTime end = MonoTime.currTime;
     writeln(end-beg);
     writeln(50000000);
 }
 ```
 So actually you should be more surprised that the reported time 
 is not equal to near-zero (just the time between two 
 `MonoTime.currTime` calls)!
On Posix, `MonoTime.currTime`'s implementation uses clock_gettime(CLOCK_MONOTONIC, ...) which quite a bit more involved than simply using the rdtsc instruciton on x86. See: http://linuxmogeb.blogspot.bg/2013/10/how-does-clockgettime-work.html On Windows, `MonoTime.currTime` uses QueryPerformanceCounter, which on Win 7 and later uses the rdtsc instruction, which makes it quite streamlined. In some testing I did several months ago QueryPerformanceCounter had really good latency and precision (though I forgot the exact numbers I got).
 Instead of `iota(1,1000000)`, you should initialize the array 
 with random numbers with a randomization seed given by the user 
 (e.g. commandline argument or stdin). Then, the program will 
 actually have to do the runtime calculations that I assume you 
 are expecting it to perform.
Agreed, though I think Phobos's unpredictableSeed does an ok job w.r.t. seeding, so unless you want to repeat the benchmark on the exact same dataset, something like this does a good job: T[] generate(T)(size_t size) { import std.algorithm.iteration : map; import std.range : array, iota; import std.random : uniform; return size.iota.map!(_ => uniform!T()).array; }
Aug 13
prev sibling parent reply Johan Engelen <j j.nl> writes:
On Sunday, 13 August 2017 at 09:08:14 UTC, Petar Kirov 
[ZombineDev] wrote:
 This instantiation:

 sum_subranges(std.range.iota!(int, int).iota(int, int).Result, 
 uint)

 of the following function:

 auto sum_subranges(T)(T input, uint range)
 {
     import std.range : chunks, ElementType, array;
     import std.algorithm : map;
     return input.chunks(range).map!(sum);
 }

 gets optimized with LDC to:
  [snip]
 I.e. the compiler turned a O(n) algorithm to O(1), which is 
 quite neat. It is also quite surprising to me that it looks 
 like even dmd managed to do a similar optimization:
  [snip]
Execution of sum_subranges is already O(1), because the calculation of the sum is delayed: the return type of the function is not `uint`, it is `MapResult!(sum, <range>)` which does a lazy evaluation of the sum. - Johan
Aug 13
parent Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:
On Sunday, 13 August 2017 at 09:41:39 UTC, Johan Engelen wrote:
 On Sunday, 13 August 2017 at 09:08:14 UTC, Petar Kirov 
 [ZombineDev] wrote:
  [...]
  [...]
Execution of sum_subranges is already O(1), because the calculation of the sum is delayed: the return type of the function is not `uint`, it is `MapResult!(sum, <range>)` which does a lazy evaluation of the sum. - Johan
Heh, yeah you're absolutely right. I was just about to correct myself, when I saw your reply. Don't know how I missed such an obvious thing :D
Aug 13
prev sibling next sibling parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
this one is even faster than c++:
http://ideone.com/TRDsOo

On Sun, Aug 13, 2017 at 10:00 AM, Daniel Kozak <kozzi11 gmail.com> wrote:

 my second version on ldc takes 380ms and c++ version on same compiler
 (clang), takes 350ms, so it seems to be almost same

 On Sun, Aug 13, 2017 at 9:51 AM, amfvcg via Digitalmars-d-learn <
 digitalmars-d-learn puremagic.com> wrote:

 On Sunday, 13 August 2017 at 07:30:32 UTC, Daniel Kozak wrote:

 Here is more D idiomatic way:

 import std.stdio : writeln;
 import std.algorithm.comparison: min;
 import std.algorithm.iteration: sum;
 import core.time: MonoTime, Duration;


 auto sum_subranges(T)(T input, uint range)
 {
     import std.array : array;
     import std.range : chunks, ElementType;
     import std.algorithm : map;

     if (range == 0)
     {
         return ElementType!(T)[].init;
     }
     return input.chunks(range).map!(sum).array;
 }

 unittest
 {
     assert(sum_subranges([1,1,1], 2) == [2, 1]);
     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
     assert(sum_subranges([], 2) == []);
     assert(sum_subranges([1], 2) == [1]);
     assert(sum_subranges([1], 0) == []);
 }


 int main()
 {
     import std.range : iota, array;
     auto v = iota(0,1000000);
     int sum;
     MonoTime beg = MonoTime.currTime;
     for (int i=0; i < 100; i++)
         sum += cast(int)sum_subranges(v,2).length;
     MonoTime end = MonoTime.currTime;
     writeln(end-beg);
     writeln(sum);
     return sum;
 }

 On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11 gmail.com> wrote:

 this works ok for me with ldc compiler, gdc does not work on my arch
 machine so I can not do comparsion to your c++ versin (clang does not work
 with your c++ code)

 import std.stdio : writeln;
 import std.algorithm.comparison: min;
 import std.algorithm.iteration: sum;
 import core.time: MonoTime, Duration;


 T[] sum_subranges(T)(T[] input, uint range)
 {
     import std.array : appender;
     auto app = appender!(T[])();
     if (range == 0)
     {
         return app.data;
     }
     for (uint i; i < input.length; i=min(i+range, input.length))
     {
         app.put(sum(input[i..min(i+range, input.length)]));
     }
     return app.data;
 }

 unittest
 {
     assert(sum_subranges([1,1,1], 2) == [2, 1]);
     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
     assert(sum_subranges([], 2) == []);
     assert(sum_subranges([1], 2) == [1]);
     assert(sum_subranges([1], 0) == []);
 }


 int main()
 {
     import std.range : iota, array;
     auto v = iota(0,1000000).array;
     int sum;
     MonoTime beg = MonoTime.currTime;
     for (int i=0; i < 100; i++)
         sum += cast(int)sum_subranges(v,2).length;
     MonoTime end = MonoTime.currTime;
     writeln(end-beg);
     writeln(sum);
     return sum;
 }

 On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn
 < digitalmars-d-learn puremagic.com> wrote:

 [...]

Thank you all for the replies. Good to know the community is alive in d :) Let's settle the playground: D : http://ideone.com/h4fnsD C++: http://ideone.com/X1pyXG Both using GCC under the hood. C++ in 112 ms; D in : - 2.5 sec with original source; - 2.5 sec with Daniel's 1st version; - 5 sec timeout exceeded with Daniel's 2nd version; - 1.8 sec with Neia-like preallocation; So still it's not that neaty. (What's interesting C++ code generates 2KLOC of assembly, and Dlang ldc 12KLOC - checked at godbolt). P.S. For C++ version to work under clang, the function which takes (BinaryOp) must go before the other one (my bad).
Aug 13
prev sibling parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
And this one is awesome :P
http://ideone.com/muehUw

On Sun, Aug 13, 2017 at 10:27 AM, Daniel Kozak <kozzi11 gmail.com> wrote:

 this one is even faster than c++:
 http://ideone.com/TRDsOo

 On Sun, Aug 13, 2017 at 10:00 AM, Daniel Kozak <kozzi11 gmail.com> wrote:

 my second version on ldc takes 380ms and c++ version on same compiler
 (clang), takes 350ms, so it seems to be almost same

 On Sun, Aug 13, 2017 at 9:51 AM, amfvcg via Digitalmars-d-learn <
 digitalmars-d-learn puremagic.com> wrote:

 On Sunday, 13 August 2017 at 07:30:32 UTC, Daniel Kozak wrote:

 Here is more D idiomatic way:

 import std.stdio : writeln;
 import std.algorithm.comparison: min;
 import std.algorithm.iteration: sum;
 import core.time: MonoTime, Duration;


 auto sum_subranges(T)(T input, uint range)
 {
     import std.array : array;
     import std.range : chunks, ElementType;
     import std.algorithm : map;

     if (range == 0)
     {
         return ElementType!(T)[].init;
     }
     return input.chunks(range).map!(sum).array;
 }

 unittest
 {
     assert(sum_subranges([1,1,1], 2) == [2, 1]);
     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
     assert(sum_subranges([], 2) == []);
     assert(sum_subranges([1], 2) == [1]);
     assert(sum_subranges([1], 0) == []);
 }


 int main()
 {
     import std.range : iota, array;
     auto v = iota(0,1000000);
     int sum;
     MonoTime beg = MonoTime.currTime;
     for (int i=0; i < 100; i++)
         sum += cast(int)sum_subranges(v,2).length;
     MonoTime end = MonoTime.currTime;
     writeln(end-beg);
     writeln(sum);
     return sum;
 }

 On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11 gmail.com>
 wrote:

 this works ok for me with ldc compiler, gdc does not work on my arch
 machine so I can not do comparsion to your c++ versin (clang does not work
 with your c++ code)

 import std.stdio : writeln;
 import std.algorithm.comparison: min;
 import std.algorithm.iteration: sum;
 import core.time: MonoTime, Duration;


 T[] sum_subranges(T)(T[] input, uint range)
 {
     import std.array : appender;
     auto app = appender!(T[])();
     if (range == 0)
     {
         return app.data;
     }
     for (uint i; i < input.length; i=min(i+range, input.length))
     {
         app.put(sum(input[i..min(i+range, input.length)]));
     }
     return app.data;
 }

 unittest
 {
     assert(sum_subranges([1,1,1], 2) == [2, 1]);
     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
     assert(sum_subranges([], 2) == []);
     assert(sum_subranges([1], 2) == [1]);
     assert(sum_subranges([1], 0) == []);
 }


 int main()
 {
     import std.range : iota, array;
     auto v = iota(0,1000000).array;
     int sum;
     MonoTime beg = MonoTime.currTime;
     for (int i=0; i < 100; i++)
         sum += cast(int)sum_subranges(v,2).length;
     MonoTime end = MonoTime.currTime;
     writeln(end-beg);
     writeln(sum);
     return sum;
 }

 On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via
 Digitalmars-d-learn < digitalmars-d-learn puremagic.com> wrote:

 [...]

Thank you all for the replies. Good to know the community is alive in d :) Let's settle the playground: D : http://ideone.com/h4fnsD C++: http://ideone.com/X1pyXG Both using GCC under the hood. C++ in 112 ms; D in : - 2.5 sec with original source; - 2.5 sec with Daniel's 1st version; - 5 sec timeout exceeded with Daniel's 2nd version; - 1.8 sec with Neia-like preallocation; So still it's not that neaty. (What's interesting C++ code generates 2KLOC of assembly, and Dlang ldc 12KLOC - checked at godbolt). P.S. For C++ version to work under clang, the function which takes (BinaryOp) must go before the other one (my bad).
Aug 13
prev sibling parent data pulverizer <data.pulverizer gmail.com> writes:
On Sunday, 13 August 2017 at 06:09:39 UTC, amfvcg wrote:
 Hi all,
 I'm solving below task:

 given container T and value R return sum of R-ranges over T. An 
 example:
 input : T=[1,1,1] R=2
 output : [2, 1]

 input : T=[1,2,3] R=1
 output : [1,2,3]
 (see dlang unittests for more examples)


 Below c++ code compiled with g++-5.4.0 -O2 -std=c++14 runs on 
 my machine in 656 836 us.
 Below D code compiled with dmd v2.067.1 -O runs on my machine 
 in ~ 14.5 sec.

 Each language has it's own "way of programming", and as I'm a 
 beginner in D - probably I'm running through bushes instead of 
 highway. Therefore I'd like to ask you, experienced dlang devs, 
 to shed some light on "how to do it dlang-way".

 ...
From time to time the forum gets questions like these. It would be nice it we had blog articles concerning high performance programming in D, a kind of best practice approach, everything from programming idioms, to compilers, and compiler flags.
Aug 17