digitalmars.D - Benchmarking mir.ndslice + lubeck against numpy and Julia

p.shkadzko (144/144) Jan 11 2020 Today I decided to write a couple of benchmarks to compare D mir

user1234 (5/9) Jan 11 2020 Can you specify the command line used for D ?

p.shkadzko (5/16) Jan 11 2020 All D code was compiled with ldc2

Jon Degenhardt (8/25) Jan 11 2020 A useful set of flags to try:

JN (6/14) Jan 11 2020 Meh. I don't like this kind of table. I think it should at least

Doc Andrew (8/13) Jan 11 2020 C? Try Fortran!
p.shkadzko (4/19) Jan 12 2020 Yes, technically we are benchmarking against C numpy and

Dennis (99/102) Jan 12 2020 No, I think the names in your comparison are fair.

p.shkadzko (4/11) Jan 12 2020 Thanks! Now it becomes much clearer. It makes sense that in the
bachmeier (8/12) Jan 12 2020 You left out an important detail in your description. gesdd is

Dennis (7/10) Jan 12 2020 Interesting, I didn't actually know that. I just quoted the scipy

Arine (15/15) Jan 11 2020 I ran your dot product test for python and D.

Arine (5/20) Jan 11 2020 ops `--config=release` should be `--build=release` cause you know

Ola Fosheim Grostad (11/14) Jan 11 2020 A decent optimizer would remove all your code except the print

p.shkadzko (18/32) Jan 12 2020 Yes, it all should be there, I was impatient to share the timings.

bachmeier (10/38) Jan 12 2020 These things can make *some* difference, but not the kind of

bachmeier (8/14) Jan 12 2020 I was going to say the same thing. I don't even need to look at

Joseph Rushton Wakeling (29/32) Jan 13 2020 Thanks for running these benchmarks. A few notes/questions that

jmh530 (9/10) Jan 13 2020 On 1 & 2, you are correct for wherever blas/lapack are used,

Joseph Rushton Wakeling (11/17) Jan 13 2020 Yes, in principle that's what `mir-glas` is supposed to offer,
bachmeier (6/12) Jan 13 2020 All you have to do is look at the timings. Julia calls into

jmh530 (2/8) Jan 13 2020 Yes, that was my initial reaction as well.

Joseph Rushton Wakeling (7/16) Jan 13 2020 Well, see the link I posted for some details on how they achieve

jmh530 (9/16) Jan 13 2020 ...take a look at the Julia benchmark in the first post. They are

Joseph Rushton Wakeling (4/8) Jan 13 2020 Fair. Is it possible that Julia is evaluating these calculations

Pavel Shkadzko (9/21) Jan 13 2020 Indeed, my experience with Julia is zero and I don't know what

9il (5/9) Jan 14 2020 Lubeck doesn't use it's own sad implementation. Instead it uses

p.shkadzko <p.shkadzko gmail.com> writes:

Today I decided to write a couple of benchmarks to compare D mir 
with lubeck against Python numpy, then I also added Julia 
snippets. The results appeared to be quite interesting.


Allocation and SVD of [5000 x 10] matrix:

+--------+--------------------+------------+
|  lang  |        libs        | time (sec) |
+--------+--------------------+------------+
| Python | numpy+scipy        |        0.5 |
| Julia  | LinearAlgebra      |     0.0014 |
| D      | mir.ndslice+lubeck |        0.6 |
+--------+--------------------+------------+


Allocation and SVD of [5000 x 100] matrix:

+--------+--------------------+------------+
|  lang  |        libs        | time (sec) |
+--------+--------------------+------------+
| Python | numpy+scipy        |        4.5 |
| Julia  | LinearAlgebra      |      0.024 |
| D      | mir.ndslice+lubeck |        5.2 |
+--------+--------------------+------------+


Allocation and SVD of [5000 x 1000] matrix:

+--------+--------------------+------------+
|  lang  |        libs        | time (sec) |
+--------+--------------------+------------+
| Python | numpy+scipy        |        1.4 |
| Julia  | LinearAlgebra      |          1 |
| D      | mir.ndslice+lubeck |      12.85 |
+--------+--------------------+------------+


Allocation and SVD of [500 x 10000] matrix:

+--------+--------------------+------------+
|  lang  |        libs        | time (sec) |
+--------+--------------------+------------+
| Python | numpy+scipy        |       2.34 |
| Julia  | LinearAlgebra      |        1.1 |
| D      | mir.ndslice+lubeck |         25 |
+--------+--------------------+------------+


Matrices allocation and dot product A [3000 x 3000] * B [3000 x 
3000]

+--------+--------------------+------------+
|  lang  |        libs        | time (sec) |
+--------+--------------------+------------+
| Python | numpy+scipy        |       0.62 |
| Julia  | LinearAlgebra      |      0.215 |
| D      | mir.ndslice+lubeck |        1.5 |
+--------+--------------------+------------+


D lubeck's svd method it quite slow and gets even slower with 
growth of the second dimension while scipy svd becomes 
surprisingly faster. Dot product unfortunately is also 
disappointing. I can only complement Julia on such amazing 
results.

Below is the code I was using.

Allocation and SVD of [A x B] matrix:
Python
------
import numpy as np
from scipy.linalg import svd
import timeit

def svd_fun():
     data = np.random.randint(0, 1024, 5000000).reshape((5000, 
1000))
     u, s, v = svd(data)

print(timeit.timeit(svd_fun, number=1))

Julia
-----
using LinearAlgebra
using BenchmarkTools

function svdFun()
     a = rand([0, 1024], 5000, 1000)
     res = svd(a)
end

 btime svdFun()


D mir.ndslice+lubeck
--------------------
/+dub.sdl:
dependency "mir" version="~>3.2.0"
dependency "lubeck" version="~>1.1.7"
libs "lapack" "openblas"
+/
import std.datetime;
import std.datetime.stopwatch : benchmark;
import std.stdio;
import std.random : Xorshift, unpredictableSeed, uniform;
import std.array: array;
import std.range: generate, take;


import mir.ndslice;
import mir.math.common : optmath;
import lubeck;


void svdFun()
{
     Xorshift rnd;
     rnd.seed(unpredictableSeed);
     auto matrix = generate(() => uniform(0, 1024, rnd))
         .take(5000_000)
         .array
         .sliced(5000, 1000);
     auto svdResult = matrix.svd;
}

void main()
{
     auto svdTime = benchmark!(svdFun)(1);
     writeln(svdTime);

}

Matrices allocation and dot product A [3000 x 3000] * B [3000 x 
3000]
Python
------
def dot_fun():
     a = np.random.random(9000000).reshape((3000, 3000))
     b = np.random.random(9000000).reshape((3000, 3000))
     c = np.dot(a, b)
print(timeit.timeit(dot_fun, number=10)/10)

Julia
-----
function dotFun()
     a = rand([0, 1.0], 3000, 3000)
     b = rand([0, 1.0], 3000, 3000)
     c = dot(a, b)
end
 btime dotFun()

D mir.ndslice+lubeck
--------------------
static  optmath auto rndMatrix(T)(const T maxN, in int dimA, in 
int dimB) {
     Xorshift rnd;
     rnd.seed(unpredictableSeed);
     const amount = dimA * dimB;
     return generate(() => uniform(0, maxN, rnd))
         .take(amount)
         .array
         .sliced(dimA, dimB);
}

static  optmath T fmuladd(T, Z)(const T a, Z z){
     return a + z.a * z.b;
}

void dotFun() {
     auto matrixA = rndMatrix!double(1.0, 3000, 3000);
     auto matrixB = rndMatrix!double(1.0, 3000, 3000);
     auto zipped = zip!true(matrixA, matrixB);
     auto dot = reduce!fmuladd(0.0, zipped);

}

void main()
{
     auto dotTime = benchmark!(dotFun)(1);
     writeln(dotTime);
}

Jan 11 2020

user1234 <user1234 12.de> writes:

On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.


 [...]

Can you specify the command line used for D ?
For now it's not clear if use LDC, GDC or DMD and the options 
used (assertions compiled or not ?). DMD is not a good choice for 
benchmarking.

Jan 11 2020

p.shkadzko <p.shkadzko gmail.com> writes:

On Saturday, 11 January 2020 at 22:21:18 UTC, user1234 wrote:
 On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.


 [...]

 Can you specify the command line used for D ?
 For now it's not clear if use LDC, GDC or DMD and the options 
 used (assertions compiled or not ?). DMD is not a good choice 
 for benchmarking.

All D code was compiled with ldc2
dub build --compiler=ldc2 --single matrix_ops.d

I also tried using $DFLAGS="--O3" but that didn't produce better 
results.

Jan 11 2020

Jon Degenhardt <jond noreply.com> writes:

On Saturday, 11 January 2020 at 22:50:46 UTC, p.shkadzko wrote:
 On Saturday, 11 January 2020 at 22:21:18 UTC, user1234 wrote:
 On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.


 [...]

 Can you specify the command line used for D ?
 For now it's not clear if use LDC, GDC or DMD and the options 
 used (assertions compiled or not ?). DMD is not a good choice 
 for benchmarking.

 All D code was compiled with ldc2
 dub build --compiler=ldc2 --single matrix_ops.d

 I also tried using $DFLAGS="--O3" but that didn't produce 
 better results.

A useful set of flags to try:

     -O -release -flto=thin 
-defaultlib=phobos2-ldc-lto,druntime-ldc-lto

The '-flto' option turns on LTO. Doesn't always help, but 
sometimes makes a significant difference, especially when library 
code is used in tight loops. Another option is to disable bounds 
checking in  safe code (--boundscheck=off).

Jan 11 2020

JN <666total wp.pl> writes:

On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Allocation and SVD of [5000 x 10] matrix:

 +--------+--------------------+------------+
 |  lang  |        libs        | time (sec) |
 +--------+--------------------+------------+
 | Python | numpy+scipy        |        0.5 |
 | Julia  | LinearAlgebra      |     0.0014 |
 | D      | mir.ndslice+lubeck |        0.6 |
 +--------+--------------------+------------+

Meh. I don't like this kind of table. I think it should at least 
say "Python/C". Imagine a newcomer encountering this table and 
being like "wow, D is slower even than Python, what a mess!". But 
numpy/scipy are running C underneath, so D is not that bad in 
comparison.

Jan 11 2020

Doc Andrew <x x.com> writes:

On Sunday, 12 January 2020 at 00:25:44 UTC, JN wrote:
 Meh. I don't like this kind of table. I think it should at 
 least say "Python/C". Imagine a newcomer encountering this 
 table and being like "wow, D is slower even than Python, what a 
 mess!". But numpy/scipy are running C underneath, so D is not 
 that bad in comparison.

C? Try Fortran! 
https://github.com/scipy/scipy/blob/master/scipy/linalg/src/i
_dist/src/idz_svd.f (I think that's the underlying code being called in the
OP's example)

But your point still stands. Most of the heavy-duty lifting done 
by Python scientific computing libs is done by carefully-tuned, 
long-standing native code. There's nothing stopping D from 
linking to said libraries...

-Doc

Jan 11 2020

p.shkadzko <p.shkadzko gmail.com> writes:

On Sunday, 12 January 2020 at 00:25:44 UTC, JN wrote:
 On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Allocation and SVD of [5000 x 10] matrix:

 +--------+--------------------+------------+
 |  lang  |        libs        | time (sec) |
 +--------+--------------------+------------+
 | Python | numpy+scipy        |        0.5 |
 | Julia  | LinearAlgebra      |     0.0014 |
 | D      | mir.ndslice+lubeck |        0.6 |
 +--------+--------------------+------------+

 Meh. I don't like this kind of table. I think it should at 
 least say "Python/C". Imagine a newcomer encountering this 
 table and being like "wow, D is slower even than Python, what a 
 mess!". But numpy/scipy are running C underneath, so D is not 
 that bad in comparison.

Yes, technically we are benchmarking against C numpy and 
C/Fortran scipy through Python syntax layer. It should have been 
"C/Fortran".

Jan 12 2020

Dennis <dkorpel gmail.com> writes:

On Sunday, 12 January 2020 at 09:43:46 UTC, p.shkadzko wrote:
 Yes, technically we are benchmarking against C numpy and 
 C/Fortran scipy through Python syntax layer. It should have 
 been "C/Fortran".

No, I think the names in your comparison are fair.

You are not comparing the languages themselve, but available 
solutions for scientific computing for each language.

Let me focus on your 5000 * 1000 SVD example. The thing is, all 
these languages / packages are just using LAPACK for that (see 
[1] [2] and [3]), so what you are actually comparing is:

- The default settings
- The LAPACK implementation that is used
- A bit of luck (your matrices are randomly created)
- The wrapping code (this is where the programming language 
actually matters)

The first important default setting is the algorithm. Lapack 
offers 'gesdd', a 'more efficient divide-and-conquer approach', 
and 'gesvd', a 'general rectangular approach' [4]. Python and 
Julia default to gesdd, while D's Lubeck defaults to gesvd.

I factored out the matrix generation code from the benchmark, and 
switched the D algorithm to 'gesdd', and the time went from ~17 
seconds to ~6 on my PC. Comparing that with Python, I got these 
times for some subsequent runs:

Python: 5.7, 6.3, 6.4, 5.3, 6.3
D (LDC release): 5.9, 5.8, 5.8, 5.9, 7.0, 5.7
D (DMD debug): 6.0, 6.5, 6.9, 6.7, 6.0

It can be seen that:

- The times vary between runs (possibly because randomized 
matrices, and other applications running on my PC)
- The times between scipy and Lubeck are comparable
- Compiler settings are not that important, the extra compilation 
time is worse than the gained speed (and when doing scientific 
computing you recompile a lot).

The second important setting is whether to compute full matrix U 
and Vt.

When computing the SVD of matrix A of size 5000x1000, the U*S*Vt 
have sizes 5000x5000, 5000x1000, and 1000x1000 respectively. 
However, since S is just a diagonal matrix with 1000 entries and 
the rest is all 0, you can choose to only compute U as a 
5000x1000 matrix. This obviously saves a lot of work.

Julia does this by default, while Python and D compute the full U 
by default.

When not computing the full matrices in Python and D, I get:

D: roughly 1.7 s
Python: roughly 1.2 s

I don't have Julia, but I bet that it will be in the same order 
of magnitude on my PC as well. The new code can be found below.

So in conclusion, Lubeck has some unfortunate default settings 
for benchmarking, and even with comparable settings D can be a 
little slower. This is either because inferior wrapping code, or 
because I am using a different LAPACK implementation. For D I use 
the LAPACK implementation of OpenBLAS that I compiled myself a 
while ago, I don't know what Python ships with.

In any case, while your benchmark is not representative of the 
actual languages, it is indeed unfortunate when people try 
something D, find it significantly slower than other languages, 
and someone on the forum has to point out all the steps to get D 
performance on par.

---

[1] 
https://github.com/scipy/scipy/blob/adc4f4f7bab120ccfab9383aba272954a0a12fb0/scipy/linalg/decomp_svd.py#L16-L139
[2] 
https://github.com/JuliaLang/julia/blob/aa35ee2d30065803410718378e5673c7f845da62/stdlib/LinearAlgebra/src/svd.jl#L93
[3] 
https://github.com/kaleidicassociates/lubeck/blob/72091ecdd545f9524a1d80e5880cda19845143d0/source/lubeck.d#L356

[4] 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.svd.html

New Python code:
```
import numpy as np
from scipy.linalg import svd
import timeit
h = 5000
w = 1000
data = np.random.randint(0, 1024, w*h).reshape((h, w))
def svd_fun():
     u, s, v = svd(data, overwrite_a=False, full_matrices=False, 
lapack_driver='gesdd')

print(timeit.timeit(svd_fun, number=1))
```

New D code:
```
import std, mir.ndslice, lubeck;
import std.datetime.stopwatch: benchmark;

enum h = 5000;
enum w = 1000;

auto getMtx() {
     Xorshift rnd;
     rnd.seed(unpredictableSeed);
     auto matrix = generate(() => uniform(0, 1024, rnd))
         .take(w * h).array.sliced(h, w);
     return matrix;
}

auto svdFun(T)(T mtx) {
     auto result = mtx.svd!(No.allowDestroy, "gesdd")(Yes.slim); 
// gesvd gesdd
}

void main() {
     auto svdTime = benchmark!(() => getMtx.svdFun())(1);
     writeln(svdTime);
}
```

Jan 12 2020

p.shkadzko <p.shkadzko gmail.com> writes:

On Sunday, 12 January 2020 at 12:46:44 UTC, Dennis wrote:
 [1] 
 https://github.com/scipy/scipy/blob/adc4f4f7bab120ccfab9383aba272954a0a12fb0/scipy/linalg/decomp_svd.py#L16-L139
 [2] 
 https://github.com/JuliaLang/julia/blob/aa35ee2d30065803410718378e5673c7f845da62/stdlib/LinearAlgebra/src/svd.jl#L93
 [3] 
 https://github.com/kaleidicassociates/lubeck/blob/72091ecdd545f9524a1d80e5880cda19845143d0/source/lubeck.d#L356

 [...]

Thanks! Now it becomes much clearer. It makes sense that in the 
end of the day it is BLAS+LAPACK for any language engaging into 
matrix calculations.

Jan 12 2020

bachmeier <no spam.net> writes:

On Sunday, 12 January 2020 at 12:46:44 UTC, Dennis wrote:

 The first important default setting is the algorithm. Lapack 
 offers 'gesdd', a 'more efficient divide-and-conquer approach', 
 and 'gesvd', a 'general rectangular approach' [4]. Python and 
 Julia default to gesdd, while D's Lubeck defaults to gesvd.

You left out an important detail in your description. gesdd is 
more efficient, but at the expense of being less accurate, and 
can easily fail on you. I have a strong preference for correct as 
the default as I've run into problems with optimized for speed by 
default before. Nobody ever digs in and understands the tradeoffs 
involved for the default.

https://savannah.gnu.org/bugs/?55564

Jan 12 2020

Dennis <dkorpel gmail.com> writes:

On Sunday, 12 January 2020 at 14:41:11 UTC, bachmeier wrote:
 You left out an important detail in your description. gesdd is 
 more efficient, but at the expense of being less accurate, and 
 can easily fail on you.

Interesting, I didn't actually know that. I just quoted the scipy 
documentation. I have only used SVD once before today, and it was 
on a 3x3 matrix for point set registration. Performance and 
accuracy weren't a problem.

It was wondering why Matlab and Octave default to the slower 
method too, but that explains why.

Jan 12 2020

Arine <arine123445128843 gmail.com> writes:

I ran your dot product test for python and D.

$ python main.py
0.6625702600000001

$ ./main
[100 ms, 916 ╬╝s, and 2 hnsecs] // 0.1009162 secs

Some things I guess. Dub is horrible. The defaults are horrible. 
A lot of the way it functions is horrible. So I'm not surprised 
you got worse results, even though it seems you have a faster PC 
than me.

     dub build --config=release --compiler=ldc2 --arch=x86_64 
--single main.d

Dub sometimes defaults to x86, cause why not, it's not a dying 
platform. Which can give worse codegen. It also default to debug, 
cause, you know if you run something through dub it's obviously 
being run through a debugger.

Jan 11 2020

Arine <arine123445128843 gmail.com> writes:

On Sunday, 12 January 2020 at 05:05:15 UTC, Arine wrote:
 I ran your dot product test for python and D.

 $ python main.py
 0.6625702600000001

 $ ./main
 [100 ms, 916 ╬╝s, and 2 hnsecs] // 0.1009162 secs

 Some things I guess. Dub is horrible. The defaults are 
 horrible. A lot of the way it functions is horrible. So I'm not 
 surprised you got worse results, even though it seems you have 
 a faster PC than me.

     dub build --config=release --compiler=ldc2 --arch=x86_64 
 --single main.d

 Dub sometimes defaults to x86, cause why not, it's not a dying 
 platform. Which can give worse codegen. It also default to 
 debug, cause, you know if you run something through dub it's 
 obviously being run through a debugger.

ops `--config=release` should be `--build=release` cause you know 
dub doesn't have confusing argument names and a really horrible 
documentation page that literally just repeats all the same 
arguments 20 times.

Jan 11 2020

Ola Fosheim Grostad <ola.fosheim.grostad gmail.com> writes:

On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.

A decent optimizer would remove all your code except the print 
statement. Make sure to output the result of the computation. 
Also make sure you use the same algorithms and accuracy. If you 
write your own innerproduct in one language then you should do so 
in the other languages as well and require the result to follow 
ieee754 by evaluating the result.

Please note that floating point code cannot be fully restructured 
by the optimizer without setting the optimizer to less 
predictable fast-math settings. So it cannot even in theory 
approach hand tuned library code.

Jan 11 2020

p.shkadzko <p.shkadzko gmail.com> writes:

On Sunday, 12 January 2020 at 07:04:00 UTC, Ola Fosheim Grostad 
wrote:
 On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.

 A decent optimizer would remove all your code except the print 
 statement. Make sure to output the result of the computation. 
 Also make sure you use the same algorithms and accuracy. If you 
 write your own innerproduct in one language then you should do 
 so in the other languages as well and require the result to 
 follow ieee754 by evaluating the result.

 Please note that floating point code cannot be fully 
 restructured by the optimizer without setting the optimizer to 
 less predictable fast-math settings. So it cannot even in 
 theory approach hand tuned library code.

Yes, it all should be there, I was impatient to share the timings.

On the other hand, I believe that stating that D can be faster at 
something but you only have to know which compiler to use, 
correct flags, maybe don't use dub because its defaults are bad 
will never going to work. People will not spend time trying to 
figure all this out. There is already some compiler heritage one 
should be aware of when using D and I suspect a lot of other 
things. True that in order to make a fair comparison you should 
be an expert in all these things but then the results won't be 
representative of the beginners code. Python out-of-the-box gives 
you such guarantees saying: "here is a programming language for 
you that is easy as a toaster and btw be assured that it will run 
fast no matter what stupid thing you do". On top of that you have 
quite detailed and beginner friendly docs.

I shall keep checking out mir libs and lubeck to see what's more 
in there.

Jan 12 2020

bachmeier <no spam.net> writes:

On Sunday, 12 January 2020 at 10:25:09 UTC, p.shkadzko wrote:
 On Sunday, 12 January 2020 at 07:04:00 UTC, Ola Fosheim Grostad 
 wrote:
 On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.

 A decent optimizer would remove all your code except the print 
 statement. Make sure to output the result of the computation. 
 Also make sure you use the same algorithms and accuracy. If 
 you write your own innerproduct in one language then you 
 should do so in the other languages as well and require the 
 result to follow ieee754 by evaluating the result.

 Please note that floating point code cannot be fully 
 restructured by the optimizer without setting the optimizer to 
 less predictable fast-math settings. So it cannot even in 
 theory approach hand tuned library code.

 Yes, it all should be there, I was impatient to share the 
 timings.

 On the other hand, I believe that stating that D can be faster 
 at something but you only have to know which compiler to use, 
 correct flags, maybe don't use dub because its defaults are bad 
 will never going to work.

These things can make *some* difference, but not the kind of 
difference you're claiming to have uncovered.

 Python out-of-the-box gives you such guarantees saying: "here 
 is a programming language for you that is easy as a toaster and 
 btw be assured that it will run fast no matter what stupid 
 thing you do".

If you're doing simple things, sure, Python will just forward 
your requests to underlying C/C++/Fortran libraries, but any 
language - including D - can do that. You could have achieved the 
same results using Ruby, Tcl, or Perl as you did with your Python 
code. As soon as you start doing something involved, where some 
parts of your code are pure Python and some parts are 
C/C++/Fortran, performance falls apart in a hurry.

Jan 12 2020

bachmeier <no spam.net> writes:

On Sunday, 12 January 2020 at 07:04:00 UTC, Ola Fosheim Grostad 
wrote:
 On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.

 A decent optimizer would remove all your code except the print 
 statement.

I was going to say the same thing. I don't even need to look at 
the code (and I wouldn't have time to do so anyway). If all 
you're doing is comparing the execution speed of the underlying 
libraries, Julia is not going to beat Python by such a wide 
margin. Julia is "fast" because it's not carrying out the the 
claimed operations.

Jan 12 2020

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.

Thanks for running these benchmarks.  A few notes/questions that 
may be interesting:

(1) Do I understand right that both numpy and lubeck just call 
into external linear algebra libraries (BLAS and LAPACK) for the 
actual number crunching?  (I believe so.)

(2) Assuming that is true, then compiler flags will probably only 
make a difference to the initialization and allocation of the 
matrices.  (IIRC, Xorshift's performance is very different when 
optimized versus un-optimized.)  That likely explains why D and 
Python performance are near-identical once matrix initialization 
is taken out of the equation: the actual number-crunching is 
being done by the same external libraries.

(3) The much speedier Julia results can probably be explained by 
its use of dedicated data structures for various stages of 
calculation, see §5.2 of https://arxiv.org/pdf/1411.1607.pdf.

(4) Just as a general remark for your original benchmarks: while 
obviously a "just getting started" comparison is perfectly valid, 
it's unlikely you're comparing like with like.  Leaving aside the 
compiler optimization flags, odds are that e.g. the randomized 
initialization of matrix elements is being done very differently 
in the 3 different languages (probably a different underlying RNG 
in each case, and quite possibly some extra tricks in numpy and 
Julia to handle the case of randomly initializing the contents of 
an entire matrix or array).

**Principal takeaway**: anyone who wants to match Julia for speed 
needs to follow its lead in terms of dedicated data structures 
for linear algebra, not just hook into long-running standards 
like BLAS and LAPACK.  There are opportunities here for libmir :-)

Jan 13 2020

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 13 January 2020 at 11:51:20 UTC, Joseph Rushton 
Wakeling wrote:
 [snip]

On 1 & 2, you are correct for wherever blas/lapack are used, 
which is in a lot of numpy, lubeck, R, etc. However, mir-glas is 
intended to be an alternative to blas, though I don't think much 
work has been done recently.

On 3, I didn't have time to look into the Julia results, but 
someone above made the comment that Julia was optimizing away the 
calculation itself. Dennis also had some interesting points above.

Jan 13 2020

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On Monday, 13 January 2020 at 13:34:40 UTC, jmh530 wrote:
 However, mir-glas is intended to be an alternative to blas,
 though I don't think much work has been done recently.

Yes, in principle that's what `mir-glas` is supposed to offer, 
but the current the benchmarks only cover lubeck (which doesn't 
use `mir-glas`), and as you say, `mir-glas` itself is not 
feature-complete when it comes to linear algebra APIs.

If anyone does get round to trying to benchmark functionality 
from `mir-glas`, note that you will have to pay quite a lot of 
attention to optimization flags if you want to get best results.

 On 3, I didn't have time to look into the Julia results, but 
 someone above made the comment that Julia was optimizing away 
 the calculation itself. Dennis also had some interesting points 
 above.

Yes, I would imagine that might also be part of it.  Again, I 
would anticipate that being something where D ought to be able to 
readily implement similar features.

Jan 13 2020

bachmeier <no spam.net> writes:

On Monday, 13 January 2020 at 13:34:40 UTC, jmh530 wrote:
 On Monday, 13 January 2020 at 11:51:20 UTC, Joseph Rushton 
 Wakeling wrote:
 On 3, I didn't have time to look into the Julia results, but 
 someone above made the comment that Julia was optimizing away 
 the calculation itself. Dennis also had some interesting points 
 above.

All you have to do is look at the timings. Julia calls into 
lapack to do these operations, just like everybody else. No 
amount of optimization will result in the timings reported for 
Julia - it would be a revolution unlike any ever seen in 
computing if they were accurate.

Jan 13 2020

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 13 January 2020 at 14:26:55 UTC, bachmeier wrote:
 [snip]

 All you have to do is look at the timings. Julia calls into 
 lapack to do these operations, just like everybody else. No 
 amount of optimization will result in the timings reported for 
 Julia - it would be a revolution unlike any ever seen in 
 computing if they were accurate.

Yes, that was my initial reaction as well.

Jan 13 2020

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On Monday, 13 January 2020 at 14:52:59 UTC, jmh530 wrote:
 On Monday, 13 January 2020 at 14:26:55 UTC, bachmeier wrote:
 [snip]

 All you have to do is look at the timings. Julia calls into 
 lapack to do these operations, just like everybody else. No 
 amount of optimization will result in the timings reported for 
 Julia - it would be a revolution unlike any ever seen in 
 computing if they were accurate.

 Yes, that was my initial reaction as well.

Well, see the link I posted for some details on how they achieve 
that -- for example, when doing QR or LU decomposition, instead 
of doing in-place calculations where they have to replace every 
element of a m*n matrix, they define custom types that store the 
matrix factorizations in packed representations that only include 
the non-zero elements.

Jan 13 2020

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 13 January 2020 at 17:09:29 UTC, Joseph Rushton 
Wakeling wrote:
 [snip]

 Well, see the link I posted for some details on how they 
 achieve that -- for example, when doing QR or LU decomposition, 
 instead of doing in-place calculations where they have to 
 replace every element of a m*n matrix, they define custom types 
 that store the matrix factorizations in packed representations 
 that only include the non-zero elements.

...take a look at the Julia benchmark in the first post. They are 
about 350x faster than the Numpy and D versions that are 
basically just calling C code. Do you really that the people who 
write linear algebra code are missing 350x improvements? Maybe 
their algorithm is faster than what lapack does, but I'm 
skeptical that - properly benchmarked - it could be that much 
faster.

Jan 13 2020

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On Monday, 13 January 2020 at 17:33:16 UTC, jmh530 wrote:
 Do you really that the people who write linear algebra code are 
 missing 350x improvements? Maybe their algorithm is faster than 
 what lapack does, but I'm skeptical that - properly benchmarked 
 - it could be that much faster.

Fair.  Is it possible that Julia is evaluating these calculations 
lazily, so you don't pay for the factorization or SVD until you 
genuinely start using the numbers?

Jan 13 2020

Pavel Shkadzko <p.shkadzko gmail.com> writes:

On Monday, 13 January 2020 at 14:26:55 UTC, bachmeier wrote:
 On Monday, 13 January 2020 at 13:34:40 UTC, jmh530 wrote:
 On Monday, 13 January 2020 at 11:51:20 UTC, Joseph Rushton 
 Wakeling wrote:
 On 3, I didn't have time to look into the Julia results, but 
 someone above made the comment that Julia was optimizing away 
 the calculation itself. Dennis also had some interesting 
 points above.

 All you have to do is look at the timings. Julia calls into 
 lapack to do these operations, just like everybody else. No 
 amount of optimization will result in the timings reported for 
 Julia - it would be a revolution unlike any ever seen in 
 computing if they were accurate.

Indeed, my experience with Julia is zero and I don't know what 
 btime is actually testing. Just copied it from howto benchmark 
Julia code page. I would honestly test the time it takes the 
actual script to run like "$ time julia demo.jl" because it does 
take some time before it precompiles the code so you don't feel 
those reported 0.1 seconds at all. And if you do data processing 
and scripting, console responsiveness and processing speed is all 
that matters to me at least.

Jan 13 2020

9il <ilyayaroshenko gmail.com> writes:

On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.

 [...]

Lubeck doesn't use it's own sad implementation. Instead it uses 
the native library. The speed depends on the kind of BLAS library 
used. For speed program it's also recommended to use mir-lapack 
in pair with Intel MKL instead of Lubeck.

Jan 14 2020

D Programming

C/C++ Programming

Other

digitalmars.D - Benchmarking mir.ndslice + lubeck against numpy and Julia