www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Benchmarking mir.ndslice + lubeck against numpy and Julia

reply p.shkadzko <p.shkadzko gmail.com> writes:
Today I decided to write a couple of benchmarks to compare D mir 
with lubeck against Python numpy, then I also added Julia 
snippets. The results appeared to be quite interesting.


Allocation and SVD of [5000 x 10] matrix:

+--------+--------------------+------------+
|  lang  |        libs        | time (sec) |
+--------+--------------------+------------+
| Python | numpy+scipy        |        0.5 |
| Julia  | LinearAlgebra      |     0.0014 |
| D      | mir.ndslice+lubeck |        0.6 |
+--------+--------------------+------------+


Allocation and SVD of [5000 x 100] matrix:

+--------+--------------------+------------+
|  lang  |        libs        | time (sec) |
+--------+--------------------+------------+
| Python | numpy+scipy        |        4.5 |
| Julia  | LinearAlgebra      |      0.024 |
| D      | mir.ndslice+lubeck |        5.2 |
+--------+--------------------+------------+


Allocation and SVD of [5000 x 1000] matrix:

+--------+--------------------+------------+
|  lang  |        libs        | time (sec) |
+--------+--------------------+------------+
| Python | numpy+scipy        |        1.4 |
| Julia  | LinearAlgebra      |          1 |
| D      | mir.ndslice+lubeck |      12.85 |
+--------+--------------------+------------+


Allocation and SVD of [500 x 10000] matrix:

+--------+--------------------+------------+
|  lang  |        libs        | time (sec) |
+--------+--------------------+------------+
| Python | numpy+scipy        |       2.34 |
| Julia  | LinearAlgebra      |        1.1 |
| D      | mir.ndslice+lubeck |         25 |
+--------+--------------------+------------+


Matrices allocation and dot product A [3000 x 3000] * B [3000 x 
3000]

+--------+--------------------+------------+
|  lang  |        libs        | time (sec) |
+--------+--------------------+------------+
| Python | numpy+scipy        |       0.62 |
| Julia  | LinearAlgebra      |      0.215 |
| D      | mir.ndslice+lubeck |        1.5 |
+--------+--------------------+------------+


D lubeck's svd method it quite slow and gets even slower with 
growth of the second dimension while scipy svd becomes 
surprisingly faster. Dot product unfortunately is also 
disappointing. I can only complement Julia on such amazing 
results.

Below is the code I was using.

Allocation and SVD of [A x B] matrix:
Python
------
import numpy as np
from scipy.linalg import svd
import timeit

def svd_fun():
     data = np.random.randint(0, 1024, 5000000).reshape((5000, 
1000))
     u, s, v = svd(data)

print(timeit.timeit(svd_fun, number=1))

Julia
-----
using LinearAlgebra
using BenchmarkTools

function svdFun()
     a = rand([0, 1024], 5000, 1000)
     res = svd(a)
end

 btime svdFun()


D mir.ndslice+lubeck
--------------------
/+dub.sdl:
dependency "mir" version="~>3.2.0"
dependency "lubeck" version="~>1.1.7"
libs "lapack" "openblas"
+/
import std.datetime;
import std.datetime.stopwatch : benchmark;
import std.stdio;
import std.random : Xorshift, unpredictableSeed, uniform;
import std.array: array;
import std.range: generate, take;


import mir.ndslice;
import mir.math.common : optmath;
import lubeck;


void svdFun()
{
     Xorshift rnd;
     rnd.seed(unpredictableSeed);
     auto matrix = generate(() => uniform(0, 1024, rnd))
         .take(5000_000)
         .array
         .sliced(5000, 1000);
     auto svdResult = matrix.svd;
}

void main()
{
     auto svdTime = benchmark!(svdFun)(1);
     writeln(svdTime);

}

Matrices allocation and dot product A [3000 x 3000] * B [3000 x 
3000]
Python
------
def dot_fun():
     a = np.random.random(9000000).reshape((3000, 3000))
     b = np.random.random(9000000).reshape((3000, 3000))
     c = np.dot(a, b)
print(timeit.timeit(dot_fun, number=10)/10)

Julia
-----
function dotFun()
     a = rand([0, 1.0], 3000, 3000)
     b = rand([0, 1.0], 3000, 3000)
     c = dot(a, b)
end
 btime dotFun()

D mir.ndslice+lubeck
--------------------
static  optmath auto rndMatrix(T)(const T maxN, in int dimA, in 
int dimB) {
     Xorshift rnd;
     rnd.seed(unpredictableSeed);
     const amount = dimA * dimB;
     return generate(() => uniform(0, maxN, rnd))
         .take(amount)
         .array
         .sliced(dimA, dimB);
}

static  optmath T fmuladd(T, Z)(const T a, Z z){
     return a + z.a * z.b;
}

void dotFun() {
     auto matrixA = rndMatrix!double(1.0, 3000, 3000);
     auto matrixB = rndMatrix!double(1.0, 3000, 3000);
     auto zipped = zip!true(matrixA, matrixB);
     auto dot = reduce!fmuladd(0.0, zipped);

}

void main()
{
     auto dotTime = benchmark!(dotFun)(1);
     writeln(dotTime);
}
Jan 11
next sibling parent reply user1234 <user1234 12.de> writes:
On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.


 [...]
Can you specify the command line used for D ? For now it's not clear if use LDC, GDC or DMD and the options used (assertions compiled or not ?). DMD is not a good choice for benchmarking.
Jan 11
parent reply p.shkadzko <p.shkadzko gmail.com> writes:
On Saturday, 11 January 2020 at 22:21:18 UTC, user1234 wrote:
 On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.


 [...]
Can you specify the command line used for D ? For now it's not clear if use LDC, GDC or DMD and the options used (assertions compiled or not ?). DMD is not a good choice for benchmarking.
All D code was compiled with ldc2 dub build --compiler=ldc2 --single matrix_ops.d I also tried using $DFLAGS="--O3" but that didn't produce better results.
Jan 11
parent Jon Degenhardt <jond noreply.com> writes:
On Saturday, 11 January 2020 at 22:50:46 UTC, p.shkadzko wrote:
 On Saturday, 11 January 2020 at 22:21:18 UTC, user1234 wrote:
 On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.


 [...]
Can you specify the command line used for D ? For now it's not clear if use LDC, GDC or DMD and the options used (assertions compiled or not ?). DMD is not a good choice for benchmarking.
All D code was compiled with ldc2 dub build --compiler=ldc2 --single matrix_ops.d I also tried using $DFLAGS="--O3" but that didn't produce better results.
A useful set of flags to try: -O -release -flto=thin -defaultlib=phobos2-ldc-lto,druntime-ldc-lto The '-flto' option turns on LTO. Doesn't always help, but sometimes makes a significant difference, especially when library code is used in tight loops. Another option is to disable bounds checking in safe code (--boundscheck=off).
Jan 11
prev sibling next sibling parent reply JN <666total wp.pl> writes:
On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Allocation and SVD of [5000 x 10] matrix:

 +--------+--------------------+------------+
 |  lang  |        libs        | time (sec) |
 +--------+--------------------+------------+
 | Python | numpy+scipy        |        0.5 |
 | Julia  | LinearAlgebra      |     0.0014 |
 | D      | mir.ndslice+lubeck |        0.6 |
 +--------+--------------------+------------+
Meh. I don't like this kind of table. I think it should at least say "Python/C". Imagine a newcomer encountering this table and being like "wow, D is slower even than Python, what a mess!". But numpy/scipy are running C underneath, so D is not that bad in comparison.
Jan 11
next sibling parent Doc Andrew <x x.com> writes:
On Sunday, 12 January 2020 at 00:25:44 UTC, JN wrote:
 Meh. I don't like this kind of table. I think it should at 
 least say "Python/C". Imagine a newcomer encountering this 
 table and being like "wow, D is slower even than Python, what a 
 mess!". But numpy/scipy are running C underneath, so D is not 
 that bad in comparison.
C? Try Fortran! https://github.com/scipy/scipy/blob/master/scipy/linalg/src/i _dist/src/idz_svd.f (I think that's the underlying code being called in the OP's example) But your point still stands. Most of the heavy-duty lifting done by Python scientific computing libs is done by carefully-tuned, long-standing native code. There's nothing stopping D from linking to said libraries... -Doc
Jan 11
prev sibling parent reply p.shkadzko <p.shkadzko gmail.com> writes:
On Sunday, 12 January 2020 at 00:25:44 UTC, JN wrote:
 On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Allocation and SVD of [5000 x 10] matrix:

 +--------+--------------------+------------+
 |  lang  |        libs        | time (sec) |
 +--------+--------------------+------------+
 | Python | numpy+scipy        |        0.5 |
 | Julia  | LinearAlgebra      |     0.0014 |
 | D      | mir.ndslice+lubeck |        0.6 |
 +--------+--------------------+------------+
Meh. I don't like this kind of table. I think it should at least say "Python/C". Imagine a newcomer encountering this table and being like "wow, D is slower even than Python, what a mess!". But numpy/scipy are running C underneath, so D is not that bad in comparison.
Yes, technically we are benchmarking against C numpy and C/Fortran scipy through Python syntax layer. It should have been "C/Fortran".
Jan 12
parent reply Dennis <dkorpel gmail.com> writes:
On Sunday, 12 January 2020 at 09:43:46 UTC, p.shkadzko wrote:
 Yes, technically we are benchmarking against C numpy and 
 C/Fortran scipy through Python syntax layer. It should have 
 been "C/Fortran".
No, I think the names in your comparison are fair. You are not comparing the languages themselve, but available solutions for scientific computing for each language. Let me focus on your 5000 * 1000 SVD example. The thing is, all these languages / packages are just using LAPACK for that (see [1] [2] and [3]), so what you are actually comparing is: - The default settings - The LAPACK implementation that is used - A bit of luck (your matrices are randomly created) - The wrapping code (this is where the programming language actually matters) The first important default setting is the algorithm. Lapack offers 'gesdd', a 'more efficient divide-and-conquer approach', and 'gesvd', a 'general rectangular approach' [4]. Python and Julia default to gesdd, while D's Lubeck defaults to gesvd. I factored out the matrix generation code from the benchmark, and switched the D algorithm to 'gesdd', and the time went from ~17 seconds to ~6 on my PC. Comparing that with Python, I got these times for some subsequent runs: Python: 5.7, 6.3, 6.4, 5.3, 6.3 D (LDC release): 5.9, 5.8, 5.8, 5.9, 7.0, 5.7 D (DMD debug): 6.0, 6.5, 6.9, 6.7, 6.0 It can be seen that: - The times vary between runs (possibly because randomized matrices, and other applications running on my PC) - The times between scipy and Lubeck are comparable - Compiler settings are not that important, the extra compilation time is worse than the gained speed (and when doing scientific computing you recompile a lot). The second important setting is whether to compute full matrix U and Vt. When computing the SVD of matrix A of size 5000x1000, the U*S*Vt have sizes 5000x5000, 5000x1000, and 1000x1000 respectively. However, since S is just a diagonal matrix with 1000 entries and the rest is all 0, you can choose to only compute U as a 5000x1000 matrix. This obviously saves a lot of work. Julia does this by default, while Python and D compute the full U by default. When not computing the full matrices in Python and D, I get: D: roughly 1.7 s Python: roughly 1.2 s I don't have Julia, but I bet that it will be in the same order of magnitude on my PC as well. The new code can be found below. So in conclusion, Lubeck has some unfortunate default settings for benchmarking, and even with comparable settings D can be a little slower. This is either because inferior wrapping code, or because I am using a different LAPACK implementation. For D I use the LAPACK implementation of OpenBLAS that I compiled myself a while ago, I don't know what Python ships with. In any case, while your benchmark is not representative of the actual languages, it is indeed unfortunate when people try something D, find it significantly slower than other languages, and someone on the forum has to point out all the steps to get D performance on par. --- [1] https://github.com/scipy/scipy/blob/adc4f4f7bab120ccfab9383aba272954a0a12fb0/scipy/linalg/decomp_svd.py#L16-L139 [2] https://github.com/JuliaLang/julia/blob/aa35ee2d30065803410718378e5673c7f845da62/stdlib/LinearAlgebra/src/svd.jl#L93 [3] https://github.com/kaleidicassociates/lubeck/blob/72091ecdd545f9524a1d80e5880cda19845143d0/source/lubeck.d#L356 [4] https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.svd.html New Python code: ``` import numpy as np from scipy.linalg import svd import timeit h = 5000 w = 1000 data = np.random.randint(0, 1024, w*h).reshape((h, w)) def svd_fun(): u, s, v = svd(data, overwrite_a=False, full_matrices=False, lapack_driver='gesdd') print(timeit.timeit(svd_fun, number=1)) ``` New D code: ``` import std, mir.ndslice, lubeck; import std.datetime.stopwatch: benchmark; enum h = 5000; enum w = 1000; auto getMtx() { Xorshift rnd; rnd.seed(unpredictableSeed); auto matrix = generate(() => uniform(0, 1024, rnd)) .take(w * h).array.sliced(h, w); return matrix; } auto svdFun(T)(T mtx) { auto result = mtx.svd!(No.allowDestroy, "gesdd")(Yes.slim); // gesvd gesdd } void main() { auto svdTime = benchmark!(() => getMtx.svdFun())(1); writeln(svdTime); } ```
Jan 12
next sibling parent p.shkadzko <p.shkadzko gmail.com> writes:
On Sunday, 12 January 2020 at 12:46:44 UTC, Dennis wrote:
 [1] 
 https://github.com/scipy/scipy/blob/adc4f4f7bab120ccfab9383aba272954a0a12fb0/scipy/linalg/decomp_svd.py#L16-L139
 [2] 
 https://github.com/JuliaLang/julia/blob/aa35ee2d30065803410718378e5673c7f845da62/stdlib/LinearAlgebra/src/svd.jl#L93
 [3] 
 https://github.com/kaleidicassociates/lubeck/blob/72091ecdd545f9524a1d80e5880cda19845143d0/source/lubeck.d#L356

 [...]
Thanks! Now it becomes much clearer. It makes sense that in the end of the day it is BLAS+LAPACK for any language engaging into matrix calculations.
Jan 12
prev sibling parent reply bachmeier <no spam.net> writes:
On Sunday, 12 January 2020 at 12:46:44 UTC, Dennis wrote:

 The first important default setting is the algorithm. Lapack 
 offers 'gesdd', a 'more efficient divide-and-conquer approach', 
 and 'gesvd', a 'general rectangular approach' [4]. Python and 
 Julia default to gesdd, while D's Lubeck defaults to gesvd.
You left out an important detail in your description. gesdd is more efficient, but at the expense of being less accurate, and can easily fail on you. I have a strong preference for correct as the default as I've run into problems with optimized for speed by default before. Nobody ever digs in and understands the tradeoffs involved for the default. https://savannah.gnu.org/bugs/?55564
Jan 12
parent Dennis <dkorpel gmail.com> writes:
On Sunday, 12 January 2020 at 14:41:11 UTC, bachmeier wrote:
 You left out an important detail in your description. gesdd is 
 more efficient, but at the expense of being less accurate, and 
 can easily fail on you.
Interesting, I didn't actually know that. I just quoted the scipy documentation. I have only used SVD once before today, and it was on a 3x3 matrix for point set registration. Performance and accuracy weren't a problem. It was wondering why Matlab and Octave default to the slower method too, but that explains why.
Jan 12
prev sibling next sibling parent reply Arine <arine123445128843 gmail.com> writes:
I ran your dot product test for python and D.

$ python main.py
0.6625702600000001

$ ./main
[100 ms, 916 ╬╝s, and 2 hnsecs] // 0.1009162 secs

Some things I guess. Dub is horrible. The defaults are horrible. 
A lot of the way it functions is horrible. So I'm not surprised 
you got worse results, even though it seems you have a faster PC 
than me.

     dub build --config=release --compiler=ldc2 --arch=x86_64 
--single main.d

Dub sometimes defaults to x86, cause why not, it's not a dying 
platform. Which can give worse codegen. It also default to debug, 
cause, you know if you run something through dub it's obviously 
being run through a debugger.
Jan 11
parent Arine <arine123445128843 gmail.com> writes:
On Sunday, 12 January 2020 at 05:05:15 UTC, Arine wrote:
 I ran your dot product test for python and D.

 $ python main.py
 0.6625702600000001

 $ ./main
 [100 ms, 916 ╬╝s, and 2 hnsecs] // 0.1009162 secs

 Some things I guess. Dub is horrible. The defaults are 
 horrible. A lot of the way it functions is horrible. So I'm not 
 surprised you got worse results, even though it seems you have 
 a faster PC than me.

     dub build --config=release --compiler=ldc2 --arch=x86_64 
 --single main.d

 Dub sometimes defaults to x86, cause why not, it's not a dying 
 platform. Which can give worse codegen. It also default to 
 debug, cause, you know if you run something through dub it's 
 obviously being run through a debugger.
ops `--config=release` should be `--build=release` cause you know dub doesn't have confusing argument names and a really horrible documentation page that literally just repeats all the same arguments 20 times.
Jan 11
prev sibling next sibling parent reply Ola Fosheim Grostad <ola.fosheim.grostad gmail.com> writes:
On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.
A decent optimizer would remove all your code except the print statement. Make sure to output the result of the computation. Also make sure you use the same algorithms and accuracy. If you write your own innerproduct in one language then you should do so in the other languages as well and require the result to follow ieee754 by evaluating the result. Please note that floating point code cannot be fully restructured by the optimizer without setting the optimizer to less predictable fast-math settings. So it cannot even in theory approach hand tuned library code.
Jan 11
next sibling parent reply p.shkadzko <p.shkadzko gmail.com> writes:
On Sunday, 12 January 2020 at 07:04:00 UTC, Ola Fosheim Grostad 
wrote:
 On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.
A decent optimizer would remove all your code except the print statement. Make sure to output the result of the computation. Also make sure you use the same algorithms and accuracy. If you write your own innerproduct in one language then you should do so in the other languages as well and require the result to follow ieee754 by evaluating the result. Please note that floating point code cannot be fully restructured by the optimizer without setting the optimizer to less predictable fast-math settings. So it cannot even in theory approach hand tuned library code.
Yes, it all should be there, I was impatient to share the timings. On the other hand, I believe that stating that D can be faster at something but you only have to know which compiler to use, correct flags, maybe don't use dub because its defaults are bad will never going to work. People will not spend time trying to figure all this out. There is already some compiler heritage one should be aware of when using D and I suspect a lot of other things. True that in order to make a fair comparison you should be an expert in all these things but then the results won't be representative of the beginners code. Python out-of-the-box gives you such guarantees saying: "here is a programming language for you that is easy as a toaster and btw be assured that it will run fast no matter what stupid thing you do". On top of that you have quite detailed and beginner friendly docs. I shall keep checking out mir libs and lubeck to see what's more in there.
Jan 12
parent bachmeier <no spam.net> writes:
On Sunday, 12 January 2020 at 10:25:09 UTC, p.shkadzko wrote:
 On Sunday, 12 January 2020 at 07:04:00 UTC, Ola Fosheim Grostad 
 wrote:
 On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.
A decent optimizer would remove all your code except the print statement. Make sure to output the result of the computation. Also make sure you use the same algorithms and accuracy. If you write your own innerproduct in one language then you should do so in the other languages as well and require the result to follow ieee754 by evaluating the result. Please note that floating point code cannot be fully restructured by the optimizer without setting the optimizer to less predictable fast-math settings. So it cannot even in theory approach hand tuned library code.
Yes, it all should be there, I was impatient to share the timings. On the other hand, I believe that stating that D can be faster at something but you only have to know which compiler to use, correct flags, maybe don't use dub because its defaults are bad will never going to work.
These things can make *some* difference, but not the kind of difference you're claiming to have uncovered.
 Python out-of-the-box gives you such guarantees saying: "here 
 is a programming language for you that is easy as a toaster and 
 btw be assured that it will run fast no matter what stupid 
 thing you do".
If you're doing simple things, sure, Python will just forward your requests to underlying C/C++/Fortran libraries, but any language - including D - can do that. You could have achieved the same results using Ruby, Tcl, or Perl as you did with your Python code. As soon as you start doing something involved, where some parts of your code are pure Python and some parts are C/C++/Fortran, performance falls apart in a hurry.
Jan 12
prev sibling parent bachmeier <no spam.net> writes:
On Sunday, 12 January 2020 at 07:04:00 UTC, Ola Fosheim Grostad 
wrote:
 On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.
A decent optimizer would remove all your code except the print statement.
I was going to say the same thing. I don't even need to look at the code (and I wouldn't have time to do so anyway). If all you're doing is comparing the execution speed of the underlying libraries, Julia is not going to beat Python by such a wide margin. Julia is "fast" because it's not carrying out the the claimed operations.
Jan 12
prev sibling next sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.
Thanks for running these benchmarks. A few notes/questions that may be interesting: (1) Do I understand right that both numpy and lubeck just call into external linear algebra libraries (BLAS and LAPACK) for the actual number crunching? (I believe so.) (2) Assuming that is true, then compiler flags will probably only make a difference to the initialization and allocation of the matrices. (IIRC, Xorshift's performance is very different when optimized versus un-optimized.) That likely explains why D and Python performance are near-identical once matrix initialization is taken out of the equation: the actual number-crunching is being done by the same external libraries. (3) The much speedier Julia results can probably be explained by its use of dedicated data structures for various stages of calculation, see §5.2 of https://arxiv.org/pdf/1411.1607.pdf. (4) Just as a general remark for your original benchmarks: while obviously a "just getting started" comparison is perfectly valid, it's unlikely you're comparing like with like. Leaving aside the compiler optimization flags, odds are that e.g. the randomized initialization of matrix elements is being done very differently in the 3 different languages (probably a different underlying RNG in each case, and quite possibly some extra tricks in numpy and Julia to handle the case of randomly initializing the contents of an entire matrix or array). **Principal takeaway**: anyone who wants to match Julia for speed needs to follow its lead in terms of dedicated data structures for linear algebra, not just hook into long-running standards like BLAS and LAPACK. There are opportunities here for libmir :-)
Jan 13
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Monday, 13 January 2020 at 11:51:20 UTC, Joseph Rushton 
Wakeling wrote:
 [snip]
On 1 & 2, you are correct for wherever blas/lapack are used, which is in a lot of numpy, lubeck, R, etc. However, mir-glas is intended to be an alternative to blas, though I don't think much work has been done recently. On 3, I didn't have time to look into the Julia results, but someone above made the comment that Julia was optimizing away the calculation itself. Dennis also had some interesting points above.
Jan 13
next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Monday, 13 January 2020 at 13:34:40 UTC, jmh530 wrote:
 However, mir-glas is intended to be an alternative to blas,
 though I don't think much work has been done recently.
Yes, in principle that's what `mir-glas` is supposed to offer, but the current the benchmarks only cover lubeck (which doesn't use `mir-glas`), and as you say, `mir-glas` itself is not feature-complete when it comes to linear algebra APIs. If anyone does get round to trying to benchmark functionality from `mir-glas`, note that you will have to pay quite a lot of attention to optimization flags if you want to get best results.
 On 3, I didn't have time to look into the Julia results, but 
 someone above made the comment that Julia was optimizing away 
 the calculation itself. Dennis also had some interesting points 
 above.
Yes, I would imagine that might also be part of it. Again, I would anticipate that being something where D ought to be able to readily implement similar features.
Jan 13
prev sibling parent reply bachmeier <no spam.net> writes:
On Monday, 13 January 2020 at 13:34:40 UTC, jmh530 wrote:
 On Monday, 13 January 2020 at 11:51:20 UTC, Joseph Rushton 
 Wakeling wrote:
 On 3, I didn't have time to look into the Julia results, but 
 someone above made the comment that Julia was optimizing away 
 the calculation itself. Dennis also had some interesting points 
 above.
All you have to do is look at the timings. Julia calls into lapack to do these operations, just like everybody else. No amount of optimization will result in the timings reported for Julia - it would be a revolution unlike any ever seen in computing if they were accurate.
Jan 13
next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Monday, 13 January 2020 at 14:26:55 UTC, bachmeier wrote:
 [snip]

 All you have to do is look at the timings. Julia calls into 
 lapack to do these operations, just like everybody else. No 
 amount of optimization will result in the timings reported for 
 Julia - it would be a revolution unlike any ever seen in 
 computing if they were accurate.
Yes, that was my initial reaction as well.
Jan 13
parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Monday, 13 January 2020 at 14:52:59 UTC, jmh530 wrote:
 On Monday, 13 January 2020 at 14:26:55 UTC, bachmeier wrote:
 [snip]

 All you have to do is look at the timings. Julia calls into 
 lapack to do these operations, just like everybody else. No 
 amount of optimization will result in the timings reported for 
 Julia - it would be a revolution unlike any ever seen in 
 computing if they were accurate.
Yes, that was my initial reaction as well.
Well, see the link I posted for some details on how they achieve that -- for example, when doing QR or LU decomposition, instead of doing in-place calculations where they have to replace every element of a m*n matrix, they define custom types that store the matrix factorizations in packed representations that only include the non-zero elements.
Jan 13
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Monday, 13 January 2020 at 17:09:29 UTC, Joseph Rushton 
Wakeling wrote:
 [snip]

 Well, see the link I posted for some details on how they 
 achieve that -- for example, when doing QR or LU decomposition, 
 instead of doing in-place calculations where they have to 
 replace every element of a m*n matrix, they define custom types 
 that store the matrix factorizations in packed representations 
 that only include the non-zero elements.
...take a look at the Julia benchmark in the first post. They are about 350x faster than the Numpy and D versions that are basically just calling C code. Do you really that the people who write linear algebra code are missing 350x improvements? Maybe their algorithm is faster than what lapack does, but I'm skeptical that - properly benchmarked - it could be that much faster.
Jan 13
parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Monday, 13 January 2020 at 17:33:16 UTC, jmh530 wrote:
 Do you really that the people who write linear algebra code are 
 missing 350x improvements? Maybe their algorithm is faster than 
 what lapack does, but I'm skeptical that - properly benchmarked 
 - it could be that much faster.
Fair. Is it possible that Julia is evaluating these calculations lazily, so you don't pay for the factorization or SVD until you genuinely start using the numbers?
Jan 13
prev sibling parent Pavel Shkadzko <p.shkadzko gmail.com> writes:
On Monday, 13 January 2020 at 14:26:55 UTC, bachmeier wrote:
 On Monday, 13 January 2020 at 13:34:40 UTC, jmh530 wrote:
 On Monday, 13 January 2020 at 11:51:20 UTC, Joseph Rushton 
 Wakeling wrote:
 On 3, I didn't have time to look into the Julia results, but 
 someone above made the comment that Julia was optimizing away 
 the calculation itself. Dennis also had some interesting 
 points above.
All you have to do is look at the timings. Julia calls into lapack to do these operations, just like everybody else. No amount of optimization will result in the timings reported for Julia - it would be a revolution unlike any ever seen in computing if they were accurate.
Indeed, my experience with Julia is zero and I don't know what btime is actually testing. Just copied it from howto benchmark Julia code page. I would honestly test the time it takes the actual script to run like "$ time julia demo.jl" because it does take some time before it precompiles the code so you don't feel those reported 0.1 seconds at all. And if you do data processing and scripting, console responsiveness and processing speed is all that matters to me at least.
Jan 13
prev sibling parent 9il <ilyayaroshenko gmail.com> writes:
On Saturday, 11 January 2020 at 21:54:13 UTC, p.shkadzko wrote:
 Today I decided to write a couple of benchmarks to compare D 
 mir with lubeck against Python numpy, then I also added Julia 
 snippets. The results appeared to be quite interesting.

 [...]
Lubeck doesn't use it's own sad implementation. Instead it uses the native library. The speed depends on the kind of BLAS library used. For speed program it's also recommended to use mir-lapack in pair with Intel MKL instead of Lubeck.
Jan 14