digitalmars.D - M1 10x faster than Intel at integral division, throughput one 64-bit

Andrei Alexandrescu (6/6) May 12 2021 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_divisi...

kookman (5/12) May 12 2021 LDC v1.26 is already doing a very good job IMHO. I'm running with

Guillaume Piolat (4/7) May 13 2021 Even LDC 1.24 x86_64 cross-compiling to arm64 has very nice

Max Haughton (29/36) May 12 2021 It's already winning let alone challenging, although consider
Nicholas Wilson (6/13) May 12 2021 LDC works out of the box.

Iain Buclaw (5/9) May 13 2021 Nope, it doesn't work (sorry it took 5 months to raise an issue)

Witold Baryluk (28/33) May 13 2021 No. It means nothing.

Witold Baryluk (11/48) May 13 2021 I just tested, using his benchmark code, on my a bit older AMD

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (9/11) May 13 2021 Indeed, proper benchmarks use application suites, not shoehorned

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (3/5) May 14 2021 Eh, no. I was wrong. Moral: never program right before bedtime.

claptrap (4/15) May 13 2021 Zen3 is about 2 to 3 times faster than Zen1 for both latency and

claptrap (14/19) May 13 2021 Integer division on Intel has always been excruciatingly slow, 64
Manu (7/13) May 27 2021 It's a strange thing to optimise... I seem to do an integer divide so

Ola Fosheim Grostad (8/13) May 27 2021 Maybe Swift programmers dont think in terms of 2^n? Apple wants
deadalnix (8/13) May 27 2021 There are a few places where it matters. Some cryptographic

Max Haughton (3/18) May 27 2021 Not pipelined!?

Ola Fosheim Grostad (4/5) May 27 2021 On older Intel CPUs integer divide started to use parts of

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

Integral division is the strongest arithmetic operation.

I have a friend who knows some M1 internals. He said it's really Star 
Trek stuff.

This will seriously challenge other CPU producers.

What perspectives do we have to run the compiler on M1 and produce M1 code?

May 12 2021

kookman <thekookman gmail.com> writes:

On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

 Integral division is the strongest arithmetic operation.

 I have a friend who knows some M1 internals. He said it's 
 really Star Trek stuff.

 This will seriously challenge other CPU producers.

 What perspectives do we have to run the compiler on M1 and 
 produce M1 code?

LDC v1.26 is already doing a very good job IMHO. I'm running with 
native ARM (brew.sh) LDC on Mac mini M1 and compiles are not 
noticeably slower than DMD on my Intel MacBook. Impressive!

May 12 2021

Guillaume Piolat <first.last gmail.com> writes:

On Thursday, 13 May 2021 at 04:40:25 UTC, kookman wrote:
 LDC v1.26 is already doing a very good job IMHO. I'm running 
 with native ARM (brew.sh) LDC on Mac mini M1 and compiles are 
 not noticeably slower than DMD on my Intel MacBook. Impressive!

Even LDC 1.24 x86_64 cross-compiling to arm64 has very nice 
compilation time on the M1 here. Which is nice because you can 
then build the combined binary with one compiler.

May 13 2021

Max Haughton <maxhaton gmail.com> writes:

On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu
wrote:
https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

Integral division is the strongest arithmetic operation.

I have a friend who knows some M1 internals. He said it's
really Star Trek stuff.

This will seriously challenge other CPU producers.

What perspectives do we have to run the compiler on M1 and
produce M1 code?

It's already winning let alone challenging, although consider
just how fucking enormous the transistor budget is on the M1 on a
per-core basis (i.e. from what is known in public, the M1 doesn't
really have that much magic to it but is rather an extremely wide
- where it really matters - iteration of what already works
elsewhere in the industry, combined with no X86 tax on desktop
for the first time.). Intel's process engineers completely
dropped the ball, so the M1 is on a process something like 4-5
*x* denser than Intel 14nm.

Someone mentioned on hackernews that Intel improved the ThisXeon
+ 1 integer division capabilities also, would be worth
benchmarking - although expecting monster SPECint numbers from a
28 core Xeon is probably missing the point.

Someone on the discord has an M1, D already works fine
apparently, I'm aiming to get a blog post out of it.

The GCC project has M1 hardware and should apparently be getting
support soon-ish. Apple don't like upstreaming their backends
from what I can tell, so it could be a while before they get
tuned much.

Apple also haven't published anything along the lines of an
optimization manual for M1 so I guess we'll find out via osmosis
what it's really capable of as times goes on - I think it's more
likely Apple get the Microsoft hidden-api treatment than actually
go public on some of the extensions they have made to the ARM ISA
- both in new instructions and in the form of an old trick SPARC
had which basically turns TSO on underneath a program to aid X86
emulation.

May 12 2021

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

 Integral division is the strongest arithmetic operation.

 I have a friend who knows some M1 internals. He said it's 
 really Star Trek stuff.

 This will seriously challenge other CPU producers.

 What perspectives do we have to run the compiler on M1 and 
 produce M1 code?

LDC works out of the box.
GDC may support it with GDC11 (no idea on time frame).
DMD has no support for ARM whatsoever.
non-dmdfe based compilers, no idea. Probably not.

May 12 2021

Iain Buclaw <ibuclaw gdcproject.org> writes:

On Thursday, 13 May 2021 at 06:02:31 UTC, Nicholas Wilson wrote:
 LDC works out of the box.
 GDC may support it with GDC11 (no idea on time frame).
 DMD has no support for ARM whatsoever.
 non-dmdfe based compilers, no idea. Probably not.

Nope, it doesn't work (sorry it took 5 months to raise an issue)

https://issues.dlang.org/show_bug.cgi?id=21919

Fortunately, OSX is broken on both X86_64 and ARM64 though.  So 
the sooner someone fixes it...

May 13 2021

Witold Baryluk <witold.baryluk gmail.com> writes:

On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu
wrote:
https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

Integral division is the strongest arithmetic operation.

I have a friend who knows some M1 internals. He said it's
really Star Trek stuff.

This will seriously challenge other CPU producers.

No. It means nothing.

1) M1 is built on smaller manufacturing node, allowing to "waste"
more silicon area for such niche stuff.

2) He measured throughput in highly-div dense code. He didn't
measure actual speed (latency) of a divide. Anybody can make
integer division faster (higher throughput) by throwing more
execution units or fully pipeline the integer divisions. It costs
a lot of silicon, for zero gain, because real world code doesn't
have a div after div ever next or second instruction.

He only checked on one x86 CPU. What about 3rd Gen Xeons (i.e.
Copper Lake)? Zen 2? Did he isolate memory effects?

I see he used very high `count` in the loop: `count=524k`. That
is 2MiB / 4MiB of data. Granted, this data access pattern will be
easy to predict and prefetching will work really well, but still
this will be touching multiple levels of cache.

There are data dependencies in the loop on the `sum`, making it
really hard to do speculatation.

We don't see the assembly code, and don't know how much the loops
are unrolled / how much potential is there for parallelism or
hardware pipelineing.

3) Apple can get away with that, because they run on a leading
edge manufacturing node, clock lower to reduce power, and waste
silicon.

My guess is: If you do a single divide without a loop, it most
likely will be the same on both platforms.

There is nothing magical about M1 "fast" division.

May 13 2021

Witold Baryluk <witold.baryluk gmail.com> writes:

On Thursday, 13 May 2021 at 11:58:50 UTC, Witold Baryluk wrote:
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu
wrote:
https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

Integral division is the strongest arithmetic operation.

I have a friend who knows some M1 internals. He said it's
really Star Trek stuff.

This will seriously challenge other CPU producers.

No. It means nothing.

1) M1 is built on smaller manufacturing node, allowing to
"waste" more silicon area for such niche stuff.

2) He measured throughput in highly-div dense code. He didn't
measure actual speed (latency) of a divide. Anybody can make
integer division faster (higher throughput) by throwing more
execution units or fully pipeline the integer divisions. It
costs a lot of silicon, for zero gain, because real world code
doesn't have a div after div ever next or second instruction.

He only checked on one x86 CPU. What about 3rd Gen Xeons (i.e.
Copper Lake)? Zen 2? Did he isolate memory effects?

I see he used very high `count` in the loop: `count=524k`. That
is 2MiB / 4MiB of data. Granted, this data access pattern will
be easy to predict and prefetching will work really well, but
still this will be touching multiple levels of cache.

There are data dependencies in the loop on the `sum`, making it
really hard to do speculatation.

We don't see the assembly code, and don't know how much the
loops are unrolled / how much potential is there for
parallelism or hardware pipelineing.

3) Apple can get away with that, because they run on a leading
edge manufacturing node, clock lower to reduce power, and waste
silicon.

My guess is: If you do a single divide without a loop, it most
likely will be the same on both platforms.

There is nothing magical about M1 "fast" division.

I just tested, using his benchmark code, on my a bit older AMD
Zen+ CPU, that is clocked 2.8GHz (so actually slower than either
M1 or the tested Xeon):

I got 1.156ns per u32 divide using hardware divide. If I
normalize this to 3.2GHz, it becomes 1.01ns.

0.399ns (or 0.349ns normalized to 3.2GHz) when using `libdivide`.
So exactly same speed as M1 (0.351ms).

So, no, M1 is not 10 times faster than "x86".

Next time, exercise more critical thinking when reading
"benchmark" claims.

May 13 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 13 May 2021 at 12:06:01 UTC, Witold Baryluk wrote:
 Next time, exercise more critical thinking when reading 
 "benchmark" claims.

Indeed, proper benchmarks use application suites, not shoehorned 
synthetic garble... Besides, most performance sensitive code does 
not use division much if the programmers know what they are 
doing. And in this "benchmark" the division could've been moved 
out of the inner loop by a less-than-braindead compiler.

Looks like Intel is releasing a Clang based C++ compiler with 
OpenMP offload to Intel GPUs... Wonder if anyone knows anything 
about it?

May 13 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 13 May 2021 at 22:40:06 UTC, Ola Fosheim Grøstad 
wrote:
 And in this "benchmark" the division could've been moved out of 
 the inner loop by a less-than-braindead compiler.

Eh, no. I was wrong. Moral: never program right before bedtime.

May 14 2021

claptrap <clap trap.com> writes:

On Thursday, 13 May 2021 at 12:06:01 UTC, Witold Baryluk wrote:
 On Thursday, 13 May 2021 at 11:58:50 UTC, Witold Baryluk wrote:
 On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
 wrote:

 I just tested, using his benchmark code, on my a bit older AMD 
 Zen+ CPU, that is clocked 2.8GHz (so actually slower than 
 either M1 or the tested Xeon):

 I got 1.156ns per u32 divide using hardware divide. If I 
 normalize this to 3.2GHz, it becomes 1.01ns.

 0.399ns (or 0.349ns normalized to 3.2GHz) when using 
 `libdivide`. So exactly same speed as M1 (0.351ms).

Zen3 is about 2 to 3 times faster than Zen1 for both latency and 
throughput of 32/64 idiv. So if your results are accurate, Zen3 
is 2 or 3 times faster than the M1.

May 13 2021

claptrap <clap trap.com> writes:

On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

 Integral division is the strongest arithmetic operation.

 I have a friend who knows some M1 internals. He said it's 
 really Star Trek stuff.

 This will seriously challenge other CPU producers.

Integer division on Intel has always been excruciatingly slow, 64 
bit idiv can be up to 100 cycles in some cases, but DIVSD is like 
20 or something. Its much faster to convert to double do the 
division and convert back. (If you are ok with slightly lower 
precision.)

Just for reference I looked up timings for Zen3 and 64 bit idiv is

9-17 latency, 7-12 throughput

For skylake which is what it looks like the Xeon 8275CL is based 
on its..

42-95 latency, 24-90 throughput

So on paper a Zen3 is maybe 5 to 8 times faster at idiv than the 
Xeon he's using.

May 13 2021

Manu <turkeyman gmail.com> writes:

It's a strange thing to optimise... I seem to do an integer divide so
infrequently, that I can't imagine a measurable improvement in most code
I've ever written if it were substantially faster. I feel like I stopped
doing frequent integer divides right about the same time computers got
FPU's...

On Thu, May 13, 2021 at 12:00 PM Andrei Alexandrescu via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

 Integral division is the strongest arithmetic operation.

 I have a friend who knows some M1 internals. He said it's really Star
 Trek stuff.

 This will seriously challenge other CPU producers.

 What perspectives do we have to run the compiler on M1 and produce M1 code?

May 27 2021

Ola Fosheim Grostad <ola.fosheim.grostad gmail.com> writes:

On Thursday, 27 May 2021 at 08:46:20 UTC, Manu wrote:
 It's a strange thing to optimise... I seem to do an integer 
 divide so infrequently, that I can't imagine a measurable 
 improvement in most code I've ever written if it were 
 substantially faster. I feel like I stopped doing frequent 
 integer divides right about the same time computers got FPU's...

Maybe Swift programmers dont think in terms of 2^n? Apple wants 
Swift to be a system level language that is as easy to use as a 
high level language. Or maybe they have analyzed apps in appstore 
and found that people use div more than we know?

I personally feel bad if I don't make a datastructure compatible 
with 2^n algos... Except in Python, there I only feel bad if the 
code isnt looking clean.

May 27 2021

deadalnix <deadalnix gmail.com> writes:

On Thursday, 27 May 2021 at 08:46:20 UTC, Manu wrote:
 It's a strange thing to optimise... I seem to do an integer 
 divide so infrequently, that I can't imagine a measurable 
 improvement in most code I've ever written if it were 
 substantially faster. I feel like I stopped doing frequent 
 integer divides right about the same time computers got FPU's...

There are a few places where it matters. Some cryptographic 
operations for instance, or data compression/decompression. 
Memory allocators tend to rely on it, not heavily, but the rest 
of the system depends heavily on them.

More generally, the problem with x86 divide isn't it's perf per 
se, but the fact that it is not pipelined on Intel machines (no 
idea about AMD).

May 27 2021

Max Haughton <maxhaton gmail.com> writes:

On Thursday, 27 May 2021 at 12:50:52 UTC, deadalnix wrote:
 On Thursday, 27 May 2021 at 08:46:20 UTC, Manu wrote:
 It's a strange thing to optimise... I seem to do an integer 
 divide so infrequently, that I can't imagine a measurable 
 improvement in most code I've ever written if it were 
 substantially faster. I feel like I stopped doing frequent 
 integer divides right about the same time computers got 
 FPU's...

 There are a few places where it matters. Some cryptographic 
 operations for instance, or data compression/decompression. 
 Memory allocators tend to rely on it, not heavily, but the rest 
 of the system depends heavily on them.

 More generally, the problem with x86 divide isn't it's perf per 
 se, but the fact that it is not pipelined on Intel machines (no 
 idea about AMD).

Not pipelined!?

https://www.uops.info/table.html?search=idiv&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_ZEN3=on&cb_measurements=on&cb_doc=on&cb_base=on

May 27 2021

Ola Fosheim Grostad <ola.fosheim.grostad gmail.com> writes:

On Thursday, 27 May 2021 at 17:20:41 UTC, Max Haughton wrote:
 Not pipelined!?

On older Intel CPUs integer divide started to use parts of 
floating point divide logic to save space. Still pipelined... But 
ineffective.

May 27 2021

D Programming

C/C++ Programming

Other

digitalmars.D - M1 10x faster than Intel at integral division, throughput one 64-bit