www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - M1 10x faster than Intel at integral division, throughput one 64-bit

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

Integral division is the strongest arithmetic operation.

I have a friend who knows some M1 internals. He said it's really Star 
Trek stuff.

This will seriously challenge other CPU producers.

What perspectives do we have to run the compiler on M1 and produce M1 code?
May 12
next sibling parent reply kookman <thekookman gmail.com> writes:
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

 Integral division is the strongest arithmetic operation.

 I have a friend who knows some M1 internals. He said it's 
 really Star Trek stuff.

 This will seriously challenge other CPU producers.

 What perspectives do we have to run the compiler on M1 and 
 produce M1 code?
LDC v1.26 is already doing a very good job IMHO. I'm running with native ARM (brew.sh) LDC on Mac mini M1 and compiles are not noticeably slower than DMD on my Intel MacBook. Impressive!
May 12
parent Guillaume Piolat <first.last gmail.com> writes:
On Thursday, 13 May 2021 at 04:40:25 UTC, kookman wrote:
 LDC v1.26 is already doing a very good job IMHO. I'm running 
 with native ARM (brew.sh) LDC on Mac mini M1 and compiles are 
 not noticeably slower than DMD on my Intel MacBook. Impressive!
Even LDC 1.24 x86_64 cross-compiling to arm64 has very nice compilation time on the M1 here. Which is nice because you can then build the combined binary with one compiler.
May 13
prev sibling next sibling parent Max Haughton <maxhaton gmail.com> writes:
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

 Integral division is the strongest arithmetic operation.

 I have a friend who knows some M1 internals. He said it's 
 really Star Trek stuff.

 This will seriously challenge other CPU producers.

 What perspectives do we have to run the compiler on M1 and 
 produce M1 code?
It's already winning let alone challenging, although consider just how fucking enormous the transistor budget is on the M1 on a per-core basis (i.e. from what is known in public, the M1 doesn't really have that much magic to it but is rather an extremely wide - where it really matters - iteration of what already works elsewhere in the industry, combined with no X86 tax on desktop for the first time.). Intel's process engineers completely dropped the ball, so the M1 is on a process something like 4-5 *x* denser than Intel 14nm. Someone mentioned on hackernews that Intel improved the ThisXeon + 1 integer division capabilities also, would be worth benchmarking - although expecting monster SPECint numbers from a 28 core Xeon is probably missing the point. Someone on the discord has an M1, D already works fine apparently, I'm aiming to get a blog post out of it. The GCC project has M1 hardware and should apparently be getting support soon-ish. Apple don't like upstreaming their backends from what I can tell, so it could be a while before they get tuned much. Apple also haven't published anything along the lines of an optimization manual for M1 so I guess we'll find out via osmosis what it's really capable of as times goes on - I think it's more likely Apple get the Microsoft hidden-api treatment than actually go public on some of the extensions they have made to the ARM ISA - both in new instructions and in the form of an old trick SPARC had which basically turns TSO on underneath a program to aid X86 emulation.
May 12
prev sibling next sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

 Integral division is the strongest arithmetic operation.

 I have a friend who knows some M1 internals. He said it's 
 really Star Trek stuff.

 This will seriously challenge other CPU producers.

 What perspectives do we have to run the compiler on M1 and 
 produce M1 code?
LDC works out of the box. GDC may support it with GDC11 (no idea on time frame). DMD has no support for ARM whatsoever. non-dmdfe based compilers, no idea. Probably not.
May 12
parent Iain Buclaw <ibuclaw gdcproject.org> writes:
On Thursday, 13 May 2021 at 06:02:31 UTC, Nicholas Wilson wrote:
 LDC works out of the box.
 GDC may support it with GDC11 (no idea on time frame).
 DMD has no support for ARM whatsoever.
 non-dmdfe based compilers, no idea. Probably not.
Nope, it doesn't work (sorry it took 5 months to raise an issue) https://issues.dlang.org/show_bug.cgi?id=21919 Fortunately, OSX is broken on both X86_64 and ARM64 though. So the sooner someone fixes it...
May 13
prev sibling next sibling parent reply Witold Baryluk <witold.baryluk gmail.com> writes:
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

 Integral division is the strongest arithmetic operation.

 I have a friend who knows some M1 internals. He said it's 
 really Star Trek stuff.

 This will seriously challenge other CPU producers.
No. It means nothing. 1) M1 is built on smaller manufacturing node, allowing to "waste" more silicon area for such niche stuff. 2) He measured throughput in highly-div dense code. He didn't measure actual speed (latency) of a divide. Anybody can make integer division faster (higher throughput) by throwing more execution units or fully pipeline the integer divisions. It costs a lot of silicon, for zero gain, because real world code doesn't have a div after div ever next or second instruction. He only checked on one x86 CPU. What about 3rd Gen Xeons (i.e. Copper Lake)? Zen 2? Did he isolate memory effects? I see he used very high `count` in the loop: `count=524k`. That is 2MiB / 4MiB of data. Granted, this data access pattern will be easy to predict and prefetching will work really well, but still this will be touching multiple levels of cache. There are data dependencies in the loop on the `sum`, making it really hard to do speculatation. We don't see the assembly code, and don't know how much the loops are unrolled / how much potential is there for parallelism or hardware pipelineing. 3) Apple can get away with that, because they run on a leading edge manufacturing node, clock lower to reduce power, and waste silicon. My guess is: If you do a single divide without a loop, it most likely will be the same on both platforms. There is nothing magical about M1 "fast" division.
May 13
parent reply Witold Baryluk <witold.baryluk gmail.com> writes:
On Thursday, 13 May 2021 at 11:58:50 UTC, Witold Baryluk wrote:
 On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
 wrote:
 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

 Integral division is the strongest arithmetic operation.

 I have a friend who knows some M1 internals. He said it's 
 really Star Trek stuff.

 This will seriously challenge other CPU producers.
No. It means nothing. 1) M1 is built on smaller manufacturing node, allowing to "waste" more silicon area for such niche stuff. 2) He measured throughput in highly-div dense code. He didn't measure actual speed (latency) of a divide. Anybody can make integer division faster (higher throughput) by throwing more execution units or fully pipeline the integer divisions. It costs a lot of silicon, for zero gain, because real world code doesn't have a div after div ever next or second instruction. He only checked on one x86 CPU. What about 3rd Gen Xeons (i.e. Copper Lake)? Zen 2? Did he isolate memory effects? I see he used very high `count` in the loop: `count=524k`. That is 2MiB / 4MiB of data. Granted, this data access pattern will be easy to predict and prefetching will work really well, but still this will be touching multiple levels of cache. There are data dependencies in the loop on the `sum`, making it really hard to do speculatation. We don't see the assembly code, and don't know how much the loops are unrolled / how much potential is there for parallelism or hardware pipelineing. 3) Apple can get away with that, because they run on a leading edge manufacturing node, clock lower to reduce power, and waste silicon. My guess is: If you do a single divide without a loop, it most likely will be the same on both platforms. There is nothing magical about M1 "fast" division.
I just tested, using his benchmark code, on my a bit older AMD Zen+ CPU, that is clocked 2.8GHz (so actually slower than either M1 or the tested Xeon): I got 1.156ns per u32 divide using hardware divide. If I normalize this to 3.2GHz, it becomes 1.01ns. 0.399ns (or 0.349ns normalized to 3.2GHz) when using `libdivide`. So exactly same speed as M1 (0.351ms). So, no, M1 is not 10 times faster than "x86". Next time, exercise more critical thinking when reading "benchmark" claims.
May 13
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 13 May 2021 at 12:06:01 UTC, Witold Baryluk wrote:
 Next time, exercise more critical thinking when reading 
 "benchmark" claims.
Indeed, proper benchmarks use application suites, not shoehorned synthetic garble... Besides, most performance sensitive code does not use division much if the programmers know what they are doing. And in this "benchmark" the division could've been moved out of the inner loop by a less-than-braindead compiler. Looks like Intel is releasing a Clang based C++ compiler with OpenMP offload to Intel GPUs... Wonder if anyone knows anything about it?
May 13
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 13 May 2021 at 22:40:06 UTC, Ola Fosheim Grøstad 
wrote:
 And in this "benchmark" the division could've been moved out of 
 the inner loop by a less-than-braindead compiler.
Eh, no. I was wrong. Moral: never program right before bedtime.
May 14
prev sibling parent claptrap <clap trap.com> writes:
On Thursday, 13 May 2021 at 12:06:01 UTC, Witold Baryluk wrote:
 On Thursday, 13 May 2021 at 11:58:50 UTC, Witold Baryluk wrote:
 On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
 wrote:
I just tested, using his benchmark code, on my a bit older AMD Zen+ CPU, that is clocked 2.8GHz (so actually slower than either M1 or the tested Xeon): I got 1.156ns per u32 divide using hardware divide. If I normalize this to 3.2GHz, it becomes 1.01ns. 0.399ns (or 0.349ns normalized to 3.2GHz) when using `libdivide`. So exactly same speed as M1 (0.351ms).
Zen3 is about 2 to 3 times faster than Zen1 for both latency and throughput of 32/64 idiv. So if your results are accurate, Zen3 is 2 or 3 times faster than the M1.
May 13
prev sibling next sibling parent claptrap <clap trap.com> writes:
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
wrote:
 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

 Integral division is the strongest arithmetic operation.

 I have a friend who knows some M1 internals. He said it's 
 really Star Trek stuff.

 This will seriously challenge other CPU producers.
Integer division on Intel has always been excruciatingly slow, 64 bit idiv can be up to 100 cycles in some cases, but DIVSD is like 20 or something. Its much faster to convert to double do the division and convert back. (If you are ok with slightly lower precision.) Just for reference I looked up timings for Zen3 and 64 bit idiv is 9-17 latency, 7-12 throughput For skylake which is what it looks like the Xeon 8275CL is based on its.. 42-95 latency, 24-90 throughput So on paper a Zen3 is maybe 5 to 8 times faster at idiv than the Xeon he's using.
May 13
prev sibling parent reply Manu <turkeyman gmail.com> writes:
It's a strange thing to optimise... I seem to do an integer divide so
infrequently, that I can't imagine a measurable improvement in most code
I've ever written if it were substantially faster. I feel like I stopped
doing frequent integer divides right about the same time computers got
FPU's...

On Thu, May 13, 2021 at 12:00 PM Andrei Alexandrescu via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

 Integral division is the strongest arithmetic operation.

 I have a friend who knows some M1 internals. He said it's really Star
 Trek stuff.

 This will seriously challenge other CPU producers.

 What perspectives do we have to run the compiler on M1 and produce M1 code?
May 27
next sibling parent Ola Fosheim Grostad <ola.fosheim.grostad gmail.com> writes:
On Thursday, 27 May 2021 at 08:46:20 UTC, Manu wrote:
 It's a strange thing to optimise... I seem to do an integer 
 divide so infrequently, that I can't imagine a measurable 
 improvement in most code I've ever written if it were 
 substantially faster. I feel like I stopped doing frequent 
 integer divides right about the same time computers got FPU's...
Maybe Swift programmers dont think in terms of 2^n? Apple wants Swift to be a system level language that is as easy to use as a high level language. Or maybe they have analyzed apps in appstore and found that people use div more than we know? I personally feel bad if I don't make a datastructure compatible with 2^n algos... Except in Python, there I only feel bad if the code isnt looking clean.
May 27
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Thursday, 27 May 2021 at 08:46:20 UTC, Manu wrote:
 It's a strange thing to optimise... I seem to do an integer 
 divide so infrequently, that I can't imagine a measurable 
 improvement in most code I've ever written if it were 
 substantially faster. I feel like I stopped doing frequent 
 integer divides right about the same time computers got FPU's...
There are a few places where it matters. Some cryptographic operations for instance, or data compression/decompression. Memory allocators tend to rely on it, not heavily, but the rest of the system depends heavily on them. More generally, the problem with x86 divide isn't it's perf per se, but the fact that it is not pipelined on Intel machines (no idea about AMD).
May 27
parent reply Max Haughton <maxhaton gmail.com> writes:
On Thursday, 27 May 2021 at 12:50:52 UTC, deadalnix wrote:
 On Thursday, 27 May 2021 at 08:46:20 UTC, Manu wrote:
 It's a strange thing to optimise... I seem to do an integer 
 divide so infrequently, that I can't imagine a measurable 
 improvement in most code I've ever written if it were 
 substantially faster. I feel like I stopped doing frequent 
 integer divides right about the same time computers got 
 FPU's...
There are a few places where it matters. Some cryptographic operations for instance, or data compression/decompression. Memory allocators tend to rely on it, not heavily, but the rest of the system depends heavily on them. More generally, the problem with x86 divide isn't it's perf per se, but the fact that it is not pipelined on Intel machines (no idea about AMD).
Not pipelined!? https://www.uops.info/table.html?search=idiv&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_ZEN3=on&cb_measurements=on&cb_doc=on&cb_base=on
May 27
parent Ola Fosheim Grostad <ola.fosheim.grostad gmail.com> writes:
On Thursday, 27 May 2021 at 17:20:41 UTC, Max Haughton wrote:
 Not pipelined!?
On older Intel CPUs integer divide started to use parts of floating point divide logic to save space. Still pipelined... But ineffective.
May 27