digitalmars.D - M1 10x faster than Intel at integral division, throughput one 64-bit
- Andrei Alexandrescu (6/6) May 12 2021 https://www.reddit.com/r/programming/comments/nawerv/benchmarking_divisi...
- kookman (5/12) May 12 2021 LDC v1.26 is already doing a very good job IMHO. I'm running with
- Guillaume Piolat (4/7) May 13 2021 Even LDC 1.24 x86_64 cross-compiling to arm64 has very nice
- Max Haughton (29/36) May 12 2021 It's already winning let alone challenging, although consider
- Nicholas Wilson (6/13) May 12 2021 LDC works out of the box.
- Iain Buclaw (5/9) May 13 2021 Nope, it doesn't work (sorry it took 5 months to raise an issue)
- Witold Baryluk (28/33) May 13 2021 No. It means nothing.
- Witold Baryluk (11/48) May 13 2021 I just tested, using his benchmark code, on my a bit older AMD
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (9/11) May 13 2021 Indeed, proper benchmarks use application suites, not shoehorned
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (3/5) May 14 2021 Eh, no. I was wrong. Moral: never program right before bedtime.
- claptrap (4/15) May 13 2021 Zen3 is about 2 to 3 times faster than Zen1 for both latency and
- claptrap (14/19) May 13 2021 Integer division on Intel has always been excruciatingly slow, 64
- Manu (7/13) May 27 2021 It's a strange thing to optimise... I seem to do an integer divide so
- Ola Fosheim Grostad (8/13) May 27 2021 Maybe Swift programmers dont think in terms of 2^n? Apple wants
- deadalnix (8/13) May 27 2021 There are a few places where it matters. Some cryptographic
- Max Haughton (3/18) May 27 2021 Not pipelined!?
- Ola Fosheim Grostad (4/5) May 27 2021 On older Intel CPUs integer divide started to use parts of
https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/ Integral division is the strongest arithmetic operation. I have a friend who knows some M1 internals. He said it's really Star Trek stuff. This will seriously challenge other CPU producers. What perspectives do we have to run the compiler on M1 and produce M1 code?
May 12 2021
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/ Integral division is the strongest arithmetic operation. I have a friend who knows some M1 internals. He said it's really Star Trek stuff. This will seriously challenge other CPU producers. What perspectives do we have to run the compiler on M1 and produce M1 code?LDC v1.26 is already doing a very good job IMHO. I'm running with native ARM (brew.sh) LDC on Mac mini M1 and compiles are not noticeably slower than DMD on my Intel MacBook. Impressive!
May 12 2021
On Thursday, 13 May 2021 at 04:40:25 UTC, kookman wrote:LDC v1.26 is already doing a very good job IMHO. I'm running with native ARM (brew.sh) LDC on Mac mini M1 and compiles are not noticeably slower than DMD on my Intel MacBook. Impressive!Even LDC 1.24 x86_64 cross-compiling to arm64 has very nice compilation time on the M1 here. Which is nice because you can then build the combined binary with one compiler.
May 13 2021
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/ Integral division is the strongest arithmetic operation. I have a friend who knows some M1 internals. He said it's really Star Trek stuff. This will seriously challenge other CPU producers. What perspectives do we have to run the compiler on M1 and produce M1 code?It's already winning let alone challenging, although consider just how fucking enormous the transistor budget is on the M1 on a per-core basis (i.e. from what is known in public, the M1 doesn't really have that much magic to it but is rather an extremely wide - where it really matters - iteration of what already works elsewhere in the industry, combined with no X86 tax on desktop for the first time.). Intel's process engineers completely dropped the ball, so the M1 is on a process something like 4-5 *x* denser than Intel 14nm. Someone mentioned on hackernews that Intel improved the ThisXeon + 1 integer division capabilities also, would be worth benchmarking - although expecting monster SPECint numbers from a 28 core Xeon is probably missing the point. Someone on the discord has an M1, D already works fine apparently, I'm aiming to get a blog post out of it. The GCC project has M1 hardware and should apparently be getting support soon-ish. Apple don't like upstreaming their backends from what I can tell, so it could be a while before they get tuned much. Apple also haven't published anything along the lines of an optimization manual for M1 so I guess we'll find out via osmosis what it's really capable of as times goes on - I think it's more likely Apple get the Microsoft hidden-api treatment than actually go public on some of the extensions they have made to the ARM ISA - both in new instructions and in the form of an old trick SPARC had which basically turns TSO on underneath a program to aid X86 emulation.
May 12 2021
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/ Integral division is the strongest arithmetic operation. I have a friend who knows some M1 internals. He said it's really Star Trek stuff. This will seriously challenge other CPU producers. What perspectives do we have to run the compiler on M1 and produce M1 code?LDC works out of the box. GDC may support it with GDC11 (no idea on time frame). DMD has no support for ARM whatsoever. non-dmdfe based compilers, no idea. Probably not.
May 12 2021
On Thursday, 13 May 2021 at 06:02:31 UTC, Nicholas Wilson wrote:LDC works out of the box. GDC may support it with GDC11 (no idea on time frame). DMD has no support for ARM whatsoever. non-dmdfe based compilers, no idea. Probably not.Nope, it doesn't work (sorry it took 5 months to raise an issue) https://issues.dlang.org/show_bug.cgi?id=21919 Fortunately, OSX is broken on both X86_64 and ARM64 though. So the sooner someone fixes it...
May 13 2021
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/ Integral division is the strongest arithmetic operation. I have a friend who knows some M1 internals. He said it's really Star Trek stuff. This will seriously challenge other CPU producers.No. It means nothing. 1) M1 is built on smaller manufacturing node, allowing to "waste" more silicon area for such niche stuff. 2) He measured throughput in highly-div dense code. He didn't measure actual speed (latency) of a divide. Anybody can make integer division faster (higher throughput) by throwing more execution units or fully pipeline the integer divisions. It costs a lot of silicon, for zero gain, because real world code doesn't have a div after div ever next or second instruction. He only checked on one x86 CPU. What about 3rd Gen Xeons (i.e. Copper Lake)? Zen 2? Did he isolate memory effects? I see he used very high `count` in the loop: `count=524k`. That is 2MiB / 4MiB of data. Granted, this data access pattern will be easy to predict and prefetching will work really well, but still this will be touching multiple levels of cache. There are data dependencies in the loop on the `sum`, making it really hard to do speculatation. We don't see the assembly code, and don't know how much the loops are unrolled / how much potential is there for parallelism or hardware pipelineing. 3) Apple can get away with that, because they run on a leading edge manufacturing node, clock lower to reduce power, and waste silicon. My guess is: If you do a single divide without a loop, it most likely will be the same on both platforms. There is nothing magical about M1 "fast" division.
May 13 2021
On Thursday, 13 May 2021 at 11:58:50 UTC, Witold Baryluk wrote:On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:I just tested, using his benchmark code, on my a bit older AMD Zen+ CPU, that is clocked 2.8GHz (so actually slower than either M1 or the tested Xeon): I got 1.156ns per u32 divide using hardware divide. If I normalize this to 3.2GHz, it becomes 1.01ns. 0.399ns (or 0.349ns normalized to 3.2GHz) when using `libdivide`. So exactly same speed as M1 (0.351ms). So, no, M1 is not 10 times faster than "x86". Next time, exercise more critical thinking when reading "benchmark" claims.https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/ Integral division is the strongest arithmetic operation. I have a friend who knows some M1 internals. He said it's really Star Trek stuff. This will seriously challenge other CPU producers.No. It means nothing. 1) M1 is built on smaller manufacturing node, allowing to "waste" more silicon area for such niche stuff. 2) He measured throughput in highly-div dense code. He didn't measure actual speed (latency) of a divide. Anybody can make integer division faster (higher throughput) by throwing more execution units or fully pipeline the integer divisions. It costs a lot of silicon, for zero gain, because real world code doesn't have a div after div ever next or second instruction. He only checked on one x86 CPU. What about 3rd Gen Xeons (i.e. Copper Lake)? Zen 2? Did he isolate memory effects? I see he used very high `count` in the loop: `count=524k`. That is 2MiB / 4MiB of data. Granted, this data access pattern will be easy to predict and prefetching will work really well, but still this will be touching multiple levels of cache. There are data dependencies in the loop on the `sum`, making it really hard to do speculatation. We don't see the assembly code, and don't know how much the loops are unrolled / how much potential is there for parallelism or hardware pipelineing. 3) Apple can get away with that, because they run on a leading edge manufacturing node, clock lower to reduce power, and waste silicon. My guess is: If you do a single divide without a loop, it most likely will be the same on both platforms. There is nothing magical about M1 "fast" division.
May 13 2021
On Thursday, 13 May 2021 at 12:06:01 UTC, Witold Baryluk wrote:Next time, exercise more critical thinking when reading "benchmark" claims.Indeed, proper benchmarks use application suites, not shoehorned synthetic garble... Besides, most performance sensitive code does not use division much if the programmers know what they are doing. And in this "benchmark" the division could've been moved out of the inner loop by a less-than-braindead compiler. Looks like Intel is releasing a Clang based C++ compiler with OpenMP offload to Intel GPUs... Wonder if anyone knows anything about it?
May 13 2021
On Thursday, 13 May 2021 at 22:40:06 UTC, Ola Fosheim Grøstad wrote:And in this "benchmark" the division could've been moved out of the inner loop by a less-than-braindead compiler.Eh, no. I was wrong. Moral: never program right before bedtime.
May 14 2021
On Thursday, 13 May 2021 at 12:06:01 UTC, Witold Baryluk wrote:On Thursday, 13 May 2021 at 11:58:50 UTC, Witold Baryluk wrote:Zen3 is about 2 to 3 times faster than Zen1 for both latency and throughput of 32/64 idiv. So if your results are accurate, Zen3 is 2 or 3 times faster than the M1.On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:I just tested, using his benchmark code, on my a bit older AMD Zen+ CPU, that is clocked 2.8GHz (so actually slower than either M1 or the tested Xeon): I got 1.156ns per u32 divide using hardware divide. If I normalize this to 3.2GHz, it becomes 1.01ns. 0.399ns (or 0.349ns normalized to 3.2GHz) when using `libdivide`. So exactly same speed as M1 (0.351ms).
May 13 2021
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/ Integral division is the strongest arithmetic operation. I have a friend who knows some M1 internals. He said it's really Star Trek stuff. This will seriously challenge other CPU producers.Integer division on Intel has always been excruciatingly slow, 64 bit idiv can be up to 100 cycles in some cases, but DIVSD is like 20 or something. Its much faster to convert to double do the division and convert back. (If you are ok with slightly lower precision.) Just for reference I looked up timings for Zen3 and 64 bit idiv is 9-17 latency, 7-12 throughput For skylake which is what it looks like the Xeon 8275CL is based on its.. 42-95 latency, 24-90 throughput So on paper a Zen3 is maybe 5 to 8 times faster at idiv than the Xeon he's using.
May 13 2021
It's a strange thing to optimise... I seem to do an integer divide so infrequently, that I can't imagine a measurable improvement in most code I've ever written if it were substantially faster. I feel like I stopped doing frequent integer divides right about the same time computers got FPU's... On Thu, May 13, 2021 at 12:00 PM Andrei Alexandrescu via Digitalmars-d < digitalmars-d puremagic.com> wrote:https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/ Integral division is the strongest arithmetic operation. I have a friend who knows some M1 internals. He said it's really Star Trek stuff. This will seriously challenge other CPU producers. What perspectives do we have to run the compiler on M1 and produce M1 code?
May 27 2021
On Thursday, 27 May 2021 at 08:46:20 UTC, Manu wrote:It's a strange thing to optimise... I seem to do an integer divide so infrequently, that I can't imagine a measurable improvement in most code I've ever written if it were substantially faster. I feel like I stopped doing frequent integer divides right about the same time computers got FPU's...Maybe Swift programmers dont think in terms of 2^n? Apple wants Swift to be a system level language that is as easy to use as a high level language. Or maybe they have analyzed apps in appstore and found that people use div more than we know? I personally feel bad if I don't make a datastructure compatible with 2^n algos... Except in Python, there I only feel bad if the code isnt looking clean.
May 27 2021
On Thursday, 27 May 2021 at 08:46:20 UTC, Manu wrote:It's a strange thing to optimise... I seem to do an integer divide so infrequently, that I can't imagine a measurable improvement in most code I've ever written if it were substantially faster. I feel like I stopped doing frequent integer divides right about the same time computers got FPU's...There are a few places where it matters. Some cryptographic operations for instance, or data compression/decompression. Memory allocators tend to rely on it, not heavily, but the rest of the system depends heavily on them. More generally, the problem with x86 divide isn't it's perf per se, but the fact that it is not pipelined on Intel machines (no idea about AMD).
May 27 2021
On Thursday, 27 May 2021 at 12:50:52 UTC, deadalnix wrote:On Thursday, 27 May 2021 at 08:46:20 UTC, Manu wrote:Not pipelined!? https://www.uops.info/table.html?search=idiv&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_ZEN3=on&cb_measurements=on&cb_doc=on&cb_base=onIt's a strange thing to optimise... I seem to do an integer divide so infrequently, that I can't imagine a measurable improvement in most code I've ever written if it were substantially faster. I feel like I stopped doing frequent integer divides right about the same time computers got FPU's...There are a few places where it matters. Some cryptographic operations for instance, or data compression/decompression. Memory allocators tend to rely on it, not heavily, but the rest of the system depends heavily on them. More generally, the problem with x86 divide isn't it's perf per se, but the fact that it is not pipelined on Intel machines (no idea about AMD).
May 27 2021
On Thursday, 27 May 2021 at 17:20:41 UTC, Max Haughton wrote:Not pipelined!?On older Intel CPUs integer divide started to use parts of floating point divide logic to save space. Still pipelined... But ineffective.
May 27 2021