## digitalmars.D - How to squeeze more performance out of gcd?

- H. S. Teoh (22/22) Jun 12 So this week I finished a major feature in one of my projects, which
- Uknown (7/25) Jun 13 I was doing some competitive programming last month and fast-gcd
- H. S. Teoh (12/19) Jun 15 Funnily enough, I did read that article before posting it, and I did
- Uknown (6/13) Jun 19 I didn't notice your reply, but if you haven't solved this yet,

So this week I finished a major feature in one of my projects, which involves adding a custom number type that can do exact arithmetic with quadratic irrationals (numbers of the form (a+b*√r)/c with integer a,b,c, and fixed r). The custom number type, unsurprisingly, performs about 5x worse than using `double`, but with the plus side that arithmetic is exact so I never have to worry about round-off errors. Profiling a typical use case reveals that the most prominent bottleneck (50+% of total running time) is std.numeric.gcd. Inspecting the disassembly shows that it's using the fast binary GCD algorithm (Stein's algorithm) with the bsf and cmovg instructions (this is with `ldc2 -mcpu=native -O3`), which is as fast as you can get for GCD, AFAIK. Is there a way to squeeze more juice out of gcd? Like is there some novel algorithm or hack that could trim that 50% figure a little? One of the reasons gcd is a bottleneck is because the quadratic irrationals are normalized after every operation. I contemplated suppressing some of these normalizations in order to boost performance, but just today I had to fix a major bug caused by coefficient overflow, so I'm not keen on postponing normalization unless there's no other way around it. T -- For every argument for something, there is always an equal and opposite argument against it. Debates don't give answers, only wounded or inflated egos.

Jun 12

On Friday, 12 June 2020 at 23:27:35 UTC, H. S. Teoh wrote:So this week I finished a major feature in one of my projects, which involves adding a custom number type that can do exact arithmetic with quadratic irrationals (numbers of the form (a+b*√r)/c with integer a,b,c, and fixed r). The custom number type, unsurprisingly, performs about 5x worse than using `double`, but with the plus side that arithmetic is exact so I never have to worry about round-off errors. Profiling a typical use case reveals that the most prominent bottleneck (50+% of total running time) is std.numeric.gcd. Inspecting the disassembly shows that it's using the fast binary GCD algorithm (Stein's algorithm) with the bsf and cmovg instructions (this is with `ldc2 -mcpu=native -O3`), which is as fast as you can get for GCD, AFAIK. Is there a way to squeeze more juice out of gcd? Like is there some novel algorithm or hack that could trim that 50% figure a little? [...] TI was doing some competitive programming last month and fast-gcd came up, this was the fastest I found back then: https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/ Of course it turned out the actual fast approach to my problem was to skip the gcd calculations entirely, because they were irrelevant, but that blog post may help you.

Jun 13

On Sat, Jun 13, 2020 at 02:25:21PM +0000, Uknown via Digitalmars-d wrote: [...]I was doing some competitive programming last month and fast-gcd came up, this was the fastest I found back then: https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/ Of course it turned out the actual fast approach to my problem was to skip the gcd calculations entirely, because they were irrelevant, but that blog post may help you.Funnily enough, I did read that article before posting it, and I did check the disassembly that it was using the bsr instruction rather than a loop. Unfortunately, the gcd computations still occupy about 51% of total runtime. Probably I need to think about some way of reducing the number of gcd calculations, since the gcd computation itself is already maxed out in terms of optimization AFAICT. T -- "I'm running Windows '98." "Yes." "My computer isn't working now." "Yes, you already said that." -- User-Friendly

Jun 15

On Monday, 15 June 2020 at 16:45:01 UTC, H. S. Teoh wrote:On Sat, Jun 13, 2020 at 02:25:21PM +0000, Uknown via Digitalmars-d wrote: [...] [...] need to think about some way of reducing the number of gcd calculations, since the gcd computation itself is already maxed out in terms of optimization AFAICT. TI didn't notice your reply, but if you haven't solved this yet, you could try this: Make a simple recursive gcd implementation and memoize it. It might work better than any other implementation (depending on the numbers in question and the data set size)

Jun 19