## digitalmars.D - How to squeeze more performance out of gcd?

• H. S. Teoh (22/22) Jun 12 So this week I finished a major feature in one of my projects, which
• Uknown (7/25) Jun 13 I was doing some competitive programming last month and fast-gcd
• H. S. Teoh (12/19) Jun 15 Funnily enough, I did read that article before posting it, and I did
• Uknown (6/13) Jun 19 I didn't notice your reply, but if you haven't solved this yet,
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```So this week I finished a major feature in one of my projects, which
involves adding a custom number type that can do exact arithmetic with
quadratic irrationals (numbers of the form (a+b*√r)/c with integer
a,b,c, and fixed r). The custom number type, unsurprisingly, performs
about 5x worse than using `double`, but with the plus side that
arithmetic is exact so I never have to worry about round-off errors.

Profiling a typical use case reveals that the most prominent bottleneck
(50+% of total running time) is std.numeric.gcd.  Inspecting the
disassembly shows that it's using the fast binary GCD algorithm (Stein's
algorithm) with the bsf and cmovg instructions (this is with `ldc2
-mcpu=native -O3`), which is as fast as you can get for GCD, AFAIK.

Is there a way to squeeze more juice out of gcd?  Like is there some
novel algorithm or hack that could trim that 50% figure a little?

One of the reasons gcd is a bottleneck is because the quadratic
irrationals are normalized after every operation. I contemplated
suppressing some of these normalizations in order to boost performance,
but just today I had to fix a major bug caused by coefficient overflow,
so I'm not keen on postponing normalization unless there's no other way
around it.

T

--
For every argument for something, there is always an equal and opposite
argument against it. Debates don't give answers, only wounded or inflated egos.
```
Jun 12
Uknown <sireeshkodali1 gmail.com> writes:
```On Friday, 12 June 2020 at 23:27:35 UTC, H. S. Teoh wrote:
So this week I finished a major feature in one of my projects,
which involves adding a custom number type that can do exact
arithmetic with quadratic irrationals (numbers of the form
(a+b*√r)/c with integer a,b,c, and fixed r). The custom number
type, unsurprisingly, performs about 5x worse than using
`double`, but with the plus side that arithmetic is exact so I
never have to worry about round-off errors.

Profiling a typical use case reveals that the most prominent
bottleneck (50+% of total running time) is std.numeric.gcd.
Inspecting the disassembly shows that it's using the fast
binary GCD algorithm (Stein's algorithm) with the bsf and cmovg
instructions (this is with `ldc2 -mcpu=native -O3`), which is
as fast as you can get for GCD, AFAIK.

Is there a way to squeeze more juice out of gcd?  Like is there
some novel algorithm or hack that could trim that 50% figure a
little?
[...]
T

I was doing some competitive programming last month and fast-gcd
came up, this was the fastest I found back then:
https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/

Of course it turned out the actual fast approach to my problem
was to skip the gcd calculations entirely, because they were
```
Jun 13
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```On Sat, Jun 13, 2020 at 02:25:21PM +0000, Uknown via Digitalmars-d wrote:
[...]
I was doing some competitive programming last month and fast-gcd came
up, this was the fastest I found back then:
https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/

Of course it turned out the actual fast approach to my problem was to
skip the gcd calculations entirely, because they were irrelevant, but

Funnily enough, I did read that article before posting it, and I did
check the disassembly that it was using the bsr instruction rather than
a loop.  Unfortunately, the gcd computations still occupy about 51% of
total runtime.

Probably I need to think about some way of reducing the number of gcd
calculations, since the gcd computation itself is already maxed out in
terms of optimization AFAICT.

T

--
"I'm running Windows '98." "Yes." "My computer isn't working now." "Yes, you
```
Jun 15
Uknown <sireeshkodali1 gmail.com> writes:
```On Monday, 15 June 2020 at 16:45:01 UTC, H. S. Teoh wrote:
On Sat, Jun 13, 2020 at 02:25:21PM +0000, Uknown via
Digitalmars-d wrote: [...]
[...]
need to think about some way of reducing the number of gcd
calculations, since the gcd computation itself is already maxed
out in terms of optimization AFAICT.

T

I didn't notice your reply, but if you haven't solved this yet,
you could try this:

Make a simple recursive gcd implementation and memoize it. It
might work better than any other implementation (depending on the
numbers in question and the data set size)
```
Jun 19