www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - How to squeeze more performance out of gcd?

reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
So this week I finished a major feature in one of my projects, which
involves adding a custom number type that can do exact arithmetic with
quadratic irrationals (numbers of the form (a+b*√r)/c with integer
a,b,c, and fixed r). The custom number type, unsurprisingly, performs
about 5x worse than using `double`, but with the plus side that
arithmetic is exact so I never have to worry about round-off errors.

Profiling a typical use case reveals that the most prominent bottleneck
(50+% of total running time) is std.numeric.gcd.  Inspecting the
disassembly shows that it's using the fast binary GCD algorithm (Stein's
algorithm) with the bsf and cmovg instructions (this is with `ldc2
-mcpu=native -O3`), which is as fast as you can get for GCD, AFAIK.

Is there a way to squeeze more juice out of gcd?  Like is there some
novel algorithm or hack that could trim that 50% figure a little?

One of the reasons gcd is a bottleneck is because the quadratic
irrationals are normalized after every operation. I contemplated
suppressing some of these normalizations in order to boost performance,
but just today I had to fix a major bug caused by coefficient overflow,
so I'm not keen on postponing normalization unless there's no other way
around it.


T

-- 
For every argument for something, there is always an equal and opposite
argument against it. Debates don't give answers, only wounded or inflated egos.
Jun 12
parent reply Uknown <sireeshkodali1 gmail.com> writes:
On Friday, 12 June 2020 at 23:27:35 UTC, H. S. Teoh wrote:
 So this week I finished a major feature in one of my projects, 
 which involves adding a custom number type that can do exact 
 arithmetic with quadratic irrationals (numbers of the form 
 (a+b*√r)/c with integer a,b,c, and fixed r). The custom number 
 type, unsurprisingly, performs about 5x worse than using 
 `double`, but with the plus side that arithmetic is exact so I 
 never have to worry about round-off errors.

 Profiling a typical use case reveals that the most prominent 
 bottleneck (50+% of total running time) is std.numeric.gcd.  
 Inspecting the disassembly shows that it's using the fast 
 binary GCD algorithm (Stein's algorithm) with the bsf and cmovg 
 instructions (this is with `ldc2 -mcpu=native -O3`), which is 
 as fast as you can get for GCD, AFAIK.

 Is there a way to squeeze more juice out of gcd?  Like is there 
 some novel algorithm or hack that could trim that 50% figure a 
 little?
 [...]
 T
I was doing some competitive programming last month and fast-gcd came up, this was the fastest I found back then: https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/ Of course it turned out the actual fast approach to my problem was to skip the gcd calculations entirely, because they were irrelevant, but that blog post may help you.
Jun 13
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Jun 13, 2020 at 02:25:21PM +0000, Uknown via Digitalmars-d wrote:
[...]
 I was doing some competitive programming last month and fast-gcd came
 up, this was the fastest I found back then:
 https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/
 
 Of course it turned out the actual fast approach to my problem was to
 skip the gcd calculations entirely, because they were irrelevant, but
 that blog post may help you.
Funnily enough, I did read that article before posting it, and I did check the disassembly that it was using the bsr instruction rather than a loop. Unfortunately, the gcd computations still occupy about 51% of total runtime. Probably I need to think about some way of reducing the number of gcd calculations, since the gcd computation itself is already maxed out in terms of optimization AFAICT. T -- "I'm running Windows '98." "Yes." "My computer isn't working now." "Yes, you already said that." -- User-Friendly
Jun 15
parent Uknown <sireeshkodali1 gmail.com> writes:
On Monday, 15 June 2020 at 16:45:01 UTC, H. S. Teoh wrote:
 On Sat, Jun 13, 2020 at 02:25:21PM +0000, Uknown via 
 Digitalmars-d wrote: [...]
 [...]
 need to think about some way of reducing the number of gcd 
 calculations, since the gcd computation itself is already maxed 
 out in terms of optimization AFAICT.


 T
I didn't notice your reply, but if you haven't solved this yet, you could try this: Make a simple recursive gcd implementation and memoize it. It might work better than any other implementation (depending on the numbers in question and the data set size)
Jun 19