digitalmars.D - Compiler optimizations

Craig Black (19/19) Apr 30 2006 So I've been playing around with the pi program. I ported it to C++ and

dennis luehring (6/14) Apr 30 2006 a fixpoint floating somethimes outperforms an floating-point operating

Craig Black (14/28) Apr 30 2006 point

dennis luehring (6/15) Apr 30 2006 i don't know the truth - but why should all the compiler vendors around

Craig Black (31/46) Apr 30 2006 I

James Pelcis (4/31) Apr 30 2006 This doesn't really surprise me. If I am reading it correctly, the

Craig Black (12/15) Apr 30 2006 I don't think you realize what is happening here ... we are dealing with

Walter Bright (8/11) May 02 2006 Not exactly. What is true is that both kinds of division are based on

Bruno Medeiros (9/22) May 03 2006 I ran these tests and I got basicly the same results (the int division

Walter Bright (3/8) May 03 2006 I don't know. I'm surprised at such results. Perhaps it's because Intel
Thomas Kuehne (22/32) May 03 2006 -----BEGIN PGP SIGNED MESSAGE-----

Bruno Medeiros (44/90) May 04 2006 Hum, yes I should have been more specific. I only ran (a modified

Don Clugston (10/26) May 04 2006 It's true.

pmoore (3/29) May 04 2006 It depends how big the dividend and divisor are. Smaller values can take...

pmoore (9/99) May 04 2006 It looks to me that your double division is actually encoded as float di...

Don Clugston (6/125) May 04 2006 That would be faster, actually (provided it's aligned sensibly). The

Dave (8/30) May 03 2006 If you take a look at the assembly for div(), idiv is being executed

Dave (8/39) May 03 2006 Obviously I was talking about the original pi.d implementation that's

dennis luehring (1/10) Apr 30 2006

Craig Black (2/11) Apr 30 2006 I don't see what you are trying to say here.

dennis luehring (48/57) Apr 30 2006 you say that integer devision is slower then the same division in

dennis luehring (5/5) Apr 30 2006 and if still don't understand your "optimizing"

Craig Black (4/6) Apr 30 2006 No, actually it shows us that integer division can be slower than floati...

Craig Black (33/33) Apr 30 2006 Yours used a constant divisor which allowed the compiler to optimize it

Craig Black (2/11) Apr 30 2006

Ben Phillips (6/30) Apr 30 2006 The second one is faster because you cheat. You aren't testing integer d...

Craig Black (25/26) Apr 30 2006 Nope. Try this one. On my computer the floating point division is twice ...

dennis luehring (6/7) Apr 30 2006 sorry but on my system int division is two times faster than floating po...

Craig Black (25/25) Apr 30 2006 This is probably because you have SSE optimizations disabled.

dennis luehring (5/9) Apr 30 2006 your right the double version is faster - i don't think that it can be

Craig Black (5/15) May 01 2006 Not if SSE is enabled. I have actually experienced significant performa...

Walter Bright (4/37) May 03 2006 On my machine, integer is still faster.

pmoore (2/39) May 04 2006

Walter Bright (3/38) May 03 2006 On my machine, the int version is 1.7 seconds, the double is 3.2. I am

Walter Bright (6/30) May 02 2006 Multiplication is a much simpler operation than division, which is why

"Craig Black" <cblack ara.com> writes:

So I've been playing around with the pi program.  I ported it to C++ and
I've been doing some benchmarks with MSVC.  It seems that MSVC is a little
faster when SSE2 instructions are enabled.  It is slower when they are
disabled.  Which leads me to a question.  Does the D backend make use of SSE
2 intstructions?  I think the answer is yes but perhaps the optimizer is not
quite as sophisticated as MSVC yet.

Another question for someone who knows a lot about backend optimizations.
I've noticed that integer division is faster when the the divisor is a
constant.  How does the compiler optimize this?

Another thing.  I know that integer division is essentially a floating point
operation because there is no integer division instruction.  I also read
some on the internet and learned that Intel has a library that provides
fixed-point math routines, which provides a division operation that
outperforms floating-point division.  Then I thought to myself, "Why don't
compilers just use fixed-point division for dividing integers since it's
faster?"

I know some of you out there know way more than me so please all you smart
people give me some input.

-Craig

Apr 30 2006

dennis luehring <dl.soluz gmx.net> writes:

 Another thing.  I know that integer division is essentially a floating point
 operation because there is no integer division instruction.

and what is "idiv" for you?
(http://www.jegerlehner.ch/intel/opcode.html)

 I also read
 some on the internet and learned that Intel has a library that provides
 fixed-point math routines, which provides a division operation that
 outperforms floating-point division.  Then I thought to myself, "Why don't
 compilers just use fixed-point division for dividing integers since it's
 faster?"

a fixpoint floating somethimes outperforms an floating-point operating 
(with loss of precession)

but it never outperforms an true integer division

ciao dennis

Apr 30 2006

"Craig Black" <cblack ara.com> writes:

"dennis luehring" <dl.soluz gmx.net> wrote in message
news:e32c0m$1dsu$1 digitaldaemon.com...
 Another thing.  I know that integer division is essentially a floating


point
 operation because there is no integer division instruction.

 and what is "idiv" for you?
 (http://www.jegerlehner.ch/intel/opcode.html)

Please anyone, correct me if I'm wrong, but I think IDIV is a
pseudo-instruction.  80x86 is notorious for those (as opposed to RISC).  I
read on the Intel website that intel processors do not have a hardware
integer division instruction.  Apparently it is somehow better to use the
floating point instruction instead.

The benchmarks that I've done are constistent with this.  I found it was
faster to perform a floating-point multiplication on an integer and then
convert it back to an integer than just a single integer division.

 I also read
 some on the internet and learned that Intel has a library that provides
 fixed-point math routines, which provides a division operation that
 outperforms floating-point division.  Then I thought to myself, "Why


don't
 compilers just use fixed-point division for dividing integers since it's
 faster?"

 a fixpoint floating somethimes outperforms an floating-point operating
 (with loss of precession)

 but it never outperforms an true integer division

 ciao dennis

Have you run benchmarks?

-Craig

Apr 30 2006

dennis luehring <dl.soluz gmx.net> writes:

 Please anyone, correct me if I'm wrong, but I think IDIV is a
 pseudo-instruction.  80x86 is notorious for those (as opposed to RISC).  I
 read on the Intel website that intel processors do not have a hardware
 integer division instruction.  Apparently it is somehow better to use the
 floating point instruction instead.

i don't know the truth - but why should all the compiler vendors around 
us (microsoft, intel, ...) produce the slower code if there is an faster 
solution? what is the reason (maybe it produce lager code - it this will 
produce more cache problems...)

 The benchmarks that I've done are constistent with this.  I found it was
 faster to perform a floating-point multiplication on an integer and then
 convert it back to an integer than just a single integer division.

can you post your benchmark source?

 Have you run benchmarks?

no - but i can run your benchmark...

Apr 30 2006

"Craig Black" <cblack ara.com> writes:

"dennis luehring" <dl.soluz gmx.net> wrote in message
news:e32dfl$1f53$1 digitaldaemon.com...
 Please anyone, correct me if I'm wrong, but I think IDIV is a
 pseudo-instruction.  80x86 is notorious for those (as opposed to RISC).


I
 read on the Intel website that intel processors do not have a hardware
 integer division instruction.  Apparently it is somehow better to use


the
 floating point instruction instead.

 i don't know the truth - but why should all the compiler vendors around
 us (microsoft, intel, ...) produce the slower code if there is an faster
 solution? what is the reason (maybe it produce lager code - it this will
 produce more cache problems...)

I'm not saying that there is a faster solution, but there might be.  Mainly,
I'm just trying to learn more so that I will be better at optimizing my
code.

 The benchmarks that I've done are constistent with this.  I found it was
 faster to perform a floating-point multiplication on an integer and then
 convert it back to an integer than just a single integer division.

 can you post your benchmark source?

 Have you run benchmarks?

 no - but i can run your benchmark...

These two functions show the difference that I'm talking about.  When
benchmarking, pass in the same values for the divisor (pick a value between
2 and 30), and pass in a large enough value for total so that it will take
some time for the CPU to finish.  Suprisingly, the second one is faster.

-Craig

int divTest1(int divisor, int total)
{
  int sum = 0;
  for(int i = 0; i < total; i++)
  {
    int quotient = i / divisor;
    sum += quotient;
  }
}

int divTest2(int divisor, int total)
{
  int sum = 0;
  double idiv = 1.0 / divisor;
  for(int i = 0; i < total; i++)
  {
    int quotient = i * idiv;
    sum += quotient;
  }
}

Apr 30 2006

James Pelcis <jpelcis gmail.com> writes:

This doesn't really surprise me.  If I am reading it correctly, the
major difference comes from only doing the division once.  If you did
the division every time, the results might not be the same.

Craig Black wrote:
 These two functions show the difference that I'm talking about.  When
 benchmarking, pass in the same values for the divisor (pick a value between
 2 and 30), and pass in a large enough value for total so that it will take
 some time for the CPU to finish.  Suprisingly, the second one is faster.
 
 -Craig
 
 int divTest1(int divisor, int total)
 {
   int sum = 0;
   for(int i = 0; i < total; i)
   {
     int quotient = i / divisor;
     sum = quotient;
   }
 }
 
 int divTest2(int divisor, int total)
 {
   int sum = 0;
   double idiv = 1.0 / divisor;
   for(int i = 0; i < total; i)
   {
     int quotient = i * idiv;
     sum = quotient;
   }
 }

Apr 30 2006

"Craig Black" <cblack ara.com> writes:

 This doesn't really surprise me.  If I am reading it correctly, the
 major difference comes from only doing the division once.  If you did
 the division every time, the results might not be the same.

I don't think you realize what is happening here ... we are dealing with
integer division vs. floating point multiplication.  With the floating point
multiplication we have

1) conversion from int to a floating point
2) the floating point multiplication
3) convertion from floating point back to int

Although CPUs have gotten faster at converting to and from floating points,
the conversion is not trivial to performance.  What's suprising is that
these three operations are faster than a singe integer division.  This is
because integer division is essentially floating point division under the
hood.

-Craig

Apr 30 2006

Walter Bright <newshound digitalmars.com> writes:

Craig Black wrote:
  This is
 because integer division is essentially floating point division under the
 hood.

Not exactly. What is true is that both kinds of division are based on 
the same 'shift and subtract' method (the same algorithm taught in 3rd 
grade for doing long division). If you're really interested in exactly 
how integer and floating point division work, get the Digital Mars CD 
which has complete source code to an IEEE 754 floating point emulator, 
as well as code to do integer division for long's.

www.digitalmars.com/shop.html

May 02 2006

Bruno Medeiros <brunodomedeirosATgmail SPAM.com> writes:

Walter Bright wrote:
 Craig Black wrote:
  This is
 because integer division is essentially floating point division under the
 hood.

 
 Not exactly. What is true is that both kinds of division are based on 
 the same 'shift and subtract' method (the same algorithm taught in 3rd 
 grade for doing long division). If you're really interested in exactly 
 how integer and floating point division work, get the Digital Mars CD 
 which has complete source code to an IEEE 754 floating point emulator, 
 as well as code to do integer division for long's.
 
 www.digitalmars.com/shop.html

I ran these tests and I got basicly the same results (the int division 
is slower). I am very intrigued and confused. Can you (or someone else) 
explain briefly why this is so?
One would think it would be the other way around (float being slower) or 
at least the same speed.

-- 
Bruno Medeiros - CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

May 03 2006

Walter Bright <newshound digitalmars.com> writes:

Bruno Medeiros wrote:
 I ran these tests and I got basicly the same results (the int division 
 is slower). I am very intrigued and confused. Can you (or someone else) 
 explain briefly why this is so?
 One would think it would be the other way around (float being slower) or 
 at least the same speed.


I don't know. I'm surprised at such results. Perhaps it's because Intel 
has put a lot of effort into their FPU they haven't into integer math.

May 03 2006

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bruno Medeiros schrieb am 2006-05-03:
 Walter Bright wrote:
 Craig Black wrote:
  This is
 because integer division is essentially floating point division under the
 hood.


 I ran these tests and I got basicly the same results (the int division 
 is slower). I am very intrigued and confused. Can you (or someone else) 
 explain briefly why this is so?
 One would think it would be the other way around (float being slower) or 
 at least the same speed.


The code doesn't necessarily show that int division is slower than float
multiplication.

What CPU are we talking about?

A naive interpretation of the "benchmark" assumes a single execution
pipe that does floating point and integer operations in sequence ...

Even assuming a single pipe: Why is the SSE version faster?

Does the benchmark measure the speed of int division against float
multiplication? 

Does the benchmark measure the throughput of int division against float
multiplication? 

Does the benchmark measure the throughput of int division of a set of
numbers through a constant factor against float multiplication of the
same set through (1 / constant factor)?

Thomas



-----BEGIN PGP SIGNATURE-----

iD8DBQFEWRDO3w+/yD4P9tIRAs8lAJ9q62J8zf8U0HWzxtxQmMWasuU4ngCgwA21
4M5nb9Z8ZXHevJiwylY/wGM=
=QSyS
-----END PGP SIGNATURE-----

May 03 2006

Bruno Medeiros <brunodomedeirosATgmail SPAM.com> writes:

Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 Bruno Medeiros schrieb am 2006-05-03:
 Walter Bright wrote:
 Craig Black wrote:
  This is
 because integer division is essentially floating point division under the
 hood.


 I ran these tests and I got basicly the same results (the int division 
 is slower). I am very intrigued and confused. Can you (or someone else) 
 explain briefly why this is so?
 One would think it would be the other way around (float being slower) or 
 at least the same speed.

 
 
 The code doesn't necessarily show that int division is slower than float
 multiplication.
 
 What CPU are we talking about?
 
 A naive interpretation of the "benchmark" assumes a single execution
 pipe that does floating point and integer operations in sequence ...
 
 Even assuming a single pipe: Why is the SSE version faster?
 
 Does the benchmark measure the speed of int division against float
 multiplication? 
 
 Does the benchmark measure the throughput of int division against float
 multiplication? 
 
 Does the benchmark measure the throughput of int division of a set of
 numbers through a constant factor against float multiplication of the
 same set through (1 / constant factor)?
 
 Thomas
 
 
 
 -----BEGIN PGP SIGNATURE-----
 
 iD8DBQFEWRDO3w+/yD4P9tIRAs8lAJ9q62J8zf8U0HWzxtxQmMWasuU4ngCgwA21
 4M5nb9Z8ZXHevJiwylY/wGM=
 =QSyS
 -----END PGP SIGNATURE-----

Hum, yes I should have been more specific. I only ran (a modified 
version of) the latest test, which measured the throughput of int 
division against double division (I hope...).
Let me just put the code:

#include <stdio.h>
#include <time.h>

//typedef double divtype;
typedef int divtype;

int main()
{
    clock_t start = clock();


    divtype result = 0;
    divtype div=1;

    for(int max = 100000000; div < max; div++)
    {
      result = (42 / div);
    }


    clock_t finish = clock();
    double duration = (double)(finish - start) / CLOCKS_PER_SEC;
    printf("[%f] %2.2f seconds\n", double(result),duration);
}

------------------------------------
I ran the tests with GCC, with both -O0 and -O2, on an Athlon XP, and it 
both cases the typedef double divtype version was about twice as fast. 
The assembly code I get for line 17 is the following:

*** INT:

.stabn 68,0,17,LM6-_main
LM6:
	movl	$42, %edx
	movl	%edx, %eax
	sarl	$31, %edx
	idivl	-12(%ebp)
	movl	%eax, -8(%ebp)

*** DOUBLE:

.stabn 68,0,17,LM6-_main
LM6:
	flds	LC0
	fdivs	-12(%ebp)
	fstps	-8(%ebp)


I have little idea what it is that it's doing.

-- 
Bruno Medeiros - CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

May 04 2006

Don Clugston <dac nospam.com.au> writes:

Bruno Medeiros wrote:
 Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1

 Bruno Medeiros schrieb am 2006-05-03:
 Walter Bright wrote:
 Craig Black wrote:
  This is
 because integer division is essentially floating point division 
 under the
 hood.


 I ran these tests and I got basicly the same results (the int 
 division is slower). I am very intrigued and confused. Can you (or 
 someone else) explain briefly why this is so?
 One would think it would be the other way around (float being slower) 
 or at least the same speed.



It's true.

For Pentium, IDIV takes 46 cycles, while FDIV takes 39.
For PPro, PII and PIII, DIV takes 39, while FDIV takes 38.
Not much difference.

However, for P4 (and probably Athlon XP is similar),
FDIV has a latency of 43 cycles, while DIV has a latency of 50, plus it 
executes 21 microcodes!

It's not quite a factor of two, but it's close.

In short -- Intel killed integer division on the P4.

May 04 2006

pmoore <pmoore_member pathlink.com> writes:

It depends how big the dividend and divisor are. Smaller values can take much
less time than larger ones.

In article <e3cqjr$sek$1 digitaldaemon.com>, Don Clugston says...
Bruno Medeiros wrote:
 Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1

 Bruno Medeiros schrieb am 2006-05-03:
 Walter Bright wrote:
 Craig Black wrote:
  This is
 because integer division is essentially floating point division 
 under the
 hood.


 I ran these tests and I got basicly the same results (the int 
 division is slower). I am very intrigued and confused. Can you (or 
 someone else) explain briefly why this is so?
 One would think it would be the other way around (float being slower) 
 or at least the same speed.



It's true.

For Pentium, IDIV takes 46 cycles, while FDIV takes 39.
For PPro, PII and PIII, DIV takes 39, while FDIV takes 38.
Not much difference.

However, for P4 (and probably Athlon XP is similar),
FDIV has a latency of 43 cycles, while DIV has a latency of 50, plus it 
executes 21 microcodes!

It's not quite a factor of two, but it's close.

In short -- Intel killed integer division on the P4.

May 04 2006

pmoore <pmoore_member pathlink.com> writes:

It looks to me that your double division is actually encoded as float division
ie 32 bits. I am not too familiar with this syntax but i believe double
precision would look like this:

fldd	LC0
fdivd	-12(%ebp)
fstpd	-8(%ebp)

I am not surprised that the FPU version is faster than the int one. You will
probably find that the Intel version is not the same as the AMD one either.


In article <e3co5o$jth$1 digitaldaemon.com>, Bruno Medeiros says...
Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 Bruno Medeiros schrieb am 2006-05-03:
 Walter Bright wrote:
 Craig Black wrote:
  This is
 because integer division is essentially floating point division under the
 hood.


 I ran these tests and I got basicly the same results (the int division 
 is slower). I am very intrigued and confused. Can you (or someone else) 
 explain briefly why this is so?
 One would think it would be the other way around (float being slower) or 
 at least the same speed.

 
 
 The code doesn't necessarily show that int division is slower than float
 multiplication.
 
 What CPU are we talking about?
 
 A naive interpretation of the "benchmark" assumes a single execution
 pipe that does floating point and integer operations in sequence ...
 
 Even assuming a single pipe: Why is the SSE version faster?
 
 Does the benchmark measure the speed of int division against float
 multiplication? 
 
 Does the benchmark measure the throughput of int division against float
 multiplication? 
 
 Does the benchmark measure the throughput of int division of a set of
 numbers through a constant factor against float multiplication of the
 same set through (1 / constant factor)?
 
 Thomas
 
 
 
 -----BEGIN PGP SIGNATURE-----
 
 iD8DBQFEWRDO3w+/yD4P9tIRAs8lAJ9q62J8zf8U0HWzxtxQmMWasuU4ngCgwA21
 4M5nb9Z8ZXHevJiwylY/wGM=
 =QSyS
 -----END PGP SIGNATURE-----

Hum, yes I should have been more specific. I only ran (a modified 
version of) the latest test, which measured the throughput of int 
division against double division (I hope...).
Let me just put the code:

#include <stdio.h>
#include <time.h>

//typedef double divtype;
typedef int divtype;

int main()
{
    clock_t start = clock();


    divtype result = 0;
    divtype div=1;

    for(int max = 100000000; div < max; div++)
    {
      result = (42 / div);
    }


    clock_t finish = clock();
    double duration = (double)(finish - start) / CLOCKS_PER_SEC;
    printf("[%f] %2.2f seconds\n", double(result),duration);
}

------------------------------------
I ran the tests with GCC, with both -O0 and -O2, on an Athlon XP, and it 
both cases the typedef double divtype version was about twice as fast. 
The assembly code I get for line 17 is the following:

*** INT:

.stabn 68,0,17,LM6-_main
LM6:
	movl	$42, %edx
	movl	%edx, %eax
	sarl	$31, %edx
	idivl	-12(%ebp)
	movl	%eax, -8(%ebp)

*** DOUBLE:

.stabn 68,0,17,LM6-_main
LM6:
	flds	LC0
	fdivs	-12(%ebp)
	fstps	-8(%ebp)


I have little idea what it is that it's doing.

-- 
Bruno Medeiros - CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

May 04 2006

Don Clugston <dac nospam.com.au> writes:

pmoore wrote:
 It looks to me that your double division is actually encoded as float division
 ie 32 bits. I am not too familiar with this syntax but i believe double
 precision would look like this:
 
 fldd	LC0
 fdivd	-12(%ebp)
 fstpd	-8(%ebp)


That would be faster, actually (provided it's aligned sensibly). The 
precision only applies to the width of the load from memory. To reduce 
precision of the division itself (which increases the speed), you have 
to set the machine status word.

 I am not surprised that the FPU version is faster than the int one. You will
 probably find that the Intel version is not the same as the AMD one either.

That's definitely true.

 
 
 In article <e3co5o$jth$1 digitaldaemon.com>, Bruno Medeiros says...
 Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1

 Bruno Medeiros schrieb am 2006-05-03:
 Walter Bright wrote:
 Craig Black wrote:
  This is
 because integer division is essentially floating point division under the
 hood.


 I ran these tests and I got basicly the same results (the int division 
 is slower). I am very intrigued and confused. Can you (or someone else) 
 explain briefly why this is so?
 One would think it would be the other way around (float being slower) or 
 at least the same speed.

 The code doesn't necessarily show that int division is slower than float
 multiplication.

 What CPU are we talking about?

 A naive interpretation of the "benchmark" assumes a single execution
 pipe that does floating point and integer operations in sequence ...

 Even assuming a single pipe: Why is the SSE version faster?

 Does the benchmark measure the speed of int division against float
 multiplication? 

 Does the benchmark measure the throughput of int division against float
 multiplication? 

 Does the benchmark measure the throughput of int division of a set of
 numbers through a constant factor against float multiplication of the
 same set through (1 / constant factor)?

 Thomas



 -----BEGIN PGP SIGNATURE-----

 iD8DBQFEWRDO3w+/yD4P9tIRAs8lAJ9q62J8zf8U0HWzxtxQmMWasuU4ngCgwA21
 4M5nb9Z8ZXHevJiwylY/wGM=
 =QSyS
 -----END PGP SIGNATURE-----

 Hum, yes I should have been more specific. I only ran (a modified 
 version of) the latest test, which measured the throughput of int 
 division against double division (I hope...).
 Let me just put the code:

 #include <stdio.h>
 #include <time.h>

 //typedef double divtype;
 typedef int divtype;

 int main()
 {
    clock_t start = clock();


    divtype result = 0;
    divtype div=1;

    for(int max = 100000000; div < max; div++)
    {
      result = (42 / div);
    }


    clock_t finish = clock();
    double duration = (double)(finish - start) / CLOCKS_PER_SEC;
    printf("[%f] %2.2f seconds\n", double(result),duration);
 }

 ------------------------------------
 I ran the tests with GCC, with both -O0 and -O2, on an Athlon XP, and it 
 both cases the typedef double divtype version was about twice as fast. 
 The assembly code I get for line 17 is the following:

 *** INT:

 .stabn 68,0,17,LM6-_main
 LM6:
 	movl	$42, %edx
 	movl	%edx, %eax
 	sarl	$31, %edx
 	idivl	-12(%ebp)
 	movl	%eax, -8(%ebp)

 *** DOUBLE:

 .stabn 68,0,17,LM6-_main
 LM6:
 	flds	LC0
 	fdivs	-12(%ebp)
 	fstps	-8(%ebp)


 I have little idea what it is that it's doing.

 -- 
 Bruno Medeiros - CS/E student
 http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

May 04 2006

Dave <Dave_member pathlink.com> writes:

Bruno Medeiros wrote:
 Walter Bright wrote:
 Craig Black wrote:
  This is
 because integer division is essentially floating point division under 
 the
 hood.

 Not exactly. What is true is that both kinds of division are based on 
 the same 'shift and subtract' method (the same algorithm taught in 3rd 
 grade for doing long division). If you're really interested in exactly 
 how integer and floating point division work, get the Digital Mars CD 
 which has complete source code to an IEEE 754 floating point emulator, 
 as well as code to do integer division for long's.

 www.digitalmars.com/shop.html

 
 I ran these tests and I got basicly the same results (the int division 
 is slower). I am very intrigued and confused. Can you (or someone else) 
 explain briefly why this is so?
 One would think it would be the other way around (float being slower) or 
 at least the same speed.
 

If you take a look at the assembly for div(), idiv is being executed 
twice - once for the division and once for the modulo. If you replace 
the modulo with the explicit calculation then the perf. will improve a 
lot because imul & sub is faster than idiv.

It'd be nice if dmd would do this optimization for us because 
calculating the integer quotient and subsequent remainder for the same 
dividend and divisor would seem to be pretty common operations.

May 03 2006

Dave <Dave_member pathlink.com> writes:

Dave wrote:
 Bruno Medeiros wrote:
 Walter Bright wrote:
 Craig Black wrote:
  This is
 because integer division is essentially floating point division 
 under the
 hood.

 Not exactly. What is true is that both kinds of division are based on 
 the same 'shift and subtract' method (the same algorithm taught in 
 3rd grade for doing long division). If you're really interested in 
 exactly how integer and floating point division work, get the Digital 
 Mars CD which has complete source code to an IEEE 754 floating point 
 emulator, as well as code to do integer division for long's.

 www.digitalmars.com/shop.html

 I ran these tests and I got basicly the same results (the int division 
 is slower). I am very intrigued and confused. Can you (or someone 
 else) explain briefly why this is so?
 One would think it would be the other way around (float being slower) 
 or at least the same speed.

 
 If you take a look at the assembly for div(), idiv is being executed 
 twice - once for the division and once for the modulo. If you replace 
 the modulo with the explicit calculation then the perf. will improve a 
 lot because imul & sub is faster than idiv.

Obviously I was talking about the original pi.d implementation that's 
using the int modulo operator in the div() function:

quotient = b / divisor;
//remainder = b % divisor;
remainder = b - quotient * divisor; // a lot faster

 It'd be nice if dmd would do this optimization for us because 
 calculating the integer quotient and subsequent remainder for the same 
 dividend and divisor would seem to be pretty common operations.

Don't get me wrong - dmd does great for integer stuff IMHO, this is just 
one small issue I've noticed...

May 03 2006

dennis luehring <dl.soluz gmx.net> writes:

small change

 int divTest2(int divisor, int total)
 {
   int sum = 0;
   for(int i = 0; i < total; i++)
   {
     int quotient = i * ( 1.0 / divisor ); // !!!!!!!!
     sum += quotient;
   }
 }

Apr 30 2006

"Craig Black" <cblack ara.com> writes:

 int divTest2(int divisor, int total)
 {
   int sum = 0;
   for(int i = 0; i < total; i++)
   {
     int quotient = i * ( 1.0 / divisor ); // !!!!!!!!
     sum += quotient;
   }
 }


I don't see what you are trying to say here.

-Craig

Apr 30 2006

dennis luehring <dl.soluz gmx.net> writes:

 int divTest2(int divisor, int total)
 {
   int sum = 0;
   for(int i = 0; i < total; i++)
   {
     int quotient = i * ( 1.0 / divisor ); // !!!!!!!!
     sum += quotient;
   }
 }



you say that integer devision is slower then the same division in 
floating point - or?

but what is the problem with my benchmark then?

--- sorry in c ---

#include <stdio.h>
#include <conio.h>
#include <time.h>

int main()
{
   int result = 0;
   int div = 10; //(or double) !!!

   clock_t start, finish;
   double  duration;

   start = clock();
   for(int l=0; l<100000; l++)
   {
     for(int i=0; i<10000; i++)
     {
       result += (i*div) / div;
     }
   }
   finish = clock();
   duration = (double)(finish - start) / CLOCKS_PER_SEC;

   printf("[%i] %2.1f seconds\n",result,duration);

   getch();
}

on 3Ghz:
int: ~4.8 seconds
double: ~41.7 seconds - nearly ten times slower

the problem with your benchmark is that your results are not really usable
try to write an "int doIntDivisionWithFloat(int Value, int Div)"
and benchmark this one - not do the benchmark on the half of such an 
function...

int IntDivUsingInt(int divisor, int value)
{
   return value/divisor;
}

int IntDivUsingFloat(int divisor, int value)
{
   return value * (1.0 / divisor);
}

for(int i = 0; i < longlongtime; i++)
{
   do IntDivUsingInt( 10, i*10 );
   or IntDivUsingFloat( 10, i*10 );		
}

is your IntDivUsingFloat still faster?

ciao dennis

Apr 30 2006

dennis luehring <dl.soluz gmx.net> writes:

and if still don't understand your "optimizing"
try to write an "real" benchmark with value and divisor change every step

your tests just show us - wow! multipling is much faster then dividing - 
and that is something we all know very well (in float and int) :-)

ciao dennis

Apr 30 2006

"Craig Black" <cblack ara.com> writes:

 your tests just show us - wow! multipling is much faster then dividing -
 and that is something we all know very well (in float and int) :-)

No, actually it shows us that integer division can be slower than floating
point division.  This is because it uses floating point division under the
hood and requires that the divisor be converted from int to float.

-Craig

Apr 30 2006

"Craig Black" <cblack ara.com> writes:

Yours used a constant divisor which allowed the compiler to optimize it
somehow.  That's one of the things that I'm trying to figure out.  How does
the compiler optimize integer division so well when the divisor is constant?
Anyway, try this one.

#include <stdio.h>
#include <conio.h>
#include <time.h>

//typedef int divtype;
typedef double divtype; // this one is faster

int Division(divtype div)
{
  int result = 0;
  for(int i=0; i<10000; i++)
  {
     result += i / div;
  }
  return result;
}

int main()
{
   int result = 0;
   int div = 10; //(or double) !!!

   clock_t start, finish;
   double  duration;

   start = clock();
   for(divtype div=1; div<10000; div++)
   {
     result = Division(div);
   }
   finish = clock();
   duration = (double)(finish - start) / CLOCKS_PER_SEC;

   printf("[%i] %2.1f seconds\n",result,duration);
}

Apr 30 2006

"Craig Black" <cblack ara.com> writes:

If you want to get a direct comparison between floating point and integer
division, then use

 int divTest2(double divisor, int total)
 {
   int sum = 0;
   for(int i = 0; i < total; i++)
   {
     int quotient = i / divisor; // !!!!!!!!
     sum += quotient;
   }
 }

Apr 30 2006

Ben Phillips <Ben_member pathlink.com> writes:

In article <e32q1p$22dv$1 digitaldaemon.com>, Craig Black says...
These two functions show the difference that I'm talking about.  When
benchmarking, pass in the same values for the divisor (pick a value between
2 and 30), and pass in a large enough value for total so that it will take
some time for the CPU to finish.  Suprisingly, the second one is faster.

-Craig

int divTest1(int divisor, int total)
{
  int sum = 0;
  for(int i = 0; i < total; i++)
  {
    int quotient = i / divisor;
    sum += quotient;
  }
}

int divTest2(int divisor, int total)
{
  int sum = 0;
  double idiv = 1.0 / divisor;
  for(int i = 0; i < total; i++)
  {
    int quotient = i * idiv;
    sum += quotient;
  }
}

The second one is faster because you cheat. You aren't testing integer division
but rather floating point multiplication. The first one has to divide every time
through the loop, the second one only has to multiply. Multiplication is much
faster than division (even floating point multiplication vs integer division),
so your results are not surprising and are not really useful.

Apr 30 2006

"Craig Black" <cblack ara.com> writes:

 The second one is faster because you cheat.

Nope. Try this one. On my computer the floating point division is twice as
fast.  I believe this is due to the overhead on converting the divisor from
int to double before performing the division.

#include <stdio.h>
#include <conio.h>
#include <time.h>

//typedef int divtype;
typedef double divtype; // this one is faster

int main()
{
   int result = 0;

   clock_t start, finish;
   double  duration;

   start = clock();
   for(divtype div=1; div<10000; div++)
   {
      for(int i=0; i<10000; i++)
      {
        result += i / div;
      }
   }
   finish = clock();
   duration = (double)(finish - start) / CLOCKS_PER_SEC;

   printf("[%i] %2.1f seconds\n",result,duration);
}

Apr 30 2006

dennis luehring <dl.soluz gmx.net> writes:

Craig Black schrieb:
 Nope. Try this one. On my computer the floating point division is twice as

sorry but on my system int division is two times faster than floating point

and by the way - the usage of the "double" in the first loop don't realy 
help to focus the bechmarking (incrementing int is much faster them 
incerenting floating points)

ciao dennis

Apr 30 2006

"Craig Black" <cblack ara.com> writes:

This is probably because you have SSE optimizations disabled.
This one works even without SSE.  This shows that integer division is
slower.  Also, if you change the division to multiplication, you will notice
that integer multiplication is faster, which is what you would expect.

#include <stdio.h>
#include <conio.h>
#include <time.h>

//typedef int divtype;
typedef double divtype; // This one is faster

int main()
{
   divtype result = 0;

   clock_t start, finish;
   double  duration;

   start = clock();
   divtype max = 100000000;
   for(divtype div=1; div<max; div++)
   {
     divtype i = max - div;
     result += i / div;
   }
   finish = clock();
   duration = (double)(finish - start) / CLOCKS_PER_SEC;

   printf("[%f] %2.1f seconds\n",double(result),duration);
}

Apr 30 2006

dennis luehring <dl.soluz gmx.net> writes:

Craig Black schrieb:
 This is probably because you have SSE optimizations disabled.
 This one works even without SSE.  This shows that integer division is
 slower.  Also, if you change the division to multiplication, you will notice
 that integer multiplication is faster, which is what you would expect.

your right the double version is faster - i don't think that it can be 
used with integer division - because the to-float and back-to-int 
conversion costs too much...

ciao dennis

Apr 30 2006

"Craig Black" <cblack ara.com> writes:

"dennis luehring" <dl.soluz gmx.net> wrote in message 
news:e34ak8$17sg$1 digitaldaemon.com...
 Craig Black schrieb:
 This is probably because you have SSE optimizations disabled.
 This one works even without SSE.  This shows that integer division is
 slower.  Also, if you change the division to multiplication, you will 
 notice
 that integer multiplication is faster, which is what you would expect.

 your right the double version is faster - i don't think that it can be 
 used with integer division - because the to-float and back-to-int 
 conversion costs too much...

 ciao dennis

Not if SSE is enabled.  I have actually experienced significant performance 
increases by doing so.

-Craig

May 01 2006

Walter Bright <newshound digitalmars.com> writes:

On my machine, integer is still faster.

int: 3.7
double: 4.2

Craig Black wrote:
 This is probably because you have SSE optimizations disabled.
 This one works even without SSE.  This shows that integer division is
 slower.  Also, if you change the division to multiplication, you will notice
 that integer multiplication is faster, which is what you would expect.
 
 #include <stdio.h>
 #include <conio.h>
 #include <time.h>
 
 //typedef int divtype;
 typedef double divtype; // This one is faster
 
 int main()
 {
    divtype result = 0;
 
    clock_t start, finish;
    double  duration;
 
    start = clock();
    divtype max = 100000000;
    for(divtype div=1; div<max; div++)
    {
      divtype i = max - div;
      result += i / div;
    }
    finish = clock();
    duration = (double)(finish - start) / CLOCKS_PER_SEC;
 
    printf("[%f] %2.1f seconds\n",double(result),duration);
 }

May 03 2006

pmoore <pmoore_member pathlink.com> writes:

I think I remember you posting a while ago that you have quite an old machine :)

In article <e3bdub$1dsu$3 digitaldaemon.com>, Walter Bright says...
On my machine, integer is still faster.

int: 3.7
double: 4.2

Craig Black wrote:
 This is probably because you have SSE optimizations disabled.
 This one works even without SSE.  This shows that integer division is
 slower.  Also, if you change the division to multiplication, you will notice
 that integer multiplication is faster, which is what you would expect.
 
 #include <stdio.h>
 #include <conio.h>
 #include <time.h>
 
 //typedef int divtype;
 typedef double divtype; // This one is faster
 
 int main()
 {
    divtype result = 0;
 
    clock_t start, finish;
    double  duration;
 
    start = clock();
    divtype max = 100000000;
    for(divtype div=1; div<max; div++)
    {
      divtype i = max - div;
      result += i / div;
    }
    finish = clock();
    duration = (double)(finish - start) / CLOCKS_PER_SEC;
 
    printf("[%f] %2.1f seconds\n",double(result),duration);
 }

May 04 2006

Walter Bright <newshound digitalmars.com> writes:

On my machine, the int version is 1.7 seconds, the double is 3.2. I am 
unable to reproduce your results.

Craig Black wrote:
 The second one is faster because you cheat.

 
 Nope. Try this one. On my computer the floating point division is twice as
 fast.  I believe this is due to the overhead on converting the divisor from
 int to double before performing the division.
 
 #include <stdio.h>
 #include <conio.h>
 #include <time.h>
 
 //typedef int divtype;
 typedef double divtype; // this one is faster
 
 int main()
 {
    int result = 0;
 
    clock_t start, finish;
    double  duration;
 
    start = clock();
    for(divtype div=1; div<10000; div++)
    {
       for(int i=0; i<10000; i++)
       {
         result += i / div;
       }
    }
    finish = clock();
    duration = (double)(finish - start) / CLOCKS_PER_SEC;
 
    printf("[%i] %2.1f seconds\n",result,duration);
 }

May 03 2006

Walter Bright <newshound digitalmars.com> writes:

Craig Black wrote:
 These two functions show the difference that I'm talking about.  When
 benchmarking, pass in the same values for the divisor (pick a value between
 2 and 30), and pass in a large enough value for total so that it will take
 some time for the CPU to finish.  Suprisingly, the second one is faster.

Multiplication is a much simpler operation than division, which is why 
it's faster. Secondly, the reason compilers don't do the optimization 
below is that it gives different results (different rounding).

If you're really interested in optimization, I suggest using a 
disassembler on the object files, it will answer a lot of questions.


 int divTest1(int divisor, int total)
 {
   int sum = 0;
   for(int i = 0; i < total; i++)
   {
     int quotient = i / divisor;
     sum += quotient;
   }
 }
 
 int divTest2(int divisor, int total)
 {
   int sum = 0;
   double idiv = 1.0 / divisor;
   for(int i = 0; i < total; i++)
   {
     int quotient = i * idiv;
     sum += quotient;
   }
 }

May 02 2006

D Programming

C/C++ Programming

Other

digitalmars.D - Compiler optimizations