digitalmars.D - Bits rotations

bearophile (89/89) Oct 17 2012 In cryptographic code, and generally in bit-twiddling code,

Adam D. Ruppe (28/31) Oct 17 2012 Why would you write that instead of using rol(x, 11) today?

H. S. Teoh (6/19) Oct 17 2012 +1, add this to core.bitop.

Iain Buclaw (8/95) Oct 18 2012 In the gdc-4.6 package you have there, it's only naked asm that can't

bearophile (11/15) Oct 18 2012 What's DIASM? Is it the D syntax for asm code? If this is right,

Iain Buclaw (28/41) Oct 18 2012 This topic has been discussed in the past. And the current status is

Don Clugston (9/56) Oct 18 2012 FYI: That code doesn't work in DMD either.

Iain Buclaw (7/78) Oct 18 2012 Normal assembler... naked assembler has its own set of problems

"bearophile" <bearophileHUGS lycos.com> writes:

In cryptographic code, and generally in bit-twiddling code, 
rotation of the bits in a word is a common operation. It's so 
common that Intel has made it an asm instruction.

A demo program:


uint foo(in uint x) pure nothrow {
     return (x << 11) | (x >> (32 - 11));
}

uint bar(uint x, in ubyte n) pure nothrow {
     asm {
         mov EAX, x;
         mov CL, n;
         rol EAX, CL;
         mov x, EAX;
     }
     return x;
}

void main() {
     import std.stdio;
     uint x = 4290772992U;
     writefln("%032b", x);
     uint y = foo(x);
     writefln("%032b", y);
     uint z = bar(x, 11);
     writefln("%032b", z);
}


Its output, dmd -O -release -inline:

11111111110000000000000000000000
00000000000000000000011111111110
00000000000000000000011111111110


Even with full optimizations DMD seems not able to detect the 
rotation in foo(), and writing assembly as in bar() is not an 
option, because the code is even longer and bar() can't be 
inlined. This is the ASM generated (32 bit):


_D4test3fooFNaNbxkZk:
         push    EAX
         mov ECX,[ESP]
         shl EAX,0Bh
         shr ECX,015h
         or  EAX,ECX
         pop ECX
         ret

_D4test3barFNaNbkxhZk:
         push    EBP
         mov EBP,ESP
         push    EAX
         mov EAX,8[EBP]
         mov CL,-4[EBP]
         rol EAX,CL
         mov 8[EBP],EAX
         mov EAX,8[EBP]
         mov ESP,EBP
         pop EBP
         ret 4


GDC 4.6.3 does better, recognizing the rol (foo() can be inlined, 
usually becoming 1 instruction in the middle of other code) (I 
like the demangling here!):

pure nothrow uint example.foo(const(uint)):
     movl    %edi, %eax
     roll    $11, %eax
     ret

pure nothrow uint example.bar(uint, const(ubyte)):
     movl  %edi, -4(%rsp)
     movb  %sil, -5(%rsp)
     movl -4(%rsp), %eax
     movb -5(%rsp), %cl
     roll %cl, %eax
     movl %eax, -4(%rsp)
     movl -4(%rsp), %eax
     ret


So I'd like to write:

uint spam(in uint x) pure nothrow  safe {
     import core.bitop: rol;
     return rol(x, 11);
}


This is better because:
- It's standard. If a CPU supports rol (or ror) the compiler uses 
it. If it doesn't support it, the compiler uses shifts and an or.
- It's shorter and more readable. For me "rol(x, 11)" is rather 
more easy to read and debug than code like "(x << 11) | (x >> (32 
- 11))".
- spam() is inlinable, just like foo() and unlike bar().
- It doesn't rely on compiler optimizations, that sometimes are 
not present. If the CPU supports the rol, the compiler doesn't 
need pattern matching, it just spits out a rol. DMD currently has 
such optimization, but apparently here it's not working, I don't 
know why. For such basic operation, that is built in many CPUs 
there is no need for compiler optimizations.

Bye,
bearophile

Oct 17 2012

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Thursday, 18 October 2012 at 02:36:59 UTC, bearophile wrote:
 uint foo(in uint x) pure nothrow {
     return (x << 11) | (x >> (32 - 11));
 }

Why would you write that instead of using rol(x, 11) today?

uint rol(in uint x, in uint y) pure nothrow {
     return (x << y) | (x >> (32 - y));
}

uint foo(in uint x) pure nothrow {
         return rol(x, 11);
}

the rest is the same. Compile it and see a rol instruction, 
inlined, in the main function.

Though there is a bit of extra spam around it, some movs that 
don't seem necessary to me.

Here's the top function so you can see the movs:

00000000 <_D5test43rolFNaNbxkxkZk>:
    0:   55                      push   ebp
    1:   8b ec                   mov    ebp,esp
    3:   50                      push   eax
    4:   8b 45 08                mov    eax,DWORD PTR [ebp+0x8]
    7:   8b 4d fc                mov    ecx,DWORD PTR [ebp-0x4]
    a:   8b e5                   mov    esp,ebp
    c:   d3 c0                   rol    eax,cl
    e:   5d                      pop    ebp
    f:   c2 04 00                ret    0x4
   12:   90                      nop
   13:   90                      nop



Perhaps we should add that rol function to the stdlib to save 
people from quickly doing it themselves, but there's no need to 
do anything beyond this simple function.

Oct 17 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Oct 18, 2012 at 04:55:38AM +0200, Adam D. Ruppe wrote:
 On Thursday, 18 October 2012 at 02:36:59 UTC, bearophile wrote:
uint foo(in uint x) pure nothrow {
    return (x << 11) | (x >> (32 - 11));
}

 
 Why would you write that instead of using rol(x, 11) today?
 
 uint rol(in uint x, in uint y) pure nothrow {
     return (x << y) | (x >> (32 - y));
 }

[...]
 Perhaps we should add that rol function to the stdlib to save people
 from quickly doing it themselves, but there's no need to do anything
 beyond this simple function.

+1, add this to core.bitop.


T

-- 
If you compete with slaves, you become a slave. -- Norbert Wiener

Oct 17 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 18 October 2012 03:36, bearophile <bearophileHUGS lycos.com> wrote:
 In cryptographic code, and generally in bit-twiddling code, rotation of the
 bits in a word is a common operation. It's so common that Intel has made it
 an asm instruction.

 A demo program:


 uint foo(in uint x) pure nothrow {
     return (x << 11) | (x >> (32 - 11));
 }

 uint bar(uint x, in ubyte n) pure nothrow {
     asm {
         mov EAX, x;
         mov CL, n;
         rol EAX, CL;
         mov x, EAX;
     }
     return x;
 }

 void main() {
     import std.stdio;
     uint x = 4290772992U;
     writefln("%032b", x);
     uint y = foo(x);
     writefln("%032b", y);
     uint z = bar(x, 11);
     writefln("%032b", z);
 }


 Its output, dmd -O -release -inline:

 11111111110000000000000000000000
 00000000000000000000011111111110
 00000000000000000000011111111110


 Even with full optimizations DMD seems not able to detect the rotation in
 foo(), and writing assembly as in bar() is not an option, because the code
 is even longer and bar() can't be inlined. This is the ASM generated (32
 bit):


 _D4test3fooFNaNbxkZk:
         push    EAX
         mov ECX,[ESP]
         shl EAX,0Bh
         shr ECX,015h
         or  EAX,ECX
         pop ECX
         ret

 _D4test3barFNaNbkxhZk:
         push    EBP
         mov EBP,ESP
         push    EAX
         mov EAX,8[EBP]
         mov CL,-4[EBP]
         rol EAX,CL
         mov 8[EBP],EAX
         mov EAX,8[EBP]
         mov ESP,EBP
         pop EBP
         ret 4


 GDC 4.6.3 does better, recognizing the rol (foo() can be inlined, usually
 becoming 1 instruction in the middle of other code) (I like the demangling
 here!):

 pure nothrow uint example.foo(const(uint)):
     movl    %edi, %eax
     roll    $11, %eax
     ret

 pure nothrow uint example.bar(uint, const(ubyte)):
     movl  %edi, -4(%rsp)
     movb  %sil, -5(%rsp)
     movl -4(%rsp), %eax
     movb -5(%rsp), %cl
     roll %cl, %eax
     movl %eax, -4(%rsp)
     movl -4(%rsp), %eax
     ret


 So I'd like to write:

 uint spam(in uint x) pure nothrow  safe {
     import core.bitop: rol;
     return rol(x, 11);
 }


 This is better because:
 - It's standard. If a CPU supports rol (or ror) the compiler uses it. If it
 doesn't support it, the compiler uses shifts and an or.
 - It's shorter and more readable. For me "rol(x, 11)" is rather more easy to
 read and debug than code like "(x << 11) | (x >> (32 - 11))".
 - spam() is inlinable, just like foo() and unlike bar().
 - It doesn't rely on compiler optimizations, that sometimes are not present.
 If the CPU supports the rol, the compiler doesn't need pattern matching, it
 just spits out a rol. DMD currently has such optimization, but apparently
 here it's not working, I don't know why. For such basic operation, that is
 built in many CPUs there is no need for compiler optimizations.

 Bye,
 bearophile


In the gdc-4.6 package you have there, it's only naked asm that can't
be inlined.   However it is worth noting that DIASM is no longer in
mainline gdc.

Thanks,
-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Oct 18 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Iain Buclaw:

 In the gdc-4.6 package you have there, it's only naked asm that 
 can't be inlined.

Good.


 However it is worth noting that DIASM is no longer in mainline 
 gdc.

What's DIASM? Is it the D syntax for asm code? If this is right, 
then gdc developers have done a mistake, reducing D code 
interoperability, creating an incompatibility where there wasn't 
(and reducing my desire to use gdc or to switch to it, because I 
have hundreds of lines of inlined asm in my D code), this means 
doing the opposite of what generally compiler writers are 
supposed to do (maybe this topic was discussed already, in past).

Bye,
bearophile

Oct 18 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 18 October 2012 09:27, bearophile <bearophileHUGS lycos.com> wrote:
 Iain Buclaw:


 In the gdc-4.6 package you have there, it's only naked asm that can't be
 inlined.


 Good.



 However it is worth noting that DIASM is no longer in mainline gdc.


 What's DIASM? Is it the D syntax for asm code? If this is right, then gdc
 developers have done a mistake, reducing D code interoperability, creating
 an incompatibility where there wasn't (and reducing my desire to use gdc or
 to switch to it, because I have hundreds of lines of inlined asm in my D
 code), this means doing the opposite of what generally compiler writers are
 supposed to do (maybe this topic was discussed already, in past).

 Bye,
 bearophile


This topic has been discussed in the past.  And the current status is
that GCC mainline has poisoned the frontend to use certain headers
that the IASM implementation in GDC depended on.

Example:

int zz(int p1)
{
  asm {
    naked;
    mov EAX, p1[EBP];
  }
}


To calculate p1[EBP], one would have to know where p1 will land on the
frame pointer to replace it with the relavant offset value.  This
would mean from the front-end we would have to invoke the back-end to
generate and tell us the stack frame layout of zz, which is not
possible because:

a) Invoking this before the optimisation passes may produce a
different result to what that actual result is after the optimisation
passes.
b) All functions are sitting in poisoned (for the front-end) headers.

There is an opportunity to defer parsing IASM until the GIMPLE
(middle-end) stage, however am still unable to retrieve the required
information to produce the correct codegen.


Regards
-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Oct 18 2012

Don Clugston <dac nospam.com> writes:

On 18/10/12 11:39, Iain Buclaw wrote:
 On 18 October 2012 09:27, bearophile <bearophileHUGS lycos.com> wrote:
 Iain Buclaw:


 In the gdc-4.6 package you have there, it's only naked asm that can't be
 inlined.


 Good.



 However it is worth noting that DIASM is no longer in mainline gdc.


 What's DIASM? Is it the D syntax for asm code? If this is right, then gdc
 developers have done a mistake, reducing D code interoperability, creating
 an incompatibility where there wasn't (and reducing my desire to use gdc or
 to switch to it, because I have hundreds of lines of inlined asm in my D
 code), this means doing the opposite of what generally compiler writers are
 supposed to do (maybe this topic was discussed already, in past).

 Bye,
 bearophile


 This topic has been discussed in the past.  And the current status is
 that GCC mainline has poisoned the frontend to use certain headers
 that the IASM implementation in GDC depended on.

 Example:

 int zz(int p1)
 {
    asm {
      naked;
      mov EAX, p1[EBP];
    }
 }


 To calculate p1[EBP], one would have to know where p1 will land on the
 frame pointer to replace it with the relavant offset value.  This
 would mean from the front-end we would have to invoke the back-end to
 generate and tell us the stack frame layout of zz, which is not
 possible because:

FYI: That code doesn't work in DMD either.
DMD assumes a frame pointer is created in naked ASM, which is totally 
wrong. Code like that should not compile. The compiler does not know 
what the correct offsets are and should not attempt to try.

 a) Invoking this before the optimisation passes may produce a
 different result to what that actual result is after the optimisation
 passes.
 b) All functions are sitting in poisoned (for the front-end) headers.

 There is an opportunity to defer parsing IASM until the GIMPLE
 (middle-end) stage, however am still unable to retrieve the required
 information to produce the correct codegen.

Are you just talking about naked asm? Conceptually naked asm should act 
as if it was created in an assembler in a seperate obj file, and 
accessed via extern(C).
If you have problems with non-naked asm, that would make more sense to me.

Oct 18 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 18 October 2012 15:22, Don Clugston <dac nospam.com> wrote:
 On 18/10/12 11:39, Iain Buclaw wrote:
 On 18 October 2012 09:27, bearophile <bearophileHUGS lycos.com> wrote:
 Iain Buclaw:


 In the gdc-4.6 package you have there, it's only naked asm that can't be
 inlined.



 Good.



 However it is worth noting that DIASM is no longer in mainline gdc.



 What's DIASM? Is it the D syntax for asm code? If this is right, then gdc
 developers have done a mistake, reducing D code interoperability,
 creating
 an incompatibility where there wasn't (and reducing my desire to use gdc
 or
 to switch to it, because I have hundreds of lines of inlined asm in my D
 code), this means doing the opposite of what generally compiler writers
 are
 supposed to do (maybe this topic was discussed already, in past).

 Bye,
 bearophile



 This topic has been discussed in the past.  And the current status is
 that GCC mainline has poisoned the frontend to use certain headers
 that the IASM implementation in GDC depended on.

 Example:

 int zz(int p1)
 {
    asm {
      naked;
      mov EAX, p1[EBP];
    }
 }


 To calculate p1[EBP], one would have to know where p1 will land on the
 frame pointer to replace it with the relavant offset value.  This
 would mean from the front-end we would have to invoke the back-end to
 generate and tell us the stack frame layout of zz, which is not
 possible because:


 FYI: That code doesn't work in DMD either.
 DMD assumes a frame pointer is created in naked ASM, which is totally wrong.
 Code like that should not compile. The compiler does not know what the
 correct offsets are and should not attempt to try.


 a) Invoking this before the optimisation passes may produce a
 different result to what that actual result is after the optimisation
 passes.
 b) All functions are sitting in poisoned (for the front-end) headers.

 There is an opportunity to defer parsing IASM until the GIMPLE
 (middle-end) stage, however am still unable to retrieve the required
 information to produce the correct codegen.


 Are you just talking about naked asm? Conceptually naked asm should act as
 if it was created in an assembler in a seperate obj file, and accessed via
 extern(C).
 If you have problems with non-naked asm, that would make more sense to me.

Normal assembler... naked assembler has its own set of problems
(requires patching in a "naked" style attribute which the x86 GCC
maintainers rejected outrightly).


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Oct 18 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Bits rotations