www.digitalmars.com         C & C++   DMDScript  

D.gnu - gdc internals: weird way to translate "while" loops makes autovectorization impossible

reply downs <default_357-line yahoo.de> writes:
At the moment, gdc seems to be incapable of vectorizing simple loops.
Example.
The following C program will, if compiled with gcc -O3 -ftree-vectorize -msse,
generate SSE vectorized assembler code.
This can be verified by compiling with -ftree-vectorizer-verbose=7.
 #include "stdlib.h"
 
 int main() {
  float*ptr=malloc(sizeof(float)*8);
  float *tptr=ptr+8;
  while (ptr!=tptr) { *ptr=1.0; ptr++; } 
  return 0;
 }
 

I translated the program to D. Result:
 import std.c.stdlib : malloc;
 
 int main() {
  float*ptr=cast(float*)malloc(float.sizeof*8);
  float *tptr=ptr+8;
  while (ptr!=tptr) { *ptr=1.0; ptr++; } 
  return 0;
 }
 

Using the same flags, gdc doesn't vectorize the loop. I took the gdc vectorization process apart, added a few fprintf statements, and got the following log: test.d.t77.vect
 ;; Function main (_Dmain)
 
 
 test.d:6: note: ===== analyze_loop_nest =====
 test.d:6: note: === vect_analyze_loop_form ===
 test.d:6: note: not vectorized: too many BBs in loop.
 I added that below.

;; ;; Loop 1: ;; header 4, latch 3 ;; depth 1, level 1, outer 0 ;; nodes: 4 3 2 dropping block info ... Block 4 ;; basic block 4, loop depth 1, count 0 ;; prev block 3, next block 5 ;; pred: 3 [100.0%] (fallthru,exec) 1 [100.0%] (fallthru,exec) ;; succ: 2 [100.0%] (fallthru,exec) # ivtmp.27_1 = PHI <ivtmp.27_10(3), 8(1)>; # TMT.5_18 = PHI <TMT.5_13(3), TMT.5_12(1)>; # ptr_17 = PHI <ptr_9(3), ptr_3(1)>; <L2>:; # TMT.5_13 = V_MAY_DEF <TMT.5_18>; *ptr_17 = 1.0e+0; ptr_9 = ptr_17 + 4; goto <bb 2> (<L1>); Block 3 ;; basic block 3, loop depth 1, count 0 ;; prev block 2, next block 4 ;; pred: 2 [88.9%] (dfs_back,false,exec) ;; succ: 4 [100.0%] (fallthru,exec) <L11>:; Block 2 ;; basic block 2, loop depth 1, count 0 ;; prev block 1, next block 3 ;; pred: 4 [100.0%] (fallthru,exec) ;; succ: 5 [11.1%] (loop_exit,true,exec) 3 [88.9%] (dfs_back,false,exec) <L1>:; ivtmp.27_10 = ivtmp.27_1 - 1; if (ivtmp.27_10 == 0) goto <L3>; else goto <L11>; Done
 From here on out, it's normal gdc output.

test.d:6: note: bad loop form. test.d:6: note: vectorized 0 loops in function. main () { uintD.1160 ivtmp.27D.1244; floatD.1163 * ptrD.1186; floatD.1163 * tptrD.1189; intD.1177 D.1196; boolD.1173 D.1195; boolD.1173 D.1194; voidD.1154 * D.1191; # BLOCK 0 freq:1111, starting at line 4 # PRED: ENTRY [test.d : 4] D.1191_2 = [test.d : 4] malloc (32); [test.d : 4] ptrD.1186_3 = (floatD.1163 *) D.1191_2; [test.d : 5] tptrD.1189_4 = ptrD.1186_3 + 32; [test.d : 6] if (ptrD.1186_3 == tptrD.1189_4) [test.d : 6] goto <L3>; else goto <L9>; # SUCC: 5 1 # BLOCK 1 freq:988 # PRED: 0 <L9>:; goto <bb 4> (<L2>); # SUCC: 4 # BLOCK 2 freq:8889, starting at line 6 # PRED: 4 [test.d : 6] <L1>:; ivtmp.27D.1244_10 = ivtmp.27D.1244_1 - 1; [test.d : 6] if (ivtmp.27D.1244_10 == 0) [test.d : 6] goto <L3>; else goto <L11>; # SUCC: 5 3 # BLOCK 3 freq:7901 # PRED: 2 <L11>:; # SUCC: 4 # BLOCK 4 freq:8889, starting at line 6 # PRED: 3 1 # ivtmp.27D.1244_1 = PHI <ivtmp.27D.1244_10(3), 8(1)>; # ptrD.1186_17 = PHI <ptrD.1186_9(3), ptrD.1186_3(1)>; <L2>:; [test.d : 6] *ptrD.1186_17 = 1.0e+0; [test.d : 6] ptrD.1186_9 = ptrD.1186_17 + 4; [test.d : 6] goto <bb 2> (<L1>); # SUCC: 2 # BLOCK 5 freq:1111 # PRED: 2 0 <L3>:; return 0; # SUCC: EXIT }

Note the block 3, which only consists of a label and nothing more. Apparently, having three BBs instead of two blocks gdc from optimizing the loop. See also, tree-vect-analyze.c:1861. I don't know enough about gdc to see where that empty block is generated, but could we get rid of it, possibly, please? Greetings .. downs
Feb 06 2007
parent reply Downs <default_357-line yahoo.de> writes:
downs wrote:
 At the moment, gdc seems to be incapable of vectorizing simple loops.
 Example.
 The following C program will, if compiled with gcc -O3 -ftree-vectorize -msse,
generate SSE vectorized assembler code.
 This can be verified by compiling with -ftree-vectorizer-verbose=7.
 #include "stdlib.h"

 int main() {
  float*ptr=malloc(sizeof(float)*8);
  float *tptr=ptr+8;
  while (ptr!=tptr) { *ptr=1.0; ptr++; } 
  return 0;
 }

I translated the program to D. Result:
 import std.c.stdlib : malloc;

 int main() {
  float*ptr=cast(float*)malloc(float.sizeof*8);
  float *tptr=ptr+8;
  while (ptr!=tptr) { *ptr=1.0; ptr++; } 
  return 0;
 }

Using the same flags, gdc doesn't vectorize the loop. I took the gdc vectorization process apart, added a few fprintf statements, and got the following log: test.d.t77.vect
 ;; Function main (_Dmain)


 test.d:6: note: ===== analyze_loop_nest =====
 test.d:6: note: === vect_analyze_loop_form ===
 test.d:6: note: not vectorized: too many BBs in loop.
 I added that below.

;; ;; Loop 1: ;; header 4, latch 3 ;; depth 1, level 1, outer 0 ;; nodes: 4 3 2 dropping block info ... Block 4 ;; basic block 4, loop depth 1, count 0 ;; prev block 3, next block 5 ;; pred: 3 [100.0%] (fallthru,exec) 1 [100.0%] (fallthru,exec) ;; succ: 2 [100.0%] (fallthru,exec) # ivtmp.27_1 = PHI <ivtmp.27_10(3), 8(1)>; # TMT.5_18 = PHI <TMT.5_13(3), TMT.5_12(1)>; # ptr_17 = PHI <ptr_9(3), ptr_3(1)>; <L2>:; # TMT.5_13 = V_MAY_DEF <TMT.5_18>; *ptr_17 = 1.0e+0; ptr_9 = ptr_17 + 4; goto <bb 2> (<L1>); Block 3 ;; basic block 3, loop depth 1, count 0 ;; prev block 2, next block 4 ;; pred: 2 [88.9%] (dfs_back,false,exec) ;; succ: 4 [100.0%] (fallthru,exec) <L11>:; Block 2 ;; basic block 2, loop depth 1, count 0 ;; prev block 1, next block 3 ;; pred: 4 [100.0%] (fallthru,exec) ;; succ: 5 [11.1%] (loop_exit,true,exec) 3 [88.9%] (dfs_back,false,exec) <L1>:; ivtmp.27_10 = ivtmp.27_1 - 1; if (ivtmp.27_10 == 0) goto <L3>; else goto <L11>; Done
 From here on out, it's normal gdc output.

test.d:6: note: vectorized 0 loops in function. main () { uintD.1160 ivtmp.27D.1244; floatD.1163 * ptrD.1186; floatD.1163 * tptrD.1189; intD.1177 D.1196; boolD.1173 D.1195; boolD.1173 D.1194; voidD.1154 * D.1191; # BLOCK 0 freq:1111, starting at line 4 # PRED: ENTRY [test.d : 4] D.1191_2 = [test.d : 4] malloc (32); [test.d : 4] ptrD.1186_3 = (floatD.1163 *) D.1191_2; [test.d : 5] tptrD.1189_4 = ptrD.1186_3 + 32; [test.d : 6] if (ptrD.1186_3 == tptrD.1189_4) [test.d : 6] goto <L3>; else goto <L9>; # SUCC: 5 1 # BLOCK 1 freq:988 # PRED: 0 <L9>:; goto <bb 4> (<L2>); # SUCC: 4 # BLOCK 2 freq:8889, starting at line 6 # PRED: 4 [test.d : 6] <L1>:; ivtmp.27D.1244_10 = ivtmp.27D.1244_1 - 1; [test.d : 6] if (ivtmp.27D.1244_10 == 0) [test.d : 6] goto <L3>; else goto <L11>; # SUCC: 5 3 # BLOCK 3 freq:7901 # PRED: 2 <L11>:; # SUCC: 4 # BLOCK 4 freq:8889, starting at line 6 # PRED: 3 1 # ivtmp.27D.1244_1 = PHI <ivtmp.27D.1244_10(3), 8(1)>; # ptrD.1186_17 = PHI <ptrD.1186_9(3), ptrD.1186_3(1)>; <L2>:; [test.d : 6] *ptrD.1186_17 = 1.0e+0; [test.d : 6] ptrD.1186_9 = ptrD.1186_17 + 4; [test.d : 6] goto <bb 2> (<L1>); # SUCC: 2 # BLOCK 5 freq:1111 # PRED: 2 0 <L3>:; return 0; # SUCC: EXIT }

Note the block 3, which only consists of a label and nothing more. Apparently, having three BBs instead of two blocks gdc from optimizing the loop. See also, tree-vect-analyze.c:1861. I don't know enough about gdc to see where that empty block is generated, but could we get rid of it, possibly, please? Greetings .. downs

Any news on this? I believe the fact that a rather important optimization feature of GCC is completely broken on GDC merits at least a brief response. If you don't plan to fix this, then please, do tell me so. If I made some kind of obvious, plain mistake, then please do tell me so. But either way, I would really appreciate a response. Oh, btw: the following code serves to illustrate the problem further.
 import std.c.stdlib : malloc;

 int main() {
  float*ptr=cast(float*)malloc(float.sizeof*8); float*tptr=ptr+8;
  version(spaghetti) { loop: *ptr=1.0; ptr++; if (ptr!=tptr) goto loop; }
  else { do { *ptr=1.0; ptr++; } while (ptr!=tptr) }
  return 0;
 }

Anyway, greetings. --downs
Feb 24 2007
parent David Friedman <dvdfrdmn users.ess-eff.net> writes:
Downs wrote:
 downs wrote:
 
 At the moment, gdc seems to be incapable of vectorizing simple loops.
 Example.
 The following C program will, if compiled with gcc -O3 
 -ftree-vectorize -msse, generate SSE vectorized assembler code.
 This can be verified by compiling with -ftree-vectorizer-verbose=7.

 #include "stdlib.h"

 int main() {
  float*ptr=malloc(sizeof(float)*8);
  float *tptr=ptr+8;
  while (ptr!=tptr) { *ptr=1.0; ptr++; }  return 0;
 }

I translated the program to D. Result:
 import std.c.stdlib : malloc;

 int main() {
  float*ptr=cast(float*)malloc(float.sizeof*8);
  float *tptr=ptr+8;
  while (ptr!=tptr) { *ptr=1.0; ptr++; }  return 0;
 }

Using the same flags, gdc doesn't vectorize the loop. I took the gdc vectorization process apart, added a few fprintf statements, and got the following log: test.d.t77.vect
 ;; Function main (_Dmain)


 test.d:6: note: ===== analyze_loop_nest =====
 test.d:6: note: === vect_analyze_loop_form ===
 test.d:6: note: not vectorized: too many BBs in loop.

 I added that below.

dropping additional loop info ... ;; ;; Loop 1: ;; header 4, latch 3 ;; depth 1, level 1, outer 0 ;; nodes: 4 3 2 dropping block info ... Block 4 ;; basic block 4, loop depth 1, count 0 ;; prev block 3, next block 5 ;; pred: 3 [100.0%] (fallthru,exec) 1 [100.0%] (fallthru,exec) ;; succ: 2 [100.0%] (fallthru,exec) # ivtmp.27_1 = PHI <ivtmp.27_10(3), 8(1)>; # TMT.5_18 = PHI <TMT.5_13(3), TMT.5_12(1)>; # ptr_17 = PHI <ptr_9(3), ptr_3(1)>; <L2>:; # TMT.5_13 = V_MAY_DEF <TMT.5_18>; *ptr_17 = 1.0e+0; ptr_9 = ptr_17 + 4; goto <bb 2> (<L1>); Block 3 ;; basic block 3, loop depth 1, count 0 ;; prev block 2, next block 4 ;; pred: 2 [88.9%] (dfs_back,false,exec) ;; succ: 4 [100.0%] (fallthru,exec) <L11>:; Block 2 ;; basic block 2, loop depth 1, count 0 ;; prev block 1, next block 3 ;; pred: 4 [100.0%] (fallthru,exec) ;; succ: 5 [11.1%] (loop_exit,true,exec) 3 [88.9%] (dfs_back,false,exec) <L1>:; ivtmp.27_10 = ivtmp.27_1 - 1; if (ivtmp.27_10 == 0) goto <L3>; else goto <L11>; Done
 From here on out, it's normal gdc output.

test.d:6: note: bad loop form. test.d:6: note: vectorized 0 loops in function. main () { uintD.1160 ivtmp.27D.1244; floatD.1163 * ptrD.1186; floatD.1163 * tptrD.1189; intD.1177 D.1196; boolD.1173 D.1195; boolD.1173 D.1194; voidD.1154 * D.1191; # BLOCK 0 freq:1111, starting at line 4 # PRED: ENTRY [test.d : 4] D.1191_2 = [test.d : 4] malloc (32); [test.d : 4] ptrD.1186_3 = (floatD.1163 *) D.1191_2; [test.d : 5] tptrD.1189_4 = ptrD.1186_3 + 32; [test.d : 6] if (ptrD.1186_3 == tptrD.1189_4) [test.d : 6] goto <L3>; else goto <L9>; # SUCC: 5 1 # BLOCK 1 freq:988 # PRED: 0 <L9>:; goto <bb 4> (<L2>); # SUCC: 4 # BLOCK 2 freq:8889, starting at line 6 # PRED: 4 [test.d : 6] <L1>:; ivtmp.27D.1244_10 = ivtmp.27D.1244_1 - 1; [test.d : 6] if (ivtmp.27D.1244_10 == 0) [test.d : 6] goto <L3>; else goto <L11>; # SUCC: 5 3 # BLOCK 3 freq:7901 # PRED: 2 <L11>:; # SUCC: 4 # BLOCK 4 freq:8889, starting at line 6 # PRED: 3 1 # ivtmp.27D.1244_1 = PHI <ivtmp.27D.1244_10(3), 8(1)>; # ptrD.1186_17 = PHI <ptrD.1186_9(3), ptrD.1186_3(1)>; <L2>:; [test.d : 6] *ptrD.1186_17 = 1.0e+0; [test.d : 6] ptrD.1186_9 = ptrD.1186_17 + 4; [test.d : 6] goto <bb 2> (<L1>); # SUCC: 2 # BLOCK 5 freq:1111 # PRED: 2 0 <L3>:; return 0; # SUCC: EXIT }

Note the block 3, which only consists of a label and nothing more. Apparently, having three BBs instead of two blocks gdc from optimizing the loop. See also, tree-vect-analyze.c:1861. I don't know enough about gdc to see where that empty block is generated, but could we get rid of it, possibly, please? Greetings .. downs

Any news on this? I believe the fact that a rather important optimization feature of GCC is completely broken on GDC merits at least a brief response. If you don't plan to fix this, then please, do tell me so. If I made some kind of obvious, plain mistake, then please do tell me so. But either way, I would really appreciate a response. Oh, btw: the following code serves to illustrate the problem further. > import std.c.stdlib : malloc; > > int main() { > float*ptr=cast(float*)malloc(float.sizeof*8); float*tptr=ptr+8; > version(spaghetti) { loop: *ptr=1.0; ptr++; if (ptr!=tptr) goto loop; } > else { do { *ptr=1.0; ptr++; } while (ptr!=tptr) } > return 0; > } One of these versions will get autovectorized, the other not. Anyway, greetings. --downs

I will fix this, but I am not doing any work on it presently. My best guess so far is that the C compiler puts loop conditions at the end of the loop to reduce the number of basic blocks (but has an extra initial goto.) This is the form that the vectorizer expects. David
Feb 25 2007