digitalmars.D - Why 16Mib static array size limit?

=?UTF-8?Q?Ali_=c3=87ehreli?= (7/7) Aug 15 2016 dmd does not allow anything larger. Shouldn't the limit depend on the

NX (5/5) Aug 15 2016 https://issues.dlang.org/show_bug.cgi?id=14859

Jacob Carlborg (4/9) Aug 16 2016 Or using any other platform than Windows :)

NX (13/13) Aug 15 2016 You can apply a patch if you're willing to compile dmd from
=?UTF-8?Q?Ali_=c3=87ehreli?= (93/94) Aug 15 2016 Could you please help me understand the following results, possibly by

ketmar (7/7) Aug 15 2016 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
Yuxuan Shui (5/11) Aug 15 2016 There seem to be two things at work here: 1) when not accessed
Johan Engelen (23/28) Aug 16 2016 As Yuxuan Shui mentioned the difference is in vectorization. The

=?UTF-8?Q?Ali_=c3=87ehreli?= (15/41) Aug 16 2016 Thank you all. That makes sense... Agreeing that the POINTER version is

Yuxuan Shui (3/24) Aug 16 2016 Actually, the STATIC version is always faster on my machine (Core

Yuxuan Shui (3/31) Aug 16 2016 In dmd's case, the non-STATIC version seems to evaluate the loop

Yuxuan Shui (3/31) Aug 16 2016 Wait, doesn't D have strict aliasing rules? ubyte* (&evil) should

ketmar (3/4) Aug 16 2016 luckily, no. at least this is not enforced by dmd. and it is

deadalnix (7/11) Aug 17 2016 Controlling aliasing is really the #1 optimization barrier these

ketmar (8/20) Aug 17 2016 from my PoV, this kind of "optimizing" is overrated. i'm

deadalnix (5/12) Aug 17 2016 Because making 99.9% of the code slower because of a fringe use

ketmar (7/9) Aug 17 2016 exactly the thing i was writing about. "hey, you, meatbag! i, Teh

Chris Wright (5/14) Aug 17 2016 It makes your intent more obvious. It's more obvious to other humans as

jmh530 (2/8) Aug 17 2016 AA? Associative Array?

Timon Gehr (2/13) Aug 17 2016 (Alias Analysis.)
deadalnix (3/15) Aug 17 2016 Alias Analysis. This is a common compiler acronym.

Walter Bright (5/10) Aug 17 2016 At least for this case, as I mentioned in another post, if the pointer a...

Yuxuan Shui (4/21) Aug 17 2016 But doing so would be incorrect if D doesn't provide strong

Walter Bright (2/4) Aug 17 2016 It would be correct for that loop if the user does it.
Chris Wright (4/7) Aug 17 2016 The language can analyze all code that affects a local variable in many

Yuxuan Shui (3/11) Aug 17 2016 That's right. But for Ali's code, the compiler is clearly not

Chris Wright (6/18) Aug 17 2016 Most of the time, the compiler can successfully make those optimizations...

Walter Bright (9/11) Aug 17 2016 Global variables are pretty much spawn of the devil. You're right that d...

Johan Engelen (24/32) Aug 18 2016 Perhaps not smart enough, but it is very close to being smart

Johan Engelen (3/5) Aug 18 2016 Nevermind, not possible. Templates, cross-module inlining, and
=?UTF-8?Q?Ali_=c3=87ehreli?= (11/14) Aug 18 2016 Yet there is the following text in the spec:

Steven Schveighoffer (4/39) Aug 16 2016 Even if it did, I believe the wildcard is ubyte*. Just like in C, char*

Charles Hixson via Digitalmars-d (3/44) Aug 16 2016 I think what you say is true (look at the code of std.outbuffer), but

Steven Schveighoffer (7/17) Aug 17 2016 void * is almost useless. In D you can assign a void[] from another

deadalnix (4/12) Aug 17 2016 Yes, but everything can alias with void*/void[] . Thus, you can

Steven Schveighoffer (9/21) Aug 17 2016 Sure, but how do you implement, let's say, byte swapping on an integer?

Johan Engelen (4/4) Aug 17 2016 How about the specific case of array indexing?

Walter Bright (7/20) Aug 16 2016 When accessing global arrays like this, cache the address of the data in...

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

dmd does not allow anything larger. Shouldn't the limit depend on the 
.data segment size, which can be specified by the system? Is there a 
technical reason?

Observing that the limit is actually one less (note -1 below), it's 
probably due to an old limitation where indexes were limited to 24 bits?

static ubyte[16 * 1024 * 1024 - 1] arr;

Ali

Aug 15 2016

NX <nightmarex1337 hotmail.com> writes:

https://issues.dlang.org/show_bug.cgi?id=14859

This limitation is put there because of optlink (which fails to 
link when you have enough static data), and is actually entirely 
meaningless when combined with -m32mscoff & -m64 switches (since 
other linkers handle huge static data just fine).

Aug 15 2016

Jacob Carlborg <doob me.com> writes:

On 2016-08-15 22:28, NX wrote:
 https://issues.dlang.org/show_bug.cgi?id=14859

 This limitation is put there because of optlink (which fails to link
 when you have enough static data), and is actually entirely meaningless
 when combined with -m32mscoff & -m64 switches (since other linkers
 handle huge static data just fine).

Or using any other platform than Windows :)

-- 
/Jacob Carlborg

Aug 16 2016

NX <nightmarex1337 hotmail.com> writes:

You can apply a patch if you're willing to compile dmd from 
source by doing the following:

Find the following code in file 'mtype.d' at line 4561:

                 bool overflow = false;
                 if (mulu(tbn.size(loc), d2, overflow) >= 
0x1000000 || overflow) // put a 'reasonable' limit on it
                     goto Loverflow;

And change it to:

                 bool overflow = false;
                 mulu(tbn.size(loc), d2, overflow);
                 if (overflow)
                     goto Loverflow;

I would make a PR if I had the time (anyone?)...

Aug 15 2016

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 08/15/2016 12:09 PM, Ali Çehreli wrote:
 dmd does not allow anything larger.

Could you please help me understand the following results, possibly by 
analyzing the produced assembly?

I wanted to see whether there were any performance penalties when one 
used D's recommendation of using dynamic arrays beyond 16MiB.

Here is the test code:

enum size = 15 * 1024 * 1024;

version (STATIC) {
     ubyte[size] arr;
}
else {
     ubyte[] arr;

     static this() {
         arr = new ubyte[](size);
     }
}

void main() {
     auto p = arr.ptr;

     foreach (j; 0 .. 100) {
         foreach (i; 0..arr.length) {
             version (POINTER) {
                 p[i] += cast(ubyte)i;
             }
             else {
                 arr[i] += cast(ubyte)i;
             }
         }
     }
}

My CPU is an i7 with 4M cache:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 78
Model name:            Intel(R) Core(TM) i7-6600U CPU   2.60GHz
Stepping:              3
CPU MHz:               513.953
CPU max MHz:           3400.0000
CPU min MHz:           400.0000
BogoMIPS:              5615.89
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              4096K

I tried two compilers:

- DMD64 D Compiler v2.071.2-b2

- LDC - the LLVM D compiler (1.0.0):
    based on DMD v2.070.2 and LLVM 3.8.0

As seen in the code, I tried two version identifiers:

-  STATIC: Use static array
-    else: Use dynamic array

- POINTER: Access array elements through .ptr
-    else: Access array elements through the [] operator

So, that gave me 8 combinations. Below, I list both the compilation 
command lines that I used and the wallclock times that each program 
execution took (as reported by the 'time' utility).

1) dmd deneme.d -ofdeneme -O -boundscheck=off -inline -version=STATIC 
-version=POINTER

    4.332s


2) dmd deneme.d -ofdeneme -O -boundscheck=off -inline -version=STATIC

    4.238s


3) dmd deneme.d -ofdeneme -O -boundscheck=off -inline -version=POINTER

    4.321s


4) dmd deneme.d -ofdeneme -O -boundscheck=off -inline

    3.845s  <== BEST for dmd


5) ldc2 deneme.d -ofdeneme  -O5 -release -boundscheck=off 
-d-version=POINTER -d-version=STATIC

    0.469s


6) ldc2 deneme.d -ofdeneme  -O5 -release -boundscheck=off -d-version=STATIC

   0.472s


7) ldc2 deneme.d -ofdeneme  -O5 -release -boundscheck=off -d-version=POINTER

   0.182s  <== BEST for ldc2


8) ldc2 deneme.d -ofdeneme  -O5 -release -boundscheck=off

   0.792s


So, for dmd, going with the recommendation of using a dynamic array is 
faster. Interestingly, using .ptr is actually slower. How?

With ldc2, the best option is to go with a dynamic array ONLY IF you 
access the elements through the .ptr property. As seen in the last 
result, using the [] operator on the array is about 4 times slower than 
that.

Does that make sense to you? Why would that be?

Thank you,
Ali

Aug 15 2016

ketmar <ketmar ketmar.no-ip.org> writes:

On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:

for x86 dmd, the results are:
0m0.694s
0m1.176s
0m0.722s
0m1.454s


x86_64 codegen looks... not normal.

Aug 15 2016

Yuxuan Shui <yshuiv7 gmail.com> writes:

On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 On 08/15/2016 12:09 PM, Ali Çehreli wrote:
 [...]

 Could you please help me understand the following results, 
 possibly by analyzing the produced assembly?

 [...]

There seem to be two things at work here: 1) when not accessed 
via pointer, access to the array is going through TCB everytime 
(for it's a TLS variable) 2) LDC is able to vectorize the pointer 
version but not the array version.

 [...]

Aug 15 2016

Johan Engelen <j j.nl> writes:

On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 
 With ldc2, the best option is to go with a dynamic array ONLY 
 IF you access the elements through the .ptr property. As seen 
 in the last result, using the [] operator on the array is about 
 4 times slower than that.

As Yuxuan Shui mentioned the difference is in vectorization. The 
non-POINTER version is not vectorized because the semantics of 
the code is not the same as the POINTER version. Indexing `arr`, 
and writing to that address could change `arr.ptr`, and so the 
loop would do something different when "caching" `arr.ptr` in `p` 
(POINTER version) versus the case without caching (non-POINTER 
version).

Evil code demonstrating the problem:
```
ubyte evil;
ubyte[] arr;

void doEvil() {
     // TODO: use this in the obfuscated-D contest
     arr = (&evil)[0..50];
}
```

The compiler somehow has to prove that `arr[i]` will never point 
to `arr.ptr` (it's called Alias Analysis in LLVM).

Perhaps it is UB in D to have `arr[i]` ever point into `arr` 
itself, I don't know. If so, the code is vectorizable and we can 
try to make it so.

-Johan

Aug 16 2016

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 08/16/2016 10:51 AM, Johan Engelen wrote:
 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 With ldc2, the best option is to go with a dynamic array ONLY IF you
 access the elements through the .ptr property. As seen in the last
 result, using the [] operator on the array is about 4 times slower
 than that.

 As Yuxuan Shui mentioned the difference is in vectorization. The
 non-POINTER version is not vectorized because the semantics of the code
 is not the same as the POINTER version. Indexing `arr`, and writing to
 that address could change `arr.ptr`, and so the loop would do something
 different when "caching" `arr.ptr` in `p` (POINTER version) versus the
 case without caching (non-POINTER version).

 Evil code demonstrating the problem:
 ```
 ubyte evil;
 ubyte[] arr;

 void doEvil() {
     // TODO: use this in the obfuscated-D contest
     arr = (&evil)[0..50];
 }
 ```

 The compiler somehow has to prove that `arr[i]` will never point to
 `arr.ptr` (it's called Alias Analysis in LLVM).

 Perhaps it is UB in D to have `arr[i]` ever point into `arr` itself, I
 don't know. If so, the code is vectorizable and we can try to make it so.

 -Johan

Thank you all. That makes sense... Agreeing that the POINTER version is 
applicable only in some cases, looking only at the non-POINTER cases, 
for ldc2, a static array is faster, making the "arbitrary" 16MiB limit a 
performance issue. For ldc2, static array is about 40% faster:

6) ldc2 deneme.d -ofdeneme  -O5 -release -boundscheck=off -d-version=STATIC

   0.472s


8) ldc2 deneme.d -ofdeneme  -O5 -release -boundscheck=off

   0.792s


It's the opposite for dmd:

2) dmd deneme.d -ofdeneme -O -boundscheck=off -inline -version=STATIC

    4.238s


4) dmd deneme.d -ofdeneme -O -boundscheck=off -inline

    3.845s

Ali

Aug 16 2016

Yuxuan Shui <yshuiv7 gmail.com> writes:

On Tuesday, 16 August 2016 at 18:46:06 UTC, Ali Çehreli wrote:
 On 08/16/2016 10:51 AM, Johan Engelen wrote:
 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 With ldc2, the best option is to go with a dynamic array


 ONLY IF you
 access the elements through the .ptr property. As seen in


 the last
 result, using the [] operator on the array is about 4 times


 slower
 than that.

 [...]

 -Johan

 Thank you all. That makes sense... Agreeing that the POINTER 
 version is applicable only in some cases, looking only at the 
 non-POINTER cases, for ldc2, a static array is faster, making 
 the "arbitrary" 16MiB limit a performance issue. For ldc2, 
 static array is about 40% faster:

 [...]

 Ali

Actually, the STATIC version is always faster on my machine (Core 
i5 5200U), in both dmd and ldc2 cases

Aug 16 2016

Yuxuan Shui <yshuiv7 gmail.com> writes:

On Tuesday, 16 August 2016 at 19:50:14 UTC, Yuxuan Shui wrote:
 On Tuesday, 16 August 2016 at 18:46:06 UTC, Ali Çehreli wrote:
 On 08/16/2016 10:51 AM, Johan Engelen wrote:
 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli 
 wrote:
 With ldc2, the best option is to go with a dynamic array


 ONLY IF you
 access the elements through the .ptr property. As seen in


 the last
 result, using the [] operator on the array is about 4 times


 slower
 than that.

 [...]

 -Johan

 Thank you all. That makes sense... Agreeing that the POINTER 
 version is applicable only in some cases, looking only at the 
 non-POINTER cases, for ldc2, a static array is faster, making 
 the "arbitrary" 16MiB limit a performance issue. For ldc2, 
 static array is about 40% faster:

 [...]

 Ali

 Actually, the STATIC version is always faster on my machine 
 (Core i5 5200U), in both dmd and ldc2 cases

In dmd's case, the non-STATIC version seems to evaluate the loop 
condition (arr.length) **everytime**.

Aug 16 2016

Yuxuan Shui <yshuiv7 gmail.com> writes:

On Tuesday, 16 August 2016 at 17:51:13 UTC, Johan Engelen wrote:
 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 
 With ldc2, the best option is to go with a dynamic array ONLY 
 IF you access the elements through the .ptr property. As seen 
 in the last result, using the [] operator on the array is 
 about 4 times slower than that.

 As Yuxuan Shui mentioned the difference is in vectorization. 
 The non-POINTER version is not vectorized because the semantics 
 of the code is not the same as the POINTER version. Indexing 
 `arr`, and writing to that address could change `arr.ptr`, and 
 so the loop would do something different when "caching" 
 `arr.ptr` in `p` (POINTER version) versus the case without 
 caching (non-POINTER version).

 Evil code demonstrating the problem:
 ```
 ubyte evil;
 ubyte[] arr;

 void doEvil() {
     // TODO: use this in the obfuscated-D contest
     arr = (&evil)[0..50];
 }
 ```

 The compiler somehow has to prove that `arr[i]` will never 
 point to `arr.ptr` (it's called Alias Analysis in LLVM).

 Perhaps it is UB in D to have `arr[i]` ever point into `arr` 
 itself, I don't know. If so, the code is vectorizable and we 
 can try to make it so.

 -Johan

Wait, doesn't D have strict aliasing rules? ubyte* (&evil) should 
not be allowed to alias with ubyte** (&arr.ptr).

Aug 16 2016

ketmar <ketmar ketmar.no-ip.org> writes:

On Tuesday, 16 August 2016 at 20:11:12 UTC, Yuxuan Shui wrote:
 Wait, doesn't D have strict aliasing rules?

luckily, no. at least this is not enforced by dmd. and it is 
great.

Aug 16 2016

deadalnix <deadalnix gmail.com> writes:

On Tuesday, 16 August 2016 at 20:19:32 UTC, ketmar wrote:
 On Tuesday, 16 August 2016 at 20:11:12 UTC, Yuxuan Shui wrote:
 Wait, doesn't D have strict aliasing rules?

 luckily, no. at least this is not enforced by dmd. and it is 
 great.


days, so I don't think it's that good of a thing.

Almost every single one case where Rust end up being faster than 
C++ is because their type system allow for more AA information 
available for the optimizer.

AA is also key to do non GC memory management at language level.

Aug 17 2016

ketmar <ketmar ketmar.no-ip.org> writes:

On Wednesday, 17 August 2016 at 12:20:28 UTC, deadalnix wrote:
 On Tuesday, 16 August 2016 at 20:19:32 UTC, ketmar wrote:
 On Tuesday, 16 August 2016 at 20:11:12 UTC, Yuxuan Shui wrote:
 Wait, doesn't D have strict aliasing rules?

 luckily, no. at least this is not enforced by dmd. and it is 
 great.


 these days, so I don't think it's that good of a thing.

 Almost every single one case where Rust end up being faster 
 than C++ is because their type system allow for more AA 
 information available for the optimizer.

 AA is also key to do non GC memory management at language level.

from my PoV, this kind of "optimizing" is overrated. i'm 
absolutely unable to understand why should i obey orders from 
machine instead of machine obeys my orders. if i want to go wild 
with pointers, don't tell me that i can't, just compile my code! 
C is literally ridden with this shit, and in the end it is a 
freakin' pain to write correct C code (if it is possible at all 
for something comlex).

Aug 17 2016

deadalnix <deadalnix gmail.com> writes:

On Wednesday, 17 August 2016 at 12:32:20 UTC, ketmar wrote:
 from my PoV, this kind of "optimizing" is overrated. i'm 
 absolutely unable to understand why should i obey orders from 
 machine instead of machine obeys my orders. if i want to go 
 wild with pointers, don't tell me that i can't, just compile my 
 code! C is literally ridden with this shit, and in the end it 
 is a freakin' pain to write correct C code (if it is possible 
 at all for something comlex).

Because making 99.9% of the code slower because of a fringe use 
case isn't sound engineering. Especially since there are already 
ways to do this is a way that makes the AA happy, for instance 
using unions.

Aug 17 2016

ketmar <ketmar ketmar.no-ip.org> writes:

On Wednesday, 17 August 2016 at 13:27:14 UTC, deadalnix wrote:
 Especially since there are already ways to do this is a way 
 that makes the AA happy

exactly the thing i was writing about. "hey, you, meatbag! i, Teh 
Great Machine, said that you have to use unions, not pointers! 
what? making a pointer to union point into the middle of the 
buffer is exactly the same aliasing problem, so unions doesn't 
solve anything? I, Teh Great Machine, don't care. it is your 
problems, meatbag, i'm not here to serve you."

Aug 17 2016

Chris Wright <dhasenan gmail.com> writes:

On Wed, 17 Aug 2016 21:37:11 +0000, ketmar wrote:

 On Wednesday, 17 August 2016 at 13:27:14 UTC, deadalnix wrote:
 Especially since there are already ways to do this is a way that makes
 the AA happy

 exactly the thing i was writing about. "hey, you, meatbag! i, Teh Great
 Machine, said that you have to use unions, not pointers! what? making a
 pointer to union point into the middle of the buffer is exactly the same
 aliasing problem, so unions doesn't solve anything? I, Teh Great
 Machine, don't care. it is your problems, meatbag, i'm not here to serve
 you."

It makes your intent more obvious. It's more obvious to other humans as 
well as the compiler. For me, it's a win with no downsides. For you, 
don't use  safe and you opt out of this class of optimizations and the 
related restrictions.

Aug 17 2016

jmh530 <john.michael.hall gmail.com> writes:

On Wednesday, 17 August 2016 at 12:20:28 UTC, deadalnix wrote:

 these days, so I don't think it's that good of a thing.

 Almost every single one case where Rust end up being faster 
 than C++ is because their type system allow for more AA 
 information available for the optimizer.

 AA is also key to do non GC memory management at language level.

AA? Associative Array?

Aug 17 2016

Timon Gehr <timon.gehr gmx.ch> writes:

On 17.08.2016 15:41, jmh530 wrote:
 On Wednesday, 17 August 2016 at 12:20:28 UTC, deadalnix wrote:

 so I don't think it's that good of a thing.

 Almost every single one case where Rust end up being faster than C++
 is because their type system allow for more AA information available
 for the optimizer.

 AA is also key to do non GC memory management at language level.

 AA? Associative Array?

(Alias Analysis.)

Aug 17 2016

deadalnix <deadalnix gmail.com> writes:

On Wednesday, 17 August 2016 at 13:41:09 UTC, jmh530 wrote:
 On Wednesday, 17 August 2016 at 12:20:28 UTC, deadalnix wrote:

 these days, so I don't think it's that good of a thing.

 Almost every single one case where Rust end up being faster 
 than C++ is because their type system allow for more AA 
 information available for the optimizer.

 AA is also key to do non GC memory management at language 
 level.

 AA? Associative Array?

Alias Analysis. This is a common compiler acronym.

Associative array are called map by everybody outside this forum.

Aug 17 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/17/2016 5:20 AM, deadalnix wrote:

 don't think it's that good of a thing.

 Almost every single one case where Rust end up being faster than C++ is because
 their type system allow for more AA information available for the optimizer.

 AA is also key to do non GC memory management at language level.

At least for this case, as I mentioned in another post, if the pointer and 
length of the global is cached in a local, it can be cached in a register. The 
contents of locals don't have aliasing problems because if their addresses are 
not taken, nobody can point to them. Optimization relies heavily on that.

Aug 17 2016

Yuxuan Shui <yshuiv7 gmail.com> writes:

On Wednesday, 17 August 2016 at 19:36:17 UTC, Walter Bright wrote:
 On 8/17/2016 5:20 AM, deadalnix wrote:

 these days, so I
 don't think it's that good of a thing.

 Almost every single one case where Rust end up being faster 
 than C++ is because
 their type system allow for more AA information available for 
 the optimizer.

 AA is also key to do non GC memory management at language 
 level.

 At least for this case, as I mentioned in another post, if the 
 pointer and length of the global is cached in a local, it can 
 be cached in a register. The contents of locals don't have 
 aliasing problems because if their addresses are not taken, 
 nobody can point to them. Optimization relies heavily on that.

But doing so would be incorrect if D doesn't provide strong 
aliasing guarantees. And if D does provide these guarantees, we 
won't need to do this manually.

Aug 17 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/17/2016 3:12 PM, Yuxuan Shui wrote:
 But doing so would be incorrect if D doesn't provide strong aliasing
guarantees.
 And if D does provide these guarantees, we won't need to do this manually.

It would be correct for that loop if the user does it.

Aug 17 2016

Chris Wright <dhasenan gmail.com> writes:

On Wed, 17 Aug 2016 22:12:25 +0000, Yuxuan Shui wrote:
 But doing so would be incorrect if D doesn't provide strong aliasing
 guarantees. And if D does provide these guarantees, we won't need to do
 this manually.

The language can analyze all code that affects a local variable in many 
cases. You don't always need the language to guarantee it's impossible if 
the compiler can see that the user isn't doing anything funky.

Aug 17 2016

Yuxuan Shui <yshuiv7 gmail.com> writes:

On Thursday, 18 August 2016 at 00:20:32 UTC, Chris Wright wrote:
 On Wed, 17 Aug 2016 22:12:25 +0000, Yuxuan Shui wrote:
 But doing so would be incorrect if D doesn't provide strong 
 aliasing guarantees. And if D does provide these guarantees, 
 we won't need to do this manually.

 The language can analyze all code that affects a local variable 
 in many cases. You don't always need the language to guarantee 
 it's impossible if the compiler can see that the user isn't 
 doing anything funky.

That's right. But for Ali's code, the compiler is clearly not 
smart enough.

Aug 17 2016

Chris Wright <dhasenan gmail.com> writes:

On Thu, 18 Aug 2016 01:18:03 +0000, Yuxuan Shui wrote:

 On Thursday, 18 August 2016 at 00:20:32 UTC, Chris Wright wrote:
 On Wed, 17 Aug 2016 22:12:25 +0000, Yuxuan Shui wrote:
 But doing so would be incorrect if D doesn't provide strong aliasing
 guarantees. And if D does provide these guarantees,
 we won't need to do this manually.

 The language can analyze all code that affects a local variable in many
 cases. You don't always need the language to guarantee it's impossible
 if the compiler can see that the user isn't doing anything funky.

 
 That's right. But for Ali's code, the compiler is clearly not smart
 enough.

Most of the time, the compiler can successfully make those optimizations 
for local variables. The compiler can seldom make them for global 
variables. You get into whole program analysis territory pretty fast.

So it's not surprising that the compiler doesn't handle global variables 
as well as it does local ones.

Aug 17 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/17/2016 9:16 PM, Chris Wright wrote:
 So it's not surprising that the compiler doesn't handle global variables
 as well as it does local ones.

Global variables are pretty much spawn of the devil. You're right that dmd does 
not make a special effort optimizing them.

The way to deal with it:

1. minimize use of globals

2. if using them in a hot loop, copy a reference to the global into a local 
variable, and use the local variable in the loop instead.

This also gets around the problem that thread local storage address calculation 
is slow and inefficient on all platforms.

Aug 17 2016

Johan Engelen <j j.nl> writes:

On Thursday, 18 August 2016 at 01:18:03 UTC, Yuxuan Shui wrote:
 On Thursday, 18 August 2016 at 00:20:32 UTC, Chris Wright wrote:
 The language can analyze all code that affects a local 
 variable in many cases. You don't always need the language to 
 guarantee it's impossible if the compiler can see that the 
 user isn't doing anything funky.

 That's right. But for Ali's code, the compiler is clearly not 
 smart enough.

Perhaps not smart enough, but it is very close to being smart 
enough. Perhaps we, the LDC devs, weren't smart enough in using 
the LLVM backend. ;-)

For Ali's code, `arr` is a global symbol that can be modified by 
code not seen by the compiler. So no assumptions can be made 
about the contents of arr.ptr and arr.length, and thus `arr[i]` 
could index into `arr` itself.

Now, if we tell [*] the compiler that `arr` is not touched by 
code outside the module... the resulting machine code is 
identical with and without POINTER!

It's an interesting case to look further into. For example with 
Link-Time Optimization, delaying codegen until linktime, we can 
internalize many symbols (i.e. telling the compiler no other 
module is going to do anything with it) allowing the compiler to 
reason better about the code and generate faster executables. 
Ideally, without much user effort. It's something I am already 
looking into.

I don't know if we can mark `private` global variables as 
internal. If so, wow :-)

-Johan


[*] Unfortunately `private arr` does not have the desired effect 
(it does not make it an `internal` LLVM variable), so you have to 
hack into the LLVM IR file to make it happen.

Aug 18 2016

Johan Engelen <j j.nl> writes:

On Thursday, 18 August 2016 at 12:20:50 UTC, Johan Engelen wrote:
 I don't know if we can mark `private` global variables as 
 internal. If so, wow :-)

Nevermind, not possible. Templates, cross-module inlining, and 
probably other reasons.

Aug 18 2016

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 08/18/2016 05:20 AM, Johan Engelen wrote:

 `arr` is a global symbol that can be modified by code
 not seen by the compiler. So no assumptions can be made about the
 contents of arr.ptr and arr.length

Yet there is the following text in the spec:

* https://dlang.org/spec/statement.html#ForeachStatement

   "The aggregate must be loop invariant, meaning that elements to
    the aggregate cannot be added or removed from it in the
    NoScopeNonEmptyStatement."

I think the spirit of the spec includes other code outside of 
NoScopeNonEmptyStatement as well. If so, if elements are added to the 
array from other code (e.g. other threads) it should be a case of "all 
bets are off". So, I think both .ptr and .length could be cached.

Ali

Aug 18 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 8/16/16 4:11 PM, Yuxuan Shui wrote:
 On Tuesday, 16 August 2016 at 17:51:13 UTC, Johan Engelen wrote:
 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 With ldc2, the best option is to go with a dynamic array ONLY IF you
 access the elements through the .ptr property. As seen in the last
 result, using the [] operator on the array is about 4 times slower
 than that.

 As Yuxuan Shui mentioned the difference is in vectorization. The
 non-POINTER version is not vectorized because the semantics of the
 code is not the same as the POINTER version. Indexing `arr`, and
 writing to that address could change `arr.ptr`, and so the loop would
 do something different when "caching" `arr.ptr` in `p` (POINTER
 version) versus the case without caching (non-POINTER version).

 Evil code demonstrating the problem:
 ```
 ubyte evil;
 ubyte[] arr;

 void doEvil() {
     // TODO: use this in the obfuscated-D contest
     arr = (&evil)[0..50];
 }
 ```

 The compiler somehow has to prove that `arr[i]` will never point to
 `arr.ptr` (it's called Alias Analysis in LLVM).

 Perhaps it is UB in D to have `arr[i]` ever point into `arr` itself, I
 don't know. If so, the code is vectorizable and we can try to make it so.

 -Johan

 Wait, doesn't D have strict aliasing rules? ubyte* (&evil) should not be
 allowed to alias with ubyte** (&arr.ptr).

Even if it did, I believe the wildcard is ubyte*. Just like in C, char* 
can point at anything, ubyte is D's equivalent.

-Steve

Aug 16 2016

Charles Hixson via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 08/16/2016 01:49 PM, Steven Schveighoffer via Digitalmars-d wrote:
 On 8/16/16 4:11 PM, Yuxuan Shui wrote:
 On Tuesday, 16 August 2016 at 17:51:13 UTC, Johan Engelen wrote:
 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 With ldc2, the best option is to go with a dynamic array ONLY IF you
 access the elements through the .ptr property. As seen in the last
 result, using the [] operator on the array is about 4 times slower
 than that.

 As Yuxuan Shui mentioned the difference is in vectorization. The
 non-POINTER version is not vectorized because the semantics of the
 code is not the same as the POINTER version. Indexing `arr`, and
 writing to that address could change `arr.ptr`, and so the loop would
 do something different when "caching" `arr.ptr` in `p` (POINTER
 version) versus the case without caching (non-POINTER version).

 Evil code demonstrating the problem:
 ```
 ubyte evil;
 ubyte[] arr;

 void doEvil() {
     // TODO: use this in the obfuscated-D contest
     arr = (&evil)[0..50];
 }
 ```

 The compiler somehow has to prove that `arr[i]` will never point to
 `arr.ptr` (it's called Alias Analysis in LLVM).

 Perhaps it is UB in D to have `arr[i]` ever point into `arr` itself, I
 don't know. If so, the code is vectorizable and we can try to make 
 it so.

 -Johan

 Wait, doesn't D have strict aliasing rules? ubyte* (&evil) should not be
 allowed to alias with ubyte** (&arr.ptr).

 Even if it did, I believe the wildcard is ubyte*. Just like in C, 
 char* can point at anything, ubyte is D's equivalent.

 -Steve

I think what you say is true (look at the code of std.outbuffer), but 
IIRC the documentation says that's supposed to be the job of void*.

Aug 16 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 8/16/16 7:23 PM, Charles Hixson via Digitalmars-d wrote:
 On 08/16/2016 01:49 PM, Steven Schveighoffer via Digitalmars-d wrote:
 On 8/16/16 4:11 PM, Yuxuan Shui wrote:
 Wait, doesn't D have strict aliasing rules? ubyte* (&evil) should not be
 allowed to alias with ubyte** (&arr.ptr).

 Even if it did, I believe the wildcard is ubyte*. Just like in C,
 char* can point at anything, ubyte is D's equivalent.

 I think what you say is true (look at the code of std.outbuffer), but
 IIRC the documentation says that's supposed to be the job of void*.

void * is almost useless. In D you can assign a void[] from another 
void[], but other than that, there's no way to write the memory or read it.

In C, void * is also allowed to alias any other pointer. But char * is 
also allowed to provide arbitrary byte reading/writing.

I'd expect that D also would provide a similar option.

-Steve

Aug 17 2016

deadalnix <deadalnix gmail.com> writes:

On Wednesday, 17 August 2016 at 14:21:32 UTC, Steven 
Schveighoffer wrote:
 void * is almost useless. In D you can assign a void[] from 
 another void[], but other than that, there's no way to write 
 the memory or read it.

 In C, void * is also allowed to alias any other pointer. But 
 char * is also allowed to provide arbitrary byte 
 reading/writing.

 I'd expect that D also would provide a similar option.

 -Steve

Yes, but everything can alias with void*/void[] . Thus, you can 
cast from void* to T* "safely".

Aug 17 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 8/17/16 10:38 AM, deadalnix wrote:
 On Wednesday, 17 August 2016 at 14:21:32 UTC, Steven Schveighoffer wrote:
 void * is almost useless. In D you can assign a void[] from another
 void[], but other than that, there's no way to write the memory or
 read it.

 In C, void * is also allowed to alias any other pointer. But char * is
 also allowed to provide arbitrary byte reading/writing.

 I'd expect that D also would provide a similar option.

 Yes, but everything can alias with void*/void[] . Thus, you can cast
 from void* to T* "safely".

Sure, but how do you implement, let's say, byte swapping on an integer?

ubyte[] x = &myInt[0 .. 1];
foreach(i; 0 .. x.length / 2)
    swap(x[i], x[$-i-1]);

So if the compiler can assume that x can't point at myInt, and thus 
myInt can't have changed, then we have a problem.

You just can't do this with void (or at least not very easily).

-Steve

Aug 17 2016

Johan Engelen <j j.nl> writes:

How about the specific case of array indexing?
Is it UB to have `arr[i]` ever point into `arr` itself, or can we 
make it UB?

-Johan

Aug 17 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 8/15/2016 6:28 PM, Ali Çehreli wrote:
 void main() {
     auto p = arr.ptr;

     foreach (j; 0 .. 100) {
         foreach (i; 0..arr.length) {
             version (POINTER) {
                 p[i] += cast(ubyte)i;
             }
             else {
                 arr[i] += cast(ubyte)i;
             }
         }
     }
 }

When accessing global arrays like this, cache the address of the data in a
local 
variable. This will enable the compiler to enregister it. Putting global data
in 
registers is problematic because any assignment through a pointer could change 
it, so the compiler takes a pessimistic view of it.

Your POINTER version does cache the pointer, but arr.length needs to be cached 
as well.

Aug 16 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Why 16Mib static array size limit?