www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why 16Mib static array size limit?

reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
dmd does not allow anything larger. Shouldn't the limit depend on the 
.data segment size, which can be specified by the system? Is there a 
technical reason?

Observing that the limit is actually one less (note -1 below), it's 
probably due to an old limitation where indexes were limited to 24 bits?

static ubyte[16 * 1024 * 1024 - 1] arr;

Ali
Aug 15 2016
next sibling parent reply NX <nightmarex1337 hotmail.com> writes:
https://issues.dlang.org/show_bug.cgi?id=14859

This limitation is put there because of optlink (which fails to 
link when you have enough static data), and is actually entirely 
meaningless when combined with -m32mscoff & -m64 switches (since 
other linkers handle huge static data just fine).
Aug 15 2016
parent Jacob Carlborg <doob me.com> writes:
On 2016-08-15 22:28, NX wrote:
 https://issues.dlang.org/show_bug.cgi?id=14859

 This limitation is put there because of optlink (which fails to link
 when you have enough static data), and is actually entirely meaningless
 when combined with -m32mscoff & -m64 switches (since other linkers
 handle huge static data just fine).
Or using any other platform than Windows :) -- /Jacob Carlborg
Aug 16 2016
prev sibling next sibling parent NX <nightmarex1337 hotmail.com> writes:
You can apply a patch if you're willing to compile dmd from 
source by doing the following:

Find the following code in file 'mtype.d' at line 4561:

                 bool overflow = false;
                 if (mulu(tbn.size(loc), d2, overflow) >= 
0x1000000 || overflow) // put a 'reasonable' limit on it
                     goto Loverflow;

And change it to:

                 bool overflow = false;
                 mulu(tbn.size(loc), d2, overflow);
                 if (overflow)
                     goto Loverflow;

I would make a PR if I had the time (anyone?)...
Aug 15 2016
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 08/15/2016 12:09 PM, Ali Çehreli wrote:
 dmd does not allow anything larger.
Could you please help me understand the following results, possibly by analyzing the produced assembly? I wanted to see whether there were any performance penalties when one used D's recommendation of using dynamic arrays beyond 16MiB. Here is the test code: enum size = 15 * 1024 * 1024; version (STATIC) { ubyte[size] arr; } else { ubyte[] arr; static this() { arr = new ubyte[](size); } } void main() { auto p = arr.ptr; foreach (j; 0 .. 100) { foreach (i; 0..arr.length) { version (POINTER) { p[i] += cast(ubyte)i; } else { arr[i] += cast(ubyte)i; } } } } My CPU is an i7 with 4M cache: $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 78 Model name: Intel(R) Core(TM) i7-6600U CPU 2.60GHz Stepping: 3 CPU MHz: 513.953 CPU max MHz: 3400.0000 CPU min MHz: 400.0000 BogoMIPS: 5615.89 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 4096K I tried two compilers: - DMD64 D Compiler v2.071.2-b2 - LDC - the LLVM D compiler (1.0.0): based on DMD v2.070.2 and LLVM 3.8.0 As seen in the code, I tried two version identifiers: - STATIC: Use static array - else: Use dynamic array - POINTER: Access array elements through .ptr - else: Access array elements through the [] operator So, that gave me 8 combinations. Below, I list both the compilation command lines that I used and the wallclock times that each program execution took (as reported by the 'time' utility). 1) dmd deneme.d -ofdeneme -O -boundscheck=off -inline -version=STATIC -version=POINTER 4.332s 2) dmd deneme.d -ofdeneme -O -boundscheck=off -inline -version=STATIC 4.238s 3) dmd deneme.d -ofdeneme -O -boundscheck=off -inline -version=POINTER 4.321s 4) dmd deneme.d -ofdeneme -O -boundscheck=off -inline 3.845s <== BEST for dmd 5) ldc2 deneme.d -ofdeneme -O5 -release -boundscheck=off -d-version=POINTER -d-version=STATIC 0.469s 6) ldc2 deneme.d -ofdeneme -O5 -release -boundscheck=off -d-version=STATIC 0.472s 7) ldc2 deneme.d -ofdeneme -O5 -release -boundscheck=off -d-version=POINTER 0.182s <== BEST for ldc2 8) ldc2 deneme.d -ofdeneme -O5 -release -boundscheck=off 0.792s So, for dmd, going with the recommendation of using a dynamic array is faster. Interestingly, using .ptr is actually slower. How? With ldc2, the best option is to go with a dynamic array ONLY IF you access the elements through the .ptr property. As seen in the last result, using the [] operator on the array is about 4 times slower than that. Does that make sense to you? Why would that be? Thank you, Ali
Aug 15 2016
next sibling parent ketmar <ketmar ketmar.no-ip.org> writes:
On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:

for x86 dmd, the results are:
0m0.694s
0m1.176s
0m0.722s
0m1.454s


x86_64 codegen looks... not normal.
Aug 15 2016
prev sibling next sibling parent Yuxuan Shui <yshuiv7 gmail.com> writes:
On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 On 08/15/2016 12:09 PM, Ali Çehreli wrote:
 [...]
Could you please help me understand the following results, possibly by analyzing the produced assembly? [...]
There seem to be two things at work here: 1) when not accessed via pointer, access to the array is going through TCB everytime (for it's a TLS variable) 2) LDC is able to vectorize the pointer version but not the array version.
 [...]
Aug 15 2016
prev sibling next sibling parent reply Johan Engelen <j j.nl> writes:
On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 
 With ldc2, the best option is to go with a dynamic array ONLY 
 IF you access the elements through the .ptr property. As seen 
 in the last result, using the [] operator on the array is about 
 4 times slower than that.
As Yuxuan Shui mentioned the difference is in vectorization. The non-POINTER version is not vectorized because the semantics of the code is not the same as the POINTER version. Indexing `arr`, and writing to that address could change `arr.ptr`, and so the loop would do something different when "caching" `arr.ptr` in `p` (POINTER version) versus the case without caching (non-POINTER version). Evil code demonstrating the problem: ``` ubyte evil; ubyte[] arr; void doEvil() { // TODO: use this in the obfuscated-D contest arr = (&evil)[0..50]; } ``` The compiler somehow has to prove that `arr[i]` will never point to `arr.ptr` (it's called Alias Analysis in LLVM). Perhaps it is UB in D to have `arr[i]` ever point into `arr` itself, I don't know. If so, the code is vectorizable and we can try to make it so. -Johan
Aug 16 2016
next sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 08/16/2016 10:51 AM, Johan Engelen wrote:
 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 With ldc2, the best option is to go with a dynamic array ONLY IF you
 access the elements through the .ptr property. As seen in the last
 result, using the [] operator on the array is about 4 times slower
 than that.
As Yuxuan Shui mentioned the difference is in vectorization. The non-POINTER version is not vectorized because the semantics of the code is not the same as the POINTER version. Indexing `arr`, and writing to that address could change `arr.ptr`, and so the loop would do something different when "caching" `arr.ptr` in `p` (POINTER version) versus the case without caching (non-POINTER version). Evil code demonstrating the problem: ``` ubyte evil; ubyte[] arr; void doEvil() { // TODO: use this in the obfuscated-D contest arr = (&evil)[0..50]; } ``` The compiler somehow has to prove that `arr[i]` will never point to `arr.ptr` (it's called Alias Analysis in LLVM). Perhaps it is UB in D to have `arr[i]` ever point into `arr` itself, I don't know. If so, the code is vectorizable and we can try to make it so. -Johan
Thank you all. That makes sense... Agreeing that the POINTER version is applicable only in some cases, looking only at the non-POINTER cases, for ldc2, a static array is faster, making the "arbitrary" 16MiB limit a performance issue. For ldc2, static array is about 40% faster: 6) ldc2 deneme.d -ofdeneme -O5 -release -boundscheck=off -d-version=STATIC 0.472s 8) ldc2 deneme.d -ofdeneme -O5 -release -boundscheck=off 0.792s It's the opposite for dmd: 2) dmd deneme.d -ofdeneme -O -boundscheck=off -inline -version=STATIC 4.238s 4) dmd deneme.d -ofdeneme -O -boundscheck=off -inline 3.845s Ali
Aug 16 2016
parent reply Yuxuan Shui <yshuiv7 gmail.com> writes:
On Tuesday, 16 August 2016 at 18:46:06 UTC, Ali Çehreli wrote:
 On 08/16/2016 10:51 AM, Johan Engelen wrote:
 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 With ldc2, the best option is to go with a dynamic array
ONLY IF you
 access the elements through the .ptr property. As seen in
the last
 result, using the [] operator on the array is about 4 times
slower
 than that.
[...] -Johan
Thank you all. That makes sense... Agreeing that the POINTER version is applicable only in some cases, looking only at the non-POINTER cases, for ldc2, a static array is faster, making the "arbitrary" 16MiB limit a performance issue. For ldc2, static array is about 40% faster: [...] Ali
Actually, the STATIC version is always faster on my machine (Core i5 5200U), in both dmd and ldc2 cases
Aug 16 2016
parent Yuxuan Shui <yshuiv7 gmail.com> writes:
On Tuesday, 16 August 2016 at 19:50:14 UTC, Yuxuan Shui wrote:
 On Tuesday, 16 August 2016 at 18:46:06 UTC, Ali Çehreli wrote:
 On 08/16/2016 10:51 AM, Johan Engelen wrote:
 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli 
 wrote:
 With ldc2, the best option is to go with a dynamic array
ONLY IF you
 access the elements through the .ptr property. As seen in
the last
 result, using the [] operator on the array is about 4 times
slower
 than that.
[...] -Johan
Thank you all. That makes sense... Agreeing that the POINTER version is applicable only in some cases, looking only at the non-POINTER cases, for ldc2, a static array is faster, making the "arbitrary" 16MiB limit a performance issue. For ldc2, static array is about 40% faster: [...] Ali
Actually, the STATIC version is always faster on my machine (Core i5 5200U), in both dmd and ldc2 cases
In dmd's case, the non-STATIC version seems to evaluate the loop condition (arr.length) **everytime**.
Aug 16 2016
prev sibling parent reply Yuxuan Shui <yshuiv7 gmail.com> writes:
On Tuesday, 16 August 2016 at 17:51:13 UTC, Johan Engelen wrote:
 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 
 With ldc2, the best option is to go with a dynamic array ONLY 
 IF you access the elements through the .ptr property. As seen 
 in the last result, using the [] operator on the array is 
 about 4 times slower than that.
As Yuxuan Shui mentioned the difference is in vectorization. The non-POINTER version is not vectorized because the semantics of the code is not the same as the POINTER version. Indexing `arr`, and writing to that address could change `arr.ptr`, and so the loop would do something different when "caching" `arr.ptr` in `p` (POINTER version) versus the case without caching (non-POINTER version). Evil code demonstrating the problem: ``` ubyte evil; ubyte[] arr; void doEvil() { // TODO: use this in the obfuscated-D contest arr = (&evil)[0..50]; } ``` The compiler somehow has to prove that `arr[i]` will never point to `arr.ptr` (it's called Alias Analysis in LLVM). Perhaps it is UB in D to have `arr[i]` ever point into `arr` itself, I don't know. If so, the code is vectorizable and we can try to make it so. -Johan
Wait, doesn't D have strict aliasing rules? ubyte* (&evil) should not be allowed to alias with ubyte** (&arr.ptr).
Aug 16 2016
next sibling parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Tuesday, 16 August 2016 at 20:11:12 UTC, Yuxuan Shui wrote:
 Wait, doesn't D have strict aliasing rules?
luckily, no. at least this is not enforced by dmd. and it is great.
Aug 16 2016
parent reply deadalnix <deadalnix gmail.com> writes:
On Tuesday, 16 August 2016 at 20:19:32 UTC, ketmar wrote:
 On Tuesday, 16 August 2016 at 20:11:12 UTC, Yuxuan Shui wrote:
 Wait, doesn't D have strict aliasing rules?
luckily, no. at least this is not enforced by dmd. and it is great.
days, so I don't think it's that good of a thing. Almost every single one case where Rust end up being faster than C++ is because their type system allow for more AA information available for the optimizer. AA is also key to do non GC memory management at language level.
Aug 17 2016
next sibling parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Wednesday, 17 August 2016 at 12:20:28 UTC, deadalnix wrote:
 On Tuesday, 16 August 2016 at 20:19:32 UTC, ketmar wrote:
 On Tuesday, 16 August 2016 at 20:11:12 UTC, Yuxuan Shui wrote:
 Wait, doesn't D have strict aliasing rules?
luckily, no. at least this is not enforced by dmd. and it is great.
these days, so I don't think it's that good of a thing. Almost every single one case where Rust end up being faster than C++ is because their type system allow for more AA information available for the optimizer. AA is also key to do non GC memory management at language level.
from my PoV, this kind of "optimizing" is overrated. i'm absolutely unable to understand why should i obey orders from machine instead of machine obeys my orders. if i want to go wild with pointers, don't tell me that i can't, just compile my code! C is literally ridden with this shit, and in the end it is a freakin' pain to write correct C code (if it is possible at all for something comlex).
Aug 17 2016
parent reply deadalnix <deadalnix gmail.com> writes:
On Wednesday, 17 August 2016 at 12:32:20 UTC, ketmar wrote:
 from my PoV, this kind of "optimizing" is overrated. i'm 
 absolutely unable to understand why should i obey orders from 
 machine instead of machine obeys my orders. if i want to go 
 wild with pointers, don't tell me that i can't, just compile my 
 code! C is literally ridden with this shit, and in the end it 
 is a freakin' pain to write correct C code (if it is possible 
 at all for something comlex).
Because making 99.9% of the code slower because of a fringe use case isn't sound engineering. Especially since there are already ways to do this is a way that makes the AA happy, for instance using unions.
Aug 17 2016
parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Wednesday, 17 August 2016 at 13:27:14 UTC, deadalnix wrote:
 Especially since there are already ways to do this is a way 
 that makes the AA happy
exactly the thing i was writing about. "hey, you, meatbag! i, Teh Great Machine, said that you have to use unions, not pointers! what? making a pointer to union point into the middle of the buffer is exactly the same aliasing problem, so unions doesn't solve anything? I, Teh Great Machine, don't care. it is your problems, meatbag, i'm not here to serve you."
Aug 17 2016
parent Chris Wright <dhasenan gmail.com> writes:
On Wed, 17 Aug 2016 21:37:11 +0000, ketmar wrote:

 On Wednesday, 17 August 2016 at 13:27:14 UTC, deadalnix wrote:
 Especially since there are already ways to do this is a way that makes
 the AA happy
exactly the thing i was writing about. "hey, you, meatbag! i, Teh Great Machine, said that you have to use unions, not pointers! what? making a pointer to union point into the middle of the buffer is exactly the same aliasing problem, so unions doesn't solve anything? I, Teh Great Machine, don't care. it is your problems, meatbag, i'm not here to serve you."
It makes your intent more obvious. It's more obvious to other humans as well as the compiler. For me, it's a win with no downsides. For you, don't use safe and you opt out of this class of optimizations and the related restrictions.
Aug 17 2016
prev sibling next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 17 August 2016 at 12:20:28 UTC, deadalnix wrote:

 these days, so I don't think it's that good of a thing.

 Almost every single one case where Rust end up being faster 
 than C++ is because their type system allow for more AA 
 information available for the optimizer.

 AA is also key to do non GC memory management at language level.
AA? Associative Array?
Aug 17 2016
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 17.08.2016 15:41, jmh530 wrote:
 On Wednesday, 17 August 2016 at 12:20:28 UTC, deadalnix wrote:

 so I don't think it's that good of a thing.

 Almost every single one case where Rust end up being faster than C++
 is because their type system allow for more AA information available
 for the optimizer.

 AA is also key to do non GC memory management at language level.
AA? Associative Array?
(Alias Analysis.)
Aug 17 2016
prev sibling parent deadalnix <deadalnix gmail.com> writes:
On Wednesday, 17 August 2016 at 13:41:09 UTC, jmh530 wrote:
 On Wednesday, 17 August 2016 at 12:20:28 UTC, deadalnix wrote:

 these days, so I don't think it's that good of a thing.

 Almost every single one case where Rust end up being faster 
 than C++ is because their type system allow for more AA 
 information available for the optimizer.

 AA is also key to do non GC memory management at language 
 level.
AA? Associative Array?
Alias Analysis. This is a common compiler acronym. Associative array are called map by everybody outside this forum.
Aug 17 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/17/2016 5:20 AM, deadalnix wrote:

 don't think it's that good of a thing.

 Almost every single one case where Rust end up being faster than C++ is because
 their type system allow for more AA information available for the optimizer.

 AA is also key to do non GC memory management at language level.
At least for this case, as I mentioned in another post, if the pointer and length of the global is cached in a local, it can be cached in a register. The contents of locals don't have aliasing problems because if their addresses are not taken, nobody can point to them. Optimization relies heavily on that.
Aug 17 2016
parent reply Yuxuan Shui <yshuiv7 gmail.com> writes:
On Wednesday, 17 August 2016 at 19:36:17 UTC, Walter Bright wrote:
 On 8/17/2016 5:20 AM, deadalnix wrote:

 these days, so I
 don't think it's that good of a thing.

 Almost every single one case where Rust end up being faster 
 than C++ is because
 their type system allow for more AA information available for 
 the optimizer.

 AA is also key to do non GC memory management at language 
 level.
At least for this case, as I mentioned in another post, if the pointer and length of the global is cached in a local, it can be cached in a register. The contents of locals don't have aliasing problems because if their addresses are not taken, nobody can point to them. Optimization relies heavily on that.
But doing so would be incorrect if D doesn't provide strong aliasing guarantees. And if D does provide these guarantees, we won't need to do this manually.
Aug 17 2016
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/17/2016 3:12 PM, Yuxuan Shui wrote:
 But doing so would be incorrect if D doesn't provide strong aliasing
guarantees.
 And if D does provide these guarantees, we won't need to do this manually.
It would be correct for that loop if the user does it.
Aug 17 2016
prev sibling parent reply Chris Wright <dhasenan gmail.com> writes:
On Wed, 17 Aug 2016 22:12:25 +0000, Yuxuan Shui wrote:
 But doing so would be incorrect if D doesn't provide strong aliasing
 guarantees. And if D does provide these guarantees, we won't need to do
 this manually.
The language can analyze all code that affects a local variable in many cases. You don't always need the language to guarantee it's impossible if the compiler can see that the user isn't doing anything funky.
Aug 17 2016
parent reply Yuxuan Shui <yshuiv7 gmail.com> writes:
On Thursday, 18 August 2016 at 00:20:32 UTC, Chris Wright wrote:
 On Wed, 17 Aug 2016 22:12:25 +0000, Yuxuan Shui wrote:
 But doing so would be incorrect if D doesn't provide strong 
 aliasing guarantees. And if D does provide these guarantees, 
 we won't need to do this manually.
The language can analyze all code that affects a local variable in many cases. You don't always need the language to guarantee it's impossible if the compiler can see that the user isn't doing anything funky.
That's right. But for Ali's code, the compiler is clearly not smart enough.
Aug 17 2016
next sibling parent reply Chris Wright <dhasenan gmail.com> writes:
On Thu, 18 Aug 2016 01:18:03 +0000, Yuxuan Shui wrote:

 On Thursday, 18 August 2016 at 00:20:32 UTC, Chris Wright wrote:
 On Wed, 17 Aug 2016 22:12:25 +0000, Yuxuan Shui wrote:
 But doing so would be incorrect if D doesn't provide strong aliasing
 guarantees. And if D does provide these guarantees,
 we won't need to do this manually.
The language can analyze all code that affects a local variable in many cases. You don't always need the language to guarantee it's impossible if the compiler can see that the user isn't doing anything funky.
That's right. But for Ali's code, the compiler is clearly not smart enough.
Most of the time, the compiler can successfully make those optimizations for local variables. The compiler can seldom make them for global variables. You get into whole program analysis territory pretty fast. So it's not surprising that the compiler doesn't handle global variables as well as it does local ones.
Aug 17 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/17/2016 9:16 PM, Chris Wright wrote:
 So it's not surprising that the compiler doesn't handle global variables
 as well as it does local ones.
Global variables are pretty much spawn of the devil. You're right that dmd does not make a special effort optimizing them. The way to deal with it: 1. minimize use of globals 2. if using them in a hot loop, copy a reference to the global into a local variable, and use the local variable in the loop instead. This also gets around the problem that thread local storage address calculation is slow and inefficient on all platforms.
Aug 17 2016
prev sibling parent reply Johan Engelen <j j.nl> writes:
On Thursday, 18 August 2016 at 01:18:03 UTC, Yuxuan Shui wrote:
 On Thursday, 18 August 2016 at 00:20:32 UTC, Chris Wright wrote:
 The language can analyze all code that affects a local 
 variable in many cases. You don't always need the language to 
 guarantee it's impossible if the compiler can see that the 
 user isn't doing anything funky.
That's right. But for Ali's code, the compiler is clearly not smart enough.
Perhaps not smart enough, but it is very close to being smart enough. Perhaps we, the LDC devs, weren't smart enough in using the LLVM backend. ;-) For Ali's code, `arr` is a global symbol that can be modified by code not seen by the compiler. So no assumptions can be made about the contents of arr.ptr and arr.length, and thus `arr[i]` could index into `arr` itself. Now, if we tell [*] the compiler that `arr` is not touched by code outside the module... the resulting machine code is identical with and without POINTER! It's an interesting case to look further into. For example with Link-Time Optimization, delaying codegen until linktime, we can internalize many symbols (i.e. telling the compiler no other module is going to do anything with it) allowing the compiler to reason better about the code and generate faster executables. Ideally, without much user effort. It's something I am already looking into. I don't know if we can mark `private` global variables as internal. If so, wow :-) -Johan [*] Unfortunately `private arr` does not have the desired effect (it does not make it an `internal` LLVM variable), so you have to hack into the LLVM IR file to make it happen.
Aug 18 2016
next sibling parent Johan Engelen <j j.nl> writes:
On Thursday, 18 August 2016 at 12:20:50 UTC, Johan Engelen wrote:
 I don't know if we can mark `private` global variables as 
 internal. If so, wow :-)
Nevermind, not possible. Templates, cross-module inlining, and probably other reasons.
Aug 18 2016
prev sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 08/18/2016 05:20 AM, Johan Engelen wrote:

 `arr` is a global symbol that can be modified by code
 not seen by the compiler. So no assumptions can be made about the
 contents of arr.ptr and arr.length
Yet there is the following text in the spec: * https://dlang.org/spec/statement.html#ForeachStatement "The aggregate must be loop invariant, meaning that elements to the aggregate cannot be added or removed from it in the NoScopeNonEmptyStatement." I think the spirit of the spec includes other code outside of NoScopeNonEmptyStatement as well. If so, if elements are added to the array from other code (e.g. other threads) it should be a case of "all bets are off". So, I think both .ptr and .length could be cached. Ali
Aug 18 2016
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 8/16/16 4:11 PM, Yuxuan Shui wrote:
 On Tuesday, 16 August 2016 at 17:51:13 UTC, Johan Engelen wrote:
 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 With ldc2, the best option is to go with a dynamic array ONLY IF you
 access the elements through the .ptr property. As seen in the last
 result, using the [] operator on the array is about 4 times slower
 than that.
As Yuxuan Shui mentioned the difference is in vectorization. The non-POINTER version is not vectorized because the semantics of the code is not the same as the POINTER version. Indexing `arr`, and writing to that address could change `arr.ptr`, and so the loop would do something different when "caching" `arr.ptr` in `p` (POINTER version) versus the case without caching (non-POINTER version). Evil code demonstrating the problem: ``` ubyte evil; ubyte[] arr; void doEvil() { // TODO: use this in the obfuscated-D contest arr = (&evil)[0..50]; } ``` The compiler somehow has to prove that `arr[i]` will never point to `arr.ptr` (it's called Alias Analysis in LLVM). Perhaps it is UB in D to have `arr[i]` ever point into `arr` itself, I don't know. If so, the code is vectorizable and we can try to make it so. -Johan
Wait, doesn't D have strict aliasing rules? ubyte* (&evil) should not be allowed to alias with ubyte** (&arr.ptr).
Even if it did, I believe the wildcard is ubyte*. Just like in C, char* can point at anything, ubyte is D's equivalent. -Steve
Aug 16 2016
parent reply Charles Hixson via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 08/16/2016 01:49 PM, Steven Schveighoffer via Digitalmars-d wrote:
 On 8/16/16 4:11 PM, Yuxuan Shui wrote:
 On Tuesday, 16 August 2016 at 17:51:13 UTC, Johan Engelen wrote:
 On Tuesday, 16 August 2016 at 01:28:05 UTC, Ali Çehreli wrote:
 With ldc2, the best option is to go with a dynamic array ONLY IF you
 access the elements through the .ptr property. As seen in the last
 result, using the [] operator on the array is about 4 times slower
 than that.
As Yuxuan Shui mentioned the difference is in vectorization. The non-POINTER version is not vectorized because the semantics of the code is not the same as the POINTER version. Indexing `arr`, and writing to that address could change `arr.ptr`, and so the loop would do something different when "caching" `arr.ptr` in `p` (POINTER version) versus the case without caching (non-POINTER version). Evil code demonstrating the problem: ``` ubyte evil; ubyte[] arr; void doEvil() { // TODO: use this in the obfuscated-D contest arr = (&evil)[0..50]; } ``` The compiler somehow has to prove that `arr[i]` will never point to `arr.ptr` (it's called Alias Analysis in LLVM). Perhaps it is UB in D to have `arr[i]` ever point into `arr` itself, I don't know. If so, the code is vectorizable and we can try to make it so. -Johan
Wait, doesn't D have strict aliasing rules? ubyte* (&evil) should not be allowed to alias with ubyte** (&arr.ptr).
Even if it did, I believe the wildcard is ubyte*. Just like in C, char* can point at anything, ubyte is D's equivalent. -Steve
I think what you say is true (look at the code of std.outbuffer), but IIRC the documentation says that's supposed to be the job of void*.
Aug 16 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 8/16/16 7:23 PM, Charles Hixson via Digitalmars-d wrote:
 On 08/16/2016 01:49 PM, Steven Schveighoffer via Digitalmars-d wrote:
 On 8/16/16 4:11 PM, Yuxuan Shui wrote:
 Wait, doesn't D have strict aliasing rules? ubyte* (&evil) should not be
 allowed to alias with ubyte** (&arr.ptr).
Even if it did, I believe the wildcard is ubyte*. Just like in C, char* can point at anything, ubyte is D's equivalent.
I think what you say is true (look at the code of std.outbuffer), but IIRC the documentation says that's supposed to be the job of void*.
void * is almost useless. In D you can assign a void[] from another void[], but other than that, there's no way to write the memory or read it. In C, void * is also allowed to alias any other pointer. But char * is also allowed to provide arbitrary byte reading/writing. I'd expect that D also would provide a similar option. -Steve
Aug 17 2016
parent reply deadalnix <deadalnix gmail.com> writes:
On Wednesday, 17 August 2016 at 14:21:32 UTC, Steven 
Schveighoffer wrote:
 void * is almost useless. In D you can assign a void[] from 
 another void[], but other than that, there's no way to write 
 the memory or read it.

 In C, void * is also allowed to alias any other pointer. But 
 char * is also allowed to provide arbitrary byte 
 reading/writing.

 I'd expect that D also would provide a similar option.

 -Steve
Yes, but everything can alias with void*/void[] . Thus, you can cast from void* to T* "safely".
Aug 17 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 8/17/16 10:38 AM, deadalnix wrote:
 On Wednesday, 17 August 2016 at 14:21:32 UTC, Steven Schveighoffer wrote:
 void * is almost useless. In D you can assign a void[] from another
 void[], but other than that, there's no way to write the memory or
 read it.

 In C, void * is also allowed to alias any other pointer. But char * is
 also allowed to provide arbitrary byte reading/writing.

 I'd expect that D also would provide a similar option.
Yes, but everything can alias with void*/void[] . Thus, you can cast from void* to T* "safely".
Sure, but how do you implement, let's say, byte swapping on an integer? ubyte[] x = &myInt[0 .. 1]; foreach(i; 0 .. x.length / 2) swap(x[i], x[$-i-1]); So if the compiler can assume that x can't point at myInt, and thus myInt can't have changed, then we have a problem. You just can't do this with void (or at least not very easily). -Steve
Aug 17 2016
parent Johan Engelen <j j.nl> writes:
How about the specific case of array indexing?
Is it UB to have `arr[i]` ever point into `arr` itself, or can we 
make it UB?

-Johan
Aug 17 2016
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/15/2016 6:28 PM, Ali Çehreli wrote:
 void main() {
     auto p = arr.ptr;

     foreach (j; 0 .. 100) {
         foreach (i; 0..arr.length) {
             version (POINTER) {
                 p[i] += cast(ubyte)i;
             }
             else {
                 arr[i] += cast(ubyte)i;
             }
         }
     }
 }
When accessing global arrays like this, cache the address of the data in a local variable. This will enable the compiler to enregister it. Putting global data in registers is problematic because any assignment through a pointer could change it, so the compiler takes a pessimistic view of it. Your POINTER version does cache the pointer, but arr.length needs to be cached as well.
Aug 16 2016