www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - D array expansion and non-deterministic re-allocation

reply Bartosz Milewski <bartosz-nospam relisoft.com> writes:
I read Andrei's chapter on arrays and there's one thing that concerns me. When
a slice is extended, the decision to re-allocate, and therefore to cut its
connection to other slices, is non-deterministic. How does that influence
program testing and can it be exploited to attack a buggy system?

Here's a piece of code I came up with--it's not realistic but it describes a
certain class of scenarios:

void f(byte[] code)
{
    doSomeChecks(code);
    execute(code);
}

void doSomeChecks(byte[] a)
{
    
  // to avoid special-casing empty arrays
  a.length += 1; // extension
   
  // get byte b (or a whole bunch) from somewhere
   a[0] = b; // write
  
  // check what happens when a[0] = b
}

Of course this code is incorrect because doSomeChecks assumes that it holds a
unique array, and the caller failed to make a copy before sending it to
doSomeChecks. Bad bad programmer!

However imagine this scenario: Under testing conditions, the array extension
just happens to always cause reallocation, so the write is NOT visible in the
calling routine. All tests pass with flying colors. But since the re-allocation
is, by language definition, non-deterministic, it might happen that, after the
code has been deployed, the re-allocation stopped happening, say, on Sundays.
If a hacker learns about that, he may be able to inject malicious code or data
into the program. 

In this case the non-determinism built into the language helped camouflage a
serious bug. Should we, or shouldn't we, worry about it in practice?
Nov 15 2009
next sibling parent Tim Matthews <tim.matthews7 gmail.com> writes:
Bartosz Milewski wrote:
 I read Andrei's chapter on arrays and there's one thing that concerns me. When
a slice is extended, the decision to re-allocate, and therefore to cut its
connection to other slices, is non-deterministic. How does that influence
program testing and can it be exploited to attack a buggy system?
 
 Here's a piece of code I came up with--it's not realistic but it describes a
certain class of scenarios:
 
 void f(byte[] code)
 {
     doSomeChecks(code);
     execute(code);
 }
 
 void doSomeChecks(byte[] a)
 {
     
   // to avoid special-casing empty arrays
   a.length += 1; // extension
    
   // get byte b (or a whole bunch) from somewhere
    a[0] = b; // write
   
   // check what happens when a[0] = b
 }
 
 Of course this code is incorrect because doSomeChecks assumes that it holds a
unique array, and the caller failed to make a copy before sending it to
doSomeChecks. Bad bad programmer!
 
 However imagine this scenario: Under testing conditions, the array extension
just happens to always cause reallocation, so the write is NOT visible in the
calling routine. All tests pass with flying colors. But since the re-allocation
is, by language definition, non-deterministic, it might happen that, after the
code has been deployed, the re-allocation stopped happening, say, on Sundays.
If a hacker learns about that, he may be able to inject malicious code or data
into the program. 
 
 In this case the non-determinism built into the language helped camouflage a
serious bug. Should we, or shouldn't we, worry about it in practice?

Isn't the new kind of arrays created to fix this already? See T[new]. PS: Are you still working on D's concurrency?
Nov 15 2009
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Bartosz Milewski wrote:
 I read Andrei's chapter on arrays and there's one thing that concerns
 me. When a slice is extended, the decision to re-allocate, and
 therefore to cut its connection to other slices, is
 non-deterministic.

It is not non-deterministic. Try it - you'll get the same results every time. It is implementation-defined behavior.
 How does that influence program testing and can it
 be exploited to attack a buggy system?

Exploitations rely on undefined-behavior, such as buffer overflows, not implementation-defined behavior. This issue is no more conducive to an "exploit" than any other garden variety programming issue. It's entirely different than a buffer overflow attack.
Nov 15 2009
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:hdqea8$2mte$1 digitalmars.com...
 Bartosz Milewski wrote:
 I read Andrei's chapter on arrays and there's one thing that concerns
 me. When a slice is extended, the decision to re-allocate, and
 therefore to cut its connection to other slices, is
 non-deterministic.

It is not non-deterministic. Try it - you'll get the same results every time. It is implementation-defined behavior.
 How does that influence program testing and can it
 be exploited to attack a buggy system?

Exploitations rely on undefined-behavior, such as buffer overflows, not implementation-defined behavior. This issue is no more conducive to an "exploit" than any other garden variety programming issue. It's entirely different than a buffer overflow attack.

Definition used by Implementation "Foo": "Resize if there's room, otherwise reallocate" (sounds fairly realistic to me) Code in program using Implementation "Foo": myArrayOnHeap ~= getUserInput(); Resizes or Reallocates?: Deterministically dependant on "Is there room?" which is deterministically dependant on two things: 1. The size of the user input (non-deterministic). 2. The amount of freespace after the current end of myArrayOnHeap (dependent on the specific actions of the GC, which in turn, is dependent on other behaviors of the program, which, for almost any useful program, are dependent on non-deterministic runtime conditions, such as other user input). If 'A' ("resize or reallocate?") is deterministically determined by 'B', and 'B' ("is there room?") is non-deterministic ("dependent on things like user input"), then 'A' is, in effect, non-deterministic. Simplified example of the same basic principle in action: void main(char[][] args) { if(args.length < 10) showPrettyPicture(); else blowUpEarth(); } Deterministic? Only in the same sense that "resize or realloc upon appending" is deterministic. Safe? Fuck no.
Nov 15 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Nick Sabalausky wrote:
 Deterministic? Only in the same sense that "resize or realloc upon 
 appending" is deterministic.

It's deterministic in the sense that if you run the program again with the same inputs, you will get the same result. This is a highly useful attribute for testing and debugging. A non-deterministic problem will give you a different result every time you run it. Threading problems are an example of this, as are any dependencies on uninitialized data. This particular issue is implementation-defined.
 Safe? Fuck no.

It's safe as in memory safe. This is as opposed to undefined-behavior, which is not memory safe. A buffer overflow is an example of undefined-behavior.
Nov 15 2009
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Denis Koroskin wrote:
 It is *non*-deterministic. The decision to reallocate depends (or will 
 depend) on LRU and it may be cleared by another thread (e.g. another 
 thread may reset it manually or via a GC cycle run).

The LRU is thread local.
Nov 15 2009
parent reply Denis Koroskin <2korden gmail.com> writes:
Walter Bright Wrote:

 Denis Koroskin wrote:
 It is *non*-deterministic. The decision to reallocate depends (or will 
 depend) on LRU and it may be cleared by another thread (e.g. another 
 thread may reset it manually or via a GC cycle run).

The LRU is thread local.

It will then prevent a moving GC from being possible in D. Otherwise, moving a block of data will invalidate LRU which will result in non-deterministic reallocation.
Nov 17 2009
parent reply Sean Kelly <sean invisibleduck.org> writes:
Denis Koroskin Wrote:

 Walter Bright Wrote:
 
 Denis Koroskin wrote:
 It is *non*-deterministic. The decision to reallocate depends (or will 
 depend) on LRU and it may be cleared by another thread (e.g. another 
 thread may reset it manually or via a GC cycle run).

The LRU is thread local.

It will then prevent a moving GC from being possible in D. Otherwise, moving a block of data will invalidate LRU which will result in non-deterministic reallocation.

It won't work with the existing GC either. If a page is completely emptied then it can be re-used for different sized allocations. This is why the current cache is wiped during a collection.
Nov 17 2009
parent Walter Bright <newshound1 digitalmars.com> writes:
Sean Kelly wrote:
 Denis Koroskin Wrote:
 
 Walter Bright Wrote:
 
 Denis Koroskin wrote:
 It is *non*-deterministic. The decision to reallocate depends
 (or will depend) on LRU and it may be cleared by another thread
 (e.g. another thread may reset it manually or via a GC cycle
 run).


Otherwise, moving a block of data will invalidate LRU which will result in non-deterministic reallocation.

It won't work with the existing GC either. If a page is completely emptied then it can be re-used for different sized allocations. This is why the current cache is wiped during a collection.

In both cases, the LRU can be adjusted rather than wiped to account for it.
Nov 17 2009
prev sibling next sibling parent reply grauzone <none example.net> writes:
Walter Bright wrote:
 Safe? Fuck no.

It's safe as in memory safe. This is as opposed to undefined-behavior, which is not memory safe. A buffer overflow is an example of undefined-behavior.

Even if it's memory safe, it could cause a bug. Look at Bartosz example again; if the memory block gets copied really depends from the input data length.
Nov 15 2009
parent Walter Bright <newshound1 digitalmars.com> writes:
grauzone wrote:
 Even if it's memory safe, it could cause a bug. Look at Bartosz example 
 again; if the memory block gets copied really depends from the input 
 data length.

Yes, it could cause a bug. But it isn't undefined-behavior, it isn't memory corruption, and it isn't non-deterministic. i = ++i; can also cause a bug.
Nov 16 2009
prev sibling next sibling parent reply Rainer Deyke <rainerd eldwood.com> writes:
Walter Bright wrote:
 It's deterministic in the sense that if you run the program again with
 the same inputs, you will get the same result. This is a highly useful
 attribute for testing and debugging.

On the same platform, with the same compiler, compiler settings, and standard library implementation. That makes it harder to test, not easier. You now have to test with multiple compilers.
 It's safe as in memory safe. This is as opposed to undefined-behavior,
 which is not memory safe. A buffer overflow is an example of
 undefined-behavior.

The current behavior is unsafe in that you can accidentally have two variables pointing at the same buffer. Let's say one buffer holds network input and the other holds some bytecode to execute. Boom - a bug that can be exploited to execute malicious (byte-)code. -- Rainer Deyke - rainerd eldwood.com
Nov 16 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Rainer Deyke wrote:
 Walter Bright wrote:
 It's deterministic in the sense that if you run the program again with
 the same inputs, you will get the same result. This is a highly useful
 attribute for testing and debugging.

standard library implementation. That makes it harder to test, not easier. You now have to test with multiple compilers.

That is still determinate. Indeterminate means you get different results if your run it again on the same machine.
 It's safe as in memory safe. This is as opposed to undefined-behavior,
 which is not memory safe. A buffer overflow is an example of
 undefined-behavior.

The current behavior is unsafe in that you can accidentally have two variables pointing at the same buffer. Let's say one buffer holds network input and the other holds some bytecode to execute. Boom - a bug that can be exploited to execute malicious (byte-)code.

It is not two random arrays randomly sharing data.
Nov 16 2009
parent reply "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:hdr6mm$1bf0$1 digitalmars.com...
 Rainer Deyke wrote:
 Walter Bright wrote:
 It's deterministic in the sense that if you run the program again with
 the same inputs, you will get the same result. This is a highly useful
 attribute for testing and debugging.

standard library implementation. That makes it harder to test, not easier. You now have to test with multiple compilers.

That is still determinate. Indeterminate means you get different results if your run it again on the same machine.

Even if it is technically determinate if you run it on the same machine with the same inputs, that still does nothing to address Bartosz's claim that it's a potential security hole - Apps don't always get run on the same machine with the same inputs.
Nov 16 2009
parent Walter Bright <newshound1 digitalmars.com> writes:
Nick Sabalausky wrote:
 Even if it is technically determinate if you run it on the same machine with 
 the same inputs, that still does nothing to address Bartosz's claim that 
 it's a potential security hole - Apps don't always get run on the same 
 machine with the same inputs.

It's not a security hole in any more serious manner than any other routine programming bug would be. Very few ordinary programming bugs are exploitable. A buffer overflow, however, is much more of a security hole because they are nearly always exploitable, because it allows arbitrary user data to be executed. This is not the case with the array resizing issue. That's why I drew a distinction between undefined-behavior and implementation-defined behavior. The former is a couple more orders of magnitude more serious.
Nov 16 2009
prev sibling next sibling parent reply BCS <none anon.com> writes:
Hello Walter,

 Nick Sabalausky wrote:
 
 Deterministic? Only in the same sense that "resize or realloc upon
 appending" is deterministic.
 

the same inputs, you will get the same result. This is a highly useful attribute for testing and debugging. A non-deterministic problem will give you a different result every time you run it. Threading problems are an example of this, as are any dependencies on uninitialized data. This particular issue is implementation-defined.

would you agree that it is not something the programer can predict in advance?
Nov 16 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
BCS wrote:
 would you agree that it is not something the programer can predict in 
 advance?

He can, but it is not reasonable to expect him to. But it's still deterministic.
Nov 16 2009
parent reply Bartosz Milewski <bartosz-nospam relisoft.com> writes:
Walter Bright Wrote:

 BCS wrote:
 would you agree that it is not something the programer can predict in 
 advance?

He can, but it is not reasonable to expect him to. But it's still deterministic.

I've been following this discussion about determinism and I think it addresses the problem from the wrong point of view, that of a specific implementation. My concern is the semantics of the language. As it is defined right now, a conforming implementation is free to use a quantum random number generator to decide whether to re-allocate or not. Is it likely? I don't think so; but the non-determinism is part of the semantics of D arrays.
Nov 17 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Bartosz Milewski wrote:
 Walter Bright Wrote:
 
 BCS wrote:
 would you agree that it is not something the programer can predict in 
 advance?

deterministic.

I've been following this discussion about determinism and I think it addresses the problem from the wrong point of view, that of a specific implementation. My concern is the semantics of the language. As it is defined right now, a conforming implementation is free to use a quantum random number generator to decide whether to re-allocate or not. Is it likely? I don't think so; but the non-determinism is part of the semantics of D arrays.

It would not be difficult to specify in the language definition (and TDPL) that behavior is deterministic for a given platform. I think this has some impact on the freedom of the memory allocator, but probably not major. Andrei
Nov 17 2009
parent reply Bartosz Milewski <bartosz-nospam relisoft.com> writes:
Andrei Alexandrescu Wrote:

 My concern is the semantics of the language. As it is defined right now, a
conforming implementation is free to use a quantum random number generator to
decide whether to re-allocate or not. Is it likely? I don't think so; but the
non-determinism is part of the semantics of D arrays.
 
 It would not be difficult to specify in the language definition (and 
 TDPL) that behavior is deterministic for a given platform. I think this 
 has some impact on the freedom of the memory allocator, but probably not 
 major.

Actually this wouldn't fix the problem. Although this would make the program deterministic, it would still exhibit chaotic behavior (and chaos is a pretty good simulator of non-determinism--see random number generators). An input string that is one character longer than in the previous run in one part of the program could cause change in allocation in a completely different part of the program (arbitrary long distance coupling). Theoretically, the heap is deterministic, but in practice no program should depend on all pointers having exactly the same values from run to run. For all intents and purposes the heap should be treated as non-deterministic. This is why no language bothers to impose determinism on the heap. Neither should D.
Nov 18 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Bartosz Milewski wrote:
 Andrei Alexandrescu Wrote:
 
 My concern is the semantics of the language. As it is defined right
 now, a conforming implementation is free to use a quantum random
 number generator to decide whether to re-allocate or not. Is it
 likely? I don't think so; but the non-determinism is part of the
 semantics of D arrays.
 
 It would not be difficult to specify in the language definition
 (and TDPL) that behavior is deterministic for a given platform. I
 think this has some impact on the freedom of the memory allocator,
 but probably not major.

Actually this wouldn't fix the problem. Although this would make the program deterministic, it would still exhibit chaotic behavior (and chaos is a pretty good simulator of non-determinism--see random number generators).

I am glad you have also reached that conclusion.
 An input string that is one character longer than in the previous run
 in one part of the program could cause change in allocation in a
 completely different part of the program (arbitrary long distance
 coupling).

Then you must restart your argument which was centered around non-determinism. It starts like that: "I read Andrei's chapter on arrays and there's one thing that concerns me. When a slice is extended, the decision to re-allocate, and therefore to cut its connection to other slices, is non-deterministic." Andrei
Nov 18 2009
prev sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Tue, Nov 17, 2009 at 4:38 PM, Bartosz Milewski
<bartosz-nospam relisoft.com> wrote:
 Walter Bright Wrote:

 BCS wrote:
 would you agree that it is not something the programer can predict in
 advance?

He can, but it is not reasonable to expect him to. But it's still deterministic.

I've been following this discussion about determinism and I think it addr=

tation.
 My concern is the semantics of the language. As it is defined right now, =

r to decide whether to re-allocate or not. Is it likely? I don't think so; = but the non-determinism is part of the semantics of D arrays. Then why not just fix the spec to say the behavior has to be deterministic? --bb
Nov 17 2009
prev sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Mon, 16 Nov 2009 09:24:13 +0300, Walter Bright  
<newshound1 digitalmars.com> wrote:

 Nick Sabalausky wrote:
 Deterministic? Only in the same sense that "resize or realloc upon  
 appending" is deterministic.

It's deterministic in the sense that if you run the program again with the same inputs, you will get the same result. This is a highly useful attribute for testing and debugging.

It is *non*-deterministic. The decision to reallocate depends (or will depend) on LRU and it may be cleared by another thread (e.g. another thread may reset it manually or via a GC cycle run).
Nov 15 2009
prev sibling next sibling parent Leandro Lucarella <llucax gmail.com> writes:
Tim Matthews, el 16 de noviembre a las 15:05 me escribiste:
 Isn't the new kind of arrays created to fix this already? See T[new].

T[new] is gone. -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Parece McGuevara's o CheDonald's
Nov 15 2009
prev sibling next sibling parent reply Ali Cehreli <acehreli yahoo.com> writes:
Bartosz Milewski Wrote:

 I read Andrei's chapter on arrays and there's one thing that concerns me.

I tried to voice the same concern earlier, but my voice is not loud yet. ;)
 When a slice is extended, the decision to re-allocate, and therefore to cut
 its connection to other slices, is non-deterministic.

I don't see it being any different than C++'s std::vector invalidating iterators when the vector is expanded. That has the same "non-determinism", but we've been living with it. The reason we've been living happily with it is, because it's been spelled out: "may invalidate all iterators." In D2's slices, what we're missing is the correct language to describe the semantics. As long as we set the semantics right, there will be no problem. (Earlier, I proposed "discretionary sharing semantics" for D2's slice semantics.) The first code examples that are used to describe slices better say "slice" on the left hand side: int[] slice = new int[2]; // GOOD for the semantics int[] array = new int[2]; // BAD for the semantics Ali
Nov 16 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Ali Cehreli wrote:
 Bartosz Milewski Wrote:
 
 I read Andrei's chapter on arrays and there's one thing that concerns me.

I tried to voice the same concern earlier, but my voice is not loud yet. ;)
 When a slice is extended, the decision to re-allocate, and therefore to cut
 its connection to other slices, is non-deterministic.

I don't see it being any different than C++'s std::vector invalidating iterators when the vector is expanded. That has the same "non-determinism", but we've been living with it.

It is different because invalid STL iterators could do anything. Andrei
Nov 16 2009
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Bartosz Milewski (bartosz-nospam relisoft.com)'s article
 I read Andrei's chapter on arrays and there's one thing that concerns me. When
a

connection to other slices, is non-deterministic. How does that influence program testing and can it be exploited to attack a buggy system?
 Here's a piece of code I came up with--it's not realistic but it describes a

 void f(byte[] code)
 {
     doSomeChecks(code);
     execute(code);
 }
 void doSomeChecks(byte[] a)
 {
    
   // to avoid special-casing empty arrays
   a.length += 1; // extension
    
   // get byte b (or a whole bunch) from somewhere
    a[0] = b; // write
   
   // check what happens when a[0] = b
 }
 Of course this code is incorrect because doSomeChecks assumes that it holds a

doSomeChecks. Bad bad programmer!
 However imagine this scenario: Under testing conditions, the array extension

calling routine. All tests pass with flying colors. But since the re-allocation is, by language definition, non-deterministic, it might happen that, after the code has been deployed, the re-allocation stopped happening, say, on Sundays. If a hacker learns about that, he may be able to inject malicious code or data into the program.
 In this case the non-determinism built into the language helped camouflage a

The one thing that I think has been missing from this discussion is, what would be the alternative if we didn't have this "non-deterministic" reallocation? How else could you **efficiently** implement dynamic arrays?
Nov 16 2009