www.digitalmars.com         C & C++   DMDScript  

D - Memory layout in D

reply Bill Cox <bill viasic.com> writes:
Hi.

By the way, the design of D is very nice!  Definately real improvements 

an experienced programmer.

Here's some thoughts about the way compilers lay out memory.  If you're 
getting the idea from my posts so far that I'm a speed freak, you're 
right.  If an extra 20% of speed doesn't excite you, think about 
skipping this letter.

...

Compilers that are meant to generate fast code (like C) generally give 
the user control of exact memory layout.  D goes further than C and C++ 
in providing allignment constructs.

Compilers or interpreters meant to be easy to use and generate portable 
code (like Java) usually hide the memory layout from you.

While this sounds like the way it should be, I've found that if you give 
the compiler the flexibility to lay out memory however it wants (with 
optional help from the user), you generally get faster code using less 
memory, which is also portable and still easy to use.

For example, one optimization I've used the past is automatically 
reordering fields.  Take this D code for example:

struct Foo {
     byte a;
     int b;
     byte c, d, e, f;
}

Foo c[10000000];

Did you cringe?  If not, definately go the next news topic.  In the 
array 'c', I'd expect the field 'b' of 'Foo' to cross a word boundary, 
and require multiple machine code instructions to load or write. 
Further, the structure is 9 bytes long, so I'd expect the compiler to 
pad it to 12 or 16.  Yuk.  Ok, ok... any real programmer would fix the 
layout by hand.  For the less experienced programmers, however, it's 
nice for the compiler to automatically generate:

struct Foo {
     int b;
     byte a, c, d, e, f;
     byte padding[3];
}

This is kind of minor stuff.  However, imagine you wanted to push the 
limits of memory and speed.  Suppose that the critical code in an 
application is:

total = 0;
for(i = 0; i < 10000000; i++) {
     total ^= c[i].d;
}

Please forgive the use of the large constant.  It's just to illistrate 
that there are a LOT of Foo objects in memory stored mostly contiguously 
(as many good memory managers would do).

The critical code here accesses only the 'd' field.  Using a sub-set of 
fields is really common in inner loops.  But, all these other fields are 
wasting space in the data cashe.  If instead, the compiler generated:

struct Foo {
     int b;
     byte a, c, e, f;
}

Foo c[10000000];
byte c_d[10000000];

...
total = 0;
for(i = 0; i < 10000000; i++) {
     total ^= c_d[i];
}

Then, the cache would fill up with the 'd' fields, and leave out the 
other stuff.  I've measured this effect in the past, and it can easily 
be a 2x speed improvement of critical code when there are a lot of 
objects to be processed (many megabytes worth).

Also, the wasted memory (padding) in the structure Foo is gone.  We can 
simultaneously improve speed and memory.  The compiler now has to keep 
track of multiple objects to represent the original structure, but it 
can do that if memory layout is abstract.

The code generators we use where I work do some of this kind of 
optimization.  Really.  We'd never have the patience to pull out 
individual fields and manage their memory by hand.  With our tools, we 
get portable, fast, and memory efficent code.

The key is simply removing the notion memory layout from the language. 
The language gets simplified, but the compiler gets more complicated. 
The code gets faster and leaner, and inexperienced programmers have one 
less thing to worry about.

Bill Cox
Jan 25 2003
next sibling parent reply Ilya Minkov <midiclub 8ung.at> writes:
Hi.

Usually in these cases i'd overload a storage class and stored data 
class to store data in a "broken" manner. You're right, it'd better be 
done at compilation. But have you thought how complex it is to implement?

However, structs are data arrangement mechanisms. And as such (by 
philosophy) cannot be changed. This could be done with classes, for 
which storage is not reliably defined anyway. The requierement is then, 
that a class may not necessarily have a run-time representation. One can 
requiere, that only classes implementing interfaces have a run-time 
representation, and only to the extent necessary. (?)

BTW, in classes data should be sorted largest-first and aligned already.

-i.


Bill Cox wrote:
 Hi.
 
 By the way, the design of D is very nice!  Definately real improvements 

 an experienced programmer.
 
 Here's some thoughts about the way compilers lay out memory.  If you're 
 getting the idea from my posts so far that I'm a speed freak, you're 
 right.  If an extra 20% of speed doesn't excite you, think about 
 skipping this letter.
 
 ...
 
 Compilers that are meant to generate fast code (like C) generally give 
 the user control of exact memory layout.  D goes further than C and C++ 
 in providing allignment constructs.
 
 Compilers or interpreters meant to be easy to use and generate portable 
 code (like Java) usually hide the memory layout from you.
 
 While this sounds like the way it should be, I've found that if you give 
 the compiler the flexibility to lay out memory however it wants (with 
 optional help from the user), you generally get faster code using less 
 memory, which is also portable and still easy to use.
 
 For example, one optimization I've used the past is automatically 
 reordering fields.  Take this D code for example:
 
 struct Foo {
     byte a;
     int b;
     byte c, d, e, f;
 }
 
 Foo c[10000000];
 
 Did you cringe?  If not, definately go the next news topic.  In the 
 array 'c', I'd expect the field 'b' of 'Foo' to cross a word boundary, 
 and require multiple machine code instructions to load or write. 
 Further, the structure is 9 bytes long, so I'd expect the compiler to 
 pad it to 12 or 16.  Yuk.  Ok, ok... any real programmer would fix the 
 layout by hand.  For the less experienced programmers, however, it's 
 nice for the compiler to automatically generate:
 
 struct Foo {
     int b;
     byte a, c, d, e, f;
     byte padding[3];
 }
 
 This is kind of minor stuff.  However, imagine you wanted to push the 
 limits of memory and speed.  Suppose that the critical code in an 
 application is:
 
 total = 0;
 for(i = 0; i < 10000000; i++) {
     total ^= c[i].d;
 }
 
 Please forgive the use of the large constant.  It's just to illistrate 
 that there are a LOT of Foo objects in memory stored mostly contiguously 
 (as many good memory managers would do).
 
 The critical code here accesses only the 'd' field.  Using a sub-set of 
 fields is really common in inner loops.  But, all these other fields are 
 wasting space in the data cashe.  If instead, the compiler generated:
 
 struct Foo {
     int b;
     byte a, c, e, f;
 }
 
 Foo c[10000000];
 byte c_d[10000000];
 
 ...
 total = 0;
 for(i = 0; i < 10000000; i++) {
     total ^= c_d[i];
 }
 
 Then, the cache would fill up with the 'd' fields, and leave out the 
 other stuff.  I've measured this effect in the past, and it can easily 
 be a 2x speed improvement of critical code when there are a lot of 
 objects to be processed (many megabytes worth).
 
 Also, the wasted memory (padding) in the structure Foo is gone.  We can 
 simultaneously improve speed and memory.  The compiler now has to keep 
 track of multiple objects to represent the original structure, but it 
 can do that if memory layout is abstract.
 
 The code generators we use where I work do some of this kind of 
 optimization.  Really.  We'd never have the patience to pull out 
 individual fields and manage their memory by hand.  With our tools, we 
 get portable, fast, and memory efficent code.
 
 The key is simply removing the notion memory layout from the language. 
 The language gets simplified, but the compiler gets more complicated. 
 The code gets faster and leaner, and inexperienced programmers have one 
 less thing to worry about.
 
 Bill Cox
 
Jan 25 2003
parent "Serge K" <skarebo programmer.net> writes:
"Ilya Minkov" <midiclub 8ung.at> wrote in message
news:b0u5s7$2dro$1 digitaldaemon.com...
 BTW, in classes data should be sorted largest-first
It's not always good - some processors prefer opposite order. For example, Hitachi SuperH processors (quite popular architecture in the embedded systems, also used in WinCE machines) have memory load / store instructions with 4bit const. offset, scaled by the data size. Which means it can access with the single instruction quite limited memory range: 8bit : R_base + [0..15] 16bit : R_base + [0..30] 32bit : R_base + [0..60] In case if class members are sorted smallest-first, it's more likely to get any class member without additional code to calculate its address.
Jan 29 2003
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
Some great ideas there. The idea in D is that with structs, the programmer
controls the layout. With classes, the compiler controls it and lays it out
for maximum performance, doing things like reordering as you suggest.

"Bill Cox" <bill viasic.com> wrote in message
news:3E327627.6000109 viasic.com...
 Hi.

 By the way, the design of D is very nice!  Definately real improvements

 an experienced programmer.

 Here's some thoughts about the way compilers lay out memory.  If you're
 getting the idea from my posts so far that I'm a speed freak, you're
 right.  If an extra 20% of speed doesn't excite you, think about
 skipping this letter.

 ...

 Compilers that are meant to generate fast code (like C) generally give
 the user control of exact memory layout.  D goes further than C and C++
 in providing allignment constructs.

 Compilers or interpreters meant to be easy to use and generate portable
 code (like Java) usually hide the memory layout from you.

 While this sounds like the way it should be, I've found that if you give
 the compiler the flexibility to lay out memory however it wants (with
 optional help from the user), you generally get faster code using less
 memory, which is also portable and still easy to use.

 For example, one optimization I've used the past is automatically
 reordering fields.  Take this D code for example:

 struct Foo {
      byte a;
      int b;
      byte c, d, e, f;
 }

 Foo c[10000000];

 Did you cringe?  If not, definately go the next news topic.  In the
 array 'c', I'd expect the field 'b' of 'Foo' to cross a word boundary,
 and require multiple machine code instructions to load or write.
 Further, the structure is 9 bytes long, so I'd expect the compiler to
 pad it to 12 or 16.  Yuk.  Ok, ok... any real programmer would fix the
 layout by hand.  For the less experienced programmers, however, it's
 nice for the compiler to automatically generate:

 struct Foo {
      int b;
      byte a, c, d, e, f;
      byte padding[3];
 }

 This is kind of minor stuff.  However, imagine you wanted to push the
 limits of memory and speed.  Suppose that the critical code in an
 application is:

 total = 0;
 for(i = 0; i < 10000000; i++) {
      total ^= c[i].d;
 }

 Please forgive the use of the large constant.  It's just to illistrate
 that there are a LOT of Foo objects in memory stored mostly contiguously
 (as many good memory managers would do).

 The critical code here accesses only the 'd' field.  Using a sub-set of
 fields is really common in inner loops.  But, all these other fields are
 wasting space in the data cashe.  If instead, the compiler generated:

 struct Foo {
      int b;
      byte a, c, e, f;
 }

 Foo c[10000000];
 byte c_d[10000000];

 ...
 total = 0;
 for(i = 0; i < 10000000; i++) {
      total ^= c_d[i];
 }

 Then, the cache would fill up with the 'd' fields, and leave out the
 other stuff.  I've measured this effect in the past, and it can easily
 be a 2x speed improvement of critical code when there are a lot of
 objects to be processed (many megabytes worth).

 Also, the wasted memory (padding) in the structure Foo is gone.  We can
 simultaneously improve speed and memory.  The compiler now has to keep
 track of multiple objects to represent the original structure, but it
 can do that if memory layout is abstract.

 The code generators we use where I work do some of this kind of
 optimization.  Really.  We'd never have the patience to pull out
 individual fields and manage their memory by hand.  With our tools, we
 get portable, fast, and memory efficent code.

 The key is simply removing the notion memory layout from the language.
 The language gets simplified, but the compiler gets more complicated.
 The code gets faster and leaner, and inexperienced programmers have one
 less thing to worry about.

 Bill Cox
Jan 25 2003
next sibling parent Bill Cox <bill viasic.com> writes:
Sounds good!

-- Bill Cox

Walter wrote:
 Some great ideas there. The idea in D is that with structs, the programmer
 controls the layout. With classes, the compiler controls it and lays it out
 for maximum performance, doing things like reordering as you suggest.
 
 "Bill Cox" <bill viasic.com> wrote in message
 news:3E327627.6000109 viasic.com...
 
Hi.

By the way, the design of D is very nice!  Definately real improvements

an experienced programmer.

Here's some thoughts about the way compilers lay out memory.  If you're
getting the idea from my posts so far that I'm a speed freak, you're
right.  If an extra 20% of speed doesn't excite you, think about
skipping this letter.

...

Compilers that are meant to generate fast code (like C) generally give
the user control of exact memory layout.  D goes further than C and C++
in providing allignment constructs.

Compilers or interpreters meant to be easy to use and generate portable
code (like Java) usually hide the memory layout from you.

While this sounds like the way it should be, I've found that if you give
the compiler the flexibility to lay out memory however it wants (with
optional help from the user), you generally get faster code using less
memory, which is also portable and still easy to use.

For example, one optimization I've used the past is automatically
reordering fields.  Take this D code for example:

struct Foo {
     byte a;
     int b;
     byte c, d, e, f;
}

Foo c[10000000];

Did you cringe?  If not, definately go the next news topic.  In the
array 'c', I'd expect the field 'b' of 'Foo' to cross a word boundary,
and require multiple machine code instructions to load or write.
Further, the structure is 9 bytes long, so I'd expect the compiler to
pad it to 12 or 16.  Yuk.  Ok, ok... any real programmer would fix the
layout by hand.  For the less experienced programmers, however, it's
nice for the compiler to automatically generate:

struct Foo {
     int b;
     byte a, c, d, e, f;
     byte padding[3];
}

This is kind of minor stuff.  However, imagine you wanted to push the
limits of memory and speed.  Suppose that the critical code in an
application is:

total = 0;
for(i = 0; i < 10000000; i++) {
     total ^= c[i].d;
}

Please forgive the use of the large constant.  It's just to illistrate
that there are a LOT of Foo objects in memory stored mostly contiguously
(as many good memory managers would do).

The critical code here accesses only the 'd' field.  Using a sub-set of
fields is really common in inner loops.  But, all these other fields are
wasting space in the data cashe.  If instead, the compiler generated:

struct Foo {
     int b;
     byte a, c, e, f;
}

Foo c[10000000];
byte c_d[10000000];

...
total = 0;
for(i = 0; i < 10000000; i++) {
     total ^= c_d[i];
}

Then, the cache would fill up with the 'd' fields, and leave out the
other stuff.  I've measured this effect in the past, and it can easily
be a 2x speed improvement of critical code when there are a lot of
objects to be processed (many megabytes worth).

Also, the wasted memory (padding) in the structure Foo is gone.  We can
simultaneously improve speed and memory.  The compiler now has to keep
track of multiple objects to represent the original structure, but it
can do that if memory layout is abstract.

The code generators we use where I work do some of this kind of
optimization.  Really.  We'd never have the patience to pull out
individual fields and manage their memory by hand.  With our tools, we
get portable, fast, and memory efficent code.

The key is simply removing the notion memory layout from the language.
The language gets simplified, but the compiler gets more complicated.
The code gets faster and leaner, and inexperienced programmers have one
less thing to worry about.

Bill Cox
Jan 25 2003
prev sibling parent reply "Jeroen van Bemmel" <anonymous somewhere.com> writes:
"Walter" <walter digitalmars.com> wrote in message
news:b0vmev$472$1 digitaldaemon.com...
 Some great ideas there. The idea in D is that with structs, the programmer
 controls the layout. With classes, the compiler controls it and lays it
out
 for maximum performance, doing things like reordering as you suggest.
Following this line of thought, then - structs cannot have virtual methods - structs can use inheritance but cannot implement an interface - structs can have bitfields (and then perhaps classes should _not_ be allowed to have them; bitfields often imply low level control) Would it make sense to rename them to 'aggregate' or something, to distinguish them from the conventional C/C++ semantics?
Jan 26 2003
parent "Walter" <walter digitalmars.com> writes:
"Jeroen van Bemmel" <anonymous somewhere.com> wrote in message
news:b12m0r$1lgg$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:b0vmev$472$1 digitaldaemon.com...
 Some great ideas there. The idea in D is that with structs, the
programmer
 controls the layout. With classes, the compiler controls it and lays it
out
 for maximum performance, doing things like reordering as you suggest.
Following this line of thought, then - structs cannot have virtual methods - structs can use inheritance but cannot implement an interface - structs can have bitfields (and then perhaps classes should _not_ be allowed to have them; bitfields often imply low level control) Would it make sense to rename them to 'aggregate' or something, to distinguish them from the conventional C/C++ semantics?
They do have conventional C semantics, in fact, they are deliberately set up to match the layout of what the host C/C++ compiler would do.
Jan 28 2003