digitalmars.D - Holes in structs and opEquals

bearophile (22/22) Mar 07 2010 This comes from a small thread that is going on in digitalmars.D.learn.

Fawzi Mohamed (15/54) Mar 07 2010 one could argue that the unsafe operation is memset.

bearophile (7/9) Mar 07 2010 Note that if you write a opEquals for each struct then you have the same...

bearophile (10/10) Mar 07 2010 There are other solutions.

Walter Bright (9/25) Mar 07 2010 Right.

yigal chripun (8/43) Mar 07 2010 The compiler knows at compile-time what variables are initialized with "...

Walter Bright (9/18) Mar 08 2010 It can, but I don't agree that it should. For an =void initialization,

bearophile (6/14) Mar 08 2010 I don't fully like this design strategy, I don't like to separate so muc...

Don (20/57) Mar 08 2010 The same issue can arise with unions. This one is pretty hard to grep fo...

Don (3/25) Mar 07 2010 Yes. Most of the problems with struct opEquals were fixed in bug 3433.
Walter Bright (6/10) Mar 07 2010 The "holes" are defined to be filled with 0, and are when initialized by...

bearophile (5/8) Mar 07 2010 Thank you for your answer. I didn't read this important detail in the on...

yigal chripun (4/24) Mar 08 2010 Ok, that sound's reasonable if you want to keep the compiler simple and ...

Walter Bright (5/14) Mar 08 2010 The non-trivial cases are not detectable (halting problem), so it is a

bearophile (6/8) Mar 09 2010 On the other hand I have shown few ideas to avoid this problem, that req...

bearophile (98/98) Mar 09 2010 I guess what's holding back Walter on this improvement is the higher per...

bearophile <bearophileHUGS lycos.com> writes:

This comes from a small thread that is going on in digitalmars.D.learn.

This program asserts:


import std.c.string;
struct S { // 16 bytes, with a hole
    ushort s;
    double d;
}
void main() {
    S s1, s2;
    memset(&s1, ubyte.min, S.sizeof);
    memset(&s2, ubyte.max, S.sizeof);
    s1.s = s2.s = 0;
    s1.d = s2.d = 0;
    assert(s1 == s2);
}


But a correctly implemented opEquals (and opCmp) among structs has to ignore
the contents of the holes, because they can be filled with anything, for
example if the structs where not initialized (with =void).

A correct opEquals has to work recursively (so if the struct contains a string,
it has to control the string equality too).

And a built-in recursive opCmp/opHash for structs is about as useful as having
a correct opEquals.

A correct opEquals can be a little slower, but correctness come first. In the
uncommon situations where I really need max speed and I don't care of
correctness I can use a memcmp(&s1, &s2, S.sizeof). (And the compiler is free
to use just memcmp when a struct has no holes and doesn't contain references,
associative arrays and dynamic arrays).

Correctness of basic struct operators is not an optional feature, like
properties or even enums. If the opEquals among structs is wrong then it's
better to not have it at all.

Bye,
bearophile

Mar 07 2010

Fawzi Mohamed <fmohamed mac.com> writes:

On 2010-03-07 15:09:53 +0100, bearophile <bearophileHUGS lycos.com> said:

 This comes from a small thread that is going on in digitalmars.D.learn.
 
 This program asserts:
 
 
 import std.c.string;
 struct S { // 16 bytes, with a hole
     ushort s;
     double d;
 }
 void main() {
     S s1, s2;
     memset(&s1, ubyte.min, S.sizeof);
     memset(&s2, ubyte.max, S.sizeof);
     s1.s = s2.s = 0;
     s1.d = s2.d = 0;
     assert(s1 == s2);
 }
 
 
 But a correctly implemented opEquals (and opCmp) among structs has to 
 ignore the contents of the holes, because they can be filled with 
 anything, for example if the structs where not initialized (with =void).
 
 A correct opEquals has to work recursively (so if the struct contains a 
 string, it has to control the string equality too).
 
 And a built-in recursive opCmp/opHash for structs is about as useful as 
 having a correct opEquals.
 
 A correct opEquals can be a little slower, but correctness come first. 
 In the uncommon situations where I really need max speed and I don't 
 care of correctness I can use a memcmp(&s1, &s2, S.sizeof). (And the 
 compiler is free to use just memcmp when a struct has no holes and 
 doesn't contain references, associative arrays and dynamic arrays).
 
 Correctness of basic struct operators is not an optional feature, like 
 properties or even enums. If the opEquals among structs is wrong then 
 it's better to not have it at all.

one could argue that the unsafe operation is memset.
The compiler always initializes a struct, so that what you describe 
should never happen in a safe program.
Still as you say the following example that might indeed considered a bug:

S s1=void,s2=void;
s1.s=0; s1.d=0;
s2.s=0; s2.d=0;
assert(s1 == s2);

here the assert might fail depending on the previous content of the memory.
I am not sure of what is the best solution, I am not sure that defining 
a special comparison operation by default for each struct is the 
correct solution (it can be quite some bloat), please note that a user 
defined comparison function will not have these problems.
Still I agree that traking down a bug due to this might be very ugly...

Mar 07 2010

bearophile <bearophileHUGS lycos.com> writes:

Fawzi Mohamed:

I am not sure of what is the best solution,<

This is true for me too, that's why I have created this thread, to find a way
to solve this problem.


I am not sure that defining a special comparison operation by default for each
struct is the correct solution (it can be quite some bloat), please note that a
user defined comparison function will not have these problems.<

Note that if you write a opEquals for each struct then you have the same code
bloat. The difference is that you have had to write lot of boring code, that
can contain bugs, and your program is longer.

In a program you don't use == and != (or hashing) on all the structs, so in
theory the compiler can add such opEquals only to the structs that are tested
for equality in the program. I guess the usual modular compilation requirement
of D code asks too all structs to have such opEquals. In this case the
Link-Time Optimization of LLVM comes to the rescue and removed unused functions
from the whole program/whole compilation block.

Another possible solution is to have a single opEquals in the runtime, that
works like the array sort, using run time knowledge about the types. But this
is of probably slower.

Bye,
bearophile

Mar 07 2010

bearophile <bearophileHUGS lycos.com> writes:

There are other solutions.
For example remove the current opEquals from structs, so doing == among two
structs become a syntax error. And then add a property like  equable that when
present beside the struct name adds a specific and correct and recursive
opEquals to it.

 equable stuct Foo {
  int x;
}

Probably instead of  equable it's better to have an attribute that adds opHash,
opEquals and opCmp to a struct.
This is safe, avoids most code bloat, and keeps code fast. The disadvantage is
a that it adds a new attribute to the language.

If the user can create new attributes, then it's surely possible to define this
attribute with D code and some compile-time reflection, keeping the language
simple enough. There is one or more Java libs that use attributes for this
specific purpose.

Bye,
bearophile

Mar 07 2010

Walter Bright <newshound1 digitalmars.com> writes:

Fawzi Mohamed wrote:
 one could argue that the unsafe operation is memset.

That's right.


 The compiler always initializes a struct, so that what you describe 
 should never happen in a safe program.

Right.


 Still as you say the following example that might indeed considered a bug:
 
 S s1=void,s2=void;
 s1.s=0; s1.d=0;
 s2.s=0; s2.d=0;
 assert(s1 == s2);
 
 here the assert might fail depending on the previous content of the memory.
 I am not sure of what is the best solution, I am not sure that defining 
 a special comparison operation by default for each struct is the correct 
 solution (it can be quite some bloat), please note that a user defined 
 comparison function will not have these problems.

No, it's not a bug in the language, it's a bug in the user code. Using 
=void is an advanced feature, and it requires the user to know what he's 
doing with it. That's why it isn't allowed in safe mode.


 Still I agree that traking down a bug due to this might be very ugly...

The idea with =void initializations is that they are findable using 
grep, and so can be singled out for special attention when there is a 
problem.

Mar 07 2010

yigal chripun <yigal100 gmail.com> writes:

Walter Bright Wrote:

 Fawzi Mohamed wrote:
 one could argue that the unsafe operation is memset.

 
 That's right.
 
 
 The compiler always initializes a struct, so that what you describe 
 should never happen in a safe program.

 
 Right.
 
 
 Still as you say the following example that might indeed considered a bug:
 
 S s1=void,s2=void;
 s1.s=0; s1.d=0;
 s2.s=0; s2.d=0;
 assert(s1 == s2);
 
 here the assert might fail depending on the previous content of the memory.
 I am not sure of what is the best solution, I am not sure that defining 
 a special comparison operation by default for each struct is the correct 
 solution (it can be quite some bloat), please note that a user defined 
 comparison function will not have these problems.

 
 No, it's not a bug in the language, it's a bug in the user code. Using 
 =void is an advanced feature, and it requires the user to know what he's 
 doing with it. That's why it isn't allowed in safe mode.
 
 
 Still I agree that traking down a bug due to this might be very ugly...

 
 The idea with =void initializations is that they are findable using 
 grep, and so can be singled out for special attention when there is a 
 problem.

The compiler knows at compile-time what variables are initialized with "=void".
The compiler than can add a compilation error when such a struct variable is
used in an equals expression. 
this doesn't cover use of malloc() which must be the user's responsebility. 

e.g.
 S s1=void,s2=void;
 s1.s=0; s1.d=0;
 s2.s=0; s2.d=0;
 assert(s1 == s2); // <- this line should not compile

Mar 07 2010

Walter Bright <newshound1 digitalmars.com> writes:

yigal chripun wrote:
 The compiler knows at compile-time what variables are initialized with
"=void". The compiler than can add a compilation error when such a struct
variable is used in an equals expression. 
 this doesn't cover use of malloc() which must be the user's responsebility. 
 
 e.g.
  S s1=void,s2=void;
  s1.s=0; s1.d=0;
  s2.s=0; s2.d=0;
  assert(s1 == s2); // <- this line should not compile
 

It can, but I don't agree that it should. For an =void initialization, 
it's the user's responsibility to initialize it properly. The use of 
=void implies the user knows what he's doing with it, and will take care 
to initialize the 'holes' as necessary.

Trying to disable == for such structs is a losing battle, anyway, as the 
compiler could only detect the most obvious cases. Pass a reference to 
it to a function, store it in a data structure, etc., and all that goes 
away.

Mar 08 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 It can, but I don't agree that it should. For an =void initialization, 
 it's the user's responsibility to initialize it properly. The use of 
 =void implies the user knows what he's doing with it, and will take care 
 to initialize the 'holes' as necessary.

I don't fully like this design strategy, I don't like to separate so much the
safe code from the unsafe code where the compiler refuses to give any safety. A
good compiler has to help assure safety every time it can.

On the other hand I agree with you that your design strategy helps keep the
compiler simpler (and sometimes compilation time low), and I agree that there
are times where it's hopeless to try to turn something unsafe in something a
little safer. It's a waste of time that can be used for something more useful.
So in the end I shut up the muzzle :-)


 Trying to disable == for such structs is a losing battle, anyway, as the 
 compiler could only detect the most obvious cases. Pass a reference to 
 it to a function, store it in a data structure, etc., and all that goes 
 away.

In theory a lint for D code can partially trace such movements to spot such
troubles (in debug release it can even add to the void structs an extra bool
attribute that always gets initialized, to tell apart initialized from
uninitialized structs at runtime and spot such bugs, but this starts to sound a
little silly).

Bye,
bearophile

Mar 08 2010

Don <nospam nospam.com> writes:

Walter Bright wrote:
 Fawzi Mohamed wrote:
 one could argue that the unsafe operation is memset.

 
 That's right.
 
 
 The compiler always initializes a struct, so that what you describe 
 should never happen in a safe program.

 
 Right.
 
 
 Still as you say the following example that might indeed considered a 
 bug:

 S s1=void,s2=void;
 s1.s=0; s1.d=0;
 s2.s=0; s2.d=0;
 assert(s1 == s2);

 here the assert might fail depending on the previous content of the 
 memory.
 I am not sure of what is the best solution, I am not sure that 
 defining a special comparison operation by default for each struct is 
 the correct solution (it can be quite some bloat), please note that a 
 user defined comparison function will not have these problems.

 
 No, it's not a bug in the language, it's a bug in the user code. Using 
 =void is an advanced feature, and it requires the user to know what he's 
 doing with it. That's why it isn't allowed in safe mode.
 
 
 Still I agree that traking down a bug due to this might be very ugly...

 
 The idea with =void initializations is that they are findable using 
 grep, and so can be singled out for special attention when there is a 
 problem.

The same issue can arise with unions. This one is pretty hard to grep for:

-----------
struct S { // 16 bytes, with a hole
     ushort s;
     double d;
}

union U {
   char [16] c;
   S s;
}

void main() {
     U u;
     u.c[]='2';

     S a = S(5, 2.0);

     u.s.s = a.s;
     u.s.d = a.d;
     S b = u.s;
     assert( a == b);
}

Mar 08 2010

Don <nospam nospam.com> writes:

bearophile wrote:
 This comes from a small thread that is going on in digitalmars.D.learn.
 
 This program asserts:
 
 
 import std.c.string;
 struct S { // 16 bytes, with a hole
     ushort s;
     double d;
 }
 void main() {
     S s1, s2;
     memset(&s1, ubyte.min, S.sizeof);
     memset(&s2, ubyte.max, S.sizeof);
     s1.s = s2.s = 0;
     s1.d = s2.d = 0;
     assert(s1 == s2);
 }
 
 
 But a correctly implemented opEquals (and opCmp) among structs has to ignore
the contents of the holes, because they can be filled with anything, for
example if the structs where not initialized (with =void).
 Correctness of basic struct operators is not an optional feature, like
properties or even enums. If the opEquals among structs is wrong then it's
better to not have it at all.

Yes. Most of the problems with struct opEquals were fixed in bug 3433. 
You've found an annoying case which wasn't covered.

Mar 07 2010

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 But a correctly implemented opEquals (and opCmp) among structs has to
 ignore the contents of the holes, because they can be filled with
 anything,


The "holes" are defined to be filled with 0, and are when initialized by 
the compiler. This is specifically so that memcmp can be done to compare 
the struct contents.


 for example if the structs where not initialized (with =void).

If you use =void, or allocate with malloc(), it is your responsibility 
to deal with the holes.

Mar 07 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 The "holes" are defined to be filled with 0, and are when initialized by 
 the compiler. This is specifically so that memcmp can be done to compare 
 the struct contents.

Thank you for your answer. I didn't read this important detail in the online
documentation, I must have missed it, sorry. It's a detail that D programmers
must keep in mind well.

A future lint program for D will need to try to warn the programmer that using
the default opEquals among uninitialized structs is wrong.

Bye,
bearophile

Mar 07 2010

yigal chripun <yigal100 gmail.com> writes:

Walter Bright Wrote:

 yigal chripun wrote:
 The compiler knows at compile-time what variables are initialized with
"=void". The compiler than can add a compilation error when such a struct
variable is used in an equals expression. 
 this doesn't cover use of malloc() which must be the user's responsebility. 
 
 e.g.
  S s1=void,s2=void;
  s1.s=0; s1.d=0;
  s2.s=0; s2.d=0;
  assert(s1 == s2); // <- this line should not compile
 

 
 It can, but I don't agree that it should. For an =void initialization, 
 it's the user's responsibility to initialize it properly. The use of 
 =void implies the user knows what he's doing with it, and will take care 
 to initialize the 'holes' as necessary.
 
 Trying to disable == for such structs is a losing battle, anyway, as the 
 compiler could only detect the most obvious cases. Pass a reference to 
 it to a function, store it in a data structure, etc., and all that goes 
 away.

Ok, that sound's reasonable if you want to keep the compiler simple and fast.
How about same mode though? maybe it's worth to add this check only as part of
same mode, where there are less cases anyway since malloc() isn't safe, and
there are no pointers. is it allowed to use "=void" in safe mode at all?

another question I have: How would a user initialize the holes and doesn't it
negate the bebefits of void as optimisation?

Mar 08 2010

Walter Bright <newshound1 digitalmars.com> writes:

yigal chripun wrote:
 Trying to disable == for such structs is a losing battle, anyway,
 as the compiler could only detect the most obvious cases. Pass a
 reference to it to a function, store it in a data structure, etc.,
 and all that goes away.

 
 Ok, that sound's reasonable if you want to keep the compiler simple
 and fast.

The non-trivial cases are not detectable (halting problem), so it is a 
pointless feature to add.


 another question I have: How would a user initialize the holes and
 doesn't it negate the bebefits of void as optimisation?

The point of =void is to allow the user to determine those tradeoffs 
rather than the compiler.

Mar 08 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 The non-trivial cases are not detectable (halting problem), so it is a 
 pointless feature to add.

On the other hand I have shown few ideas to avoid this problem, that require no
magic abilities to a compiler, essentially making the opBinary("==") operator
actively ignore the holes.

If generic code to ignore such holes is too much slow it can be a little faster
creating a specialized version for each struct. If this causes too much code
bloat, then the default opBinary("==") can be removed from structs and put only
when the programmer asks for it with something like a  equatable property.

If a programmer wants max speed ignoring the holes, such person can define a
opBinary("==") that uses memcmp or something similar.

Bye,
bearophile

Mar 09 2010

bearophile <bearophileHUGS lycos.com> writes:

I guess what's holding back Walter on this improvement is the higher
performance of the equal done ignoring the holes compared to the more correct
one. So I have done a test to see how can large is such performance difference,
because with not even a bit of experimental data it's hard to discuss:


import std.stdio: writeln;
import std.c.string: memcmp, memset;


/*
Basic uniform random generator: Minimal Standard in Park and
Miller (1988): "Random Number Generators: Good Ones Are Hard to
Find", Comm. of the ACM, 31, 1192-1201.
Parameters: m = 2^31-1, a=48271.

Translated to D from Pascal code by Jesper Lund:
http://www.gnu-pascal.de/crystal/gpc/en/mail1390.html
*/
struct Random {
    enum int m = int.max;
    enum int a = 48_271;
    enum int q = m / a;
    enum int r = m % a;
    int seed = 42;

    double nextDouble() {
        int k = seed / q;
        seed = a * (seed - k * q) - r * k;
        if (seed < 1)
            seed += m;
        return cast(double)seed / m;
    }

    int nextInt(int max) {
        int n = max + 1;
        int k = cast(int)(n * this.nextDouble());
        return (k == n) ? k - 1 : k;
    }
}


struct SA {
    short s;
    double d;
}

struct SB {
    short s;
    double d;
    bool opBinary(string S:"==")(SB other) {
        return (this.s == other.s) && (this.d == other.d);
    }
}


void main() {
    static assert(SA.sizeof == SB.sizeof);
    enum Versions { basic, defined, memcmp }

    enum int N = 1_000; // small, to keep data in CPU cache
    enum int NLOOPS = 100_000;
    enum Versions vers = Versions.defined; // basic, defined, memcmp

    SA[6] sa_items = [SA(20, 5.5), SA(20, 5.5),
                      SA(20, 6.5), SA(20, 6.5),
                      SA(30, 6.5), SA(30, 6.5)];

    memset(&(sa_items[1]), 255, SA.sizeof);
    sa_items[1].s = sa_items[0].s;
    sa_items[1].d = sa_items[0].d;

    memset(&(sa_items[3]), 255, SA.sizeof);
    sa_items[3].s = sa_items[2].s;
    sa_items[3].d = sa_items[2].d;

    memset(&(sa_items[5]), 255, SA.sizeof);
    sa_items[5].s = sa_items[4].s;
    sa_items[5].d = sa_items[4].d;


    auto sa1 = new SA[N];
    auto sb1 = new SB[sa1.length];
    auto sa2 = new SA[sa1.length];
    auto sb2 = new SB[sa1.length];

    auto rnd = Random(1);

    foreach (i; 0 .. sa1.length) {
        int r1 = rnd.nextInt(sa_items.length - 1);
        sa1[i] = sa_items[r1];
        sb1[i] = SB(sa1[i].s, sa1[i].d);

        int r2 = rnd.nextInt(sa_items.length - 1);
        sa2[i] = sa_items[r2];
        sb2[i] = SB(sa2[i].s, sa2[i].d);
    }

    int sum;

    final switch(vers) {
        case Versions.basic:
            for (int i; i < NLOOPS; i++)
                for (int j; j < N; j++)
                    sum += (sa1[j] == sa2[j]);
            break;
        case Versions.defined:
            for (int i; i < NLOOPS; i++)
                for (int j; j < N; j++)
                    sum += (sb1[j] == sb2[j]);
            break;
        case Versions.memcmp:
            for (int i; i < NLOOPS; i++)
                for (int j; j < N; j++)
                    sum += memcmp(&(sa1[j]), &(sa2[j]), SA.sizeof) == 0;
            break;
    }

    writeln(sum);
}


Timings, N=1_000, NLOOPS=100_000, seconds:
  basic:   3.76 (result sum = 16400000)
  defined: 4.48 (result sum = 34100000)
  memcmp:  4.81 (result sum = 16400000)

Bye,
bearophile

Mar 09 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Holes in structs and opEquals