www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Holes in structs and opEquals

reply bearophile <bearophileHUGS lycos.com> writes:
This comes from a small thread that is going on in digitalmars.D.learn.

This program asserts:


import std.c.string;
struct S { // 16 bytes, with a hole
    ushort s;
    double d;
}
void main() {
    S s1, s2;
    memset(&s1, ubyte.min, S.sizeof);
    memset(&s2, ubyte.max, S.sizeof);
    s1.s = s2.s = 0;
    s1.d = s2.d = 0;
    assert(s1 == s2);
}


But a correctly implemented opEquals (and opCmp) among structs has to ignore
the contents of the holes, because they can be filled with anything, for
example if the structs where not initialized (with =void).

A correct opEquals has to work recursively (so if the struct contains a string,
it has to control the string equality too).

And a built-in recursive opCmp/opHash for structs is about as useful as having
a correct opEquals.

A correct opEquals can be a little slower, but correctness come first. In the
uncommon situations where I really need max speed and I don't care of
correctness I can use a memcmp(&s1, &s2, S.sizeof). (And the compiler is free
to use just memcmp when a struct has no holes and doesn't contain references,
associative arrays and dynamic arrays).

Correctness of basic struct operators is not an optional feature, like
properties or even enums. If the opEquals among structs is wrong then it's
better to not have it at all.

Bye,
bearophile
Mar 07 2010
next sibling parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2010-03-07 15:09:53 +0100, bearophile <bearophileHUGS lycos.com> said:

 This comes from a small thread that is going on in digitalmars.D.learn.
 
 This program asserts:
 
 
 import std.c.string;
 struct S { // 16 bytes, with a hole
     ushort s;
     double d;
 }
 void main() {
     S s1, s2;
     memset(&s1, ubyte.min, S.sizeof);
     memset(&s2, ubyte.max, S.sizeof);
     s1.s = s2.s = 0;
     s1.d = s2.d = 0;
     assert(s1 == s2);
 }
 
 
 But a correctly implemented opEquals (and opCmp) among structs has to 
 ignore the contents of the holes, because they can be filled with 
 anything, for example if the structs where not initialized (with =void).
 
 A correct opEquals has to work recursively (so if the struct contains a 
 string, it has to control the string equality too).
 
 And a built-in recursive opCmp/opHash for structs is about as useful as 
 having a correct opEquals.
 
 A correct opEquals can be a little slower, but correctness come first. 
 In the uncommon situations where I really need max speed and I don't 
 care of correctness I can use a memcmp(&s1, &s2, S.sizeof). (And the 
 compiler is free to use just memcmp when a struct has no holes and 
 doesn't contain references, associative arrays and dynamic arrays).
 
 Correctness of basic struct operators is not an optional feature, like 
 properties or even enums. If the opEquals among structs is wrong then 
 it's better to not have it at all.
one could argue that the unsafe operation is memset. The compiler always initializes a struct, so that what you describe should never happen in a safe program. Still as you say the following example that might indeed considered a bug: S s1=void,s2=void; s1.s=0; s1.d=0; s2.s=0; s2.d=0; assert(s1 == s2); here the assert might fail depending on the previous content of the memory. I am not sure of what is the best solution, I am not sure that defining a special comparison operation by default for each struct is the correct solution (it can be quite some bloat), please note that a user defined comparison function will not have these problems. Still I agree that traking down a bug due to this might be very ugly...
Mar 07 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Fawzi Mohamed:

I am not sure of what is the best solution,<
This is true for me too, that's why I have created this thread, to find a way to solve this problem.
I am not sure that defining a special comparison operation by default for each
struct is the correct solution (it can be quite some bloat), please note that a
user defined comparison function will not have these problems.<
Note that if you write a opEquals for each struct then you have the same code bloat. The difference is that you have had to write lot of boring code, that can contain bugs, and your program is longer. In a program you don't use == and != (or hashing) on all the structs, so in theory the compiler can add such opEquals only to the structs that are tested for equality in the program. I guess the usual modular compilation requirement of D code asks too all structs to have such opEquals. In this case the Link-Time Optimization of LLVM comes to the rescue and removed unused functions from the whole program/whole compilation block. Another possible solution is to have a single opEquals in the runtime, that works like the array sort, using run time knowledge about the types. But this is of probably slower. Bye, bearophile
Mar 07 2010
parent bearophile <bearophileHUGS lycos.com> writes:
There are other solutions.
For example remove the current opEquals from structs, so doing == among two
structs become a syntax error. And then add a property like  equable that when
present beside the struct name adds a specific and correct and recursive
opEquals to it.

 equable stuct Foo {
  int x;
}

Probably instead of  equable it's better to have an attribute that adds opHash,
opEquals and opCmp to a struct.
This is safe, avoids most code bloat, and keeps code fast. The disadvantage is
a that it adds a new attribute to the language.

If the user can create new attributes, then it's surely possible to define this
attribute with D code and some compile-time reflection, keeping the language
simple enough. There is one or more Java libs that use attributes for this
specific purpose.

Bye,
bearophile
Mar 07 2010
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Fawzi Mohamed wrote:
 one could argue that the unsafe operation is memset.
That's right.
 The compiler always initializes a struct, so that what you describe 
 should never happen in a safe program.
Right.
 Still as you say the following example that might indeed considered a bug:
 
 S s1=void,s2=void;
 s1.s=0; s1.d=0;
 s2.s=0; s2.d=0;
 assert(s1 == s2);
 
 here the assert might fail depending on the previous content of the memory.
 I am not sure of what is the best solution, I am not sure that defining 
 a special comparison operation by default for each struct is the correct 
 solution (it can be quite some bloat), please note that a user defined 
 comparison function will not have these problems.
No, it's not a bug in the language, it's a bug in the user code. Using =void is an advanced feature, and it requires the user to know what he's doing with it. That's why it isn't allowed in safe mode.
 Still I agree that traking down a bug due to this might be very ugly...
The idea with =void initializations is that they are findable using grep, and so can be singled out for special attention when there is a problem.
Mar 07 2010
next sibling parent reply yigal chripun <yigal100 gmail.com> writes:
Walter Bright Wrote:

 Fawzi Mohamed wrote:
 one could argue that the unsafe operation is memset.
That's right.
 The compiler always initializes a struct, so that what you describe 
 should never happen in a safe program.
Right.
 Still as you say the following example that might indeed considered a bug:
 
 S s1=void,s2=void;
 s1.s=0; s1.d=0;
 s2.s=0; s2.d=0;
 assert(s1 == s2);
 
 here the assert might fail depending on the previous content of the memory.
 I am not sure of what is the best solution, I am not sure that defining 
 a special comparison operation by default for each struct is the correct 
 solution (it can be quite some bloat), please note that a user defined 
 comparison function will not have these problems.
No, it's not a bug in the language, it's a bug in the user code. Using =void is an advanced feature, and it requires the user to know what he's doing with it. That's why it isn't allowed in safe mode.
 Still I agree that traking down a bug due to this might be very ugly...
The idea with =void initializations is that they are findable using grep, and so can be singled out for special attention when there is a problem.
The compiler knows at compile-time what variables are initialized with "=void". The compiler than can add a compilation error when such a struct variable is used in an equals expression. this doesn't cover use of malloc() which must be the user's responsebility. e.g. S s1=void,s2=void; s1.s=0; s1.d=0; s2.s=0; s2.d=0; assert(s1 == s2); // <- this line should not compile
Mar 07 2010
parent reply Walter Bright <newshound1 digitalmars.com> writes:
yigal chripun wrote:
 The compiler knows at compile-time what variables are initialized with
"=void". The compiler than can add a compilation error when such a struct
variable is used in an equals expression. 
 this doesn't cover use of malloc() which must be the user's responsebility. 
 
 e.g.
  S s1=void,s2=void;
  s1.s=0; s1.d=0;
  s2.s=0; s2.d=0;
  assert(s1 == s2); // <- this line should not compile
 
It can, but I don't agree that it should. For an =void initialization, it's the user's responsibility to initialize it properly. The use of =void implies the user knows what he's doing with it, and will take care to initialize the 'holes' as necessary. Trying to disable == for such structs is a losing battle, anyway, as the compiler could only detect the most obvious cases. Pass a reference to it to a function, store it in a data structure, etc., and all that goes away.
Mar 08 2010
parent bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 It can, but I don't agree that it should. For an =void initialization, 
 it's the user's responsibility to initialize it properly. The use of 
 =void implies the user knows what he's doing with it, and will take care 
 to initialize the 'holes' as necessary.
I don't fully like this design strategy, I don't like to separate so much the safe code from the unsafe code where the compiler refuses to give any safety. A good compiler has to help assure safety every time it can. On the other hand I agree with you that your design strategy helps keep the compiler simpler (and sometimes compilation time low), and I agree that there are times where it's hopeless to try to turn something unsafe in something a little safer. It's a waste of time that can be used for something more useful. So in the end I shut up the muzzle :-)
 Trying to disable == for such structs is a losing battle, anyway, as the 
 compiler could only detect the most obvious cases. Pass a reference to 
 it to a function, store it in a data structure, etc., and all that goes 
 away.
In theory a lint for D code can partially trace such movements to spot such troubles (in debug release it can even add to the void structs an extra bool attribute that always gets initialized, to tell apart initialized from uninitialized structs at runtime and spot such bugs, but this starts to sound a little silly). Bye, bearophile
Mar 08 2010
prev sibling parent Don <nospam nospam.com> writes:
Walter Bright wrote:
 Fawzi Mohamed wrote:
 one could argue that the unsafe operation is memset.
That's right.
 The compiler always initializes a struct, so that what you describe 
 should never happen in a safe program.
Right.
 Still as you say the following example that might indeed considered a 
 bug:

 S s1=void,s2=void;
 s1.s=0; s1.d=0;
 s2.s=0; s2.d=0;
 assert(s1 == s2);

 here the assert might fail depending on the previous content of the 
 memory.
 I am not sure of what is the best solution, I am not sure that 
 defining a special comparison operation by default for each struct is 
 the correct solution (it can be quite some bloat), please note that a 
 user defined comparison function will not have these problems.
No, it's not a bug in the language, it's a bug in the user code. Using =void is an advanced feature, and it requires the user to know what he's doing with it. That's why it isn't allowed in safe mode.
 Still I agree that traking down a bug due to this might be very ugly...
The idea with =void initializations is that they are findable using grep, and so can be singled out for special attention when there is a problem.
The same issue can arise with unions. This one is pretty hard to grep for: ----------- struct S { // 16 bytes, with a hole ushort s; double d; } union U { char [16] c; S s; } void main() { U u; u.c[]='2'; S a = S(5, 2.0); u.s.s = a.s; u.s.d = a.d; S b = u.s; assert( a == b); }
Mar 08 2010
prev sibling next sibling parent Don <nospam nospam.com> writes:
bearophile wrote:
 This comes from a small thread that is going on in digitalmars.D.learn.
 
 This program asserts:
 
 
 import std.c.string;
 struct S { // 16 bytes, with a hole
     ushort s;
     double d;
 }
 void main() {
     S s1, s2;
     memset(&s1, ubyte.min, S.sizeof);
     memset(&s2, ubyte.max, S.sizeof);
     s1.s = s2.s = 0;
     s1.d = s2.d = 0;
     assert(s1 == s2);
 }
 
 
 But a correctly implemented opEquals (and opCmp) among structs has to ignore
the contents of the holes, because they can be filled with anything, for
example if the structs where not initialized (with =void).
 Correctness of basic struct operators is not an optional feature, like
properties or even enums. If the opEquals among structs is wrong then it's
better to not have it at all.
Yes. Most of the problems with struct opEquals were fixed in bug 3433. You've found an annoying case which wasn't covered.
Mar 07 2010
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 But a correctly implemented opEquals (and opCmp) among structs has to
 ignore the contents of the holes, because they can be filled with
 anything,
The "holes" are defined to be filled with 0, and are when initialized by the compiler. This is specifically so that memcmp can be done to compare the struct contents.
 for example if the structs where not initialized (with =void).
If you use =void, or allocate with malloc(), it is your responsibility to deal with the holes.
Mar 07 2010
parent bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 The "holes" are defined to be filled with 0, and are when initialized by 
 the compiler. This is specifically so that memcmp can be done to compare 
 the struct contents.
Thank you for your answer. I didn't read this important detail in the online documentation, I must have missed it, sorry. It's a detail that D programmers must keep in mind well. A future lint program for D will need to try to warn the programmer that using the default opEquals among uninitialized structs is wrong. Bye, bearophile
Mar 07 2010
prev sibling next sibling parent reply yigal chripun <yigal100 gmail.com> writes:
Walter Bright Wrote:

 yigal chripun wrote:
 The compiler knows at compile-time what variables are initialized with
"=void". The compiler than can add a compilation error when such a struct
variable is used in an equals expression. 
 this doesn't cover use of malloc() which must be the user's responsebility. 
 
 e.g.
  S s1=void,s2=void;
  s1.s=0; s1.d=0;
  s2.s=0; s2.d=0;
  assert(s1 == s2); // <- this line should not compile
 
It can, but I don't agree that it should. For an =void initialization, it's the user's responsibility to initialize it properly. The use of =void implies the user knows what he's doing with it, and will take care to initialize the 'holes' as necessary. Trying to disable == for such structs is a losing battle, anyway, as the compiler could only detect the most obvious cases. Pass a reference to it to a function, store it in a data structure, etc., and all that goes away.
Ok, that sound's reasonable if you want to keep the compiler simple and fast. How about same mode though? maybe it's worth to add this check only as part of same mode, where there are less cases anyway since malloc() isn't safe, and there are no pointers. is it allowed to use "=void" in safe mode at all? another question I have: How would a user initialize the holes and doesn't it negate the bebefits of void as optimisation?
Mar 08 2010
parent Walter Bright <newshound1 digitalmars.com> writes:
yigal chripun wrote:
 Trying to disable == for such structs is a losing battle, anyway,
 as the compiler could only detect the most obvious cases. Pass a
 reference to it to a function, store it in a data structure, etc.,
 and all that goes away.
Ok, that sound's reasonable if you want to keep the compiler simple and fast.
The non-trivial cases are not detectable (halting problem), so it is a pointless feature to add.
 another question I have: How would a user initialize the holes and
 doesn't it negate the bebefits of void as optimisation?
The point of =void is to allow the user to determine those tradeoffs rather than the compiler.
Mar 08 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 The non-trivial cases are not detectable (halting problem), so it is a 
 pointless feature to add.
On the other hand I have shown few ideas to avoid this problem, that require no magic abilities to a compiler, essentially making the opBinary("==") operator actively ignore the holes. If generic code to ignore such holes is too much slow it can be a little faster creating a specialized version for each struct. If this causes too much code bloat, then the default opBinary("==") can be removed from structs and put only when the programmer asks for it with something like a equatable property. If a programmer wants max speed ignoring the holes, such person can define a opBinary("==") that uses memcmp or something similar. Bye, bearophile
Mar 09 2010
parent bearophile <bearophileHUGS lycos.com> writes:
I guess what's holding back Walter on this improvement is the higher
performance of the equal done ignoring the holes compared to the more correct
one. So I have done a test to see how can large is such performance difference,
because with not even a bit of experimental data it's hard to discuss:


import std.stdio: writeln;
import std.c.string: memcmp, memset;


/*
Basic uniform random generator: Minimal Standard in Park and
Miller (1988): "Random Number Generators: Good Ones Are Hard to
Find", Comm. of the ACM, 31, 1192-1201.
Parameters: m = 2^31-1, a=48271.

Translated to D from Pascal code by Jesper Lund:
http://www.gnu-pascal.de/crystal/gpc/en/mail1390.html
*/
struct Random {
    enum int m = int.max;
    enum int a = 48_271;
    enum int q = m / a;
    enum int r = m % a;
    int seed = 42;

    double nextDouble() {
        int k = seed / q;
        seed = a * (seed - k * q) - r * k;
        if (seed < 1)
            seed += m;
        return cast(double)seed / m;
    }

    int nextInt(int max) {
        int n = max + 1;
        int k = cast(int)(n * this.nextDouble());
        return (k == n) ? k - 1 : k;
    }
}


struct SA {
    short s;
    double d;
}

struct SB {
    short s;
    double d;
    bool opBinary(string S:"==")(SB other) {
        return (this.s == other.s) && (this.d == other.d);
    }
}


void main() {
    static assert(SA.sizeof == SB.sizeof);
    enum Versions { basic, defined, memcmp }

    enum int N = 1_000; // small, to keep data in CPU cache
    enum int NLOOPS = 100_000;
    enum Versions vers = Versions.defined; // basic, defined, memcmp

    SA[6] sa_items = [SA(20, 5.5), SA(20, 5.5),
                      SA(20, 6.5), SA(20, 6.5),
                      SA(30, 6.5), SA(30, 6.5)];

    memset(&(sa_items[1]), 255, SA.sizeof);
    sa_items[1].s = sa_items[0].s;
    sa_items[1].d = sa_items[0].d;

    memset(&(sa_items[3]), 255, SA.sizeof);
    sa_items[3].s = sa_items[2].s;
    sa_items[3].d = sa_items[2].d;

    memset(&(sa_items[5]), 255, SA.sizeof);
    sa_items[5].s = sa_items[4].s;
    sa_items[5].d = sa_items[4].d;


    auto sa1 = new SA[N];
    auto sb1 = new SB[sa1.length];
    auto sa2 = new SA[sa1.length];
    auto sb2 = new SB[sa1.length];

    auto rnd = Random(1);

    foreach (i; 0 .. sa1.length) {
        int r1 = rnd.nextInt(sa_items.length - 1);
        sa1[i] = sa_items[r1];
        sb1[i] = SB(sa1[i].s, sa1[i].d);

        int r2 = rnd.nextInt(sa_items.length - 1);
        sa2[i] = sa_items[r2];
        sb2[i] = SB(sa2[i].s, sa2[i].d);
    }

    int sum;

    final switch(vers) {
        case Versions.basic:
            for (int i; i < NLOOPS; i++)
                for (int j; j < N; j++)
                    sum += (sa1[j] == sa2[j]);
            break;
        case Versions.defined:
            for (int i; i < NLOOPS; i++)
                for (int j; j < N; j++)
                    sum += (sb1[j] == sb2[j]);
            break;
        case Versions.memcmp:
            for (int i; i < NLOOPS; i++)
                for (int j; j < N; j++)
                    sum += memcmp(&(sa1[j]), &(sa2[j]), SA.sizeof) == 0;
            break;
    }

    writeln(sum);
}


Timings, N=1_000, NLOOPS=100_000, seconds:
  basic:   3.76 (result sum = 16400000)
  defined: 4.48 (result sum = 34100000)
  memcmp:  4.81 (result sum = 16400000)

Bye,
bearophile
Mar 09 2010