www.digitalmars.com         C & C++   DMDScript  

D - [bug] static array slice as associative array index

reply serge lxnt.info writes:
I have just found D language and was interested in its performance. I found this
page http://www.functionalfuture.com/d/ and was not impressed by 'hash'
benchmark performance, so I tried to make my own implementation. But the problem
is that it seems to work for small values of 'n' but starts producing incorrect
results when it becomes larger (100000 for example, it gives 25659 instead of
18699). I use DMD 0.82 in Linux. If there is no bug in compiler, I would like to
know what's wrong with my program. Thanks.


import std.c.stdio;
import std.string;

int main(char[][] args)
{
int n = args.length < 2 ? 1 : atoi(args[1]);
int[char[]] X;
char[32] buf;
int c = 0;

for (int i = 1; i <= n; i++)
{
sprintf(buf, "%x", i);
X[buf[0..strlen(buf)]] = i;
}

for (int i = n; i >= 1; i--)
{
sprintf(buf, "%d", i);
if (buf[0..strlen(buf)] in X) c++;
}

printf("%d\n", c);

return 0;
}
Apr 11 2004
next sibling parent reply serge <serge lxnt.info> writes:
In article <c5b7km$osi$1 digitaldaemon.com>, serge lxnt.info says...
I have just found D language and was interested in its performance. I found this
page http://www.functionalfuture.com/d/ and was not impressed by 'hash'
benchmark performance, so I tried to make my own implementation. But the problem
is that it seems to work for small values of 'n' but starts producing incorrect
results when it becomes larger (100000 for example, it gives 25659 instead of
18699). I use DMD 0.82 in Linux. If there is no bug in compiler, I would like to
know what's wrong with my program. Thanks.

Here is a smaller testcase: int main() { char[1] key; int[char[]] dictionary; key[0] = 'a'; dictionary[key] = 1; key[0] = 'b'; dictionary[key] = 2; foreach(char[] k, int v; dictionary) printf("dictionary[\"%.*s\"] = %d\n", k, v); return 0; } It gives the following results when executed: dictionary["b"] = 1 dictionary["b"] = 2 Replacing 'dictionary[key]' with 'dictionary[key.dup]' helps to fix this problem.
Apr 11 2004
parent reply larry cowan <larry_member pathlink.com> writes:
In article <c5c9r5$2bsb$1 digitaldaemon.com>, serge says...
In article <c5b7km$osi$1 digitaldaemon.com>, serge lxnt.info says...
I have just found D language and was interested in its performance. I found this
page http://www.functionalfuture.com/d/ and was not impressed by 'hash'
benchmark performance, so I tried to make my own implementation. But the problem
is that it seems to work for small values of 'n' but starts producing incorrect
results when it becomes larger (100000 for example, it gives 25659 instead of
18699). I use DMD 0.82 in Linux. If there is no bug in compiler, I would like to
know what's wrong with my program. Thanks.

Here is a smaller testcase: int main() { char[1] key; int[char[]] dictionary; key[0] = 'a'; dictionary[key] = 1; key[0] = 'b'; dictionary[key] = 2; foreach(char[] k, int v; dictionary) printf("dictionary[\"%.*s\"] = %d\n", k, v); return 0; } It gives the following results when executed: dictionary["b"] = 1 dictionary["b"] = 2 Replacing 'dictionary[key]' with 'dictionary[key.dup]' helps to fix this problem.

Basically, you have solved your own problem. It is one of the things you are going to have to just get used to - at least for now. You have a single array defined (key) and set it to "a", load a dictionary item ("a",1) which saves a reference to the array and the value 1. Then you change the array content (no size change, no copy) to "b" with value 2 which picks up a reference to the key (the char array) and the value 2. If you then tried to get the value for literal "a", you would get no answer for 2 reasons: 1. the reference for both entries points to an array (key) with current value "b". 2. the literal "a" doesn't match anything. Furthermore, if you set the array now to value "a" or "c", it would match both entries. Non-intuitive, you bet! I don't like the way it works either. Asking the dictionary to pick up a dup gets the value staticized and then intuitive matching can occur. But I don't see when you would ever want the results as they are used now unless you were loading a prebuilt array of keys with appropriate values or some other load method with distinct key and value references if they are arrays. And finally, if you add a line printf("key = %.*s, value = %d\n","x",dictionary["x"]); before your display loop, you will somehow add an entry "x" with a value of 0 just for having made the reference. I think there are problems in the current implementation, but I don't know whether it's by accident or by design. Maybe there are some rules for use that will prevent all this, but they are not posted anywhere I know about. Generally, the performance of the compiler and its code is very good, but there are still limitations (bugs?) in both the design and the implementation. Some of the library stuff is still in alpha state and can be too slow. Everything is still in an as-is - help improve it - state as long as the release level stays below 1.0 - sort-of alpha, sort-of beta, but coming along very nicely with lots of potential. Criticisms, suggested or donated code, bug reports, etc are all welcome. Join the party.
Apr 11 2004
parent reply larry cowan <larry_member pathlink.com> writes:
In article <c5b7km$osi$1 digitaldaemon.com>, serge lxnt.info says...


As far as your original program is concerned, after 6999, then it starts incrementing twice for each entry found instead of just the c++. This looks like a compiler bug. Also, you don't need the slices - "buf" works just as well - because the sprintf's are extern C, and create a D array for buf, but the strlen works because the \0 that the sprintf ends buf with is either still there, or carried back by the current D compiler. I don't know if you can count on this as a compiler specification or just as Walter's current method of implementation.
Apr 11 2004
parent reply serge <serge lxnt.info> writes:
In article <c5ckom$2t8g$1 digitaldaemon.com>, larry cowan says...
In article <c5b7km$osi$1 digitaldaemon.com>, serge lxnt.info says...


As far as your original program is concerned, after 6999, then it starts incrementing twice for each entry found instead of just the c++. This looks like a compiler bug.

The original program is not written right, no suprise it behaves strange :) Inserting the same ".dup" suffix fixes all the problems, at least for me.
Also, you don't need the slices - "buf" works just as well - because the
sprintf's are extern C, and create a D array for buf, but the strlen works
because the \0 that the sprintf ends buf with is either still there, or carried
back by the current D compiler. I don't know if you can count on this as a
compiler specification or just as Walter's current method of implementation.

Using just "buf" is not quite the same, it is a string 32 characters long with embedded '\0' bytes. But the program just needs hextradecimal strings as associative array indexes. For example, formatting strings using sprintf into a static buffer could result in the following two strings: "123\0A\0\0\0...", "123\0B\0\0\0...". So using just "buf.dup", they will be different keys, but using "buf[0..strlen(buf)].dup", they are the same. I wonder if it is possible to add runtime checks to detect modifications of associative array keys, as it could be a source of the bugs which are hard to debug (just forget to duplicate a string for array index and you are in a trouble). Also, experimenting more with static arrays, I came to the following (incorrect) program. It returns a pointer to the string, allocated on stack and DMD compiler does not report it as error. Making such bugs is very easy for people unfamiliar with D (newcomers like me) and maybe detection of this situation as error in a compiler would be useful. import std.string; char[] f() { char[32] x; strcpy(x, "abc"); return x; } int main() { printf("%.*s\n", f()); return 0; }
Apr 12 2004
next sibling parent reply School <school users.sf.net> writes:
serge :

 In article <c5ckom$2t8g$1 digitaldaemon.com>, larry cowan says...
 
In article <c5b7km$osi$1 digitaldaemon.com>, serge lxnt.info says...


As far as your original program is concerned, after 6999, then it starts incrementing twice for each entry found instead of just the c++. This looks like a compiler bug.

The original program is not written right, no suprise it behaves strange :) Inserting the same ".dup" suffix fixes all the problems, at least for me.
Also, you don't need the slices - "buf" works just as well - because the
sprintf's are extern C, and create a D array for buf, but the strlen works
because the \0 that the sprintf ends buf with is either still there, or carried
back by the current D compiler. I don't know if you can count on this as a
compiler specification or just as Walter's current method of implementation.

Using just "buf" is not quite the same, it is a string 32 characters long with embedded '\0' bytes. But the program just needs hextradecimal strings as associative array indexes. For example, formatting strings using sprintf into a static buffer could result in the following two strings: "123\0A\0\0\0...", "123\0B\0\0\0...". So using just "buf.dup", they will be different keys, but using "buf[0..strlen(buf)].dup", they are the same. I wonder if it is possible to add runtime checks to detect modifications of associative array keys, as it could be a source of the bugs which are hard to debug (just forget to duplicate a string for array index and you are in a trouble).

is only the way D handle the char, or even String. -- School, yet another nickname for anonymous. :D ;-D
Apr 12 2004
parent reply J Anderson <REMOVEanderson badmama.com.au> writes:
School wrote:

 Hope future releases would give a warning. In fact, it is not a bug, it

is only the way D handle the char, or even String.
  

either true or false, not in between. -- -Anderson: http://badmama.com.au/~anderson/
Apr 12 2004
parent reply School <school users.sf.net> writes:
J Anderson wrote:
 There are no warnings in D. With warnings, Walter believes that it's
 either true or false, not in between.
 

It only because D compiler has no "so-called" warnings. -- School, yet another nickname for anonymous. :D ;-D
Apr 12 2004
parent J Anderson <REMOVEanderson badmama.com.au> writes:
School wrote:

J Anderson wrote:
  

There are no warnings in D. With warnings, Walter believes that it's
either true or false, not in between.

    

It only because D compiler has no "so-called" warnings.

Walter: " D compilers will not generate warnings for questionable code. Code will either be acceptable to the compiler or it will not be. This will eliminate any debate about which warnings are valid errors and which are not, and any debate about what to do with them. The need for compiler warnings is symptomatic of poor language design." http://www.digitalmars.com/d/overview.html I agree. Once you've used several different C++ compiles with different errors and interpretations of warnings, they become a real pain. ...And then there are people who simply ignore warnings until they have hundreds. What's the use of warnings then, your not going to read them all and neither is anyone new who jumps on the team (and if they do they're wasting there time as well because you know 99% are ok). If you go through and fix/hide them anyway, then they should be errors in the first place. -- -Anderson: http://badmama.com.au/~anderson/
Apr 12 2004
prev sibling parent larry cowan <larry_member pathlink.com> writes:
In article <c5e1im$20dn$1 digitaldaemon.com>, serge says...

Using just "buf" is not quite the same, it is a string 32 characters long with
embedded '\0' bytes. But the program just needs hextradecimal strings as
associative array indexes. For example, formatting strings using sprintf into a
static buffer could result in the following two strings: "123\0A\0\0\0...",
"123\0B\0\0\0...". So using just "buf.dup", they will be different keys, but
using "buf[0..strlen(buf)].dup", they are the same.

(I presume you are talking about linefeed x0A vs carriage return x0D). To slice the last char off the string you need "buf[0..strlen(buf)-1]" whether or not you are using ".dup". You are right about the length of buf, but that opens the question of whether the key compares should be string compares or array compares - should "123/0456/0789/0/0/..." be the same as "123/0654/0987/0/0..." or not?
I wonder if it is possible to add runtime checks to detect modifications of
associative array keys, as it could be a source of the bugs which are hard to
debug (just forget to duplicate a string for array index and you are in a
trouble). 

That would probably be terribly expensive for the compiler to insert, or for you to do it either. I feel that the point where the string is taken as a key for the array should be a place where the compiler takes a copy not a pointer. It should have frozen data there for the key, not some multiuse area with changeable contents. I don't like the potential for obscure tricks being developed with the current method, or worse, the likelihood of insecure programs being developed so easily.
Apr 13 2004
prev sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
On Sun, 11 Apr 2004 10:45:42 +0000 (UTC), serge lxnt.info wrote:

I have just found D language and was interested in its performance. I found this
page http://www.functionalfuture.com/d/ and was not impressed by 'hash'
benchmark performance, so I tried to make my own implementation. 

Any luck improving the hash test performance? Also I think the matrix mult test needs attention. It should probably use the type int[30][30] instead of int[][] see multi-dimensional arrays in http://www.digitalmars.com/d/arrays.html -Ben
Apr 14 2004
parent reply serge <serge_member pathlink.com> writes:
In article <ibvp70hmjvjrj5kf17cajl7l3mesomtv5v 4ax.com>, Ben Hinkle says...

On Sun, 11 Apr 2004 10:45:42 +0000 (UTC), serge lxnt.info wrote:

I have just found D language and was interested in its performance. I found this
page http://www.functionalfuture.com/d/ and was not impressed by 'hash'
benchmark performance, so I tried to make my own implementation. 

Any luck improving the hash test performance?

Results are in the attached file.
Also I think the matrix mult test needs attention.
It should probably use the type
 int[30][30]
instead of
 int[][]


see multi-dimensional arrays in
 http://www.digitalmars.com/d/arrays.html

-Ben

Apr 14 2004
parent reply serge <serge lxnt.info> writes:
In article <c5k6vs$2l9u$1 digitaldaemon.com>, serge says...

Results are in the attached file.

Well, I think these results need some comments. I compared performance of 4 programs, one compiled with GCC and 3 compiled with DMD. hash.c - hash test in pure C (from http://www.bagley.org/~doug/shootout/) hash.d - test.c ported to D hash.native.old.d - hash test from http://www.functionalfuture.com/d/ hash.native.new.d - my attempt to improve performance of hash test compiler command line options: gcc: -O3 -fomit-frame-pointer dmd: -O -release -inline performance results for n=1000000 on Athlon 850: hash.c : 3.690 seconds hash.d : 4.580 seconds hash.native.old.d : 26.110 seconds hash.native.new.d : 13.000 seconds So, hash.d and hash.c times differ not too much. They contain almost identical code and just compare the quality of optimizers in GCC and DMD. Given the fact that DMD is still not particularly optimized for performance, this result is very good. But difference between hash.d and hash.native.new.d is very noticeable. This can be an indication of poor associative array performance (DMD native associative arrays are about 3 times slower than hand written hash implementation). I hope this is not a design flaw and can be fixed in the future.
Apr 14 2004
next sibling parent Ben Hinkle <bhinkle4 juno.com> writes:
 But difference between hash.d and hash.native.new.d is very noticeable. This
can
 be an indication of poor associative array performance (DMD native associative
 arrays are about 3 times slower than hand written hash implementation). I hope
 this is not a design flaw and can be fixed in the future.

Interesting. I threw in some rehash calls during the first loop and it sped things up somewhat but not very close to hash.d - I wonder how much improvement there would be by prallocating the size by allowing the length to be set in associative arrays? It could even use the same prime number table in hash.d It would also be fun to play around with the hashing function to see if it makes any difference. Or inlining. or the GC. or... I wonder what a profiler like VTune says. -Ben
Apr 14 2004
prev sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
 hash.c            : 3.690 seconds
 hash.d            : 4.580 seconds
 hash.native.old.d : 26.110 seconds
 hash.native.new.d : 13.000 seconds

I played around a bit and have some more data: all code compiled with gcc -O3 -fomit-frame-pointer hash.c gdc -O3 -fomit-frame-pointer hash.d (same flags for phobos, too) hash.c 3.06 hash.d 3.13 hash.native.new.d 9.4 + pre-allocate 4.85 + new hash fcn 4.68 by "pre-allocate" I mean the line in aaA.d *aa = new pa[10]; is replaced with *aa = new pa[1572869]; // magic number from hash.c for 1000000 by "new hash fcn" I mean the function getHash in ti_Aa.d is replaced with (from www.cs.yorku.ca/~oz/hash.html) uint getHash(void*p) { char[] s = *(char[]*)p; uint len = s.length; char *str = s; uint hash = 5381; while (len--) hash = hash*33 ^ *str++; return hash; } I'll keep investigating where the time is going. I think it makes sense to add some way to reserve a large hash table size like in the C code. Building the hash table up one element at a time and rehashing every now and then is slow when you know you want a big table. -Ben ps. a big hurrah for gdc - I love it! We aren't limited to just playing around with phobos - we can actually go in and try adding .reserve properties and such. And plus it is simple to compare C and D code side by side. very very cool.
Apr 15 2004
parent C <dont respond.com> writes:
 I'll keep investigating where the time is going.

 ps. a big hurrah for gdc - I love it!

Cool! I havent played with it yet, is it pretty stable ? C On Thu, 15 Apr 2004 22:17:53 -0400, Ben Hinkle <bhinkle4 juno.com> wrote= :
 hash.c            : 3.690 seconds
 hash.d            : 4.580 seconds
 hash.native.old.d : 26.110 seconds
 hash.native.new.d : 13.000 seconds

I played around a bit and have some more data: all code compiled with gcc -O3 -fomit-frame-pointer hash.c gdc -O3 -fomit-frame-pointer hash.d (same flags for phobos, too) hash.c 3.06 hash.d 3.13 hash.native.new.d 9.4 + pre-allocate 4.85 + new hash fcn 4.68 by "pre-allocate" I mean the line in aaA.d *aa =3D new pa[10]; is replaced with *aa =3D new pa[1572869]; // magic number from hash.c for 1000000 by "new hash fcn" I mean the function getHash in ti_Aa.d is replaced =

 with (from www.cs.yorku.ca/~oz/hash.html)
 uint getHash(void*p)
 { char[] s =3D *(char[]*)p;
    uint len =3D s.length;
    char *str =3D s;
    uint hash =3D 5381;
    while (len--) hash =3D hash*33 ^ *str++;
    return hash;
 }

 I'll keep investigating where the time is going. I think it makes sens=

 to add some way to reserve a large hash table size like in the C code.=

 Building the hash table up one element at a time and rehashing every n=

 and then is slow when you know you want a big table.
 -Ben

 ps. a big hurrah for gdc - I love it! We aren't limited to just playin=

 around with phobos - we can actually go in and try adding .reserve =

 properties and such. And plus it is simple to compare C and D code sid=

 by side. very very cool.

-- = D Newsgroup.
Apr 15 2004