www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: Latest string_token Code

reply bearophile <bearophileHUGS lycos.com> writes:
Ben Hanson:

Even if you are an expert C++ programmer, I can express few more comments about
your code, to show how to program in D (here I comment only the first example
of each thing. More cases can follow).

------------------

You can write this:

template regex(CharT)
{
struct BasicStringToken
{

Like this:

template regex(CharT)
{
    struct BasicStringToken
    {

------------------

In this part:

    void negate()
    {
        CharT curr_char = START_CHAR;
        CharT[] temp;
        CharT *ptr = cast(CharT *) 0;


This is closer to normal D code, because in D there is "null", and in D the *
of pointer definitions is better written on the left, because it's part of the
type:

    void negate()
    {
        CharT curr_char = START_CHAR;
        CharT[] temp;
        CharT* ptr = null;


But the idiomatic D code is this because pointers are automatically initialized
to null, and nornally in D variable names are camelCase with a starting
lowercase (sometimes I'd like to use underscores, but this is the D style.
Constant names can contain underscores in D, I presume):

    void negate()
    {
        CharT currChar = START_CHAR;
        CharT[] temp;
        CharT* ptr;

------------------

This line:
        
            else if (!overlap.charset.length == 0)

I don't like it a lot, maybe this is better:

            else if (overlap.charset.length)

------------------

This code:

        else if (negated)
        {
            intersectNegated(rhs, overlap);
        }
        else // negated == false
        {
            intersectCharset(rhs, overlap);
        }


There is no need of that comment, never add useless noise to the code:

        else if (negated)
        {
            intersectNegated(rhs, overlap);
        }
        else
        {
            intersectCharset(rhs, overlap);
        }

------------------

I see that in your code you lot of pointers. Pointers can be used in D, but I
suggest you to use them only when necessary. Maybe some usage of pointers can
be replaced by normal array indexes, that are safer too (because in D in
nonrelease mode array bounds are tested).

For some situations I have created in D2 a "safer pointer" that in release mode
is as efficient as a pointer but in nonrelease mode asserts if you make it step
out of a pre-specified memory interval. I don't think lot of people here has
appreciated it, but I have used it to catch some of my pointer-releated bugs.

Bye,
bearophile
Jun 22 2010
parent reply Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Ben Hanson:
 Even if you are an expert C++ programmer, I can express few more comments about

each thing. More cases can follow). It's hard to be an expert in C++ these days, particularly when posting to a group frequented by Andrei! :-D
 ------------------
 You can write this:
 template regex(CharT)
 {
 struct BasicStringToken
 {
 Like this:
 template regex(CharT)
 {
     struct BasicStringToken
     {
 ------------------
 In this part:
     void negate()
     {
         CharT curr_char = START_CHAR;
         CharT[] temp;
         CharT *ptr = cast(CharT *) 0;
 This is closer to normal D code, because in D there is "null", and in D the *
of

     void negate()
     {
         CharT curr_char = START_CHAR;
         CharT[] temp;
         CharT* ptr = null;
 But the idiomatic D code is this because pointers are automatically initialized

(sometimes I'd like to use underscores, but this is the D style. Constant names can contain underscores in D, I presume):
     void negate()
     {
         CharT currChar = START_CHAR;
         CharT[] temp;
         CharT* ptr;
 ------------------

I forgot about auto initialisation in D. D'oh!
 This line:
             else if (!overlap.charset.length == 0)
 I don't like it a lot, maybe this is better:
             else if (overlap.charset.length)

This is just a bug. Should be: else if (!overlap.charset.length)
 ------------------
 This code:
         else if (negated)
         {
             intersectNegated(rhs, overlap);
         }
         else // negated == false
         {
             intersectCharset(rhs, overlap);
         }
 There is no need of that comment, never add useless noise to the code:
         else if (negated)
         {
             intersectNegated(rhs, overlap);
         }
         else
         {
             intersectCharset(rhs, overlap);
         }

Those comments were deliberate as a 'yes I do mean that', but I've removed them anyway.
 ------------------
 I see that in your code you lot of pointers. Pointers can be used in D, but I

replaced by normal array indexes, that are safer too (because in D in nonrelease mode array bounds are tested).
 For some situations I have created in D2 a "safer pointer" that in release mode

out of a pre-specified memory interval. I don't think lot of people here has appreciated it, but I have used it to catch some of my pointer-releated bugs.
 Bye,
 bearophile

All the code for this library needs to absolutely as fast as it can be. As it turns out, by intersecting regex charsets once in the code then it won't take that many cycles, but the only question I have is: Will the optimiser create as fast code if you use indexes compared to pointers? Updated code follows. Thanks, Ben module main; import std.algorithm; import std.array; import std.c.string; import std.string; import std.stdio; template regex(CharT) { struct BasicStringToken { bool negated = false; CharT[] charset; enum size_t MAX_CHARS = CharT.max + 1; enum size_t START_CHAR = cast(CharT) 0x80 < 0 ? 0x80 : 0; this(const bool negated_, ref CharT[] charset_) { negated = negated_; charset = charset_; } void removeDuplicates() { charset.sort; squeeze(charset); } void normalise() { if (charset.length == MAX_CHARS) { negated = !negated; charset.clear(); } else if (charset.length > MAX_CHARS / 2) { negate(); } } void negate() { CharT currChar = START_CHAR; CharT[] temp; CharT *ptr; CharT *curr = charset.ptr; CharT *end = curr + charset.length; size_t i = 0; negated = !negated; temp.length = MAX_CHARS - charset.length; ptr = temp.ptr; while (curr < end) { while (*curr > currChar) { *ptr = currChar; ++ptr; ++currChar; ++i; } ++currChar; ++curr; ++i; } for (; i < MAX_CHARS; ++i) { *ptr = currChar; ++ptr; ++currChar; } charset = temp; } bool empty() { return charset.length == 0 && !negated; } bool any() { return charset.length == 0 && negated; } void clear() { negated = false; charset.length = 0; } void intersect(ref BasicStringToken rhs, ref BasicStringToken overlap) { if ((any() && rhs.any()) || (negated == rhs.negated && !any() && !rhs.any())) { intersectSameTypes(rhs, overlap); } else { intersectDiffTypes(rhs, overlap); } } private: void intersectSameTypes(ref BasicStringToken rhs, ref BasicStringToken overlap) { if (any()) { clear(); overlap.negated = true; rhs.clear(); } else { CharT *iter = charset.ptr; CharT *end = iter + charset.length; CharT *rhsIter = rhs.charset.ptr; CharT *rhsEnd = rhsIter + rhs.charset.length; overlap.negated = negated; while (iter != end && rhsIter != rhsEnd) { if (*iter < *rhsIter) { ++iter; } else if (*iter > *rhsIter) { ++rhsIter; } else { overlap.charset ~= *iter; memmove(iter, iter + 1, (charset.ptr + charset.length - iter) * CharT.sizeof); --end; charset.length -= 1; memmove(rhsIter, rhsIter + 1, (rhs.charset.ptr + rhs.charset.length - rhsIter) * CharT.sizeof); --rhsEnd; rhs.charset.length -= 1; } } if (negated) { // duplicates already merged // src, dest merge(charset, overlap.charset); // duplicates already merged // src, dest merge(rhs.charset, overlap.charset); negated = false; rhs.negated = false; swap(charset, rhs.charset); normalise(); overlap.normalise(); rhs.normalise(); } else if (!overlap.charset.length) { normalise(); overlap.normalise(); rhs.normalise(); } } } void intersectDiffTypes(ref BasicStringToken rhs, ref BasicStringToken overlap) { if (any()) { intersectAny(rhs, overlap); } else if (negated) { intersectNegated(rhs, overlap); } else { intersectCharset(rhs, overlap); } } void intersectAny(ref BasicStringToken rhs, ref BasicStringToken overlap) { if (rhs.negated) { rhs.intersectNegated(this, overlap); } else { rhs.intersectCharset(this, overlap); } } void intersectNegated(ref BasicStringToken rhs, ref BasicStringToken overlap) { if (rhs.any()) { overlap.negated = true; overlap.charset = charset; rhs.negated = false; rhs.charset = charset; clear(); } else { rhs.intersectCharset(this, overlap); } } void intersectCharset(ref BasicStringToken rhs, ref BasicStringToken overlap) { if (rhs.any()) { overlap.charset = charset; rhs.negated = true; rhs.charset = charset; clear(); } else { CharT *iter = charset.ptr; CharT *end = iter + charset.length; CharT *rhsIter = rhs.charset.ptr; CharT *rhsEnd = rhsIter + rhs.charset.length; while (iter != end && rhsIter != rhsEnd) { if (*iter < *rhsIter) { overlap.charset ~= *iter; rhs.charset.length += 1; rhsIter = rhs.charset.ptr; rhsEnd = rhsIter + rhs.charset.length; memmove(rhsIter + 1, rhsIter, (rhs.charset.length - (rhsEnd - rhsIter - 1)) * CharT.sizeof); ++rhsIter; memmove(iter, iter + 1, (charset.ptr + charset.length - iter) * CharT.sizeof); charset.length -= 1; --end; } else if (*iter > *rhsIter) { ++rhsIter; } else { ++iter; ++rhsIter; } } if (iter != end) { CharT[] temp; temp.length = end - iter; memmove(temp.ptr, iter, temp.length * CharT.sizeof); // nothing bigger in rhs than iter // src, dest merge(temp, overlap.charset); memmove(iter, iter + 1, (charset.ptr + charset.length - iter) * CharT.sizeof); charset.length -= 1; } if (!overlap.charset.empty()) { merge(overlap.charset, rhs.charset); // possible duplicates, so check for any and erase. squeeze(rhs.charset); normalise(); overlap.normalise(); rhs.normalise(); } } } void squeeze(ref CharT[] str) { if (str.length > 1) { CharT *write = str.ptr; CharT *end = write + str.length; CharT *read = write + 1; while (read != end) { while (read != end && *read == *write) { ++read; } if (read == end) break; ++write; if (read > write) { *write = *read; } ++read; } str.length = write + 1 - str.ptr; } } void merge(ref CharT[] src, ref CharT[] dest) { CharT[] temp; CharT *ptr; CharT *iter = src.ptr; CharT *end = iter + src.length; CharT *destIter = dest.ptr; CharT *destEnd = destIter + dest.length; temp.length = src.length + dest.length; ptr = temp.ptr; while (iter != end && destIter != destEnd) { if (*iter < *destIter) { *ptr++ = *iter++; } else { *ptr++ = *destIter++; } } while (iter != end) { *ptr++ = *iter++; } while (destIter != destEnd) { *ptr++ = *destIter++; } dest = temp; } }; } int main(char[][]argv) { regex!(char).BasicStringToken lhs; regex!(char).BasicStringToken rhs; regex!(char).BasicStringToken intersect; lhs.charset = "aaabbc".dup; lhs.negated = true; lhs.removeDuplicates(); rhs.charset = "bccddd".dup; rhs.negated = true; rhs.removeDuplicates(); writeln(lhs.charset, '(', lhs.negated, ") intersect ", rhs.charset, '(', rhs.negated, ") ="); lhs.intersect(rhs, intersect); writeln(lhs.charset, '(', lhs.negated, "), ", rhs.charset, '(', rhs.negated, "), ", intersect.charset, '(', intersect.negated, ')'); return 0; }
Jun 22 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Ben Hanson:
 Will the optimiser create as fast code if you use indexes compared to pointers?

Do you mean the dmd optimizer or the llvm one? LLVM is able to digest array indexes and pointers about equally efficiently. In one important case dmd seems to produce slower code with pointers compared to arrays :-) In some situations it's better to use pointers because they allow a richer semantics, but in many situations arrays give better-looking (and a bit safer) code. If you have some free time you can create an array-based version and compare their performance. Otherwise if your original C++ code uses pointers then maybe it's better to keep them, to avoid translation bugs. You need lot of care to translate code between two different languages, you need a strong rigour. Another small thing I've seen in your code:
 template regex(CharT)
 {
 struct BasicStringToken
 {

 };
 }

D struct definitions don't need an ending semicolon, so it's better to not add it in D code. Bye, bearophile
Jun 22 2010
prev sibling next sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Tue, 22 Jun 2010 20:59:55 +0000, Ben Hanson wrote:

 All the code for this library needs to absolutely as fast as it can be.
 As it turns out, by intersecting regex charsets once in the code then it
 won't take that many cycles, but the only question I have is:
 
 Will the optimiser create as fast code if you use indexes compared to
 pointers?

Probably not always, but there is no reason it shouldn't. In the cases where it doesn't, I'd say it is a compiler bug. In my opinion, you should always go the clean and safe route, and use arrays. And if someone complains that your code is slow, blame the compiler. ;) "Premature optimization is the root of all evil." -- Donald Knuth -Lars
Jun 23 2010
prev sibling parent "Rory McGuire" <rmcguire neonova.co.za> writes:
On Tue, 22 Jun 2010 22:59:55 +0200, Ben Hanson <Ben.Hanson tfbplc.co.uk>  
wrote:

 == Quote from bearophile (bearophileHUGS lycos.com)'s article
 Ben Hanson:
 Even if you are an expert C++ programmer, I can express few more  
 comments about

example of each thing. More cases can follow). It's hard to be an expert in C++ these days, particularly when posting to a group frequented by Andrei! :-D
 ------------------
 You can write this:
 template regex(CharT)
 {
 struct BasicStringToken
 {
 Like this:
 template regex(CharT)
 {
     struct BasicStringToken
     {
 ------------------
 In this part:
     void negate()
     {
         CharT curr_char = START_CHAR;
         CharT[] temp;
         CharT *ptr = cast(CharT *) 0;
 This is closer to normal D code, because in D there is "null", and in D  
 the * of

the type:
     void negate()
     {
         CharT curr_char = START_CHAR;
         CharT[] temp;
         CharT* ptr = null;
 But the idiomatic D code is this because pointers are automatically  
 initialized

lowercase (sometimes I'd like to use underscores, but this is the D style. Constant names can contain underscores in D, I presume):
     void negate()
     {
         CharT currChar = START_CHAR;
         CharT[] temp;
         CharT* ptr;
 ------------------

I forgot about auto initialisation in D. D'oh!
 This line:
             else if (!overlap.charset.length == 0)
 I don't like it a lot, maybe this is better:
             else if (overlap.charset.length)

This is just a bug. Should be: else if (!overlap.charset.length)
 ------------------
 This code:
         else if (negated)
         {
             intersectNegated(rhs, overlap);
         }
         else // negated == false
         {
             intersectCharset(rhs, overlap);
         }
 There is no need of that comment, never add useless noise to the code:
         else if (negated)
         {
             intersectNegated(rhs, overlap);
         }
         else
         {
             intersectCharset(rhs, overlap);
         }

Those comments were deliberate as a 'yes I do mean that', but I've removed them anyway.
 ------------------
 I see that in your code you lot of pointers. Pointers can be used in D,  
 but I

pointers can be replaced by normal array indexes, that are safer too (because in D in nonrelease mode array bounds are tested).
 For some situations I have created in D2 a "safer pointer" that in  
 release mode

it step out of a pre-specified memory interval. I don't think lot of people here has appreciated it, but I have used it to catch some of my pointer-releated bugs.
 Bye,
 bearophile

All the code for this library needs to absolutely as fast as it can be. As it turns out, by intersecting regex charsets once in the code then it won't take that many cycles, but the only question I have is: Will the optimiser create as fast code if you use indexes compared to pointers? Updated code follows. Thanks, Ben module main; import std.algorithm; import std.array; import std.c.string; import std.string; import std.stdio; template regex(CharT) { struct BasicStringToken { bool negated = false; CharT[] charset; enum size_t MAX_CHARS = CharT.max + 1; enum size_t START_CHAR = cast(CharT) 0x80 < 0 ? 0x80 : 0; this(const bool negated_, ref CharT[] charset_) { negated = negated_; charset = charset_; } void removeDuplicates() { charset.sort; squeeze(charset); } void normalise() { if (charset.length == MAX_CHARS) { negated = !negated; charset.clear(); } else if (charset.length > MAX_CHARS / 2) { negate(); } } void negate() { CharT currChar = START_CHAR; CharT[] temp; CharT *ptr; CharT *curr = charset.ptr; CharT *end = curr + charset.length; size_t i = 0; negated = !negated; temp.length = MAX_CHARS - charset.length; ptr = temp.ptr; while (curr < end) { while (*curr > currChar) { *ptr = currChar; ++ptr; ++currChar; ++i; } ++currChar; ++curr; ++i; } for (; i < MAX_CHARS; ++i) { *ptr = currChar; ++ptr; ++currChar; } charset = temp; } bool empty() { return charset.length == 0 && !negated; } bool any() { return charset.length == 0 && negated; } void clear() { negated = false; charset.length = 0; } void intersect(ref BasicStringToken rhs, ref BasicStringToken overlap) { if ((any() && rhs.any()) || (negated == rhs.negated && !any() && !rhs.any())) { intersectSameTypes(rhs, overlap); } else { intersectDiffTypes(rhs, overlap); } } private: void intersectSameTypes(ref BasicStringToken rhs, ref BasicStringToken overlap) { if (any()) { clear(); overlap.negated = true; rhs.clear(); } else { CharT *iter = charset.ptr; CharT *end = iter + charset.length; CharT *rhsIter = rhs.charset.ptr; CharT *rhsEnd = rhsIter + rhs.charset.length; overlap.negated = negated; while (iter != end && rhsIter != rhsEnd) { if (*iter < *rhsIter) { ++iter; } else if (*iter > *rhsIter) { ++rhsIter; } else { overlap.charset ~= *iter; memmove(iter, iter + 1, (charset.ptr + charset.length - iter) * CharT.sizeof); --end; charset.length -= 1; memmove(rhsIter, rhsIter + 1, (rhs.charset.ptr + rhs.charset.length - rhsIter) * CharT.sizeof); --rhsEnd; rhs.charset.length -= 1; } } if (negated) { // duplicates already merged // src, dest merge(charset, overlap.charset); // duplicates already merged // src, dest merge(rhs.charset, overlap.charset); negated = false; rhs.negated = false; swap(charset, rhs.charset); normalise(); overlap.normalise(); rhs.normalise(); } else if (!overlap.charset.length) { normalise(); overlap.normalise(); rhs.normalise(); } } } void intersectDiffTypes(ref BasicStringToken rhs, ref BasicStringToken overlap) { if (any()) { intersectAny(rhs, overlap); } else if (negated) { intersectNegated(rhs, overlap); } else { intersectCharset(rhs, overlap); } } void intersectAny(ref BasicStringToken rhs, ref BasicStringToken overlap) { if (rhs.negated) { rhs.intersectNegated(this, overlap); } else { rhs.intersectCharset(this, overlap); } } void intersectNegated(ref BasicStringToken rhs, ref BasicStringToken overlap) { if (rhs.any()) { overlap.negated = true; overlap.charset = charset; rhs.negated = false; rhs.charset = charset; clear(); } else { rhs.intersectCharset(this, overlap); } } void intersectCharset(ref BasicStringToken rhs, ref BasicStringToken overlap) { if (rhs.any()) { overlap.charset = charset; rhs.negated = true; rhs.charset = charset; clear(); } else { CharT *iter = charset.ptr; CharT *end = iter + charset.length; CharT *rhsIter = rhs.charset.ptr; CharT *rhsEnd = rhsIter + rhs.charset.length; while (iter != end && rhsIter != rhsEnd) { if (*iter < *rhsIter) { overlap.charset ~= *iter; rhs.charset.length += 1; rhsIter = rhs.charset.ptr; rhsEnd = rhsIter + rhs.charset.length; memmove(rhsIter + 1, rhsIter, (rhs.charset.length - (rhsEnd - rhsIter - 1)) * CharT.sizeof); ++rhsIter; memmove(iter, iter + 1, (charset.ptr + charset.length - iter) * CharT.sizeof); charset.length -= 1; --end; } else if (*iter > *rhsIter) { ++rhsIter; } else { ++iter; ++rhsIter; } } if (iter != end) { CharT[] temp; temp.length = end - iter; memmove(temp.ptr, iter, temp.length * CharT.sizeof); // nothing bigger in rhs than iter // src, dest merge(temp, overlap.charset); memmove(iter, iter + 1, (charset.ptr + charset.length - iter) * CharT.sizeof); charset.length -= 1; } if (!overlap.charset.empty()) { merge(overlap.charset, rhs.charset); // possible duplicates, so check for any and erase. squeeze(rhs.charset); normalise(); overlap.normalise(); rhs.normalise(); } } } void squeeze(ref CharT[] str) { if (str.length > 1) { CharT *write = str.ptr; CharT *end = write + str.length; CharT *read = write + 1; while (read != end) { while (read != end && *read == *write) { ++read; } if (read == end) break; ++write; if (read > write) { *write = *read; } ++read; } str.length = write + 1 - str.ptr; } } void merge(ref CharT[] src, ref CharT[] dest) { CharT[] temp; CharT *ptr; CharT *iter = src.ptr; CharT *end = iter + src.length; CharT *destIter = dest.ptr; CharT *destEnd = destIter + dest.length; temp.length = src.length + dest.length; ptr = temp.ptr; while (iter != end && destIter != destEnd) { if (*iter < *destIter) { *ptr++ = *iter++; } else { *ptr++ = *destIter++; } } while (iter != end) { *ptr++ = *iter++; } while (destIter != destEnd) { *ptr++ = *destIter++; } dest = temp; } }; } int main(char[][]argv) { regex!(char).BasicStringToken lhs; regex!(char).BasicStringToken rhs; regex!(char).BasicStringToken intersect; lhs.charset = "aaabbc".dup; lhs.negated = true; lhs.removeDuplicates(); rhs.charset = "bccddd".dup; rhs.negated = true; rhs.removeDuplicates(); writeln(lhs.charset, '(', lhs.negated, ") intersect ", rhs.charset, '(', rhs.negated, ") ="); lhs.intersect(rhs, intersect); writeln(lhs.charset, '(', lhs.negated, "), ", rhs.charset, '(', rhs.negated, "), ", intersect.charset, '(', intersect.negated, ')'); return 0; }

While you are using pointers to do all the copying, comparing etc... are you taking into account that the characters may span multiple CharTs?
Jun 23 2010