www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Table lookups - this is pretty definitive

reply Walter Bright <newshound2 digitalmars.com> writes:
Try this benchmark comparing various classification schemes:
---------------------------------
import core.stdc.stdlib;
import core.stdc.string;

import std.algorithm;
import std.array;
import std.ascii;
import std.datetime;
import std.range;
import std.stdio;
import std.traits;

bool isIdentifierChar0(ubyte c)
{
     return isAlphaNum(c) || c == '_' || c == '$';
}

bool isIdentifierChar1(ubyte c)
{
     return ((c >= '0' || c == '$') &&
             (c <= '9' || c >= 'A')  &&
             (c <= 'Z' || c >= 'a' || c == '_') &&
             (c <= 'z'));
}

immutable bool[256] tab2;
static this()
{
     for (size_t u = 0; u < 0x100; ++u)
     {
         tab2[u] = isIdentifierChar0(cast(ubyte)u);
     }
}

bool isIdentifierChar2(ubyte c)
{
     return tab2[c];
}

/*********************************************/

int f0()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar0(cast(ubyte)u);
     }
     return x;
}

int f1()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar1(cast(ubyte)u);
     }
     return x;
}

int f2()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar2(cast(ubyte)u);
     }
     return x;
}

void main()
{
     auto r = benchmark!(f0, f1, f2)(10_000);
     writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs);
}
Apr 01 2014
next sibling parent "w0rp" <devw0rp gmail.com> writes:
Milliseconds 22 14 10

Yeah, table lookup is the way to go for ASCII, it seems.
Apr 01 2014
prev sibling next sibling parent "w0rp" <devw0rp gmail.com> writes:
On Tuesday, 1 April 2014 at 18:41:03 UTC, w0rp wrote:
Milliseconds 22 14 10

Yeah, table lookup is the way to go for ASCII, it seems.

 Milliseconds 13 8 2

After I remember to use -inline -release -O
Apr 01 2014
prev sibling next sibling parent reply Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
01.04.2014 22:35, Walter Bright пишет:
 Try this benchmark comparing various classification schemes:
 ---------------------------------
 import core.stdc.stdlib;
 import core.stdc.string;

 import std.algorithm;
 import std.array;
 import std.ascii;
 import std.datetime;
 import std.range;
 import std.stdio;
 import std.traits;

 bool isIdentifierChar0(ubyte c)
 {
      return isAlphaNum(c) || c == '_' || c == '$';
 }

 bool isIdentifierChar1(ubyte c)
 {
      return ((c >= '0' || c == '$') &&
              (c <= '9' || c >= 'A')  &&
              (c <= 'Z' || c >= 'a' || c == '_') &&
              (c <= 'z'));
 }

 immutable bool[256] tab2;
 static this()
 {
      for (size_t u = 0; u < 0x100; ++u)
      {
          tab2[u] = isIdentifierChar0(cast(ubyte)u);
      }
 }

 bool isIdentifierChar2(ubyte c)
 {
      return tab2[c];
 }

 /*********************************************/

 int f0()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar0(cast(ubyte)u);
      }
      return x;
 }

 int f1()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar1(cast(ubyte)u);
      }
      return x;
 }

 int f2()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar2(cast(ubyte)u);
      }
      return x;
 }

 void main()
 {
      auto r = benchmark!(f0, f1, f2)(10_000);
      writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs);
 }

Some regular benchmark notes: 1. The first one passed to `benchmark` is always slower (e.g. pass `f2` and see). 2. Unexpected program behaviour changes: Let's use `benchmark!(f1, f1, f1)(1_000_000)`: Milliseconds 928 889 888 Then copy `isAlphaNum` in file (benchmark still shows same results) and change `dchar c` to `ubyte c`, result changes: Milliseconds 849 815 827 The difference is sufficient but `isAlphaNum` not called in benchmark. Also `f0` is faster than `f1`, benchmark!(f0, f0, f0)(1_000_000): Milliseconds 731 694 702 And `f2` wins, it's the only obvious thing, benchmark!(f2, f2, f2)(1_000_000): Milliseconds 227 184 184 Compiler used: dmd -O -inline -release -- Денис В. Шеломовский Denis V. Shelomovskij
Apr 01 2014
next sibling parent reply Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
01.04.2014 23:22, Denis Shelomovskij пишет:
 01.04.2014 22:35, Walter Bright пишет:
 Try this benchmark comparing various classification schemes:
 ---------------------------------
 import core.stdc.stdlib;
 import core.stdc.string;

 import std.algorithm;
 import std.array;
 import std.ascii;
 import std.datetime;
 import std.range;
 import std.stdio;
 import std.traits;

 bool isIdentifierChar0(ubyte c)
 {
      return isAlphaNum(c) || c == '_' || c == '$';
 }

 bool isIdentifierChar1(ubyte c)
 {
      return ((c >= '0' || c == '$') &&
              (c <= '9' || c >= 'A')  &&
              (c <= 'Z' || c >= 'a' || c == '_') &&
              (c <= 'z'));
 }

 immutable bool[256] tab2;
 static this()
 {
      for (size_t u = 0; u < 0x100; ++u)
      {
          tab2[u] = isIdentifierChar0(cast(ubyte)u);
      }
 }

 bool isIdentifierChar2(ubyte c)
 {
      return tab2[c];
 }

 /*********************************************/

 int f0()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar0(cast(ubyte)u);
      }
      return x;
 }

 int f1()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar1(cast(ubyte)u);
      }
      return x;
 }

 int f2()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar2(cast(ubyte)u);
      }
      return x;
 }

 void main()
 {
      auto r = benchmark!(f0, f1, f2)(10_000);
      writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs,
 r[2].msecs);
 }

Some regular benchmark notes: 1. The first one passed to `benchmark` is always slower (e.g. pass `f2` and see). 2. Unexpected program behaviour changes: Let's use `benchmark!(f1, f1, f1)(1_000_000)`: Milliseconds 928 889 888 Then copy `isAlphaNum` in file (benchmark still shows same results) and change `dchar c` to `ubyte c`, result changes: Milliseconds 849 815 827 The difference is sufficient but `isAlphaNum` not called in benchmark. Also `f0` is faster than `f1`, benchmark!(f0, f0, f0)(1_000_000): Milliseconds 731 694 702 And `f2` wins, it's the only obvious thing, benchmark!(f2, f2, f2)(1_000_000): Milliseconds 227 184 184 Compiler used: dmd -O -inline -release

Few more words about `dmd`: "Hey, silly, you still use `== x` to compare with `x`? Use table lookups! And never cast `size_t` to `ubyte`! See: --- import std.datetime; import std.stdio; immutable bool[256] tab2; static this() { foreach(size_t u; 0 .. 0x100) tab2[u] = u == '_'; } /*********************************************/ int f0() { int x; foreach(uint u; 0 .. 0x100) x += u == '_'; return x; } int f2() { int x; foreach(uint u; 0 .. 0x100) x += tab2[cast(ubyte)u]; return x; } int f2plus() { int x; foreach(size_t u; 0 .. 0x100) x += tab2[u]; return x; } void main() { auto r = benchmark!(f0, f0, f0)(1_000_000); writefln("Milliseconds %s %s %s (f0)", r[0].msecs, r[1].msecs, r[2].msecs); r = benchmark!(f2, f2, f2)(1_000_000); writefln("Milliseconds %s %s %s (f2)", r[0].msecs, r[1].msecs, r[2].msecs); r = benchmark!(f2plus, f2plus, f2plus)(1_000_000); writefln("Milliseconds %s %s %s (f2plus)", r[0].msecs, r[1].msecs, r[2].msecs); } --- Milliseconds 323 274 281 (f0) Milliseconds 185 185 185 (f2) Milliseconds 105 97 105 (f2plus) --- P.S. I don't want to say anything with these results. I just have no idea what happens here and I don't want to investigate dmd's way of optimizing things now. -- Денис В. Шеломовский Denis V. Shelomovskij
Apr 01 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/1/2014 12:38 PM, Denis Shelomovskij wrote:
 I don't want to say anything with these results. I just have no idea what
 happens here and I don't want to investigate dmd's way of optimizing things
now.

This is why I recommend, when going all out in optimizing, that at some point you've gotta look at the assembler output of (insert your favorite compiler).
Apr 01 2014
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/1/2014 12:22 PM, Denis Shelomovskij wrote:
 Compiler used: dmd -O -inline -release

For me, rewriting main() as: void main() { { auto r = benchmark!(f0, f1, f2)(100_000); writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs); } { auto r = benchmark!(f0, f1, f2)(100_000); writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, r[2].msecs); } } Gives: Milliseconds 139 99 15 Milliseconds 122 93 13
Apr 01 2014
prev sibling next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:
 ---------------------------------
 import core.stdc.stdlib;
 import core.stdc.string;

 import std.algorithm;
 import std.array;
 import std.ascii;
 import std.datetime;
 import std.range;
 import std.stdio;
 import std.traits;

 bool isIdentifierChar0(ubyte c)
 {
     return isAlphaNum(c) || c == '_' || c == '$';
 }

 bool isIdentifierChar1(ubyte c)
 {
     return ((c >= '0' || c == '$') &&
             (c <= '9' || c >= 'A')  &&
             (c <= 'Z' || c >= 'a' || c == '_') &&
             (c <= 'z'));
 }

 immutable bool[256] tab2;
 static this()
 {
     for (size_t u = 0; u < 0x100; ++u)
     {
         tab2[u] = isIdentifierChar0(cast(ubyte)u);
     }
 }

 bool isIdentifierChar2(ubyte c)
 {
     return tab2[c];
 }

 /*********************************************/

 int f0()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar0(cast(ubyte)u);
     }
     return x;
 }

 int f1()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar1(cast(ubyte)u);
     }
     return x;
 }

 int f2()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar2(cast(ubyte)u);
     }
     return x;
 }

 void main()
 {
     auto r = benchmark!(f0, f1, f2)(10_000);
     writefln("Milliseconds %s %s %s", r[0].msecs, r[1].msecs, 
 r[2].msecs);
 }

It's not really possible to tell anything from that benchmark, especially with fancy modern optimisers and branch predictors. 1) ldc and gdc both optimise away some/all of the tests respectively. 2) once isIdentifierCharX is inlined, the compiler can determine what the results will be before-hand. 3) Even without 1) and 2), the results are unrealistically (very!) correlated, leading to a branch-prediction bonanza. I'm sure you know how significant that can be on modern CPUs. It is also very friendly to pre-fetching of the lookup table* Perhaps have the same benchmark, but working on realistic data from a file. *In my experience, the cost of using lookup tables often only appears in real code, where cache pressure and predictability becomes worse. This is especially true of lazy, range-based code. If several layers of a range use different lookup tables you can quickly end up ruining cache performance compared to an eager approach.
Apr 01 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/1/2014 12:23 PM, John Colvin wrote:
 It's not really possible to tell anything from that benchmark, especially with
 fancy modern optimisers and branch predictors.

I disagree, read on.
 1) ldc and gdc both optimise away some/all of the tests respectively.
 2) once isIdentifierCharX is inlined, the compiler can determine what the
 results will be before-hand.

This is always an issue with benchmarking in order to find the best way to write an algorithm. I just adjust it so the particular compiler used can't throw it away.
 3) Even without 1) and 2), the results are unrealistically (very!) correlated,
 leading to a branch-prediction bonanza. I'm sure you know how significant that
 can be on modern CPUs. It is also very friendly to pre-fetching of the lookup
 table*

This is exactly why I wanted to test it.
 Perhaps have the same benchmark, but working on realistic data from a file.

Switching to it in Warp produced a very significant speedup on real files. https://github.com/facebook/warp/pull/5
Apr 01 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/1/2014 2:30 PM, Peter Alexander wrote:
 Any idea how it is comparing to clang with this improvement?

No data yet. Stay tuned.
Apr 01 2014
prev sibling next sibling parent reply "JR" <zorael gmail.com> writes:
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 immutable bool[256] tab2;
 static this()
 {
     for (size_t u = 0; u < 0x100; ++u)
     {
         tab2[u] = isIdentifierChar0(cast(ubyte)u);
     }
 }

Table lookups are awesome. While tab2 there lends itself to being populated in a static this() it would really lower the bar if we could define static immutable AA literals at compile-time. Or at least CTFE-fill them. Having to rely on static this() for such is unfortunate and hacky. Relevant is http://forum.dlang.org/thread/oxnwtojtdymopgvanbwz forum.dlang.org in which I get a speed increase of >3.5x by doing string-to-enum translation/mapping using someEnum[string] AA lookups, instead of using someString.to!someEnum. (Given CtAAs, std.conv.to could then generate such for various translations where cheap -- assuming we still like AAs?)
Apr 01 2014
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/1/2014 1:57 PM, Andrej Mitrovic wrote:
 immutable bool[256] tab2 =
      iota(0, 0x100)
      .map!(i => isIdentifierChar0(cast(ubyte)i))
      .array;

Yup, that works. Purty nice!
Apr 01 2014
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/1/2014 1:57 PM, Andrej Mitrovic wrote:
 immutable bool[256] tab2 =
      iota(0, 0x100)
      .map!(i => isIdentifierChar0(cast(ubyte)i))
      .array;

Incorporated your idea: https://github.com/facebook/warp/pull/5/files
Apr 01 2014
prev sibling next sibling parent "Tove" <tove fransson.se> writes:
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:
 ---------------------------------

Original program... 3 4 1 10x as many iterations... 36 47 19 I think this benchmark is flawed... 1) Apparently there are rounding issues... assuming linear scaling, 1.9 ms yields ~1ms? Probably better to use usecs... 2) Table lookups will nearly always win in isolated 1 function tests... but not necessarily in real apps.
Apr 01 2014
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
01-Apr-2014 22:35, Walter Bright пишет:
 Try this benchmark comparing various classification schemes:

 int f0()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar0(cast(ubyte)u);
      }
      return x;
 }

 int f1()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar1(cast(ubyte)u);
      }
      return x;
 }

 int f2()
 {
      int x;
      for (uint u = 0; u < 0x100; ++u)
      {
          x += isIdentifierChar2(cast(ubyte)u);
      }
      return x;
 }

Would strongly suggest to use 2 kinds of data - randomized and some text. And, of course, it must be loaded at run-time. -- Dmitry Olshansky
Apr 01 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/1/2014 12:46 PM, Dmitry Olshansky wrote:
 Would strongly suggest to use 2 kinds of data - randomized and some text. And,
 of course, it must be loaded at run-time.

See Vladimir's post.
Apr 01 2014
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Apr-2014 02:09, Walter Bright пишет:
 On 4/1/2014 12:46 PM, Dmitry Olshansky wrote:
 Would strongly suggest to use 2 kinds of data - randomized and some
 text. And,
 of course, it must be loaded at run-time.

See Vladimir's post.

stuff ;) -- Dmitry Olshansky
Apr 02 2014
prev sibling next sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 04/01/14 21:35, JR wrote:
 On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 immutable bool[256] tab2;
 static this()
 {
     for (size_t u = 0; u < 0x100; ++u)
     {
         tab2[u] = isIdentifierChar0(cast(ubyte)u);
     }
 }

Table lookups are awesome. While tab2 there lends itself to being populated in a static this() it would really lower the bar if we could define static immutable AA literals at compile-time. Or at least CTFE-fill them. Having to rely on static this() for such is unfortunate and hacky.

You mention AAs, so you may already realize this, but there is no need for a mod ctor in the above example: immutable bool[256] tab2 = { bool t[256]; for (size_t u = 0; u < 0x100; ++u) t[u] = isIdentifierChar0(cast(ubyte)u); return t; }(); "Static immutable AAs" can be built at CT likewise, it's just that their type will be different from "std" RT AAs. artur
Apr 01 2014
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 4/1/14, Artur Skawina <art.08.09 gmail.com> wrote:
 You mention AAs, so you may already realize this, but there is no
 need for a mod ctor in the above example:

    immutable bool[256] tab2 = {
       bool t[256];
       for (size_t u = 0; u < 0x100; ++u)
           t[u] = isIdentifierChar0(cast(ubyte)u);
       return t;
    }();

Also (untested): immutable bool[256] tab2 = iota(0, 0x100) .map!(i => isIdentifierChar0(cast(ubyte)i)) .array;
Apr 01 2014
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
 Switching to it in Warp produced a very significant speedup on 
 real files.

 https://github.com/facebook/warp/pull/5

That's to be expected as warp will call it often and it will stay in the cache. For a program that calls it less often the other methods may be faster. Basically, it depends on the use case. Glad to see it's helped in warp. Any idea how it is comparing to clang with this improvement?
Apr 01 2014
prev sibling next sibling parent "Ivan Kazmenko" <gassa mail.ru> writes:
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:
 ---------------------------------

10_000 iterations: Milliseconds 7 10 2 Milliseconds 5 5 1 Milliseconds 8 10 2 100_000 iterations: Milliseconds 78 93 21 Milliseconds 78 94 21 Milliseconds 78 92 21 1_000_000 iterations: Milliseconds 584 614 146 Milliseconds 541 612 147 Milliseconds 502 609 146 Milliseconds 529 613 147 Milliseconds 539 606 147 This is DMD 32-bit 2.065.0 on an Intel Core-i7. The options are -O -release -inline -noboundscheck. A bit of comments: 1. The benchmark seems rather artificial to me. The natural order of input is fixed, which is nice for both the branch predictor (long streaks of each branch) and the array access (sequential). 2. The latter function wins, no wonder though, since there's no branching. (I tried to augment it by actually using the cache in between, but it still wins against my attempts so far.) Ivan Kazmenko.
Apr 01 2014
prev sibling next sibling parent "Ivan Kazmenko" <gassa mail.ru> writes:
On Tuesday, 1 April 2014 at 21:34:29 UTC, Ivan Kazmenko wrote:
 On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:

 2. The latter function wins, no wonder though, since there's no 
 branching.  (I tried to augment it by actually using the cache 
 in between, but it still wins against my attempts so far.)

The "it" being augmented is the benchmark, the "it" winning being the function isIdentifierChar2. This modification of the benchmark pretends to use a stack-allocated array alongside calling the funcions of interest: http://dpaste.dzfl.pl/1ed247445308 The timings are: Milliseconds 623 590 229 Milliseconds 629 592 231 Milliseconds 619 596 233 Milliseconds 524 592 283 Funnily, on DPaste, the picture changes: Milliseconds 1864 1349 1303 Perhaps there is no "-O -release -inline -noboundscheck" tuning there. Ivan Kazmenko.
Apr 01 2014
prev sibling next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:

I modified it a bit to use the Phobos source code as the test data: http://dump.thecybershadow.net/7a685ffb8588ec836e3ae0dba14e0618/test.d My results (DMD git HEAD, -O -inline -release): x86: 41070 29642 5169 x64: 42975 29759 5822 I've been using table lookups in my library for a while, e.g. for ASCII case conversions: https://github.com/CyberShadow/ae/blob/master/utils/text.d#L323-L341 or escaping JSON strings: https://github.com/CyberShadow/ae/blob/master/utils/json.d#L150-L188
Apr 01 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/1/2014 3:00 PM, Vladimir Panteleev wrote:
 I modified it a bit to use the Phobos source code as the test data:
 http://dump.thecybershadow.net/7a685ffb8588ec836e3ae0dba14e0618/test.d

Nice. I especially like the use of dirEntries to read all the text. That should go into the phobos documentation!
Apr 01 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Andrej Mitrovic:

 Also (untested):

 immutable bool[256] tab2 =
     iota(0, 0x100)
     .map!(i => isIdentifierChar0(cast(ubyte)i))
     .array;

Eventually this should work, and avoid the nasty cast: immutable bool[256] tab3 = ubyte .max .iota!"[]" .map!isIdentifierChar0 .array; Bye, bearophile
Apr 01 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Artur Skawina:

 You mention AAs, so you may already realize this, but there is 
 no
 need for a mod ctor in the above example:

    immutable bool[256] tab2 = {
       bool t[256];
       for (size_t u = 0; u < 0x100; ++u)
           t[u] = isIdentifierChar0(cast(ubyte)u);
       return t;
    }();

A recent improvement in foreach loops allows to remove the cast from your code, but I have found a new compiler bug: https://d.puremagic.com/issues/show_bug.cgi?id=12504 Bye, bearophile
Apr 01 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 Incorporated your idea:

A bit of statistics. I have counted about 260 "cast(" in the whole d files (total is 7752 CLOC lines of D code, plus 1008 blank lines and 905 lines of comments. I have not counted the lines of unittests). The files with the higher number of casts are: macros.d, with 86 casts constexpr.d, with 39 casts directive.d, with 30 casts lexer.d, with 28 casts stringlit.d, with 18 casts id.d, with 16 casts context.d, with 11 casts skip.d, with 9 casts Bye, bearophile
Apr 01 2014
prev sibling next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Do I see it right that it really comes down to one use of
isIdentifierStart(c) in one huge switch-case? Since "warp"
seems to be interested only in ASCII properties since it is
processing C code, could the switch-case be optimized into a
256 entries large jump table?

If a table lookup (char->bool) is faster than multiple
comparisons, why shouldn't this apply to char->address as well?

-- 
Marco
Apr 01 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Apr-2014 12:27, Daniel N пишет:
 On Wednesday, 2 April 2014 at 03:47:58 UTC, Marco Leise wrote:
 Do I see it right that it really comes down to one use of
 isIdentifierStart(c) in one huge switch-case? Since "warp"
 seems to be interested only in ASCII properties since it is
 processing C code, could the switch-case be optimized into a
 256 entries large jump table?

 If a table lookup (char->bool) is faster than multiple
 comparisons, why shouldn't this apply to char->address as well?

I was thinking the same, sure would be nice if computed goto:s were standardized. (as suggested by Bearophile many times)

_Explicit_ tail call (aka switch-call) would have been cleaner and more modular. It covers all relevant cases of computed-goto such as threaded-code interpreters and also recursion-heavy functional code. -- Dmitry Olshansky
Apr 02 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Apr-2014 15:26, bearophile пишет:
 Dmitry Olshansky:

 _Explicit_ tail call (aka switch-call) would have been cleaner and
 more modular. It covers all relevant cases of computed-goto such as
 threaded-code interpreters and also recursion-heavy functional code.

It looks like an additive change. Can you show a possible syntax and semantics for D? If it's so good it could also become a DIP later.

Easily. In grammar, a new kind of statement: SwitchCallStatment: goto CallExpression; Semantics: Same as that of a function call expression except: a) Can Switch-Call only to a function with the same return type. b) Anything below this statement is unreachable, i.e. treat it as return. c) Stack frame of the current function is overwritten with that of switched-to function. In other words after a switch-call stack looks as if the switched-to function was called instead in the first place. The last point is what modern tail-call optimizations do when one has 2 functions that tail-recurse to each other. They overwrite stackframes on such tail calls. However this only happens with certain optimization switch which IMO makes the thing totally unreliable. In fact I even observed that I could implement threaded-code interpreter by relying on tail-call optimization with LDC, but only when compiling with all optimizations enabled. Unoptimized builds simply blowup stack. This "semantics are tied to optimization" is a situation that I hate about C/C++ and would hope to address in D. -- Dmitry Olshansky
Apr 02 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/2/2014 12:03 PM, Dmitry Olshansky wrote:
 SwitchCallStatment:
      goto CallExpression;

 Semantics:
 Same as that of a function call expression except:
 a) Can Switch-Call only to a function with the same return type.
 b) Anything below this statement is unreachable, i.e. treat it as return.
 c) Stack frame of the current function is overwritten with that of switched-to
 function. In other words after a switch-call stack looks as if the switched-to
 function was called instead in the first place.

 The last point is what modern tail-call optimizations do when one has 2
 functions that tail-recurse to each other. They overwrite stackframes on such
 tail calls.

How is this different from simply recognizing "return foo(arg);" as being a goto?
 However this only happens with certain optimization switch which IMO makes the
 thing totally unreliable. In fact I even observed that I could implement
 threaded-code interpreter by relying on tail-call optimization with LDC, but
 only when compiling with all optimizations enabled. Unoptimized builds simply
 blowup stack. This "semantics are tied to optimization" is a situation that I
 hate about C/C++ and would hope to address in D.

Doesn't seem terribly different than the "force inline" feature discussion.
Apr 02 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Apr-2014 00:39, Walter Bright пишет:
 On 4/2/2014 12:03 PM, Dmitry Olshansky wrote:
 SwitchCallStatment:
      goto CallExpression;

 Semantics:
 Same as that of a function call expression except:
 a) Can Switch-Call only to a function with the same return type.
 b) Anything below this statement is unreachable, i.e. treat it as return.
 c) Stack frame of the current function is overwritten with that of
 switched-to
 function. In other words after a switch-call stack looks as if the
 switched-to
 function was called instead in the first place.

 The last point is what modern tail-call optimizations do when one has 2
 functions that tail-recurse to each other. They overwrite stackframes
 on such
 tail calls.

How is this different from simply recognizing "return foo(arg);" as being a goto?

If we can alter semantics of return foo(arg); to _always_ be a goto, guarantee a jump and reuse of the stack, then I'm all for it.
 However this only happens with certain optimization switch which IMO
 makes the
 thing totally unreliable. In fact I even observed that I could implement
 threaded-code interpreter by relying on tail-call optimization with
 LDC, but
 only when compiling with all optimizations enabled. Unoptimized builds
 simply
 blowup stack. This "semantics are tied to optimization" is a situation
 that I
 hate about C/C++ and would hope to address in D.

Doesn't seem terribly different than the "force inline" feature discussion.

Yes, it shows the same problem: semantics and optimization are often mixed up. In this case however a program would segfault if the relevant optimization was not performed, it's not just be many times slower. -- Dmitry Olshansky
Apr 02 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/2/2014 1:54 PM, Dmitry Olshansky wrote:
 If we can alter semantics of
 return foo(arg);
 to _always_ be a goto, guarantee a jump and reuse of the stack,  then I'm all
 for it.

I don't see why not. Note that we couldn't do this for extern(C) functions, or variadics, or caller functions with parameters that need destruction, or parameters that may refer to any locals. That last constraint might be a doozy, however. Why not submit an enhancement request to bugzilla?
Apr 02 2014
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/2/2014 3:06 PM, bearophile wrote:
 Walter Bright:

 I don't see why not. Note that we couldn't do this for extern(C) functions, or
 variadics, or caller functions with parameters that need destruction, or
 parameters that may refer to any locals. That last constraint might be a
 doozy, however.

 Why not submit an enhancement request to bugzilla?

So what does it happen if I (by mistake or by ignorance) try to use this on a function that doesn't satisfy one of those requirements?

Then it won't tail call it.
 D can't give an error in that case because it's a big breaking change. So is D
in that case just
 silently not performing the tail call as the programmer meant?

That's right.
Apr 02 2014
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/2/2014 6:21 PM, bearophile wrote:
 Then this feature needs a specific and explicit syntax, or it has _no_ point at
 all.

That's like saying inlining has no point because it doesn't have a particular syntax.
Apr 02 2014
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Apr-2014 05:05, Walter Bright пишет:
 On 4/2/2014 3:06 PM, bearophile wrote:
 Walter Bright:

 I don't see why not. Note that we couldn't do this for extern(C)
 functions, or
 variadics, or caller functions with parameters that need destruction, or
 parameters that may refer to any locals. That last constraint might be a
 doozy, however.

 Why not submit an enhancement request to bugzilla?

So what does it happen if I (by mistake or by ignorance) try to use this on a function that doesn't satisfy one of those requirements?

Then it won't tail call it.
 D can't give an error in that case because it's a big breaking change.
 So is D in that case just
 silently not performing the tail call as the programmer meant?

That's right.

From this it seems to me that once again you've walled yourself out of this question and going to frame it as an optimization. Tail-call has 2 sides to it: a) An optimization for functions that return with a call expression. I'd leave this case an optimization because it also has an undesirable effect on call stack, hindering debugging. b) An explicit instruction to switch execution to another function. This doesn't have any unexpected effects, and failing to make a tail-call will change the semantics of code. This is critical for certain kind of code, it simply doesn't work if tail-call is not a jump. I agree that nothing prevents doing optimization of return foo(args); in case a) if optimization is enabled, and this is something GCC/LLVM already do. But more importantly there should be an explicit tool for the case b), with syntax: goto foo(args); that will fail to compile if tail-call is not possible. Because the alternative of double-checking on every compiler that "return foo(args)" in all the right places is optimized the right way is fragile, and would turn out to be an _ongoing_ waste of time. -- Dmitry Olshansky
Apr 03 2014
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Apr-2014 01:51, Walter Bright пишет:
 On 4/2/2014 1:54 PM, Dmitry Olshansky wrote:
 If we can alter semantics of
 return foo(arg);
 to _always_ be a goto, guarantee a jump and reuse of the stack,  then
 I'm all
 for it.

I don't see why not. Note that we couldn't do this for extern(C) functions, or variadics, or caller functions with parameters that need destruction, or parameters that may refer to any locals. That last constraint might be a doozy, however. Why not submit an enhancement request to bugzilla?

I'd gladly author an enhancement once I'm certain of what exactly to state in it. Always optimizing return foo(args) to a jump may have impact on debugging experience (curious call-stacks) and cost people hours of wasted time chasing shadows. Adding an explicit form for it seems to me far more appealing and honest. -- Dmitry Olshansky
Apr 03 2014
prev sibling next sibling parent "Daniel N" <ufo orbiting.us> writes:
On Wednesday, 2 April 2014 at 03:47:58 UTC, Marco Leise wrote:
 Do I see it right that it really comes down to one use of
 isIdentifierStart(c) in one huge switch-case? Since "warp"
 seems to be interested only in ASCII properties since it is
 processing C code, could the switch-case be optimized into a
 256 entries large jump table?

 If a table lookup (char->bool) is faster than multiple
 comparisons, why shouldn't this apply to char->address as well?

I was thinking the same, sure would be nice if computed goto:s were standardized. (as suggested by Bearophile many times)
Apr 02 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Dmitry Olshansky:

 _Explicit_ tail call (aka switch-call) would have been cleaner 
 and more modular. It covers all relevant cases of computed-goto 
 such as threaded-code interpreters and also recursion-heavy 
 functional code.

It looks like an additive change. Can you show a possible syntax and semantics for D? If it's so good it could also become a DIP later. Bye, bearophile
Apr 02 2014
prev sibling next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:

 bool isIdentifierChar1(ubyte c)
 {
     return ((c >= '0' || c == '$') &&
             (c <= '9' || c >= 'A')  &&
             (c <= 'Z' || c >= 'a' || c == '_') &&
             (c <= 'z'));
 }

I'd like to point out this is quite a complicated function to begin with, so it doesn't generalize to all isXXX is ascii, for which the tests would be fairly simpler. In any case, (on my win32 machine) I can go from 810msecs to 500msecs using this function instead: bool isIdentifierChar1(ubyte c) { return c <= 'z' && ( 'a' <= c || ('0' <= c && (c <= '9' || c == '_' || ('A' <= c && c <= 'Z'))) || c == '$'); } That said, I'm abusing the fact that 50% of your bench is for chars over 0x80. If I loop only on actual ASCII you can find in text, (0x20 - 0X80), then those numbers "only" go from "320" => "300". Only slightly better, but still a win. *BUT*, if your functions were to accept any arbitrary codepoint, it would absolutely murder.
Apr 02 2014
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Apr-2014 20:52, Daniel N пишет:
 On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:

 bool isIdentifierChar1(ubyte c)
 {
    return ((c >= '0' || c == '$') &&
            (c <= '9' || c >= 'A')  &&
            (c <= 'Z' || c >= 'a' || c == '_') &&
            (c <= 'z'));
 }


Considering SSE4.2 was released already back in 2008 and was designed to accelerate this very use-case... Maybe it's more beneficial to add a few new code-paths... SSE4.2, NEON, etc.

Ideally we'd need: a) A collection of meaningful (high-level compared to std.simd) primitives that can be accelerated by SIMD, in Phobos. Such as searching, bit-counting etc. b) A way to automatically use whatever is best and supported by the hardware. GCC got a zero-overhead stuff for that since around 4.6 (?). -- Dmitry Olshansky
Apr 02 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/2/2014 12:38 PM, Dmitry Olshansky wrote:
 Ideally we'd need:
 a) A collection of meaningful (high-level compared to std.simd) primitives that
 can be accelerated by SIMD, in Phobos. Such as searching, bit-counting etc.

memchr is already optimized by SIMD, and std.algorithm.find uses memchr. In general, enormous effort has been poured into optimizing C standard library functions. If an algorithm can be implemented using such functions, we ought to.
Apr 02 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Apr-2014 00:46, Walter Bright пишет:
 On 4/2/2014 12:38 PM, Dmitry Olshansky wrote:
 Ideally we'd need:
 a) A collection of meaningful (high-level compared to std.simd)
 primitives that
 can be accelerated by SIMD, in Phobos. Such as searching, bit-counting
 etc.

memchr is already optimized by SIMD, and std.algorithm.find uses memchr. In general, enormous effort has been poured into optimizing C standard library functions. If an algorithm can be implemented using such functions, we ought to.

This is all good and well, but... memchr is working on the level of a single byte, leaving nothing for 2, 4, 8 byte-wide needles. Not type-safe. No fallback for CTFE. Obviously C standard is lacking many things, hence the endless row of extensions. -- Dmitry Olshansky
Apr 02 2014
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/2/14, 11:19 PM, monarch_dodra wrote:
 If I were to request the actual "memchr"/"wmemchr"/"dmemchr" functions
 in some "core.???" module, is this something we'd want, and would
 somebody know how to provide an efficient implementation?

Yes. Probably checking stuff with SIMD and a few other approaches would be great. Andrei
Apr 03 2014
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Apr-2014 10:19, monarch_dodra пишет:

 If I were to request the actual "memchr"/"wmemchr"/"dmemchr" functions
 in some "core.???" module, is this something we'd want, and would
 somebody know how to provide an efficient implementation?

Something I wanted in D since about 2011. Should have put an enhancement request or tipped bearophile ;) -- Dmitry Olshansky
Apr 03 2014
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/2/2014 4:49 AM, monarch_dodra wrote:
 That said, I'm abusing the fact that 50% of your bench is for chars over 0x80.
 If I loop only on actual ASCII you can find in text, (0x20 - 0X80), then those
 numbers "only" go from "320" => "300". Only slightly better, but still a win.

Surely a better approach would be to do a statistical analysis of character frequency in a representative corpus, and tune the compares that way. But the table lookup offered such a dramatic improvement, I didn't bother.
Apr 02 2014
prev sibling next sibling parent "Daniel N" <ufo orbiting.us> writes:
 On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:

 bool isIdentifierChar1(ubyte c)
 {
    return ((c >= '0' || c == '$') &&
            (c <= '9' || c >= 'A')  &&
            (c <= 'Z' || c >= 'a' || c == '_') &&
            (c <= 'z'));
 }


Considering SSE4.2 was released already back in 2008 and was designed to accelerate this very use-case... Maybe it's more beneficial to add a few new code-paths... SSE4.2, NEON, etc.
Apr 02 2014
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Tuesday, 1 April 2014 at 22:32:28 UTC, bearophile wrote:
 Eventually this should work, and avoid the nasty cast:

 immutable bool[256] tab3 =
     ubyte
     .max
     .iota!"[]"

I can't decide if this part is brilliant or terrible. Clever, either way.
Apr 02 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Brad Anderson:

 immutable bool[256] tab3 =
    ubyte
    .max
    .iota!"[]"

I can't decide if this part is brilliant or terrible. Clever, either way.

That iota enhancement has nothing terrible: the "[]" syntax is already used in Phobos for std.range.uniform, and it's quite needed, otherwise it can't generate the right bound of a type (like the last ubyte, last byte, last char, last int, etc). Bye, bearophile
Apr 02 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 I don't see why not. Note that we couldn't do this for 
 extern(C) functions, or variadics, or caller functions with 
 parameters that need destruction, or parameters that may refer 
 to any locals. That last constraint might be a doozy, however.

 Why not submit an enhancement request to bugzilla?

So what does it happen if I (by mistake or by ignorance) try to use this on a function that doesn't satisfy one of those requirements? D can't give an error in that case because it's a big breaking change. So is D in that case just silently not performing the tail call as the programmer meant? Bye, bearophile
Apr 02 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 D can't give an error in that case because it's a big breaking 
 change. So is D in that case just
 silently not performing the tail call as the programmer meant?

That's right.

Then this feature needs a specific and explicit syntax, or it has _no_ point at all. (Note: I am currently not asking for this feature in D. I have just shown one of its minimal requirements). Bye, bearophile
Apr 02 2014
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Wednesday, 2 April 2014 at 21:58:33 UTC, bearophile wrote:
 Brad Anderson:

 immutable bool[256] tab3 =
   ubyte
   .max
   .iota!"[]"

I can't decide if this part is brilliant or terrible. Clever, either way.

That iota enhancement has nothing terrible: the "[]" syntax is already used in Phobos for std.range.uniform, and it's quite needed, otherwise it can't generate the right bound of a type (like the last ubyte, last byte, last char, last int, etc). Bye, bearophile

I was more referring to ubyte.max.iota :)
Apr 02 2014
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 2 April 2014 at 21:18:19 UTC, Dmitry Olshansky 
wrote:
 03-Apr-2014 00:46, Walter Bright пишет:
 On 4/2/2014 12:38 PM, Dmitry Olshansky wrote:
 memchr is already optimized by SIMD, and std.algorithm.find 
 uses memchr.
 In general, enormous effort has been poured into optimizing C 
 standard
 library functions. If an algorithm can be implemented using 
 such
 functions, we ought to.

This is all good and well, but... memchr is working on the level of a single byte, leaving nothing for 2, 4, 8 byte-wide needles. Not type-safe. No fallback for CTFE.

I found myself needing "wmemchr" (and optionally "dmemchr") for speeding up find for wstrings and dstrings. I realized "wmemchr" exists, but it is platform defined: ushort (w) on windows, and uint (d) on linux. So it's hard to use, and definitely not the "W" version I'd expect in D code. If I were to request the actual "memchr"/"wmemchr"/"dmemchr" functions in some "core.???" module, is this something we'd want, and would somebody know how to provide an efficient implementation?
Apr 02 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 That's like saying inlining has no point because it doesn't 
 have a particular syntax.

Hopefully Dmitry Olshansky can explain why you are probably wrong regarding the proposed feature. Bye, bearophile
Apr 03 2014
prev sibling next sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 04/03/14 04:40, Walter Bright wrote:
 On 4/2/2014 6:21 PM, bearophile wrote:
 Then this feature needs a specific and explicit syntax, or it has _no_ point at
 all.

That's like saying inlining has no point because it doesn't have a particular syntax.

Weak inlining *hints* have no point - C learned this the hard way, with "inline" being useless today, and everybody having to use "always_inline/forceinline". The issue isn't about optimizations, which any decent compiler can deal with; it's about the programmer being able to rely on avoiding certain undesirable effects, like excessive stack consumption, unnecessary clobbering of the few precious cpu registers available etc. What Bearophile is saying is that an optional tail call optimization being mandated by a spec would be completely useless -- TCO can be legally done by the compiler anyway; documenting that it could happen does not help the programmer in any way - (s)he still has to assume that the optimization might not happen. A syntax like "goto return f();" which mandates TCO would make the situation different. [Defining computed gotos would be a better idea, as the tail-call transformation would significantly raise the requirements for a compliant D compiler. OTOH computed-gotos would lead to more readable D code, are easier to implement, and can be easily be made safe in D (by making the type of label-addresses unique per-function and avoiding function-cloning)] artur
Apr 03 2014
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 3 April 2014 at 19:39:57 UTC, Andrei Alexandrescu 
wrote:
 On 4/2/14, 11:19 PM, monarch_dodra wrote:
 If I were to request the actual "memchr"/"wmemchr"/"dmemchr" 
 functions
 in some "core.???" module, is this something we'd want, and 
 would
 somebody know how to provide an efficient implementation?

Yes. Probably checking stuff with SIMD and a few other approaches would be great. Andrei

I filed it here: https://d.puremagic.com/issues/show_bug.cgi?id=12515 Now, we just need someone to actually do it :)
Apr 03 2014
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 02 Apr 2014 18:05:32 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 4/2/2014 3:06 PM, bearophile wrote:
 Walter Bright:

 I don't see why not. Note that we couldn't do this for extern(C) functions, or
 variadics, or caller functions with parameters that need destruction, or
 parameters that may refer to any locals. That last constraint might be a
 doozy, however.

 Why not submit an enhancement request to bugzilla?

So what does it happen if I (by mistake or by ignorance) try to use this on a function that doesn't satisfy one of those requirements?

Then it won't tail call it.

How would `return foo(args)' be used in the C preprocessor example? Would it be implemented like with a jump table?: char c = ...; return caseTable[c](args); What are the benefits over "goto"? I figure with "goto" the semantics are clearer and you can easily access the variables on the stack. Does the TCO version enable any more use-cases? (Note: I consider "far less convoluted code" a valid use-case. :) ) -- Marco
Apr 05 2014
prev sibling next sibling parent reply Iain Buclaw <ibuclaw gdcproject.org> writes:
On 1 April 2014 19:35, Walter Bright <newshound2 digitalmars.com> wrote:
 Try this benchmark comparing various classification schemes:

GDC turns switches into table lookups, and speeds are comparable on my system, though your little hack is appears faster in release builds. Tests are with 1 million runs. --- Non-release run: Milliseconds 1797 1642 1501 1463 Non-release run, no bounds checking: Milliseconds 1730 1611 1483 1512 Release run: Milliseconds 1753 1633 1471 1528 Optimised run: Milliseconds 0 0 0 0 --- It is interesting that switches seem to take longer when bounds checking is turned off, and even longer when in release mode. However this is not very conclusive as I can't test with optimisations turned on. Unfortunately for your small program, gcc is too smart and optimises everything away. --- bool isIdentifierChar3(ubyte c) { switch(c) { case '0': .. case '9': case 'A': .. case 'Z': case 'a': .. case 'z': case '$': case '_': return true; default: return false; } }
Apr 07 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/7/2014 2:41 AM, "Ola Fosheim Grøstad" 
<ola.fosheim.grostad+dlang gmail.com>" wrote:
 on.  Unfortunately for your small program, gcc is too smart and
 optimises everything away.

Haha, that's funny. But why don't you put the data in a separate compilation unit?

True, it's pretty easy to defeat that optimization.
Apr 07 2014
prev sibling next sibling parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
 on.  Unfortunately for your small program, gcc is too smart and
 optimises everything away.

Haha, that's funny. But why don't you put the data in a separate compilation unit?
Apr 07 2014
prev sibling next sibling parent reply Martin Nowak <code dawg.eu> writes:
On 04/01/2014 08:35 PM, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:

Counter example: https://github.com/D-Programming-Language/phobos/commit/8599d6b4acde0eff886876a56f54ecd8e6b528e9
Apr 07 2014
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 7 April 2014 at 11:35:37 UTC, Martin Nowak wrote:
 On 04/01/2014 08:35 PM, Walter Bright wrote:
 Try this benchmark comparing various classification schemes:

Counter example: https://github.com/D-Programming-Language/phobos/commit/8599d6b4acde0eff886876a56f54ecd8e6b528e9

It's all a matter of how complicated the "non-table" implementation is I guess. table-lookup has a "steep entry cost", but is constant, regardless of complexity. Table lookups were "recently" removed from std.ascii too. The implementation walter is benching "against" is particularly complicated here.
Apr 07 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/7/2014 7:11 AM, monarch_dodra wrote:
 It's all a matter of how complicated the "non-table" implementation is I guess.

Yup. Memory lookups can be expensive. Once it's in the cache, it's cheap.
Apr 07 2014
prev sibling parent reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
I added a lookup scheme of my own, its not as fast as Walters (in fact 
its the slowest without -inline - release -O) but it uses 1 bit per 
entry in the table instead of a whole byte so you can have lots and lots 
of different tables. I'm even reasonably sure that it works correctly!


===============================
import core.stdc.stdlib;
import core.stdc.string;

import std.algorithm;
import std.array;
import std.ascii;
import std.datetime;
import std.range;
import std.stdio;
import std.traits;

bool isIdentifierChar0(ubyte c)
{
     return isAlphaNum(c) || c == '_' || c == '$';
}

bool isIdentifierChar1(ubyte c)
{
     return ((c >= '0' || c == '$') &&
             (c <= '9' || c >= 'A')  &&
             (c <= 'Z' || c >= 'a' || c == '_') &&
             (c <= 'z'));
}

immutable bool[256] tab2;
static this()
{
     for (size_t u = 0; u < 0x100; ++u)
     {
         tab2[u] = isIdentifierChar0(cast(ubyte)u);
     }
}

bool isIdentifierChar2(ubyte c)
{
     return tab2[c];
}

immutable ulong[4] tab3;
static this()
{
     for (size_t u = 0; u < 0x100; ++u)
     {
         if (isIdentifierChar0(cast(ubyte)u))
         {
             auto sub = u >>> 6;
             auto b = u & 0x3f;
             auto mask = 0x01L << b;
             tab3[sub] |= mask;
         }
     }
}

bool isIdentifierChar3(ubyte c)
{
	auto sub = c >>> 6;
	c &= 0x3f;
	auto mask = 0x01L << c;
	return (tab3[sub] & mask) > 0;
}

int f0()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar0(cast(ubyte)u);
     }
     return x;
}

int f1()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar1(cast(ubyte)u);
     }
     return x;
}

int f2()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar2(cast(ubyte)u);
     }
     return x;
}

int f3()
{
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar3(cast(ubyte)u);
     }
     return x;
}

void main()
{
     auto r = benchmark!(f0, f1, f2, f3)(10_000);
     writefln("Milliseconds %s %s %s %s", r[0].msecs, r[1].msecs, 
r[2].msecs, r[3].msecs);
}
Apr 17 2014
next sibling parent reply "ixid" <nuaccount gmail.com> writes:
On Thursday, 17 April 2014 at 16:27:26 UTC, Alix Pexton wrote:
 I added a lookup scheme of my own, its not as fast as Walters 
 (in fact its the slowest without -inline - release -O) but it 
 uses 1 bit per entry in the table instead of a whole byte so 
 you can have lots and lots of different tables. I'm even 
 reasonably sure that it works correctly!


 ===============================
 import core.stdc.stdlib;
 import core.stdc.string;

 import std.algorithm;
 import std.array;
 import std.ascii;
 import std.datetime;
 import std.range;
 import std.stdio;
 import std.traits;

 bool isIdentifierChar0(ubyte c)
 {
     return isAlphaNum(c) || c == '_' || c == '$';
 }

 bool isIdentifierChar1(ubyte c)
 {
     return ((c >= '0' || c == '$') &&
             (c <= '9' || c >= 'A')  &&
             (c <= 'Z' || c >= 'a' || c == '_') &&
             (c <= 'z'));
 }

 immutable bool[256] tab2;
 static this()
 {
     for (size_t u = 0; u < 0x100; ++u)
     {
         tab2[u] = isIdentifierChar0(cast(ubyte)u);
     }
 }

 bool isIdentifierChar2(ubyte c)
 {
     return tab2[c];
 }

 immutable ulong[4] tab3;
 static this()
 {
     for (size_t u = 0; u < 0x100; ++u)
     {
         if (isIdentifierChar0(cast(ubyte)u))
         {
             auto sub = u >>> 6;
             auto b = u & 0x3f;
             auto mask = 0x01L << b;
             tab3[sub] |= mask;
         }
     }
 }

 bool isIdentifierChar3(ubyte c)
 {
 	auto sub = c >>> 6;
 	c &= 0x3f;
 	auto mask = 0x01L << c;
 	return (tab3[sub] & mask) > 0;
 }

 int f0()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar0(cast(ubyte)u);
     }
     return x;
 }

 int f1()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar1(cast(ubyte)u);
     }
     return x;
 }

 int f2()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar2(cast(ubyte)u);
     }
     return x;
 }

 int f3()
 {
     int x;
     for (uint u = 0; u < 0x100; ++u)
     {
         x += isIdentifierChar3(cast(ubyte)u);
     }
     return x;
 }

 void main()
 {
     auto r = benchmark!(f0, f1, f2, f3)(10_000);
     writefln("Milliseconds %s %s %s %s", r[0].msecs, 
 r[1].msecs, r[2].msecs, r[3].msecs);
 }

I feel like there must be a way of making a fast bit look up but my version is only moderate in speed. You can get all the bits you need on two 64 bit registers or one SSE register. I haven't tried bt, does that work with a 64 bit register?
Apr 17 2014
parent reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 17/04/2014 8:41 PM, monarch_dodra wrote:
 On Thursday, 17 April 2014 at 18:07:24 UTC, ixid wrote:
 I feel like there must be a way of making a fast bit look up but my
 version is only moderate in speed. You can get all the bits you need
 on two 64 bit registers or one SSE register. I haven't tried bt, does
 that work with a 64 bit register?

http://dlang.org/phobos/core_bitop.html#.bt ? Note it can be applied to the table in general, rather than the byte themselves. EG: ubyte[256] buf; auto b = bt(buf.ptr, 428);

I didn't think to look in core.bitop for a faster way to check bits, I'll check if it closes the gap to Walter's version. A...
Apr 18 2014
parent reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
I tested this against the others...
(with "-inline -release -O" of course)

===

uint[8] tab4; // bitop function only work on uints
static this()
{
     for (size_t u = 0; u < 0x100; ++u)
     {
         if (isIdentifierChar0(cast(ubyte)u))
         {
             bts(tab4.ptr, u);
         }
     }
}

bool isIdentifierChar4(ubyte c)
{
	return bt(tab4.ptr, c) != 0;
}

===

it takes about 3 times as long (about the same as the original 
isIdentifierChar1, that used lots of &&s and ||s).

So 2 shifts, 2 ands and a compare to zero, beats core.bitop.

for clarity, the output of my benchmark for all 5 version was

Milliseconds 21 15 2 5 15

the 2 (which on some runs was a 3) is Walter's table of bools, the 5 is 
my table of ulongs and the 15 on the end is core.bitop.

Short of delving into the asm, which I'm trying to avoid, any other 
suggestions for speed-ups?

A...
Apr 18 2014
parent Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
PS

I discovered that core.bitop.bts() won't accept a pointer to immutable 
even when inside a static this. Not sure if that counts as a bug or not ><
Apr 18 2014
prev sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 17 April 2014 at 18:07:24 UTC, ixid wrote:
 I feel like there must be a way of making a fast bit look up 
 but my version is only moderate in speed. You can get all the 
 bits you need on two 64 bit registers or one SSE register. I 
 haven't tried bt, does that work with a 64 bit register?

http://dlang.org/phobos/core_bitop.html#.bt ? Note it can be applied to the table in general, rather than the byte themselves. EG: ubyte[256] buf; auto b = bt(buf.ptr, 428);
Apr 17 2014