|
Archives
D Programming
digitalmars.D
digitalmars.D.bugs
digitalmars.D.dtl
digitalmars.D.ide
digitalmars.D.dwt
digitalmars.D.announce
digitalmars.D.learn
digitalmars.D.debugger
D.gnu
D
C/C++ Programming
c++
c++.announce
c++.atl
c++.beta
c++.chat
c++.command-line
c++.dos
c++.dos.16-bits
c++.dos.32-bits
c++.idde
c++.mfc
c++.rtl
c++.stl
c++.stl.hp
c++.stl.port
c++.stl.sgi
c++.stlsoft
c++.windows
c++.windows.16-bits
c++.windows.32-bits
c++.wxwindows
digitalmars.empire
digitalmars.DMDScript
electronics
|
digitalmars.D - Re: Perfect hashing for string switch
BCS:
Have you compared it to a decisition tree or lex style state mechine?
I have now implemented that too, it was not an immediate thing to do (I have
removed the versions 2 to 5 to reduce code size on codepad):
http://codepad.org/zOmPeE13
The results are good:
Timings, ldc, seconds:
test1: 4.48 // normal string switch
test2: 2.98 // perfect hash
test3: 2.09
test4: 2.07
test5: 5.44 // AA. Tango AA opIn_r is bug-slow
test6: 1.18 // new result
I hope this is enough.
I have created that large finite state machine in D with a Python program :-)
Bye,
bearophile
On 2010-01-27 15:17, bearophile wrote:
BCS:
Have you compared it to a decisition tree or lex style state mechine?
I have now implemented that too, it was not an immediate thing to do (I have
removed the versions 2 to 5 to reduce code size on codepad):
http://codepad.org/zOmPeE13
The results are good:
Timings, ldc, seconds:
test1: 4.48 // normal string switch
test2: 2.98 // perfect hash
test3: 2.09
test4: 2.07
test5: 5.44 // AA. Tango AA opIn_r is bug-slow
test6: 1.18 // new result
I hope this is enough.
I have created that large finite state machine in D with a Python program :-)
Your test6 is invalid: it reads beyond the bounds of the given array.
For example with input "th", it will check whether the third character
is 'i'; but the length of the array is only 2, so it shouldn't be doing
that.
--
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Hello Matti,
On 2010-01-27 15:17, bearophile wrote:
BCS:
Have you compared it to a decisition tree or lex style state
mechine?
(I have removed the versions 2 to 5 to reduce code size on codepad):
http://codepad.org/zOmPeE13
The results are good:
Timings, ldc, seconds:
test1: 4.48 // normal string switch
test2: 2.98 // perfect hash
test3: 2.09
test4: 2.07
test5: 5.44 // AA. Tango AA opIn_r is bug-slow
test6: 1.18 // new result
I hope this is enough.
I have created that large finite state machine in D with a Python
program :-)
For example with input "th", it will check whether the third character
is 'i'; but the length of the array is only 2, so it shouldn't be
doing that.
A simple fix to that is to make he first transition of the state machine
based on the length of the key.
that is, 'if' at the top becomes:
switch(word.length) {
case 13: goto S...:
case 12: goto S...:
case 11: goto S...:
case 10: goto S...:
case 9: goto S...:
case 8: goto S...:
case 7: goto S...:
case 6: goto S...:
case 5: goto S...:
case 4: goto S...:
case 3: goto S...:
case 2: goto S...:
default: goto UNREC;
}
--
<IXOYE><
Matti Niemenmaa:
Your test6 is invalid: it reads beyond the bounds of the given array.
For example with input "th", it will check whether the third character
is 'i'; but the length of the array is only 2, so it shouldn't be doing
that.
Shame on me, I am sorry -.- Thank you. I will try to improve the code later.
The good thing is that D2 allows me to implement something to catch that kind
of out-of-range pointer errors at run-time (D1 is not good enough here). That
code at the bottom is not well tested yet, so it can contain many bugs. And
probably it's not const-aware yet (if someone has suggestions to improve it I
will appreciate them). In programs a certain percentage of pointers run inside
an interval of memory, so something like this can be useful. Do you think this
is (once well debugged and improved) useful for Phobos2?
Up sides:
- this code seems to work and catchs the bug at runtime in the test6() function
I've written.
Downsides:
- currently compiled with D2 the test6() gets slower even when the code is not
in debug release. Future D2 compilers will need to reduce to zero or near zero
this overhead, so safer pointers can be used more freely.
Possible future improvements: use an "interval tree" where necessary to quickly
find if the pointer is inside more than one interval. This is probably a quite
less common need.
Bye,
bearophile
---------------------
import std.stdio: writeln, write;
template IsMutable(T) {
enum bool IsMutable = __traits(compiles, { T a; a = T.init; });
}
unittest { // of IsMutable(T)
auto s1 = "hello1";
static assert( !IsMutable!(typeof(s1[0])) );
char[] s2 = cast(char[])"hello2";
static assert( IsMutable!(typeof(s2[0])) );
}
/// ***** NOT tested well enough yet *****
struct IntervalPtr(T) {
T* ptr; // can be null
debug {
private T* start_ptr; // can't be null if present
private T* end_ptr; // can't be null if present
invariant() {
assert(start_ptr != null);
assert(end_ptr != null);
assert(start_ptr <= end_ptr);
assert(ptr == null || (ptr >= start_ptr && ptr < end_ptr));
}
} else {
this(T* aptr) {
this.ptr = aptr;
}
}
this(T* start, T* end) {
this.ptr = start;
debug {
assert(start != null);
assert(end != null);
assert(start <= end);
this.start_ptr = start;
this.end_ptr = end;
}
}
this(T* start, T* end, T* aptr) {
this.ptr = aptr;
debug {
assert(end != null);
assert(end != null);
assert(start <= end);
if (aptr != null) {
assert(aptr >= start);
assert(aptr < end);
}
this.start_ptr = start;
this.end_ptr = end;
}
}
T opStar() {
return *this.ptr;
}
void opPostInc() {
debug assert(this.ptr < (this.end_ptr - 1));
this.ptr++;
}
void opPostDec() {
debug assert(this.ptr > this.start_ptr);
this.ptr--;
}
void opAddAssign(size_t i) {
debug assert(this.ptr < (this.end_ptr - i));
this.ptr += i;
}
void opSubAssign(size_t i) {
debug assert(this.ptr >= (this.start_ptr + i));
this.ptr -= i;
}
void opAssign(T* other_ptr) {
debug {
if (other_ptr != null) {
assert(other_ptr >= this.start_ptr);
assert(other_ptr < this.end_ptr);
}
}
this.ptr = other_ptr;
}
void opAssign(IntervalPtr other) {
debug {
assert(other.start_ptr != null);
assert(other.end_ptr != null);
assert(other.end_ptr != null);
if (other.ptr != null) {
assert(other.ptr >= other.start_ptr);
assert(other.ptr < other.end_ptr);
}
this.start_ptr = other.start_ptr;
this.end_ptr = other.end_ptr;
}
this.ptr = other.ptr;
}
const bool opEquals(ref const(IntervalPtr) other) {
return this.ptr == other.ptr;
}
/*ref ?*/ T opIndex(size_t i) {
debug {
assert((this.ptr + i) >= this.start_ptr);
assert((this.ptr + i) < this.end_ptr);
}
return this.ptr[i];
}
static if (IsMutable!T) {
void opIndexAssign(T value, size_t i) {
debug {
assert((this.ptr + i) >= this.start_ptr);
assert((this.ptr + i) < this.end_ptr);
}
this.ptr[i] = value;
}
}
}
IntervalPtr!T intervalPtr(T)(/*inout ?*/ T[] arr) {
debug
return IntervalPtr!T(arr.ptr, arr.ptr + arr.length);
else
return IntervalPtr!T(arr.ptr);
}
IntervalPtr!T intervalPtr(T)(/*inout ?*/ T[] arr, T* aptr) {
debug
return IntervalPtr!T(arr.ptr, arr.ptr + arr.length, aptr);
else
return IntervalPtr!T(aptr);
}
IntervalPtr!T intervalPtr(T)(T* start, T* end) {
debug
return IntervalPtr!T(start, end);
else
return IntervalPtr!T(start);
}
IntervalPtr!T intervalPtr(T)(T* start, T* end, T* aptr) {
debug
return IntervalPtr!T(start, end, aptr);
else
return IntervalPtr!T(aptr);
}
void main() {
auto s = "hello";
writeln("String: ", s);
auto p1 = s.intervalPtr();
writeln("intervalPtr sizeof: ", p1.sizeof);
foreach (_; 0 .. 4) {
write(">", *p1, "< ");
++p1;
}
writeln(">", *p1, "< ");
auto p2 = intervalPtr(s.ptr, s.ptr + s.length);
foreach (_; 0 .. 4) {
write(">", *p2, "< ");
p2++;
}
writeln(">", *p2, "< ");
p1 = s.ptr; // p1 reset to the start
foreach (i; 0 .. s.length)
write(">", p1[i], "< ");
writeln();
}
Hello bearophile,
Matti Niemenmaa:
Your test6 is invalid: it reads beyond the bounds of the given array.
For example with input "th", it will check whether the third
character is 'i'; but the length of the array is only 2, so it
shouldn't be doing that.
later.
The good thing is that D2 allows me to implement something to catch
that kind of out-of-range pointer errors at run-time (D1 is not good
enough here).
Things I'd use in place of that:
/////
char[] str, int at = 0;
...
switch(str[at]) { ... }
...
at++;
or
/////
char[] str;
...
switch(str[0]) { ... }
...
str = str[1..$];
A good compiler should be able to convert the first to about the same as
what your type should compile to. A really good compiler should be able to
const propagate at and remove the variable and (if you start by switching
on the length) fold out even the bounds checks.
The second has the same optimization potential but for some reason I think
it would be expecting more of a compiler to get it optimized that well.
--
<IXOYE><
BCS:
Things I'd use in place of that:
/////
char[] str, int at = 0;
...
switch(str[at]) { ... }
...
at++;
or
/////
char[] str;
...
switch(str[0]) { ... }
...
str = str[1..$];
When in not-debug mode that struct of mine is just 1 word long, so it hopefully
gets optimized well by LDC2. While removing that free 2-word ling "str" is a
bit harder.
With that struct you don't need to carry around two things, a vector and index,
you have to keep and move around just one variable. And it acts just like a
pointer, so you can move it in both directions, etc (it's not handy nor safe to
do this with a slice, you need to keep a 3 words of memory as state to be safe).
You can (hopefully) use it as drop-in safer replacement for a pointer that you
know spans a range of memory (it can have few holes too), so you can use it in
D2 code translated from original C code, changing it very little. You don't
want to modify too much that kind of C code. In that broken finite state
machine of mine I've had to change just one line of code, where the "p" pointer
is defined.
With DMD you can remove 100% of the overhead defining it as a pointer in
not-debug release, and keeping the rest of the code that use the pointer
unchanged. This can be done locally (with a bit more complex pointer definition
that defines a true pointer in not-debug mode), or more generally modifying
those functions that act as constructors, so they return a true pointer in
not-debug mode (but I like this less, I think it's not tidy to have functions
that return different types in debug mode).
Having safer pointers goes well with the style of D language. Probably I'll put
this struct in my future d2libs :-) Maybe with small changes such safe ranged
pointers can be used in SafeD modules too :-)
Bye,
bearophile
Hello bearophile,
BCS:
Things I'd use in place of that:
/////
char[] str, int at = 0;
...
switch(str[at]) { ... }
...
at++;
or
/////
char[] str;
...
switch(str[0]) { ... }
...
str = str[1..$];
hopefully gets optimized well by LDC2. While removing that free 2-word
ling "str" is a bit harder.
Almost all the cases I can think of fall into one of a few categories:
- the length stuff can be statically verified to be correct and the length
need not be carried around
- the length stuff can't be statically verified but the code handles he dynamic
verification in such a way the optimizer can figure out how to safely omit
the length checks for much of the code.
- the length stuff can't be statically verified and there is no way to enforce
correct dynamic verification so you can't safely omit the length and checks
even in release mode.
In each case, the native solution can be just as fast as your solution or
your solution is not safe in release mode.
And it
acts just like a pointer, so you can move it in both directions, etc
I'll give you that point.
--
<IXOYE><
|
|