digitalmars.D - Fast string search
- bearophile (61/61) Dec 12 2010 Found through Reddit, it may be useful for Phobos:
 - Leonid Volnitsky (12/12) Dec 15 2010 Thanks for bug report! Code is very new.
 - bearophile (18/21) Dec 15 2010 I may not understand the purpose of the function, but after your fix thi...
 - Leonid Volnitsky (3/3) Dec 15 2010 36 - is correct. When not found volnitsky() return pointer to next byt...
 - bearophile (4/6) Dec 15 2010 Look better at the code, the main contains three writeln(). The first 36...
 - Leonid Volnitsky (2/3) Dec 16 2010 Oh, now I see. Then this is not correct translation to D. C++ returns...
 
Found through Reddit, it may be useful for Phobos:
http://volnitsky.com/project/str_search/
A possible partial D2 translation:
const(char)* volnitsky(string haystack, string needle) {
    enum size_t wSize = ushort.sizeof;
    enum size_t hSize = 64 * 1024;
    immutable char* S = haystack.ptr;
    immutable char* Se = &haystack[$ - 1];
    immutable char* SS = needle.ptr;
    immutable char* SSe = &needle[$ - 1];
    immutable size_t SS_size = SSe - SS; // needle length
    immutable size_t step = SS_size - wSize + 1;
    // if S is too small or SS is too big, fallback to std::search()
    if (step < wSize || SS_size >= (1 << 8) - 1) {
        throw new Exception("search!");
        //return std::search(S, Se, SS, SSe);
    }
    // make SS hash
    //auto H = new ubyte[hSize]; // 64 KB
    ubyte[hSize] H; // 64 KB
    for (size_t SSi = 0; SSi < (SS_size - wSize + 1); SSi++)  {
        size_t h = *cast(ushort*)(SS + SSi) % hSize;
        while (H[h])
            h = (h + 1) % hSize; // find free cell
        H[h] = cast(ubyte)(SSi + 1); // can't overflow!
    }
    // step through text
    for (char* p = cast(char*)(S + step); p <= Se - wSize; p += step) {
        size_t h = *cast(ushort*)p % hSize;
        if (H[h]) {
TRY_HASH_CELL:
            int SSi = H[h] - 1;
            const char* R = p - SSi;
            for (size_t i = 0; i < SS_size; i++) {
                if (R[i] != SS[i]) {
                    size_t h_next = (h + 1) % hSize;
                    if (H[h_next]) {
                        h = h_next;
                        goto TRY_HASH_CELL;
                    } else
                        goto NEXT_STEP;
                }
            }
            return R;
        }
NEXT_STEP:;
    }
    return Se;
}
//---------------------------------------------
import std.stdio: writeln;
void main() {
    string s1 = "hello, how are you? I'm well, thanks";
    string s2 = "wello"; // doesn't work with strings of odd lenght!
    auto ptr = volnitsky(s1, s2);
    writeln(ptr - s1.ptr);
    writeln(s1[ptr - s1.ptr]);
}
But as the example shows it doesn't seem to work with strings of odd length.
Also it lacks the call to std::search() still.
Bye,
bearophile
 Dec 12 2010
Thanks for bug report!  Code is very new.
Bug manifests itself when substring is in 0 position.
Fix:
diff --git a/volnitsky.h b/volnitsky.h
index a7dbc1f..62d686d 100644
-- a/volnitsky.h
+++ b/volnitsky.h
   -26,7 +26,7    const char*     volnitsky(const char*  S,  const char* Se,
const char* SS,  con
         // step through text
-        for (const char* p = S+step;   p <= Se-W_size;  p+=step)  {
+        for (const char* p = S+step-W_size;   p <= Se-W_size;  p+=step)  {
 Dec 15 2010
Leonid Volnitsky:Thanks for bug report! Code is very new.You are welcome :-)Bug manifests itself when substring is in 0 position. Fix:I may not understand the purpose of the function, but after your fix this code: void main() { string s1 = "hello, how are you? I'm well, thanks"; writeln(s1.length); string s2 = "wello"; auto ptr = volnitsky(s1, s2); writeln(ptr - s1.ptr); writeln(s1[ptr - s1.ptr .. $]); } Prints: 36 24 well, thanks But "wello" is not present in the s1 string, so returning 24 is not what I desire... Bye, bearophile
 Dec 15 2010
36 - is correct. When not found volnitsky() return pointer to next byte after last byte of s1 (same as std::search()). 24 - I don't know D to tell what it is (use of ptr past end of s1?)
 Dec 15 2010
Leonid Volnitsky:36 - is correct. When not found volnitsky() return pointer to next byte after last byte of s1 (same as std::search()).Look better at the code, the main contains three writeln(). The first 36 it prints is the length of the s1 string, that's obviously correct. The other two writeln show a wrong result (it may be just a C++==>D translation error of mine). Bye, bearophile
 Dec 15 2010
bearophile Wrote:Look better at the code,Oh, now I see. Then this is not correct translation to D. C++ returns correct results. I've fixed bug, but not the one that you see. You can verify this by running C++ version.
 Dec 16 2010








 
 
 
 Leonid Volnitsky <leonid volnitsky.com>