digitalmars.D.learn - maketrans and translate

bearophile (12/12) Feb 12 2012 In the online docs I've seen that std.string.maketrans() is (going to be...

Stewart Gordon (5/7) Feb 13 2012

bearophile (19/21) Feb 13 2012 A new version of the code:

Jonathan M Davis (4/34) Feb 13 2012 Do you have data to backup that there is a significant speed difference?...

bearophile (73/75) Feb 13 2012 I have just written a small benchmark, it contains both the URL to the t...

Jonathan M Davis (12/36) Feb 13 2012 I'll look into the best way to handle it. Those functions aren't being

bearophile (5/7) Feb 15 2012 I have added the topic to Bugzilla to avoid its oblivion:

bearophile <bearophileHUGS lycos.com> writes:

In the online docs I've seen that std.string.maketrans() is (going to be)
deprecated.
How do you adapt this code to the new regime?


import std.stdio, std.string;
void main() {
    char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text
    auto tab = maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ",
                        
"0111222333445566677788899901112223334455666777888999");
    auto text2 = text.translate(tab, "\"");
    writeln(text2);
}

Bye,
bearophile

Feb 12 2012

Stewart Gordon <smjg_1998 yahoo.com> writes:

On 13/02/2012 03:04, bearophile wrote:
 In the online docs I've seen that std.string.maketrans() is (going to be)
deprecated.
 How do you adapt this code to the new regime?

<snip>

Use an associative array for the translation table.

Or write your own functions that work in the same way as maketrans/translate.

Stewart.

Feb 13 2012

bearophile <bearophileHUGS lycos.com> writes:

Stewart Gordon:

 Use an associative array for the translation table.

A new version of the code:

import std.stdio, std.string, std.range;
void main() {
    char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text
    dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6',
    'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9',
    'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7',
    'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8',
    'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3',
    'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2',
    'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
    auto text3 = text.translate(aa, "\"");
    writeln(text3);
}


 Or write your own functions that work in the same way as maketrans/translate.

The code with the associative array is longer, and every char in the original
string requires one or two associative array lookups. This is quite inefficient
for ASCII strings. 

Instead of writing my own string functions, or copying them from an older
Phobos, what about the idea of keeping both the old maketrans and translate
functions too and to un-deprecate them (maybe moving them in std.ascii)? They
are not UTF-safe, but I have large amounts of biological data text that is not
Unicode.

Bye,
bearophile

Feb 13 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Monday, February 13, 2012 08:09:06 bearophile wrote:
 Stewart Gordon:
 Use an associative array for the translation table.

 
 A new version of the code:
 
 import std.stdio, std.string, std.range;
 void main() {
 char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII
 text dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6',
 'c':'6', 'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9',
 'g':'9', 'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7',
 'k':'7', 'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8',
 'o':'8', 'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3',
 's':'3', 'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2',
 'w':'2', 'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
 auto text3 = text.translate(aa, "\"");
 writeln(text3);
 }
 
 Or write your own functions that work in the same way as
 maketrans/translate.

 The code with the associative array is longer, and every char in the
 original string requires one or two associative array lookups. This is
 quite inefficient for ASCII strings.
 
 Instead of writing my own string functions, or copying them from an older
 Phobos, what about the idea of keeping both the old maketrans and translate
 functions too and to un-deprecate them (maybe moving them in std.ascii)?
 They are not UTF-safe, but I have large amounts of biological data text
 that is not Unicode.

Do you have data to backup that there is a significant speed difference? We're 
trying to not have any ASCII-only string stuff.

- Jonathan M Davis

Feb 13 2012

bearophile <bearophileHUGS lycos.com> writes:

Jonathan M Davis:

 Do you have data to backup that there is a significant speed difference?

I have just written a small benchmark, it contains both the URL to the test
data and the timings I am seeing on a slow PC. If your PC is faster feel free
to use it on more data (creating a larger input file on disk, concatenating
many copies of the same test data). If you see bugs in the benchmark please
tell me.

version3 uses the old translate. version4 uses the new translate.
version1InPlace is a reference point, it works in-place. version2InPlace is
similar, uses a linear scan for the chars to remove, and it seems a bit slower
than version1InPlace (probably because the small deltab array fits well in the
L1 cache).

If both the benchmark and the timings are correct, the old translate seems
almost 5 times faster than the new one (and this is not strange, the new
translate performs two associative array lookups for each char of the input
string).


import std.stdio, std.string, std.file, std.exception;

char[] version1InPlace(char[] text) {
    immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ";
    immutable t2 = "0111222333445566677788899901112223334455666777888999";
    immutable delchars = "\"";

    char[256] transtab;
    foreach (i, ref char c; transtab)
        c = cast(char)i;

    foreach (i, char c; t1)
        transtab[c] = t2[i];

    bool[256] deltab = false;
    foreach (char c; delchars)
        deltab[c] = true;

    int count;
    foreach (char c; text)
        if (!deltab[c]) {
            text[count] = transtab[c];
            count++;
        }

    return text[0 .. count];
}


char[] version2InPlace(char[] text) {
    immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ";
    immutable t2 = "0111222333445566677788899901112223334455666777888999";
    immutable delchars = "\"";

    char[256] transtab;
    foreach (i, ref char c; transtab)
        c = cast(char)i;

    foreach (i, char c; t1)
        transtab[c] = t2[i];

    int count;
    outer: foreach (char c; text) {
        foreach (char d; delchars)
            if (c == d)
                continue outer;
        text[count] = transtab[c];
        count++;
    }

    return text[0 .. count];
}


string version3(char[] text) {
    auto tab = maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ",
                        
"0111222333445566677788899901112223334455666777888999");
    return text.translate(tab, "\"");
}


string version4(string text) {
    dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6',
    'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9',
    'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7',
    'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8',
    'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3',
    'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2',
    'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
    return text.translate(aa, "\"");
}


void main() {
    // http://patriot.net/~bmcgin/kjv12.txt
    auto txt = cast(char[])read("kjv12.txt");

    assert(version1InPlace(txt.dup) == version2InPlace(txt.dup));
    assert(version1InPlace(txt.dup) == version3(txt.dup));
    assert(version1InPlace(txt.dup) == version4(txt.idup));

    static if (1) version1InPlace(txt); // 0.16 seconds
    static if (0) version2InPlace(txt); // 0.18 seconds
    static if (0) version3(txt); // 0.20 seconds
    static if (0) version4(assumeUnique(txt)); // 0.97 seconds
}


 We're trying to not have any ASCII-only string stuff.

Around there is a large amount of data that is essentially text, but it's
ASCII, it's not Unicode. Like biological data, text data coming out of
instruments, etc. Handling it as binary data is not good.

Bye,
bearophile

Feb 13 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Monday, February 13, 2012 14:08:16 bearophile wrote:
 Jonathan M Davis:
 Do you have data to backup that there is a significant speed difference?

 
 I have just written a small benchmark, it contains both the URL to the test
 data and the timings I am seeing on a slow PC. If your PC is faster feel
 free to use it on more data (creating a larger input file on disk,
 concatenating many copies of the same test data). If you see bugs in the
 benchmark please tell me.
 
 version3 uses the old translate. version4 uses the new translate.
 version1InPlace is a reference point, it works in-place. version2InPlace is
 similar, uses a linear scan for the chars to remove, and it seems a bit
 slower than version1InPlace (probably because the small deltab array fits
 well in the L1 cache).
 
 If both the benchmark and the timings are correct, the old translate seems
 almost 5 times faster than the new one (and this is not strange, the new
 translate performs two associative array lookups for each char of the input
 string).

I'll look into the best way to handle it. Those functions aren't being 
deprecated this release regardless.

 We're trying to not have any ASCII-only string stuff.

 
 Around there is a large amount of data that is essentially text, but it's
 ASCII, it's not Unicode. Like biological data, text data coming out of
 instruments, etc. Handling it as binary data is not good.

In most cases, it should make no difference. You could use 1, 2, 3, and 4 as 
easily as you could use A, T, C, and G. Most algorithms don't care one whit 
whether you're dealing with characters or numbers. The only problem that I see 
is cases where there's an algorithm which would normally only be done with 
characters and not numbers and which therefore is in std.string and not 
std.array or std.algorithm - and translate falls into that category.

Regardless, in general, Phobos is not going to have ASCII-specific string 
processing.

- Jonathan M Davis

Feb 13 2012

bearophile <bearophileHUGS lycos.com> writes:

Jonathan M Davis:

 I'll look into the best way to handle it. Those functions aren't being 
 deprecated this release regardless.

I have added the topic to Bugzilla to avoid its oblivion:
http://d.puremagic.com/issues/show_bug.cgi?id=7515

Bye,
bearophile

Feb 15 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - maketrans and translate