digitalmars.D.bugs - [Issue 7515] New: The new std.string.translate is slow for ASCII text
- d-bugmail puremagic.com (138/140) Feb 15 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (10/10) Mar 06 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (18/19) Mar 06 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (21/24) Mar 06 2012 I don't know. It seems to me that that's an implementation detail. makeT...
- d-bugmail puremagic.com (18/22) Mar 06 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (14/14) Mar 06 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (11/11) Mar 11 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (18/18) Mar 11 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (16/29) Mar 11 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (16/16) Mar 11 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (15/15) Mar 11 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (12/12) May 28 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (8/10) May 28 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (16/16) Jun 10 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
- d-bugmail puremagic.com (9/9) Jun 10 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7515
http://d.puremagic.com/issues/show_bug.cgi?id=7515 Summary: The new std.string.translate is slow for ASCII text Product: D Version: D2 Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Phobos AssignedTo: nobody puremagic.com ReportedBy: bearophile_hugs eml.cc --- Comment #0 from bearophile_hugs eml.cc 2012-02-15 19:23:51 PST --- std.string.maketrans() is going to be deprecated. This is a demo program, it's meant to work on a good amount of pure 7-bit ASCII text. The conversion table is the same as the one in this page: http://digitalmars.com/d/1.0/lisp-java-d.html import std.stdio, std.string; void main() { char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text auto tab = maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ", "0111222333445566677788899901112223334455666777888999"); auto text2 = text.translate(tab, "\""); writeln(text2); } Using the new std.string.maketrans the code becomes something like: import std.stdio, std.string, std.range; void main() { char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6', 'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9', 'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7', 'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8', 'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3', 'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2', 'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9']; auto text3 = text.translate(aa, "\""); writeln(text3); } The code with the associative array is _uglier_, _longer_ to write and read, and every char in the original string requires one or two associative array lookups. This is quite inefficient for sizeable ASCII strings. Jonathan M Davis has asked me a benchmark:Do you have data to backup that there is a significant speed difference?So I have written the following small benchmark, it contains both the URL to the test data (bible) and the timings I am seeing on a slow PC. If your PC is faster feel free to use it on more data (creating a larger input file on disk, concatenating many copies of the same test data). If you see bugs in the benchmark please tell me. Notes: - version3 uses the old translate; - version4 uses the new translate; - version1InPlace is a reference point, it works in-place; - version2InPlace is similar, uses a linear scan for the chars to remove, and it seems a bit slower than version1InPlace (probably because the small deltab array fits well in the L1 cache). If both the benchmark and the timings are correct, the old translate seems almost 5 times faster than the new one (and this is not strange, the new translate performs two associative array lookups for each char of the input string). import std.stdio, std.string, std.file, std.exception; char[] version1InPlace(char[] text) { immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ"; immutable t2 = "0111222333445566677788899901112223334455666777888999"; immutable delchars = "\""; char[256] transtab; foreach (i, ref char c; transtab) c = cast(char)i; foreach (i, char c; t1) transtab[c] = t2[i]; bool[256] deltab = false; foreach (char c; delchars) deltab[c] = true; int count; foreach (char c; text) if (!deltab[c]) { text[count] = transtab[c]; count++; } return text[0 .. count]; } char[] version2InPlace(char[] text) { immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ"; immutable t2 = "0111222333445566677788899901112223334455666777888999"; immutable delchars = "\""; char[256] transtab; foreach (i, ref char c; transtab) c = cast(char)i; foreach (i, char c; t1) transtab[c] = t2[i]; int count; outer: foreach (char c; text) { foreach (char d; delchars) if (c == d) continue outer; text[count] = transtab[c]; count++; } return text[0 .. count]; } string version3(char[] text) { auto tab = maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ", "0111222333445566677788899901112223334455666777888999"); return text.translate(tab, "\""); } string version4(string text) { dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6', 'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9', 'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7', 'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8', 'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3', 'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2', 'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9']; return text.translate(aa, "\""); } void main() { // http://patriot.net/~bmcgin/kjv12.txt auto txt = cast(char[])read("kjv12.txt"); assert(version1InPlace(txt.dup) == version2InPlace(txt.dup)); assert(version1InPlace(txt.dup) == version3(txt.dup)); assert(version1InPlace(txt.dup) == version4(txt.idup)); static if (1) version1InPlace(txt); // 0.16 seconds static if (0) version2InPlace(txt); // 0.18 seconds static if (0) version3(txt); // 0.20 seconds static if (0) version4(assumeUnique(txt)); // 0.97 seconds } Jonathan M Davis:We're trying to not have any ASCII-only string stuff.Around there is a large amount of data that is essentially text, but it's ASCII, it's not Unicode. Like biological data, text data coming out of instruments, etc. Handling it as binary data is not good. A possibility is to keep both the old maketrans and translate functions, and un-deprecate them (maybe moving them in std.ascii?). Other ideas are welcome. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 15 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 Jonathan M Davis <jmdavisProg gmx.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jmdavisProg gmx.com --- Comment #1 from Jonathan M Davis <jmdavisProg gmx.com> 2012-03-06 00:15:03 PST --- https://github.com/D-Programming-Language/phobos/pull/478 -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 06 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 --- Comment #2 from bearophile_hugs eml.cc 2012-03-06 04:13:54 PST --- (In reply to comment #1)https://github.com/D-Programming-Language/phobos/pull/478Thank you. Your patch seems nice. Two suggestions: 1) If they don't already say it, then I suggest to add in the ddoc of the functions that the translation arrays have a length 128. This for both old D programmers and Python programmers. 2) This code in translate(): + bool[128] remTable; + + remTable[] = false; I suggest to write it like this, that's shorter and avoids a double initialization: + bool[128] remTable = false; -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 06 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 --- Comment #3 from Jonathan M Davis <jmdavisProg gmx.com> 2012-03-06 15:27:13 PST ---1) If they don't already say it, then I suggest to add in the ddoc of the functions that the translation arrays have a length 128. This for both old D programmers and Python programmers.I don't know. It seems to me that that's an implementation detail. makeTrans doesn't even return a static array. It returns a dynamic one. So, as long as the array passed to translate comes from makeTrans (as it should), it's complete non-issue. If anything, I'd argue for creating a struct (e.g. TransTable) which just held the dynamic array without giving access to it so that no one _can_ screw with it (since screwing with it is just going to break code). It seems to me that the only reason that makeTrans exists in the first place rather than just putting it in translate is so that you can reuse the translation table. No one should be messing with it or passing anything else to that version of translate. So, why does the length of the array matter to the user of makeTrans or translate? And actually, I'm _really_ tempted to just go and change it so that it returns a struct which you can't mess with and is only created by makeTrans. That completely resolves any possible issues caused by someone passing the wrong array to translate. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 06 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 --- Comment #4 from bearophile_hugs eml.cc 2012-03-06 15:42:22 PST --- (In reply to comment #3)It seems to me that the only reason that makeTrans exists in the first place rather than just putting it in translate is so that you can reuse the translation table. No one should be messing with it or passing anything else to that version of translate.As several other string functions, the two functions we are talking about here come from Python2: http://docs.python.org/library/string.html#string-functions http://docs.python.org/library/string.html#string.translate The Python docs tell the size of the table too: string.translate(s, table[, deletechars]) Delete all characters from s that are in deletechars (if present), and then translate the characters using table, which must be a 256-character string giving the translation for each character value, indexed by its ordinal. If table is None, then only the character deletion step is performed. In Python some times I have built the translation table manually. If makeTrans returns an opaque or privately defined struct this is not possible. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 06 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 --- Comment #5 from Jonathan M Davis <jmdavisProg gmx.com> 2012-03-06 16:06:35 PST --- And _why_ would you ever need to build it manually? That's what makeTrans is for. The docs would have to explain exactly how the array is laid out for it to be at all reasonable for the user to be screwing with it, and right now, they definitely treat it as completely opaque. And without a really good reason why the user would be screwing with the translation table on their own, I would definitely argue that having it be opaque is a huge benefit, since it stops a whole possible set of bugs caused by translate being passed an array which doesn't conform to what it requires. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 06 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 --- Comment #6 from bearophile_hugs eml.cc 2012-03-11 13:53:26 PDT --- I have thought some more about this and I have changed my mind: there is no point to artificially restrict the usefulness of this function. So I suggest: 1) To not use a opaque struct, but leave the simple translation array. 2) To use a translation array of 256 items. So translate is usable for an ubyte[] array input too (with a cast of the ubyte[] to char[] in the function call if necessary). -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 11 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 --- Comment #7 from Jonathan M Davis <jmdavisProg gmx.com> 2012-03-11 14:22:41 PDT --- If you want a truly generic mapping function which operates on ubyte and creates an array which holds all of the possible values of that type, then it makes no sense to be putting it in std.string. It should be more generic and be in std.array (though given that it wouldn't really make sense to do what makeTrans and translate are doing with anything larger than a ubyte, I question how good an idea that would lbe). If what you care about is strings, then there is no point in having an array greater than 128 elements long, because there aren't any ASCII characters greater than that. Having 256 is pointless and wasteful, because there won't be any characters that large. Not to mention, the _whole_ reason that you were complaining about this in the first place was because the unicode-aware version of translate was slower than using maketrans/trnslate when dealing with pure ASCII, and using 128 elements is going to be faster. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 11 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 --- Comment #8 from bearophile_hugs eml.cc 2012-03-11 14:40:45 PDT --- (In reply to comment #7) Thank you for your answers.If you want a truly generic mapping function which operates on ubyte and creates an array which holds all of the possible values of that type, then it makes no sense to be putting it in std.string. It should be more generic and be in std.array (though given that it wouldn't really make sense to do what makeTrans and translate are doing with anything larger than a ubyte, I question how good an idea that would lbe).I have not asked for a truly generic function for std.array. So this problem doesn't happen.If what you care about is strings, then there is no point in having an array greater than 128 elements long, because there aren't any ASCII characters greater than that. Having 256 is pointless and wasteful, because there won't be any characters that large. Not to mention, the _whole_ reason that you were complaining about this in the first place was because the unicode-aware version of translate was slower than using maketrans/trnslate when dealing with pure ASCII, and using 128 elements is going to be faster.Now I have had to use translate() for a different job: I have a good amount of 8-bit ASCII text in my language to process (http://en.wikipedia.org/wiki/Extended_ASCII ). It's not Unicode, and I don't think I can convert it to Unicode. I have to use translate() on this text. A 128-elements translate() isn't enough. The "waste" caused by using a 256 items array instead of a 128 items array is probably not significant. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 11 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 --- Comment #9 from Jonathan M Davis <jmdavisProg gmx.com> 2012-03-11 15:12:49 PDT --- Okay. It seems to me that this is getting a bit ridiculous. Phobos doesn't support Extendend ASCII _anywhere_. Its support of ASCII-specific stuff is already minimal. So, to set up a string function to take a string which is _completely_ invalid as far as the rest of Phobos is concerned seems really off to me. All you have to do with an ASCII-only version is make sure that you don't pass it unicode characters. But with an Extended ASCII version, all of a sudden you have invalid unicode sequences. The only support that Phobos gives for non-unicode and non-ASCII encodings is std.encoding, and almost all of the ASCII-specific stuff is in std.ascii. The _only_ reason that I see to support this is the fact that the implementation that we've had happens to. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 11 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 --- Comment #10 from Jonathan M Davis <jmdavisProg gmx.com> 2012-03-11 15:33:12 PDT --- The fact that the version that's going away deals with a 256 element array and therefore just so happens to work with Extended ASCII may be enough to justify just making the new one do that as well, but I honestly, don't see Extended ASCII as much of an argument when none of the other functions support it and are likely to blow up if you use it. The general idea with Phobos functions is that you will convert any non-unicode strings that you have to unicode before processing them, and then when you're done, if you want a different encoding, you convert it before writing it out to file or whatever you're doing with it. It's really not the intention that Phobos in general will support non-unicode encodings. Dealing with unicode is already complicated enough. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 11 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 --- Comment #11 from Jonathan M Davis <jmdavisProg gmx.com> 2012-05-28 15:41:22 PDT --- Okay. After some rethinking the issue, I've closed the old pull request and created a new one: https://github.com/D-Programming-Language/phobos/pull/609 It retains the ability to handle 256 code units rather than the 128 that ASCII actually uses. What finally convinced me was the fact that if it takes the full 256 code units, then it could operate on full-on UTF-8 strings, replacing the ASCII characters and leaving the unicode ones alone. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
May 28 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 --- Comment #12 from bearophile_hugs eml.cc 2012-05-28 17:13:48 PDT --- (In reply to comment #11)It retains the ability to handle 256 code units rather than the 128 that ASCII actually uses.Thank you :-) Please take also a look at Issue 8141 -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
May 28 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 --- Comment #13 from github-bugzilla puremagic.com 2012-06-10 20:32:42 PDT --- Commits pushed to master at https://github.com/D-Programming-Language/phobos https://github.com/D-Programming-Language/phobos/commit/08acc16b830b35572e6ea004d936a7b2522801b8 Fix issue 7515 I created an adjusted version of translate which is ASCII-only and renamed maketrans to makeTrans with some minor changes. Instead of deprecating the old translate and maketrans, maketrans is now a deprecated alias of makeTrans, and the new ASCII-only translate should be compatible with the old one. https://github.com/D-Programming-Language/phobos/commit/94c06fb2469c0d98be842438b749e61571b42fe3 Merge pull request #609 from jmdavis/7515 Fix issue 7515 -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jun 10 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7515 Jonathan M Davis <jmdavisProg gmx.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jun 10 2012