digitalmars.D - suggestion of type: ustring
- ZY Zhou (22/22) Mar 20 2011 D's string is supposed to be utf8 encoded, however, the following code
- Jonathan M Davis (14/44) Mar 20 2011 It would be prohibitively expensive to be constantly validating strings....
- ZY Zhou (9/32) Mar 20 2011 No, it would be much much cheaper, since there are only 2 cases the vali...
- Jesse Phillips (5/17) Mar 20 2011 Honestly, so far the only time I had problems processing utf has been wh...
- spir (20/37) Mar 20 2011 Anyway, std.utf.validate just tries to *decode*! (see function source be...
D's string is supposed to be utf8 encoded, however, the following code compiles and runs with no error: string s = "\xff"; // s is invalid writeln(s); fileStream.writeLine(s); In order to make sure only valid utf8 string is used in the system, validating is needed everywhere, e.g. string cut3bytes(string s) in {validate(s);} out(result} {validate(result);} body {return s.length > 3 ? s[0..3] : s;} I think it will be better if D has a ustring type to do all the validating job. e.g. ustring s = "0xFF"; // compile error char[] c = [0xFF]; ustring s = c.idup; // throw UtfException ustring s1 = "\xc2\xa2"; ustring s2 = s1[0..1]; // throw UtfException So the above example can be simplified to: ustring cut3bytes(ustring s) {return s.length > 3 ? s[0..3] : s;} --ZY Zhou
Mar 20 2011
D's string is supposed to be utf8 encoded, however, the following code compiles and runs with no error: string s = "\xff"; // s is invalid writeln(s); fileStream.writeLine(s); In order to make sure only valid utf8 string is used in the system, validating is needed everywhere, e.g. string cut3bytes(string s) in {validate(s);} out(result} {validate(result);} body {return s.length > 3 ? s[0..3] : s;} I think it will be better if D has a ustring type to do all the validating job. e.g. ustring s = "0xFF"; // compile error char[] c = [0xFF]; ustring s = c.idup; // throw UtfException ustring s1 = "\xc2\xa2"; ustring s2 = s1[0..1]; // throw UtfException So the above example can be simplified to: ustring cut3bytes(ustring s) {return s.length > 3 ? s[0..3] : s;}It would be prohibitively expensive to be constantly validating strings. You validate them at the point that they're created, and then you generally don't worry about. Doing otherwise would be expensive. Some functions do check that a string is properly encoded, but most don't. If you want a string type that actually validates on every operation, feel free to define a struct which holds a string internally and has all of the appropriate overloaded operators so that it's a range of dchar and whatnot. But you're going to have a hard time convincing folks that such a type should be in Phobos, and there's no way that it would make it into the language itself. And honestly, how often do you have to worry about invalid strings? As long as you check them when they're created, you won't generally have problems with invalid strings, and it's a lot less expensive than constantly checking their validity whenever you do anything with them. - Jonathan M Davis
Mar 20 2011
It would be prohibitively expensive to be constantly validating strings.No, it would be much much cheaper, since there are only 2 cases the validating is needed 1) when you convert char[] to ustring, in this case, the validating is necessary 2) when you use split on ustring. but since ustring is guaranteed to be valid, the validating only need to check 2 bytes of data (start and end), much cheaper than validating the entire string. after that, all the other functions will no longer need to worry about invalid utf8 string, as long as the parameter is ustring, no validating is needed. == Quote from Jonathan M Davis (jmdavisProg gmx.com)'s articleD's string is supposed to be utf8 encoded, however, the following code compiles and runs with no error: string s = "\xff"; // s is invalid writeln(s); fileStream.writeLine(s); In order to make sure only valid utf8 string is used in the system, validating is needed everywhere, e.g. string cut3bytes(string s) in {validate(s);} out(result} {validate(result);} body {return s.length > 3 ? s[0..3] : s;} I think it will be better if D has a ustring type to do all the validating job. e.g. ustring s = "0xFF"; // compile error char[] c = [0xFF]; ustring s = c.idup; // throw UtfException ustring s1 = "\xc2\xa2"; ustring s2 = s1[0..1]; // throw UtfException So the above example can be simplified to: ustring cut3bytes(ustring s) {return s.length > 3 ? s[0..3] : s;}It would be prohibitively expensive to be constantly validating strings.
Mar 20 2011
ZY Zhou Wrote:Honestly, so far the only time I had problems processing utf has been when someone stuck a stupid BOM[1] at the beginning of the file. Question, what is so hard about inserting validity checks[2] into your code just as you have described? This way you don't have to put them in contracts of all your functions. 1. http://en.wikipedia.org/wiki/Byte_order_mark 2. http://digitalmars.com/d/2.0/phobos/std_utf.html#validateIt would be prohibitively expensive to be constantly validating strings.No, it would be much much cheaper, since there are only 2 cases the validating is needed 1) when you convert char[] to ustring, in this case, the validating is necessary 2) when you use split on ustring. but since ustring is guaranteed to be valid, the validating only need to check 2 bytes of data (start and end), much cheaper than validating the entire string. after that, all the other functions will no longer need to worry about invalid utf8 string, as long as the parameter is ustring, no validating is needed.
Mar 20 2011
On 03/20/2011 05:12 PM, Jesse Phillips wrote:ZY Zhou Wrote:Anyway, std.utf.validate just tries to *decode*! (see function source below) (decode itself throws an exception when stepping on invalid utf) So, it would more as efficient to just decode (which in D means the same as converting to dstring) at start, and work with strings of code points all along your process. In addition to validating only once, further operations can be much faster each time you need to operate at the level of code points. void validate(S)(in S s) if (isSomeString!S) { immutable len = s.length; for (size_t i = 0; i < len; ) { decode(s, i); } } Denis -- _________________ vita es estrany spir.wikidot.comHonestly, so far the only time I had problems processing utf has been when someone stuck a stupid BOM[1] at the beginning of the file. Question, what is so hard about inserting validity checks[2] into your code just as you have described? This way you don't have to put them in contracts of all your functions. 1. http://en.wikipedia.org/wiki/Byte_order_mark 2. http://digitalmars.com/d/2.0/phobos/std_utf.html#validateIt would be prohibitively expensive to be constantly validating strings.No, it would be much much cheaper, since there are only 2 cases the validating is needed 1) when you convert char[] to ustring, in this case, the validating is necessary 2) when you use split on ustring. but since ustring is guaranteed to be valid, the validating only need to check 2 bytes of data (start and end), much cheaper than validating the entire string. after that, all the other functions will no longer need to worry about invalid utf8 string, as long as the parameter is ustring, no validating is needed.
Mar 20 2011