www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - suggestion of type: ustring

reply ZY Zhou <rinick GgMmAaIiLl.com> writes:
D's string is supposed to be utf8 encoded, however, the following code
compiles and runs with no error:

  string s = "\xff"; // s is invalid
  writeln(s);
  fileStream.writeLine(s);

In order to make sure only valid utf8 string is used in the system, validating
is needed everywhere, e.g.

  string cut3bytes(string s)
  in {validate(s);}
  out(result} {validate(result);}
  body {return s.length > 3 ? s[0..3] : s;}

I think it will be better if D has a ustring type to do all the validating
job. e.g.

  ustring s = "0xFF";  // compile error

  char[] c = [0xFF];
  ustring s = c.idup;  // throw UtfException

  ustring s1 = "\xc2\xa2";
  ustring s2 = s1[0..1];  // throw UtfException

So the above example can be simplified to:

  ustring cut3bytes(ustring s)
  {return s.length > 3 ? s[0..3] : s;}

--ZY Zhou
Mar 20 2011
parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
 D's string is supposed to be utf8 encoded, however, the following code
 compiles and runs with no error:
 
   string s = "\xff"; // s is invalid
   writeln(s);
   fileStream.writeLine(s);
 
 In order to make sure only valid utf8 string is used in the system,
 validating is needed everywhere, e.g.
 
   string cut3bytes(string s)
   in {validate(s);}
   out(result} {validate(result);}
   body {return s.length > 3 ? s[0..3] : s;}
 
 I think it will be better if D has a ustring type to do all the validating
 job. e.g.
 
   ustring s = "0xFF";  // compile error
 
   char[] c = [0xFF];
   ustring s = c.idup;  // throw UtfException
 
   ustring s1 = "\xc2\xa2";
   ustring s2 = s1[0..1];  // throw UtfException
 
 So the above example can be simplified to:
 
   ustring cut3bytes(ustring s)
   {return s.length > 3 ? s[0..3] : s;}

It would be prohibitively expensive to be constantly validating strings. You validate them at the point that they're created, and then you generally don't worry about. Doing otherwise would be expensive. Some functions do check that a string is properly encoded, but most don't. If you want a string type that actually validates on every operation, feel free to define a struct which holds a string internally and has all of the appropriate overloaded operators so that it's a range of dchar and whatnot. But you're going to have a hard time convincing folks that such a type should be in Phobos, and there's no way that it would make it into the language itself. And honestly, how often do you have to worry about invalid strings? As long as you check them when they're created, you won't generally have problems with invalid strings, and it's a lot less expensive than constantly checking their validity whenever you do anything with them. - Jonathan M Davis
Mar 20 2011
parent reply ZY Zhou <rinick GgMmAaIiLl.com> writes:
 It would be prohibitively expensive to be constantly validating strings.

No, it would be much much cheaper, since there are only 2 cases the validating is needed 1) when you convert char[] to ustring, in this case, the validating is necessary 2) when you use split on ustring. but since ustring is guaranteed to be valid, the validating only need to check 2 bytes of data (start and end), much cheaper than validating the entire string. after that, all the other functions will no longer need to worry about invalid utf8 string, as long as the parameter is ustring, no validating is needed. == Quote from Jonathan M Davis (jmdavisProg gmx.com)'s article
 D's string is supposed to be utf8 encoded, however, the following code
 compiles and runs with no error:
   string s = "\xff"; // s is invalid
   writeln(s);
   fileStream.writeLine(s);
 In order to make sure only valid utf8 string is used in the system,
 validating is needed everywhere, e.g.
   string cut3bytes(string s)
   in {validate(s);}
   out(result} {validate(result);}
   body {return s.length > 3 ? s[0..3] : s;}
 I think it will be better if D has a ustring type to do all the validating
 job. e.g.
   ustring s = "0xFF";  // compile error
   char[] c = [0xFF];
   ustring s = c.idup;  // throw UtfException
   ustring s1 = "\xc2\xa2";
   ustring s2 = s1[0..1];  // throw UtfException
 So the above example can be simplified to:
   ustring cut3bytes(ustring s)
   {return s.length > 3 ? s[0..3] : s;}


Mar 20 2011
next sibling parent Jesse Phillips <jessekphillips+D gmail.com> writes:
ZY Zhou Wrote:

 It would be prohibitively expensive to be constantly validating strings.

No, it would be much much cheaper, since there are only 2 cases the validating is needed 1) when you convert char[] to ustring, in this case, the validating is necessary 2) when you use split on ustring. but since ustring is guaranteed to be valid, the validating only need to check 2 bytes of data (start and end), much cheaper than validating the entire string. after that, all the other functions will no longer need to worry about invalid utf8 string, as long as the parameter is ustring, no validating is needed.

Honestly, so far the only time I had problems processing utf has been when someone stuck a stupid BOM[1] at the beginning of the file. Question, what is so hard about inserting validity checks[2] into your code just as you have described? This way you don't have to put them in contracts of all your functions. 1. http://en.wikipedia.org/wiki/Byte_order_mark 2. http://digitalmars.com/d/2.0/phobos/std_utf.html#validate
Mar 20 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 03/20/2011 05:12 PM, Jesse Phillips wrote:
 ZY Zhou Wrote:

 It would be prohibitively expensive to be constantly validating strings.

No, it would be much much cheaper, since there are only 2 cases the validating is needed 1) when you convert char[] to ustring, in this case, the validating is necessary 2) when you use split on ustring. but since ustring is guaranteed to be valid, the validating only need to check 2 bytes of data (start and end), much cheaper than validating the entire string. after that, all the other functions will no longer need to worry about invalid utf8 string, as long as the parameter is ustring, no validating is needed.

Honestly, so far the only time I had problems processing utf has been when someone stuck a stupid BOM[1] at the beginning of the file. Question, what is so hard about inserting validity checks[2] into your code just as you have described? This way you don't have to put them in contracts of all your functions. 1. http://en.wikipedia.org/wiki/Byte_order_mark 2. http://digitalmars.com/d/2.0/phobos/std_utf.html#validate

Anyway, std.utf.validate just tries to *decode*! (see function source below) (decode itself throws an exception when stepping on invalid utf) So, it would more as efficient to just decode (which in D means the same as converting to dstring) at start, and work with strings of code points all along your process. In addition to validating only once, further operations can be much faster each time you need to operate at the level of code points. void validate(S)(in S s) if (isSomeString!S) { immutable len = s.length; for (size_t i = 0; i < len; ) { decode(s, i); } } Denis -- _________________ vita es estrany spir.wikidot.com
Mar 20 2011