digitalmars.D - suggestion of type: ustring

ZY Zhou (22/22) Mar 20 2011 D's string is supposed to be utf8 encoded, however, the following code

Jonathan M Davis (14/44) Mar 20 2011 It would be prohibitively expensive to be constantly validating strings....

ZY Zhou (9/32) Mar 20 2011 No, it would be much much cheaper, since there are only 2 cases the vali...

Jesse Phillips (5/17) Mar 20 2011 Honestly, so far the only time I had problems processing utf has been wh...

spir (20/37) Mar 20 2011 Anyway, std.utf.validate just tries to *decode*! (see function source be...

ZY Zhou <rinick GgMmAaIiLl.com> writes:

D's string is supposed to be utf8 encoded, however, the following code
compiles and runs with no error:

  string s = "\xff"; // s is invalid
  writeln(s);
  fileStream.writeLine(s);

In order to make sure only valid utf8 string is used in the system, validating
is needed everywhere, e.g.

  string cut3bytes(string s)
  in {validate(s);}
  out(result} {validate(result);}
  body {return s.length > 3 ? s[0..3] : s;}

I think it will be better if D has a ustring type to do all the validating
job. e.g.

  ustring s = "0xFF";  // compile error

  char[] c = [0xFF];
  ustring s = c.idup;  // throw UtfException

  ustring s1 = "\xc2\xa2";
  ustring s2 = s1[0..1];  // throw UtfException

So the above example can be simplified to:

  ustring cut3bytes(ustring s)
  {return s.length > 3 ? s[0..3] : s;}

--ZY Zhou

Mar 20 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

 D's string is supposed to be utf8 encoded, however, the following code
 compiles and runs with no error:
 
   string s = "\xff"; // s is invalid
   writeln(s);
   fileStream.writeLine(s);
 
 In order to make sure only valid utf8 string is used in the system,
 validating is needed everywhere, e.g.
 
   string cut3bytes(string s)
   in {validate(s);}
   out(result} {validate(result);}
   body {return s.length > 3 ? s[0..3] : s;}
 
 I think it will be better if D has a ustring type to do all the validating
 job. e.g.
 
   ustring s = "0xFF";  // compile error
 
   char[] c = [0xFF];
   ustring s = c.idup;  // throw UtfException
 
   ustring s1 = "\xc2\xa2";
   ustring s2 = s1[0..1];  // throw UtfException
 
 So the above example can be simplified to:
 
   ustring cut3bytes(ustring s)
   {return s.length > 3 ? s[0..3] : s;}

It would be prohibitively expensive to be constantly validating strings. You 
validate them at the point that they're created, and then you generally don't 
worry about. Doing otherwise would be expensive. Some functions do check that 
a string is properly encoded, but most don't. If you want a string type that 
actually validates on every operation, feel free to define a struct which 
holds a string internally and has all of the appropriate overloaded operators 
so that it's a range of dchar and whatnot. But you're going to have a hard 
time convincing folks that such a type should be in Phobos, and there's no way 
that it would make it into the language itself.

And honestly, how often do you have to worry about invalid strings? As long as 
you check them when they're created, you won't generally have problems with 
invalid strings, and it's a lot less expensive than constantly checking their 
validity whenever you do anything with them.

- Jonathan M Davis

Mar 20 2011

ZY Zhou <rinick GgMmAaIiLl.com> writes:

 It would be prohibitively expensive to be constantly validating strings.

No, it would be much much cheaper, since there are only 2 cases the validating
is
needed

1) when you convert char[] to ustring, in this case, the validating is necessary
2) when you use split on ustring. but since ustring is guaranteed to be valid,
the
validating only need to check 2 bytes of data (start and end), much cheaper than
validating the entire string.

after that, all the other functions will no longer need to worry about invalid
utf8 string, as long as the parameter is ustring, no validating is needed.

== Quote from Jonathan M Davis (jmdavisProg gmx.com)'s article
 D's string is supposed to be utf8 encoded, however, the following code
 compiles and runs with no error:
   string s = "\xff"; // s is invalid
   writeln(s);
   fileStream.writeLine(s);
 In order to make sure only valid utf8 string is used in the system,
 validating is needed everywhere, e.g.
   string cut3bytes(string s)
   in {validate(s);}
   out(result} {validate(result);}
   body {return s.length > 3 ? s[0..3] : s;}
 I think it will be better if D has a ustring type to do all the validating
 job. e.g.
   ustring s = "0xFF";  // compile error
   char[] c = [0xFF];
   ustring s = c.idup;  // throw UtfException
   ustring s1 = "\xc2\xa2";
   ustring s2 = s1[0..1];  // throw UtfException
 So the above example can be simplified to:
   ustring cut3bytes(ustring s)
   {return s.length > 3 ? s[0..3] : s;}

 It would be prohibitively expensive to be constantly validating strings.

Mar 20 2011

Jesse Phillips <jessekphillips+D gmail.com> writes:

ZY Zhou Wrote:

 It would be prohibitively expensive to be constantly validating strings.

 
 No, it would be much much cheaper, since there are only 2 cases the validating
is
 needed
 
 1) when you convert char[] to ustring, in this case, the validating is
necessary
 2) when you use split on ustring. but since ustring is guaranteed to be valid,
the
 validating only need to check 2 bytes of data (start and end), much cheaper
than
 validating the entire string.
 
 after that, all the other functions will no longer need to worry about invalid
 utf8 string, as long as the parameter is ustring, no validating is needed.

Honestly, so far the only time I had problems processing utf has been when
someone stuck a stupid BOM[1] at the beginning of the file.

Question, what is so hard about inserting validity checks[2] into your code
just as you have described? This way you don't have to put them in contracts of
all your functions.

1. http://en.wikipedia.org/wiki/Byte_order_mark
2. http://digitalmars.com/d/2.0/phobos/std_utf.html#validate

Mar 20 2011

spir <denis.spir gmail.com> writes:

On 03/20/2011 05:12 PM, Jesse Phillips wrote:
 ZY Zhou Wrote:

 It would be prohibitively expensive to be constantly validating strings.

 No, it would be much much cheaper, since there are only 2 cases the validating
is
 needed

 1) when you convert char[] to ustring, in this case, the validating is
necessary
 2) when you use split on ustring. but since ustring is guaranteed to be valid,
the
 validating only need to check 2 bytes of data (start and end), much cheaper
than
 validating the entire string.

 after that, all the other functions will no longer need to worry about invalid
 utf8 string, as long as the parameter is ustring, no validating is needed.

 Honestly, so far the only time I had problems processing utf has been when
someone stuck a stupid BOM[1] at the beginning of the file.

 Question, what is so hard about inserting validity checks[2] into your code
just as you have described? This way you don't have to put them in contracts of
all your functions.

 1. http://en.wikipedia.org/wiki/Byte_order_mark
 2. http://digitalmars.com/d/2.0/phobos/std_utf.html#validate

Anyway, std.utf.validate just tries to *decode*! (see function source below) 
(decode itself throws an exception when stepping on invalid utf)
So, it would more as efficient to just decode (which in D means the same as 
converting to dstring) at start, and work with strings of code points all along 
your process. In addition to validating only once, further operations can be 
much faster each time you need to operate at the level of code points.

void validate(S)(in S s) if (isSomeString!S)
{
     immutable len = s.length;
     for (size_t i = 0; i < len; )
     {
         decode(s, i);
     }
}

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Mar 20 2011

D Programming

C/C++ Programming

Other

digitalmars.D - suggestion of type: ustring