www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - If invalid string should crash(was:string need to be robust)

reply ZY Zhou <rinick GeeeeMail.com> writes:
Hi,

invalid utf8 code always break my program, so I suggest if invalid code in
utf8 need to be converted to dchar, use the low surrogate code
points(DC80~DCFF) instead of crashing the program. But many people here don't
like this idea, you think exception is the right thing. OK, let me ask you a
question:

Do you always try/catch for invalid utf when reading a file?
I believe you don't, you simply don't care.

While the text file is invalid, this use case itself is valid. Should a
browser crash on a web page with charset=utf8 but has invalid utf8 code in it?
Exception doesn't help either, using them in this case is almost like writing
a utf8 decoder yourself.

Anyway, since I'm already using my own utf decoder, I don't care if you agree
with me or not.

But for the following case, it is complete wrong if it crash at line 3:

 1:  char[] c = [0xA0];
 2:  string s = c.idup;
 3:  foreach(dchar d; s){}

The expected result is either:
 a) crash at line 2, c is not valid utf
    and can't be converted to string
or:
 b) don't crash, and d = 0xDCA0;


--ZY Zhou
Mar 13 2011
next sibling parent %u <a aaa.com> writes:
 But for the following case, it is complete wrong if it crash at line 3:
  1:  char[] c = [0xA0];
  2:  string s = c.idup;
  3:  foreach(dchar d; s){}
 The expected result is either:
  a) crash at line 2, c is not valid utf and can't be converted to string
 or:
  b) don't crash, and d = 0xDCA0;

I agree with a), but not b), Can't find anything in unicode standard says you can use the low surrogate like that
Mar 13 2011
prev sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
ZY Zhou <rinick geeeemail.com> wrote:


 But for the following case, it is complete wrong if it crash at line 3:

Why? That is the point where you are actually saying 'I care about individual characters in this string'.
  1:  char[] c = [0xA0];
  2:  string s = c.idup;
  3:  foreach(dchar d; s){}

 The expected result is either:
  a) crash at line 2, c is not valid utf
     and can't be converted to string

A char[] is just as bound by the rules as is string (which is simply immutable(char)[]). Thus the program should feel free to expect it to contain valid utf-8 data. Validating each string upon every single copy operation is unacceptable overhead.
 or:
  b) don't crash, and d = 0xDCA0;

b is unacceptable in the general case. It may be good for your specific situation, but in general, it is simply ignoring an error. -- Simen
Mar 14 2011