digitalmars.D - string need to be robust

ZY Zhou (18/18) Mar 13 2011 Hi,

Jonathan M Davis (3/22) Mar 13 2011 Check out std.utf. It has the functions for dealing with unicode stuff.

ZY Zhou (8/8) Mar 13 2011 std.utf throw exception instead of crash the program. but you still need...

%u (4/12) Mar 13 2011 At first glance I would like to overload the opInvalid? function in the ...
Jonathan M Davis (11/21) Mar 13 2011 I think that it's completely unreasonable to expect all string functions...

ZY Zhou (13/37) Mar 13 2011 As I explained in the first mail, if utf8 parser convert all invalid utf...

spir (11/26) Mar 13 2011 This is not a good idea, imo. Surrogate values /are/ invalid code points...

ZY Zhou (4/8) Mar 13 2011 If a invalid utf8 or utf16 code need to be converted to utf32, then it s...

spir (19/27) Mar 13 2011 You are wrong on both points.

spir (12/27) Mar 13 2011 PS: You are free to preprocess the source if you like it, and convert in...

spir (18/26) Mar 13 2011 The line reader /must/ crash or throw if the source is invalid. If you d...

spir (16/32) Mar 13 2011 D native features *must* crash or throw when the source text is invalid....

ZY Zhou (7/11) Mar 13 2011 What if I'm making a text editor with D?

Michel Fortin (23/30) Mar 13 2011 But what is the best thing to do when you got an invalid UTF file in a

Jacob Carlborg (6/44) Mar 13 2011 I would say that the functions should NOT crash but instead throw an

Andrei Alexandrescu (4/7) Mar 13 2011 Yah. In addition, the exception should provide index information such

Andrej Mitrovic (4/4) Mar 13 2011 Crash -> Have fun stepping through your code with a debugger, or
ZY Zhou (9/16) Mar 13 2011 it doesn't make sense to add try/catch every time you use tolower/touppe...

Jesse Phillips (4/12) Mar 13 2011 Right that isn't the point of throwing an exception. In this case you ha...
Jonathan M Davis (11/20) Mar 13 2011 If you're going to worry about string validity, it should be checked whe...
Jacob Carlborg (5/21) Mar 14 2011 Depending on what kind of application you write, all you need could just...

KennyTM~ (17/20) Mar 13 2011 It is already throwing an exception called

Jesse Phillips (2/21) Mar 13 2011 foreach does not use the range interface provided by std.array (that I k...

KennyTM~ (5/26) Mar 13 2011 I haven't checked on Windows, but there is no access violation in

Jacob Carlborg (5/25) Mar 14 2011 Good, then we can call it a day :) I assumed it crashed since he said it...

ZY Zhou <rinick GeeMail.com> writes:

Hi,

I wrote a small program to read and parse html(charset=UTF-8). It worked great
until some invalid utf8 chars appears in that page.
When the string is invalid, things like foreach or std.string.tolower will
just crash.
this make the string type totally unusable when processing files, since there
is no guarantee that utf8 file doesn't contain invalid utf8 chars.

So I made a utf8 decoder myself to convert char[] to dchar[]. In my decoder, I
convert all invalid utf8 chars to low surrogate code points(0x80~0xFF ->
0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still able to
know which part of the string is invalid. Besides, after processing the
dchar[] string, I still can convert it back to utf8 char[] without affecting
any of the invalid part.

But it is still too easy to crash program with invalid string.
Is it possible to make this a native feature of string? Or is there any other
recommended method to solve this issue?


Thank you,
--ZY Zhou

Mar 13 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday 13 March 2011 01:57:12 ZY Zhou wrote:
 Hi,
 
 I wrote a small program to read and parse html(charset=UTF-8). It worked
 great until some invalid utf8 chars appears in that page.
 When the string is invalid, things like foreach or std.string.tolower will
 just crash.
 this make the string type totally unusable when processing files, since
 there is no guarantee that utf8 file doesn't contain invalid utf8 chars.
 
 So I made a utf8 decoder myself to convert char[] to dchar[]. In my
 decoder, I convert all invalid utf8 chars to low surrogate code
 points(0x80~0xFF -> 0xDC80~0xDCFF), since low surrogate are invalid utf32
 codes, I'm still able to know which part of the string is invalid.
 Besides, after processing the dchar[] string, I still can convert it back
 to utf8 char[] without affecting any of the invalid part.
 
 But it is still too easy to crash program with invalid string.
 Is it possible to make this a native feature of string? Or is there any
 other recommended method to solve this issue?

Check out std.utf. It has the functions for dealing with unicode stuff.

- Jonathan M Davis

Mar 13 2011

ZY Zhou <rinick GeeMail.com> writes:

std.utf throw exception instead of crash the program. but you still need to add
try/catch everywhere.

My point is: this simple code should work, instead of crash, it is supposed to
leave all invalid codes untouched and just process the valid parts.

Stream file = new BufferedFile("sample.txt");
foreach(char[] line; file) {
   string s = line.idup.tolower;
}

Mar 13 2011

%u <e ee.com> writes:

== Quote from ZY Zhou (rinick GeeMail.com)'s article
 std.utf throw exception instead of crash the program. but you still need to add
 try/catch everywhere.
 My point is: this simple code should work, instead of crash, it is supposed to
 leave all invalid codes untouched and just process the valid parts.
 Stream file = new BufferedFile("sample.txt");
 foreach(char[] line; file) {
    string s = line.idup.tolower;
 }

At first glance I would like to overload the opInvalid? function in the stream
such that iso a throw, the char would be ignored.
Disclaimer: I don't know D2 nor streams :)

Mar 13 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday 13 March 2011 04:34:24 ZY Zhou wrote:
 std.utf throw exception instead of crash the program. but you still need to
 add try/catch everywhere.
 
 My point is: this simple code should work, instead of crash, it is supposed
 to leave all invalid codes untouched and just process the valid parts.
 
 Stream file = new BufferedFile("sample.txt");
 foreach(char[] line; file) {
    string s = line.idup.tolower;
 }

I think that it's completely unreasonable to expect all string functions to 
worry about whether they're dealing with valid unicode or not. And a lot of 
string stuff would involve ranges which would require converting each code
point 
to UTF-32. And how is it supposed to do _that_ with invalid UTF-8?

I don't know how you expect to really be able to do anything with invalid UTF-8 
anyway. There may be something that could be added to std.utf to help better 
handle the situation, but I think that it's completely unreasonable to expect 
all of the string-based and/or range-based functions to be able to handle 
invalid unicode.

- Jonathan M Davis

Mar 13 2011

ZY Zhou <rinick GeeMail.com> writes:

 but I think that it's completely unreasonable to expect
 all of the string-based and/or range-based functions to be able to handle
 invalid unicode.

As I explained in the first mail, if utf8 parser convert all invalid utf8 chars
to
low surrogate code points(0x80~0xFF ->
0xDC80~0xDCFF), other string related functions will still work fine, and you can
also handle these error if you want

string s = "\xa0";
foreach(dchar d; s) {
  if (isValidUnicode(d)) {
    process(d);
  } else {
    handleError(d);
  }
}


== Quote from Jonathan M Davis (jmdavisProg gmx.com)'s article
 On Sunday 13 March 2011 04:34:24 ZY Zhou wrote:
 std.utf throw exception instead of crash the program. but you still need to
 add try/catch everywhere.

 My point is: this simple code should work, instead of crash, it is supposed
 to leave all invalid codes untouched and just process the valid parts.

 Stream file = new BufferedFile("sample.txt");
 foreach(char[] line; file) {
    string s = line.idup.tolower;
 }

 I think that it's completely unreasonable to expect all string functions to
 worry about whether they're dealing with valid unicode or not. And a lot of
 string stuff would involve ranges which would require converting each code
point
 to UTF-32. And how is it supposed to do _that_ with invalid UTF-8?
 I don't know how you expect to really be able to do anything with invalid UTF-8
 anyway. There may be something that could be added to std.utf to help better
 handle the situation, but I think that it's completely unreasonable to expect
 all of the string-based and/or range-based functions to be able to handle
 invalid unicode.
 - Jonathan M Davis

Mar 13 2011