www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - string need to be robust

reply ZY Zhou <rinick GeeMail.com> writes:
Hi,

I wrote a small program to read and parse html(charset=UTF-8). It worked great
until some invalid utf8 chars appears in that page.
When the string is invalid, things like foreach or std.string.tolower will
just crash.
this make the string type totally unusable when processing files, since there
is no guarantee that utf8 file doesn't contain invalid utf8 chars.

So I made a utf8 decoder myself to convert char[] to dchar[]. In my decoder, I
convert all invalid utf8 chars to low surrogate code points(0x80~0xFF ->
0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still able to
know which part of the string is invalid. Besides, after processing the
dchar[] string, I still can convert it back to utf8 char[] without affecting
any of the invalid part.

But it is still too easy to crash program with invalid string.
Is it possible to make this a native feature of string? Or is there any other
recommended method to solve this issue?


Thank you,
--ZY Zhou
Mar 13 2011
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday 13 March 2011 01:57:12 ZY Zhou wrote:
 Hi,
 
 I wrote a small program to read and parse html(charset=UTF-8). It worked
 great until some invalid utf8 chars appears in that page.
 When the string is invalid, things like foreach or std.string.tolower will
 just crash.
 this make the string type totally unusable when processing files, since
 there is no guarantee that utf8 file doesn't contain invalid utf8 chars.
 
 So I made a utf8 decoder myself to convert char[] to dchar[]. In my
 decoder, I convert all invalid utf8 chars to low surrogate code
 points(0x80~0xFF -> 0xDC80~0xDCFF), since low surrogate are invalid utf32
 codes, I'm still able to know which part of the string is invalid.
 Besides, after processing the dchar[] string, I still can convert it back
 to utf8 char[] without affecting any of the invalid part.
 
 But it is still too easy to crash program with invalid string.
 Is it possible to make this a native feature of string? Or is there any
 other recommended method to solve this issue?

Check out std.utf. It has the functions for dealing with unicode stuff. - Jonathan M Davis
Mar 13 2011
next sibling parent reply ZY Zhou <rinick GeeMail.com> writes:
std.utf throw exception instead of crash the program. but you still need to add
try/catch everywhere.

My point is: this simple code should work, instead of crash, it is supposed to
leave all invalid codes untouched and just process the valid parts.

Stream file = new BufferedFile("sample.txt");
foreach(char[] line; file) {
   string s = line.idup.tolower;
}
Mar 13 2011
next sibling parent %u <e ee.com> writes:
== Quote from ZY Zhou (rinick GeeMail.com)'s article
 std.utf throw exception instead of crash the program. but you still need to add
 try/catch everywhere.
 My point is: this simple code should work, instead of crash, it is supposed to
 leave all invalid codes untouched and just process the valid parts.
 Stream file = new BufferedFile("sample.txt");
 foreach(char[] line; file) {
    string s = line.idup.tolower;
 }

At first glance I would like to overload the opInvalid? function in the stream such that iso a throw, the char would be ignored. Disclaimer: I don't know D2 nor streams :)
Mar 13 2011
prev sibling next sibling parent reply ZY Zhou <rinick GeeMail.com> writes:
 but I think that it's completely unreasonable to expect
 all of the string-based and/or range-based functions to be able to handle
 invalid unicode.

As I explained in the first mail, if utf8 parser convert all invalid utf8 chars to low surrogate code points(0x80~0xFF -> 0xDC80~0xDCFF), other string related functions will still work fine, and you can also handle these error if you want string s = "\xa0"; foreach(dchar d; s) { if (isValidUnicode(d)) { process(d); } else { handleError(d); } } == Quote from Jonathan M Davis (jmdavisProg gmx.com)'s article
 On Sunday 13 March 2011 04:34:24 ZY Zhou wrote:
 std.utf throw exception instead of crash the program. but you still need to
 add try/catch everywhere.

 My point is: this simple code should work, instead of crash, it is supposed
 to leave all invalid codes untouched and just process the valid parts.

 Stream file = new BufferedFile("sample.txt");
 foreach(char[] line; file) {
    string s = line.idup.tolower;
 }

worry about whether they're dealing with valid unicode or not. And a lot of string stuff would involve ranges which would require converting each code point to UTF-32. And how is it supposed to do _that_ with invalid UTF-8? I don't know how you expect to really be able to do anything with invalid UTF-8 anyway. There may be something that could be added to std.utf to help better handle the situation, but I think that it's completely unreasonable to expect all of the string-based and/or range-based functions to be able to handle invalid unicode. - Jonathan M Davis

Mar 13 2011
next sibling parent ZY Zhou <rinick GeeeeMail.com> writes:
If a invalid utf8 or utf16 code need to be converted to utf32, then it should be
converted to an invalid utf32. that's why D800~DFFF are marked as invalid points
in unicode standard.

== Quote from spir (denis.spir gmail.com)'s article
 This is not a good idea, imo. Surrogate values /are/ invalid code points. (For
 the ones who guess, there are a range of /code unit/ values used to code in
 utf16 code points > 0xFFFF.) They should never appear in a string of dchar[];
 and a string of char[] code units should never encode a non-code point in the

Mar 13 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 03/13/2011 04:43 PM, ZY Zhou wrote:
 If a invalid utf8 or utf16 code need to be converted to utf32, then it should
be
 converted to an invalid utf32. that's why D800~DFFF are marked as invalid
points
 in unicode standard.

You are wrong on both points. First, there is no definition of invalid source conversion into another format/encoding; instead it should be treated as invalid, that's all. A language or string-processing library should certainly *not* provide any way to do that. Instead, it should just signal invalidity by crashing or throwing. Second, the range you mention is not intended for application use; instead it is reserved for special use by utf16; and, as such, invalid. Since the beginning of this thread, you are demanding for D standard features (the *string types or *char[] arrays) to cope with your particular needs of the moment, doing your job; at the price of all other use cases of those features potentially becoming unsecure or incorrect; crashing loads of existing code which rely on correct behaviour; and breaking the standard. Strange. Denis
 == Quote from spir (denis.spir gmail.com)'s article
 This is not a good idea, imo. Surrogate values /are/ invalid code points. (For
 the ones who guess, there are a range of /code unit/ values used to code in
 utf16 code points>  0xFFFF.) They should never appear in a string of dchar[];
 and a string of char[] code units should never encode a non-code point in the


-- _________________ vita es estrany spir.wikidot.com
Mar 13 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 03/13/2011 01:25 PM, ZY Zhou wrote:
 but I think that it's completely unreasonable to expect
  all of the string-based and/or range-based functions to be able to handle
  invalid unicode.


low surrogate code points(0x80~0xFF -> 0xDC80~0xDCFF), other string related functions will still work fine, and you can also handle these error if you want string s = "\xa0"; foreach(dchar d; s) { if (isValidUnicode(d)) { process(d); } else { handleError(d); } }

This is not a good idea, imo. Surrogate values /are/ invalid code points. (For the ones who guess, there are a range of /code unit/ values used to code in utf16 code points > 0xFFFF.) They should never appear in a string of dchar[]; and a string of char[] code units should never encode a non-code point in the surrogate range. Denis -- _________________ vita es estrany spir.wikidot.com
Mar 13 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 03/13/2011 01:25 PM, ZY Zhou wrote:
 but I think that it's completely unreasonable to expect
  all of the string-based and/or range-based functions to be able to handle
  invalid unicode.


low surrogate code points(0x80~0xFF -> 0xDC80~0xDCFF), other string related functions will still work fine, and you can also handle these error if you want string s = "\xa0"; foreach(dchar d; s) { if (isValidUnicode(d)) { process(d); } else { handleError(d); } }

PS: You are free to preprocess the source if you like it, and convert invalid parts into whatever you like. But instead of surrogates, you'd rather use one of the freely usable ranges of values; or use 0 maybe (so that output won't be disturbed); or better the code point intended for "un-representable" thingie, that all fonts would correctly interpret (and usually display as an inverse video '?'). Denis -- _________________ vita es estrany spir.wikidot.com
Mar 13 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 03/13/2011 12:34 PM, ZY Zhou wrote:
 std.utf throw exception instead of crash the program. but you still need to add
 try/catch everywhere.

 My point is: this simple code should work, instead of crash, it is supposed to
 leave all invalid codes untouched and just process the valid parts.

 Stream file = new BufferedFile("sample.txt");
 foreach(char[] line; file) {
     string s = line.idup.tolower;
 }

The line reader /must/ crash or throw if the source is invalid. If you do not want to pre-process the source (and convert to surrogates as you do, which IMO is not a very good idea) /before/ running the process loop, then read the source as plain byte[], check and pre-process invalid text in this form, and then only convert to string to be able to use string functions. There is no builtin type in D that takes random input as supposed string (text) /without/ checking it, and still allows to use string operations on it as if it were checked to be valid. This would be foolish, don't you think? The D way to check the source of strings at construction time allows checking only once, then safely apply any number of operations. The opposite would require to check at each oeration. And would still throw or crash, which is not what you expect, I guess: you would like IIUC invalid input to be ignored. Denis -- _________________ vita es estrany spir.wikidot.com
Mar 13 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday 13 March 2011 04:34:24 ZY Zhou wrote:
 std.utf throw exception instead of crash the program. but you still need to
 add try/catch everywhere.
 
 My point is: this simple code should work, instead of crash, it is supposed
 to leave all invalid codes untouched and just process the valid parts.
 
 Stream file = new BufferedFile("sample.txt");
 foreach(char[] line; file) {
    string s = line.idup.tolower;
 }

I think that it's completely unreasonable to expect all string functions to worry about whether they're dealing with valid unicode or not. And a lot of string stuff would involve ranges which would require converting each code point to UTF-32. And how is it supposed to do _that_ with invalid UTF-8? I don't know how you expect to really be able to do anything with invalid UTF-8 anyway. There may be something that could be added to std.utf to help better handle the situation, but I think that it's completely unreasonable to expect all of the string-based and/or range-based functions to be able to handle invalid unicode. - Jonathan M Davis
Mar 13 2011
prev sibling next sibling parent reply spir <denis.spir gmail.com> writes:
On 03/13/2011 10:57 AM, ZY Zhou wrote:
 Hi,

 I wrote a small program to read and parse html(charset=UTF-8). It worked great
 until some invalid utf8 chars appears in that page.
 When the string is invalid, things like foreach or std.string.tolower will
 just crash.
 this make the string type totally unusable when processing files, since there
 is no guarantee that utf8 file doesn't contain invalid utf8 chars.

 So I made a utf8 decoder myself to convert char[] to dchar[]. In my decoder, I
 convert all invalid utf8 chars to low surrogate code points(0x80~0xFF ->
 0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still able to
 know which part of the string is invalid. Besides, after processing the
 dchar[] string, I still can convert it back to utf8 char[] without affecting
 any of the invalid part.

 But it is still too easy to crash program with invalid string.
 Is it possible to make this a native feature of string? Or is there any other
 recommended method to solve this issue?

D native features *must* crash or throw when the source text is invalid. What do you think? What should a square root function do when you pass it negative input? /You/ may have special requirements for those cases (ignore it, log it, negate it, replace it with 0 or 1...), but the library must crash anyway. Your requirements are application-specific needs that /you/ must define yourself. Hope I'm clear. D offers an utf8 checking function (checking utf8 beeing the same as convertingto utf32, it just tries to convert and throws when fails). I would use before process to do what /you/ expect. Denis -- _________________ vita es estrany spir.wikidot.com
Mar 13 2011
next sibling parent reply ZY Zhou <rinick GeeeMail.com> writes:
What if I'm making a text editor with D?
I know the text has something wrong, I want to open it and fix it. the exception
won't help, if the editor just refuse to open invalid file, then the editor is
useless.
Try open an invalid utf file with a text editor, like vim, you will understand
what I mean

== Quote from spir (denis.spir gmail.com)'s article
 D offers an utf8 checking function (checking utf8 beeing the same as
 convertingto utf32, it just tries to convert and throws when fails). I would
 use before process to do what /you/ expect.
 Denis

Mar 13 2011
parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-03-13 10:18:24 -0400, ZY Zhou <rinick GeeeMail.com> said:

 What if I'm making a text editor with D?
 I know the text has something wrong, I want to open it and fix it. the 
 exception
 won't help, if the editor just refuse to open invalid file, then the editor is
 useless.
 Try open an invalid utf file with a text editor, like vim, you will understand
 what I mean

But what is the best thing to do when you got an invalid UTF file in a text editor? Perhaps you should show a warning to the user, perhaps you also should ask the user to select the right text encoding (because it might simply not be UTF-8), or perhaps you want to silently ignore the error and show an invalid character marker at the right point in the text. All of these options are valid and the programing language shouldn't decide that for you. So I'd point out that a text file editor is a special use case, most programs aren't text file editors and don't share this concern. In the same vein, HTML parsers are also a special case that should know how to handle encodings. In fact, HTML 5 defines explicitly how to deal with invalid UTF-8 sequences: <http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#utf-8> There are many good ways to deal with invalid UTF-8 sequences. Throwing an exception seems like the most robust one to me since it protects against invalid input. What to do with invalid input belongs in the application logic, not the language. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 13 2011
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2011-03-13 13:22, spir wrote:
 On 03/13/2011 10:57 AM, ZY Zhou wrote:
 Hi,

 I wrote a small program to read and parse html(charset=UTF-8). It
 worked great
 until some invalid utf8 chars appears in that page.
 When the string is invalid, things like foreach or std.string.tolower
 will
 just crash.
 this make the string type totally unusable when processing files,
 since there
 is no guarantee that utf8 file doesn't contain invalid utf8 chars.

 So I made a utf8 decoder myself to convert char[] to dchar[]. In my
 decoder, I
 convert all invalid utf8 chars to low surrogate code points(0x80~0xFF ->
 0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still
 able to
 know which part of the string is invalid. Besides, after processing the
 dchar[] string, I still can convert it back to utf8 char[] without
 affecting
 any of the invalid part.

 But it is still too easy to crash program with invalid string.
 Is it possible to make this a native feature of string? Or is there
 any other
 recommended method to solve this issue?

D native features *must* crash or throw when the source text is invalid. What do you think? What should a square root function do when you pass it negative input? /You/ may have special requirements for those cases (ignore it, log it, negate it, replace it with 0 or 1...), but the library must crash anyway. Your requirements are application-specific needs that /you/ must define yourself. Hope I'm clear. D offers an utf8 checking function (checking utf8 beeing the same as convertingto utf32, it just tries to convert and throws when fails). I would use before process to do what /you/ expect. Denis

I would say that the functions should NOT crash but instead throw an exception. Then the developer can choose what to do when there's an invalid unicode character. -- /Jacob Carlborg
Mar 13 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/13/11 1:55 PM, Jacob Carlborg wrote:
 I would say that the functions should NOT crash but instead throw an
 exception. Then the developer can choose what to do when there's an
 invalid unicode character.

Yah. In addition, the exception should provide index information such that an application can elect to continue processing. Andrei
Mar 13 2011
next sibling parent reply ZY Zhou <rinick GeeeeeMail.com> writes:
it doesn't make sense to add try/catch every time you use
tolower/toupper/foreach
on string. No one will do that.
You either throw exception when convert invalid utf8 bytes to string, or never
throw exception and use invalid UTF32 code in dchar to represent invalid utf8
code.

  string s = "\x0A"; // this is the right place to throw the exception (or
compile
error)
  s.tolower; // no one will add try/catch on this

--ZY Zhou

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 On 3/13/11 1:55 PM, Jacob Carlborg wrote:
 I would say that the functions should NOT crash but instead throw an
 exception. Then the developer can choose what to do when there's an
 invalid unicode character.

that an application can elect to continue processing. Andrei

Mar 13 2011
next sibling parent Jesse Phillips <jessekphillips+D gmail.com> writes:
ZY Zhou Wrote:

 it doesn't make sense to add try/catch every time you use
tolower/toupper/foreach
 on string. No one will do that.
 You either throw exception when convert invalid utf8 bytes to string, or never
 throw exception and use invalid UTF32 code in dchar to represent invalid utf8
code.
 
   string s = "\x0A"; // this is the right place to throw the exception (or
compile
 error)

I don't think here is even a good place to have it. The problem being most strings/invalid strings will come in during run-time. And at run-time we have std.file.readText which will validate the string upon entry.
   s.tolower; // no one will add try/catch on this

Right that isn't the point of throwing an exception. In this case you have not written you program to handle invalid Unicode, which means your program should not continue to operate until you decide to fix it. (Exceptions are robust) However an access violation like the one received from using foreach is not appropriate and should be filed as a bug. You should be programming under the assumption that everything is how you expect it, and verify and remedy external inputs at the point of input. If you don't then the program shouldn't continue running.
Mar 13 2011
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2011-03-14 06:45, ZY Zhou wrote:
 it doesn't make sense to add try/catch every time you use
tolower/toupper/foreach
 on string. No one will do that.
 You either throw exception when convert invalid utf8 bytes to string, or never
 throw exception and use invalid UTF32 code in dchar to represent invalid utf8
code.

    string s = "\x0A"; // this is the right place to throw the exception (or
compile
 error)
    s.tolower; // no one will add try/catch on this

 --ZY Zhou

 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 On 3/13/11 1:55 PM, Jacob Carlborg wrote:
 I would say that the functions should NOT crash but instead throw an
 exception. Then the developer can choose what to do when there's an
 invalid unicode character.

that an application can elect to continue processing. Andrei


Depending on what kind of application you write, all you need could just be to wrap the main method in try/catch. -- /Jacob Carlborg
Mar 14 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday 13 March 2011 22:45:38 ZY Zhou wrote:
 it doesn't make sense to add try/catch every time you use
 tolower/toupper/foreach on string. No one will do that.
 You either throw exception when convert invalid utf8 bytes to string, or
 never throw exception and use invalid UTF32 code in dchar to represent
 invalid utf8 code.
 
   string s = "\x0A"; // this is the right place to throw the exception (or
 compile error)
   s.tolower; // no one will add try/catch on this

If you're going to worry about string validity, it should be checked when the string is initially created. If it's not valid, then fix it in whatever way you deem appropriate. After that, you shouldn't have to worry about string validity anymore. Honestly, invalid UTF-8 is pretty rare overall. You'll get it because of a bad file or somesuch, but once the string is valid, it stays valid. So, you really only have to worry about it when you read in files and such. Once the file has been correctly read in, you just use the strings without worrying about it. There shouldn't be a need to use try-catch blocks much of anywhere to worry about invalid unicode. - Jonathan M Davis
Mar 13 2011
prev sibling parent reply KennyTM~ <kennytm gmail.com> writes:
On Mar 14, 11 02:55, Jacob Carlborg wrote:
 I would say that the functions should NOT crash but instead throw an
 exception. Then the developer can choose what to do when there's an
 invalid unicode character.

It is already throwing an exception called core.exception.UnicodeException. This even provides you the index where decoding failed. (However Phobos is not using it, AFAIK.) ----------- import core.exception, std.stdio, std.conv; void main() { char[] s = [0x0f, 0x12,0x34,0x56,0x78,0x9a,0xbc]; try { foreach (dchar d; s){} } catch (UnicodeException e) { writefln("error at index %s (%x)", e.idx, to!ubyte(s[e.idx])); // error at index 5 (9a) } } -----------
Mar 13 2011
next sibling parent reply Jesse Phillips <jessekphillips+D gmail.com> writes:
KennyTM~ Wrote:

 It is already throwing an exception called 
 core.exception.UnicodeException. This even provides you the index where 
 decoding failed.
 
 (However Phobos is not using it, AFAIK.)
 
 -----------
 import core.exception, std.stdio, std.conv;
 
 void main() {
      char[] s = [0x0f, 0x12,0x34,0x56,0x78,0x9a,0xbc];
      try {
          foreach (dchar d; s){}
      } catch (UnicodeException e) {
          writefln("error at index %s (%x)", e.idx, to!ubyte(s[e.idx]));
          // error at index 5 (9a)
      }
 }
 -----------

foreach does not use the range interface provided by std.array (that I know of). So there should be a bug reported on the current behavior of Access Violation.
Mar 13 2011
parent KennyTM~ <kennytm gmail.com> writes:
On Mar 14, 11 13:53, Jesse Phillips wrote:
 KennyTM~ Wrote:

 It is already throwing an exception called
 core.exception.UnicodeException. This even provides you the index where
 decoding failed.

 (However Phobos is not using it, AFAIK.)

 -----------
 import core.exception, std.stdio, std.conv;

 void main() {
       char[] s = [0x0f, 0x12,0x34,0x56,0x78,0x9a,0xbc];
       try {
           foreach (dchar d; s){}
       } catch (UnicodeException e) {
           writefln("error at index %s (%x)", e.idx, to!ubyte(s[e.idx]));
           // error at index 5 (9a)
       }
 }
 -----------

foreach does not use the range interface provided by std.array (that I know of). So there should be a bug reported on the current behavior of Access Violation.

I haven't checked on Windows, but there is no access violation in foreach (http://ideone.com/WSBRm), and there is no access violation in the range interface either (as of 2.052 it throws a std.utf.UtfException), so I don't know what bug you are talking about.
Mar 13 2011
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2011-03-13 23:36, KennyTM~ wrote:
 On Mar 14, 11 02:55, Jacob Carlborg wrote:
 I would say that the functions should NOT crash but instead throw an
 exception. Then the developer can choose what to do when there's an
 invalid unicode character.

It is already throwing an exception called core.exception.UnicodeException. This even provides you the index where decoding failed. (However Phobos is not using it, AFAIK.) ----------- import core.exception, std.stdio, std.conv; void main() { char[] s = [0x0f, 0x12,0x34,0x56,0x78,0x9a,0xbc]; try { foreach (dchar d; s){} } catch (UnicodeException e) { writefln("error at index %s (%x)", e.idx, to!ubyte(s[e.idx])); // error at index 5 (9a) } } -----------

Good, then we can call it a day :) I assumed it crashed since he said it did. -- /Jacob Carlborg
Mar 14 2011
prev sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Crash -> Have fun stepping through your code with a debugger, or
worse, observe disassembly.
Throw -> (Hopefully) get an informative error message, which could
mean you'll be able to fix the bug quickly.
Mar 13 2011