digitalmars.D - Re: If invalid string should crash(was:string need to be robust)

Jussi Jumppanen <jussij zeusedit.com> Mar 13 2011

ZY Zhou <rinick GeeeeMail.com> Mar 13 2011

spir <denis.spir gmail.com> Mar 14 2011

Kagamin <spam here.lot> Mar 14 2011

Jussi Jumppanen <jussij zeusedit.com> writes:

%u Wrote:

 I agree with a), but not b), Can't find anything in unicode standard says
 you can use the low surrogate like that


According to: http://www.cl.cam.ac.uk/~mgk25/

    According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
    receiving UTF-8 shall interpret a "malformed sequence in the same way
    that it interprets a character that is outside the adopted subset" and
    "characters that are not within the adopted subset shall be indicated
    to the user" by a receiving device. A quite commonly used approach in
    UTF-8 decoders is to replace any malformed UTF-8 sequence by a
    replacement character (U+FFFD), which looks a bit like an inverted
    question mark, or a similar symbol. 

Refer to this file for the above quote: 

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Mar 13 2011

ZY Zhou <rinick GeeeeMail.com> writes:

Thank you Jussi,

But still this is not part of the standard, U+FFFD is a commonly used approach,
while the U+DC80..U+DCFF is also a common solution for
that(http://en.wikipedia.org/wiki/Utf8#Invalid_byte_sequences), different
approach
solve different problems.

I think the current problem in D is that std.utf module is ill defined, it's not
designed to make developer's life easier. It just make the developers to ignore
the case that utf8 string can be invalid.

--ZY Zhou

== Quote from Jussi Jumppanen (jussij zeusedit.com)'s article
 %u Wrote:
 I agree with a), but not b), Can't find anything in unicode standard says
 you can use the low surrogate like that


     According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
     receiving UTF-8 shall interpret a "malformed sequence in the same way
     that it interprets a character that is outside the adopted subset" and
     "characters that are not within the adopted subset shall be indicated
     to the user" by a receiving device. A quite commonly used approach in
     UTF-8 decoders is to replace any malformed UTF-8 sequence by a
     replacement character (U+FFFD), which looks a bit like an inverted
     question mark, or a similar symbol.
 Refer to this file for the above quote:
 http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Mar 13 2011

spir <denis.spir gmail.com> writes:

On 03/14/2011 07:55 AM, ZY Zhou wrote:
 Thank you Jussi,

 But still this is not part of the standard, U+FFFD is a commonly used approach,
 while the U+DC80..U+DCFF is also a common solution for
 that(http://en.wikipedia.org/wiki/Utf8#Invalid_byte_sequences), different
approach
 solve different problems.


I am surprised of some of your very affirmative statements (all along the 
thread). None of the string processing libs I have met use the approach you 
propose here, which is replacing invalid input by other invalid data (surrogate 
values). On the other hand, the replacement character (0xFFFD) evoked by Jussi 
(which I also proposed in a previous post) is a valid Unicode code point; same 
for free user-avalable areas.

 I think the current problem in D is that std.utf module is ill defined, it's
not
 designed to make developer's life easier. It just make the developers to ignore
 the case that utf8 string can be invalid.


On the contrary, D perfectly deals with invalid input by signalling it to you 
programmer. It is not ignored, which would be the worse approach. What to do 
with invalid input belongs to your application's logic (as pointed by 
Jonathan); you are demanding D standard libs to do your job at your place, 
exactly the way you want it, using an incorrect approach.

Denis

 --ZY Zhou

 == Quote from Jussi Jumppanen (jussij zeusedit.com)'s article
 %u Wrote:
 I agree with a), but not b), Can't find anything in unicode standard says
 you can use the low surrogate like that


      According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
      receiving UTF-8 shall interpret a "malformed sequence in the same way
      that it interprets a character that is outside the adopted subset" and
      "characters that are not within the adopted subset shall be indicated
      to the user" by a receiving device. A quite commonly used approach in
      UTF-8 decoders is to replace any malformed UTF-8 sequence by a
      replacement character (U+FFFD), which looks a bit like an inverted
      question mark, or a similar symbol.
 Refer to this file for the above quote:
 http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt




-- 
_________________
vita es estrany
spir.wikidot.com

Mar 14 2011

Kagamin <spam here.lot> writes:

Jussi Jumppanen Wrote:

 %u Wrote:
 
 I agree with a), but not b), Can't find anything in unicode standard says
 you can use the low surrogate like that


 According to: http://www.cl.cam.ac.uk/~mgk25/
 
     According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
     receiving UTF-8 shall interpret a "malformed sequence in the same way
     that it interprets a character that is outside the adopted subset" and
     "characters that are not within the adopted subset shall be indicated
     to the user" by a receiving device. A quite commonly used approach in
     UTF-8 decoders is to replace any malformed UTF-8 sequence by a
     replacement character (U+FFFD), which looks a bit like an inverted
     question mark, or a similar symbol. 
 
 Refer to this file for the above quote: 
 
 http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt


Sounds like a text rendering guideline rather than a text processing guideline.

Mar 14 2011

D Programming

C/C++ Programming

Other

digitalmars.D - Re: If invalid string should crash(was:string need to be robust)