www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Can std.json handle Unicode?

reply Jacob Carlborg <doob me.com> writes:
I'm wondering because I see that std.json uses isControl, isDigit and 
isHexDigit from std.ascii and not std.uni. This also causes a problem 
with a pull request I recently made for std.net.isemail. In one of its 
unit tests the DEL character (127) is used. According to 
std.ascii.isControl this is a control character, but not according to 
std.uni.isControl. This will cause the test suite for the pull request 
not to be run since std.json chokes on the DEL character.

https://github.com/D-Programming-Language/phobos/pull/1217

-- 
/Jacob Carlborg
Mar 23 2013
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, March 23, 2013 13:22:42 Jacob Carlborg wrote:
 I'm wondering because I see that std.json uses isControl, isDigit and
 isHexDigit from std.ascii and not std.uni. This also causes a problem
 with a pull request I recently made for std.net.isemail. In one of its
 unit tests the DEL character (127) is used. According to
 std.ascii.isControl this is a control character, but not according to
 std.uni.isControl. This will cause the test suite for the pull request
 not to be run since std.json chokes on the DEL character.
 
 https://github.com/D-Programming-Language/phobos/pull/1217
Curious. According to this page ( http://www.aivosto.com/vbtips/control-characters.html ) both space and delete are ASCII control characters (though neither std.ascii nor C's iscntrl deem space to be a control character), but neither of them are control characters according to recent Unicode standards. This section on DEL http://www.aivosto.com/vbtips/control-characters.html#DEL seems to say that DEL should basically be ignored. It seems to think that NUL should be treated the same way (and basically complains that languages like C ever treated it as a terminator). If I look at the RFC for json ( http://www.rfc-editor.org/rfc/rfc4627.txt ), it specifically lists control characters as being U+0000 through U+001F, which does _not_ include DEL or _any_ Unicode-specific control character. So, using either std.ascii or std.uni's isControl would be wrong. It specifically needs to check whether a character is < 32 when checking for control characters. And the grammar rule for string is string = quotation-mark *char quotation-mark char = unescaped / escape ( %x22 / ; " quotation mark U+0022 %x5C / ; \ reverse solidus U+005C %x2F / ; / solidus U+002F %x62 / ; b backspace U+0008 %x66 / ; f form feed U+000C %x6E / ; n line feed U+000A %x72 / ; r carriage return U+000D %x74 / ; t tab U+0009 %x75 4HEXDIG ) ; uXXXX U+XXXX escape = %x5C ; \ quotation-mark = %x22 ; " unescaped = %x20-21 / %x23-5B / %x5D-10FFFF So, it looks like the only characters that should be considered valid inside the double-quotes of a string which aren't escaped are / (which indicates the beginning of an escape sequence), and the characters listed in unescaped. So, in decimal, that would be 32 and 33, 35 - 91, and everything 93 and greater (up to 10FFFF). DEL is 127, so it should be considered valid. So, if std.json is using isControl, my guess is that whoever wrote that was not careful enough with the grammar (though it's easy enough to assume that everyone means the same thing by control characters), and I'd be concerned that std.json is not handling this set of grammar rules correctly with more characters than just DEL. - Jonathan M Davis
Mar 23 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-03-23 21:08, Jonathan M Davis wrote:

 Curious. According to this page (
http://www.aivosto.com/vbtips/control-characters.html ) both space and delete
are ASCII control characters (though
 neither std.ascii nor C's iscntrl deem space to be a control character), but
 neither of them are control characters according to recent Unicode standards.
 This section on DEL

 http://www.aivosto.com/vbtips/control-characters.html#DEL

 seems to say that DEL should basically be ignored. It seems to think that NUL
 should be treated the same way (and basically complains that languages like C
 ever treated it as a terminator).

 If I look at the RFC for json ( http://www.rfc-editor.org/rfc/rfc4627.txt ),
 it specifically lists control characters as being U+0000 through U+001F, which
 does _not_ include DEL or _any_ Unicode-specific control character. So, using
 either std.ascii or std.uni's isControl would be wrong. It specifically needs
 to check whether a character is < 32 when checking for control characters.

 And the grammar rule for string is

           string = quotation-mark *char quotation-mark

           char = unescaped /
                  escape (
                      %x22 /          ; "    quotation mark  U+0022
                      %x5C /          ; \    reverse solidus U+005C
                      %x2F /          ; /    solidus         U+002F
                      %x62 /          ; b    backspace       U+0008
                      %x66 /          ; f    form feed       U+000C
                      %x6E /          ; n    line feed       U+000A
                      %x72 /          ; r    carriage return U+000D
                      %x74 /          ; t    tab             U+0009
                      %x75 4HEXDIG )  ; uXXXX                U+XXXX

           escape = %x5C              ; \

           quotation-mark = %x22      ; "

           unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

 So, it looks like the only characters that should be considered valid inside
 the double-quotes of a string which aren't escaped are / (which indicates the
 beginning of an escape sequence), and the characters listed in unescaped. So,
 in decimal, that would be 32 and 33, 35 - 91, and everything 93 and greater
 (up to 10FFFF). DEL is 127, so it should be considered valid.

 So, if std.json is using isControl, my guess is that whoever wrote that was
 not careful enough with the grammar (though it's easy enough to assume that
 everyone means the same thing by control characters), and I'd be concerned
 that std.json is not handling this set of grammar rules correctly with more
 characters than just DEL.
I see. Yes, one could think that "control character" would mean the same thing in every situation for a given encoding. -- /Jacob Carlborg
Mar 24 2013
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Mar 23, 2013, at 5:22 AM, Jacob Carlborg <doob me.com> wrote:

 I'm wondering because I see that std.json uses isControl, isDigit and =
isHexDigit from std.ascii and not std.uni. This also causes a problem = with a pull request I recently made for std.net.isemail. In one of its = unit tests the DEL character (127) is used. According to = std.ascii.isControl this is a control character, but not according to = std.uni.isControl. This will cause the test suite for the pull request = not to be run since std.json chokes on the DEL character. I don't know about control characters, but std.json doesn't handle = UTF-16 surrogate pairs in strings, which are legal JSON. I *think* it = does properly handle the 32 bit code points though.=
Mar 25 2013