www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - Standard omission or compiler bug: Hexadecimal escapes don't encode

reply Burton Radons <burton-radons shaw.ca> writes:
The following string is encoded into UTF-8:

    char [] c = "\u7362";

It encodes into x"E4 98 B7".  The following string, however, does not 
get encoded:

    char [] c = "\x8F";

It remains x"8F".  Thus the \x specifies a literal byte in the character 
stream as implemented.  The specification doesn't mention this twist, if 
it was intentional.
Sep 19 2004
next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Burton Radons" <burton-radons shaw.ca> wrote in message
news:cilbgp$1p3a$1 digitaldaemon.com...
 The following string is encoded into UTF-8:

     char [] c = "\u7362";

 It encodes into x"E4 98 B7".  The following string, however, does not
 get encoded:

     char [] c = "\x8F";

 It remains x"8F".  Thus the \x specifies a literal byte in the character
 stream as implemented.  The specification doesn't mention this twist, if
 it was intentional.
I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8.
Sep 20 2004
parent reply Stewart Gordon <Stewart_member pathlink.com> writes:
In article <cioggi$k7p$1 digitaldaemon.com>, Walter says...
<snip>
 It remains x"8F".  Thus the \x specifies a literal byte in the character
 stream as implemented.  The specification doesn't mention this twist, if
 it was intentional.
I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8.
Here's somewhere I agree with your choice of behaviour, where \x denotes byte values, not Unicode codepoints. Hence here, the coder who writes \x8F intended the byte having this value - a single value of type char. Moreover, it follows the "looks like C, acts like C" principle. Of course, if circumstances dictate that the string be interpreted as a wchar[] or dchar[], then that's another matter. Stewart.
Sep 21 2004
next sibling parent Stewart Gordon <Stewart_member pathlink.com> writes:
In article <ciout3$skf$1 digitaldaemon.com>, Stewart Gordon says and then some
program or 
another makes a mess of...
<snip>
Here's somewhere I agree with your choice of behaviour, where \x denotes byte
values, not Unicode 
codepoints.  Hence here, the coder who writes \x8F intended the byte having this
value - a single value 
of type char.  Moreover, it follows the "looks like C, acts like C" principle.

Of course, if circumstances dictate that the string be interpreted as a wchar[]
or dchar[], then that's 
another matter.
Just what is wrong with this web newsgroup interface? I should've carried on using my quote tidier. If anyone else is having the same troubles, you're pointed here.... http://smjg.port5.com/faqs/usenet/quotetidy.html Hopefully my regular posting environment will soon have a working power supply once again.... Stewart.
Sep 21 2004
prev sibling next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 21 Sep 2004 10:13:55 +0000 (UTC), Stewart Gordon 
<Stewart_member pathlink.com> wrote:

 In article <cioggi$k7p$1 digitaldaemon.com>, Walter says...
 <snip>
 It remains x"8F".  Thus the \x specifies a literal byte in the 
 character
 stream as implemented.  The specification doesn't mention this twist, 
 if
 it was intentional.
I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8.
Here's somewhere I agree with your choice of behaviour, where \x denotes byte values, not Unicode codepoints. Hence here, the coder who writes \x8F intended the byte having this value - a single value of type char. Moreover, it follows the "looks like C, acts like C" principle.
I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence? Does the compiler/program catch this invalid sequence? I believe it should.
 Of course, if circumstances dictate that the string be interpreted as a 
 wchar[]
 or dchar[], then that's
 another matter.

 Stewart.
-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Sep 21 2004
next sibling parent "Walter" <newshound digitalmars.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message
news:opseo5g1si5a2sq9 digitalmars.com...
 I agree.. however doesn't this make it possible to create an invalid UTF-8
 sequence?
Yes.
 Does the compiler/program catch this invalid sequence?
 I believe it should.
Only if the string is interpreted as a wchar[] or dchar[].
Sep 21 2004
prev sibling next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <opseo5g1si5a2sq9 digitalmars.com>, Regan Heath says...

I agree.. however doesn't this make it possible to create an invalid UTF-8 
sequence?
Yup. If you use \x in a char array you are doing /low level stuff/. You are doing encoding-by-hand - and it's up to you to get it right.
Does the compiler/program catch this invalid sequence?
I believe it should.
I disagree. If you're using \x then you're working at the byte level. You might be doing some system-programming-type stuff where you actually /want/ to break the rules. The compiler will catch it if and when you pass it to a toUTF function, and that's good enough for me. People simply need to understand the difference between \u and \x. Arcane Jill
Sep 22 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Wed, 22 Sep 2004 07:21:27 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:
 In article <opseo5g1si5a2sq9 digitalmars.com>, Regan Heath says...

 I agree.. however doesn't this make it possible to create an invalid 
 UTF-8
 sequence?
Yup. If you use \x in a char array you are doing /low level stuff/. You are doing encoding-by-hand - and it's up to you to get it right.
I agree.
 Does the compiler/program catch this invalid sequence?
 I believe it should.
I disagree. If you're using \x then you're working at the byte level. You might be doing some system-programming-type stuff where you actually /want/ to break the rules.
I disagree. char is 'defined' as being UTF encoded, IMO it should never not be. If you want to 'break the rules' you can/should use ubyte[], then, you're not breaking any rules.
 The compiler will catch it if and when you pass it to a toUTF function, 
 and that's good enough for me.
Probably fair enough.. however, I think it would be more robust if it was made impossible to have an invalid utf8/16/32 sequence. That may be an impossible dream..
 People simply need to understand the difference between \u and \x.
But of course. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Sep 22 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <opseq16svz5a2sq9 digitalmars.com>, Regan Heath says...


I disagree. char is 'defined' as being UTF encoded, IMO it should never 
not be.
If you want to 'break the rules' you can/should use ubyte[], then, you're 
not breaking any rules.
Okay, you've convinced me. other integer and integer array literals. But that would be a real headache for Walter, since D is supposed to have a context-free grammar. It's not clear to me how the compiler could parse the difference. Arcane Jill
Sep 23 2004
prev sibling parent reply Stewart Gordon <Stewart_member pathlink.com> writes:
In article <opseq16svz5a2sq9 digitalmars.com>, Regan Heath says...
<snip>
 I disagree.  char is 'defined' as being UTF encoded, IMO it should 
 never not be.  If you want to 'break the rules' you can/should use 
 ubyte[], then, you're not breaking any rules.
Do char[] and ubyte[] implicitly convert between each other? If not, it could make code that interfaces a foreign API somewhat cluttered with casts. And besides this, which is more self-documenting for the purpose? ubyte[] obscures the fact that it's a string, rather than any old block of 8-bit numbers. char[] denotes a string, but is it any more misleading? People coming from a C(++) background are likely to see it and think 'string' rather than 'UTF-8'. (Does anyone actually come from a D background yet?)
 The compiler will catch it if and when you pass it to a toUTF 
 function, and that's good enough for me.
Probably fair enough.. however, I think it would be more robust if it was made impossible to have an invalid utf8/16/32 sequence. That may be an impossible dream..
<snip> That would mean that a single char value would be restricted to the ASCII set , wouldn't it? Stewart.
Sep 23 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ciu7q5$s35$1 digitaldaemon.com>, Stewart Gordon says...

Do char[] and ubyte[] implicitly convert between each other? 
Of course not. They can't even /ex/plicitly convert. How could they? You'd be converting from UTF-8 to ... what exactly? But I suspect you meant implicitly /cast/. In which case, no, they don't do that either.
If not, 
it could make code that interfaces a foreign API somewhat cluttered 
with casts.
Not really, since foreign API functions should be /expecting/ C-strings, that is, pointers to arrays of bytes (not chars), terminated with the byte value \0. So, for example, strcat() should be declared in D as: and not as:
And besides this, which is more self-documenting for the purpose?
Well, this of course is the big area of disagreement. We all want code to be easily maintainable. That means, more readable; more self-documenting. Readable code is a good thing. The problem is that, some of us (Regan and I, for example) look at a declaration of char[] and see "A string of Unicode characters encoded in UTF-8". It is eminently self-documenting, by the very definition of char[]. We also look at a declaration of byte[] and see "An array of bytes whose interpretation depends on what you do with them". Others (yourself included) apparently see things differently. You look at a declaration of char[] and see "A string of not-necessarily-Unicode characters encoded in some unspecified way", and see byte[] as "An array of bytes whose interpretation is anything /other/ than a sequence of characters". It is not really possible for code to be simultaneously self-documenting in both paradigms - but you might like to consider the fact that in C and C++, an array of C chars must be interpretted as "An array of bytes whose interpretation depends on what you do with them" - because C/C++ don't actually /have/ a character type, merely an overused byte type. As soon as you start to think: D Java C/C++ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byte byte signed char ubyte no equivalent unsigned char char no equivalent no equivalent wchar char wchar_t and /stop/ imagining that D's char == C's char (which it clearly doesn't) then everything makes sense.
ubyte[] obscures the fact that it's a string, rather than any old 
block of 8-bit numbers.
And how would you make such a distinction in C?
char[] denotes a string, but is it any more 
misleading?  People coming from a C(++) background are likely to see 
it and think 'string' rather than 'UTF-8'.  (Does anyone actually 
come from a D background yet?)
Maybe you're answering your own question there. Stop thinking in C. This is D. Think in D. Even if nobody comes from a D background yet - let's just assume that one day, they will. It has been suggested over on the main forum that D's types be renamed. If no type called "char" existed in D; if instead, you had to choose between the types "utf8", "uint8" and "int8", it would be obvious which one you'd go for.
That would mean that a single char value would be restricted to the 
ASCII set , wouldn't it?
You're not thinking in Unicode. A D char stores a "code unit" (a UTF-8 fragment), not a character codepoint. UTF-8 code-units coincide with character codepoints /only/ in the ASCII range. A single char value, however, can store any valid UTF-8 fragment. You would be wrong, however, to interpret this as a character. For example: Arcane Jill
Sep 23 2004
parent reply Stewart Gordon <Stewart_member pathlink.com> writes:
In article <ciudb4$11ba$1 digitaldaemon.com>, Arcane Jill says...

 In article <ciu7q5$s35$1 digitaldaemon.com>, Stewart Gordon says...
 
 Do char[] and ubyte[] implicitly convert between each other?
Of course not. They can't even /ex/plicitly convert. How could they? You'd be converting from UTF-8 to ... what exactly?
I wouldn't. I'd be converting from bytes interpreted as chars to bytes interpreted as bytes of arbitrary semantics. <snip>
 Not really, since foreign API functions should be /expecting/ 
 C-strings, that is, pointers to arrays of bytes (not chars), 
 terminated with the byte value \0.
Even if they're written in/for Pascal or Fortran?
 So, for example, strcat() should be declared in D as:
 

 
 and not as:
 

Then how would I write the C call strcat(qwert, "yuiop"); in D? <snip>
 Others (yourself included) apparently see things differently.  You 
 look at a declaration of char[] and see "A string of 
 not-necessarily-Unicode characters encoded in some unspecified 
 way", and see byte[] as "An array of bytes whose interpretation is 
 anything /other/ than a sequence of characters".
Did I say that? I didn't mean to indicate that byte[] necessarily isn't an array of characters. Merely that I don't see people as seeing it and thinking 'string'. <snip>
 ubyte[] obscures the fact that it's a string, rather than any old 
 block of 8-bit numbers.
And how would you make such a distinction in C?
With a typedef. <snip>
 Maybe you're answering your own question there.  Stop thinking in 
 C.  This is D.  Think in D.
I do on the whole. But trying to think in Windows API at the same time isn't easy. It'll probably be easier once the D Windows headers are finished.
 Even if nobody comes from a D background yet - let's just assume 
 that one day, they will.
 
 It has been suggested over on the main forum that D's types be 
 renamed.  If no type called "char" existed in D; if instead, you 
 had to choose between the types "utf8", "uint8" and "int8", it 
 would be obvious which one you'd go for.
Then if only such types existed as "ansi", "windows1252", "windows1253", "ibm", "iso8859_5", "macdevanagari" then the list would be complete. <snip>
 You're not thinking in Unicode.  A D char stores a "code unit" (a 
 UTF-8 fragment), not a character codepoint.  UTF-8 code-units 
 coincide with character codepoints /only/ in the ASCII range.  A 
 single char value, however, can store any valid UTF-8 fragment.  
 You would be wrong, however, to interpret this as a character.
<snip> That makes sense.... Stewart.
Sep 23 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Thu, 23 Sep 2004 12:51:16 +0000 (UTC), Stewart Gordon 
<Stewart_member pathlink.com> wrote:

<snip>

 Even if nobody comes from a D background yet - let's just assume
 that one day, they will.

 It has been suggested over on the main forum that D's types be
 renamed.  If no type called "char" existed in D; if instead, you
 had to choose between the types "utf8", "uint8" and "int8", it
 would be obvious which one you'd go for.
Then if only such types existed as "ansi", "windows1252", "windows1253", "ibm", "iso8859_5", "macdevanagari" then the list would be complete.
It appears to me that Walter has decided on having only 3 types with a specified encoding, and all other encodings will be handled by using ubyte[]/byte[] and conversion functions. I think this is the right choice. I see unicode as the future and other encodings as legacy encodings, whose use I hope gradually disappears. Of course is there is a valid reason for a certain encoding to remain, for speed/space/other reasons, and D wanted the same sort of built-in support as we do for utf8/16/32 then a new type might emerge. <snip> Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Sep 23 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <opses3k8mv5a2sq9 digitalmars.com>, Regan Heath says...

It appears to me that Walter has decided on having only 3 types with a 
specified encoding, and all other encodings will be handled by using 
ubyte[]/byte[] and conversion functions.
Completely in agreement with you there. However, Stewart did actually ask a question which I couldn't answer, and which we shouldn't ignore. Maybe you have some ideas. ..anyway... I'm moving my reply to the main forum. I think it's more appropriate there. Arcane Jill
Sep 24 2004
prev sibling parent reply Stewart Gordon <Stewart_member pathlink.com> writes:
In article <opseo5g1si5a2sq9 digitalmars.com>, Regan Heath says...
<snip>
 I agree..  however doesn't this make it possible to create an 
 invalid UTF-8 sequence?  Does the compiler/program catch this 
 invalid sequence?  I believe it should.
I firmly don't believe in any attempts to force a specific character encoding on every char[] ever created. As said before, it should remain possible for char[] literals to contain character codes that aren't UTF-8, for such purposes as interfacing OS APIs. The ability to use arbitrary \x codes provides this neatly. I imagine few people would use it to insert UTF-8 characters in practice - if they want the checking, they can either type the character directly or use the \u code, which is much simpler than manually converting it to UTF-8. Stewart.
Sep 22 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cirlav$2a9s$1 digitaldaemon.com>, Stewart Gordon says...
I firmly don't believe in any attempts to force a specific character 
encoding on every char[] ever created.
I do, since it's documented that way.
As said before, it should 
remain possible for char[] literals to contain character codes that 
aren't UTF-8, for such purposes as interfacing OS APIs.
I agree that it should remain possible - but I disagree with the reason. Non-UTF encodings are more properly stored as ubyte[] arrays in D. Remember, C and C++ simply don't /have/ a type equivalent to D's char, so functions written in C or C++ were /never/ intended to receive such a type. C's char == D's byte or ubyte. The possible reasons why one might want to store arbitrary byte values in chars include scary hand-encoding of UTF-8 and possible some esoteric custom extensions (for example, imagine you invent some backwardly compatible UTF-8-PLUS). Such uses are, however, rare. About as rare as, say, needing to write a custom allocator because "new" isn't good enough. It should always be possible, but never commonplace.
The ability to use arbitrary \x codes provides this neatly.  I 
imagine few people would use it to insert UTF-8 characters in 
practice - if they want the checking, they can either type the 
character directly or use the \u code, which is much simpler than 
manually converting it to UTF-8.
Of course this makes perfect logical sense - /if/ you're talking about a ubyte[] array, not a char[] array. Jill
Sep 22 2004
prev sibling parent Regan Heath <regan netwin.co.nz> writes:
On Wed, 22 Sep 2004 10:49:03 +0000 (UTC), Stewart Gordon 
<Stewart_member pathlink.com> wrote:
 In article <opseo5g1si5a2sq9 digitalmars.com>, Regan Heath says...
 <snip>
 I agree..  however doesn't this make it possible to create an
 invalid UTF-8 sequence?  Does the compiler/program catch this
 invalid sequence?  I believe it should.
I firmly don't believe in any attempts to force a specific character encoding on every char[] ever created.
But it's 'defined' as having that encoding, if you dont want it, dont use char[] use byte[] instead.
 As said before, it should
 remain possible for char[] literals to contain character codes that
 aren't UTF-8, for such purposes as interfacing OS APIs.
A C/C++ char* is a signed 8 bit value with no specified encoding. D's byte[] matches that perfectly. Maybe byte[] should be implicitly convertable to char* (if it's not already).
 The ability to use arbitrary \x codes provides this neatly.  I
 imagine few people would use it to insert UTF-8 characters in
 practice - if they want the checking, they can either type the
 character directly or use the \u code, which is much simpler than
 manually converting it to UTF-8.
Sure, really I'm playing devils advocate.. I question the logic of 'defining' char to be utf8 if you're not going to enforce it. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Sep 22 2004
prev sibling parent reply Burton Radons <burton-radons shaw.ca> writes:
Stewart Gordon wrote:

 In article <cioggi$k7p$1 digitaldaemon.com>, Walter says...
 <snip>
 
It remains x"8F".  Thus the \x specifies a literal byte in the character
stream as implemented.  The specification doesn't mention this twist, if
it was intentional.
I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8.
Here's somewhere I agree with your choice of behaviour, where \x denotes byte values, not Unicode codepoints. Hence here, the coder who writes \x8F intended the byte having this value - a single value of type char. Moreover, it follows the "looks like C, acts like C" principle.
I don't think this will work; it requires specifying what encoding the compiler worked with internally. For example, DMD works in UTF-8 internally. Therefore the first string is okay but the second is not because the UTF-8 is broken: char [] foo = "\x8F"; wchar [] bar = "\x8F"; But if a compiler uses UTF-16 or UTF-32 internally, then it won't detect any problem with either of those strings. So a literal string must be valid for arbitrary conversion between any encoding (that can only be interpreted as "\x specifies a UNICODE character"), OR there must be a mandate for what encoding the compiler uses internally. I think the former is less odious; as soon as you start depending upon features of an encoding, you get into trouble.
Sep 25 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cj4c5p$1r6s$1 digitaldaemon.com>, Burton Radons says...

I don't think this will work; it requires specifying what encoding the 
compiler worked with internally.

For example, DMD works in UTF-8 internally.
Walter assures us that the D language itself is not prejudiced toward UTF-8; that UTF-16 and UTF-32 have equal status. I can think of one or two examples which seem to contradict this, but they are likely to disappear once D gives us implicit conversions between the UTFs.
Therefore the first string 
is okay but the second is not because the UTF-8 is broken:

     char [] foo = "\x8F";
     wchar [] bar = "\x8F";
I presume you meant that the other way round: the /first/ string is broken; the /second/ string is okay. But that's okay. Anyone using \x in a char[], wchar[] or dchar[] is expected to know what they're doing, otherwise they should be using \u.
But if a compiler uses UTF-16 or UTF-32 internally, then it won't detect 
any problem with either of those strings.
There isn't /necessarily/ anything wrong with either of those strings. For example: The requirement for using \x is merely that the programmer knows their UTF.
So a literal string must be valid for arbitrary conversion between any 
encoding (that can only be interpreted as "\x specifies a UNICODE 
character"),
No, the requirement is that programmers /must not/ use \x within a string unless they understand exactly how it will be interpretted. For most normal purposes, stick to this golden rule: *) For char[], wchar[] or dchar[] - use \u *) For all other arrays - use \x
OR there must be a mandate for what encoding the compiler 
uses internally.
I don't see a need for that.
I think the former is less odious; as soon as you 
start depending upon features of an encoding, you get into trouble.
Right. Which is why \x in strings should be considered "experts only". But I would hesitate to call that a "bug". It would be /possible/ for D's lexer to distinguish character string constants from byte string constants in some cases. I don't know if that would be a good idea. What I mean is: This would catch a lot of such bugs at compile time. Maybe Walter could be persuaded to go for this, I don't know. But \x bugs are bugs in user code, not in the compiler. Arcane Jill
Sep 27 2004
next sibling parent Stewart Gordon <Stewart_member pathlink.com> writes:
In article <cj8fhp$1h9u$1 digitaldaemon.com>, Arcane Jill says...
<snip>

 UTF-8

 UTF-16 for U+008F
<snip> No, "\x8F" _means_ the byte with value 0x8F, meant to be interpreted as UTF-8. Somewhere in the docs there's an example or two of a wchar[] or dchar[] being initialised with UTF-8 in this way. Stewart.
Sep 29 2004
prev sibling parent reply Burton Radons <burton-radons smocky.com> writes:
Arcane Jill wrote:

 In article <cj4c5p$1r6s$1 digitaldaemon.com>, Burton Radons says...
 
 
I don't think this will work; it requires specifying what encoding the 
compiler worked with internally.

For example, DMD works in UTF-8 internally.
Walter assures us that the D language itself is not prejudiced toward UTF-8; that UTF-16 and UTF-32 have equal status. I can think of one or two examples which seem to contradict this, but they are likely to disappear once D gives us implicit conversions between the UTFs.
I don't understand what political interpretation you gave my statement, but it was only to introduce an object example.
Therefore the first string 
is okay but the second is not because the UTF-8 is broken:

    char [] foo = "\x8F";
    wchar [] bar = "\x8F";
I presume you meant that the other way round: the /first/ string is broken; the /second/ string is okay.
If the compiler uses UTF-8 internally, the first string compiles correctly as string with length one while the second string does not, because the compiler tries to re-encode it as UTF-16 during semantic processing and fails. I am describing DMD's current behaviour, mind. If the compiler uses UTF-16 or UTF-32 internally (where it would convert the source file into its native encoding during BOM processing), then both strings compile. The first string has length two; the second string has length one. [snip]
But if a compiler uses UTF-16 or UTF-32 internally, then it won't detect 
any problem with either of those strings.
There isn't /necessarily/ anything wrong with either of those strings. For example:
If the compiler is using UTF-8 internally, there is no possible way to re-encode the second string as UTF-16 while remaining consistent with compilers that use different encodings. To use your example encoding, if the compiler uses UTF-8 internally, then this code: wchar [] s = "\xC4\x8F"; Would result in a single-code string (you understand that D grammar is contextless and that string escapes are interpreted during tokenisation, right?). However, if the compiler uses UTF-16 internally, it would result in a two-code string. This does show a third option, however: change string escapes so that they must not be interpreted until after semantic processing where they can be interpreted directly as their destination encoding. But that only serves to illustrate how unnatural this behaviour is; which may be why that no Unicode-supporting language that I can find that handles \x interprets it as anything but a character. [snip]
Sep 29 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjg3qq$1ui6$1 digitaldaemon.com>, Burton Radons says...

Therefore the first string 
is okay but the second is not because the UTF-8 is broken:

    char [] foo = "\x8F";
    wchar [] bar = "\x8F";
I presume you meant that the other way round: the /first/ string is broken; the /second/ string is okay.
If the compiler uses UTF-8 internally, the first string compiles correctly as string with length one while the second string does not, because the compiler tries to re-encode it as UTF-16 during semantic processing and fails. I am describing DMD's current behaviour, mind.
Yes. I was wrong. The first example compiles okay, but results in foo containing an invalid UTF-8 sequence. The second example does not compile. (I assumed that it would, without testing the hypothesis. That'll teach me).
To use your example encoding,
if the compiler uses UTF-8 internally, then this code:

    wchar [] s = "\xC4\x8F";
Again, I was wrong. I assumed (without testing) that this would compile to a two-wchar string constant, with s[0] containing U+00C4 and s[1] containing U+008F. In actual fact, what this code yields is a one-wchar string constant, with s[0] containing U+010F. I would call that a bug. [ 0xC4, 0x8F ] is UTF-8 (not UTF-16) for U+010F. But s is a wchar string, so it's supposed to be UTF-16.
(you understand that D grammar is
contextless and that string escapes are interpreted during tokenisation,
right?).  However, if the compiler uses UTF-16 internally, it would
result in a two-code string.
I think you're right. That's what happening. The compiler is interpretting all string constants as though they were UTF-8, regardless of the type of the destination.
This does show a third option, however: change string escapes so that
they must not be interpreted until after semantic processing where they
can be interpreted directly as their destination encoding.
That's the way I assumed it would be done.
But that
only serves to illustrate how unnatural this behaviour is; which may be 
why that no Unicode-supporting language that I can find that handles \x 
interprets it as anything but a character.
Actually, C and C++ interpret \x as a LOCAL ENCODING character, and \u as a UNICODE character. Thus, in C++, if your local encoding were Windows-1252, then the following two statements would have identical effect: Both of these will leave the string s containing a single (byte-wide) char, with value 0x80. (Plus the null-terminator, of course). Compare this with which /should/ fail to compile on a Windows-1252 machine. So you /are/ right, but nonetheless there is a difference between \x and \u. And this presents a problem for D, because D aims to be portable between encodings. In D, therefore, \x SHOULD NOT be interpretted according to the local encoding, because this would immediately make code non-portable. One way around this would be to assert that \x should mean exactly the same thing as \u and \U (that is, to specify a Unicode character). Now, that would be fine for those of us used to Latin-1, but Cyrillic users (for example) would be left out in the cold. Currently, I have come to the conclusion that \x should be deprecated. The escapes \u and \U explicitly specify a character set (i.e. Unicode), and that is what you need for portabilty. \x just has too many problems. Arcane Jill
Sep 30 2004
prev sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cilbgp$1p3a$1 digitaldaemon.com>, Burton Radons says...
The following string is encoded into UTF-8:

    char [] c = "\u7362";

It encodes into x"E4 98 B7".  The following string, however, does not 
get encoded:

    char [] c = "\x8F";

It remains x"8F".  Thus the \x specifies a literal byte in the character 
stream as implemented.  The specification doesn't mention this twist, if 
it was intentional.
This is correct behavior. You should be using \u for Unicode characters. \x is for literal bytes. \u is supposed to understand the encoding. \x is not. In D, the source code encoding must always be a UTF, but in other computer languages, this is not so. Imagine a C++ program in which the source code encoding were WINDOWS-1252. In such a case, the following two lines would be equivalent: In both cases, a single byte [0x80] will be placed in the string s. And now, here's the same thing in a C++ program in which the source code encoding is WINDOWS-1251: In both cases, a single byte [0x88] will be placed in the string s. Now, since D does not allow non-UTF source code encodings, the distinction may appear blurred, but it's still there. Just remember: \x => insert this literal byte \u => insert this Unicode character, encoded in the appropriate encoding. Arcane Jill
Sep 22 2004