digitalmars.D.bugs - Standard omission or compiler bug: Hexadecimal escapes don't encode

Burton Radons (8/8) Sep 19 2004 The following string is encoded into UTF-8:

Walter (5/13) Sep 20 2004 I wasn't sure what to do about that case, so I left the \x as whatever t...

Stewart Gordon (11/17) Sep 21 2004 Here's somewhere I agree with your choice of behaviour, where \x denotes...

Stewart Gordon (11/19) Sep 21 2004 In article , Stewart Gordon says and the...
Regan Heath (8/34) Sep 21 2004 I agree.. however doesn't this make it possible to create an invalid UTF...

Walter (4/8) Sep 21 2004 Yes.
Arcane Jill (10/14) Sep 22 2004 Yup. If you use \x in a char array you are doing /low level stuff/. You ...

Regan Heath (14/31) Sep 22 2004 I agree.

Arcane Jill (9/13) Sep 23 2004 Okay, you've convinced me.
Stewart Gordon (15/24) Sep 23 2004 Do char[] and ubyte[] implicitly convert between each other? If not,

Arcane Jill (58/71) Sep 23 2004 Of course not. They can't even /ex/plicitly convert. How could they? You...

Stewart Gordon (25/64) Sep 23 2004 I wouldn't. I'd be converting from bytes interpreted as chars to

Regan Heath (15/25) Sep 23 2004 On Thu, 23 Sep 2004 12:51:16 +0000 (UTC), Stewart Gordon

Arcane Jill (7/10) Sep 24 2004 Completely in agreement with you there. However, Stewart did actually as...

Stewart Gordon (12/15) Sep 22 2004 I firmly don't believe in any attempts to force a specific character

Arcane Jill (15/25) Sep 22 2004 I agree that it should remain possible - but I disagree with the reason....
Regan Heath (12/27) Sep 22 2004 But it's 'defined' as having that encoding, if you dont want it, dont us...

Burton Radons (14/31) Sep 25 2004 I don't think this will work; it requires specifying what encoding the

Arcane Jill (37/53) Sep 27 2004 Walter assures us that the D language itself is not prejudiced toward UT...

Stewart Gordon (7/11) Sep 29 2004
Burton Radons (28/56) Sep 29 2004 I don't understand what political interpretation you gave my statement,

Arcane Jill (38/65) Sep 30 2004 Yes. I was wrong.

Arcane Jill (21/29) Sep 22 2004 This is correct behavior. You should be using \u for Unicode characters....

Burton Radons <burton-radons shaw.ca> writes:

The following string is encoded into UTF-8:

    char [] c = "\u7362";

It encodes into x"E4 98 B7".  The following string, however, does not 
get encoded:

    char [] c = "\x8F";

It remains x"8F".  Thus the \x specifies a literal byte in the character 
stream as implemented.  The specification doesn't mention this twist, if 
it was intentional.

Sep 19 2004

"Walter" <newshound digitalmars.com> writes:

"Burton Radons" <burton-radons shaw.ca> wrote in message
news:cilbgp$1p3a$1 digitaldaemon.com...
 The following string is encoded into UTF-8:

     char [] c = "\u7362";

 It encodes into x"E4 98 B7".  The following string, however, does not
 get encoded:

     char [] c = "\x8F";

 It remains x"8F".  Thus the \x specifies a literal byte in the character
 stream as implemented.  The specification doesn't mention this twist, if
 it was intentional.

I wasn't sure what to do about that case, so I left the \x as whatever the
programmer wrote. The \u, though, is definitely meant as unicode and so is
converted to UTF-8.

Sep 20 2004

Stewart Gordon <Stewart_member pathlink.com> writes:

In article <cioggi$k7p$1 digitaldaemon.com>, Walter says...
<snip>
 It remains x"8F".  Thus the \x specifies a literal byte in the character
 stream as implemented.  The specification doesn't mention this twist, if
 it was intentional.

I wasn't sure what to do about that case, so I left the \x as whatever the
programmer wrote. The \u, though, is definitely meant as unicode and so is
converted to UTF-8.

Here's somewhere I agree with your choice of behaviour, where \x denotes byte
values, not Unicode 
codepoints.  Hence here, the coder who writes \x8F intended the byte having this
value - a single value 
of type char.  Moreover, it follows the "looks like C, acts like C" principle.

Of course, if circumstances dictate that the string be interpreted as a wchar[]
or dchar[], then that's 
another matter.

Stewart.

Sep 21 2004

Stewart Gordon <Stewart_member pathlink.com> writes:

In article <ciout3$skf$1 digitaldaemon.com>, Stewart Gordon says and then some
program or 
another makes a mess of...
<snip>
Here's somewhere I agree with your choice of behaviour, where \x denotes byte
values, not Unicode 
codepoints.  Hence here, the coder who writes \x8F intended the byte having this
value - a single value 
of type char.  Moreover, it follows the "looks like C, acts like C" principle.

Of course, if circumstances dictate that the string be interpreted as a wchar[]
or dchar[], then that's 
another matter.

Just what is wrong with this web newsgroup interface?  I should've 
carried on using my quote tidier.

If anyone else is having the same troubles, you're pointed here....

http://smjg.port5.com/faqs/usenet/quotetidy.html

Hopefully my regular posting environment will soon have a working 
power supply once again....

Stewart.

Sep 21 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 21 Sep 2004 10:13:55 +0000 (UTC), Stewart Gordon 
<Stewart_member pathlink.com> wrote:

 In article <cioggi$k7p$1 digitaldaemon.com>, Walter says...
 <snip>
 It remains x"8F".  Thus the \x specifies a literal byte in the 
 character
 stream as implemented.  The specification doesn't mention this twist, 
 if
 it was intentional.

 I wasn't sure what to do about that case, so I left the \x as whatever 
 the
 programmer wrote. The \u, though, is definitely meant as unicode and so 
 is
 converted to UTF-8.

 Here's somewhere I agree with your choice of behaviour, where \x denotes 
 byte
 values, not Unicode
 codepoints.  Hence here, the coder who writes \x8F intended the byte 
 having this
 value - a single value
 of type char.  Moreover, it follows the "looks like C, acts like C" 
 principle.

I agree.. however doesn't this make it possible to create an invalid UTF-8 
sequence?
Does the compiler/program catch this invalid sequence?
I believe it should.

 Of course, if circumstances dictate that the string be interpreted as a 
 wchar[]
 or dchar[], then that's
 another matter.

 Stewart.



-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Sep 21 2004

"Walter" <newshound digitalmars.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote in message
news:opseo5g1si5a2sq9 digitalmars.com...
 I agree.. however doesn't this make it possible to create an invalid UTF-8
 sequence?

Yes.

 Does the compiler/program catch this invalid sequence?
 I believe it should.

Only if the string is interpreted as a wchar[] or dchar[].

Sep 21 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <opseo5g1si5a2sq9 digitalmars.com>, Regan Heath says...

I agree.. however doesn't this make it possible to create an invalid UTF-8 
sequence?

Yup. If you use \x in a char array you are doing /low level stuff/. You are
doing encoding-by-hand - and it's up to you to get it right.

Does the compiler/program catch this invalid sequence?
I believe it should.

I disagree. If you're using \x then you're working at the byte level. You might
be doing some system-programming-type stuff where you actually /want/ to break
the rules.

The compiler will catch it if and when you pass it to a toUTF function, and
that's good enough for me.

People simply need to understand the difference between \u and \x.
Arcane Jill

Sep 22 2004

Regan Heath <regan netwin.co.nz> writes:

On Wed, 22 Sep 2004 07:21:27 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:
 In article <opseo5g1si5a2sq9 digitalmars.com>, Regan Heath says...

 I agree.. however doesn't this make it possible to create an invalid 
 UTF-8
 sequence?

 Yup. If you use \x in a char array you are doing /low level stuff/. You 
 are
 doing encoding-by-hand - and it's up to you to get it right.

I agree.

 Does the compiler/program catch this invalid sequence?
 I believe it should.

 I disagree. If you're using \x then you're working at the byte level. 
 You might
 be doing some system-programming-type stuff where you actually /want/ to 
 break
 the rules.

I disagree. char is 'defined' as being UTF encoded, IMO it should never 
not be.
If you want to 'break the rules' you can/should use ubyte[], then, you're 
not breaking any rules.

 The compiler will catch it if and when you pass it to a toUTF function, 
 and that's good enough for me.

Probably fair enough.. however, I think it would be more robust if it was 
made impossible to have an invalid utf8/16/32 sequence. That may be an 
impossible dream..

 People simply need to understand the difference between \u and \x.

But of course.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Sep 22 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <opseq16svz5a2sq9 digitalmars.com>, Regan Heath says...


I disagree. char is 'defined' as being UTF encoded, IMO it should never 
not be.
If you want to 'break the rules' you can/should use ubyte[], then, you're 
not breaking any rules.

Okay, you've convinced me.



other integer and integer array literals.

But that would be a real headache for Walter, since D is supposed to have a
context-free grammar. It's not clear to me how the compiler could parse the
difference.

Arcane Jill

Sep 23 2004

Stewart Gordon <Stewart_member pathlink.com> writes:

In article <opseq16svz5a2sq9 digitalmars.com>, Regan Heath says...
<snip>
 I disagree.  char is 'defined' as being UTF encoded, IMO it should 
 never not be.  If you want to 'break the rules' you can/should use 
 ubyte[], then, you're not breaking any rules.

Do char[] and ubyte[] implicitly convert between each other?  If not, 
it could make code that interfaces a foreign API somewhat cluttered 
with casts.

And besides this, which is more self-documenting for the purpose?

ubyte[] obscures the fact that it's a string, rather than any old 
block of 8-bit numbers.  char[] denotes a string, but is it any more 
misleading?  People coming from a C(++) background are likely to see 
it and think 'string' rather than 'UTF-8'.  (Does anyone actually 
come from a D background yet?)

 The compiler will catch it if and when you pass it to a toUTF 
 function, and that's good enough for me.

 
 Probably fair enough..  however, I think it would be more robust if 
 it was made impossible to have an invalid utf8/16/32 sequence.  
 That may be an impossible dream..

<snip>

That would mean that a single char value would be restricted to the 
ASCII set , wouldn't it?

Stewart.

Sep 23 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ciu7q5$s35$1 digitaldaemon.com>, Stewart Gordon says...

Do char[] and ubyte[] implicitly convert between each other? 

Of course not. They can't even /ex/plicitly convert. How could they? You'd be
converting from UTF-8 to ... what exactly?

But I suspect you meant implicitly /cast/. In which case, no, they don't do that
either.

If not, 
it could make code that interfaces a foreign API somewhat cluttered 
with casts.

Not really, since foreign API functions should be /expecting/ C-strings, that
is, pointers to arrays of bytes (not chars), terminated with the byte value \0.
So, for example, strcat() should be declared in D as:



and not as:





And besides this, which is more self-documenting for the purpose?

Well, this of course is the big area of disagreement. We all want code to be
easily maintainable. That means, more readable; more self-documenting. Readable
code is a good thing. The problem is that, some of us (Regan and I, for example)
look at a declaration of char[] and see "A string of Unicode characters encoded
in UTF-8". It is eminently self-documenting, by the very definition of char[].
We also look at a declaration of byte[] and see "An array of bytes whose
interpretation depends on what you do with them".

Others (yourself included) apparently see things differently. You look at a
declaration of char[] and see "A string of not-necessarily-Unicode characters
encoded in some unspecified way", and see byte[] as "An array of bytes whose
interpretation is anything /other/ than a sequence of characters".

It is not really possible for code to be simultaneously self-documenting in both
paradigms - but you might like to consider the fact that in C and C++, an array
of C chars must be interpretted as "An array of bytes whose interpretation
depends on what you do with them" - because C/C++ don't actually /have/ a
character type, merely an overused byte type. As soon as you start to think:

D           Java              C/C++
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
byte        byte              signed char
ubyte       no equivalent     unsigned char
char        no equivalent     no equivalent
wchar       char              wchar_t

and /stop/ imagining that D's char == C's char (which it clearly doesn't) then
everything makes sense.


ubyte[] obscures the fact that it's a string, rather than any old 
block of 8-bit numbers.

And how would you make such a distinction in C?



char[] denotes a string, but is it any more 
misleading?  People coming from a C(++) background are likely to see 
it and think 'string' rather than 'UTF-8'.  (Does anyone actually 
come from a D background yet?)

Maybe you're answering your own question there. Stop thinking in C. This is D.
Think in D. Even if nobody comes from a D background yet - let's just assume
that one day, they will.

It has been suggested over on the main forum that D's types be renamed. If no
type called "char" existed in D; if instead, you had to choose between the types
"utf8", "uint8" and "int8", it would be obvious which one you'd go for.



That would mean that a single char value would be restricted to the 
ASCII set , wouldn't it?

You're not thinking in Unicode. A D char stores a "code unit" (a UTF-8
fragment), not a character codepoint. UTF-8 code-units coincide with character
codepoints /only/ in the ASCII range. A single char value, however, can store
any valid UTF-8 fragment. You would be wrong, however, to interpret this as a
character. For example:












Arcane Jill

Sep 23 2004

Stewart Gordon <Stewart_member pathlink.com> writes:

In article <ciudb4$11ba$1 digitaldaemon.com>, Arcane Jill says...

 In article <ciu7q5$s35$1 digitaldaemon.com>, Stewart Gordon says...
 
 Do char[] and ubyte[] implicitly convert between each other?

 
 Of course not.  They can't even /ex/plicitly convert.  How could 
 they?  You'd be converting from UTF-8 to ...  what exactly?

I wouldn't.  I'd be converting from bytes interpreted as chars to 
bytes interpreted as bytes of arbitrary semantics.

<snip>
 Not really, since foreign API functions should be /expecting/ 
 C-strings, that is, pointers to arrays of bytes (not chars), 
 terminated with the byte value \0.

Even if they're written in/for Pascal or Fortran?

 So, for example, strcat() should be declared in D as:
 

 
 and not as:
 


Then how would I write the C call

strcat(qwert, "yuiop");

in D?

<snip>
 Others (yourself included) apparently see things differently.  You 
 look at a declaration of char[] and see "A string of 
 not-necessarily-Unicode characters encoded in some unspecified 
 way", and see byte[] as "An array of bytes whose interpretation is 
 anything /other/ than a sequence of characters".

Did I say that?  I didn't mean to indicate that byte[] necessarily 
isn't an array of characters.  Merely that I don't see people as 
seeing it and thinking 'string'.

<snip>
 ubyte[] obscures the fact that it's a string, rather than any old 
 block of 8-bit numbers.

 
 And how would you make such a distinction in C?

With a typedef.

<snip>
 Maybe you're answering your own question there.  Stop thinking in 
 C.  This is D.  Think in D.

I do on the whole.  But trying to think in Windows API at the same 
time isn't easy.  It'll probably be easier once the D Windows headers 
are finished.

 Even if nobody comes from a D background yet - let's just assume 
 that one day, they will.
 
 It has been suggested over on the main forum that D's types be 
 renamed.  If no type called "char" existed in D; if instead, you 
 had to choose between the types "utf8", "uint8" and "int8", it 
 would be obvious which one you'd go for.

Then if only such types existed as "ansi", "windows1252", 
"windows1253", "ibm", "iso8859_5", "macdevanagari" then the list 
would be complete.

<snip>
 You're not thinking in Unicode.  A D char stores a "code unit" (a 
 UTF-8 fragment), not a character codepoint.  UTF-8 code-units 
 coincide with character codepoints /only/ in the ASCII range.  A 
 single char value, however, can store any valid UTF-8 fragment.  
 You would be wrong, however, to interpret this as a character.

<snip>

That makes sense....

Stewart.

Sep 23 2004

Regan Heath <regan netwin.co.nz> writes:

On Thu, 23 Sep 2004 12:51:16 +0000 (UTC), Stewart Gordon 
<Stewart_member pathlink.com> wrote:

<snip>

 Even if nobody comes from a D background yet - let's just assume
 that one day, they will.

 It has been suggested over on the main forum that D's types be
 renamed.  If no type called "char" existed in D; if instead, you
 had to choose between the types "utf8", "uint8" and "int8", it
 would be obvious which one you'd go for.

 Then if only such types existed as "ansi", "windows1252",
 "windows1253", "ibm", "iso8859_5", "macdevanagari" then the list
 would be complete.

It appears to me that Walter has decided on having only 3 types with a 
specified encoding, and all other encodings will be handled by using 
ubyte[]/byte[] and conversion functions.

I think this is the right choice. I see unicode as the future and other 
encodings as legacy encodings, whose use I hope gradually disappears.

Of course is there is a valid reason for a certain encoding to remain, for 
speed/space/other reasons, and D wanted the same sort of built-in support 
as we do for utf8/16/32 then a new type might emerge.

<snip>

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Sep 23 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <opses3k8mv5a2sq9 digitalmars.com>, Regan Heath says...

It appears to me that Walter has decided on having only 3 types with a 
specified encoding, and all other encodings will be handled by using 
ubyte[]/byte[] and conversion functions.

Completely in agreement with you there. However, Stewart did actually ask a
question which I couldn't answer, and which we shouldn't ignore. Maybe you have
some ideas.

..anyway...

I'm moving my reply to the main forum. I think it's more appropriate there.

Arcane Jill

Sep 24 2004

Stewart Gordon <Stewart_member pathlink.com> writes:

In article <opseo5g1si5a2sq9 digitalmars.com>, Regan Heath says...
<snip>
 I agree..  however doesn't this make it possible to create an 
 invalid UTF-8 sequence?  Does the compiler/program catch this 
 invalid sequence?  I believe it should.

I firmly don't believe in any attempts to force a specific character 
encoding on every char[] ever created.  As said before, it should 
remain possible for char[] literals to contain character codes that 
aren't UTF-8, for such purposes as interfacing OS APIs.

The ability to use arbitrary \x codes provides this neatly.  I 
imagine few people would use it to insert UTF-8 characters in 
practice - if they want the checking, they can either type the 
character directly or use the \u code, which is much simpler than 
manually converting it to UTF-8.

Stewart.

Sep 22 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cirlav$2a9s$1 digitaldaemon.com>, Stewart Gordon says...
I firmly don't believe in any attempts to force a specific character 
encoding on every char[] ever created.

I do, since it's documented that way.


As said before, it should 
remain possible for char[] literals to contain character codes that 
aren't UTF-8, for such purposes as interfacing OS APIs.

I agree that it should remain possible - but I disagree with the reason. Non-UTF
encodings are more properly stored as ubyte[] arrays in D. Remember, C and C++
simply don't /have/ a type equivalent to D's char, so functions written in C or
C++ were /never/ intended to receive such a type. C's char == D's byte or ubyte.

The possible reasons why one might want to store arbitrary byte values in chars
include scary hand-encoding of UTF-8 and possible some esoteric custom
extensions (for example, imagine you invent some backwardly compatible
UTF-8-PLUS). Such uses are, however, rare. About as rare as, say, needing to
write a custom allocator because "new" isn't good enough. It should always be
possible, but never commonplace.



The ability to use arbitrary \x codes provides this neatly.  I 
imagine few people would use it to insert UTF-8 characters in 
practice - if they want the checking, they can either type the 
character directly or use the \u code, which is much simpler than 
manually converting it to UTF-8.

Of course this makes perfect logical sense - /if/ you're talking about a ubyte[]
array, not a char[] array.

Jill

Sep 22 2004

Regan Heath <regan netwin.co.nz> writes:

On Wed, 22 Sep 2004 10:49:03 +0000 (UTC), Stewart Gordon 
<Stewart_member pathlink.com> wrote:
 In article <opseo5g1si5a2sq9 digitalmars.com>, Regan Heath says...
 <snip>
 I agree..  however doesn't this make it possible to create an
 invalid UTF-8 sequence?  Does the compiler/program catch this
 invalid sequence?  I believe it should.

 I firmly don't believe in any attempts to force a specific character
 encoding on every char[] ever created.

But it's 'defined' as having that encoding, if you dont want it, dont use 
char[] use byte[] instead.

 As said before, it should
 remain possible for char[] literals to contain character codes that
 aren't UTF-8, for such purposes as interfacing OS APIs.

A C/C++ char* is a signed 8 bit value with no specified encoding. D's 
byte[] matches that perfectly. Maybe byte[] should be implicitly 
convertable to char* (if it's not already).

 The ability to use arbitrary \x codes provides this neatly.  I
 imagine few people would use it to insert UTF-8 characters in
 practice - if they want the checking, they can either type the
 character directly or use the \u code, which is much simpler than
 manually converting it to UTF-8.

Sure, really I'm playing devils advocate.. I question the logic of 
'defining' char to be utf8 if you're not going to enforce it.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Sep 22 2004

Burton Radons <burton-radons shaw.ca> writes:

Stewart Gordon wrote:

 In article <cioggi$k7p$1 digitaldaemon.com>, Walter says...
 <snip>
 
It remains x"8F".  Thus the \x specifies a literal byte in the character
stream as implemented.  The specification doesn't mention this twist, if
it was intentional.

I wasn't sure what to do about that case, so I left the \x as whatever the
programmer wrote. The \u, though, is definitely meant as unicode and so is
converted to UTF-8.

 
 
 Here's somewhere I agree with your choice of behaviour, where \x denotes byte
 values, not Unicode 
 codepoints.  Hence here, the coder who writes \x8F intended the byte having
this
 value - a single value 
 of type char.  Moreover, it follows the "looks like C, acts like C" principle.

I don't think this will work; it requires specifying what encoding the 
compiler worked with internally.

For example, DMD works in UTF-8 internally.  Therefore the first string 
is okay but the second is not because the UTF-8 is broken:

     char [] foo = "\x8F";
     wchar [] bar = "\x8F";

But if a compiler uses UTF-16 or UTF-32 internally, then it won't detect 
any problem with either of those strings.

So a literal string must be valid for arbitrary conversion between any 
encoding (that can only be interpreted as "\x specifies a UNICODE 
character"), OR there must be a mandate for what encoding the compiler 
uses internally.  I think the former is less odious; as soon as you 
start depending upon features of an encoding, you get into trouble.

Sep 25 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cj4c5p$1r6s$1 digitaldaemon.com>, Burton Radons says...

I don't think this will work; it requires specifying what encoding the 
compiler worked with internally.

For example, DMD works in UTF-8 internally.

Walter assures us that the D language itself is not prejudiced toward UTF-8;
that UTF-16 and UTF-32 have equal status. I can think of one or two examples
which seem to contradict this, but they are likely to disappear once D gives us
implicit conversions between the UTFs.


Therefore the first string 
is okay but the second is not because the UTF-8 is broken:

     char [] foo = "\x8F";
     wchar [] bar = "\x8F";

I presume you meant that the other way round: the /first/ string is broken; the
/second/ string is okay.




But that's okay. Anyone using \x in a char[], wchar[] or dchar[] is expected to
know what they're doing, otherwise they should be using \u.


But if a compiler uses UTF-16 or UTF-32 internally, then it won't detect 
any problem with either of those strings.

There isn't /necessarily/ anything wrong with either of those strings. For
example:






The requirement for using \x is merely that the programmer knows their UTF.


So a literal string must be valid for arbitrary conversion between any 
encoding (that can only be interpreted as "\x specifies a UNICODE 
character"),

No, the requirement is that programmers /must not/ use \x within a string unless
they understand exactly how it will be interpretted.

For most normal purposes, stick to this golden rule:

*) For char[], wchar[] or dchar[] - use \u
*) For all other arrays - use \x


OR there must be a mandate for what encoding the compiler 
uses internally.

I don't see a need for that.


I think the former is less odious; as soon as you 
start depending upon features of an encoding, you get into trouble.

Right. Which is why \x in strings should be considered "experts only". But I
would hesitate to call that a "bug".

It would be /possible/ for D's lexer to distinguish character string constants
from byte string constants in some cases. I don't know if that would be a good
idea. What I mean is:






This would catch a lot of such bugs at compile time. Maybe Walter could be
persuaded to go for this, I don't know. But \x bugs are bugs in user code, not
in the compiler.

Arcane Jill

Sep 27 2004

Stewart Gordon <Stewart_member pathlink.com> writes:

In article <cj8fhp$1h9u$1 digitaldaemon.com>, Arcane Jill says...
<snip>

 UTF-8

 UTF-16 for U+008F

<snip>

No, "\x8F" _means_ the byte with value 0x8F, meant to be interpreted 
as UTF-8.  Somewhere in the docs there's an example or two of a 
wchar[] or dchar[] being initialised with UTF-8 in this way.

Stewart.

Sep 29 2004

Burton Radons <burton-radons smocky.com> writes:

Arcane Jill wrote:

 In article <cj4c5p$1r6s$1 digitaldaemon.com>, Burton Radons says...
 
 
I don't think this will work; it requires specifying what encoding the 
compiler worked with internally.

For example, DMD works in UTF-8 internally.

 
 
 Walter assures us that the D language itself is not prejudiced toward UTF-8;
 that UTF-16 and UTF-32 have equal status. I can think of one or two examples
 which seem to contradict this, but they are likely to disappear once D gives us
 implicit conversions between the UTFs.

I don't understand what political interpretation you gave my statement,
but it was only to introduce an object example.

Therefore the first string 
is okay but the second is not because the UTF-8 is broken:

    char [] foo = "\x8F";
    wchar [] bar = "\x8F";

 
 
 I presume you meant that the other way round: the /first/ string is broken; the
 /second/ string is okay.

If the compiler uses UTF-8 internally, the first string compiles
correctly as string with length one while the second string does not,
because the compiler tries to re-encode it as UTF-16 during semantic
processing and fails.  I am describing DMD's current behaviour, mind.

If the compiler uses UTF-16 or UTF-32 internally (where it would convert
the source file into its native encoding during BOM processing), then
both strings compile.  The first string has length two; the second
string has length one.

[snip]
But if a compiler uses UTF-16 or UTF-32 internally, then it won't detect 
any problem with either of those strings.

 
 
 There isn't /necessarily/ anything wrong with either of those strings. For
 example:

If the compiler is using UTF-8 internally, there is no possible way to
re-encode the second string as UTF-16 while remaining consistent with
compilers that use different encodings.  To use your example encoding,
if the compiler uses UTF-8 internally, then this code:

    wchar [] s = "\xC4\x8F";

Would result in a single-code string (you understand that D grammar is
contextless and that string escapes are interpreted during tokenisation,
right?).  However, if the compiler uses UTF-16 internally, it would
result in a two-code string.

This does show a third option, however: change string escapes so that
they must not be interpreted until after semantic processing where they
can be interpreted directly as their destination encoding.  But that
only serves to illustrate how unnatural this behaviour is; which may be 
why that no Unicode-supporting language that I can find that handles \x 
interprets it as anything but a character.

[snip]

Sep 29 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjg3qq$1ui6$1 digitaldaemon.com>, Burton Radons says...

Therefore the first string 
is okay but the second is not because the UTF-8 is broken:

    char [] foo = "\x8F";
    wchar [] bar = "\x8F";

 
 
 I presume you meant that the other way round: the /first/ string is broken; the
 /second/ string is okay.

If the compiler uses UTF-8 internally, the first string compiles
correctly as string with length one while the second string does not,
because the compiler tries to re-encode it as UTF-16 during semantic
processing and fails.  I am describing DMD's current behaviour, mind.

Yes. I was wrong.

The first example compiles okay, but results in foo containing an invalid UTF-8
sequence. The second example does not compile. (I assumed that it would, without
testing the hypothesis. That'll teach me).


To use your example encoding,
if the compiler uses UTF-8 internally, then this code:

    wchar [] s = "\xC4\x8F";

Again, I was wrong. I assumed (without testing) that this would compile to a
two-wchar string constant, with s[0] containing U+00C4 and s[1] containing
U+008F. In actual fact, what this code yields is a one-wchar string constant,
with s[0] containing U+010F.

I would call that a bug. [ 0xC4, 0x8F ] is UTF-8 (not UTF-16) for U+010F. But s
is a wchar string, so it's supposed to be UTF-16.



(you understand that D grammar is
contextless and that string escapes are interpreted during tokenisation,
right?).  However, if the compiler uses UTF-16 internally, it would
result in a two-code string.

I think you're right. That's what happening. The compiler is interpretting all
string constants as though they were UTF-8, regardless of the type of the
destination.


This does show a third option, however: change string escapes so that
they must not be interpreted until after semantic processing where they
can be interpreted directly as their destination encoding.

That's the way I assumed it would be done.


But that
only serves to illustrate how unnatural this behaviour is; which may be 
why that no Unicode-supporting language that I can find that handles \x 
interprets it as anything but a character.

Actually, C and C++ interpret \x as a LOCAL ENCODING character, and \u as a
UNICODE character. Thus, in C++, if your local encoding were Windows-1252, then
the following two statements would have identical effect:





Both of these will leave the string s containing a single (byte-wide) char, with
value 0x80. (Plus the null-terminator, of course). Compare this with




which /should/ fail to compile on a Windows-1252 machine.

So you /are/ right, but nonetheless there is a difference between \x and \u. And
this presents a problem for D, because D aims to be portable between encodings.
In D, therefore, \x SHOULD NOT be interpretted according to the local encoding,
because this would immediately make code non-portable.

One way around this would be to assert that \x should mean exactly the same
thing as \u and \U (that is, to specify a Unicode character). Now, that would be
fine for those of us used to Latin-1, but Cyrillic users (for example) would be
left out in the cold.

Currently, I have come to the conclusion that \x should be deprecated. The
escapes \u and \U explicitly specify a character set (i.e. Unicode), and that is
what you need for portabilty. \x just has too many problems.

Arcane Jill

Sep 30 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cilbgp$1p3a$1 digitaldaemon.com>, Burton Radons says...
The following string is encoded into UTF-8:

    char [] c = "\u7362";

It encodes into x"E4 98 B7".  The following string, however, does not 
get encoded:

    char [] c = "\x8F";

It remains x"8F".  Thus the \x specifies a literal byte in the character 
stream as implemented.  The specification doesn't mention this twist, if 
it was intentional.

This is correct behavior. You should be using \u for Unicode characters. \x is
for literal bytes. \u is supposed to understand the encoding. \x is not.

In D, the source code encoding must always be a UTF, but in other computer
languages, this is not so. Imagine a C++ program in which the source code
encoding were WINDOWS-1252. In such a case, the following two lines would be
equivalent:




In both cases, a single byte [0x80] will be placed in the string s. And now,
here's the same thing in a C++ program in which the source code encoding is
WINDOWS-1251:




In both cases, a single byte [0x88] will be placed in the string s. Now, since D
does not allow non-UTF source code encodings, the distinction may appear
blurred, but it's still there.

Just remember:
\x => insert this literal byte
\u => insert this Unicode character, encoded in the appropriate encoding.

Arcane Jill

Sep 22 2004

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - Standard omission or compiler bug: Hexadecimal escapes don't encode