digitalmars.D - What to do about \x?

Arcane Jill (48/48) Oct 01 2004 The use of the escape sequence "\x" in string and character literals cau...

Stewart Gordon (21/43) Oct 01 2004 It does _emit_ a lowercase 'e' with an acute accent, if its destination

Arcane Jill (39/61) Oct 01 2004 Sorry, I didn't follow that. Can you give me a code example? I only mean...

Arcane Jill (6/6) Oct 01 2004 In article , Arcane Jill says...
Stewart Gordon (40/72) Oct 01 2004 char[] s = "\xE9";

Arcane Jill (45/74) Oct 01 2004 I would argue that the spec is wrong. \u and \U are Unicode things, and ...

Stewart Gordon (30/81) Oct 01 2004 Which bit of The Unicode Standard should I read to find the meanings of

Arcane Jill (26/60) Oct 01 2004 It doesn't, of course. The meaning of \u and \U is defined separately in...

Stewart Gordon (20/51) Oct 04 2004 Because that's how \x is, and at the moment we have an equivalent for

Arcane Jill (6/10) Oct 04 2004 Which is what, exactly? Why would anyone need to hand-code UTF-8 inside ...

Stewart Gordon (8/15) Oct 04 2004 Sorry. Maybe I got mixed up and was thinking of the talk-to-death of

Walter (17/17) Oct 01 2004 This is a good summary of the situation. I don't know what the right ans...

David L. Davis (10/27) Oct 01 2004 Walter: Personally, I for one like the prefix idea for the string litera...
Arcane Jill (27/45) Oct 01 2004 You've /already/ outlawed that in D, Walter - at least, if the length is...

Arcane Jill (34/34) Oct 03 2004 In article , Arcane Jill says...

Arcane Jill (2/6) Oct 04 2004

Burton Radons (9/23) Oct 01 2004 That code won't compile. x"" should be turned to produce ubyte[] instea...
Stewart Gordon (36/44) Oct 04 2004 I thought it was by design that string literals take their type from the...

Arcane Jill (30/58) Oct 04 2004 Opinions differ on that viewpoint.

Arcane Jill (36/45) Oct 04 2004 That got me thinking a lot. Finally, it occurred to be that if the type ...

Stewart Gordon (19/56) Oct 05 2004 Making it a _syntax_ error while retaining CFG might be tricky, if not

Arcane Jill (78/118) Oct 05 2004 Whoops. My brain was seeing [0..2] for some reason. Of course it should ...

Stewart Gordon (34/99) Oct 05 2004 Wrong again. "�10" is three wchars: '�', '1', '0'. w"10" is correct.

Arcane Jill (59/75) Oct 06 2004

Stewart Gordon (13/65) Oct 06 2004 I see....

Arcane Jill (12/32) Oct 06 2004 True enough. But I guess, what I was trying to say is that \u20AC is som...

larrycowan (9/14) Oct 05 2004 Why is there any problem here? The content is not different. ubyte and

Arcane Jill (26/36) Oct 05 2004 It's possible that there isn't. I don't know enough about how the compil...

Stewart Gordon (15/54) Oct 05 2004 If people choose to initialise a string with UTF-8 or UTF-16, then it

Benjamin Herr (5/7) Oct 04 2004 Eh, if we are that far, what prevents resolving overloaded function

Arcane Jill (13/20) Oct 05 2004 As I understand it, the context-detection for string literals is very, v...

Stewart Gordon (22/46) Oct 05 2004 There would be a lot more cases to deal with, it would probably be

Arcane Jill <Arcane_member pathlink.com> writes:

The use of the escape sequence "\x" in string and character literals causes a
lot of confusion in D. 

(1) Most people on this thread use either WINDOWS-1252 or LATIN-1, and,
consequently, expect "\xE9" to emit a lowercase 'e' with an acute accent. It
does not.

(2) In a recent thread on this forum, novice (who is Russian) expected "\xC0" to
emit the Cyrillic letter capital A (because that's what happens in C++ on their
WINDOWS-1251 machine). It does not.

(3) Over in the bugs forum, it is being discussed and discovered that the DMD
compiler interprets "\x" as UTF-8 /even in wchar strings/ (which one might have
expected to be UTF-16). This is clearly nonsense.

So maybe it's time to clear up the confusion once and for all. What /should/
"\x" do? As I see it, these are the options:

(1) \x should be interpretted in the user's default encoding
(2) \x should be interpretted in the source-code encoding
(3) \x should be interpretted according to the destination type of the literal
(4) \x should be interpretted as UTF-8
(5) \x should be interpretted as Latin-1
(6) \x should be interpretted as ASCII
(7) \x should be deprecated

Option (1) is what C++ does. However - it makes code non-portable across
encodings: a source file created by one user will not necessarily compile
correctly for another.

Option (2) is more restricted, since the source file encoding must be one of
UTF-8, UTF-16 or UTF-32. I still don't like it though - the compiler shouldn't
behave differently just because the source file is saved differently.

Option (3) is what I (incorrectly) assumed \x would do. However, D has a
context-free grammar, and parses string literals /before/ it knows what kind of
thing it's assigning. (Exactly how it manages to make



do the right thing is beyond me, but whether or not this contextual typing could
be extended to include the intepretation of \x is something only Walter could
answer). Anyway, this would still be strange behaviour, from the point of view
of C++ programmers.

Option (4) is the status quo. It confuses everybody.

Option (5) is biased toward folk in the Western world. There is some
justfication for it, however, since Latin-1 is a subset of Unicode, having
precisely the same codepoint-to-character mapping. If we went for this, then

(though I'm not certain) that this is what Java does.

Option (6) is not unreasonable. It would mean that "\x00" to "\x7F" would be the
only legal \x escape sequences - these are unambiguous. Sequences "\x80" to
"\xFF" would become compile-time errors. The error message should advise people
to use "\u" instead.

Option (7) is foolproof. Any use of "\x" becomes a compile-time error. The error
message should advise people to use "\u" instead.

What think you all?
Arcane Jill

Oct 01 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:
 The use of the escape sequence "\x" in string and character literals causes a
 lot of confusion in D. 
 
 (1) Most people on this thread use either WINDOWS-1252 or LATIN-1, and,
 consequently, expect "\xE9" to emit a lowercase 'e' with an acute accent. It
 does not.

It does _emit_ a lowercase 'e' with an acute accent, if its destination 
is anything in the Windows GUI and it was emitted using the A version of 
a Windows API function.

 (2) In a recent thread on this forum, novice (who is Russian) expected "\xC0"
to
 emit the Cyrillic letter capital A (because that's what happens in C++ on their
 WINDOWS-1251 machine). It does not.
 
 (3) Over in the bugs forum, it is being discussed and discovered that the DMD
 compiler interprets "\x" as UTF-8 /even in wchar strings/ (which one might have
 expected to be UTF-16). This is clearly nonsense.

In that case, how should it interpret \u or \U in a char string?

\x is a UTF-8 fragment, \u is a UTF-16 fragment, \U is a UTF-32 
fragment.  Whatever rules we have for translating one into the other 
should be consistent.  And the current behaviour satisfies that 
criterion nicely.

<snip>
 (Exactly how it manages to make
 

 
 do the right thing is beyond me, but whether or not this contextual typing
could
 be extended to include the intepretation of \x is something only Walter could
 answer).

Just thinking about it, I guess that DMD uses an 8-bit internal 
representation during the tokenising and parsing phase, converting \u 
and \U codes (and of course literal characters in UTF-16 or UTF-32 
source text) to their UTF-8 counterparts.  During the semantic analysis, 
it then converts it back to UTF-16 or UTF-32 if it's assigning to a 
wchar[] or dchar[].

<snip>
 Option (4) is the status quo. It confuses everybody.

<snip>

Well, I'm not confused ... yet.

 What think you all?

My vote goes to leaving \x as it is.

Stewart.

Oct 01 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjjcs7$9pf$1 digitaldaemon.com>, Stewart Gordon says...
Arcane Jill wrote:
 The use of the escape sequence "\x" in string and character literals causes a
 lot of confusion in D. 
 
 (1) Most people on this thread use either WINDOWS-1252 or LATIN-1, and,
 consequently, expect "\xE9" to emit a lowercase 'e' with an acute accent. It
 does not.

It does _emit_ a lowercase 'e' with an acute accent, if its destination 
is anything in the Windows GUI and it was emitted using the A version of 
a Windows API function.

Sorry, I didn't follow that. Can you give me a code example? I only meant that



does not leave s containing an e with an acute accent, which some people might
expect.



In that case, how should it interpret \u or \U in a char string?

\u and \U are well defined universally. They do not depend on encoding.


\x is a UTF-8 fragment,

That's the status quo in D, yes.

\u is a UTF-16 fragment, \U is a UTF-32 fragment.

Incorrect. \u is /not/ a UTF-16 fragment. Why did you assume that? Did you
assume that because, in D, \x is a UTF-8 fragment? If so, we may take that
further evidence that the current implementation of \x causes confusion.

In fact, both \u and \U specify Unicode /characters/ (not UTF fragments). By

identical in all respects other than the number of hex digits expected to follow
them. (You only need to resort to \U if more than four digits are present.)
Thus,



(correctly) fails to compile, (correctly) requiring you instead to do:



So you see, \x really _is_ the odd one out.



Whatever rules we have for translating one into the other 
should be consistent.



fragment. (The primary argument /against/ this behaviour is that it effectively
makes \x an ISO-8859-1 encoding, which could be considered to be Western bias.)


And the current behaviour satisfies that 
criterion nicely.

My point is that the current behavior of \x is /not/ consistent with \u or \U.
It is also not consistent with the expectations of users used to C++ or Java
behavior.


 Option (4) is the status quo. It confuses everybody.

<snip>

Well, I'm not confused ... yet.

Well, I don't know about you, but the following confuses me:





Why should s - declared as a UTF-16 string - be able to accept UTF-8 literals,
but not UTF-16 literals? Let's see that again with a different example:





It's not consistent. (But \u and \U are implemented correctly).


 What think you all?

My vote goes to leaving \x as it is.
Stewart.

For what it's worth, my vote goes to deprecating \x.
Arcane Jill

Oct 01 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjjgs7$bs2$1 digitaldaemon.com>, Arcane Jill says...

Erratum



should read:



Jill

Oct 01 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:

 In article <cjjcs7$9pf$1 digitaldaemon.com>, Stewart Gordon says...

<snip>
 It does _emit_ a lowercase 'e' with an acute accent, if its 
 destination is anything in the Windows GUI and it was emitted using 
 the A version of a Windows API function.

 
 Sorry, I didn't follow that. Can you give me a code example?

     char[] s = "\xE9";
     SendMessageA(hWnd, WM_SETTEXT, 0, cast(LPARAM) cast(char*) s);

<snip>
 In fact, both \u and \U specify Unicode /characters/ (not UTF 


 the number of hex digits expected to follow them. (You only need to 
 resort to \U if more than four digits are present.) Thus, 
 

 
 (correctly) fails to compile, (correctly) requiring you instead to 
 do:

I find in the spec:

lex.html
	\n			the linefeed character
	\t			the tab character
	\"			the double quote character
	\012			octal
	\x1A			hex
	\u1234			wchar character
	\U00101234		dchar character
	\r\n			carriage return, line feed

expression.html
"Character literals are single characters and resolve to one of type 
char, wchar, or dchar. If the literal is a \u escape sequence, it 
resolves to type wchar. If the literal is a \U escape sequence, it 
resolves to type dchar. Otherwise, it resolves to the type with the 
smallest size it will fit into."

What bit of the spec should I be reading instead?

<snip>
 Whatever rules we have for translating one into the other should be 
 consistent.

 
 Such consistency would require option (5) from my original post, so 

 character, not a UTF fragment. (The primary argument /against/ this 
 behaviour is that it effectively makes \x an ISO-8859-1 encoding, 
 which could be considered to be Western bias.)

How does simply having \x, \u and \U represent UTF-8, UTF-16 and UTF-32 
fragments respectively not achieve this consistency?

<snip>
 Well, I don't know about you, but the following confuses me:
 




I suppose it confused me before I realised what it meant.

The point, AIUI, is that all string literals are equal whether they are 
notated as UTF-8, UTF-16 or UTF-32, i.e. the lexer reduces all to the 
same thing.  This is kind of consistent with the principle that the 
permitted source text encodings are all treated equally.  This leaves 
the semantic analyser with only one kind of string literal to worry 
about, which it converts to the UTF of the target type.

Moreover, I imagine that string literals juxtaposed into one, or even 
the contents of a single " " pair, are allowed to mix the various escape 
notations.  In this case, trying to label string literals at lex-time as 
UTF-8, UTF-16 or UTF-32 would be fruitless.

 Why should s - declared as a UTF-16 string - be able to accept UTF-8 
 literals, but not UTF-16 literals? Let's see that again with a 
 different example:

<snip>

That would sound like a bug to me.

Stewart.

Oct 01 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjjo6a$g8o$1 digitaldaemon.com>, Stewart Gordon says...

I find in the spec:

lex.html
	\n			the linefeed character
	\t			the tab character
	\"			the double quote character
	\012			octal
	\x1A			hex
	\u1234			wchar character
	\U00101234		dchar character
	\r\n			carriage return, line feed

I would argue that the spec is wrong. \u and \U are Unicode things, and I
distinctly remember a discussion on this very subject on the Unicode public
forum a while back (though Walter could, if he were sufficiently perverse, give
D a different definition). I suggest that the D spec /should/ read:





Consortium, and the actual current behavior of D.

Certainly, at present, actual behavior of \u in D is different from the
documentation, so either there is a documentation error, or else there is a bug.
I'm inclined to the belief that it's a documentation error. I certainly hope so,
because \u and \U should always be independent of encoding. No-one should have
to learn UTF-16 to use \u. It should be legal to write:



(which of course it currently is).



expression.html
"Character literals are single characters and resolve to one of type 
char, wchar, or dchar. If the literal is a \u escape sequence, it 
resolves to type wchar. If the literal is a \U escape sequence, it 
resolves to type dchar. Otherwise, it resolves to the type with the 
smallest size it will fit into."

Bugger!

Again, that's not how Unicode is supposed to behave. The following should (and
does) compile without complaint:



So again, D is behaving fine, but the documentation does not match reality.
Documentation error or bug? I say it's a documentation error.



What bit of the spec should I be reading instead?

You read the D docs correctly. I was going by previous discussions on the
Unicode public forum (from memory). Obviously, those discussions weren't
specifically about D.


How does simply having \x, \u and \U represent UTF-8, UTF-16 and UTF-32 
fragments respectively not achieve this consistency?

It's just not how Unicode is supposed to behave. \u and \U are supposed to be
Unicode characters. Nothing more. Nothing less. (And that of course is exactly
what D has implemented).

I'm going to have a hard time backing that up - so please trust me on this one.
If not, I'll have to go trawling through the Unicode archives, or passing this
question on to the Consortium folk. It's a complicated question, because the \u
thing isn't actually something the UC can define, but nonetheless the same

would be out on a very dodgy limb here if it were to do things differently.

(...and it's also what is implemented by D, so again, I claim it's the
documentation which is in error).




The point, AIUI, is that all string literals are equal whether they are 
notated as UTF-8, UTF-16 or UTF-32, i.e. the lexer reduces all to the 
same thing.

But one should not have to learn /any/ UTF to encode a string. Why would anyone
want that?

All you should need to know to encode a character using an escape sequence is
its codepoint. And that's what \u and \U are for.


This is kind of consistent with the principle that the 
permitted source text encodings are all treated equally.

Well, here at least we are in agreement. All supported encodings should be
treated equally.


 Why should s - declared as a UTF-16 string - be able to accept UTF-8 
 literals, but not UTF-16 literals? Let's see that again with a 
 different example:

<snip>

That would sound like a bug to me.

It's certainly an inconsistency - but it's the legality of \x, not the
illegality of \u, about which I would complain.

Arcane Jill

Oct 01 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:
<snip>
 I would argue that the spec is wrong. \u and \U are Unicode things, 
 and I distinctly remember a discussion on this very subject on the 
 Unicode public forum a while back (though Walter could, if he were 
 sufficiently perverse, give D a different definition).

Which bit of The Unicode Standard should I read to find the meanings of 
\u and \U it sets in stone across every language ever invented?

<snip>
 Certainly, at present, actual behavior of \u in D is different from 
 the documentation, so either there is a documentation error, or else 
 there is a bug. I'm inclined to the belief that it's a documentation 
 error. I certainly hope so, because \u and \U should always be 
 independent of encoding.

How would fixing the compiler to follow the spec create dependence on 
encoding?

 No-one should have to learn UTF-16 to use \u. It should be legal to 
 write:
 

 
 (which of course it currently is).

Agreed from the start.  Nobody suggested that anyone should have to 
learn UTF-16.  Nor that being _allowed_ to use UTF-16 and being 
_allowed_ to use UTF-32 (and hence actual codepoints) should be mutually 
exclusive.  I thought that was half the spirit of having both \u and \U 
- to give the programmer the choice.

 expression.html
 "Character literals are single characters and resolve to one of 
 type char, wchar, or dchar. If the literal is a \u escape sequence, 
 it resolves to type wchar. If the literal is a \U escape sequence, 
 it resolves to type dchar. Otherwise, it resolves to the type with 
 the smallest size it will fit into."

 
 Bugger!
 
 Again, that's not how Unicode is supposed to behave. The following 
 should (and does) compile without complaint:
 

 
 So again, D is behaving fine, but the documentation does not match 
 reality. Documentation error or bug? I say it's a documentation 
 error.

Ah, you have a point.  It appears that character and string literals are 
translated by the same function, rather than character literals being 
labelled as char, wchar or dchar.  So that could be either a doc error 
or a bug.  Walter?

<snip>
 It's just not how Unicode is supposed to behave. \u and \U are 
 supposed to be Unicode characters. Nothing more. Nothing less. (And 
 that of course is exactly what D has implemented).
 
 I'm going to have a hard time backing that up - so please trust me on 
 this one. If not, I'll have to go trawling through the Unicode 
 archives, or passing this question on to the Consortium folk. It's a 
 complicated question, because the \u thing isn't actually something 
 the UC can define, but nonetheless the same definition is used by 

 out on a very dodgy limb here if it were to do things differently.

Well, at least what the D spec implies \u should mean is a superset of 
what you're implying is the 'correct' Unicode behaviour.

<snip>
 The point, AIUI, is that all string literals are equal whether they 
 are notated as UTF-8, UTF-16 or UTF-32, i.e. the lexer reduces all 
 to the same thing.

 
 But one should not have to learn /any/ UTF to encode a string. Why 
 would anyone want that?

I can't see how that follows on from what I just said.  But agreed.

 All you should need to know to encode a character using an escape 
 sequence is its codepoint. And that's what \u and \U are for.

<snip>

Exactly.  At least, whichever interpretation of \u we go by, it works 
for all codepoints below U+FFFE.  The only thing left to debate is 
whether it should also work for those UTF-16 fragments that don't 
directly correspond to codepoints.  If not, then there are two things to do:

- fix the documentation to explain this
- invent another escape to represent a UTF-16 fragment, for the sake of 
completeness.

Stewart.

Oct 01 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjjvnq$nje$1 digitaldaemon.com>, Stewart Gordon says...
Arcane Jill wrote:
<snip>
 I would argue that the spec is wrong. \u and \U are Unicode things, 
 and I distinctly remember a discussion on this very subject on the 
 Unicode public forum a while back (though Walter could, if he were 
 sufficiently perverse, give D a different definition).

Which bit of The Unicode Standard should I read to find the meanings of 
\u and \U it sets in stone across every language ever invented?

It doesn't, of course. The meaning of \u and \U is defined separately in each
programming language, and not by the UC. Walter is free to define them how he
likes for D.


<snip>
 Certainly, at present, actual behavior of \u in D is different from 
 the documentation, so either there is a documentation error, or else 
 there is a bug. I'm inclined to the belief that it's a documentation 
 error. I certainly hope so, because \u and \U should always be 
 independent of encoding.

How would fixing the compiler to follow the spec create dependence on 
encoding?

Tricky. Well, if "\uD800" were allowed, for example, it would open up the door
to the very nasty possibility that "\U0000D800" might also be allowed - and that
/definitely/ shouldn't be, because it's not a valid character. I dunno - I just
think it could be very confusing and counterintuitive.

Everyone expects \u and \U to prefix /characters/, not fragments of some
encoding scheme. Why would you want it any different?


Agreed from the start.  Nobody suggested that anyone should have to 
learn UTF-16.  Nor that being _allowed_ to use UTF-16 and being 
_allowed_ to use UTF-32 (and hence actual codepoints) should be mutually 
exclusive.  I thought that was half the spirit of having both \u and \U 
- to give the programmer the choice.


different?

If you want to hand-code UTF-16, you can always do it like this:




Ah, you have a point.  It appears that character and string literals are 
translated by the same function, rather than character literals being 
labelled as char, wchar or dchar.  So that could be either a doc error 
or a bug.  Walter?

Well, as you know, I suspect a documentation error, but I leave it to Walter to
answer once and for all.


Well, at least what the D spec implies \u should mean is a superset of 
what you're implying is the 'correct' Unicode behaviour.

Guess I can't argue with that.


Exactly.  At least, whichever interpretation of \u we go by, it works 
for all codepoints below U+FFFE.  The only thing left to debate is 
whether it should also work for those UTF-16 fragments that don't 
directly correspond to codepoints.  If not, then there are two things to do:

- fix the documentation to explain this

Agreed.

- invent another escape to represent a UTF-16 fragment, for the sake of 
completeness.

I'm not sure you need an escape sequence. Why not just replace



with



What requirement is there to hand-code UTF-8 /inside a string literal/? If you
get it right, then you might just as well have used \U and the codepoint; if you
get it wrong, you're buggered. 

Arcane Jill

Oct 01 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:
<snip>
 Tricky. Well, if "\uD800" were allowed, for example, it would open up the door
 to the very nasty possibility that "\U0000D800" might also be allowed - and
that
 /definitely/ shouldn't be, because it's not a valid character. I dunno - I just
 think it could be very confusing and counterintuitive.
 
 Everyone expects \u and \U to prefix /characters/, not fragments of some
 encoding scheme. Why would you want it any different?

Because that's how \x is, and at the moment we have an equivalent for 
UTF-16 according to the spec, but not according to the compiler.

Agreed from the start.  Nobody suggested that anyone should have to 
learn UTF-16.  Nor that being _allowed_ to use UTF-16 and being 
_allowed_ to use UTF-32 (and hence actual codepoints) should be mutually 
exclusive.  I thought that was half the spirit of having both \u and \U 
- to give the programmer the choice.

 

 different?

Are languages defined by the spec or the compiler?  I'd've thought the 
spec, IWC D is already different.

 If you want to hand-code UTF-16, you can always do it like this:
 


If D's meant to be consistent, then I should also be able to do

     wchar[] s = "\uD800\uDC00";

or even

     dchar[] s = "\uD800\uDC00";

or

     char[] s = "\uD800\uDC00";

or the same with the 'u' replaced by some other letter defined with 
these semantics.

<snip>
- invent another escape to represent a UTF-16 fragment, for the sake of 
completeness.

 
 I'm not sure you need an escape sequence. Why not just replace
 

 
 with
 


Because that doesn't enable whole string literals to be done like this. 
  And because it would be inconsistent to allow string literals to be 
encoded in anything except UTF-16.

 What requirement is there to hand-code UTF-8 /inside a string literal/? If you
 get it right, then you might just as well have used \U and the codepoint; if
you
 get it wrong, you're buggered. 

The same as there was when the subject was first brought up.

Stewart.

Oct 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjr5no$28em$1 digitaldaemon.com>, Stewart Gordon says...

 What requirement is there to hand-code UTF-8 /inside a string literal/? If you
 get it right, then you might just as well have used \U and the codepoint; if
you
 get it wrong, you're buggered. 

The same as there was when the subject was first brought up.

Which is what, exactly? Why would anyone need to hand-code UTF-8 inside a string
literal? Or UTF-16 for that matter?




Jill

Oct 04 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:
 In article <cjr5no$28em$1 digitaldaemon.com>, Stewart Gordon says...

<snip>
 The same as there was when the subject was first brought up.

 
 Which is what, exactly? Why would anyone need to hand-code UTF-8 inside a
string
 literal? Or UTF-16 for that matter?

Sorry.  Maybe I got mixed up and was thinking of the talk-to-death of 
initialising a char[] with arbitrary byte values.




Maybe it isn't in general.  But I suppose if you're doing file I/O and 
want your code to be self-documenting to the extent of indicating how 
many actual bytes are being transferred....

Stewart.

Oct 04 2004

"Walter" <newshound digitalmars.com> writes:

This is a good summary of the situation. I don't know what the right answer
is yet, so it's a good topic for discussion. One issue not mentioned yet is
the (not that unusual) practice of stuffing arbitrary data into a string
with \x. For example, in C one might create a length prefixed 'pascal style'
string with:

    unsigned char* p = "\x03abc";

where the \x03 is not character data, but length data. I'm hesitant to say
"you cannot do that in D". I also want to support things like:

    byte[] a = "\x05\x89abc\xFF";

where clearly a bunch of binary data is desired.

One possibility is to use prefixes on the string literals:

    "..."    // Takes its type from the context, only ascii \x allowed
    c"..."    // char[] string literal, only ascii \x allowed
    w"..."   // wchar[] string literal, \x not allowed
    d"..."    // dchar[] string literal, \x not allowed
    b"..."    // byte[] binary data string literal, \x allowed, \u and \U
not allowed

Oct 01 2004

David L. Davis <SpottedTiger yahoo.com> writes:

In article <cjk0rv$oqo$1 digitaldaemon.com>, Walter says...
This is a good summary of the situation. I don't know what the right answer
is yet, so it's a good topic for discussion. One issue not mentioned yet is
the (not that unusual) practice of stuffing arbitrary data into a string
with \x. For example, in C one might create a length prefixed 'pascal style'
string with:

    unsigned char* p = "\x03abc";

where the \x03 is not character data, but length data. I'm hesitant to say
"you cannot do that in D". I also want to support things like:

    byte[] a = "\x05\x89abc\xFF";

where clearly a bunch of binary data is desired.

One possibility is to use prefixes on the string literals:

    "..."    // Takes its type from the context, only ascii \x allowed
    c"..."    // char[] string literal, only ascii \x allowed
    w"..."   // wchar[] string literal, \x not allowed
    d"..."    // dchar[] string literal, \x not allowed
    b"..."    // byte[] binary data string literal, \x allowed, \u and \U
not allowed

Walter: Personally, I for one like the prefix idea for the string literals
(which should also make the ~ string concatenation operator behave
correctly...so, no more need to cast every darn string literal), and the cases
of when \x can and cannot be used for each. I'm actually hold my breathe for
this one to happen! ;)

David L.

P.S. Welcome Back!! Hope you got some much needed rest.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

Oct 01 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjk0rv$oqo$1 digitaldaemon.com>, Walter says...
This is a good summary of the situation. I don't know what the right answer
is yet, so it's a good topic for discussion. One issue not mentioned yet is
the (not that unusual) practice of stuffing arbitrary data into a string
with \x. For example, in C one might create a length prefixed 'pascal style'
string with:

    unsigned char* p = "\x03abc";

where the \x03 is not character data, but length data. I'm hesitant to say
"you cannot do that in D".

You've /already/ outlawed that in D, Walter - at least, if the length is greater
than 127. Try compiling this:





I also want to support things like:

    byte[] a = "\x05\x89abc\xFF";

where clearly a bunch of binary data is desired.

And I. That's a good plan. But that's a byte[] literal you want there, not a
char[] literal. How to tell them apart, that's the problem...?



One possibility is to use prefixes on the string literals:

    "..."    // Takes its type from the context, only ascii \x allowed
    c"..."    // char[] string literal, only ascii \x allowed
    w"..."   // wchar[] string literal, \x not allowed
    d"..."    // dchar[] string literal, \x not allowed
    b"..."    // byte[] binary data string literal, \x allowed, \u and \U
not allowed

Perfect! I'd support that wholeheartedly. It definitely seems the right way to
go - except that I'd allow the following:









That is, if the prefix is omitted, Unicode escapes should still be allowed.
Otherwise people will complain when



fails to compile.

Oh - and one other thing: for the benefit of novice and other newcomers to D, it
would be nice if the error message which is output when something like "\xC0"
fails to compile was more helpful: "invalid UTF-8" has turned out to be
confusing; "invalid ASCII" would be even worse (as would suggest that D only
does ASCII). You need an error message which lets folk know that we can do
Unicode, and that "\u" followed by the four-digit Unicode codepoint is most
likely what they want.

Arcane Jill

Oct 01 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjkhnf$167v$1 digitaldaemon.com>, Arcane Jill says...

Here's my latest suggestion, only very slightly modified from the last one. The
essential difference is the behavior of the default ("...") string type.

For string literals with a prefix, the following rules should apply:









(Note the addition of the a"..." type. We don't actually need this, but it makes
the table below make more sense).

For string literals /without/ a prefix, the rules should be determined by
context, as follows:










Now there's only one remaining problem: should /both/ of the following lines
compile, or should one of them need a cast? (or perhaps put another way, should
byte[] and ubyte[] be mutually implicitly convertable?)




(by the context rules, these strings would be interpreted as b"\x05\x89abc\xFF")

Arcane Jill

PS. Of course, this sort of thing (defining consecutive bytes of mixed data) is
simple to do in assembler, so it ought to be simple to do in a
close-to-the-processor language like C:

Oct 03 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjqrt4$1vbg$1 digitaldaemon.com>, Arcane Jill says...

PS. Of course, this sort of thing (defining consecutive bytes of mixed data) is
simple to do in assembler, so it ought to be simple to do in a
close-to-the-processor language like C:

Er - make that D.

Oct 04 2004

Burton Radons <burton-radons smocky.com> writes:

Walter wrote:
 where the \x03 is not character data, but length data. I'm hesitant to say
 "you cannot do that in D". I also want to support things like:
 
     byte[] a = "\x05\x89abc\xFF";
 
 where clearly a bunch of binary data is desired.

That code won't compile.  x"" should be turned to produce ubyte[] instead.

For the "meat is my potatoes" crew you can allow implicitly converting 
x"" literals to char[]; then they can embed random data within the 
string to their heart's content.

 One possibility is to use prefixes on the string literals:
 
     "..."    // Takes its type from the context, only ascii \x allowed
     c"..."    // char[] string literal, only ascii \x allowed
     w"..."   // wchar[] string literal, \x not allowed
     d"..."    // dchar[] string literal, \x not allowed
     b"..."    // byte[] binary data string literal, \x allowed, \u and \U
 not allowed

So the language should be made more complex and even more hostile to 
templating because... don't got an answer there.  Just leave \x as is 
but have it encode.  It'll be a stumbling point for transitionary users, 
but that's what tutorials are for.

Oct 01 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Walter wrote:

<snip>
 One possibility is to use prefixes on the string literals:
 
     "..."    // Takes its type from the context, only ascii \x allowed
     c"..."    // char[] string literal, only ascii \x allowed
     w"..."   // wchar[] string literal, \x not allowed
     d"..."    // dchar[] string literal, \x not allowed
     b"..."    // byte[] binary data string literal, \x allowed, \u and \U
 not allowed

I thought it was by design that string literals take their type from the 
context, and that they could be initialised by UTF-8, UTF-16 or UTF-32 
regardless of the destination type.  As such, it would be a step 
backward to restrict what can go in a "...".

This leaves two things to do:

1. Stop string literals from jumping to the wrong conclusions.  At parse 
time, a StringLiteral is just a StringLiteral, right?  It has no 
specific type yet.  On seeing a subexpression

     StringLiteral ~ StringLiteral

the compiler would concatenate the strings there and then, and the 
result would still be a StringLiteral.

Only when a specific type is required, e.g.

- assignment
- concatenation with a string of known type
- passing to a function or template
- an explicit cast

would the semantic analysis turn it into a char[], wchar[] or dchar[]. 
The spec would need to be clear about what happens if a given 
function/template name is overloaded to take different string types, and 
of course the type it would have in a variadic function.

Other thing we would need to be careful about:

(a) whether an expression like

     "�10"[1..3]

should be allowed, and what it should do

(b) when we get array arithmetic, whether it should be allowed on 
strings.  String literals and strings of known type could be considered 
separately in this debate.

Of course, c"..." et al could still be invented, if only as syntactic 
sugar for cast(char[]) "..."....

2. Clean up the issue of whether arbitrary UTF-16 fragments are allowed 
by \u escapes - get the compiler matching the spec or vice versa.  If we 
don't allow them, then if only for the sake of completeness, we should 
invent a new escape to denote UTF-16 fragments.

Stewart.

Oct 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjr7ar$29j8$1 digitaldaemon.com>, Stewart Gordon says...
Walter wrote:

<snip>
 One possibility is to use prefixes on the string literals:
 
     "..."    // Takes its type from the context, only ascii \x allowed
     c"..."    // char[] string literal, only ascii \x allowed
     w"..."   // wchar[] string literal, \x not allowed
     d"..."    // dchar[] string literal, \x not allowed
     b"..."    // byte[] binary data string literal, \x allowed, \u and \U
 not allowed

I thought it was by design that string literals take their type from the 
context, and that they could be initialised by UTF-8, UTF-16 or UTF-32 
regardless of the destination type.

Opinions differ on that viewpoint.

Currently, strings /cannot/ be initialized with UTF-16, and I believe that
behavior to be correct. (You believe it to be incorrect, I know. Like I said,
opinions differ).

I don't believe that allowing them to be initialised with UTF-8 is a good idea
either. It's dumb.

Regardless of the destination type! 



As such, it would be a step 
backward to restrict what can go in a "...".

I posted an update this morning with better rules, which I think would keep
everyone happy. Well - everyone apart from those who want to explicitly stick
UTF-8 and UTF-16 into char[], wchar[] and dchar[] anyway.



This leaves two things to do:

1. Stop string literals from jumping to the wrong conclusions.

Prefixes do that nicely. In the absense of prefixes, you can often figure it
out. But /sometimes/ you just can't. For instance: f("..."), where f() is
overloaded.

The spec would need to be clear about what happens if a given 
function/template name is overloaded to take different string types, and 
of course the type it would have in a variadic function.

Yeah. The algorithm I posted this morning (which is almost the same as Walter's)
covers that nicely.


Other thing we would need to be careful about:

(a) whether an expression like

     "�10"[1..3]

should be allowed, and what it should do

An interesting conundrum!


2. Clean up the issue of whether arbitrary UTF-16 fragments are allowed 
by \u escapes - get the compiler matching the spec or vice versa.  If we 
don't allow them, then if only for the sake of completeness, we should 
invent a new escape to denote UTF-16 fragments.

Disagree. I still think that encoding-by-hand inside a string literal is a dumb



Walter suggests (and I agree with him) that \x should be for inserting arbitrary
binary data into binary strings, and ASCII characters into text strings. I see
no /point/ in defining \x to be UTF-8 ... unless of course you want to enter an
obfuscated D contest with code like this:



instead of either of:




It's just crazy.

Jill

Oct 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjrd3d$2cep$1 digitaldaemon.com>, Arcane Jill says...
In article <cjr7ar$29j8$1 digitaldaemon.com>, Stewart Gordon says...

Other thing we would need to be careful about:

(a) whether an expression like

     "�10"[1..3]

should be allowed, and what it should do

An interesting conundrum!

That got me thinking a lot. Finally, it occurred to be that if the type of "..."
cannot be determined by context then it should be a syntax error. What's more,
non-ASCII characters shouldn't be allowed in a string constant of unknown type.
So, in effect, I'm saying:




And the complete set of rules should now be:

For string literals with a prefix, the following rules should apply:








Note the extra column, "non-ASCII". What this means is that statements like



must be forbidden, because '�' is a non-ASCII character, and its representation
is unspecified. Is it UTF-8 ("\xE2\x82\xAC")? Is it UTF-16BE ("\x20\xAC")?
UTF-16LE ("\xAC\x20")? Is it WINDOWS-1252 ("\x80")? You see the problem.

For string literals /without/ a prefix, the rules should be determined by
context, as follows:










I believe that these rules would lead to a resolution of Stewart's conundrum.



String literals will now be consistent, logical, and sufficiently powerful for
Walter's purposes. The distinction between byte[] and ubyte[] would then be the
only remaining problem.

Arcane Jill

Oct 04 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:

<snip>
 That got me thinking a lot. Finally, it occurred to be that if the type of
"..."
 cannot be determined by context then it should be a syntax error.

Making it a _syntax_ error while retaining CFG might be tricky, if not 
impossible.  Perhaps better would be to make it an error at the semantic 
analysis level.

 What's more, non-ASCII characters shouldn't be allowed in a string constant of
unknown type.
 So, in effect, I'm saying:
 



You seem to've got your indexing mixed up.  Surely it would be c"\xA31" 
and w"10"?

 And the complete set of rules should now be:
 
 For string literals with a prefix, the following rules should apply:
 







Where did those forward slashes come from?

 Note the extra column, "non-ASCII". What this means is that statements like
 

 
 must be forbidden, because '�' is a non-ASCII character, and its representation
 is unspecified. Is it UTF-8 ("\xE2\x82\xAC")? Is it UTF-16BE ("\x20\xAC")?
 UTF-16LE ("\xAC\x20")? Is it WINDOWS-1252 ("\x80")? You see the problem.

Indeed, if we had b"..." then this would be the case.

 For string literals /without/ a prefix, the rules should be determined by
 context, as follows:
 








 
 I believe that these rules would lead to a resolution of Stewart's conundrum.

<snip>

Depends on what you mean by "treat as".  My thought is that it is 
simplest to keep translation of string escapes at the lexical level.  Of 
course, the prefix (or lack thereof) would be carried forward to the SA. 
  This of course would retain the current 'anything goes' approach for 
unprefixed literals, but also retains the flexibility that some of us 
want or even need.

Of course, if the context is this very conundrum, then it would indeed 
be a compile-time error, at least if it contains anything non-ASCII at all.

Stewart.

Oct 05 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjtufp$2rjd$1 digitaldaemon.com>, Stewart Gordon says...



You seem to've got your indexing mixed up.  Surely it would be c"\xA31" 
and w"10"?

Whoops. My brain was seeing [0..2] for some reason. Of course it should be:








Where did those forward slashes come from?

Errr. Typo. Read as backslash.


Indeed, if we had b"..." then this would be the case.

That was Walter's idea, but I like it.


Depends on what you mean by "treat as".  My thought is that it is 
simplest to keep translation of string escapes at the lexical level.  Of 
course, the prefix (or lack thereof) would be carried forward to the SA. 

You're starting to lose me. Compiler front ends are not my strong point. But, by
"treat as", I meant, allow the same escape sequences. I couldn't tell you
whether or not this would be feasible for a D compiler, but it's a suggestion to
which Walter knows the answer, so it didn't seem unreasonable to suggest it.



  This of course would retain the current 'anything goes' approach for 
unprefixed literals, but also retains the flexibility that some of us 
want or even need.

Need?

You have yet to provide a convincing argument as to why you (or, indeed, anyone)
would /need/ the syntax:



which doesn't even make sense (unless you DEFINE \x as meaning a UTF-8 fragment,
but there's just no need for that). "�" tells you you've got the Euro character;
"\u20AC" tells you you've got the character U+20AC (which any code chart will
tell you is the Euro character). "\xE2\x82\xAC" tells you ... what?

If you need to place arbitrary binary data into a string literal, then Walter's
idea makes the most sense. He proposes:




There is really no way to do this if \x MUST be UTF-8.

Combining two posts into one here...
In article <cjtt82$2qsb$1 digitaldaemon.com>, Stewart Gordon says...

If people choose to initialise a string with UTF-8 or UTF-16, then it 
automatically follows that they should know what they're doing.  What 
would we gain by stopping these people?

Defining rules for string literals doesn't stop anyone doing anything. Look at
it this way - I'm saying that the following line would be outlawed:



but of course we still allow this:



but there would still be nothing - absolutely nothing - to stop you from
initialising a string with UTF-8, by doing any of:





if you really wanted to. It seems to me to be the best of both worlds. People
who really know what they're doing and who /insist/ on hand-coding their UTF-8
will still be allowed to do so, at the cost of a slightly more involved syntax,
but everyone else will be protected from silly mistakes. And I speak as someone
who /can/ hand-code UTF-8, and /still/ wants to be protected from doing it
accidently.


But they take away the convenience Walter has gone to the trouble to 
create, of which this is just one example:

http://www.digitalmars.com/d/ctod.html#ascii

You mean this?

The D Way
The type of a string is determined by semantic analysis,
so there is no need to wrap strings in a macro call:

    char[] foo_ascii = "hello";        // string is taken to be ascii 
    wchar[] foo_wchar = "hello";       // string is taken to be wchar 

Under the proposed new rules, that example would still work exactly as
documented.

Everything would still work, Stewart - everything that you would ever (sensibly)
want to do - PLUS you'd get arbitrary binary strings thrown in as a bonus - AND
you could still hand-code whatever you wanted (with the inherent risk of getting
it wrong, obviously) with a small amount of extra typing. It's a no-lose
situation.





Nothing - these are perfectly valid at the moment, and remain perfectly 
valid whether \u is interpreted as codepoints or UTF-16 fragments.

"\uD800" won't compile, so I'd hardly call it "perfectly valid at the moment".

Under the proposed new rules, however, you /would/ be able to do any of the
following:





or even



I don't see a problem with this. If you want to create invalid strings, you
should not be surprised if you need a bit of casting.



 ... unless of course you want to enter an
 obfuscated D contest with code like this:
 

 
 instead of either of:
 


 
 It's just crazy.

Maybe.  But there's method in some people's madness.

But is there sufficient method in:



(which leaves c containing the value 0x20AC) to justify its current status as
legal D. I mean, what methodic madness makes the above line better than any of:





..really?

Providing that D can do a reasonable job of auto-detecting the type of "..."
(and '...') in the most common circumstances (which it can), the rest just
becomes logical. I mean - we have an /opportunity/ here - an opportunity to make
D better; to make strings to the /obvious/, /intuitive/ thing, to allow
arbitrary binary strings (which we don't have currently), and all this /without/
preventing programmers from doing "under the hood" stuff if they really want to.
Let us seize this opportunity while we can. It's not often that such an
opportunity arises that /actually has Walter's backing/ (and was actually
Walter's idea). We should be uniting on this one not arguing with each other
(although it is good that differing arguments get thrashed out).

Arcane Jill

Oct 05 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:

<snip>
 Whoops. My brain was seeing [0..2] for some reason. Of course it should be:
 



Wrong again.  "�10" is three wchars: '�', '1', '0'.  w"10" is correct.

<snip>
Depends on what you mean by "treat as".  My thought is that it is 
simplest to keep translation of string escapes at the lexical level.  Of 
course, the prefix (or lack thereof) would be carried forward to the SA. 

 
 You're starting to lose me. Compiler front ends are not my strong point. But,
by
 "treat as", I meant, allow the same escape sequences. I couldn't tell you
 whether or not this would be feasible for a D compiler, but it's a suggestion
to
 which Walter knows the answer, so it didn't seem unreasonable to suggest it.

It would somewhat complicate the lexical analysis with little or no real 
benefit, which is one of the reasons I suggest allowing \x, \u or \U 
equally in unprefixed literals.

 This of course would retain the current 'anything goes' approach for 
 unprefixed literals, but also retains the flexibility that some of us 
 want or even need.

 
 Need?

As I said before, need to be able to interface foreign APIs relying on 
non-Unicode character sets/encodings.

<snip>
 if you really wanted to. It seems to me to be the best of both worlds. People
 who really know what they're doing and who /insist/ on hand-coding their UTF-8
 will still be allowed to do so, at the cost of a slightly more involved syntax,
 but everyone else will be protected from silly mistakes. And I speak as someone
 who /can/ hand-code UTF-8, and /still/ wants to be protected from doing it
 accidently.

What are the ill consequences of hand-coding UTF-8, from which one would 
need to be protected?

But they take away the convenience Walter has gone to the trouble to 
create, of which this is just one example:

http://www.digitalmars.com/d/ctod.html#ascii

 
 You mean this?
 
The D Way
The type of a string is determined by semantic analysis,
so there is no need to wrap strings in a macro call:

   char[] foo_ascii = "hello";        // string is taken to be ascii 
   wchar[] foo_wchar = "hello";       // string is taken to be wchar 


That's indeed the only bit of D code in that section of the page.

<snip>



Nothing - these are perfectly valid at the moment, and remain perfectly 
valid whether \u is interpreted as codepoints or UTF-16 fragments.

 
 "\uD800" won't compile, so I'd hardly call it "perfectly valid at the moment".



Unicode codepoint or a UTF-16 fragment.

<snip>
 But is there sufficient method in:
 

 
 (which leaves c containing the value 0x20AC) to justify its current status as
 legal D. I mean, what methodic madness makes the above line better than any of:
 



 
 ..really?

Using DMD as a UTF conversion tool, perhaps?  :-)

Maybe someone can come up with a variety of uses for this.

 Providing that D can do a reasonable job of auto-detecting the type of "..."
 (and '...') in the most common circumstances (which it can), the rest just
 becomes logical. I mean - we have an /opportunity/ here - an opportunity to
make
 D better; to make strings to the /obvious/, /intuitive/ thing, to allow
 arbitrary binary strings (which we don't have currently), and all this
/without/
 preventing programmers from doing "under the hood" stuff if they really want
to.
 Let us seize this opportunity while we can. It's not often that such an
 opportunity arises that /actually has Walter's backing/ (and was actually
 Walter's idea). We should be uniting on this one not arguing with each other
 (although it is good that differing arguments get thrashed out).

Maybe you're right.  The trouble is that you have some points I don't 
really agree with, like that writing strings in UTF-8 or UTF-16 should 
be illegal.  And some of my thoughts are:

- it should remain straightforward to interface legacy APIs, whatever 
character sets/encodings they may rely on
- the implementation should remain simple, complete with context-free 
lexical analysis

But at least we seem to agree that:
- string literals should remain specifyable in terms of actual Unicode 
codepoints
- prefixed string literals might come in useful one day as an 
alternative to unprefixed ones

Maybe we can come to a best of all three worlds.

Stewart.

Oct 05 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjuhhv$bn7$1 digitaldaemon.com>, Stewart Gordon says...

Wrong again.  "�10" is three wchars: '�', '1', '0'.  w"10" is correct.

<Hangs head in shame>
What can I say? One of these days I must learn to count!



As I said before, need to be able to interface foreign APIs relying on 
non-Unicode character sets/encodings.

I think you must be misunderstanding me. Fair enough - maybe I'm not that good
at explaining things.

Okay, let's say you a foreign API whose signature is something like:



Hopefully, we're all familiar with that one. Now, the following would all be
legal under both Jill's rules and Stewart's rules:





The following would be legal under Stewart's rules, but illegal under Jill's:




However, even under Jill's rules, you could still do:



if you really wanted to insist on stuffing encoding details into a literal.

But you said "relying on non-Unicode character sets/encodings" - so let's look
at that now. Let's say we're running on a WINDOWS-1252 machine, in which
character set the character '�' has codepoint 0x80. Now you'd have to write:




Now here, of course, by use of those casts, you are explicitly telling the
compiler "I know what I'm doing" -- which I don't think is unreasonable given
that you are now hand-coding WINDOWS-1252.

Of course, I've been assuming here that the type of b"..." would be byte[] or
ubyte[]. But it's equally possible that Walter might decide that the type of
b"..." is actually to be char[] -- precisely so that it /can/ interface easily
with foreign APIs, in which case we'd end up with:








What are the ill consequences of hand-coding UTF-8, from which one would 
need to be protected?

POSSIBILITY ONE:

Suppose that a user accustomed to WINDOWS-1252 wanted to hand-code a
string containing a multiplication sign ('�') immediately followed by a Euro
sign ('�'). Such a user might mistakenly type



since 0xD7 is the WINDOWS-1252 codepoint for '�', and 0x80 is the WINDOWS-1252
codepoint for '�'. This /does/ compile under present rules - but results in s
containing the single Unicode character U+05C0 (Hebrew punctuation PASEQ). This
is not what the user was expecting, and results entirely from the system trying
to interpret \x as UTF-8 when it wasn't. If the user had been required to
instead type



then they would have been protected from that error.

POSSIBILITY TWO:

Suppose a user decides to hand-code UTF-8 on purpose /and gets it wrong/. As in:



who's going to notice? The compiler? Not in this case. Again, if the user had
been required instead to type:



then they would have been protected from that error.



 Let us seize this opportunity while we can. It's not often that such an
 opportunity arises that /actually has Walter's backing/ (and was actually
 Walter's idea). We should be uniting on this one not arguing with each other
 (although it is good that differing arguments get thrashed out).

Maybe you're right.  The trouble is that you have some points I don't 
really agree with, like that writing strings in UTF-8 or UTF-16 should 
be illegal.

Not illegal - I'm only asking for a b prefix. As in b"...".


And some of my thoughts are:

- it should remain straightforward to interface legacy APIs, whatever 
character sets/encodings they may rely on

Yes, absolutely. That would be one use of b"...".



Maybe we can come to a best of all three worlds.

Of course we can. We're good. I'm not "making a stand" or "taking a position".
I'm open to persuasion. I can be persuaded to change my mind. Assuming that's
also true of you, it should just be a matter of logicking out all the pros and
cons, and then letting Walter be the judge.

Arcane Jill

Oct 06 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:
<snip>
 POSSIBILITY ONE:
 
 Suppose that a user accustomed to WINDOWS-1252 wanted to hand-code a 
 string containing a multiplication sign ('�') immediately followed by 
 a Euro sign ('�'). Such a user might mistakenly type
 

 
 since 0xD7 is the WINDOWS-1252 codepoint for '�', and 0x80 is the 
 WINDOWS-1252 codepoint for '�'. This /does/ compile under present 
 rules - but results in s containing the single Unicode character 
 U+05C0 (Hebrew punctuation PASEQ). This is not what the user was 
 expecting, and results entirely from the system trying to interpret 
 \x as UTF-8 when it wasn't. If the user had been required to instead 
 type
 

 
 then they would have been protected from that error.

I see....

 POSSIBILITY TWO:
 
 Suppose a user decides to hand-code UTF-8 on purpose /and gets it 
 wrong/. As in:
 

 
 
 who's going to notice? The compiler? Not in this case. Again, if the 
 user had been required instead to type:
 

 
 then they would have been protected from that error.

How are these less typo-prone?  Even if I did that, how would it know 
that I didn't mean to type

     char[] s = "\u208C";

?  As I started to say, people who choose to hand-code UTF-8, any other 
encoding or even uncoded codepoints should know what they're doing and 
that they have to be careful.

<snip>
 Maybe you're right.  The trouble is that you have some points I 
 don't really agree with, like that writing strings in UTF-8 or 
 UTF-16 should be illegal.

 
 Not illegal - I'm only asking for a b prefix. As in b"...".
 
 And some of my thoughts are:
 
 - it should remain straightforward to interface legacy APIs, 
 whatever character sets/encodings they may rely on

 
 Yes, absolutely. That would be one use of b"...".

I see....

 Maybe we can come to a best of all three worlds.

 
 Of course we can. We're good. I'm not "making a stand" or "taking a 
 position". I'm open to persuasion. I can be persuaded to change my 
 mind. Assuming that's also true of you, it should just be a matter of 
 logicking out all the pros and cons, and then letting Walter be the 
 judge.

Good idea.

Stewart.

Oct 06 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ck0gia$239i$1 digitaldaemon.com>, Stewart Gordon says...

 POSSIBILITY TWO:
 
 Suppose a user decides to hand-code UTF-8 on purpose /and gets it 
 wrong/. As in:
 

 
 
 who's going to notice? The compiler? Not in this case. Again, if the 
 user had been required instead to type:
 

 
 then they would have been protected from that error.

How are these less typo-prone?  Even if I did that, how would it know 
that I didn't mean to type

     char[] s = "\u208C";

True enough. But I guess, what I was trying to say is that \u20AC is something
that any /human/ can look up in Unicode code charts. By contrast, you are
unlikely to find a convenient lookup table anywhere on the web which will let
you look up \xE2\x82\xAC. So the \u version is just more maintainable, and the
\x version more obfuscated. I suppose I'm saying that if a project is maintained
by more than one person, an error involving \u is more likely to be spotted by
someone else in the team than an error involving a sequence of \x's.

Maybe that's just subjective. Maybe I'm wrong. I dunno any more.


?  As I started to say, people who choose to hand-code UTF-8, any other 
encoding or even uncoded codepoints should know what they're doing and 
that they have to be careful.

Well obviously we both agree on that one. I think we only disagree on whether or
not you should need a b before the "...".

Jill

Oct 06 2004

larrycowan <larrycowan_member pathlink.com> writes:

In article <cjtgk8$2h40$1 digitaldaemon.com>, Arcane Jill says...
...
String literals will now be consistent, logical, and sufficiently powerful for
Walter's purposes. The distinction between byte[] and ubyte[] would then be the
only remaining problem.

Arcane Jill

Why is there any problem here? The content is not different.  ubyte and
byte are distinguished only when used singly (or grouped) as binary for
an arithmetic evaluation.  The only difference I can see is for something
like:

ubyte what = -123;  // valid
and
byte what = -123; // invalid, would be = 133 to same content

but that is a different type of initialization.

Oct 05 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjur82$lpv$1 digitaldaemon.com>, larrycowan says...

Why is there any problem here?

It's possible that there isn't. I don't know enough about how the compiler can
sort these things out. 


The content is not different.

Well, there is a sense in which the context is different. For example:








I guess I incline to the opinion that only ubyte[] should be used for binary
string literals, but maybe that's not the only answer.




ubyte and
byte are distinguished only when used singly (or grouped) as binary for
an arithmetic evaluation.  The only difference I can see is for something
like:

ubyte what = -123;  // valid
and
byte what = -123; // invalid, would be = 133 to same content

but that is a different type of initialization.

You mean assignment. Assignment can happen at times other than just
initialization.

But I was more getting at stuff like this:




or






or





Jill

Oct 05 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:
<snip>
 Currently, strings /cannot/ be initialized with UTF-16, and I believe that
 behavior to be correct. (You believe it to be incorrect, I know. Like I said,
 opinions differ).
 
 I don't believe that allowing them to be initialised with UTF-8 is a good idea
 either. It's dumb.
 
 Regardless of the destination type! 

If people choose to initialise a string with UTF-8 or UTF-16, then it 
automatically follows that they should know what they're doing.  What 
would we gain by stopping these people?

<snip>
 I posted an update this morning with better rules, which I think would keep
 everyone happy. Well - everyone apart from those who want to explicitly stick
 UTF-8 and UTF-16 into char[], wchar[] and dchar[] anyway.

Hmm......

This leaves two things to do:

1. Stop string literals from jumping to the wrong conclusions.

 
 Prefixes do that nicely. In the absense of prefixes, you can often figure it
 out. But /sometimes/ you just can't. For instance: f("..."), where f() is
 overloaded.

But they take away the convenience Walter has gone to the trouble to 
create, of which this is just one example:

http://www.digitalmars.com/d/ctod.html#ascii

<snip>
2. Clean up the issue of whether arbitrary UTF-16 fragments are allowed 
by \u escapes - get the compiler matching the spec or vice versa.  If we 
don't allow them, then if only for the sake of completeness, we should 
invent a new escape to denote UTF-16 fragments.

 
 Disagree. I still think that encoding-by-hand inside a string literal is a dumb



Nothing - these are perfectly valid at the moment, and remain perfectly 
valid whether \u is interpreted as codepoints or UTF-16 fragments.

 Walter suggests (and I agree with him) that \x should be for inserting
arbitrary
 binary data into binary strings, and ASCII characters into text strings. I see
 no /point/ in defining \x to be UTF-8 ... unless of course you want to enter an
 obfuscated D contest with code like this:
 

 
 instead of either of:
 


 
 It's just crazy.

Maybe.  But there's method in some people's madness.

Stewart.

Oct 05 2004

Benjamin Herr <ben 0x539.de> writes:

Stewart Gordon wrote:
 I thought it was by design that string literals take their type from the 
 context

Eh, if we are that far, what prevents resolving overloaded function 
calls based on their return type? Then we could even have a sensible opCast.
Sorry to hijack the thread, but I never quite realised this feature.

-ben

Oct 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjrq9c$uea$1 digitaldaemon.com>, Benjamin Herr says...
Stewart Gordon wrote:
 I thought it was by design that string literals take their type from the 
 context

Eh, if we are that far, what prevents resolving overloaded function 
calls based on their return type? Then we could even have a sensible opCast.
Sorry to hijack the thread, but I never quite realised this feature.

-ben

As I understand it, the context-detection for string literals is very, very
crude - basically limited to assignment statements of the form:



although it ought to be fairly easily extended to:



It's not a full context analysis - and even if it were, it would be an
/exception/ to D's overall context free grammar, not the rule. The language may
be able to withstand one or two very simple exceptions, but that is a far cry
from generically being able to tell the context of everything in all
circumstances. The latter would (I suspect) require a complete language rewrite.

Arcane Jill

PS. I want a sensible opCast() too - but that's for another thread.

Oct 05 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:

 In article <cjrq9c$uea$1 digitaldaemon.com>, Benjamin Herr says...
 
Stewart Gordon wrote:

I thought it was by design that string literals take their type from the 
context

Eh, if we are that far, what prevents resolving overloaded function 
calls based on their return type? Then we could even have a sensible opCast.
Sorry to hijack the thread, but I never quite realised this feature.


There would be a lot more cases to deal with, it would probably be 
complex to implement, and could get utterly confusing to determine which 
path the types are going through in a complex expression.

Simple type resolution of string literals is, OTOH, a relatively simple 
feature.

-ben

 
 
 As I understand it, the context-detection for string literals is very, very
 crude - basically limited to assignment statements of the form:
 

 
 although it ought to be fairly easily extended to:
 

 
 It's not a full context analysis - and even if it were, it would be an
 /exception/ to D's overall context free grammar, not the rule.

<snip>

How would it destroy CFG?  Type resolution and expression simplification 
are part of semantic analysis.

The mechanism would be simple and unambiguous to this extent:

literal ~ literal => literal (concatenated at compile time)
char[]  ~ literal => char[]
wchar[] ~ literal => wchar[]
dchar[] ~ literal => dchar[]
literal ~ char[]  => char[]
literal ~ wchar[] => wchar[]
literal ~ dchar[] => dchar[]

char[] ~ wchar[] => invalid?

char[]  = literal => char[]
wchar[] = literal => wchar[]
dchar[] = literal => dchar[]

Stewart.

Oct 05 2004

D Programming

C/C++ Programming

Other

digitalmars.D - What to do about \x?