www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - What to do about \x?

reply Arcane Jill <Arcane_member pathlink.com> writes:
The use of the escape sequence "\x" in string and character literals causes a
lot of confusion in D. 

(1) Most people on this thread use either WINDOWS-1252 or LATIN-1, and,
consequently, expect "\xE9" to emit a lowercase 'e' with an acute accent. It
does not.

(2) In a recent thread on this forum, novice (who is Russian) expected "\xC0" to
emit the Cyrillic letter capital A (because that's what happens in C++ on their
WINDOWS-1251 machine). It does not.

(3) Over in the bugs forum, it is being discussed and discovered that the DMD
compiler interprets "\x" as UTF-8 /even in wchar strings/ (which one might have
expected to be UTF-16). This is clearly nonsense.

So maybe it's time to clear up the confusion once and for all. What /should/
"\x" do? As I see it, these are the options:

(1) \x should be interpretted in the user's default encoding
(2) \x should be interpretted in the source-code encoding
(3) \x should be interpretted according to the destination type of the literal
(4) \x should be interpretted as UTF-8
(5) \x should be interpretted as Latin-1
(6) \x should be interpretted as ASCII
(7) \x should be deprecated

Option (1) is what C++ does. However - it makes code non-portable across
encodings: a source file created by one user will not necessarily compile
correctly for another.

Option (2) is more restricted, since the source file encoding must be one of
UTF-8, UTF-16 or UTF-32. I still don't like it though - the compiler shouldn't
behave differently just because the source file is saved differently.

Option (3) is what I (incorrectly) assumed \x would do. However, D has a
context-free grammar, and parses string literals /before/ it knows what kind of
thing it's assigning. (Exactly how it manages to make

#    wchar[] s = "hello";

do the right thing is beyond me, but whether or not this contextual typing could
be extended to include the intepretation of \x is something only Walter could
answer). Anyway, this would still be strange behaviour, from the point of view
of C++ programmers.

Option (4) is the status quo. It confuses everybody.

Option (5) is biased toward folk in the Western world. There is some
justfication for it, however, since Latin-1 is a subset of Unicode, having
precisely the same codepoint-to-character mapping. If we went for this, then
"\x##" would always be the same thing as "\u00##" or "\U000000##". I /think/
(though I'm not certain) that this is what Java does.

Option (6) is not unreasonable. It would mean that "\x00" to "\x7F" would be the
only legal \x escape sequences - these are unambiguous. Sequences "\x80" to
"\xFF" would become compile-time errors. The error message should advise people
to use "\u" instead.

Option (7) is foolproof. Any use of "\x" becomes a compile-time error. The error
message should advise people to use "\u" instead.

What think you all?
Arcane Jill
Oct 01 2004
next sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:
 The use of the escape sequence "\x" in string and character literals causes a
 lot of confusion in D. 
 
 (1) Most people on this thread use either WINDOWS-1252 or LATIN-1, and,
 consequently, expect "\xE9" to emit a lowercase 'e' with an acute accent. It
 does not.

It does _emit_ a lowercase 'e' with an acute accent, if its destination is anything in the Windows GUI and it was emitted using the A version of a Windows API function.
 (2) In a recent thread on this forum, novice (who is Russian) expected "\xC0"
to
 emit the Cyrillic letter capital A (because that's what happens in C++ on their
 WINDOWS-1251 machine). It does not.
 
 (3) Over in the bugs forum, it is being discussed and discovered that the DMD
 compiler interprets "\x" as UTF-8 /even in wchar strings/ (which one might have
 expected to be UTF-16). This is clearly nonsense.

In that case, how should it interpret \u or \U in a char string? \x is a UTF-8 fragment, \u is a UTF-16 fragment, \U is a UTF-32 fragment. Whatever rules we have for translating one into the other should be consistent. And the current behaviour satisfies that criterion nicely. <snip>
 (Exactly how it manages to make
 
 #    wchar[] s = "hello";
 
 do the right thing is beyond me, but whether or not this contextual typing
could
 be extended to include the intepretation of \x is something only Walter could
 answer).

Just thinking about it, I guess that DMD uses an 8-bit internal representation during the tokenising and parsing phase, converting \u and \U codes (and of course literal characters in UTF-16 or UTF-32 source text) to their UTF-8 counterparts. During the semantic analysis, it then converts it back to UTF-16 or UTF-32 if it's assigning to a wchar[] or dchar[]. <snip>
 Option (4) is the status quo. It confuses everybody.

Well, I'm not confused ... yet.
 What think you all?

My vote goes to leaving \x as it is. Stewart.
Oct 01 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjjcs7$9pf$1 digitaldaemon.com>, Stewart Gordon says...
Arcane Jill wrote:
 The use of the escape sequence "\x" in string and character literals causes a
 lot of confusion in D. 
 
 (1) Most people on this thread use either WINDOWS-1252 or LATIN-1, and,
 consequently, expect "\xE9" to emit a lowercase 'e' with an acute accent. It
 does not.

It does _emit_ a lowercase 'e' with an acute accent, if its destination is anything in the Windows GUI and it was emitted using the A version of a Windows API function.

Sorry, I didn't follow that. Can you give me a code example? I only meant that # char[] s = "\xE9"; does not leave s containing an e with an acute accent, which some people might expect.
In that case, how should it interpret \u or \U in a char string?

\u and \U are well defined universally. They do not depend on encoding.
\x is a UTF-8 fragment,

That's the status quo in D, yes. \u is a UTF-16 fragment, \U is a UTF-32 fragment. Incorrect. \u is /not/ a UTF-16 fragment. Why did you assume that? Did you assume that because, in D, \x is a UTF-8 fragment? If so, we may take that further evidence that the current implementation of \x causes confusion. In fact, both \u and \U specify Unicode /characters/ (not UTF fragments). By definition, \u#### = \U0000#### = the Unicode character U+####. \u and \U are identical in all respects other than the number of hex digits expected to follow them. (You only need to resort to \U if more than four digits are present.) Thus, # wchar[] s = "\uD800\uDC00"; // error (correctly) fails to compile, (correctly) requiring you instead to do: # wchar[] s = "\U00110000"; So you see, \x really _is_ the odd one out.
Whatever rules we have for translating one into the other 
should be consistent.

Such consistency would require option (5) from my original post, so that \x## = \u00## = \U000000## - in all cases regarded as a Unicode character, not a UTF fragment. (The primary argument /against/ this behaviour is that it effectively makes \x an ISO-8859-1 encoding, which could be considered to be Western bias.)
And the current behaviour satisfies that 
criterion nicely.

My point is that the current behavior of \x is /not/ consistent with \u or \U. It is also not consistent with the expectations of users used to C++ or Java behavior.
 Option (4) is the status quo. It confuses everybody.

Well, I'm not confused ... yet.

Well, I don't know about you, but the following confuses me: # wchar[] s = "\xC4\x8F"; // eh? # wchar[] t = "\u010F"; # assert(s == t); // yup -they're the same Why should s - declared as a UTF-16 string - be able to accept UTF-8 literals, but not UTF-16 literals? Let's see that again with a different example: # dchar[] s = "\U00010000"; // Unicode - (correctly) compiles # dchar[] s = "\uD800\uDC00"; // UTF-16 - (correctly) fails to compile # dchar[] s = "\xE0\x90\x80\x80"; // UTF-8 - compiles It's not consistent. (But \u and \U are implemented correctly).
 What think you all?

Stewart.

For what it's worth, my vote goes to deprecating \x. Arcane Jill
Oct 01 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjjgs7$bs2$1 digitaldaemon.com>, Arcane Jill says...

Erratum

#    wchar[] s = "\U00110000";

should read:

#    wchar[] s = "\U00010000";

Jill
Oct 01 2004
prev sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:

 In article <cjjcs7$9pf$1 digitaldaemon.com>, Stewart Gordon says...

 It does _emit_ a lowercase 'e' with an acute accent, if its 
 destination is anything in the Windows GUI and it was emitted using 
 the A version of a Windows API function.

Sorry, I didn't follow that. Can you give me a code example?

char[] s = "\xE9"; SendMessageA(hWnd, WM_SETTEXT, 0, cast(LPARAM) cast(char*) s); <snip>
 In fact, both \u and \U specify Unicode /characters/ (not UTF 
 fragments). By definition, \u#### = \U0000#### = the Unicode 
 character U+####. \u and \U are identical in all respects other than 
 the number of hex digits expected to follow them. (You only need to 
 resort to \U if more than four digits are present.) Thus, 
 
 #    wchar[] s = "\uD800\uDC00"; // error
 
 (correctly) fails to compile, (correctly) requiring you instead to 
 do:

I find in the spec: lex.html \n the linefeed character \t the tab character \" the double quote character \012 octal \x1A hex \u1234 wchar character \U00101234 dchar character \r\n carriage return, line feed expression.html "Character literals are single characters and resolve to one of type char, wchar, or dchar. If the literal is a \u escape sequence, it resolves to type wchar. If the literal is a \U escape sequence, it resolves to type dchar. Otherwise, it resolves to the type with the smallest size it will fit into." What bit of the spec should I be reading instead? <snip>
 Whatever rules we have for translating one into the other should be 
 consistent.

Such consistency would require option (5) from my original post, so that \x## = \u00## = \U000000## - in all cases regarded as a Unicode character, not a UTF fragment. (The primary argument /against/ this behaviour is that it effectively makes \x an ISO-8859-1 encoding, which could be considered to be Western bias.)

How does simply having \x, \u and \U represent UTF-8, UTF-16 and UTF-32 fragments respectively not achieve this consistency? <snip>
 Well, I don't know about you, but the following confuses me:
 
 #    wchar[] s = "\xC4\x8F";  // eh?
 #    wchar[] t = "\u010F";
 #    assert(s == t);          // yup -they're the same

I suppose it confused me before I realised what it meant. The point, AIUI, is that all string literals are equal whether they are notated as UTF-8, UTF-16 or UTF-32, i.e. the lexer reduces all to the same thing. This is kind of consistent with the principle that the permitted source text encodings are all treated equally. This leaves the semantic analyser with only one kind of string literal to worry about, which it converts to the UTF of the target type. Moreover, I imagine that string literals juxtaposed into one, or even the contents of a single " " pair, are allowed to mix the various escape notations. In this case, trying to label string literals at lex-time as UTF-8, UTF-16 or UTF-32 would be fruitless.
 Why should s - declared as a UTF-16 string - be able to accept UTF-8 
 literals, but not UTF-16 literals? Let's see that again with a 
 different example:

That would sound like a bug to me. Stewart.
Oct 01 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjjo6a$g8o$1 digitaldaemon.com>, Stewart Gordon says...

I find in the spec:

lex.html
	\n			the linefeed character
	\t			the tab character
	\"			the double quote character
	\012			octal
	\x1A			hex
	\u1234			wchar character
	\U00101234		dchar character
	\r\n			carriage return, line feed

I would argue that the spec is wrong. \u and \U are Unicode things, and I distinctly remember a discussion on this very subject on the Unicode public forum a while back (though Walter could, if he were sufficiently perverse, give D a different definition). I suggest that the D spec /should/ read: # \u1234 Unicode character # \U00101234 Unicode character This would be consistent with C++, C#, Java, the intent of the Unicode Consortium, and the actual current behavior of D. Certainly, at present, actual behavior of \u in D is different from the documentation, so either there is a documentation error, or else there is a bug. I'm inclined to the belief that it's a documentation error. I certainly hope so, because \u and \U should always be independent of encoding. No-one should have to learn UTF-16 to use \u. It should be legal to write: # wchar[] s = "\U00101234"; (which of course it currently is).
expression.html
"Character literals are single characters and resolve to one of type 
char, wchar, or dchar. If the literal is a \u escape sequence, it 
resolves to type wchar. If the literal is a \U escape sequence, it 
resolves to type dchar. Otherwise, it resolves to the type with the 
smallest size it will fit into."

Bugger! Again, that's not how Unicode is supposed to behave. The following should (and does) compile without complaint: # char c = '\U0000002A'; So again, D is behaving fine, but the documentation does not match reality. Documentation error or bug? I say it's a documentation error.
What bit of the spec should I be reading instead?

You read the D docs correctly. I was going by previous discussions on the Unicode public forum (from memory). Obviously, those discussions weren't specifically about D.
How does simply having \x, \u and \U represent UTF-8, UTF-16 and UTF-32 
fragments respectively not achieve this consistency?

It's just not how Unicode is supposed to behave. \u and \U are supposed to be Unicode characters. Nothing more. Nothing less. (And that of course is exactly what D has implemented). I'm going to have a hard time backing that up - so please trust me on this one. If not, I'll have to go trawling through the Unicode archives, or passing this question on to the Consortium folk. It's a complicated question, because the \u thing isn't actually something the UC can define, but nonetheless the same definition is used by C++, Java, Python, C#, various internet RFCs, etc. etc. D would be out on a very dodgy limb here if it were to do things differently. (...and it's also what is implemented by D, so again, I claim it's the documentation which is in error).
The point, AIUI, is that all string literals are equal whether they are 
notated as UTF-8, UTF-16 or UTF-32, i.e. the lexer reduces all to the 
same thing.

But one should not have to learn /any/ UTF to encode a string. Why would anyone want that? All you should need to know to encode a character using an escape sequence is its codepoint. And that's what \u and \U are for.
This is kind of consistent with the principle that the 
permitted source text encodings are all treated equally.

Well, here at least we are in agreement. All supported encodings should be treated equally.
 Why should s - declared as a UTF-16 string - be able to accept UTF-8 
 literals, but not UTF-16 literals? Let's see that again with a 
 different example:

That would sound like a bug to me.

It's certainly an inconsistency - but it's the legality of \x, not the illegality of \u, about which I would complain. Arcane Jill
Oct 01 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:
<snip>
 I would argue that the spec is wrong. \u and \U are Unicode things, 
 and I distinctly remember a discussion on this very subject on the 
 Unicode public forum a while back (though Walter could, if he were 
 sufficiently perverse, give D a different definition).

Which bit of The Unicode Standard should I read to find the meanings of \u and \U it sets in stone across every language ever invented? <snip>
 Certainly, at present, actual behavior of \u in D is different from 
 the documentation, so either there is a documentation error, or else 
 there is a bug. I'm inclined to the belief that it's a documentation 
 error. I certainly hope so, because \u and \U should always be 
 independent of encoding.

How would fixing the compiler to follow the spec create dependence on encoding?
 No-one should have to learn UTF-16 to use \u. It should be legal to 
 write:
 
 #    wchar[] s = "\U00101234";
 
 (which of course it currently is).

Agreed from the start. Nobody suggested that anyone should have to learn UTF-16. Nor that being _allowed_ to use UTF-16 and being _allowed_ to use UTF-32 (and hence actual codepoints) should be mutually exclusive. I thought that was half the spirit of having both \u and \U - to give the programmer the choice.
 expression.html
 "Character literals are single characters and resolve to one of 
 type char, wchar, or dchar. If the literal is a \u escape sequence, 
 it resolves to type wchar. If the literal is a \U escape sequence, 
 it resolves to type dchar. Otherwise, it resolves to the type with 
 the smallest size it will fit into."

Bugger! Again, that's not how Unicode is supposed to behave. The following should (and does) compile without complaint: # char c = '\U0000002A'; So again, D is behaving fine, but the documentation does not match reality. Documentation error or bug? I say it's a documentation error.

Ah, you have a point. It appears that character and string literals are translated by the same function, rather than character literals being labelled as char, wchar or dchar. So that could be either a doc error or a bug. Walter? <snip>
 It's just not how Unicode is supposed to behave. \u and \U are 
 supposed to be Unicode characters. Nothing more. Nothing less. (And 
 that of course is exactly what D has implemented).
 
 I'm going to have a hard time backing that up - so please trust me on 
 this one. If not, I'll have to go trawling through the Unicode 
 archives, or passing this question on to the Consortium folk. It's a 
 complicated question, because the \u thing isn't actually something 
 the UC can define, but nonetheless the same definition is used by 
 C++, Java, Python, C#, various internet RFCs, etc. etc. D would be 
 out on a very dodgy limb here if it were to do things differently.

Well, at least what the D spec implies \u should mean is a superset of what you're implying is the 'correct' Unicode behaviour. <snip>
 The point, AIUI, is that all string literals are equal whether they 
 are notated as UTF-8, UTF-16 or UTF-32, i.e. the lexer reduces all 
 to the same thing.

But one should not have to learn /any/ UTF to encode a string. Why would anyone want that?

I can't see how that follows on from what I just said. But agreed.
 All you should need to know to encode a character using an escape 
 sequence is its codepoint. And that's what \u and \U are for.

Exactly. At least, whichever interpretation of \u we go by, it works for all codepoints below U+FFFE. The only thing left to debate is whether it should also work for those UTF-16 fragments that don't directly correspond to codepoints. If not, then there are two things to do: - fix the documentation to explain this - invent another escape to represent a UTF-16 fragment, for the sake of completeness. Stewart.
Oct 01 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjjvnq$nje$1 digitaldaemon.com>, Stewart Gordon says...
Arcane Jill wrote:
<snip>
 I would argue that the spec is wrong. \u and \U are Unicode things, 
 and I distinctly remember a discussion on this very subject on the 
 Unicode public forum a while back (though Walter could, if he were 
 sufficiently perverse, give D a different definition).

Which bit of The Unicode Standard should I read to find the meanings of \u and \U it sets in stone across every language ever invented?

It doesn't, of course. The meaning of \u and \U is defined separately in each programming language, and not by the UC. Walter is free to define them how he likes for D.
<snip>
 Certainly, at present, actual behavior of \u in D is different from 
 the documentation, so either there is a documentation error, or else 
 there is a bug. I'm inclined to the belief that it's a documentation 
 error. I certainly hope so, because \u and \U should always be 
 independent of encoding.

How would fixing the compiler to follow the spec create dependence on encoding?

Tricky. Well, if "\uD800" were allowed, for example, it would open up the door to the very nasty possibility that "\U0000D800" might also be allowed - and that /definitely/ shouldn't be, because it's not a valid character. I dunno - I just think it could be very confusing and counterintuitive. Everyone expects \u and \U to prefix /characters/, not fragments of some encoding scheme. Why would you want it any different?
Agreed from the start.  Nobody suggested that anyone should have to 
learn UTF-16.  Nor that being _allowed_ to use UTF-16 and being 
_allowed_ to use UTF-32 (and hence actual codepoints) should be mutually 
exclusive.  I thought that was half the spirit of having both \u and \U 
- to give the programmer the choice.

In other languages, \u#### is identical in meaning to \U0000####. Why make D different? If you want to hand-code UTF-16, you can always do it like this: # wchar[] s = [ 0xD800, 0xDC00 ];
Ah, you have a point.  It appears that character and string literals are 
translated by the same function, rather than character literals being 
labelled as char, wchar or dchar.  So that could be either a doc error 
or a bug.  Walter?

Well, as you know, I suspect a documentation error, but I leave it to Walter to answer once and for all.
Well, at least what the D spec implies \u should mean is a superset of 
what you're implying is the 'correct' Unicode behaviour.

Guess I can't argue with that.
Exactly.  At least, whichever interpretation of \u we go by, it works 
for all codepoints below U+FFFE.  The only thing left to debate is 
whether it should also work for those UTF-16 fragments that don't 
directly correspond to codepoints.  If not, then there are two things to do:

- fix the documentation to explain this

Agreed.
- invent another escape to represent a UTF-16 fragment, for the sake of 
completeness.

I'm not sure you need an escape sequence. Why not just replace # wchar c = '\uD800'; with # wchar c = 0xD800; What requirement is there to hand-code UTF-8 /inside a string literal/? If you get it right, then you might just as well have used \U and the codepoint; if you get it wrong, you're buggered. Arcane Jill
Oct 01 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:
<snip>
 Tricky. Well, if "\uD800" were allowed, for example, it would open up the door
 to the very nasty possibility that "\U0000D800" might also be allowed - and
that
 /definitely/ shouldn't be, because it's not a valid character. I dunno - I just
 think it could be very confusing and counterintuitive.
 
 Everyone expects \u and \U to prefix /characters/, not fragments of some
 encoding scheme. Why would you want it any different?

Because that's how \x is, and at the moment we have an equivalent for UTF-16 according to the spec, but not according to the compiler.
Agreed from the start.  Nobody suggested that anyone should have to 
learn UTF-16.  Nor that being _allowed_ to use UTF-16 and being 
_allowed_ to use UTF-32 (and hence actual codepoints) should be mutually 
exclusive.  I thought that was half the spirit of having both \u and \U 
- to give the programmer the choice.

In other languages, \u#### is identical in meaning to \U0000####. Why make D different?

Are languages defined by the spec or the compiler? I'd've thought the spec, IWC D is already different.
 If you want to hand-code UTF-16, you can always do it like this:
 
 #    wchar[] s = [ 0xD800, 0xDC00 ];

If D's meant to be consistent, then I should also be able to do wchar[] s = "\uD800\uDC00"; or even dchar[] s = "\uD800\uDC00"; or char[] s = "\uD800\uDC00"; or the same with the 'u' replaced by some other letter defined with these semantics. <snip>
- invent another escape to represent a UTF-16 fragment, for the sake of 
completeness.

I'm not sure you need an escape sequence. Why not just replace # wchar c = '\uD800'; with # wchar c = 0xD800;

Because that doesn't enable whole string literals to be done like this. And because it would be inconsistent to allow string literals to be encoded in anything except UTF-16.
 What requirement is there to hand-code UTF-8 /inside a string literal/? If you
 get it right, then you might just as well have used \U and the codepoint; if
you
 get it wrong, you're buggered. 

The same as there was when the subject was first brought up. Stewart.
Oct 04 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjr5no$28em$1 digitaldaemon.com>, Stewart Gordon says...

 What requirement is there to hand-code UTF-8 /inside a string literal/? If you
 get it right, then you might just as well have used \U and the codepoint; if
you
 get it wrong, you're buggered. 

The same as there was when the subject was first brought up.

Which is what, exactly? Why would anyone need to hand-code UTF-8 inside a string literal? Or UTF-16 for that matter? I can't think of any circumstance in which "\x##\x##\x##\x##" would be preferable to "\U00######". Can you? Jill
Oct 04 2004
parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:
 In article <cjr5no$28em$1 digitaldaemon.com>, Stewart Gordon says...

 The same as there was when the subject was first brought up.

Which is what, exactly? Why would anyone need to hand-code UTF-8 inside a string literal? Or UTF-16 for that matter?

Sorry. Maybe I got mixed up and was thinking of the talk-to-death of initialising a char[] with arbitrary byte values.
 I can't think of any circumstance in which "\x##\x##\x##\x##" would be
 preferable to "\U00######". Can you?

Maybe it isn't in general. But I suppose if you're doing file I/O and want your code to be self-documenting to the extent of indicating how many actual bytes are being transferred.... Stewart.
Oct 04 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
This is a good summary of the situation. I don't know what the right answer
is yet, so it's a good topic for discussion. One issue not mentioned yet is
the (not that unusual) practice of stuffing arbitrary data into a string
with \x. For example, in C one might create a length prefixed 'pascal style'
string with:

    unsigned char* p = "\x03abc";

where the \x03 is not character data, but length data. I'm hesitant to say
"you cannot do that in D". I also want to support things like:

    byte[] a = "\x05\x89abc\xFF";

where clearly a bunch of binary data is desired.

One possibility is to use prefixes on the string literals:

    "..."    // Takes its type from the context, only ascii \x allowed
    c"..."    // char[] string literal, only ascii \x allowed
    w"..."   // wchar[] string literal, \x not allowed
    d"..."    // dchar[] string literal, \x not allowed
    b"..."    // byte[] binary data string literal, \x allowed, \u and \U
not allowed
Oct 01 2004
next sibling parent David L. Davis <SpottedTiger yahoo.com> writes:
In article <cjk0rv$oqo$1 digitaldaemon.com>, Walter says...
This is a good summary of the situation. I don't know what the right answer
is yet, so it's a good topic for discussion. One issue not mentioned yet is
the (not that unusual) practice of stuffing arbitrary data into a string
with \x. For example, in C one might create a length prefixed 'pascal style'
string with:

    unsigned char* p = "\x03abc";

where the \x03 is not character data, but length data. I'm hesitant to say
"you cannot do that in D". I also want to support things like:

    byte[] a = "\x05\x89abc\xFF";

where clearly a bunch of binary data is desired.

One possibility is to use prefixes on the string literals:

    "..."    // Takes its type from the context, only ascii \x allowed
    c"..."    // char[] string literal, only ascii \x allowed
    w"..."   // wchar[] string literal, \x not allowed
    d"..."    // dchar[] string literal, \x not allowed
    b"..."    // byte[] binary data string literal, \x allowed, \u and \U
not allowed

(which should also make the ~ string concatenation operator behave correctly...so, no more need to cast every darn string literal), and the cases of when \x can and cannot be used for each. I'm actually hold my breathe for this one to happen! ;) David L. P.S. Welcome Back!! Hope you got some much needed rest. ------------------------------------------------------------------- "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
Oct 01 2004
prev sibling next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjk0rv$oqo$1 digitaldaemon.com>, Walter says...
This is a good summary of the situation. I don't know what the right answer
is yet, so it's a good topic for discussion. One issue not mentioned yet is
the (not that unusual) practice of stuffing arbitrary data into a string
with \x. For example, in C one might create a length prefixed 'pascal style'
string with:

    unsigned char* p = "\x03abc";

where the \x03 is not character data, but length data. I'm hesitant to say
"you cannot do that in D".

You've /already/ outlawed that in D, Walter - at least, if the length is greater than 127. Try compiling this: # char[] p = "\x81abcdefg...";
I also want to support things like:

    byte[] a = "\x05\x89abc\xFF";

where clearly a bunch of binary data is desired.

And I. That's a good plan. But that's a byte[] literal you want there, not a char[] literal. How to tell them apart, that's the problem...?
One possibility is to use prefixes on the string literals:

    "..."    // Takes its type from the context, only ascii \x allowed
    c"..."    // char[] string literal, only ascii \x allowed
    w"..."   // wchar[] string literal, \x not allowed
    d"..."    // dchar[] string literal, \x not allowed
    b"..."    // byte[] binary data string literal, \x allowed, \u and \U
not allowed

Perfect! I'd support that wholeheartedly. It definitely seems the right way to go - except that I'd allow the following: # /x /u /U # ------------------------------------ # "..." ASCII yes yes # c"..." ASCII yes yes # w"..." no yes yes # d"..." no yes yes # b"..." yes no no That is, if the prefix is omitted, Unicode escapes should still be allowed. Otherwise people will complain when # char[] euroSign = "\u20AC"; fails to compile. Oh - and one other thing: for the benefit of novice and other newcomers to D, it would be nice if the error message which is output when something like "\xC0" fails to compile was more helpful: "invalid UTF-8" has turned out to be confusing; "invalid ASCII" would be even worse (as would suggest that D only does ASCII). You need an error message which lets folk know that we can do Unicode, and that "\u" followed by the four-digit Unicode codepoint is most likely what they want. Arcane Jill
Oct 01 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjkhnf$167v$1 digitaldaemon.com>, Arcane Jill says...

Here's my latest suggestion, only very slightly modified from the last one. The
essential difference is the behavior of the default ("...") string type.

For string literals with a prefix, the following rules should apply:

#    Literal   /x        /u      /U     type
#    -------------------------------------------
#    a"..."    ASCII     no      no     char[]
#    c"..."    ASCII     yes     yes    char[]
#    w"..."    no        yes     yes    wchar[]
#    d"..."    no        yes     yes    dchar[]
#    b"..."    yes       no      no     ubyte[]

(Note the addition of the a"..." type. We don't actually need this, but it makes
the table below make more sense).

For string literals /without/ a prefix, the rules should be determined by
context, as follows:

#    Context           Treat as
#    ---------------------------
#    char[]            c"..."
#    wchar[]           w"..."
#    dchar[]           d"..."
#    ubyte[]           b"..."
#    byte[]            b"..."
#    indeterminate     a"..."

Now there's only one remaining problem: should /both/ of the following lines
compile, or should one of them need a cast? (or perhaps put another way, should
byte[] and ubyte[] be mutually implicitly convertable?)

#    byte[] x = "\x05\x89abc\xFF";
#    ubyte[] y = "\x05\x89abc\xFF";

(by the context rules, these strings would be interpreted as b"\x05\x89abc\xFF")

Arcane Jill

PS. Of course, this sort of thing (defining consecutive bytes of mixed data) is
simple to do in assembler, so it ought to be simple to do in a
close-to-the-processor language like C:

#    db 5, 89h, "abc", 0FFh
Oct 03 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjqrt4$1vbg$1 digitaldaemon.com>, Arcane Jill says...

PS. Of course, this sort of thing (defining consecutive bytes of mixed data) is
simple to do in assembler, so it ought to be simple to do in a
close-to-the-processor language like C:

Er - make that D.
#    db 5, 89h, "abc", 0FFh

Oct 04 2004
prev sibling next sibling parent Burton Radons <burton-radons smocky.com> writes:
Walter wrote:
 where the \x03 is not character data, but length data. I'm hesitant to say
 "you cannot do that in D". I also want to support things like:
 
     byte[] a = "\x05\x89abc\xFF";
 
 where clearly a bunch of binary data is desired.

That code won't compile. x"" should be turned to produce ubyte[] instead. For the "meat is my potatoes" crew you can allow implicitly converting x"" literals to char[]; then they can embed random data within the string to their heart's content.
 One possibility is to use prefixes on the string literals:
 
     "..."    // Takes its type from the context, only ascii \x allowed
     c"..."    // char[] string literal, only ascii \x allowed
     w"..."   // wchar[] string literal, \x not allowed
     d"..."    // dchar[] string literal, \x not allowed
     b"..."    // byte[] binary data string literal, \x allowed, \u and \U
 not allowed

So the language should be made more complex and even more hostile to templating because... don't got an answer there. Just leave \x as is but have it encode. It'll be a stumbling point for transitionary users, but that's what tutorials are for.
Oct 01 2004
prev sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Walter wrote:

<snip>
 One possibility is to use prefixes on the string literals:
 
     "..."    // Takes its type from the context, only ascii \x allowed
     c"..."    // char[] string literal, only ascii \x allowed
     w"..."   // wchar[] string literal, \x not allowed
     d"..."    // dchar[] string literal, \x not allowed
     b"..."    // byte[] binary data string literal, \x allowed, \u and \U
 not allowed

I thought it was by design that string literals take their type from the context, and that they could be initialised by UTF-8, UTF-16 or UTF-32 regardless of the destination type. As such, it would be a step backward to restrict what can go in a "...". This leaves two things to do: 1. Stop string literals from jumping to the wrong conclusions. At parse time, a StringLiteral is just a StringLiteral, right? It has no specific type yet. On seeing a subexpression StringLiteral ~ StringLiteral the compiler would concatenate the strings there and then, and the result would still be a StringLiteral. Only when a specific type is required, e.g. - assignment - concatenation with a string of known type - passing to a function or template - an explicit cast would the semantic analysis turn it into a char[], wchar[] or dchar[]. The spec would need to be clear about what happens if a given function/template name is overloaded to take different string types, and of course the type it would have in a variadic function. Other thing we would need to be careful about: (a) whether an expression like "10"[1..3] should be allowed, and what it should do (b) when we get array arithmetic, whether it should be allowed on strings. String literals and strings of known type could be considered separately in this debate. Of course, c"..." et al could still be invented, if only as syntactic sugar for cast(char[]) "...".... 2. Clean up the issue of whether arbitrary UTF-16 fragments are allowed by \u escapes - get the compiler matching the spec or vice versa. If we don't allow them, then if only for the sake of completeness, we should invent a new escape to denote UTF-16 fragments. Stewart.
Oct 04 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjr7ar$29j8$1 digitaldaemon.com>, Stewart Gordon says...
Walter wrote:

<snip>
 One possibility is to use prefixes on the string literals:
 
     "..."    // Takes its type from the context, only ascii \x allowed
     c"..."    // char[] string literal, only ascii \x allowed
     w"..."   // wchar[] string literal, \x not allowed
     d"..."    // dchar[] string literal, \x not allowed
     b"..."    // byte[] binary data string literal, \x allowed, \u and \U
 not allowed

I thought it was by design that string literals take their type from the context, and that they could be initialised by UTF-8, UTF-16 or UTF-32 regardless of the destination type.

Opinions differ on that viewpoint. Currently, strings /cannot/ be initialized with UTF-16, and I believe that behavior to be correct. (You believe it to be incorrect, I know. Like I said, opinions differ). I don't believe that allowing them to be initialised with UTF-8 is a good idea either. It's dumb. Regardless of the destination type!
As such, it would be a step 
backward to restrict what can go in a "...".

I posted an update this morning with better rules, which I think would keep everyone happy. Well - everyone apart from those who want to explicitly stick UTF-8 and UTF-16 into char[], wchar[] and dchar[] anyway.
This leaves two things to do:

1. Stop string literals from jumping to the wrong conclusions.

Prefixes do that nicely. In the absense of prefixes, you can often figure it out. But /sometimes/ you just can't. For instance: f("..."), where f() is overloaded.
The spec would need to be clear about what happens if a given 
function/template name is overloaded to take different string types, and 
of course the type it would have in a variadic function.

Yeah. The algorithm I posted this morning (which is almost the same as Walter's) covers that nicely.
Other thing we would need to be careful about:

(a) whether an expression like

     "10"[1..3]

should be allowed, and what it should do

An interesting conundrum!
2. Clean up the issue of whether arbitrary UTF-16 fragments are allowed 
by \u escapes - get the compiler matching the spec or vice versa.  If we 
don't allow them, then if only for the sake of completeness, we should 
invent a new escape to denote UTF-16 fragments.

Disagree. I still think that encoding-by-hand inside a string literal is a dumb idea - both for UTF-8 and for UTF-16. What on Earth is wrong with \u#### and \U00###### (where #### and ###### are just Unicode codepoints in hex)? Walter suggests (and I agree with him) that \x should be for inserting arbitrary binary data into binary strings, and ASCII characters into text strings. I see no /point/ in defining \x to be UTF-8 ... unless of course you want to enter an obfuscated D contest with code like this: # wchar c = '\xE2\x82\xAC'; // currently legal instead of either of: # wchar c = '\u20AC'; # wchar c = ''; It's just crazy. Jill
Oct 04 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjrd3d$2cep$1 digitaldaemon.com>, Arcane Jill says...
In article <cjr7ar$29j8$1 digitaldaemon.com>, Stewart Gordon says...

Other thing we would need to be careful about:

(a) whether an expression like

     "10"[1..3]

should be allowed, and what it should do

An interesting conundrum!

That got me thinking a lot. Finally, it occurred to be that if the type of "..." cannot be determined by context then it should be a syntax error. What's more, non-ASCII characters shouldn't be allowed in a string constant of unknown type. So, in effect, I'm saying: # c"10"[1..3]; // results in c"1" # w"10"[1..3]; // results in w"10" And the complete set of rules should now be: For string literals with a prefix, the following rules should apply: # Literal /x /u /U non-ASCII type # ------------------------------------------------------- # c"..." ASCII yes yes yes char[] # w"..." no yes yes yes wchar[] # d"..." no yes yes yes dchar[] # b"..." yes no no no ubyte[] Note the extra column, "non-ASCII". What this means is that statements like # ubyte[] x = b"100"; must be forbidden, because '' is a non-ASCII character, and its representation is unspecified. Is it UTF-8 ("\xE2\x82\xAC")? Is it UTF-16BE ("\x20\xAC")? UTF-16LE ("\xAC\x20")? Is it WINDOWS-1252 ("\x80")? You see the problem. For string literals /without/ a prefix, the rules should be determined by context, as follows: # Context Treat as # ------------------------------------- # char[] c"..." # wchar[] w"..." # dchar[] d"..." # ubyte[] b"..." # byte[] b"..." # indeterminate compile-time error I believe that these rules would lead to a resolution of Stewart's conundrum. # "10"[1..3]; // illegal - type of "..." not known String literals will now be consistent, logical, and sufficiently powerful for Walter's purposes. The distinction between byte[] and ubyte[] would then be the only remaining problem. Arcane Jill
Oct 04 2004
next sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:

<snip>
 That got me thinking a lot. Finally, it occurred to be that if the type of
"..."
 cannot be determined by context then it should be a syntax error.

Making it a _syntax_ error while retaining CFG might be tricky, if not impossible. Perhaps better would be to make it an error at the semantic analysis level.
 What's more, non-ASCII characters shouldn't be allowed in a string constant of
unknown type.
 So, in effect, I'm saying:
 
 #    c"10"[1..3];   // results in c"1"
 #    w"10"[1..3];   // results in w"10"

You seem to've got your indexing mixed up. Surely it would be c"\xA31" and w"10"?
 And the complete set of rules should now be:
 
 For string literals with a prefix, the following rules should apply:
 
 #    Literal   /x        /u      /U     non-ASCII   type
 #    -------------------------------------------------------
 #    c"..."    ASCII     yes     yes    yes         char[]
 #    w"..."    no        yes     yes    yes         wchar[]
 #    d"..."    no        yes     yes    yes         dchar[]
 #    b"..."    yes       no      no     no          ubyte[]

Where did those forward slashes come from?
 Note the extra column, "non-ASCII". What this means is that statements like
 
 #    ubyte[] x = b"100";
 
 must be forbidden, because '' is a non-ASCII character, and its representation
 is unspecified. Is it UTF-8 ("\xE2\x82\xAC")? Is it UTF-16BE ("\x20\xAC")?
 UTF-16LE ("\xAC\x20")? Is it WINDOWS-1252 ("\x80")? You see the problem.

Indeed, if we had b"..." then this would be the case.
 For string literals /without/ a prefix, the rules should be determined by
 context, as follows:
 
 #    Context           Treat as
 #    -------------------------------------
 #    char[]            c"..."
 #    wchar[]           w"..."
 #    dchar[]           d"..."
 #    ubyte[]           b"..."
 #    byte[]            b"..."
 #    indeterminate     compile-time error
 
 I believe that these rules would lead to a resolution of Stewart's conundrum.

Depends on what you mean by "treat as". My thought is that it is simplest to keep translation of string escapes at the lexical level. Of course, the prefix (or lack thereof) would be carried forward to the SA. This of course would retain the current 'anything goes' approach for unprefixed literals, but also retains the flexibility that some of us want or even need. Of course, if the context is this very conundrum, then it would indeed be a compile-time error, at least if it contains anything non-ASCII at all. Stewart.
Oct 05 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjtufp$2rjd$1 digitaldaemon.com>, Stewart Gordon says...
 #    c"10"[1..3];   // results in c"1"
 #    w"10"[1..3];   // results in w"10"

You seem to've got your indexing mixed up. Surely it would be c"\xA31" and w"10"?

Whoops. My brain was seeing [0..2] for some reason. Of course it should be: # c"10"[1..3]; // c"\xA31" # w"10"[1..3]; // array bounds exception
 #    Literal   /x        /u      /U     non-ASCII   type


Errr. Typo. Read as backslash.
Indeed, if we had b"..." then this would be the case.

That was Walter's idea, but I like it.
Depends on what you mean by "treat as".  My thought is that it is 
simplest to keep translation of string escapes at the lexical level.  Of 
course, the prefix (or lack thereof) would be carried forward to the SA. 

You're starting to lose me. Compiler front ends are not my strong point. But, by "treat as", I meant, allow the same escape sequences. I couldn't tell you whether or not this would be feasible for a D compiler, but it's a suggestion to which Walter knows the answer, so it didn't seem unreasonable to suggest it.
  This of course would retain the current 'anything goes' approach for 
unprefixed literals, but also retains the flexibility that some of us 
want or even need.

Need? You have yet to provide a convincing argument as to why you (or, indeed, anyone) would /need/ the syntax: # wchar[] s = "\xE2\x82\xAC"; which doesn't even make sense (unless you DEFINE \x as meaning a UTF-8 fragment, but there's just no need for that). "" tells you you've got the Euro character; "\u20AC" tells you you've got the character U+20AC (which any code chart will tell you is the Euro character). "\xE2\x82\xAC" tells you ... what? If you need to place arbitrary binary data into a string literal, then Walter's idea makes the most sense. He proposes: # byte[] s = "\x80\x81x\82"; // byte[] tells you this is binary data # f( b"\x80\x81x\82" ); // b"..." tells you this is binary data There is really no way to do this if \x MUST be UTF-8. Combining two posts into one here... In article <cjtt82$2qsb$1 digitaldaemon.com>, Stewart Gordon says...
If people choose to initialise a string with UTF-8 or UTF-16, then it 
automatically follows that they should know what they're doing.  What 
would we gain by stopping these people?

Defining rules for string literals doesn't stop anyone doing anything. Look at it this way - I'm saying that the following line would be outlawed: # char[] s = "\xE2\xA2\xAC"; // illegal under proposed new rules but of course we still allow this: # char[] s = "\u20AC"; // perfectly ok but there would still be nothing - absolutely nothing - to stop you from initialising a string with UTF-8, by doing any of: # byte[] s = "\xE2\xA2\xAC"; # char[] s = [ 0xE2, 0xA2, 0xAC ]; # char[] s = cast(char[]) b"\xE2\xA2\xAC"; if you really wanted to. It seems to me to be the best of both worlds. People who really know what they're doing and who /insist/ on hand-coding their UTF-8 will still be allowed to do so, at the cost of a slightly more involved syntax, but everyone else will be protected from silly mistakes. And I speak as someone who /can/ hand-code UTF-8, and /still/ wants to be protected from doing it accidently.
But they take away the convenience Walter has gone to the trouble to 
create, of which this is just one example:

http://www.digitalmars.com/d/ctod.html#ascii

You mean this?
The D Way
The type of a string is determined by semantic analysis,
so there is no need to wrap strings in a macro call:

    char[] foo_ascii = "hello";        // string is taken to be ascii 
    wchar[] foo_wchar = "hello";       // string is taken to be wchar 

Under the proposed new rules, that example would still work exactly as documented. Everything would still work, Stewart - everything that you would ever (sensibly) want to do - PLUS you'd get arbitrary binary strings thrown in as a bonus - AND you could still hand-code whatever you wanted (with the inherent risk of getting it wrong, obviously) with a small amount of extra typing. It's a no-lose situation.
What on Earth is wrong with \u#### and
 \U00###### (where #### and ###### are just Unicode codepoints in hex)?

Nothing - these are perfectly valid at the moment, and remain perfectly valid whether \u is interpreted as codepoints or UTF-16 fragments.

"\uD800" won't compile, so I'd hardly call it "perfectly valid at the moment". Under the proposed new rules, however, you /would/ be able to do any of the following: # wchar c = 0xD800; // UTF-16 # wchar[] s = cast(wchar[]) b"\x00\xD8"; // UTF-16LE # wchar[] s = cast(wchar[]) b"\xD8\x00"; // UTF-16BE or even # wchar[] s = "string with a " ~ cast(wchar)0xD800 ~ " in it"; I don't see a problem with this. If you want to create invalid strings, you should not be surprised if you need a bit of casting.
 ... unless of course you want to enter an
 obfuscated D contest with code like this:
 
 #    wchar c = '\xE2\x82\xAC';    // currently legal
 
 instead of either of:
 
 #    wchar c = '\u20AC';
 #    wchar c = '';
 
 It's just crazy.

Maybe. But there's method in some people's madness.

But is there sufficient method in: # wchar c = '\xE2\x82\xAC'; (which leaves c containing the value 0x20AC) to justify its current status as legal D. I mean, what methodic madness makes the above line better than any of: # wchar c = '\u20AC'; # wchar c = 0x20AC; # wchar c = ''; ..really? Providing that D can do a reasonable job of auto-detecting the type of "..." (and '...') in the most common circumstances (which it can), the rest just becomes logical. I mean - we have an /opportunity/ here - an opportunity to make D better; to make strings to the /obvious/, /intuitive/ thing, to allow arbitrary binary strings (which we don't have currently), and all this /without/ preventing programmers from doing "under the hood" stuff if they really want to. Let us seize this opportunity while we can. It's not often that such an opportunity arises that /actually has Walter's backing/ (and was actually Walter's idea). We should be uniting on this one not arguing with each other (although it is good that differing arguments get thrashed out). Arcane Jill
Oct 05 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:

<snip>
 Whoops. My brain was seeing [0..2] for some reason. Of course it should be:
 
 #    c"10"[1..3];    // c"\xA31"
 #    w"10"[1..3];    // array bounds exception

Wrong again. "10" is three wchars: '', '1', '0'. w"10" is correct. <snip>
Depends on what you mean by "treat as".  My thought is that it is 
simplest to keep translation of string escapes at the lexical level.  Of 
course, the prefix (or lack thereof) would be carried forward to the SA. 

You're starting to lose me. Compiler front ends are not my strong point. But, by "treat as", I meant, allow the same escape sequences. I couldn't tell you whether or not this would be feasible for a D compiler, but it's a suggestion to which Walter knows the answer, so it didn't seem unreasonable to suggest it.

It would somewhat complicate the lexical analysis with little or no real benefit, which is one of the reasons I suggest allowing \x, \u or \U equally in unprefixed literals.
 This of course would retain the current 'anything goes' approach for 
 unprefixed literals, but also retains the flexibility that some of us 
 want or even need.

Need?

As I said before, need to be able to interface foreign APIs relying on non-Unicode character sets/encodings. <snip>
 if you really wanted to. It seems to me to be the best of both worlds. People
 who really know what they're doing and who /insist/ on hand-coding their UTF-8
 will still be allowed to do so, at the cost of a slightly more involved syntax,
 but everyone else will be protected from silly mistakes. And I speak as someone
 who /can/ hand-code UTF-8, and /still/ wants to be protected from doing it
 accidently.

What are the ill consequences of hand-coding UTF-8, from which one would need to be protected?
But they take away the convenience Walter has gone to the trouble to 
create, of which this is just one example:

http://www.digitalmars.com/d/ctod.html#ascii

You mean this?
The D Way
The type of a string is determined by semantic analysis,
so there is no need to wrap strings in a macro call:

   char[] foo_ascii = "hello";        // string is taken to be ascii 
   wchar[] foo_wchar = "hello";       // string is taken to be wchar 


That's indeed the only bit of D code in that section of the page. <snip>
What on Earth is wrong with \u#### and
\U00###### (where #### and ###### are just Unicode codepoints in hex)?

Nothing - these are perfectly valid at the moment, and remain perfectly valid whether \u is interpreted as codepoints or UTF-16 fragments.

"\uD800" won't compile, so I'd hardly call it "perfectly valid at the moment".

I meant that, _if_ #### happens to be Unicode codepoint, as you were saying, \u#### would be equally legal whether \u is defined to take a Unicode codepoint or a UTF-16 fragment. <snip>
 But is there sufficient method in:
 
 #    wchar c = '\xE2\x82\xAC';
 
 (which leaves c containing the value 0x20AC) to justify its current status as
 legal D. I mean, what methodic madness makes the above line better than any of:
 
 #    wchar c = '\u20AC';
 #    wchar c = 0x20AC;
 #    wchar c = '';
 
 ..really?

Using DMD as a UTF conversion tool, perhaps? :-) Maybe someone can come up with a variety of uses for this.
 Providing that D can do a reasonable job of auto-detecting the type of "..."
 (and '...') in the most common circumstances (which it can), the rest just
 becomes logical. I mean - we have an /opportunity/ here - an opportunity to
make
 D better; to make strings to the /obvious/, /intuitive/ thing, to allow
 arbitrary binary strings (which we don't have currently), and all this
/without/
 preventing programmers from doing "under the hood" stuff if they really want
to.
 Let us seize this opportunity while we can. It's not often that such an
 opportunity arises that /actually has Walter's backing/ (and was actually
 Walter's idea). We should be uniting on this one not arguing with each other
 (although it is good that differing arguments get thrashed out).

Maybe you're right. The trouble is that you have some points I don't really agree with, like that writing strings in UTF-8 or UTF-16 should be illegal. And some of my thoughts are: - it should remain straightforward to interface legacy APIs, whatever character sets/encodings they may rely on - the implementation should remain simple, complete with context-free lexical analysis But at least we seem to agree that: - string literals should remain specifyable in terms of actual Unicode codepoints - prefixed string literals might come in useful one day as an alternative to unprefixed ones Maybe we can come to a best of all three worlds. Stewart.
Oct 05 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjuhhv$bn7$1 digitaldaemon.com>, Stewart Gordon says...

Wrong again.  "10" is three wchars: '', '1', '0'.  w"10" is correct.

<Hangs head in shame> What can I say? One of these days I must learn to count!
As I said before, need to be able to interface foreign APIs relying on 
non-Unicode character sets/encodings.

I think you must be misunderstanding me. Fair enough - maybe I'm not that good at explaining things. Okay, let's say you a foreign API whose signature is something like: # extern(C) char * strstr(char *s1, char *s2); Hopefully, we're all familiar with that one. Now, the following would all be legal under both Jill's rules and Stewart's rules: # strstr("hello", "e"); # strstr("100", ""); # strstr("100", "\u20AC"); The following would be legal under Stewart's rules, but illegal under Jill's: # strstr("100", "\xE2\x82\xAC"); // note that this will /succeed/ # // - that is, return non-null However, even under Jill's rules, you could still do: # strstr("100", cast(char*) b"\xE2\x82\xAC"); if you really wanted to insist on stuffing encoding details into a literal. But you said "relying on non-Unicode character sets/encodings" - so let's look at that now. Let's say we're running on a WINDOWS-1252 machine, in which character set the character '' has codepoint 0x80. Now you'd have to write: # strstr("\x80100", "\x80"); // Stewart's rules # strstr(cast(char*) b"\x80100", cast(char*) b"\x80"); // Jill's rules Now here, of course, by use of those casts, you are explicitly telling the compiler "I know what I'm doing" -- which I don't think is unreasonable given that you are now hand-coding WINDOWS-1252. Of course, I've been assuming here that the type of b"..." would be byte[] or ubyte[]. But it's equally possible that Walter might decide that the type of b"..." is actually to be char[] -- precisely so that it /can/ interface easily with foreign APIs, in which case we'd end up with: # strstr("\x80100", "\x80"); // Stewart's rules # strstr(b"\x80100", b"\x80"); // Jill's rules
What are the ill consequences of hand-coding UTF-8, from which one would 
need to be protected?

POSSIBILITY ONE: Suppose that a user accustomed to WINDOWS-1252 wanted to hand-code a string containing a multiplication sign ('') immediately followed by a Euro sign (''). Such a user might mistakenly type # char[] s = "\xD7\x80"; since 0xD7 is the WINDOWS-1252 codepoint for '', and 0x80 is the WINDOWS-1252 codepoint for ''. This /does/ compile under present rules - but results in s containing the single Unicode character U+05C0 (Hebrew punctuation PASEQ). This is not what the user was expecting, and results entirely from the system trying to interpret \x as UTF-8 when it wasn't. If the user had been required to instead type # char[] s = "\u00D7\u20AC"; then they would have been protected from that error. POSSIBILITY TWO: Suppose a user decides to hand-code UTF-8 on purpose /and gets it wrong/. As in: # char[] s = "\xE2\x82\x8C"; // whoops - should be "\xE2\x82\xAC" who's going to notice? The compiler? Not in this case. Again, if the user had been required instead to type: # char[] s = "\u20AC"; then they would have been protected from that error.
 Let us seize this opportunity while we can. It's not often that such an
 opportunity arises that /actually has Walter's backing/ (and was actually
 Walter's idea). We should be uniting on this one not arguing with each other
 (although it is good that differing arguments get thrashed out).

Maybe you're right. The trouble is that you have some points I don't really agree with, like that writing strings in UTF-8 or UTF-16 should be illegal.

Not illegal - I'm only asking for a b prefix. As in b"...".
And some of my thoughts are:

- it should remain straightforward to interface legacy APIs, whatever 
character sets/encodings they may rely on

Yes, absolutely. That would be one use of b"...".
Maybe we can come to a best of all three worlds.

Of course we can. We're good. I'm not "making a stand" or "taking a position". I'm open to persuasion. I can be persuaded to change my mind. Assuming that's also true of you, it should just be a matter of logicking out all the pros and cons, and then letting Walter be the judge. Arcane Jill
Oct 06 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:
<snip>
 POSSIBILITY ONE:
 
 Suppose that a user accustomed to WINDOWS-1252 wanted to hand-code a 
 string containing a multiplication sign ('') immediately followed by 
 a Euro sign (''). Such a user might mistakenly type
 
 #    char[] s = "\xD7\x80";
 
 since 0xD7 is the WINDOWS-1252 codepoint for '', and 0x80 is the 
 WINDOWS-1252 codepoint for ''. This /does/ compile under present 
 rules - but results in s containing the single Unicode character 
 U+05C0 (Hebrew punctuation PASEQ). This is not what the user was 
 expecting, and results entirely from the system trying to interpret 
 \x as UTF-8 when it wasn't. If the user had been required to instead 
 type
 
 #    char[] s = "\u00D7\u20AC";
 
 then they would have been protected from that error.

I see....
 POSSIBILITY TWO:
 
 Suppose a user decides to hand-code UTF-8 on purpose /and gets it 
 wrong/. As in:
 
 #    char[] s = "\xE2\x82\x8C";  // whoops - should be "\xE2\x82\xAC"
 
 
 who's going to notice? The compiler? Not in this case. Again, if the 
 user had been required instead to type:
 
 #    char[] s = "\u20AC";
 
 then they would have been protected from that error.

How are these less typo-prone? Even if I did that, how would it know that I didn't mean to type char[] s = "\u208C"; ? As I started to say, people who choose to hand-code UTF-8, any other encoding or even uncoded codepoints should know what they're doing and that they have to be careful. <snip>
 Maybe you're right.  The trouble is that you have some points I 
 don't really agree with, like that writing strings in UTF-8 or 
 UTF-16 should be illegal.

Not illegal - I'm only asking for a b prefix. As in b"...".
 And some of my thoughts are:
 
 - it should remain straightforward to interface legacy APIs, 
 whatever character sets/encodings they may rely on

Yes, absolutely. That would be one use of b"...".

I see....
 Maybe we can come to a best of all three worlds.

Of course we can. We're good. I'm not "making a stand" or "taking a position". I'm open to persuasion. I can be persuaded to change my mind. Assuming that's also true of you, it should just be a matter of logicking out all the pros and cons, and then letting Walter be the judge.

Good idea. Stewart.
Oct 06 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ck0gia$239i$1 digitaldaemon.com>, Stewart Gordon says...

 POSSIBILITY TWO:
 
 Suppose a user decides to hand-code UTF-8 on purpose /and gets it 
 wrong/. As in:
 
 #    char[] s = "\xE2\x82\x8C";  // whoops - should be "\xE2\x82\xAC"
 
 
 who's going to notice? The compiler? Not in this case. Again, if the 
 user had been required instead to type:
 
 #    char[] s = "\u20AC";
 
 then they would have been protected from that error.

How are these less typo-prone? Even if I did that, how would it know that I didn't mean to type char[] s = "\u208C";

True enough. But I guess, what I was trying to say is that \u20AC is something that any /human/ can look up in Unicode code charts. By contrast, you are unlikely to find a convenient lookup table anywhere on the web which will let you look up \xE2\x82\xAC. So the \u version is just more maintainable, and the \x version more obfuscated. I suppose I'm saying that if a project is maintained by more than one person, an error involving \u is more likely to be spotted by someone else in the team than an error involving a sequence of \x's. Maybe that's just subjective. Maybe I'm wrong. I dunno any more.
?  As I started to say, people who choose to hand-code UTF-8, any other 
encoding or even uncoded codepoints should know what they're doing and 
that they have to be careful.

Well obviously we both agree on that one. I think we only disagree on whether or not you should need a b before the "...". Jill
Oct 06 2004
prev sibling parent reply larrycowan <larrycowan_member pathlink.com> writes:
In article <cjtgk8$2h40$1 digitaldaemon.com>, Arcane Jill says...
...
String literals will now be consistent, logical, and sufficiently powerful for
Walter's purposes. The distinction between byte[] and ubyte[] would then be the
only remaining problem.

Arcane Jill

byte are distinguished only when used singly (or grouped) as binary for an arithmetic evaluation. The only difference I can see is for something like: ubyte what = -123; // valid and byte what = -123; // invalid, would be = 133 to same content but that is a different type of initialization.
Oct 05 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjur82$lpv$1 digitaldaemon.com>, larrycowan says...

Why is there any problem here?

It's possible that there isn't. I don't know enough about how the compiler can sort these things out.
The content is not different.

Well, there is a sense in which the context is different. For example: # byte[] x = "\x80\x90"; # ubyte[] y = "\x80\x90"; # assert( x[1] < 0 ); # assert( y[1] >= 0 ); # assert( x[1] != y[1] ); I guess I incline to the opinion that only ubyte[] should be used for binary string literals, but maybe that's not the only answer.
ubyte and
byte are distinguished only when used singly (or grouped) as binary for
an arithmetic evaluation.  The only difference I can see is for something
like:

ubyte what = -123;  // valid
and
byte what = -123; // invalid, would be = 133 to same content

but that is a different type of initialization.

You mean assignment. Assignment can happen at times other than just initialization. But I was more getting at stuff like this: # byte[] x = "\x80\x90"; # ubyte[] y = x; // A lossless convertion? Should this compile? or # byte[] x = "\x80\x90"; # ubyte[] y = "\x80\x90"; # if (x == y) // Again, should this compile? # // And, if so, should it evaluate to true or false? or # void f(byte[] s) { /*stuff*/ } # void f(ubyte[] s) { /*stuff*/ } # f( b"\x80\xC0" ); // which f gets called? Jill
Oct 05 2004
prev sibling parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:
<snip>
 Currently, strings /cannot/ be initialized with UTF-16, and I believe that
 behavior to be correct. (You believe it to be incorrect, I know. Like I said,
 opinions differ).
 
 I don't believe that allowing them to be initialised with UTF-8 is a good idea
 either. It's dumb.
 
 Regardless of the destination type! 

If people choose to initialise a string with UTF-8 or UTF-16, then it automatically follows that they should know what they're doing. What would we gain by stopping these people? <snip>
 I posted an update this morning with better rules, which I think would keep
 everyone happy. Well - everyone apart from those who want to explicitly stick
 UTF-8 and UTF-16 into char[], wchar[] and dchar[] anyway.

Hmm......
This leaves two things to do:

1. Stop string literals from jumping to the wrong conclusions.

Prefixes do that nicely. In the absense of prefixes, you can often figure it out. But /sometimes/ you just can't. For instance: f("..."), where f() is overloaded.

But they take away the convenience Walter has gone to the trouble to create, of which this is just one example: http://www.digitalmars.com/d/ctod.html#ascii <snip>
2. Clean up the issue of whether arbitrary UTF-16 fragments are allowed 
by \u escapes - get the compiler matching the spec or vice versa.  If we 
don't allow them, then if only for the sake of completeness, we should 
invent a new escape to denote UTF-16 fragments.

Disagree. I still think that encoding-by-hand inside a string literal is a dumb idea - both for UTF-8 and for UTF-16. What on Earth is wrong with \u#### and \U00###### (where #### and ###### are just Unicode codepoints in hex)?

Nothing - these are perfectly valid at the moment, and remain perfectly valid whether \u is interpreted as codepoints or UTF-16 fragments.
 Walter suggests (and I agree with him) that \x should be for inserting
arbitrary
 binary data into binary strings, and ASCII characters into text strings. I see
 no /point/ in defining \x to be UTF-8 ... unless of course you want to enter an
 obfuscated D contest with code like this:
 
 #    wchar c = '\xE2\x82\xAC';    // currently legal
 
 instead of either of:
 
 #    wchar c = '\u20AC';
 #    wchar c = '';
 
 It's just crazy.

Maybe. But there's method in some people's madness. Stewart.
Oct 05 2004
prev sibling parent reply Benjamin Herr <ben 0x539.de> writes:
Stewart Gordon wrote:
 I thought it was by design that string literals take their type from the 
 context

Eh, if we are that far, what prevents resolving overloaded function calls based on their return type? Then we could even have a sensible opCast. Sorry to hijack the thread, but I never quite realised this feature. -ben
Oct 04 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjrq9c$uea$1 digitaldaemon.com>, Benjamin Herr says...
Stewart Gordon wrote:
 I thought it was by design that string literals take their type from the 
 context

Eh, if we are that far, what prevents resolving overloaded function calls based on their return type? Then we could even have a sensible opCast. Sorry to hijack the thread, but I never quite realised this feature. -ben

As I understand it, the context-detection for string literals is very, very crude - basically limited to assignment statements of the form: # T s = "literal"; although it ought to be fairly easily extended to: # T s = /*stuff*/ ~ "literal" ~ /*stuff*/; It's not a full context analysis - and even if it were, it would be an /exception/ to D's overall context free grammar, not the rule. The language may be able to withstand one or two very simple exceptions, but that is a far cry from generically being able to tell the context of everything in all circumstances. The latter would (I suspect) require a complete language rewrite. Arcane Jill PS. I want a sensible opCast() too - but that's for another thread.
Oct 05 2004
parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:

 In article <cjrq9c$uea$1 digitaldaemon.com>, Benjamin Herr says...
 
Stewart Gordon wrote:

I thought it was by design that string literals take their type from the 
context

Eh, if we are that far, what prevents resolving overloaded function calls based on their return type? Then we could even have a sensible opCast. Sorry to hijack the thread, but I never quite realised this feature.


There would be a lot more cases to deal with, it would probably be complex to implement, and could get utterly confusing to determine which path the types are going through in a complex expression. Simple type resolution of string literals is, OTOH, a relatively simple feature.
-ben

As I understand it, the context-detection for string literals is very, very crude - basically limited to assignment statements of the form: # T s = "literal"; although it ought to be fairly easily extended to: # T s = /*stuff*/ ~ "literal" ~ /*stuff*/; It's not a full context analysis - and even if it were, it would be an /exception/ to D's overall context free grammar, not the rule.

How would it destroy CFG? Type resolution and expression simplification are part of semantic analysis. The mechanism would be simple and unambiguous to this extent: literal ~ literal => literal (concatenated at compile time) char[] ~ literal => char[] wchar[] ~ literal => wchar[] dchar[] ~ literal => dchar[] literal ~ char[] => char[] literal ~ wchar[] => wchar[] literal ~ dchar[] => dchar[] char[] ~ wchar[] => invalid? char[] = literal => char[] wchar[] = literal => wchar[] dchar[] = literal => dchar[] Stewart.
Oct 05 2004