www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - Universal character names not supported

reply Sean Kelly <sean f4.ca> writes:
C:\code\d>type test.d
void \u00A0() {}
void main() {}

C:\code\d>dmd test
test.d(1): no identifier for declarator void
test.d(1): semicolon expected, not '&#9516;'
test.d(1): Declaration expected, not '&#9516;'
Oct 21 2005
next sibling parent reply =?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sean Kelly schrieb:
 C:\code\d>type test.d
 void \u00A0() {}
 void main() {}
 
 C:\code\d>dmd test
 test.d(1): no identifier for declarator void
 test.d(1): semicolon expected, not '&#9516;á'
 test.d(1): Declaration expected, not '&#9516;á'

What OS do you use? Could you please zip the file and send it to me? (the compression should ensure that no "magic" encoding conversion are triggered) Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFDWfk73w+/yD4P9tIRAh7pAJ40f/pL2oBCO3zZ+ywZnJDNqzndgwCeI3YQ qQld3fiRigSbsDpy/wnNUgQ= =kQSd -----END PGP SIGNATURE-----
Oct 22 2005
parent reply Sean Kelly <sean f4.ca> writes:
In article <djct8n$uka$1 digitaldaemon.com>, =?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?=
says...
What OS do you use?

WinXP. I created the file with UltraEdit, so a UTF-8 BOM may exist in the file as well.
Could you please zip the file and send it to me?

Done. I've also put it online here: http://home.f4.ca/sean/d/ucn.zip Sean
Oct 22 2005
parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
Sean Kelly schrieb am 2005-10-22:
 In article <djct8n$uka$1 digitaldaemon.com>, =?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?=
 says...
What OS do you use?

WinXP. I created the file with UltraEdit, so a UTF-8 BOM may exist in the file as well.
Could you please zip the file and send it to me?

Done. I've also put it online here: http://home.f4.ca/sean/d/ucn.zip

That is no Unicode problem. The file containts the *character* sequence "\00A0", whereas it should contain the *byte* sequence "00A0" (assuming UTF-16 BE) or other *byte* sequences for UTF-8, UTF-16 LE, UTF-32 BE and UTF-32 LE. For a source sample that uses non-ASCI identifiers try: http://dstress.kuehne.cn/run/unicode_03.d Thomas
Oct 22 2005
prev sibling parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
That's because your code is wrong.

void \u00A0()

Is like:

void 'a'()

Which gives a similar message:

dummy.d(1): no identifier for declarator void
dummy.d(1): semicolon expected, not '97U'
dummy.d(1): Declaration expected, not '97U'

If I use Unicode, it works fine:

void を検索()

-[Unknown]


 C:\code\d>type test.d
 void \u00A0() {}
 void main() {}
 
 C:\code\d>dmd test
 test.d(1): no identifier for declarator void
 test.d(1): semicolon expected, not '&#9516;á'
 test.d(1): Declaration expected, not '&#9516;á'

Oct 22 2005
parent reply Sean Kelly <sean f4.ca> writes:
In article <djedqc$h0e$1 digitaldaemon.com>, Unknown W. Brackets says...
That's because your code is wrong.

void \u00A0()

Is like:

void 'a'()

Is it? From the D spec: Identifier: IdentiferStart IdentiferStart IdentifierChars IdentifierChars: IdentiferChar IdentiferChar IdentifierChars IdentifierStart: _ Letter UniversalAlpha Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.) And from the C standard: identifier: identifier-nondigit identifier identifier-nondigit identifier digit identifier-nondigit: nondigit universal-character-name 6.4.3 Universal character names Syntax universal-character-name: \u hex-quad \U hex-quad hex-quad hex-quad: hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit .. Universal character names may be used in identifiers, character constants, and string literals to designate characters that are not in the basic character set. Semantics The universal character name \Unnnnnnnn designates the character whose eight-digit short identifier (as specified by ISO/IEC 10646) is nnnnnnnn.62) Similarly, the universal character name \unnnn designates the character whose four-digit short identifier is nnnn (and whose eight-digit short identifier is 0000nnnn). Do I have the declaration format wrong? I'll admit I've never used these before. Sean
Oct 22 2005
next sibling parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
I'm not sure I understand why you're quoting what you are.

In D, you can do this:

printf("hello" \n);

Yes, no typo.  See that \n outside?  That works fine.  In D, \n is just 
another character literal.  So is \0.  In fact, so is \u00A0.  They are 
all character literals.  Like "hello".

This is not, in any way, the same as C, nor should it be construed to 
be.  D is not C.  D is not C with rockets strapped on... D is another 
language entirely, which is very similar to C.

D, unlike C (or at least, most implementations of C), supports Unicode 
source files.  This means, instead of resorting to tricks or pretending, 
you can actually use strings of such characters.  Things like . 
Obviously, people from other countries than America do not type \uBLAH 
every time they want a character like that, nor should you.

Unicode is multibyte.  For example, \u00A0 means the Unicode character 
00A0.  In other words, (in UTF-16) those exact bytes: 00 and A0.  This 
is in contrast to ANSI which only uses one byte per character.

D supports actually having those codes in the file.  Open it with a 
binary text editor, and add those two bytes right there for the function 
name.  This is true unicode.  This means that Japanese programmers can 
program Japanese in Japanese, not using \uBLAH.  If I were Japanese (or 
fluent in the language), I would rather use English than that.

You're right that this is something C supported that D does not. 
However, since it is - imho - entirely and completely useless 
(especially compared to the much better feature D does have), I don't 
see why anyone's going to complain.

-[Unknown]


 In article <djedqc$h0e$1 digitaldaemon.com>, Unknown W. Brackets says...
 
That's because your code is wrong.

void \u00A0()

Is like:

void 'a'()

Is it? From the D spec: Identifier: IdentiferStart IdentiferStart IdentifierChars IdentifierChars: IdentiferChar IdentiferChar IdentifierChars IdentifierStart: _ Letter UniversalAlpha Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.) And from the C standard: identifier: identifier-nondigit identifier identifier-nondigit identifier digit identifier-nondigit: nondigit universal-character-name 6.4.3 Universal character names Syntax universal-character-name: \u hex-quad \U hex-quad hex-quad hex-quad: hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit .. Universal character names may be used in identifiers, character constants, and string literals to designate characters that are not in the basic character set. Semantics The universal character name \Unnnnnnnn designates the character whose eight-digit short identifier (as specified by ISO/IEC 10646) is nnnnnnnn.62) Similarly, the universal character name \unnnn designates the character whose four-digit short identifier is nnnn (and whose eight-digit short identifier is 0000nnnn). Do I have the declaration format wrong? I'll admit I've never used these before. Sean

Oct 23 2005
parent reply Sean Kelly <sean f4.ca> writes:
In article <djfdin$1aom$1 digitaldaemon.com>, Unknown W. Brackets says...
You're right that this is something C supported that D does not. 
However, since it is - imho - entirely and completely useless 
(especially compared to the much better feature D does have), I don't 
see why anyone's going to complain.

If that's the case then it's fine with me. I read the D spec as that it was intended to support this format, but perhaps it just meant that the chars were supported in UTF format? Why the bit about UniversalAlpha for identifiers then? Sean
Oct 23 2005
parent =?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sean Kelly schrieb:

 In article <djfdin$1aom$1 digitaldaemon.com>, Unknown W. Brackets says...
 
You're right that this is something C supported that D does not. 
However, since it is - imho - entirely and completely useless 
(especially compared to the much better feature D does have), I don't 
see why anyone's going to complain.

If that's the case then it's fine with me. I read the D spec as that it was intended to support this format, but perhaps it just meant that the chars were supported in UTF format? Why the bit about UniversalAlpha for identifiers then?

If it only sayed UTF for identifiers the following should compile: void 6(){ // do some thing } "6" is part of Unicode but is no "UniversalAlpha". Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFDW0f73w+/yD4P9tIRAr6DAKCGIsiaonqhZFYIAXK+bqj1zrJmMQCgtH+u WhpEyfz7EV9HvXN4HxbW12k= =XFDR -----END PGP SIGNATURE-----
Oct 23 2005
prev sibling parent reply "Walter Bright" <newshound digitalmars.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message
news:djehj1$jso$1 digitaldaemon.com...
 In article <djedqc$h0e$1 digitaldaemon.com>, Unknown W. Brackets says...
That's because your code is wrong.

void \u00A0()

Is like:

void 'a'()

Is it?

Yes, it's wrong. D does not support the \u or \U syntax for identifier characters. D supports actual embedded unicode alpha characters as identifier characters. Similarly, D does not support digraphs or trigraphs. D does support \u and \U as a way to specify unicode characters within string literals. The reason that D doesn't support identifiers like abc\u00A0xyz is because nobody in their right mind would actually write such an identifier. C is forced to because the C source character set is unspecified, so digraphs, trigraphs, and \u kludges are necessary to get 'portable' source code. One would presumably write C unicode identifiers in unicode, then there'd be some translater that converted them to \u syntax so that it would work with other C compilers. The D source character set *is* specified to be unicode, so there's no reason to translate it to \u notation. I've also never, ever seen anyone use \u notation in C code outside of a test suite. Ditto for both trigraphs and digraphs.
Oct 23 2005
parent reply Sean Kelly <sean f4.ca> writes:
In article <djfnnu$1i1f$1 digitaldaemon.com>, Walter Bright says...
"Sean Kelly" <sean f4.ca> wrote in message
news:djehj1$jso$1 digitaldaemon.com...
 In article <djedqc$h0e$1 digitaldaemon.com>, Unknown W. Brackets says...
That's because your code is wrong.

void \u00A0()

Is like:

void 'a'()

Is it?

Yes, it's wrong. D does not support the \u or \U syntax for identifier characters. D supports actual embedded unicode alpha characters as identifier characters. Similarly, D does not support digraphs or trigraphs.

Thanks for clearing that up. The reference to Appendix D of the C99 spec threw me, as it referred to these characters. I suppose I should have realized that you meant the letters themselves rather than the formatting.
I've also never, ever seen anyone use \u notation in C code outside of a
test suite. Ditto for both trigraphs and digraphs.

Me either. I stumbled across that portion of the spec yesterday and tried it on a whim :-) Sorry for the confusion. Sean
Oct 23 2005
parent zwang <nehzgnaw gmail.com> writes:
Sean Kelly wrote:
 In article <djfnnu$1i1f$1 digitaldaemon.com>, Walter Bright says...
 
"Sean Kelly" <sean f4.ca> wrote in message
news:djehj1$jso$1 digitaldaemon.com...

In article <djedqc$h0e$1 digitaldaemon.com>, Unknown W. Brackets says...

That's because your code is wrong.

void \u00A0()

Is like:

void 'a'()

Is it?

Yes, it's wrong. D does not support the \u or \U syntax for identifier characters. D supports actual embedded unicode alpha characters as identifier characters. Similarly, D does not support digraphs or trigraphs.

Thanks for clearing that up. The reference to Appendix D of the C99 spec threw me, as it referred to these characters. I suppose I should have realized that you meant the letters themselves rather than the formatting.
I've also never, ever seen anyone use \u notation in C code outside of a
test suite. Ditto for both trigraphs and digraphs.

Me either. I stumbled across that portion of the spec yesterday and tried it on a whim :-) Sorry for the confusion. Sean

I have come across them in obfuscated C code :-)
Oct 23 2005