digitalmars.D.bugs - Universal character names not supported

Sean Kelly (7/7) Oct 21 2005 C:\code\d>type test.d

=?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= (13/21) Oct 22 2005 -----BEGIN PGP SIGNED MESSAGE-----

Sean Kelly (7/9) Oct 22 2005 WinXP. I created the file with UltraEdit, so a UTF-8 BOM may exist in t...

Thomas Kuehne (7/16) Oct 22 2005 That is no Unicode problem. The file containts the *character* sequence ...

Unknown W. Brackets (11/19) Oct 22 2005 That's because your code is wrong.

Sean Kelly (43/47) Oct 22 2005 Is it? From the D spec:

Unknown W. Brackets (27/100) Oct 23 2005 I'm not sure I understand why you're quoting what you are.

Sean Kelly (5/9) Oct 23 2005 If that's the case then it's fine with me. I read the D spec as that it...

=?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= (14/25) Oct 23 2005 -----BEGIN PGP SIGNED MESSAGE-----

Walter Bright (17/27) Oct 23 2005 Yes, it's wrong. D does not support the \u or \U syntax for identifier

Sean Kelly (7/25) Oct 23 2005 Thanks for clearing that up. The reference to Appendix D of the C99 spe...

zwang (2/40) Oct 23 2005 I have come across them in obfuscated C code :-)

Sean Kelly <sean f4.ca> writes:

C:\code\d>type test.d
void \u00A0() {}
void main() {}

C:\code\d>dmd test
test.d(1): no identifier for declarator void

Oct 21 2005

=?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sean Kelly schrieb:
 C:\code\d>type test.d
 void \u00A0() {}
 void main() {}
 
 C:\code\d>dmd test
 test.d(1): no identifier for declarator void



What OS do you use?

Could you please zip the file and send it to me?
(the compression should ensure that no "magic" encoding conversion are
triggered)

Thomas
-----BEGIN PGP SIGNATURE-----

iD8DBQFDWfk73w+/yD4P9tIRAh7pAJ40f/pL2oBCO3zZ+ywZnJDNqzndgwCeI3YQ
qQld3fiRigSbsDpy/wnNUgQ=
=kQSd
-----END PGP SIGNATURE-----

Oct 22 2005

Sean Kelly <sean f4.ca> writes:

In article <djct8n$uka$1 digitaldaemon.com>, =?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?=
says...
What OS do you use?

WinXP.  I created the file with UltraEdit, so a UTF-8 BOM may exist in the file
as well.

Could you please zip the file and send it to me?

Done.  I've also put it online here:

http://home.f4.ca/sean/d/ucn.zip


Sean

Oct 22 2005

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

Sean Kelly schrieb am 2005-10-22:
 In article <djct8n$uka$1 digitaldaemon.com>, =?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?=
 says...
What OS do you use?

 WinXP.  I created the file with UltraEdit, so a UTF-8 BOM may exist in the
 file as well.

Could you please zip the file and send it to me?

 Done.  I've also put it online here:

 http://home.f4.ca/sean/d/ucn.zip

That is no Unicode problem. The file containts the *character* sequence "\00A0",
whereas it should contain the *byte* sequence "00A0" (assuming UTF-16 BE) or
other *byte* sequences for UTF-8, UTF-16 LE, UTF-32 BE and UTF-32 LE.

For a source sample that uses non-ASCI identifiers try:
http://dstress.kuehne.cn/run/unicode_03.d

Thomas

Oct 22 2005

"Unknown W. Brackets" <unknown simplemachines.org> writes:

That's because your code is wrong.

void \u00A0()

Is like:

void 'a'()

Which gives a similar message:

dummy.d(1): no identifier for declarator void
dummy.d(1): semicolon expected, not '97U'
dummy.d(1): Declaration expected, not '97U'

If I use Unicode, it works fine:

void を検索()

-[Unknown]


 C:\code\d>type test.d
 void \u00A0() {}
 void main() {}
 
 C:\code\d>dmd test
 test.d(1): no identifier for declarator void

Oct 22 2005

Sean Kelly <sean f4.ca> writes:

In article <djedqc$h0e$1 digitaldaemon.com>, Unknown W. Brackets says...
That's because your code is wrong.

void \u00A0()

Is like:

void 'a'()

Is it?  From the D spec:

Identifier:
IdentiferStart
IdentiferStart IdentifierChars

IdentifierChars:
IdentiferChar
IdentiferChar IdentifierChars

IdentifierStart:
_
Letter
UniversalAlpha

Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is 
the C99 Standard.)

And from the C standard:

identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit

identifier-nondigit:
nondigit
universal-character-name

6.4.3 Universal character names

Syntax

universal-character-name:
\u hex-quad
\U hex-quad hex-quad

hex-quad:
hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit

..

Universal character names may be used in identifiers, character constants, 
and string literals to designate characters that are not in the basic 
character set.

Semantics

The universal character name \Unnnnnnnn designates the character whose 
eight-digit short identifier (as specified by ISO/IEC 10646) is nnnnnnnn.62) 
Similarly, the universal character name \unnnn designates the character 
whose four-digit short identifier is nnnn (and whose eight-digit short 
identifier is 0000nnnn).

Do I have the declaration format wrong?  I'll admit I've never used these
before.


Sean

Oct 22 2005

"Unknown W. Brackets" <unknown simplemachines.org> writes:

I'm not sure I understand why you're quoting what you are.

In D, you can do this:

printf("hello" \n);

Yes, no typo.  See that \n outside?  That works fine.  In D, \n is just 
another character literal.  So is \0.  In fact, so is \u00A0.  They are 
all character literals.  Like "hello".

This is not, in any way, the same as C, nor should it be construed to 
be.  D is not C.  D is not C with rockets strapped on... D is another 
language entirely, which is very similar to C.

D, unlike C (or at least, most implementations of C), supports Unicode 
source files.  This means, instead of resorting to tricks or pretending, 
you can actually use strings of such characters.  Things like �. 
Obviously, people from other countries than America do not type \uBLAH 
every time they want a character like that, nor should you.

Unicode is multibyte.  For example, \u00A0 means the Unicode character 
00A0.  In other words, (in UTF-16) those exact bytes: 00 and A0.  This 
is in contrast to ANSI which only uses one byte per character.

D supports actually having those codes in the file.  Open it with a 
binary text editor, and add those two bytes right there for the function 
name.  This is true unicode.  This means that Japanese programmers can 
program Japanese in Japanese, not using \uBLAH.  If I were Japanese (or 
fluent in the language), I would rather use English than that.

You're right that this is something C supported that D does not. 
However, since it is - imho - entirely and completely useless 
(especially compared to the much better feature D does have), I don't 
see why anyone's going to complain.

-[Unknown]


 In article <djedqc$h0e$1 digitaldaemon.com>, Unknown W. Brackets says...
 
That's because your code is wrong.

void \u00A0()

Is like:

void 'a'()

 
 
 Is it?  From the D spec:
 
 Identifier:
 IdentiferStart
 IdentiferStart IdentifierChars
 
 IdentifierChars:
 IdentiferChar
 IdentiferChar IdentifierChars
 
 IdentifierStart:
 _
 Letter
 UniversalAlpha
 
 Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is 
 the C99 Standard.)
 
 And from the C standard:
 
 identifier:
 identifier-nondigit
 identifier identifier-nondigit
 identifier digit
 
 identifier-nondigit:
 nondigit
 universal-character-name
 
 6.4.3 Universal character names
 
 Syntax
 
 universal-character-name:
 \u hex-quad
 \U hex-quad hex-quad
 
 hex-quad:
 hexadecimal-digit hexadecimal-digit
 hexadecimal-digit hexadecimal-digit
 
 ..
 
 Universal character names may be used in identifiers, character constants, 
 and string literals to designate characters that are not in the basic 
 character set.
 
 Semantics
 
 The universal character name \Unnnnnnnn designates the character whose 
 eight-digit short identifier (as specified by ISO/IEC 10646) is nnnnnnnn.62) 
 Similarly, the universal character name \unnnn designates the character 
 whose four-digit short identifier is nnnn (and whose eight-digit short 
 identifier is 0000nnnn).
 
 Do I have the declaration format wrong?  I'll admit I've never used these
 before.
 
 
 Sean

Oct 23 2005

Sean Kelly <sean f4.ca> writes:

In article <djfdin$1aom$1 digitaldaemon.com>, Unknown W. Brackets says...
You're right that this is something C supported that D does not. 
However, since it is - imho - entirely and completely useless 
(especially compared to the much better feature D does have), I don't 
see why anyone's going to complain.

If that's the case then it's fine with me.  I read the D spec as that it was
intended to support this format, but perhaps it just meant that the chars were
supported in UTF format?  Why the bit about UniversalAlpha for identifiers then?


Sean

Oct 23 2005

=?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sean Kelly schrieb:

 In article <djfdin$1aom$1 digitaldaemon.com>, Unknown W. Brackets says...
 
You're right that this is something C supported that D does not. 
However, since it is - imho - entirely and completely useless 
(especially compared to the much better feature D does have), I don't 
see why anyone's going to complain.

 
 
 If that's the case then it's fine with me.  I read the D spec as that it was
 intended to support this format, but perhaps it just meant that the chars were
 supported in UTF format?  Why the bit about UniversalAlpha for identifiers
then?

If it only sayed UTF for identifiers the following should compile:

void 6(){
     // do some thing
}

"6" is part of Unicode but is no "UniversalAlpha".

Thomas

-----BEGIN PGP SIGNATURE-----

iD8DBQFDW0f73w+/yD4P9tIRAr6DAKCGIsiaonqhZFYIAXK+bqj1zrJmMQCgtH+u
WhpEyfz7EV9HvXN4HxbW12k=
=XFDR
-----END PGP SIGNATURE-----

Oct 23 2005

"Walter Bright" <newshound digitalmars.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message
news:djehj1$jso$1 digitaldaemon.com...
 In article <djedqc$h0e$1 digitaldaemon.com>, Unknown W. Brackets says...
That's because your code is wrong.

void \u00A0()

Is like:

void 'a'()

 Is it?

Yes, it's wrong. D does not support the \u or \U syntax for identifier
characters. D supports actual embedded unicode alpha characters as
identifier characters. Similarly, D does not support digraphs or trigraphs.

D does support \u and \U as a way to specify unicode characters within
string literals.

The reason that D doesn't support identifiers like abc\u00A0xyz is because
nobody in their right mind would actually write such an identifier. C is
forced to because the C source character set is unspecified, so digraphs,
trigraphs, and \u kludges are necessary to get 'portable' source code. One
would presumably write C unicode identifiers in unicode, then there'd be
some translater that converted them to \u syntax so that it would work with
other C compilers. The D source character set *is* specified to be unicode,
so there's no reason to translate it to \u notation.

I've also never, ever seen anyone use \u notation in C code outside of a
test suite. Ditto for both trigraphs and digraphs.

Oct 23 2005

Sean Kelly <sean f4.ca> writes:

In article <djfnnu$1i1f$1 digitaldaemon.com>, Walter Bright says...
"Sean Kelly" <sean f4.ca> wrote in message
news:djehj1$jso$1 digitaldaemon.com...
 In article <djedqc$h0e$1 digitaldaemon.com>, Unknown W. Brackets says...
That's because your code is wrong.

void \u00A0()

Is like:

void 'a'()

 Is it?

Yes, it's wrong. D does not support the \u or \U syntax for identifier
characters. D supports actual embedded unicode alpha characters as
identifier characters. Similarly, D does not support digraphs or trigraphs.

Thanks for clearing that up.  The reference to Appendix D of the C99 spec threw
me, as it referred to these characters.  I suppose I should have realized that
you meant the letters themselves rather than the formatting.

I've also never, ever seen anyone use \u notation in C code outside of a
test suite. Ditto for both trigraphs and digraphs.

Me either.  I stumbled across that portion of the spec yesterday and tried it on
a whim :-)  Sorry for the confusion.


Sean

Oct 23 2005

zwang <nehzgnaw gmail.com> writes:

Sean Kelly wrote:
 In article <djfnnu$1i1f$1 digitaldaemon.com>, Walter Bright says...
 
"Sean Kelly" <sean f4.ca> wrote in message
news:djehj1$jso$1 digitaldaemon.com...

In article <djedqc$h0e$1 digitaldaemon.com>, Unknown W. Brackets says...

That's because your code is wrong.

void \u00A0()

Is like:

void 'a'()

Is it?

Yes, it's wrong. D does not support the \u or \U syntax for identifier
characters. D supports actual embedded unicode alpha characters as
identifier characters. Similarly, D does not support digraphs or trigraphs.

 
 
 Thanks for clearing that up.  The reference to Appendix D of the C99 spec threw
 me, as it referred to these characters.  I suppose I should have realized that
 you meant the letters themselves rather than the formatting.
 
 
I've also never, ever seen anyone use \u notation in C code outside of a
test suite. Ditto for both trigraphs and digraphs.

 
 
 Me either.  I stumbled across that portion of the spec yesterday and tried it
on
 a whim :-)  Sorry for the confusion.
 
 
 Sean
 


I have come across them in obfuscated C code :-)

Oct 23 2005

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - Universal character names not supported