www.digitalmars.com         C & C++   DMDScript  

D - D is too english centric

reply "Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:
Hi,

I have noted that C99 allows *any* unicode character to be used in
identifiers using \u. The D specification limits characters in identifiers
to letters, digits, and '_', but does not even define what a letter is. The
DMD implementation defines a letter to be ['A'..'Z', 'a'..'z'].

I find this unfortunate, and in contrast to the one of the main goals of D:
Link compability with C.

It has previously been argued, that only english should be used for
identifiers in order to support reuse better across language boundaries. But
that argument isn't always valid. For example, half a decade ago, I was
involved in building the IT-infrastructure for a nation-wide real estate
network. One of the requirements was that *everything* was in dansh.. It
involved lots of developers nation-wide, but noone outside Denmark. Of
cause, identifiers couldn't be fully danish - and thereby introduced
inconsistency in how things was names. But that was only a limitation of C
back than, which might not be an issue a few years from now. If D has this
limitation, it might be a valid reason to deselect D in favor of other
languages. After all, english is only the native language of a miniority.

Regards,
Martin M. Pedersen
May 27 2003
next sibling parent reply "Walter" <walter digitalmars.com> writes:
"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> wrote in message
news:bb0sqs$1t1k$1 digitaldaemon.com...
 I have noted that C99 allows *any* unicode character to be used in
 identifiers using \u.

No, only characters that fall into certain unicode ranges.
 The D specification limits characters in identifiers
 to letters, digits, and '_', but does not even define what a letter is.

 DMD implementation defines a letter to be ['A'..'Z', 'a'..'z'].
 I find this unfortunate, and in contrast to the one of the main goals of

 Link compability with C.

It's a good idea to change it to match C for the reasons you state.
May 27 2003
next sibling parent reply "Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:
"Walter" <walter digitalmars.com> wrote in message
news:bb1c8v$2e2l$1 digitaldaemon.com...
 I have noted that C99 allows *any* unicode character to be used in
 identifiers using \u.


I haven't found that, but I you are the export, so I believe you . It makes sense too.
 Link compability with C.


I'm glad we are in line here :-) Regard, Martin M. Pedersen
May 28 2003
parent reply "Walter" <walter digitalmars.com> writes:
"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> wrote in message
news:bb2hou$oas$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:bb1c8v$2e2l$1 digitaldaemon.com...
 I have noted that C99 allows *any* unicode character to be used in
 identifiers using \u.


I haven't found that, but I you are the export, so I believe you . It

 sense too.

"Each universal character name in an identifier shall designate a character whose encoding in ISO/IEC 10646 falls into one of the ranges specified in annex D." C99 6.4.2.1-3
May 28 2003
parent reply Burton Radons <loth users.sourceforge.net> writes:
Walter wrote:

 "Martin M. Pedersen" <mmp www.moeller-pedersen.dk> wrote in message
 news:bb2hou$oas$1 digitaldaemon.com...
 
"Walter" <walter digitalmars.com> wrote in message
news:bb1c8v$2e2l$1 digitaldaemon.com...

I have noted that C99 allows *any* unicode character to be used in
identifiers using \u.

No, only characters that fall into certain unicode ranges.

I haven't found that, but I you are the export, so I believe you . It

makes
sense too.

"Each universal character name in an identifier shall designate a character whose encoding in ISO/IEC 10646 falls into one of the ranges specified in annex D." C99 6.4.2.1-3

This could be more easily done by encoding into UTF-8 and assuming any byte with the eighth bit set is an identifier. It allows weird obfuscations, yes, but why care about that? I won't write code that uses one of UNICODE's whitespace characters, and anyone whose code would be worth use by me would also not abuse it. At worst it'd be one of those features that kids get into abusing before they smarten up. C99's decision itself looks pretty bad. I'd use \u escapes for codes which I don't WANT rendered because either they have no rendering (whitespaces), because they would screw up rendering (controls), don't have a rendering in my code-writing font, or have special numeric significance. Whether this feature is implemented by any compilers and editors is certainly important to Martin's stated requirements. If his clients can't read the code he's written, he hasn't fulfilled his contract. Much more successful would be to use an encoding like UTF-8 or one of the BOM'd encodings D supports; all programs developed for Finns will surely render that. If it develops that C gets a link standard for UNICODE identifiers, then that can be emulated when mangling extern (C). There's no cause for following C99 exactly in the code itself.
May 28 2003
parent reply "Walter" <walter digitalmars.com> writes:
"Burton Radons" <loth users.sourceforge.net> wrote in message
news:bb3s9f$29qv$1 digitaldaemon.com...
 Walter wrote:
 "Each universal character name in an identifier shall designate a


 whose encoding in ISO/IEC 10646 falls into one of the ranges specified


 annex D." C99 6.4.2.1-3

byte with the eighth bit set is an identifier. It allows weird obfuscations, yes, but why care about that? I won't write code that uses one of UNICODE's whitespace characters, and anyone whose code would be worth use by me would also not abuse it. At worst it'd be one of those features that kids get into abusing before they smarten up. C99's decision itself looks pretty bad. I'd use \u escapes for codes which I don't WANT rendered because either they have no rendering (whitespaces), because they would screw up rendering (controls), don't have a rendering in my code-writing font, or have special numeric significance. Whether this feature is implemented by any compilers and editors is certainly important to Martin's stated requirements. If his clients can't read the code he's written, he hasn't fulfilled his contract. Much more successful would be to use an encoding like UTF-8 or one of the BOM'd encodings D supports; all programs developed for Finns will surely render that. If it develops that C gets a link standard for UNICODE identifiers, then that can be emulated when mangling extern (C). There's no cause for following C99 exactly in the code itself.

This is C's third attempt at internationalizing C source code. In 15 years I have yet to see any C source outside of a test suite that used trigraphs or digraphs. I'm skeptical the \u scheme will catch on, either. I think the best way is to simply declare that the source text is UTF-8, UTF-16, or UTF-32. D already recognizes and automatically handles all three. Then, it is simply a matter of deciding which unicode characters to allow as identifiers and whitespace. The advantage of that is you can edit the source in any text editor that supports unicode if you want to use more than ascii. There is no need for any special editors that recognize trigraphs, digraphs, or on-the-fly \u translation.
May 28 2003
parent reply Bill Cox <bill viasic.com> writes:
I'll put in a vote for UTF-8 support.  It seems to have the best chance 
of getting support from Linux IDEs and debuggers.

Bill

Walter wrote:
 "Burton Radons" <loth users.sourceforge.net> wrote in message
 news:bb3s9f$29qv$1 digitaldaemon.com...
 
Walter wrote:

"Each universal character name in an identifier shall designate a


whose encoding in ISO/IEC 10646 falls into one of the ranges specified


annex D." C99 6.4.2.1-3

This could be more easily done by encoding into UTF-8 and assuming any byte with the eighth bit set is an identifier. It allows weird obfuscations, yes, but why care about that? I won't write code that uses one of UNICODE's whitespace characters, and anyone whose code would be worth use by me would also not abuse it. At worst it'd be one of those features that kids get into abusing before they smarten up. C99's decision itself looks pretty bad. I'd use \u escapes for codes which I don't WANT rendered because either they have no rendering (whitespaces), because they would screw up rendering (controls), don't have a rendering in my code-writing font, or have special numeric significance. Whether this feature is implemented by any compilers and editors is certainly important to Martin's stated requirements. If his clients can't read the code he's written, he hasn't fulfilled his contract. Much more successful would be to use an encoding like UTF-8 or one of the BOM'd encodings D supports; all programs developed for Finns will surely render that. If it develops that C gets a link standard for UNICODE identifiers, then that can be emulated when mangling extern (C). There's no cause for following C99 exactly in the code itself.

This is C's third attempt at internationalizing C source code. In 15 years I have yet to see any C source outside of a test suite that used trigraphs or digraphs. I'm skeptical the \u scheme will catch on, either. I think the best way is to simply declare that the source text is UTF-8, UTF-16, or UTF-32. D already recognizes and automatically handles all three. Then, it is simply a matter of deciding which unicode characters to allow as identifiers and whitespace. The advantage of that is you can edit the source in any text editor that supports unicode if you want to use more than ascii. There is no need for any special editors that recognize trigraphs, digraphs, or on-the-fly \u translation.

May 29 2003
parent Benji Smith <Benji_member pathlink.com> writes:
I agree. Source should be UTF-8.

--Benji


In article <3ED5FFE7.3040100 viasic.com>, Bill Cox says...
I'll put in a vote for UTF-8 support.  It seems to have the best chance 
of getting support from Linux IDEs and debuggers.

Bill

Walter wrote:
 "Burton Radons" <loth users.sourceforge.net> wrote in message
 news:bb3s9f$29qv$1 digitaldaemon.com...
 
Walter wrote:

"Each universal character name in an identifier shall designate a


whose encoding in ISO/IEC 10646 falls into one of the ranges specified


annex D." C99 6.4.2.1-3

This could be more easily done by encoding into UTF-8 and assuming any byte with the eighth bit set is an identifier. It allows weird obfuscations, yes, but why care about that? I won't write code that uses one of UNICODE's whitespace characters, and anyone whose code would be worth use by me would also not abuse it. At worst it'd be one of those features that kids get into abusing before they smarten up. C99's decision itself looks pretty bad. I'd use \u escapes for codes which I don't WANT rendered because either they have no rendering (whitespaces), because they would screw up rendering (controls), don't have a rendering in my code-writing font, or have special numeric significance. Whether this feature is implemented by any compilers and editors is certainly important to Martin's stated requirements. If his clients can't read the code he's written, he hasn't fulfilled his contract. Much more successful would be to use an encoding like UTF-8 or one of the BOM'd encodings D supports; all programs developed for Finns will surely render that. If it develops that C gets a link standard for UNICODE identifiers, then that can be emulated when mangling extern (C). There's no cause for following C99 exactly in the code itself.

This is C's third attempt at internationalizing C source code. In 15 years I have yet to see any C source outside of a test suite that used trigraphs or digraphs. I'm skeptical the \u scheme will catch on, either. I think the best way is to simply declare that the source text is UTF-8, UTF-16, or UTF-32. D already recognizes and automatically handles all three. Then, it is simply a matter of deciding which unicode characters to allow as identifiers and whitespace. The advantage of that is you can edit the source in any text editor that supports unicode if you want to use more than ascii. There is no need for any special editors that recognize trigraphs, digraphs, or on-the-fly \u translation.


May 29 2003
prev sibling parent "Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:
"Walter" <walter digitalmars.com> wrote in message
news:bb1c8v$2e2l$1 digitaldaemon.com...
 DMD implementation defines a letter to be ['A'..'Z', 'a'..'z'].
 I find this unfortunate, and in contrast to the one of the main goals of

 Link compability with C.

It's a good idea to change it to match C for the reasons you state.

Another way of resolving this would be to give the programmer control of the external identifer. Something like this: extern (C) { extern("foo\u4444") void foo() { bar(); } extern("bar\u4444") void bar(); } That would also allow us to access mangled C++ identifiers, and identifiers containing '$'. It would not be easy, but that is not what I ask for. I only want it to be possible. Regards, Martin M. Pedersen
May 29 2003
prev sibling next sibling parent reply Ilya Minkov <Ilya_member pathlink.com> writes:
In article <bb0sqs$1t1k$1 digitaldaemon.com>, Martin M. Pedersen says...

It has previously been argued, that only english should be used for
identifiers in order to support reuse better across language boundaries. But
that argument isn't always valid. For example, half a decade ago, I was
involved in building the IT-infrastructure for a nation-wide real estate
network. One of the requirements was that *everything* was in dansh.. It
involved lots of developers nation-wide, but noone outside Denmark. Of
cause, identifiers couldn't be fully danish - and thereby introduced
inconsistency in how things was names. But that was only a limitation of C
back than, which might not be an issue a few years from now. If D has this
limitation, it might be a valid reason to deselect D in favor of other
languages. After all, english is only the native language of a miniority.

Hello, i believe there was a flamewar to this topic a few months ago, starting from an old 1st april joke article from Bjarne Stroustrup about adding unicode identifiers to C++. I believe that most people on this newsgroup are not native english speakers. And nontheless, the idea has found very little support, since: - for almost any language, a transliteration scheme exists which approximates the language in terms of latin alphabet; - keywords are english anyway, and in D there is no preprocessor to un-english them. :) Using any language other than english would yuild to inclonsistency anyway. - i know quite a number of languages, but i have tremendous problems switching between them. It may take minutes every time. And having seen a single english keyword, i start thinking in english and you can be sure of all my subsequent comments to be in english. Then, i also cant't read both code and comments simultaneously. So i have to translate the comments into english to get going. I even refuse to use any code with comments in my native language. I believe there are plenty of people experiencing the same problem. So, if you *really* want to mix your native language into a project, why don't you write a scanner, which would: - translate keywords from your language into D; - transliterate all other identifiers into latin letters. This would basically be an extended version of a lexer, and lexing D is really simple. Besides, there's a good readymade lexer to borrow. :)
I have noted that C99 allows *any* unicode character to be used in
identifiers using \u. The D specification limits characters in identifiers
to letters, digits, and '_', but does not even define what a letter is. The
DMD implementation defines a letter to be ['A'..'Z', 'a'..'z'].

It is defined in the library. :>
I find this unfortunate, and in contrast to the one of the main goals of D:
Link compability with C.

I have not seen a single piece of code using this silly feature. Is there any programmer's editor which has \u unicode support as of yet? And any IDE? I would also like to see how many compilers implement that - and in what manner. Even if some does, it would probably be incompatible with that of other compilers. So would you say, C violates the requierement of link compatibility with itself as well? :> -i.
May 28 2003
next sibling parent "Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:
"Ilya Minkov" <Ilya_member pathlink.com> wrote in message
news:bb2cup$in4$1 digitaldaemon.com...
 Hello, i believe there was a flamewar to this topic a few months ago,

 from an old 1st april joke article from Bjarne Stroustrup about adding

 identifiers to C++.

I don't want to get into a flamewar, and I don't want to argue against your preferences for using english. My point is simply that sometimes it is not a choice one can make. For example, if you are supplied with libraries using unicode identifiers, that you are required to use. If it is necessary to wrap such functions in other C code, D cannot be said to be link compatible with C (C99). Likewise, you might also be required to implement an interface using such identifiers.
 I have not seen a single piece of code using this silly feature.

That is not really an argument. The feature exists, and will get support by compilers as times go by. Silly or not, compilers cannot be said to be C99 compliant if they do not support it. Any serious compiler vendor will go in that direction. And some will use this feature - there must have been a reason for its introduction.
 Is there any programmer's editor which has \u unicode support as of yet?

They don't have to, as I read the document. They only need to support editing unicode. Translation phase 1 is: "Physical source file multibyte characters are mapped to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representation." I believe this is also how DMD does things (except the trigraph stuff) - maps unicode chars \u-sequences, that is.
 I would also like to see how many compilers implement that - and in what

I don't know if the ABI is completely standardized, but the translation limits chapter gives me a clue how it is to be done: "31 significant initial characters in an external identifier (each universal character name specifying a character short identifier of 0000FFFF or less is considered 6 characters, each universal character name specifying a character short identifier of 00010000 or more is considered 10 characters, and each extended source character is considered the same number of characters as the corresponding universal character name, if any)" The numbers 6 and 10 indicates to me, that they will be encoded using "\uXXXX" and "\uXXXXXXXX" or something very similar. But that is only a guess. Regards, Martin M. Pedersen
May 28 2003
prev sibling parent reply Bill Cox <bill viasic.com> writes:
Hi, Ilya.

 I have not seen a single piece of code using this silly feature. Is there any
 programmer's editor which has \u unicode support as of yet? And any IDE?

The latest version of Vim supports UTF-8. However, it requires a kernel patch that isn't in RedHat 7.3. It is suppose to be in 8.0 on. It also doesn't work in the last version of Cygwin I installed. Anyone know how UTF support is comming along in emacs? Bill
May 28 2003
parent Bill Cox <bill viasic.com> writes:
Err.... I read your post a little more carefully...  I don't know of any 
programming editors directly supporting the \u and \U features of C.

Bill Cox wrote:
 Hi, Ilya.
 
 I have not seen a single piece of code using this silly feature. Is 
 there any
 programmer's editor which has \u unicode support as of yet? And any IDE?

The latest version of Vim supports UTF-8. However, it requires a kernel patch that isn't in RedHat 7.3. It is suppose to be in 8.0 on. It also doesn't work in the last version of Cygwin I installed. Anyone know how UTF support is comming along in emacs? Bill

May 28 2003
prev sibling next sibling parent reply Georg Wrede <Georg_member pathlink.com> writes:
In article <bb0sqs$1t1k$1 digitaldaemon.com>, Martin M. Pedersen says...
It has previously been argued, that only english should be used for
identifiers in order to support reuse better across language boundaries. But
that argument isn't always valid. For example, half a decade ago, I was
involved in building the IT-infrastructure for a nation-wide real estate
network. One of the requirements was that *everything* was in dansh.. It
involved lots of developers nation-wide, but noone outside Denmark. Of
cause, identifiers couldn't be fully danish - and thereby introduced
inconsistency in how things was names.

Back in the bad old days, before MSDOS, we all used CP/M. There was this Nationalist project in Finland, with the goal of translating all operating system commands to Finnish, or Finnish abbreviations. Ostensibly this would be easier on people. Turned out nobody wanted to use or learn the Finnish version. Their explanation: since these commands are "new words" to you anyway, the least of your troubles is the spelling. Compared with trying to grasp the meaning of these new concepts the spelling is a non-issue. And if you then have to use a non Finnish version, you're totally lost. Sure, D code written in Chinese would be more compact, maybe even more legible (in an absolute sense), with its one character variable names and method names. Maybe even parentheses and plus signs could be in Chinese equivalents. But I don't believe they'd want it. Most Finnish companies have a policy where all program code and comments have to be in English. Even in those companies where the programmers and staff speak hardly any English at all.
May 28 2003
parent reply "Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:
"Georg Wrede" <Georg_member pathlink.com> wrote in message
 Most Finnish companies have a policy where all program
 code and comments have to be in English. Even in those
 companies where the programmers and staff speak hardly
 any English at all.

So do we. Yet there are exceptions. If the customer pays us to develop and deliver source code, it is his requirements that counts, not our policy. Regards, Martin M. Pedersen
May 28 2003
parent "Walter" <walter digitalmars.com> writes:
"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> wrote in message
news:bb2l7u$s2i$1 digitaldaemon.com...
 So do we. Yet there are exceptions. If the customer pays us to develop and
 deliver source code, it is his requirements that counts, not our policy.

Yup. Listen to the customers, not the marketing department <g>.
May 28 2003
prev sibling next sibling parent reply Mark Evans <Mark_member pathlink.com> writes:
I agree that D is too English-centric (even ASCII-centric).

Concern about C99 link compatibility leads me to reflect on C99's boolean type:

http://www.uic.edu/classes/mcs/mcs494/f01/transparencies/sec8.4.pdf

Mark
May 28 2003
parent Mark Evans <Mark_member pathlink.com> writes:
Actually I still think that link compatibility with Digital Mars C++ would be a
huge win for D.  C++ also has a bool type.

Mark
May 28 2003
prev sibling parent reply Mark T <Mark_member pathlink.com> writes:
I have noted that C99 allows *any* unicode character to be used in
identifiers using \u. The D specification limits characters in identifiers
to letters, digits, and '_', but does not even define what a letter is. The
DMD implementation defines a letter to be ['A'..'Z', 'a'..'z'].

I don't think there is a full implementation of C99 yet. It was adopted in late 1999. Maybe some of this stuff will disappear due to lack of use. Did ISO sack the trigraph crap from C89/C90?
May 29 2003
parent "Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:
"Mark T" <Mark_member pathlink.com> wrote in message
news:bb6710$1v5d$1 digitaldaemon.com...
 I don't think there is a full implementation of C99 yet. It was adopted in

 1999.  Maybe some of this stuff will disappear due to lack of use. Did ISO

 the trigraph crap from C89/C90?

No, trigraphs are still there. Regards, Martin M. Pedersen
May 30 2003