www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Updating D beyond Unicode 2.0

reply Neia Neutuladh <neia ikeran.org> writes:
D's currently accepted identifier characters are based on Unicode 
2.0:

* ASCII range values are handled specially.
* Letters and combining marks from Unicode 2.0 are accepted.
* Numbers outside the ASCII range are accepted.
* Eight random punctuation marks are accepted.

This follows the C99 standard.


Python, ECMAScript, just to name a few. A small number of 
languages reject non-ASCII characters: Dart, Perl. Some languages 
are weirdly generous: Swift and C11 allow everything outside the 
Basic Multilingual Plane.

I'd like to update that so that D accepts something as a valid 
identifier character if it's a letter or combining mark or 
modifier symbol that's present in Unicode 11, or a non-ASCII 
number. This allows the 146 most popular writing systems and a 
lot more characters from those writing systems. This *would* 
reject those eight random punctuation marks, so I'll keep them in 
as legacy characters.

It would mean we don't have to reference the C99 standard when 
enumerating the allowed characters; we just have to refer to the 
Unicode standard, which we already need to talk about in the 
lexical part of the spec.

It might also make the lexer a tiny bit faster; it reduces the 
number of valid-ident-char segments to search from 245 to 134. On 
the other hand, it will change the ident char ranges from wchar 
to dchar, which means the table takes up marginally more memory.

And, of course, it lets you write programs entirely in Linear B, 
and that's a marketing ploy not to be missed.

I've got this coded up and can submit a PR, but I thought I'd get 
feedback here first.

Does anyone see any horrible potential problems here?

Or is there an interestingly better option?

Does this need a DIP?
Sep 21 2018
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
When I originally started with D, I thought non-ASCII identifiers with Unicode 
was a good idea. I've since slowly become less and less enthusiastic about it.

First off, D source text simply must (and does) fully support Unicode in 
comments, characters, and string literals. That's not an issue.

But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, 
C++, or D. In fact, I've seen zero use of it outside of test cases. I don't see 
much point in expanding the support of it. If people use such identifiers, the 
result would most likely be annoyance rather than illumination when people who 
don't know that language have to work on the code.

Extending it further will also cause problems for all the tools that work with
D 
object code, like debuggers, disassemblers, linkers, filesystems, etc.

Absent a much more compelling rationale for it, I'd say no.
Sep 21 2018
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:
 But identifiers? I haven't seen hardly any use of non-ascii 
 identifiers in C, C++, or D. In fact, I've seen zero use of it 
 outside of test cases.
Do you look at Japanese D code much? Or Turkish? Or Chinese? I know there are decently sized D communities in those languages, and I am pretty sure I have seen identifiers in their languages before, but I can't find it right now. Just there's a pretty clear potential for observation bias here. Even our search engine queries are going to be biased toward English-language results, so there can be a whole D world kinda invisible to you and I. We should reach out and get solid stats before making a final decision.
 most likely be annoyance rather than illumination when people 
 who don't know that language have to work on the code.
Well, for example, with a Chinese company, they may very well find forced English identifiers to be an annoyance.
Sep 21 2018
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 09/21/2018 04:18 PM, Adam D. Ruppe wrote:

 Well, for example, with a Chinese company, they may very well find
 forced English identifiers to be an annoyance.
Fully aggreed but as far as I know, Turkish companies use English in source code. Turkish alphabet is Latin based where dotted and undotted versions of Latin letters are distinct and produce different meanings. Quick examples: sık: dense (n), squeeze (v), ... sik: penis (n), f*ck (v) [1] şık: one of multiple choices (1), swanky (2) döndür: return dondur: make frozen sök: disassemble, dismantle, ... sok: insert, install, ... şok: shock Hence, non-Unicode is unacceptable in Turkish code unless we reserve programming to English speakers only, which is unacceptable because it would be exclusionary and would produce English identifiers that are frequently amusing. I've seen the latter in code of English learners. :) Ali [1] https://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail
Sep 23 2018
parent Kagamin <spam here.lot> writes:
On Sunday, 23 September 2018 at 11:18:42 UTC, Ali Çehreli wrote:
 Hence, non-Unicode is unacceptable in Turkish code
You even contributed to http://code.google.com/p/trileri/source/browse/trunk/tr/yazi.d
Sep 23 2018
prev sibling next sibling parent reply Neia Neutuladh <neia ikeran.org> writes:
On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:
 But identifiers? I haven't seen hardly any use of non-ascii 
 identifiers in C, C++, or D. In fact, I've seen zero use of it 
 outside of test cases. I don't see much point in expanding the 
 support of it. If people use such identifiers, the result would 
 most likely be annoyance rather than illumination when people 
 who don't know that language have to work on the code.
...you *do* know that not every codebase has people working on it who only know English, right? If I took a software development job in China, I'd need to learn Chinese. I'd expect the codebase to be in Chinese. Because a Chinese company generally operates in Chinese, and they're likely to have a lot of employees who only speak Chinese. And no, you can't just transcribe Chinese into ASCII. Same for Spanish, Norwegian, German, Polish, Russian -- heck, it's almost easier to list out the languages you *don't* need non-ASCII characters for. Anyway, here's some more D code using non-ASCII identifiers, in case you need examples: https://git.ikeran.org/dhasenan/muzikilo
Sep 21 2018
next sibling parent Thomas Mader <thomas.mader gmail.com> writes:
On Saturday, 22 September 2018 at 01:08:26 UTC, Neia Neutuladh 
wrote:
 ...you *do* know that not every codebase has people working on 
 it who only know English, right?
This topic boils down to diversity vs. productivity. If supporting diversity in this case is questionable. I work in a German speaking company and we have no developers who are not speaking German for now. In fact all are native speakers. Still we write our code, comments and commit messages in English. Even at university you learn that you should use English to code. The reasoning is simple. You never know who will work on your code in the future. If a company writes code in Chinese, they will have a hard time to expand the development of their codebase even though Chinese is spoken by that many people. So even though you could use all sorts of characters, in a productive environment you better choose not to do so. You might end up shooting yourself in the foot in the long run. Diversity is important in other areas but I don't see much advantage here. At least for now because the spoken languages of today don't differ tremendously in what they are capable of expressing. This is also true for todays programming languages. Most of them are just different syntax for the very same ideas and concepts. That's not very helpful to bring people together and advance. My understanding is that even life with it's great diversity just has one language (DNA) to define it.
Sep 22 2018
prev sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 9/21/18 9:08 PM, Neia Neutuladh wrote:
 On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:
 But identifiers? I haven't seen hardly any use of non-ascii 
 identifiers in C, C++, or D. In fact, I've seen zero use of it outside 
 of test cases. I don't see much point in expanding the support of it. 
 If people use such identifiers, the result would most likely be 
 annoyance rather than illumination when people who don't know that 
 language have to work on the code.
....you *do* know that not every codebase has people working on it who only know English, right? If I took a software development job in China, I'd need to learn Chinese. I'd expect the codebase to be in Chinese. Because a Chinese company generally operates in Chinese, and they're likely to have a lot of employees who only speak Chinese. And no, you can't just transcribe Chinese into ASCII. Same for Spanish, Norwegian, German, Polish, Russian -- heck, it's almost easier to list out the languages you *don't* need non-ASCII characters for. Anyway, here's some more D code using non-ASCII identifiers, in case you need examples: https://git.ikeran.org/dhasenan/muzikilo
But aren't we arguing about the wrong thing here? D already accepts non-ASCII identifiers. What languages need an upgrade to unicode symbol names? In other words, what symbols aren't possible with the current support? Or maybe I'm misunderstanding something. -Steve
Sep 22 2018
parent reply Neia Neutuladh <neia ikeran.org> writes:
On Saturday, 22 September 2018 at 12:35:27 UTC, Steven 
Schveighoffer wrote:
 But aren't we arguing about the wrong thing here? D already 
 accepts non-ASCII identifiers.
Walter was doing that thing that people in the US who only speak English tend to do: forgetting that other people speak other languages, and that people who speak English can learn other languages to work with people who don't speak English. He was saying it's inevitably a mistake to use non-ASCII characters in identifiers and that nobody does use them in practice. Walter talking like that sounds like he'd like to remove support for non-ASCII identifiers from the language. I've gotten by without maintaining a set of personal patches on top of DMD so far, and I'd like it if I didn't have to start.
 What languages need an upgrade to unicode symbol names? In 
 other words, what symbols aren't possible with the current 
 support?
Chinese and Japanese have gained about eleven thousand symbols since Unicode 2. Unicode 2 covers 25 writing systems, while Unicode 11 covers 146. Just updating to Unicode 3 would give us Cherokee, Ge'ez (multiple languages), Khmer (Cambodian), Mongolian, Burmese, Sinhala (Sri Lanka), Thaana (Maldivian), Canadian aboriginal syllabics, and Yi (Nuosu).
Sep 22 2018
next sibling parent reply Erik van Velzen <erik evanv.nl> writes:
On Saturday, 22 September 2018 at 16:56:10 UTC, Neia Neutuladh 
wrote:
On Saturday, 22 September 2018 at 16:56:10 UTC, Neia Neutuladh 
wrote:
 Walter was doing that thing that people in the US who only 
 speak English tend to do: forgetting that other people speak 
 other languages, and that people who speak English can learn 
 other languages to work with people who don't speak English. He 
 was saying it's inevitably a mistake to use non-ASCII 
 characters in identifiers and that nobody does use them in 
 practice.
There's a more charitable view and that's that even furriners usually use English identifiers. Nobody in this thread so far has said they are programming in non-ASCII. If there was a contingent of Japanese or Chinese users doing that then surely they would speak up here or in Bugzilla to advocate for this feature?
Sep 22 2018
next sibling parent Neia Neutuladh <neia ikeran.org> writes:
On Saturday, 22 September 2018 at 19:59:42 UTC, Erik van Velzen 
wrote:
 Nobody in this thread so far has said they are programming in 
 non-ASCII.
I did. https://git.ikeran.org/dhasenan/muzikilo
Sep 22 2018
prev sibling next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Saturday, 22 September 2018 at 19:59:42 UTC, Erik van Velzen 
wrote:
 Nobody in this thread so far has said they are programming in 
 non-ASCII.
This is the obvious observation bias I alluded to before: of course people who don't read and write English aren't in this thread, since they cannot read or write the English used in this thread! Ditto for bugzilla. Absence of evidence CAN be evidence of absence... but not when the absence is so easily explained by our shared bias. Neia Neutuladh posted one link. I have seen Japanese D code before on twitter, but cannot find it now (surely because the search engines also share this bias). Perhaps those are the only two examples in existence, but I stand by my belief that we must reach out to these other communities somehow and do a proper, proactive study before dismissing the possibility.
Sep 22 2018
parent reply sarn <sarn theartofmachinery.com> writes:
On Sunday, 23 September 2018 at 00:18:06 UTC, Adam D. Ruppe wrote:
 I have seen Japanese D code before on twitter, but cannot find 
 it now (surely because the search engines also share this bias).
You can find a lot more Japanese D code on this blogging platform: https://qiita.com/tags/dlang Here's the most recent post to save you a click: https://qiita.com/ShigekiKarita/items/9b3aa8f716848278ef62
Sep 22 2018
parent reply Shachar Shemesh <shachar weka.io> writes:
On 23/09/18 04:29, sarn wrote:
 On Sunday, 23 September 2018 at 00:18:06 UTC, Adam D. Ruppe wrote:
 I have seen Japanese D code before on twitter, but cannot find it now 
 (surely because the search engines also share this bias).
You can find a lot more Japanese D code on this blogging platform: https://qiita.com/tags/dlang Here's the most recent post to save you a click: https://qiita.com/ShigekiKarita/items/9b3aa8f716848278ef62
Comments in Japanese. Identifiers in English. Not advancing your point, I think. Shachar
Sep 22 2018
parent reply sarn <sarn theartofmachinery.com> writes:
On Sunday, 23 September 2018 at 06:53:21 UTC, Shachar Shemesh 
wrote:
 On 23/09/18 04:29, sarn wrote:
 You can find a lot more Japanese D code on this blogging 
 platform:
 https://qiita.com/tags/dlang
 
 Here's the most recent post to save you a click:
 https://qiita.com/ShigekiKarita/items/9b3aa8f716848278ef62
Comments in Japanese. Identifiers in English. Not advancing your point, I think. Shachar
Well, I knew that when I posted, so I honestly have no idea what point you assumed I was making.
Sep 23 2018
parent Shachar Shemesh <shachar weka.io> writes:
On 23/09/18 15:38, sarn wrote:
 On Sunday, 23 September 2018 at 06:53:21 UTC, Shachar Shemesh wrote:
 On 23/09/18 04:29, sarn wrote:
 You can find a lot more Japanese D code on this blogging platform:
 https://qiita.com/tags/dlang

 Here's the most recent post to save you a click:
 https://qiita.com/ShigekiKarita/items/9b3aa8f716848278ef62
Comments in Japanese. Identifiers in English. Not advancing your point, I think. Shachar
Well, I knew that when I posted, so I honestly have no idea what point you assumed I was making.
I don't know what point you were trying to make. That's precisely why I posted. I don't think D currently or ever enforces what type of (legal UTF-8) text you could use in comments or strings. This thread is about what's legal to use in identifiers. The example you brought does not use Unicode in identifiers, and is, therefor, irrelevant to the discussion we're having. That was the point *I* was trying to make. Shachar
Sep 23 2018
prev sibling parent aliak <something something.com> writes:
On Saturday, 22 September 2018 at 19:59:42 UTC, Erik van Velzen 
wrote:
 If there was a contingent of Japanese or Chinese users doing 
 that then surely they would speak up here or in Bugzilla to 
 advocate for this feature?
https://forum.dlang.org/post/piwvbtetcwyxlalocxkw forum.dlang.org
Sep 23 2018
prev sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 9/22/18 12:56 PM, Neia Neutuladh wrote:
 On Saturday, 22 September 2018 at 12:35:27 UTC, Steven Schveighoffer wrote:
 But aren't we arguing about the wrong thing here? D already accepts 
 non-ASCII identifiers.
Walter was doing that thing that people in the US who only speak English tend to do: forgetting that other people speak other languages, and that people who speak English can learn other languages to work with people who don't speak English.
I don't think he was doing that. I think what he was saying was, D tried to accommodate users who don't normally speak English, and they still use English (for the most part) for coding. I'm actually surprised there isn't much code out there that is written with other identifiers besides ASCII, given that C99 supported them. I assumed it was because they weren't supported. Now I learn that they are supported, yet almost all C code I've ever seen is written in English. Perhaps that's just because I don't frequent foreign language sites though :) But many people here speak English as a second language, and vouch for their cultures still using English to write code.
 He was saying it's inevitably a mistake to use 
 non-ASCII characters in identifiers and that nobody does use them in 
 practice.
I would expect people probably do try to use them in practice, it's just that the problems they run into aren't worth the effort (tool/environment support). But I have no first or even second hand experience with this. It does seem like Walter has a lot of experience with it though.
 Walter talking like that sounds like he'd like to remove support for 
 non-ASCII identifiers from the language. I've gotten by without 
 maintaining a set of personal patches on top of DMD so far, and I'd like 
 it if I didn't have to start.
I don't think he was saying that. I think he was against expanding support for further Unicode identifiers because the the first effort did not produce any measurable benefit. I'd be shocked from the recent positions of Walter and Andrei if they decided to remove non-ASCII identifiers that are currently supported, thereby breaking any existing code.
 What languages need an upgrade to unicode symbol names? In other 
 words, what symbols aren't possible with the current support?
Chinese and Japanese have gained about eleven thousand symbols since Unicode 2. Unicode 2 covers 25 writing systems, while Unicode 11 covers 146. Just updating to Unicode 3 would give us Cherokee, Ge'ez (multiple languages), Khmer (Cambodian), Mongolian, Burmese, Sinhala (Sri Lanka), Thaana (Maldivian), Canadian aboriginal syllabics, and Yi (Nuosu).
Very interesting! I would agree that we should at least add support for unicode symbols that are used in spoken languages, especially if we already have support for symbols that aren't ASCII already. I don't see the downside, especially if you can already use Unicode 2.0 symbols for identifiers (the ship has already sailed). It could be a good incentive to get kids in countries where English isn't commonly spoken to try D out as a first programming language ;) Using your native language to show example code could be a huge benefit for teaching coding. My recommendation is to put the PR up for review (that you said you had ready) and see what happens. Having an actual patch to talk about could change minds. At the very least, it's worth not wasting your efforts that you have already spent. Even if it does need a DIP, the PR can show that one less piece of effort is needed to get it implemented. -Steve
Sep 24 2018
prev sibling next sibling parent reply Joakim <dlang joakim.fea.st> writes:
On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:
 When I originally started with D, I thought non-ASCII 
 identifiers with Unicode was a good idea. I've since slowly 
 become less and less enthusiastic about it.

 First off, D source text simply must (and does) fully support 
 Unicode in comments, characters, and string literals. That's 
 not an issue.

 But identifiers? I haven't seen hardly any use of non-ascii 
 identifiers in C, C++, or D. In fact, I've seen zero use of it 
 outside of test cases. I don't see much point in expanding the 
 support of it. If people use such identifiers, the result would 
 most likely be annoyance rather than illumination when people 
 who don't know that language have to work on the code.

 Extending it further will also cause problems for all the tools 
 that work with D object code, like debuggers, disassemblers, 
 linkers, filesystems, etc.
To wit, Windows linker error with Unicode symbol: https://github.com/ldc-developers/ldc/pull/2850#issuecomment-422968161
 Absent a much more compelling rationale for it, I'd say no.
I'm torn. I completely agree with Adam and others that people should be able to use any language they want. But the Unicode spec is such a tire fire that I'm leery of extending support for it. Someone linked this Swift chapter on Unicode handling in an earlier forum thread, read the section on emoji in particular: https://oleb.net/blog/2017/11/swift-4-strings/ I was laughing out loud when reading about composing "family" emojis with zero-width joiners. If you told me that was a tech parody, I'd have believed it. I believe Swift just punts their Unicode support to ICU, like most any other project these days. That's a horrible sign, that you've created a spec so grotesquely complicated that most everybody relies on a single project to not have to deal with it.
Sep 21 2018
next sibling parent Neia Neutuladh <neia ikeran.org> writes:
On Saturday, 22 September 2018 at 04:54:59 UTC, Joakim wrote:
 To wit, Windows linker error with Unicode symbol:

 https://github.com/ldc-developers/ldc/pull/2850#issuecomment-422968161
That's a good argument for sticking to ASCII for name mangling.
 I'm torn. I completely agree with Adam and others that people 
 should be able to use any language they want. But the Unicode 
 spec is such a tire fire that I'm leery of extending support 
 for it.
The compiler doesn't have to do much with Unicode processing, fortunately.
Sep 21 2018
prev sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Friday, September 21, 2018 10:54:59 PM MDT Joakim via Digitalmars-d 
wrote:
 I'm torn. I completely agree with Adam and others that people
 should be able to use any language they want. But the Unicode
 spec is such a tire fire that I'm leery of extending support for
 it.
Unicode identifiers may make sense in a code base that is going to be used solely by a group of developers who speak a particular language that uses a number a of non-ASCII characters (especially languages like Chinese or Japanese), but it has no business in any code that's intended for international use. It just causes problems. At best, a particular, regional keyboard may be able to handle a particular symbol, but most other keyboards won't be able too. So, using that symbol causes problems for all of the developers from other parts of the world even if those developers also have Unicode symbols in their native languages.
 Someone linked this Swift chapter on Unicode handling in an
 earlier forum thread, read the section on emoji in particular:

 https://oleb.net/blog/2017/11/swift-4-strings/

 I was laughing out loud when reading about composing "family"
 emojis with zero-width joiners. If you told me that was a tech
 parody, I'd have believed it.
Honestly, I was horrified to find out that emojis were even in Unicode. It makes no sense whatsover. Emojis are supposed to be sequences of characters that can be interepreted as images. Treating them like Unicode symbols is like treating entire words like Unicode symbols. It's just plain stupid and a clear sign that Unicode has gone completely off the rails (if it was ever on them). Unfortunately, it's the best tool that we have for the job. - Jonathan M Davis
Sep 22 2018
next sibling parent reply Shachar Shemesh <shachar weka.io> writes:
On 22/09/18 11:52, Jonathan M Davis wrote:
 
 Honestly, I was horrified to find out that emojis were even in Unicode. It
 makes no sense whatsover. Emojis are supposed to be sequences of characters
 that can be interepreted as images. Treating them like Unicode symbols is
 like treating entire words like Unicode symbols. It's just plain stupid and
 a clear sign that Unicode has gone completely off the rails (if it was ever
 on them). Unfortunately, it's the best tool that we have for the job.
 
 - Jonathan M Davis
Thank Allah that someone said it before I had to. I could not agree more. Encoding whole words as single Unicode code points makes no sense. U+FDF2 Shachar
Sep 22 2018
parent reply Thomas Mader <thomas.mader gmail.com> writes:
On Saturday, 22 September 2018 at 10:24:48 UTC, Shachar Shemesh 
wrote:
 Thank Allah that someone said it before I had to. I could not 
 agree more. Encoding whole words as single Unicode code points 
 makes no sense.
The goal of Unicode is to support diversity, if you argue against that you don't need Unicode at all. What you are saying is basically that you would remove Chinese too. Emojis are not my world either but it is an expression system / language.
Sep 22 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, September 22, 2018 4:51:47 AM MDT Thomas Mader via Digitalmars-
d wrote:
 On Saturday, 22 September 2018 at 10:24:48 UTC, Shachar Shemesh

 wrote:
 Thank Allah that someone said it before I had to. I could not
 agree more. Encoding whole words as single Unicode code points
 makes no sense.
The goal of Unicode is to support diversity, if you argue against that you don't need Unicode at all. What you are saying is basically that you would remove Chinese too. Emojis are not my world either but it is an expression system / language.
Unicode is supposed to be a universal way of representing every character in every language. Emojis are not characters. They are sequences of characters that people use to represent images. I do not understand how an argument can even be made that they belong in Unicode. As I said, it's exactly the same as arguing that words should be represented in Unicode. Unfortunately, however, at least some of them are in there. :| - Jonathan M Davis
Sep 22 2018
next sibling parent Shachar Shemesh <shachar weka.io> writes:
On 22/09/18 14:28, Jonathan M Davis wrote:
 As I said, it's exactly the same
 as arguing that words should be represented in Unicode. Unfortunately,
 however, at least some of them are in there. :|
 
 - Jonathan M Davis
To be fair to them, that word is part of the "Arabic-representation forms" section. The "Presentation forms" sections are meant as backwards compatibility toward code points that existed before, and are not meant to be generated by Unicode aware applications. Shachar
Sep 22 2018
prev sibling parent reply Thomas Mader <thomas.mader gmail.com> writes:
On Saturday, 22 September 2018 at 11:28:48 UTC, Jonathan M Davis 
wrote:
 Unicode is supposed to be a universal way of representing every 
 character in every language. Emojis are not characters. They 
 are sequences of characters that people use to represent 
 images. I do not understand how an argument can even be made 
 that they belong in Unicode. As I said, it's exactly the same 
 as arguing that words should be represented in Unicode. 
 Unfortunately, however, at least some of them are in there. :|
At least since the incorporation of Emojis it's not supposed to be a universal way of representing characters anymore. :-) Maybe there was a time when that was true I don't know but I think they see Unicode as a way to express all language symbols. And Emojis is nothing else than a language were each symbol stands for an emotion/word/sentence. If Unicode only allows languages with characters which are used to form words it's excluding languages which use other ways of expressing something. Would you suggest to remove such writing systems out of Unicode? What should a museum do which is in need of a software to somehow manage Egyptian hieroglyphs? Unicode was made to support all sorts of writing systems and using multiple characters per word is just one system to form a writing system.
Sep 22 2018
parent reply Shachar Shemesh <shachar weka.io> writes:
On 22/09/18 15:13, Thomas Mader wrote:
 Would you suggest to remove such writing systems out of Unicode?
 What should a museum do which is in need of a software to somehow manage 
 Egyptian hieroglyphs?
If memory serves me right, hieroglyphs actually represent consonants (vowels are implicit), and as such, are most definitely "characters". The only language I can think of, off the top of my head, where words have distinct signs is sign language. It is a good question whether Unicode should include such a language (difficulty of representing motion in a font aside). Shachar
Sep 22 2018
parent reply Neia Neutuladh <neia ikeran.org> writes:
On Saturday, 22 September 2018 at 12:24:49 UTC, Shachar Shemesh 
wrote:
 If memory serves me right, hieroglyphs actually represent 
 consonants (vowels are implicit), and as such, are most 
 definitely "characters".
Egyptian hieroglyphics uses logographs (symbols representing whole words, which might be multiple syllables), letters, and determinants (which don't represent any word but disambiguate the surrounding words). Looking things up serves me better than memory, usually.
 The only language I can think of, off the top of my head, where 
 words have distinct signs is sign language.
Logographic writing systems. There is one logographic writing system still in common use, and it's the standard writing system for Chinese and Japanese. That's about 1.4 billion people. It was used in Korea until hangul became popularized. Unicode also aims to support writing systems that aren't used anymore. That means Mayan, cuneiform (several variants), Egyptian hieroglyphics and demotic script, several extinct variants on the Chinese writing system, and Luwian. Sign languages generally don't have writing systems. They're also not generally related to any ambient spoken languages (for instance, American Sign Language is derived from French Sign Language), so if you speak sign language and can write, you're bilingual. Anyway, without writing systems, sign languages are irrelevant to Unicode.
Sep 22 2018
parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 09/22/2018 09:27 AM, Neia Neutuladh wrote:

 Logographic writing systems. There is one logographic writing system
 still in common use, and it's the standard writing system for Chinese
 and Japanese.
I had the misconception of each Chinese character meaning a word until I read "The Chinese Language, Fact and Fantasy" by John DeFrancis. One thing I learned was that Chinese is not purely logographic. Ali
Sep 23 2018
prev sibling next sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 9/22/18 4:52 AM, Jonathan M Davis wrote:
 I was laughing out loud when reading about composing "family"
 emojis with zero-width joiners. If you told me that was a tech
 parody, I'd have believed it.
Honestly, I was horrified to find out that emojis were even in Unicode. It makes no sense whatsover. Emojis are supposed to be sequences of characters that can be interepreted as images. Treating them like Unicode symbols is like treating entire words like Unicode symbols. It's just plain stupid and a clear sign that Unicode has gone completely off the rails (if it was ever on them). Unfortunately, it's the best tool that we have for the job.
But aren't some (many?) Chinese/Japanese characters representing whole words? -Steve
Sep 22 2018
next sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, September 22, 2018 6:37:09 AM MDT Steven Schveighoffer via 
Digitalmars-d wrote:
 On 9/22/18 4:52 AM, Jonathan M Davis wrote:
 I was laughing out loud when reading about composing "family"
 emojis with zero-width joiners. If you told me that was a tech
 parody, I'd have believed it.
Honestly, I was horrified to find out that emojis were even in Unicode. It makes no sense whatsover. Emojis are supposed to be sequences of characters that can be interepreted as images. Treating them like Unicode symbols is like treating entire words like Unicode symbols. It's just plain stupid and a clear sign that Unicode has gone completely off the rails (if it was ever on them). Unfortunately, it's the best tool that we have for the job.
But aren't some (many?) Chinese/Japanese characters representing whole words?
It's true that they're not characters in the sense that Roman characters are characters, but they're still part of the alphabets for those languages. Emojis are specifically formed from sequences of characters - e.g. :) is two characters which are already expressible on their own. They're meant to represent a smiley face, but it's a sequence of characters already. There's no need whatsoever to represent anything extra Unicode. It's already enough of a disaster that there are multiple ways to represent the same character in Unicode without nonsense like emojis. It's stuff like this that really makes me wish that we could come up with a new standard that would replace Unicode, but that's likely a pipe dream at this point. - Jonathan M Davis
Sep 22 2018
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 9/22/18 8:58 AM, Jonathan M Davis wrote:
 On Saturday, September 22, 2018 6:37:09 AM MDT Steven Schveighoffer via
 Digitalmars-d wrote:
 On 9/22/18 4:52 AM, Jonathan M Davis wrote:
 I was laughing out loud when reading about composing "family"
 emojis with zero-width joiners. If you told me that was a tech
 parody, I'd have believed it.
Honestly, I was horrified to find out that emojis were even in Unicode. It makes no sense whatsover. Emojis are supposed to be sequences of characters that can be interepreted as images. Treating them like Unicode symbols is like treating entire words like Unicode symbols. It's just plain stupid and a clear sign that Unicode has gone completely off the rails (if it was ever on them). Unfortunately, it's the best tool that we have for the job.
But aren't some (many?) Chinese/Japanese characters representing whole words?
It's true that they're not characters in the sense that Roman characters are characters, but they're still part of the alphabets for those languages. Emojis are specifically formed from sequences of characters - e.g. :) is two characters which are already expressible on their own. They're meant to represent a smiley face, but it's a sequence of characters already. There's no need whatsoever to represent anything extra Unicode. It's already enough of a disaster that there are multiple ways to represent the same character in Unicode without nonsense like emojis. It's stuff like this that really makes me wish that we could come up with a new standard that would replace Unicode, but that's likely a pipe dream at this point.
But there are tons of emojis that have nothing to do with sequences of characters. Like houses, or planes, or whatever. I don't even know what the sequences of characters are for them. I think it started out like that, but turned into something else. Either way, I can't imagine any benefit from using emojis in symbol names. -Steve
Sep 24 2018
prev sibling parent sarn <sarn theartofmachinery.com> writes:
On Saturday, 22 September 2018 at 12:37:09 UTC, Steven 
Schveighoffer wrote:
 But aren't some (many?) Chinese/Japanese characters 
 representing whole words?

 -Steve
Kind of hair-splitting, but it's more accurate to say that some Chinese/Japanese words can be written with one character. Like how English speakers wouldn't normally say that "A" and "I" are characters representing whole words.
Sep 22 2018
prev sibling next sibling parent reply Neia Neutuladh <neia ikeran.org> writes:
On Saturday, 22 September 2018 at 08:52:32 UTC, Jonathan M Davis 
wrote:
 Unicode identifiers may make sense in a code base that is going 
 to be used solely by a group of developers who speak a 
 particular language that uses a number a of non-ASCII 
 characters (especially languages like Chinese or Japanese), but 
 it has no business in any code that's intended for 
 international use. It just causes problems.
You have a problem when you need to share a codebase between two organizations using different languages. "Just use ASCII" is not the solution. "Use a language that most developers in both organizations can use" is. That's *usually* going to be English, but not always. For instance, a Belorussian company doing outsourcing work for a Russian company might reasonably write code in Russian. If you're writing for a global audience, as most open source code is, you're usually going to use the most widely spoken language.
Sep 22 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, September 22, 2018 10:07:38 AM MDT Neia Neutuladh via 
Digitalmars-d wrote:
 On Saturday, 22 September 2018 at 08:52:32 UTC, Jonathan M Davis

 wrote:
 Unicode identifiers may make sense in a code base that is going
 to be used solely by a group of developers who speak a
 particular language that uses a number a of non-ASCII
 characters (especially languages like Chinese or Japanese), but
 it has no business in any code that's intended for
 international use. It just causes problems.
You have a problem when you need to share a codebase between two organizations using different languages. "Just use ASCII" is not the solution. "Use a language that most developers in both organizations can use" is. That's *usually* going to be English, but not always. For instance, a Belorussian company doing outsourcing work for a Russian company might reasonably write code in Russian. If you're writing for a global audience, as most open source code is, you're usually going to use the most widely spoken language.
My point is that if your code base is definitely only going to be used within a group of people who are using a keyboard that supports a Unicode character that you want to use, then it's not necessarily a problem to use it, but if you're writing code that may be seen or used by a general audience (especially if it's going to be open source), then it needs to be in ASCII, or it's a serious problem. Even if it's a character like lambda that most everyone is going to understand, many, many programmers are not going to be able type it on their keyboards, and that's going to cause nothing but problems. For better or worse, English is the international language of science and engineering, and that includes programming. So, any programs that are intended to be seen and used by the world at large need to be in ASCII. And the biggest practical issue with that is whether a character is even on a typical keyboard. Using a Unicode character in a program makes it so that make programmers cannot type it. And even given the large breadth of Unicode characters, you could even have a keyboard that supports a number of Unicode characters and still not have the Unicode character in question. So, open source programs need to be in ASCII. Now, I don't know that it's a problem to support a wide range of Unicode characters in identifiers when you consider the issues of folks whose native language is not English (especially when it's a language like Chinese or Japanese), but open source programs should only be using ASCII identifiers. And unfortunately, sometimes, the fact that a language supports Unicode identifiers has lead English speakers to do stupid things like use the lambda character in identifiers. So, I can understand Walter's reticence to go further with supporting Unicode identifiers, but on the other hand, when you consider how many people there are on the planet who use a language that doesn't even use the latin alphabet, it's arguably a good idea to fully support Unicode identifiers. - Jonathan M Davis
Sep 22 2018
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/22/2018 6:01 PM, Jonathan M Davis wrote:
 For better or worse, English is the international language of science and
 engineering, and that includes programming.
In the earlier days of D, I put on the web pages a google widget what would automatically translate the page into any language google supported. This was eventually removed (not by me) because nobody wanted it. Nobody (besides me) even noticed it was removed. And the D community is a very international one. Supporting Unicode in identifiers gives users a false sense that it's a good idea to use them. Lots of programming tools don't work well with Unicode. Even Windows doesn't by default - you've got to run "chcp 65001" each time you open a console window. Filesystems don't work reliably with Unicode. Heck, the reason module names should be lower case in D is because mixed case doesn't work reliably across filesystems. D supports Unicode in identifiers because C and C++ do, and we want to be able to interoperate with them. Extending Unicode identifier support off into other directions, especially ones that break such interoperability, is just doing a disservice to users.
Sep 23 2018
next sibling parent reply Neia Neutuladh <neia ikeran.org> writes:
On Sunday, 23 September 2018 at 21:12:13 UTC, Walter Bright wrote:
 D supports Unicode in identifiers because C and C++ do, and we 
 want to be able to interoperate with them. Extending Unicode 
 identifier support off into other directions, especially ones 
 that break such interoperability, is just doing a disservice to 
 users.
Okay, that's why you previously selected C99 as the standard for what characters to allow. Do you want to update to match C11? It's been out for the better part of a decade, after all.
Sep 23 2018
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/23/2018 3:23 PM, Neia Neutuladh wrote:
 Okay, that's why you previously selected C99 as the standard for what
characters 
 to allow. Do you want to update to match C11? It's been out for the better
part 
 of a decade, after all.
I wasn't aware it changed in C11.
Sep 23 2018
parent reply Neia Neutuladh <neia ikeran.org> writes:
On Monday, 24 September 2018 at 01:39:43 UTC, Walter Bright wrote:
 On 9/23/2018 3:23 PM, Neia Neutuladh wrote:
 Okay, that's why you previously selected C99 as the standard 
 for what characters to allow. Do you want to update to match 
 C11? It's been out for the better part of a decade, after all.
I wasn't aware it changed in C11.
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf page 522 (PDF numbering) or 504 (internal numbering). Outside the BMP, almost everything is allowed, including many things that are not currently mapped to any Unicode value. Within the BMP, a heck of a lot of stuff is allowed, including a lot that D doesn't currently allow. GCC hasn't even updated to the C99 standard here, as far as I can tell, but clang-5.0 is up to date.
Sep 23 2018
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 9/24/18 12:23 AM, Neia Neutuladh wrote:
 On Monday, 24 September 2018 at 01:39:43 UTC, Walter Bright wrote:
 On 9/23/2018 3:23 PM, Neia Neutuladh wrote:
 Okay, that's why you previously selected C99 as the standard for what 
 characters to allow. Do you want to update to match C11? It's been 
 out for the better part of a decade, after all.
I wasn't aware it changed in C11.
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf page 522 (PDF numbering) or 504 (internal numbering). Outside the BMP, almost everything is allowed, including many things that are not currently mapped to any Unicode value. Within the BMP, a heck of a lot of stuff is allowed, including a lot that D doesn't currently allow. GCC hasn't even updated to the C99 standard here, as far as I can tell, but clang-5.0 is up to date.
I searched around for the current state of symbol names in C, and found some really crappy rules, though maybe this site isn't up to date?: https://en.cppreference.com/w/c/language/identifier What I understand from that is: 1. Yes, you can use any unicode character you want in C/C++ (seemingly since C99) 2. There are no rules about what *encoding* is acceptable, it's implementation defined. So various compilers have different rules as to what will be accepted in the actual source code. In fact, I read somewhere that not even ASCII is guaranteed to be supported. The result being, that you have to write the identifiers with an ASCII escape sequence in order for it to be actually portable. Which to me, completely defeats the purpose of using such identifiers in the first place. For example, on that page, they have a line that works in clang, not in GCC (tagged as implementation defined): char *🐱 = "cat"; The portable version looks like this: char *\U0001f431 = "cat"; Seriously, who wants to use that? Now, D can potentially do better (especially when all front-ends are the same) and support such things in the spec, but I think the argument "because C supports it" is kind of bunk. Or am I reading it wrong? In any case, I would expect that symbol name support should be focused only on languages which people use, not emojis. If there are words in Chinese or Japanese that can't be expressed using D, while other words can, it would seem inconsistent to a Chinese or Japanese speaking user, and I think we should work to fix that. I just have no idea what the state of that is. I also tend to agree that most code is going to be written in English, even when the primary language of the user is not. Part of the reason, which I haven't read here yet, is that all the keywords are in English. Someone has to kind of understand those to get the meaning of some constructs, and it's going to read strangely with the non-english words. One group which I believe hasn't spoken up yet is the group making the hunt framework, whom I believe are all Chinese? At least their web site is. It would be good to hear from a group like that which has large experience writing mature D code (it appears all to be in English) and how they feel about the support. -Steve
Sep 24 2018
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 24 September 2018 at 13:26:14 UTC, Steven 
Schveighoffer wrote:
 Part of the reason, which I haven't read here yet, is that all 
 the keywords are in English.
Eh, those are kinda opaque sequences anyway, since the meanings aren't quite what the normal dictionary definition is anyway. Look up "int" in the dictionary... or "void", or even "string". They are just a handful of magic sequences we learn with the programming language. (And in languages like Rust, "fn", lol.)
 One group which I believe hasn't spoken up yet is the group 
 making the hunt framework, whom I believe are all Chinese? At 
 least their web site is.
I know they used a lot of my code as a starting point, and I, of course, wrote it in English, so that could have biased it a bit too. Though that might be a general point where you want to use these libraries and they are in a language. Just even so, I still find it kinda hard to believe that everybody everywhere uses only English in all their code. Maybe our efforts should be going toward the Chinese market via natural language support instead of competing with Rust on computer language features :P
 It would be good to hear from a group like that which has large 
 experience writing mature D code (it appears all to be in 
 English) and how they feel about the support.
definitely.
Sep 24 2018
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 9/24/18 10:14 AM, Adam D. Ruppe wrote:
 On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote:
 Part of the reason, which I haven't read here yet, is that all the 
 keywords are in English.
Eh, those are kinda opaque sequences anyway, since the meanings aren't quite what the normal dictionary definition is anyway. Look up "int" in the dictionary... or "void", or even "string". They are just a handful of magic sequences we learn with the programming language. (And in languages like Rust, "fn", lol.)
Well, even on top of that, the standard library is full of English words that read very coherently when used together (if you understand English). I can't imagine a long chain of English algorithms with some Chinese one pasted in the middle looks very good :) I suppose you could alias them all... -Steve
Sep 24 2018
parent reply Martin Tschierschke <mt smartdolphin.de> writes:
On Monday, 24 September 2018 at 14:34:21 UTC, Steven 
Schveighoffer wrote:
 On 9/24/18 10:14 AM, Adam D. Ruppe wrote:
 On Monday, 24 September 2018 at 13:26:14 UTC, Steven 
 Schveighoffer wrote:
 Part of the reason, which I haven't read here yet, is that 
 all the keywords are in English.
Eh, those are kinda opaque sequences anyway, since the meanings aren't quite what the normal dictionary definition is anyway. Look up "int" in the dictionary... or "void", or even "string". They are just a handful of magic sequences we learn with the programming language. (And in languages like Rust, "fn", lol.)
Well, even on top of that, the standard library is full of English words that read very coherently when used together (if you understand English). I can't imagine a long chain of English algorithms with some Chinese one pasted in the middle looks very good :) I suppose you could alias them all... -Steve
You might get really funny error messages. 🙂 can't be casted to int. :-) And if you have to increment the number of cars you can write: 🚗++; This might give really funny looking programs!
Sep 24 2018
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 9/24/18 2:20 PM, Martin Tschierschke wrote:
 On Monday, 24 September 2018 at 14:34:21 UTC, Steven Schveighoffer wrote:
 On 9/24/18 10:14 AM, Adam D. Ruppe wrote:
 On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer 
 wrote:
 Part of the reason, which I haven't read here yet, is that all the 
 keywords are in English.
Eh, those are kinda opaque sequences anyway, since the meanings aren't quite what the normal dictionary definition is anyway. Look up "int" in the dictionary... or "void", or even "string". They are just a handful of magic sequences we learn with the programming language. (And in languages like Rust, "fn", lol.)
Well, even on top of that, the standard library is full of English words that read very coherently when used together (if you understand English). I can't imagine a long chain of English algorithms with some Chinese one pasted in the middle looks very good :) I suppose you could alias them all...
You might get really funny error messages. 🙂 can't be casted to int.
Haha, it could be cynical as well int can’t be casted to int🤔 Oh, the games we could play. -Steve
Sep 24 2018
prev sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Monday, 24 September 2018 at 13:26:14 UTC, Steven 
Schveighoffer wrote:
 2. There are no rules about what *encoding* is acceptable, it's 
 implementation defined. So various compilers have different 
 rules as to what will be accepted in the actual source code. In 
 fact, I read somewhere that not even ASCII is guaranteed to be 
 supported.
Indeed. IBM mainframes have C compilers too but not ASCII. They code in EBCDIC. That's why for instance it's not portable to do things like if(c >= 'A' && c <= 'Z') printf("CAPITAL LETTER\n"); is not true in EBCDIC.
Sep 24 2018
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 9/24/18 3:18 PM, Patrick Schluter wrote:
 On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote:
 2. There are no rules about what *encoding* is acceptable, it's 
 implementation defined. So various compilers have different rules as 
 to what will be accepted in the actual source code. In fact, I read 
 somewhere that not even ASCII is guaranteed to be supported.
Indeed. IBM mainframes have C compilers too but not ASCII. They code in EBCDIC. That's why for instance it's not portable to do things like      if(c >= 'A' && c <= 'Z') printf("CAPITAL LETTER\n"); is not true in EBCDIC.
Right. But it's just a side-note -- I'd guess all modern compilers support ASCII, and definitely ones that we would want to interoperate with. Besides, that example is more concerned about *input data* encoding, not *source code* encoding. If the above is written in ASCII, then I would assume that the bytes in the source file are the ASCII bytes, and probably the IBM compilers would not know what to do with such files (it would all be gibberish if you opened on an EBCDIC editor). You'd first have to translate it to EBCDIC, which is a red flag that likely this isn't going to work :) -Steve
Sep 24 2018
prev sibling parent reply Dennis <dkorpel gmail.com> writes:
On Sunday, 23 September 2018 at 21:12:13 UTC, Walter Bright wrote:
 D supports Unicode in identifiers because C and C++ do, and we 
 want to be able to interoperate with them. Extending Unicode 
 identifier support off into other directions, especially ones 
 that break such interoperability, is just doing a disservice to 
 users.
I always thought D supported Unicode with the goal of going forward with it while C was stuck with ASCII: http://www.drdobbs.com/cpp/time-for-unicode/228700405 "The D programming language has already driven stakes in the ground, saying it will not support 16 bit processors, processors that don't have 8 bit bytes, and processors with crippled, non-IEEE floating point. Is it time to drive another stake in and say the time for Unicode has come? " Have you changed your mind since?
Sep 23 2018
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/23/2018 6:06 PM, Dennis wrote:
 Have you changed your mind since?
D the language is well suited to the development of Unicode apps. D source code is another matter.
Sep 23 2018
parent reply Dennis <dkorpel gmail.com> writes:
On Monday, 24 September 2018 at 01:32:38 UTC, Walter Bright wrote:
 D the language is well suited to the development of Unicode 
 apps. D source code is another matter.
But in the article you specifically talk about the use of Unicode in the context of source code instead of apps: "With the D programming language, we continuously run up against the problem that ASCII has reached its expressivity limits." "There are the chevrons « and » which serve as another set of brackets to lighten the overburdened ambiguities of ( ). There are the dot-product and cross-product characters · and × which would make lovely infix operator tokens for math libraries. The greek letters would be great for math variable names."
Sep 24 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, September 24, 2018 4:19:31 AM MDT Dennis via Digitalmars-d wrote:
 On Monday, 24 September 2018 at 01:32:38 UTC, Walter Bright wrote:
 D the language is well suited to the development of Unicode
 apps. D source code is another matter.
But in the article you specifically talk about the use of Unicode in the context of source code instead of apps: "With the D programming language, we continuously run up against the problem that ASCII has reached its expressivity limits." "There are the chevrons and which serve as another set of brackets to lighten the overburdened ambiguities of ( ). There are the dot-product and cross-product characters and which would make lovely infix operator tokens for math libraries. The greek letters would be great for math variable names."
Given that the typical keyboard has none of those characters, maintaining code that used any of them would be a royal pain. It's one thing if they're used in the occasional string as data, but it's quite another if they're used as identifiers or operators. I don't see how that would be at all maintainable. You'd be forced to constantly copy and paste rather than type. - Jonathan M Davis
Sep 24 2018
next sibling parent Dennis <dkorpel gmail.com> writes:
On Monday, 24 September 2018 at 10:36:50 UTC, Jonathan M Davis 
wrote:
 Given that the typical keyboard has none of those characters, 
 maintaining code that used any of them would be a royal pain.
Note that I'm not trying to argue either way, it's just that I used to think of Walter's stance on D and Unicode as: "D would fully embrace Unicode if only editors/debuggers etc. would embrace it too" But now I read:
 D supports Unicode in identifiers because C and C++ do, and we 
 want to be able to interoperate with them."
So I wonder what changed. I guess it's mostly answered in the first reply:
 When I originally started with D, I thought non-ASCII 
 identifiers with Unicode was a good idea. I've since slowly 
 become less and less enthusiastic about it.
Sep 24 2018
prev sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 24 September 2018 at 10:36:50 UTC, Jonathan M Davis 
wrote:
 Given that the typical keyboard has none of those characters, 
 maintaining code that used any of them would be a royal pain.
It is pretty easy to type them with a little keyboard config change, and like vim can pick those up from comments in the file even, though you have to train your fingers to know how to use it effectively too... but if you were maintaining something long term, you'd just do that.
Sep 24 2018
prev sibling parent reply Abdulhaq <alynch4047 gmail.com> writes:
On Saturday, 22 September 2018 at 08:52:32 UTC, Jonathan M Davis 
wrote:

 Honestly, I was horrified to find out that emojis were even in 
 Unicode. It makes no sense whatsover. Emojis are supposed to be 
 sequences of characters that can be interepreted as images. 
 Treating them like Unicode symbols is like treating entire 
 words like Unicode symbols. It's just plain stupid and a clear 
 sign that Unicode has gone completely off the rails (if it was 
 ever on them). Unfortunately, it's the best tool that we have 
 for the job.
According to the Unicode website, http://unicode.org/standard/WhatIsUnicode.html, """ Support of Unicode forms the foundation for the representation of languages and symbols in all major operating systems, search engines, browsers, laptops, and smart phones—plus the Internet and World Wide Web (URLs, HTML, XML, CSS, JSON, etc.)""" Note, unicode supports symbols, not just characters. The smiley face symbol predates its ':-)' usage in ascii text, https://www.smithsonianmag.com/arts-culture/who-really-invented-the-s iley-face-2058483/. It's fundamentally a symbol, not a sequence of characters. Therefore it is not unreasonable for it to be encoded with a unicode number. I do agree though, of course, that it would seem bizarre to use an emoji as a D identifier. The early history of computer science is completely dominated by cultures who use latin script based characters, and hence, quiet reasonably, text encoding and its automated visual representation by compute based devices is dominated by the requirements of latin script languages. However, the world keeps turning and, despite DT's best efforts, China et al. look to become dominant. Even if not China, the chances are that eventually a non-latin script based language will become very important. Parochial views like "all open source code should be in ASCII" will look silly. However, until that time D developers have to spend their time where it can be most useful. Hence the condition of whether to apply Neia's patch / ideas or not mainly depends on how much effort the donwstream effort will be (debuggers etc. as Walter pointed out), and how much the gain is. As unicode 2.0 is already supported I would take a guess that the vast majority of people with access to a computer can already enter identifiers in D that are rich enough for them. As Adam said though, it would be a good idea to at least ask!
Sep 23 2018
parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/23/2018 12:06 PM, Abdulhaq wrote:
 The early history of computer science is completely dominated by cultures who 
 use latin script based characters,
Small character sets are much more implementable on primitive systems like telegraphs and electro-mechanical ttys. It wasn't even practical to display a rich character set until the early 1980's or so. There wasn't enough memory. Glass ttys at the time could barely, and I mean barely, display ASCII. I know because I designed and built one.
Sep 25 2018
prev sibling parent reply aliak <something something.com> writes:
On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:
 When I originally started with D, I thought non-ASCII 
 identifiers with Unicode was a good idea. I've since slowly 
 become less and less enthusiastic about it.

 First off, D source text simply must (and does) fully support 
 Unicode in comments, characters, and string literals. That's 
 not an issue.

 But identifiers? I haven't seen hardly any use of non-ascii 
 identifiers in C, C++, or D. In fact, I've seen zero use of it 
 outside of test cases. I don't see much point in expanding the 
 support of it. If people use such identifiers, the result would 
 most likely be annoyance rather than illumination when people 
 who don't know that language have to work on the code.
Not seeing identifiers in languages you don't program in or can read in is expected. If it's supported it will be used: Japanese Swift: https://speakerdeck.com/codelynx/programming-swift-in-japanese
 Extending it further will also cause problems for all the tools 
 that work with D object code, like debuggers, disassemblers, 
 linkers, filesystems, etc.

 Absent a much more compelling rationale for it, I'd say no.
More compelling than: "there're 6 billion people in this world who don't speak english?" Allowing people to program in their own language while reducing the cognitive friction for people who want to learn programming in the majority of the world seems like a no-brainer thing to do.
Sep 23 2018
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/23/2018 9:52 AM, aliak wrote:
 Not seeing identifiers in languages you don't program in or can read in is 
 expected.
On the other hand, I've been programming for 40 years. I've customized my C++ compiler to emit error messages in various languages: https://github.com/DigitalMars/Compiler/blob/master/dm/src/dmc/msgsx.c I've implemented SHIFT-JIS encodings, along with .950 (Chinese) and .949 (Korean) code pages in the C++ compiler. I've worked in Japan writing software for Japanese companies. I've sold compilers internationally for 30 years (mostly to Germany and Japan). I did the tech support, meaning I'd see their code. --- There's a reason why dmd doesn't have international error messages. My experience with it is that international users don't want it. They prefer the english messages. I'm sure if you look hard enough you'll find someone using non-ASCII characters in identifiers. --- When I visited Remedy Games in Finland a few years back, I was surprised that everyone in the company was talking in english. I asked if they were doing that out of courtesy to me. They laughed, and said no, they talked in English because they came from all over the world, and english was the only language they had in common.
Sep 23 2018
next sibling parent reply 0xEAB <desisma heidel.beer> writes:
On Sunday, 23 September 2018 at 20:49:39 UTC, Walter Bright wrote:
 There's a reason why dmd doesn't have international error 
 messages. My experience with it is that international users 
 don't want it. They prefer the english messages.
I'm a native German speaker. As for my part, I agree on this, indeed. There are several reasons for this: - Usually such translations are terrible, simply put. - Uncontinuous translations [0] - Non-idiomatic sentences that still sound like English somehow. - Translations of tech terms [1] - Non-idiomatic translations of tech terms [2] However, well done translations might be quite nice at the in VS 2010 I was happy with the German error messages. I'm not sure whether it was just delusion but I think it got worse with some later version, though. [0] There's nothing worse than every single sentence being treated on its own during the translation process. At least that's what you'd often think when you face a longer error message. Usually you're confronted with non-linked and kindergarten-like sentences that don't seem to be meant to be put together. Often you'd think there were several translators. Favorite problem with this: 2 different terms for the same thing in two sentences. [1] e.g. "integer type" -> "ganzzahliger Datentyp" This just sounds weird. Anyone using "int" in their code knows what it means anyway... Nevertheless, there are some common translations that are fine (primarily because they're common), e.g. "error" -> "Fehler" [2] e.g. "assertion" -> "Assertionsfehler" This particular one can be found in Windows 10 and is not even proper German.
Sep 24 2018
next sibling parent 0xEAB <desisma heidel.beer> writes:
On Monday, 24 September 2018 at 15:17:14 UTC, 0xEAB wrote:

 German error messages.
addendum: I've been using the English version since VS2017
Sep 24 2018
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 09/24/2018 08:17 AM, 0xEAB wrote:

 - Non-idiomatic translations of tech terms [2]
This is something I had heard from a Digital Research programmer in early 90s: English message was something like "No memory left" and the German translation was "No memory on the left hand side" :) Ali
Sep 25 2018
next sibling parent Simen =?UTF-8?B?S2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:
On Wednesday, 26 September 2018 at 02:12:07 UTC, Ali Çehreli 
wrote:
 On 09/24/2018 08:17 AM, 0xEAB wrote:

 - Non-idiomatic translations of tech terms [2]
This is something I had heard from a Digital Research programmer in early 90s: English message was something like "No memory left" and the German translation was "No memory on the left hand side" :)
My ex-girlfriend tried to learn SQL from a book that had gotten a prize for its use of Norwegian. As a result, every single concept used a different name from what everybody else uses, and while it may be possible to learn som SQL from this, it made googling an absolute nightmare. Just imagine a whole book saying CHOOSE for SELECT, IF for WHERE, and USING instead of FROM - only worse, since it's a different language. It even used SQL pseudo-code with these made-up names, and showed how to translate it to proper SQL as more of an afterthought. -- Simen
Sep 25 2018
prev sibling next sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Wednesday, 26 September 2018 at 02:12:07 UTC, Ali Çehreli 
wrote:
 On 09/24/2018 08:17 AM, 0xEAB wrote:

 - Non-idiomatic translations of tech terms [2]
This is something I had heard from a Digital Research programmer in early 90s: English message was something like "No memory left" and the German translation was "No memory on the left hand side" :)
The K&R in German was of the same "quality". That happens when the translator is not an IT person himself.
Sep 25 2018
prev sibling next sibling parent reply ShadoLight <ettienne.gilbert gmail.com> writes:
On Wednesday, 26 September 2018 at 02:12:07 UTC, Ali Çehreli 
wrote:
 On 09/24/2018 08:17 AM, 0xEAB wrote:

 - Non-idiomatic translations of tech terms [2]
[snip]
 English message was something like "No memory left" and the 
 German translation was "No memory on the left hand side" :)

 Ali
Not sure if this was not just some urban legend, but there was a delightful story back in the late 80s/early 90s about the early translation programs. They were in particular not very good at idiomatic translations, so people would play with idiomatic expressions from language X (say english) to language Y, and then back from Y to X - and then see what was returned. Apparently the expression "the spirit is willing but the flesh is weak" translated to Russian and back was returned by one such program as: "The vodka is good but the meat is rotten!"
Sep 26 2018
parent abcde1234 <abcde1234 ge.sd> writes:
On Wednesday, 26 September 2018 at 12:57:21 UTC, ShadoLight wrote:
 On Wednesday, 26 September 2018 at 02:12:07 UTC, Ali Çehreli 
 wrote:
 On 09/24/2018 08:17 AM, 0xEAB wrote:

 - Non-idiomatic translations of tech terms [2]
[snip]
 English message was something like "No memory left" and the 
 German translation was "No memory on the left hand side" :)

 Ali
Not sure if this was not just some urban legend, but there was a delightful story back in the late 80s/early 90s about the early translation programs. They were in particular not very good at idiomatic translations, so people would play with idiomatic expressions from language X (say english) to language Y, and then back from Y to X - and then see what was returned. Apparently the expression "the spirit is willing but the flesh is weak" translated to Russian and back was returned by one such program as: "The vodka is good but the meat is rotten!"
In case you missed it, this was well spreaded in the tech news last month or so: https://translate.google.fr/?hl=fr#so/en/ngoo%20m%20goon%20goob%20goo%20goo%20goo%20mgoo%20goo%20goo%20goo%20goo%20goo%20m%20goo Still progress to do.
Sep 26 2018
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
A delicious Turkish desert is "kabak tatlısı", made of squash. Now, it 
so happens that "kabak" also means "zucchini" in Turkish. Imagine my 
shock when I came across that desert recipe in English that used 
zucchini as the ingredient! :)

Ali
Sep 26 2018
next sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Wednesday, September 26, 2018 11:15:01 PM MDT Ali Çehreli via 
Digitalmars-d wrote:
 A delicious Turkish desert is "kabak tatlısı", made of squash. Now, it
 so happens that "kabak" also means "zucchini" in Turkish. Imagine my
 shock when I came across that desert recipe in English that used
 zucchini as the ingredient! :)
Was it any good? ;) - Jonathan M Davis
Sep 26 2018
prev sibling parent reply Andrea Fontana <nospam example.com> writes:
On Thursday, 27 September 2018 at 05:15:01 UTC, Ali Çehreli wrote:
 A delicious Turkish desert is "kabak tatlısı", made of squash. 
 Now, it so happens that "kabak" also means "zucchini" in 
 Turkish. Imagine my shock when I came across that desert recipe 
 in English that used zucchini as the ingredient! :)

 Ali
You can't even imagine how many italian words and recipes are distorted... Andrea
Sep 27 2018
parent Paolo Invernizzi <paolo.invernizzi gmail.com> writes:
On Thursday, 27 September 2018 at 07:03:51 UTC, Andrea Fontana 
wrote:
 On Thursday, 27 September 2018 at 05:15:01 UTC, Ali Çehreli 
 wrote:
 A delicious Turkish desert is "kabak tatlısı", made of squash. 
 Now, it so happens that "kabak" also means "zucchini" in 
 Turkish. Imagine my shock when I came across that desert 
 recipe in English that used zucchini as the ingredient! :)

 Ali
You can't even imagine how many italian words and recipes are distorted... Andrea
+1 :-P
Sep 27 2018
prev sibling next sibling parent Andrea Fontana <nospam example.com> writes:
On Sunday, 23 September 2018 at 20:49:39 UTC, Walter Bright wrote:
 On 9/23/2018 9:52 AM, aliak wrote:

 There's a reason why dmd doesn't have international error 
 messages. My experience with it is that international users 
 don't want it. They prefer the english messages.
Yes please. Keep them in english. But please, add an error code too in front of them.
 I'm sure if you look hard enough you'll find someone using 
 non-ASCII characters in identifiers.
It depends on what I'm developing. If I'm writing a public library I'm planning to release on github, I use english identifiers. But of course if is a piece of software for my company or for myself, I use italian identifiers. Andrea
Sep 26 2018
prev sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Sunday, September 23, 2018 2:49:39 PM MDT Walter Bright via Digitalmars-d 
wrote:
 There's a reason why dmd doesn't have international error messages. My
 experience with it is that international users don't want it. They prefer
 the english messages.
It reminds me of one of the reasons that Bryan Cantrill thinks that many folks use Linux - they want to be able to google their stack traces. Of course, that same argument would be a reason to use C/C++ rather than switching to D, but having an error be in a format that's more common and therefore more likely to have been posted somewhere where you might be able to find a discussion on it and therefore maybe be able to find the solution for it can be valuable - and that's without even getting into all of the translation issues discussed elsewher in this thread. And it's not like compiler error messages - or programming speak in general - are really traditional English anyway. - Jonathan M Davis
Sep 26 2018
prev sibling next sibling parent reply Erik van Velzen <erik evanv.nl> writes:
Agreed with Walter.

I'm all on board with i18n but I see no need for non-ascii 
identifiers.

Even identifiers with a non-latin origin are usually written in 
the latin script.

As for real-world usage I've seen Cyrillic identifiers a few 
times in PHP.
Sep 21 2018
parent reply Seb <seb wilzba.ch> writes:
On Friday, 21 September 2018 at 23:00:45 UTC, Erik van Velzen 
wrote:
 Agreed with Walter.

 I'm all on board with i18n but I see no need for non-ascii 
 identifiers.

 Even identifiers with a non-latin origin are usually written in 
 the latin script.

 As for real-world usage I've seen Cyrillic identifiers a few 
 times in PHP.
A: Wait. Using emojis as identifiers is not a good idea? B: Yes. A: But the cool kids are doing it: https://codepen.io/andresgalante/pen/jbGqXj In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it. (This is already supported in D.)
Sep 21 2018
next sibling parent Neia Neutuladh <neia ikeran.org> writes:
On Friday, 21 September 2018 at 23:17:42 UTC, Seb wrote:
 A: Wait. Using emojis as identifiers is not a good idea?
 B: Yes.
 A: But the cool kids are doing it:
The C11 spec says that emoji should be allowed in identifiers (ISO publication N1570 page 504/522), so it's not just the cool kids. I'm not in favor of emoji in identifiers.
 In all seriousness I hate it when someone thought its funny to 
 use the lambda symbol as an identifier and I have to copy that 
 symbol whenever I want to use it because there's no convenient 
 way to type it.
It's supported because λ is a letter in a language spoken by thirteen million people. I mean, would you want to have to name a variable "lumиnosиty" because someone got annoyed at people using "i" as a variable name?
Sep 21 2018
prev sibling next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 22/09/2018 11:17 AM, Seb wrote:
 In all seriousness I hate it when someone thought its funny to use the 
 lambda symbol as an identifier and I have to copy that symbol whenever I 
 want to use it because there's no convenient way to type it.
 (This is already supported in D.)
This can be strongly mitigated by using a compose key. But they are not terribly common unfortunately.
Sep 21 2018
prev sibling next sibling parent Kagamin <spam here.lot> writes:
On Friday, 21 September 2018 at 23:17:42 UTC, Seb wrote:
 A: Wait. Using emojis as identifiers is not a good idea?
 B: Yes.
 A: But the cool kids are doing it:

 https://codepen.io/andresgalante/pen/jbGqXj
It's not like we have a lot of good fonts (I know only one), and even fewer of them are suitable for code, and they can't be realistically expected to do everything, monospace fonts are even often ascii-only.
Sep 23 2018
prev sibling parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Friday, 21 September 2018 at 23:17:42 UTC, Seb wrote:
 In all seriousness I hate it when someone thought its funny to 
 use the lambda symbol as an identifier and I have to copy that 
 symbol whenever I want to use it because there's no convenient 
 way to type it.
 (This is already supported in D.)
I just want to chime in that I've definitely used greek letters in "ordinary" code - it's handy when writing math and feeling lazy. Note that on Linux, with a simple configuration tweak (Windows key mapped to Compose, and https://gist.githubusercontent.com/zkat/6718053/raw/4535a2e2a988aa90937a69dbb8f10e 6a43b4010/.XCompose ), you can for instance type "<windows key> l a m" to make the lambda symbol, or other greek letters very easily.
Sep 25 2018
parent reply Dukc <ajieskola gmail.com> writes:
When I make code that I expect to be only used around here, I 
generally write the code itself in english but comments in my own 
language. I agree that in general, it's better to stick with 
english in identifiers when the programming language and the 
standard library is English.

On Tuesday, 25 September 2018 at 09:28:33 UTC, FeepingCreature 
wrote:
 On Friday, 21 September 2018 at 23:17:42 UTC, Seb wrote:
 In all seriousness I hate it when someone thought its funny to 
 use the lambda symbol as an identifier and I have to copy that 
 symbol whenever I want to use it because there's no convenient 
 way to type it.
 (This is already supported in D.)
I just want to chime in that I've definitely used greek letters in "ordinary" code - it's handy when writing math and feeling lazy.
On the other hand, Unicode identifiers till have their value IMO. The quote above is one reason for that -if there is a very specialized codebase it may be just inpractical to letterize everything. Another reason is that something may not have a good translation to English. If there is an enum type listing city names, it is IMO better to write them as normal, using Unicode. CityName.seinäjoki, not CityName.seinaejoki.
Sep 25 2018
parent reply Shachar Shemesh <shachar weka.io> writes:
On 25/09/18 15:35, Dukc wrote:
 Another reason is that something may not have a good translation to 
 English. If there is an enum type listing city names, it is IMO better 
 to write them as normal, using Unicode. CityName.seinäjoki, not 
 CityName.seinaejoki.
This sounded like a very compelling example, until I gave it a second thought. I now fail to see how this example translates to a real-life scenario. City names (data, changes over time) as enums (compile time set) seem like a horrible idea. That may sound like a very technical objection to an otherwise valid point, but it really think that's not the case. The properties that cause city names to be poor candidates for enum values are the same as those that make them Unicode candidates. Shachar
Sep 25 2018
next sibling parent reply Dukc <ajieskola gmail.com> writes:
On Wednesday, 26 September 2018 at 06:50:47 UTC, Shachar Shemesh 
wrote:
 The properties that cause city names to be poor candidates for 
 enum values are the same as those that make them Unicode 
 candidates.
How so?
 City names (data, changes over time) as enums (compile time 
 set) seem like a horrible idea.
In most cases yes. But not always. You might me doing some sort of game where certain cities are a central concept, not just data with properties. Another possibility is that you're using code as data, AKA scripting. And who says anyway you can't make a program that's designed specificially for certain cities?
Sep 26 2018
parent reply Shachar Shemesh <shachar weka.io> writes:
On 26/09/18 10:26, Dukc wrote:
 On Wednesday, 26 September 2018 at 06:50:47 UTC, Shachar Shemesh wrote:
 The properties that cause city names to be poor candidates for enum 
 values are the same as those that make them Unicode candidates.
How so?
 City names (data, changes over time) as enums (compile time set) seem 
 like a horrible idea.
In most cases yes. But not always. You might me doing some sort of game where certain cities are a central concept, not just data with properties. Another possibility is that you're using code as data, AKA scripting. And who says anyway you can't make a program that's designed specificially for certain cities?
Sure you can. It's just very poor design. I think, when asking such questions, two types of answers are relevant. One is hypotheticals where you say "this design requires this". For such answers, the design needs to be a good one. It makes no sense to design a language to support a hypothetical design which is not a good one. The other type of answer is "it's being done in the real world". If it's in active use in the real world, it might make sense to support it, even if we can agree that the design is not optimal. Since your answer is hypothetical, I think arguing this is not a good way to code is a valid one. Shachar
Sep 26 2018
parent Dukc <ajieskola gmail.com> writes:
On Wednesday, 26 September 2018 at 07:37:28 UTC, Shachar Shemesh 
wrote:
 The other type of answer is "it's being done in the real 
 world". If it's in active use in the real world, it might make 
 sense to support it, even if we can agree that the design is 
 not optimal.

 Shachar
Two years ago, I taked part in implementing a commerical game. It would have faced the same thing, were it used. Anyway, the game has three characters with completely different abilites. The abilites were unique enough that it made sense to name some functions after the characters. One of the characters really has a non-ASCII character in his name, and that meant naming him differently in the code.
Sep 26 2018
prev sibling next sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 9/26/18 2:50 AM, Shachar Shemesh wrote:
 On 25/09/18 15:35, Dukc wrote:
 Another reason is that something may not have a good translation to 
 English. If there is an enum type listing city names, it is IMO better 
 to write them as normal, using Unicode. CityName.seinäjoki, not 
 CityName.seinaejoki.
This sounded like a very compelling example, until I gave it a second thought. I now fail to see how this example translates to a real-life scenario. City names (data, changes over time) as enums (compile time set) seem like a horrible idea. That may sound like a very technical objection to an otherwise valid point, but it really think that's not the case. The properties that cause city names to be poor candidates for enum values are the same as those that make them Unicode candidates.
Hm... I could see actually some "clever" use of opDispatch being used to define cities or other such names. In any case, I think the biggest pro for supporting Unicode symbol names is -- we already support Unicode symbol names. It doesn't make a whole lot of sense to only support some of them. -Steve
Sep 26 2018
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/25/2018 11:50 PM, Shachar Shemesh wrote:
 This sounded like a very compelling example, until I gave it a second thought.
I 
 now fail to see how this example translates to a real-life scenario.
Also, there are usually common ASCII versions of city names, such as Cologne for Köln.
Sep 26 2018
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2018-09-21 18:27, Neia Neutuladh wrote:
 D's currently accepted identifier characters are based on Unicode 2.0:
 
 * ASCII range values are handled specially.
 * Letters and combining marks from Unicode 2.0 are accepted.
 * Numbers outside the ASCII range are accepted.
 * Eight random punctuation marks are accepted.
 
 This follows the C99 standard.
 

 Python, ECMAScript, just to name a few. A small number of languages 
 reject non-ASCII characters: Dart, Perl. Some languages are weirdly 
 generous: Swift and C11 allow everything outside the Basic Multilingual 
 Plane.
 
 I'd like to update that so that D accepts something as a valid 
 identifier character if it's a letter or combining mark or modifier 
 symbol that's present in Unicode 11, or a non-ASCII number. This allows 
 the 146 most popular writing systems and a lot more characters from 
 those writing systems. This *would* reject those eight random 
 punctuation marks, so I'll keep them in as legacy characters.
 
 It would mean we don't have to reference the C99 standard when 
 enumerating the allowed characters; we just have to refer to the Unicode 
 standard, which we already need to talk about in the lexical part of the 
 spec.
 
 It might also make the lexer a tiny bit faster; it reduces the number of 
 valid-ident-char segments to search from 245 to 134. On the other hand, 
 it will change the ident char ranges from wchar to dchar, which means 
 the table takes up marginally more memory.
 
 And, of course, it lets you write programs entirely in Linear B, and 
 that's a marketing ploy not to be missed.
 
 I've got this coded up and can submit a PR, but I thought I'd get 
 feedback here first.
 
 Does anyone see any horrible potential problems here?
 
 Or is there an interestingly better option?
 
 Does this need a DIP?
I'm not a native English speaker but I write all my public and private code in English. Anyone I work with, I will expect them and make sure they're writing the code in English as well. English is not enough either, it has to be American English. Despite this I think that D should support as much of the Unicode as possible (including using Unicode for identifiers). It should not be up to the programming language to decide which language the developer should write the code in. -- /Jacob Carlborg
Sep 25 2018
prev sibling parent reply rjframe <dlang ryanjframe.com> writes:
On Fri, 21 Sep 2018 16:27:46 +0000, Neia Neutuladh wrote:

 I've got this coded up and can submit a PR, but I thought I'd get
 feedback here first.
 
 Does anyone see any horrible potential problems here?
 
 Or is there an interestingly better option?
 
 Does this need a DIP?
I just want to point out since this thread is still living that there have been very few answers to the actual question ("should I submit my PR?"). Walter did answer the question, with the reasons that Unicode identifier support is not useful/helpful and could cause issues with tooling. Which is likely correct; and if we really want to follow this logic, Unicode identifier support should be removed from D entirely. I don't recall seeing anyone in favor providing technical reasons, save the OP. Especially since the work is done, it makes sense to me to ask for the PR for review. Worst case scenario, it sits there until we need it.
Sep 26 2018
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 9/26/18 5:54 AM, rjframe wrote:
 On Fri, 21 Sep 2018 16:27:46 +0000, Neia Neutuladh wrote:
 
 I've got this coded up and can submit a PR, but I thought I'd get
 feedback here first.

 Does anyone see any horrible potential problems here?

 Or is there an interestingly better option?

 Does this need a DIP?
I just want to point out since this thread is still living that there have been very few answers to the actual question ("should I submit my PR?"). Walter did answer the question, with the reasons that Unicode identifier support is not useful/helpful and could cause issues with tooling. Which is likely correct; and if we really want to follow this logic, Unicode identifier support should be removed from D entirely.
This is a non-starter. We can't break people's code, especially for trivial reasons like 'you shouldn't code that way because others don't like it'. I'm pretty sure Walter would be against removing Unicode support for identifiers.
 
 I don't recall seeing anyone in favor providing technical reasons, save
 the OP.
There doesn't necessarily need to be a technical reason. In fact, there really isn't one -- people can get by with using ASCII identifiers just fine (and many/most people do). Supporting Unicode would be purely for social or inclusive reasons (it may make D more approachable to non-English speaking schoolchildren for instance). As an only-English speaking person, it doesn't bother me either way to have Unicode identifiers. But the fact that we *already* support Unicode identifiers leads me to expect that we support *all* Unicode identifiers. It doesn't make a whole lot of sense to only support some of them.
 
 Especially since the work is done, it makes sense to me to ask for the PR
 for review. Worst case scenario, it sits there until we need it.
I suggested this as well. https://forum.dlang.org/post/poaq1q$its$1 digitalmars.com I think it stands a good chance of getting incorporated, just for the simple fact that it's enabling and not disruptive. -Steve
Sep 26 2018
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/26/2018 5:46 AM, Steven Schveighoffer wrote:
 This is a non-starter. We can't break people's code, especially for trivial 
 reasons like 'you shouldn't code that way because others don't like it'. I'm 
 pretty sure Walter would be against removing Unicode support for identifiers.
We're not going to remove it, because there's not much to gain from it. But expanding it seems of vanishingly little value. Note that each thing that gets added to D adds weight to it, and it needs to pull its weight. Nothing is free. I don't see a scenario where someone would be learning D and not know English. Non-English D instructional material is nearly non-existent. dlang.org is all in English. Don't most languages have a Romanji-like representation? C/C++ have made efforts in the past to support non-ASCII coding - digraphs, trigraphs, and alternate keywords. They've all failed miserably. The only people who seem to know those features even exist are language lawyers.
Sep 26 2018
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Wednesday, 26 September 2018 at 20:43:47 UTC, Walter Bright 
wrote:
 I don't see a scenario where someone would be learning D and 
 not know English. Non-English D instructional material is 
 nearly non-existent.
http://ddili.org/ders/d/
Sep 26 2018
prev sibling next sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 9/26/18 4:43 PM, Walter Bright wrote:
 But expanding it seems of vanishingly little value. Note that each thing 
 that gets added to D adds weight to it, and it needs to pull its weight. 
 Nothing is free.
It may be the weight is already there in the form of unicode symbol support, just the range of the characters supported isn't good enough for some languages. It might be like replacing your refrigerator -- you get an upgrade, but it's not going to take up any more space because you get rid of the old one. I would like to see the PR before passing judgment on the heft of the change. The value is simply in the consistency -- when some of the words for your language can be valid symbols but others can't, then it becomes a weird guessing game as to what is supported. It would be like saying all identifiers can have any letters except `q`. Sure, you can get around that, but it's weirdly exclusive. I claim complete ignorance as to what is required, it hasn't been technically laid out what is at stake, and I'm not bilingual anyway. It could be true that I'm completely misunderstanding the positions of others. -Steve
Sep 26 2018
prev sibling next sibling parent Neia Neutuladh <neia ikeran.org> writes:
On 09/26/2018 01:43 PM, Walter Bright wrote:
 Don't most languages have a Romanji-like 
 representation?
Yes, a lot of languages that don't use the Latin alphabet have standard transcriptions into the Latin alphabet. Standard transcriptions into ASCII are much less common, and newer Unicode versions include more Latin characters to better support languages (and other use cases) using the Latin alphabet.
Sep 26 2018
prev sibling parent reply aliak <something something.com> writes:
On Wednesday, 26 September 2018 at 20:43:47 UTC, Walter Bright 
wrote:
 On 9/26/2018 5:46 AM, Steven Schveighoffer wrote:
 This is a non-starter. We can't break people's code, 
 especially for trivial reasons like 'you shouldn't code that 
 way because others don't like it'. I'm pretty sure Walter 
 would be against removing Unicode support for identifiers.
We're not going to remove it, because there's not much to gain from it. But expanding it seems of vanishingly little value. Note that each thing that gets added to D adds weight to it, and it needs to pull its weight. Nothing is free. I don't see a scenario where someone would be learning D and not know English. Non-English D instructional material is nearly non-existent. dlang.org is all in English. Don't most languages have a Romanji-like representation?
It's not that they don't know English. It's that non-English speakers can process words and sentences in non-English much more efficiently than in English. Knowing a language is not binary. Here's an example from this years spring semester and NTNU (norwegian uni): http://folk.ntnu.no/frh/grprog/eksempel/eks_20.cpp ... That's the basic programming course. Whether the professor would use that I guess would depend on ratio of English/non-English speakers. But it's there nonetheless. Of course Norway is a bad example because the English level here is, arguably, higher than many English countries :p But it's a great example because even if you're great at English, still sometimes people are more comfortable/confident/efficient/ in their own native language. Some tech meetups from different countries try and do things in English and mostly it works. But it's been seen consistently with non-English audiences that presentations given in English result in silence whereas if it's in their native language you have actual engagement. I fail to understand how supporting a version of unicode from (not sure when it was released) 3 billion decades ago should just be left as is and also cannot be removed when there's someone who's willing to update it.
 C/C++ have made efforts in the past to support non-ASCII coding 
 - digraphs, trigraphs, and alternate keywords. They've all 
 failed miserably. The only people who seem to know those 
 features even exist are language lawyers.
This is not relevant. Trigraphs and digraphs did indeed fail miserably but they do not represent any non-ascii characters. The existential reasons for those abominations were different. Anyway, on a related note: D itself (not identifiers, but std) also supports unicode 6 or something. That's from 2010. That's a decade ago. We're at unicode 11 now. And I've already had someone tell me (while trying to get them to use D) - "hold on it supports unicode from a decade ago? Nah I'm not touching it". Not that it's the same as supporting identifiers in code, but still the reaction is relevant. Cheers, - Ali
Sep 27 2018
next sibling parent reply Shachar Shemesh <shachar weka.io> writes:
On 27/09/18 10:35, aliak wrote:
 Here's an example from this years spring semester and NTNU (norwegian 
 uni): http://folk.ntnu.no/frh/grprog/eksempel/eks_20.cpp
 
 ... That's the basic programming course. Whether the professor would use 
 that I guess would depend on ratio of English/non-English speakers. But 
 it's there nonetheless.
I'm sorry I keep bringing this up, but context is really important here. The program you link to has non-ASCII in the comments and in the literals, but not in the identifiers. Nobody is opposed to having those. Shachar
Sep 27 2018
parent reply aliak <something something.com> writes:
On Thursday, 27 September 2018 at 08:16:00 UTC, Shachar Shemesh 
wrote:
 On 27/09/18 10:35, aliak wrote:
 Here's an example from this years spring semester and NTNU 
 (norwegian uni): 
 http://folk.ntnu.no/frh/grprog/eksempel/eks_20.cpp
 
 ... That's the basic programming course. Whether the professor 
 would use that I guess would depend on ratio of 
 English/non-English speakers. But it's there nonetheless.
I'm sorry I keep bringing this up, but context is really important here. The program you link to has non-ASCII in the comments and in the literals, but not in the identifiers. Nobody is opposed to having those. Shachar
The point was that being able to use non-English in code is demonstrably both helpful and useful to people. Norwegian happens to be easily anglicize-able. I've already linked to non ascii code versions in a previous post if you want that too.
Sep 27 2018
parent reply Shachar Shemesh <shachar weka.io> writes:
On 27/09/18 16:38, aliak wrote:
 The point was that being able to use non-English in code is demonstrably 
 both helpful and useful to people. Norwegian happens to be easily 
 anglicize-able. I've already linked to non ascii code versions in a 
 previous post if you want that too.
If you wish to make a point about something irrelevant to the discussion, that's fine. It is, however, irrelevant, mostly because it is uncontested. This thread is about the use of non-English in *identifiers*. This thread is not about comments. It is not about literals (i.e. - strings). Only about identifiers (function names, variable names etc.). If you have real world examples of those, that would be both interesting and relevant. Shachar
Sep 27 2018
parent reply aliak <something something.com> writes:
On Thursday, 27 September 2018 at 13:59:48 UTC, Shachar Shemesh 
wrote:
 On 27/09/18 16:38, aliak wrote:
 The point was that being able to use non-English in code is 
 demonstrably both helpful and useful to people. Norwegian 
 happens to be easily anglicize-able. I've already linked to 
 non ascii code versions in a previous post if you want that 
 too.
If you wish to make a point about something irrelevant to the discussion, that's fine. It is, however, irrelevant, mostly because it is uncontested. This thread is about the use of non-English in *identifiers*. This thread is not about comments. It is not about literals (i.e. - strings). Only about identifiers (function names, variable names etc.). If you have real world examples of those, that would be both interesting and relevant. Shachar
English doesn't mean ascii. You can write non-English in ascii, which you would've noticed if you'd opened the link, which had identifiers in Norwegian (which is not English). And again, I've already posted a link that shows non-ascii identifiers. I'll paste it again here incase you don't want to read the thread: https://speakerdeck.com/codelynx/programming-swift-in-japanese
Sep 27 2018
parent reply sarn <sarn theartofmachinery.com> writes:
On Thursday, 27 September 2018 at 16:34:37 UTC, aliak wrote:
 On Thursday, 27 September 2018 at 13:59:48 UTC, Shachar Shemesh 
 wrote:
 On 27/09/18 16:38, aliak wrote:
 The point was that being able to use non-English in code is 
 demonstrably both helpful and useful to people. Norwegian 
 happens to be easily anglicize-able. I've already linked to 
 non ascii code versions in a previous post if you want that 
 too.
If you wish to make a point about something irrelevant to the discussion, that's fine. It is, however, irrelevant, mostly because it is uncontested. This thread is about the use of non-English in *identifiers*. This thread is not about comments. It is not about literals (i.e. - strings). Only about identifiers (function names, variable names etc.). If you have real world examples of those, that would be both interesting and relevant. Shachar
English doesn't mean ascii. You can write non-English in ascii, which you would've noticed if you'd opened the link, which had identifiers in Norwegian (which is not English). And again, I've already posted a link that shows non-ascii identifiers. I'll paste it again here incase you don't want to read the thread: https://speakerdeck.com/codelynx/programming-swift-in-japanese
Shachar seems to be aiming for an internet high score by shooting down threads without reading them. You have better things to do. http://www.paulgraham.com/vb.html
Sep 27 2018
parent reply Dukc <ajieskola gmail.com> writes:
On Friday, 28 September 2018 at 02:23:32 UTC, sarn wrote:
 Shachar seems to be aiming for an internet high score by 
 shooting down threads without reading them.  You have better 
 things to do.
 http://www.paulgraham.com/vb.html
I believe you're being too harsh. It's easy to miss a part of a post sometimes.
Sep 28 2018
next sibling parent sarn <sarn theartofmachinery.com> writes:
On Friday, 28 September 2018 at 11:37:10 UTC, Dukc wrote:
 It's easy to miss a part of a post sometimes.
That's very true, and it's always good to give people the benefit of the doubt. But most people are able to post constructively here without * Abrasively and condescendingly declaring others' posts to be completely pointless * Doing that based on one single aspect of a post, without bothering to check the whole post or parent post * Doubling down even after getting a hint that the poster might not have posted 100% cluelessly * Doing all this more than once in a thread If Shachar starts posting constructively, I'll happily engage. I mean that. Otherwise I won't waste my time, and I'll tell others not to waste theirs, too.
Sep 28 2018
prev sibling parent reply Shachar Shemesh <shachar weka.io> writes:
On 28/09/18 14:37, Dukc wrote:
 On Friday, 28 September 2018 at 02:23:32 UTC, sarn wrote:
 Shachar seems to be aiming for an internet high score by shooting down 
 threads without reading them.  You have better things to do.
 http://www.paulgraham.com/vb.html
I believe you're being too harsh. It's easy to miss a part of a post sometimes.
A minor correction: Aliak is not accusing me of missing a part of the post. He's accusing me of not taking into account something he said in a different part of the *thread*. I.e. - I missed something he said in one of the other (as of this writing, 98) posts of this thread, and thus causing Dukc to label me a bullshitter.
Sep 28 2018
parent reply Dukc <ajieskola gmail.com> writes:
On Saturday, 29 September 2018 at 02:22:55 UTC, Shachar Shemesh 
wrote:
 I missed something he said in one of the other (as of this 
 writing, 98) posts of this thread, and thus causing Dukc to 
 label me a bullshitter.
I know you meant Sarn, but still... can you please be a bit less aggresive with our wording?
Sep 29 2018
parent reply Shachar Shemesh <shachar weka.io> writes:
On 29/09/18 16:52, Dukc wrote:
 On Saturday, 29 September 2018 at 02:22:55 UTC, Shachar Shemesh wrote:
 I missed something he said in one of the other (as of this writing, 
 98) posts of this thread, and thus causing Dukc to label me a 
 bullshitter.
I know you meant Sarn, but still... can you please be a bit less aggresive with our wording?
From the article (the furthest point I read in it):
 When I ask myself what I've found life is too short for, the word that pops
into my head is "bullshit."
That is the word used by the article *you* linked to, in reference to me. If it offends you enough to be accused of *calling* someone that, just imagine how I felt being *called* that very same name. Seriously, I don't make it a habit of being offended by random people on the Internet, but this is more a conscious decision than a naturally thick skin. Seeing that label hurt. Don't worry. I've been on the Internet since 1991. That's longer than the median age here (i.e. - I've been on the Internet since before most of you have been born). I've had my own fair share of flame wars, include some that, to my chagrin, I've started. In other words, I got over it. I did not reply, big though the temptation was. But the right time to be sensitive about what words are being used was *before* you linked to the article. Taking offense from being called out for calling someone something you find offensive is hypocritical. I never understood the focus on words. It's not the use of that word that offended me, it's the fact that you thought anything I did justified using it. I don't think using "cattle excrement" instead would have been any less hurtful. And it's not that the rest of your post was thoughtful, considerate and took pains to give constructive criticism, with or without hurting anyone's feelings. It's just that it doesn't seem to be that part bothers you. Shachar
Sep 29 2018
parent Shachar Shemesh <shachar weka.io> writes:
On Saturday, 29 September 2018 at 16:19:38 UTC, ag0aep6g wrote:
 On 09/29/2018 04:19 PM, Shachar Shemesh wrote:
 On 29/09/18 16:52, Dukc wrote:
[...]
 I know you meant Sarn, but still... can you please be a bit 
 less aggresive with our wording?
From the article (the furthest point I read in it):
 When I ask myself what I've found life is too short for, the 
 word that pops into my head is "bullshit."
Dukc didn't post that link. sarn did.
You are 100% correct. My most sincere apologies. I am going to stop responding to this thread now. Shachar
Sep 29 2018
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/27/2018 12:35 AM, aliak wrote:
 Anyway, on a related note: D itself (not identifiers, but std) also supports 
 unicode 6 or something. That's from 2010. That's a decade ago. We're at
unicode 
 11 now. And I've already had someone tell me (while trying to get them to use
D) 
 - "hold on it supports unicode from a decade ago? Nah I'm not touching it".
Not 
 that it's the same as supporting identifiers in code, but still the reaction
is 
 relevant.
Nobody is suggesting D not support Unicode in strings, comments, and the standard library. Please file any issues on Bugzilla, and PRs to fix them.
Sep 27 2018
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/26/2018 5:46 AM, Steven Schveighoffer wrote:
 Does this need a DIP?
Feel free to write one, but its chances of getting incorporated are remote and would require a pretty strong rationale that I haven't seen yet.
Sep 26 2018