digitalmars.D - Why UTF-8/16 character encodings?

Joakim (22/26) May 24 2013 This triggered a long-standing bugbear of mine: why are we using

Peter Alexander (13/15) May 24 2013 Simple: backwards compatibility with all ASCII APIs (e.g. most C

Joakim (47/59) May 24 2013 And yet here we are today, where an early decision made solely to

Joakim (7/16) May 24 2013 Sorry, I was a bit imprecise. Here's what I meant to write:
Walter Bright (17/22) May 24 2013 This is more a problem with the algorithms taking the easy way than a pr...

anonymous (2/16) May 24 2013 The German ß becomes SS when capitalised. It's no encoding issue.
Dmitry Olshansky (34/58) May 24 2013 You seem to think that not only UTF-8 is bad encoding but also one

H. S. Teoh (102/151) May 24 2013 I remember those bad ole days of gratuitously-incompatible encodings. I

Walter Bright (7/9) May 24 2013 One of the first, and best, decisions I made for D was it would be Unico...

Manu (6/16) May 24 2013 Indeed, excellent decision!

Walter Bright (3/4) May 24 2013 Oh, how I want to do that. But I still think the world hasn't completely...

H. S. Teoh (13/18) May 24 2013 That would be most awesome!

Timon Gehr (13/31) May 25 2013 This is what eg. Haskell, Coq are doing.

Hans W. Uhlig (5/10) May 26 2013 Using those characters would be wonderful and while we do have

Walter Bright (3/6) May 26 2013 I have a post-it stuck to my monitor with the numbers for various unicod...

H. S. Teoh (13/21) May 26 2013 I have been thinking about this idea of a "reprogrammable keyboard", in

Kiith-Sa (2/2) May 26 2013 You mean like

H. S. Teoh (6/8) May 26 2013 Whoa! That is exactly what I had in mind!!

Torje Digernes (5/12) May 26 2013 If you want to configure your keyboard so you can type unicode in

H. S. Teoh (16/33) May 26 2013 Oh, I know *that*. I configured my xkb setup to switch between English

Wyatt (23/38) May 26 2013 I've given this domain a fair bit of thought, and from my

H. S. Teoh (34/71) May 27 2013 I like this idea. It's certainly more feasible than reinventing the
Vladimir Panteleev (3/4) May 27 2013 Perhaps something like the compose key?

H. S. Teoh (9/15) May 27 2013 I'm already using the compose key. But it only goes so far (I don't

Vladimir Panteleev (3/6) May 27 2013 I thought the topic was typing the occasional Unicode character

H. S. Teoh (16/23) May 27 2013 Well, D *does* support non-English identifiers, y'know... for example:

Walter Bright (3/11) May 27 2013 I've recently come to the opinion that that's a bad idea, and D should n...

Hans W. Uhlig (3/18) May 27 2013 Why do you think its a bad idea? It makes it such that code can

H. S. Teoh (22/41) May 27 2013 Currently, the above code snippet compiles (upon inserting "import

monarch_dodra (23/76) May 28 2013 I can tell you for a fact there are a tons of *private* companies

Walter Bright (7/12) May 27 2013 Every time I've been to a programming shop in a foreign country, the dev...

Diggory (7/23) May 27 2013 The most convincing case for usefulness I've seen was in java

H. S. Teoh (12/37) May 27 2013 I don't find this a compelling reason to allow full Unicode on

Walter Bright (2/10) May 27 2013 +1

Michel Fortin (31/42) May 28 2013 -1

Olivier Pisano (13/17) May 28 2013 Would you have been to such an event if you could not have
monarch_dodra (8/12) May 28 2013 That's because you have an academic view of code, and a library
qznc (7/23) May 29 2013 Once I heared an argument from developers working for banks. They

Walter Bright (2/7) May 29 2013 German is pretty easy to do in ASCII: Vermoegen and Buergschaft

monarch_dodra (18/30) May 30 2013 What about Chinese? Russian? Japanese? It is doable, but I can

Simen Kjaeraas (21/48) May 30 2013 On Thu, 30 May 2013 11:36:42 +0200, monarch_dodra

Dicebot (4/14) May 30 2013 What about poor guys from other country that will support that

monarch_dodra (18/33) May 30 2013 Well... defacto: "in practice but not necessarily ordained by
Manu (5/19) May 30 2013 Have you ever worked on code written by people who barely speak English?

Dicebot (4/11) May 30 2013 I have had comments with Finnish poetry in code I was responsible
Kagamin (31/33) Jun 27 2013 I did. It's better than having a mixture of languages like here:

deadalnix (7/14) Jun 27 2013 OOo codebase is historically mostly in german. They try to reduce

Peter Williams (4/6) May 27 2013 So you're going to spell check them all to make sure that they're

David Eagen (4/7) May 27 2013 That's it. I'm filing a bug against std.traits. There's a

Manu (7/15) May 27 2013 How dare you!

Walter Bright (2/5) May 27 2013 Resistance is useless.

Diggory (2/9) May 27 2013 *futile :P

Peter Williams (4/12) May 27 2013 Except here in Australia and other places where they use the Queen's
Manu (2/17) May 27 2013 Is there anywhere other than America that doesn't?

Jacob Carlborg (4/5) May 28 2013 Canada, Jamaica, other countries in that region?

Manu (3/7) May 28 2013 Yes, the region called America ;)

Jacob Carlborg (4/6) May 28 2013 Oh, you meant the whole region and not the country.

Simen Kjaeraas (4/8) May 28 2013 America is not a country. The country is called USA.

Jacob Carlborg (5/6) May 28 2013 I know that, but I get the impression that people usually say "America"

Peter Williams (4/7) May 28 2013 Last time I looked Canada was in America (which is a continent not a

Diggory (3/13) May 28 2013 America isn't a continent, North America is a continent, and

monarch_dodra (8/23) May 28 2013 Well, that point of view really depends from which continent

H. S. Teoh (13/23) May 28 2013 [...]
Peter Williams (4/24) May 28 2013 Last time I was there (about 40 years ago) Canadians didn't seem that
H. S. Teoh (7/25) May 28 2013 [...]

Jacob Carlborg (4/6) May 28 2013 Don't you have a spell checker in your editor? If not, find a new one :)

=?UTF-8?B?Ikx1w61z?= Marques" (3/5) May 27 2013 I think it is a bad idea to program in a language other than

Manu (5/11) May 27 2013 I can imagine a young student learning to code, that may not speak Engli...

Manu (3/17) May 27 2013 Why? You said previously that you'd love to support extended operators ;...

Torje Digernes (4/28) May 27 2013 I find features such as support for uncommon symbols in variables
Walter Bright (2/16) May 27 2013 Extended operators, yes. Non-ascii identifiers, no.

Oleg Kuporosov (6/9) May 28 2013 BTW, this is one of big D advantage, take into account

Simen Kjaeraas (10/23) May 28 2013 :

Jakob Ovrum (17/19) May 29 2013 Honestly, removing support for non-ASCII characters from

Walter Bright (3/19) May 29 2013 I still think it's a bad idea, but it's obvious people want it in D, so ...

Marco Leise (5/8) May 29 2013 Surprisingly ASCII also covers Cornish and Malay.
Oleg Kuporosov (3/6) May 29 2013 Good, thanks, restrictions definetelly can and should be applied
Jakob Ovrum (5/6) May 30 2013 I don't understand the logic behind this. Surely this is the

Marco Leise (11/25) May 29 2013 t=20

Timon Gehr (2/5) May 30 2013 No, that's deeply troubling.

Entry (1/1) May 29 2013 My personal opinion is that code should only be in English.

Peter Williams (3/4) May 29 2013 But why would you want to impose this restriction on others?

Entry (7/11) May 30 2013 I wouldn't say impose. I'd say that programming in a unified

monarch_dodra (8/21) May 30 2013 But programming IS a human tool, and thus, subject to human

Entry (3/26) May 30 2013 What a way to attack a straw-man and completely miss the point at

monarch_dodra (6/35) May 30 2013 Fine.

Entry (7/43) May 30 2013 Take a minute to think about why we're all communicating in

monarch_dodra (19/25) May 30 2013 Well that's condescending :/ and fallacious.

Entry (8/33) May 30 2013 I'm glad you agree, though I believe that I never said anything

Jakob Ovrum (3/11) May 30 2013 If the programmers who are going to be working on that code don't

Entry (4/15) May 30 2013 Then there's no helping it. Though I wonder what kind of a

Manu (2/18) May 30 2013 A child, or a student.

Manu (2/44) May 30 2013 This is the definition of a *convention*, not a rule.

Manu (1/16) May 30 2013
Manu (12/27) May 30 2013 We don't all know English. Plenty of people don't.

Walter Bright (2/12) May 30 2013 Sure, but the code itself is written using ASCII!

Peter Williams (3/19) May 30 2013 Because they had no choice.

Walter Bright (2/21) May 30 2013 Not true, D supports Unicode identifiers.

Simen Kjaeraas (8/30) May 31 2013 ce,

1100110 (1/27) Jun 17 2013

Timothee Cour (23/50) Jun 05 2013 =A7=E3=81=99=E3=80=82
Brad Roberts (2/16) Jun 05 2013 Filed in bugzilla?
Sean Kelly (10/30) Jun 17 2013 unicode strings/comments), we must
H. S. Teoh (6/31) Jun 17 2013 Do linkers actually support 8-bit symbol names? Or do these have to be
Sean Kelly (8/11) Jun 17 2013 Good question. It looks like the linker on OSX does:
Brad Roberts (4/12) Jun 17 2013 Don't symbol names from dmd/win32 get compressed if they're too long, re...

Walter Bright (2/6) Jun 17 2013 Optlink doesn't care what the symbol byte contents are.

H. S. Teoh (10/17) Jun 18 2013 It seems ld on Linux doesn't, either. I just tested separate compilation

Walter Bright (2/3) Jun 18 2013 I doubt it, but try it and see!

H. S. Teoh (6/10) Jun 18 2013 Sadly I don't have access to a Windows dev machine. Anybody else cares

Sean Kelly (11/26) Jun 19 2013 be

Manu (19/36) May 30 2013 =E3=81=99=E3=80=82

Walter Bright (3/4) May 30 2013 I am going to leave it that way based on the comments here, I only wante...

Manu (6/28) May 30 2013 Indeed, and believe me, the variable names can often make NO sense, or

Mr. Anonymous (2/31) May 28 2013 http://code.google.com/p/trileri/source/browse/trunk/tr/yazi.d

Simen Kjaeraas (13/33) May 27 2013 On Tue, 28 May 2013 00:18:31 +0200, H. S. Teoh ...

Jonathan M Davis (7/13) May 27 2013 I think that it was more an issue of that the only reason that Unicode w...

Manu (22/36) May 27 2013 I'm fairly sure that any programmer who takes themself seriously will us...

Daniel Murphy (3/10) May 25 2013 When these have keys on standard keyboards.

Joakim (23/44) May 25 2013 That is why I asked this question here. I think D is still one

Walter Bright (18/21) May 25 2013 I think you stand alone in your desire to return to code pages. I have y...

Joakim (39/60) May 25 2013 Nobody is talking about going back to code pages. I'm talking

Dmitry Olshansky (11/35) May 25 2013 Problem is what you outline is isomorphic with code-pages. Hence the

Jonathan M Davis (17/44) May 25 2013 ith it
H. S. Teoh (25/58) May 25 2013 [...]

Walter Bright (7/14) May 25 2013 Many moons ago, when the earth was young and I had a few strands of hair...

Vladimir Panteleev (30/39) May 25 2013 For the record, I noticed that programmers (myself included) that

Joakim (22/44) May 25 2013 Combining characters are examples of complexity baked into the

w0rp (1/1) May 25 2013 This is dumb. You are dumb. Go away.
Vladimir Panteleev (14/26) May 25 2013 You don't need to do that to slice a string. I think you mean to

Joakim (36/51) May 25 2013 Slicing a string implies finding the N-th code point, what other

Vladimir Panteleev (2/8) May 25 2013 No. Are you sure you understand UTF-8 properly?

Joakim (17/44) May 25 2013 Are you sure _you_ understand it properly? Both encodings have

Peter Alexander (22/43) May 25 2013 I suggest you read up on UTF-8. You really don't understand it.

Peter Alexander (2/10) May 25 2013 Oops. Missing a ++c in there, but I'm sure the point was made :-)

Vladimir Panteleev (10/31) May 25 2013 It looks like you've missed an important property of UTF-8: lower

Joakim (35/76) May 25 2013 OK, you got me with this particular special case: it is not

Peter Alexander (11/17) May 25 2013 It's not just a shortcut, it is absolutely fundamental to the

H. S. Teoh (14/32) May 25 2013 [...]

Dmitry Olshansky (9/43) May 25 2013 +1

Andrei Alexandrescu (5/13) May 25 2013 You mentioned this a couple of times, and I wonder what makes you so

Walter Bright (6/18) May 25 2013 On the other hand, Joakim even admits his single byte encoding is variab...

Joakim (20/29) May 25 2013 I have noted from the beginning that these large alphabets have

Walter Bright (28/55) May 25 2013 If it's one byte sometimes, or two bytes sometimes, it's variable length...

Joakim (95/174) May 26 2013 It is variable length, with the advantage that only strings

Declan (2/184) May 26 2013 I服了u，I'm thinking of your name means joking?
John Colvin (3/185) May 26 2013 I suggest you make an attempt at writing strstr and post it. Code
Walter Bright (3/6) May 26 2013 C'mon, Joakim, show us this amazing strstr() implementation for your sch...

Joakim (7/16) May 26 2013 You will see it when it's built into a fully working single-byte

Diggory (38/43) May 25 2013 All I can say is if you think that is simpler than UTF-8 then you

Dmitry Olshansky (13/41) May 24 2013 As is there are no UTF-8 specific tables (yet), but there are tools to

Joakim (40/138) May 25 2013 This problem already exists for UTF-8, breaking ASCII

Diggory (19/19) May 25 2013 I think you are a little confused about what unicode actually

Joakim (20/41) May 25 2013 Incorrect.

Diggory (36/70) May 25 2013 Given that all the machine registers are at least 32-bits already

Joakim (36/111) May 25 2013 No, that directly _contradicts_ what you said about Unicode

Diggory (52/164) May 25 2013 UCS does have nothing to do with code pages, it was designed as a

Walter Bright (5/7) May 25 2013 I suspect the Chinese, Koreans, and Japanese would take exception to bei...

Joakim (41/77) May 24 2013 Yes, on the encoding, if it's a variable-length encoding like

Dmitry Olshansky (44/122) May 25 2013 UCS is dead and gone. Next in line to "640K is enough for everyone".

Joakim (59/173) May 25 2013 I think you are confused. UCS refers to the Universal Character

Juan Manuel Cabo (13/13) May 25 2013 ░░░░░░░░░ⓌⓉⒻ░
Diggory (28/36) May 25 2013 "limited success of UTF-8"

Joakim (114/263) May 26 2013 So you admit that UTF-8 isn't used on the vast majority of

Dmitry Olshansky (29/159) May 25 2013 You can map a codepage to a subset of UCS :)

Joakim (36/77) May 25 2013 Something like that. For a multi-language string encoding, the

Dmitry Olshansky (7/19) May 25 2013 Runs away in horror :) It's mess even before you've got to details.

Joakim (27/61) May 26 2013 Perhaps it's fatally flawed, but I don't see an argument for why,

Juan Manuel Cabo (13/27) May 25 2013 You obviously are not thinking it through. Such encoding would
H. S. Teoh (16/27) May 25 2013 [...]

Joakim (33/116) May 26 2013 For some reason this posting by H. S. Teoh shows up on the

H. S. Teoh (43/67) May 26 2013 IHBT. You said that to handle multilanguage strings, your header would

Joakim (21/44) May 26 2013 Pretty funny how you claim you've been trolled and then go on to

Vladimir Panteleev (12/15) May 26 2013 1. Make extraordinary claims

Dmitry Olshansky (5/19) May 26 2013 +1
Joakim (33/44) May 26 2013 I never refused small examples. I have provided several analyses

Andrei Alexandrescu (14/18) May 26 2013 Congratulations, you are literally the only person on the Internet who

Joakim (11/32) May 26 2013 Haha, that is funny, :D though "unicode is shit" returns at least

Mr. Anonymous (4/16) May 26 2013 On the other hand:

Joakim (8/23) May 26 2013 I'm not sure if you were trying to make my point, but you just

Mr. Anonymous (2/26) May 26 2013 Man, you're a bullshit machine!

Joakim (2/11) May 26 2013 What can I say? I'm very good at interpreting bad data. ;)

Marco Leise (5/29) May 29 2013 Lol, https://www.google.com/search?q=%22utf-8+is+the+best%22

Joakim (11/38) May 30 2013 Your point is? 121 results, including false positives like

Marco Leise (8/19) May 30 2013 Alright, for me it said ~6.570.000 results, which I found

Marcin Mstowski (12/49) May 26 2013 Character Data Representation

Joakim (10/27) May 26 2013 You might be right, but I gave it a quick look and can't make out

Marcin Mstowski (9/34) May 26 2013 Yes, because they didn't reinvent wheel from scratch and are reusing

Joakim (13/30) May 26 2013 Which picture description did you have in mind? They all seem

John Colvin (5/12) May 27 2013 It's your idea and project. Showing that it is original / doing

Joakim (18/30) May 27 2013 Sure, some research is necessary. However, software is littered

Walter Bright (4/9) May 25 2013 Please implement the simple C function strstr() with this simple scheme,...

Walter Bright (19/28) May 25 2013 I'll go first. Here's a simple UTF-8 version in C. It's not the fastest ...

H. S. Teoh (32/42) May 25 2013 I'll have you know that Chinese, Korean, and Japanese account for a

H. S. Teoh (23/54) May 27 2013 I think there's a difference between allowing math symbols (which

Torje Digernes (18/90) May 29 2013 I think there is very little difference, both cases are

"Joakim" <joakim airpost.net> writes:

On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
 toUpper/lower cannot be made in place if it should handle all 
 Unicode. Some characters will change their length when convert 
 to/from uppercase. Examples of these are the German double S 
 and some Turkish I.

This triggered a long-standing bugbear of mine: why are we using 
these variable-length encodings at all?  Does anybody really care 
about UTF-8 being "self-synchronizing," ie does anybody actually 
use that in this day and age?  Sure, it's backwards-compatible 
with ASCII and the vast majority of usage is probably just ASCII, 
but that means the other languages don't matter anyway.  Not to 
mention taking the valuable 8-bit real estate for English and 
dumping the longer encodings on everyone else.

I'd just use a single-byte header to signify the language and 
then put the vast majority of languages in a single byte 
encoding, with the few exceptional languages with more than 256 
characters encoded in two bytes.  OK, that doesn't cover 
multi-language strings, but that is what, .000001% of usage?  
Make your header a little longer and you could handle those also. 
  Yes, it wouldn't be strictly backwards-compatible with ASCII, 
but it would be so much easier to internationalize.  Of course, 
there's also the monoculture we're creating; love this UTF-8 rant 
by tuomov, author of one the first tiling window managers for 
linux:

http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?

May 24 2013

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Friday, 24 May 2013 at 17:05:57 UTC, Joakim wrote:
 This triggered a long-standing bugbear of mine: why are we 
 using these variable-length encodings at all?

Simple: backwards compatibility with all ASCII APIs (e.g. most C 
libraries), and because I don't want my strings to consume 
multiple bytes per character when I don't need it.

Your language header idea is no good for at least three reasons:

1. What happens if I want to take a substring slice of your 
string? I'll need to allocate a new string to add the header in.

2. What if I have a long string with the ASCII header and want to 
append a non-ASCII character on the end? I'll need to reallocate 
the whole string and widen it with the new header.

3. Even if I have a string that is 99% ASCII then I have to pay 
extra bytes for every character just because 1% wasn't ASCII. 
With UTF-8, I only pay the extra bytes when needed.

May 24 2013

"Joakim" <joakim airpost.net> writes:

On Friday, 24 May 2013 at 17:43:03 UTC, Peter Alexander wrote:
 Simple: backwards compatibility with all ASCII APIs (e.g. most 
 C libraries), and because I don't want my strings to consume 
 multiple bytes per character when I don't need it.

And yet here we are today, where an early decision made solely to 
accommodate the authors of then-dominant all-ASCII APIs has now 
foisted an unnecessarily complex encoding on all of us, with 
reduced performance as the result.  You do realize that my 
encoding would encode almost all languages' characters in single 
bytes, unlike UTF-8, right?  Your latter argument is one against 
UTF-8.

 Your language header idea is no good for at least three reasons:

 1. What happens if I want to take a substring slice of your 
 string? I'll need to allocate a new string to add the header in.

Good point.  The solution that comes to mind right now is that 
you'd parse my format and store it in memory as a String class, 
storing the chars in an internal array with the header stripped 
out and the language stored in a property.  That way, even a 
slice could be made to refer to the same language, by referring 
to the language of the containing array.

Strictly speaking, this solution could also be implemented with 
UTF-8, simply by changing the format for the data structure you 
use in memory to the one I've outlined, as opposed to using the 
the UTF-8 encoding for both transmission and processing.  But if 
you're going to use my format for processing, you might as well 
use it for transmission also, since it is much smaller for 
non-ASCII text.

Before you ridicule my solution as somehow unworkable, let me 
remind you of the current monstrosity.  Currently, the language 
is stored in every single UTF-8 character, by having the length 
vary from one to four bytes depending on the language.  This 
leads to Phobos converting every UTF-8 string to UTF-32, so that 
it can easily run its algorithms on a constant-width 32-bit 
character set, and the resulting performance penalties.  Perhaps 
the biggest loss is that programmers everywhere are pushed to 
wrap their heads around this mess, predictably leading to either 
ignorance or broken code.

Which seems more unworkable to you?

 2. What if I have a long string with the ASCII header and want 
 to append a non-ASCII character on the end? I'll need to 
 reallocate the whole string and widen it with the new header.

How often does this happen in practice?  I suspect that this 
almost never happens.  But if it does, it would be solved by the 
String class I outlined above, as the header isn't stored in the 
array anymore.

 3. Even if I have a string that is 99% ASCII then I have to pay 
 extra bytes for every character just because 1% wasn't ASCII. 
 With UTF-8, I only pay the extra bytes when needed.

I don't understand what you mean here.  If your string has a 
thousand non-ASCII characters, the UTF-8 version will have one or 
two thousand more characters, ie 1 or 2 KB more.  My format would 
add a couple bytes in the header for each non-ASCII language 
character used, that's it.  It's a clear win for my format.

In any case, I just came up with the simplest format I could off 
the top of my head, maybe there are gaping holes in it.  But my 
point is that we should be able to come up with such a much 
simpler format, which keeps most characters to a single byte, not 
that my format is best.  All I want to argue is that UTF-8 is the 
worst. ;)

May 24 2013

"Joakim" <joakim airpost.net> writes:

On Friday, 24 May 2013 at 20:37:58 UTC, Joakim wrote:
 3. Even if I have a string that is 99% ASCII then I have to 
 pay extra bytes for every character just because 1% wasn't 
 ASCII. With UTF-8, I only pay the extra bytes when needed.

 I don't understand what you mean here.  If your string has a 
 thousand non-ASCII characters, the UTF-8 version will have one 
 or two thousand more characters, ie 1 or 2 KB more.  My format 
 would add a couple bytes in the header for each non-ASCII 
 language character used, that's it.  It's a clear win for my 
 format.

Sorry, I was a bit imprecise.  Here's what I meant to write:

I don't understand what you mean here.  If your string has a
thousand non-ASCII characters, the UTF-8 version will have one
or two thousand more bytes, ie 1 or 2 KB more.  My format
would add a couple bytes in the header for each non-ASCII
language used, that's it.  It's a clear win for my format.

May 24 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/24/2013 1:37 PM, Joakim wrote:
 This leads to Phobos converting every UTF-8 string to UTF-32, so that
 it can easily run its algorithms on a constant-width 32-bit character set, and
 the resulting performance penalties.

This is more a problem with the algorithms taking the easy way than a problem 
with UTF-8. You can do all the string algorithms, including regex, by working 
with the UTF-8 directly rather than converting to UTF-32. Then the algorithms 
work at full speed.


 Yes, it wouldn't be strictly backwards-compatible with ASCII, but it would be 

so much easier to internationalize.

That was the go-to solution in the 1980's, they were called "code pages". A 
disaster.


 with the few exceptional languages with more than 256 characters encoded in 

two bytes.

Like those rare languages Japanese, Korean, Chinese, etc. This too was done in 
the 80's with "Shift-JIS" for Japanese, and some other wacky scheme for Korean, 
and a third nutburger one for Chinese.

I've had the misfortune of supporting all that in the old Zortech C++ compiler. 
It's AWFUL. If you think it's simpler, all I can say is you've never tried to 
write internationalized code with it.

UTF-8 is heavenly in comparison. Your code is automatically internationalized. 
It's awesome.

May 24 2013

"anonymous" <anonymous example.com> writes:

On Friday, 24 May 2013 at 17:05:57 UTC, Joakim wrote:
 On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
 toUpper/lower cannot be made in place if it should handle all 
 Unicode. Some characters will change their length when convert 
 to/from uppercase. Examples of these are the German double S 
 and some Turkish I.

 This triggered a long-standing bugbear of mine: why are we 
 using these variable-length encodings at all?  Does anybody 
 really care about UTF-8 being "self-synchronizing," ie does 
 anybody actually use that in this day and age?  Sure, it's 
 backwards-compatible with ASCII and the vast majority of usage 
 is probably just ASCII, but that means the other languages 
 don't matter anyway.  Not to mention taking the valuable 8-bit 
 real estate for English and dumping the longer encodings on 
 everyone else.

The German ß becomes SS when capitalised. It's no encoding issue.

May 24 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

24-May-2013 21:05, Joakim пишет:
 On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
 toUpper/lower cannot be made in place if it should handle all Unicode.
 Some characters will change their length when convert to/from
 uppercase. Examples of these are the German double S and some Turkish I.

 This triggered a long-standing bugbear of mine: why are we using these
 variable-length encodings at all?  Does anybody really care about UTF-8
 being "self-synchronizing," ie does anybody actually use that in this
 day and age?  Sure, it's backwards-compatible with ASCII and the vast
 majority of usage is probably just ASCII, but that means the other
 languages don't matter anyway.  Not to mention taking the valuable 8-bit
 real estate for English and dumping the longer encodings on everyone else.

 I'd just use a single-byte header to signify the language and then put
 the vast majority of languages in a single byte encoding, with the few
 exceptional languages with more than 256 characters encoded in two
 bytes.

You seem to think that not only UTF-8 is bad encoding but also one 
unified encoding (code-space) is bad(?).

Separate code spaces were the case before Unicode (and utf-8). The 
problem is not only that without header text is meaningless (no easy 
slicing) but the fact that encoding of data after header strongly 
depends a variety of factors -  a list of encodings actually. Now 
everybody has to keep a (code) page per language to at least know if 
it's 2 bytes per char or 1 byte per char or whatever. And you still work 
on a basis that there is no combining marks and regional specific stuff :)

In fact it was even "better" nobody ever talked about header they just 
assumed a codepage with some global setting. Imagine yourself creating a 
font rendering system these days - a hell of an exercise in frustration 
(okay how do I render 0x88 ? mm if that is in codepage XYZ then ...).

 OK, that doesn't cover multi-language strings, but that is what,
 .000001% of usage?

This just shows you don't care for multilingual stuff at all. Imagine 
any language tutor/translator/dictionary on the Web. For instance most 
languages need to intersperse ASCII (also keep in mind e.g. HTML 
markup). Books often feature citations in native language (or e.g. 
latin) along with translations.

Now also take into account math symbols, currency symbols and beyond. 
Also these days cultures are mixing in wild combinations so you might 
need to see the text even if you can't read it. Unicode is not only 
"encode characters from all languages". It needs to address universal 
representation of symbolics used in writing systems at large.

 Make your header a little longer and you could handle
 those also.  Yes, it wouldn't be strictly backwards-compatible with
 ASCII, but it would be so much easier to internationalize.  Of course,
 there's also the monoculture we're creating; love this UTF-8 rant by
 tuomov, author of one the first tiling window managers for linux:

We want monoculture! That is to understand each without all these 
"par-le-vu-france?" and codepages of various complexity(insanity).

Want small - use compression schemes which are perfectly fine and get to 
the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/

 http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

 The emperor has no clothes, what am I missing?

And borrowing the arguments from from that rant: locale is borked shit 
when it comes to encodings. Locales should be used for tweaking visual 
like numbers, date display an so on.

-- 
Dmitry Olshansky

May 24 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote:
 24-May-2013 21:05, Joakim пишет:

[...]
This triggered a long-standing bugbear of mine: why are we using
these variable-length encodings at all?  Does anybody really care
about UTF-8 being "self-synchronizing," ie does anybody actually use
that in this day and age?  Sure, it's backwards-compatible with ASCII
and the vast majority of usage is probably just ASCII, but that means
the other languages don't matter anyway.  Not to mention taking the
valuable 8-bit real estate for English and dumping the longer
encodings on everyone else.

I'd just use a single-byte header to signify the language and then
put the vast majority of languages in a single byte encoding, with
the few exceptional languages with more than 256 characters encoded
in two bytes.

 
 You seem to think that not only UTF-8 is bad encoding but also one
 unified encoding (code-space) is bad(?).
 
 Separate code spaces were the case before Unicode (and utf-8). The
 problem is not only that without header text is meaningless (no easy
 slicing) but the fact that encoding of data after header strongly
 depends a variety of factors -  a list of encodings actually. Now
 everybody has to keep a (code) page per language to at least know if
 it's 2 bytes per char or 1 byte per char or whatever. And you still
 work on a basis that there is no combining marks and regional
 specific stuff :)

I remember those bad ole days of gratuitously-incompatible encodings. I
wish those days will never ever return again. You'd get a text file in
some unknown encoding, and the only way to make any sense of it was to
guess what encoding it might be and hope you get lucky. Not only so, the
same language often has multiple encodings, so adding support for a
single new language required supporting several new encodings and being
able to tell them apart (often with no info on which they are, if you're
lucky, or if you're unlucky, with *wrong* encoding type specs -- for
example, I *still* get email from outdated systems that claim to be
iso-8859 when it's actually KOI8R).

Prepending the encoding to the data doesn't help, because it's pretty
much guaranteed somebody will cut-n-paste some segment of that data and
save it without the encoding type header (or worse, some program will
try to "fix" broken low-level code by prepending a default encoding type
to everything, regardless of whether it's actually in that encoding or
not), thus ensuring nobody will be able to reliably recognize what
encoding it is down the road.


 In fact it was even "better" nobody ever talked about header they just
 assumed a codepage with some global setting. Imagine yourself creating
 a font rendering system these days - a hell of an exercise in
 frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ
 then ...).

Not to mention, if the sysadmin changes the default locale settings, you
may suddenly discover that a bunch of your text files have become
gibberish, because some programs blindly assume that every text file is
in the current locale-specified language.

I tried writing language-agnostic text-processing programs in C/C++
before the widespread adoption of Unicode. It was a living nightmare.
The Posix spec *seems* to promise language-independence with its locale
functions, but actually, the whole thing is one big inconsistent and
under-specified mess that has many unspecified, implementation-specific
behaviours that you can't rely on.  The APIs basically assume that you
set your locale's language once, and never change it, and every single
file you'll ever want to read must be encoded in that particular
encoding. If you try to read another encoding, too bad, you're screwed.
There isn't even a standard for locale names that you could use to
manually switch to inside your program (yes there are de facto
conventions, but there *are* systems out there that don't follow it).

And many standard library functions are affected by locale settings
(once you call setlocale, *anything* could change, like string
comparison, output encoding, etc.), making it a hairy mess to get
input/output of multiple encodings to work correctly. Basically, you
have to write everything manually, because the standard library can't
handle more than a single encoding correctly (well, not without extreme
amounts of pain, that is). So you're back to manipulating bytes
directly. Which means you have to keep large tables of every single
encoding you ever wish to support. And encoding-specific code to deal
with exceptions with those evil variant encodings that are supposedly
the same as the official standard of that encoding, but actually have
one or two subtle differences that cause your program to output
embarrassing garbage characters every now and then.

For all of its warts, Unicode fixed a WHOLE bunch of these problems, and
made cross-linguistic data sane to handle without pulling out your hair,
many times over.  And now we're trying to go back to that nightmarish
old world again? No way, José!


[...]
Make your header a little longer and you could handle those also.
Yes, it wouldn't be strictly backwards-compatible with ASCII, but it
would be so much easier to internationalize.  Of course, there's also
the monoculture we're creating; love this UTF-8 rant by tuomov,
author of one the first tiling window managers for linux:

 We want monoculture! That is to understand each without all these
 "par-le-vu-france?" and codepages of various complexity(insanity).

Yeah, those codepages were an utter nightmare to deal with. Everybody
and his neighbour's dog invented their own codepage, sometimes multiple
codepages for a single language, all of which are gratuitously
incompatible with each other. Every codepage has its own peculiarities
and exceptions, and programs have to know how to deal with all of them.
Only to get broken again as soon as somebody invents yet another
codepage two years later, or creates yet another codepage variant just
for the heck of it.

If you're really concerned about encoding size, just use a compression
library -- they're readily available these days. Internally, the program
can just use UTF-16 for the most part -- UTF-32 is really only necessary
if you're routinely delving outside BMP, which is very rare.

As far as Phobos is concerned, Dmitry's new std.uni module has powerful
code-generation templates that let you write code that operate directly
on UTF-8 without needing to convert to UTF-32 first. Well, OK, maybe
we're not quite there yet, but the foundations are in place, and I'm
looking forward to the day when string functions will no longer have
implicit conversion to UTF-32, but will directly manipulate UTF-8 using
optimized state tables generated by std.uni.


 Want small - use compression schemes which are perfectly fine and
 get to the precious 1byte per codepoint with exceptional speed.
 http://www.unicode.org/reports/tr6/

+1.  Using your own encoding is perfectly fine. Just don't do that for
data interchange. Unicode was created because we *want* a single
standard to communicate with each other without stupid broken encoding
issues that used to be rampant on the web before Unicode came along.

In the bad ole days, HTML could be served in any random number of
encodings, often out-of-sync with what the server claims the encoding
is, and browsers would assume arbitrary default encodings that for the
most part *appeared* to work but are actually fundamentally b0rken.
Sometimes webpages would show up mostly-intact, but with a few
characters mangled, because of deviations / variations on codepage
interpretation, or non-standard characters being used in a particular
encoding. It was a total, utter mess, that wasted who knows how many
man-hours of programming time to work around. For data interchange on
the internet, we NEED a universal standard that everyone can agree on.


http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?

 
 And borrowing the arguments from from that rant: locale is borked
 shit when it comes to encodings. Locales should be used for tweaking
 visual like numbers, date display an so on.

[...]

I found that rant rather incoherent. I didn't find any convincing
arguments as to why we should return to the bad old scheme of codepages
and gratuitous complexity, just a lot of grievances about why
monoculture is "bad" without much supporting evidence.

UTF-8, for all its flaws, is remarkably resilient to mangling -- you can
cut-n-paste any byte sequence and the receiving end can still make some
sense of it.  Not like the bad old days of codepages where you just get
one gigantic block of gibberish. A properly-synchronizing UTF-8 function
can still recover legible data, maybe with only a few characters at the
ends truncated in the worst case. I don't see how any codepage-based
encoding is an improvement over this.


T

-- 
There are 10 kinds of people in the world: those who can count in
binary, and those who can't.

May 24 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/24/2013 3:42 PM, H. S. Teoh wrote:
 I tried writing language-agnostic text-processing programs in C/C++
 before the widespread adoption of Unicode.

One of the first, and best, decisions I made for D was it would be Unicode
front 
to back.

At the time, Unicode was poorly supported by operating systems and lots of 
software, and I encountered some initial resistance to it. But I believed 
Unicode was the inevitable future.

Code pages, Shift-JIS, EBCDIC, etc., should all be terminated with prejudice.

May 24 2013

Manu <turkeyman gmail.com> writes:

On 25 May 2013 11:58, Walter Bright <newshound2 digitalmars.com> wrote:

 On 5/24/2013 3:42 PM, H. S. Teoh wrote:

 I tried writing language-agnostic text-processing programs in C/C++
 before the widespread adoption of Unicode.

 One of the first, and best, decisions I made for D was it would be Unicod=

e
 front to back.

Indeed, excellent decision!
So when we define operators for u =C3=97 v and a =C2=B7 b, or maybe n=C2=B2=
? ;)


At the time, Unicode was poorly supported by operating systems and lots of
 software, and I encountered some initial resistance to it. But I believed
 Unicode was the inevitable future.

 Code pages, Shift-JIS, EBCDIC, etc., should all be terminated with
 prejudice.

May 24 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/24/2013 7:16 PM, Manu wrote:
 So when we define operators for u × v and a · b, or maybe n²? ;)

Oh, how I want to do that. But I still think the world hasn't completely caught 
up with Unicode yet.

May 24 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, May 24, 2013 at 08:45:56PM -0700, Walter Bright wrote:
 On 5/24/2013 7:16 PM, Manu wrote:
So when we define operators for u � v and a � b, or maybe n�? ;)

 
 Oh, how I want to do that. But I still think the world hasn't
 completely caught up with Unicode yet.

That would be most awesome!

Though it does raise the issue of how parsing would work, 'cos you
either have to assign a fixed precedence to each of these operators (and
there are a LOT of them in Unicode!), or allow user-defined operators
with custom precedence and associativity, which means nightmare for the
parser (it has to adapt itself to new operators as the code is
parsed/analysed, which then leads to issues with what happens if two
different modules define the same operator with conflicting precedence /
associativity).


T

-- 
Spaghetti code may be tangly, but lasagna code is just cheesy.

May 24 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 05/25/2013 05:56 AM, H. S. Teoh wrote:
 On Fri, May 24, 2013 at 08:45:56PM -0700, Walter Bright wrote:
 On 5/24/2013 7:16 PM, Manu wrote:
 So when we define operators for u � v and a � b, or maybe n�? ;)

 Oh, how I want to do that. But I still think the world hasn't
 completely caught up with Unicode yet.

 That would be most awesome!

 Though it does raise the issue of how parsing would work, 'cos you
 either have to assign a fixed precedence to each of these operators (and
 there are a LOT of them in Unicode!),

I think this is what eg. fortress is doing.

 or allow user-defined operators
 with custom precedence and associativity,

This is what eg. Haskell, Coq are doing.
(Though Coq has the advantage of not allowing forward references, and 
hence inline parser customization is straighforward in Coq.)

 which means nightmare for the
 parser (it has to adapt itself to new operators as the code is
 parsed/analysed,

It would be easier on the parsing side, since the parser would not fully 
parse expressions. Semantic analysis would resolve precedences. This is 
quite simple, and the current way the parser resolves operator 
precedences is less efficient anyways.

 which then leads to issues with what happens if two
 different modules define the same operator with conflicting precedence /
 associativity).

This would probably be an error without explicit disambiguation, or 
follow the usual disambiguation rules. (trying all possibilities appears 
to be exponential in the number of conflicting operators in an 
expression in the worst case though.)

May 25 2013

"Hans W. Uhlig" <huhlig gmail.com> writes:

On Saturday, 25 May 2013 at 03:46:23 UTC, Walter Bright wrote:
 On 5/24/2013 7:16 PM, Manu wrote:
 So when we define operators for u × v and a · b, or maybe n²? 
 ;)

 Oh, how I want to do that. But I still think the world hasn't 
 completely caught up with Unicode yet.

Using those characters would be wonderful and while we do have 
unicode software support we don't really have unicode hardware 
support. I am still on my 102 key keyboard and I haven't really 
seen a good expanded character keyboard come along.

May 26 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/26/2013 1:44 PM, Hans W. Uhlig wrote:
 Using those characters would be wonderful and while we do have unicode software
 support we don't really have unicode hardware support. I am still on my 102 key
 keyboard and I haven't really seen a good expanded character keyboard come
along.

I have a post-it stuck to my monitor with the numbers for various unicode 
characters, but I just can't see that for writing code.

May 26 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sun, May 26, 2013 at 02:14:17PM -0700, Walter Bright wrote:
 On 5/26/2013 1:44 PM, Hans W. Uhlig wrote:
Using those characters would be wonderful and while we do have
unicode software support we don't really have unicode hardware
support. I am still on my 102 key keyboard and I haven't really seen
a good expanded character keyboard come along.

 
 I have a post-it stuck to my monitor with the numbers for various
 unicode characters, but I just can't see that for writing code.

I have been thinking about this idea of a "reprogrammable keyboard", in
that the keys are either a fixed layout with LCD labels on each key, or
perhaps the whole thing is a long touchscreen, that allows arbitrary
relabelling of keys (or, in the latter case, complete dynamic
reconfiguration of layout). There would be some convenient way to switch
between layouts, say a scrolling sidebar or roller dial of some sort, so
you could, in theory, type Unicode directly.

I haven't been able to refine this into an actual, implementable idea,
though.


T

-- 
Shin: (n.) A device for finding furniture in the dark.

May 26 2013

"Kiith-Sa" <kiithsacmp gmail.com> writes:

You mean like 
http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard ?

May 26 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sun, May 26, 2013 at 11:25:09PM +0200, Kiith-Sa wrote:
 You mean like http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
 ?

Whoa! That is exactly what I had in mind!!

Pity they don't appear to support Linux, though. :-(


T

-- 
MACINTOSH: Most Applications Crash, If Not, The Operating System Hangs

May 26 2013

"Torje Digernes" <torjehoa pvv.org> writes:

On Sunday, 26 May 2013 at 21:46:38 UTC, H. S. Teoh wrote:
 On Sun, May 26, 2013 at 11:25:09PM +0200, Kiith-Sa wrote:
 You mean like 
 http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
 ?

 Whoa! That is exactly what I had in mind!!

 Pity they don't appear to support Linux, though. :-(


 T

If you want to configure your keyboard so you can type unicode in 
Linux you should make yourself familiar with xkb, it is not that 
difficult to work with, but not exactly user friendly either, 
super user friendly though.

May 26 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, May 27, 2013 at 12:30:02AM +0200, Torje Digernes wrote:
 On Sunday, 26 May 2013 at 21:46:38 UTC, H. S. Teoh wrote:
On Sun, May 26, 2013 at 11:25:09PM +0200, Kiith-Sa wrote:
You mean like
http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
?

Whoa! That is exactly what I had in mind!!

Pity they don't appear to support Linux, though. :-(


T

 
 If you want to configure your keyboard so you can type unicode in
 Linux you should make yourself familiar with xkb, it is not that
 difficult to work with, but not exactly user friendly either, super
 user friendly though.

Oh, I know *that*. I configured my xkb setup to switch between English
and Russian with the unused windows key (I used to have Greek too, but I
use it rarely enough that I took it out). It's just that without the
dynamic key labels, I have to touch-type, which requires learning each
layout as opposed to just looking for the symbol I need on the key
labels. And I have yet to figure out a sane way to support *all* of
Unicode without making the result unusable -- when I had Greek in the
mix, it was already getting cumbersome having to continually hit the
windows key repeatedly when alternating between two of the 3 languages.
That's simply not scalable to, say, 100 modes.  :-P

But maybe I'm just missing a really obvious solution. That happens a
lot.  :-P


T

-- 
War doesn't prove who's right, just who's left. -- BSD Games' Fortune

May 26 2013

"Wyatt" <wyatt.epp gmail.com> writes:

On Sunday, 26 May 2013 at 21:23:44 UTC, H. S. Teoh wrote:
 I have been thinking about this idea of a "reprogrammable 
 keyboard", in
 that the keys are either a fixed layout with LCD labels on each 
 key, or
 perhaps the whole thing is a long touchscreen, that allows 
 arbitrary
 relabelling of keys (or, in the latter case, complete dynamic
 reconfiguration of layout). There would be some convenient way 
 to switch
 between layouts, say a scrolling sidebar or roller dial of some 
 sort, so
 you could, in theory, type Unicode directly.

 I haven't been able to refine this into an actual, 
 implementable idea,
 though.

I've given this domain a fair bit of thought, and from my 
perspective you want to throw hardware at a software problem.  
Have you ever used a Japanese input method?  They're sort of a 
good exemplar here, wherein you type a sequence and then hit 
space to cycle through possible ways of writing it.  So "ame" can 
become, あめ, 雨, 飴, etc.  Right now, in addition to my learning, I 
also use it for things like α (アルファ) and Δ (デルタ).  It's
limited, 
but...usable, I guess.  Sort of.

The other end of this is TeX, which was designed around the idea 
of composing scientific texts with a high degree of control and 
flexibility.  Specialty characters are inserted with 
backslash-escapes, like \alpha, \beta, etc.

Now combine the two:  An input method that outputs as usual, 
until you enter a character code which is substituted in real 
time to what you actually want.
Example:
"values of \beta will give rise to dom!" composes as
"values of β will give rise to dom!"

No hardware required; just a smarter IME.  Like maybe this one: 
http://www.andonyar.com/rec/2008-03/mathinput/ (I'm honestly not 
yet sure how mature or usable that one is as I'm a UIM user, but 
it does serve as a proof of concept).

May 26 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, May 27, 2013 at 04:17:06AM +0200, Wyatt wrote:
 On Sunday, 26 May 2013 at 21:23:44 UTC, H. S. Teoh wrote:
I have been thinking about this idea of a "reprogrammable keyboard",
in that the keys are either a fixed layout with LCD labels on each
key, or perhaps the whole thing is a long touchscreen, that allows
arbitrary relabelling of keys (or, in the latter case, complete
dynamic reconfiguration of layout). There would be some convenient
way to switch between layouts, say a scrolling sidebar or roller dial
of some sort, so you could, in theory, type Unicode directly.

I haven't been able to refine this into an actual, implementable
idea, though.

 I've given this domain a fair bit of thought, and from my
 perspective you want to throw hardware at a software problem.  Have
 you ever used a Japanese input method?  They're sort of a good
 exemplar here, wherein you type a sequence and then hit space to
 cycle through possible ways of writing it.  So "ame" can become,
 あめ, 雨, 飴, etc.  Right now, in addition to my learning, I also
 use it for things like α (アルファ) and Δ (デルタ).  It's limited,
 but...usable, I guess.  Sort of.
 
 The other end of this is TeX, which was designed around the idea of
 composing scientific texts with a high degree of control and
 flexibility.  Specialty characters are inserted with
 backslash-escapes, like \alpha, \beta, etc.
 
 Now combine the two:  An input method that outputs as usual, until
 you enter a character code which is substituted in real time to what
 you actually want.
 Example:
 "values of \beta will give rise to dom!" composes as
 "values of β will give rise to dom!"
 
 No hardware required; just a smarter IME.  Like maybe this one:
 http://www.andonyar.com/rec/2008-03/mathinput/ (I'm honestly not yet
 sure how mature or usable that one is as I'm a UIM user, but it does
 serve as a proof of concept).

I like this idea. It's certainly more feasible than reinventing the
Optimus Maximus keyboard. :) I can write code for free, but engineering
custom hardware is a bit beyond my abilities (and means!).

If we go the software route, then one possible strategy might be:

- Have a default mode that is whatever your default keyboard layout is
  (the usual 100+-key layout, or DVORAK, whatever.).

- Assign one or two escape keys (not to be confused with the Esc key,
  which is something else) that allows you to switch mode.

   - Under the 1-key scheme, you'd use it to begin sequences like \beta,
     except that instead of the backslash \, you're using a dedicated
     key. These sequences can include individual characters (e.g.
     <ESC>beta == β) or allow you to change the current input mode (e.g.
     <ESC>grk to switch to a Greek layout that takes effect from that
     point onwards until you enter, say, <ESC>eng). For convenience, the
     sequence <ESC><ESC> can be shorthand for switching back to whatever
     the default layout is, so that if you mistype an escape sequence
     and end up in some strange unexpected layout mode, hitting <ESC>
     twice will reset it back to the default.

   - Under the 2-key scheme, you'd have one key dedicated for the
     occasional foreign character (<ESC1>beta == β), and the second key
     dedicated for switching layouts (thus allowing shorter sequences
     for switching between languages without fear of conflicting with
     single-character sequences, e.g., <ESC2>g for Greek).

Perhaps the 1-key scheme is the simplest to implement. The capslock key
is a good candidate, being conveniently located where your left little
finger is, and having no real useful function in this day and age.

The only drawback is no custom key labels. But perhaps that can be
alleviated by hooking an escape sequence to toggle an on-screen visual
representation of the current layout. Maybe <ESC>? can be assigned to
invoke a helper utility that renders the current layout on the screen.


T

-- 
Don't get stuck in a closet---wear yourself out.

May 27 2013

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Monday, 27 May 2013 at 02:17:08 UTC, Wyatt wrote:
 No hardware required; just a smarter IME.

Perhaps something like the compose key?

http://en.wikipedia.org/wiki/Compose_key

May 27 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, May 27, 2013 at 09:59:52PM +0200, Vladimir Panteleev wrote:
 On Monday, 27 May 2013 at 02:17:08 UTC, Wyatt wrote:
No hardware required; just a smarter IME.

 
 Perhaps something like the compose key?
 
 http://en.wikipedia.org/wiki/Compose_key

I'm already using the compose key. But it only goes so far (I don't
think compose key sequences cover all of unicode). Besides, it's
impractical to use compose key sequences to write large amounts of text
in some given language; a method of temporarily switching to a different
layout is necessary.


T

-- 
Тише едешь, дальше будешь.

May 27 2013

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
 Besides, it's impractical to use compose key sequences to write 
 large amounts of text in some given language; a method of 
 temporarily switching to a different layout is necessary.

I thought the topic was typing the occasional Unicode character 
to use as an operator in D programs?

May 27 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, May 28, 2013 at 12:04:52AM +0200, Vladimir Panteleev wrote:
 On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.

 
 I thought the topic was typing the occasional Unicode character to
 use as an operator in D programs?

Well, D *does* support non-English identifiers, y'know... for example:

	void main(string[] args) {
		int число = 1;
		foreach (и; 0..100)
			число += и;
		writeln(число);
	}

Of course, whether that's a good practice is a different story. :)

But for operators, you still need enough compose key sequences to cover
all of the Unicode operators -- and there are a LOT of them -- which I
don't think is currently done anywhere. You'd have to make your own
compose key maps to do it.


T

-- 
Freedom: (n.) Man's self-given right to be enslaved by his own depravity.

May 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:
 Well, D *does* support non-English identifiers, y'know... for example:

 	void main(string[] args) {
 		int число = 1;
 		foreach (и; 0..100)
 			число += и;
 		writeln(число);
 	}

 Of course, whether that's a good practice is a different story. :)

I've recently come to the opinion that that's a bad idea, and D should not 
support it.

May 27 2013

"Hans W. Uhlig" <huhlig gmail.com> writes:

On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 On 5/27/2013 3:18 PM, H. S. Teoh wrote:
 Well, D *does* support non-English identifiers, y'know... for 
 example:

 	void main(string[] args) {
 		int число = 1;
 		foreach (и; 0..100)
 			число += и;
 		writeln(число);
 	}

 Of course, whether that's a good practice is a different 
 story. :)

 I've recently come to the opinion that that's a bad idea, and D 
 should not support it.

Why do you think its a bad idea? It makes it such that code can 
be in various languages? Just lack of keyboard support?

May 27 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, May 28, 2013 at 01:28:22AM +0200, Hans W. Uhlig wrote:
 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for
example:

	void main(string[] args) {
		int число = 1;
		foreach (и; 0..100)
			число += и;
		writeln(число);
	}

Of course, whether that's a good practice is a different story.
:)

I've recently come to the opinion that that's a bad idea, and D
should not support it.


Currently, the above code snippet compiles (upon inserting "import
std.stdio;", that is). Should that be made illegal?


 Why do you think its a bad idea? It makes it such that code can be
 in various languages? Just lack of keyboard support?

I can't speak for Walter, but one issue that comes to mind is when
someone reads the code and doesn't understand the language the
identifiers are in, or worse, can't reliably recognize the distinctions
between the glyphs, and so can't match identifier names correctly -- if
you don't know Japanese, for example, seeing a bunch of Japanese
identifiers of equal length will look more-or-less the same (all
gibberish to you), so it only obscures the code. Or if your computer
doesn't have the requisite fonts to display the alphabet in question,
then you'll just see a bunch of ?'s or black blotches for all program
identifiers, making the code completely unreadable.

Since language keywords are already in English, we might as well
standardize on English identifiers too. (After all, Phobos identifiers
are English as well.) While it's cool to have multilingual identifiers,
I'm not sure if it actually adds any practical value. :) If anything, it
arguably detracts from usability. Multilingual program output, of
course, is a different kettle o' fish.


T

-- 
Doubt is a self-fulfilling prophecy.

May 27 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Monday, 27 May 2013 at 23:46:17 UTC, H. S. Teoh wrote:
 On Tue, May 28, 2013 at 01:28:22AM +0200, Hans W. Uhlig wrote:
 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for
example:

	void main(string[] args) {
		int число = 1;
		foreach (и; 0..100)
			число += и;
		writeln(число);
	}

Of course, whether that's a good practice is a different 
story.
:)

I've recently come to the opinion that that's a bad idea, and 
D
should not support it.


 Currently, the above code snippet compiles (upon inserting 
 "import
 std.stdio;", that is). Should that be made illegal?


 Why do you think its a bad idea? It makes it such that code 
 can be
 in various languages? Just lack of keyboard support?

 I can't speak for Walter, but one issue that comes to mind is 
 when
 someone reads the code and doesn't understand the language the
 identifiers are in, or worse, can't reliably recognize the 
 distinctions
 between the glyphs, and so can't match identifier names 
 correctly -- if
 you don't know Japanese, for example, seeing a bunch of Japanese
 identifiers of equal length will look more-or-less the same (all
 gibberish to you), so it only obscures the code. Or if your 
 computer
 doesn't have the requisite fonts to display the alphabet in 
 question,
 then you'll just see a bunch of ?'s or black blotches for all 
 program
 identifiers, making the code completely unreadable.

 Since language keywords are already in English, we might as well
 standardize on English identifiers too. (After all, Phobos 
 identifiers
 are English as well.) While it's cool to have multilingual 
 identifiers,
 I'm not sure if it actually adds any practical value. :) If 
 anything, it
 arguably detracts from usability. Multilingual program output, 
 of
 course, is a different kettle o' fish.


 T

I can tell you for a fact there are a tons of *private* companies 
that create closed source programs, whose source code is *not* 
English. And from *their* business perspective, it makes sense. 
They don't care if you can't understand their source code, since 
*you* will never see their source code. I'm quite confident there 
are tons of programs that you use that *aren't* written in 
English.

My wifes writes the embedded soft for hardware her company sells. 
I can tell you the source code sure as hell isn't in English. Why 
would it? The entire company speaks the local language natively. 
I've worked in Japan, and I can tell you the norm over there is 
*not* to code in English.

And why should it? Why would you code in a language that is not 
your own, if you don't plan to ever share your code to outside 
your team? Why would you care about users that don't have unicode 
support, if the workstations of all your employees is unicode 
compatible?

Allowing unicode identifiers makes their work a better 
experience. Why should we take that away from them?

There are advantages and disadvantages to non-ASCII identifiers, 
but whether or not you should be able to use them should belong 
in a coding standard, not in a compiler limitation.

May 28 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 I've recently come to the opinion that that's a bad idea, and D should not
 support it.

 Why do you think its a bad idea? It makes it such that code can be in various
 languages? Just lack of keyboard support?

Every time I've been to a programming shop in a foreign country, the developers 
speak english at work and code in english. Of course, that doesn't mean that 
everyone does, but as far as I can tell the overwhelming bulk is done in
english.

Naturally, full Unicode needs to be in strings and comments, but symbol names?
I 
don't see the point nor the utilty of it. Supporting such is just pointless 
complexity to the language.

May 27 2013

"Diggory" <diggsey googlemail.com> writes:

On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
 On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 I've recently come to the opinion that that's a bad idea, and 
 D should not
 support it.

 Why do you think its a bad idea? It makes it such that code 
 can be in various
 languages? Just lack of keyboard support?

 Every time I've been to a programming shop in a foreign 
 country, the developers speak english at work and code in 
 english. Of course, that doesn't mean that everyone does, but 
 as far as I can tell the overwhelming bulk is done in english.

 Naturally, full Unicode needs to be in strings and comments, 
 but symbol names? I don't see the point nor the utilty of it. 
 Supporting such is just pointless complexity to the language.

The most convincing case for usefulness I've seen was in java 
where a class implemented a particular algorithm and so was named 
after it. This name had a particular accented character and so 
required unicode. Lots of algorithms are named after their 
inventors and lots of these names contain unicode characters so 
it's not that uncommon.

May 27 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, May 28, 2013 at 02:23:32AM +0200, Diggory wrote:
 On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and
D should not
support it.

Why do you think its a bad idea? It makes it such that code can
be in various
languages? Just lack of keyboard support?

Every time I've been to a programming shop in a foreign country,
the developers speak english at work and code in english. Of
course, that doesn't mean that everyone does, but as far as I can
tell the overwhelming bulk is done in english.

Naturally, full Unicode needs to be in strings and comments, but
symbol names? I don't see the point nor the utilty of it.
Supporting such is just pointless complexity to the language.

 
 The most convincing case for usefulness I've seen was in java where
 a class implemented a particular algorithm and so was named after
 it. This name had a particular accented character and so required
 unicode. Lots of algorithms are named after their inventors and lots
 of these names contain unicode characters so it's not that uncommon.

I don't find this a compelling reason to allow full Unicode on
identifiers, though. For one thing, somebody maintaining your code may
not know how to type said identifier correctly. It can be very
frustrating to have to keep copy-n-pasting identifiers just because they
contain foreign letters you can't type. Not to mention sheer
unreadability if the inventor's name is in Chinese, so the algorithm
name is also in Chinese, and the person maintaining the code can't read
Chinese. This will kill D code maintainability.


T

-- 
Don't drink and derive. Alcohol and algebra don't mix.

May 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/27/2013 6:06 PM, H. S. Teoh wrote:
 I don't find this a compelling reason to allow full Unicode on
 identifiers, though. For one thing, somebody maintaining your code may
 not know how to type said identifier correctly. It can be very
 frustrating to have to keep copy-n-pasting identifiers just because they
 contain foreign letters you can't type. Not to mention sheer
 unreadability if the inventor's name is in Chinese, so the algorithm
 name is also in Chinese, and the person maintaining the code can't read
 Chinese. This will kill D code maintainability.

+1

May 27 2013

Michel Fortin <michel.fortin michelf.ca> writes:

On 2013-05-28 01:34:17 +0000, Walter Bright <newshound2 digitalmars.com> said:

 On 5/27/2013 6:06 PM, H. S. Teoh wrote:
 I don't find this a compelling reason to allow full Unicode on
 identifiers, though. For one thing, somebody maintaining your code may
 not know how to type said identifier correctly. It can be very
 frustrating to have to keep copy-n-pasting identifiers just because they
 contain foreign letters you can't type. Not to mention sheer
 unreadability if the inventor's name is in Chinese, so the algorithm
 name is also in Chinese, and the person maintaining the code can't read
 Chinese. This will kill D code maintainability.

 
 +1

-1

What's even worse for code maintainability is code that does not do 
what it says.

Disallowing non-ASCII charsets does not prevent people from writing 
foreign-language code. I've seen plenty of code in French in my life in 
languages with no Unicode support. I've also seen plenty of bad English 
in code. I'd rather see a correct French word as a variable or function 
name than an incorrect English one. Correctly naming things is 
difficult, and correctly naming them in a foreign language is even 
more. This surely apply to languages using non-ASCII alphabets too.

Of course, if you're not using English words you'll be limiting 
audience to programmers who understand that language. But you might 
widen it in other directions. I worked once with a grad student who was 
building a model to simulate breakages of water pipe systems. She was 
good enough to write code that worked, although she needed my help for 
a couple of things, notably increasing performance. The code was all in 
French, and thankfully so as attempting to translate all those terms 
(some dealing with concepts unknown to me) to English when writing the 
code and back to French when explaining the concepts would have been 
quite annoying, inefficient, and error-prone in our work.

While French likely will always be a possibility (as it fits well in 
ASCII), I can see how writing code in Japanese or Russian might benefit 
native speakers of those languages too, especially those for who 
programming is only an incidental part of their job. Programming is a 
form of expression, and it's always easier to express ourself in our 
own native language.

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca/

May 28 2013

"Olivier Pisano" <olivier.pisano laposte.net> writes:

On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
 Every time I've been to a programming shop in a foreign 
 country, the developers speak english at work and code in 
 english. Of course, that doesn't mean that everyone does, but 
 as far as I can tell the overwhelming bulk is done in english.

Would you have been to such an event if you could not have 
understood what people were doing or saying? Of course, when we 
are working on something with international scope, we tend to do 
it in english, but it doesn't mean every programming task is 
performed in english…

Being a non-native english speaker, I tend to see Unicode 
identifiers as an improvement over other programming languages. 
It's like operator overloading, it is good when used moderately, 
depending on the context of the programming task and its intended 
audience. BTW, I use a Unicode-aware alternative keyboard layout, 
so I can type greek letters or math symbols directly. ASCII-only 
identifiers sounds like an arbitrary limitation for me.

May 28 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
 Every time I've been to a programming shop in a foreign 
 country, the developers speak english at work and code in 
 english. Of course, that doesn't mean that everyone does, but 
 as far as I can tell the overwhelming bulk is done in english.

That's because you have an academic view of code, and a library 
approach to development.

When you are a private company selling closed source code, I 
really don't see why you'd code in English.

IMO, whether it is a bad idea is not for us to judge (and less so 
to stop), but for each company/organization to choose their own 
coding standard.

May 28 2013

"qznc" <qznc web.de> writes:

On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
 On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 I've recently come to the opinion that that's a bad idea, and 
 D should not
 support it.

 Why do you think its a bad idea? It makes it such that code 
 can be in various
 languages? Just lack of keyboard support?

 Every time I've been to a programming shop in a foreign 
 country, the developers speak english at work and code in 
 english. Of course, that doesn't mean that everyone does, but 
 as far as I can tell the overwhelming bulk is done in english.

 Naturally, full Unicode needs to be in strings and comments, 
 but symbol names? I don't see the point nor the utilty of it. 
 Supporting such is just pointless complexity to the language.

Once I heared an argument from developers working for banks. They 
coded business-specific stuff in Java. Business-specific meant 
financial concepts with german names (e.g. Vermögen,Bürgschaft), 
which sometimes include äöüß. Some of those concept had no good 
translation into english, because they are not used outside of 
Germany and the clients prefer the actual names anyways.

May 29 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/29/2013 3:26 AM, qznc wrote:
 Once I heared an argument from developers working for banks. They coded
 business-specific stuff in Java. Business-specific meant financial concepts
with
 german names (e.g. Vermögen,Bürgschaft), which sometimes include äöüß.
Some of
 those concept had no good translation into english, because they are not used
 outside of Germany and the clients prefer the actual names anyways.

German is pretty easy to do in ASCII: Vermoegen and Buergschaft

May 29 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 29 May 2013 at 22:42:08 UTC, Walter Bright wrote:
 On 5/29/2013 3:26 AM, qznc wrote:
 Once I heared an argument from developers working for banks. 
 They coded
 business-specific stuff in Java. Business-specific meant 
 financial concepts with
 german names (e.g. Vermögen,Bürgschaft), which sometimes 
 include äöüß. Some of
 those concept had no good translation into english, because 
 they are not used
 outside of Germany and the clients prefer the actual names 
 anyways.

 German is pretty easy to do in ASCII: Vermoegen and Buergschaft

What about Chinese? Russian? Japanese? It is doable, but I can 
tell you for a fact that they very much don't like reading it 
that way.

You know, having done programming in Japan, I know that a lot of 
devs simply don't care for english, and they'd really enjoy just 
being able to code in Japanese. I can't speak for the other 
countries, but I'm sure that large but not spread out countries 
like China would also just *love* to be able to code in 100% 
Madarin (I'd say they wouldn't care much for English either).

I think this possibility is actually a brilliant feature that 
could help popularize the language oversees, especially in 
teaching courses, or the private sector. Why not turn down a 
feature that makes us popular?

As for research/university, I think they are already global 
enough to stick to English anyways.

No matter how I see it, I can only see benefits to keeping it, 
and downsides to turning it down.

May 30 2013

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On Thu, 30 May 2013 11:36:42 +0200, monarch_dodra <monarchdodra gmail.co=
m>  =

wrote:

 On Wednesday, 29 May 2013 at 22:42:08 UTC, Walter Bright wrote:
 On 5/29/2013 3:26 AM, qznc wrote:
 Once I heared an argument from developers working for banks. They co=



ded
 business-specific stuff in Java. Business-specific meant financial  =



 concepts with
 german names (e.g. Verm=C3=B6gen,B=C3=BCrgschaft), which sometimes i=



nclude =C3=A4=C3=B6=C3=BC=C3=9F.  =

 Some of
 those concept had no good translation into english, because they are=



  =

 not used
 outside of Germany and the clients prefer the actual names anyways.

 German is pretty easy to do in ASCII: Vermoegen and Buergschaft

 What about Chinese? Russian? Japanese? It is doable, but I can tell yo=

u  =

 for a fact that they very much don't like reading it that way.

 You know, having done programming in Japan, I know that a lot of devs =

 =

 simply don't care for english, and they'd really enjoy just being able=

  =

 to code in Japanese. I can't speak for the other countries, but I'm su=

re  =

 that large but not spread out countries like China would also just  =

 *love* to be able to code in 100% Madarin (I'd say they wouldn't care =

 =

 much for English either).

 I think this possibility is actually a brilliant feature that could he=

lp  =

 popularize the language oversees, especially in teaching courses, or t=

he  =

 private sector. Why not turn down a feature that makes us popular?

 As for research/university, I think they are already global enough to =

 =

 stick to English anyways.

 No matter how I see it, I can only see benefits to keeping it, and  =

 downsides to turning it down.

Now if only we had the C preprocessor:

#define =E5=A6=82=E6=9E=9C if
#define =E7=9B=B4=E5=88=B0 while

(Note: this is what Google Translate told me was good. I do not speak,
read or otherwise understand chinese)

-- =

Simen

May 30 2013

"Dicebot" <m.strashun gmail.com> writes:

On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra wrote:
 What about Chinese? Russian? Japanese? It is doable, but I can 
 tell you for a fact that they very much don't like reading it 
 that way.

 You know, having done programming in Japan, I know that a lot 
 of devs simply don't care for english, and they'd really enjoy 
 just being able to code in Japanese. I can't speak for the 
 other countries, but I'm sure that large but not spread out 
 countries like China would also just *love* to be able to code 
 in 100% Madarin (I'd say they wouldn't care much for English 
 either).

What about poor guys from other country that will support that 
project after? English is a de-facto standard language for 
programming for a good reason.

May 30 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 30 May 2013 at 10:13:46 UTC, Dicebot wrote:
 On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra wrote:
 What about Chinese? Russian? Japanese? It is doable, but I can 
 tell you for a fact that they very much don't like reading it 
 that way.

 You know, having done programming in Japan, I know that a lot 
 of devs simply don't care for english, and they'd really enjoy 
 just being able to code in Japanese. I can't speak for the 
 other countries, but I'm sure that large but not spread out 
 countries like China would also just *love* to be able to code 
 in 100% Madarin (I'd say they wouldn't care much for English 
 either).

 What about poor guys from other country that will support that 
 project after? English is a de-facto standard language for 
 programming for a good reason.

Well... defacto: "in practice but not necessarily ordained by 
law".

Besides, even in english, there are use cases for unicode. Such 
as math (Greek symbols).

And even if you are coding in english, that don't mean you can't 
be working on a region specific project, that requires the 
identifiers to have region-specific names (AKA, German banking 
reference).

Finally, english does have a few (albeit rare) words that can't 
be expressed with ASCII. For example: Möbius. Sure, you can write 
it "Mobius", but why settle for wrong, when you can have right?

--------

I'm saying that even if I agree that code should be in English 
(which I don't completly agree with), it's still not a strong 
argument against unicode in identifiers. In this day and age, it 
seems as arbitrary to me as requiring lines to not exceed 80 
chars. That kind of shit belongs in a coding standard.

May 30 2013

Manu <turkeyman gmail.com> writes:

On 30 May 2013 20:13, Dicebot <m.strashun gmail.com> wrote:

 On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra wrote:

 What about Chinese? Russian? Japanese? It is doable, but I can tell you
 for a fact that they very much don't like reading it that way.

 You know, having done programming in Japan, I know that a lot of devs
 simply don't care for english, and they'd really enjoy just being able to
 code in Japanese. I can't speak for the other countries, but I'm sure that
 large but not spread out countries like China would also just *love* to be
 able to code in 100% Madarin (I'd say they wouldn't care much for English
 either).

 What about poor guys from other country that will support that project
 after? English is a de-facto standard language for programming for a good
 reason.

Have you ever worked on code written by people who barely speak English?
Even if they write English words, that doesn't make it 'English', or any
easier to understand. And people often tend to just transliterate into
latin, which is kinda pointless too, how does that help?

May 30 2013

"Dicebot" <m.strashun gmail.com> writes:

On Thursday, 30 May 2013 at 11:29:47 UTC, Manu wrote:
 Have you ever worked on code written by people who barely speak 
 English?
 Even if they write English words, that doesn't make it 
 'English', or any
 easier to understand. And people often tend to just 
 transliterate into
 latin, which is kinda pointless too, how does that help?

I have had comments with Finnish poetry in code I was responsible 
to support :( No need to provide means to think such approach is 
the way to go.

May 30 2013

"Kagamin" <spam here.lot> writes:

On Thursday, 30 May 2013 at 11:29:47 UTC, Manu wrote:
 Have you ever worked on code written by people who barely speak 
 English?

I did. It's better than having a mixture of languages like here: 
http://code.google.com/p/trileri/source/browse/trunk/tr/yazi.d
assert(length == dizgi.length); - in one expression!
 property Yazı küçüğü() const - property? const? küçüğü?

BTW I don't speak English myself, and D code doesn't comprise 
English either. How well do you have to know English to use one 
word to name a variable "player"? And I believe everyone who 
learned math know latin alphabet.

Unicode identifiers allow for typos, which can't be detected 
visually. For example greek and cyrillic alphabets have letters 
indistinguishable from ASCII so they can sneak into ASCII text 
and you won't see it. You can also have more fun with heuristic 
language switchers.

Try to find a problem in this code:
------
class c
{
	void Сlose(){}
}

int main()
{
	c obj=new c;
	obj.Close();
	return 0;
}
------


believe noone checked phobos for such errors.

I was taught BASIC at school and had no idea I should complain 
about latin alphabet even though I didn't learn English back then.

Jun 27 2013

"deadalnix" <deadalnix gmail.com> writes:

On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
 Every time I've been to a programming shop in a foreign 
 country, the developers speak english at work and code in 
 english. Of course, that doesn't mean that everyone does, but 
 as far as I can tell the overwhelming bulk is done in english.

OOo codebase is historically mostly in german. They try to reduce 
the amunt of german in the codebase with each new version.

Some massive codebases are non english.

 Naturally, full Unicode needs to be in strings and comments, 
 but symbol names? I don't see the point nor the utilty of it. 
 Supporting such is just pointless complexity to the language.

I know this is a crazy idea, but someone told be once that most 
people on this planet aren't living in english speaking 
countries. Insane isn't it ?

Jun 27 2013

Peter Williams <pwil3058 bigpond.net.au> writes:

On 28/05/13 09:44, H. S. Teoh wrote:

 Since language keywords are already in English, we might as well
 standardize on English identifiers too.

So you're going to spell check them all to make sure that they're 
English?  Or did you mean ASCII?

Peter

May 27 2013

"David Eagen" <davideagen mailinator.com> writes:

On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:

 So you're going to spell check them all to make sure that 
 they're English?  Or did you mean ASCII?

 Peter

That's it. I'm filing a bug against std.traits. There's a 
unittest there that with a struct named "Colour". Completely 
unacceptable.

May 27 2013

Manu <turkeyman gmail.com> writes:

On 28 May 2013 13:22, David Eagen <davideagen mailinator.com> wrote:

 On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:


 So you're going to spell check them all to make sure that they're
 English?  Or did you mean ASCII?

 Peter

 That's it. I'm filing a bug against std.traits. There's a unittest there
 that with a struct named "Colour". Completely unacceptable.

How dare you!
What's unacceptable is that a bunch of ex-english speakers had the audacity
to rewrite the dictionary and continue to call it English!
I will never write colour without a u, ever! I may suffer the global
American cultural invasion of my country like the rest of us, but I will
never let them infiltrate my mind! ;)

May 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/27/2013 9:27 PM, Manu wrote:
 I will never write colour without a u, ever! I may suffer the global American
 cultural invasion of my country like the rest of us, but I will never let them
 infiltrate my mind! ;)

Resistance is useless.

May 27 2013

"Diggory" <diggsey googlemail.com> writes:

On Tuesday, 28 May 2013 at 04:52:55 UTC, Walter Bright wrote:
 On 5/27/2013 9:27 PM, Manu wrote:
 I will never write colour without a u, ever! I may suffer the 
 global American
 cultural invasion of my country like the rest of us, but I 
 will never let them
 infiltrate my mind! ;)

 Resistance is useless.

*futile :P

May 27 2013

Peter Williams <pwil3058 bigpond.net.au> writes:

On 28/05/13 13:22, David Eagen wrote:
 On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:

 So you're going to spell check them all to make sure that they're
 English?  Or did you mean ASCII?

 Peter

 That's it. I'm filing a bug against std.traits. There's a unittest there
 that with a struct named "Colour". Completely unacceptable.

Except here in Australia and other places where they use the Queen's 
English :-)

Peter

May 27 2013

Manu <turkeyman gmail.com> writes:

On 28 May 2013 14:38, Peter Williams <pwil3058 bigpond.net.au> wrote:

 On 28/05/13 13:22, David Eagen wrote:

 On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:


 So you're going to spell check them all to make sure that they're
 English?  Or did you mean ASCII?

 Peter

 That's it. I'm filing a bug against std.traits. There's a unittest there
 that with a struct named "Colour". Completely unacceptable.

 Except here in Australia and other places where they use the Queen's
 English :-)


Is there anywhere other than America that doesn't?

May 27 2013

Jacob Carlborg <doob me.com> writes:

On 2013-05-28 08:00, Manu wrote:

 Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

-- 
/Jacob Carlborg

May 28 2013

Manu <turkeyman gmail.com> writes:

On 28 May 2013 19:12, Jacob Carlborg <doob me.com> wrote:

 On 2013-05-28 08:00, Manu wrote:

  Is there anywhere other than America that doesn't?

 Canada, Jamaica, other countries in that region?


Yes, the region called America ;)
Although there's a few British colonies in the Caribbean...

May 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-05-28 14:09, Manu wrote:

 Yes, the region called America ;)
 Although there's a few British colonies in the Caribbean...

Oh, you meant the whole region and not the country.

-- 
/Jacob Carlborg

May 28 2013

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On Tue, 28 May 2013 14:11:29 +0200, Jacob Carlborg <doob me.com> wrote:

 On 2013-05-28 14:09, Manu wrote:

 Yes, the region called America ;)
 Although there's a few British colonies in the Caribbean...

 Oh, you meant the whole region and not the country.

America is not a country. The country is called USA.

-- 
Simen

May 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-05-28 14:58, Simen Kjaeraas wrote:

 America is not a country. The country is called USA.

I know that, but I get the impression that people usually say "America" 
and refer to USA.

-- 
/Jacob Carlborg

May 28 2013

Peter Williams <pwil3058 bigpond.net.au> writes:

On 28/05/13 19:12, Jacob Carlborg wrote:
 On 2013-05-28 08:00, Manu wrote:

 Is there anywhere other than America that doesn't?

 Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent not a 
country). :-)

Peter

May 28 2013

"Diggory" <diggsey googlemail.com> writes:

On Tuesday, 28 May 2013 at 23:33:47 UTC, Peter Williams wrote:
 On 28/05/13 19:12, Jacob Carlborg wrote:
 On 2013-05-28 08:00, Manu wrote:

 Is there anywhere other than America that doesn't?

 Canada, Jamaica, other countries in that region?

 Last time I looked Canada was in America (which is a continent 
 not a country). :-)

 Peter

America isn't a continent, North America is a continent, and 
Canada is in North America :P

May 28 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 29 May 2013 at 01:29:07 UTC, Diggory wrote:
 On Tuesday, 28 May 2013 at 23:33:47 UTC, Peter Williams wrote:
 On 28/05/13 19:12, Jacob Carlborg wrote:
 On 2013-05-28 08:00, Manu wrote:

 Is there anywhere other than America that doesn't?

 Canada, Jamaica, other countries in that region?

 Last time I looked Canada was in America (which is a continent 
 not a country). :-)

 Peter

 America isn't a continent, North America is a continent, and 
 Canada is in North America :P

Well, that point of view really depends from which continent 
you're from:
http://en.wikipedia.org/wiki/Continents#Number_of_continents

There is no internationally agreed on scheme. I for one, have 
always been taught that there is only "America", and that the 
terms "North America" and "South America" where only meant to 
denote regions within said continent.

May 28 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, May 29, 2013 at 09:33:32AM +1000, Peter Williams wrote:
 On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

 
 Last time I looked Canada was in America (which is a continent not a
 country). :-)

[...]

If you say that to a Canadian to his face, you might get a hostile (or
faux-hostile) reaction. :)

Up here in the Great White North, we like to think of ourselves as
different from our rowdy neighbours to the south (even though we're not
that different, but we won't ever admit that :-P). And yes, "America"
means USA up here (and "American" especially means USian, as distinct
from Canadian), even though we all know that technically it refers to
the continent, not the country.


T

-- 
Computers aren't intelligent; they only think they are.

May 28 2013

Peter Williams <pwil3058 bigpond.net.au> writes:

On 29/05/13 09:57, H. S. Teoh wrote:
 On Wed, May 29, 2013 at 09:33:32AM +1000, Peter Williams wrote:
 On 28/05/13 19:12, Jacob Carlborg wrote:
 On 2013-05-28 08:00, Manu wrote:

 Is there anywhere other than America that doesn't?

 Canada, Jamaica, other countries in that region?

 Last time I looked Canada was in America (which is a continent not a
 country). :-)

 [...]

 If you say that to a Canadian to his face, you might get a hostile (or
 faux-hostile) reaction. :)

 Up here in the Great White North, we like to think of ourselves as
 different from our rowdy neighbours to the south (even though we're not
 that different, but we won't ever admit that :-P). And yes, "America"
 means USA up here (and "American" especially means USian, as distinct
 from Canadian), even though we all know that technically it refers to
 the continent, not the country.

Last time I was there (about 40 years ago) Canadians didn't seem that 
touchy. :-)

Peter

May 28 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, May 29, 2013 at 10:36:08AM +1000, Peter Williams wrote:
 On 29/05/13 09:57, H. S. Teoh wrote:
On Wed, May 29, 2013 at 09:33:32AM +1000, Peter Williams wrote:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent not a
country). :-)

[...]

If you say that to a Canadian to his face, you might get a hostile
(or faux-hostile) reaction. :)


[...]
 Last time I was there (about 40 years ago) Canadians didn't seem
 that touchy. :-)

[...]

Well, they are not, hence "faux-hostile". :)


T

-- 
Political correctness: socially-sanctioned hypocrisy.

May 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-05-28 03:38, Peter Williams wrote:

 So you're going to spell check them all to make sure that they're
 English?  Or did you mean ASCII?

Don't you have a spell checker in your editor? If not, find a new one :)

-- 
/Jacob Carlborg

May 28 2013

=?UTF-8?B?Ikx1w61z?= Marques" <luismarques gmail.com> writes:

On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 I've recently come to the opinion that that's a bad idea, and D 
 should not support it.

I think it is a bad idea to program in a language other than 
english, but I believe D should still support it.

May 27 2013

Manu <turkeyman gmail.com> writes:

On 28 May 2013 09:39, <luismarques gmail.com>" puremagic.com <
"\"Lu=C3=ADs".Marques"> wrote:

 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:

 I've recently come to the opinion that that's a bad idea, and D should
 not support it.

 I think it is a bad idea to program in a language other than english, but
 I believe D should still support it.

I can imagine a young student learning to code, that may not speak English
(yet).
Or a not-so-unlikely future where we're all speaking chinese ;)

May 27 2013

Manu <turkeyman gmail.com> writes:

On 28 May 2013 09:05, Walter Bright <newshound2 digitalmars.com> wrote:

 On 5/27/2013 3:18 PM, H. S. Teoh wrote:

 Well, D *does* support non-English identifiers, y'know... for example:

         void main(string[] args) {
                 int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
                 foreach (=D0=B8; 0..100)
                         =D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
                 writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
         }

 Of course, whether that's a good practice is a different story. :)

 I've recently come to the opinion that that's a bad idea, and D should no=

t
 support it.

Why? You said previously that you'd love to support extended operators ;)

May 27 2013

"Torje Digernes" <torjehoa pvv.org> writes:

On Tuesday, 28 May 2013 at 00:34:20 UTC, Manu wrote:
 On 28 May 2013 09:05, Walter Bright 
 <newshound2 digitalmars.com> wrote:

 On 5/27/2013 3:18 PM, H. S. Teoh wrote:

 Well, D *does* support non-English identifiers, y'know... for 
 example:

         void main(string[] args) {
                 int число = 1;
                 foreach (и; 0..100)
                         число += и;
                 writeln(число);
         }

 Of course, whether that's a good practice is a different 
 story. :)

 I've recently come to the opinion that that's a bad idea, and 
 D should not
 support it.

 Why? You said previously that you'd love to support extended 
 operators ;)

I find features such as support for uncommon symbols in variables 
a strength as it makes some physics formulas a bit easier to read 
in code form, which in my opinion is a good thing.

May 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/27/2013 5:34 PM, Manu wrote:
 On 28 May 2013 09:05, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 digitalmars.com>> wrote:

     On 5/27/2013 3:18 PM, H. S. Teoh wrote:

         Well, D *does* support non-English identifiers, y'know... for example:

                  void main(string[] args) {
                          int число = 1;
                          foreach (и; 0..100)
                                  число += и;
                          writeln(число);
                  }

         Of course, whether that's a good practice is a different story. :)


     I've recently come to the opinion that that's a bad idea, and D should not
     support it.


 Why? You said previously that you'd love to support extended operators ;)

Extended operators, yes. Non-ascii identifiers, no.

May 27 2013

"Oleg Kuporosov" <Oleg.Kuporosov gmail.com> writes:

On Tuesday, 28 May 2013 at 01:34:47 UTC, Walter Bright wrote:

 Why? You said previously that you'd love to support extended 
 operators ;)

 Extended operators, yes. Non-ascii identifiers, no.

BTW, this is one of big D advantage, take into account
some day D could be used for teaching in schools where pupils
still doesn't know English somewhere outside US/GB.
It is much easy to start with localized Ids.

Please keep unicode in language.

May 28 2013

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On Tue, 28 May 2013 01:05:46 +0200, Walter Bright  =

<newshound2 digitalmars.com> wrote:

 On 5/27/2013 3:18 PM, H. S. Teoh wrote:
 Well, D *does* support non-English identifiers, y'know... for example=


:
 	void main(string[] args) {
 		int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
 		foreach (=D0=B8; 0..100)
 			=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
 		writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
 	}

 Of course, whether that's a good practice is a different story. :)

 I've recently come to the opinion that that's a bad idea, and D should=

  =

 not support it.

I've recently come to the opinion that you're wrong - using them is ofte=
n
wrong, but D should support them. Various good reasons have been posted
in this thread.

-- =

Simen

May 28 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 I've recently come to the opinion that that's a bad idea, and D 
 should not support it.

Honestly, removing support for non-ASCII characters from 
identifiers is the worst idea you've had in a while. There is an 
_unfathomable amount_ of code out there written in non-English 
languages but hamfisted into an English-alphabet representation 
because the programming language doesn't care to support it. The 
resulting friction is considerable.

You seem to attribute particular value to personal anecdotes, so 
here's one of mine: I personally know several prestigious 
universities in Europe and Asia which teach programming using 
Java and/or C with identifiers being in an English-alphabet 
representation of the native non-English language. Using the 
English language for identifiers is usually a sanctioned 
alternative, but not the primary modus operandi. I also know 
several professional programmers using their native non-English 
language for identifiers in production code.

Please reconsider.

May 29 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/29/2013 2:42 AM, Jakob Ovrum wrote:
 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 I've recently come to the opinion that that's a bad idea, and D should not
 support it.

 Honestly, removing support for non-ASCII characters from identifiers is the
 worst idea you've had in a while. There is an _unfathomable amount_ of code out
 there written in non-English languages but hamfisted into an English-alphabet
 representation because the programming language doesn't care to support it. The
 resulting friction is considerable.

 You seem to attribute particular value to personal anecdotes, so here's one of
 mine: I personally know several prestigious universities in Europe and Asia
 which teach programming using Java and/or C with identifiers being in an
 English-alphabet representation of the native non-English language. Using the
 English language for identifiers is usually a sanctioned alternative, but not
 the primary modus operandi. I also know several professional programmers using
 their native non-English language for identifiers in production code.

 Please reconsider.

I still think it's a bad idea, but it's obvious people want it in D, so it'll
stay.

(Also note that I meant using ASCII, not necessarily english.)

May 29 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 29 May 2013 15:44:17 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 I still think it's a bad idea, but it's obvious people want it in D, so it'll
stay.
 
 (Also note that I meant using ASCII, not necessarily english.)

Surprisingly ASCII also covers Cornish and Malay.

-- 
Marco

May 29 2013

"Oleg Kuporosov" <Oleg.Kuporosov gmail.com> writes:

On Wednesday, 29 May 2013 at 22:44:17 UTC, Walter Bright wrote:
 I still think it's a bad idea, but it's obvious people want it 
 in D, so it'll stay.

 (Also note that I meant using ASCII, not necessarily english.)

Good, thanks, restrictions definetelly can and should be applied 
per project, like for druntime/Phobos.

May 29 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 29 May 2013 at 22:44:17 UTC, Walter Bright wrote:
 (Also note that I meant using ASCII, not necessarily english.)

I don't understand the logic behind this. Surely this is the 
worst combination; severely crippled ability to use non-English 
languages (yes, even for European languages), yet non-speakers of 
those languages still don't have a clue what it means.

May 30 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 27 May 2013 16:05:46 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 5/27/2013 3:18 PM, H. S. Teoh wrote:
 Well, D *does* support non-English identifiers, y'know... for example:

 	void main(string[] args) {
 		int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
 		foreach (=D0=B8; 0..100)
 			=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
 		writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
 	}

 Of course, whether that's a good practice is a different story. :)

=20
 I've recently come to the opinion that that's a bad idea, and D should no=

t=20
 support it.

I hope that was just a random thought. I knew a teacher who
would give all his methods German names so they are easier to
distinguish from the English Java library methods.
Personally I like to type =CE=B1 instead of alpha for angles, since
that is the identifier you'd expect in math. And everyone
likes "alias =E2=84=95 =3D size_t;", right? :) D=C3=A9j=C3=A0 vu?

--=20
Marco

May 29 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 05/29/2013 12:03 PM, Marco Leise wrote:
 ...  And everyone
 likes "alias ℕ = size_t;", right? :)
 ...

No, that's deeply troubling.

May 30 2013

"Entry" <no no.com> writes:

My personal opinion is that code should only be in English.

May 29 2013

Peter Williams <pwil3058 bigpond.net.au> writes:

On 30/05/13 08:40, Entry wrote:
 My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

May 29 2013

"Entry" <no no.com> writes:

On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:
 On 30/05/13 08:40, Entry wrote:
 My personal opinion is that code should only be in English.

 But why would you want to impose this restriction on others?

 Peter

I wouldn't say impose. I'd say that programming in a unified 
language (D) should not be sabotaged by comments and variable 
names in various human languages (Swedish, Russian), but be 
accompanied by a similarly 'unified' language that we all know - 
English. It is only my opinion though and I wouldn't force it 
upon anyone.

May 30 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
 On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:
 On 30/05/13 08:40, Entry wrote:
 My personal opinion is that code should only be in English.

 But why would you want to impose this restriction on others?

 Peter

 I wouldn't say impose. I'd say that programming in a unified 
 language (D) should not be sabotaged by comments and variable 
 names in various human languages (Swedish, Russian), but be 
 accompanied by a similarly 'unified' language that we all know 
 - English. It is only my opinion though and I wouldn't force it 
 upon anyone.

But programming IS a human tool, and thus, subject to human 
language.

Also, I don't see how a programming language is any more unified 
than, say, a library.

While you wouldn't force it on anyone, would it also be your 
opinion that putting a French book in a french library be a 
sabotage of the world's librarial institutions?

May 30 2013

"Entry" <no no.com> writes:

On Thursday, 30 May 2013 at 09:29:43 UTC, monarch_dodra wrote:
 On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
 On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams 
 wrote:
 On 30/05/13 08:40, Entry wrote:
 My personal opinion is that code should only be in English.

 But why would you want to impose this restriction on others?

 Peter

 I wouldn't say impose. I'd say that programming in a unified 
 language (D) should not be sabotaged by comments and variable 
 names in various human languages (Swedish, Russian), but be 
 accompanied by a similarly 'unified' language that we all know 
 - English. It is only my opinion though and I wouldn't force 
 it upon anyone.

 But programming IS a human tool, and thus, subject to human 
 language.

 Also, I don't see how a programming language is any more 
 unified than, say, a library.

 While you wouldn't force it on anyone, would it also be your 
 opinion that putting a French book in a french library be a 
 sabotage of the world's librarial institutions?

What a way to attack a straw-man and completely miss the point at 
the same time.

May 30 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 30 May 2013 at 13:12:17 UTC, Entry wrote:
 On Thursday, 30 May 2013 at 09:29:43 UTC, monarch_dodra wrote:
 On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
 On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams 
 wrote:
 On 30/05/13 08:40, Entry wrote:
 My personal opinion is that code should only be in English.

 But why would you want to impose this restriction on others?

 Peter

 I wouldn't say impose. I'd say that programming in a unified 
 language (D) should not be sabotaged by comments and variable 
 names in various human languages (Swedish, Russian), but be 
 accompanied by a similarly 'unified' language that we all 
 know - English. It is only my opinion though and I wouldn't 
 force it upon anyone.

 But programming IS a human tool, and thus, subject to human 
 language.

 Also, I don't see how a programming language is any more 
 unified than, say, a library.

 While you wouldn't force it on anyone, would it also be your 
 opinion that putting a French book in a french library be a 
 sabotage of the world's librarial institutions?

 What a way to attack a straw-man and completely miss the point 
 at the same time.

Fine.

In that case, I'll retort by saying that you use of the 'unified' 
is intentionally loaded to favor your stance.

My retort was not correctly expressed, but I don't see how D is 
"unified". I thought it was just a tool to create programs.

May 30 2013

"Entry" <no no.com> writes:

On Thursday, 30 May 2013 at 13:52:09 UTC, monarch_dodra wrote:
 On Thursday, 30 May 2013 at 13:12:17 UTC, Entry wrote:
 On Thursday, 30 May 2013 at 09:29:43 UTC, monarch_dodra wrote:
 On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
 On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams 
 wrote:
 On 30/05/13 08:40, Entry wrote:
 My personal opinion is that code should only be in English.

 But why would you want to impose this restriction on others?

 Peter

 I wouldn't say impose. I'd say that programming in a unified 
 language (D) should not be sabotaged by comments and 
 variable names in various human languages (Swedish, 
 Russian), but be accompanied by a similarly 'unified' 
 language that we all know - English. It is only my opinion 
 though and I wouldn't force it upon anyone.

 But programming IS a human tool, and thus, subject to human 
 language.

 Also, I don't see how a programming language is any more 
 unified than, say, a library.

 While you wouldn't force it on anyone, would it also be your 
 opinion that putting a French book in a french library be a 
 sabotage of the world's librarial institutions?

 What a way to attack a straw-man and completely miss the point 
 at the same time.

 Fine.

 In that case, I'll retort by saying that you use of the 
 'unified' is intentionally loaded to favor your stance.

 My retort was not correctly expressed, but I don't see how D is 
 "unified". I thought it was just a tool to create programs.

Take a minute to think about why we're all communicating in 
English here. Let's see if you can figure it out. I just think 
that it's better to focus on two very specific languages with two 
very specific purposes (D for programming and English for 
communication). 'Twas just an idea, I don't care if you write 
your code in hieroglyphs.

May 30 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:
 Take a minute to think about why we're all communicating in 
 English here. Let's see if you can figure it out.

Well that's condescending :/ and fallacious.

To answer your question, it may have something to do with the 
fact that these are the English forums? Just a wild hunch. Oh. 
And because we *can* speak English? That could also have 
something to do with it.

There are tons of non-English speaking programming forums out 
there. Maybe those that don't speak English are over there? Heck, 
there are a few non-English threads in learn.

Oh. And did you know TDPL was published in Japanese? Why bother 
right?

 I just think that it's better to focus on two very specific 
 languages with two very specific purposes (D for programming 
 and English for communication). 'Twas just an idea, I don't 
 care if you write your code in hieroglyphs.

I really really agree with you.

Yet, I think they are orthogonal concepts, and that the D 
programming language has no business choosing which communication 
vector its users should use.

It's not just a matter (imo) of "I wouldn't force it upon 
anyone", but "I think everyone should choose what's best for 
them".

Yeah. I know. Same conclusion, but there is a nuance.

May 30 2013

"Entry" <no no.com> writes:

On Thursday, 30 May 2013 at 14:49:12 UTC, monarch_dodra wrote:
 On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:
 Take a minute to think about why we're all communicating in 
 English here. Let's see if you can figure it out.

 Well that's condescending :/ and fallacious.

 To answer your question, it may have something to do with the 
 fact that these are the English forums? Just a wild hunch. Oh. 
 And because we *can* speak English? That could also have 
 something to do with it.

 There are tons of non-English speaking programming forums out 
 there. Maybe those that don't speak English are over there? 
 Heck, there are a few non-English threads in learn.

 Oh. And did you know TDPL was published in Japanese? Why bother 
 right?

 I just think that it's better to focus on two very specific 
 languages with two very specific purposes (D for programming 
 and English for communication). 'Twas just an idea, I don't 
 care if you write your code in hieroglyphs.

 I really really agree with you.

 Yet, I think they are orthogonal concepts, and that the D 
 programming language has no business choosing which 
 communication vector its users should use.

 It's not just a matter (imo) of "I wouldn't force it upon 
 anyone", but "I think everyone should choose what's best for 
 them".

 Yeah. I know. Same conclusion, but there is a nuance.

I'm glad you agree, though I believe that I never said anything 
about D 'choosing' which human languages are compatible with it. 
I just expressed my belief that should people choose to construct 
something, be it a ship or a computer program, the usage of a 
single language will greatly enhance their progress (ever heard 
the story of the Tower of Babel? wink wink). Sorry if my previous 
comment seemed hostile, that was not my intention.

May 30 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:
 I'm glad you agree, though I believe that I never said anything 
 about D 'choosing' which human languages are compatible with 
 it. I just expressed my belief that should people choose to 
 construct something, be it a ship or a computer program, the 
 usage of a single language will greatly enhance their progress 
 (ever heard the story of the Tower of Babel? wink wink). Sorry 
 if my previous comment seemed hostile, that was not my 
 intention.

If the programmers who are going to be working on that code don't 
understand the "Single Language", then what use is it?

May 30 2013

"Entry" <no no.com> writes:

On Thursday, 30 May 2013 at 16:05:13 UTC, Jakob Ovrum wrote:
 On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:
 I'm glad you agree, though I believe that I never said 
 anything about D 'choosing' which human languages are 
 compatible with it. I just expressed my belief that should 
 people choose to construct something, be it a ship or a 
 computer program, the usage of a single language will greatly 
 enhance their progress (ever heard the story of the Tower of 
 Babel? wink wink). Sorry if my previous comment seemed 
 hostile, that was not my intention.

 If the programmers who are going to be working on that code 
 don't understand the "Single Language", then what use is it?

Then there's no helping it. Though I wonder what kind of a 
programmer doesn't understand English enough to at least read the 
code and comments.

May 30 2013

Manu <turkeyman gmail.com> writes:

On 31 May 2013 03:08, Entry <no no.com> wrote:

 On Thursday, 30 May 2013 at 16:05:13 UTC, Jakob Ovrum wrote:

 On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:

 I'm glad you agree, though I believe that I never said anything about D
 'choosing' which human languages are compatible with it. I just expressed
 my belief that should people choose to construct something, be it a ship or
 a computer program, the usage of a single language will greatly enhance
 their progress (ever heard the story of the Tower of Babel? wink wink).
 Sorry if my previous comment seemed hostile, that was not my intention.

 If the programmers who are going to be working on that code don't
 understand the "Single Language", then what use is it?

 Then there's no helping it. Though I wonder what kind of a programmer
 doesn't understand English enough to at least read the code and comments.

A child, or a student.

May 30 2013

Manu <turkeyman gmail.com> writes:

On 31 May 2013 01:48, Entry <no no.com> wrote:

 On Thursday, 30 May 2013 at 14:49:12 UTC, monarch_dodra wrote:

 On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:

 Take a minute to think about why we're all communicating in English
 here. Let's see if you can figure it out.

 Well that's condescending :/ and fallacious.

 To answer your question, it may have something to do with the fact that
 these are the English forums? Just a wild hunch. Oh. And because we *can*
 speak English? That could also have something to do with it.

 There are tons of non-English speaking programming forums out there.
 Maybe those that don't speak English are over there? Heck, there are a few
 non-English threads in learn.

 Oh. And did you know TDPL was published in Japanese? Why bother right?

  I just think that it's better to focus on two very specific languages
 with two very specific purposes (D for programming and English for
 communication). 'Twas just an idea, I don't care if you write your code in
 hieroglyphs.

 I really really agree with you.

 Yet, I think they are orthogonal concepts, and that the D programming
 language has no business choosing which communication vector its users
 should use.

 It's not just a matter (imo) of "I wouldn't force it upon anyone", but "I
 think everyone should choose what's best for them".

 Yeah. I know. Same conclusion, but there is a nuance.

 I'm glad you agree, though I believe that I never said anything about D
 'choosing' which human languages are compatible with it. I just expressed
 my belief that should people choose to construct something, be it a ship or
 a computer program, the usage of a single language will greatly enhance
 their progress (ever heard the story of the Tower of Babel? wink wink).
 Sorry if my previous comment seemed hostile, that was not my intention.

This is the definition of a *convention*, not a rule.

May 30 2013

Manu <turkeyman gmail.com> writes:

On 30 May 2013 18:32, Entry <no no.com> wrote:

 On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:

 On 30/05/13 08:40, Entry wrote:

 My personal opinion is that code should only be in English.

 But why would you want to impose this restriction on others?

 Peter

 I wouldn't say impose. I'd say that programming in a unified language (D)
 should not be sabotaged by comments and variable names in various human
 languages (Swedish, Russian), but be accompanied by a similarly 'unified'
 language that we all know - English. It is only my opinion though and I
 wouldn't force it upon anyone.

May 30 2013

Manu <turkeyman gmail.com> writes:

On 30 May 2013 18:32, Entry <no no.com> wrote:

 On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:

 On 30/05/13 08:40, Entry wrote:

 My personal opinion is that code should only be in English.

 But why would you want to impose this restriction on others?

 Peter

 I wouldn't say impose. I'd say that programming in a unified language (D)
 should not be sabotaged by comments and variable names in various human
 languages (Swedish, Russian), but be accompanied by a similarly 'unified'
 language that we all know - English. It is only my opinion though and I
 wouldn't force it upon anyone.

We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance, it
almost always looks like this:

{
  // E: I like cake.
  // J: =E3=82=B1=E3=83=BC=E3=82=AD=E3=81=8C=E5=A5=BD=E3=81=8D=E3=81=A7=E3=
=81=99=E3=80=82
  player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that power
an industry worth 10s of billions.

May 30 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/30/2013 4:24 AM, Manu wrote:
 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance, it
almost
 always looks like this:

 {
    // E: I like cake.
    // J: ケーキが好きです。
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that power an
 industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

May 30 2013

Peter Williams <pwil3058 bigpond.net.au> writes:

On 31/05/13 05:07, Walter Bright wrote:
 On 5/30/2013 4:24 AM, Manu wrote:
 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance,
 it almost
 always looks like this:

 {
    // E: I like cake.
    // J: ケーキが好きです。
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that
 power an
 industry worth 10s of billions.

 Sure, but the code itself is written using ASCII!

Because they had no choice.

Peter

May 30 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/30/2013 5:00 PM, Peter Williams wrote:
 On 31/05/13 05:07, Walter Bright wrote:
 On 5/30/2013 4:24 AM, Manu wrote:
 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance,
 it almost
 always looks like this:

 {
    // E: I like cake.
    // J: ケーキが好きです。
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that
 power an
 industry worth 10s of billions.

 Sure, but the code itself is written using ASCII!

 Because they had no choice.

Not true, D supports Unicode identifiers.

May 30 2013

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On Fri, 31 May 2013 07:57:37 +0200, Walter Bright  =

<newshound2 digitalmars.com> wrote:

 On 5/30/2013 5:00 PM, Peter Williams wrote:
 On 31/05/13 05:07, Walter Bright wrote:
 On 5/30/2013 4:24 AM, Manu wrote:
 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instan=




ce,
 it almost
 always looks like this:

 {
    // E: I like cake.
    // J: =E3=82=B1=E3=83=BC=E3=82=AD=E3=81=8C=E5=A5=BD=E3=81=8D=E3=81=




=A7=E3=81=99=E3=80=82
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases th=




at
 power an
 industry worth 10s of billions.

 Sure, but the code itself is written using ASCII!

 Because they had no choice.

 Not true, D supports Unicode identifiers.

I doubt Sony and Nintendo use D extensively.

-- =

Simen

May 31 2013

1100110 <0b1100110 gmail.com> writes:

On 05/31/2013 05:11 AM, Simen Kjaeraas wrote:
 On Fri, 31 May 2013 07:57:37 +0200, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 5/30/2013 5:00 PM, Peter Williams wrote:
 On 31/05/13 05:07, Walter Bright wrote:
 On 5/30/2013 4:24 AM, Manu wrote:
 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance,
 it almost
 always looks like this:

 {
    // E: I like cake.
    // J: ケーキが好きです。
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that
 power an
 industry worth 10s of billions.

 Sure, but the code itself is written using ASCII!

 Because they had no choice.

 Not true, D supports Unicode identifiers.

 I doubt Sony and Nintendo use D extensively.

Jun 17 2013

Timothee Cour <thelastmammoth gmail.com> writes:

On Thu, May 30, 2013 at 10:57 PM, Walter Bright
<newshound2 digitalmars.com>wrote:

 On 5/30/2013 5:00 PM, Peter Williams wrote:

 On 31/05/13 05:07, Walter Bright wrote:

 On 5/30/2013 4:24 AM, Manu wrote:

 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance,
 it almost
 always looks like this:

 {
    // E: I like cake.
    // J: =E3=82=B1=E3=83=BC=E3=82=AD=E3=81=8C=E5=A5=BD=E3=81=8D=E3=81=




=A7=E3=81=99=E3=80=82
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that
 power an
 industry worth 10s of billions.

 Sure, but the code itself is written using ASCII!

 Because they had no choice.

 Not true, D supports Unicode identifiers.


currently std.demangle.demangle doesn't work with unicode (see example
below)

If we decide to keep allowing unicode symbols (as opposed to just unicode
strings/comments), we must
address this issue. Will supporting this negatively impact performance (of
both compile time and runtime) ?

Likewise, will linkers + other tools (gdb etc) be happy with unicode in
mangled names?

----
struct A{
    int z;
    void foo(int x){}
    void =E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F(int x){}
    void =C2=AA=C3=A5(int x){}
}
mangledName!(A.=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F).dema=
ngle.writeln;=3D>
_D4util13demangle_funs1A18=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=
=AE=9FMFiZv
----

Jun 05 2013

Brad Roberts <braddr puremagic.com> writes:

On 6/5/13 6:11 PM, Timothee Cour wrote:
 currently std.demangle.demangle doesn't work with unicode (see example below)

 If we decide to keep allowing unicode symbols (as opposed to just unicode
strings/comments), we must
 address this issue. Will supporting this negatively impact performance (of
both compile time and
 runtime) ?

 Likewise, will linkers + other tools (gdb etc) be happy with unicode in
mangled names?

 ----
 structA{
 intz;
 voidfoo(intx){}
 voidさいごの果実(intx){}
 voidªå(intx){}
 }
 mangledName!(A.さいごの果実).demangle.writeln;=>_D4util13demangle_funs1A18さいごの果実MFiZv
 ----

Filed in bugzilla?

Jun 05 2013

Sean Kelly <sean invisibleduck.org> writes:

On Jun 5, 2013, at 6:21 PM, Brad Roberts <braddr puremagic.com> wrote:

 On 6/5/13 6:11 PM, Timothee Cour wrote:
 currently std.demangle.demangle doesn't work with unicode (see =


example below)
=20
 If we decide to keep allowing unicode symbols (as opposed to just =


unicode strings/comments), we must
 address this issue. Will supporting this negatively impact =


performance (of both compile time and
 runtime) ?
=20
 Likewise, will linkers + other tools (gdb etc) be happy with unicode =


in mangled names?
=20
 ----
 structA{
 intz;
 voidfoo(intx){}
 void=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F(intx){}
 void=C2=AA=C3=A5(intx){}
 }
 =


mangledName!(A.=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F).dem=
angle.writeln;=3D>_D4util13demangle_funs1A18=E3=81=95=E3=81=84=E3=81=94=E3=
=81=AE=E6=9E=9C=E5=AE=9FMFiZv
 ----

=20
 Filed in bugzilla?

http://d.puremagic.com/issues/show_bug.cgi?id=3D10393
https://github.com/D-Programming-Language/druntime/pull/524

Jun 17 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Jun 17, 2013 at 11:37:18AM -0700, Sean Kelly wrote:
 On Jun 5, 2013, at 6:21 PM, Brad Roberts <braddr puremagic.com> wrote:
 
 On 6/5/13 6:11 PM, Timothee Cour wrote:
 currently std.demangle.demangle doesn't work with unicode (see example below)
 
 If we decide to keep allowing unicode symbols (as opposed to just unicode
strings/comments), we must
 address this issue. Will supporting this negatively impact performance (of
both compile time and
 runtime) ?
 
 Likewise, will linkers + other tools (gdb etc) be happy with unicode in
mangled names?
 
 ----
 structA{
 intz;
 voidfoo(intx){}
 voidさいごの果実(intx){}
 voidªå(intx){}
 }
 mangledName!(A.さいごの果実).demangle.writeln;=>_D4util13demangle_funs1A18さいごの果実MFiZv
 ----

 
 Filed in bugzilla?

 
 http://d.puremagic.com/issues/show_bug.cgi?id=10393
 https://github.com/D-Programming-Language/druntime/pull/524

Do linkers actually support 8-bit symbol names? Or do these have to be
translated into ASCII somehow?


T

-- 
We've all heard that a million monkeys banging on a million typewriters will
eventually reproduce the entire works of Shakespeare.  Now, thanks to the
Internet, we know this is not true. -- Robert Wilensk

Jun 17 2013

Sean Kelly <sean invisibleduck.org> writes:

On Jun 17, 2013, at 11:47 AM, "H. S. Teoh" <hsteoh quickfur.ath.cx> =
wrote:
=20
 Do linkers actually support 8-bit symbol names? Or do these have to be
 translated into ASCII somehow?

Good question.  It looks like the linker on OSX does:

	public	_D3abc1A18=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=
=AE=9FMFiZv
	public	_D3abc1A4=C2=AA=C3=A5MFiZv

The object file linked just fine.  I haven't tried OPTLINK on Win32 =
though.=

Jun 17 2013

Brad Roberts <braddr puremagic.com> writes:

On 6/17/13 11:58 AM, Sean Kelly wrote:
 On Jun 17, 2013, at 11:47 AM, "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote:
 Do linkers actually support 8-bit symbol names? Or do these have to be
 translated into ASCII somehow?

 Good question.  It looks like the linker on OSX does:

 	public	_D3abc1A18さいごの果実MFiZv
 	public	_D3abc1A4ªåMFiZv

 The object file linked just fine.  I haven't tried OPTLINK on Win32 though.

Don't symbol names from dmd/win32 get compressed if they're too long, resulting
in essentially 
arbitrary random binary data being used as symbol names?  Assuming my memory on
that is correct then 
it's already demonstrated that optlink doesn't care what the data is.

Jun 17 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 6/17/2013 6:28 PM, Brad Roberts wrote:
 Don't symbol names from dmd/win32 get compressed if they're too long, resulting
 in essentially arbitrary random binary data being used as symbol names?
 Assuming my memory on that is correct then it's already demonstrated that
 optlink doesn't care what the data is.

Optlink doesn't care what the symbol byte contents are.

Jun 17 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Jun 17, 2013 at 06:49:19PM -0700, Walter Bright wrote:
 On 6/17/2013 6:28 PM, Brad Roberts wrote:
Don't symbol names from dmd/win32 get compressed if they're too long, resulting
in essentially arbitrary random binary data being used as symbol names?
Assuming my memory on that is correct then it's already demonstrated that
optlink doesn't care what the data is.

 
 Optlink doesn't care what the symbol byte contents are.

It seems ld on Linux doesn't, either. I just tested separate compilation
on some code containing functions and modules with Cyrillic names, and
it worked fine. But my system locale is UTF-8; I'm not sure if there may
be a problem on other system locales (not that modern systems would
actually use anything else, though!).

Might this cause a problem with the VS linker?


T

-- 
It only takes one twig to burn down a forest.

Jun 18 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 6/18/2013 9:44 AM, H. S. Teoh wrote:
 Might this cause a problem with the VS linker?

I doubt it, but try it and see!

Jun 18 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Jun 18, 2013 at 04:33:54PM -0700, Walter Bright wrote:
 On 6/18/2013 9:44 AM, H. S. Teoh wrote:
Might this cause a problem with the VS linker?

 
 I doubt it, but try it and see!

Sadly I don't have access to a Windows dev machine. Anybody else cares
to try?


T

-- 
Study gravitation, it's a field with a lot of potential.

Jun 18 2013

Sean Kelly <sean invisibleduck.org> writes:

On Jun 17, 2013, at 6:28 PM, Brad Roberts <braddr puremagic.com> wrote:

 On 6/17/13 11:58 AM, Sean Kelly wrote:
 On Jun 17, 2013, at 11:47 AM, "H. S. Teoh" <hsteoh quickfur.ath.cx> =


wrote:
=20
 Do linkers actually support 8-bit symbol names? Or do these have to =



be
 translated into ASCII somehow?

=20
 Good question.  It looks like the linker on OSX does:
=20
 	public	_D3abc1A18=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=


=AE=9FMFiZv
 	public	_D3abc1A4=C2=AA=C3=A5MFiZv
=20
 The object file linked just fine.  I haven't tried OPTLINK on Win32 =


though.
=20

=20
 Don't symbol names from dmd/win32 get compressed if they're too long, =

resulting in essentially arbitrary random binary data being used as =
symbol names?  Assuming my memory on that is correct then it's already =
demonstrated that optlink doesn't care what the data is.

Yes.  So it isn't always possible to fully demangle really long symbol =
names.  This is not terribly difficult to hit using templates, =
especially if they take string arguments.=

Jun 19 2013

Manu <turkeyman gmail.com> writes:

On 31 May 2013 05:07, Walter Bright <newshound2 digitalmars.com> wrote:

 On 5/30/2013 4:24 AM, Manu wrote:

 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance, i=


t
 almost
 always looks like this:

 {
    // E: I like cake.
    // J: =E3=82=B1=E3=83=BC=E3=82=AD=E3=81=8C=E5=A5=BD=E3=81=8D=E3=81=A7=


=E3=81=99=E3=80=82
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that
 power an
 industry worth 10s of billions.

 Sure, but the code itself is written using ASCII!

But that doesn't make it English, or any more readable...
The only benefit to forcing users to use ASCII is that everyone can
physically type it.
But that comes with disadvantages:
 1. It's not natural to type a word that you don't know what it is or how
to spell, you'll end up copy-pasting anyway rather than trying to
remember/copy it letter by letter and risk misspelling.
 2. It's less natural for the people who CAN read it, because they have to
mentally transliterate too. (And if they're kids/amateurs who don't know
even know the latin alphabet?)

Ie, it serves neither party to force someone who doesn't speak English to
write ASCII.
Add that to the points I (and others) made earlier about education, or
children learning to code. There's no compelling reason to force
identifiers in ASCII.
Currently, D offers a unique advantage; leave it that way.

May 30 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/30/2013 5:04 PM, Manu wrote:
 Currently, D offers a unique advantage; leave it that way.

I am going to leave it that way based on the comments here, I only wanted to 
point out that the example didn't support Unicode identifiers.

May 30 2013

Manu <turkeyman gmail.com> writes:

On 31 May 2013 10:00, Peter Williams <pwil3058 bigpond.net.au> wrote:

 On 31/05/13 05:07, Walter Bright wrote:

 On 5/30/2013 4:24 AM, Manu wrote:

 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance,
 it almost
 always looks like this:

 {
    // E: I like cake.
    // J: =E3=82=B1=E3=83=BC=E3=82=AD=E3=81=8C=E5=A5=BD=E3=81=8D=E3=81=



=A7=E3=81=99=E3=80=82
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that
 power an
 industry worth 10s of billions.

 Sure, but the code itself is written using ASCII!

 Because they had no choice.


Indeed, and believe me, the variable names can often make NO sense, or
worse, they're misunderstood and quite misleading.
Ie, you think a variable is something, but you realise it's the inverse, or
just something completely different.

May 30 2013

"Mr. Anonymous" <mailnew4ster gmail.com> writes:

On Monday, 27 May 2013 at 22:20:16 UTC, H. S. Teoh wrote:
 On Tue, May 28, 2013 at 12:04:52AM +0200, Vladimir Panteleev 
 wrote:
 On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to 
write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.

 
 I thought the topic was typing the occasional Unicode 
 character to
 use as an operator in D programs?

 Well, D *does* support non-English identifiers, y'know... for 
 example:

 	void main(string[] args) {
 		int число = 1;
 		foreach (и; 0..100)
 			число += и;
 		writeln(число);
 	}

 Of course, whether that's a good practice is a different story. 
 :)

 But for operators, you still need enough compose key sequences 
 to cover
 all of the Unicode operators -- and there are a LOT of them -- 
 which I
 don't think is currently done anywhere. You'd have to make your 
 own
 compose key maps to do it.


 T

http://code.google.com/p/trileri/source/browse/trunk/tr/yazi.d

May 28 2013

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On Tue, 28 May 2013 00:18:31 +0200, H. S. Teoh <hsteoh quickfur.ath.cx> =
 =

wrote:

 On Tue, May 28, 2013 at 12:04:52AM +0200, Vladimir Panteleev wrote:
 On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.

 I thought the topic was typing the occasional Unicode character to
 use as an operator in D programs?

 Well, D *does* support non-English identifiers, y'know... for example:=

 	void main(string[] args) {
 		int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
 		foreach (=D0=B8; 0..100)
 			=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
 		writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
 	}

 Of course, whether that's a good practice is a different story. :)

 But for operators, you still need enough compose key sequences to cove=

r
 all of the Unicode operators -- and there are a LOT of them -- which I=

 don't think is currently done anywhere. You'd have to make your own
 compose key maps to do it.


The Fortress programming language has some 900 or so operators:

https://java.net/projects/projectfortress/sources/sources/content/Specif=
ication/fortress.1.0.pdf?rev=3D5558

Appendix C, and

https://java.net/projects/projectfortress/sources/sources/content/Docume=
ntation/Specification/fortress.pdf?rev=3D5558

chapter 14


-- =

Simen

May 27 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, May 28, 2013 11:38:08 Peter Williams wrote:
 On 28/05/13 09:44, H. S. Teoh wrote:
 Since language keywords are already in English, we might as well
 standardize on English identifiers too.

 
 So you're going to spell check them all to make sure that they're
 English?  Or did you mean ASCII?

I think that it was more an issue of that the only reason that Unicode would 
be necessary in identifiers would be if you weren't using English, so if you 
assume that everyone is going to be using some form of English for their 
identifier names, you can skip having Unicode in identifiers. So, a natural 
effect of standardizing on English is that you can stick with ASCII.

- Jonathan M Davis

May 27 2013

Manu <turkeyman gmail.com> writes:

On 28 May 2013 11:42, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 On Tuesday, May 28, 2013 11:38:08 Peter Williams wrote:
 On 28/05/13 09:44, H. S. Teoh wrote:
 Since language keywords are already in English, we might as well
 standardize on English identifiers too.

 So you're going to spell check them all to make sure that they're
 English?  Or did you mean ASCII?

 I think that it was more an issue of that the only reason that Unicode
 would
 be necessary in identifiers would be if you weren't using English, so if
 you
 assume that everyone is going to be using some form of English for their
 identifier names, you can skip having Unicode in identifiers. So, a natural
 effect of standardizing on English is that you can stick with ASCII.

I'm fairly sure that any programmer who takes themself seriously will use
English, I don't see any reason why this rule should nee to be be
implemented by the compiler.
The loss I can imagine is that kids, or people from developing countries,
etc, may have an additional barrier to learning to code if they don't speak
English.
Nobody in this set is likely to produce a useful library that will be used
widely.
Likewise, no sane programmer is going to choose to use a library that's not
written in English.

You may argue that the keywords and libs are in English. I can attest from
personal experience, that a child, or a non-english-speaking beginner
probably has absolutely NO IDEA what the keywords mean anyway, even if they
do speak English.
I certainly had no idea when I was a kid, I just typed them because I
figured out what they did. I didn't even know how to say many of them, and
realised 5 years later than I was saying all the words wrong...

So my point is, why make this restriction as a static compiler rule, when
it's not practically going to be broken anyway. You never know, it may
actually assist some people somewhere.
I think it's a great thing that D can accept identifiers in non-english.

May 27 2013

"Daniel Murphy" <yebblies nospamgmail.com> writes:

"Manu" <turkeyman gmail.com> wrote in message 
news:mailman.137.1369448229.13711.digitalmars-d puremagic.com...
 One of the first, and best, decisions I made for D was it would be 
 Unicode
 front to back.

 Indeed, excellent decision!
 So when we define operators for u � v and a � b, or maybe n�? ;)

When these have keys on standard keyboards.

May 25 2013

"Joakim" <joakim airpost.net> writes:

On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
 One of the first, and best, decisions I made for D was it would 
 be Unicode front to back.

That is why I asked this question here.  I think D is still one 
of the few programming languages with such unicode support.

 This is more a problem with the algorithms taking the easy way 
 than a problem with UTF-8. You can do all the string 
 algorithms, including regex, by working with the UTF-8 directly 
 rather than converting to UTF-32. Then the algorithms work at 
 full speed.

I call BS on this.  There's no way working on a variable-width 
encoding can be as "full speed" as a constant-width encoding.  
Perhaps you mean that the slowdown is minimal, but I doubt that 
also.

 That was the go-to solution in the 1980's, they were called 
 "code pages". A disaster.

My understanding is that code pages were a "disaster" because 
they weren't standardized and often badly implemented.  If you 
used UCS with a single-byte encoding, you wouldn't have that 
problem.

 with the few exceptional languages with more than 256

 characters encoded in two bytes.

 Like those rare languages Japanese, Korean, Chinese, etc. This 
 too was done in the 80's with "Shift-JIS" for Japanese, and 
 some other wacky scheme for Korean, and a third nutburger one 
 for Chinese.

Of course, you have to have more than one byte for those 
languages, because they have more than 256 characters.  So there 
will be no compression gain over UTF-8/16 there, but a big gain 
in parsing complexity with a simpler encoding, particularly when 
dealing with multi-language strings.

 I've had the misfortune of supporting all that in the old 
 Zortech C++ compiler. It's AWFUL. If you think it's simpler, 
 all I can say is you've never tried to write internationalized 
 code with it.

Heh, I'm not saying "let's go back to badly defined code pages" 
because I'm saying "let's go back to single-byte encodings."  The 
two are separate arguments.

 UTF-8 is heavenly in comparison. Your code is automatically 
 internationalized. It's awesome.

At what cost?  Most programmers completely punt on unicode, 
because they just don't want to deal with the complexity.  
Perhaps you can deal with it and don't mind the performance loss, 
but I suspect you're in the minority.

May 25 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 5/25/2013 12:33 AM, Joakim wrote:
 At what cost?  Most programmers completely punt on unicode, because they just
 don't want to deal with the complexity. Perhaps you can deal with it and don't
 mind the performance loss, but I suspect you're in the minority.

I think you stand alone in your desire to return to code pages. I have years of 
experience with code pages and the unfixable misery they produce. This has 
disappeared with Unicode. I find your arguments unpersuasive when stacked 
against my experience. And yes, I have made a living writing high performance 
code that deals with characters, and you are quite off base with claims that 
UTF-8 has inevitable bad performance - though there is inefficient code in 
Phobos for it, to be sure.

My grandfather wrote a book that consists of mixed German, French, and Latin 
words, using special characters unique to those languages. Another failing of 
code pages is it fails miserably at any such mixed language text. Unicode 
handles it with aplomb.

I can't even write an email to Rainer Schütze in English under your scheme.

Code pages simply are no longer practical nor acceptable for a global
community. 
D is never going to convert to a code page system, and even if it did, there's 
no way D will ever convince the world to abandon Unicode, and so D would be as 
useless as EBCDIC.

I'm afraid your quest is quixotic.

May 25 2013

"Joakim" <joakim airpost.net> writes:

On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote:
 I think you stand alone in your desire to return to code pages.

Nobody is talking about going back to code pages.  I'm talking 
about going to single-byte encodings, which do not imply the 
problems that you had with code pages way back when.

 I have years of experience with code pages and the unfixable 
 misery they produce. This has disappeared with Unicode. I find 
 your arguments unpersuasive when stacked against my experience. 
 And yes, I have made a living writing high performance code 
 that deals with characters, and you are quite off base with 
 claims that UTF-8 has inevitable bad performance - though there 
 is inefficient code in Phobos for it, to be sure.

How can a variable-width encoding possibly compete with a 
constant-width encoding?  You have not articulated a reason for 
this.  Do you believe there is a performance loss with 
variable-width, but that it is not significant and therefore 
worth it?  Or do you believe it can be implemented with no loss?  
That is what I asked above, but you did not answer.

 My grandfather wrote a book that consists of mixed German, 
 French, and Latin words, using special characters unique to 
 those languages. Another failing of code pages is it fails 
 miserably at any such mixed language text. Unicode handles it 
 with aplomb.

I see no reason why single-byte encodings wouldn't do a better 
job at such mixed-language text.  You'd just have to have a 
larger, more complex header or keep all your strings in a single 
language, with a different format to compose them together for 
your book.  This would be so much easier than UTF-8 that I cannot 
see how anyone could argue for a variable-length encoding instead.

 I can't even write an email to Rainer Schütze in English under 
 your scheme.

Why not?  You seem to think that my scheme doesn't implement 
multi-language text at all, whereas I pointed out, from the 
beginning, that it could be trivially done also.

 Code pages simply are no longer practical nor acceptable for a 
 global community. D is never going to convert to a code page 
 system, and even if it did, there's no way D will ever convince 
 the world to abandon Unicode, and so D would be as useless as 
 EBCDIC.

I'm afraid you and others here seem to mentally translate 
"single-byte encodings" to "code pages" in your head, then recoil 
in horror as you remember all your problems with broken 
implementations of code pages, even though those problems are not 
intrinsic to single-byte encodings.

I'm not asking you to consider this for D.  I just wanted to 
discuss why UTF-8 is used at all.  I had hoped for some technical 
evaluations of its merits, but I seem to simply be dredging up a 
bunch of repressed memories about code pages instead. ;)

The world may not "abandon Unicode," but it will abandon UTF-8, 
because it's a dumb idea.  Unfortunately, such dumb ideas- XML 
anyone?- often proliferate until someone comes up with something 
better to show how dumb they are.  Perhaps it won't be the D 
programming language that does that, but it would be easy to 
implement my idea in D, so maybe it will be a D-based library 
someday. :)

 I'm afraid your quest is quixotic.

I'd argue the opposite, considering most programmers still can't 
wrap their head around UTF-8.  If someone can just get a 
single-byte encoding implemented and in front of them, I suspect 
it will be UTF-8 that will be considered quixotic. :D

May 25 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

25-May-2013 13:05, Joakim пишет:
 On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote:
 I think you stand alone in your desire to return to code pages.

 Nobody is talking about going back to code pages.  I'm talking about
 going to single-byte encodings, which do not imply the problems that you
 had with code pages way back when.

Problem is what you outline is isomorphic with code-pages. Hence the 
grief of accumulated experience against them.
 Code pages simply are no longer practical nor acceptable for a global
 community. D is never going to convert to a code page system, and even
 if it did, there's no way D will ever convince the world to abandon
 Unicode, and so D would be as useless as EBCDIC.

 I'm afraid you and others here seem to mentally translate "single-byte
 encodings" to "code pages" in your head, then recoil in horror as you
 remember all your problems with broken implementations of code pages,
 even though those problems are not intrinsic to single-byte encodings.

 I'm not asking you to consider this for D.  I just wanted to discuss why
 UTF-8 is used at all.  I had hoped for some technical evaluations of its
 merits, but I seem to simply be dredging up a bunch of repressed
 memories about code pages instead. ;)

Well if somebody get a quest to redefine UTF-8 they *might* come up with 
something that is a bit faster to decode but shares the same properties. 
Hardly a life saver anyway.
 The world may not "abandon Unicode," but it will abandon UTF-8, because
 it's a dumb idea.  Unfortunately, such dumb ideas- XML anyone?- often
 proliferate until someone comes up with something better to show how
 dumb they are.

Even children know XML is awful redundant shit as interchange format. 
The hierarchical document is a nice idea anyway.

 Perhaps it won't be the D programming language that does
 that, but it would be easy to implement my idea in D, so maybe it will
 be a D-based library someday. :)

Implement Unicode compression scheme - at least that is standardized.



-- 
Dmitry Olshansky

May 25 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, May 25, 2013 01:42:20 Walter Bright wrote:
 On 5/25/2013 12:33 AM, Joakim wrote:
 At what cost?  Most programmers completely punt on unicode, because=


 they
 just don't want to deal with the complexity. Perhaps you can deal w=


ith it
 and don't mind the performance loss, but I suspect you're in the
 minority.

=20
 I think you stand alone in your desire to return to code pages. I hav=

e years
 of experience with code pages and the unfixable misery they produce. =

This
 has disappeared with Unicode. I find your arguments unpersuasive when=

 stacked against my experience. And yes, I have made a living writing =

high
 performance code that deals with characters, and you are quite off ba=

se
 with claims that UTF-8 has inevitable bad performance - though there =

is
 inefficient code in Phobos for it, to be sure.
=20
 My grandfather wrote a book that consists of mixed German, French, an=

d Latin
 words, using special characters unique to those languages. Another fa=

iling
 of code pages is it fails miserably at any such mixed language text.
 Unicode handles it with aplomb.
=20
 I can't even write an email to Rainer Sch=C3=BCtze in English under y=

our scheme.
=20
 Code pages simply are no longer practical nor acceptable for a global=

 community. D is never going to convert to a code page system, and eve=

n if
 it did, there's no way D will ever convince the world to abandon Unic=

ode,
 and so D would be as useless as EBCDIC.
=20
 I'm afraid your quest is quixotic.

All I've got to say on this subject is "Thank you Walter Bright for bui=
lding=20
Unicode into D!"

- Jonathan M Davis

May 25 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, May 25, 2013 at 04:14:34PM -0700, Jonathan M Davis wrote:
On Saturday, May 25, 2013 01:42:20 Walter Bright wrote:
On 5/25/2013 12:33 AM, Joakim wrote:
At what cost? Most programmers completely punt on unicode,
because they just don't want to deal with the complexity. Perhaps
you can deal with it and don't mind the performance loss, but I
suspect you're in the minority.

I think you stand alone in your desire to return to code pages. I
have years of experience with code pages and the unfixable misery
they produce. This has disappeared with Unicode. I find your
arguments unpersuasive when stacked against my experience. And yes,
I have made a living writing high performance code that deals with
characters, and you are quite off base with claims that UTF-8 has
inevitable bad performance - though there is inefficient code in
Phobos for it, to be sure.

My grandfather wrote a book that consists of mixed German, French,
and Latin words, using special characters unique to those languages.
Another failing of code pages is it fails miserably at any such
mixed language text. Unicode handles it with aplomb.

Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 9:48 PM, H. S. Teoh wrote:
Then came along D with native Unicode support built right into the
language. And not just UTF-16 shoved down your throat like Java does (or
was it UTF-32?); UTF-8, UTF-16, and UTF-32 are all equally supported.
You cannot imagine what a happy camper I was since then!! Yes, Phobos
still has a ways to go in terms of performance w.r.t. UTF-8 strings, but
what we have right now is already far, far, superior to the situation in
C/C++, and things can only get better.

Many moons ago, when the earth was young and I had a few strands of hair left,
a
C++ programmer challenged me to a "bakeoff", D vs C++. I wrote the program in D
(a string processing program). He said "ahaaaa!" and wrote the C++ one. They
were fairly comparable.

I then suggested we do the internationalized version. I resubmitted exactly the
same program. He threw in the towel.

May 25 2013

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
This is more a problem with the algorithms taking the easy way
than a problem with UTF-8. You can do all the string
algorithms, including regex, by working with the UTF-8
directly rather than converting to UTF-32. Then the algorithms
work at full speed.

I call BS on this. There's no way working on a variable-width
encoding can be as "full speed" as a constant-width encoding.
Perhaps you mean that the slowdown is minimal, but I doubt that
also.

For the record, I noticed that programmers (myself included) that
had an incomplete understanding of Unicode / UTF exaggerate this
point, and sometimes needlessly assume that their code needs to
operate on individual characters (code points), when it is in
fact not so - and that code will work just fine as if it was
written to handle ASCII. The example Walter quoted (regex -
assuming you don't want Unicode ranges or case-insensitivity) is
one such case.

Another thing I noticed: sometimes when you think you really need
to operate on individual characters (and that your code will not
be correct unless you do that), the assumption will be incorrect
due to the existence of combining characters in Unicode. Two of
the often-quoted use cases of working on individual code points
is calculating the string width (assuming a fixed-width font),
and slicing the string - both of these will break with combining
characters if those are not accounted for. I believe the proper
way to approach such tasks is to implement the respective Unicode
algorithms for it, which I believe are non-trivial and for which
the relative impact for the overhead of working with a
variable-width encoding is acceptable.

Can you post some specific cases where the benefits of a
constant-width encoding are obvious and, in your opinion, make
constant-width encodings more useful than all the benefits of
UTF-8?

Also, I don't think this has been posted in this thread. Not sure
if it answers your points, though:

http://www.utf8everywhere.org/

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

May 25 2013

"Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 08:58:57 UTC, Vladimir Panteleev
wrote:
Another thing I noticed: sometimes when you think you really
need to operate on individual characters (and that your code
will not be correct unless you do that), the assumption will be
incorrect due to the existence of combining characters in
Unicode. Two of the often-quoted use cases of working on
individual code points is calculating the string width
(assuming a fixed-width font), and slicing the string - both of
these will break with combining characters if those are not
accounted for. I believe the proper way to approach such tasks
is to implement the respective Unicode algorithms for it, which
I believe are non-trivial and for which the relative impact for
the overhead of working with a variable-width encoding is
acceptable.

Combining characters are examples of complexity baked into the
various languages, so there's no way around that. I'm arguing
against layering more complexity on top, through UTF-8.

Can you post some specific cases where the benefits of a
constant-width encoding are obvious and, in your opinion, make
constant-width encodings more useful than all the benefits of
UTF-8?

Let's take one you listed above, slicing a string. You have to
either translate your entire string into UTF-32 so it's
constant-width, which is apparently what Phobos does, or decode
every single UTF-8 character along the way, every single time. A
constant-width, single-byte encoding would be much easier to
slice, while still using at most half the space.

Also, I don't think this has been posted in this thread. Not
sure if it answers your points, though:

http://www.utf8everywhere.org/

That seems to be a call to using UTF-8 on Windows, with a lot of
info on how best to do so, with little justification for why
you'd want to do so in the first place. For example,

"Q: But what about performance of text processing algorithms,
byte alignment, etc?

A: Is it really better with UTF-16? Maybe so."

Not exactly a considered analysis of the two. ;)

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

You cannot honestly look at those multiple state diagrams and
tell me it's "simple." That said, the difficulty of _using_
UTF-8 is a much bigger than problem than implementing a decoder
in a library.

May 25 2013

"w0rp" <devw0rp gmail.com> writes:
This is dumb. You are dumb. Go away.

May 25 2013

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 25 May 2013 at 09:40:36 UTC, Joakim wrote:
Can you post some specific cases where the benefits of a
constant-width encoding are obvious and, in your opinion, make
constant-width encodings more useful than all the benefits of
UTF-8?

Let's take one you listed above, slicing a string. You have to
either translate your entire string into UTF-32 so it's
constant-width, which is apparently what Phobos does, or decode
every single UTF-8 character along the way, every single time.
A constant-width, single-byte encoding would be much easier to
slice, while still using at most half the space.

You don't need to do that to slice a string. I think you mean to
say that you need to decode each character if you want to slice
the string at the N-th code point? But this is exactly what I'm
trying to point out: how would you find this N? How would you
know if it makes sense, taking into account combining characters,
and all the other complexities of Unicode?

If you want to split a string by ASCII whitespace (newlines, tabs
and spaces), it makes no difference whether the string is in
ASCII or UTF-8 - the code will behave correctly in either case,
variable-width-encodings regardless.

You cannot honestly look at those multiple state diagrams and
tell me it's "simple."

I meant that it's simple to implement (and adapt/port to other
languages). I would say that UTF-8 is quite cleverly designed, so
I wouldn't say it's simple by itself.

May 25 2013

"Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 10:33:12 UTC, Vladimir Panteleev
wrote:
You don't need to do that to slice a string. I think you mean
to say that you need to decode each character if you want to
slice the string at the N-th code point? But this is exactly
what I'm trying to point out: how would you find this N? How
would you know if it makes sense, taking into account combining
characters, and all the other complexities of Unicode?

Slicing a string implies finding the N-th code point, what other
way would you slice and have it make any sense? Finding the N-th
point is much simpler with a constant-width encoding.

I'm leaving aside combining characters and those intrinsic
language complexities baked into unicode in my previous analysis,
but if you want to bring those in, that's actually an argument in
favor of my encoding. With my encoding, you know up front if
you're using languages that have such complexity- just check the
header- whereas with a chunk of random UTF-8 text, you cannot
ever know that unless you decode the entire string once and
extract knowledge of all the languages that are embedded.

For another similar example, let's say you want to run toUpper on
a multi-language string, which contains English in the first half
and some Asian script that doesn't define uppercase in the second
half. With my format, toUpper can check the header, then process
the English half and skip the Asian half (I'm assuming that the
substring indices for each language would be stored in this more
complex header). With UTF-8, you have to process the entire
string, because you never know what random languages might be
packed in there.

UTF-8 is riddled with such performance bottlenecks, all to make
if self-synchronizing. But is anybody really using its less
compact encoding to do some "self-synchronized" integrity
checking? I suspect almost nobody is.

If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string is
in ASCII or UTF-8 - the code will behave correctly in either
case, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to decode
while splitting, when compared to a single-byte encoding.

You cannot honestly look at those multiple state diagrams and
tell me it's "simple."

I meant that it's simple to implement (and adapt/port to other
languages). I would say that UTF-8 is quite cleverly designed,
so I wouldn't say it's simple by itself.

Perhaps, maybe decoding is not so bad for the type of people who
write the fundamental UTF-8 libraries. But implementation does
not merely refer to the UTF-8 libraries, but also all the code
that tries to build on it for internationalized apps. And with
all the unnecessary additional complexity added by UTF-8,
wrapping the average programmer's head around this mess likely
leads to as many problems as broken code pages implementations
did back in the day. ;)

May 25 2013

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string is
in ASCII or UTF-8 - the code will behave correctly in either
case, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to
decode while splitting, when compared to a single-byte encoding.

No. Are you sure you understand UTF-8 properly?

May 25 2013

"Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev
wrote:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string
is in ASCII or UTF-8 - the code will behave correctly in
either case, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to
decode while splitting, when compared to a single-byte
encoding.

No. Are you sure you understand UTF-8 properly?

Are you sure _you_ understand it properly? Both encodings have
to check every single character to test for whitespace, but the
single-byte encoding simply has to load each byte in the string
and compare it against the whitespace-signifying bytes, while the
variable-length code has to first load and parse potentially 4
bytes before it can compare, because it has to go through the
state machine that you linked to above. Obviously the
constant-width encoding will be faster. Did I really need to
explain this?

On Saturday, 25 May 2013 at 12:43:21 UTC, Andrei Alexandrescu
wrote:
On 5/25/13 3:33 AM, Joakim wrote:
On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
This is more a problem with the algorithms taking the easy
way than a
problem with UTF-8. You can do all the string algorithms,
including
regex, by working with the UTF-8 directly rather than
converting to
UTF-32. Then the algorithms work at full speed.

I call BS on this. There's no way working on a variable-width
encoding
can be as "full speed" as a constant-width encoding. Perhaps
you mean
that the slowdown is minimal, but I doubt that also.

You mentioned this a couple of times, and I wonder what makes
you so sure. On contemporary architectures small is fast and
large is slow; betting on replacing larger data with more
computation is quite often a win.

When has small ever been slow and large fast? ;) I'm talking
about replacing larger data _and_ more computation, ie UTF-8,
with smaller data and less computation, ie single-byte encodings,
so it is an unmitigated win in that regard. :)

May 25 2013

"Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev
wrote:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string
is in ASCII or UTF-8 - the code will behave correctly in
either case, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to
decode while splitting, when compared to a single-byte
encoding.

No. Are you sure you understand UTF-8 properly?

Are you sure _you_ understand it properly? Both encodings have
to check every single character to test for whitespace, but the
single-byte encoding simply has to load each byte in the string
and compare it against the whitespace-signifying bytes, while
the variable-length code has to first load and parse
potentially 4 bytes before it can compare, because it has to go
through the state machine that you linked to above. Obviously
the constant-width encoding will be faster. Did I really need
to explain this?

I suggest you read up on UTF-8. You really don't understand it.
There is no need to decode, you just treat the UTF-8 string as if
it is an ASCII string.

This code will count all spaces in a string whether it is encoded
as ASCII or UTF-8:

int countSpaces(const(char)* c)
{
int n = 0;
while (*c)
if (*c == ' ')
++n;
return n;
}

I repeat: there is no need to decode. Please read up on UTF-8.
You do not understand it. The reason you don't need to decode is
because UTF-8 is self-synchronising.

The code above tests for spaces only, but it works the same when
searching for any substring or single character. It is no slower
than fixed-width encoding for these operations.

Again, I urge you, please read up on UTF-8. It is very well
designed.

May 25 2013

"Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
int countSpaces(const(char)* c)
{
int n = 0;
while (*c)
if (*c == ' ')
++n;
return n;
}

Oops. Missing a ++c in there, but I'm sure the point was made :-)

May 25 2013

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev
wrote:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string
is in ASCII or UTF-8 - the code will behave correctly in
either case, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to
decode while splitting, when compared to a single-byte
encoding.

No. Are you sure you understand UTF-8 properly?

Are you sure _you_ understand it properly? Both encodings have
to check every single character to test for whitespace, but the
single-byte encoding simply has to load each byte in the string
and compare it against the whitespace-signifying bytes, while
the variable-length code has to first load and parse
potentially 4 bytes before it can compare, because it has to go
through the state machine that you linked to above. Obviously
the constant-width encoding will be faster. Did I really need
to explain this?

It looks like you've missed an important property of UTF-8: lower
ASCII remains encoded the same, and UTF-8 code units encoding
non-ASCII characters cannot be confused with ASCII characters.
Code that does not need Unicode code points can treat UTF-8
strings as ASCII strings, and does not need to decode each
character individually - because a 0x20 byte will mean "space"
regardless of context. That's why a function that splits a string
by ASCII whitespace does NOT need do perform UTF-8 decoding.

I hope this clears up the misunderstanding :)

May 25 2013

"Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 14:18:32 UTC, Vladimir Panteleev
wrote:
On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
Are you sure _you_ understand it properly? Both encodings
have to check every single character to test for whitespace,
but the single-byte encoding simply has to load each byte in
the string and compare it against the whitespace-signifying
bytes, while the variable-length code has to first load and
parse potentially 4 bytes before it can compare, because it
has to go through the state machine that you linked to above.
Obviously the constant-width encoding will be faster. Did I
really need to explain this?

It looks like you've missed an important property of UTF-8:
lower ASCII remains encoded the same, and UTF-8 code units
encoding non-ASCII characters cannot be confused with ASCII
characters. Code that does not need Unicode code points can
treat UTF-8 strings as ASCII strings, and does not need to
decode each character individually - because a 0x20 byte will
mean "space" regardless of context. That's why a function that
splits a string by ASCII whitespace does NOT need do perform
UTF-8 decoding.

I hope this clears up the misunderstanding :)

OK, you got me with this particular special case: it is not
necessary to decode every UTF-8 character if you are simply
comparing against ASCII space characters. My mixup is because I
was unaware if every language used its own space character in
UTF-8 or if they reuse the ASCII space character, apparently it's
the latter.

However, my overall point stands. You still have to check 2-4
times as many bytes if you do it the way Peter suggests, as
opposed to a single-byte encoding. There is a shortcut: you
could also check the first byte to see if it's ASCII or not and
then skip the right number of ensuing bytes in a character's
encoding if it isn't ASCII, but at that point you have begun
partially decoding the UTF-8 encoding, which you claimed wasn't
necessary and which will degrade performance anyway.

On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
I suggest you read up on UTF-8. You really don't understand it.
There is no need to decode, you just treat the UTF-8 string as
if it is an ASCII string.

Not being aware of this shortcut doesn't mean not understanding
UTF-8.

This code will count all spaces in a string whether it is
encoded as ASCII or UTF-8:

int countSpaces(const(char)* c)
{
int n = 0;
while (*c)
if (*c == ' ')
++n;
return n;
}

I repeat: there is no need to decode. Please read up on UTF-8.
You do not understand it. The reason you don't need to decode
is because UTF-8 is self-synchronising.

Not quite. The reason you don't need to decode is because of the
particular encoding scheme chosen for UTF-8, a side effect of
ASCII backwards compatibility and reusing the ASCII space
character; it has nothing to do with whether it's
self-synchronizing or not.

The code above tests for spaces only, but it works the same
when searching for any substring or single character. It is no
slower than fixed-width encoding for these operations.

It doesn't work the same "for any substring or single character,"
it works the same for any single ASCII character.

Of course it's slower than a fixed-width single-byte encoding.
You have to check every single byte of a non-ASCII character in
UTF-8, whereas a single-byte encoding only has to check a single
byte per language character. There is a shortcut if you
partially decode the first byte in UTF-8, mentioned above, but
you seem dead-set against decoding. ;)

Again, I urge you, please read up on UTF-8. It is very well
designed.

I disagree. It is very badly designed, but the ASCII
compatibility does hack in some shortcuts like this, which still
don't save its performance.

May 25 2013

"Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 25 May 2013 at 14:58:02 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
I suggest you read up on UTF-8. You really don't understand
it. There is no need to decode, you just treat the UTF-8
string as if it is an ASCII string.

Not being aware of this shortcut doesn't mean not understanding
UTF-8.

It's not just a shortcut, it is absolutely fundamental to the
design of UTF-8. It's like saying you understand Lisp without
being aware that everything is a list.

Also, you continuously keep stating disadvantages to UTF-8 that
are completely false, like "slicing does require decoding".
Again, completely missing the point of UTF-8. I cannot conceive
how you can claim to understand how UTF-8 works yet repeatedly
demonstrating that you do not.

You are either ignorant or a successful troll. In either case,
I'm done here.

May 25 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, May 25, 2013 at 03:47:41PM +0200, Joakim wrote:
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string is
in ASCII or UTF-8 - the code will behave correctly in either
case, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to decode
while splitting, when compared to a single-byte encoding.

No. Are you sure you understand UTF-8 properly?

Are you sure _you_ understand it properly? Both encodings have to
check every single character to test for whitespace, but the
single-byte encoding simply has to load each byte in the string and
compare it against the whitespace-signifying bytes, while the
variable-length code has to first load and parse potentially 4 bytes
before it can compare, because it has to go through the state
machine that you linked to above. Obviously the constant-width
encoding will be faster. Did I really need to explain this?

[...]

Have you actually tried to write a whitespace splitter for UTF-8? Do you
realize that you can use an ASCII whitespace splitter for UTF-8 and it
will work correctly?

There is no need to decode UTF-8 for whitespace splitting at all. There
is no need to parse anything. You just iterate over the bytes and split
on 0x20. There is no performance difference over ASCII.

As Dmitry said, UTF-8 is self-synchronizing. While current Phobos code
tries to play it safe by decoding every character, this is not necessary
in many cases.

--
The best compiler is between your ears. -- Michael Abrash

May 25 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 12:58, Vladimir Panteleev пишет:
On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
This is more a problem with the algorithms taking the easy way than a
problem with UTF-8. You can do all the string algorithms, including
regex, by working with the UTF-8 directly rather than converting to
UTF-32. Then the algorithms work at full speed.

I call BS on this. There's no way working on a variable-width
encoding can be as "full speed" as a constant-width encoding. Perhaps
you mean that the slowdown is minimal, but I doubt that also.

For the record, I noticed that programmers (myself included) that had an
incomplete understanding of Unicode / UTF exaggerate this point, and
sometimes needlessly assume that their code needs to operate on
individual characters (code points), when it is in fact not so - and
that code will work just fine as if it was written to handle ASCII. The
example Walter quoted (regex - assuming you don't want Unicode ranges or
case-insensitivity) is one such case.

+1
BTW regex even with Unicode ranges and case-insensitivity is doable just
not easy (yet).

Another thing I noticed: sometimes when you think you really need to
operate on individual characters (and that your code will not be correct
unless you do that), the assumption will be incorrect due to the
existence of combining characters in Unicode. Two of the often-quoted
use cases of working on individual code points is calculating the string
width (assuming a fixed-width font), and slicing the string - both of
these will break with combining characters if those are not accounted
for. I believe the proper way to approach such tasks is to implement the
respective Unicode algorithms for it, which I believe are non-trivial
and for which the relative impact for the overhead of working with a
variable-width encoding is acceptable.

Another plus one. Algorithms defined on code point basis are quite
complex so that benefit of not decoding won't be that large. The benefit
of transparently special-casing ASCII in UTF-8 is far larger.

Can you post some specific cases where the benefits of a constant-width
encoding are obvious and, in your opinion, make constant-width encodings
more useful than all the benefits of UTF-8?

Also, I don't think this has been posted in this thread. Not sure if it
answers your points, though:

http://www.utf8everywhere.org/

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

--
Dmitry Olshansky

May 25 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/25/13 3:33 AM, Joakim wrote:
On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
This is more a problem with the algorithms taking the easy way than a
problem with UTF-8. You can do all the string algorithms, including
regex, by working with the UTF-8 directly rather than converting to
UTF-32. Then the algorithms work at full speed.

I call BS on this. There's no way working on a variable-width encoding
can be as "full speed" as a constant-width encoding. Perhaps you mean
that the slowdown is minimal, but I doubt that also.

You mentioned this a couple of times, and I wonder what makes you so
sure. On contemporary architectures small is fast and large is slow;
betting on replacing larger data with more computation is quite often a win.

Andrei

May 25 2013

Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 5:43 AM, Andrei Alexandrescu wrote:
On 5/25/13 3:33 AM, Joakim wrote:
On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
This is more a problem with the algorithms taking the easy way than a
problem with UTF-8. You can do all the string algorithms, including
regex, by working with the UTF-8 directly rather than converting to
UTF-32. Then the algorithms work at full speed.

I call BS on this. There's no way working on a variable-width encoding
can be as "full speed" as a constant-width encoding. Perhaps you mean
that the slowdown is minimal, but I doubt that also.

You mentioned this a couple of times, and I wonder what makes you so sure. On
contemporary architectures small is fast and large is slow; betting on
replacing
larger data with more computation is quite often a win.

On the other hand, Joakim even admits his single byte encoding is variable
length, as otherwise he simply dismisses the rarely used (!) Chinese, Japanese,
and Korean languages, as well as any text that contains words from more than
one
language.

I suspect he's trolling us, and quite successfully.

May 25 2013

"Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:
On the other hand, Joakim even admits his single byte encoding
is variable length, as otherwise he simply dismisses the rarely
used (!) Chinese, Japanese, and Korean languages, as well as
any text that contains words from more than one language.

I have noted from the beginning that these large alphabets have
to be encoded to two bytes, so it is not a true constant-width
encoding if you are mixing one of those languages into a
single-byte encoded string. But this "variable length" encoding
is so much simpler than UTF-8, there's no comparison.

I suspect he's trolling us, and quite successfully.

Ha, I wondered who would pull out this insult, quite surprised to
see it's Walter. It seems to be the trend on the internet to
accuse anybody you disagree with of trolling, I am honestly
surprised to see Walter stoop so low. Considering I'm the only
one making any cogent arguments here, perhaps I should wonder if
you're all trolling me. ;)

On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:
I suspect the Chinese, Koreans, and Japanese would take
exception to being called irrelevant.

Irrelevant only because they are a small subset of the UCS. I
have noted that they would also be handled by a two-byte encoding.

Good luck with your scheme that can't handle languages written
by billions of people!

So let's see: first you say that my scheme has to be variable
length because I am using two bytes to handle these languages,
then you claim I don't handle these languages. This kind of
blatant contradiction within two posts can only be called...
trolling!

May 25 2013

Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 1:03 PM, Joakim wrote:
On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:
On the other hand, Joakim even admits his single byte encoding is variable
length, as otherwise he simply dismisses the rarely used (!) Chinese,
Japanese, and Korean languages, as well as any text that contains words from
more than one language.

I have noted from the beginning that these large alphabets have to be encoded
to
two bytes, so it is not a true constant-width encoding if you are mixing one of
those languages into a single-byte encoded string. But this "variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's variable length. You
overlook that I've had to deal with this. It isn't "simpler", there's actually
more work to write code that adapts to one or two byte encodings.

I suspect he's trolling us, and quite successfully.

Ha, I wondered who would pull out this insult, quite surprised to see it's
Walter. It seems to be the trend on the internet to accuse anybody you
disagree
with of trolling, I am honestly surprised to see Walter stoop so low.
Considering I'm the only one making any cogent arguments here, perhaps I should
wonder if you're all trolling me. ;)

On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:
I suspect the Chinese, Koreans, and Japanese would take exception to being
called irrelevant.

Irrelevant only because they are a small subset of the UCS. I have noted that
they would also be handled by a two-byte encoding.

Good luck with your scheme that can't handle languages written by billions of
people!

So let's see: first you say that my scheme has to be variable length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard Chinese. You cannot have
it both ways. Code to deal with two bytes is significantly different than code
to deal with one. That means you've got a conditional in your generic code -
that isn't going to be faster than the conditional for UTF-8.

then you claim I don't handle
these languages. This kind of blatant contradiction within two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it as irrelevant,
along with more handwaving about what to do with text that has embedded words
in
multiple languages.

Worse, there are going to be more than 256 of these encodings - you can't even
have a byte to specify them. Remember, Unicode has approximately 256,000
characters in it. How many code pages is that?

I was being kind saying you were trolling, as otherwise I'd be saying your
scheme was, to be blunt, absurd.

---------------------------------------

I'll be the first to admit that a lot of great ideas have been initially
dismissed by the experts as absurd. If you really believe in this, I recommend
that you write it up as a real article, taking care to fill in all the
handwaving with something specific, and include some benchmarks to prove your
performance claims. Post your article on reddit, stackoverflow, hackernews,
etc., and look for fertile ground for it. I'm sorry you're not finding fertile
ground here (so far, nobody has agreed with any of your points), and this is
the
wrong place for such proposals anyway, as D is simply not going to switch over
to it.

Remember, extraordinary claims require extraordinary evidence, not handwaving
and assumptions disguised as bold assertions.

May 25 2013

"Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
I have noted from the beginning that these large alphabets
have to be encoded to
two bytes, so it is not a true constant-width encoding if you
are mixing one of
those languages into a single-byte encoded string. But this
"variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's
variable length. You overlook that I've had to deal with this.
It isn't "simpler", there's actually more work to write code
that adapts to one or two byte encodings.

It is variable length, with the advantage that only strings
containing a few Asian languages are variable-length, as opposed
to UTF-8 having every non-English language string be
variable-length. It may be more work to write library code to
handle my encoding, perhaps, but efficiency and ease of use are
paramount.

So let's see: first you say that my scheme has to be variable
length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard Chinese.
You cannot have it both ways. Code to deal with two bytes is
significantly different than code to deal with one. That means
you've got a conditional in your generic code - that isn't
going to be faster than the conditional for UTF-8.

Hah, I have explicitly said several times that I'd use a two-byte
encoding for Chinese and I already acknowledged that such a
predominantly single-byte encoding is still variable-length. The
problem is that _you_ try to have it both ways: first you claimed
it is variable-length because I support Chinese that way, then
you claimed I don't support Chinese.

Yes, there will be conditionals, just as there are several
conditionals in phobos depending on whether a language supports
uppercase or not. The question is whether the conditionals for
single-byte encoding will execute faster than decoding every
UTF-8 character. This is a matter of engineering judgement, I
see no reason why you think decoding every UTF-8 character is
faster.

then you claim I don't handle
these languages. This kind of blatant contradiction within
two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it
as irrelevant, along with more handwaving about what to do with
text that has embedded words in multiple languages.

If it was mere "vague handwaving," how did you know I planned to
use two bytes to encode Chinese? I'm not sure why you're
continuing along this contradictory path.

I didn't "handwave" about multi-language strings, I gave specific
ideas about how they might be implemented. I'm not claiming to
have a bullet-proof and detailed single-byte encoding spec, just
spitballing some ideas on how to do it better than the abominable
UTF-8.

Worse, there are going to be more than 256 of these encodings -
you can't even have a byte to specify them. Remember, Unicode
has approximately 256,000 characters in it. How many code pages
is that?

There are 72 modern scripts in Unicode 6.1, 28 ancient scripts,
maybe another 50 symbolic sets. That leaves space for another
100 or so new scripts. Maybe you are so worried about
future-proofing that you'd use two bytes to signify the alphabet,
but I wouldn't. I think it's more likely that we'll ditch
scripts than add them. ;) Most of those symbol sets should not be
in UCS.

I was being kind saying you were trolling, as otherwise I'd be
saying your scheme was, to be blunt, absurd.

I think it's absurd to use a self-synchronizing text encoding
from 20 years ago, that is really only useful when streaming
text, which nobody does today. There may have been a time when
ASCII compatibility was paramount, when nobody cared about
internationalization and almost all libraries only took ASCII
input: that is not the case today.

I'll be the first to admit that a lot of great ideas have been
initially dismissed by the experts as absurd. If you really
believe in this, I recommend that you write it up as a real
article, taking care to fill in all the handwaving with
something specific, and include some benchmarks to prove your
performance claims. Post your article on reddit, stackoverflow,
hackernews, etc., and look for fertile ground for it. I'm sorry
you're not finding fertile ground here (so far, nobody has
agreed with any of your points), and this is the wrong place
for such proposals anyway, as D is simply not going to switch
over to it.

Let me admit in return that I might be completely wrong about my
single-byte encoding representing a step forward from UTF-8.
While this argument has produced no argument that I'm wrong, it's
possible we've all missed something salient, some deal-breaker.
As I said before, I'm not proposing that D "switch over." I was
simply asking people who know or at the very least use UTF-8 more
than most, as a result of employing one of the few languages with
Unicode support baked in, why they think UTF-8 is a good idea.

I was hoping for a technical discussion on the merits, before I
went ahead and implemented this single-byte encoding. Since
nobody has been able to point out a reason for why my encoding
wouldn't be much better than UTF-8, I see no reason not to go
forward with my implementation. I may write something up after
implementation: most people don't care about ideas, only results,
to the point where almost nobody can reason at all about ideas.

Remember, extraordinary claims require extraordinary evidence,
not handwaving and assumptions disguised as bold assertions.

I don't think my claims are extraordinary or backed by
"handwaving and assumptions." Some people can reason about such
possible encodings, even in the incomplete form I've sketched
out, without having implemented them, if they know what they're
doing.

On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
On 5/25/2013 2:51 PM, Walter Bright wrote:
On 5/25/2013 12:51 PM, Joakim wrote:
For a multi-language string encoding, the header would
contain a single byte for every language used in the string,
along with multiple
index bytes to signify the start and finish of every run of
single-language
characters in the string. So, a list of languages and a list
of pure
single-language substrings.

Please implement the simple C function strstr() with this
simple scheme, and
post it here.

http://www.digitalmars.com/rtl/string.html#strstr

I'll go first. Here's a simple UTF-8 version in C. It's not the
fastest way to do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
size_t len1 = strlen(s1);
size_t len2 = strlen(s2);
if (!len2)
return (char *) s1;
char c2 = *s2;
while (len2 <= len1) {
if (c2 == *s1)
if (memcmp(s2,s1,len2) == 0)
return (char *) s1;
s1++;
len1--;
}
return NULL;
}

There is no question that a UTF-8 implementation of strstr can be
simpler to write in C and D for multi-language strings that
include Korean/Chinese/Japanese. But while the strstr
implementation for my encoding would contain more conditionals
and lines of code, it would be far more efficient. For instance,
because you know where all the language substrings are from the
header, you can potentially rule out searching vast swathes of
the string, because they don't contain the same languages or
lengths as the string you're searching for.

Even if you're searching a single-language string, which won't
have those speedups, your naive implementation checks every byte,
even continuation bytes, in UTF-8 to see if they might match the
first letter of the search string, even though no continuation
byte will match. You can avoid this by partially decoding the
leading bytes of UTF-8 characters and skipping over continuation
bytes, as I've mentioned earlier in this thread, but you've then
added more lines of code to your pretty yet simple function and
added decoding overhead to every iteration of the while loop.

My single-byte encoding has none of these problems, in fact, it's
much faster and uses less memory for the same function, while
providing additional speedups, from the header, that are not
available to UTF-8.

Finally, being able to write simple yet inefficient functions
like this is not the test of a good encoding, as strstr is a
library function, and making library developers' lives easier is
a low priority for any good format. The primary goals are ease
of use for library consumers, ie app developers, and speed and
efficiency of the code. You are trading on the latter two for
the former with this implementation. That is not a good tradeoff.

Perhaps it was a good trade 20 years ago when everyone rolled
their own code and nobody bothered waiting for those floppy disks
to arrive with expensive library code. It is not a good trade
today.

May 26 2013

"Declan" <oyscal 163.com> writes:
On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
I have noted from the beginning that these large alphabets
have to be encoded to
two bytes, so it is not a true constant-width encoding if you
are mixing one of
those languages into a single-byte encoded string. But this
"variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's
variable length. You overlook that I've had to deal with this.
It isn't "simpler", there's actually more work to write code
that adapts to one or two byte encodings.

It is variable length, with the advantage that only strings
containing a few Asian languages are variable-length, as
opposed to UTF-8 having every non-English language string be
variable-length. It may be more work to write library code to
handle my encoding, perhaps, but efficiency and ease of use are
paramount.

So let's see: first you say that my scheme has to be variable
length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard
Chinese. You cannot have it both ways. Code to deal with two
bytes is significantly different than code to deal with one.
That means you've got a conditional in your generic code -
that isn't going to be faster than the conditional for UTF-8.

Hah, I have explicitly said several times that I'd use a
two-byte encoding for Chinese and I already acknowledged that
such a predominantly single-byte encoding is still
variable-length. The problem is that _you_ try to have it both
ways: first you claimed it is variable-length because I support
Chinese that way, then you claimed I don't support Chinese.

then you claim I don't handle
these languages. This kind of blatant contradiction within
two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it
as irrelevant, along with more handwaving about what to do
with text that has embedded words in multiple languages.

If it was mere "vague handwaving," how did you know I planned
to use two bytes to encode Chinese? I'm not sure why you're
continuing along this contradictory path.

I didn't "handwave" about multi-language strings, I gave
specific ideas about how they might be implemented. I'm not
claiming to have a bullet-proof and detailed single-byte
encoding spec, just spitballing some ideas on how to do it
better than the abominable UTF-8.

Worse, there are going to be more than 256 of these encodings
- you can't even have a byte to specify them. Remember,
Unicode has approximately 256,000 characters in it. How many
code pages is that?

There are 72 modern scripts in Unicode 6.1, 28 ancient scripts,
maybe another 50 symbolic sets. That leaves space for another
100 or so new scripts. Maybe you are so worried about
future-proofing that you'd use two bytes to signify the
alphabet, but I wouldn't. I think it's more likely that we'll
ditch scripts than add them. ;) Most of those symbol sets
should not be in UCS.

I was being kind saying you were trolling, as otherwise I'd be
saying your scheme was, to be blunt, absurd.

I'll be the first to admit that a lot of great ideas have been
initially dismissed by the experts as absurd. If you really
believe in this, I recommend that you write it up as a real
article, taking care to fill in all the handwaving with
something specific, and include some benchmarks to prove your
performance claims. Post your article on reddit,
stackoverflow, hackernews, etc., and look for fertile ground
for it. I'm sorry you're not finding fertile ground here (so
far, nobody has agreed with any of your points), and this is
the wrong place for such proposals anyway, as D is simply not
going to switch over to it.

Let me admit in return that I might be completely wrong about
my single-byte encoding representing a step forward from UTF-8.
While this argument has produced no argument that I'm wrong,
it's possible we've all missed something salient, some
deal-breaker. As I said before, I'm not proposing that D
"switch over." I was simply asking people who know or at the
very least use UTF-8 more than most, as a result of employing
one of the few languages with Unicode support baked in, why
they think UTF-8 is a good idea.

I was hoping for a technical discussion on the merits, before I
went ahead and implemented this single-byte encoding. Since
nobody has been able to point out a reason for why my encoding
wouldn't be much better than UTF-8, I see no reason not to go
forward with my implementation. I may write something up after
implementation: most people don't care about ideas, only
results, to the point where almost nobody can reason at all
about ideas.

Remember, extraordinary claims require extraordinary evidence,
not handwaving and assumptions disguised as bold assertions.

I don't think my claims are extraordinary or backed by
"handwaving and assumptions." Some people can reason about
such possible encodings, even in the incomplete form I've
sketched out, without having implemented them, if they know
what they're doing.

Please implement the simple C function strstr() with this
simple scheme, and
post it here.

http://www.digitalmars.com/rtl/string.html#strstr

I'll go first. Here's a simple UTF-8 version in C. It's not
the fastest way to do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
size_t len1 = strlen(s1);
size_t len2 = strlen(s2);
if (!len2)
return (char *) s1;
char c2 = *s2;
while (len2 <= len1) {
if (c2 == *s1)
if (memcmp(s2,s1,len2) == 0)
return (char *) s1;
s1++;
len1--;
}
return NULL;
}

There is no question that a UTF-8 implementation of strstr can
be simpler to write in C and D for multi-language strings that
include Korean/Chinese/Japanese. But while the strstr
implementation for my encoding would contain more conditionals
and lines of code, it would be far more efficient. For
instance, because you know where all the language substrings
are from the header, you can potentially rule out searching
vast swathes of the string, because they don't contain the same
languages or lengths as the string you're searching for.

Even if you're searching a single-language string, which won't
have those speedups, your naive implementation checks every
byte, even continuation bytes, in UTF-8 to see if they might
match the first letter of the search string, even though no
continuation byte will match. You can avoid this by partially
decoding the leading bytes of UTF-8 characters and skipping
over continuation bytes, as I've mentioned earlier in this
thread, but you've then added more lines of code to your pretty
yet simple function and added decoding overhead to every
iteration of the while loop.

My single-byte encoding has none of these problems, in fact,
it's much faster and uses less memory for the same function,
while providing additional speedups, from the header, that are
not available to UTF-8.

Finally, being able to write simple yet inefficient functions
like this is not the test of a good encoding, as strstr is a
library function, and making library developers' lives easier
is a low priority for any good format. The primary goals are
ease of use for library consumers, ie app developers, and speed
and efficiency of the code. You are trading on the latter two
for the former with this implementation. That is not a good
tradeoff.

Perhaps it was a good trade 20 years ago when everyone rolled
their own code and nobody bothered waiting for those floppy
disks to arrive with expensive library code. It is not a good
trade today.

I服了u，I'm thinking of your name means joking?

May 26 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:
On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
I have noted from the beginning that these large alphabets
have to be encoded to
two bytes, so it is not a true constant-width encoding if you
are mixing one of
those languages into a single-byte encoded string. But this
"variable length"
encoding is so much simpler than UTF-8, there's no comparison.