## digitalmars.D - Why UTF-8/16 character encodings?

• Joakim (22/26) May 24 2013 This triggered a long-standing bugbear of mine: why are we using
• Peter Alexander (13/15) May 24 2013 Simple: backwards compatibility with all ASCII APIs (e.g. most C
• Joakim (47/59) May 24 2013 And yet here we are today, where an early decision made solely to
• Joakim (7/16) May 24 2013 Sorry, I was a bit imprecise. Here's what I meant to write:
• Walter Bright (17/22) May 24 2013 This is more a problem with the algorithms taking the easy way than a pr...
• anonymous (2/16) May 24 2013 The German ß becomes SS when capitalised. It's no encoding issue.
• Dmitry Olshansky (34/58) May 24 2013 You seem to think that not only UTF-8 is bad encoding but also one
• H. S. Teoh (102/151) May 24 2013 I remember those bad ole days of gratuitously-incompatible encodings. I
• Walter Bright (7/9) May 24 2013 One of the first, and best, decisions I made for D was it would be Unico...
• Manu (6/16) May 24 2013 Indeed, excellent decision!
• Walter Bright (3/4) May 24 2013 Oh, how I want to do that. But I still think the world hasn't completely...
• H. S. Teoh (13/18) May 24 2013 That would be most awesome!
• Timon Gehr (13/31) May 25 2013 This is what eg. Haskell, Coq are doing.
• Hans W. Uhlig (5/10) May 26 2013 Using those characters would be wonderful and while we do have
• Walter Bright (3/6) May 26 2013 I have a post-it stuck to my monitor with the numbers for various unicod...
• H. S. Teoh (13/21) May 26 2013 I have been thinking about this idea of a "reprogrammable keyboard", in
• Kiith-Sa (2/2) May 26 2013 You mean like
• H. S. Teoh (6/8) May 26 2013 Whoa! That is exactly what I had in mind!!
• Torje Digernes (5/12) May 26 2013 If you want to configure your keyboard so you can type unicode in
• H. S. Teoh (16/33) May 26 2013 Oh, I know *that*. I configured my xkb setup to switch between English
• Wyatt (23/38) May 26 2013 I've given this domain a fair bit of thought, and from my
• H. S. Teoh (34/71) May 27 2013 I like this idea. It's certainly more feasible than reinventing the
• Vladimir Panteleev (3/4) May 27 2013 Perhaps something like the compose key?
• H. S. Teoh (9/15) May 27 2013 I'm already using the compose key. But it only goes so far (I don't
• Vladimir Panteleev (3/6) May 27 2013 I thought the topic was typing the occasional Unicode character
• H. S. Teoh (16/23) May 27 2013 Well, D *does* support non-English identifiers, y'know... for example:
• Walter Bright (3/11) May 27 2013 I've recently come to the opinion that that's a bad idea, and D should n...
• Hans W. Uhlig (3/18) May 27 2013 Why do you think its a bad idea? It makes it such that code can
• H. S. Teoh (22/41) May 27 2013 Currently, the above code snippet compiles (upon inserting "import
• monarch_dodra (23/76) May 28 2013 I can tell you for a fact there are a tons of *private* companies
• Walter Bright (7/12) May 27 2013 Every time I've been to a programming shop in a foreign country, the dev...
• Diggory (7/23) May 27 2013 The most convincing case for usefulness I've seen was in java
• Olivier Pisano (13/17) May 28 2013 Would you have been to such an event if you could not have
• monarch_dodra (8/12) May 28 2013 That's because you have an academic view of code, and a library
• qznc (7/23) May 29 2013 Once I heared an argument from developers working for banks. They
• Walter Bright (2/7) May 29 2013 German is pretty easy to do in ASCII: Vermoegen and Buergschaft
• monarch_dodra (18/30) May 30 2013 What about Chinese? Russian? Japanese? It is doable, but I can
• Simen Kjaeraas (21/48) May 30 2013 On Thu, 30 May 2013 11:36:42 +0200, monarch_dodra
• Dicebot (4/14) May 30 2013 What about poor guys from other country that will support that
• monarch_dodra (18/33) May 30 2013 Well... defacto: "in practice but not necessarily ordained by
• Manu (5/19) May 30 2013 Have you ever worked on code written by people who barely speak English?
• Dicebot (4/11) May 30 2013 I have had comments with Finnish poetry in code I was responsible
• Kagamin (31/33) Jun 27 2013 I did. It's better than having a mixture of languages like here:
• deadalnix (7/14) Jun 27 2013 OOo codebase is historically mostly in german. They try to reduce
• Peter Williams (4/6) May 27 2013 So you're going to spell check them all to make sure that they're
• David Eagen (4/7) May 27 2013 That's it. I'm filing a bug against std.traits. There's a
• Manu (7/15) May 27 2013 How dare you!
• Peter Williams (4/12) May 27 2013 Except here in Australia and other places where they use the Queen's
• Manu (2/17) May 27 2013 Is there anywhere other than America that doesn't?
• Jacob Carlborg (4/5) May 28 2013 Canada, Jamaica, other countries in that region?
• Manu (3/7) May 28 2013 Yes, the region called America ;)
• Jacob Carlborg (4/6) May 28 2013 Oh, you meant the whole region and not the country.
• Simen Kjaeraas (4/8) May 28 2013 America is not a country. The country is called USA.
• Jacob Carlborg (5/6) May 28 2013 I know that, but I get the impression that people usually say "America"
• Peter Williams (4/7) May 28 2013 Last time I looked Canada was in America (which is a continent not a
• Diggory (3/13) May 28 2013 America isn't a continent, North America is a continent, and
• monarch_dodra (8/23) May 28 2013 Well, that point of view really depends from which continent
• H. S. Teoh (13/23) May 28 2013 [...]
• Peter Williams (4/24) May 28 2013 Last time I was there (about 40 years ago) Canadians didn't seem that
• H. S. Teoh (7/25) May 28 2013 [...]
• Jacob Carlborg (4/6) May 28 2013 Don't you have a spell checker in your editor? If not, find a new one :)
• =?UTF-8?B?Ikx1w61z?= Marques" (3/5) May 27 2013 I think it is a bad idea to program in a language other than
• Manu (5/11) May 27 2013 I can imagine a young student learning to code, that may not speak Engli...
• Manu (3/17) May 27 2013 Why? You said previously that you'd love to support extended operators ;...
• Torje Digernes (4/28) May 27 2013 I find features such as support for uncommon symbols in variables
• Walter Bright (2/16) May 27 2013 Extended operators, yes. Non-ascii identifiers, no.
• Oleg Kuporosov (6/9) May 28 2013 BTW, this is one of big D advantage, take into account
• Simen Kjaeraas (10/23) May 28 2013 :
• Jakob Ovrum (17/19) May 29 2013 Honestly, removing support for non-ASCII characters from
• Walter Bright (3/19) May 29 2013 I still think it's a bad idea, but it's obvious people want it in D, so ...
• Marco Leise (5/8) May 29 2013 Surprisingly ASCII also covers Cornish and Malay.
• Oleg Kuporosov (3/6) May 29 2013 Good, thanks, restrictions definetelly can and should be applied
• Jakob Ovrum (5/6) May 30 2013 I don't understand the logic behind this. Surely this is the
• Marco Leise (11/25) May 29 2013 t=20
• Timon Gehr (2/5) May 30 2013 No, that's deeply troubling.
• Entry (1/1) May 29 2013 My personal opinion is that code should only be in English.
• Peter Williams (3/4) May 29 2013 But why would you want to impose this restriction on others?
• Entry (7/11) May 30 2013 I wouldn't say impose. I'd say that programming in a unified
• monarch_dodra (8/21) May 30 2013 But programming IS a human tool, and thus, subject to human
• Entry (3/26) May 30 2013 What a way to attack a straw-man and completely miss the point at
• monarch_dodra (6/35) May 30 2013 Fine.
• Entry (7/43) May 30 2013 Take a minute to think about why we're all communicating in
• monarch_dodra (19/25) May 30 2013 Well that's condescending :/ and fallacious.
• Entry (8/33) May 30 2013 I'm glad you agree, though I believe that I never said anything
• Jakob Ovrum (3/11) May 30 2013 If the programmers who are going to be working on that code don't
• Entry (4/15) May 30 2013 Then there's no helping it. Though I wonder what kind of a
• Manu (2/18) May 30 2013 A child, or a student.
• Manu (2/44) May 30 2013 This is the definition of a *convention*, not a rule.
• Manu (1/16) May 30 2013
• Manu (12/27) May 30 2013 We don't all know English. Plenty of people don't.
• Walter Bright (2/12) May 30 2013 Sure, but the code itself is written using ASCII!
• Peter Williams (3/19) May 30 2013 Because they had no choice.
• Walter Bright (2/21) May 30 2013 Not true, D supports Unicode identifiers.
• Simen Kjaeraas (8/30) May 31 2013 ce,
• Timothee Cour (23/50) Jun 05 2013 =A7=E3=81=99=E3=80=82
• Brad Roberts (2/16) Jun 05 2013 Filed in bugzilla?
• Sean Kelly (10/30) Jun 17 2013 unicode strings/comments), we must
• H. S. Teoh (6/31) Jun 17 2013 Do linkers actually support 8-bit symbol names? Or do these have to be
• Sean Kelly (8/11) Jun 17 2013 Good question. It looks like the linker on OSX does:
• Brad Roberts (4/12) Jun 17 2013 Don't symbol names from dmd/win32 get compressed if they're too long, re...
• Walter Bright (2/6) Jun 17 2013 Optlink doesn't care what the symbol byte contents are.
• H. S. Teoh (10/17) Jun 18 2013 It seems ld on Linux doesn't, either. I just tested separate compilation
• Walter Bright (2/3) Jun 18 2013 I doubt it, but try it and see!
• H. S. Teoh (6/10) Jun 18 2013 Sadly I don't have access to a Windows dev machine. Anybody else cares
• Sean Kelly (11/26) Jun 19 2013 be
• Manu (19/36) May 30 2013 =E3=81=99=E3=80=82
• Walter Bright (3/4) May 30 2013 I am going to leave it that way based on the comments here, I only wante...
• Manu (6/28) May 30 2013 Indeed, and believe me, the variable names can often make NO sense, or
• Mr. Anonymous (2/31) May 28 2013 http://code.google.com/p/trileri/source/browse/trunk/tr/yazi.d
• Simen Kjaeraas (13/33) May 27 2013 On Tue, 28 May 2013 00:18:31 +0200, H. S. Teoh ...
• Jonathan M Davis (7/13) May 27 2013 I think that it was more an issue of that the only reason that Unicode w...
• Manu (22/36) May 27 2013 I'm fairly sure that any programmer who takes themself seriously will us...
• Daniel Murphy (3/10) May 25 2013 When these have keys on standard keyboards.
• Joakim (23/44) May 25 2013 That is why I asked this question here. I think D is still one
• Walter Bright (18/21) May 25 2013 I think you stand alone in your desire to return to code pages. I have y...
• Joakim (39/60) May 25 2013 Nobody is talking about going back to code pages. I'm talking
• Dmitry Olshansky (11/35) May 25 2013 Problem is what you outline is isomorphic with code-pages. Hence the
• Jonathan M Davis (17/44) May 25 2013 ith it
• H. S. Teoh (25/58) May 25 2013 [...]
• Walter Bright (7/14) May 25 2013 Many moons ago, when the earth was young and I had a few strands of hair...
• Vladimir Panteleev (30/39) May 25 2013 For the record, I noticed that programmers (myself included) that
• Joakim (22/44) May 25 2013 Combining characters are examples of complexity baked into the
• w0rp (1/1) May 25 2013 This is dumb. You are dumb. Go away.
• Vladimir Panteleev (14/26) May 25 2013 You don't need to do that to slice a string. I think you mean to
• Joakim (36/51) May 25 2013 Slicing a string implies finding the N-th code point, what other
• Vladimir Panteleev (2/8) May 25 2013 No. Are you sure you understand UTF-8 properly?
• Joakim (17/44) May 25 2013 Are you sure _you_ understand it properly? Both encodings have
• Peter Alexander (22/43) May 25 2013 I suggest you read up on UTF-8. You really don't understand it.
• Peter Alexander (2/10) May 25 2013 Oops. Missing a ++c in there, but I'm sure the point was made :-)
• Vladimir Panteleev (10/31) May 25 2013 It looks like you've missed an important property of UTF-8: lower
• Joakim (35/76) May 25 2013 OK, you got me with this particular special case: it is not
• Peter Alexander (11/17) May 25 2013 It's not just a shortcut, it is absolutely fundamental to the
• H. S. Teoh (14/32) May 25 2013 [...]
• Dmitry Olshansky (9/43) May 25 2013 +1
• Andrei Alexandrescu (5/13) May 25 2013 You mentioned this a couple of times, and I wonder what makes you so
• Walter Bright (6/18) May 25 2013 On the other hand, Joakim even admits his single byte encoding is variab...
• Joakim (20/29) May 25 2013 I have noted from the beginning that these large alphabets have
• Walter Bright (28/55) May 25 2013 If it's one byte sometimes, or two bytes sometimes, it's variable length...
• Joakim (95/174) May 26 2013 It is variable length, with the advantage that only strings
• Declan (2/184) May 26 2013 I服了u，I'm thinking of your name means joking?
• John Colvin (3/185) May 26 2013 I suggest you make an attempt at writing strstr and post it. Code
• Walter Bright (3/6) May 26 2013 C'mon, Joakim, show us this amazing strstr() implementation for your sch...
• Joakim (7/16) May 26 2013 You will see it when it's built into a fully working single-byte
• Diggory (38/43) May 25 2013 All I can say is if you think that is simpler than UTF-8 then you
• Dmitry Olshansky (13/41) May 24 2013 As is there are no UTF-8 specific tables (yet), but there are tools to
• Joakim (40/138) May 25 2013 This problem already exists for UTF-8, breaking ASCII
• Diggory (19/19) May 25 2013 I think you are a little confused about what unicode actually
• Joakim (20/41) May 25 2013 Incorrect.
• Diggory (36/70) May 25 2013 Given that all the machine registers are at least 32-bits already
• Joakim (36/111) May 25 2013 No, that directly _contradicts_ what you said about Unicode
• Diggory (52/164) May 25 2013 UCS does have nothing to do with code pages, it was designed as a
• Walter Bright (5/7) May 25 2013 I suspect the Chinese, Koreans, and Japanese would take exception to bei...
• Joakim (41/77) May 24 2013 Yes, on the encoding, if it's a variable-length encoding like
• Dmitry Olshansky (44/122) May 25 2013 UCS is dead and gone. Next in line to "640K is enough for everyone".
• Joakim (59/173) May 25 2013 I think you are confused. UCS refers to the Universal Character
• Juan Manuel Cabo (13/13) May 25 2013 ░░░░░░░░░ⓌⓉⒻ░
• Diggory (28/36) May 25 2013 "limited success of UTF-8"
• Joakim (114/263) May 26 2013 So you admit that UTF-8 isn't used on the vast majority of
• Dmitry Olshansky (29/159) May 25 2013 You can map a codepage to a subset of UCS :)
• Joakim (36/77) May 25 2013 Something like that. For a multi-language string encoding, the
• Dmitry Olshansky (7/19) May 25 2013 Runs away in horror :) It's mess even before you've got to details.
• Joakim (27/61) May 26 2013 Perhaps it's fatally flawed, but I don't see an argument for why,
• Juan Manuel Cabo (13/27) May 25 2013 You obviously are not thinking it through. Such encoding would
• H. S. Teoh (16/27) May 25 2013 [...]
• Joakim (33/116) May 26 2013 For some reason this posting by H. S. Teoh shows up on the
• H. S. Teoh (43/67) May 26 2013 IHBT. You said that to handle multilanguage strings, your header would
• Joakim (21/44) May 26 2013 Pretty funny how you claim you've been trolled and then go on to
• Vladimir Panteleev (12/15) May 26 2013 1. Make extraordinary claims
• Dmitry Olshansky (5/19) May 26 2013 +1
• Joakim (33/44) May 26 2013 I never refused small examples. I have provided several analyses
• Andrei Alexandrescu (14/18) May 26 2013 Congratulations, you are literally the only person on the Internet who
• Joakim (11/32) May 26 2013 Haha, that is funny, :D though "unicode is shit" returns at least
• Mr. Anonymous (4/16) May 26 2013 On the other hand:
• Joakim (8/23) May 26 2013 I'm not sure if you were trying to make my point, but you just
• Mr. Anonymous (2/26) May 26 2013 Man, you're a bullshit machine!
• Joakim (2/11) May 26 2013 What can I say? I'm very good at interpreting bad data. ;)
• Marco Leise (5/29) May 29 2013 Lol, https://www.google.com/search?q=%22utf-8+is+the+best%22
• Joakim (11/38) May 30 2013 Your point is? 121 results, including false positives like
• Marco Leise (8/19) May 30 2013 Alright, for me it said ~6.570.000 results, which I found
• Marcin Mstowski (12/49) May 26 2013 Character Data Representation
• Joakim (10/27) May 26 2013 You might be right, but I gave it a quick look and can't make out
• Marcin Mstowski (9/34) May 26 2013 Yes, because they didn't reinvent wheel from scratch and are reusing
• Joakim (13/30) May 26 2013 Which picture description did you have in mind? They all seem
• John Colvin (5/12) May 27 2013 It's your idea and project. Showing that it is original / doing
• Joakim (18/30) May 27 2013 Sure, some research is necessary. However, software is littered
• Walter Bright (4/9) May 25 2013 Please implement the simple C function strstr() with this simple scheme,...
• Walter Bright (19/28) May 25 2013 I'll go first. Here's a simple UTF-8 version in C. It's not the fastest ...
• H. S. Teoh (32/42) May 25 2013 I'll have you know that Chinese, Korean, and Japanese account for a
• H. S. Teoh (23/54) May 27 2013 I think there's a difference between allowing math symbols (which
• Torje Digernes (18/90) May 29 2013 I think there is very little difference, both cases are
"Joakim" <joakim airpost.net> writes:
On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
toUpper/lower cannot be made in place if it should handle all
Unicode. Some characters will change their length when convert
to/from uppercase. Examples of these are the German double S
and some Turkish I.

This triggered a long-standing bugbear of mine: why are we using
these variable-length encodings at all?  Does anybody really care
about UTF-8 being "self-synchronizing," ie does anybody actually
use that in this day and age?  Sure, it's backwards-compatible
with ASCII and the vast majority of usage is probably just ASCII,
but that means the other languages don't matter anyway.  Not to
mention taking the valuable 8-bit real estate for English and
dumping the longer encodings on everyone else.

I'd just use a single-byte header to signify the language and
then put the vast majority of languages in a single byte
encoding, with the few exceptional languages with more than 256
characters encoded in two bytes.  OK, that doesn't cover
multi-language strings, but that is what, .000001% of usage?
Make your header a little longer and you could handle those also.
Yes, it wouldn't be strictly backwards-compatible with ASCII,
but it would be so much easier to internationalize.  Of course,
there's also the monoculture we're creating; love this UTF-8 rant
by tuomov, author of one the first tiling window managers for
linux:

http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?

May 24 2013
"Peter Alexander" <peter.alexander.au gmail.com> writes:
On Friday, 24 May 2013 at 17:05:57 UTC, Joakim wrote:
This triggered a long-standing bugbear of mine: why are we
using these variable-length encodings at all?

Simple: backwards compatibility with all ASCII APIs (e.g. most C
libraries), and because I don't want my strings to consume
multiple bytes per character when I don't need it.

Your language header idea is no good for at least three reasons:

1. What happens if I want to take a substring slice of your
string? I'll need to allocate a new string to add the header in.

2. What if I have a long string with the ASCII header and want to
append a non-ASCII character on the end? I'll need to reallocate
the whole string and widen it with the new header.

3. Even if I have a string that is 99% ASCII then I have to pay
extra bytes for every character just because 1% wasn't ASCII.
With UTF-8, I only pay the extra bytes when needed.

May 24 2013
"Joakim" <joakim airpost.net> writes:
On Friday, 24 May 2013 at 17:43:03 UTC, Peter Alexander wrote:
Simple: backwards compatibility with all ASCII APIs (e.g. most
C libraries), and because I don't want my strings to consume
multiple bytes per character when I don't need it.

And yet here we are today, where an early decision made solely to
accommodate the authors of then-dominant all-ASCII APIs has now
foisted an unnecessarily complex encoding on all of us, with
reduced performance as the result.  You do realize that my
encoding would encode almost all languages' characters in single
bytes, unlike UTF-8, right?  Your latter argument is one against
UTF-8.

Your language header idea is no good for at least three reasons:

1. What happens if I want to take a substring slice of your
string? I'll need to allocate a new string to add the header in.

Good point.  The solution that comes to mind right now is that
you'd parse my format and store it in memory as a String class,
storing the chars in an internal array with the header stripped
out and the language stored in a property.  That way, even a
slice could be made to refer to the same language, by referring
to the language of the containing array.

Strictly speaking, this solution could also be implemented with
UTF-8, simply by changing the format for the data structure you
use in memory to the one I've outlined, as opposed to using the
the UTF-8 encoding for both transmission and processing.  But if
you're going to use my format for processing, you might as well
use it for transmission also, since it is much smaller for
non-ASCII text.

Before you ridicule my solution as somehow unworkable, let me
remind you of the current monstrosity.  Currently, the language
is stored in every single UTF-8 character, by having the length
vary from one to four bytes depending on the language.  This
leads to Phobos converting every UTF-8 string to UTF-32, so that
it can easily run its algorithms on a constant-width 32-bit
character set, and the resulting performance penalties.  Perhaps
the biggest loss is that programmers everywhere are pushed to
ignorance or broken code.

Which seems more unworkable to you?

2. What if I have a long string with the ASCII header and want
to append a non-ASCII character on the end? I'll need to
reallocate the whole string and widen it with the new header.

How often does this happen in practice?  I suspect that this
almost never happens.  But if it does, it would be solved by the
String class I outlined above, as the header isn't stored in the
array anymore.

3. Even if I have a string that is 99% ASCII then I have to pay
extra bytes for every character just because 1% wasn't ASCII.
With UTF-8, I only pay the extra bytes when needed.

I don't understand what you mean here.  If your string has a
thousand non-ASCII characters, the UTF-8 version will have one or
two thousand more characters, ie 1 or 2 KB more.  My format would
character used, that's it.  It's a clear win for my format.

In any case, I just came up with the simplest format I could off
the top of my head, maybe there are gaping holes in it.  But my
point is that we should be able to come up with such a much
simpler format, which keeps most characters to a single byte, not
that my format is best.  All I want to argue is that UTF-8 is the
worst. ;)

May 24 2013
"Joakim" <joakim airpost.net> writes:
On Friday, 24 May 2013 at 20:37:58 UTC, Joakim wrote:
3. Even if I have a string that is 99% ASCII then I have to
pay extra bytes for every character just because 1% wasn't
ASCII. With UTF-8, I only pay the extra bytes when needed.

I don't understand what you mean here.  If your string has a
thousand non-ASCII characters, the UTF-8 version will have one
or two thousand more characters, ie 1 or 2 KB more.  My format
language character used, that's it.  It's a clear win for my
format.

Sorry, I was a bit imprecise.  Here's what I meant to write:

I don't understand what you mean here.  If your string has a
thousand non-ASCII characters, the UTF-8 version will have one
or two thousand more bytes, ie 1 or 2 KB more.  My format
language used, that's it.  It's a clear win for my format.

May 24 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2013 1:37 PM, Joakim wrote:
This leads to Phobos converting every UTF-8 string to UTF-32, so that
it can easily run its algorithms on a constant-width 32-bit character set, and
the resulting performance penalties.

This is more a problem with the algorithms taking the easy way than a problem
with UTF-8. You can do all the string algorithms, including regex, by working
with the UTF-8 directly rather than converting to UTF-32. Then the algorithms
work at full speed.

Yes, it wouldn't be strictly backwards-compatible with ASCII, but it would be

so much easier to internationalize.

That was the go-to solution in the 1980's, they were called "code pages". A
disaster.

with the few exceptional languages with more than 256 characters encoded in

two bytes.

Like those rare languages Japanese, Korean, Chinese, etc. This too was done in
the 80's with "Shift-JIS" for Japanese, and some other wacky scheme for Korean,
and a third nutburger one for Chinese.

I've had the misfortune of supporting all that in the old Zortech C++ compiler.
It's AWFUL. If you think it's simpler, all I can say is you've never tried to
write internationalized code with it.

UTF-8 is heavenly in comparison. Your code is automatically internationalized.
It's awesome.

May 24 2013
"anonymous" <anonymous example.com> writes:
On Friday, 24 May 2013 at 17:05:57 UTC, Joakim wrote:
On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
toUpper/lower cannot be made in place if it should handle all
Unicode. Some characters will change their length when convert
to/from uppercase. Examples of these are the German double S
and some Turkish I.

This triggered a long-standing bugbear of mine: why are we
using these variable-length encodings at all?  Does anybody
really care about UTF-8 being "self-synchronizing," ie does
anybody actually use that in this day and age?  Sure, it's
backwards-compatible with ASCII and the vast majority of usage
is probably just ASCII, but that means the other languages
don't matter anyway.  Not to mention taking the valuable 8-bit
real estate for English and dumping the longer encodings on
everyone else.

The German ß becomes SS when capitalised. It's no encoding issue.

May 24 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-May-2013 21:05, Joakim пишет:
On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
toUpper/lower cannot be made in place if it should handle all Unicode.
Some characters will change their length when convert to/from
uppercase. Examples of these are the German double S and some Turkish I.

This triggered a long-standing bugbear of mine: why are we using these
variable-length encodings at all?  Does anybody really care about UTF-8
being "self-synchronizing," ie does anybody actually use that in this
day and age?  Sure, it's backwards-compatible with ASCII and the vast
majority of usage is probably just ASCII, but that means the other
languages don't matter anyway.  Not to mention taking the valuable 8-bit
real estate for English and dumping the longer encodings on everyone else.

I'd just use a single-byte header to signify the language and then put
the vast majority of languages in a single byte encoding, with the few
exceptional languages with more than 256 characters encoded in two
bytes.

You seem to think that not only UTF-8 is bad encoding but also one

Separate code spaces were the case before Unicode (and utf-8). The
problem is not only that without header text is meaningless (no easy
slicing) but the fact that encoding of data after header strongly
depends a variety of factors -  a list of encodings actually. Now
everybody has to keep a (code) page per language to at least know if
it's 2 bytes per char or 1 byte per char or whatever. And you still work
on a basis that there is no combining marks and regional specific stuff :)

In fact it was even "better" nobody ever talked about header they just
assumed a codepage with some global setting. Imagine yourself creating a
font rendering system these days - a hell of an exercise in frustration
(okay how do I render 0x88 ? mm if that is in codepage XYZ then ...).

OK, that doesn't cover multi-language strings, but that is what,
.000001% of usage?

This just shows you don't care for multilingual stuff at all. Imagine
any language tutor/translator/dictionary on the Web. For instance most
languages need to intersperse ASCII (also keep in mind e.g. HTML
markup). Books often feature citations in native language (or e.g.
latin) along with translations.

Now also take into account math symbols, currency symbols and beyond.
Also these days cultures are mixing in wild combinations so you might
need to see the text even if you can't read it. Unicode is not only
"encode characters from all languages". It needs to address universal
representation of symbolics used in writing systems at large.

those also.  Yes, it wouldn't be strictly backwards-compatible with
ASCII, but it would be so much easier to internationalize.  Of course,
there's also the monoculture we're creating; love this UTF-8 rant by
tuomov, author of one the first tiling window managers for linux:

We want monoculture! That is to understand each without all these
"par-le-vu-france?" and codepages of various complexity(insanity).

Want small - use compression schemes which are perfectly fine and get to
the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/

http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?

And borrowing the arguments from from that rant: locale is borked shit
when it comes to encodings. Locales should be used for tweaking visual
like numbers, date display an so on.

--
Dmitry Olshansky

May 24 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote:
24-May-2013 21:05, Joakim пишет:

[...]
This triggered a long-standing bugbear of mine: why are we using
these variable-length encodings at all?  Does anybody really care
about UTF-8 being "self-synchronizing," ie does anybody actually use
that in this day and age?  Sure, it's backwards-compatible with ASCII
and the vast majority of usage is probably just ASCII, but that means
the other languages don't matter anyway.  Not to mention taking the
valuable 8-bit real estate for English and dumping the longer
encodings on everyone else.

I'd just use a single-byte header to signify the language and then
put the vast majority of languages in a single byte encoding, with
the few exceptional languages with more than 256 characters encoded
in two bytes.

You seem to think that not only UTF-8 is bad encoding but also one

Separate code spaces were the case before Unicode (and utf-8). The
problem is not only that without header text is meaningless (no easy
slicing) but the fact that encoding of data after header strongly
depends a variety of factors -  a list of encodings actually. Now
everybody has to keep a (code) page per language to at least know if
it's 2 bytes per char or 1 byte per char or whatever. And you still
work on a basis that there is no combining marks and regional
specific stuff :)

I remember those bad ole days of gratuitously-incompatible encodings. I
wish those days will never ever return again. You'd get a text file in
some unknown encoding, and the only way to make any sense of it was to
guess what encoding it might be and hope you get lucky. Not only so, the
same language often has multiple encodings, so adding support for a
single new language required supporting several new encodings and being
able to tell them apart (often with no info on which they are, if you're
lucky, or if you're unlucky, with *wrong* encoding type specs -- for
example, I *still* get email from outdated systems that claim to be
iso-8859 when it's actually KOI8R).

Prepending the encoding to the data doesn't help, because it's pretty
much guaranteed somebody will cut-n-paste some segment of that data and
save it without the encoding type header (or worse, some program will
try to "fix" broken low-level code by prepending a default encoding type
to everything, regardless of whether it's actually in that encoding or
not), thus ensuring nobody will be able to reliably recognize what
encoding it is down the road.

In fact it was even "better" nobody ever talked about header they just
assumed a codepage with some global setting. Imagine yourself creating
a font rendering system these days - a hell of an exercise in
frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ
then ...).

Not to mention, if the sysadmin changes the default locale settings, you
may suddenly discover that a bunch of your text files have become
gibberish, because some programs blindly assume that every text file is
in the current locale-specified language.

I tried writing language-agnostic text-processing programs in C/C++
The Posix spec *seems* to promise language-independence with its locale
functions, but actually, the whole thing is one big inconsistent and
under-specified mess that has many unspecified, implementation-specific
behaviours that you can't rely on.  The APIs basically assume that you
set your locale's language once, and never change it, and every single
file you'll ever want to read must be encoded in that particular
encoding. If you try to read another encoding, too bad, you're screwed.
There isn't even a standard for locale names that you could use to
manually switch to inside your program (yes there are de facto
conventions, but there *are* systems out there that don't follow it).

And many standard library functions are affected by locale settings
(once you call setlocale, *anything* could change, like string
comparison, output encoding, etc.), making it a hairy mess to get
input/output of multiple encodings to work correctly. Basically, you
have to write everything manually, because the standard library can't
handle more than a single encoding correctly (well, not without extreme
amounts of pain, that is). So you're back to manipulating bytes
directly. Which means you have to keep large tables of every single
encoding you ever wish to support. And encoding-specific code to deal
with exceptions with those evil variant encodings that are supposedly
the same as the official standard of that encoding, but actually have
one or two subtle differences that cause your program to output
embarrassing garbage characters every now and then.

For all of its warts, Unicode fixed a WHOLE bunch of these problems, and
many times over.  And now we're trying to go back to that nightmarish
old world again? No way, José!

[...]
Make your header a little longer and you could handle those also.
Yes, it wouldn't be strictly backwards-compatible with ASCII, but it
would be so much easier to internationalize.  Of course, there's also
the monoculture we're creating; love this UTF-8 rant by tuomov,
author of one the first tiling window managers for linux:

We want monoculture! That is to understand each without all these
"par-le-vu-france?" and codepages of various complexity(insanity).

Yeah, those codepages were an utter nightmare to deal with. Everybody
and his neighbour's dog invented their own codepage, sometimes multiple
codepages for a single language, all of which are gratuitously
incompatible with each other. Every codepage has its own peculiarities
and exceptions, and programs have to know how to deal with all of them.
Only to get broken again as soon as somebody invents yet another
codepage two years later, or creates yet another codepage variant just
for the heck of it.

If you're really concerned about encoding size, just use a compression
library -- they're readily available these days. Internally, the program
can just use UTF-16 for the most part -- UTF-32 is really only necessary
if you're routinely delving outside BMP, which is very rare.

As far as Phobos is concerned, Dmitry's new std.uni module has powerful
code-generation templates that let you write code that operate directly
on UTF-8 without needing to convert to UTF-32 first. Well, OK, maybe
we're not quite there yet, but the foundations are in place, and I'm
looking forward to the day when string functions will no longer have
implicit conversion to UTF-32, but will directly manipulate UTF-8 using
optimized state tables generated by std.uni.

Want small - use compression schemes which are perfectly fine and
get to the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/

+1.  Using your own encoding is perfectly fine. Just don't do that for
data interchange. Unicode was created because we *want* a single
standard to communicate with each other without stupid broken encoding
issues that used to be rampant on the web before Unicode came along.

In the bad ole days, HTML could be served in any random number of
encodings, often out-of-sync with what the server claims the encoding
is, and browsers would assume arbitrary default encodings that for the
most part *appeared* to work but are actually fundamentally b0rken.
Sometimes webpages would show up mostly-intact, but with a few
characters mangled, because of deviations / variations on codepage
interpretation, or non-standard characters being used in a particular
encoding. It was a total, utter mess, that wasted who knows how many
man-hours of programming time to work around. For data interchange on
the internet, we NEED a universal standard that everyone can agree on.

http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?

And borrowing the arguments from from that rant: locale is borked
shit when it comes to encodings. Locales should be used for tweaking
visual like numbers, date display an so on.

[...]

I found that rant rather incoherent. I didn't find any convincing
and gratuitous complexity, just a lot of grievances about why
monoculture is "bad" without much supporting evidence.

UTF-8, for all its flaws, is remarkably resilient to mangling -- you can
cut-n-paste any byte sequence and the receiving end can still make some
sense of it.  Not like the bad old days of codepages where you just get
one gigantic block of gibberish. A properly-synchronizing UTF-8 function
can still recover legible data, maybe with only a few characters at the
ends truncated in the worst case. I don't see how any codepage-based
encoding is an improvement over this.

T

--
There are 10 kinds of people in the world: those who can count in
binary, and those who can't.

May 24 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2013 3:42 PM, H. S. Teoh wrote:
I tried writing language-agnostic text-processing programs in C/C++

One of the first, and best, decisions I made for D was it would be Unicode
front
to back.

At the time, Unicode was poorly supported by operating systems and lots of
software, and I encountered some initial resistance to it. But I believed
Unicode was the inevitable future.

Code pages, Shift-JIS, EBCDIC, etc., should all be terminated with prejudice.

May 24 2013
Manu <turkeyman gmail.com> writes:
On 25 May 2013 11:58, Walter Bright <newshound2 digitalmars.com> wrote:

On 5/24/2013 3:42 PM, H. S. Teoh wrote:

I tried writing language-agnostic text-processing programs in C/C++

One of the first, and best, decisions I made for D was it would be Unicod=

e
front to back.

Indeed, excellent decision!
So when we define operators for u =C3=97 v and a =C2=B7 b, or maybe n=C2=B2=
? ;)

At the time, Unicode was poorly supported by operating systems and lots of
software, and I encountered some initial resistance to it. But I believed
Unicode was the inevitable future.

Code pages, Shift-JIS, EBCDIC, etc., should all be terminated with
prejudice.


May 24 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2013 7:16 PM, Manu wrote:
So when we define operators for u × v and a · b, or maybe n²? ;)

Oh, how I want to do that. But I still think the world hasn't completely caught
up with Unicode yet.

May 24 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, May 24, 2013 at 08:45:56PM -0700, Walter Bright wrote:
On 5/24/2013 7:16 PM, Manu wrote:
So when we define operators for u  v and a  b, or maybe n? ;)

Oh, how I want to do that. But I still think the world hasn't
completely caught up with Unicode yet.

That would be most awesome!

Though it does raise the issue of how parsing would work, 'cos you
either have to assign a fixed precedence to each of these operators (and
there are a LOT of them in Unicode!), or allow user-defined operators
with custom precedence and associativity, which means nightmare for the
parser (it has to adapt itself to new operators as the code is
parsed/analysed, which then leads to issues with what happens if two
different modules define the same operator with conflicting precedence /
associativity).

T

--
Spaghetti code may be tangly, but lasagna code is just cheesy.

May 24 2013
Timon Gehr <timon.gehr gmx.ch> writes:
On 05/25/2013 05:56 AM, H. S. Teoh wrote:
On Fri, May 24, 2013 at 08:45:56PM -0700, Walter Bright wrote:
On 5/24/2013 7:16 PM, Manu wrote:
So when we define operators for u  v and a  b, or maybe n? ;)

Oh, how I want to do that. But I still think the world hasn't
completely caught up with Unicode yet.

That would be most awesome!

Though it does raise the issue of how parsing would work, 'cos you
either have to assign a fixed precedence to each of these operators (and
there are a LOT of them in Unicode!),

I think this is what eg. fortress is doing.

or allow user-defined operators
with custom precedence and associativity,

This is what eg. Haskell, Coq are doing.
(Though Coq has the advantage of not allowing forward references, and
hence inline parser customization is straighforward in Coq.)

which means nightmare for the
parser (it has to adapt itself to new operators as the code is
parsed/analysed,

It would be easier on the parsing side, since the parser would not fully
parse expressions. Semantic analysis would resolve precedences. This is
quite simple, and the current way the parser resolves operator
precedences is less efficient anyways.

which then leads to issues with what happens if two
different modules define the same operator with conflicting precedence /
associativity).

This would probably be an error without explicit disambiguation, or
follow the usual disambiguation rules. (trying all possibilities appears
to be exponential in the number of conflicting operators in an
expression in the worst case though.)

May 25 2013
"Hans W. Uhlig" <huhlig gmail.com> writes:
On Saturday, 25 May 2013 at 03:46:23 UTC, Walter Bright wrote:
On 5/24/2013 7:16 PM, Manu wrote:
So when we define operators for u × v and a · b, or maybe n²?
;)

Oh, how I want to do that. But I still think the world hasn't
completely caught up with Unicode yet.

Using those characters would be wonderful and while we do have
unicode software support we don't really have unicode hardware
support. I am still on my 102 key keyboard and I haven't really
seen a good expanded character keyboard come along.

May 26 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/26/2013 1:44 PM, Hans W. Uhlig wrote:
Using those characters would be wonderful and while we do have unicode software
support we don't really have unicode hardware support. I am still on my 102 key
keyboard and I haven't really seen a good expanded character keyboard come
along.

I have a post-it stuck to my monitor with the numbers for various unicode
characters, but I just can't see that for writing code.

May 26 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, May 26, 2013 at 02:14:17PM -0700, Walter Bright wrote:
On 5/26/2013 1:44 PM, Hans W. Uhlig wrote:
Using those characters would be wonderful and while we do have
unicode software support we don't really have unicode hardware
support. I am still on my 102 key keyboard and I haven't really seen
a good expanded character keyboard come along.

I have a post-it stuck to my monitor with the numbers for various
unicode characters, but I just can't see that for writing code.

that the keys are either a fixed layout with LCD labels on each key, or
perhaps the whole thing is a long touchscreen, that allows arbitrary
relabelling of keys (or, in the latter case, complete dynamic
reconfiguration of layout). There would be some convenient way to switch
between layouts, say a scrolling sidebar or roller dial of some sort, so
you could, in theory, type Unicode directly.

I haven't been able to refine this into an actual, implementable idea,
though.

T

--
Shin: (n.) A device for finding furniture in the dark.

May 26 2013
"Kiith-Sa" <kiithsacmp gmail.com> writes:
You mean like
http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard ?

May 26 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, May 26, 2013 at 11:25:09PM +0200, Kiith-Sa wrote:
You mean like http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
?

Whoa! That is exactly what I had in mind!!

Pity they don't appear to support Linux, though. :-(

T

--
MACINTOSH: Most Applications Crash, If Not, The Operating System Hangs

May 26 2013
"Torje Digernes" <torjehoa pvv.org> writes:
On Sunday, 26 May 2013 at 21:46:38 UTC, H. S. Teoh wrote:
On Sun, May 26, 2013 at 11:25:09PM +0200, Kiith-Sa wrote:
You mean like
http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
?

Whoa! That is exactly what I had in mind!!

Pity they don't appear to support Linux, though. :-(

T

If you want to configure your keyboard so you can type unicode in
Linux you should make yourself familiar with xkb, it is not that
difficult to work with, but not exactly user friendly either,
super user friendly though.

May 26 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, May 27, 2013 at 12:30:02AM +0200, Torje Digernes wrote:
On Sunday, 26 May 2013 at 21:46:38 UTC, H. S. Teoh wrote:
On Sun, May 26, 2013 at 11:25:09PM +0200, Kiith-Sa wrote:
You mean like
http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
?

Whoa! That is exactly what I had in mind!!

Pity they don't appear to support Linux, though. :-(

T

If you want to configure your keyboard so you can type unicode in
Linux you should make yourself familiar with xkb, it is not that
difficult to work with, but not exactly user friendly either, super
user friendly though.

Oh, I know *that*. I configured my xkb setup to switch between English
and Russian with the unused windows key (I used to have Greek too, but I
use it rarely enough that I took it out). It's just that without the
dynamic key labels, I have to touch-type, which requires learning each
layout as opposed to just looking for the symbol I need on the key
labels. And I have yet to figure out a sane way to support *all* of
Unicode without making the result unusable -- when I had Greek in the
mix, it was already getting cumbersome having to continually hit the
windows key repeatedly when alternating between two of the 3 languages.
That's simply not scalable to, say, 100 modes.  :-P

But maybe I'm just missing a really obvious solution. That happens a
lot.  :-P

T

--
War doesn't prove who's right, just who's left. -- BSD Games' Fortune

May 26 2013
"Wyatt" <wyatt.epp gmail.com> writes:
On Sunday, 26 May 2013 at 21:23:44 UTC, H. S. Teoh wrote:
keyboard", in
that the keys are either a fixed layout with LCD labels on each
key, or
perhaps the whole thing is a long touchscreen, that allows
arbitrary
relabelling of keys (or, in the latter case, complete dynamic
reconfiguration of layout). There would be some convenient way
to switch
between layouts, say a scrolling sidebar or roller dial of some
sort, so
you could, in theory, type Unicode directly.

I haven't been able to refine this into an actual,
implementable idea,
though.

I've given this domain a fair bit of thought, and from my
perspective you want to throw hardware at a software problem.
Have you ever used a Japanese input method?  They're sort of a
good exemplar here, wherein you type a sequence and then hit
space to cycle through possible ways of writing it.  So "ame" can
become, あめ, 雨, 飴, etc.  Right now, in addition to my learning, I
also use it for things like α (アルファ) and Δ (デルタ).  It's
limited,
but...usable, I guess.  Sort of.

The other end of this is TeX, which was designed around the idea
of composing scientific texts with a high degree of control and
flexibility.  Specialty characters are inserted with
backslash-escapes, like \alpha, \beta, etc.

Now combine the two:  An input method that outputs as usual,
until you enter a character code which is substituted in real
time to what you actually want.
Example:
"values of \beta will give rise to dom!" composes as
"values of β will give rise to dom!"

No hardware required; just a smarter IME.  Like maybe this one:
http://www.andonyar.com/rec/2008-03/mathinput/ (I'm honestly not
yet sure how mature or usable that one is as I'm a UIM user, but
it does serve as a proof of concept).

May 26 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, May 27, 2013 at 04:17:06AM +0200, Wyatt wrote:
On Sunday, 26 May 2013 at 21:23:44 UTC, H. S. Teoh wrote:
in that the keys are either a fixed layout with LCD labels on each
key, or perhaps the whole thing is a long touchscreen, that allows
arbitrary relabelling of keys (or, in the latter case, complete
dynamic reconfiguration of layout). There would be some convenient
way to switch between layouts, say a scrolling sidebar or roller dial
of some sort, so you could, in theory, type Unicode directly.

I haven't been able to refine this into an actual, implementable
idea, though.

I've given this domain a fair bit of thought, and from my
perspective you want to throw hardware at a software problem.  Have
you ever used a Japanese input method?  They're sort of a good
exemplar here, wherein you type a sequence and then hit space to
cycle through possible ways of writing it.  So "ame" can become,
あめ, 雨, 飴, etc.  Right now, in addition to my learning, I also
use it for things like α (アルファ) and Δ (デルタ).  It's limited,
but...usable, I guess.  Sort of.

The other end of this is TeX, which was designed around the idea of
composing scientific texts with a high degree of control and
flexibility.  Specialty characters are inserted with
backslash-escapes, like \alpha, \beta, etc.

Now combine the two:  An input method that outputs as usual, until
you enter a character code which is substituted in real time to what
you actually want.
Example:
"values of \beta will give rise to dom!" composes as
"values of β will give rise to dom!"

No hardware required; just a smarter IME.  Like maybe this one:
http://www.andonyar.com/rec/2008-03/mathinput/ (I'm honestly not yet
sure how mature or usable that one is as I'm a UIM user, but it does
serve as a proof of concept).

I like this idea. It's certainly more feasible than reinventing the
Optimus Maximus keyboard. :) I can write code for free, but engineering
custom hardware is a bit beyond my abilities (and means!).

If we go the software route, then one possible strategy might be:

- Have a default mode that is whatever your default keyboard layout is
(the usual 100+-key layout, or DVORAK, whatever.).

- Assign one or two escape keys (not to be confused with the Esc key,
which is something else) that allows you to switch mode.

- Under the 1-key scheme, you'd use it to begin sequences like \beta,
except that instead of the backslash \, you're using a dedicated
key. These sequences can include individual characters (e.g.
<ESC>beta == β) or allow you to change the current input mode (e.g.
<ESC>grk to switch to a Greek layout that takes effect from that
point onwards until you enter, say, <ESC>eng). For convenience, the
sequence <ESC><ESC> can be shorthand for switching back to whatever
the default layout is, so that if you mistype an escape sequence
and end up in some strange unexpected layout mode, hitting <ESC>
twice will reset it back to the default.

- Under the 2-key scheme, you'd have one key dedicated for the
occasional foreign character (<ESC1>beta == β), and the second key
dedicated for switching layouts (thus allowing shorter sequences
for switching between languages without fear of conflicting with
single-character sequences, e.g., <ESC2>g for Greek).

Perhaps the 1-key scheme is the simplest to implement. The capslock key
is a good candidate, being conveniently located where your left little
finger is, and having no real useful function in this day and age.

The only drawback is no custom key labels. But perhaps that can be
alleviated by hooking an escape sequence to toggle an on-screen visual
representation of the current layout. Maybe <ESC>? can be assigned to
invoke a helper utility that renders the current layout on the screen.

T

--
Don't get stuck in a closet---wear yourself out.

May 27 2013
On Monday, 27 May 2013 at 02:17:08 UTC, Wyatt wrote:
No hardware required; just a smarter IME.

Perhaps something like the compose key?

http://en.wikipedia.org/wiki/Compose_key

May 27 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, May 27, 2013 at 09:59:52PM +0200, Vladimir Panteleev wrote:
On Monday, 27 May 2013 at 02:17:08 UTC, Wyatt wrote:
No hardware required; just a smarter IME.

Perhaps something like the compose key?

http://en.wikipedia.org/wiki/Compose_key

I'm already using the compose key. But it only goes so far (I don't
think compose key sequences cover all of unicode). Besides, it's
impractical to use compose key sequences to write large amounts of text
in some given language; a method of temporarily switching to a different
layout is necessary.

T

--
Тише едешь, дальше будешь.

May 27 2013
On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.

I thought the topic was typing the occasional Unicode character
to use as an operator in D programs?

May 27 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, May 28, 2013 at 12:04:52AM +0200, Vladimir Panteleev wrote:
On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.

I thought the topic was typing the occasional Unicode character to
use as an operator in D programs?

Well, D *does* support non-English identifiers, y'know... for example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different story. :)

But for operators, you still need enough compose key sequences to cover
all of the Unicode operators -- and there are a LOT of them -- which I
don't think is currently done anywhere. You'd have to make your own
compose key maps to do it.

T

--
Freedom: (n.) Man's self-given right to be enslaved by his own depravity.

May 27 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different story. :)

I've recently come to the opinion that that's a bad idea, and D should not
support it.

May 27 2013
"Hans W. Uhlig" <huhlig gmail.com> writes:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for
example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different
story. :)

I've recently come to the opinion that that's a bad idea, and D
should not support it.

Why do you think its a bad idea? It makes it such that code can
be in various languages? Just lack of keyboard support?

May 27 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, May 28, 2013 at 01:28:22AM +0200, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for
example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different story.
:)

I've recently come to the opinion that that's a bad idea, and D
should not support it.

Currently, the above code snippet compiles (upon inserting "import
std.stdio;", that is). Should that be made illegal?

Why do you think its a bad idea? It makes it such that code can be
in various languages? Just lack of keyboard support?

I can't speak for Walter, but one issue that comes to mind is when
someone reads the code and doesn't understand the language the
identifiers are in, or worse, can't reliably recognize the distinctions
between the glyphs, and so can't match identifier names correctly -- if
you don't know Japanese, for example, seeing a bunch of Japanese
identifiers of equal length will look more-or-less the same (all
gibberish to you), so it only obscures the code. Or if your computer
doesn't have the requisite fonts to display the alphabet in question,
then you'll just see a bunch of ?'s or black blotches for all program
identifiers, making the code completely unreadable.

Since language keywords are already in English, we might as well
standardize on English identifiers too. (After all, Phobos identifiers
are English as well.) While it's cool to have multilingual identifiers,
I'm not sure if it actually adds any practical value. :) If anything, it
arguably detracts from usability. Multilingual program output, of
course, is a different kettle o' fish.

T

--
Doubt is a self-fulfilling prophecy.

May 27 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 27 May 2013 at 23:46:17 UTC, H. S. Teoh wrote:
On Tue, May 28, 2013 at 01:28:22AM +0200, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for
example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different
story.
:)

I've recently come to the opinion that that's a bad idea, and
D
should not support it.

Currently, the above code snippet compiles (upon inserting
"import
std.stdio;", that is). Should that be made illegal?

Why do you think its a bad idea? It makes it such that code
can be
in various languages? Just lack of keyboard support?

I can't speak for Walter, but one issue that comes to mind is
when
someone reads the code and doesn't understand the language the
identifiers are in, or worse, can't reliably recognize the
distinctions
between the glyphs, and so can't match identifier names
correctly -- if
you don't know Japanese, for example, seeing a bunch of Japanese
identifiers of equal length will look more-or-less the same (all
gibberish to you), so it only obscures the code. Or if your
computer
doesn't have the requisite fonts to display the alphabet in
question,
then you'll just see a bunch of ?'s or black blotches for all
program
identifiers, making the code completely unreadable.

Since language keywords are already in English, we might as well
standardize on English identifiers too. (After all, Phobos
identifiers
are English as well.) While it's cool to have multilingual
identifiers,
I'm not sure if it actually adds any practical value. :) If
anything, it
arguably detracts from usability. Multilingual program output,
of
course, is a different kettle o' fish.

T

I can tell you for a fact there are a tons of *private* companies
that create closed source programs, whose source code is *not*
English. And from *their* business perspective, it makes sense.
They don't care if you can't understand their source code, since
*you* will never see their source code. I'm quite confident there
are tons of programs that you use that *aren't* written in
English.

My wifes writes the embedded soft for hardware her company sells.
I can tell you the source code sure as hell isn't in English. Why
would it? The entire company speaks the local language natively.
I've worked in Japan, and I can tell you the norm over there is
*not* to code in English.

And why should it? Why would you code in a language that is not
your own, if you don't plan to ever share your code to outside
your team? Why would you care about users that don't have unicode
support, if the workstations of all your employees is unicode
compatible?

Allowing unicode identifiers makes their work a better
experience. Why should we take that away from them?

but whether or not you should be able to use them should belong
in a coding standard, not in a compiler limitation.

May 28 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and D should not
support it.

Why do you think its a bad idea? It makes it such that code can be in various
languages? Just lack of keyboard support?

Every time I've been to a programming shop in a foreign country, the developers
speak english at work and code in english. Of course, that doesn't mean that
everyone does, but as far as I can tell the overwhelming bulk is done in
english.

Naturally, full Unicode needs to be in strings and comments, but symbol names?
I
don't see the point nor the utilty of it. Supporting such is just pointless
complexity to the language.

May 27 2013
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and
D should not
support it.

Why do you think its a bad idea? It makes it such that code
can be in various
languages? Just lack of keyboard support?

Every time I've been to a programming shop in a foreign
country, the developers speak english at work and code in
english. Of course, that doesn't mean that everyone does, but
as far as I can tell the overwhelming bulk is done in english.

Naturally, full Unicode needs to be in strings and comments,
but symbol names? I don't see the point nor the utilty of it.
Supporting such is just pointless complexity to the language.

The most convincing case for usefulness I've seen was in java
where a class implemented a particular algorithm and so was named
after it. This name had a particular accented character and so
required unicode. Lots of algorithms are named after their
inventors and lots of these names contain unicode characters so
it's not that uncommon.

May 27 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, May 28, 2013 at 02:23:32AM +0200, Diggory wrote:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and
D should not
support it.

Why do you think its a bad idea? It makes it such that code can
be in various
languages? Just lack of keyboard support?

Every time I've been to a programming shop in a foreign country,
the developers speak english at work and code in english. Of
course, that doesn't mean that everyone does, but as far as I can
tell the overwhelming bulk is done in english.

Naturally, full Unicode needs to be in strings and comments, but
symbol names? I don't see the point nor the utilty of it.
Supporting such is just pointless complexity to the language.

The most convincing case for usefulness I've seen was in java where
a class implemented a particular algorithm and so was named after
it. This name had a particular accented character and so required
unicode. Lots of algorithms are named after their inventors and lots
of these names contain unicode characters so it's not that uncommon.

I don't find this a compelling reason to allow full Unicode on
identifiers, though. For one thing, somebody maintaining your code may
not know how to type said identifier correctly. It can be very
frustrating to have to keep copy-n-pasting identifiers just because they
contain foreign letters you can't type. Not to mention sheer
unreadability if the inventor's name is in Chinese, so the algorithm
name is also in Chinese, and the person maintaining the code can't read
Chinese. This will kill D code maintainability.

T

--
Don't drink and derive. Alcohol and algebra don't mix.

May 27 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 6:06 PM, H. S. Teoh wrote:
I don't find this a compelling reason to allow full Unicode on
identifiers, though. For one thing, somebody maintaining your code may
not know how to type said identifier correctly. It can be very
frustrating to have to keep copy-n-pasting identifiers just because they
contain foreign letters you can't type. Not to mention sheer
unreadability if the inventor's name is in Chinese, so the algorithm
name is also in Chinese, and the person maintaining the code can't read
Chinese. This will kill D code maintainability.

+1

May 27 2013
Michel Fortin <michel.fortin michelf.ca> writes:
On 2013-05-28 01:34:17 +0000, Walter Bright <newshound2 digitalmars.com> said:

On 5/27/2013 6:06 PM, H. S. Teoh wrote:
I don't find this a compelling reason to allow full Unicode on
identifiers, though. For one thing, somebody maintaining your code may
not know how to type said identifier correctly. It can be very
frustrating to have to keep copy-n-pasting identifiers just because they
contain foreign letters you can't type. Not to mention sheer
unreadability if the inventor's name is in Chinese, so the algorithm
name is also in Chinese, and the person maintaining the code can't read
Chinese. This will kill D code maintainability.

+1

-1

What's even worse for code maintainability is code that does not do
what it says.

Disallowing non-ASCII charsets does not prevent people from writing
foreign-language code. I've seen plenty of code in French in my life in
languages with no Unicode support. I've also seen plenty of bad English
in code. I'd rather see a correct French word as a variable or function
name than an incorrect English one. Correctly naming things is
difficult, and correctly naming them in a foreign language is even
more. This surely apply to languages using non-ASCII alphabets too.

Of course, if you're not using English words you'll be limiting
audience to programmers who understand that language. But you might
widen it in other directions. I worked once with a grad student who was
building a model to simulate breakages of water pipe systems. She was
good enough to write code that worked, although she needed my help for
a couple of things, notably increasing performance. The code was all in
French, and thankfully so as attempting to translate all those terms
(some dealing with concepts unknown to me) to English when writing the
code and back to French when explaining the concepts would have been
quite annoying, inefficient, and error-prone in our work.

While French likely will always be a possibility (as it fits well in
ASCII), I can see how writing code in Japanese or Russian might benefit
native speakers of those languages too, especially those for who
programming is only an incidental part of their job. Programming is a
form of expression, and it's always easier to express ourself in our
own native language.

--
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca/

May 28 2013
"Olivier Pisano" <olivier.pisano laposte.net> writes:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
Every time I've been to a programming shop in a foreign
country, the developers speak english at work and code in
english. Of course, that doesn't mean that everyone does, but
as far as I can tell the overwhelming bulk is done in english.

Would you have been to such an event if you could not have
understood what people were doing or saying? Of course, when we
are working on something with international scope, we tend to do
it in english, but it doesn't mean every programming task is
performed in english…

Being a non-native english speaker, I tend to see Unicode
identifiers as an improvement over other programming languages.
depending on the context of the programming task and its intended
audience. BTW, I use a Unicode-aware alternative keyboard layout,
so I can type greek letters or math symbols directly. ASCII-only
identifiers sounds like an arbitrary limitation for me.

May 28 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
Every time I've been to a programming shop in a foreign
country, the developers speak english at work and code in
english. Of course, that doesn't mean that everyone does, but
as far as I can tell the overwhelming bulk is done in english.

That's because you have an academic view of code, and a library
approach to development.

When you are a private company selling closed source code, I
really don't see why you'd code in English.

IMO, whether it is a bad idea is not for us to judge (and less so
to stop), but for each company/organization to choose their own
coding standard.

May 28 2013
"qznc" <qznc web.de> writes:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and
D should not
support it.

Why do you think its a bad idea? It makes it such that code
can be in various
languages? Just lack of keyboard support?

Every time I've been to a programming shop in a foreign
country, the developers speak english at work and code in
english. Of course, that doesn't mean that everyone does, but
as far as I can tell the overwhelming bulk is done in english.

Naturally, full Unicode needs to be in strings and comments,
but symbol names? I don't see the point nor the utilty of it.
Supporting such is just pointless complexity to the language.

Once I heared an argument from developers working for banks. They
financial concepts with german names (e.g. Vermögen,Bürgschaft),
which sometimes include äöüß. Some of those concept had no good
translation into english, because they are not used outside of
Germany and the clients prefer the actual names anyways.

May 29 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/29/2013 3:26 AM, qznc wrote:
Once I heared an argument from developers working for banks. They coded
with
german names (e.g. Vermögen,Bürgschaft), which sometimes include äöüß.
Some of
those concept had no good translation into english, because they are not used
outside of Germany and the clients prefer the actual names anyways.

German is pretty easy to do in ASCII: Vermoegen and Buergschaft

May 29 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 29 May 2013 at 22:42:08 UTC, Walter Bright wrote:
On 5/29/2013 3:26 AM, qznc wrote:
Once I heared an argument from developers working for banks.
They coded
financial concepts with
german names (e.g. Vermögen,Bürgschaft), which sometimes
include äöüß. Some of
those concept had no good translation into english, because
they are not used
outside of Germany and the clients prefer the actual names
anyways.

German is pretty easy to do in ASCII: Vermoegen and Buergschaft

What about Chinese? Russian? Japanese? It is doable, but I can
tell you for a fact that they very much don't like reading it
that way.

You know, having done programming in Japan, I know that a lot of
devs simply don't care for english, and they'd really enjoy just
being able to code in Japanese. I can't speak for the other
countries, but I'm sure that large but not spread out countries
like China would also just *love* to be able to code in 100%
Madarin (I'd say they wouldn't care much for English either).

I think this possibility is actually a brilliant feature that
could help popularize the language oversees, especially in
teaching courses, or the private sector. Why not turn down a
feature that makes us popular?

As for research/university, I think they are already global
enough to stick to English anyways.

No matter how I see it, I can only see benefits to keeping it,
and downsides to turning it down.

May 30 2013
"Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Thu, 30 May 2013 11:36:42 +0200, monarch_dodra <monarchdodra gmail.co=
m>  =

wrote:

On Wednesday, 29 May 2013 at 22:42:08 UTC, Walter Bright wrote:
On 5/29/2013 3:26 AM, qznc wrote:
Once I heared an argument from developers working for banks. They co=

ded

concepts with
german names (e.g. Verm=C3=B6gen,B=C3=BCrgschaft), which sometimes i=

nclude =C3=A4=C3=B6=C3=BC=C3=9F.  =

Some of
those concept had no good translation into english, because they are=

=

not used
outside of Germany and the clients prefer the actual names anyways.

German is pretty easy to do in ASCII: Vermoegen and Buergschaft

What about Chinese? Russian? Japanese? It is doable, but I can tell yo=

u  =

for a fact that they very much don't like reading it that way.

You know, having done programming in Japan, I know that a lot of devs =

=

simply don't care for english, and they'd really enjoy just being able=

=

to code in Japanese. I can't speak for the other countries, but I'm su=

re  =

that large but not spread out countries like China would also just  =

*love* to be able to code in 100% Madarin (I'd say they wouldn't care =

=

much for English either).

I think this possibility is actually a brilliant feature that could he=

lp  =

popularize the language oversees, especially in teaching courses, or t=

he  =

private sector. Why not turn down a feature that makes us popular?

As for research/university, I think they are already global enough to =

=

stick to English anyways.

No matter how I see it, I can only see benefits to keeping it, and  =

downsides to turning it down.

Now if only we had the C preprocessor:

#define =E5=A6=82=E6=9E=9C if
#define =E7=9B=B4=E5=88=B0 while

(Note: this is what Google Translate told me was good. I do not speak,

-- =

Simen

May 30 2013
"Dicebot" <m.strashun gmail.com> writes:
On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra wrote:
What about Chinese? Russian? Japanese? It is doable, but I can
tell you for a fact that they very much don't like reading it
that way.

You know, having done programming in Japan, I know that a lot
of devs simply don't care for english, and they'd really enjoy
just being able to code in Japanese. I can't speak for the
other countries, but I'm sure that large but not spread out
countries like China would also just *love* to be able to code
in 100% Madarin (I'd say they wouldn't care much for English
either).

What about poor guys from other country that will support that
project after? English is a de-facto standard language for
programming for a good reason.

May 30 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 30 May 2013 at 10:13:46 UTC, Dicebot wrote:
On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra wrote:
What about Chinese? Russian? Japanese? It is doable, but I can
tell you for a fact that they very much don't like reading it
that way.

You know, having done programming in Japan, I know that a lot
of devs simply don't care for english, and they'd really enjoy
just being able to code in Japanese. I can't speak for the
other countries, but I'm sure that large but not spread out
countries like China would also just *love* to be able to code
in 100% Madarin (I'd say they wouldn't care much for English
either).

What about poor guys from other country that will support that
project after? English is a de-facto standard language for
programming for a good reason.

Well... defacto: "in practice but not necessarily ordained by
law".

Besides, even in english, there are use cases for unicode. Such
as math (Greek symbols).

And even if you are coding in english, that don't mean you can't
be working on a region specific project, that requires the
identifiers to have region-specific names (AKA, German banking
reference).

Finally, english does have a few (albeit rare) words that can't
be expressed with ASCII. For example: Möbius. Sure, you can write
it "Mobius", but why settle for wrong, when you can have right?

--------

I'm saying that even if I agree that code should be in English
(which I don't completly agree with), it's still not a strong
argument against unicode in identifiers. In this day and age, it
seems as arbitrary to me as requiring lines to not exceed 80
chars. That kind of shit belongs in a coding standard.

May 30 2013
Manu <turkeyman gmail.com> writes:
On 30 May 2013 20:13, Dicebot <m.strashun gmail.com> wrote:

On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra wrote:

What about Chinese? Russian? Japanese? It is doable, but I can tell you
for a fact that they very much don't like reading it that way.

You know, having done programming in Japan, I know that a lot of devs
simply don't care for english, and they'd really enjoy just being able to
code in Japanese. I can't speak for the other countries, but I'm sure that
large but not spread out countries like China would also just *love* to be
able to code in 100% Madarin (I'd say they wouldn't care much for English
either).

What about poor guys from other country that will support that project
after? English is a de-facto standard language for programming for a good
reason.

Have you ever worked on code written by people who barely speak English?
Even if they write English words, that doesn't make it 'English', or any
easier to understand. And people often tend to just transliterate into
latin, which is kinda pointless too, how does that help?

May 30 2013
"Dicebot" <m.strashun gmail.com> writes:
On Thursday, 30 May 2013 at 11:29:47 UTC, Manu wrote:
Have you ever worked on code written by people who barely speak
English?
Even if they write English words, that doesn't make it
'English', or any
easier to understand. And people often tend to just
transliterate into
latin, which is kinda pointless too, how does that help?

I have had comments with Finnish poetry in code I was responsible
to support :( No need to provide means to think such approach is
the way to go.

May 30 2013
"Kagamin" <spam here.lot> writes:
On Thursday, 30 May 2013 at 11:29:47 UTC, Manu wrote:
Have you ever worked on code written by people who barely speak
English?

I did. It's better than having a mixture of languages like here:
assert(length == dizgi.length); - in one expression!
property Yazı küçüğü() const - property? const? küçüğü?

BTW I don't speak English myself, and D code doesn't comprise
English either. How well do you have to know English to use one
word to name a variable "player"? And I believe everyone who
learned math know latin alphabet.

Unicode identifiers allow for typos, which can't be detected
visually. For example greek and cyrillic alphabets have letters
indistinguishable from ASCII so they can sneak into ASCII text
and you won't see it. You can also have more fun with heuristic
language switchers.

Try to find a problem in this code:
------
class c
{
void Сlose(){}
}

int main()
{
c obj=new c;
obj.Close();
return 0;
}
------

That's an actual issue I had with C# in industrial code. And I
believe noone checked phobos for such errors.

I was taught BASIC at school and had no idea I should complain
about latin alphabet even though I didn't learn English back then.

Jun 27 2013
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
Every time I've been to a programming shop in a foreign
country, the developers speak english at work and code in
english. Of course, that doesn't mean that everyone does, but
as far as I can tell the overwhelming bulk is done in english.

OOo codebase is historically mostly in german. They try to reduce
the amunt of german in the codebase with each new version.

Some massive codebases are non english.

Naturally, full Unicode needs to be in strings and comments,
but symbol names? I don't see the point nor the utilty of it.
Supporting such is just pointless complexity to the language.

I know this is a crazy idea, but someone told be once that most
people on this planet aren't living in english speaking
countries. Insane isn't it ?

Jun 27 2013
Peter Williams <pwil3058 bigpond.net.au> writes:
On 28/05/13 09:44, H. S. Teoh wrote:

Since language keywords are already in English, we might as well
standardize on English identifiers too.

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

Peter

May 27 2013
"David Eagen" <davideagen mailinator.com> writes:
On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:

So you're going to spell check them all to make sure that
they're English?  Or did you mean ASCII?

Peter

That's it. I'm filing a bug against std.traits. There's a
unittest there that with a struct named "Colour". Completely
unacceptable.

May 27 2013
Manu <turkeyman gmail.com> writes:
On 28 May 2013 13:22, David Eagen <davideagen mailinator.com> wrote:

On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

Peter

That's it. I'm filing a bug against std.traits. There's a unittest there
that with a struct named "Colour". Completely unacceptable.

How dare you!
What's unacceptable is that a bunch of ex-english speakers had the audacity
to rewrite the dictionary and continue to call it English!
I will never write colour without a u, ever! I may suffer the global
American cultural invasion of my country like the rest of us, but I will
never let them infiltrate my mind! ;)

May 27 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 9:27 PM, Manu wrote:
I will never write colour without a u, ever! I may suffer the global American
cultural invasion of my country like the rest of us, but I will never let them
infiltrate my mind! ;)

Resistance is useless.

May 27 2013
On Tuesday, 28 May 2013 at 04:52:55 UTC, Walter Bright wrote:
On 5/27/2013 9:27 PM, Manu wrote:
I will never write colour without a u, ever! I may suffer the
global American
cultural invasion of my country like the rest of us, but I
will never let them
infiltrate my mind! ;)

Resistance is useless.

*futile :P

May 27 2013
Peter Williams <pwil3058 bigpond.net.au> writes:
On 28/05/13 13:22, David Eagen wrote:
On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

Peter

That's it. I'm filing a bug against std.traits. There's a unittest there
that with a struct named "Colour". Completely unacceptable.

Except here in Australia and other places where they use the Queen's
English :-)

Peter

May 27 2013
Manu <turkeyman gmail.com> writes:
On 28 May 2013 14:38, Peter Williams <pwil3058 bigpond.net.au> wrote:

On 28/05/13 13:22, David Eagen wrote:

On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

Peter

That's it. I'm filing a bug against std.traits. There's a unittest there
that with a struct named "Colour". Completely unacceptable.

Except here in Australia and other places where they use the Queen's
English :-)

Is there anywhere other than America that doesn't?

May 27 2013
Jacob Carlborg <doob me.com> writes:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

--
/Jacob Carlborg

May 28 2013
Manu <turkeyman gmail.com> writes:
On 28 May 2013 19:12, Jacob Carlborg <doob me.com> wrote:

On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Yes, the region called America ;)
Although there's a few British colonies in the Caribbean...

May 28 2013
Jacob Carlborg <doob me.com> writes:
On 2013-05-28 14:09, Manu wrote:

Yes, the region called America ;)
Although there's a few British colonies in the Caribbean...

Oh, you meant the whole region and not the country.

--
/Jacob Carlborg

May 28 2013
"Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Tue, 28 May 2013 14:11:29 +0200, Jacob Carlborg <doob me.com> wrote:

On 2013-05-28 14:09, Manu wrote:

Yes, the region called America ;)
Although there's a few British colonies in the Caribbean...

Oh, you meant the whole region and not the country.

America is not a country. The country is called USA.

--
Simen

May 28 2013
Jacob Carlborg <doob me.com> writes:
On 2013-05-28 14:58, Simen Kjaeraas wrote:

America is not a country. The country is called USA.

I know that, but I get the impression that people usually say "America"
and refer to USA.

--
/Jacob Carlborg

May 28 2013
Peter Williams <pwil3058 bigpond.net.au> writes:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent not a
country). :-)

Peter

May 28 2013
On Tuesday, 28 May 2013 at 23:33:47 UTC, Peter Williams wrote:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent
not a country). :-)

Peter

America isn't a continent, North America is a continent, and
Canada is in North America :P

May 28 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 29 May 2013 at 01:29:07 UTC, Diggory wrote:
On Tuesday, 28 May 2013 at 23:33:47 UTC, Peter Williams wrote:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent
not a country). :-)

Peter

America isn't a continent, North America is a continent, and
Canada is in North America :P

Well, that point of view really depends from which continent
you're from:
http://en.wikipedia.org/wiki/Continents#Number_of_continents

There is no internationally agreed on scheme. I for one, have
always been taught that there is only "America", and that the
terms "North America" and "South America" where only meant to
denote regions within said continent.

May 28 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, May 29, 2013 at 09:33:32AM +1000, Peter Williams wrote:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent not a
country). :-)

[...]

If you say that to a Canadian to his face, you might get a hostile (or
faux-hostile) reaction. :)

Up here in the Great White North, we like to think of ourselves as
different from our rowdy neighbours to the south (even though we're not
that different, but we won't ever admit that :-P). And yes, "America"
means USA up here (and "American" especially means USian, as distinct
from Canadian), even though we all know that technically it refers to
the continent, not the country.

T

--
Computers aren't intelligent; they only think they are.

May 28 2013
Peter Williams <pwil3058 bigpond.net.au> writes:
On 29/05/13 09:57, H. S. Teoh wrote:
On Wed, May 29, 2013 at 09:33:32AM +1000, Peter Williams wrote:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent not a
country). :-)

[...]

If you say that to a Canadian to his face, you might get a hostile (or
faux-hostile) reaction. :)

Up here in the Great White North, we like to think of ourselves as
different from our rowdy neighbours to the south (even though we're not
that different, but we won't ever admit that :-P). And yes, "America"
means USA up here (and "American" especially means USian, as distinct
from Canadian), even though we all know that technically it refers to
the continent, not the country.

Last time I was there (about 40 years ago) Canadians didn't seem that
touchy. :-)

Peter

May 28 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, May 29, 2013 at 10:36:08AM +1000, Peter Williams wrote:
On 29/05/13 09:57, H. S. Teoh wrote:
On Wed, May 29, 2013 at 09:33:32AM +1000, Peter Williams wrote:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent not a
country). :-)

[...]

If you say that to a Canadian to his face, you might get a hostile
(or faux-hostile) reaction. :)

[...]
Last time I was there (about 40 years ago) Canadians didn't seem
that touchy. :-)

[...]

Well, they are not, hence "faux-hostile". :)

T

--
Political correctness: socially-sanctioned hypocrisy.

May 28 2013
Jacob Carlborg <doob me.com> writes:
On 2013-05-28 03:38, Peter Williams wrote:

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

Don't you have a spell checker in your editor? If not, find a new one :)

--
/Jacob Carlborg

May 28 2013
=?UTF-8?B?Ikx1w61z?= Marques" <luismarques gmail.com> writes:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and D
should not support it.

I think it is a bad idea to program in a language other than
english, but I believe D should still support it.

May 27 2013
Manu <turkeyman gmail.com> writes:
On 28 May 2013 09:39, <luismarques gmail.com>" puremagic.com <

On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:

I've recently come to the opinion that that's a bad idea, and D should
not support it.

I think it is a bad idea to program in a language other than english, but
I believe D should still support it.

I can imagine a young student learning to code, that may not speak English
(yet).
Or a not-so-unlikely future where we're all speaking chinese ;)

May 27 2013
Manu <turkeyman gmail.com> writes:
On 28 May 2013 09:05, Walter Bright <newshound2 digitalmars.com> wrote:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:

Well, D *does* support non-English identifiers, y'know... for example:

void main(string[] args) {
int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
foreach (=D0=B8; 0..100)
=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
}

Of course, whether that's a good practice is a different story. :)

I've recently come to the opinion that that's a bad idea, and D should no=

t
support it.

Why? You said previously that you'd love to support extended operators ;)

May 27 2013
"Torje Digernes" <torjehoa pvv.org> writes:
On Tuesday, 28 May 2013 at 00:34:20 UTC, Manu wrote:
On 28 May 2013 09:05, Walter Bright
<newshound2 digitalmars.com> wrote:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:

Well, D *does* support non-English identifiers, y'know... for
example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different
story. :)

I've recently come to the opinion that that's a bad idea, and
D should not
support it.

Why? You said previously that you'd love to support extended
operators ;)

I find features such as support for uncommon symbols in variables
a strength as it makes some physics formulas a bit easier to read
in code form, which in my opinion is a good thing.

May 27 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 5:34 PM, Manu wrote:
On 28 May 2013 09:05, Walter Bright <newshound2 digitalmars.com
<mailto:newshound2 digitalmars.com>> wrote:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:

Well, D *does* support non-English identifiers, y'know... for example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different story. :)

I've recently come to the opinion that that's a bad idea, and D should not
support it.

Why? You said previously that you'd love to support extended operators ;)

Extended operators, yes. Non-ascii identifiers, no.

May 27 2013
"Oleg Kuporosov" <Oleg.Kuporosov gmail.com> writes:
On Tuesday, 28 May 2013 at 01:34:47 UTC, Walter Bright wrote:

Why? You said previously that you'd love to support extended
operators ;)

Extended operators, yes. Non-ascii identifiers, no.

BTW, this is one of big D advantage, take into account
some day D could be used for teaching in schools where pupils
still doesn't know English somewhere outside US/GB.


May 28 2013
"Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Tue, 28 May 2013 01:05:46 +0200, Walter Bright  =

<newshound2 digitalmars.com> wrote:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for example=

:
void main(string[] args) {
int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
foreach (=D0=B8; 0..100)
=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
}

Of course, whether that's a good practice is a different story. :)

I've recently come to the opinion that that's a bad idea, and D should=

=

not support it.

I've recently come to the opinion that you're wrong - using them is ofte=
n
wrong, but D should support them. Various good reasons have been posted

-- =

Simen

May 28 2013
"Jakob Ovrum" <jakobovrum gmail.com> writes:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and D
should not support it.

Honestly, removing support for non-ASCII characters from
identifiers is the worst idea you've had in a while. There is an
_unfathomable amount_ of code out there written in non-English
languages but hamfisted into an English-alphabet representation
because the programming language doesn't care to support it. The
resulting friction is considerable.

You seem to attribute particular value to personal anecdotes, so
here's one of mine: I personally know several prestigious
universities in Europe and Asia which teach programming using
Java and/or C with identifiers being in an English-alphabet
representation of the native non-English language. Using the
English language for identifiers is usually a sanctioned
alternative, but not the primary modus operandi. I also know
several professional programmers using their native non-English
language for identifiers in production code.


May 29 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/29/2013 2:42 AM, Jakob Ovrum wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and D should not
support it.

Honestly, removing support for non-ASCII characters from identifiers is the
worst idea you've had in a while. There is an _unfathomable amount_ of code out
there written in non-English languages but hamfisted into an English-alphabet
representation because the programming language doesn't care to support it. The
resulting friction is considerable.

You seem to attribute particular value to personal anecdotes, so here's one of
mine: I personally know several prestigious universities in Europe and Asia
which teach programming using Java and/or C with identifiers being in an
English-alphabet representation of the native non-English language. Using the
English language for identifiers is usually a sanctioned alternative, but not
the primary modus operandi. I also know several professional programmers using
their native non-English language for identifiers in production code.

I still think it's a bad idea, but it's obvious people want it in D, so it'll
stay.

(Also note that I meant using ASCII, not necessarily english.)

May 29 2013
Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 29 May 2013 15:44:17 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

I still think it's a bad idea, but it's obvious people want it in D, so it'll
stay.

(Also note that I meant using ASCII, not necessarily english.)

Surprisingly ASCII also covers Cornish and Malay.

--
Marco

May 29 2013
"Oleg Kuporosov" <Oleg.Kuporosov gmail.com> writes:
On Wednesday, 29 May 2013 at 22:44:17 UTC, Walter Bright wrote:
I still think it's a bad idea, but it's obvious people want it
in D, so it'll stay.

(Also note that I meant using ASCII, not necessarily english.)

Good, thanks, restrictions definetelly can and should be applied
per project, like for druntime/Phobos.

May 29 2013
"Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 29 May 2013 at 22:44:17 UTC, Walter Bright wrote:
(Also note that I meant using ASCII, not necessarily english.)

I don't understand the logic behind this. Surely this is the
worst combination; severely crippled ability to use non-English
languages (yes, even for European languages), yet non-speakers of
those languages still don't have a clue what it means.

May 30 2013
Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 27 May 2013 16:05:46 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for example:

void main(string[] args) {
int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
foreach (=D0=B8; 0..100)
=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
}

Of course, whether that's a good practice is a different story. :)

=20
I've recently come to the opinion that that's a bad idea, and D should no=

t=20
support it.

I hope that was just a random thought. I knew a teacher who
would give all his methods German names so they are easier to
distinguish from the English Java library methods.
Personally I like to type =CE=B1 instead of alpha for angles, since
that is the identifier you'd expect in math. And everyone
likes "alias =E2=84=95 =3D size_t;", right? :) D=C3=A9j=C3=A0 vu?

--=20
Marco

May 29 2013
Timon Gehr <timon.gehr gmx.ch> writes:
On 05/29/2013 12:03 PM, Marco Leise wrote:
...  And everyone
likes "alias ℕ = size_t;", right? :)
...

No, that's deeply troubling.

May 30 2013
"Entry" <no no.com> writes:
My personal opinion is that code should only be in English.

May 29 2013
Peter Williams <pwil3058 bigpond.net.au> writes:
On 30/05/13 08:40, Entry wrote:
My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

May 29 2013
"Entry" <no no.com> writes:
On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:
On 30/05/13 08:40, Entry wrote:
My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified
language (D) should not be sabotaged by comments and variable
names in various human languages (Swedish, Russian), but be
accompanied by a similarly 'unified' language that we all know -
English. It is only my opinion though and I wouldn't force it
upon anyone.

May 30 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:
On 30/05/13 08:40, Entry wrote:
My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified
language (D) should not be sabotaged by comments and variable
names in various human languages (Swedish, Russian), but be
accompanied by a similarly 'unified' language that we all know
- English. It is only my opinion though and I wouldn't force it
upon anyone.

But programming IS a human tool, and thus, subject to human
language.

Also, I don't see how a programming language is any more unified
than, say, a library.

While you wouldn't force it on anyone, would it also be your
opinion that putting a French book in a french library be a
sabotage of the world's librarial institutions?

May 30 2013
"Entry" <no no.com> writes:
On Thursday, 30 May 2013 at 09:29:43 UTC, monarch_dodra wrote:
On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams
wrote:
On 30/05/13 08:40, Entry wrote:
My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified
language (D) should not be sabotaged by comments and variable
names in various human languages (Swedish, Russian), but be
accompanied by a similarly 'unified' language that we all know
- English. It is only my opinion though and I wouldn't force
it upon anyone.

But programming IS a human tool, and thus, subject to human
language.

Also, I don't see how a programming language is any more
unified than, say, a library.

While you wouldn't force it on anyone, would it also be your
opinion that putting a French book in a french library be a
sabotage of the world's librarial institutions?

What a way to attack a straw-man and completely miss the point at
the same time.

May 30 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 30 May 2013 at 13:12:17 UTC, Entry wrote:
On Thursday, 30 May 2013 at 09:29:43 UTC, monarch_dodra wrote:
On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams
wrote:
On 30/05/13 08:40, Entry wrote:
My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified
language (D) should not be sabotaged by comments and variable
names in various human languages (Swedish, Russian), but be
accompanied by a similarly 'unified' language that we all
know - English. It is only my opinion though and I wouldn't
force it upon anyone.

But programming IS a human tool, and thus, subject to human
language.

Also, I don't see how a programming language is any more
unified than, say, a library.

While you wouldn't force it on anyone, would it also be your
opinion that putting a French book in a french library be a
sabotage of the world's librarial institutions?

What a way to attack a straw-man and completely miss the point
at the same time.

Fine.

In that case, I'll retort by saying that you use of the 'unified'

My retort was not correctly expressed, but I don't see how D is
"unified". I thought it was just a tool to create programs.

May 30 2013
"Entry" <no no.com> writes:
On Thursday, 30 May 2013 at 13:52:09 UTC, monarch_dodra wrote:
On Thursday, 30 May 2013 at 13:12:17 UTC, Entry wrote:
On Thursday, 30 May 2013 at 09:29:43 UTC, monarch_dodra wrote:
On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams
wrote:
On 30/05/13 08:40, Entry wrote:
My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified
language (D) should not be sabotaged by comments and
variable names in various human languages (Swedish,
Russian), but be accompanied by a similarly 'unified'
language that we all know - English. It is only my opinion
though and I wouldn't force it upon anyone.

But programming IS a human tool, and thus, subject to human
language.

Also, I don't see how a programming language is any more
unified than, say, a library.

While you wouldn't force it on anyone, would it also be your
opinion that putting a French book in a french library be a
sabotage of the world's librarial institutions?

What a way to attack a straw-man and completely miss the point
at the same time.

Fine.

In that case, I'll retort by saying that you use of the

My retort was not correctly expressed, but I don't see how D is
"unified". I thought it was just a tool to create programs.

Take a minute to think about why we're all communicating in
English here. Let's see if you can figure it out. I just think
that it's better to focus on two very specific languages with two
very specific purposes (D for programming and English for
communication). 'Twas just an idea, I don't care if you write

May 30 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:
Take a minute to think about why we're all communicating in
English here. Let's see if you can figure it out.

Well that's condescending :/ and fallacious.

To answer your question, it may have something to do with the
fact that these are the English forums? Just a wild hunch. Oh.
And because we *can* speak English? That could also have
something to do with it.

There are tons of non-English speaking programming forums out
there. Maybe those that don't speak English are over there? Heck,
there are a few non-English threads in learn.

Oh. And did you know TDPL was published in Japanese? Why bother
right?

I just think that it's better to focus on two very specific
languages with two very specific purposes (D for programming
and English for communication). 'Twas just an idea, I don't
care if you write your code in hieroglyphs.

I really really agree with you.

Yet, I think they are orthogonal concepts, and that the D
programming language has no business choosing which communication
vector its users should use.

It's not just a matter (imo) of "I wouldn't force it upon
anyone", but "I think everyone should choose what's best for
them".

Yeah. I know. Same conclusion, but there is a nuance.

May 30 2013
"Entry" <no no.com> writes:
On Thursday, 30 May 2013 at 14:49:12 UTC, monarch_dodra wrote:
On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:
Take a minute to think about why we're all communicating in
English here. Let's see if you can figure it out.

Well that's condescending :/ and fallacious.

To answer your question, it may have something to do with the
fact that these are the English forums? Just a wild hunch. Oh.
And because we *can* speak English? That could also have
something to do with it.

There are tons of non-English speaking programming forums out
there. Maybe those that don't speak English are over there?
Heck, there are a few non-English threads in learn.

Oh. And did you know TDPL was published in Japanese? Why bother
right?

I just think that it's better to focus on two very specific
languages with two very specific purposes (D for programming
and English for communication). 'Twas just an idea, I don't
care if you write your code in hieroglyphs.

I really really agree with you.

Yet, I think they are orthogonal concepts, and that the D
programming language has no business choosing which
communication vector its users should use.

It's not just a matter (imo) of "I wouldn't force it upon
anyone", but "I think everyone should choose what's best for
them".

Yeah. I know. Same conclusion, but there is a nuance.

I'm glad you agree, though I believe that I never said anything
about D 'choosing' which human languages are compatible with it.
I just expressed my belief that should people choose to construct
something, be it a ship or a computer program, the usage of a
single language will greatly enhance their progress (ever heard
the story of the Tower of Babel? wink wink). Sorry if my previous
comment seemed hostile, that was not my intention.

May 30 2013
"Jakob Ovrum" <jakobovrum gmail.com> writes:
On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:
I'm glad you agree, though I believe that I never said anything
about D 'choosing' which human languages are compatible with
it. I just expressed my belief that should people choose to
construct something, be it a ship or a computer program, the
usage of a single language will greatly enhance their progress
(ever heard the story of the Tower of Babel? wink wink). Sorry
if my previous comment seemed hostile, that was not my
intention.

If the programmers who are going to be working on that code don't
understand the "Single Language", then what use is it?

May 30 2013
"Entry" <no no.com> writes:
On Thursday, 30 May 2013 at 16:05:13 UTC, Jakob Ovrum wrote:
On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:
I'm glad you agree, though I believe that I never said
anything about D 'choosing' which human languages are
compatible with it. I just expressed my belief that should
people choose to construct something, be it a ship or a
computer program, the usage of a single language will greatly
enhance their progress (ever heard the story of the Tower of
Babel? wink wink). Sorry if my previous comment seemed
hostile, that was not my intention.

If the programmers who are going to be working on that code
don't understand the "Single Language", then what use is it?

Then there's no helping it. Though I wonder what kind of a
programmer doesn't understand English enough to at least read the

May 30 2013
Manu <turkeyman gmail.com> writes:
On 31 May 2013 03:08, Entry <no no.com> wrote:

On Thursday, 30 May 2013 at 16:05:13 UTC, Jakob Ovrum wrote:

On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:

I'm glad you agree, though I believe that I never said anything about D
'choosing' which human languages are compatible with it. I just expressed
my belief that should people choose to construct something, be it a ship or
a computer program, the usage of a single language will greatly enhance
their progress (ever heard the story of the Tower of Babel? wink wink).
Sorry if my previous comment seemed hostile, that was not my intention.

If the programmers who are going to be working on that code don't
understand the "Single Language", then what use is it?

Then there's no helping it. Though I wonder what kind of a programmer
doesn't understand English enough to at least read the code and comments.

A child, or a student.

May 30 2013
Manu <turkeyman gmail.com> writes:
On 31 May 2013 01:48, Entry <no no.com> wrote:

On Thursday, 30 May 2013 at 14:49:12 UTC, monarch_dodra wrote:

On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:

Take a minute to think about why we're all communicating in English
here. Let's see if you can figure it out.

Well that's condescending :/ and fallacious.

To answer your question, it may have something to do with the fact that
these are the English forums? Just a wild hunch. Oh. And because we *can*
speak English? That could also have something to do with it.

There are tons of non-English speaking programming forums out there.
Maybe those that don't speak English are over there? Heck, there are a few

Oh. And did you know TDPL was published in Japanese? Why bother right?

I just think that it's better to focus on two very specific languages
with two very specific purposes (D for programming and English for
communication). 'Twas just an idea, I don't care if you write your code in
hieroglyphs.

I really really agree with you.

Yet, I think they are orthogonal concepts, and that the D programming
language has no business choosing which communication vector its users
should use.

It's not just a matter (imo) of "I wouldn't force it upon anyone", but "I
think everyone should choose what's best for them".

Yeah. I know. Same conclusion, but there is a nuance.

I'm glad you agree, though I believe that I never said anything about D
'choosing' which human languages are compatible with it. I just expressed
my belief that should people choose to construct something, be it a ship or
a computer program, the usage of a single language will greatly enhance
their progress (ever heard the story of the Tower of Babel? wink wink).
Sorry if my previous comment seemed hostile, that was not my intention.

This is the definition of a *convention*, not a rule.

May 30 2013
Manu <turkeyman gmail.com> writes:
On 30 May 2013 18:32, Entry <no no.com> wrote:

On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:

On 30/05/13 08:40, Entry wrote:

My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified language (D)
should not be sabotaged by comments and variable names in various human
languages (Swedish, Russian), but be accompanied by a similarly 'unified'
language that we all know - English. It is only my opinion though and I
wouldn't force it upon anyone.


May 30 2013
Manu <turkeyman gmail.com> writes:
On 30 May 2013 18:32, Entry <no no.com> wrote:

On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:

On 30/05/13 08:40, Entry wrote:

My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified language (D)
should not be sabotaged by comments and variable names in various human
languages (Swedish, Russian), but be accompanied by a similarly 'unified'
language that we all know - English. It is only my opinion though and I
wouldn't force it upon anyone.

We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance, it
almost always looks like this:

{
// E: I like cake.
=81=99=E3=80=82
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that power
an industry worth 10s of billions.

May 30 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2013 4:24 AM, Manu wrote:
We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance, it
almost
always looks like this:

{
// E: I like cake.
// J: ケーキが好きです。
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

May 30 2013
Peter Williams <pwil3058 bigpond.net.au> writes:
On 31/05/13 05:07, Walter Bright wrote:
On 5/30/2013 4:24 AM, Manu wrote:
We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance,
it almost
always looks like this:

{
// E: I like cake.
// J: ケーキが好きです。
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

Peter

May 30 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2013 5:00 PM, Peter Williams wrote:
On 31/05/13 05:07, Walter Bright wrote:
On 5/30/2013 4:24 AM, Manu wrote:
We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance,
it almost
always looks like this:

{
// E: I like cake.
// J: ケーキが好きです。
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

Not true, D supports Unicode identifiers.

May 30 2013
"Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Fri, 31 May 2013 07:57:37 +0200, Walter Bright  =

<newshound2 digitalmars.com> wrote:

On 5/30/2013 5:00 PM, Peter Williams wrote:
On 31/05/13 05:07, Walter Bright wrote:
On 5/30/2013 4:24 AM, Manu wrote:
We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instan=

ce,
it almost
always looks like this:

{
// E: I like cake.

=A7=E3=81=99=E3=80=82
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases th=

at
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

Not true, D supports Unicode identifiers.

I doubt Sony and Nintendo use D extensively.

-- =

Simen

May 31 2013
1100110 <0b1100110 gmail.com> writes:
On 05/31/2013 05:11 AM, Simen Kjaeraas wrote:
On Fri, 31 May 2013 07:57:37 +0200, Walter Bright
<newshound2 digitalmars.com> wrote:

On 5/30/2013 5:00 PM, Peter Williams wrote:
On 31/05/13 05:07, Walter Bright wrote:
On 5/30/2013 4:24 AM, Manu wrote:
We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance,
it almost
always looks like this:

{
// E: I like cake.
// J: ケーキが好きです。
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

Not true, D supports Unicode identifiers.

I doubt Sony and Nintendo use D extensively.


Jun 17 2013
Timothee Cour <thelastmammoth gmail.com> writes:
On Thu, May 30, 2013 at 10:57 PM, Walter Bright
<newshound2 digitalmars.com>wrote:

On 5/30/2013 5:00 PM, Peter Williams wrote:

On 31/05/13 05:07, Walter Bright wrote:

On 5/30/2013 4:24 AM, Manu wrote:

We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance,
it almost
always looks like this:

{
// E: I like cake.

=A7=E3=81=99=E3=80=82
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

Not true, D supports Unicode identifiers.

currently std.demangle.demangle doesn't work with unicode (see example
below)

If we decide to keep allowing unicode symbols (as opposed to just unicode
address this issue. Will supporting this negatively impact performance (of
both compile time and runtime) ?

Likewise, will linkers + other tools (gdb etc) be happy with unicode in
mangled names?

----
struct A{
int z;
void foo(int x){}
void =E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F(int x){}
void =C2=AA=C3=A5(int x){}
}
mangledName!(A.=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F).dema=
ngle.writeln;=3D>
_D4util13demangle_funs1A18=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=
=AE=9FMFiZv
----

Jun 05 2013
On 6/5/13 6:11 PM, Timothee Cour wrote:
currently std.demangle.demangle doesn't work with unicode (see example below)

If we decide to keep allowing unicode symbols (as opposed to just unicode
address this issue. Will supporting this negatively impact performance (of
both compile time and
runtime) ?

Likewise, will linkers + other tools (gdb etc) be happy with unicode in
mangled names?

----
structA{
intz;
voidfoo(intx){}
voidさいごの果実(intx){}
voidªå(intx){}
}
mangledName!(A.さいごの果実).demangle.writeln;=>_D4util13demangle_funs1A18さいごの果実MFiZv
----

Filed in bugzilla?

Jun 05 2013
Sean Kelly <sean invisibleduck.org> writes:
On Jun 5, 2013, at 6:21 PM, Brad Roberts <braddr puremagic.com> wrote:

On 6/5/13 6:11 PM, Timothee Cour wrote:
currently std.demangle.demangle doesn't work with unicode (see =

example below)
=20
If we decide to keep allowing unicode symbols (as opposed to just =

address this issue. Will supporting this negatively impact =

performance (of both compile time and
runtime) ?
=20
Likewise, will linkers + other tools (gdb etc) be happy with unicode =

in mangled names?
=20
----
structA{
intz;
voidfoo(intx){}
void=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F(intx){}
void=C2=AA=C3=A5(intx){}
}
=

mangledName!(A.=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F).dem=
angle.writeln;=3D>_D4util13demangle_funs1A18=E3=81=95=E3=81=84=E3=81=94=E3=
=81=AE=E6=9E=9C=E5=AE=9FMFiZv
----

=20
Filed in bugzilla?

http://d.puremagic.com/issues/show_bug.cgi?id=3D10393
https://github.com/D-Programming-Language/druntime/pull/524

Jun 17 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jun 17, 2013 at 11:37:18AM -0700, Sean Kelly wrote:
On Jun 5, 2013, at 6:21 PM, Brad Roberts <braddr puremagic.com> wrote:

On 6/5/13 6:11 PM, Timothee Cour wrote:
currently std.demangle.demangle doesn't work with unicode (see example below)

If we decide to keep allowing unicode symbols (as opposed to just unicode
address this issue. Will supporting this negatively impact performance (of
both compile time and
runtime) ?

Likewise, will linkers + other tools (gdb etc) be happy with unicode in
mangled names?

----
structA{
intz;
voidfoo(intx){}
voidさいごの果実(intx){}
voidªå(intx){}
}
mangledName!(A.さいごの果実).demangle.writeln;=>_D4util13demangle_funs1A18さいごの果実MFiZv
----

Filed in bugzilla?

http://d.puremagic.com/issues/show_bug.cgi?id=10393
https://github.com/D-Programming-Language/druntime/pull/524

Do linkers actually support 8-bit symbol names? Or do these have to be
translated into ASCII somehow?

T

--
We've all heard that a million monkeys banging on a million typewriters will
eventually reproduce the entire works of Shakespeare.  Now, thanks to the
Internet, we know this is not true. -- Robert Wilensk

Jun 17 2013
Sean Kelly <sean invisibleduck.org> writes:
On Jun 17, 2013, at 11:47 AM, "H. S. Teoh" <hsteoh quickfur.ath.cx> =
wrote:
=20
Do linkers actually support 8-bit symbol names? Or do these have to be
translated into ASCII somehow?

Good question.  It looks like the linker on OSX does:

public	_D3abc1A18=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=
=AE=9FMFiZv
public	_D3abc1A4=C2=AA=C3=A5MFiZv

The object file linked just fine.  I haven't tried OPTLINK on Win32 =
though.=

Jun 17 2013
On 6/17/13 11:58 AM, Sean Kelly wrote:
On Jun 17, 2013, at 11:47 AM, "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote:
Do linkers actually support 8-bit symbol names? Or do these have to be
translated into ASCII somehow?

Good question.  It looks like the linker on OSX does:

public	_D3abc1A18さいごの果実MFiZv
public	_D3abc1A4ªåMFiZv

The object file linked just fine.  I haven't tried OPTLINK on Win32 though.

Don't symbol names from dmd/win32 get compressed if they're too long, resulting
in essentially
arbitrary random binary data being used as symbol names?  Assuming my memory on
that is correct then

Jun 17 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 6/17/2013 6:28 PM, Brad Roberts wrote:
Don't symbol names from dmd/win32 get compressed if they're too long, resulting
in essentially arbitrary random binary data being used as symbol names?
Assuming my memory on that is correct then it's already demonstrated that
optlink doesn't care what the data is.

Optlink doesn't care what the symbol byte contents are.

Jun 17 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jun 17, 2013 at 06:49:19PM -0700, Walter Bright wrote:
On 6/17/2013 6:28 PM, Brad Roberts wrote:
Don't symbol names from dmd/win32 get compressed if they're too long, resulting
in essentially arbitrary random binary data being used as symbol names?
Assuming my memory on that is correct then it's already demonstrated that
optlink doesn't care what the data is.

Optlink doesn't care what the symbol byte contents are.

It seems ld on Linux doesn't, either. I just tested separate compilation
on some code containing functions and modules with Cyrillic names, and
it worked fine. But my system locale is UTF-8; I'm not sure if there may
be a problem on other system locales (not that modern systems would
actually use anything else, though!).

Might this cause a problem with the VS linker?

T

--
It only takes one twig to burn down a forest.

Jun 18 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 6/18/2013 9:44 AM, H. S. Teoh wrote:
Might this cause a problem with the VS linker?

I doubt it, but try it and see!

Jun 18 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jun 18, 2013 at 04:33:54PM -0700, Walter Bright wrote:
On 6/18/2013 9:44 AM, H. S. Teoh wrote:
Might this cause a problem with the VS linker?

I doubt it, but try it and see!

to try?

T

--
Study gravitation, it's a field with a lot of potential.

Jun 18 2013
Sean Kelly <sean invisibleduck.org> writes:
On Jun 17, 2013, at 6:28 PM, Brad Roberts <braddr puremagic.com> wrote:

On 6/17/13 11:58 AM, Sean Kelly wrote:
On Jun 17, 2013, at 11:47 AM, "H. S. Teoh" <hsteoh quickfur.ath.cx> =

wrote:
=20
Do linkers actually support 8-bit symbol names? Or do these have to =

be
translated into ASCII somehow?

=20
Good question.  It looks like the linker on OSX does:
=20
public	_D3abc1A18=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=

=AE=9FMFiZv
public	_D3abc1A4=C2=AA=C3=A5MFiZv
=20
The object file linked just fine.  I haven't tried OPTLINK on Win32 =

though.
=20

=20
Don't symbol names from dmd/win32 get compressed if they're too long, =

resulting in essentially arbitrary random binary data being used as =
symbol names?  Assuming my memory on that is correct then it's already =
demonstrated that optlink doesn't care what the data is.

Yes.  So it isn't always possible to fully demangle really long symbol =
names.  This is not terribly difficult to hit using templates, =
especially if they take string arguments.=

Jun 19 2013
Manu <turkeyman gmail.com> writes:
On 31 May 2013 05:07, Walter Bright <newshound2 digitalmars.com> wrote:

On 5/30/2013 4:24 AM, Manu wrote:

We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance, i=

t
almost
always looks like this:

{
// E: I like cake.

=E3=81=99=E3=80=82
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

But that doesn't make it English, or any more readable...
The only benefit to forcing users to use ASCII is that everyone can
physically type it.
1. It's not natural to type a word that you don't know what it is or how
to spell, you'll end up copy-pasting anyway rather than trying to
remember/copy it letter by letter and risk misspelling.
2. It's less natural for the people who CAN read it, because they have to
mentally transliterate too. (And if they're kids/amateurs who don't know
even know the latin alphabet?)

Ie, it serves neither party to force someone who doesn't speak English to
write ASCII.
children learning to code. There's no compelling reason to force
identifiers in ASCII.
Currently, D offers a unique advantage; leave it that way.

May 30 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2013 5:04 PM, Manu wrote:
Currently, D offers a unique advantage; leave it that way.

I am going to leave it that way based on the comments here, I only wanted to
point out that the example didn't support Unicode identifiers.

May 30 2013
Manu <turkeyman gmail.com> writes:
On 31 May 2013 10:00, Peter Williams <pwil3058 bigpond.net.au> wrote:

On 31/05/13 05:07, Walter Bright wrote:

On 5/30/2013 4:24 AM, Manu wrote:

We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance,
it almost
always looks like this:

{
// E: I like cake.

=A7=E3=81=99=E3=80=82
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

Indeed, and believe me, the variable names can often make NO sense, or
worse, they're misunderstood and quite misleading.
Ie, you think a variable is something, but you realise it's the inverse, or
just something completely different.

May 30 2013
"Mr. Anonymous" <mailnew4ster gmail.com> writes:
On Monday, 27 May 2013 at 22:20:16 UTC, H. S. Teoh wrote:
On Tue, May 28, 2013 at 12:04:52AM +0200, Vladimir Panteleev
wrote:
On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to
write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.

I thought the topic was typing the occasional Unicode
character to
use as an operator in D programs?

Well, D *does* support non-English identifiers, y'know... for
example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different story.
:)

But for operators, you still need enough compose key sequences
to cover
all of the Unicode operators -- and there are a LOT of them --
which I
don't think is currently done anywhere. You'd have to make your
own
compose key maps to do it.

T


May 28 2013
"Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Tue, 28 May 2013 00:18:31 +0200, H. S. Teoh <hsteoh quickfur.ath.cx> =
=

wrote:

On Tue, May 28, 2013 at 12:04:52AM +0200, Vladimir Panteleev wrote:
On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.

I thought the topic was typing the occasional Unicode character to
use as an operator in D programs?

Well, D *does* support non-English identifiers, y'know... for example:=

void main(string[] args) {
int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
foreach (=D0=B8; 0..100)
=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
}

Of course, whether that's a good practice is a different story. :)

But for operators, you still need enough compose key sequences to cove=

r
all of the Unicode operators -- and there are a LOT of them -- which I=

don't think is currently done anywhere. You'd have to make your own
compose key maps to do it.

The Fortress programming language has some 900 or so operators:

https://java.net/projects/projectfortress/sources/sources/content/Specif=
ication/fortress.1.0.pdf?rev=3D5558

Appendix C, and

https://java.net/projects/projectfortress/sources/sources/content/Docume=
ntation/Specification/fortress.pdf?rev=3D5558

chapter 14

-- =

Simen

May 27 2013
Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, May 28, 2013 11:38:08 Peter Williams wrote:
On 28/05/13 09:44, H. S. Teoh wrote:
Since language keywords are already in English, we might as well
standardize on English identifiers too.

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

I think that it was more an issue of that the only reason that Unicode would
be necessary in identifiers would be if you weren't using English, so if you
assume that everyone is going to be using some form of English for their
identifier names, you can skip having Unicode in identifiers. So, a natural
effect of standardizing on English is that you can stick with ASCII.

- Jonathan M Davis

May 27 2013
Manu <turkeyman gmail.com> writes:
On 28 May 2013 11:42, Jonathan M Davis <jmdavisProg gmx.com> wrote:

On Tuesday, May 28, 2013 11:38:08 Peter Williams wrote:
On 28/05/13 09:44, H. S. Teoh wrote:
Since language keywords are already in English, we might as well
standardize on English identifiers too.

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

I think that it was more an issue of that the only reason that Unicode
would
be necessary in identifiers would be if you weren't using English, so if
you
assume that everyone is going to be using some form of English for their
identifier names, you can skip having Unicode in identifiers. So, a natural
effect of standardizing on English is that you can stick with ASCII.

I'm fairly sure that any programmer who takes themself seriously will use
English, I don't see any reason why this rule should nee to be be
implemented by the compiler.
The loss I can imagine is that kids, or people from developing countries,
etc, may have an additional barrier to learning to code if they don't speak
English.
Nobody in this set is likely to produce a useful library that will be used
widely.
Likewise, no sane programmer is going to choose to use a library that's not
written in English.

You may argue that the keywords and libs are in English. I can attest from
personal experience, that a child, or a non-english-speaking beginner
probably has absolutely NO IDEA what the keywords mean anyway, even if they
do speak English.
I certainly had no idea when I was a kid, I just typed them because I
figured out what they did. I didn't even know how to say many of them, and
realised 5 years later than I was saying all the words wrong...

So my point is, why make this restriction as a static compiler rule, when
it's not practically going to be broken anyway. You never know, it may
actually assist some people somewhere.
I think it's a great thing that D can accept identifiers in non-english.

May 27 2013
"Daniel Murphy" <yebblies nospamgmail.com> writes:
"Manu" <turkeyman gmail.com> wrote in message
news:mailman.137.1369448229.13711.digitalmars-d puremagic.com...
One of the first, and best, decisions I made for D was it would be
Unicode
front to back.

Indeed, excellent decision!
So when we define operators for u  v and a  b, or maybe n? ;)

When these have keys on standard keyboards.

May 25 2013
"Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
One of the first, and best, decisions I made for D was it would
be Unicode front to back.

That is why I asked this question here.  I think D is still one
of the few programming languages with such unicode support.

This is more a problem with the algorithms taking the easy way
than a problem with UTF-8. You can do all the string
algorithms, including regex, by working with the UTF-8 directly
rather than converting to UTF-32. Then the algorithms work at
full speed.

I call BS on this.  There's no way working on a variable-width
encoding can be as "full speed" as a constant-width encoding.
Perhaps you mean that the slowdown is minimal, but I doubt that
also.

That was the go-to solution in the 1980's, they were called
"code pages". A disaster.

My understanding is that code pages were a "disaster" because
they weren't standardized and often badly implemented.  If you
used UCS with a single-byte encoding, you wouldn't have that
problem.

with the few exceptional languages with more than 256

characters encoded in two bytes.

Like those rare languages Japanese, Korean, Chinese, etc. This
too was done in the 80's with "Shift-JIS" for Japanese, and
some other wacky scheme for Korean, and a third nutburger one
for Chinese.

Of course, you have to have more than one byte for those
languages, because they have more than 256 characters.  So there
will be no compression gain over UTF-8/16 there, but a big gain
in parsing complexity with a simpler encoding, particularly when
dealing with multi-language strings.

I've had the misfortune of supporting all that in the old
Zortech C++ compiler. It's AWFUL. If you think it's simpler,
all I can say is you've never tried to write internationalized
code with it.

Heh, I'm not saying "let's go back to badly defined code pages"
because I'm saying "let's go back to single-byte encodings."  The
two are separate arguments.

UTF-8 is heavenly in comparison. Your code is automatically
internationalized. It's awesome.

At what cost?  Most programmers completely punt on unicode,
because they just don't want to deal with the complexity.
Perhaps you can deal with it and don't mind the performance loss,
but I suspect you're in the minority.

May 25 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 12:33 AM, Joakim wrote:
At what cost?  Most programmers completely punt on unicode, because they just
don't want to deal with the complexity. Perhaps you can deal with it and don't
mind the performance loss, but I suspect you're in the minority.

I think you stand alone in your desire to return to code pages. I have years of
experience with code pages and the unfixable misery they produce. This has
disappeared with Unicode. I find your arguments unpersuasive when stacked
against my experience. And yes, I have made a living writing high performance
code that deals with characters, and you are quite off base with claims that
UTF-8 has inevitable bad performance - though there is inefficient code in
Phobos for it, to be sure.

My grandfather wrote a book that consists of mixed German, French, and Latin
words, using special characters unique to those languages. Another failing of
code pages is it fails miserably at any such mixed language text. Unicode
handles it with aplomb.

I can't even write an email to Rainer Schütze in English under your scheme.

Code pages simply are no longer practical nor acceptable for a global
community.
D is never going to convert to a code page system, and even if it did, there's
no way D will ever convince the world to abandon Unicode, and so D would be as
useless as EBCDIC.

I'm afraid your quest is quixotic.

May 25 2013
"Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote:

Nobody is talking about going back to code pages.  I'm talking
about going to single-byte encodings, which do not imply the
problems that you had with code pages way back when.

I have years of experience with code pages and the unfixable
misery they produce. This has disappeared with Unicode. I find
your arguments unpersuasive when stacked against my experience.
And yes, I have made a living writing high performance code
that deals with characters, and you are quite off base with
claims that UTF-8 has inevitable bad performance - though there
is inefficient code in Phobos for it, to be sure.

How can a variable-width encoding possibly compete with a
constant-width encoding?  You have not articulated a reason for
this.  Do you believe there is a performance loss with
variable-width, but that it is not significant and therefore
worth it?  Or do you believe it can be implemented with no loss?

My grandfather wrote a book that consists of mixed German,
French, and Latin words, using special characters unique to
those languages. Another failing of code pages is it fails
miserably at any such mixed language text. Unicode handles it
with aplomb.

I see no reason why single-byte encodings wouldn't do a better
job at such mixed-language text.  You'd just have to have a
larger, more complex header or keep all your strings in a single
language, with a different format to compose them together for
your book.  This would be so much easier than UTF-8 that I cannot
see how anyone could argue for a variable-length encoding instead.

I can't even write an email to Rainer Schütze in English under

Why not?  You seem to think that my scheme doesn't implement
multi-language text at all, whereas I pointed out, from the
beginning, that it could be trivially done also.

Code pages simply are no longer practical nor acceptable for a
global community. D is never going to convert to a code page
system, and even if it did, there's no way D will ever convince
the world to abandon Unicode, and so D would be as useless as
EBCDIC.

I'm afraid you and others here seem to mentally translate
in horror as you remember all your problems with broken
implementations of code pages, even though those problems are not
intrinsic to single-byte encodings.

I'm not asking you to consider this for D.  I just wanted to
discuss why UTF-8 is used at all.  I had hoped for some technical
evaluations of its merits, but I seem to simply be dredging up a

The world may not "abandon Unicode," but it will abandon UTF-8,
because it's a dumb idea.  Unfortunately, such dumb ideas- XML
anyone?- often proliferate until someone comes up with something
better to show how dumb they are.  Perhaps it won't be the D
programming language that does that, but it would be easy to
implement my idea in D, so maybe it will be a D-based library
someday. :)

I'm afraid your quest is quixotic.

I'd argue the opposite, considering most programmers still can't
wrap their head around UTF-8.  If someone can just get a
single-byte encoding implemented and in front of them, I suspect
it will be UTF-8 that will be considered quixotic. :D

May 25 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 13:05, Joakim пишет:
On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote:

Nobody is talking about going back to code pages.  I'm talking about
going to single-byte encodings, which do not imply the problems that you
had with code pages way back when.

Problem is what you outline is isomorphic with code-pages. Hence the
grief of accumulated experience against them.
Code pages simply are no longer practical nor acceptable for a global
community. D is never going to convert to a code page system, and even
if it did, there's no way D will ever convince the world to abandon
Unicode, and so D would be as useless as EBCDIC.

I'm afraid you and others here seem to mentally translate "single-byte
encodings" to "code pages" in your head, then recoil in horror as you
remember all your problems with broken implementations of code pages,
even though those problems are not intrinsic to single-byte encodings.

I'm not asking you to consider this for D.  I just wanted to discuss why
UTF-8 is used at all.  I had hoped for some technical evaluations of its
merits, but I seem to simply be dredging up a bunch of repressed

Well if somebody get a quest to redefine UTF-8 they *might* come up with
something that is a bit faster to decode but shares the same properties.
Hardly a life saver anyway.
The world may not "abandon Unicode," but it will abandon UTF-8, because
it's a dumb idea.  Unfortunately, such dumb ideas- XML anyone?- often
proliferate until someone comes up with something better to show how
dumb they are.

Even children know XML is awful redundant shit as interchange format.
The hierarchical document is a nice idea anyway.

Perhaps it won't be the D programming language that does
that, but it would be easy to implement my idea in D, so maybe it will
be a D-based library someday. :)

Implement Unicode compression scheme - at least that is standardized.

--
Dmitry Olshansky

May 25 2013
Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, May 25, 2013 01:42:20 Walter Bright wrote:
On 5/25/2013 12:33 AM, Joakim wrote:
At what cost?  Most programmers completely punt on unicode, because=

they
just don't want to deal with the complexity. Perhaps you can deal w=

ith it
and don't mind the performance loss, but I suspect you're in the
minority.

=20

e years
of experience with code pages and the unfixable misery they produce. =

This
has disappeared with Unicode. I find your arguments unpersuasive when=

stacked against my experience. And yes, I have made a living writing =

high
performance code that deals with characters, and you are quite off ba=

se
with claims that UTF-8 has inevitable bad performance - though there =

is
inefficient code in Phobos for it, to be sure.
=20
My grandfather wrote a book that consists of mixed German, French, an=

d Latin
words, using special characters unique to those languages. Another fa=

iling
of code pages is it fails miserably at any such mixed language text.
Unicode handles it with aplomb.
=20
I can't even write an email to Rainer Sch=C3=BCtze in English under y=

our scheme.
=20
Code pages simply are no longer practical nor acceptable for a global=

community. D is never going to convert to a code page system, and eve=

n if
it did, there's no way D will ever convince the world to abandon Unic=

ode,
and so D would be as useless as EBCDIC.
=20
I'm afraid your quest is quixotic.

All I've got to say on this subject is "Thank you Walter Bright for bui=
lding=20
Unicode into D!"

- Jonathan M Davis

May 25 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, May 25, 2013 at 04:14:34PM -0700, Jonathan M Davis wrote:
On Saturday, May 25, 2013 01:42:20 Walter Bright wrote:
On 5/25/2013 12:33 AM, Joakim wrote:
At what cost?  Most programmers completely punt on unicode,
because they just don't want to deal with the complexity. Perhaps
you can deal with it and don't mind the performance loss, but I
suspect you're in the minority.

have years of experience with code pages and the unfixable misery
they produce. This has disappeared with Unicode. I find your
arguments unpersuasive when stacked against my experience. And yes,
I have made a living writing high performance code that deals with
characters, and you are quite off base with claims that UTF-8 has
inevitable bad performance - though there is inefficient code in
Phobos for it, to be sure.

My grandfather wrote a book that consists of mixed German, French,
and Latin words, using special characters unique to those languages.
Another failing of code pages is it fails miserably at any such
mixed language text.  Unicode handles it with aplomb.

Walter Bright <newshound2 digitalmars.com>  writes:
On 5/25/2013 9:48 PM, H. S. Teoh wrote:
Then came along D with native Unicode support built right into the
language. And not just UTF-16 shoved down your throat like Java does (or
was it UTF-32?); UTF-8, UTF-16, and UTF-32 are all equally supported.
You cannot imagine what a happy camper I was since then!! Yes, Phobos
still has a ways to go in terms of performance w.r.t. UTF-8 strings, but
what we have right now is already far, far, superior to the situation in
C/C++, and things can only get better.

Many moons ago, when the earth was young and I had a few strands of hair left,
a
C++ programmer challenged me to a "bakeoff", D vs C++. I wrote the program in D
(a string processing program). He said "ahaaaa!" and wrote the C++ one. They
were fairly comparable.

I then suggested we do the internationalized version. I resubmitted exactly the
same program. He threw in the towel.

May 25 2013

On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
This is more a problem with the algorithms taking the easy way
than a problem with UTF-8. You can do all the string
algorithms, including regex, by working with the UTF-8
directly rather than converting to UTF-32. Then the algorithms
work at full speed.

I call BS on this.  There's no way working on a variable-width
encoding can be as "full speed" as a constant-width encoding.
Perhaps you mean that the slowdown is minimal, but I doubt that
also.

For the record, I noticed that programmers (myself included) that
had an incomplete understanding of Unicode / UTF exaggerate this
point, and sometimes needlessly assume that their code needs to
operate on individual characters (code points), when it is in
fact not so - and that code will work just fine as if it was
written to handle ASCII. The example Walter quoted (regex -
assuming you don't want Unicode ranges or case-insensitivity) is
one such case.

Another thing I noticed: sometimes when you think you really need
to operate on individual characters (and that your code will not
be correct unless you do that), the assumption will be incorrect
due to the existence of combining characters in Unicode. Two of
the often-quoted use cases of working on individual code points
is calculating the string width (assuming a fixed-width font),
and slicing the string - both of these will break with combining
characters if those are not accounted for. I believe the proper
way to approach such tasks is to implement the respective Unicode
algorithms for it, which I believe are non-trivial and for which
the relative impact for the overhead of working with a
variable-width encoding is acceptable.

Can you post some specific cases where the benefits of a
constant-width encoding are obvious and, in your opinion, make
constant-width encodings more useful than all the benefits of
UTF-8?

Also, I don't think this has been posted in this thread. Not sure

http://www.utf8everywhere.org/

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

May 25 2013

"Joakim" <joakim airpost.net>  writes:
On Saturday, 25 May 2013 at 08:58:57 UTC, Vladimir Panteleev
wrote:
Another thing I noticed: sometimes when you think you really
need to operate on individual characters (and that your code
will not be correct unless you do that), the assumption will be
incorrect due to the existence of combining characters in
Unicode. Two of the often-quoted use cases of working on
individual code points is calculating the string width
(assuming a fixed-width font), and slicing the string - both of
these will break with combining characters if those are not
accounted for. I believe the proper way to approach such tasks
is to implement the respective Unicode algorithms for it, which
I believe are non-trivial and for which the relative impact for
the overhead of working with a variable-width encoding is
acceptable.

Combining characters are examples of complexity baked into the
various languages, so there's no way around that.  I'm arguing
against layering more complexity on top, through UTF-8.

Can you post some specific cases where the benefits of a
constant-width encoding are obvious and, in your opinion, make
constant-width encodings more useful than all the benefits of
UTF-8?

Let's take one you listed above, slicing a string.  You have to
either translate your entire string into UTF-32 so it's
constant-width, which is apparently what Phobos does, or decode
every single UTF-8 character along the way, every single time.  A
constant-width, single-byte encoding would be much easier to
slice, while still using at most half the space.

Also, I don't think this has been posted in this thread. Not

http://www.utf8everywhere.org/

That seems to be a call to using UTF-8 on Windows, with a lot of
info on how best to do so, with little justification for why
you'd want to do so in the first place.  For example,

"Q: But what about performance of text processing algorithms,
byte alignment, etc?

A: Is it really better with UTF-16? Maybe so."

Not exactly a considered analysis of the two. ;)

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

You cannot honestly look at those multiple state diagrams and
tell me it's "simple."  That said, the difficulty of _using_
UTF-8 is a much bigger than problem than implementing a decoder
in a library.

May 25 2013

"w0rp" <devw0rp gmail.com>  writes:
This is dumb. You are dumb. Go away.

May 25 2013

On Saturday, 25 May 2013 at 09:40:36 UTC, Joakim wrote:
Can you post some specific cases where the benefits of a
constant-width encoding are obvious and, in your opinion, make
constant-width encodings more useful than all the benefits of
UTF-8?

Let's take one you listed above, slicing a string.  You have to
either translate your entire string into UTF-32 so it's
constant-width, which is apparently what Phobos does, or decode
every single UTF-8 character along the way, every single time.
A constant-width, single-byte encoding would be much easier to
slice, while still using at most half the space.

You don't need to do that to slice a string. I think you mean to
say that you need to decode each character if you want to slice
the string at the N-th code point? But this is exactly what I'm
trying to point out: how would you find this N? How would you
know if it makes sense, taking into account combining characters,
and all the other complexities of Unicode?

If you want to split a string by ASCII whitespace (newlines, tabs
and spaces), it makes no difference whether the string is in
ASCII or UTF-8 - the code will behave correctly in either case,
variable-width-encodings regardless.

You cannot honestly look at those multiple state diagrams and
tell me it's "simple."

I meant that it's simple to implement (and adapt/port to other
languages). I would say that UTF-8 is quite cleverly designed, so
I wouldn't say it's simple by itself.

May 25 2013

"Joakim" <joakim airpost.net>  writes:
On Saturday, 25 May 2013 at 10:33:12 UTC, Vladimir Panteleev
wrote:
You don't need to do that to slice a string. I think you mean
to say that you need to decode each character if you want to
slice the string at the N-th code point? But this is exactly
what I'm trying to point out: how would you find this N? How
would you know if it makes sense, taking into account combining
characters, and all the other complexities of Unicode?

Slicing a string implies finding the N-th code point, what other
way would you slice and have it make any sense?  Finding the N-th
point is much simpler with a constant-width encoding.

I'm leaving aside combining characters and those intrinsic
language complexities baked into unicode in my previous analysis,
but if you want to bring those in, that's actually an argument in
favor of my encoding.  With my encoding, you know up front if
you're using languages that have such complexity- just check the
header- whereas with a chunk of random UTF-8 text, you cannot
ever know that unless you decode the entire string once and
extract knowledge of all the languages that are embedded.

For another similar example, let's say you want to run toUpper on
a multi-language string, which contains English in the first half
and some Asian script that doesn't define uppercase in the second
half.  With my format, toUpper can check the header, then process
the English half and skip the Asian half (I'm assuming that the
substring indices for each language would be stored in this more
complex header).  With UTF-8, you have to process the entire
string, because you never know what random languages might be
packed in there.

UTF-8 is riddled with such performance bottlenecks, all to make
if self-synchronizing.  But is anybody really using its less
compact encoding to do some "self-synchronized" integrity
checking?  I suspect almost nobody is.

If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string is
in ASCII or UTF-8 - the code will behave correctly in either
case, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to decode
while splitting, when compared to a single-byte encoding.

You cannot honestly look at those multiple state diagrams and
tell me it's "simple."

I meant that it's simple to implement (and adapt/port to other
languages). I would say that UTF-8 is quite cleverly designed,
so I wouldn't say it's simple by itself.

Perhaps, maybe decoding is not so bad for the type of people who
write the fundamental UTF-8 libraries.  But implementation does
not merely refer to the UTF-8 libraries, but also all the code
that tries to build on it for internationalized apps.  And with
wrapping the average programmer's head around this mess likely
leads to as many problems as broken code pages implementations
did back in the day. ;)

May 25 2013

On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string is
in ASCII or UTF-8 - the code will behave correctly in either
case, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to
decode while splitting, when compared to a single-byte encoding.

No. Are you sure you understand UTF-8 properly?

May 25 2013

"Joakim" <joakim airpost.net>  writes:
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev
wrote:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string
is in ASCII or UTF-8 - the code will behave correctly in
either case, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to
decode while splitting, when compared to a single-byte
encoding.

No. Are you sure you understand UTF-8 properly?

Are you sure _you_ understand it properly?  Both encodings have
to check every single character to test for whitespace, but the
single-byte encoding simply has to load each byte in the string
and compare it against the whitespace-signifying bytes, while the
variable-length code has to first load and parse potentially 4
bytes before it can compare, because it has to go through the
state machine that you linked to above.  Obviously the
constant-width encoding will be faster.  Did I really need to
explain this?

On Saturday, 25 May 2013 at 12:43:21 UTC, Andrei Alexandrescu
wrote:
On 5/25/13 3:33 AM, Joakim wrote:
On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
This is more a problem with the algorithms taking the easy
way than a
problem with UTF-8. You can do all the string algorithms,
including
regex, by working with the UTF-8 directly rather than
converting to
UTF-32. Then the algorithms work at full speed.

I call BS on this. There's no way working on a variable-width
encoding
can be as "full speed" as a constant-width encoding. Perhaps
you mean
that the slowdown is minimal, but I doubt that also.

You mentioned this a couple of times, and I wonder what makes
you so sure. On contemporary architectures small is fast and
large is slow; betting on replacing larger data with more
computation is quite often a win.

When has small ever been slow and large fast? ;) I'm talking
about replacing larger data _and_ more computation, ie UTF-8,
with smaller data and less computation, ie single-byte encodings,
so it is an unmitigated win in that regard. :)

May 25 2013

"Peter Alexander" <peter.alexander.au gmail.com>  writes:
On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev
wrote:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string
is in ASCII or UTF-8 - the code will behave correctly in
either case, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to
decode while splitting, when compared to a single-byte
encoding.

No. Are you sure you understand UTF-8 properly?

Are you sure _you_ understand it properly?  Both encodings have
to check every single character to test for whitespace, but the
single-byte encoding simply has to load each byte in the string
and compare it against the whitespace-signifying bytes, while
the variable-length code has to first load and parse
potentially 4 bytes before it can compare, because it has to go
through the state machine that you linked to above.  Obviously
the constant-width encoding will be faster.  Did I really need
to explain this?

I suggest you read up on UTF-8. You really don't understand it.
There is no need to decode, you just treat the UTF-8 string as if
it is an ASCII string.

This code will count all spaces in a string whether it is encoded
as ASCII or UTF-8:

int countSpaces(const(char)* c)
{
int n = 0;
while (*c)
if (*c == ' ')
++n;
return n;
}

I repeat: there is no need to decode. Please read up on UTF-8.
You do not understand it. The reason you don't need to decode is
because UTF-8 is self-synchronising.

The code above tests for spaces only, but it works the same when
searching for any substring or single character. It is no slower
than fixed-width encoding for these operations.

Again, I urge you, please read up on UTF-8. It is very well
designed.

May 25 2013

"Peter Alexander" <peter.alexander.au gmail.com>  writes:
On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
int countSpaces(const(char)* c)
{
int n = 0;
while (*c)
if (*c == ' ')
++n;
return n;
}

Oops. Missing a ++c in there, but I'm sure the point was made :-)

May 25 2013

On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev
wrote:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string
is in ASCII or UTF-8 - the code will behave correctly in
either case, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to
decode while splitting, when compared to a single-byte
encoding.

No. Are you sure you understand UTF-8 properly?

Are you sure _you_ understand it properly?  Both encodings have
to check every single character to test for whitespace, but the
single-byte encoding simply has to load each byte in the string
and compare it against the whitespace-signifying bytes, while
the variable-length code has to first load and parse
potentially 4 bytes before it can compare, because it has to go
through the state machine that you linked to above.  Obviously
the constant-width encoding will be faster.  Did I really need
to explain this?

It looks like you've missed an important property of UTF-8: lower
ASCII remains encoded the same, and UTF-8 code units encoding
non-ASCII characters cannot be confused with ASCII characters.
Code that does not need Unicode code points can treat UTF-8
strings as ASCII strings, and does not need to decode each
character individually - because a 0x20 byte will mean "space"
regardless of context. That's why a function that splits a string
by ASCII whitespace does NOT need do perform UTF-8 decoding.

I hope this clears up the misunderstanding :)

May 25 2013

"Joakim" <joakim airpost.net>  writes:
On Saturday, 25 May 2013 at 14:18:32 UTC, Vladimir Panteleev
wrote:
On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
Are you sure _you_ understand it properly?  Both encodings
have to check every single character to test for whitespace,
but the single-byte encoding simply has to load each byte in
the string and compare it against the whitespace-signifying
bytes, while the variable-length code has to first load and
parse potentially 4 bytes before it can compare, because it
has to go through the state machine that you linked to above.
Obviously the constant-width encoding will be faster.  Did I
really need to explain this?

It looks like you've missed an important property of UTF-8:
lower ASCII remains encoded the same, and UTF-8 code units
encoding non-ASCII characters cannot be confused with ASCII
characters. Code that does not need Unicode code points can
treat UTF-8 strings as ASCII strings, and does not need to
decode each character individually - because a 0x20 byte will
mean "space" regardless of context. That's why a function that
splits a string by ASCII whitespace does NOT need do perform
UTF-8 decoding.

I hope this clears up the misunderstanding :)

OK, you got me with this particular special case: it is not
necessary to decode every UTF-8 character if you are simply
comparing against ASCII space characters.  My mixup is because I
was unaware if every language used its own space character in
UTF-8 or if they reuse the ASCII space character, apparently it's
the latter.

However, my overall point stands.  You still have to check 2-4
times as many bytes if you do it the way Peter suggests, as
opposed to a single-byte encoding.  There is a shortcut: you
could also check the first byte to see if it's ASCII or not and
then skip the right number of ensuing bytes in a character's
encoding if it isn't ASCII, but at that point you have begun
partially decoding the UTF-8 encoding, which you claimed wasn't
necessary and which will degrade performance anyway.

On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
I suggest you read up on UTF-8. You really don't understand it.
There is no need to decode, you just treat the UTF-8 string as
if it is an ASCII string.

Not being aware of this shortcut doesn't mean not understanding
UTF-8.

This code will count all spaces in a string whether it is
encoded as ASCII or UTF-8:

int countSpaces(const(char)* c)
{
int n = 0;
while (*c)
if (*c == ' ')
++n;
return n;
}

I repeat: there is no need to decode. Please read up on UTF-8.
You do not understand it. The reason you don't need to decode
is because UTF-8 is self-synchronising.

Not quite.  The reason you don't need to decode is because of the
particular encoding scheme chosen for UTF-8, a side effect of
ASCII backwards compatibility and reusing the ASCII space
character; it has nothing to do with whether it's
self-synchronizing or not.

The code above tests for spaces only, but it works the same
when searching for any substring or single character. It is no
slower than fixed-width encoding for these operations.

It doesn't work the same "for any substring or single character,"
it works the same for any single ASCII character.

Of course it's slower than a fixed-width single-byte encoding.
You have to check every single byte of a non-ASCII character in
UTF-8, whereas a single-byte encoding only has to check a single
byte per language character.  There is a shortcut if you
partially decode the first byte in UTF-8, mentioned above, but
you seem dead-set against decoding. ;)

Again, I urge you, please read up on UTF-8. It is very well
designed.

I disagree.  It is very badly designed, but the ASCII
compatibility does hack in some shortcuts like this, which still
don't save its performance.

May 25 2013

"Peter Alexander" <peter.alexander.au gmail.com>  writes:
On Saturday, 25 May 2013 at 14:58:02 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
I suggest you read up on UTF-8. You really don't understand
it. There is no need to decode, you just treat the UTF-8
string as if it is an ASCII string.

Not being aware of this shortcut doesn't mean not understanding
UTF-8.

It's not just a shortcut, it is absolutely fundamental to the
design of UTF-8. It's like saying you understand Lisp without
being aware that everything is a list.

Also, you continuously keep stating disadvantages to UTF-8 that
are completely false, like "slicing does require decoding".
Again, completely missing the point of UTF-8. I cannot conceive
how you can claim to understand how UTF-8 works yet repeatedly
demonstrating that you do not.

You are either ignorant or a successful troll. In either case,
I'm done here.

May 25 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx>  writes:
On Sat, May 25, 2013 at 03:47:41PM +0200, Joakim wrote:
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string is
in ASCII or UTF-8 - the code will behave correctly in either
case, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to decode
while splitting, when compared to a single-byte encoding.

No. Are you sure you understand UTF-8 properly?

Are you sure _you_ understand it properly?  Both encodings have to
check every single character to test for whitespace, but the
single-byte encoding simply has to load each byte in the string and
compare it against the whitespace-signifying bytes, while the
variable-length code has to first load and parse potentially 4 bytes
before it can compare, because it has to go through the state
machine that you linked to above.  Obviously the constant-width
encoding will be faster.  Did I really need to explain this?

[...]

Have you actually tried to write a whitespace splitter for UTF-8? Do you
realize that you can use an ASCII whitespace splitter for UTF-8 and it
will work correctly?

There is no need to decode UTF-8 for whitespace splitting at all. There
is no need to parse anything. You just iterate over the bytes and split
on 0x20. There is no performance difference over ASCII.

As Dmitry said, UTF-8 is self-synchronizing. While current Phobos code
tries to play it safe by decoding every character, this is not necessary
in many cases.

T

--
The best compiler is between your ears. -- Michael Abrash

May 25 2013

Dmitry Olshansky <dmitry.olsh gmail.com>  writes:
On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
This is more a problem with the algorithms taking the easy way than a
problem with UTF-8. You can do all the string algorithms, including
regex, by working with the UTF-8 directly rather than converting to
UTF-32. Then the algorithms work at full speed.

I call BS on this.  There's no way working on a variable-width
encoding can be as "full speed" as a constant-width encoding. Perhaps
you mean that the slowdown is minimal, but I doubt that also.

For the record, I noticed that programmers (myself included) that had an
incomplete understanding of Unicode / UTF exaggerate this point, and
sometimes needlessly assume that their code needs to operate on
individual characters (code points), when it is in fact not so - and
that code will work just fine as if it was written to handle ASCII. The
example Walter quoted (regex - assuming you don't want Unicode ranges or
case-insensitivity) is one such case.

+1
BTW regex even with Unicode ranges and case-insensitivity is doable just
not easy (yet).

Another thing I noticed: sometimes when you think you really need to
operate on individual characters (and that your code will not be correct
unless you do that), the assumption will be incorrect due to the
existence of combining characters in Unicode. Two of the often-quoted
use cases of working on individual code points is calculating the string
width (assuming a fixed-width font), and slicing the string - both of
these will break with combining characters if those are not accounted
for.  I believe the proper way to approach such tasks is to implement the
respective Unicode algorithms for it, which I believe are non-trivial
and for which the relative impact for the overhead of working with a
variable-width encoding is acceptable.

Another plus one. Algorithms defined on code point basis are quite
complex so that benefit of not decoding won't be that large. The benefit
of transparently special-casing ASCII in UTF-8 is far larger.

Can you post some specific cases where the benefits of a constant-width
encoding are obvious and, in your opinion, make constant-width encodings
more useful than all the benefits of UTF-8?

Also, I don't think this has been posted in this thread. Not sure if it

http://www.utf8everywhere.org/

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

--
Dmitry Olshansky

May 25 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>  writes:
On 5/25/13 3:33 AM, Joakim wrote:
On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
This is more a problem with the algorithms taking the easy way than a
problem with UTF-8. You can do all the string algorithms, including
regex, by working with the UTF-8 directly rather than converting to
UTF-32. Then the algorithms work at full speed.

I call BS on this. There's no way working on a variable-width encoding
can be as "full speed" as a constant-width encoding. Perhaps you mean
that the slowdown is minimal, but I doubt that also.

You mentioned this a couple of times, and I wonder what makes you so
sure. On contemporary architectures small is fast and large is slow;
betting on replacing larger data with more computation is quite often a win.

Andrei

May 25 2013

Walter Bright <newshound2 digitalmars.com>  writes:
On 5/25/2013 5:43 AM, Andrei Alexandrescu wrote:
On 5/25/13 3:33 AM, Joakim wrote:
On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
This is more a problem with the algorithms taking the easy way than a
problem with UTF-8. You can do all the string algorithms, including
regex, by working with the UTF-8 directly rather than converting to
UTF-32. Then the algorithms work at full speed.

I call BS on this. There's no way working on a variable-width encoding
can be as "full speed" as a constant-width encoding. Perhaps you mean
that the slowdown is minimal, but I doubt that also.

You mentioned this a couple of times, and I wonder what makes you so sure. On
contemporary architectures small is fast and large is slow; betting on
replacing
larger data with more computation is quite often a win.

On the other hand, Joakim even admits his single byte encoding is variable
length, as otherwise he simply dismisses the rarely used (!) Chinese, Japanese,
and Korean languages, as well as any text that contains words from more than
one
language.

I suspect he's trolling us, and quite successfully.

May 25 2013

"Joakim" <joakim airpost.net>  writes:
On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:
On the other hand, Joakim even admits his single byte encoding
is variable length, as otherwise he simply dismisses the rarely
used (!) Chinese, Japanese, and Korean languages, as well as
any text that contains words from more than one language.

I have noted from the beginning that these large alphabets have
to be encoded to two bytes, so it is not a true constant-width
encoding if you are mixing one of those languages into a
single-byte encoded string.  But this "variable length" encoding
is so much simpler than UTF-8, there's no comparison.

I suspect he's trolling us, and quite successfully.

Ha, I wondered who would pull out this insult, quite surprised to
see it's Walter.  It seems to be the trend on the internet to
accuse anybody you disagree with of trolling, I am honestly
surprised to see Walter stoop so low.  Considering I'm the only
one making any cogent arguments here, perhaps I should wonder if
you're all trolling me. ;)

On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:
I suspect the Chinese, Koreans, and Japanese would take
exception to being called irrelevant.

Irrelevant only because they are a small subset of the UCS.  I
have noted that they would also be handled by a two-byte encoding.

Good luck with your scheme that can't handle languages written
by billions of people!

So let's see: first you say that my scheme has to be variable
length because I am using two bytes to handle these languages,
then you claim I don't handle these languages.  This kind of
blatant contradiction within two posts can only be called...
trolling!

May 25 2013

Walter Bright <newshound2 digitalmars.com>  writes:
On 5/25/2013 1:03 PM, Joakim wrote:
On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:
On the other hand, Joakim even admits his single byte encoding is variable
length, as otherwise he simply dismisses the rarely used (!) Chinese,
Japanese, and Korean languages, as well as any text that contains words from
more than one language.

I have noted from the beginning that these large alphabets have to be encoded
to
two bytes, so it is not a true constant-width encoding if you are mixing one of
those languages into a single-byte encoded string.  But this "variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's variable length. You
overlook that I've had to deal with this. It isn't "simpler", there's actually
more work to write code that adapts to one or two byte encodings.

I suspect he's trolling us, and quite successfully.

Ha, I wondered who would pull out this insult, quite surprised to see it's
Walter.  It seems to be the trend on the internet to accuse anybody you
disagree
with of trolling, I am honestly surprised to see Walter stoop so low.
Considering I'm the only one making any cogent arguments here, perhaps I should
wonder if you're all trolling me. ;)

On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:
I suspect the Chinese, Koreans, and Japanese would take exception to being
called irrelevant.

Irrelevant only because they are a small subset of the UCS.  I have noted that
they would also be handled by a two-byte encoding.

Good luck with your scheme that can't handle languages written by billions of
people!

So let's see: first you say that my scheme has to be variable length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard Chinese. You cannot have
it both ways. Code to deal with two bytes is significantly different than code
to deal with one. That means you've got a conditional in your generic code -
that isn't going to be faster than the conditional for UTF-8.

then you claim I don't handle
these languages.  This kind of blatant contradiction within two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it as irrelevant,
along with more handwaving about what to do with text that has embedded words
in
multiple languages.

Worse, there are going to be more than 256 of these encodings - you can't even
have a byte to specify them. Remember, Unicode has approximately 256,000
characters in it. How many code pages is that?

I was being kind saying you were trolling, as otherwise I'd be saying your
scheme was, to be blunt, absurd.

---------------------------------------

I'll be the first to admit that a lot of great ideas have been initially
dismissed by the experts as absurd. If you really believe in this, I recommend
that you write it up as a real article, taking care to fill in all the
handwaving with something specific, and include some benchmarks to prove your
performance claims. Post your article on reddit, stackoverflow, hackernews,
etc., and look for fertile ground for it. I'm sorry you're not finding fertile
ground here (so far, nobody has agreed with any of your points), and this is
the
wrong place for such proposals anyway, as D is simply not going to switch over
to it.

Remember, extraordinary claims require extraordinary evidence, not handwaving
and assumptions disguised as bold assertions.

May 25 2013

"Joakim" <joakim airpost.net>  writes:
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
I have noted from the beginning that these large alphabets
have to be encoded to
two bytes, so it is not a true constant-width encoding if you
are mixing one of
those languages into a single-byte encoded string.  But this
"variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's
variable length. You overlook that I've had to deal with this.
It isn't "simpler", there's actually more work to write code
that adapts to one or two byte encodings.

It is variable length, with the advantage that only strings
containing a few Asian languages are variable-length, as opposed
to UTF-8 having every non-English language string be
variable-length.  It may be more work to write library code to
handle my encoding, perhaps, but efficiency and ease of use are
paramount.

So let's see: first you say that my scheme has to be variable
length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard Chinese.
You cannot have it both ways. Code to deal with two bytes is
significantly different than code to deal with one. That means
you've got a conditional in your generic code - that isn't
going to be faster than the conditional for UTF-8.

Hah, I have explicitly said several times that I'd use a two-byte
encoding for Chinese and I already acknowledged that such a
predominantly single-byte encoding is still variable-length.  The
problem is that _you_ try to have it both ways: first you claimed
it is variable-length because I support Chinese that way, then
you claimed I don't support Chinese.

Yes, there will be conditionals, just as there are several
conditionals in phobos depending on whether a language supports
uppercase or not.  The question is whether the conditionals for
single-byte encoding will execute faster than decoding every
UTF-8 character.  This is a matter of engineering judgement, I
see no reason why you think decoding every UTF-8 character is
faster.

then you claim I don't handle
these languages.  This kind of blatant contradiction within
two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it
as irrelevant, along with more handwaving about what to do with
text that has embedded words in multiple languages.

If it was mere "vague handwaving," how did you know I planned to
use two bytes to encode Chinese?  I'm not sure why you're

I didn't "handwave" about multi-language strings, I gave specific
ideas about how they might be implemented.  I'm not claiming to
have a bullet-proof and detailed single-byte encoding spec, just
spitballing some ideas on how to do it better than the abominable
UTF-8.

Worse, there are going to be more than 256 of these encodings -
you can't even have a byte to specify them. Remember, Unicode
has approximately 256,000 characters in it. How many code pages
is that?

There are 72 modern scripts in Unicode 6.1, 28 ancient scripts,
maybe another 50 symbolic sets.  That leaves space for another
100 or so new scripts.  Maybe you are so worried about
future-proofing that you'd use two bytes to signify the alphabet,
but I wouldn't.  I think it's more likely that we'll ditch
scripts than add them. ;) Most of those symbol sets should not be
in UCS.

I was being kind saying you were trolling, as otherwise I'd be
saying your scheme was, to be blunt, absurd.

I think it's absurd to use a self-synchronizing text encoding
from 20 years ago, that is really only useful when streaming
text, which nobody does today.  There may have been a time when
ASCII compatibility was paramount, when nobody cared about
internationalization and almost all libraries only took ASCII
input: that is not the case today.

I'll be the first to admit that a lot of great ideas have been
initially dismissed by the experts as absurd. If you really
believe in this, I recommend that you write it up as a real
article, taking care to fill in all the handwaving with
something specific, and include some benchmarks to prove your
performance claims. Post your article on reddit, stackoverflow,
hackernews, etc., and look for fertile ground for it. I'm sorry
you're not finding fertile ground here (so far, nobody has
agreed with any of your points), and this is the wrong place
for such proposals anyway, as D is simply not going to switch
over to it.

Let me admit in return that I might be completely wrong about my
single-byte encoding representing a step forward from UTF-8.
While this argument has produced no argument that I'm wrong, it's
possible we've all missed something salient, some deal-breaker.
As I said before, I'm not proposing that D "switch over."  I was
simply asking people who know or at the very least use UTF-8 more
than most, as a result of employing one of the few languages with
Unicode support baked in, why they think UTF-8 is a good idea.

I was hoping for a technical discussion on the merits, before I
went ahead and implemented this single-byte encoding.  Since
nobody has been able to point out a reason for why my encoding
wouldn't be much better than UTF-8, I see no reason not to go
forward with my implementation.  I may write something up after
implementation: most people don't care about ideas, only results,
to the point where almost nobody can reason at all about ideas.

Remember, extraordinary claims require extraordinary evidence,
not handwaving and assumptions disguised as bold assertions.

I don't think my claims are extraordinary or backed by
"handwaving and assumptions."  Some people can reason about such
possible encodings, even in the incomplete form I've sketched
out, without having implemented them, if they know what they're
doing.

On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
On 5/25/2013 2:51 PM, Walter Bright wrote:
On 5/25/2013 12:51 PM, Joakim wrote:
For a multi-language string encoding, the header would
contain a single byte for every language used in the string,
along with multiple
index bytes to signify the start and finish of every run of
single-language
characters in the string. So, a list of languages and a list
of pure
single-language substrings.

Please implement the simple C function strstr() with this
simple scheme, and
post it here.

http://www.digitalmars.com/rtl/string.html#strstr

I'll go first. Here's a simple UTF-8 version in C. It's not the
fastest way to do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
size_t len1 = strlen(s1);
size_t len2 = strlen(s2);
if (!len2)
return (char *) s1;
char c2 = *s2;
while (len2 <= len1) {
if (c2 == *s1)
if (memcmp(s2,s1,len2) == 0)
return (char *) s1;
s1++;
len1--;
}
return NULL;
}

There is no question that a UTF-8 implementation of strstr can be
simpler to write in C and D for multi-language strings that
include Korean/Chinese/Japanese.  But while the strstr
implementation for my encoding would contain more conditionals
and lines of code, it would be far more efficient.  For instance,
because you know where all the language substrings are from the
header, you can potentially rule out searching vast swathes of
the string, because they don't contain the same languages or
lengths as the string you're searching for.

Even if you're searching a single-language string, which won't
have those speedups, your naive implementation checks every byte,
even continuation bytes, in UTF-8 to see if they might match the
first letter of the search string, even though no continuation
byte will match.  You can avoid this by partially decoding the
leading bytes of UTF-8 characters and skipping over continuation
bytes, as I've mentioned earlier in this thread, but you've then
added more lines of code to your pretty yet simple function and

My single-byte encoding has none of these problems, in fact, it's
much faster and uses less memory for the same function, while
available to UTF-8.

Finally, being able to write simple yet inefficient functions
like this is not the test of a good encoding, as strstr is a
library function, and making library developers' lives easier is
a low priority for any good format.  The primary goals are ease
of use for library consumers, ie app developers, and speed and
efficiency of the code.  You are trading on the latter two for
the former with this implementation.  That is not a good tradeoff.

Perhaps it was a good trade 20 years ago when everyone rolled
their own code and nobody bothered waiting for those floppy disks
to arrive with expensive library code.  It is not a good trade
today.

May 26 2013

"Declan" <oyscal 163.com>  writes:
On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
I have noted from the beginning that these large alphabets
have to be encoded to
two bytes, so it is not a true constant-width encoding if you
are mixing one of
those languages into a single-byte encoded string.  But this
"variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's
variable length. You overlook that I've had to deal with this.
It isn't "simpler", there's actually more work to write code
that adapts to one or two byte encodings.

It is variable length, with the advantage that only strings
containing a few Asian languages are variable-length, as
opposed to UTF-8 having every non-English language string be
variable-length.  It may be more work to write library code to
handle my encoding, perhaps, but efficiency and ease of use are
paramount.

So let's see: first you say that my scheme has to be variable
length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard
Chinese. You cannot have it both ways. Code to deal with two
bytes is significantly different than code to deal with one.
That means you've got a conditional in your generic code -
that isn't going to be faster than the conditional for UTF-8.

Hah, I have explicitly said several times that I'd use a
two-byte encoding for Chinese and I already acknowledged that
such a predominantly single-byte encoding is still
variable-length.  The problem is that _you_ try to have it both
ways: first you claimed it is variable-length because I support
Chinese that way, then you claimed I don't support Chinese.

Yes, there will be conditionals, just as there are several
conditionals in phobos depending on whether a language supports
uppercase or not.  The question is whether the conditionals for
single-byte encoding will execute faster than decoding every
UTF-8 character.  This is a matter of engineering judgement, I
see no reason why you think decoding every UTF-8 character is
faster.

then you claim I don't handle
these languages.  This kind of blatant contradiction within
two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it
as irrelevant, along with more handwaving about what to do
with text that has embedded words in multiple languages.

If it was mere "vague handwaving," how did you know I planned
to use two bytes to encode Chinese?  I'm not sure why you're

I didn't "handwave" about multi-language strings, I gave
specific ideas about how they might be implemented.  I'm not
claiming to have a bullet-proof and detailed single-byte
encoding spec, just spitballing some ideas on how to do it
better than the abominable UTF-8.

Worse, there are going to be more than 256 of these encodings
- you can't even have a byte to specify them. Remember,
Unicode has approximately 256,000 characters in it. How many
code pages is that?

There are 72 modern scripts in Unicode 6.1, 28 ancient scripts,
maybe another 50 symbolic sets.  That leaves space for another
100 or so new scripts.  Maybe you are so worried about
future-proofing that you'd use two bytes to signify the
alphabet, but I wouldn't.  I think it's more likely that we'll
ditch scripts than add them. ;) Most of those symbol sets
should not be in UCS.

I was being kind saying you were trolling, as otherwise I'd be
saying your scheme was, to be blunt, absurd.

I think it's absurd to use a self-synchronizing text encoding
from 20 years ago, that is really only useful when streaming
text, which nobody does today.  There may have been a time when
ASCII compatibility was paramount, when nobody cared about
internationalization and almost all libraries only took ASCII
input: that is not the case today.

I'll be the first to admit that a lot of great ideas have been
initially dismissed by the experts as absurd. If you really
believe in this, I recommend that you write it up as a real
article, taking care to fill in all the handwaving with
something specific, and include some benchmarks to prove your
performance claims. Post your article on reddit,
stackoverflow, hackernews, etc., and look for fertile ground
for it. I'm sorry you're not finding fertile ground here (so
far, nobody has agreed with any of your points), and this is
the wrong place for such proposals anyway, as D is simply not
going to switch over to it.

Let me admit in return that I might be completely wrong about
my single-byte encoding representing a step forward from UTF-8.
While this argument has produced no argument that I'm wrong,
it's possible we've all missed something salient, some
deal-breaker.  As I said before, I'm not proposing that D
"switch over."  I was simply asking people who know or at the
very least use UTF-8 more than most, as a result of employing
one of the few languages with Unicode support baked in, why
they think UTF-8 is a good idea.

I was hoping for a technical discussion on the merits, before I
went ahead and implemented this single-byte encoding.  Since
nobody has been able to point out a reason for why my encoding
wouldn't be much better than UTF-8, I see no reason not to go
forward with my implementation.  I may write something up after
implementation: most people don't care about ideas, only
results, to the point where almost nobody can reason at all

Remember, extraordinary claims require extraordinary evidence,
not handwaving and assumptions disguised as bold assertions.

I don't think my claims are extraordinary or backed by
"handwaving and assumptions."  Some people can reason about
such possible encodings, even in the incomplete form I've
sketched out, without having implemented them, if they know
what they're doing.

On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
On 5/25/2013 2:51 PM, Walter Bright wrote:
On 5/25/2013 12:51 PM, Joakim wrote:
For a multi-language string encoding, the header would
contain a single byte for every language used in the string,
along with multiple
index bytes to signify the start and finish of every run of
single-language
characters in the string. So, a list of languages and a list
of pure
single-language substrings.

Please implement the simple C function strstr() with this
simple scheme, and
post it here.

http://www.digitalmars.com/rtl/string.html#strstr

I'll go first. Here's a simple UTF-8 version in C. It's not
the fastest way to do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
size_t len1 = strlen(s1);
size_t len2 = strlen(s2);
if (!len2)
return (char *) s1;
char c2 = *s2;
while (len2 <= len1) {
if (c2 == *s1)
if (memcmp(s2,s1,len2) == 0)
return (char *) s1;
s1++;
len1--;
}
return NULL;
}

There is no question that a UTF-8 implementation of strstr can
be simpler to write in C and D for multi-language strings that
include Korean/Chinese/Japanese.  But while the strstr
implementation for my encoding would contain more conditionals
and lines of code, it would be far more efficient.  For
instance, because you know where all the language substrings
are from the header, you can potentially rule out searching
vast swathes of the string, because they don't contain the same
languages or lengths as the string you're searching for.

Even if you're searching a single-language string, which won't
have those speedups, your naive implementation checks every
byte, even continuation bytes, in UTF-8 to see if they might
match the first letter of the search string, even though no
continuation byte will match.  You can avoid this by partially
decoding the leading bytes of UTF-8 characters and skipping
over continuation bytes, as I've mentioned earlier in this
iteration of the while loop.

My single-byte encoding has none of these problems, in fact,
it's much faster and uses less memory for the same function,
not available to UTF-8.

Finally, being able to write simple yet inefficient functions
like this is not the test of a good encoding, as strstr is a
library function, and making library developers' lives easier
is a low priority for any good format.  The primary goals are
ease of use for library consumers, ie app developers, and speed
and efficiency of the code.  You are trading on the latter two
for the former with this implementation.  That is not a good

Perhaps it was a good trade 20 years ago when everyone rolled
their own code and nobody bothered waiting for those floppy
disks to arrive with expensive library code.  It is not a good

I服了u，I'm thinking of your name means joking?

May 26 2013

"John Colvin" <john.loughran.colvin gmail.com>  writes:
On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
I have noted from the beginning that these large alphabets
have to be encoded to
two bytes, so it is not a true constant-width encoding if you
are mixing one of
those languages into a single-byte encoded string.  But this
"variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's
variable length. You overlook that I've had to deal with this.
It isn't "simpler", there's actually more work to write code
that adapts to one or two byte encodings.

It is variable length, with the advantage that only strings
containing a few Asian languages are variable-length, as
opposed to UTF-8 having every non-English language string be
variable-length.  It may be more work to write library code to
handle my encoding, perhaps, but efficiency and ease of use are
paramount.

So let's see: first you say that my scheme has to be variable
length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard
Chinese. You cannot have it both ways. Code to deal with two
bytes is significantly different than code to deal with one.
That means you've got a conditional in your generic code -
that isn't going to be faster than the conditional for UTF-8.

Hah, I have explicitly said several times that I'd use a
two-byte encoding for Chinese and I already acknowledged that
such a predominantly single-byte encoding is still
variable-length.  The problem is that _you_ try to have it both
ways: first you claimed it is variable-length because I support
Chinese that way, then you claimed I don't support Chinese.

Yes, there will be conditionals, just as there are several
conditionals in phobos depending on whether a language supports
uppercase or not.  The question is whether the conditionals for
single-byte encoding will execute faster than decoding every
UTF-8 character.  This is a matter of engineering judgement, I
see no reason why you think decoding every UTF-8 character is
faster.

then you claim I don't handle
these languages.  This kind of blatant contradiction within
two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it
as irrelevant, along with more handwaving about what to do
with text that has embedded words in multiple languages.

If it was mere "vague handwaving," how did you know I planned
to use two bytes to encode Chinese?  I'm not sure why you're

I didn't "handwave" about multi-language strings, I gave
specific ideas about how they might be implemented.  I'm not
claiming to have a bullet-proof and detailed single-byte
encoding spec, just spitballing some ideas on how to do it
better than the abominable UTF-8.

Worse, there are going to be more than 256 of these encodings
- you can't even have a byte to specify them. Remember,
Unicode has approximately 256,000 characters in it. How many
code pages is that?

There are 72 modern scripts in Unicode 6.1, 28 ancient scripts,
maybe another 50 symbolic sets.  That leaves space for another
100 or so new scripts.  Maybe you are so worried about
future-proofing that you'd use two bytes to signify the
alphabet, but I wouldn't.  I think it's more likely that we'll
ditch scripts than add them. ;) Most of those symbol sets
should not be in UCS.

I was being kind saying you were trolling, as otherwise I'd be
saying your scheme was, to be blunt, absurd.

I think it's absurd to use a self-synchronizing text encoding
from 20 years ago, that is really only useful when streaming
text, which nobody does today.  There may have been a time when
ASCII compatibility was paramount, when nobody cared about
internationalization and almost all libraries only took ASCII
input: that is not the case today.

I'll be the first to admit that a lot of great ideas have been
initially dismissed by the experts as absurd. If you really
believe in this, I recommend that you write it up as a real
article, taking care to fill in all the handwaving with
something specific, and include some benchmarks to prove your
performance claims. Post your article on reddit,
stackoverflow, hackernews, etc., and look for fertile ground
for it. I'm sorry you're not finding fertile ground here (so
far, nobody has agreed with any of your points), and this is
the wrong place for such proposals anyway, as D is simply not
going to switch over to it.

Let me admit in return that I might be completely wrong about
my single-byte encoding representing a step forward from UTF-8.
While this argument has produced no argument that I'm wrong,
it's possible we've all missed something salient, some
deal-breaker.  As I said before, I'm not proposing that D
"switch over."  I was simply asking people who know or at the
very least use UTF-8 more than most, as a result of employing
one of the few languages with Unicode support baked in, why
they think UTF-8 is a good idea.

I was hoping for a technical discussion on the merits, before I
went ahead and implemented this single-byte encoding.  Since
nobody has been able to point out a reason for why my encoding
wouldn't be much better than UTF-8, I see no reason not to go
forward with my implementation.  I may write something up after
implementation: most people don't care about ideas, only
results, to the point where almost nobody can reason at all

Remember, extraordinary claims require extraordinary evidence,
not handwaving and assumptions disguised as bold assertions.

I don't think my claims are extraordinary or backed by
"handwaving and assumptions."  Some people can reason about
such possible encodings, even in the incomplete form I've
sketched out, without having implemented them, if they know
what they're doing.

On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
On 5/25/2013 2:51 PM, Walter Bright wrote:
On 5/25/2013 12:51 PM, Joakim wrote:
For a multi-language string encoding, the header would
contain a single byte for every language used in the string,
along with multiple
index bytes to signify the start and finish of every run of
single-language
characters in the string. So, a list of languages and a list
of pure
single-language substrings.

Please implement the simple C function strstr() with this
simple scheme, and
post it here.

http://www.digitalmars.com/rtl/string.html#strstr

I'll go first. Here's a simple UTF-8 version in C. It's not
the fastest way to do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
size_t len1 = strlen(s1);
size_t len2 = strlen(s2);
if (!len2)
return (char *) s1;
char c2 = *s2;
while (len2 <= len1) {
if (c2 == *s1)
if (memcmp(s2,s1,len2) == 0)
return (char *) s1;
s1++;
len1--;
}
return NULL;
}

There is no question that a UTF-8 implementation of strstr can
be simpler to write in C and D for multi-language strings that
include Korean/Chinese/Japanese.  But while the strstr
implementation for my encoding would contain more conditionals
and lines of code, it would be far more efficient.  For
instance, because you know where all the language substrings
are from the header, you can potentially rule out searching
vast swathes of the string, because they don't contain the same
languages or lengths as the string you're searching for.

Even if you're searching a single-language string, which won't
have those speedups, your naive implementation checks every
byte, even continuation bytes, in UTF-8 to see if they might
match the first letter of the search string, even though no
continuation byte will match.  You can avoid this by partially
decoding the leading bytes of UTF-8 characters and skipping
over continuation bytes, as I've mentioned earlier in this
iteration of the while loop.

My single-byte encoding has none of these problems, in fact,
it's much faster and uses less memory for the same function,
not available to UTF-8.

Finally, being able to write simple yet inefficient functions
like this is not the test of a good encoding, as strstr is a
library function, and making library developers' lives easier
is a low priority for any good format.  The primary goals are
ease of use for library consumers, ie app developers, and speed
and efficiency of the code.  You are trading on the latter two
for the former with this implementation.  That is not a good

Perhaps it was a good trade 20 years ago when everyone rolled
their own code and nobody bothered waiting for those floppy
disks to arrive with expensive library code.  It is not a good

I suggest you make an attempt at writing strstr and post it. Code
speaks louder than words.

May 26 2013

Walter Bright <newshound2 digitalmars.com>  writes:
On 5/26/2013 4:31 AM, Joakim wrote:
My single-byte encoding has none of these problems, in fact, it's much faster
and uses less memory for the same function, while providing additional
speedups,
from the header, that are not available to UTF-8.

C'mon, Joakim, show us this amazing strstr() implementation for your scheme!

May 26 2013

"Joakim" <joakim airpost.net>  writes:
On Sunday, 26 May 2013 at 12:55:11 UTC, Walter Bright wrote:
On 5/26/2013 4:31 AM, Joakim wrote:
My single-byte encoding has none of these problems, in fact,
it's much faster
and uses less memory for the same function, while providing
from the header, that are not available to UTF-8.

C'mon, Joakim, show us this amazing strstr() implementation for

You will see it when it's built into a fully working single-byte
encoding implementation.  I don't write toy code, particularly
inefficient functions like yours, for the reasons given, which

Heh, never seen that sketch before.  Never understood why anyone
likes this silly Monty Python stuff, from what little I've seen.

May 26 2013

On Saturday, 25 May 2013 at 20:03:59 UTC, Joakim wrote:
I have noted from the beginning that these large alphabets have
to be encoded to two bytes, so it is not a true constant-width
encoding if you are mixing one of those languages into a
single-byte encoded string.  But this "variable length"
encoding is so much simpler than UTF-8, there's no comparison.

All I can say is if you think that is simpler than UTF-8 then you
have completely the wrong idea about UTF-8.

Let me explain:

1) Take the byte at a particular offset in the string
2) If it is ASCII then we're done
3) Otherwise count the number of '1's at the start of the byte -
this is how many bytes make up the character (there's even an ASM
instruction to do this)
4) This first byte will look like '1110xxxx' for a 3 byte
character, '11110xxx' for a 4 byte character, etc.
5) All following bytes are of the form '10xxxxxx'
6) Now just concatenate all the 'x's together and add an offset
to get the code point

Note that this is CONSTANT TIME, O(1) with minimal branching so
well suited to pipelining (after the initial byte the other bytes
can all be processed in parallel by the CPU) and only sequential
memory access so no cache misses, and zero additional memory
requirements

1) Look up the offset in the header using binary search: O(log N)
lots of branching
2) Look up the code page ID in a massive array of code pages to
work out how many bytes per character
3) Hope this array hasn't been paged out and is still in the cache
4) Extract that many bytes from the string and combine them into
a number
5) Look up this new number in yet another large array specific to
the code page
6) Hope this array hasn't been paged out and is still in the
cache too

This is O(log N) has lots of branching so no pipelining (every
stage depends on the result of the stage before), lots of random
memory access so lots of cache misses, lots of additional memory
requirements to store all those tables, and an algorithm that
isn't even any easier to understand.

Plus every other algorithm to operate on it except for decoding
is insanely complicated.

May 25 2013

Dmitry Olshansky <dmitry.olsh gmail.com>  writes:
25-May-2013 02:42, H. S. Teoh пишет:
On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote:
24-May-2013 21:05, Joakim пишет:

[...]

As far as Phobos is concerned, Dmitry's new std.uni module has powerful
code-generation templates that let you write code that operate directly
on UTF-8 without needing to convert to UTF-32 first.

As is there are no UTF-8 specific tables (yet), but there are tools to
create the required abstraction by hand. I plan to grow one for
std.regex that will thus be field-tested and then get into public
interface. In fact the needs of std.regex prompted me to provide more
Unicode stuff in the std.

Well, OK, maybe
we're not quite there yet, but the foundations are in place, and I'm
looking forward to the day when string functions will no longer have
implicit conversion to UTF-32, but will directly manipulate UTF-8 using
optimized state tables generated by std.uni.

Yup, but let's get the correctness part first, then performance ;)

Want small - use compression schemes which are perfectly fine and
get to the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/

+1.  Using your own encoding is perfectly fine. Just don't do that for
data interchange. Unicode was created because we *want* a single
standard to communicate with each other without stupid broken encoding
issues that used to be rampant on the web before Unicode came along.

BTW the document linked discusses _standard_ compression so that anybody
can decode that stuff. How you compress would largely affect the
compression ratio but not much beyond it..

In the bad ole days, HTML could be served in any random number of
encodings, often out-of-sync with what the server claims the encoding
is, and browsers would assume arbitrary default encodings that for the
most part *appeared* to work but are actually fundamentally b0rken.
Sometimes webpages would show up mostly-intact, but with a few
characters mangled, because of deviations / variations on codepage
interpretation, or non-standard characters being used in a particular
encoding. It was a total, utter mess, that wasted who knows how many
man-hours of programming time to work around. For data interchange on
the internet, we NEED a universal standard that everyone can agree on.

+1 on these and others :)

--
Dmitry Olshansky

May 24 2013

"Joakim" <joakim airpost.net>  writes:
On Friday, 24 May 2013 at 22:44:24 UTC, H. S. Teoh wrote:
I remember those bad ole days of gratuitously-incompatible
encodings. I
wish those days will never ever return again. You'd get a text
file in
some unknown encoding, and the only way to make any sense of it
was to
guess what encoding it might be and hope you get lucky. Not
only so, the
same language often has multiple encodings, so adding support
for a
single new language required supporting several new encodings
and being
able to tell them apart (often with no info on which they are,
if you're
lucky, or if you're unlucky, with *wrong* encoding type specs
-- for
example, I *still* get email from outdated systems that claim
to be
iso-8859 when it's actually KOI8R).

This is an argument for UCS, not UTF-8.

Prepending the encoding to the data doesn't help, because it's
pretty
much guaranteed somebody will cut-n-paste some segment of that
data and
save it without the encoding type header (or worse, some
program will
try to "fix" broken low-level code by prepending a default
encoding type
to everything, regardless of whether it's actually in that
encoding or
not), thus ensuring nobody will be able to reliably recognize
what
encoding it is down the road.

This problem already exists for UTF-8, breaking ASCII
compatibility in the process:

http://en.wikipedia.org/wiki/Byte_order_mark

Well, at the very least adding garbage ASCII data in the front,
just as my header would do. ;)

For all of its warts, Unicode fixed a WHOLE bunch of these
problems, and
made cross-linguistic data sane to handle without pulling out
many times over.  And now we're trying to go back to that
nightmarish
old world again? No way, José!

No, I'm suggesting going back to one element of that "old world,"
single-byte encodings, but using UCS or some other standardized
character set to avoid all those incompatible code pages you had
to deal with.

If you're really concerned about encoding size, just use a
compression
library -- they're readily available these days. Internally,
the program
can just use UTF-16 for the most part -- UTF-32 is really only
necessary
if you're routinely delving outside BMP, which is very rare.

True, but you're still doubling your string size with UTF-16 and
non-ASCII text.  My concerns are the following, in order of
importance:

1. Lost programmer productivity due to these dumb variable-length
encodings.  That is the biggest loss from UTF-8's complexity.

2. Lost speed and memory due to using either an unnecessarily
complex variable-length encoding or because you translated
everything to 32-bit UTF-32 to get back to constant-width.

3. Lost bandwidth from using a fatter encoding.

As far as Phobos is concerned, Dmitry's new std.uni module has
powerful
code-generation templates that let you write code that operate
directly
on UTF-8 without needing to convert to UTF-32 first. Well, OK,
maybe
we're not quite there yet, but the foundations are in place,
and I'm
looking forward to the day when string functions will no longer
have
implicit conversion to UTF-32, but will directly manipulate
UTF-8 using
optimized state tables generated by std.uni.

There is no way this can ever be as performant as a
constant-width single-byte encoding.

+1.  Using your own encoding is perfectly fine. Just don't do
that for
data interchange. Unicode was created because we *want* a single
standard to communicate with each other without stupid broken
encoding
issues that used to be rampant on the web before Unicode came
along.

In the bad ole days, HTML could be served in any random number
of
encodings, often out-of-sync with what the server claims the
encoding
is, and browsers would assume arbitrary default encodings that
for the
most part *appeared* to work but are actually fundamentally
b0rken.
Sometimes webpages would show up mostly-intact, but with a few
characters mangled, because of deviations / variations on
codepage
interpretation, or non-standard characters being used in a
particular
encoding. It was a total, utter mess, that wasted who knows how
many
man-hours of programming time to work around. For data
interchange on
the internet, we NEED a universal standard that everyone can
agree on.

I disagree.  This is not an indictment of multiple encodings, it
is one of multiple unspecified or _broken_ encodings.  Given how
difficult UTF-8 is to get right, all you've likely done is
replace multiple broken encodings with a single encoding with
multiple broken implementations.

UTF-8, for all its flaws, is remarkably resilient to mangling
-- you can
cut-n-paste any byte sequence and the receiving end can still
make some
sense of it.  Not like the bad old days of codepages where you
just get
one gigantic block of gibberish. A properly-synchronizing UTF-8
function
can still recover legible data, maybe with only a few
characters at the
ends truncated in the worst case. I don't see how any
codepage-based
encoding is an improvement over this.

Have you ever used this self-synchronizing future of UTF-8?  Have
you ever heard of anyone using it?  There is no reason why this
kind of limited checking of data integrity should be rolled into
everyone had plans to stream text or something, but nobody does
good to go.

Unicode is still a "codepage-based encoding," nothing has changed
in that regard.  All UCS did is standardize a bunch of
pre-existing code pages, so that some of the redundancy was taken
out.  Unfortunately, the UTF-8 encoding then bloated the
transmission format and tempted devs to use this unnecessarily
complex format for processing too.

May 25 2013

I think you are a little confused about what unicode actually
is... Unicode has nothing to do with code pages and nobody uses
code pages any more except for compatibility with legacy
applications (with good reason!).

Unicode is:
1) A standardised numbering of a large number of characters
2) A set of standardised algorithms for operating on these
characters
3) A set of standardised encodings for efficiently encoding
sequences of these characters

You said that phobos converts UTF-8 strings to UTF-32 before
operating on them but that's not true. As it iterates over UTF-8
strings it iterates over dchars rather than chars, but that's not
in any way inefficient so I don't really see the problem.

Also your complaint that UTF-8 reserves the short characters for
the english alphabet is not really relevant - the characters with
longer encodings tend to be rarer (such as special symbols) or
sentence takes only about 1/3 the number of characters).

May 25 2013

"Joakim" <joakim airpost.net>  writes:
On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
I think you are a little confused about what unicode actually
is... Unicode has nothing to do with code pages and nobody uses
code pages any more except for compatibility with legacy
applications (with good reason!).

Incorrect.

"Unicode is an effort to include all characters from previous
code pages into a single character enumeration that can be used
with a number of encoding schemes... In practice the various
Unicode character set encodings have simply been assigned their
own code page numbers, and all the other code pages have been
technically redefined as encodings for various subsets of
Unicode."
http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode

Unicode is:
1) A standardised numbering of a large number of characters
2) A set of standardised algorithms for operating on these
characters
3) A set of standardised encodings for efficiently encoding
sequences of these characters

What makes you think I'm unaware of this?  I have repeatedly
differentiated between UCS (1) and UTF-8 (3).

You said that phobos converts UTF-8 strings to UTF-32 before
operating on them but that's not true. As it iterates over
UTF-8 strings it iterates over dchars rather than chars, but
that's not in any way inefficient so I don't really see the
problem.

And what's a dchar?  Let's check:

dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html

Of course that's inefficient, you are translating your whole
encoding over to a 32-bit encoding every time you need to process
it.  Walter as much as said so up above.

Also your complaint that UTF-8 reserves the short characters
for the english alphabet is not really relevant - the
characters with longer encodings tend to be rarer (such as
characters where the same sentence takes only about 1/3 the
number of characters).

The vast majority of non-english alphabets in UCS can be encoded
in a single byte.  It is your exceptions that are not relevant.

May 25 2013

On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
I think you are a little confused about what unicode actually
is... Unicode has nothing to do with code pages and nobody
uses code pages any more except for compatibility with legacy
applications (with good reason!).

Incorrect.

"Unicode is an effort to include all characters from previous
code pages into a single character enumeration that can be used
with a number of encoding schemes... In practice the various
Unicode character set encodings have simply been assigned their
own code page numbers, and all the other code pages have been
technically redefined as encodings for various subsets of
Unicode."
http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode

That confirms exactly what I just said...

You said that phobos converts UTF-8 strings to UTF-32 before
operating on them but that's not true. As it iterates over
UTF-8 strings it iterates over dchars rather than chars, but
that's not in any way inefficient so I don't really see the
problem.

And what's a dchar?  Let's check:

dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html

Of course that's inefficient, you are translating your whole
encoding over to a 32-bit encoding every time you need to
process it.  Walter as much as said so up above.

Given that all the machine registers are at least 32-bits already
it doesn't make the slightest difference. The only additional
operations on top of ascii are when it's a multi-byte character,
and even then it's some simple bit manipulation which is as fast
as any variable width encoding is going to get.

The only alternatives to a variable width encoding I can see are:
- Single code page per string
This is completely useless because now you can't concatenate
strings of different code pages.

- Multiple code pages per string
This just makes everything overly complicated and is far slower
to decode what the actual character is than UTF-8.

- String with escape sequences to change code page
Can no longer access characters in the middle or end of the
string, you have to parse the entire string every time which
completely negates the benefit of a fixed width encoding.

- An encoding wide enough to store every character
This is just UTF-32.

Also your complaint that UTF-8 reserves the short characters
for the english alphabet is not really relevant - the
characters with longer encodings tend to be rarer (such as
characters where the same sentence takes only about 1/3 the
number of characters).

The vast majority of non-english alphabets in UCS can be
encoded in a single byte.  It is your exceptions that are not
relevant.

Well obviously... That's like saying "if you know what the exact
contents of a file are going to be anyway you can compress it to
a single byte!"

ie. It's possible to devise an encoding which will encode any
given string to an arbitrarily small size. It's still completely
useless because you'd have to know the string in advance...

- A useful encoding has to be able to handle every unicode
character
- As I've shown the only space-efficient way to do this is using
a variable length encoding like UTF-8
- Given the frequency distribution of unicode characters, UTF-8
does a pretty good job at encoding higher frequency characters in
fewer bytes.
- Yes you COULD encode non-english alphabets in a single byte but
doing so would be inefficient because it would mean the more
frequently used characters take more bytes to encode.

May 25 2013

"Joakim" <joakim airpost.net>  writes:
On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote:
On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
I think you are a little confused about what unicode actually
is... Unicode has nothing to do with code pages and nobody
uses code pages any more except for compatibility with legacy
applications (with good reason!).

Incorrect.

"Unicode is an effort to include all characters from previous
code pages into a single character enumeration that can be
used with a number of encoding schemes... In practice the
various Unicode character set encodings have simply been
assigned their own code page numbers, and all the other code
pages have been technically redefined as encodings for various
subsets of Unicode."
http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode

That confirms exactly what I just said...

having "nothing to do with code pages."  All UCS did is take a
bunch of existing code pages and standardize them into one
massive character set.  For example, ISCII was a pre-existing
single-byte encoding and Unicode "largely preserves the ISCII
layout within each block."
http://en.wikipedia.org/wiki/ISCII

All a code page is is a table of mappings, UCS is just a much
larger, standardized table of such mappings.

You said that phobos converts UTF-8 strings to UTF-32 before
operating on them but that's not true. As it iterates over
UTF-8 strings it iterates over dchars rather than chars, but
that's not in any way inefficient so I don't really see the
problem.

And what's a dchar?  Let's check:

dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html

Of course that's inefficient, you are translating your whole
encoding over to a 32-bit encoding every time you need to
process it.  Walter as much as said so up above.

Given that all the machine registers are at least 32-bits
already it doesn't make the slightest difference. The only
additional operations on top of ascii are when it's a
multi-byte character, and even then it's some simple bit
manipulation which is as fast as any variable width encoding is
going to get.

I see you've abandoned without note your claim that phobos
doesn't convert UTF-8 to UTF-32 internally.  Perhaps converting
to UTF-32 is "as fast as any variable width encoding is going to
get" but my claim is that single-byte encodings will be faster.

The only alternatives to a variable width encoding I can see
are:
- Single code page per string
This is completely useless because now you can't concatenate
strings of different code pages.

I wouldn't be so fast to ditch this.  There is a real argument to
be made that strings of different languages are sufficiently
different that there should be no multi-language strings.  Is
this the best route?  I'm not sure, but I certainly wouldn't
dismiss it out of hand.

- Multiple code pages per string
This just makes everything overly complicated and is far slower
to decode what the actual character is than UTF-8.

I disagree, this would still be far faster than UTF-8,

- String with escape sequences to change code page
Can no longer access characters in the middle or end of the
string, you have to parse the entire string every time which
completely negates the benefit of a fixed width encoding.

I didn't think of this possibility, but you may be right that
it's sub-optimal.

Also your complaint that UTF-8 reserves the short characters
for the english alphabet is not really relevant - the
characters with longer encodings tend to be rarer (such as
characters where the same sentence takes only about 1/3 the
number of characters).

The vast majority of non-english alphabets in UCS can be
encoded in a single byte.  It is your exceptions that are not
relevant.

Well obviously... That's like saying "if you know what the
exact contents of a file are going to be anyway you can
compress it to a single byte!"

ie. It's possible to devise an encoding which will encode any
given string to an arbitrarily small size. It's still
completely useless because you'd have to know the string in

No, it's not the same at all.  The contents of an
arbitrary-length file cannot be compressed to a single byte, you
would have collisions galore.  But since most non-english
alphabets are less than 256 characters, they can all be uniquely
encoded in a single byte per character, with the header
determining what language's code page to use.  I don't understand

- A useful encoding has to be able to handle every unicode
character
- As I've shown the only space-efficient way to do this is
using a variable length encoding like UTF-8

You haven't shown this.

- Given the frequency distribution of unicode characters, UTF-8
does a pretty good job at encoding higher frequency characters
in fewer bytes.

No, it does a very bad job of this.  Every non-ASCII character
takes at least two bytes to encode, whereas my single-byte
encoding scheme would encode every alphabet with less than 256
characters in a single byte.

- Yes you COULD encode non-english alphabets in a single byte
but doing so would be inefficient because it would mean the
more frequently used characters take more bytes to encode.

Not sure what you mean by this.

May 25 2013

On Saturday, 25 May 2013 at 19:02:43 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote:
On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
I think you are a little confused about what unicode
actually is... Unicode has nothing to do with code pages and
nobody uses code pages any more except for compatibility
with legacy applications (with good reason!).

Incorrect.

"Unicode is an effort to include all characters from previous
code pages into a single character enumeration that can be
used with a number of encoding schemes... In practice the
various Unicode character set encodings have simply been
assigned their own code page numbers, and all the other code
pages have been technically redefined as encodings for
various subsets of Unicode."
http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode

That confirms exactly what I just said...

having "nothing to do with code pages."  All UCS did is take a
bunch of existing code pages and standardize them into one
massive character set.  For example, ISCII was a pre-existing
single-byte encoding and Unicode "largely preserves the ISCII
layout within each block."
http://en.wikipedia.org/wiki/ISCII

All a code page is is a table of mappings, UCS is just a much
larger, standardized table of such mappings.

UCS does have nothing to do with code pages, it was designed as a
replacement for them. A codepage is a strict subset of the
possible characters, UCS is the entire set of possible characters.
You said that phobos converts UTF-8 strings to UTF-32 before
operating on them but that's not true. As it iterates over
UTF-8 strings it iterates over dchars rather than chars, but
that's not in any way inefficient so I don't really see the
problem.

And what's a dchar?  Let's check:

dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html

Of course that's inefficient, you are translating your whole
encoding over to a 32-bit encoding every time you need to
process it.  Walter as much as said so up above.

Given that all the machine registers are at least 32-bits
already it doesn't make the slightest difference. The only
additional operations on top of ascii are when it's a
multi-byte character, and even then it's some simple bit
manipulation which is as fast as any variable width encoding
is going to get.

I see you've abandoned without note your claim that phobos
doesn't convert UTF-8 to UTF-32 internally.  Perhaps converting
to UTF-32 is "as fast as any variable width encoding is going
to get" but my claim is that single-byte encodings will be
faster.

I haven't "abandoned my claim". It's a simple fact that phobos
does not convert UTF-8 string to UTF-32 strings before it uses
them.

ie. the difference between this:
string mystr = ...;
dstring temp = mystr.to!dstring;
for (int i = 0; i < temp.length; ++i)
process(temp[i]);

and this:
string mystr = ...;
size_t i = 0;
while (i < mystr.length) {
dchar current = decode(mystr, i);
process(current);
}

And if you can't see why the latter example is far more efficient
I give up...

The only alternatives to a variable width encoding I can see
are:
- Single code page per string
This is completely useless because now you can't concatenate
strings of different code pages.

I wouldn't be so fast to ditch this.  There is a real argument
to be made that strings of different languages are sufficiently
different that there should be no multi-language strings.  Is
this the best route?  I'm not sure, but I certainly wouldn't
dismiss it out of hand.

- Multiple code pages per string
This just makes everything overly complicated and is far
slower to decode what the actual character is than UTF-8.

I disagree, this would still be far faster than UTF-8,

The cache misses alone caused by simply accessing the separate
takes a few assembly instructions and has perfect locality and
can be efficiently pipelined by the CPU.

Then there's all the extra processing involved combining the
headers when you concatenate strings. Plus you lose the one
benefit a fixed width encoding has because random access is no
longer possible without first finding out which header controls
the location you want to access.

- String with escape sequences to change code page
Can no longer access characters in the middle or end of the
string, you have to parse the entire string every time which
completely negates the benefit of a fixed width encoding.

I didn't think of this possibility, but you may be right that
it's sub-optimal.

Also your complaint that UTF-8 reserves the short characters
for the english alphabet is not really relevant - the
characters with longer encodings tend to be rarer (such as
characters where the same sentence takes only about 1/3 the
number of characters).

The vast majority of non-english alphabets in UCS can be
encoded in a single byte.  It is your exceptions that are not
relevant.

Well obviously... That's like saying "if you know what the
exact contents of a file are going to be anyway you can
compress it to a single byte!"

ie. It's possible to devise an encoding which will encode any
given string to an arbitrarily small size. It's still
completely useless because you'd have to know the string in

No, it's not the same at all.  The contents of an
arbitrary-length file cannot be compressed to a single byte,
you would have collisions galore.  But since most non-english
alphabets are less than 256 characters, they can all be
uniquely encoded in a single byte per character, with the
header determining what language's code page to use.  I don't

you are compressing you have at the time of writing the algorithm
the better compression ration you can get, to the point that if
you know exactly what the file is going to contain you can
compress it to nothing. This is why you have specialised
compression algorithms for images, video, audio, etc.

It doesn't matter how few characters non-english alphabets have -
unless you know WHICH alphabet it is before-hand you can't store
it in a single byte. Since any given character could be in any
alphabet the best you can do is look at the probabilities of
different characters appearing and use shorter representations
for more common ones. (This is the basis for all lossless
compression) The english alphabet plus 0-9 and basic punctuation
are by far the most common characters used on computers so it
makes sense to use one byte for those and multiple bytes for
rarer characters.

- A useful encoding has to be able to handle every unicode
character
- As I've shown the only space-efficient way to do this is
using a variable length encoding like UTF-8

You haven't shown this.

per string you would see that I had.

- Given the frequency distribution of unicode characters,
UTF-8 does a pretty good job at encoding higher frequency
characters in fewer bytes.

No, it does a very bad job of this.  Every non-ASCII character
takes at least two bytes to encode, whereas my single-byte
encoding scheme would encode every alphabet with less than 256
characters in a single byte.

And strings with mixed characters would use lots of memory and be
extremely slow. Common when using proper names, quotes, inline
translations, graphical characters, etc. etc. Not to mention the
added complexity to actually implement the algorithms.

May 25 2013

Walter Bright <newshound2 digitalmars.com>  writes:
On 5/25/2013 1:07 AM, Joakim wrote:
The vast majority of non-english alphabets in UCS can be encoded in a single
byte.  It is your exceptions that are not relevant.

I suspect the Chinese, Koreans, and Japanese would take exception to being
called irrelevant.

Good luck with your scheme that can't handle languages written by billions of
people!

May 25 2013

"Joakim" <joakim airpost.net>  writes:
On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote:
You seem to think that not only UTF-8 is bad encoding but also
one unified encoding (code-space) is bad(?).

Yes, on the encoding, if it's a variable-length encoding like
UTF-8, no, on the code space.  I was originally going to title my
post, "Why Unicode?" but I have no real problem with UCS, which
merely standardized a bunch of pre-existing code pages.  Perhaps
there are a lot of problems with UCS also, I just haven't delved
into it enough to know.  My problem is with these dumb
variable-length encodings, so I was precise in the title.

Separate code spaces were the case before Unicode (and utf-8).
The problem is not only that without header text is meaningless
(no easy slicing) but the fact that encoding of data after
header strongly depends a variety of factors -  a list of
encodings actually. Now everybody has to keep a (code) page per
language to at least know if it's 2 bytes per char or 1 byte
per char or whatever. And you still work on a basis that there
is no combining marks and regional specific stuff :)

Everybody is still keeping code pages, UTF-8 hasn't changed that.
Does UTF-8 not need "to at least know if it's 2 bytes per char
or 1 byte per char or whatever?"  It has to do that also.
Everyone keeps talking about "easy slicing" as though UTF-8
provides it, but it doesn't.  Phobos turns UTF-8 into UTF-32
internally for all that ease of use, at least doubling your
string size in the process.  Correct me if I'm wrong, that was
what I read on the newsgroup sometime back.

they just assumed a codepage with some global setting. Imagine
yourself creating a font rendering system these days - a hell
of an exercise in frustration (okay how do I render 0x88 ? mm
if that is in codepage XYZ then ...).

I understand that people were frustrated with all the code pages
out there before UCS standardized them, but that is a completely
different argument than my problem with UTF-8 and variable-length
encodings.  My proposed simple, header-based, constant-width
encoding could be implemented with UCS and there go all your

This just shows you don't care for multilingual stuff at all.
Imagine any language tutor/translator/dictionary on the Web.
For instance most languages need to intersperse ASCII (also
keep in mind e.g. HTML markup). Books often feature citations
in native language (or e.g. latin) along with translations.

This is a small segment of use and it would be handled fine by an
alternate encoding.

Now also take into account math symbols, currency symbols and
beyond. Also these days cultures are mixing in wild
combinations so you might need to see the text even if you
can't read it. Unicode is not only "encode characters from all
languages". It needs to address universal representation of
symbolics used in writing systems at large.

I take your point that it isn't just languages, but symbols also.
I see no reason why UTF-8 is a better encoding for that purpose
than the kind of simple encoding I've suggested.

We want monoculture! That is to understand each without all
these "par-le-vu-france?" and codepages of various
complexity(insanity).

I hate monoculture, but then I haven't had to decipher some
screwed-up codepage in the middle of the night. ;) That said, you
could standardize on UCS for your code space without using a bad
encoding like UTF-8, as I said above.

Want small - use compression schemes which are perfectly fine
and get to the precious 1byte per codepoint with exceptional
speed.
http://www.unicode.org/reports/tr6/

Correct me if I'm wrong, but it seems like that compression
exactly what I suggested! :) But I get the impression that it's
only for sending over the wire, ie transmision, so all the
processing issues that UTF-8 introduces would still be there.

And borrowing the arguments from from that rant: locale is
borked shit when it comes to encodings. Locales should be used
for tweaking visual like numbers, date display an so on.

Is that worse than every API simply assuming UTF-8, as he says?
Broken locale support in the past, as you and others complain
about, doesn't invalidate the concept.  If they're screwing up
something so simple, imagine how much worse everyone is screwing
up something complex like UTF-8?

May 24 2013

Dmitry Olshansky <dmitry.olsh gmail.com>  writes:
25-May-2013 10:44, Joakim пишет:
On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote:
You seem to think that not only UTF-8 is bad encoding but also one

Yes, on the encoding, if it's a variable-length encoding like UTF-8, no,
on the code space.  I was originally going to title my post, "Why
Unicode?" but I have no real problem with UCS, which merely standardized
a bunch of pre-existing code pages.  Perhaps there are a lot of problems
with UCS also, I just haven't delved into it enough to know.  My problem
is with these dumb variable-length encodings, so I was precise in the
title.

UCS is dead and gone. Next in line to "640K is enough for everyone".
Simply put Unicode decided to take into account all diversity of
luggages instead of ~80% of these. Hard to add anything else. No offense
meant but it feels like you actually live in universe that is 5-7 years
behind current state. UTF-16 (a successor to UCS) is no random-access
either. And it's shitty beyond measure, UTF-8 is a shining gem in
comparison.

Separate code spaces were the case before Unicode (and utf-8). The
problem is not only that without header text is meaningless (no easy
slicing) but the fact that encoding of data after header strongly
depends a variety of factors -  a list of encodings actually. Now
everybody has to keep a (code) page per language to at least know if
it's 2 bytes per char or 1 byte per char or whatever. And you still
work on a basis that there is no combining marks and regional specific
stuff :)

Everybody is still keeping code pages, UTF-8 hasn't changed that.

Legacy. Hard to switch overnight. There are graphs that indicate that
few years from now you might never encounter a legacy encoding anymore,
only UTF-8/UTF-16.

Does
UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per
char or whatever?"

It's coherent in its scheme to determine that. You don't need extra
information synced to text unlike header stuff.

It has to do that also. Everyone keeps talking about
"easy slicing" as though UTF-8 provides it, but it doesn't.  Phobos
turns UTF-8 into UTF-32 internally for all that ease of use, at least
doubling your string size in the process.  Correct me if I'm wrong, that
was what I read on the newsgroup sometime back.

Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't
do any decoding and it does return you a slice of a balance of original.

In fact it was even "better" nobody ever talked about header they just
assumed a codepage with some global setting. Imagine yourself creating
a font rendering system these days - a hell of an exercise in
frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ
then ...).

I understand that people were frustrated with all the code pages out
there before UCS standardized them, but that is a completely different
argument than my problem with UTF-8 and variable-length encodings.  My
proposed simple, header-based, constant-width encoding could be
implemented with UCS and there go all your arguments about random code
pages.

No they don't - have you ever seen native Korean or Chinese codepages?
that there is no single sane way to deal with it on cross-locale basis
(that you simply ignore as noted below).

This just shows you don't care for multilingual stuff at all. Imagine
any language tutor/translator/dictionary on the Web. For instance most
languages need to intersperse ASCII (also keep in mind e.g. HTML
markup). Books often feature citations in native language (or e.g.
latin) along with translations.

This is a small segment of use and it would be handled fine by an
alternate encoding.

??? Simply makes no sense. There is no intersection between some legacy
encodings as of now. Or do you want to add N*(N-1) cross-encodings for
any combination of 2? What about 3 in one string?

Now also take into account math symbols, currency symbols and beyond.
Also these days cultures are mixing in wild combinations so you might
need to see the text even if you can't read it. Unicode is not only
"encode characters from all languages". It needs to address universal
representation of symbolics used in writing systems at large.

I take your point that it isn't just languages, but symbols also.  I see
no reason why UTF-8 is a better encoding for that purpose than the kind
of simple encoding I've suggested.

We want monoculture! That is to understand each without all these
"par-le-vu-france?" and codepages of various complexity(insanity).

I hate monoculture, but then I haven't had to decipher some screwed-up
codepage in the middle of the night. ;)

So you never had trouble of internationalization? What languages do you

That said, you could standardize
on UCS for your code space without using a bad encoding like UTF-8, as I
said above.

UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into
that trap (Java, Windows NT). You shouldn't.

Want small - use compression schemes which are perfectly fine and get
to the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/

Correct me if I'm wrong, but it seems like that compression scheme
simply adds a header and then uses a single-byte encoding, exactly what
I suggested! :)

This is it but it's far more flexible in a sense that it allows
multi-linguagal strings just fine and lone full-with unicode codepoints
as well.

But I get the impression that it's only for sending over
the wire, ie transmision, so all the processing issues that UTF-8
introduces would still be there.

Use mime-type etc. Standards are always a bit stringy and suboptimal,
their acceptance rate is one of chief advantages they have. Unicode has
horrifically large momentum now and not a single organization aside from
them tries to do this dirty work (=i18n).

And borrowing the arguments from from that rant: locale is borked shit
when it comes to encodings. Locales should be used for tweaking visual
like numbers, date display an so on.

Is that worse than every API simply assuming UTF-8, as he says? Broken
locale support in the past, as you and others complain about, doesn't
invalidate the concept.

It's combinatorial blowup and has some stone-walls to hit into. Consider
adding another encoding for "Tuva" for isntance. Now you have to add 2*n
conversion routines to match it to other codepages/locales.

Beyond that - there are many things to consider in internationalization
and you would have to special case them all by codepage.

If they're screwing up something so simple,
imagine how much worse everyone is screwing up something complex like
UTF-8?

UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a
sequence of octets. It does it pretty well and compatible with ASCII,
even the little rant you posted acknowledged that. Now you are either
against Unicode as whole or what?

--
Dmitry Olshansky

May 25 2013

"Joakim" <joakim airpost.net>  writes:
On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:
25-May-2013 10:44, Joakim пишет:
Yes, on the encoding, if it's a variable-length encoding like
UTF-8, no,
on the code space.  I was originally going to title my post,
"Why
Unicode?" but I have no real problem with UCS, which merely
standardized
a bunch of pre-existing code pages.  Perhaps there are a lot
of problems
with UCS also, I just haven't delved into it enough to know.

UCS is dead and gone. Next in line to "640K is enough for
everyone".

I think you are confused.  UCS refers to the Universal Character
Set, which is the backbone of Unicode:

http://en.wikipedia.org/wiki/Universal_Character_Set

You might be thinking of the unpopular UCS-2 and UCS-4 encodings,
which I have never referred to.

Separate code spaces were the case before Unicode (and
utf-8). The
problem is not only that without header text is meaningless
(no easy
slicing) but the fact that encoding of data after header
strongly
depends a variety of factors -  a list of encodings actually.
Now
everybody has to keep a (code) page per language to at least
know if
it's 2 bytes per char or 1 byte per char or whatever. And you
still
work on a basis that there is no combining marks and regional
specific
stuff :)

Everybody is still keeping code pages, UTF-8 hasn't changed
that.

Legacy. Hard to switch overnight. There are graphs that
indicate that few years from now you might never encounter a
legacy encoding anymore, only UTF-8/UTF-16.

I didn't mean that people are literally keeping code pages.  I
meant that there's not much of a difference between code pages
with 2 bytes per char and the language character sets in UCS.

Does
UTF-8 not need "to at least know if it's 2 bytes per char or 1
byte per
char or whatever?"

It's coherent in its scheme to determine that. You don't need
extra information synced to text unlike header stuff.

?!  It's okay because you deem it "coherent in its scheme?"  I
deem headers much more coherent. :)

It has to do that also. Everyone keeps talking about
"easy slicing" as though UTF-8 provides it, but it doesn't.
Phobos
turns UTF-8 into UTF-32 internally for all that ease of use,
at least
doubling your string size in the process.  Correct me if I'm
wrong, that
was what I read on the newsgroup sometime back.

Indeed you are - searching for UTF-8 substring in UTF-8 string
doesn't do any decoding and it does return you a slice of a
balance of original.

Perhaps substring search doesn't strictly require decoding but
you have changed the subject: slicing does require decoding and
that's the use case you brought up to begin with.  I haven't
looked into it, but I suspect substring search not requiring
decoding is the exception for UTF-8 algorithms, not the rule.

??? Simply makes no sense. There is no intersection between
some legacy encodings as of now. Or do you want to add N*(N-1)
cross-encodings for any combination of 2? What about 3 in one
string?

I sketched two possible encodings above, none of which would
require "cross-encodings."

We want monoculture! That is to understand each without all
these
"par-le-vu-france?" and codepages of various
complexity(insanity).

I hate monoculture, but then I haven't had to decipher some
screwed-up
codepage in the middle of the night. ;)

So you never had trouble of internationalization? What

This was meant as a point in your favor, conceding that I haven't
had to code with the terrible code pages system from the past.  I
can read and speak multiple languages, but I don't use anything
other than English text.

That said, you could standardize
on UCS for your code space without using a bad encoding like
UTF-8, as I
said above.

UCS is a myth as of ~5 years ago. Early adopters of Unicode
fell into that trap (Java, Windows NT). You shouldn't.

UCS, the character set, as noted above.  If that's a myth,
Unicode is a myth. :)

This is it but it's far more flexible in a sense that it allows
multi-linguagal strings just fine and lone full-with unicode
codepoints as well.

That's only because it uses a more complex header than a single
byte for the language, which I noted could be done with my
mentioned this unicode compression scheme.

But I get the impression that it's only for sending over
the wire, ie transmision, so all the processing issues that
UTF-8
introduces would still be there.

Use mime-type etc. Standards are always a bit stringy and
suboptimal, their acceptance rate is one of chief advantages
they have. Unicode has horrifically large momentum now and not
a single organization aside from them tries to do this dirty
work (=i18n).

You misunderstand.  I was saying that this unicode compression
transmission and is probably fine for that, precisely because it
seems to implement some version of my single-byte encoding
scheme!  You do raise a good point: the only reason why we're
likely using such a bad encoding in UTF-8 is that nobody else
wants to tackle this hairy problem.

Consider adding another encoding for "Tuva" for isntance. Now
you have to add 2*n conversion routines to match it to other
codepages/locales.

Not sure what you're referring to here.

Beyond that - there are many things to consider in
internationalization and you would have to special case them
all by codepage.

Not necessarily.  But that is actually one of the advantages of
single-byte encodings, as I have noted above.  toUpper is a NOP
for a single-byte encoding string with an Asian script, you can't
do that with a UTF-8 string.

If they're screwing up something so simple,
imagine how much worse everyone is screwing up something
complex like
UTF-8?

UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF]
to a sequence of octets. It does it pretty well and compatible
with ASCII, even the little rant you posted acknowledged that.
Now you are either against Unicode as whole or what?

The BOM link I gave notes that UTF-8 isn't always
ASCII-compatible.

There are two parts to Unicode.  I don't know enough about UCS,
the character set, ;) to be for it or against it, but I
acknowledge that a standardized character set may make sense.  I
am dead set against the UTF-8 variable-width encoding, for all
the reasons listed above.

On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote:
25-May-2013 13:05, Joakim пишет:
Nobody is talking about going back to code pages.  I'm talking
going to single-byte encodings, which do not imply the
problems that you
had with code pages way back when.

Problem is what you outline is isomorphic with code-pages.
Hence the grief of accumulated experience against them.

They may seem superficially similar but they're not.  For
example, from the beginning, I have suggested a more complex
header that can enable multi-language strings, as one possible
solution.  I don't think code pages provided that.

Well if somebody get a quest to redefine UTF-8 they *might*
come up with something that is a bit faster to decode but
shares the same properties. Hardly a life saver anyway.

Perhaps not, but I suspect programmers will flock to a
constant-width encoding that is much simpler and more efficient
than UTF-8.  Programmer productivity is the biggest loss from the
complexity of UTF-8, as I've noted before.

The world may not "abandon Unicode," but it will abandon
UTF-8, because
it's a dumb idea.  Unfortunately, such dumb ideas- XML
anyone?- often
proliferate until someone comes up with something better to
show how
dumb they are.

Even children know XML is awful redundant shit as interchange
format. The hierarchical document is a nice idea anyway.

_We_ both know that, but many others don't, or XML wouldn't be as
popular as it is. ;) I'm making a similar point about the more
limited success of UTF-8, ie it's still shit.

May 25 2013

"Juan Manuel Cabo" <juanmanuel.cabo gmail.com>  writes:
░░░░░░░░░ⓌⓉⒻ░
╔╗░╔╗░╔╗╔════╗╔════╗░░
║║░║║░║║╚═╗╔═╝║╔═══╝░░
║║░║║░║║░░║║░░║╚═╗░░░░
║╚═╝╚═╝║╔╗║║╔╗║╔═╝╔╗░░
╚══════╝╚╝╚╝╚╝╚╝░░╚╝░░

░░░░░░░░░░░░░░░░░░░░░░░░
█░█░█░░░░░░▐░░░░░░░░░░▐░
█░█░█▐▀█▐▀█▐░█▐▀█▐▀█▐▀█░
█░█░█▐▄█▐▄█▐▄▀▐▄█▐░█▐░█░
█▄█▄█▐▄▄▐▄▄▐░█▐▄▄▐░█▐▄█░
░░░░░░░░░░░░░░░░░░░░░░░░

--jm

May 25 2013

"limited success of UTF-8"

Becoming the de-facto standard encoding EVERYWERE except for
windows which uses UTF-16 is hardly a failure...

I really don't understand your hatred for UTF-8 - it's simple to
decode and encode, fast and space-efficient. Fixed width
encodings are not inherently fast, the only thing they are faster
at is if you want to randomly access the Nth character instead of
the Nth byte. In the rare cases that you need to do a lot of this
kind of random access there exists UTF-32...

Any fixed width encoding which can encode every unicode character
must use at least 3 bytes, and using 4 bytes is probably going to
be faster because of alignment, so I don't see what the great
improvement over UTF-32 is going to be.

slicing does require decoding

Nope.

I didn't mean that people are literally keeping code pages.  I
meant that there's not much of a difference between code pages
with 2 bytes per char and the language character sets in UCS.

Unicode doesn't have "language character sets". The different
planes only exist for organisational purposes they don't affect
how characters are encoded.

?!  It's okay because you deem it "coherent in its scheme?"  I
deem headers much more coherent. :)

Sure if you change the word "coherent" to mean something
completely different... Coherent means that you store related
things together, ie. everything that you need to decode a
character in the same place, not spread out between part of a

but I suspect substring search not requiring decoding is the
exception for UTF-8 algorithms, not the rule.

The only time you need to decode is when you need to do some
transformation that depends on the code point such as converting
case or identifying which character class a particular character
belongs to. Appending, slicing, copying, searching, replacing,
etc. basically all the most common text operations can all be
done without any encoding or decoding.

May 25 2013

"Joakim" <joakim airpost.net>  writes:
On Saturday, 25 May 2013 at 18:56:42 UTC, Diggory wrote:
"limited success of UTF-8"

Becoming the de-facto standard encoding EVERYWERE except for
windows which uses UTF-16 is hardly a failure...

So you admit that UTF-8 isn't used on the vast majority of
computers since the inception of Unicode.  That's what I call
limited success, thank you for agreeing with me. :)

I really don't understand your hatred for UTF-8 - it's simple
to decode and encode, fast and space-efficient. Fixed width
encodings are not inherently fast, the only thing they are
faster at is if you want to randomly access the Nth character
instead of the Nth byte. In the rare cases that you need to do
a lot of this kind of random access there exists UTF-32...

Space-efficient?  Do you even understand what a single-byte
encoding is?  Suffice to say, a single-byte encoding beats UTF-8
on all these measures, not just one.

Any fixed width encoding which can encode every unicode
character must use at least 3 bytes, and using 4 bytes is
probably going to be faster because of alignment, so I don't
see what the great improvement over UTF-32 is going to be.

Slaps head.  You don't need "at least 3 bytes" because you're
packing language info in the header.  I don't think you even know

slicing does require decoding

Nope.

Of course it does, at least partially.  There is no other way to
know where the code points are.

I didn't mean that people are literally keeping code pages.  I
meant that there's not much of a difference between code pages
with 2 bytes per char and the language character sets in UCS.

Unicode doesn't have "language character sets". The different
planes only exist for organisational purposes they don't affect
how characters are encoded.

the different language character sets in this list:

http://en.wikipedia.org/wiki/List_of_Unicode_characters

?!  It's okay because you deem it "coherent in its scheme?"  I
deem headers much more coherent. :)

Sure if you change the word "coherent" to mean something
completely different... Coherent means that you store related
things together, ie. everything that you need to decode a
character in the same place, not spread out between part of a

Coherent means that the organizational pieces fit together and
make sense conceptually, not that everything is stored together.
My point is that putting the language info in a header seems much
more coherent to me than ramming that info into every character.

but I suspect substring search not requiring decoding is the
exception for UTF-8 algorithms, not the rule.

The only time you need to decode is when you need to do some
transformation that depends on the code point such as
converting case or identifying which character class a
particular character belongs to. Appending, slicing, copying,
searching, replacing, etc. basically all the most common text
operations can all be done without any encoding or decoding.

Slicing by byte, which is the only way to slice without decoding,
is useless, I have to laugh that you even include it. :) All
these basic operations can be done very fast, often faster than
UTF-8, in a single-byte encoding.  Once you start talking code
points, it's no contest: UTF-8 flat out loses.

On Saturday, 25 May 2013 at 19:42:41 UTC, Diggory wrote:
All a code page is is a table of mappings, UCS is just a much
larger, standardized table of such mappings.

UCS does have nothing to do with code pages, it was designed as
a replacement for them. A codepage is a strict subset of the
possible characters, UCS is the entire set of possible
characters.

"[I]t was designed as a replacement for them" by combining
several of them into a master code page and removing
redundancies.  Functionally, they are the same and historically
they maintain the same layout in at least some cases.  To then
say, UCS has "nothing to do with code pages" is just dense.

I see you've abandoned without note your claim that phobos
doesn't convert UTF-8 to UTF-32 internally.  Perhaps
converting to UTF-32 is "as fast as any variable width
encoding is going to get" but my claim is that single-byte
encodings will be faster.

I haven't "abandoned my claim". It's a simple fact that phobos
does not convert UTF-8 string to UTF-32 strings before it uses
them.

ie. the difference between this:
string mystr = ...;
dstring temp = mystr.to!dstring;
for (int i = 0; i < temp.length; ++i)
process(temp[i]);

and this:
string mystr = ...;
size_t i = 0;
while (i < mystr.length) {
dchar current = decode(mystr, i);
process(current);
}

And if you can't see why the latter example is far more
efficient I give up...

I take your point that phobos is often decoding by char as it
iterates through, but there are still functions in std.string
that convert the entire string, as in your first example.  The
point is that you are forced to decode everything to UTF-32,
whether by char or the entire string.  Your latter example may be
marginally more efficient but it is only useful for functions
that start from the beginning and walk the string in only one
direction, which not all operations do.

- Multiple code pages per string
This just makes everything overly complicated and is far
slower to decode what the actual character is than UTF-8.

I disagree, this would still be far faster than UTF-8,

The cache misses alone caused by simply accessing the separate
takes a few assembly instructions and has perfect locality and
can be efficiently pipelined by the CPU.

Lol, you think a few potential cache misses is going to be slower
than repeatedly decoding, whether in assembly and pipelined or
not, every single UTF-8 character? :D

Then there's all the extra processing involved combining the
headers when you concatenate strings. Plus you lose the one
benefit a fixed width encoding has because random access is no
longer possible without first finding out which header controls
the location you want to access.

There would be a few arithmetic operations on substring indices
when concatenating strings, hardly anything.

Random access is still not only possible, it is incredibly fast
in most cases: you just have to check first if the header lists
any two-byte encodings.  This can be done once and cached as a
property of the string (set a boolean no_two_byte_encoding once
and simply have the slice operator check it before going ahead),
just as you could add a property to UTF-8 strings to allow quick
random access if they happen to be pure ASCII.  The difference is
that only strings that include the two-byte encoded
Korean/Chinese/Japanese characters would require a bit more
calculation for slicing in my scheme, whereas _every_ non-ASCII
UTF-8 string requires full decoding to allow random access.  This
is a clear win for my single-byte encoding, though maybe not the
complete demolition of UTF-8 you were hoping for. ;)

No, it's not the same at all.  The contents of an
arbitrary-length file cannot be compressed to a single byte,
you would have collisions galore.  But since most non-english
alphabets are less than 256 characters, they can all be
uniquely encoded in a single byte per character, with the
header determining what language's code page to use.  I don't

you are compressing you have at the time of writing the
algorithm the better compression ration you can get, to the
point that if you know exactly what the file is going to
contain you can compress it to nothing. This is why you have
specialised compression algorithms for images, video, audio,
etc.

This may be mostly true in general, but your specific example of
compressing down to a byte is nonsense.  For any arbitrarily long
data, there are always limits to compression.  What any of this
has to do with my single-byte encoding, I have no idea.

It doesn't matter how few characters non-english alphabets have
- unless you know WHICH alphabet it is before-hand you can't
store it in a single byte. Since any given character could be
in any alphabet the best you can do is look at the
probabilities of different characters appearing and use shorter
representations for more common ones. (This is the basis for
all lossless compression) The english alphabet plus 0-9 and
basic punctuation are by far the most common characters used on
computers so it makes sense to use one byte for those and
multiple bytes for rarer characters.

How many times have I said that "you know WHICH alphabet it is
before-hand" because that info is stored in the header?  That is
why I specifically said, from my first post, that multi-language
strings would have more complex headers, which I later pointed
out could list all the different language substrings within a
multi-language string.  Your silly exposition of how compression
works makes me wonder if you understand anything about how a
single-byte encoding would work.

Perhaps it made sense to use one byte for ASCII characters and
relegate _every other language_ to multiple bytes two decades
ago.  It doesn't make sense today.

- As I've shown the only space-efficient way to do this is
using a variable length encoding like UTF-8

You haven't shown this.

pages per string you would see that I had.

You are not packaging and transmitting the code pages with the
string, just as you do not ship the entire UCS with every UTF-8
string.  A single-byte encoding is going to be more
space-efficient for the vast majority of strings, everybody knows
this.

No, it does a very bad job of this.  Every non-ASCII character
takes at least two bytes to encode, whereas my single-byte
encoding scheme would encode every alphabet with less than 256
characters in a single byte.

And strings with mixed characters would use lots of memory and
be extremely slow. Common when using proper names, quotes,
inline translations, graphical characters, etc. etc. Not to
mention the added complexity to actually implement the
algorithms.

Ah, you have finally stumbled across the path to a good argument,
though I'm not sure how, given your seeming ignorance of how
single-byte encodings work. :) There _is_ a degenerate case with
my particular single-byte encoding (not the ones you list, which
would still be faster and use less memory than UTF-8): strings
that use many, if not all, character sets.  So the worst case
scenario might be something like a string that had 100
characters, every one from a different language.  In that case, I
think it would still be smaller than the equivalent UTF-8 string,
but not by much.

There might be some complexity in implementing the algorithms,
but on net, likely less than UTF-8, while being much more usable
for most programmers.

On Saturday, 25 May 2013 at 22:41:59 UTC, Diggory wrote:
1) Take the byte at a particular offset in the string
2) If it is ASCII then we're done
3) Otherwise count the number of '1's at the start of the byte
- this is how many bytes make up the character (there's even an
ASM instruction to do this)
4) This first byte will look like '1110xxxx' for a 3 byte
character, '11110xxx' for a 4 byte character, etc.
5) All following bytes are of the form '10xxxxxx'
6) Now just concatenate all the 'x's together and add an offset
to get the code point

Not sure why you chose to write this basic UTF-8 stuff out, other
than to bluster on without much use.

Note that this is CONSTANT TIME, O(1) with minimal branching so
well suited to pipelining (after the initial byte the other
bytes can all be processed in parallel by the CPU) and only
sequential memory access so no cache misses, and zero

It is constant time _per character_.  You have to do it for
up.

1) Look up the offset in the header using binary search: O(log
N) lots of branching

depends on the number of languages used and how many substrings
there are.  There are worst-case scenarios that could approach
something like log(n) but extremely unlikely in real-world use.
Most of the time, this would be O(1).

2) Look up the code page ID in a massive array of code pages to
work out how many bytes per character

Hardly, this could be done by a simple lookup function that
simply checked if the language was one of the few alphabets that
require two bytes.

3) Hope this array hasn't been paged out and is still in the
cache
4) Extract that many bytes from the string and combine them
into a number

Lol, I love how you think this is worth listing as a separate
step for the few two-byte encodings, yet have no problem with
doing this for every non-ASCII character in UTF-8.

5) Look up this new number in yet another large array specific
to the code page

Why?  The language byte and number uniquely specify the
character, just like your Unicode code point above.  If you were
simply encoding the UCS in a single-byte encoding, you would
arrange your scheme in such a way to trivially be able to
generate the UCS code point using these two bytes.

This is O(log N) has lots of branching so no pipelining (every
stage depends on the result of the stage before), lots of
random memory access so lots of cache misses, lots of
additional memory requirements to store all those tables, and
an algorithm that isn't even any easier to understand.

Wrong on practically every count, as detailed above.

Plus every other algorithm to operate on it except for decoding
is insanely complicated.

They are still much _less_ complicated than UTF-8, that's the
comparison that matters.

May 26 2013

Dmitry Olshansky <dmitry.olsh gmail.com>  writes:
25-May-2013 22:26, Joakim пишет:
On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:
25-May-2013 10:44, Joakim пишет:
Yes, on the encoding, if it's a variable-length encoding like UTF-8, no,
on the code space.  I was originally going to title my post, "Why
Unicode?" but I have no real problem with UCS, which merely standardized
a bunch of pre-existing code pages.  Perhaps there are a lot of problems
with UCS also, I just haven't delved into it enough to know.

UCS is dead and gone. Next in line to "640K is enough for everyone".

I think you are confused.  UCS refers to the Universal Character Set,
which is the backbone of Unicode:

http://en.wikipedia.org/wiki/Universal_Character_Set

You might be thinking of the unpopular UCS-2 and UCS-4 encodings, which
I have never referred to.

Yeah got confused. So sorry about that.

Separate code spaces were the case before Unicode (and utf-8). The
problem is not only that without header text is meaningless (no easy
slicing) but the fact that encoding of data after header strongly
depends a variety of factors -  a list of encodings actually. Now
everybody has to keep a (code) page per language to at least know if
it's 2 bytes per char or 1 byte per char or whatever. And you still
work on a basis that there is no combining marks and regional specific
stuff :)

Everybody is still keeping code pages, UTF-8 hasn't changed that.

Legacy. Hard to switch overnight. There are graphs that indicate that
few years from now you might never encounter a legacy encoding
anymore, only UTF-8/UTF-16.

I didn't mean that people are literally keeping code pages.  I meant
that there's not much of a difference between code pages with 2 bytes
per char and the language character sets in UCS.

You can map a codepage to a subset of UCS :)
That's what they do internally anyway.
If I take you right you propose to define string as a header that
denotes a set of windows in code space? I still fail to see how that
would scale see below.

It has to do that also. Everyone keeps talking about
"easy slicing" as though UTF-8 provides it, but it doesn't. Phobos
turns UTF-8 into UTF-32 internally for all that ease of use, at least
doubling your string size in the process.  Correct me if I'm wrong, that
was what I read on the newsgroup sometime back.

Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't
do any decoding and it does return you a slice of a balance of original.

Perhaps substring search doesn't strictly require decoding but you have
changed the subject: slicing does require decoding and that's the use
case you brought up to begin with.  I haven't looked into it, but I
suspect substring search not requiring decoding is the exception for
UTF-8 algorithms, not the rule.

Mm... strictly speaking (let's turn that argument backwards) - what are
algorithms that require slicing say [5..$] of string without ever looking at it left to right, searching etc.? ??? Simply makes no sense. There is no intersection between some legacy encodings as of now. Or do you want to add N*(N-1) cross-encodings for any combination of 2? What about 3 in one string? I sketched two possible encodings above, none of which would require "cross-encodings." We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity). I hate monoculture, but then I haven't had to decipher some screwed-up codepage in the middle of the night. ;) So you never had trouble of internationalization? What languages do you use (read/speak/etc.)? This was meant as a point in your favor, conceding that I haven't had to code with the terrible code pages system from the past. I can read and speak multiple languages, but I don't use anything other than English text. Okay then. That said, you could standardize on UCS for your code space without using a bad encoding like UTF-8, as I said above. UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into that trap (Java, Windows NT). You shouldn't. UCS, the character set, as noted above. If that's a myth, Unicode is a myth. :) Yeah, that was a mishap on my behalf. I think I've seen your 2 byte argument way to often and it got concatenated to UCS forming UCS-2 :) This is it but it's far more flexible in a sense that it allows multi-linguagal strings just fine and lone full-with unicode codepoints as well. That's only because it uses a more complex header than a single byte for the language, which I noted could be done with my scheme, by adding a more complex header, How would it look like? Or how the processing will go? long before you mentioned this unicode compression scheme. It does inline headers or rather tags. That hop between fixed char windows. It's not random-access nor claims to be. But I get the impression that it's only for sending over the wire, ie transmision, so all the processing issues that UTF-8 introduces would still be there. Use mime-type etc. Standards are always a bit stringy and suboptimal, their acceptance rate is one of chief advantages they have. Unicode has horrifically large momentum now and not a single organization aside from them tries to do this dirty work (=i18n). You misunderstand. I was saying that this unicode compression scheme doesn't help you with string processing, it is only for transmission and is probably fine for that, precisely because it seems to implement some version of my single-byte encoding scheme! You do raise a good point: the only reason why we're likely using such a bad encoding in UTF-8 is that nobody else wants to tackle this hairy problem. Yup, where have you been say almost 10 years ago? :) Consider adding another encoding for "Tuva" for isntance. Now you have to add 2*n conversion routines to match it to other codepages/locales. Not sure what you're referring to here. If you adopt the "map to UCS policy" then nothing. Beyond that - there are many things to consider in internationalization and you would have to special case them all by codepage. Not necessarily. But that is actually one of the advantages of single-byte encodings, as I have noted above. toUpper is a NOP for a single-byte encoding string with an Asian script, you can't do that with a UTF-8 string. But you have to check what encoding it's in and given that not all codepages are that simple to upper case some generic algorithm is required. If they're screwing up something so simple, imagine how much worse everyone is screwing up something complex like UTF-8? UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a sequence of octets. It does it pretty well and compatible with ASCII, even the little rant you posted acknowledged that. Now you are either against Unicode as whole or what? The BOM link I gave notes that UTF-8 isn't always ASCII-compatible. There are two parts to Unicode. I don't know enough about UCS, the character set, ;) to be for it or against it, but I acknowledge that a standardized character set may make sense. I am dead set against the UTF-8 variable-width encoding, for all the reasons listed above. Okay we are getting somewhere, now that I understand your position and got myself confused in the midway there. On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote: 25-May-2013 13:05, Joakim пишет: Nobody is talking about going back to code pages. I'm talking about going to single-byte encodings, which do not imply the problems that you had with code pages way back when. Problem is what you outline is isomorphic with code-pages. Hence the grief of accumulated experience against them. They may seem superficially similar but they're not. For example, from the beginning, I have suggested a more complex header that can enable multi-language strings, as one possible solution. I don't think code pages provided that. The problem is how would you define an uppercase algorithm for multilingual string with 3 distinct 256 codespaces (windows)? I bet it's won't be pretty. Well if somebody get a quest to redefine UTF-8 they *might* come up with something that is a bit faster to decode but shares the same properties. Hardly a life saver anyway. Perhaps not, but I suspect programmers will flock to a constant-width encoding that is much simpler and more efficient than UTF-8. Programmer productivity is the biggest loss from the complexity of UTF-8, as I've noted before. I still don't see how your solution scales to beyond 256 different codepoints per string (= multiple pages/parts of UCS ;) ). -- Dmitry Olshansky May 25 2013 "Joakim" <joakim airpost.net> writes: On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote: You can map a codepage to a subset of UCS :) That's what they do internally anyway. If I take you right you propose to define string as a header that denotes a set of windows in code space? I still fail to see how that would scale see below. Something like that. For a multi-language string encoding, the header would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string. So, a list of languages and a list of pure single-language substrings. This is just off the top of my head, I'm not suggesting it is definitive. Mm... strictly speaking (let's turn that argument backwards) - what are algorithms that require slicing say [5..$] of string
without ever looking at it left to right, searching etc.?

Don't know, I was just pointing out that all the claims of easy
slicing with UTF-8 are wrong.  But a single-byte encoding would
be scanned much faster also, as I've noted above, no decoding
necessary and single bytes will always be faster than multiple
bytes, even without decoding.

How would it look like? Or how the processing will go?

Detailed a bit above.  As I mentioned earlier in this thread,
functions like toUpper would execute much faster because you
wouldn't have to scan substrings containing languages that don't
have uppercase, which you have to scan in UTF-8.

long before you mentioned this unicode compression
scheme.

It does inline headers or rather tags. That hop between fixed
char windows. It's not random-access nor claims to be.

I wasn't criticizing it, just saying that it seems to be
superficially similar to my scheme. :)

version of my single-byte encoding scheme!  You do raise a
good point:
the only reason why we're likely using such a bad encoding in
UTF-8 is
that nobody else wants to tackle this hairy problem.

Yup, where have you been say almost 10 years ago? :)

I was in grad school, avoiding writing my thesis. :) I'd never
have thought I'd be discussing Unicode today, didn't even know
what it was back then.

Not necessarily.  But that is actually one of the advantages of
single-byte encodings, as I have noted above.  toUpper is a
NOP for a
single-byte encoding string with an Asian script, you can't do
that with
a UTF-8 string.

But you have to check what encoding it's in and given that not
all codepages are that simple to upper case some generic
algorithm is required.

You have to check the language, but my point is that you can look
at the header and know that toUpper has to do nothing for a
single-byte-encoded string of an Asian script which doesn't have
uppercase characters.  With UTF-8, you have to decode the entire
string to find that out.

They may seem superficially similar but they're not.  For
example, from
the beginning, I have suggested a more complex header that can
enable
multi-language strings, as one possible solution.  I don't
think code
pages provided that.

The problem is how would you define an uppercase algorithm for
multilingual string with 3 distinct 256 codespaces (windows)? I
bet it's won't be pretty.

How is it done now?  It isn't pretty with UTF-8 now either, as
some languages have uppercase characters and others don't.  The
version of toUpper for my encoding will be similar, but it will
do less work, because it doesn't have to be invoked for every
character in the string.

I still don't see how your solution scales to beyond 256
different codepoints per string (= multiple pages/parts of UCS
;) ).

I assume you're talking about Chinese, Korean, etc. alphabets?  I
mentioned those to Walter earlier, they would have a two-byte
encoding.  No way around that, but they would still be easier to
deal with than UTF-8, because of the header.

May 25 2013

Dmitry Olshansky <dmitry.olsh gmail.com>  writes:
25-May-2013 23:51, Joakim пишет:
On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
You can map a codepage to a subset of UCS :)
That's what they do internally anyway.
If I take you right you propose to define string as a header that
denotes a set of windows in code space? I still fail to see how that
would scale see below.

Something like that.  For a multi-language string encoding, the header
would contain a single byte for every language used in the string, along
with multiple index bytes to signify the start and finish of every run
of single-language characters in the string. So, a list of languages and
a list of pure single-language substrings.  This is just off the top of
my head, I'm not suggesting it is definitive.

Runs away in horror :) It's mess even before you've got to details.

Another point about using sometimes a 2-byte encoding - welcome to the
nice world of BigEndian/LittleEndian i.e. the very trap UTF-16 has
stepped into.

--
Dmitry Olshansky

May 25 2013

"Joakim" <joakim airpost.net>  writes:
On Saturday, 25 May 2013 at 19:58:25 UTC, Dmitry Olshansky wrote:
Runs away in horror :) It's mess even before you've got to
details.

Perhaps it's fatally flawed, but I don't see an argument for why,
so I'll assume you can't find such a flaw.  It is still _much
less_ messy than UTF-8, that is the critical distinction.

Another point about using sometimes a 2-byte encoding - welcome
to the nice world of BigEndian/LittleEndian i.e. the very trap
UTF-16 has stepped into.

I don't think this is a sizable obstacle.  It takes some
coordination, but it is a minor issue.

On Saturday, 25 May 2013 at 20:20:11 UTC, Juan Manuel Cabo wrote:
You obviously are not thinking it through. Such encoding would
have a O(n^2) complexity for appending a character/symbol in a
different language to the string, since you would have to
update the beginning of the string, and move the contents
forward to make room. Not to mention that it wouldn't be
backwards compatible with ascii routines, and the complexity of
such a header would be have to be carried all the way to font
rendering routines in the OS.

non-font-related assertions have been addressed earlier.  I see
no reason why a single-byte encoding of UCS would have to be
carried to "font rendering routines" but UTF-8 wouldn't be.

Multiple languages/symbols in one string is a blessing of
modern humane computing. It is the norm more than the exception
in most of the world.

I disagree, but in any case, most of this thread refers to
multi-language strings.  The argument is about how best to encode
them.

On Saturday, 25 May 2013 at 20:47:25 UTC, Peter Alexander wrote:
On Saturday, 25 May 2013 at 14:58:02 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander
wrote:
I suggest you read up on UTF-8. You really don't understand
it. There is no need to decode, you just treat the UTF-8
string as if it is an ASCII string.

Not being aware of this shortcut doesn't mean not
understanding UTF-8.

It's not just a shortcut, it is absolutely fundamental to the
design of UTF-8. It's like saying you understand Lisp without
being aware that everything is a list.

It is an accidental shortcut because of the encoding scheme
chosen for UTF-8 and, as I've noted, still less efficient than
similarly searching a single-byte encoding.  The fact that you
keep trumpeting this silly detail as somehow "fundamental"
suggests you have no idea what you're talking about.

Also, you continuously keep stating disadvantages to UTF-8 that
are completely false, like "slicing does require decoding".
Again, completely missing the point of UTF-8. I cannot conceive
how you can claim to understand how UTF-8 works yet repeatedly
demonstrating that you do not.

Slicing on code points requires decoding, I'm not sure how you
don't know that.  If you mean slicing by byte, that is not only
useless, but _every_ encoding can do that.  I cannot conceive how
you claim to defend UTF-8, yet keep making such stupid points,
that you don't even bother backing up.

You are either ignorant or a successful troll. In either case,
I'm done here.

Must be nice to just insult someone who has demolished your
arguments and leave.  Good riddance, you weren't adding anything.

May 26 2013

"Juan Manuel Cabo" <juanmanuel.cabo gmail.com>  writes:
On Saturday, 25 May 2013 at 19:51:43 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky
wrote:
You can map a codepage to a subset of UCS :)
That's what they do internally anyway.
If I take you right you propose to define string as a header
that denotes a set of windows in code space? I still fail to
see how that would scale see below.

Something like that.  For a multi-language string encoding, the
header would contain a single byte for every language used in
the string, along with multiple index bytes to signify the
start and finish of every run of single-language characters in
the string.  So, a list of languages and a list of pure
single-language substrings.  This is just off the top of my
head, I'm not suggesting it is definitive.

You obviously are not thinking it through. Such encoding would
have a O(n^2) complexity for appending a character/symbol in a
different language to the string, since you would have to update
the beginning of the string, and move the contents forward to
make room. Not to mention that it wouldn't be backwards
compatible with ascii routines, and the complexity of such a
header would be have to be carried all the way to font rendering
routines in the OS.

Multiple languages/symbols in one string is a blessing of modern
humane computing. It is the norm more than the exception in most
of the world.

--jm

May 25 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx>  writes:
On Sat, May 25, 2013 at 09:51:42PM +0200, Joakim wrote:
On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
If I take you right you propose to define string as a header that
denotes a set of windows in code space? I still fail to see how
that would scale see below.

Something like that.  For a multi-language string encoding, the
header would contain a single byte for every language used in the
string, along with multiple index bytes to signify the start and
finish of every run of single-language characters in the string.
So, a list of languages and a list of pure single-language
substrings.  This is just off the top of my head, I'm not suggesting
it is definitive.

[...]

And just how exactly does that help with slicing? If anything, it makes
slicing way hairier and error-prone than UTF-8. In fact, this one point
alone already defeated any performance gains you may have had with a
single-byte encoding. Now you can't do *any* slicing at all without
convoluted algorithms to determine what encoding is where at the
endpoints of your slice, and the resulting slice must have new headers
to indicate the start/end of every different-language substring. By the
time you're done with all that, you're going way slower than processing
UTF-8.

Again I say, I'm not 100% sold on UTF-8, but what you're proposing here
is far worse.

T

--
The best compiler is between your ears. -- Michael Abrash

May 25 2013

"Joakim" <joakim airpost.net>  writes:
For some reason this posting by H. S. Teoh shows up on the
mailing list but not on the forum.

On Sat May 25 13:42:10 PDT 2013, H. S. Teoh wrote:
On Sat, May 25, 2013 at 10:07:41AM +0200, Joakim wrote:
The vast majority of non-english alphabets in UCS can be
encoded in
a single byte.  It is your exceptions that are not relevant.

I'll have you know that Chinese, Korean, and Japanese account
for a
significant percentage of the world's population, and therefore
arguments about "vast majority" are kinda missing the forest
for the
trees. If you count the number of *alphabets* that can be
encoded in a
single byte, you can get a majority, but that in no way
reflects actual
usage.

Not just "a majority," the vast majority of alphabets,
representing 85% of the world's population.

The only alternatives to a variable width encoding I can see
are:
- Single code page per string
This is completely useless because now you can't concatenate
strings of different code pages.

I wouldn't be so fast to ditch this.  There is a real argument
to be
made that strings of different languages are sufficiently
different
that there should be no multi-language strings.  Is this the
best
route?  I'm not sure, but I certainly wouldn't dismiss it out
of hand.

This is so patently absurd I don't even know how to begin to
have you actually dealt with any significant amount of text at
all? A
large amount of text in today's digital world are at least
bilingual, if
not more. Even in pure English text, you occasionally need a
foreign
letter in order to transcribe a borrowed/quoted word, e.g.,
"cliché",
"naïve", etc.. Under your scheme, it would be impossible to
encode any
text that contains even a single instance of such words. All it
takes is
*one* word in a 500-page text and your scheme breaks down, and
we're
back to the bad ole days of codepages. And yes you can say
"well just
include é and ï in the English code page". But then all it
takes is a
single math formula that requires a Greek letter, and your text
is
non-encodable anymore. By the time you pull in all the French,
German,
Greek letters and math symbols, you might as well just go back
to UTF-8.

I think you misunderstand what this implies.  I mentioned it
earlier as another possibility to Walter, "keep all your strings
in a single language, with a different format to compose them
together."  Nobody is talking about disallowing alphabets other
than English or going back to code pages.  The fundamental
question is whether it makes sense to combine all these different
alphabets and their idiosyncratic rules into a single string and
encoding.

There is a good argument to be made that the differences outweigh
the similarities and you'd be better off keeping each
language/alphabet in its own string.  It's a question of
modeling, just like a class hierarchy.  As I said, I'm not sure
this is the best route, but it has some real strengths.

The alternative is to have embedded escape sequences for the
rare
foreign letter/word that you might need, but then you're back
to being
unable to slice the string at will, since slicing it at the
wrong place
will produce gibberish.

No one has presented this as a viable option.

I'm not saying UTF-8 (or UTF-16, etc.) is panacea -- there are
things
about it that are annoying, but it's certainly better than the
scheme
you're proposing.

I disagree.

On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:
And just how exactly does that help with slicing? If anything,
it makes
slicing way hairier and error-prone than UTF-8. In fact, this
one point
with a
single-byte encoding. Now you can't do *any* slicing at all
without
convoluted algorithms to determine what encoding is where at the
endpoints of your slice, and the resulting slice must have new
to indicate the start/end of every different-language
substring. By the
time you're done with all that, you're going way slower than
processing
UTF-8.

There are no convoluted algorithms, it's a simple check if the
string contains any two-bye encodings, a check which can be done
once and cached.  If it's single-byte all the way through, no
problems whatsoever with slicing.  If there are two-byte
languages included, the slice function will have to do a little
arithmetic calculation before slicing.  You will also need a few
arithmetic ops to create the new header for the slice.  The point
is that these operations will be much faster than decoding every
code point to slice UTF-8.

Again I say, I'm not 100% sold on UTF-8, but what you're
proposing here
is far worse.

Well, I'm glad you realize some problems with UTF-8, :) even if
you dismiss my alternative out of hand.

May 26 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx>  writes:
On Sun, May 26, 2013 at 11:59:19AM +0200, Joakim wrote:
On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:
And just how exactly does that help with slicing? If anything, it
makes slicing way hairier and error-prone than UTF-8. In fact, this
one point alone already defeated any performance gains you may have
had with a single-byte encoding. Now you can't do *any* slicing at
all without convoluted algorithms to determine what encoding is where
at the endpoints of your slice, and the resulting slice must have new
headers to indicate the start/end of every different-language
substring.  By the time you're done with all that, you're going way
slower than processing UTF-8.

There are no convoluted algorithms, it's a simple check if the
string contains any two-bye encodings, a check which can be done
once and cached.

have a list of starting/ending points indicating which encoding should
be used for which substring(s). That has nothing to do with two-byte
encodings. So, please show us the code: given a string containing, say,
English and French substrings, what will the header look like? And
what's the algorithm to take a slice of such a string?

If it's single-byte all the way through, no problems whatsoever with
slicing.

Huh?! How are there no problems with slicing? Let's say you have a
string that contains both English and French. According to your scheme,
you'll have some kind of header format that lets you say bytes 0-123 are
English, bytes 124-129 are French, and bytes 130-200 are English. Now
let's say I want a substring from 120 to 125. How would this be done?
And what about if I want a substring from 120 to 140? Or 126 to 130?
What if the string contains several runs of French?

If there are two-byte languages included, the slice function will have
to do a little arithmetic calculation before slicing.  You will also
need a few arithmetic ops to create the new header for the slice.  The
point is that these operations will be much faster than decoding every
code point to slice UTF-8.

You haven't proven that this "little arithmetic calculation" will be
faster than manipulating UTF-8. What if I have an English text that
contains quotations of Chinese, French, and Greek snippets? Math
symbols?  Please show us (1) how such a string should be encoded under
your scheme, and (2) the code will slice such a string in an efficient
way, according to your proposed encoding scheme.

(And before you dismiss such a string as unlikely or write it off as
rare, consider a technical math paper that cites the work of Chinese and
French authors -- a rather common thing these days. You'd need the extra
characters just to be able to cite their names, even if none of the
actual Chinese or French is quoted verbatim. Greek in general is used
all over math anyway, since for whatever reason mathematicians just love
Greek symbols, so it pretty much needs to be included by default.)

Again I say, I'm not 100% sold on UTF-8, but what you're proposing
here is far worse.

Well, I'm glad you realize some problems with UTF-8, :) even if you
dismiss my alternative out of hand.

Clearly, we're not seeing what you're seeing here. So instead of making
to show us the actual code.  So far, I haven't seen anything that
convinces me that your scheme is any better.  In fact, from what I can
see, it's a lot worse, and you're just evading pointed questions about
how to address those problems.  Maybe that's a wrong perception, but not
having any actual code to look at, I'm having a hard time believing your
claims. Right now I'm leaning towards agreeing with Walter that you're
just trolling us (and rather successfully at that).

So, please show us the code. Otherwise, I think I should just stop
responding, as we're obviously not on the same page and this discussion
isn't getting anywhere.

T

--
Some ideas are so stupid that only intellectuals could believe them. -- George
Orwell

May 26 2013

"Joakim" <joakim airpost.net>  writes:
On Sunday, 26 May 2013 at 14:37:27 UTC, H. S. Teoh wrote:
IHBT. You said that to handle multilanguage strings, your

Pretty funny how you claim you've been trolled and then go on to
make a bunch of trolling arguments, which seem to imply you have
no idea how a single-byte encoding works.  I'm not going to
bother explaining it to you, anyone who knows encodings can
easily figure it out from what I've said so far.

Clearly, we're not seeing what you're seeing here. So instead
of making
might want
to show us the actual code.  So far, I haven't seen anything
that
convinces me that your scheme is any better.  In fact, from
what I can
see, it's a lot worse, and you're just evading pointed
how to address those problems.  Maybe that's a wrong
perception, but not
having any actual code to look at, I'm having a hard time
believing your
claims. Right now I'm leaning towards agreeing with Walter that
you're
just trolling us (and rather successfully at that).

trolling, that's you not understanding what they're saying.  I
encoding being worse.  If you can't understand my arguments, you
need to go out and learn some more about these issues.

So, please show us the code. Otherwise, I think I should just
stop
responding, as we're obviously not on the same page and this
discussion
isn't getting anywhere.

I've made my position clear: I don't write toy code.  It will
take too long for the kind of encoding I have in mind, so it
isn't worth my time, and if you can't understand the higher-level
technical language I'm using in these posts, you won't understand
the code anyway.  I have adequately sketched what I'd do, so that
anyone proficient in the art can reason about what the
consequences of such a scheme would be.  Perhaps that doesn't
include Walter and you.

I don't know why you'd want to keep responding to someone you
think is trolling you anyway.

May 26 2013

On Sunday, 26 May 2013 at 15:23:33 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 14:37:27 UTC, H. S. Teoh wrote:
IHBT.

I've made my position clear: I don't write toy code.

1. Make extraordinary claims
2. Refuse to back up said claims with small examples because "I
don't write toy code"
3. Refuse to back up said claims with elaborate examples because
"It will
take too long"
4. Use arrogant tone throughout thread, imply that you're smarter
than the creators of UTF, and creators and long-time contributors
of D (never contribute code to D yourself)

Conclusion: Successful troll is successful :)

May 26 2013

Dmitry Olshansky <dmitry.olsh gmail.com>  writes:
On Sunday, 26 May 2013 at 15:23:33 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 14:37:27 UTC, H. S. Teoh wrote:
IHBT.

I've made my position clear: I don't write toy code.

1. Make extraordinary claims
2. Refuse to back up said claims with small examples because "I don't
write toy code"
3. Refuse to back up said claims with elaborate examples because "It will
take too long"
4. Use arrogant tone throughout thread, imply that you're smarter than
the creators of UTF, and creators and long-time contributors of D (never
contribute code to D yourself)

Conclusion: Successful troll is successful :)

+1

--
Dmitry Olshansky

May 26 2013

"Joakim" <joakim airpost.net>  writes:
On Sunday, 26 May 2013 at 16:54:53 UTC, Vladimir Panteleev wrote:
1. Make extraordinary claims

What is extraordinary about "UTF-8 is shit?"  It is obviously so.

2. Refuse to back up said claims with small examples because "I
don't write toy code"

I never refused small examples.  I have provided several analyses
of how a single-byte encoding would compare to UTF-8, along with
listing optimizations that make it much faster.  I finally
refused to analyze Teoh's examples because he accused me of
trolling and demanded code as the only possible explanation.

3. Refuse to back up said claims with elaborate examples
because "It will
take too long"

You are confused.  What I said is "I don't write toy code,
non-toy code would take too long, and you wouldn't understand it
anyway."

The whole demand for code is idiotic anyway.

If I outlined TCP/IP as a packet-switched network and briefly
sketched what the header might look like and the queuing
algorithms that I might use, I can just imagine you saying, "But
there's no code... how can I possibly understand what you're
saying without any code?"  If you can't understand networking
without seeing working code, you're not equipped to understand it
anyway, same here.

4. Use arrogant tone throughout thread, imply that you're
smarter than the creators of UTF, and creators and long-time
contributors of D (never contribute code to D yourself)

Hey, if the shoe fits. :)

I actually had a lot of respect for Walter till I read this
thread.  I can only assume that his past experience with code
pages was so maddening that he cannot be rational on the subject
of going to any single-byte encoding that would be similar, same
with others griping about code pages above.  I also don't think
he and others are paying much attention to the various points I'm
raising, hence his recent claim that I wouldn't handle Chinese,
when I addressed that from the beginning.

Or it could just be that I'm much smarter than everybody else in
this thread, ;) I can't rule it out given the often silly
responses I've been getting.

Conclusion: Successful troll is successful :)

Conclusion: Vladimir trolls me because he doesn't understand what
I'm talking about, which is why he doesn't raise a single
technical point in this post.

May 26 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>  writes:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is obviously so.

Congratulations, you are literally the only person on the Internet who
said so: http://goo.gl/TFhUO

On 5/26/13 1:45 PM, Joakim wrote:
Or it could just be that I'm much smarter than everybody else in this
thread, ;) I can't rule it out given the often silly responses I've been
getting.

everybody in this forum raises like one to the same opinion. Usually
it's like whatever the topic, a debate will ensue between two ad-hoc groups.

It has become clear that people involved in this have gotten too
frustrated to have a constructive exchange. I suggest we collectively
drop it. What you may want to do is to use D's modeling abilities to
define a great string type pursuant to your ideas. If it is as good as
you believe it could, then it will enjoy use and adoption and everybody
will be better off.

Andrei

May 26 2013

"Joakim" <joakim airpost.net>  writes:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu wrote:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is obviously
so.

Congratulations, you are literally the only person on the
Internet who said so: http://goo.gl/TFhUO

Haha, that is funny, :D though "unicode is shit" returns at least
8 results.  How many people even know how UTF-8 works?  Given how
few people use it, I'm not surprised most don't know enough about
how it works to criticize it.

On 5/26/13 1:45 PM, Joakim wrote:
Or it could just be that I'm much smarter than everybody else
in this
thread, ;) I can't rule it out given the often silly responses
I've been
getting.

most everybody in this forum raises like one to the same
opinion. Usually it's like whatever the topic, a debate will

I suspect it's because I'm presenting an original idea about a
not well-understood technology, Unicode, not the usual "emacs vs
vim" or "D should not have null references" argument.  For
example, how many here know what UCS is?  Most people never dig
into Unicode, it's just a black box that is annoying to deal with.

It has become clear that people involved in this have gotten
too frustrated to have a constructive exchange. I suggest we
collectively drop it. What you may want to do is to use D's
modeling abilities to define a great string type pursuant to
your ideas. If it is as good as you believe it could, then it
will enjoy use and adoption and everybody will be better off.

I agree.  I am enjoying your book, btw.

May 26 2013

"Mr. Anonymous" <mailnew4ster gmail.com>  writes:
On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu
wrote:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is obviously
so.

Congratulations, you are literally the only person on the
Internet who said so: http://goo.gl/TFhUO

Haha, that is funny, :D though "unicode is shit" returns at
least 8 results.  How many people even know how UTF-8 works?
Given how few people use it, I'm not surprised most don't know
enough about how it works to criticize it.

On the other hand:

:D

May 26 2013

"Joakim" <joakim airpost.net>  writes:
On Sunday, 26 May 2013 at 19:11:42 UTC, Mr. Anonymous wrote:
On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu
wrote:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is obviously
so.

Congratulations, you are literally the only person on the
Internet who said so: http://goo.gl/TFhUO

Haha, that is funny, :D though "unicode is shit" returns at
least 8 results.  How many people even know how UTF-8 works?
Given how few people use it, I'm not surprised most don't know
enough about how it works to criticize it.

On the other hand:

I'm not sure if you were trying to make my point, but you just
did.  There are only 19 results for that search string.  If UTF-8
were such a rousing success and most developers found it easy to
understand, you wouldn't expect only 19 results for it and 8
against it.  The paucity of results suggests most don't know how
it works or perhaps simply annoyed by it, liking the
internationalization but disliking the complexity.

May 26 2013

"Mr. Anonymous" <mailnew4ster gmail.com>  writes:
On Sunday, 26 May 2013 at 19:25:37 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 19:11:42 UTC, Mr. Anonymous wrote:
On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu
wrote:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is
obviously so.

Congratulations, you are literally the only person on the
Internet who said so: http://goo.gl/TFhUO

Haha, that is funny, :D though "unicode is shit" returns at
least 8 results.  How many people even know how UTF-8 works?
Given how few people use it, I'm not surprised most don't
know enough about how it works to criticize it.

On the other hand:

I'm not sure if you were trying to make my point, but you just
did.  There are only 19 results for that search string.  If
UTF-8 were such a rousing success and most developers found it
easy to understand, you wouldn't expect only 19 results for it
and 8 against it.  The paucity of results suggests most don't
know how it works or perhaps simply annoyed by it, liking the
internationalization but disliking the complexity.

Man, you're a bullshit machine!

May 26 2013

"Joakim" <joakim airpost.net>  writes:
On Sunday, 26 May 2013 at 19:38:21 UTC, Mr. Anonymous wrote:
On Sunday, 26 May 2013 at 19:25:37 UTC, Joakim wrote:
I'm not sure if you were trying to make my point, but you just
did.  There are only 19 results for that search string.  If
UTF-8 were such a rousing success and most developers found it
easy to understand, you wouldn't expect only 19 results for it
and 8 against it.  The paucity of results suggests most don't
know how it works or perhaps simply annoyed by it, liking the
internationalization but disliking the complexity.

Man, you're a bullshit machine!

What can I say?  I'm very good at interpreting bad data. ;)

May 26 2013

Marco Leise <Marco.Leise gmx.de>  writes:
Am Sun, 26 May 2013 21:25:36 +0200
schrieb "Joakim" <joakim airpost.net>:

On Sunday, 26 May 2013 at 19:11:42 UTC, Mr. Anonymous wrote:
On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu
wrote:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is obviously
so.

Congratulations, you are literally the only person on the
Internet who said so: http://goo.gl/TFhUO

Haha, that is funny, :D though "unicode is shit" returns at
least 8 results.  How many people even know how UTF-8 works?
Given how few people use it, I'm not surprised most don't know
enough about how it works to criticize it.

On the other hand:

I'm not sure if you were trying to make my point, but you just
did.  There are only 19 results for that search string.  If UTF-8
were such a rousing success and most developers found it easy to
understand, you wouldn't expect only 19 results for it and 8
against it.  The paucity of results suggests most don't know how
it works or perhaps simply annoyed by it, liking the
internationalization but disliking the complexity.

--
Marco

May 29 2013

"Joakim" <joakim airpost.net>  writes:
On Wednesday, 29 May 2013 at 23:40:51 UTC, Marco Leise wrote:
Am Sun, 26 May 2013 21:25:36 +0200
schrieb "Joakim" <joakim airpost.net>:

On Sunday, 26 May 2013 at 19:11:42 UTC, Mr. Anonymous wrote:
On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu
wrote:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is
obviously so.

Congratulations, you are literally the only person on the
Internet who said so: http://goo.gl/TFhUO

Haha, that is funny, :D though "unicode is shit" returns at
least 8 results.  How many people even know how UTF-8
works?  Given how few people use it, I'm not surprised most
don't know enough about how it works to criticize it.

On the other hand:

I'm not sure if you were trying to make my point, but you just
did.  There are only 19 results for that search string.  If
UTF-8 were such a rousing success and most developers found it
easy to understand, you wouldn't expect only 19 results for it
and 8 against it.  The paucity of results suggests most don't
know how it works or perhaps simply annoyed by it, liking the
internationalization but disliking the complexity.

Your point is?  121 results, including false positives like
"utf-8 is the best guess."  If you look at the results, almost
all make the pragmatic recommendation that UTF-8 is the best _for
now_, because it is better supported than other multi-language
formats.  That's like saying Windows is the best OS because it's
easier to find one in your local computer store.

Yet again, the fact that even this somewhat ambiguous search
string has only 121 results is damning of anyone liking UTF-8,
nothing else, given the many thousands of programmmers that are
forced to use Unicode if they want to internationalize.

May 30 2013

Marco Leise <Marco.Leise gmx.de>  writes:
Am Thu, 30 May 2013 09:19:32 +0200
schrieb "Joakim" <joakim airpost.net>:

Your point is?  121 results, including false positives like
"utf-8 is the best guess."  If you look at the results, almost
all make the pragmatic recommendation that UTF-8 is the best _for
now_, because it is better supported than other multi-language
formats.  That's like saying Windows is the best OS because it's
easier to find one in your local computer store.

Yet again, the fact that even this somewhat ambiguous search
string has only 121 results is damning of anyone liking UTF-8,
nothing else, given the many thousands of programmmers that are
forced to use Unicode if they want to internationalize.

Alright, for me it said ~6.570.000 results, which I found
funny. I'm not trying to make a point, but to troll. If there
is a point to be made, then that the count of search results
is a _very_ rough estimate.

--
Marco

May 30 2013

Marcin Mstowski <marmyst gmail.com>  writes:
Character Data Representation
Architecture<http://www-01.ibm.com/software/globalization/cdra/>by
IBM. It is what you want to do with additions and it is available
since
1995.
When you come up with an inventive idea, i suggest you to first check what
was already done in that area and then rethink this again to check if you
can do this better or improve existing solution. Other approaches are
usually waste of time and efforts, unless you are doing this for fun or you
price, etc.

On Sun, May 26, 2013 at 9:05 PM, Joakim <joakim airpost.net> wrote:

On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu wrote:

On 5/26/13 1:45 PM, Joakim wrote:

What is extraordinary about "UTF-8 is shit?" It is obviously so.

Congratulations, you are literally the only person on the Internet who
said so: http://goo.gl/TFhUO

Haha, that is funny, :D though "unicode is shit" returns at least 8
results.  How many people even know how UTF-8 works?  Given how few people
use it, I'm not surprised most don't know enough about how it works to
criticize it.

On 5/26/13 1:45 PM, Joakim wrote:
Or it could just be that I'm much smarter than everybody else in this
thread, ;) I can't rule it out given the often silly responses I've been
getting.

everybody in this forum raises like one to the same opinion. Usually it's
like whatever the topic, a debate will ensue between two ad-hoc groups.

I suspect it's because I'm presenting an original idea about a not
well-understood technology, Unicode, not the usual "emacs vs vim" or "D
should not have null references" argument.  For example, how many here know
what UCS is?  Most people never dig into Unicode, it's just a black box
that is annoying to deal with.

It has become clear that people involved in this have gotten too
frustrated to have a constructive exchange. I suggest we collectively drop
it. What you may want to do is to use D's modeling abilities to define a
great string type pursuant to your ideas. If it is as good as you believe
it could, then it will enjoy use and adoption and everybody will be better
off.

I agree.  I am enjoying your book, btw.

May 26 2013

"Joakim" <joakim airpost.net>  writes:
On Sunday, 26 May 2013 at 19:20:15 UTC, Marcin Mstowski wrote:
Character Data Representation
Architecture<http://www-01.ibm.com/software/globalization/cdra/>by
IBM. It is what you want to do with additions and it is
available
since
1995.
When you come up with an inventive idea, i suggest you to first
check what
was already done in that area and then rethink this again to
check if you
can do this better or improve existing solution. Other
approaches are
usually waste of time and efforts, unless you are doing this
for fun or you
can't use existing solutions due to problems with license,
price, etc.

You might be right, but I gave it a quick look and can't make out
what the encoding actually is.  There is an appendix that lists
several possible encodings, including UTF-8!

Also, one of the first pages talks about representations of
floating point and integer numbers, which are outside the purview
of the text encodings we're talking about.  I cannot possibly be
expected to know about every dead format out there.  If you can
show that it is materially similar to my single-byte encoding
idea, it might be worth looking into.

May 26 2013

Marcin Mstowski <marmyst gmail.com>  writes:
On Sun, May 26, 2013 at 9:42 PM, Joakim <joakim airpost.net> wrote:

On Sunday, 26 May 2013 at 19:20:15 UTC, Marcin Mstowski wrote:

Character Data Representation
Architecture<http://www-01.**ibm.com/software/**globalization/cdra/<http://www-01.ibm.com/software/globalization/cdra/>
by

IBM. It is what you want to do with additions and it is available
since
1995.
When you come up with an inventive idea, i suggest you to first check what
was already done in that area and then rethink this again to check if you
can do this better or improve existing solution. Other approaches are
usually waste of time and efforts, unless you are doing this for fun or
you
price, etc.

You might be right, but I gave it a quick look and can't make out what the
encoding actually is.  There is an appendix that lists several possible
encodings, including UTF-8!

Yes, because they didn't reinvent wheel from scratch and are reusing
existing encodings as a base. There isn't any problem with adding another
code page.

Also, one of the first pages talks about representations of floating point
and integer numbers, which are outside the purview of the text encodings

They are outside of scope of CDRA too. At least read picture description
before making out of context assumptions.

I cannot possibly be expected to know about every dead format out there.

Nobody expect that.

If you can show that it is materially similar to my single-byte encoding
idea, it might be worth looking into.

Spending ~15 min to read Introduction isn't worth your time, so why should
i waste my time showing you anything ?

May 26 2013

"Joakim" <joakim airpost.net>  writes:
On Sunday, 26 May 2013 at 21:08:40 UTC, Marcin Mstowski wrote:
On Sun, May 26, 2013 at 9:42 PM, Joakim <joakim airpost.net>
wrote:
Also, one of the first pages talks about representations of
floating point
and integer numbers, which are outside the purview of the text
encodings

They are outside of scope of CDRA too. At least read picture
description
before making out of context assumptions.

Which picture description did you have in mind?  They all seem
fairly generic.  I do see now that one paragraph does say that
CDRA only deals with graphical characters and that they were only
talking about numbers earlier to introduce the topic of data
representation.

If you can show that it is materially similar to my
single-byte encoding
idea, it might be worth looking into.

why should
i waste my time showing you anything ?

You claimed that my encoding was reinventing the wheel, therefore
the onus is on you to show which of the multiple encodings CDRA
uses that I'm reinventing.  I'm not interested in delving into
the docs for some dead IBM format to prove _your_ point.  More
likely, you are just dead wrong and CDRA simply uses code pages,
which are not the same as the single-byte encoding with a header
idea that I've sketched in this thread.

May 26 2013

"John Colvin" <john.loughran.colvin gmail.com>  writes:
On Monday, 27 May 2013 at 06:11:20 UTC, Joakim wrote:
You claimed that my encoding was reinventing the wheel,
therefore the onus is on you to show which of the multiple
encodings CDRA uses that I'm reinventing.  I'm not interested
in delving into the docs for some dead IBM format to prove
_your_ point.

It's your idea and project. Showing that it is original / doing
your research on previous efforts is probably something that
*you* should do, whether or not it's someone else's "point".

More likely, you are just dead wrong and CDRA simply uses code
pages

Based on what?

May 27 2013

"Joakim" <joakim airpost.net>  writes:
On Monday, 27 May 2013 at 12:25:06 UTC, John Colvin wrote:
On Monday, 27 May 2013 at 06:11:20 UTC, Joakim wrote:
You claimed that my encoding was reinventing the wheel,
therefore the onus is on you to show which of the multiple
encodings CDRA uses that I'm reinventing.  I'm not interested
in delving into the docs for some dead IBM format to prove
_your_ point.

It's your idea and project. Showing that it is original / doing
your research on previous efforts is probably something that
*you* should do, whether or not it's someone else's "point".

Sure, some research is necessary.  However, software is littered
with past projects that never really got started or bureaucratic
efforts, like CDRA appears to be, that never went anywhere.  I
can hardly be expected to go rummaging through all these efforts
in the hopes that what, someone else has already written the
code?  If you have a brain, you can look at the currently popular
approaches, which CDRA isn't, and come up with something that
makes more sense.  I don't much care if my idea is original, I
care that it is better.

More likely, you are just dead wrong and CDRA simply uses code
pages

Based on what?

Based on the fact that his link lists EBCDIC and several other
antiquated code page encodings in its list of proposed encodings.
If Marcin believes one of those is similar to my scheme, he
should say which one, otherwise his entire line of argument is
irrelevant.  It's not up to me to prove _his_ point.

Without having looked any of the encodings in detail, I'm fairly
certain he's wrong.  If he feels otherwise, he can pipe up with
which one he had in mind.  The fact that he hasn't speaks volumes.

May 27 2013

Walter Bright <newshound2 digitalmars.com>  writes:
On 5/25/2013 12:51 PM, Joakim wrote:
For a multi-language string encoding, the header would
contain a single byte for every language used in the string, along with
multiple
index bytes to signify the start and finish of every run of single-language
characters in the string. So, a list of languages and a list of pure
single-language substrings.

Please implement the simple C function strstr() with this simple scheme, and
post it here.

http://www.digitalmars.com/rtl/string.html#strstr

May 25 2013

Walter Bright <newshound2 digitalmars.com>  writes:
On 5/25/2013 2:51 PM, Walter Bright wrote:
On 5/25/2013 12:51 PM, Joakim wrote:
For a multi-language string encoding, the header would
contain a single byte for every language used in the string, along with
multiple
index bytes to signify the start and finish of every run of single-language
characters in the string. So, a list of languages and a list of pure
single-language substrings.

Please implement the simple C function strstr() with this simple scheme, and
post it here.

http://www.digitalmars.com/rtl/string.html#strstr

I'll go first. Here's a simple UTF-8 version in C. It's not the fastest way to
do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
size_t len1 = strlen(s1);
size_t len2 = strlen(s2);
if (!len2)
return (char *) s1;
char c2 = *s2;
while (len2 <= len1) {
if (c2 == *s1)
if (memcmp(s2,s1,len2) == 0)
return (char *) s1;
s1++;
len1--;
}
return NULL;
}

May 25 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx>  writes:
On Sat, May 25, 2013 at 10:07:41AM +0200, Joakim wrote:
[...]
The vast majority of non-english alphabets in UCS can be encoded in
a single byte.  It is your exceptions that are not relevant.

I'll have you know that Chinese, Korean, and Japanese account for a
significant percentage of the world's population, and therefore
arguments about "vast majority" are kinda missing the forest for the
trees. If you count the number of *alphabets* that can be encoded in a
single byte, you can get a majority, but that in no way reflects actual
usage.

[...]
The only alternatives to a variable width encoding I can see are:
- Single code page per string
This is completely useless because now you can't concatenate
strings of different code pages.

I wouldn't be so fast to ditch this.  There is a real argument to be
made that strings of different languages are sufficiently different
that there should be no multi-language strings.  Is this the best
route?  I'm not sure, but I certainly wouldn't dismiss it out of hand.

This is so patently absurd I don't even know how to begin to answer...
have you actually dealt with any significant amount of text at all? A
large amount of text in today's digital world are at least bilingual, if
not more. Even in pure English text, you occasionally need a foreign
letter in order to transcribe a borrowed/quoted word, e.g., "cliché",
"naïve", etc.. Under your scheme, it would be impossible to encode any
text that contains even a single instance of such words. All it takes is
*one* word in a 500-page text and your scheme breaks down, and we're
back to the bad ole days of codepages. And yes you can say "well just
include é and ï in the English code page". But then all it takes is a
single math formula that requires a Greek letter, and your text is
non-encodable anymore. By the time you pull in all the French, German,
Greek letters and math symbols, you might as well just go back to UTF-8.

The alternative is to have embedded escape sequences for the rare
foreign letter/word that you might need, but then you're back to being
unable to slice the string at will, since slicing it at the wrong place
will produce gibberish.

I'm not saying UTF-8 (or UTF-16, etc.) is panacea -- there are things
about it that are annoying, but it's certainly better than the scheme
you're proposing.

T

--
Живёшь только однажды.

May 25 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx>  writes:
On Tue, May 28, 2013 at 02:54:30AM +0200, Torje Digernes wrote:
On Tuesday, 28 May 2013 at 00:34:20 UTC, Manu wrote:
On 28 May 2013 09:05, Walter Bright <newshound2 digitalmars.com>
wrote:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:

Well, D *does* support non-English identifiers, y'know... for
example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different
story. :)

I've recently come to the opinion that that's a bad idea, and D
should not
support it.

Why? You said previously that you'd love to support extended
operators ;)

I find features such as support for uncommon symbols in variables a
strength as it makes some physics formulas a bit easier to read in
code form, which in my opinion is a good thing.

I think there's a difference between allowing math symbols (which
includes things like (a subset of) Greek letters that mathematicians
love) in identifiers, and allowing full Unicode. What if you're assigned
to maintain code containing identifiers that has letters that don't
appear in any of your installed fonts?

I think it's OK to allow math symbols, but allowing the entire set of
Unicode characters is going a bit too far, IMO. For one thing, if some
code has identifiers written in Arabic, I wouldn't be able to understand
the code, simply because I'd have a hard time telling different
identifiers apart.  Besides, if the rest of the language (keywords,
Phobos, etc.) are in English, then I don't see any compelling reason to
use a different language in identifiers, other than to submit IODCC
entries. :-P

C doesn't support Unicode identifiers, for one thing, but I've seen
working C code written by people who barely understand any English -- it
didn't stop them at all. (The comments were of course in their native
language -- the compiler ignores everything inside anyway so 8-bit
native encodings or even UTF-8 can be sneaked in without provoking
compiler errors.)

T

--
WINDOWS = Will Install Needless Data On Whole System -- CompuMan

May 27 2013

"Torje Digernes" <torjehoa pvv.org>  writes:
On Tuesday, 28 May 2013 at 01:17:37 UTC, H. S. Teoh wrote:
On Tue, May 28, 2013 at 02:54:30AM +0200, Torje Digernes wrote:
On Tuesday, 28 May 2013 at 00:34:20 UTC, Manu wrote:
On 28 May 2013 09:05, Walter Bright
<newshound2 digitalmars.com>
wrote:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:

Well, D *does* support non-English identifiers, y'know...
for
example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different
story. :)

I've recently come to the opinion that that's a bad idea,
and D
should not
support it.

Why? You said previously that you'd love to support extended
operators ;)

I find features such as support for uncommon symbols in
variables a
strength as it makes some physics formulas a bit easier to
code form, which in my opinion is a good thing.

I think there's a difference between allowing math symbols
(which
includes things like (a subset of) Greek letters that
mathematicians
love) in identifiers, and allowing full Unicode. What if you're
assigned
to maintain code containing identifiers that has letters that
don't
appear in any of your installed fonts?

I think it's OK to allow math symbols, but allowing the entire
set of
Unicode characters is going a bit too far, IMO. For one thing,
if some
code has identifiers written in Arabic, I wouldn't be able to
understand
the code, simply because I'd have a hard time telling different
identifiers apart.  Besides, if the rest of the language
(keywords,
Phobos, etc.) are in English, then I don't see any compelling
reason to
use a different language in identifiers, other than to submit
IODCC
entries. :-P

C doesn't support Unicode identifiers, for one thing, but I've
seen
working C code written by people who barely understand any
English -- it
didn't stop them at all. (The comments were of course in their
native
language -- the compiler ignores everything inside anyway so
8-bit
native encodings or even UTF-8 can be sneaked in without
provoking
compiler errors.)

T

I think there is very little difference, both cases are
artificially limiting the allowable symbols. Other symbols
relevant in other fields which does not happen to use Greek
symbols primarily, are they to be treated differently?

What you propose is a built in code standard for D, based on your
feelings on a topic.

If what you fear is that unicode will suddenly make cooperation
impossible I doubt you are right, after all there is all kind of
ways to make terrible variable names (q,w,e,r ... qq,qw). If any
such identifiers show up in a project I assume they are cleaned
up, why wouldn't the same happen to unicode if they are causing
problems? Think about it, it should happen even faster because
the symbol might not be accessible for everyone, where a
single/double letter gibberish one is perfectly reproducible and
might grow into the project confusing every new reader. Are you
going to argue for disallowing variables that are not a compound
word or a dictionary word in English?

May 29 2013

<!--