## digitalmars.D - Why UTF-8/16 character encodings?

"Joakim" <joakim airpost.net> writes:
On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
toUpper/lower cannot be made in place if it should handle all
Unicode. Some characters will change their length when convert
to/from uppercase. Examples of these are the German double S
and some Turkish I.

This triggered a long-standing bugbear of mine: why are we using
these variable-length encodings at all?  Does anybody really care
about UTF-8 being "self-synchronizing," ie does anybody actually
use that in this day and age?  Sure, it's backwards-compatible
with ASCII and the vast majority of usage is probably just ASCII,
but that means the other languages don't matter anyway.  Not to
mention taking the valuable 8-bit real estate for English and
dumping the longer encodings on everyone else.

I'd just use a single-byte header to signify the language and
then put the vast majority of languages in a single byte
encoding, with the few exceptional languages with more than 256
characters encoded in two bytes.  OK, that doesn't cover
multi-language strings, but that is what, .000001% of usage?
Make your header a little longer and you could handle those also.
Yes, it wouldn't be strictly backwards-compatible with ASCII,
but it would be so much easier to internationalize.  Of course,
there's also the monoculture we're creating; love this UTF-8 rant
by tuomov, author of one the first tiling window managers for
linux:

http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?

May 24 2013
"Peter Alexander" <peter.alexander.au gmail.com> writes:
On Friday, 24 May 2013 at 17:05:57 UTC, Joakim wrote:
This triggered a long-standing bugbear of mine: why are we
using these variable-length encodings at all?

Simple: backwards compatibility with all ASCII APIs (e.g. most C
libraries), and because I don't want my strings to consume
multiple bytes per character when I don't need it.

Your language header idea is no good for at least three reasons:

1. What happens if I want to take a substring slice of your
string? I'll need to allocate a new string to add the header in.

2. What if I have a long string with the ASCII header and want to
append a non-ASCII character on the end? I'll need to reallocate
the whole string and widen it with the new header.

3. Even if I have a string that is 99% ASCII then I have to pay
extra bytes for every character just because 1% wasn't ASCII.
With UTF-8, I only pay the extra bytes when needed.

May 24 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2013 1:37 PM, Joakim wrote:
This leads to Phobos converting every UTF-8 string to UTF-32, so that
it can easily run its algorithms on a constant-width 32-bit character set, and
the resulting performance penalties.

This is more a problem with the algorithms taking the easy way than a problem
with UTF-8. You can do all the string algorithms, including regex, by working
with the UTF-8 directly rather than converting to UTF-32. Then the algorithms
work at full speed.

Yes, it wouldn't be strictly backwards-compatible with ASCII, but it would be

That was the go-to solution in the 1980's, they were called "code pages". A
disaster.

with the few exceptional languages with more than 256 characters encoded in

Like those rare languages Japanese, Korean, Chinese, etc. This too was done in
the 80's with "Shift-JIS" for Japanese, and some other wacky scheme for Korean,
and a third nutburger one for Chinese.

I've had the misfortune of supporting all that in the old Zortech C++ compiler.
It's AWFUL. If you think it's simpler, all I can say is you've never tried to
write internationalized code with it.

UTF-8 is heavenly in comparison. Your code is automatically internationalized.
It's awesome.

May 24 2013
"anonymous" <anonymous example.com> writes:
On Friday, 24 May 2013 at 17:05:57 UTC, Joakim wrote:
On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
toUpper/lower cannot be made in place if it should handle all
Unicode. Some characters will change their length when convert
to/from uppercase. Examples of these are the German double S
and some Turkish I.

This triggered a long-standing bugbear of mine: why are we
using these variable-length encodings at all?  Does anybody
really care about UTF-8 being "self-synchronizing," ie does
anybody actually use that in this day and age?  Sure, it's
backwards-compatible with ASCII and the vast majority of usage
is probably just ASCII, but that means the other languages
don't matter anyway.  Not to mention taking the valuable 8-bit
real estate for English and dumping the longer encodings on
everyone else.

The German ß becomes SS when capitalised. It's no encoding issue.

May 24 2013
"Joakim" <joakim airpost.net> writes:
On Friday, 24 May 2013 at 17:43:03 UTC, Peter Alexander wrote:
Simple: backwards compatibility with all ASCII APIs (e.g. most
C libraries), and because I don't want my strings to consume
multiple bytes per character when I don't need it.

accommodate the authors of then-dominant all-ASCII APIs has now
foisted an unnecessarily complex encoding on all of us, with
reduced performance as the result.  You do realize that my
encoding would encode almost all languages' characters in single
bytes, unlike UTF-8, right?  Your latter argument is one against
UTF-8.

Your language header idea is no good for at least three reasons:

1. What happens if I want to take a substring slice of your
string? I'll need to allocate a new string to add the header in.

you'd parse my format and store it in memory as a String class,
storing the chars in an internal array with the header stripped
out and the language stored in a property.  That way, even a
slice could be made to refer to the same language, by referring
to the language of the containing array.

Strictly speaking, this solution could also be implemented with
UTF-8, simply by changing the format for the data structure you
use in memory to the one I've outlined, as opposed to using the
the UTF-8 encoding for both transmission and processing.  But if
you're going to use my format for processing, you might as well
use it for transmission also, since it is much smaller for
non-ASCII text.

Before you ridicule my solution as somehow unworkable, let me
remind you of the current monstrosity.  Currently, the language
is stored in every single UTF-8 character, by having the length
vary from one to four bytes depending on the language.  This
leads to Phobos converting every UTF-8 string to UTF-32, so that
it can easily run its algorithms on a constant-width 32-bit
character set, and the resulting performance penalties.  Perhaps
the biggest loss is that programmers everywhere are pushed to
ignorance or broken code.

Which seems more unworkable to you?

2. What if I have a long string with the ASCII header and want
to append a non-ASCII character on the end? I'll need to
reallocate the whole string and widen it with the new header.

almost never happens.  But if it does, it would be solved by the
String class I outlined above, as the header isn't stored in the
array anymore.

3. Even if I have a string that is 99% ASCII then I have to pay
extra bytes for every character just because 1% wasn't ASCII.
With UTF-8, I only pay the extra bytes when needed.

thousand non-ASCII characters, the UTF-8 version will have one or
two thousand more characters, ie 1 or 2 KB more.  My format would
character used, that's it.  It's a clear win for my format.

In any case, I just came up with the simplest format I could off
the top of my head, maybe there are gaping holes in it.  But my
point is that we should be able to come up with such a much
simpler format, which keeps most characters to a single byte, not
that my format is best.  All I want to argue is that UTF-8 is the
worst. ;)

May 24 2013
"Joakim" <joakim airpost.net> writes:
On Friday, 24 May 2013 at 20:37:58 UTC, Joakim wrote:
3. Even if I have a string that is 99% ASCII then I have to
pay extra bytes for every character just because 1% wasn't
ASCII. With UTF-8, I only pay the extra bytes when needed.

thousand non-ASCII characters, the UTF-8 version will have one
or two thousand more characters, ie 1 or 2 KB more.  My format
language character used, that's it.  It's a clear win for my
format.

I don't understand what you mean here.  If your string has a
thousand non-ASCII characters, the UTF-8 version will have one
or two thousand more bytes, ie 1 or 2 KB more.  My format
language used, that's it.  It's a clear win for my format.

May 24 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-May-2013 21:05, Joakim пишет:
On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
toUpper/lower cannot be made in place if it should handle all Unicode.
Some characters will change their length when convert to/from
uppercase. Examples of these are the German double S and some Turkish I.

This triggered a long-standing bugbear of mine: why are we using these
variable-length encodings at all?  Does anybody really care about UTF-8
being "self-synchronizing," ie does anybody actually use that in this
day and age?  Sure, it's backwards-compatible with ASCII and the vast
majority of usage is probably just ASCII, but that means the other
languages don't matter anyway.  Not to mention taking the valuable 8-bit
real estate for English and dumping the longer encodings on everyone else.

I'd just use a single-byte header to signify the language and then put
the vast majority of languages in a single byte encoding, with the few
exceptional languages with more than 256 characters encoded in two
bytes.

You seem to think that not only UTF-8 is bad encoding but also one

Separate code spaces were the case before Unicode (and utf-8). The
problem is not only that without header text is meaningless (no easy
slicing) but the fact that encoding of data after header strongly
depends a variety of factors -  a list of encodings actually. Now
everybody has to keep a (code) page per language to at least know if
it's 2 bytes per char or 1 byte per char or whatever. And you still work
on a basis that there is no combining marks and regional specific stuff :)

In fact it was even "better" nobody ever talked about header they just
assumed a codepage with some global setting. Imagine yourself creating a
font rendering system these days - a hell of an exercise in frustration
(okay how do I render 0x88 ? mm if that is in codepage XYZ then ...).

OK, that doesn't cover multi-language strings, but that is what,
.000001% of usage?

This just shows you don't care for multilingual stuff at all. Imagine
any language tutor/translator/dictionary on the Web. For instance most
languages need to intersperse ASCII (also keep in mind e.g. HTML
markup). Books often feature citations in native language (or e.g.
latin) along with translations.

Now also take into account math symbols, currency symbols and beyond.
Also these days cultures are mixing in wild combinations so you might
need to see the text even if you can't read it. Unicode is not only
"encode characters from all languages". It needs to address universal
representation of symbolics used in writing systems at large.

those also.  Yes, it wouldn't be strictly backwards-compatible with
ASCII, but it would be so much easier to internationalize.  Of course,
there's also the monoculture we're creating; love this UTF-8 rant by
tuomov, author of one the first tiling window managers for linux:

"par-le-vu-france?" and codepages of various complexity(insanity).

Want small - use compression schemes which are perfectly fine and get to
the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/

http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?

And borrowing the arguments from from that rant: locale is borked shit
when it comes to encodings. Locales should be used for tweaking visual
like numbers, date display an so on.

--
Dmitry Olshansky

May 24 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2013 3:42 PM, H. S. Teoh wrote:
I tried writing language-agnostic text-processing programs in C/C++

One of the first, and best, decisions I made for D was it would be Unicode
front
to back.

At the time, Unicode was poorly supported by operating systems and lots of
software, and I encountered some initial resistance to it. But I believed
Unicode was the inevitable future.

Code pages, Shift-JIS, EBCDIC, etc., should all be terminated with prejudice.

May 24 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2013 7:16 PM, Manu wrote:
So when we define operators for u × v and a · b, or maybe n²? ;)

Oh, how I want to do that. But I still think the world hasn't completely caught
up with Unicode yet.

May 24 2013
Timon Gehr <timon.gehr gmx.ch> writes:
On 05/25/2013 05:56 AM, H. S. Teoh wrote:
On Fri, May 24, 2013 at 08:45:56PM -0700, Walter Bright wrote:
On 5/24/2013 7:16 PM, Manu wrote:
So when we define operators for u  v and a  b, or maybe n? ;)

Oh, how I want to do that. But I still think the world hasn't
completely caught up with Unicode yet.

That would be most awesome!

Though it does raise the issue of how parsing would work, 'cos you
either have to assign a fixed precedence to each of these operators (and
there are a LOT of them in Unicode!),

I think this is what eg. fortress is doing.

or allow user-defined operators
with custom precedence and associativity,

This is what eg. Haskell, Coq are doing.
(Though Coq has the advantage of not allowing forward references, and
hence inline parser customization is straighforward in Coq.)

which means nightmare for the
parser (it has to adapt itself to new operators as the code is
parsed/analysed,

It would be easier on the parsing side, since the parser would not fully
parse expressions. Semantic analysis would resolve precedences. This is
quite simple, and the current way the parser resolves operator
precedences is less efficient anyways.

which then leads to issues with what happens if two
different modules define the same operator with conflicting precedence /
associativity).

This would probably be an error without explicit disambiguation, or
follow the usual disambiguation rules. (trying all possibilities appears
to be exponential in the number of conflicting operators in an
expression in the worst case though.)

May 25 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/26/2013 1:44 PM, Hans W. Uhlig wrote:
Using those characters would be wonderful and while we do have unicode software
support we don't really have unicode hardware support. I am still on my 102 key
keyboard and I haven't really seen a good expanded character keyboard come
along.

I have a post-it stuck to my monitor with the numbers for various unicode
characters, but I just can't see that for writing code.

May 26 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different story. :)

I've recently come to the opinion that that's a bad idea, and D should not
support it.

May 27 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and D should not
support it.

Why do you think its a bad idea? It makes it such that code can be in various
languages? Just lack of keyboard support?

Every time I've been to a programming shop in a foreign country, the developers
speak english at work and code in english. Of course, that doesn't mean that
everyone does, but as far as I can tell the overwhelming bulk is done in
english.

Naturally, full Unicode needs to be in strings and comments, but symbol names?
I
don't see the point nor the utilty of it. Supporting such is just pointless
complexity to the language.

May 27 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 6:06 PM, H. S. Teoh wrote:
I don't find this a compelling reason to allow full Unicode on
identifiers, though. For one thing, somebody maintaining your code may
not know how to type said identifier correctly. It can be very
frustrating to have to keep copy-n-pasting identifiers just because they
contain foreign letters you can't type. Not to mention sheer
unreadability if the inventor's name is in Chinese, so the algorithm
name is also in Chinese, and the person maintaining the code can't read
Chinese. This will kill D code maintainability.

+1

May 27 2013
Michel Fortin <michel.fortin michelf.ca> writes:
On 2013-05-28 01:34:17 +0000, Walter Bright <newshound2 digitalmars.com> said:

On 5/27/2013 6:06 PM, H. S. Teoh wrote:
I don't find this a compelling reason to allow full Unicode on
identifiers, though. For one thing, somebody maintaining your code may
not know how to type said identifier correctly. It can be very
frustrating to have to keep copy-n-pasting identifiers just because they
contain foreign letters you can't type. Not to mention sheer
unreadability if the inventor's name is in Chinese, so the algorithm
name is also in Chinese, and the person maintaining the code can't read
Chinese. This will kill D code maintainability.

+1

-1

What's even worse for code maintainability is code that does not do
what it says.

Disallowing non-ASCII charsets does not prevent people from writing
foreign-language code. I've seen plenty of code in French in my life in
languages with no Unicode support. I've also seen plenty of bad English
in code. I'd rather see a correct French word as a variable or function
name than an incorrect English one. Correctly naming things is
difficult, and correctly naming them in a foreign language is even
more. This surely apply to languages using non-ASCII alphabets too.

Of course, if you're not using English words you'll be limiting
audience to programmers who understand that language. But you might
widen it in other directions. I worked once with a grad student who was
building a model to simulate breakages of water pipe systems. She was
good enough to write code that worked, although she needed my help for
a couple of things, notably increasing performance. The code was all in
French, and thankfully so as attempting to translate all those terms
(some dealing with concepts unknown to me) to English when writing the
code and back to French when explaining the concepts would have been
quite annoying, inefficient, and error-prone in our work.

While French likely will always be a possibility (as it fits well in
ASCII), I can see how writing code in Japanese or Russian might benefit
native speakers of those languages too, especially those for who
programming is only an incidental part of their job. Programming is a
form of expression, and it's always easier to express ourself in our
own native language.

--
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca/

May 28 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/29/2013 3:26 AM, qznc wrote:
Once I heared an argument from developers working for banks. They coded
with
german names (e.g. Vermögen,Bürgschaft), which sometimes include äöüß.
Some of
those concept had no good translation into english, because they are not used
outside of Germany and the clients prefer the actual names anyways.

German is pretty easy to do in ASCII: Vermoegen and Buergschaft

May 29 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 5:34 PM, Manu wrote:
On 28 May 2013 09:05, Walter Bright <newshound2 digitalmars.com
<mailto:newshound2 digitalmars.com>> wrote:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:

Well, D *does* support non-English identifiers, y'know... for example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different story. :)

I've recently come to the opinion that that's a bad idea, and D should not
support it.

Why? You said previously that you'd love to support extended operators ;)

Extended operators, yes. Non-ascii identifiers, no.

May 27 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 9:27 PM, Manu wrote:
I will never write colour without a u, ever! I may suffer the global American
cultural invasion of my country like the rest of us, but I will never let them
infiltrate my mind! ;)

Resistance is useless.

May 27 2013
Jacob Carlborg <doob me.com> writes:
On 2013-05-28 03:38, Peter Williams wrote:

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

Don't you have a spell checker in your editor? If not, find a new one :)

--
/Jacob Carlborg

May 28 2013
Jacob Carlborg <doob me.com> writes:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

--
/Jacob Carlborg

May 28 2013
Jacob Carlborg <doob me.com> writes:
On 2013-05-28 14:09, Manu wrote:

Yes, the region called America ;)
Although there's a few British colonies in the Caribbean...

Oh, you meant the whole region and not the country.

--
/Jacob Carlborg

May 28 2013
Jacob Carlborg <doob me.com> writes:
On 2013-05-28 14:58, Simen Kjaeraas wrote:

America is not a country. The country is called USA.

I know that, but I get the impression that people usually say "America"
and refer to USA.

--
/Jacob Carlborg

May 28 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/29/2013 2:42 AM, Jakob Ovrum wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and D should not
support it.

Honestly, removing support for non-ASCII characters from identifiers is the
worst idea you've had in a while. There is an _unfathomable amount_ of code out
there written in non-English languages but hamfisted into an English-alphabet
representation because the programming language doesn't care to support it. The
resulting friction is considerable.

You seem to attribute particular value to personal anecdotes, so here's one of
mine: I personally know several prestigious universities in Europe and Asia
which teach programming using Java and/or C with identifiers being in an
English-alphabet representation of the native non-English language. Using the
English language for identifiers is usually a sanctioned alternative, but not
the primary modus operandi. I also know several professional programmers using
their native non-English language for identifiers in production code.

I still think it's a bad idea, but it's obvious people want it in D, so it'll
stay.

(Also note that I meant using ASCII, not necessarily english.)

May 29 2013
Timon Gehr <timon.gehr gmx.ch> writes:
On 05/29/2013 12:03 PM, Marco Leise wrote:
...  And everyone
likes "alias ℕ = size_t;", right? :)
...

No, that's deeply troubling.

May 30 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2013 4:24 AM, Manu wrote:
We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance, it
almost
always looks like this:

{
// E: I like cake.
// J: ケーキが好きです。
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

May 30 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2013 5:00 PM, Peter Williams wrote:
On 31/05/13 05:07, Walter Bright wrote:
On 5/30/2013 4:24 AM, Manu wrote:
We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance,
it almost
always looks like this:

{
// E: I like cake.
// J: ケーキが好きです。
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

Not true, D supports Unicode identifiers.

May 30 2013
1100110 <0b1100110 gmail.com> writes:
On 05/31/2013 05:11 AM, Simen Kjaeraas wrote:
On Fri, 31 May 2013 07:57:37 +0200, Walter Bright
<newshound2 digitalmars.com> wrote:

On 5/30/2013 5:00 PM, Peter Williams wrote:
On 31/05/13 05:07, Walter Bright wrote:
On 5/30/2013 4:24 AM, Manu wrote:
We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance,
it almost
always looks like this:

{
// E: I like cake.
// J: ケーキが好きです。
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

Not true, D supports Unicode identifiers.

I doubt Sony and Nintendo use D extensively.


Jun 17 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 6/17/2013 6:28 PM, Brad Roberts wrote:
Don't symbol names from dmd/win32 get compressed if they're too long, resulting
in essentially arbitrary random binary data being used as symbol names?
Assuming my memory on that is correct then it's already demonstrated that
optlink doesn't care what the data is.

Optlink doesn't care what the symbol byte contents are.

Jun 17 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 6/18/2013 9:44 AM, H. S. Teoh wrote:
Might this cause a problem with the VS linker?

I doubt it, but try it and see!

Jun 18 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jun 18, 2013 at 04:33:54PM -0700, Walter Bright wrote:
On 6/18/2013 9:44 AM, H. S. Teoh wrote:
Might this cause a problem with the VS linker?

I doubt it, but try it and see!

to try?

T

--
Study gravitation, it's a field with a lot of potential.

Jun 18 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2013 5:04 PM, Manu wrote:
Currently, D offers a unique advantage; leave it that way.

I am going to leave it that way based on the comments here, I only wanted to
point out that the example didn't support Unicode identifiers.

May 30 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jun 17, 2013 at 06:49:19PM -0700, Walter Bright wrote:
On 6/17/2013 6:28 PM, Brad Roberts wrote:
Don't symbol names from dmd/win32 get compressed if they're too long, resulting
in essentially arbitrary random binary data being used as symbol names?
Assuming my memory on that is correct then it's already demonstrated that
optlink doesn't care what the data is.

Optlink doesn't care what the symbol byte contents are.

It seems ld on Linux doesn't, either. I just tested separate compilation
on some code containing functions and modules with Cyrillic names, and
it worked fine. But my system locale is UTF-8; I'm not sure if there may
be a problem on other system locales (not that modern systems would
actually use anything else, though!).

Might this cause a problem with the VS linker?

T

--
It only takes one twig to burn down a forest.

Jun 18 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 12:33 AM, Joakim wrote:
At what cost?  Most programmers completely punt on unicode, because they just
don't want to deal with the complexity. Perhaps you can deal with it and don't
mind the performance loss, but I suspect you're in the minority.

I think you stand alone in your desire to return to code pages. I have years of
experience with code pages and the unfixable misery they produce. This has
disappeared with Unicode. I find your arguments unpersuasive when stacked
against my experience. And yes, I have made a living writing high performance
code that deals with characters, and you are quite off base with claims that
UTF-8 has inevitable bad performance - though there is inefficient code in
Phobos for it, to be sure.

My grandfather wrote a book that consists of mixed German, French, and Latin
words, using special characters unique to those languages. Another failing of
code pages is it fails miserably at any such mixed language text. Unicode
handles it with aplomb.

I can't even write an email to Rainer Schütze in English under your scheme.

Code pages simply are no longer practical nor acceptable for a global
community.
D is never going to convert to a code page system, and even if it did, there's
no way D will ever convince the world to abandon Unicode, and so D would be as
useless as EBCDIC.

I'm afraid your quest is quixotic.

May 25 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 13:05, Joakim пишет:
On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote:

going to single-byte encodings, which do not imply the problems that you
had with code pages way back when.

Problem is what you outline is isomorphic with code-pages. Hence the
grief of accumulated experience against them.
Code pages simply are no longer practical nor acceptable for a global
community. D is never going to convert to a code page system, and even
if it did, there's no way D will ever convince the world to abandon
Unicode, and so D would be as useless as EBCDIC.

encodings" to "code pages" in your head, then recoil in horror as you
remember all your problems with broken implementations of code pages,
even though those problems are not intrinsic to single-byte encodings.

I'm not asking you to consider this for D.  I just wanted to discuss why
UTF-8 is used at all.  I had hoped for some technical evaluations of its
merits, but I seem to simply be dredging up a bunch of repressed

Well if somebody get a quest to redefine UTF-8 they *might* come up with
something that is a bit faster to decode but shares the same properties.
Hardly a life saver anyway.
The world may not "abandon Unicode," but it will abandon UTF-8, because
it's a dumb idea.  Unfortunately, such dumb ideas- XML anyone?- often
proliferate until someone comes up with something better to show how
dumb they are.

Even children know XML is awful redundant shit as interchange format.
The hierarchical document is a nice idea anyway.

Perhaps it won't be the D programming language that does
that, but it would be easy to implement my idea in D, so maybe it will
be a D-based library someday. :)

Implement Unicode compression scheme - at least that is standardized.

--
Dmitry Olshansky

May 25 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 9:48 PM, H. S. Teoh wrote:
Then came along D with native Unicode support built right into the
language. And not just UTF-16 shoved down your throat like Java does (or
was it UTF-32?); UTF-8, UTF-16, and UTF-32 are all equally supported.
You cannot imagine what a happy camper I was since then!! Yes, Phobos
still has a ways to go in terms of performance w.r.t. UTF-8 strings, but
what we have right now is already far, far, superior to the situation in
C/C++, and things can only get better.

Many moons ago, when the earth was young and I had a few strands of hair left,
a
C++ programmer challenged me to a "bakeoff", D vs C++. I wrote the program in D
(a string processing program). He said "ahaaaa!" and wrote the C++ one. They
were fairly comparable.

I then suggested we do the internationalized version. I resubmitted exactly the
same program. He threw in the towel.

May 25 2013
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/25/13 3:33 AM, Joakim wrote:
On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
This is more a problem with the algorithms taking the easy way than a
problem with UTF-8. You can do all the string algorithms, including
regex, by working with the UTF-8 directly rather than converting to
UTF-32. Then the algorithms work at full speed.

can be as "full speed" as a constant-width encoding. Perhaps you mean
that the slowdown is minimal, but I doubt that also.

You mentioned this a couple of times, and I wonder what makes you so
sure. On contemporary architectures small is fast and large is slow;
betting on replacing larger data with more computation is quite often a win.

Andrei

May 25 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 5:43 AM, Andrei Alexandrescu wrote:
On 5/25/13 3:33 AM, Joakim wrote:
On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
This is more a problem with the algorithms taking the easy way than a
problem with UTF-8. You can do all the string algorithms, including
regex, by working with the UTF-8 directly rather than converting to
UTF-32. Then the algorithms work at full speed.

can be as "full speed" as a constant-width encoding. Perhaps you mean
that the slowdown is minimal, but I doubt that also.

You mentioned this a couple of times, and I wonder what makes you so sure. On
contemporary architectures small is fast and large is slow; betting on
replacing
larger data with more computation is quite often a win.

On the other hand, Joakim even admits his single byte encoding is variable
length, as otherwise he simply dismisses the rarely used (!) Chinese, Japanese,
and Korean languages, as well as any text that contains words from more than
one
language.

I suspect he's trolling us, and quite successfully.

May 25 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 1:03 PM, Joakim wrote:
On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:
On the other hand, Joakim even admits his single byte encoding is variable
length, as otherwise he simply dismisses the rarely used (!) Chinese,
Japanese, and Korean languages, as well as any text that contains words from
more than one language.

two bytes, so it is not a true constant-width encoding if you are mixing one of
those languages into a single-byte encoded string.  But this "variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's variable length. You
overlook that I've had to deal with this. It isn't "simpler", there's actually
more work to write code that adapts to one or two byte encodings.

I suspect he's trolling us, and quite successfully.

Walter.  It seems to be the trend on the internet to accuse anybody you
disagree
with of trolling, I am honestly surprised to see Walter stoop so low.
Considering I'm the only one making any cogent arguments here, perhaps I should
wonder if you're all trolling me. ;)

On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:
I suspect the Chinese, Koreans, and Japanese would take exception to being
called irrelevant.

they would also be handled by a two-byte encoding.

Good luck with your scheme that can't handle languages written by billions of
people!

am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard Chinese. You cannot have
it both ways. Code to deal with two bytes is significantly different than code
to deal with one. That means you've got a conditional in your generic code -
that isn't going to be faster than the conditional for UTF-8.

then you claim I don't handle
these languages.  This kind of blatant contradiction within two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it as irrelevant,
along with more handwaving about what to do with text that has embedded words
in
multiple languages.

Worse, there are going to be more than 256 of these encodings - you can't even
have a byte to specify them. Remember, Unicode has approximately 256,000
characters in it. How many code pages is that?

I was being kind saying you were trolling, as otherwise I'd be saying your
scheme was, to be blunt, absurd.

---------------------------------------

I'll be the first to admit that a lot of great ideas have been initially
dismissed by the experts as absurd. If you really believe in this, I recommend
that you write it up as a real article, taking care to fill in all the
handwaving with something specific, and include some benchmarks to prove your
performance claims. Post your article on reddit, stackoverflow, hackernews,
etc., and look for fertile ground for it. I'm sorry you're not finding fertile
ground here (so far, nobody has agreed with any of your points), and this is
the
wrong place for such proposals anyway, as D is simply not going to switch over
to it.

Remember, extraordinary claims require extraordinary evidence, not handwaving
and assumptions disguised as bold assertions.

May 25 2013
Walter Bright <newshound2 digitalmars.com> writes:
On 5/26/2013 4:31 AM, Joakim wrote:
My single-byte encoding has none of these problems, in fact, it's much faster
and uses less memory for the same function, while providing additional
speedups,
from the header, that are not available to UTF-8.

C'mon, Joakim, show us this amazing strstr() implementation for your scheme!


May 26 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 12:58, Vladimir Panteleev пишет:
On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
This is more a problem with the algorithms taking the easy way than a
problem with UTF-8. You can do all the string algorithms, including
regex, by working with the UTF-8 directly rather than converting to
UTF-32. Then the algorithms work at full speed.

encoding can be as "full speed" as a constant-width encoding. Perhaps
you mean that the slowdown is minimal, but I doubt that also.

For the record, I noticed that programmers (myself included) that had an
incomplete understanding of Unicode / UTF exaggerate this point, and
sometimes needlessly assume that their code needs to operate on
individual characters (code points), when it is in fact not so - and
that code will work just fine as if it was written to handle ASCII. The
example Walter quoted (regex - assuming you don't want Unicode ranges or
case-insensitivity) is one such case.

+1
BTW regex even with Unicode ranges and case-insensitivity is doable just
not easy (yet).

Another thing I noticed: sometimes when you think you really need to
operate on individual characters (and that your code will not be correct
unless you do that), the assumption will be incorrect due to the
existence of combining characters in Unicode. Two of the often-quoted
use cases of working on individual code points is calculating the string
width (assuming a fixed-width font), and slicing the string - both of
these will break with combining characters if those are not accounted
for.  I believe the proper way to approach such tasks is to implement the
respective Unicode algorithms for it, which I believe are non-trivial
and for which the relative impact for the overhead of working with a
variable-width encoding is acceptable.

Another plus one. Algorithms defined on code point basis are quite
complex so that benefit of not decoding won't be that large. The benefit
of transparently special-casing ASCII in UTF-8 is far larger.

Can you post some specific cases where the benefits of a constant-width
encoding are obvious and, in your opinion, make constant-width encodings
more useful than all the benefits of UTF-8?

Also, I don't think this has been posted in this thread. Not sure if it

http://www.utf8everywhere.org/

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

--
Dmitry Olshansky

May 25 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, May 28, 2013 at 12:04:52AM +0200, Vladimir Panteleev wrote:
On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.

I thought the topic was typing the occasional Unicode character to
use as an operator in D programs?

Well, D *does* support non-English identifiers, y'know... for example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different story. :)

But for operators, you still need enough compose key sequences to cover
all of the Unicode operators -- and there are a LOT of them -- which I
don't think is currently done anywhere. You'd have to make your own
compose key maps to do it.

T

--
Freedom: (n.) Man's self-given right to be enslaved by his own depravity.

May 27 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, May 29, 2013 at 09:33:32AM +1000, Peter Williams wrote:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent not a
country). :-)

If you say that to a Canadian to his face, you might get a hostile (or
faux-hostile) reaction. :)

Up here in the Great White North, we like to think of ourselves as
different from our rowdy neighbours to the south (even though we're not
that different, but we won't ever admit that :-P). And yes, "America"
means USA up here (and "American" especially means USian, as distinct
from Canadian), even though we all know that technically it refers to
the continent, not the country.

T

--
Computers aren't intelligent; they only think they are.

May 28 2013
Peter Williams <pwil3058 bigpond.net.au> writes:
On 29/05/13 09:57, H. S. Teoh wrote:
On Wed, May 29, 2013 at 09:33:32AM +1000, Peter Williams wrote:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent not a
country). :-)

If you say that to a Canadian to his face, you might get a hostile (or
faux-hostile) reaction. :)

Up here in the Great White North, we like to think of ourselves as
different from our rowdy neighbours to the south (even though we're not
that different, but we won't ever admit that :-P). And yes, "America"
means USA up here (and "American" especially means USian, as distinct
from Canadian), even though we all know that technically it refers to
the continent, not the country.

Last time I was there (about 40 years ago) Canadians didn't seem that
touchy. :-)

Peter

May 28 2013
"Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Tue, 28 May 2013 00:18:31 +0200, H. S. Teoh <hsteoh quickfur.ath.cx> =
=

wrote:

On Tue, May 28, 2013 at 12:04:52AM +0200, Vladimir Panteleev wrote:
On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.

I thought the topic was typing the occasional Unicode character to
use as an operator in D programs?

Well, D *does* support non-English identifiers, y'know... for example:=

void main(string[] args) {
int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
foreach (=D0=B8; 0..100)
=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
}

Of course, whether that's a good practice is a different story. :)

But for operators, you still need enough compose key sequences to cove=

all of the Unicode operators -- and there are a LOT of them -- which I=

don't think is currently done anywhere. You'd have to make your own
compose key maps to do it.

The Fortress programming language has some 900 or so operators:

https://java.net/projects/projectfortress/sources/sources/content/Specif=
ication/fortress.1.0.pdf?rev=3D5558

Appendix C, and

https://java.net/projects/projectfortress/sources/sources/content/Docume=
ntation/Specification/fortress.pdf?rev=3D5558

chapter 14

-- =

Simen

May 27 2013
"Hans W. Uhlig" <huhlig gmail.com> writes:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for
example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different
story. :)

I've recently come to the opinion that that's a bad idea, and D
should not support it.

Why do you think its a bad idea? It makes it such that code can
be in various languages? Just lack of keyboard support?

May 27 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, May 29, 2013 at 10:36:08AM +1000, Peter Williams wrote:
On 29/05/13 09:57, H. S. Teoh wrote:
On Wed, May 29, 2013 at 09:33:32AM +1000, Peter Williams wrote:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent not a
country). :-)

If you say that to a Canadian to his face, you might get a hostile
(or faux-hostile) reaction. :)

Last time I was there (about 40 years ago) Canadians didn't seem
that touchy. :-)

Well, they are not, hence "faux-hostile". :)

T

--
Political correctness: socially-sanctioned hypocrisy.

May 28 2013
=?UTF-8?B?Ikx1w61z?= Marques" <luismarques gmail.com> writes:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and D
should not support it.

I think it is a bad idea to program in a language other than
english, but I believe D should still support it.

May 27 2013
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and
D should not
support it.

Why do you think its a bad idea? It makes it such that code
can be in various
languages? Just lack of keyboard support?

Every time I've been to a programming shop in a foreign
country, the developers speak english at work and code in
english. Of course, that doesn't mean that everyone does, but
as far as I can tell the overwhelming bulk is done in english.

Naturally, full Unicode needs to be in strings and comments,
but symbol names? I don't see the point nor the utilty of it.
Supporting such is just pointless complexity to the language.

The most convincing case for usefulness I've seen was in java
where a class implemented a particular algorithm and so was named
after it. This name had a particular accented character and so
required unicode. Lots of algorithms are named after their
inventors and lots of these names contain unicode characters so
it's not that uncommon.

May 27 2013
Manu <turkeyman gmail.com> writes:
--001a11c32dfe53ceaf04ddbc6c9b
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 28 May 2013 09:05, Walter Bright <newshound2 digitalmars.com> wrote:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:

Well, D *does* support non-English identifiers, y'know... for example:

void main(string[] args) {
int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
foreach (=D0=B8; 0..100)
=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
}

Of course, whether that's a good practice is a different story. :)

I've recently come to the opinion that that's a bad idea, and D should no=

support it.

Why? You said previously that you'd love to support extended operators ;)

--001a11c32dfe53ceaf04ddbc6c9b
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 28 May 2013 09:05, Walter Bright <span dir=3D"ltr">&lt;=
<a href=3D"mailto:newshound2 digitalmars.com" target=3D"_blank">newshound2 =
digitalmars.com</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div cl=
ass=3D"gmail_quote">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">On 5/27/2013 3:18 PM, H. S. Teoh wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
Well, D *does* support non-English identifiers, y&#39;know... for example:<=
br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 void main(string[] args) {<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 int =D1=87=D0=B8=D1=
=81=D0=BB=D0=BE =3D 1;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 foreach (=D0=B8; 0.=
.100)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 writeln(=D1=87=D0=
=B8=D1=81=D0=BB=D0=BE);<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }<br>
<br>
Of course, whether that&#39;s a good practice is a different story. :)<br>
</blockquote>
<br>
I&#39;ve recently come to the opinion that that&#39;s a bad idea, and D sho=
uld not support it.<br>
</blockquote></div><br></div><div class=3D"gmail_extra" style>Why? You said=
previously that you&#39;d love to support extended operators ;)</div></div=

--001a11c32dfe53ceaf04ddbc6c9b--

May 27 2013
Manu <turkeyman gmail.com> writes:
--001a11c1f5f0c3201a04ddbc702e
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 28 May 2013 09:39, <luismarques gmail.com>" puremagic.com <

On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:

I've recently come to the opinion that that's a bad idea, and D should
not support it.

I think it is a bad idea to program in a language other than english, but
I believe D should still support it.

I can imagine a young student learning to code, that may not speak English
(yet).
Or a not-so-unlikely future where we're all speaking chinese ;)

--001a11c1f5f0c3201a04ddbc702e
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 28 May 2013 09:39, &lt;<a href=3D"mailto:luismarques gm=
ail.com">luismarques gmail.com</a>&gt;&quot; <a href=3D"http://puremagic.co=
m">puremagic.com</a> <span dir=3D"ltr">&lt;<a href=3D"mailto:&quot;\&quot;L=
;.Marques&quot;</a>&gt;</span> wrote:<br>
<div class=3D"gmail_extra"><div class=3D"gmail_quote"><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex"><div class=3D"im">On Monday, 27 May 2013 at 23:05:46 UTC, Walter =
Bright wrote:<br>

</div><div class=3D"im"><blockquote class=3D"gmail_quote" style=3D"margin:0=
I&#39;ve recently come to the opinion that that&#39;s a bad idea, and D sho=
uld not support it.<br>
</blockquote>
<br></div>
I think it is a bad idea to program in a language other than english, but I=
believe D should still support it.<br>
</blockquote></div><br></div><div class=3D"gmail_extra" style>I can imagine=
a young student learning to code, that may not speak English (yet).</div><=
div class=3D"gmail_extra" style>Or a not-so-unlikely future where we&#39;re=
all speaking chinese ;)</div>
</div>

--001a11c1f5f0c3201a04ddbc702e--

May 27 2013
"Torje Digernes" <torjehoa pvv.org> writes:
On Tuesday, 28 May 2013 at 00:34:20 UTC, Manu wrote:
On 28 May 2013 09:05, Walter Bright
<newshound2 digitalmars.com> wrote:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:

Well, D *does* support non-English identifiers, y'know... for
example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different
story. :)

I've recently come to the opinion that that's a bad idea, and
D should not
support it.

Why? You said previously that you'd love to support extended
operators ;)

I find features such as support for uncommon symbols in variables
a strength as it makes some physics formulas a bit easier to read
in code form, which in my opinion is a good thing.

May 27 2013
"Olivier Pisano" <olivier.pisano laposte.net> writes:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
Every time I've been to a programming shop in a foreign
country, the developers speak english at work and code in
english. Of course, that doesn't mean that everyone does, but
as far as I can tell the overwhelming bulk is done in english.

Would you have been to such an event if you could not have
understood what people were doing or saying? Of course, when we
are working on something with international scope, we tend to do
it in english, but it doesn't mean every programming task is
performed in english…

Being a non-native english speaker, I tend to see Unicode
identifiers as an improvement over other programming languages.
depending on the context of the programming task and its intended
audience. BTW, I use a Unicode-aware alternative keyboard layout,
so I can type greek letters or math symbols directly. ASCII-only
identifiers sounds like an arbitrary limitation for me.

May 28 2013
"Oleg Kuporosov" <Oleg.Kuporosov gmail.com> writes:
On Tuesday, 28 May 2013 at 01:34:47 UTC, Walter Bright wrote:

Why? You said previously that you'd love to support extended
operators ;)

Extended operators, yes. Non-ascii identifiers, no.

BTW, this is one of big D advantage, take into account
some day D could be used for teaching in schools where pupils
still doesn't know English somewhere outside US/GB.


May 28 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
Every time I've been to a programming shop in a foreign
country, the developers speak english at work and code in
english. Of course, that doesn't mean that everyone does, but
as far as I can tell the overwhelming bulk is done in english.

That's because you have an academic view of code, and a library
approach to development.

When you are a private company selling closed source code, I
really don't see why you'd code in English.

IMO, whether it is a bad idea is not for us to judge (and less so
to stop), but for each company/organization to choose their own
coding standard.

May 28 2013
"Mr. Anonymous" <mailnew4ster gmail.com> writes:
On Monday, 27 May 2013 at 22:20:16 UTC, H. S. Teoh wrote:
On Tue, May 28, 2013 at 12:04:52AM +0200, Vladimir Panteleev
wrote:
On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to
write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.

I thought the topic was typing the occasional Unicode
character to
use as an operator in D programs?

Well, D *does* support non-English identifiers, y'know... for
example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different story.
:)

But for operators, you still need enough compose key sequences
to cover
all of the Unicode operators -- and there are a LOT of them --
which I
don't think is currently done anywhere. You'd have to make your
own
compose key maps to do it.

T


May 28 2013
"Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Tue, 28 May 2013 01:05:46 +0200, Walter Bright  =

<newshound2 digitalmars.com> wrote:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for example=

void main(string[] args) {
int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
foreach (=D0=B8; 0..100)
=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
}

Of course, whether that's a good practice is a different story. :)

I've recently come to the opinion that that's a bad idea, and D should=

not support it.

I've recently come to the opinion that you're wrong - using them is ofte=
n
wrong, but D should support them. Various good reasons have been posted

-- =

Simen

May 28 2013
"Jakob Ovrum" <jakobovrum gmail.com> writes:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and D
should not support it.

Honestly, removing support for non-ASCII characters from
identifiers is the worst idea you've had in a while. There is an
_unfathomable amount_ of code out there written in non-English
languages but hamfisted into an English-alphabet representation
because the programming language doesn't care to support it. The
resulting friction is considerable.

You seem to attribute particular value to personal anecdotes, so
here's one of mine: I personally know several prestigious
universities in Europe and Asia which teach programming using
Java and/or C with identifiers being in an English-alphabet
representation of the native non-English language. Using the
English language for identifiers is usually a sanctioned
alternative, but not the primary modus operandi. I also know
several professional programmers using their native non-English
language for identifiers in production code.


May 29 2013
Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 27 May 2013 16:05:46 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for example:

void main(string[] args) {
int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
foreach (=D0=B8; 0..100)
=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
}

Of course, whether that's a good practice is a different story. :)

I've recently come to the opinion that that's a bad idea, and D should no=

support it.

I hope that was just a random thought. I knew a teacher who
would give all his methods German names so they are easier to
distinguish from the English Java library methods.
Personally I like to type =CE=B1 instead of alpha for angles, since
that is the identifier you'd expect in math. And everyone
likes "alias =E2=84=95 =3D size_t;", right? :) D=C3=A9j=C3=A0 vu?

--=20
Marco

May 29 2013
"qznc" <qznc web.de> writes:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and
D should not
support it.

Why do you think its a bad idea? It makes it such that code
can be in various
languages? Just lack of keyboard support?

Every time I've been to a programming shop in a foreign
country, the developers speak english at work and code in
english. Of course, that doesn't mean that everyone does, but
as far as I can tell the overwhelming bulk is done in english.

Naturally, full Unicode needs to be in strings and comments,
but symbol names? I don't see the point nor the utilty of it.
Supporting such is just pointless complexity to the language.

Once I heared an argument from developers working for banks. They
financial concepts with german names (e.g. Vermögen,Bürgschaft),
which sometimes include äöüß. Some of those concept had no good
translation into english, because they are not used outside of
Germany and the clients prefer the actual names anyways.

May 29 2013
"Entry" <no no.com> writes:
My personal opinion is that code should only be in English.

May 29 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jun 17, 2013 at 11:37:18AM -0700, Sean Kelly wrote:
On Jun 5, 2013, at 6:21 PM, Brad Roberts <braddr puremagic.com> wrote:

On 6/5/13 6:11 PM, Timothee Cour wrote:
currently std.demangle.demangle doesn't work with unicode (see example below)

If we decide to keep allowing unicode symbols (as opposed to just unicode
address this issue. Will supporting this negatively impact performance (of
both compile time and
runtime) ?

Likewise, will linkers + other tools (gdb etc) be happy with unicode in
mangled names?

----
structA{
intz;
voidfoo(intx){}
voidさいごの果実(intx){}
voidªå(intx){}
}
mangledName!(A.さいごの果実).demangle.writeln;=>_D4util13demangle_funs1A18さいごの果実MFiZv
----

Filed in bugzilla?

http://d.puremagic.com/issues/show_bug.cgi?id=10393
https://github.com/D-Programming-Language/druntime/pull/524

Do linkers actually support 8-bit symbol names? Or do these have to be
translated into ASCII somehow?

T

--
We've all heard that a million monkeys banging on a million typewriters will
eventually reproduce the entire works of Shakespeare.  Now, thanks to the
Internet, we know this is not true. -- Robert Wilensk

Jun 17 2013
Sean Kelly <sean invisibleduck.org> writes:
On Jun 17, 2013, at 11:47 AM, "H. S. Teoh" <hsteoh quickfur.ath.cx> =
wrote:
=20
Do linkers actually support 8-bit symbol names? Or do these have to be
translated into ASCII somehow?

Good question.  It looks like the linker on OSX does:

public	_D3abc1A18=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=
=AE=9FMFiZv
public	_D3abc1A4=C2=AA=C3=A5MFiZv

The object file linked just fine.  I haven't tried OPTLINK on Win32 =
though.=

Jun 17 2013
On 6/17/13 11:58 AM, Sean Kelly wrote:
On Jun 17, 2013, at 11:47 AM, "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote:
Do linkers actually support 8-bit symbol names? Or do these have to be
translated into ASCII somehow?

Good question.  It looks like the linker on OSX does:

public	_D3abc1A18さいごの果実MFiZv
public	_D3abc1A4ªåMFiZv

The object file linked just fine.  I haven't tried OPTLINK on Win32 though.

Don't symbol names from dmd/win32 get compressed if they're too long, resulting
in essentially
arbitrary random binary data being used as symbol names?  Assuming my memory on
that is correct then

Jun 17 2013
Sean Kelly <sean invisibleduck.org> writes:
On Jun 17, 2013, at 6:28 PM, Brad Roberts <braddr puremagic.com> wrote:

On 6/17/13 11:58 AM, Sean Kelly wrote:
On Jun 17, 2013, at 11:47 AM, "H. S. Teoh" <hsteoh quickfur.ath.cx> =

=20
Do linkers actually support 8-bit symbol names? Or do these have to =

translated into ASCII somehow?

Good question.  It looks like the linker on OSX does:
=20
public	_D3abc1A18=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=

public	_D3abc1A4=C2=AA=C3=A5MFiZv
=20
The object file linked just fine.  I haven't tried OPTLINK on Win32 =

=20

Don't symbol names from dmd/win32 get compressed if they're too long, =

symbol names?  Assuming my memory on that is correct then it's already =
demonstrated that optlink doesn't care what the data is.

Yes.  So it isn't always possible to fully demangle really long symbol =
names.  This is not terribly difficult to hit using templates, =
especially if they take string arguments.=

Jun 19 2013
Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 29 May 2013 15:44:17 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

I still think it's a bad idea, but it's obvious people want it in D, so it'll
stay.

(Also note that I meant using ASCII, not necessarily english.)

Surprisingly ASCII also covers Cornish and Malay.

--
Marco

May 29 2013
Peter Williams <pwil3058 bigpond.net.au> writes:
On 30/05/13 08:40, Entry wrote:
My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

May 29 2013
"Oleg Kuporosov" <Oleg.Kuporosov gmail.com> writes:
On Wednesday, 29 May 2013 at 22:44:17 UTC, Walter Bright wrote:
I still think it's a bad idea, but it's obvious people want it
in D, so it'll stay.

(Also note that I meant using ASCII, not necessarily english.)

Good, thanks, restrictions definetelly can and should be applied
per project, like for druntime/Phobos.

May 29 2013
"Entry" <no no.com> writes:
On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:
On 30/05/13 08:40, Entry wrote:
My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified
language (D) should not be sabotaged by comments and variable
names in various human languages (Swedish, Russian), but be
accompanied by a similarly 'unified' language that we all know -
English. It is only my opinion though and I wouldn't force it
upon anyone.

May 30 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:
On 30/05/13 08:40, Entry wrote:
My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified
language (D) should not be sabotaged by comments and variable
names in various human languages (Swedish, Russian), but be
accompanied by a similarly 'unified' language that we all know
- English. It is only my opinion though and I wouldn't force it
upon anyone.

But programming IS a human tool, and thus, subject to human
language.

Also, I don't see how a programming language is any more unified
than, say, a library.

While you wouldn't force it on anyone, would it also be your
opinion that putting a French book in a french library be a
sabotage of the world's librarial institutions?

May 30 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 29 May 2013 at 22:42:08 UTC, Walter Bright wrote:
On 5/29/2013 3:26 AM, qznc wrote:
Once I heared an argument from developers working for banks.
They coded
financial concepts with
german names (e.g. Vermögen,Bürgschaft), which sometimes
include äöüß. Some of
those concept had no good translation into english, because
they are not used
outside of Germany and the clients prefer the actual names
anyways.

German is pretty easy to do in ASCII: Vermoegen and Buergschaft

What about Chinese? Russian? Japanese? It is doable, but I can
tell you for a fact that they very much don't like reading it
that way.

You know, having done programming in Japan, I know that a lot of
devs simply don't care for english, and they'd really enjoy just
being able to code in Japanese. I can't speak for the other
countries, but I'm sure that large but not spread out countries
like China would also just *love* to be able to code in 100%
Madarin (I'd say they wouldn't care much for English either).

I think this possibility is actually a brilliant feature that
could help popularize the language oversees, especially in
teaching courses, or the private sector. Why not turn down a
feature that makes us popular?

As for research/university, I think they are already global
enough to stick to English anyways.

No matter how I see it, I can only see benefits to keeping it,
and downsides to turning it down.

May 30 2013
"Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Thu, 30 May 2013 11:36:42 +0200, monarch_dodra <monarchdodra gmail.co=
m>  =

wrote:

On Wednesday, 29 May 2013 at 22:42:08 UTC, Walter Bright wrote:
On 5/29/2013 3:26 AM, qznc wrote:
Once I heared an argument from developers working for banks. They co=

concepts with
german names (e.g. Verm=C3=B6gen,B=C3=BCrgschaft), which sometimes i=

Some of
those concept had no good translation into english, because they are=

not used
outside of Germany and the clients prefer the actual names anyways.

German is pretty easy to do in ASCII: Vermoegen and Buergschaft

What about Chinese? Russian? Japanese? It is doable, but I can tell yo=

for a fact that they very much don't like reading it that way.

You know, having done programming in Japan, I know that a lot of devs =

simply don't care for english, and they'd really enjoy just being able=

to code in Japanese. I can't speak for the other countries, but I'm su=

that large but not spread out countries like China would also just  =

*love* to be able to code in 100% Madarin (I'd say they wouldn't care =

much for English either).

I think this possibility is actually a brilliant feature that could he=

popularize the language oversees, especially in teaching courses, or t=

private sector. Why not turn down a feature that makes us popular?

As for research/university, I think they are already global enough to =

stick to English anyways.

No matter how I see it, I can only see benefits to keeping it, and  =

downsides to turning it down.

Now if only we had the C preprocessor:

#define =E5=A6=82=E6=9E=9C if
#define =E7=9B=B4=E5=88=B0 while

(Note: this is what Google Translate told me was good. I do not speak,

-- =

Simen

May 30 2013
"Dicebot" <m.strashun gmail.com> writes:
On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra wrote:
What about Chinese? Russian? Japanese? It is doable, but I can
tell you for a fact that they very much don't like reading it
that way.

You know, having done programming in Japan, I know that a lot
of devs simply don't care for english, and they'd really enjoy
just being able to code in Japanese. I can't speak for the
other countries, but I'm sure that large but not spread out
countries like China would also just *love* to be able to code
in 100% Madarin (I'd say they wouldn't care much for English
either).

What about poor guys from other country that will support that
project after? English is a de-facto standard language for
programming for a good reason.

May 30 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 30 May 2013 at 10:13:46 UTC, Dicebot wrote:
On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra wrote:
What about Chinese? Russian? Japanese? It is doable, but I can
tell you for a fact that they very much don't like reading it
that way.

You know, having done programming in Japan, I know that a lot
of devs simply don't care for english, and they'd really enjoy
just being able to code in Japanese. I can't speak for the
other countries, but I'm sure that large but not spread out
countries like China would also just *love* to be able to code
in 100% Madarin (I'd say they wouldn't care much for English
either).

What about poor guys from other country that will support that
project after? English is a de-facto standard language for
programming for a good reason.

Well... defacto: "in practice but not necessarily ordained by
law".

Besides, even in english, there are use cases for unicode. Such
as math (Greek symbols).

And even if you are coding in english, that don't mean you can't
be working on a region specific project, that requires the
identifiers to have region-specific names (AKA, German banking
reference).

Finally, english does have a few (albeit rare) words that can't
be expressed with ASCII. For example: Möbius. Sure, you can write
it "Mobius", but why settle for wrong, when you can have right?

--------

I'm saying that even if I agree that code should be in English
(which I don't completly agree with), it's still not a strong
argument against unicode in identifiers. In this day and age, it
seems as arbitrary to me as requiring lines to not exceed 80
chars. That kind of shit belongs in a coding standard.

May 30 2013
Manu <turkeyman gmail.com> writes:
--047d7b41417000feb204ddeda6a8
Content-Type: text/plain; charset=UTF-8

On 30 May 2013 18:32, Entry <no no.com> wrote:

On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:

On 30/05/13 08:40, Entry wrote:

My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified language (D)
should not be sabotaged by comments and variable names in various human
languages (Swedish, Russian), but be accompanied by a similarly 'unified'
language that we all know - English. It is only my opinion though and I
wouldn't force it upon anyone.

--047d7b41417000feb204ddeda6a8
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 30 May 2013 18:32, Entry <span dir=3D"ltr">&lt;<a href=
=3D"mailto:no no.com" target=3D"_blank">no no.com</a>&gt;</span> wrote:<br>=
<div class=3D"gmail_extra"><div class=3D"gmail_quote"><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex">
<div class=3D"im">On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams=
wrote:<br>
</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
On 30/05/13 08:40, Entry wrote:<br>
</div><div class=3D"im"><blockquote class=3D"gmail_quote" style=3D"margin:0=
My personal opinion is that code should only be in English.<br>
</blockquote>
<br></div><div class=3D"im">
But why would you want to impose this restriction on others?<br>
<br>
Peter<br>
</div></blockquote>
<br>
I wouldn&#39;t say impose. I&#39;d say that programming in a unified langua=
ge (D) should not be sabotaged by comments and variable names in various hu=
man languages (Swedish, Russian), but be accompanied by a similarly &#39;un=
ified&#39; language that we all know - English. It is only my opinion thoug=
h and I wouldn&#39;t force it upon anyone.<br>

</blockquote></div><br></div></div>

--047d7b41417000feb204ddeda6a8--

May 30 2013
Manu <turkeyman gmail.com> writes:
--001a11c32dfe71f32f04ddedbe7f
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 30 May 2013 18:32, Entry <no no.com> wrote:

On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:

On 30/05/13 08:40, Entry wrote:

My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified language (D)
should not be sabotaged by comments and variable names in various human
languages (Swedish, Russian), but be accompanied by a similarly 'unified'
language that we all know - English. It is only my opinion though and I
wouldn't force it upon anyone.

We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance, it
almost always looks like this:

{
// E: I like cake.
=81=99=E3=80=82
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that power
an industry worth 10s of billions.

--001a11c32dfe71f32f04ddedbe7f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 30 May 2013 18:32, Entry <span dir=3D"ltr">&lt;<a href=
=3D"mailto:no no.com" target=3D"_blank">no no.com</a>&gt;</span> wrote:<br>=
<div class=3D"gmail_extra"><div class=3D"gmail_quote"><blockquote class=3D"=
gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border=
<div class=3D"im">On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams=
wrote:<br>
</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:s=
On 30/05/13 08:40, Entry wrote:<br>
</div><div class=3D"im"><blockquote class=3D"gmail_quote" style=3D"margin:0=
px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);b=
My personal opinion is that code should only be in English.<br>
</blockquote>
<br></div><div class=3D"im">
But why would you want to impose this restriction on others?<br>
<br>
Peter<br>
</div></blockquote>
<br>
I wouldn&#39;t say impose. I&#39;d say that programming in a unified langua=
ge (D) should not be sabotaged by comments and variable names in various hu=
man languages (Swedish, Russian), but be accompanied by a similarly &#39;un=
ified&#39; language that we all know - English. It is only my opinion thoug=
h and I wouldn&#39;t force it upon anyone.<br>

</blockquote></div><br></div><div class=3D"gmail_extra" style>We don&#39;t =
all know English. Plenty of people don&#39;t.</div><div class=3D"gmail_extr=
a" style>I&#39;ve worked a lot with Sony and Nintendo code/libraries, for i=
nstance, it almost always looks like this:<br>
</div><div class=3D"gmail_extra" style><br></div><div class=3D"gmail_extra"=
style><font face=3D"courier new, monospace">{</font></div><div class=3D"gm=
ail_extra" style><font face=3D"courier new, monospace">=C2=A0 // E: I like =
cake.</font></div>
<div class=3D"gmail_extra" style><font face=3D"courier new, monospace">=C2=
=E3=81=99=E3=80=82</font></div><div class=3D"gmail_extra" style><font face=
=3D"courier new, monospace">=C2=A0 player.eatCake();</font></div><div class=
=3D"gmail_extra" style>
<font face=3D"courier new, monospace">}</font></div><div class=3D"gmail_ext=
ra" style><font face=3D"courier new, monospace"><br></font></div><div class=
=3D"gmail_extra" style><div class=3D"gmail_extra">Clearly someone doesn&#39=
;t speak English in these massive codebases that power an industry worth 10=
s of billions.</div>
</div></div>

--001a11c32dfe71f32f04ddedbe7f--

May 30 2013
Manu <turkeyman gmail.com> writes:
--001a11c2d6820fa4e904ddedd00b
Content-Type: text/plain; charset=UTF-8

On 30 May 2013 20:13, Dicebot <m.strashun gmail.com> wrote:

On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra wrote:

What about Chinese? Russian? Japanese? It is doable, but I can tell you
for a fact that they very much don't like reading it that way.

You know, having done programming in Japan, I know that a lot of devs
simply don't care for english, and they'd really enjoy just being able to
code in Japanese. I can't speak for the other countries, but I'm sure that
large but not spread out countries like China would also just *love* to be
able to code in 100% Madarin (I'd say they wouldn't care much for English
either).

What about poor guys from other country that will support that project
after? English is a de-facto standard language for programming for a good
reason.

Have you ever worked on code written by people who barely speak English?
Even if they write English words, that doesn't make it 'English', or any
easier to understand. And people often tend to just transliterate into
latin, which is kinda pointless too, how does that help?

--001a11c2d6820fa4e904ddedd00b
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 30 May 2013 20:13, Dicebot <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:m.strashun gmail.com" target=3D"_blank">m.strashun gmail.com</a=
&gt;</span> wrote:<br><div class=3D"gmail_extra"><div class=3D"gmail_quote=

<div class=3D"im">On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra w=
rote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
What about Chinese? Russian? Japanese? It is doable, but I can tell you for=
a fact that they very much don&#39;t like reading it that way.<br>
<br>
You know, having done programming in Japan, I know that a lot of devs simpl=
y don&#39;t care for english, and they&#39;d really enjoy just being able t=
o code in Japanese. I can&#39;t speak for the other countries, but I&#39;m =
sure that large but not spread out countries like China would also just *lo=
ve* to be able to code in 100% Madarin (I&#39;d say they wouldn&#39;t care =
much for English either).<br>

</blockquote>
<br></div>
What about poor guys from other country that will support that project afte=
r? English is a de-facto standard language for programming for a good reaso=
n.<br>
</blockquote></div><br></div><div class=3D"gmail_extra" style>Have you ever=
worked on code written by people who barely speak English? Even if they wr=
ite English words, that doesn&#39;t make it &#39;English&#39;, or any easie=
r to understand. And people often tend to just transliterate into latin, wh=
ich is kinda pointless too, how does that help?</div>
</div>

--001a11c2d6820fa4e904ddedd00b--

May 30 2013
"Dicebot" <m.strashun gmail.com> writes:
On Thursday, 30 May 2013 at 11:29:47 UTC, Manu wrote:
Have you ever worked on code written by people who barely speak
English?
Even if they write English words, that doesn't make it
'English', or any
easier to understand. And people often tend to just
transliterate into
latin, which is kinda pointless too, how does that help?

I have had comments with Finnish poetry in code I was responsible
to support :( No need to provide means to think such approach is
the way to go.

May 30 2013
"Entry" <no no.com> writes:
On Thursday, 30 May 2013 at 09:29:43 UTC, monarch_dodra wrote:
On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams
wrote:
On 30/05/13 08:40, Entry wrote:
My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified
language (D) should not be sabotaged by comments and variable
names in various human languages (Swedish, Russian), but be
accompanied by a similarly 'unified' language that we all know
- English. It is only my opinion though and I wouldn't force
it upon anyone.

But programming IS a human tool, and thus, subject to human
language.

Also, I don't see how a programming language is any more
unified than, say, a library.

While you wouldn't force it on anyone, would it also be your
opinion that putting a French book in a french library be a
sabotage of the world's librarial institutions?

What a way to attack a straw-man and completely miss the point at
the same time.

May 30 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 30 May 2013 at 13:12:17 UTC, Entry wrote:
On Thursday, 30 May 2013 at 09:29:43 UTC, monarch_dodra wrote:
On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams
wrote:
On 30/05/13 08:40, Entry wrote:
My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified
language (D) should not be sabotaged by comments and variable
names in various human languages (Swedish, Russian), but be
accompanied by a similarly 'unified' language that we all
know - English. It is only my opinion though and I wouldn't
force it upon anyone.

But programming IS a human tool, and thus, subject to human
language.

Also, I don't see how a programming language is any more
unified than, say, a library.

While you wouldn't force it on anyone, would it also be your
opinion that putting a French book in a french library be a
sabotage of the world's librarial institutions?

What a way to attack a straw-man and completely miss the point
at the same time.

Fine.

In that case, I'll retort by saying that you use of the 'unified'

My retort was not correctly expressed, but I don't see how D is
"unified". I thought it was just a tool to create programs.

May 30 2013
"Entry" <no no.com> writes:
On Thursday, 30 May 2013 at 13:52:09 UTC, monarch_dodra wrote:
On Thursday, 30 May 2013 at 13:12:17 UTC, Entry wrote:
On Thursday, 30 May 2013 at 09:29:43 UTC, monarch_dodra wrote:
On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams
wrote:
On 30/05/13 08:40, Entry wrote:
My personal opinion is that code should only be in English.

But why would you want to impose this restriction on others?

Peter

I wouldn't say impose. I'd say that programming in a unified
language (D) should not be sabotaged by comments and
variable names in various human languages (Swedish,
Russian), but be accompanied by a similarly 'unified'
language that we all know - English. It is only my opinion
though and I wouldn't force it upon anyone.

But programming IS a human tool, and thus, subject to human
language.

Also, I don't see how a programming language is any more
unified than, say, a library.

While you wouldn't force it on anyone, would it also be your
opinion that putting a French book in a french library be a
sabotage of the world's librarial institutions?

What a way to attack a straw-man and completely miss the point
at the same time.

Fine.

In that case, I'll retort by saying that you use of the

My retort was not correctly expressed, but I don't see how D is
"unified". I thought it was just a tool to create programs.

Take a minute to think about why we're all communicating in
English here. Let's see if you can figure it out. I just think
that it's better to focus on two very specific languages with two
very specific purposes (D for programming and English for
communication). 'Twas just an idea, I don't care if you write

May 30 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:
Take a minute to think about why we're all communicating in
English here. Let's see if you can figure it out.

Well that's condescending :/ and fallacious.

To answer your question, it may have something to do with the
fact that these are the English forums? Just a wild hunch. Oh.
And because we *can* speak English? That could also have
something to do with it.

There are tons of non-English speaking programming forums out
there. Maybe those that don't speak English are over there? Heck,
there are a few non-English threads in learn.

Oh. And did you know TDPL was published in Japanese? Why bother
right?

I just think that it's better to focus on two very specific
languages with two very specific purposes (D for programming
and English for communication). 'Twas just an idea, I don't
care if you write your code in hieroglyphs.

I really really agree with you.

Yet, I think they are orthogonal concepts, and that the D
programming language has no business choosing which communication
vector its users should use.

It's not just a matter (imo) of "I wouldn't force it upon
anyone", but "I think everyone should choose what's best for
them".

Yeah. I know. Same conclusion, but there is a nuance.

May 30 2013
"Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 29 May 2013 at 22:44:17 UTC, Walter Bright wrote:
(Also note that I meant using ASCII, not necessarily english.)

I don't understand the logic behind this. Surely this is the
worst combination; severely crippled ability to use non-English
languages (yes, even for European languages), yet non-speakers of
those languages still don't have a clue what it means.

May 30 2013
"Entry" <no no.com> writes:
On Thursday, 30 May 2013 at 14:49:12 UTC, monarch_dodra wrote:
On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:
Take a minute to think about why we're all communicating in
English here. Let's see if you can figure it out.

Well that's condescending :/ and fallacious.

To answer your question, it may have something to do with the
fact that these are the English forums? Just a wild hunch. Oh.
And because we *can* speak English? That could also have
something to do with it.

There are tons of non-English speaking programming forums out
there. Maybe those that don't speak English are over there?
Heck, there are a few non-English threads in learn.

Oh. And did you know TDPL was published in Japanese? Why bother
right?

I just think that it's better to focus on two very specific
languages with two very specific purposes (D for programming
and English for communication). 'Twas just an idea, I don't
care if you write your code in hieroglyphs.

I really really agree with you.

Yet, I think they are orthogonal concepts, and that the D
programming language has no business choosing which
communication vector its users should use.

It's not just a matter (imo) of "I wouldn't force it upon
anyone", but "I think everyone should choose what's best for
them".

Yeah. I know. Same conclusion, but there is a nuance.

I'm glad you agree, though I believe that I never said anything
about D 'choosing' which human languages are compatible with it.
I just expressed my belief that should people choose to construct
something, be it a ship or a computer program, the usage of a
single language will greatly enhance their progress (ever heard
the story of the Tower of Babel? wink wink). Sorry if my previous
comment seemed hostile, that was not my intention.

May 30 2013
"Jakob Ovrum" <jakobovrum gmail.com> writes:
On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:
I'm glad you agree, though I believe that I never said anything
about D 'choosing' which human languages are compatible with
it. I just expressed my belief that should people choose to
construct something, be it a ship or a computer program, the
usage of a single language will greatly enhance their progress
(ever heard the story of the Tower of Babel? wink wink). Sorry
if my previous comment seemed hostile, that was not my
intention.

If the programmers who are going to be working on that code don't
understand the "Single Language", then what use is it?

May 30 2013
"Entry" <no no.com> writes:
On Thursday, 30 May 2013 at 16:05:13 UTC, Jakob Ovrum wrote:
On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:
I'm glad you agree, though I believe that I never said
anything about D 'choosing' which human languages are
compatible with it. I just expressed my belief that should
people choose to construct something, be it a ship or a
computer program, the usage of a single language will greatly
enhance their progress (ever heard the story of the Tower of
Babel? wink wink). Sorry if my previous comment seemed
hostile, that was not my intention.

If the programmers who are going to be working on that code
don't understand the "Single Language", then what use is it?

Then there's no helping it. Though I wonder what kind of a
programmer doesn't understand English enough to at least read the

May 30 2013
Peter Williams <pwil3058 bigpond.net.au> writes:
On 31/05/13 05:07, Walter Bright wrote:
On 5/30/2013 4:24 AM, Manu wrote:
We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance,
it almost
always looks like this:

{
// E: I like cake.
// J: ケーキが好きです。
player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

Peter

May 30 2013
Manu <turkeyman gmail.com> writes:
--089e0112c4d470cd9504ddf85d5d
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 31 May 2013 05:07, Walter Bright <newshound2 digitalmars.com> wrote:

On 5/30/2013 4:24 AM, Manu wrote:

We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance, i=

almost
always looks like this:

{
// E: I like cake.

player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

But that doesn't make it English, or any more readable...
The only benefit to forcing users to use ASCII is that everyone can
physically type it.
1. It's not natural to type a word that you don't know what it is or how
to spell, you'll end up copy-pasting anyway rather than trying to
remember/copy it letter by letter and risk misspelling.
2. It's less natural for the people who CAN read it, because they have to
mentally transliterate too. (And if they're kids/amateurs who don't know
even know the latin alphabet?)

Ie, it serves neither party to force someone who doesn't speak English to
write ASCII.
children learning to code. There's no compelling reason to force
identifiers in ASCII.
Currently, D offers a unique advantage; leave it that way.

--089e0112c4d470cd9504ddf85d5d
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 31 May 2013 05:07, Walter Bright <span dir=3D"ltr">&lt;=
<a href=3D"mailto:newshound2 digitalmars.com" target=3D"_blank">newshound2 =
digitalmars.com</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div cl=
ass=3D"gmail_quote">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div class=3D"im">On 5/30/2013 4:24 AM, Manu=
wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
We don&#39;t all know English. Plenty of people don&#39;t.<br>
I&#39;ve worked a lot with Sony and Nintendo code/libraries, for instance, =
it almost<br>
always looks like this:<br>
<br>
{<br>
=C2=A0 =C2=A0// E: I like cake.<br>
=E3=81=A7=E3=81=99=E3=80=82<br>
=C2=A0 =C2=A0player.eatCake();<br>
}<br>
<br>
Clearly someone doesn&#39;t speak English in these massive codebases that p=
ower an<br>
industry worth 10s of billions.<br>
</blockquote>
<br></div>
Sure, but the code itself is written using ASCII!<br></blockquote><div><br>=
</div><div style>But that doesn&#39;t make it English, or any more readable=
...</div><div style>The only benefit to forcing users to use ASCII is that =
everyone can physically type it.</div>
<div style>But that comes with disadvantages:</div><div style>=C2=A01. It&#=
39;s not natural to type a word that you don&#39;t know what it is or how t=
o spell, you&#39;ll end up copy-pasting anyway rather than trying to rememb=
er/copy it letter by letter and risk misspelling.</div>
<div style>=C2=A02. It&#39;s less natural for the people who CAN read it, b=
ecause they have to mentally transliterate too. (And if they&#39;re kids/am=
ateurs=C2=A0who don&#39;t know even know the latin alphabet?)</div><div sty=
le><br>
</div><div style>Ie, it serves neither party to force someone who doesn&#39=
;t speak English to write ASCII.</div><div style>Add that to the points I (=
and others) made earlier about education, or children learning to code. The=
re&#39;s no compelling reason to force identifiers in ASCII.</div>
<div style>Currently, D offers a unique advantage; leave it that way.</div>=
</div></div></div>

--089e0112c4d470cd9504ddf85d5d--

May 30 2013
Manu <turkeyman gmail.com> writes:
--001a11c209d210990704ddf863dc
Content-Type: text/plain; charset=UTF-8

On 31 May 2013 01:48, Entry <no no.com> wrote:

On Thursday, 30 May 2013 at 14:49:12 UTC, monarch_dodra wrote:

On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:

Take a minute to think about why we're all communicating in English
here. Let's see if you can figure it out.

Well that's condescending :/ and fallacious.

To answer your question, it may have something to do with the fact that
these are the English forums? Just a wild hunch. Oh. And because we *can*
speak English? That could also have something to do with it.

There are tons of non-English speaking programming forums out there.
Maybe those that don't speak English are over there? Heck, there are a few

Oh. And did you know TDPL was published in Japanese? Why bother right?

I just think that it's better to focus on two very specific languages
with two very specific purposes (D for programming and English for
communication). 'Twas just an idea, I don't care if you write your code in
hieroglyphs.

I really really agree with you.

Yet, I think they are orthogonal concepts, and that the D programming
language has no business choosing which communication vector its users
should use.

It's not just a matter (imo) of "I wouldn't force it upon anyone", but "I
think everyone should choose what's best for them".

Yeah. I know. Same conclusion, but there is a nuance.

I'm glad you agree, though I believe that I never said anything about D
'choosing' which human languages are compatible with it. I just expressed
my belief that should people choose to construct something, be it a ship or
a computer program, the usage of a single language will greatly enhance
their progress (ever heard the story of the Tower of Babel? wink wink).
Sorry if my previous comment seemed hostile, that was not my intention.

This is the definition of a *convention*, not a rule.

--001a11c209d210990704ddf863dc
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 31 May 2013 01:48, Entry <span dir=3D"ltr">&lt;<a href=
=3D"mailto:no no.com" target=3D"_blank">no no.com</a>&gt;</span> wrote:<br>=
<div class=3D"gmail_extra"><div class=3D"gmail_quote"><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex">
<div class=3D"HOEnZb"><div class=3D"h5">On Thursday, 30 May 2013 at 14:49:1=
2 UTC, monarch_dodra wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
Take a minute to think about why we&#39;re all communicating in English her=
e. Let&#39;s see if you can figure it out.<br>
</blockquote>
<br>
Well that&#39;s condescending :/ and fallacious.<br>
<br>
To answer your question, it may have something to do with the fact that the=
se are the English forums? Just a wild hunch. Oh. And because we *can* spea=
k English? That could also have something to do with it.<br>
<br>
There are tons of non-English speaking programming forums out there. Maybe =
those that don&#39;t speak English are over there? Heck, there are a few no=
<br>
Oh. And did you know TDPL was published in Japanese? Why bother right?<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
I just think that it&#39;s better to focus on two very specific languages w=
ith two very specific purposes (D for programming and English for communica=
tion). &#39;Twas just an idea, I don&#39;t care if you write your code in h=
ieroglyphs.<br>

</blockquote>
<br>
I really really agree with you.<br>
<br>
Yet, I think they are orthogonal concepts, and that the D programming langu=
age has no business choosing which communication vector its users should us=
e.<br>
<br>
It&#39;s not just a matter (imo) of &quot;I wouldn&#39;t force it upon anyo=
ne&quot;, but &quot;I think everyone should choose what&#39;s best for them=
&quot;.<br>
<br>
Yeah. I know. Same conclusion, but there is a nuance.<br>
</blockquote>
<br></div></div>
I&#39;m glad you agree, though I believe that I never said anything about D=
&#39;choosing&#39; which human languages are compatible with it. I just ex=
pressed my belief that should people choose to construct something, be it a=
ship or a computer program, the usage of a single language will greatly en=
hance their progress (ever heard the story of the Tower of Babel? wink wink=
). Sorry if my previous comment seemed hostile, that was not my intention.<=
br>

</blockquote></div><br></div><div class=3D"gmail_extra" style>This is the d=
efinition of a *convention*, not a rule.</div></div>

--001a11c209d210990704ddf863dc--

May 30 2013
Manu <turkeyman gmail.com> writes:
--089e013cc4285c2b6004ddf8655f
Content-Type: text/plain; charset=UTF-8

On 31 May 2013 03:08, Entry <no no.com> wrote:

On Thursday, 30 May 2013 at 16:05:13 UTC, Jakob Ovrum wrote:

On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:

I'm glad you agree, though I believe that I never said anything about D
'choosing' which human languages are compatible with it. I just expressed
my belief that should people choose to construct something, be it a ship or
a computer program, the usage of a single language will greatly enhance
their progress (ever heard the story of the Tower of Babel? wink wink).
Sorry if my previous comment seemed hostile, that was not my intention.

If the programmers who are going to be working on that code don't
understand the "Single Language", then what use is it?

Then there's no helping it. Though I wonder what kind of a programmer
doesn't understand English enough to at least read the code and comments.

A child, or a student.

--089e013cc4285c2b6004ddf8655f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 31 May 2013 03:08, Entry <span dir=3D"ltr">&lt;<a href=
=3D"mailto:no no.com" target=3D"_blank">no no.com</a>&gt;</span> wrote:<br>=
<div class=3D"gmail_extra"><div class=3D"gmail_quote"><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex">
<div class=3D"HOEnZb"><div class=3D"h5">On Thursday, 30 May 2013 at 16:05:1=
3 UTC, Jakob Ovrum wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
I&#39;m glad you agree, though I believe that I never said anything about D=
&#39;choosing&#39; which human languages are compatible with it. I just ex=
pressed my belief that should people choose to construct something, be it a=
ship or a computer program, the usage of a single language will greatly en=
hance their progress (ever heard the story of the Tower of Babel? wink wink=
). Sorry if my previous comment seemed hostile, that was not my intention.<=
br>

</blockquote>
<br>
If the programmers who are going to be working on that code don&#39;t under=
stand the &quot;Single Language&quot;, then what use is it?<br>
</blockquote>
<br></div></div>
Then there&#39;s no helping it. Though I wonder what kind of a programmer d=
oesn&#39;t understand English enough to at least read the code and comments=
.<br>
</blockquote></div><br></div><div class=3D"gmail_extra" style>A child, or a=
student.</div></div>

--089e013cc4285c2b6004ddf8655f--

May 30 2013
Manu <turkeyman gmail.com> writes:
--047d7b2e47f498252d04ddf87206
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 31 May 2013 10:00, Peter Williams <pwil3058 bigpond.net.au> wrote:

On 31/05/13 05:07, Walter Bright wrote:

On 5/30/2013 4:24 AM, Manu wrote:

We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance,
it almost
always looks like this:

{
// E: I like cake.

player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

Indeed, and believe me, the variable names can often make NO sense, or
worse, they're misunderstood and quite misleading.
Ie, you think a variable is something, but you realise it's the inverse, or
just something completely different.

--047d7b2e47f498252d04ddf87206
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 31 May 2013 10:00, Peter Williams <span dir=3D"ltr">&lt=
;<a href=3D"mailto:pwil3058 bigpond.net.au" target=3D"_blank">pwil3058 bigp=
ond.net.au</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div class=
=3D"gmail_quote">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div class=3D"HOEnZb"><div class=3D"h5">On 3=
1/05/13 05:07, Walter Bright wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
On 5/30/2013 4:24 AM, Manu wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
We don&#39;t all know English. Plenty of people don&#39;t.<br>
I&#39;ve worked a lot with Sony and Nintendo code/libraries, for instance,<=
br>
it almost<br>
always looks like this:<br>
<br>
{<br>
=C2=A0 =C2=A0// E: I like cake.<br>
=E3=81=A7=E3=81=99=E3=80=82<br>
=C2=A0 =C2=A0player.eatCake();<br>
}<br>
<br>
Clearly someone doesn&#39;t speak English in these massive codebases that<b=
r>
power an<br>
industry worth 10s of billions.<br>
</blockquote>
<br>
Sure, but the code itself is written using ASCII!<br>
</blockquote>
<br></div></div>
Because they had no choice.</blockquote><div><br></div><div style>Indeed, a=
nd believe me, the variable names can often make NO sense, or worse, they&#=
39;re misunderstood and quite misleading.</div><div style>Ie, you think a v=
ariable is something, but you realise it&#39;s the inverse, or just somethi=
ng completely different.</div>
</div></div></div>

--047d7b2e47f498252d04ddf87206--

May 30 2013
"Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Fri, 31 May 2013 07:57:37 +0200, Walter Bright  =

<newshound2 digitalmars.com> wrote:

On 5/30/2013 5:00 PM, Peter Williams wrote:
On 31/05/13 05:07, Walter Bright wrote:
On 5/30/2013 4:24 AM, Manu wrote:
We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instan=

it almost
always looks like this:

{
// E: I like cake.

player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases th=

power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

Not true, D supports Unicode identifiers.

I doubt Sony and Nintendo use D extensively.

-- =

Simen

May 31 2013
Timothee Cour <thelastmammoth gmail.com> writes:
--089e015370f44a639f04de71fe01
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Thu, May 30, 2013 at 10:57 PM, Walter Bright
<newshound2 digitalmars.com>wrote:

On 5/30/2013 5:00 PM, Peter Williams wrote:

On 31/05/13 05:07, Walter Bright wrote:

On 5/30/2013 4:24 AM, Manu wrote:

We don't all know English. Plenty of people don't.
I've worked a lot with Sony and Nintendo code/libraries, for instance,
it almost
always looks like this:

{
// E: I like cake.

player.eatCake();
}

Clearly someone doesn't speak English in these massive codebases that
power an
industry worth 10s of billions.

Sure, but the code itself is written using ASCII!

Not true, D supports Unicode identifiers.

currently std.demangle.demangle doesn't work with unicode (see example
below)

If we decide to keep allowing unicode symbols (as opposed to just unicode
address this issue. Will supporting this negatively impact performance (of
both compile time and runtime) ?

Likewise, will linkers + other tools (gdb etc) be happy with unicode in
mangled names?

----
struct A{
int z;
void foo(int x){}
void =E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F(int x){}
void =C2=AA=C3=A5(int x){}
}
mangledName!(A.=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F).dema=
ngle.writeln;=3D>
_D4util13demangle_funs1A18=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=
=AE=9FMFiZv
----

--089e015370f44a639f04de71fe01
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<br><br><div class=3D"gmail_quote">On Thu, May 30, 2013 at 10:57 PM, Walter=
Bright <span dir=3D"ltr">&lt;<a href=3D"mailto:newshound2 digitalmars.com"=
target=3D"_blank">newshound2 digitalmars.com</a>&gt;</span> wrote:<br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
<div class=3D"im">On 5/30/2013 5:00 PM, Peter Williams wrote:<br>
</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
On 31/05/13 05:07, Walter Bright wrote:<br>
</div><div><div class=3D"h5"><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On 5/30/2013 4:24 AM, Manu wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
We don&#39;t all know English. Plenty of people don&#39;t.<br>
I&#39;ve worked a lot with Sony and Nintendo code/libraries, for instance,<=
br>
it almost<br>
always looks like this:<br>
<br>
{<br>
=C2=A0 =C2=A0// E: I like cake.<br>
=E3=81=A7=E3=81=99=E3=80=82<br>
=C2=A0 =C2=A0player.eatCake();<br>
}<br>
<br>
Clearly someone doesn&#39;t speak English in these massive codebases that<b=
r>
power an<br>
industry worth 10s of billions.<br>
</blockquote>
<br>
Sure, but the code itself is written using ASCII!<br>
</blockquote>
<br></div></div><div class=3D"im">
</div></blockquote>
<br>
Not true, D supports Unicode identifiers.<br>
</blockquote></div><br><div><br></div><div>currently std.demangle.demangle =
doesn&#39;t work with unicode (see example below)</div><div><br></div><div>=
If we decide to keep allowing unicode symbols (as opposed to just unicode s=
<div>address this issue. Will supporting this negatively impact performance=
(of both compile time and runtime) ?</div><div><br></div><div>Likewise, wi=
ll linkers + other tools (gdb etc) be happy with unicode in mangled names?<=
/div>
<div><br></div><div>----</div><div><div><font face=3D"Menlo"><span style=3D=
"color:rgb(0,150,149)">struct</span><span style=3D"color:rgb(68,68,68)">=C2=
=A0</span><span style=3D"color:rgb(68,68,68)">A</span><span style=3D"color:=
rgb(68,68,68)">{</span><br>
<span style=3D"color:rgb(68,68,68)">=C2=A0=C2=A0=C2=A0=C2=A0</span><span st=
yle=3D"color:rgb(0,150,149)">int</span><span style=3D"color:rgb(68,68,68)">=
=C2=A0</span><span style=3D"color:rgb(68,68,68)">z</span><span style=3D"col=
or:rgb(68,68,68)">;</span><br><span style=3D"color:rgb(68,68,68)">=C2=A0=C2=
=A0=C2=A0=C2=A0</span><span style=3D"color:rgb(0,150,149)">void</span><span=
style=3D"color:rgb(68,68,68)">=C2=A0</span><span style=3D"color:rgb(68,68,=
68)">foo</span><span style=3D"color:rgb(68,68,68)">(</span><span style=3D"c=
olor:rgb(0,150,149)">int</span><span style=3D"color:rgb(68,68,68)">=C2=A0</=
span><span style=3D"color:rgb(68,68,68)">x</span><span style=3D"color:rgb(6=
8,68,68)">)</span><span style=3D"color:rgb(68,68,68)">{</span><span style=
=3D"color:rgb(68,68,68)">}</span><br>
<span style=3D"color:rgb(68,68,68)">=C2=A0=C2=A0=C2=A0=C2=A0</span><span st=
yle=3D"color:rgb(0,150,149)">void</span><span style=3D"color:rgb(68,68,68)"=
=C2=A0</span><span style=3D"color:rgb(68,68,68)">=E3=81=95=E3=81=84=E3=81=

/span><span style=3D"color:rgb(0,150,149)">int</span><span style=3D"color:r=
gb(68,68,68)">=C2=A0</span><span style=3D"color:rgb(68,68,68)">x</span><spa=
n style=3D"color:rgb(68,68,68)">)</span><span style=3D"color:rgb(68,68,68)"=
{</span><span style=3D"color:rgb(68,68,68)">}</span><br>

yle=3D"color:rgb(0,150,149)">void</span><span style=3D"color:rgb(68,68,68)"=
=C2=A0</span><span style=3D"color:rgb(68,68,68)">=C2=AA=C3=A5</span><span =

int</span><span style=3D"color:rgb(68,68,68)">=C2=A0</span><span style=3D"c=
olor:rgb(68,68,68)">x</span><span style=3D"color:rgb(68,68,68)">)</span><sp=
an style=3D"color:rgb(68,68,68)">{</span><span style=3D"color:rgb(68,68,68)=
">}</span><br>
<span style=3D"color:rgb(68,68,68)">}</span></font></div><div><font face=3D=
"Menlo"><span style=3D"color:rgb(68,68,68)"></span><span style=3D"color:rgb=
(68,68,68)">mangledName</span><span style=3D"color:rgb(68,68,68)">!</span><=
span style=3D"color:rgb(68,68,68)">(</span><span style=3D"color:rgb(68,68,6=
8)">A</span><span style=3D"color:rgb(68,68,68)">.</span><span style=3D"colo=
r:rgb(68,68,68)">=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F</sp=
an><span style=3D"color:rgb(68,68,68)">)</span><span style=3D"color:rgb(68,=
68,68)">.</span><span style=3D"color:rgb(68,68,68)">demangle</span><span st=
yle=3D"color:rgb(68,68,68)">.</span><span style=3D"color:rgb(68,68,68)">wri=
teln</span><span style=3D"color:rgb(68,68,68)">;=3D&gt;</span></font><font =
color=3D"#444444" face=3D"Menlo">_D4util13demangle_funs1A18=E3=81=95=E3=81=
=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9FMFiZv</font></div>
<div><font face=3D"Menlo"><span style=3D"color:rgb(68,68,68)">----</span></=
font></div></div><div><font face=3D"Menlo"><span style=3D"color:rgb(68,68,6=
8)"><br></span></font></div>

--089e015370f44a639f04de71fe01--

Jun 05 2013
On 6/5/13 6:11 PM, Timothee Cour wrote:
currently std.demangle.demangle doesn't work with unicode (see example below)

If we decide to keep allowing unicode symbols (as opposed to just unicode
address this issue. Will supporting this negatively impact performance (of
both compile time and
runtime) ?

Likewise, will linkers + other tools (gdb etc) be happy with unicode in
mangled names?

----
structA{
intz;
voidfoo(intx){}
voidさいごの果実(intx){}
voidªå(intx){}
}
mangledName!(A.さいごの果実).demangle.writeln;=>_D4util13demangle_funs1A18さいごの果実MFiZv
----

Filed in bugzilla?

Jun 05 2013
Sean Kelly <sean invisibleduck.org> writes:
On Jun 5, 2013, at 6:21 PM, Brad Roberts <braddr puremagic.com> wrote:

On 6/5/13 6:11 PM, Timothee Cour wrote:
currently std.demangle.demangle doesn't work with unicode (see =

=20
If we decide to keep allowing unicode symbols (as opposed to just =

address this issue. Will supporting this negatively impact =

runtime) ?
=20
Likewise, will linkers + other tools (gdb etc) be happy with unicode =

=20
----
structA{
intz;
voidfoo(intx){}
void=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F(intx){}
void=C2=AA=C3=A5(intx){}
}
=

angle.writeln;=3D>_D4util13demangle_funs1A18=E3=81=95=E3=81=84=E3=81=94=E3=
=81=AE=E6=9E=9C=E5=AE=9FMFiZv
----

Filed in bugzilla?

http://d.puremagic.com/issues/show_bug.cgi?id=3D10393
https://github.com/D-Programming-Language/druntime/pull/524

Jun 17 2013
"Kagamin" <spam here.lot> writes:
On Thursday, 30 May 2013 at 11:29:47 UTC, Manu wrote:
Have you ever worked on code written by people who barely speak
English?

I did. It's better than having a mixture of languages like here:
assert(length == dizgi.length); - in one expression!
property Yazı küçüğü() const - property? const? küçüğü?

BTW I don't speak English myself, and D code doesn't comprise
English either. How well do you have to know English to use one
word to name a variable "player"? And I believe everyone who
learned math know latin alphabet.

Unicode identifiers allow for typos, which can't be detected
visually. For example greek and cyrillic alphabets have letters
indistinguishable from ASCII so they can sneak into ASCII text
and you won't see it. You can also have more fun with heuristic
language switchers.

Try to find a problem in this code:
------
class c
{
void Сlose(){}
}

int main()
{
c obj=new c;
obj.Close();
return 0;
}
------

That's an actual issue I had with C# in industrial code. And I
believe noone checked phobos for such errors.

I was taught BASIC at school and had no idea I should complain
about latin alphabet even though I didn't learn English back then.

Jun 27 2013
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
Every time I've been to a programming shop in a foreign
country, the developers speak english at work and code in
english. Of course, that doesn't mean that everyone does, but
as far as I can tell the overwhelming bulk is done in english.

OOo codebase is historically mostly in german. They try to reduce
the amunt of german in the codebase with each new version.

Some massive codebases are non english.

Naturally, full Unicode needs to be in strings and comments,
but symbol names? I don't see the point nor the utilty of it.
Supporting such is just pointless complexity to the language.

I know this is a crazy idea, but someone told be once that most
people on this planet aren't living in english speaking
countries. Insane isn't it ?

Jun 27 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 02:42, H. S. Teoh пишет:
On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote:
24-May-2013 21:05, Joakim пишет:

As far as Phobos is concerned, Dmitry's new std.uni module has powerful
code-generation templates that let you write code that operate directly
on UTF-8 without needing to convert to UTF-32 first.

As is there are no UTF-8 specific tables (yet), but there are tools to
create the required abstraction by hand. I plan to grow one for
std.regex that will thus be field-tested and then get into public
interface. In fact the needs of std.regex prompted me to provide more
Unicode stuff in the std.

Well, OK, maybe
we're not quite there yet, but the foundations are in place, and I'm
looking forward to the day when string functions will no longer have
implicit conversion to UTF-32, but will directly manipulate UTF-8 using
optimized state tables generated by std.uni.

Yup, but let's get the correctness part first, then performance ;)

Want small - use compression schemes which are perfectly fine and
get to the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/

+1.  Using your own encoding is perfectly fine. Just don't do that for
data interchange. Unicode was created because we *want* a single
standard to communicate with each other without stupid broken encoding
issues that used to be rampant on the web before Unicode came along.

BTW the document linked discusses _standard_ compression so that anybody
can decode that stuff. How you compress would largely affect the
compression ratio but not much beyond it..

In the bad ole days, HTML could be served in any random number of
encodings, often out-of-sync with what the server claims the encoding
is, and browsers would assume arbitrary default encodings that for the
most part *appeared* to work but are actually fundamentally b0rken.
Sometimes webpages would show up mostly-intact, but with a few
characters mangled, because of deviations / variations on codepage
interpretation, or non-standard characters being used in a particular
encoding. It was a total, utter mess, that wasted who knows how many
man-hours of programming time to work around. For data interchange on
the internet, we NEED a universal standard that everyone can agree on.

+1 on these and others :)

--
Dmitry Olshansky

May 24 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, May 25, 2013 at 03:47:41PM +0200, Joakim wrote:
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string is
in ASCII or UTF-8 - the code will behave correctly in either
case, variable-width-encodings regardless.

while splitting, when compared to a single-byte encoding.

No. Are you sure you understand UTF-8 properly?

check every single character to test for whitespace, but the
single-byte encoding simply has to load each byte in the string and
compare it against the whitespace-signifying bytes, while the
variable-length code has to first load and parse potentially 4 bytes
before it can compare, because it has to go through the state
machine that you linked to above.  Obviously the constant-width
encoding will be faster.  Did I really need to explain this?

Have you actually tried to write a whitespace splitter for UTF-8? Do you
realize that you can use an ASCII whitespace splitter for UTF-8 and it
will work correctly?

There is no need to decode UTF-8 for whitespace splitting at all. There
is no need to parse anything. You just iterate over the bytes and split
on 0x20. There is no performance difference over ASCII.

As Dmitry said, UTF-8 is self-synchronizing. While current Phobos code
tries to play it safe by decoding every character, this is not necessary
in many cases.

T

--
The best compiler is between your ears. -- Michael Abrash

May 25 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 10:44, Joakim пишет:
On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote:
You seem to think that not only UTF-8 is bad encoding but also one

on the code space.  I was originally going to title my post, "Why
Unicode?" but I have no real problem with UCS, which merely standardized
a bunch of pre-existing code pages.  Perhaps there are a lot of problems
with UCS also, I just haven't delved into it enough to know.  My problem
is with these dumb variable-length encodings, so I was precise in the
title.

UCS is dead and gone. Next in line to "640K is enough for everyone".
Simply put Unicode decided to take into account all diversity of
luggages instead of ~80% of these. Hard to add anything else. No offense
meant but it feels like you actually live in universe that is 5-7 years
behind current state. UTF-16 (a successor to UCS) is no random-access
either. And it's shitty beyond measure, UTF-8 is a shining gem in
comparison.

Separate code spaces were the case before Unicode (and utf-8). The
problem is not only that without header text is meaningless (no easy
slicing) but the fact that encoding of data after header strongly
depends a variety of factors -  a list of encodings actually. Now
everybody has to keep a (code) page per language to at least know if
it's 2 bytes per char or 1 byte per char or whatever. And you still
work on a basis that there is no combining marks and regional specific
stuff :)

Legacy. Hard to switch overnight. There are graphs that indicate that
few years from now you might never encounter a legacy encoding anymore,
only UTF-8/UTF-16.

Does
UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per
char or whatever?"

It's coherent in its scheme to determine that. You don't need extra
information synced to text unlike header stuff.

It has to do that also. Everyone keeps talking about
"easy slicing" as though UTF-8 provides it, but it doesn't.  Phobos
turns UTF-8 into UTF-32 internally for all that ease of use, at least
doubling your string size in the process.  Correct me if I'm wrong, that
was what I read on the newsgroup sometime back.

Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't
do any decoding and it does return you a slice of a balance of original.

In fact it was even "better" nobody ever talked about header they just
assumed a codepage with some global setting. Imagine yourself creating
a font rendering system these days - a hell of an exercise in
frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ
then ...).

there before UCS standardized them, but that is a completely different
argument than my problem with UTF-8 and variable-length encodings.  My
proposed simple, header-based, constant-width encoding could be
implemented with UCS and there go all your arguments about random code
pages.

No they don't - have you ever seen native Korean or Chinese codepages?
that there is no single sane way to deal with it on cross-locale basis
(that you simply ignore as noted below).

This just shows you don't care for multilingual stuff at all. Imagine
any language tutor/translator/dictionary on the Web. For instance most
languages need to intersperse ASCII (also keep in mind e.g. HTML
markup). Books often feature citations in native language (or e.g.
latin) along with translations.

alternate encoding.

??? Simply makes no sense. There is no intersection between some legacy
encodings as of now. Or do you want to add N*(N-1) cross-encodings for
any combination of 2? What about 3 in one string?

Now also take into account math symbols, currency symbols and beyond.
Also these days cultures are mixing in wild combinations so you might
need to see the text even if you can't read it. Unicode is not only
"encode characters from all languages". It needs to address universal
representation of symbolics used in writing systems at large.

no reason why UTF-8 is a better encoding for that purpose than the kind
of simple encoding I've suggested.

We want monoculture! That is to understand each without all these
"par-le-vu-france?" and codepages of various complexity(insanity).

codepage in the middle of the night. ;)

So you never had trouble of internationalization? What languages do you

That said, you could standardize
on UCS for your code space without using a bad encoding like UTF-8, as I
said above.

UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into
that trap (Java, Windows NT). You shouldn't.

Want small - use compression schemes which are perfectly fine and get
to the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/

simply adds a header and then uses a single-byte encoding, exactly what
I suggested! :)

This is it but it's far more flexible in a sense that it allows
multi-linguagal strings just fine and lone full-with unicode codepoints
as well.

But I get the impression that it's only for sending over
the wire, ie transmision, so all the processing issues that UTF-8
introduces would still be there.

Use mime-type etc. Standards are always a bit stringy and suboptimal,
their acceptance rate is one of chief advantages they have. Unicode has
horrifically large momentum now and not a single organization aside from
them tries to do this dirty work (=i18n).

And borrowing the arguments from from that rant: locale is borked shit
when it comes to encodings. Locales should be used for tweaking visual
like numbers, date display an so on.

locale support in the past, as you and others complain about, doesn't
invalidate the concept.

It's combinatorial blowup and has some stone-walls to hit into. Consider
adding another encoding for "Tuva" for isntance. Now you have to add 2*n
conversion routines to match it to other codepages/locales.

Beyond that - there are many things to consider in internationalization
and you would have to special case them all by codepage.

If they're screwing up something so simple,
imagine how much worse everyone is screwing up something complex like
UTF-8?

UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a
sequence of octets. It does it pretty well and compatible with ASCII,
even the little rant you posted acknowledged that. Now you are either
against Unicode as whole or what?

--
Dmitry Olshansky

May 25 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 22:26, Joakim пишет:
On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:
25-May-2013 10:44, Joakim пишет:
Yes, on the encoding, if it's a variable-length encoding like UTF-8, no,
on the code space.  I was originally going to title my post, "Why
Unicode?" but I have no real problem with UCS, which merely standardized
a bunch of pre-existing code pages.  Perhaps there are a lot of problems
with UCS also, I just haven't delved into it enough to know.

UCS is dead and gone. Next in line to "640K is enough for everyone".

which is the backbone of Unicode:

http://en.wikipedia.org/wiki/Universal_Character_Set

You might be thinking of the unpopular UCS-2 and UCS-4 encodings, which
I have never referred to.

Yeah got confused. So sorry about that.

Separate code spaces were the case before Unicode (and utf-8). The
problem is not only that without header text is meaningless (no easy
slicing) but the fact that encoding of data after header strongly
depends a variety of factors -  a list of encodings actually. Now
everybody has to keep a (code) page per language to at least know if
it's 2 bytes per char or 1 byte per char or whatever. And you still
work on a basis that there is no combining marks and regional specific
stuff :)

Legacy. Hard to switch overnight. There are graphs that indicate that
few years from now you might never encounter a legacy encoding
anymore, only UTF-8/UTF-16.

that there's not much of a difference between code pages with 2 bytes
per char and the language character sets in UCS.

You can map a codepage to a subset of UCS :)
That's what they do internally anyway.
If I take you right you propose to define string as a header that
denotes a set of windows in code space? I still fail to see how that
would scale see below.

It has to do that also. Everyone keeps talking about
"easy slicing" as though UTF-8 provides it, but it doesn't. Phobos
turns UTF-8 into UTF-32 internally for all that ease of use, at least
doubling your string size in the process.  Correct me if I'm wrong, that
was what I read on the newsgroup sometime back.

Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't
do any decoding and it does return you a slice of a balance of original.

changed the subject: slicing does require decoding and that's the use
case you brought up to begin with.  I haven't looked into it, but I
suspect substring search not requiring decoding is the exception for
UTF-8 algorithms, not the rule.

Mm... strictly speaking (let's turn that argument backwards) - what are
algorithms that require slicing say [5..] of string without ever looking at it left to right, searching etc.? ??? Simply makes no sense. There is no intersection between some legacy encodings as of now. Or do you want to add N*(N-1) cross-encodings for any combination of 2? What about 3 in one string? "cross-encodings." We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity). codepage in the middle of the night. ;) So you never had trouble of internationalization? What languages do you use (read/speak/etc.)? code with the terrible code pages system from the past. I can read and speak multiple languages, but I don't use anything other than English text. Okay then. That said, you could standardize on UCS for your code space without using a bad encoding like UTF-8, as I said above. UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into that trap (Java, Windows NT). You shouldn't. myth. :) Yeah, that was a mishap on my behalf. I think I've seen your 2 byte argument way to often and it got concatenated to UCS forming UCS-2 :) This is it but it's far more flexible in a sense that it allows multi-linguagal strings just fine and lone full-with unicode codepoints as well. the language, which I noted could be done with my scheme, by adding a more complex header, How would it look like? Or how the processing will go? long before you mentioned this unicode compression scheme. It does inline headers or rather tags. That hop between fixed char windows. It's not random-access nor claims to be. But I get the impression that it's only for sending over the wire, ie transmision, so all the processing issues that UTF-8 introduces would still be there. Use mime-type etc. Standards are always a bit stringy and suboptimal, their acceptance rate is one of chief advantages they have. Unicode has horrifically large momentum now and not a single organization aside from them tries to do this dirty work (=i18n). doesn't help you with string processing, it is only for transmission and is probably fine for that, precisely because it seems to implement some version of my single-byte encoding scheme! You do raise a good point: the only reason why we're likely using such a bad encoding in UTF-8 is that nobody else wants to tackle this hairy problem. Yup, where have you been say almost 10 years ago? :) Consider adding another encoding for "Tuva" for isntance. Now you have to add 2*n conversion routines to match it to other codepages/locales. Beyond that - there are many things to consider in internationalization and you would have to special case them all by codepage. single-byte encodings, as I have noted above. toUpper is a NOP for a single-byte encoding string with an Asian script, you can't do that with a UTF-8 string. But you have to check what encoding it's in and given that not all codepages are that simple to upper case some generic algorithm is required. If they're screwing up something so simple, imagine how much worse everyone is screwing up something complex like UTF-8? UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a sequence of octets. It does it pretty well and compatible with ASCII, even the little rant you posted acknowledged that. Now you are either against Unicode as whole or what? There are two parts to Unicode. I don't know enough about UCS, the character set, ;) to be for it or against it, but I acknowledge that a standardized character set may make sense. I am dead set against the UTF-8 variable-width encoding, for all the reasons listed above. Okay we are getting somewhere, now that I understand your position and got myself confused in the midway there. On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote: 25-May-2013 13:05, Joakim пишет: Nobody is talking about going back to code pages. I'm talking about going to single-byte encodings, which do not imply the problems that you had with code pages way back when. Problem is what you outline is isomorphic with code-pages. Hence the grief of accumulated experience against them. the beginning, I have suggested a more complex header that can enable multi-language strings, as one possible solution. I don't think code pages provided that. The problem is how would you define an uppercase algorithm for multilingual string with 3 distinct 256 codespaces (windows)? I bet it's won't be pretty. Well if somebody get a quest to redefine UTF-8 they *might* come up with something that is a bit faster to decode but shares the same properties. Hardly a life saver anyway. encoding that is much simpler and more efficient than UTF-8. Programmer productivity is the biggest loss from the complexity of UTF-8, as I've noted before. I still don't see how your solution scales to beyond 256 different codepoints per string (= multiple pages/parts of UCS ;) ). -- Dmitry Olshansky  May 25 2013 Dmitry Olshansky <dmitry.olsh gmail.com> writes: 25-May-2013 23:51, Joakim пишет: On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote: You can map a codepage to a subset of UCS :) That's what they do internally anyway. If I take you right you propose to define string as a header that denotes a set of windows in code space? I still fail to see how that would scale see below. would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string. So, a list of languages and a list of pure single-language substrings. This is just off the top of my head, I'm not suggesting it is definitive. Runs away in horror :) It's mess even before you've got to details. Another point about using sometimes a 2-byte encoding - welcome to the nice world of BigEndian/LittleEndian i.e. the very trap UTF-16 has stepped into. -- Dmitry Olshansky  May 25 2013 Walter Bright <newshound2 digitalmars.com> writes: On 5/25/2013 12:51 PM, Joakim wrote: For a multi-language string encoding, the header would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string. So, a list of languages and a list of pure single-language substrings. Please implement the simple C function strstr() with this simple scheme, and post it here. http://www.digitalmars.com/rtl/string.html#strstr  May 25 2013 Walter Bright <newshound2 digitalmars.com> writes: On 5/25/2013 2:51 PM, Walter Bright wrote: On 5/25/2013 12:51 PM, Joakim wrote: For a multi-language string encoding, the header would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string. So, a list of languages and a list of pure single-language substrings. Please implement the simple C function strstr() with this simple scheme, and post it here. http://www.digitalmars.com/rtl/string.html#strstr I'll go first. Here's a simple UTF-8 version in C. It's not the fastest way to do it, but at least it is correct: ---------------------------------- char *strstr(const char *s1,const char *s2) { size_t len1 = strlen(s1); size_t len2 = strlen(s2); if (!len2) return (char *) s1; char c2 = *s2; while (len2 <= len1) { if (c2 == *s1) if (memcmp(s2,s1,len2) == 0) return (char *) s1; s1++; len1--; } return NULL; }  May 25 2013 Dmitry Olshansky <dmitry.olsh gmail.com> writes: 26-May-2013 20:54, Vladimir Panteleev пишет: On Sunday, 26 May 2013 at 15:23:33 UTC, Joakim wrote: On Sunday, 26 May 2013 at 14:37:27 UTC, H. S. Teoh wrote: IHBT. I've made my position clear: I don't write toy code. 1. Make extraordinary claims 2. Refuse to back up said claims with small examples because "I don't write toy code" 3. Refuse to back up said claims with elaborate examples because "It will take too long" 4. Use arrogant tone throughout thread, imply that you're smarter than the creators of UTF, and creators and long-time contributors of D (never contribute code to D yourself) Result: 70-post thread Conclusion: Successful troll is successful :) +1 Result: 71-post thread ;) -- Dmitry Olshansky  May 26 2013 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes: On 5/26/13 1:45 PM, Joakim wrote: What is extraordinary about "UTF-8 is shit?" It is obviously so. Congratulations, you are literally the only person on the Internet who said so: http://goo.gl/TFhUO On 5/26/13 1:45 PM, Joakim wrote: Or it could just be that I'm much smarter than everybody else in this thread, ;) I can't rule it out given the often silly responses I've been getting. One odd thing about this thread is it's extremely rare that most everybody in this forum raises like one to the same opinion. Usually it's like whatever the topic, a debate will ensue between two ad-hoc groups. It has become clear that people involved in this have gotten too frustrated to have a constructive exchange. I suggest we collectively drop it. What you may want to do is to use D's modeling abilities to define a great string type pursuant to your ideas. If it is as good as you believe it could, then it will enjoy use and adoption and everybody will be better off. Andrei  May 26 2013 Walter Bright <newshound2 digitalmars.com> writes: On 5/25/2013 1:07 AM, Joakim wrote: The vast majority of non-english alphabets in UCS can be encoded in a single byte. It is your exceptions that are not relevant. I suspect the Chinese, Koreans, and Japanese would take exception to being called irrelevant. Good luck with your scheme that can't handle languages written by billions of people!  May 25 2013 "H. S. Teoh" <hsteoh quickfur.ath.cx> writes: On Mon, May 27, 2013 at 09:59:52PM +0200, Vladimir Panteleev wrote: On Monday, 27 May 2013 at 02:17:08 UTC, Wyatt wrote: No hardware required; just a smarter IME. Perhaps something like the compose key? http://en.wikipedia.org/wiki/Compose_key I'm already using the compose key. But it only goes so far (I don't think compose key sequences cover all of unicode). Besides, it's impractical to use compose key sequences to write large amounts of text in some given language; a method of temporarily switching to a different layout is necessary. T -- Тише едешь, дальше будешь.  May 27 2013 "Vladimir Panteleev" <vladimir thecybershadow.net> writes: On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote: Besides, it's impractical to use compose key sequences to write large amounts of text in some given language; a method of temporarily switching to a different layout is necessary. I thought the topic was typing the occasional Unicode character to use as an operator in D programs?  May 27 2013 "H. S. Teoh" <hsteoh quickfur.ath.cx> writes: On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote: 24-May-2013 21:05, Joakim пишет: This triggered a long-standing bugbear of mine: why are we using these variable-length encodings at all? Does anybody really care about UTF-8 being "self-synchronizing," ie does anybody actually use that in this day and age? Sure, it's backwards-compatible with ASCII and the vast majority of usage is probably just ASCII, but that means the other languages don't matter anyway. Not to mention taking the valuable 8-bit real estate for English and dumping the longer encodings on everyone else. I'd just use a single-byte header to signify the language and then put the vast majority of languages in a single byte encoding, with the few exceptional languages with more than 256 characters encoded in two bytes. You seem to think that not only UTF-8 is bad encoding but also one unified encoding (code-space) is bad(?). Separate code spaces were the case before Unicode (and utf-8). The problem is not only that without header text is meaningless (no easy slicing) but the fact that encoding of data after header strongly depends a variety of factors - a list of encodings actually. Now everybody has to keep a (code) page per language to at least know if it's 2 bytes per char or 1 byte per char or whatever. And you still work on a basis that there is no combining marks and regional specific stuff :) I remember those bad ole days of gratuitously-incompatible encodings. I wish those days will never ever return again. You'd get a text file in some unknown encoding, and the only way to make any sense of it was to guess what encoding it might be and hope you get lucky. Not only so, the same language often has multiple encodings, so adding support for a single new language required supporting several new encodings and being able to tell them apart (often with no info on which they are, if you're lucky, or if you're unlucky, with *wrong* encoding type specs -- for example, I *still* get email from outdated systems that claim to be iso-8859 when it's actually KOI8R). Prepending the encoding to the data doesn't help, because it's pretty much guaranteed somebody will cut-n-paste some segment of that data and save it without the encoding type header (or worse, some program will try to "fix" broken low-level code by prepending a default encoding type to everything, regardless of whether it's actually in that encoding or not), thus ensuring nobody will be able to reliably recognize what encoding it is down the road. In fact it was even "better" nobody ever talked about header they just assumed a codepage with some global setting. Imagine yourself creating a font rendering system these days - a hell of an exercise in frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ then ...). Not to mention, if the sysadmin changes the default locale settings, you may suddenly discover that a bunch of your text files have become gibberish, because some programs blindly assume that every text file is in the current locale-specified language. I tried writing language-agnostic text-processing programs in C/C++ before the widespread adoption of Unicode. It was a living nightmare. The Posix spec *seems* to promise language-independence with its locale functions, but actually, the whole thing is one big inconsistent and under-specified mess that has many unspecified, implementation-specific behaviours that you can't rely on. The APIs basically assume that you set your locale's language once, and never change it, and every single file you'll ever want to read must be encoded in that particular encoding. If you try to read another encoding, too bad, you're screwed. There isn't even a standard for locale names that you could use to manually switch to inside your program (yes there are de facto conventions, but there *are* systems out there that don't follow it). And many standard library functions are affected by locale settings (once you call setlocale, *anything* could change, like string comparison, output encoding, etc.), making it a hairy mess to get input/output of multiple encodings to work correctly. Basically, you have to write everything manually, because the standard library can't handle more than a single encoding correctly (well, not without extreme amounts of pain, that is). So you're back to manipulating bytes directly. Which means you have to keep large tables of every single encoding you ever wish to support. And encoding-specific code to deal with exceptions with those evil variant encodings that are supposedly the same as the official standard of that encoding, but actually have one or two subtle differences that cause your program to output embarrassing garbage characters every now and then. For all of its warts, Unicode fixed a WHOLE bunch of these problems, and made cross-linguistic data sane to handle without pulling out your hair, many times over. And now we're trying to go back to that nightmarish old world again? No way, José! [...] Make your header a little longer and you could handle those also. Yes, it wouldn't be strictly backwards-compatible with ASCII, but it would be so much easier to internationalize. Of course, there's also the monoculture we're creating; love this UTF-8 rant by tuomov, author of one the first tiling window managers for linux: "par-le-vu-france?" and codepages of various complexity(insanity). Yeah, those codepages were an utter nightmare to deal with. Everybody and his neighbour's dog invented their own codepage, sometimes multiple codepages for a single language, all of which are gratuitously incompatible with each other. Every codepage has its own peculiarities and exceptions, and programs have to know how to deal with all of them. Only to get broken again as soon as somebody invents yet another codepage two years later, or creates yet another codepage variant just for the heck of it. If you're really concerned about encoding size, just use a compression library -- they're readily available these days. Internally, the program can just use UTF-16 for the most part -- UTF-32 is really only necessary if you're routinely delving outside BMP, which is very rare. As far as Phobos is concerned, Dmitry's new std.uni module has powerful code-generation templates that let you write code that operate directly on UTF-8 without needing to convert to UTF-32 first. Well, OK, maybe we're not quite there yet, but the foundations are in place, and I'm looking forward to the day when string functions will no longer have implicit conversion to UTF-32, but will directly manipulate UTF-8 using optimized state tables generated by std.uni. Want small - use compression schemes which are perfectly fine and get to the precious 1byte per codepoint with exceptional speed. http://www.unicode.org/reports/tr6/ +1. Using your own encoding is perfectly fine. Just don't do that for data interchange. Unicode was created because we *want* a single standard to communicate with each other without stupid broken encoding issues that used to be rampant on the web before Unicode came along. In the bad ole days, HTML could be served in any random number of encodings, often out-of-sync with what the server claims the encoding is, and browsers would assume arbitrary default encodings that for the most part *appeared* to work but are actually fundamentally b0rken. Sometimes webpages would show up mostly-intact, but with a few characters mangled, because of deviations / variations on codepage interpretation, or non-standard characters being used in a particular encoding. It was a total, utter mess, that wasted who knows how many man-hours of programming time to work around. For data interchange on the internet, we NEED a universal standard that everyone can agree on. http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06 The emperor has no clothes, what am I missing? And borrowing the arguments from from that rant: locale is borked shit when it comes to encodings. Locales should be used for tweaking visual like numbers, date display an so on. I found that rant rather incoherent. I didn't find any convincing arguments as to why we should return to the bad old scheme of codepages and gratuitous complexity, just a lot of grievances about why monoculture is "bad" without much supporting evidence. UTF-8, for all its flaws, is remarkably resilient to mangling -- you can cut-n-paste any byte sequence and the receiving end can still make some sense of it. Not like the bad old days of codepages where you just get one gigantic block of gibberish. A properly-synchronizing UTF-8 function can still recover legible data, maybe with only a few characters at the ends truncated in the worst case. I don't see how any codepage-based encoding is an improvement over this. T -- There are 10 kinds of people in the world: those who can count in binary, and those who can't.  May 24 2013 "H. S. Teoh" <hsteoh quickfur.ath.cx> writes: On Mon, May 27, 2013 at 12:30:02AM +0200, Torje Digernes wrote: On Sunday, 26 May 2013 at 21:46:38 UTC, H. S. Teoh wrote: On Sun, May 26, 2013 at 11:25:09PM +0200, Kiith-Sa wrote: You mean like http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard ? Whoa! That is exactly what I had in mind!! Pity they don't appear to support Linux, though. :-( T If you want to configure your keyboard so you can type unicode in Linux you should make yourself familiar with xkb, it is not that difficult to work with, but not exactly user friendly either, super user friendly though. Oh, I know *that*. I configured my xkb setup to switch between English and Russian with the unused windows key (I used to have Greek too, but I use it rarely enough that I took it out). It's just that without the dynamic key labels, I have to touch-type, which requires learning each layout as opposed to just looking for the symbol I need on the key labels. And I have yet to figure out a sane way to support *all* of Unicode without making the result unusable -- when I had Greek in the mix, it was already getting cumbersome having to continually hit the windows key repeatedly when alternating between two of the 3 languages. That's simply not scalable to, say, 100 modes. :-P But maybe I'm just missing a really obvious solution. That happens a lot. :-P T -- War doesn't prove who's right, just who's left. -- BSD Games' Fortune  May 26 2013 Manu <turkeyman gmail.com> writes: --089e01227f046d48ea04dd8182fb Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 25 May 2013 11:58, Walter Bright <newshound2 digitalmars.com> wrote: On 5/24/2013 3:42 PM, H. S. Teoh wrote: I tried writing language-agnostic text-processing programs in C/C++ before the widespread adoption of Unicode. One of the first, and best, decisions I made for D was it would be Unicod= front to back. Indeed, excellent decision! So when we define operators for u =C3=97 v and a =C2=B7 b, or maybe n=C2=B2= ? ;) At the time, Unicode was poorly supported by operating systems and lots of software, and I encountered some initial resistance to it. But I believed Unicode was the inevitable future. Code pages, Shift-JIS, EBCDIC, etc., should all be terminated with prejudice. --089e01227f046d48ea04dd8182fb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">On 25 May 2013 11:58, Walter Bright <span dir=3D"ltr">&lt;= <a href=3D"mailto:newshound2 digitalmars.com" target=3D"_blank">newshound2 = digitalmars.com</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div cl= ass=3D"gmail_quote"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div class=3D"im">On 5/24/2013 3:42 PM, H. S= . Teoh wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> I tried writing language-agnostic text-processing programs in C/C++<br> before the widespread adoption of Unicode.<br> </blockquote> <br></div> One of the first, and best, decisions I made for D was it would be Unicode = front to back.<br></blockquote><div><br></div><div style>Indeed, excellent = decision!</div><div style>So when we define operators for u =C3=97 v and a = =C2=B7 b, or maybe n=C2=B2? ;)</div> <div style><br></div><div style><br></div><blockquote class=3D"gmail_quote"= style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> At the time, Unicode was poorly supported by operating systems and lots of = software, and I encountered some initial resistance to it. But I believed U= nicode was the inevitable future.<br> <br> Code pages, Shift-JIS, EBCDIC, etc., should all be terminated with prejudic= e.<br> </blockquote></div><br></div></div> --089e01227f046d48ea04dd8182fb--  May 24 2013 "Daniel Murphy" <yebblies nospamgmail.com> writes: "Manu" <turkeyman gmail.com> wrote in message news:mailman.137.1369448229.13711.digitalmars-d puremagic.com... One of the first, and best, decisions I made for D was it would be Unicode front to back. Indeed, excellent decision! So when we define operators for u v and a b, or maybe n? ;) When these have keys on standard keyboards.  May 25 2013 "H. S. Teoh" <hsteoh quickfur.ath.cx> writes: On Fri, May 24, 2013 at 08:45:56PM -0700, Walter Bright wrote: On 5/24/2013 7:16 PM, Manu wrote: So when we define operators for u v and a b, or maybe n? ;) Oh, how I want to do that. But I still think the world hasn't completely caught up with Unicode yet. That would be most awesome! Though it does raise the issue of how parsing would work, 'cos you either have to assign a fixed precedence to each of these operators (and there are a LOT of them in Unicode!), or allow user-defined operators with custom precedence and associativity, which means nightmare for the parser (it has to adapt itself to new operators as the code is parsed/analysed, which then leads to issues with what happens if two different modules define the same operator with conflicting precedence / associativity). T -- Spaghetti code may be tangly, but lasagna code is just cheesy.  May 24 2013 "Joakim" <joakim airpost.net> writes: On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote: You seem to think that not only UTF-8 is bad encoding but also one unified encoding (code-space) is bad(?). UTF-8, no, on the code space. I was originally going to title my post, "Why Unicode?" but I have no real problem with UCS, which merely standardized a bunch of pre-existing code pages. Perhaps there are a lot of problems with UCS also, I just haven't delved into it enough to know. My problem is with these dumb variable-length encodings, so I was precise in the title. Separate code spaces were the case before Unicode (and utf-8). The problem is not only that without header text is meaningless (no easy slicing) but the fact that encoding of data after header strongly depends a variety of factors - a list of encodings actually. Now everybody has to keep a (code) page per language to at least know if it's 2 bytes per char or 1 byte per char or whatever. And you still work on a basis that there is no combining marks and regional specific stuff :) Does UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per char or whatever?" It has to do that also. Everyone keeps talking about "easy slicing" as though UTF-8 provides it, but it doesn't. Phobos turns UTF-8 into UTF-32 internally for all that ease of use, at least doubling your string size in the process. Correct me if I'm wrong, that was what I read on the newsgroup sometime back. In fact it was even "better" nobody ever talked about header they just assumed a codepage with some global setting. Imagine yourself creating a font rendering system these days - a hell of an exercise in frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ then ...). out there before UCS standardized them, but that is a completely different argument than my problem with UTF-8 and variable-length encodings. My proposed simple, header-based, constant-width encoding could be implemented with UCS and there go all your arguments about random code pages. This just shows you don't care for multilingual stuff at all. Imagine any language tutor/translator/dictionary on the Web. For instance most languages need to intersperse ASCII (also keep in mind e.g. HTML markup). Books often feature citations in native language (or e.g. latin) along with translations. alternate encoding. Now also take into account math symbols, currency symbols and beyond. Also these days cultures are mixing in wild combinations so you might need to see the text even if you can't read it. Unicode is not only "encode characters from all languages". It needs to address universal representation of symbolics used in writing systems at large. I see no reason why UTF-8 is a better encoding for that purpose than the kind of simple encoding I've suggested. We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity). screwed-up codepage in the middle of the night. ;) That said, you could standardize on UCS for your code space without using a bad encoding like UTF-8, as I said above. Want small - use compression schemes which are perfectly fine and get to the precious 1byte per codepoint with exceptional speed. http://www.unicode.org/reports/tr6/ scheme simply adds a header and then uses a single-byte encoding, exactly what I suggested! :) But I get the impression that it's only for sending over the wire, ie transmision, so all the processing issues that UTF-8 introduces would still be there. And borrowing the arguments from from that rant: locale is borked shit when it comes to encodings. Locales should be used for tweaking visual like numbers, date display an so on. Broken locale support in the past, as you and others complain about, doesn't invalidate the concept. If they're screwing up something so simple, imagine how much worse everyone is screwing up something complex like UTF-8?  May 24 2013 "Joakim" <joakim airpost.net> writes: On Friday, 24 May 2013 at 22:44:24 UTC, H. S. Teoh wrote: I remember those bad ole days of gratuitously-incompatible encodings. I wish those days will never ever return again. You'd get a text file in some unknown encoding, and the only way to make any sense of it was to guess what encoding it might be and hope you get lucky. Not only so, the same language often has multiple encodings, so adding support for a single new language required supporting several new encodings and being able to tell them apart (often with no info on which they are, if you're lucky, or if you're unlucky, with *wrong* encoding type specs -- for example, I *still* get email from outdated systems that claim to be iso-8859 when it's actually KOI8R). Prepending the encoding to the data doesn't help, because it's pretty much guaranteed somebody will cut-n-paste some segment of that data and save it without the encoding type header (or worse, some program will try to "fix" broken low-level code by prepending a default encoding type to everything, regardless of whether it's actually in that encoding or not), thus ensuring nobody will be able to reliably recognize what encoding it is down the road. compatibility in the process: http://en.wikipedia.org/wiki/Byte_order_mark Well, at the very least adding garbage ASCII data in the front, just as my header would do. ;) For all of its warts, Unicode fixed a WHOLE bunch of these problems, and made cross-linguistic data sane to handle without pulling out your hair, many times over. And now we're trying to go back to that nightmarish old world again? No way, José! single-byte encodings, but using UCS or some other standardized character set to avoid all those incompatible code pages you had to deal with. If you're really concerned about encoding size, just use a compression library -- they're readily available these days. Internally, the program can just use UTF-16 for the most part -- UTF-32 is really only necessary if you're routinely delving outside BMP, which is very rare. non-ASCII text. My concerns are the following, in order of importance: 1. Lost programmer productivity due to these dumb variable-length encodings. That is the biggest loss from UTF-8's complexity. 2. Lost speed and memory due to using either an unnecessarily complex variable-length encoding or because you translated everything to 32-bit UTF-32 to get back to constant-width. 3. Lost bandwidth from using a fatter encoding. As far as Phobos is concerned, Dmitry's new std.uni module has powerful code-generation templates that let you write code that operate directly on UTF-8 without needing to convert to UTF-32 first. Well, OK, maybe we're not quite there yet, but the foundations are in place, and I'm looking forward to the day when string functions will no longer have implicit conversion to UTF-32, but will directly manipulate UTF-8 using optimized state tables generated by std.uni. constant-width single-byte encoding. +1. Using your own encoding is perfectly fine. Just don't do that for data interchange. Unicode was created because we *want* a single standard to communicate with each other without stupid broken encoding issues that used to be rampant on the web before Unicode came along. In the bad ole days, HTML could be served in any random number of encodings, often out-of-sync with what the server claims the encoding is, and browsers would assume arbitrary default encodings that for the most part *appeared* to work but are actually fundamentally b0rken. Sometimes webpages would show up mostly-intact, but with a few characters mangled, because of deviations / variations on codepage interpretation, or non-standard characters being used in a particular encoding. It was a total, utter mess, that wasted who knows how many man-hours of programming time to work around. For data interchange on the internet, we NEED a universal standard that everyone can agree on. is one of multiple unspecified or _broken_ encodings. Given how difficult UTF-8 is to get right, all you've likely done is replace multiple broken encodings with a single encoding with multiple broken implementations. UTF-8, for all its flaws, is remarkably resilient to mangling -- you can cut-n-paste any byte sequence and the receiving end can still make some sense of it. Not like the bad old days of codepages where you just get one gigantic block of gibberish. A properly-synchronizing UTF-8 function can still recover legible data, maybe with only a few characters at the ends truncated in the worst case. I don't see how any codepage-based encoding is an improvement over this. you ever heard of anyone using it? There is no reason why this kind of limited checking of data integrity should be rolled into the encoding. Maybe this made sense two decades ago when everyone had plans to stream text or something, but nobody does that nowadays. Just put a checksum in your header and you're good to go. Unicode is still a "codepage-based encoding," nothing has changed in that regard. All UCS did is standardize a bunch of pre-existing code pages, so that some of the redundancy was taken out. Unfortunately, the UTF-8 encoding then bloated the transmission format and tempted devs to use this unnecessarily complex format for processing too.  May 25 2013 "Joakim" <joakim airpost.net> writes: On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote: One of the first, and best, decisions I made for D was it would be Unicode front to back. of the few programming languages with such unicode support. This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed. encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also. That was the go-to solution in the 1980's, they were called "code pages". A disaster. they weren't standardized and often badly implemented. If you used UCS with a single-byte encoding, you wouldn't have that problem. with the few exceptional languages with more than 256 Like those rare languages Japanese, Korean, Chinese, etc. This too was done in the 80's with "Shift-JIS" for Japanese, and some other wacky scheme for Korean, and a third nutburger one for Chinese. languages, because they have more than 256 characters. So there will be no compression gain over UTF-8/16 there, but a big gain in parsing complexity with a simpler encoding, particularly when dealing with multi-language strings. I've had the misfortune of supporting all that in the old Zortech C++ compiler. It's AWFUL. If you think it's simpler, all I can say is you've never tried to write internationalized code with it. because I'm saying "let's go back to single-byte encodings." The two are separate arguments. UTF-8 is heavenly in comparison. Your code is automatically internationalized. It's awesome. because they just don't want to deal with the complexity. Perhaps you can deal with it and don't mind the performance loss, but I suspect you're in the minority.  May 25 2013 "Diggory" <diggsey googlemail.com> writes: I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!). Unicode is: 1) A standardised numbering of a large number of characters 2) A set of standardised algorithms for operating on these characters 3) A set of standardised encodings for efficiently encoding sequences of these characters You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem. Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters).  May 25 2013 "Joakim" <joakim airpost.net> writes: On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote: I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!). "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode." http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode Unicode is: 1) A standardised numbering of a large number of characters 2) A set of standardised algorithms for operating on these characters 3) A set of standardised encodings for efficiently encoding sequences of these characters differentiated between UCS (1) and UTF-8 (3). You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem. dchar : unsigned 32 bit UTF-32 http://dlang.org/type.html Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it. Walter as much as said so up above. Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters). in a single byte. It is your exceptions that are not relevant.  May 25 2013 "Vladimir Panteleev" <vladimir thecybershadow.net> writes: On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote: This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed. encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also. For the record, I noticed that programmers (myself included) that had an incomplete understanding of Unicode / UTF exaggerate this point, and sometimes needlessly assume that their code needs to operate on individual characters (code points), when it is in fact not so - and that code will work just fine as if it was written to handle ASCII. The example Walter quoted (regex - assuming you don't want Unicode ranges or case-insensitivity) is one such case. Another thing I noticed: sometimes when you think you really need to operate on individual characters (and that your code will not be correct unless you do that), the assumption will be incorrect due to the existence of combining characters in Unicode. Two of the often-quoted use cases of working on individual code points is calculating the string width (assuming a fixed-width font), and slicing the string - both of these will break with combining characters if those are not accounted for. I believe the proper way to approach such tasks is to implement the respective Unicode algorithms for it, which I believe are non-trivial and for which the relative impact for the overhead of working with a variable-width encoding is acceptable. Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8? Also, I don't think this has been posted in this thread. Not sure if it answers your points, though: http://www.utf8everywhere.org/ And here's a simple and correct UTF-8 decoder: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/  May 25 2013 "Joakim" <joakim airpost.net> writes: On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote: I think you stand alone in your desire to return to code pages. about going to single-byte encodings, which do not imply the problems that you had with code pages way back when. I have years of experience with code pages and the unfixable misery they produce. This has disappeared with Unicode. I find your arguments unpersuasive when stacked against my experience. And yes, I have made a living writing high performance code that deals with characters, and you are quite off base with claims that UTF-8 has inevitable bad performance - though there is inefficient code in Phobos for it, to be sure. constant-width encoding? You have not articulated a reason for this. Do you believe there is a performance loss with variable-width, but that it is not significant and therefore worth it? Or do you believe it can be implemented with no loss? That is what I asked above, but you did not answer. My grandfather wrote a book that consists of mixed German, French, and Latin words, using special characters unique to those languages. Another failing of code pages is it fails miserably at any such mixed language text. Unicode handles it with aplomb. job at such mixed-language text. You'd just have to have a larger, more complex header or keep all your strings in a single language, with a different format to compose them together for your book. This would be so much easier than UTF-8 that I cannot see how anyone could argue for a variable-length encoding instead. I can't even write an email to Rainer Schütze in English under your scheme. multi-language text at all, whereas I pointed out, from the beginning, that it could be trivially done also. Code pages simply are no longer practical nor acceptable for a global community. D is never going to convert to a code page system, and even if it did, there's no way D will ever convince the world to abandon Unicode, and so D would be as useless as EBCDIC. "single-byte encodings" to "code pages" in your head, then recoil in horror as you remember all your problems with broken implementations of code pages, even though those problems are not intrinsic to single-byte encodings. I'm not asking you to consider this for D. I just wanted to discuss why UTF-8 is used at all. I had hoped for some technical evaluations of its merits, but I seem to simply be dredging up a bunch of repressed memories about code pages instead. ;) The world may not "abandon Unicode," but it will abandon UTF-8, because it's a dumb idea. Unfortunately, such dumb ideas- XML anyone?- often proliferate until someone comes up with something better to show how dumb they are. Perhaps it won't be the D programming language that does that, but it would be easy to implement my idea in D, so maybe it will be a D-based library someday. :) I'm afraid your quest is quixotic. wrap their head around UTF-8. If someone can just get a single-byte encoding implemented and in front of them, I suspect it will be UTF-8 that will be considered quixotic. :D  May 25 2013 "Joakim" <joakim airpost.net> writes: On Saturday, 25 May 2013 at 08:58:57 UTC, Vladimir Panteleev wrote: Another thing I noticed: sometimes when you think you really need to operate on individual characters (and that your code will not be correct unless you do that), the assumption will be incorrect due to the existence of combining characters in Unicode. Two of the often-quoted use cases of working on individual code points is calculating the string width (assuming a fixed-width font), and slicing the string - both of these will break with combining characters if those are not accounted for. I believe the proper way to approach such tasks is to implement the respective Unicode algorithms for it, which I believe are non-trivial and for which the relative impact for the overhead of working with a variable-width encoding is acceptable. various languages, so there's no way around that. I'm arguing against layering more complexity on top, through UTF-8. Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8? either translate your entire string into UTF-32 so it's constant-width, which is apparently what Phobos does, or decode every single UTF-8 character along the way, every single time. A constant-width, single-byte encoding would be much easier to slice, while still using at most half the space. Also, I don't think this has been posted in this thread. Not sure if it answers your points, though: http://www.utf8everywhere.org/ info on how best to do so, with little justification for why you'd want to do so in the first place. For example, "Q: But what about performance of text processing algorithms, byte alignment, etc? A: Is it really better with UTF-16? Maybe so." Not exactly a considered analysis of the two. ;) And here's a simple and correct UTF-8 decoder: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ tell me it's "simple." That said, the difficulty of _using_ UTF-8 is a much bigger than problem than implementing a decoder in a library.  May 25 2013 "w0rp" <devw0rp gmail.com> writes: This is dumb. You are dumb. Go away.  May 25 2013 "Vladimir Panteleev" <vladimir thecybershadow.net> writes: On Saturday, 25 May 2013 at 09:40:36 UTC, Joakim wrote: Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8? either translate your entire string into UTF-32 so it's constant-width, which is apparently what Phobos does, or decode every single UTF-8 character along the way, every single time. A constant-width, single-byte encoding would be much easier to slice, while still using at most half the space. You don't need to do that to slice a string. I think you mean to say that you need to decode each character if you want to slice the string at the N-th code point? But this is exactly what I'm trying to point out: how would you find this N? How would you know if it makes sense, taking into account combining characters, and all the other complexities of Unicode? If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless. You cannot honestly look at those multiple state diagrams and tell me it's "simple." I meant that it's simple to implement (and adapt/port to other languages). I would say that UTF-8 is quite cleverly designed, so I wouldn't say it's simple by itself.  May 25 2013 "Joakim" <joakim airpost.net> writes: On Saturday, 25 May 2013 at 10:33:12 UTC, Vladimir Panteleev wrote: You don't need to do that to slice a string. I think you mean to say that you need to decode each character if you want to slice the string at the N-th code point? But this is exactly what I'm trying to point out: how would you find this N? How would you know if it makes sense, taking into account combining characters, and all the other complexities of Unicode? way would you slice and have it make any sense? Finding the N-th point is much simpler with a constant-width encoding. I'm leaving aside combining characters and those intrinsic language complexities baked into unicode in my previous analysis, but if you want to bring those in, that's actually an argument in favor of my encoding. With my encoding, you know up front if you're using languages that have such complexity- just check the header- whereas with a chunk of random UTF-8 text, you cannot ever know that unless you decode the entire string once and extract knowledge of all the languages that are embedded. For another similar example, let's say you want to run toUpper on a multi-language string, which contains English in the first half and some Asian script that doesn't define uppercase in the second half. With my format, toUpper can check the header, then process the English half and skip the Asian half (I'm assuming that the substring indices for each language would be stored in this more complex header). With UTF-8, you have to process the entire string, because you never know what random languages might be packed in there. UTF-8 is riddled with such performance bottlenecks, all to make if self-synchronizing. But is anybody really using its less compact encoding to do some "self-synchronized" integrity checking? I suspect almost nobody is. If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless. while splitting, when compared to a single-byte encoding. You cannot honestly look at those multiple state diagrams and tell me it's "simple." I meant that it's simple to implement (and adapt/port to other languages). I would say that UTF-8 is quite cleverly designed, so I wouldn't say it's simple by itself. write the fundamental UTF-8 libraries. But implementation does not merely refer to the UTF-8 libraries, but also all the code that tries to build on it for internationalized apps. And with all the unnecessary additional complexity added by UTF-8, wrapping the average programmer's head around this mess likely leads to as many problems as broken code pages implementations did back in the day. ;)  May 25 2013 "Vladimir Panteleev" <vladimir thecybershadow.net> writes: On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote: If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless. decode while splitting, when compared to a single-byte encoding. No. Are you sure you understand UTF-8 properly?  May 25 2013 "Joakim" <joakim airpost.net> writes: On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote: On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote: If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless. decode while splitting, when compared to a single-byte encoding. No. Are you sure you understand UTF-8 properly? to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this? On Saturday, 25 May 2013 at 12:43:21 UTC, Andrei Alexandrescu wrote: On 5/25/13 3:33 AM, Joakim wrote: On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote: This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed. encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also. You mentioned this a couple of times, and I wonder what makes you so sure. On contemporary architectures small is fast and large is slow; betting on replacing larger data with more computation is quite often a win. about replacing larger data _and_ more computation, ie UTF-8, with smaller data and less computation, ie single-byte encodings, so it is an unmitigated win in that regard. :)  May 25 2013 "Peter Alexander" <peter.alexander.au gmail.com> writes: On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote: On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote: On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote: If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless. decode while splitting, when compared to a single-byte encoding. No. Are you sure you understand UTF-8 properly? to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this? I suggest you read up on UTF-8. You really don't understand it. There is no need to decode, you just treat the UTF-8 string as if it is an ASCII string. This code will count all spaces in a string whether it is encoded as ASCII or UTF-8: int countSpaces(const(char)* c) { int n = 0; while (*c) if (*c == ' ') ++n; return n; } I repeat: there is no need to decode. Please read up on UTF-8. You do not understand it. The reason you don't need to decode is because UTF-8 is self-synchronising. The code above tests for spaces only, but it works the same when searching for any substring or single character. It is no slower than fixed-width encoding for these operations. Again, I urge you, please read up on UTF-8. It is very well designed.  May 25 2013 "Peter Alexander" <peter.alexander.au gmail.com> writes: On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote: int countSpaces(const(char)* c) { int n = 0; while (*c) if (*c == ' ') ++n; return n; } Oops. Missing a ++c in there, but I'm sure the point was made :-)  May 25 2013 "Vladimir Panteleev" <vladimir thecybershadow.net> writes: On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote: On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote: On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote: If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless. decode while splitting, when compared to a single-byte encoding. No. Are you sure you understand UTF-8 properly? to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this? It looks like you've missed an important property of UTF-8: lower ASCII remains encoded the same, and UTF-8 code units encoding non-ASCII characters cannot be confused with ASCII characters. Code that does not need Unicode code points can treat UTF-8 strings as ASCII strings, and does not need to decode each character individually - because a 0x20 byte will mean "space" regardless of context. That's why a function that splits a string by ASCII whitespace does NOT need do perform UTF-8 decoding. I hope this clears up the misunderstanding :)  May 25 2013 "Joakim" <joakim airpost.net> writes: On Saturday, 25 May 2013 at 14:18:32 UTC, Vladimir Panteleev wrote: On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote: Are you sure _you_ understand it properly? Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this? It looks like you've missed an important property of UTF-8: lower ASCII remains encoded the same, and UTF-8 code units encoding non-ASCII characters cannot be confused with ASCII characters. Code that does not need Unicode code points can treat UTF-8 strings as ASCII strings, and does not need to decode each character individually - because a 0x20 byte will mean "space" regardless of context. That's why a function that splits a string by ASCII whitespace does NOT need do perform UTF-8 decoding. I hope this clears up the misunderstanding :) necessary to decode every UTF-8 character if you are simply comparing against ASCII space characters. My mixup is because I was unaware if every language used its own space character in UTF-8 or if they reuse the ASCII space character, apparently it's the latter. However, my overall point stands. You still have to check 2-4 times as many bytes if you do it the way Peter suggests, as opposed to a single-byte encoding. There is a shortcut: you could also check the first byte to see if it's ASCII or not and then skip the right number of ensuing bytes in a character's encoding if it isn't ASCII, but at that point you have begun partially decoding the UTF-8 encoding, which you claimed wasn't necessary and which will degrade performance anyway. On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote: I suggest you read up on UTF-8. You really don't understand it. There is no need to decode, you just treat the UTF-8 string as if it is an ASCII string. UTF-8. This code will count all spaces in a string whether it is encoded as ASCII or UTF-8: int countSpaces(const(char)* c) { int n = 0; while (*c) if (*c == ' ') ++n; return n; } I repeat: there is no need to decode. Please read up on UTF-8. You do not understand it. The reason you don't need to decode is because UTF-8 is self-synchronising. particular encoding scheme chosen for UTF-8, a side effect of ASCII backwards compatibility and reusing the ASCII space character; it has nothing to do with whether it's self-synchronizing or not. The code above tests for spaces only, but it works the same when searching for any substring or single character. It is no slower than fixed-width encoding for these operations. it works the same for any single ASCII character. Of course it's slower than a fixed-width single-byte encoding. You have to check every single byte of a non-ASCII character in UTF-8, whereas a single-byte encoding only has to check a single byte per language character. There is a shortcut if you partially decode the first byte in UTF-8, mentioned above, but you seem dead-set against decoding. ;) Again, I urge you, please read up on UTF-8. It is very well designed. compatibility does hack in some shortcuts like this, which still don't save its performance.  May 25 2013 "Diggory" <diggsey googlemail.com> writes: On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote: On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote: I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!). "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode." http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode That confirms exactly what I just said... You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem. dchar : unsigned 32 bit UTF-32 http://dlang.org/type.html Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it. Walter as much as said so up above. Given that all the machine registers are at least 32-bits already it doesn't make the slightest difference. The only additional operations on top of ascii are when it's a multi-byte character, and even then it's some simple bit manipulation which is as fast as any variable width encoding is going to get. The only alternatives to a variable width encoding I can see are: - Single code page per string This is completely useless because now you can't concatenate strings of different code pages. - Multiple code pages per string This just makes everything overly complicated and is far slower to decode what the actual character is than UTF-8. - String with escape sequences to change code page Can no longer access characters in the middle or end of the string, you have to parse the entire string every time which completely negates the benefit of a fixed width encoding. - An encoding wide enough to store every character This is just UTF-32. Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters). encoded in a single byte. It is your exceptions that are not relevant. Well obviously... That's like saying "if you know what the exact contents of a file are going to be anyway you can compress it to a single byte!" ie. It's possible to devise an encoding which will encode any given string to an arbitrarily small size. It's still completely useless because you'd have to know the string in advance... - A useful encoding has to be able to handle every unicode character - As I've shown the only space-efficient way to do this is using a variable length encoding like UTF-8 - Given the frequency distribution of unicode characters, UTF-8 does a pretty good job at encoding higher frequency characters in fewer bytes. - Yes you COULD encode non-english alphabets in a single byte but doing so would be inefficient because it would mean the more frequently used characters take more bytes to encode.  May 25 2013 "Joakim" <joakim airpost.net> writes: On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote: 25-May-2013 10:44, Joakim пишет: Yes, on the encoding, if it's a variable-length encoding like UTF-8, no, on the code space. I was originally going to title my post, "Why Unicode?" but I have no real problem with UCS, which merely standardized a bunch of pre-existing code pages. Perhaps there are a lot of problems with UCS also, I just haven't delved into it enough to know. UCS is dead and gone. Next in line to "640K is enough for everyone". Set, which is the backbone of Unicode: http://en.wikipedia.org/wiki/Universal_Character_Set You might be thinking of the unpopular UCS-2 and UCS-4 encodings, which I have never referred to. Separate code spaces were the case before Unicode (and utf-8). The problem is not only that without header text is meaningless (no easy slicing) but the fact that encoding of data after header strongly depends a variety of factors - a list of encodings actually. Now everybody has to keep a (code) page per language to at least know if it's 2 bytes per char or 1 byte per char or whatever. And you still work on a basis that there is no combining marks and regional specific stuff :) that. Legacy. Hard to switch overnight. There are graphs that indicate that few years from now you might never encounter a legacy encoding anymore, only UTF-8/UTF-16. meant that there's not much of a difference between code pages with 2 bytes per char and the language character sets in UCS. Does UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per char or whatever?" It's coherent in its scheme to determine that. You don't need extra information synced to text unlike header stuff. deem headers much more coherent. :) It has to do that also. Everyone keeps talking about "easy slicing" as though UTF-8 provides it, but it doesn't. Phobos turns UTF-8 into UTF-32 internally for all that ease of use, at least doubling your string size in the process. Correct me if I'm wrong, that was what I read on the newsgroup sometime back. Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't do any decoding and it does return you a slice of a balance of original. you have changed the subject: slicing does require decoding and that's the use case you brought up to begin with. I haven't looked into it, but I suspect substring search not requiring decoding is the exception for UTF-8 algorithms, not the rule. ??? Simply makes no sense. There is no intersection between some legacy encodings as of now. Or do you want to add N*(N-1) cross-encodings for any combination of 2? What about 3 in one string? require "cross-encodings." We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity). screwed-up codepage in the middle of the night. ;) So you never had trouble of internationalization? What languages do you use (read/speak/etc.)? had to code with the terrible code pages system from the past. I can read and speak multiple languages, but I don't use anything other than English text. That said, you could standardize on UCS for your code space without using a bad encoding like UTF-8, as I said above. UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into that trap (Java, Windows NT). You shouldn't. Unicode is a myth. :) This is it but it's far more flexible in a sense that it allows multi-linguagal strings just fine and lone full-with unicode codepoints as well. byte for the language, which I noted could be done with my scheme, by adding a more complex header, long before you mentioned this unicode compression scheme. But I get the impression that it's only for sending over the wire, ie transmision, so all the processing issues that UTF-8 introduces would still be there. Use mime-type etc. Standards are always a bit stringy and suboptimal, their acceptance rate is one of chief advantages they have. Unicode has horrifically large momentum now and not a single organization aside from them tries to do this dirty work (=i18n). scheme doesn't help you with string processing, it is only for transmission and is probably fine for that, precisely because it seems to implement some version of my single-byte encoding scheme! You do raise a good point: the only reason why we're likely using such a bad encoding in UTF-8 is that nobody else wants to tackle this hairy problem. Consider adding another encoding for "Tuva" for isntance. Now you have to add 2*n conversion routines to match it to other codepages/locales. Beyond that - there are many things to consider in internationalization and you would have to special case them all by codepage. single-byte encodings, as I have noted above. toUpper is a NOP for a single-byte encoding string with an Asian script, you can't do that with a UTF-8 string. If they're screwing up something so simple, imagine how much worse everyone is screwing up something complex like UTF-8? UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a sequence of octets. It does it pretty well and compatible with ASCII, even the little rant you posted acknowledged that. Now you are either against Unicode as whole or what? ASCII-compatible. There are two parts to Unicode. I don't know enough about UCS, the character set, ;) to be for it or against it, but I acknowledge that a standardized character set may make sense. I am dead set against the UTF-8 variable-width encoding, for all the reasons listed above. On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote: 25-May-2013 13:05, Joakim пишет: Nobody is talking about going back to code pages. I'm talking about going to single-byte encodings, which do not imply the problems that you had with code pages way back when. Problem is what you outline is isomorphic with code-pages. Hence the grief of accumulated experience against them. example, from the beginning, I have suggested a more complex header that can enable multi-language strings, as one possible solution. I don't think code pages provided that. Well if somebody get a quest to redefine UTF-8 they *might* come up with something that is a bit faster to decode but shares the same properties. Hardly a life saver anyway. constant-width encoding that is much simpler and more efficient than UTF-8. Programmer productivity is the biggest loss from the complexity of UTF-8, as I've noted before. The world may not "abandon Unicode," but it will abandon UTF-8, because it's a dumb idea. Unfortunately, such dumb ideas- XML anyone?- often proliferate until someone comes up with something better to show how dumb they are. Even children know XML is awful redundant shit as interchange format. The hierarchical document is a nice idea anyway. popular as it is. ;) I'm making a similar point about the more limited success of UTF-8, ie it's still shit.  May 25 2013 "Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes: ░░░░░░░░░ⓌⓉⒻ░ ╔╗░╔╗░╔╗╔════╗╔════╗░░ ║║░║║░║║╚═╗╔═╝║╔═══╝░░ ║║░║║░║║░░║║░░║╚═╗░░░░ ║╚═╝╚═╝║╔╗║║╔╗║╔═╝╔╗░░ ╚══════╝╚╝╚╝╚╝╚╝░░╚╝░░ ░░░░░░░░░░░░░░░░░░░░░░░░ █░█░█░░░░░░▐░░░░░░░░░░▐░ █░█░█▐▀█▐▀█▐░█▐▀█▐▀█▐▀█░ █░█░█▐▄█▐▄█▐▄▀▐▄█▐░█▐░█░ █▄█▄█▐▄▄▐▄▄▐░█▐▄▄▐░█▐▄█░ ░░░░░░░░░░░░░░░░░░░░░░░░ --jm  May 25 2013 "Diggory" <diggsey googlemail.com> writes: "limited success of UTF-8" Becoming the de-facto standard encoding EVERYWERE except for windows which uses UTF-16 is hardly a failure... I really don't understand your hatred for UTF-8 - it's simple to decode and encode, fast and space-efficient. Fixed width encodings are not inherently fast, the only thing they are faster at is if you want to randomly access the Nth character instead of the Nth byte. In the rare cases that you need to do a lot of this kind of random access there exists UTF-32... Any fixed width encoding which can encode every unicode character must use at least 3 bytes, and using 4 bytes is probably going to be faster because of alignment, so I don't see what the great improvement over UTF-32 is going to be. slicing does require decoding I didn't mean that people are literally keeping code pages. I meant that there's not much of a difference between code pages with 2 bytes per char and the language character sets in UCS. Unicode doesn't have "language character sets". The different planes only exist for organisational purposes they don't affect how characters are encoded. ?! It's okay because you deem it "coherent in its scheme?" I deem headers much more coherent. :) Sure if you change the word "coherent" to mean something completely different... Coherent means that you store related things together, ie. everything that you need to decode a character in the same place, not spread out between part of a character and a header. but I suspect substring search not requiring decoding is the exception for UTF-8 algorithms, not the rule. transformation that depends on the code point such as converting case or identifying which character class a particular character belongs to. Appending, slicing, copying, searching, replacing, etc. basically all the most common text operations can all be done without any encoding or decoding.  May 25 2013 "Joakim" <joakim airpost.net> writes: On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote: On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote: On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote: I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!). "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode." http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode That confirms exactly what I just said... having "nothing to do with code pages." All UCS did is take a bunch of existing code pages and standardize them into one massive character set. For example, ISCII was a pre-existing single-byte encoding and Unicode "largely preserves the ISCII layout within each block." http://en.wikipedia.org/wiki/ISCII All a code page is is a table of mappings, UCS is just a much larger, standardized table of such mappings. You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem. dchar : unsigned 32 bit UTF-32 http://dlang.org/type.html Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it. Walter as much as said so up above. Given that all the machine registers are at least 32-bits already it doesn't make the slightest difference. The only additional operations on top of ascii are when it's a multi-byte character, and even then it's some simple bit manipulation which is as fast as any variable width encoding is going to get. doesn't convert UTF-8 to UTF-32 internally. Perhaps converting to UTF-32 is "as fast as any variable width encoding is going to get" but my claim is that single-byte encodings will be faster. The only alternatives to a variable width encoding I can see are: - Single code page per string This is completely useless because now you can't concatenate strings of different code pages. be made that strings of different languages are sufficiently different that there should be no multi-language strings. Is this the best route? I'm not sure, but I certainly wouldn't dismiss it out of hand. - Multiple code pages per string This just makes everything overly complicated and is far slower to decode what the actual character is than UTF-8. particularly if you designed your header right. - String with escape sequences to change code page Can no longer access characters in the middle or end of the string, you have to parse the entire string every time which completely negates the benefit of a fixed width encoding. it's sub-optimal. Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters). encoded in a single byte. It is your exceptions that are not relevant. Well obviously... That's like saying "if you know what the exact contents of a file are going to be anyway you can compress it to a single byte!" ie. It's possible to devise an encoding which will encode any given string to an arbitrarily small size. It's still completely useless because you'd have to know the string in advance... arbitrary-length file cannot be compressed to a single byte, you would have collisions galore. But since most non-english alphabets are less than 256 characters, they can all be uniquely encoded in a single byte per character, with the header determining what language's code page to use. I don't understand your analogy whatsoever. - A useful encoding has to be able to handle every unicode character - As I've shown the only space-efficient way to do this is using a variable length encoding like UTF-8 - Given the frequency distribution of unicode characters, UTF-8 does a pretty good job at encoding higher frequency characters in fewer bytes. takes at least two bytes to encode, whereas my single-byte encoding scheme would encode every alphabet with less than 256 characters in a single byte. - Yes you COULD encode non-english alphabets in a single byte but doing so would be inefficient because it would mean the more frequently used characters take more bytes to encode.  May 25 2013 "Diggory" <diggsey googlemail.com> writes: On Saturday, 25 May 2013 at 19:02:43 UTC, Joakim wrote: On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote: On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote: On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote: I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!). "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode." http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode That confirms exactly what I just said... having "nothing to do with code pages." All UCS did is take a bunch of existing code pages and standardize them into one massive character set. For example, ISCII was a pre-existing single-byte encoding and Unicode "largely preserves the ISCII layout within each block." http://en.wikipedia.org/wiki/ISCII All a code page is is a table of mappings, UCS is just a much larger, standardized table of such mappings. UCS does have nothing to do with code pages, it was designed as a replacement for them. A codepage is a strict subset of the possible characters, UCS is the entire set of possible characters. You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem. dchar : unsigned 32 bit UTF-32 http://dlang.org/type.html Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it. Walter as much as said so up above. Given that all the machine registers are at least 32-bits already it doesn't make the slightest difference. The only additional operations on top of ascii are when it's a multi-byte character, and even then it's some simple bit manipulation which is as fast as any variable width encoding is going to get. doesn't convert UTF-8 to UTF-32 internally. Perhaps converting to UTF-32 is "as fast as any variable width encoding is going to get" but my claim is that single-byte encodings will be faster. I haven't "abandoned my claim". It's a simple fact that phobos does not convert UTF-8 string to UTF-32 strings before it uses them. ie. the difference between this: string mystr = ...; dstring temp = mystr.to!dstring; for (int i = 0; i < temp.length; ++i) process(temp[i]); and this: string mystr = ...; size_t i = 0; while (i < mystr.length) { dchar current = decode(mystr, i); process(current); } And if you can't see why the latter example is far more efficient I give up... The only alternatives to a variable width encoding I can see are: - Single code page per string This is completely useless because now you can't concatenate strings of different code pages. to be made that strings of different languages are sufficiently different that there should be no multi-language strings. Is this the best route? I'm not sure, but I certainly wouldn't dismiss it out of hand. - Multiple code pages per string This just makes everything overly complicated and is far slower to decode what the actual character is than UTF-8. particularly if you designed your header right. The cache misses alone caused by simply accessing the separate headers would be a larger overhead than decoding UTF-8 which takes a few assembly instructions and has perfect locality and can be efficiently pipelined by the CPU. Then there's all the extra processing involved combining the headers when you concatenate strings. Plus you lose the one benefit a fixed width encoding has because random access is no longer possible without first finding out which header controls the location you want to access. - String with escape sequences to change code page Can no longer access characters in the middle or end of the string, you have to parse the entire string every time which completely negates the benefit of a fixed width encoding. it's sub-optimal. Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters). encoded in a single byte. It is your exceptions that are not relevant. Well obviously... That's like saying "if you know what the exact contents of a file are going to be anyway you can compress it to a single byte!" ie. It's possible to devise an encoding which will encode any given string to an arbitrarily small size. It's still completely useless because you'd have to know the string in advance... arbitrary-length file cannot be compressed to a single byte, you would have collisions galore. But since most non-english alphabets are less than 256 characters, they can all be uniquely encoded in a single byte per character, with the header determining what language's code page to use. I don't understand your analogy whatsoever. It's very simple - the more information about the type of data you are compressing you have at the time of writing the algorithm the better compression ration you can get, to the point that if you know exactly what the file is going to contain you can compress it to nothing. This is why you have specialised compression algorithms for images, video, audio, etc. It doesn't matter how few characters non-english alphabets have - unless you know WHICH alphabet it is before-hand you can't store it in a single byte. Since any given character could be in any alphabet the best you can do is look at the probabilities of different characters appearing and use shorter representations for more common ones. (This is the basis for all lossless compression) The english alphabet plus 0-9 and basic punctuation are by far the most common characters used on computers so it makes sense to use one byte for those and multiple bytes for rarer characters. - A useful encoding has to be able to handle every unicode character - As I've shown the only space-efficient way to do this is using a variable length encoding like UTF-8 per string you would see that I had. - Given the frequency distribution of unicode characters, UTF-8 does a pretty good job at encoding higher frequency characters in fewer bytes. takes at least two bytes to encode, whereas my single-byte encoding scheme would encode every alphabet with less than 256 characters in a single byte. And strings with mixed characters would use lots of memory and be extremely slow. Common when using proper names, quotes, inline translations, graphical characters, etc. etc. Not to mention the added complexity to actually implement the algorithms.  May 25 2013 "Joakim" <joakim airpost.net> writes: On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote: You can map a codepage to a subset of UCS :) That's what they do internally anyway. If I take you right you propose to define string as a header that denotes a set of windows in code space? I still fail to see how that would scale see below. header would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string. So, a list of languages and a list of pure single-language substrings. This is just off the top of my head, I'm not suggesting it is definitive. Mm... strictly speaking (let's turn that argument backwards) - what are algorithms that require slicing say [5..] of string
without ever looking at it left to right, searching etc.?

slicing with UTF-8 are wrong.  But a single-byte encoding would
be scanned much faster also, as I've noted above, no decoding
necessary and single bytes will always be faster than multiple
bytes, even without decoding.

How would it look like? Or how the processing will go?

functions like toUpper would execute much faster because you
wouldn't have to scan substrings containing languages that don't
have uppercase, which you have to scan in UTF-8.

long before you mentioned this unicode compression
scheme.

It does inline headers or rather tags. That hop between fixed
char windows. It's not random-access nor claims to be.

superficially similar to my scheme. :)

version of my single-byte encoding scheme!  You do raise a
good point:
the only reason why we're likely using such a bad encoding in
UTF-8 is
that nobody else wants to tackle this hairy problem.

Yup, where have you been say almost 10 years ago? :)

have thought I'd be discussing Unicode today, didn't even know
what it was back then.

Not necessarily.  But that is actually one of the advantages of
single-byte encodings, as I have noted above.  toUpper is a
NOP for a
single-byte encoding string with an Asian script, you can't do
that with
a UTF-8 string.

But you have to check what encoding it's in and given that not
all codepages are that simple to upper case some generic
algorithm is required.

at the header and know that toUpper has to do nothing for a
single-byte-encoded string of an Asian script which doesn't have
uppercase characters.  With UTF-8, you have to decode the entire
string to find that out.

They may seem superficially similar but they're not.  For
example, from
the beginning, I have suggested a more complex header that can
enable
multi-language strings, as one possible solution.  I don't
think code
pages provided that.

The problem is how would you define an uppercase algorithm for
multilingual string with 3 distinct 256 codespaces (windows)? I
bet it's won't be pretty.

some languages have uppercase characters and others don't.  The
version of toUpper for my encoding will be similar, but it will
do less work, because it doesn't have to be invoked for every
character in the string.

I still don't see how your solution scales to beyond 256
different codepoints per string (= multiple pages/parts of UCS
;) ).

mentioned those to Walter earlier, they would have a two-byte
encoding.  No way around that, but they would still be easier to
deal with than UTF-8, because of the header.

May 25 2013
"Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:
On the other hand, Joakim even admits his single byte encoding
is variable length, as otherwise he simply dismisses the rarely
used (!) Chinese, Japanese, and Korean languages, as well as
any text that contains words from more than one language.

to be encoded to two bytes, so it is not a true constant-width
encoding if you are mixing one of those languages into a
single-byte encoded string.  But this "variable length" encoding
is so much simpler than UTF-8, there's no comparison.

I suspect he's trolling us, and quite successfully.

see it's Walter.  It seems to be the trend on the internet to
accuse anybody you disagree with of trolling, I am honestly
surprised to see Walter stoop so low.  Considering I'm the only
one making any cogent arguments here, perhaps I should wonder if
you're all trolling me. ;)

On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:
I suspect the Chinese, Koreans, and Japanese would take
exception to being called irrelevant.

have noted that they would also be handled by a two-byte encoding.

Good luck with your scheme that can't handle languages written
by billions of people!

length because I am using two bytes to handle these languages,
then you claim I don't handle these languages.  This kind of
blatant contradiction within two posts can only be called...
trolling!

May 25 2013
"Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes:
On Saturday, 25 May 2013 at 19:51:43 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky
wrote:
You can map a codepage to a subset of UCS :)
That's what they do internally anyway.
If I take you right you propose to define string as a header
that denotes a set of windows in code space? I still fail to
see how that would scale see below.

header would contain a single byte for every language used in
the string, along with multiple index bytes to signify the
start and finish of every run of single-language characters in
the string.  So, a list of languages and a list of pure
single-language substrings.  This is just off the top of my
head, I'm not suggesting it is definitive.

You obviously are not thinking it through. Such encoding would
have a O(n^2) complexity for appending a character/symbol in a
different language to the string, since you would have to update
the beginning of the string, and move the contents forward to
make room. Not to mention that it wouldn't be backwards
compatible with ascii routines, and the complexity of such a
header would be have to be carried all the way to font rendering
routines in the OS.

Multiple languages/symbols in one string is a blessing of modern
humane computing. It is the norm more than the exception in most
of the world.

--jm

May 25 2013
"Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 25 May 2013 at 14:58:02 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
I suggest you read up on UTF-8. You really don't understand
it. There is no need to decode, you just treat the UTF-8
string as if it is an ASCII string.

UTF-8.

It's not just a shortcut, it is absolutely fundamental to the
design of UTF-8. It's like saying you understand Lisp without
being aware that everything is a list.

Also, you continuously keep stating disadvantages to UTF-8 that
are completely false, like "slicing does require decoding".
Again, completely missing the point of UTF-8. I cannot conceive
how you can claim to understand how UTF-8 works yet repeatedly
demonstrating that you do not.

You are either ignorant or a successful troll. In either case,
I'm done here.

May 25 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, May 25, 2013 at 09:51:42PM +0200, Joakim wrote:
On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
If I take you right you propose to define string as a header that
denotes a set of windows in code space? I still fail to see how
that would scale see below.

Something like that.  For a multi-language string encoding, the
header would contain a single byte for every language used in the
string, along with multiple index bytes to signify the start and
finish of every run of single-language characters in the string.
So, a list of languages and a list of pure single-language
substrings.  This is just off the top of my head, I'm not suggesting
it is definitive.

And just how exactly does that help with slicing? If anything, it makes
slicing way hairier and error-prone than UTF-8. In fact, this one point
alone already defeated any performance gains you may have had with a
single-byte encoding. Now you can't do *any* slicing at all without
convoluted algorithms to determine what encoding is where at the
endpoints of your slice, and the resulting slice must have new headers
to indicate the start/end of every different-language substring. By the
time you're done with all that, you're going way slower than processing
UTF-8.

Again I say, I'm not 100% sold on UTF-8, but what you're proposing here
is far worse.

T

--
The best compiler is between your ears. -- Michael Abrash

May 25 2013
On Saturday, 25 May 2013 at 20:03:59 UTC, Joakim wrote:
I have noted from the beginning that these large alphabets have
to be encoded to two bytes, so it is not a true constant-width
encoding if you are mixing one of those languages into a
single-byte encoded string.  But this "variable length"
encoding is so much simpler than UTF-8, there's no comparison.

All I can say is if you think that is simpler than UTF-8 then you
have completely the wrong idea about UTF-8.

Let me explain:

1) Take the byte at a particular offset in the string
2) If it is ASCII then we're done
3) Otherwise count the number of '1's at the start of the byte -
this is how many bytes make up the character (there's even an ASM
instruction to do this)
4) This first byte will look like '1110xxxx' for a 3 byte
character, '11110xxx' for a 4 byte character, etc.
5) All following bytes are of the form '10xxxxxx'
6) Now just concatenate all the 'x's together and add an offset
to get the code point

Note that this is CONSTANT TIME, O(1) with minimal branching so
well suited to pipelining (after the initial byte the other bytes
can all be processed in parallel by the CPU) and only sequential
memory access so no cache misses, and zero additional memory
requirements

1) Look up the offset in the header using binary search: O(log N)
lots of branching
2) Look up the code page ID in a massive array of code pages to
work out how many bytes per character
3) Hope this array hasn't been paged out and is still in the cache
4) Extract that many bytes from the string and combine them into
a number
5) Look up this new number in yet another large array specific to
the code page
6) Hope this array hasn't been paged out and is still in the
cache too

This is O(log N) has lots of branching so no pipelining (every
stage depends on the result of the stage before), lots of random
memory access so lots of cache misses, lots of additional memory
requirements to store all those tables, and an algorithm that
isn't even any easier to understand.

Plus every other algorithm to operate on it except for decoding
is insanely complicated.

May 25 2013
Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, May 25, 2013 01:42:20 Walter Bright wrote:
On 5/25/2013 12:33 AM, Joakim wrote:
At what cost?  Most programmers completely punt on unicode, because=

just don't want to deal with the complexity. Perhaps you can deal w=

and don't mind the performance loss, but I suspect you're in the
minority.

of experience with code pages and the unfixable misery they produce. =

has disappeared with Unicode. I find your arguments unpersuasive when=

stacked against my experience. And yes, I have made a living writing =

performance code that deals with characters, and you are quite off ba=

with claims that UTF-8 has inevitable bad performance - though there =

inefficient code in Phobos for it, to be sure.
=20
My grandfather wrote a book that consists of mixed German, French, an=

words, using special characters unique to those languages. Another fa=

of code pages is it fails miserably at any such mixed language text.
Unicode handles it with aplomb.
=20
I can't even write an email to Rainer Sch=C3=BCtze in English under y=

=20
Code pages simply are no longer practical nor acceptable for a global=

community. D is never going to convert to a code page system, and eve=

it did, there's no way D will ever convince the world to abandon Unic=

and so D would be as useless as EBCDIC.
=20
I'm afraid your quest is quixotic.

All I've got to say on this subject is "Thank you Walter Bright for bui=
lding=20
Unicode into D!"

- Jonathan M Davis

May 25 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, May 25, 2013 at 04:14:34PM -0700, Jonathan M Davis wrote:
On Saturday, May 25, 2013 01:42:20 Walter Bright wrote:
On 5/25/2013 12:33 AM, Joakim wrote:
At what cost?  Most programmers completely punt on unicode,
because they just don't want to deal with the complexity. Perhaps
you can deal with it and don't mind the performance loss, but I
suspect you're in the minority.

have years of experience with code pages and the unfixable misery
they produce. This has disappeared with Unicode. I find your
arguments unpersuasive when stacked against my experience. And yes,
I have made a living writing high performance code that deals with
characters, and you are quite off base with claims that UTF-8 has
inevitable bad performance - though there is inefficient code in
Phobos for it, to be sure.

My grandfather wrote a book that consists of mixed German, French,
and Latin words, using special characters unique to those languages.
Another failing of code pages is it fails miserably at any such
mixed language text.  Unicode handles it with aplomb.

I can't even write an email to Rainer Schtze in English under your
scheme.

Code pages simply are no longer practical nor acceptable for a
global community. D is never going to convert to a code page system,
and even if it did, there's no way D will ever convince the world to
abandon Unicode, and so D would be as useless as EBCDIC.

I'm afraid your quest is quixotic.

All I've got to say on this subject is "Thank you Walter Bright for
building Unicode into D!"

Ditto here!

In fact, Unicode support in D (esp. UTF-8) was one of the major factors
that convinced me to adopt D. I had been trying to write
language-agnostic programs in C/C++, and ... let's just say that it was
one gigantic hairy mess, and required lots of system-dependent hacks and
unfounded assumptions ("it appears to work so I think the code's correct
even though according to spec it shouldn't have worked"). I18n support
in libc was spotty and incomplete, with many common functions breaking
in unexpected ways once you step outside ASCII, and libraries like
gettext address some of the issues but not all. Getting *real* i18n
support required using a full-fledged i18n library like libicu, which
required using custom string types. The whole experience was so painful
I've since avoided doing any i18n in C/C++ at all.

Then came along D with native Unicode support built right into the
language. And not just UTF-16 shoved down your throat like Java does (or
was it UTF-32?); UTF-8, UTF-16, and UTF-32 are all equally supported.
You cannot imagine what a happy camper I was since then!! Yes, Phobos
still has a ways to go in terms of performance w.r.t. UTF-8 strings, but
what we have right now is already far, far, superior to the situation in
C/C++, and things can only get better.

T

--
Freedom of speech: the whole world has no right *not* to hear my spouting off!

May 25 2013
"Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 18:56:42 UTC, Diggory wrote:
"limited success of UTF-8"

Becoming the de-facto standard encoding EVERYWERE except for
windows which uses UTF-16 is hardly a failure...

computers since the inception of Unicode.  That's what I call
limited success, thank you for agreeing with me. :)

I really don't understand your hatred for UTF-8 - it's simple
to decode and encode, fast and space-efficient. Fixed width
encodings are not inherently fast, the only thing they are
faster at is if you want to randomly access the Nth character
instead of the Nth byte. In the rare cases that you need to do
a lot of this kind of random access there exists UTF-32...

encoding is?  Suffice to say, a single-byte encoding beats UTF-8
on all these measures, not just one.

Any fixed width encoding which can encode every unicode
character must use at least 3 bytes, and using 4 bytes is
probably going to be faster because of alignment, so I don't
see what the great improvement over UTF-32 is going to be.

packing language info in the header.  I don't think you even know

slicing does require decoding

know where the code points are.

I didn't mean that people are literally keeping code pages.  I
meant that there's not much of a difference between code pages
with 2 bytes per char and the language character sets in UCS.

Unicode doesn't have "language character sets". The different
planes only exist for organisational purposes they don't affect
how characters are encoded.

the different language character sets in this list:

http://en.wikipedia.org/wiki/List_of_Unicode_characters

?!  It's okay because you deem it "coherent in its scheme?"  I
deem headers much more coherent. :)

Sure if you change the word "coherent" to mean something
completely different... Coherent means that you store related
things together, ie. everything that you need to decode a
character in the same place, not spread out between part of a

make sense conceptually, not that everything is stored together.
My point is that putting the language info in a header seems much
more coherent to me than ramming that info into every character.

but I suspect substring search not requiring decoding is the
exception for UTF-8 algorithms, not the rule.

transformation that depends on the code point such as
converting case or identifying which character class a
particular character belongs to. Appending, slicing, copying,
searching, replacing, etc. basically all the most common text
operations can all be done without any encoding or decoding.

is useless, I have to laugh that you even include it. :) All
these basic operations can be done very fast, often faster than
UTF-8, in a single-byte encoding.  Once you start talking code
points, it's no contest: UTF-8 flat out loses.

On Saturday, 25 May 2013 at 19:42:41 UTC, Diggory wrote:
All a code page is is a table of mappings, UCS is just a much
larger, standardized table of such mappings.

UCS does have nothing to do with code pages, it was designed as
a replacement for them. A codepage is a strict subset of the
possible characters, UCS is the entire set of possible
characters.

several of them into a master code page and removing
redundancies.  Functionally, they are the same and historically
they maintain the same layout in at least some cases.  To then
say, UCS has "nothing to do with code pages" is just dense.

I see you've abandoned without note your claim that phobos
doesn't convert UTF-8 to UTF-32 internally.  Perhaps
converting to UTF-32 is "as fast as any variable width
encoding is going to get" but my claim is that single-byte
encodings will be faster.

I haven't "abandoned my claim". It's a simple fact that phobos
does not convert UTF-8 string to UTF-32 strings before it uses
them.

ie. the difference between this:
string mystr = ...;
dstring temp = mystr.to!dstring;
for (int i = 0; i < temp.length; ++i)
process(temp[i]);

and this:
string mystr = ...;
size_t i = 0;
while (i < mystr.length) {
dchar current = decode(mystr, i);
process(current);
}

And if you can't see why the latter example is far more
efficient I give up...

iterates through, but there are still functions in std.string
that convert the entire string, as in your first example.  The
point is that you are forced to decode everything to UTF-32,
whether by char or the entire string.  Your latter example may be
marginally more efficient but it is only useful for functions
that start from the beginning and walk the string in only one
direction, which not all operations do.

- Multiple code pages per string
This just makes everything overly complicated and is far
slower to decode what the actual character is than UTF-8.

The cache misses alone caused by simply accessing the separate
takes a few assembly instructions and has perfect locality and
can be efficiently pipelined by the CPU.

than repeatedly decoding, whether in assembly and pipelined or
not, every single UTF-8 character? :D

Then there's all the extra processing involved combining the
headers when you concatenate strings. Plus you lose the one
benefit a fixed width encoding has because random access is no
longer possible without first finding out which header controls
the location you want to access.

when concatenating strings, hardly anything.

Random access is still not only possible, it is incredibly fast
in most cases: you just have to check first if the header lists
any two-byte encodings.  This can be done once and cached as a
property of the string (set a boolean no_two_byte_encoding once
and simply have the slice operator check it before going ahead),
just as you could add a property to UTF-8 strings to allow quick
random access if they happen to be pure ASCII.  The difference is
that only strings that include the two-byte encoded
Korean/Chinese/Japanese characters would require a bit more
calculation for slicing in my scheme, whereas _every_ non-ASCII
UTF-8 string requires full decoding to allow random access.  This
is a clear win for my single-byte encoding, though maybe not the
complete demolition of UTF-8 you were hoping for. ;)

No, it's not the same at all.  The contents of an
arbitrary-length file cannot be compressed to a single byte,
you would have collisions galore.  But since most non-english
alphabets are less than 256 characters, they can all be
uniquely encoded in a single byte per character, with the
header determining what language's code page to use.  I don't

you are compressing you have at the time of writing the
algorithm the better compression ration you can get, to the
point that if you know exactly what the file is going to
contain you can compress it to nothing. This is why you have
specialised compression algorithms for images, video, audio,
etc.

compressing down to a byte is nonsense.  For any arbitrarily long
data, there are always limits to compression.  What any of this
has to do with my single-byte encoding, I have no idea.

It doesn't matter how few characters non-english alphabets have
- unless you know WHICH alphabet it is before-hand you can't
store it in a single byte. Since any given character could be
in any alphabet the best you can do is look at the
probabilities of different characters appearing and use shorter
representations for more common ones. (This is the basis for
all lossless compression) The english alphabet plus 0-9 and
basic punctuation are by far the most common characters used on
computers so it makes sense to use one byte for those and
multiple bytes for rarer characters.

before-hand" because that info is stored in the header?  That is
why I specifically said, from my first post, that multi-language
strings would have more complex headers, which I later pointed
out could list all the different language substrings within a
multi-language string.  Your silly exposition of how compression
works makes me wonder if you understand anything about how a
single-byte encoding would work.

Perhaps it made sense to use one byte for ASCII characters and
relegate _every other language_ to multiple bytes two decades
ago.  It doesn't make sense today.

- As I've shown the only space-efficient way to do this is
using a variable length encoding like UTF-8

pages per string you would see that I had.

string, just as you do not ship the entire UCS with every UTF-8
string.  A single-byte encoding is going to be more
space-efficient for the vast majority of strings, everybody knows
this.

No, it does a very bad job of this.  Every non-ASCII character
takes at least two bytes to encode, whereas my single-byte
encoding scheme would encode every alphabet with less than 256
characters in a single byte.

And strings with mixed characters would use lots of memory and
be extremely slow. Common when using proper names, quotes,
inline translations, graphical characters, etc. etc. Not to
mention the added complexity to actually implement the
algorithms.

though I'm not sure how, given your seeming ignorance of how
single-byte encodings work. :) There _is_ a degenerate case with
my particular single-byte encoding (not the ones you list, which
would still be faster and use less memory than UTF-8): strings
that use many, if not all, character sets.  So the worst case
scenario might be something like a string that had 100
characters, every one from a different language.  In that case, I
think it would still be smaller than the equivalent UTF-8 string,
but not by much.

There might be some complexity in implementing the algorithms,
but on net, likely less than UTF-8, while being much more usable
for most programmers.

On Saturday, 25 May 2013 at 22:41:59 UTC, Diggory wrote:
1) Take the byte at a particular offset in the string
2) If it is ASCII then we're done
3) Otherwise count the number of '1's at the start of the byte
- this is how many bytes make up the character (there's even an
ASM instruction to do this)
4) This first byte will look like '1110xxxx' for a 3 byte
character, '11110xxx' for a 4 byte character, etc.
5) All following bytes are of the form '10xxxxxx'
6) Now just concatenate all the 'x's together and add an offset
to get the code point

than to bluster on without much use.

Note that this is CONSTANT TIME, O(1) with minimal branching so
well suited to pipelining (after the initial byte the other
bytes can all be processed in parallel by the CPU) and only
sequential memory access so no cache misses, and zero

up.

1) Look up the offset in the header using binary search: O(log
N) lots of branching

depends on the number of languages used and how many substrings
there are.  There are worst-case scenarios that could approach
something like log(n) but extremely unlikely in real-world use.
Most of the time, this would be O(1).

2) Look up the code page ID in a massive array of code pages to
work out how many bytes per character

simply checked if the language was one of the few alphabets that
require two bytes.

3) Hope this array hasn't been paged out and is still in the
cache
4) Extract that many bytes from the string and combine them
into a number

step for the few two-byte encodings, yet have no problem with
doing this for every non-ASCII character in UTF-8.

5) Look up this new number in yet another large array specific
to the code page

character, just like your Unicode code point above.  If you were
simply encoding the UCS in a single-byte encoding, you would
arrange your scheme in such a way to trivially be able to
generate the UCS code point using these two bytes.

This is O(log N) has lots of branching so no pipelining (every
stage depends on the result of the stage before), lots of
random memory access so lots of cache misses, lots of
additional memory requirements to store all those tables, and
an algorithm that isn't even any easier to understand.

Plus every other algorithm to operate on it except for decoding
is insanely complicated.

comparison that matters.

May 26 2013
"Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 19:58:25 UTC, Dmitry Olshansky wrote:
Runs away in horror :) It's mess even before you've got to
details.

so I'll assume you can't find such a flaw.  It is still _much
less_ messy than UTF-8, that is the critical distinction.

Another point about using sometimes a 2-byte encoding - welcome
to the nice world of BigEndian/LittleEndian i.e. the very trap
UTF-16 has stepped into.

coordination, but it is a minor issue.

On Saturday, 25 May 2013 at 20:20:11 UTC, Juan Manuel Cabo wrote:
You obviously are not thinking it through. Such encoding would
have a O(n^2) complexity for appending a character/symbol in a
different language to the string, since you would have to
update the beginning of the string, and move the contents
forward to make room. Not to mention that it wouldn't be
backwards compatible with ascii routines, and the complexity of
such a header would be have to be carried all the way to font
rendering routines in the OS.

non-font-related assertions have been addressed earlier.  I see
no reason why a single-byte encoding of UCS would have to be
carried to "font rendering routines" but UTF-8 wouldn't be.

Multiple languages/symbols in one string is a blessing of
modern humane computing. It is the norm more than the exception
in most of the world.

multi-language strings.  The argument is about how best to encode
them.

On Saturday, 25 May 2013 at 20:47:25 UTC, Peter Alexander wrote:
On Saturday, 25 May 2013 at 14:58:02 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander
wrote:
I suggest you read up on UTF-8. You really don't understand
it. There is no need to decode, you just treat the UTF-8
string as if it is an ASCII string.

understanding UTF-8.

It's not just a shortcut, it is absolutely fundamental to the
design of UTF-8. It's like saying you understand Lisp without
being aware that everything is a list.

chosen for UTF-8 and, as I've noted, still less efficient than
similarly searching a single-byte encoding.  The fact that you
keep trumpeting this silly detail as somehow "fundamental"
suggests you have no idea what you're talking about.

Also, you continuously keep stating disadvantages to UTF-8 that
are completely false, like "slicing does require decoding".
Again, completely missing the point of UTF-8. I cannot conceive
how you can claim to understand how UTF-8 works yet repeatedly
demonstrating that you do not.

don't know that.  If you mean slicing by byte, that is not only
useless, but _every_ encoding can do that.  I cannot conceive how
you claim to defend UTF-8, yet keep making such stupid points,
that you don't even bother backing up.

You are either ignorant or a successful troll. In either case,
I'm done here.

arguments and leave.  Good riddance, you weren't adding anything.

May 26 2013
"Joakim" <joakim airpost.net> writes:
For some reason this posting by H. S. Teoh shows up on the
mailing list but not on the forum.

On Sat May 25 13:42:10 PDT 2013, H. S. Teoh wrote:
On Sat, May 25, 2013 at 10:07:41AM +0200, Joakim wrote:
The vast majority of non-english alphabets in UCS can be
encoded in
a single byte.  It is your exceptions that are not relevant.

I'll have you know that Chinese, Korean, and Japanese account
for a
significant percentage of the world's population, and therefore
arguments about "vast majority" are kinda missing the forest
for the
trees. If you count the number of *alphabets* that can be
encoded in a
single byte, you can get a majority, but that in no way
reflects actual
usage.

representing 85% of the world's population.

The only alternatives to a variable width encoding I can see
are:
- Single code page per string
This is completely useless because now you can't concatenate
strings of different code pages.

to be
made that strings of different languages are sufficiently
different
that there should be no multi-language strings.  Is this the
best
route?  I'm not sure, but I certainly wouldn't dismiss it out
of hand.

This is so patently absurd I don't even know how to begin to
have you actually dealt with any significant amount of text at
all? A
large amount of text in today's digital world are at least
bilingual, if
not more. Even in pure English text, you occasionally need a
foreign
letter in order to transcribe a borrowed/quoted word, e.g.,
"cliché",
"naïve", etc.. Under your scheme, it would be impossible to
encode any
text that contains even a single instance of such words. All it
takes is
*one* word in a 500-page text and your scheme breaks down, and
we're
back to the bad ole days of codepages. And yes you can say
"well just
include é and ï in the English code page". But then all it
takes is a
single math formula that requires a Greek letter, and your text
is
non-encodable anymore. By the time you pull in all the French,
German,
Greek letters and math symbols, you might as well just go back
to UTF-8.

earlier as another possibility to Walter, "keep all your strings
in a single language, with a different format to compose them
together."  Nobody is talking about disallowing alphabets other
than English or going back to code pages.  The fundamental
question is whether it makes sense to combine all these different
alphabets and their idiosyncratic rules into a single string and
encoding.

There is a good argument to be made that the differences outweigh
the similarities and you'd be better off keeping each
language/alphabet in its own string.  It's a question of
modeling, just like a class hierarchy.  As I said, I'm not sure
this is the best route, but it has some real strengths.

The alternative is to have embedded escape sequences for the
rare
foreign letter/word that you might need, but then you're back
to being
unable to slice the string at will, since slicing it at the
wrong place
will produce gibberish.

I'm not saying UTF-8 (or UTF-16, etc.) is panacea -- there are
things
about it that are annoying, but it's certainly better than the
scheme
you're proposing.

On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:
And just how exactly does that help with slicing? If anything,
it makes
slicing way hairier and error-prone than UTF-8. In fact, this
one point
with a
single-byte encoding. Now you can't do *any* slicing at all
without
convoluted algorithms to determine what encoding is where at the
endpoints of your slice, and the resulting slice must have new
to indicate the start/end of every different-language
substring. By the
time you're done with all that, you're going way slower than
processing
UTF-8.

string contains any two-bye encodings, a check which can be done
once and cached.  If it's single-byte all the way through, no
problems whatsoever with slicing.  If there are two-byte
languages included, the slice function will have to do a little
arithmetic calculation before slicing.  You will also need a few
arithmetic ops to create the new header for the slice.  The point
is that these operations will be much faster than decoding every
code point to slice UTF-8.

Again I say, I'm not 100% sold on UTF-8, but what you're
proposing here
is far worse.

you dismiss my alternative out of hand.

May 26 2013
"Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
I have noted from the beginning that these large alphabets
have to be encoded to
two bytes, so it is not a true constant-width encoding if you
are mixing one of
those languages into a single-byte encoded string.  But this
"variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's
variable length. You overlook that I've had to deal with this.
It isn't "simpler", there's actually more work to write code
that adapts to one or two byte encodings.

containing a few Asian languages are variable-length, as opposed
to UTF-8 having every non-English language string be
variable-length.  It may be more work to write library code to
handle my encoding, perhaps, but efficiency and ease of use are
paramount.

So let's see: first you say that my scheme has to be variable
length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard Chinese.
You cannot have it both ways. Code to deal with two bytes is
significantly different than code to deal with one. That means
you've got a conditional in your generic code - that isn't
going to be faster than the conditional for UTF-8.

encoding for Chinese and I already acknowledged that such a
predominantly single-byte encoding is still variable-length.  The
problem is that _you_ try to have it both ways: first you claimed
it is variable-length because I support Chinese that way, then
you claimed I don't support Chinese.

Yes, there will be conditionals, just as there are several
conditionals in phobos depending on whether a language supports
uppercase or not.  The question is whether the conditionals for
single-byte encoding will execute faster than decoding every
UTF-8 character.  This is a matter of engineering judgement, I
see no reason why you think decoding every UTF-8 character is
faster.

then you claim I don't handle
these languages.  This kind of blatant contradiction within
two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it
as irrelevant, along with more handwaving about what to do with
text that has embedded words in multiple languages.

use two bytes to encode Chinese?  I'm not sure why you're

I didn't "handwave" about multi-language strings, I gave specific
ideas about how they might be implemented.  I'm not claiming to
have a bullet-proof and detailed single-byte encoding spec, just
spitballing some ideas on how to do it better than the abominable
UTF-8.

Worse, there are going to be more than 256 of these encodings -
you can't even have a byte to specify them. Remember, Unicode
has approximately 256,000 characters in it. How many code pages
is that?

maybe another 50 symbolic sets.  That leaves space for another
100 or so new scripts.  Maybe you are so worried about
future-proofing that you'd use two bytes to signify the alphabet,
but I wouldn't.  I think it's more likely that we'll ditch
scripts than add them. ;) Most of those symbol sets should not be
in UCS.

I was being kind saying you were trolling, as otherwise I'd be
saying your scheme was, to be blunt, absurd.

from 20 years ago, that is really only useful when streaming
text, which nobody does today.  There may have been a time when
ASCII compatibility was paramount, when nobody cared about
internationalization and almost all libraries only took ASCII
input: that is not the case today.

I'll be the first to admit that a lot of great ideas have been
initially dismissed by the experts as absurd. If you really
believe in this, I recommend that you write it up as a real
article, taking care to fill in all the handwaving with
something specific, and include some benchmarks to prove your
performance claims. Post your article on reddit, stackoverflow,
hackernews, etc., and look for fertile ground for it. I'm sorry
you're not finding fertile ground here (so far, nobody has
agreed with any of your points), and this is the wrong place
for such proposals anyway, as D is simply not going to switch
over to it.

single-byte encoding representing a step forward from UTF-8.
While this argument has produced no argument that I'm wrong, it's
possible we've all missed something salient, some deal-breaker.
As I said before, I'm not proposing that D "switch over."  I was
simply asking people who know or at the very least use UTF-8 more
than most, as a result of employing one of the few languages with
Unicode support baked in, why they think UTF-8 is a good idea.

I was hoping for a technical discussion on the merits, before I
went ahead and implemented this single-byte encoding.  Since
nobody has been able to point out a reason for why my encoding
wouldn't be much better than UTF-8, I see no reason not to go
forward with my implementation.  I may write something up after
implementation: most people don't care about ideas, only results,
to the point where almost nobody can reason at all about ideas.

Remember, extraordinary claims require extraordinary evidence,
not handwaving and assumptions disguised as bold assertions.

"handwaving and assumptions."  Some people can reason about such
possible encodings, even in the incomplete form I've sketched
out, without having implemented them, if they know what they're
doing.

On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
On 5/25/2013 2:51 PM, Walter Bright wrote:
On 5/25/2013 12:51 PM, Joakim wrote:
For a multi-language string encoding, the header would
contain a single byte for every language used in the string,
along with multiple
index bytes to signify the start and finish of every run of
single-language
characters in the string. So, a list of languages and a list
of pure
single-language substrings.

Please implement the simple C function strstr() with this
simple scheme, and
post it here.

http://www.digitalmars.com/rtl/string.html#strstr

I'll go first. Here's a simple UTF-8 version in C. It's not the
fastest way to do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
size_t len1 = strlen(s1);
size_t len2 = strlen(s2);
if (!len2)
return (char *) s1;
char c2 = *s2;
while (len2 <= len1) {
if (c2 == *s1)
if (memcmp(s2,s1,len2) == 0)
return (char *) s1;
s1++;
len1--;
}
return NULL;
}

simpler to write in C and D for multi-language strings that
include Korean/Chinese/Japanese.  But while the strstr
implementation for my encoding would contain more conditionals
and lines of code, it would be far more efficient.  For instance,
because you know where all the language substrings are from the
header, you can potentially rule out searching vast swathes of
the string, because they don't contain the same languages or
lengths as the string you're searching for.

Even if you're searching a single-language string, which won't
have those speedups, your naive implementation checks every byte,
even continuation bytes, in UTF-8 to see if they might match the
first letter of the search string, even though no continuation
byte will match.  You can avoid this by partially decoding the
leading bytes of UTF-8 characters and skipping over continuation
bytes, as I've mentioned earlier in this thread, but you've then
added more lines of code to your pretty yet simple function and

My single-byte encoding has none of these problems, in fact, it's
much faster and uses less memory for the same function, while
available to UTF-8.

Finally, being able to write simple yet inefficient functions
like this is not the test of a good encoding, as strstr is a
library function, and making library developers' lives easier is
a low priority for any good format.  The primary goals are ease
of use for library consumers, ie app developers, and speed and
efficiency of the code.  You are trading on the latter two for
the former with this implementation.  That is not a good tradeoff.

Perhaps it was a good trade 20 years ago when everyone rolled
their own code and nobody bothered waiting for those floppy disks
to arrive with expensive library code.  It is not a good trade
today.

May 26 2013
"Declan" <oyscal 163.com> writes:
On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
I have noted from the beginning that these large alphabets
have to be encoded to
two bytes, so it is not a true constant-width encoding if you
are mixing one of
those languages into a single-byte encoded string.  But this
"variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's
variable length. You overlook that I've had to deal with this.
It isn't "simpler", there's actually more work to write code
that adapts to one or two byte encodings.

containing a few Asian languages are variable-length, as
opposed to UTF-8 having every non-English language string be
variable-length.  It may be more work to write library code to
handle my encoding, perhaps, but efficiency and ease of use are
paramount.

So let's see: first you say that my scheme has to be variable
length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard
Chinese. You cannot have it both ways. Code to deal with two
bytes is significantly different than code to deal with one.
That means you've got a conditional in your generic code -
that isn't going to be faster than the conditional for UTF-8.

two-byte encoding for Chinese and I already acknowledged that
such a predominantly single-byte encoding is still
variable-length.  The problem is that _you_ try to have it both
ways: first you claimed it is variable-length because I support
Chinese that way, then you claimed I don't support Chinese.

Yes, there will be conditionals, just as there are several
conditionals in phobos depending on whether a language supports
uppercase or not.  The question is whether the conditionals for
single-byte encoding will execute faster than decoding every
UTF-8 character.  This is a matter of engineering judgement, I
see no reason why you think decoding every UTF-8 character is
faster.

then you claim I don't handle
these languages.  This kind of blatant contradiction within
two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it
as irrelevant, along with more handwaving about what to do
with text that has embedded words in multiple languages.

to use two bytes to encode Chinese?  I'm not sure why you're

I didn't "handwave" about multi-language strings, I gave
specific ideas about how they might be implemented.  I'm not
claiming to have a bullet-proof and detailed single-byte
encoding spec, just spitballing some ideas on how to do it
better than the abominable UTF-8.

Worse, there are going to be more than 256 of these encodings
- you can't even have a byte to specify them. Remember,
Unicode has approximately 256,000 characters in it. How many
code pages is that?

maybe another 50 symbolic sets.  That leaves space for another
100 or so new scripts.  Maybe you are so worried about
future-proofing that you'd use two bytes to signify the
alphabet, but I wouldn't.  I think it's more likely that we'll
ditch scripts than add them. ;) Most of those symbol sets
should not be in UCS.

I was being kind saying you were trolling, as otherwise I'd be
saying your scheme was, to be blunt, absurd.

from 20 years ago, that is really only useful when streaming
text, which nobody does today.  There may have been a time when
ASCII compatibility was paramount, when nobody cared about
internationalization and almost all libraries only took ASCII
input: that is not the case today.

I'll be the first to admit that a lot of great ideas have been
initially dismissed by the experts as absurd. If you really
believe in this, I recommend that you write it up as a real
article, taking care to fill in all the handwaving with
something specific, and include some benchmarks to prove your
performance claims. Post your article on reddit,
stackoverflow, hackernews, etc., and look for fertile ground
for it. I'm sorry you're not finding fertile ground here (so
far, nobody has agreed with any of your points), and this is
the wrong place for such proposals anyway, as D is simply not
going to switch over to it.

my single-byte encoding representing a step forward from UTF-8.
While this argument has produced no argument that I'm wrong,
it's possible we've all missed something salient, some
deal-breaker.  As I said before, I'm not proposing that D
"switch over."  I was simply asking people who know or at the
very least use UTF-8 more than most, as a result of employing
one of the few languages with Unicode support baked in, why
they think UTF-8 is a good idea.

I was hoping for a technical discussion on the merits, before I
went ahead and implemented this single-byte encoding.  Since
nobody has been able to point out a reason for why my encoding
wouldn't be much better than UTF-8, I see no reason not to go
forward with my implementation.  I may write something up after
implementation: most people don't care about ideas, only
results, to the point where almost nobody can reason at all

Remember, extraordinary claims require extraordinary evidence,
not handwaving and assumptions disguised as bold assertions.

"handwaving and assumptions."  Some people can reason about
such possible encodings, even in the incomplete form I've
sketched out, without having implemented them, if they know
what they're doing.

On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
On 5/25/2013 2:51 PM, Walter Bright wrote:
On 5/25/2013 12:51 PM, Joakim wrote:
For a multi-language string encoding, the header would
contain a single byte for every language used in the string,
along with multiple
index bytes to signify the start and finish of every run of
single-language
characters in the string. So, a list of languages and a list
of pure
single-language substrings.

Please implement the simple C function strstr() with this
simple scheme, and
post it here.

http://www.digitalmars.com/rtl/string.html#strstr

I'll go first. Here's a simple UTF-8 version in C. It's not
the fastest way to do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
size_t len1 = strlen(s1);
size_t len2 = strlen(s2);
if (!len2)
return (char *) s1;
char c2 = *s2;
while (len2 <= len1) {
if (c2 == *s1)
if (memcmp(s2,s1,len2) == 0)
return (char *) s1;
s1++;
len1--;
}
return NULL;
}

be simpler to write in C and D for multi-language strings that
include Korean/Chinese/Japanese.  But while the strstr
implementation for my encoding would contain more conditionals
and lines of code, it would be far more efficient.  For
instance, because you know where all the language substrings
are from the header, you can potentially rule out searching
vast swathes of the string, because they don't contain the same
languages or lengths as the string you're searching for.

Even if you're searching a single-language string, which won't
have those speedups, your naive implementation checks every
byte, even continuation bytes, in UTF-8 to see if they might
match the first letter of the search string, even though no
continuation byte will match.  You can avoid this by partially
decoding the leading bytes of UTF-8 characters and skipping
over continuation bytes, as I've mentioned earlier in this
iteration of the while loop.

My single-byte encoding has none of these problems, in fact,
it's much faster and uses less memory for the same function,
not available to UTF-8.

Finally, being able to write simple yet inefficient functions
like this is not the test of a good encoding, as strstr is a
library function, and making library developers' lives easier
is a low priority for any good format.  The primary goals are
ease of use for library consumers, ie app developers, and speed
and efficiency of the code.  You are trading on the latter two
for the former with this implementation.  That is not a good

Perhaps it was a good trade 20 years ago when everyone rolled
their own code and nobody bothered waiting for those floppy
disks to arrive with expensive library code.  It is not a good

I服了u，I'm thinking of your name means joking?

May 26 2013
"John Colvin" <john.loughran.colvin gmail.com> writes:
On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
I have noted from the beginning that these large alphabets
have to be encoded to
two bytes, so it is not a true constant-width encoding if you
are mixing one of
those languages into a single-byte encoded string.  But this
"variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's
variable length. You overlook that I've had to deal with this.
It isn't "simpler", there's actually more work to write code
that adapts to one or two byte encodings.

containing a few Asian languages are variable-length, as
opposed to UTF-8 having every non-English language string be
variable-length.  It may be more work to write library code to
handle my encoding, perhaps, but efficiency and ease of use are
paramount.

So let's see: first you say that my scheme has to be variable
length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard
Chinese. You cannot have it both ways. Code to deal with two
bytes is significantly different than code to deal with one.
That means you've got a conditional in your generic code -
that isn't going to be faster than the conditional for UTF-8.

two-byte encoding for Chinese and I already acknowledged that
such a predominantly single-byte encoding is still
variable-length.  The problem is that _you_ try to have it both
ways: first you claimed it is variable-length because I support
Chinese that way, then you claimed I don't support Chinese.

Yes, there will be conditionals, just as there are several
conditionals in phobos depending on whether a language supports
uppercase or not.  The question is whether the conditionals for
single-byte encoding will execute faster than decoding every
UTF-8 character.  This is a matter of engineering judgement, I
see no reason why you think decoding every UTF-8 character is
faster.

then you claim I don't handle
these languages.  This kind of blatant contradiction within
two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it
as irrelevant, along with more handwaving about what to do
with text that has embedded words in multiple languages.

to use two bytes to encode Chinese?  I'm not sure why you're

I didn't "handwave" about multi-language strings, I gave
specific ideas about how they might be implemented.  I'm not
claiming to have a bullet-proof and detailed single-byte
encoding spec, just spitballing some ideas on how to do it
better than the abominable UTF-8.

Worse, there are going to be more than 256 of these encodings
- you can't even have a byte to specify them. Remember,
Unicode has approximately 256,000 characters in it. How many
code pages is that?

maybe another 50 symbolic sets.  That leaves space for another
100 or so new scripts.  Maybe you are so worried about
future-proofing that you'd use two bytes to signify the
alphabet, but I wouldn't.  I think it's more likely that we'll
ditch scripts than add them. ;) Most of those symbol sets
should not be in UCS.

I was being kind saying you were trolling, as otherwise I'd be
saying your scheme was, to be blunt, absurd.

from 20 years ago, that is really only useful when streaming
text, which nobody does today.  There may have been a time when
ASCII compatibility was paramount, when nobody cared about
internationalization and almost all libraries only took ASCII
input: that is not the case today.

I'll be the first to admit that a lot of great ideas have been
initially dismissed by the experts as absurd. If you really
believe in this, I recommend that you write it up as a real
article, taking care to fill in all the handwaving with
something specific, and include some benchmarks to prove your
performance claims. Post your article on reddit,
stackoverflow, hackernews, etc., and look for fertile ground
for it. I'm sorry you're not finding fertile ground here (so
far, nobody has agreed with any of your points), and this is
the wrong place for such proposals anyway, as D is simply not
going to switch over to it.

my single-byte encoding representing a step forward from UTF-8.
While this argument has produced no argument that I'm wrong,
it's possible we've all missed something salient, some
deal-breaker.  As I said before, I'm not proposing that D
"switch over."  I was simply asking people who know or at the
very least use UTF-8 more than most, as a result of employing
one of the few languages with Unicode support baked in, why
they think UTF-8 is a good idea.

I was hoping for a technical discussion on the merits, before I
went ahead and implemented this single-byte encoding.  Since
nobody has been able to point out a reason for why my encoding
wouldn't be much better than UTF-8, I see no reason not to go
forward with my implementation.  I may write something up after
implementation: most people don't care about ideas, only
results, to the point where almost nobody can reason at all

Remember, extraordinary claims require extraordinary evidence,
not handwaving and assumptions disguised as bold assertions.

"handwaving and assumptions."  Some people can reason about
such possible encodings, even in the incomplete form I've
sketched out, without having implemented them, if they know
what they're doing.

On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
On 5/25/2013 2:51 PM, Walter Bright wrote:
On 5/25/2013 12:51 PM, Joakim wrote:
For a multi-language string encoding, the header would
contain a single byte for every language used in the string,
along with multiple
index bytes to signify the start and finish of every run of
single-language
characters in the string. So, a list of languages and a list
of pure
single-language substrings.

Please implement the simple C function strstr() with this
simple scheme, and
post it here.

http://www.digitalmars.com/rtl/string.html#strstr

I'll go first. Here's a simple UTF-8 version in C. It's not
the fastest way to do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
size_t len1 = strlen(s1);
size_t len2 = strlen(s2);
if (!len2)
return (char *) s1;
char c2 = *s2;
while (len2 <= len1) {
if (c2 == *s1)
if (memcmp(s2,s1,len2) == 0)
return (char *) s1;
s1++;
len1--;
}
return NULL;
}

be simpler to write in C and D for multi-language strings that
include Korean/Chinese/Japanese.  But while the strstr
implementation for my encoding would contain more conditionals
and lines of code, it would be far more efficient.  For
instance, because you know where all the language substrings
are from the header, you can potentially rule out searching
vast swathes of the string, because they don't contain the same
languages or lengths as the string you're searching for.

Even if you're searching a single-language string, which won't
have those speedups, your naive implementation checks every
byte, even continuation bytes, in UTF-8 to see if they might
match the first letter of the search string, even though no
continuation byte will match.  You can avoid this by partially
decoding the leading bytes of UTF-8 characters and skipping
over continuation bytes, as I've mentioned earlier in this
iteration of the while loop.

My single-byte encoding has none of these problems, in fact,
it's much faster and uses less memory for the same function,
not available to UTF-8.

Finally, being able to write simple yet inefficient functions
like this is not the test of a good encoding, as strstr is a
library function, and making library developers' lives easier
is a low priority for any good format.  The primary goals are
ease of use for library consumers, ie app developers, and speed
and efficiency of the code.  You are trading on the latter two
for the former with this implementation.  That is not a good

Perhaps it was a good trade 20 years ago when everyone rolled
their own code and nobody bothered waiting for those floppy
disks to arrive with expensive library code.  It is not a good

I suggest you make an attempt at writing strstr and post it. Code
speaks louder than words.

May 26 2013
"Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 12:55:11 UTC, Walter Bright wrote:
On 5/26/2013 4:31 AM, Joakim wrote:
My single-byte encoding has none of these problems, in fact,
it's much faster
and uses less memory for the same function, while providing
from the header, that are not available to UTF-8.

C'mon, Joakim, show us this amazing strstr() implementation for

encoding implementation.  I don't write toy code, particularly
inefficient functions like yours, for the reasons given, which

likes this silly Monty Python stuff, from what little I've seen.

May 26 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, May 26, 2013 at 11:59:19AM +0200, Joakim wrote:
On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:
And just how exactly does that help with slicing? If anything, it
makes slicing way hairier and error-prone than UTF-8. In fact, this
one point alone already defeated any performance gains you may have
had with a single-byte encoding. Now you can't do *any* slicing at
all without convoluted algorithms to determine what encoding is where
at the endpoints of your slice, and the resulting slice must have new
headers to indicate the start/end of every different-language
substring.  By the time you're done with all that, you're going way
slower than processing UTF-8.

There are no convoluted algorithms, it's a simple check if the
string contains any two-bye encodings, a check which can be done
once and cached.

have a list of starting/ending points indicating which encoding should
be used for which substring(s). That has nothing to do with two-byte
encodings. So, please show us the code: given a string containing, say,
English and French substrings, what will the header look like? And
what's the algorithm to take a slice of such a string?

If it's single-byte all the way through, no problems whatsoever with
slicing.

Huh?! How are there no problems with slicing? Let's say you have a
string that contains both English and French. According to your scheme,
you'll have some kind of header format that lets you say bytes 0-123 are
English, bytes 124-129 are French, and bytes 130-200 are English. Now
let's say I want a substring from 120 to 125. How would this be done?
And what about if I want a substring from 120 to 140? Or 126 to 130?
What if the string contains several runs of French?

If there are two-byte languages included, the slice function will have
to do a little arithmetic calculation before slicing.  You will also
need a few arithmetic ops to create the new header for the slice.  The
point is that these operations will be much faster than decoding every
code point to slice UTF-8.

You haven't proven that this "little arithmetic calculation" will be
faster than manipulating UTF-8. What if I have an English text that
contains quotations of Chinese, French, and Greek snippets? Math
symbols?  Please show us (1) how such a string should be encoded under
your scheme, and (2) the code will slice such a string in an efficient
way, according to your proposed encoding scheme.

(And before you dismiss such a string as unlikely or write it off as
rare, consider a technical math paper that cites the work of Chinese and
French authors -- a rather common thing these days. You'd need the extra
characters just to be able to cite their names, even if none of the
actual Chinese or French is quoted verbatim. Greek in general is used
all over math anyway, since for whatever reason mathematicians just love
Greek symbols, so it pretty much needs to be included by default.)

Again I say, I'm not 100% sold on UTF-8, but what you're proposing
here is far worse.

dismiss my alternative out of hand.

Clearly, we're not seeing what you're seeing here. So instead of making
to show us the actual code.  So far, I haven't seen anything that
convinces me that your scheme is any better.  In fact, from what I can
see, it's a lot worse, and you're just evading pointed questions about
how to address those problems.  Maybe that's a wrong perception, but not
having any actual code to look at, I'm having a hard time believing your
claims. Right now I'm leaning towards agreeing with Walter that you're
just trolling us (and rather successfully at that).

So, please show us the code. Otherwise, I think I should just stop
responding, as we're obviously not on the same page and this discussion
isn't getting anywhere.

T

--
Some ideas are so stupid that only intellectuals could believe them. -- George
Orwell

May 26 2013
"Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 14:37:27 UTC, H. S. Teoh wrote:
IHBT. You said that to handle multilanguage strings, your

make a bunch of trolling arguments, which seem to imply you have
no idea how a single-byte encoding works.  I'm not going to
bother explaining it to you, anyone who knows encodings can
easily figure it out from what I've said so far.

Clearly, we're not seeing what you're seeing here. So instead
of making
might want
to show us the actual code.  So far, I haven't seen anything
that
convinces me that your scheme is any better.  In fact, from
what I can
see, it's a lot worse, and you're just evading pointed
how to address those problems.  Maybe that's a wrong
perception, but not
having any actual code to look at, I'm having a hard time
believing your
claims. Right now I'm leaning towards agreeing with Walter that
you're
just trolling us (and rather successfully at that).

trolling, that's you not understanding what they're saying.  I
encoding being worse.  If you can't understand my arguments, you
need to go out and learn some more about these issues.

So, please show us the code. Otherwise, I think I should just
stop
responding, as we're obviously not on the same page and this
discussion
isn't getting anywhere.

take too long for the kind of encoding I have in mind, so it
isn't worth my time, and if you can't understand the higher-level
technical language I'm using in these posts, you won't understand
the code anyway.  I have adequately sketched what I'd do, so that
anyone proficient in the art can reason about what the
consequences of such a scheme would be.  Perhaps that doesn't
include Walter and you.

I don't know why you'd want to keep responding to someone you
think is trolling you anyway.

May 26 2013
On Sunday, 26 May 2013 at 15:23:33 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 14:37:27 UTC, H. S. Teoh wrote:
IHBT.

I've made my position clear: I don't write toy code.

1. Make extraordinary claims
2. Refuse to back up said claims with small examples because "I
don't write toy code"
3. Refuse to back up said claims with elaborate examples because
"It will
take too long"
4. Use arrogant tone throughout thread, imply that you're smarter
than the creators of UTF, and creators and long-time contributors
of D (never contribute code to D yourself)

Conclusion: Successful troll is successful :)

May 26 2013
"Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 16:54:53 UTC, Vladimir Panteleev wrote:
1. Make extraordinary claims

2. Refuse to back up said claims with small examples because "I
don't write toy code"

of how a single-byte encoding would compare to UTF-8, along with
listing optimizations that make it much faster.  I finally
refused to analyze Teoh's examples because he accused me of
trolling and demanded code as the only possible explanation.

3. Refuse to back up said claims with elaborate examples
because "It will
take too long"

non-toy code would take too long, and you wouldn't understand it
anyway."

The whole demand for code is idiotic anyway.

If I outlined TCP/IP as a packet-switched network and briefly
sketched what the header might look like and the queuing
algorithms that I might use, I can just imagine you saying, "But
there's no code... how can I possibly understand what you're
saying without any code?"  If you can't understand networking
without seeing working code, you're not equipped to understand it
anyway, same here.

4. Use arrogant tone throughout thread, imply that you're
smarter than the creators of UTF, and creators and long-time
contributors of D (never contribute code to D yourself)

I actually had a lot of respect for Walter till I read this
thread.  I can only assume that his past experience with code
pages was so maddening that he cannot be rational on the subject
of going to any single-byte encoding that would be similar, same
with others griping about code pages above.  I also don't think
he and others are paying much attention to the various points I'm
raising, hence his recent claim that I wouldn't handle Chinese,
when I addressed that from the beginning.

Or it could just be that I'm much smarter than everybody else in
this thread, ;) I can't rule it out given the often silly
responses I've been getting.

Conclusion: Successful troll is successful :)

I'm talking about, which is why he doesn't raise a single
technical point in this post.

May 26 2013
"Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu wrote:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is obviously
so.

Congratulations, you are literally the only person on the
Internet who said so: http://goo.gl/TFhUO

8 results.  How many people even know how UTF-8 works?  Given how
few people use it, I'm not surprised most don't know enough about
how it works to criticize it.

On 5/26/13 1:45 PM, Joakim wrote:
Or it could just be that I'm much smarter than everybody else
in this
thread, ;) I can't rule it out given the often silly responses
I've been
getting.

most everybody in this forum raises like one to the same
opinion. Usually it's like whatever the topic, a debate will

not well-understood technology, Unicode, not the usual "emacs vs
vim" or "D should not have null references" argument.  For
example, how many here know what UCS is?  Most people never dig
into Unicode, it's just a black box that is annoying to deal with.

It has become clear that people involved in this have gotten
too frustrated to have a constructive exchange. I suggest we
collectively drop it. What you may want to do is to use D's
modeling abilities to define a great string type pursuant to
your ideas. If it is as good as you believe it could, then it
will enjoy use and adoption and everybody will be better off.


May 26 2013
"Mr. Anonymous" <mailnew4ster gmail.com> writes:
On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu
wrote:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is obviously
so.

Congratulations, you are literally the only person on the
Internet who said so: http://goo.gl/TFhUO

least 8 results.  How many people even know how UTF-8 works?
Given how few people use it, I'm not surprised most don't know
enough about how it works to criticize it.

On the other hand:

:D

May 26 2013
Marcin Mstowski <marmyst gmail.com> writes:
--001a11c23f64cdc94f04dda3e032
Content-Type: text/plain; charset=ISO-8859-1

Character Data Representation
Architecture<http://www-01.ibm.com/software/globalization/cdra/>by
IBM. It is what you want to do with additions and it is available
since
1995.
When you come up with an inventive idea, i suggest you to first check what
was already done in that area and then rethink this again to check if you
can do this better or improve existing solution. Other approaches are
usually waste of time and efforts, unless you are doing this for fun or you
price, etc.

On Sun, May 26, 2013 at 9:05 PM, Joakim <joakim airpost.net> wrote:

On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu wrote:

On 5/26/13 1:45 PM, Joakim wrote:

What is extraordinary about "UTF-8 is shit?" It is obviously so.

Congratulations, you are literally the only person on the Internet who
said so: http://goo.gl/TFhUO

results.  How many people even know how UTF-8 works?  Given how few people
use it, I'm not surprised most don't know enough about how it works to
criticize it.

On 5/26/13 1:45 PM, Joakim wrote:
Or it could just be that I'm much smarter than everybody else in this
thread, ;) I can't rule it out given the often silly responses I've been
getting.

everybody in this forum raises like one to the same opinion. Usually it's
like whatever the topic, a debate will ensue between two ad-hoc groups.

well-understood technology, Unicode, not the usual "emacs vs vim" or "D
should not have null references" argument.  For example, how many here know
what UCS is?  Most people never dig into Unicode, it's just a black box
that is annoying to deal with.

It has become clear that people involved in this have gotten too
frustrated to have a constructive exchange. I suggest we collectively drop
it. What you may want to do is to use D's modeling abilities to define a
great string type pursuant to your ideas. If it is as good as you believe
it could, then it will enjoy use and adoption and everybody will be better
off.

--001a11c23f64cdc94f04dda3e032
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><a href=3D"http://www-01.ibm.com/software/globalization/cd=
ra/">Character Data Representation Architecture</a> by IBM. It is what you =
want to do with additions and it is available since 1995.<br>When
you come up with an inventive idea, i suggest you to first check what=20
was already done in that area and then rethink this again to check if=20
you can do this better or improve existing solution. Other <span id=3D"resu=
lt_box" class=3D"" lang=3D"en"><span class=3D"">approaches
are usually waste of time and efforts, unless you are doing this for=20
fun or you can&#39;t use existing solutions due to problems with license,=
=20
br><div class=3D"gmail_quote">On Sun, May 26, 2013 at 9:05 PM, Joakim <span=
dir=3D"ltr">&lt;<a href=3D"mailto:joakim airpost.net" target=3D"_blank">jo=
akim airpost.net</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div class=3D"im">On Sunday, 26 May 2013 at =
18:29:38 UTC, Andrei Alexandrescu wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
On 5/26/13 1:45 PM, Joakim wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
What is extraordinary about &quot;UTF-8 is shit?&quot; It is obviously so.<=
br>
</blockquote>
<br>
Congratulations, you are literally the only person on the Internet who said=
so: <a href=3D"http://goo.gl/TFhUO" target=3D"_blank">http://goo.gl/TFhUO<=
/a><br>
</blockquote></div>
Haha, that is funny, :D though &quot;unicode is shit&quot; returns at least=
8 results. =A0How many people even know how UTF-8 works? =A0Given how few =
people use it, I&#39;m not surprised most don&#39;t know enough about how i=
t works to criticize it.<div class=3D"im">
<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
On 5/26/13 1:45 PM, Joakim wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
Or it could just be that I&#39;m much smarter than everybody else in this<b=
r>
thread, ;) I can&#39;t rule it out given the often silly responses I&#39;ve=
been<br>
getting.<br>
</blockquote>
<br>
ody in this forum raises like one to the same opinion. Usually it&#39;s lik=
e whatever the topic, a debate will ensue between two ad-hoc groups.<br>

</blockquote></div>
I suspect it&#39;s because I&#39;m presenting an original idea about a not =
well-understood technology, Unicode, not the usual &quot;emacs vs vim&quot;=
or &quot;D should not have null references&quot; argument. =A0For example,=
how many here know what UCS is? =A0Most people never dig into Unicode, it&=
#39;s just a black box that is annoying to deal with.<div class=3D"im">
<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
It has become clear that people involved in this have gotten too frustrated=
to have a constructive exchange. I suggest we collectively drop it. What y=
ou may want to do is to use D&#39;s modeling abilities to define a great st=
ring type pursuant to your ideas. If it is as good as you believe it could,=
then it will enjoy use and adoption and everybody will be better off.<br>

</blockquote></div>
I agree. =A0I am enjoying your book, btw.<br>
</blockquote></div><br></div>

--001a11c23f64cdc94f04dda3e032--

May 26 2013
"Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 19:11:42 UTC, Mr. Anonymous wrote:
On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu
wrote:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is obviously
so.

Congratulations, you are literally the only person on the
Internet who said so: http://goo.gl/TFhUO

least 8 results.  How many people even know how UTF-8 works?
Given how few people use it, I'm not surprised most don't know
enough about how it works to criticize it.

On the other hand:

did.  There are only 19 results for that search string.  If UTF-8
were such a rousing success and most developers found it easy to
understand, you wouldn't expect only 19 results for it and 8
against it.  The paucity of results suggests most don't know how
it works or perhaps simply annoyed by it, liking the
internationalization but disliking the complexity.

May 26 2013
"Mr. Anonymous" <mailnew4ster gmail.com> writes:
On Sunday, 26 May 2013 at 19:25:37 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 19:11:42 UTC, Mr. Anonymous wrote:
On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu
wrote:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is
obviously so.

Congratulations, you are literally the only person on the
Internet who said so: http://goo.gl/TFhUO

least 8 results.  How many people even know how UTF-8 works?
Given how few people use it, I'm not surprised most don't
know enough about how it works to criticize it.

On the other hand:

did.  There are only 19 results for that search string.  If
UTF-8 were such a rousing success and most developers found it
easy to understand, you wouldn't expect only 19 results for it
and 8 against it.  The paucity of results suggests most don't
know how it works or perhaps simply annoyed by it, liking the
internationalization but disliking the complexity.

Man, you're a bullshit machine!

May 26 2013
"Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 19:20:15 UTC, Marcin Mstowski wrote:
Character Data Representation
Architecture<http://www-01.ibm.com/software/globalization/cdra/>by
IBM. It is what you want to do with additions and it is
available
since
1995.
When you come up with an inventive idea, i suggest you to first
check what
was already done in that area and then rethink this again to
check if you
can do this better or improve existing solution. Other
approaches are
usually waste of time and efforts, unless you are doing this
for fun or you
can't use existing solutions due to problems with license,
price, etc.

what the encoding actually is.  There is an appendix that lists
several possible encodings, including UTF-8!

Also, one of the first pages talks about representations of
floating point and integer numbers, which are outside the purview
of the text encodings we're talking about.  I cannot possibly be
expected to know about every dead format out there.  If you can
show that it is materially similar to my single-byte encoding
idea, it might be worth looking into.

May 26 2013
"Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 19:38:21 UTC, Mr. Anonymous wrote:
On Sunday, 26 May 2013 at 19:25:37 UTC, Joakim wrote:
I'm not sure if you were trying to make my point, but you just
did.  There are only 19 results for that search string.  If
UTF-8 were such a rousing success and most developers found it
easy to understand, you wouldn't expect only 19 results for it
and 8 against it.  The paucity of results suggests most don't
know how it works or perhaps simply annoyed by it, liking the
internationalization but disliking the complexity.

Man, you're a bullshit machine!


May 26 2013
"Hans W. Uhlig" <huhlig gmail.com> writes:
On Saturday, 25 May 2013 at 03:46:23 UTC, Walter Bright wrote:
On 5/24/2013 7:16 PM, Manu wrote:
So when we define operators for u × v and a · b, or maybe n²?
;)

Oh, how I want to do that. But I still think the world hasn't
completely caught up with Unicode yet.

Using those characters would be wonderful and while we do have
unicode software support we don't really have unicode hardware
support. I am still on my 102 key keyboard and I haven't really
seen a good expanded character keyboard come along.

May 26 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, May 28, 2013 at 01:28:22AM +0200, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for
example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different story.
:)

I've recently come to the opinion that that's a bad idea, and D
should not support it.

Currently, the above code snippet compiles (upon inserting "import
std.stdio;", that is). Should that be made illegal?

Why do you think its a bad idea? It makes it such that code can be
in various languages? Just lack of keyboard support?

I can't speak for Walter, but one issue that comes to mind is when
someone reads the code and doesn't understand the language the
identifiers are in, or worse, can't reliably recognize the distinctions
between the glyphs, and so can't match identifier names correctly -- if
you don't know Japanese, for example, seeing a bunch of Japanese
identifiers of equal length will look more-or-less the same (all
gibberish to you), so it only obscures the code. Or if your computer
doesn't have the requisite fonts to display the alphabet in question,
then you'll just see a bunch of ?'s or black blotches for all program
identifiers, making the code completely unreadable.

Since language keywords are already in English, we might as well
standardize on English identifiers too. (After all, Phobos identifiers
are English as well.) While it's cool to have multilingual identifiers,
I'm not sure if it actually adds any practical value. :) If anything, it
arguably detracts from usability. Multilingual program output, of
course, is a different kettle o' fish.

T

--
Doubt is a self-fulfilling prophecy.

May 27 2013
Peter Williams <pwil3058 bigpond.net.au> writes:
On 28/05/13 09:44, H. S. Teoh wrote:

Since language keywords are already in English, we might as well
standardize on English identifiers too.

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

Peter

May 27 2013
Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, May 28, 2013 11:38:08 Peter Williams wrote:
On 28/05/13 09:44, H. S. Teoh wrote:
Since language keywords are already in English, we might as well
standardize on English identifiers too.

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

I think that it was more an issue of that the only reason that Unicode would
be necessary in identifiers would be if you weren't using English, so if you
assume that everyone is going to be using some form of English for their
identifier names, you can skip having Unicode in identifiers. So, a natural
effect of standardizing on English is that you can stick with ASCII.

- Jonathan M Davis

May 27 2013
"David Eagen" <davideagen mailinator.com> writes:
On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:

So you're going to spell check them all to make sure that
they're English?  Or did you mean ASCII?

Peter

That's it. I'm filing a bug against std.traits. There's a
unittest there that with a struct named "Colour". Completely
unacceptable.

May 27 2013
Manu <turkeyman gmail.com> writes:
--001a11c2d68212897904ddbfa156
Content-Type: text/plain; charset=UTF-8

On 28 May 2013 11:42, Jonathan M Davis <jmdavisProg gmx.com> wrote:

On Tuesday, May 28, 2013 11:38:08 Peter Williams wrote:
On 28/05/13 09:44, H. S. Teoh wrote:
Since language keywords are already in English, we might as well
standardize on English identifiers too.

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

I think that it was more an issue of that the only reason that Unicode
would
be necessary in identifiers would be if you weren't using English, so if
you
assume that everyone is going to be using some form of English for their
identifier names, you can skip having Unicode in identifiers. So, a natural
effect of standardizing on English is that you can stick with ASCII.

I'm fairly sure that any programmer who takes themself seriously will use
English, I don't see any reason why this rule should nee to be be
implemented by the compiler.
The loss I can imagine is that kids, or people from developing countries,
etc, may have an additional barrier to learning to code if they don't speak
English.
Nobody in this set is likely to produce a useful library that will be used
widely.
Likewise, no sane programmer is going to choose to use a library that's not
written in English.

You may argue that the keywords and libs are in English. I can attest from
personal experience, that a child, or a non-english-speaking beginner
probably has absolutely NO IDEA what the keywords mean anyway, even if they
do speak English.
I certainly had no idea when I was a kid, I just typed them because I
figured out what they did. I didn't even know how to say many of them, and
realised 5 years later than I was saying all the words wrong...

So my point is, why make this restriction as a static compiler rule, when
it's not practically going to be broken anyway. You never know, it may
actually assist some people somewhere.
I think it's a great thing that D can accept identifiers in non-english.

--001a11c2d68212897904ddbfa156
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 28 May 2013 11:42, Jonathan M Davis <span dir=3D"ltr">&=
lt;<a href=3D"mailto:jmdavisProg gmx.com" target=3D"_blank">jmdavisProg gmx=
.com</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div class=3D"gmai=
l_quote">
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div class=3D""><div class=3D"h5">On Tuesday, May 28, 2013=
11:38:08 Peter Williams wrote:<br>

&gt; On 28/05/13 09:44, H. S. Teoh wrote:<br>
&gt; &gt; Since language keywords are already in English, we might as well<=
br>
&gt; &gt; standardize on English identifiers too.<br>
&gt;<br>
&gt; So you&#39;re going to spell check them all to make sure that they&#39=
;re<br>
&gt; English? =C2=A0Or did you mean ASCII?<br>
<br>
</div></div>I think that it was more an issue of that the only reason that =
Unicode would<br>
be necessary in identifiers would be if you weren&#39;t using English, so i=
f you<br>
assume that everyone is going to be using some form of English for their<br=

<br>
effect of standardizing on English is that you can stick with ASCII.<br></b=
lockquote><div><br></div><div style>I&#39;m fairly sure that any programmer=
who takes themself seriously will use English, I don&#39;t see any reason =
why this rule should nee to be be implemented by the compiler.</div>
<div style>The loss I can imagine is that kids, or people from developing c=
ountries, etc, may have an additional barrier to learning to code if they d=
on&#39;t speak English.</div><div style>Nobody in this set is likely to pro=
duce a useful library that will be used widely.</div>
<div style>Likewise, no sane programmer is going to choose to use a library=
that&#39;s not written in English.</div><div style><br></div><div style>Yo=
u may argue that the keywords and libs are in English. I can attest from pe=
rsonal experience, that a child, or a non-english-speaking beginner probabl=
y has absolutely NO IDEA what the keywords mean anyway, even if they do spe=
ak English.</div>
<div style>I certainly had no idea when I was a kid, I just typed them beca=
use I figured out what they did. I didn&#39;t even know how to say many of =
them, and realised 5 years later than I was saying all the words wrong...<b=
r>
</div><div style><br></div><div style>So my point is, why make this restric=
tion as a static compiler rule, when it&#39;s not practically going to be b=
roken anyway. You never know, it may actually assist some people somewhere.=
</div>
<div style>I think it&#39;s a great thing that D can accept identifiers in =
non-english.</div></div></div></div>

--001a11c2d68212897904ddbfa156--

May 27 2013
Manu <turkeyman gmail.com> writes:
--001a11c2029ace489904ddbfafde
Content-Type: text/plain; charset=UTF-8

On 28 May 2013 13:22, David Eagen <davideagen mailinator.com> wrote:

On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

Peter

That's it. I'm filing a bug against std.traits. There's a unittest there
that with a struct named "Colour". Completely unacceptable.

How dare you!
What's unacceptable is that a bunch of ex-english speakers had the audacity
to rewrite the dictionary and continue to call it English!
I will never write colour without a u, ever! I may suffer the global
American cultural invasion of my country like the rest of us, but I will
never let them infiltrate my mind! ;)

--001a11c2029ace489904ddbfafde
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 28 May 2013 13:22, David Eagen <span dir=3D"ltr">&lt;<a=
href=3D"mailto:davideagen mailinator.com" target=3D"_blank">davideagen mai=
linator.com</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div class=
=3D"gmail_quote">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div class=3D"im">On Tuesday, 28 May 2013 at=
01:38:22 UTC, Peter Williams wrote:<br>
<br>
</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
<br>
So you&#39;re going to spell check them all to make sure that they&#39;re E=
nglish? =C2=A0Or did you mean ASCII?<br>
<br></div>
Peter<br>
</blockquote>
<br>
That&#39;s it. I&#39;m filing a bug against std.traits. There&#39;s a unitt=
est there that with a struct named &quot;Colour&quot;. Completely unaccepta=
ble.<br></blockquote><div><br></div><div style>How dare you!</div><div styl=
e>
What&#39;s unacceptable is that a bunch of ex-english speakers had the auda=
city to rewrite the dictionary and continue to call it English!</div><div s=
tyle>I will never write colour without a u, ever! I may suffer the global A=
merican cultural invasion of my country like the rest of us, but I will nev=
er let them infiltrate my mind! ;)</div>
</div></div></div>

--001a11c2029ace489904ddbfafde--

May 27 2013
On Tuesday, 28 May 2013 at 04:52:55 UTC, Walter Bright wrote:
On 5/27/2013 9:27 PM, Manu wrote:
I will never write colour without a u, ever! I may suffer the
global American
cultural invasion of my country like the rest of us, but I
will never let them
infiltrate my mind! ;)

Resistance is useless.

*futile :P

May 27 2013
Peter Williams <pwil3058 bigpond.net.au> writes:
On 28/05/13 13:22, David Eagen wrote:
On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

Peter

That's it. I'm filing a bug against std.traits. There's a unittest there
that with a struct named "Colour". Completely unacceptable.

Except here in Australia and other places where they use the Queen's
English :-)

Peter

May 27 2013
Manu <turkeyman gmail.com> writes:
--089e01227f04f2dac504ddc0fa68
Content-Type: text/plain; charset=UTF-8

On 28 May 2013 14:38, Peter Williams <pwil3058 bigpond.net.au> wrote:

On 28/05/13 13:22, David Eagen wrote:

On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:

So you're going to spell check them all to make sure that they're
English?  Or did you mean ASCII?

Peter

That's it. I'm filing a bug against std.traits. There's a unittest there
that with a struct named "Colour". Completely unacceptable.

Except here in Australia and other places where they use the Queen's
English :-)

Is there anywhere other than America that doesn't?

--089e01227f04f2dac504ddc0fa68
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 28 May 2013 14:38, Peter Williams <span dir=3D"ltr">&lt=
;<a href=3D"mailto:pwil3058 bigpond.net.au" target=3D"_blank">pwil3058 bigp=
ond.net.au</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div class=
=3D"gmail_quote">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div class=3D"HOEnZb"><div class=3D"h5">On 2=
8/05/13 13:22, David Eagen wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
<br>
So you&#39;re going to spell check them all to make sure that they&#39;re<b=
r>
English? =C2=A0Or did you mean ASCII?<br>
<br>
Peter<br>
</blockquote>
<br>
That&#39;s it. I&#39;m filing a bug against std.traits. There&#39;s a unitt=
est there<br>
that with a struct named &quot;Colour&quot;. Completely unacceptable.<br>
</blockquote>
<br></div></div>
Except here in Australia and other places where they use the Queen&#39;s En=
glish :-)</blockquote><div><br></div><div style>Is there anywhere other tha=
n America that doesn&#39;t?</div></div></div></div>

--089e01227f04f2dac504ddc0fa68--

May 27 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 27 May 2013 at 23:46:17 UTC, H. S. Teoh wrote:
On Tue, May 28, 2013 at 01:28:22AM +0200, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for
example:

void main(string[] args) {
int число = 1;
foreach (и; 0..100)
число += и;
writeln(число);
}

Of course, whether that's a good practice is a different
story.
:)

I've recently come to the opinion that that's a bad idea, and
D
should not support it.

Currently, the above code snippet compiles (upon inserting
"import
std.stdio;", that is). Should that be made illegal?

Why do you think its a bad idea? It makes it such that code
can be
in various languages? Just lack of keyboard support?

I can't speak for Walter, but one issue that comes to mind is
when
someone reads the code and doesn't understand the language the
identifiers are in, or worse, can't reliably recognize the
distinctions
between the glyphs, and so can't match identifier names
correctly -- if
you don't know Japanese, for example, seeing a bunch of Japanese
identifiers of equal length will look more-or-less the same (all
gibberish to you), so it only obscures the code. Or if your
computer
doesn't have the requisite fonts to display the alphabet in
question,
then you'll just see a bunch of ?'s or black blotches for all
program
identifiers, making the code completely unreadable.

Since language keywords are already in English, we might as well
standardize on English identifiers too. (After all, Phobos
identifiers
are English as well.) While it's cool to have multilingual
identifiers,
I'm not sure if it actually adds any practical value. :) If
anything, it
arguably detracts from usability. Multilingual program output,
of
course, is a different kettle o' fish.

T

I can tell you for a fact there are a tons of *private* companies
that create closed source programs, whose source code is *not*
English. And from *their* business perspective, it makes sense.
They don't care if you can't understand their source code, since
*you* will never see their source code. I'm quite confident there
are tons of programs that you use that *aren't* written in
English.

My wifes writes the embedded soft for hardware her company sells.
I can tell you the source code sure as hell isn't in English. Why
would it? The entire company speaks the local language natively.
I've worked in Japan, and I can tell you the norm over there is
*not* to code in English.

And why should it? Why would you code in a language that is not
your own, if you don't plan to ever share your code to outside
your team? Why would you care about users that don't have unicode
support, if the workstations of all your employees is unicode
compatible?

Allowing unicode identifiers makes their work a better
experience. Why should we take that away from them?

but whether or not you should be able to use them should belong
in a coding standard, not in a compiler limitation.

May 28 2013
Manu <turkeyman gmail.com> writes:
--089e0149c8002476d204ddc6220f
Content-Type: text/plain; charset=UTF-8

On 28 May 2013 19:12, Jacob Carlborg <doob me.com> wrote:

On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Yes, the region called America ;)
Although there's a few British colonies in the Caribbean...

--089e0149c8002476d204ddc6220f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 28 May 2013 19:12, Jacob Carlborg <span dir=3D"ltr">&lt=
;<a href=3D"mailto:doob me.com" target=3D"_blank">doob me.com</a>&gt;</span=
wrote:<br><div class=3D"gmail_extra"><div class=3D"gmail_quote"><blockquo=

<div class=3D"im">On 2013-05-28 08:00, Manu wrote:<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
Is there anywhere other than America that doesn&#39;t?<br>
</blockquote>
<br></div>
Canada, Jamaica, other countries in that region?</blockquote><div><br></div=
<div style>Yes, the region called America ;)</div><div style>Although ther=

div>

--089e0149c8002476d204ddc6220f--

May 28 2013
"Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Tue, 28 May 2013 14:11:29 +0200, Jacob Carlborg <doob me.com> wrote:

On 2013-05-28 14:09, Manu wrote:

Yes, the region called America ;)
Although there's a few British colonies in the Caribbean...

Oh, you meant the whole region and not the country.

America is not a country. The country is called USA.

--
Simen

May 28 2013
Peter Williams <pwil3058 bigpond.net.au> writes:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent not a
country). :-)

Peter

May 28 2013
On Tuesday, 28 May 2013 at 23:33:47 UTC, Peter Williams wrote:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent
not a country). :-)

Peter

America isn't a continent, North America is a continent, and
Canada is in North America :P

May 28 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 29 May 2013 at 01:29:07 UTC, Diggory wrote:
On Tuesday, 28 May 2013 at 23:33:47 UTC, Peter Williams wrote:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?

Canada, Jamaica, other countries in that region?

Last time I looked Canada was in America (which is a continent
not a country). :-)

Peter

America isn't a continent, North America is a continent, and
Canada is in North America :P

Well, that point of view really depends from which continent
you're from:
http://en.wikipedia.org/wiki/Continents#Number_of_continents

There is no internationally agreed on scheme. I for one, have
always been taught that there is only "America", and that the
terms "North America" and "South America" where only meant to
denote regions within said continent.

May 28 2013
Marcin Mstowski <marmyst gmail.com> writes:
--001a11c3c9a0deb60204dda56efc
Content-Type: text/plain; charset=ISO-8859-1

On Sun, May 26, 2013 at 9:42 PM, Joakim <joakim airpost.net> wrote:

On Sunday, 26 May 2013 at 19:20:15 UTC, Marcin Mstowski wrote:

Character Data Representation
Architecture<http://www-01.**ibm.com/software/**globalization/cdra/<http://www-01.ibm.com/software/globalization/cdra/>
by

IBM. It is what you want to do with additions and it is available
since
1995.
When you come up with an inventive idea, i suggest you to first check what
was already done in that area and then rethink this again to check if you
can do this better or improve existing solution. Other approaches are
usually waste of time and efforts, unless you are doing this for fun or
you
price, etc.

encoding actually is.  There is an appendix that lists several possible
encodings, including UTF-8!

Yes, because they didn't reinvent wheel from scratch and are reusing
existing encodings as a base. There isn't any problem with adding another
code page.

Also, one of the first pages talks about representations of floating point
and integer numbers, which are outside the purview of the text encodings

They are outside of scope of CDRA too. At least read picture description
before making out of context assumptions.

I cannot possibly be expected to know about every dead format out there.

Nobody expect that.

If you can show that it is materially similar to my single-byte encoding
idea, it might be worth looking into.

Spending ~15 min to read Introduction isn't worth your time, so why should
i waste my time showing you anything ?

--001a11c3c9a0deb60204dda56efc
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On S=
un, May 26, 2013 at 9:42 PM, Joakim <span dir=3D"ltr">&lt;<a href=3D"mailto=
:joakim airpost.net" target=3D"_blank">joakim airpost.net</a>&gt;</span> wr=
ote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
On Sunday, 26 May 2013 at 19:20:15 UTC, Marcin Mstowski wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
Character Data Representation<br>
Architecture&lt;<a href=3D"http://www-01.ibm.com/software/globalization/cdr=
a/" target=3D"_blank">http://www-01.<u></u>ibm.com/software/<u></u>globaliz=
ation/cdra/</a>&gt;by<div class=3D"im"><br>
IBM. It is what you want to do with additions and it is available<br>
since<br>
1995.<br>
When you come up with an inventive idea, i suggest you to first check what<=
br>
was already done in that area and then rethink this again to check if you<b=
r>
can do this better or improve existing solution. Other approaches are<br>
usually waste of time and efforts, unless you are doing this for fun or you=
<br>
br>
price, etc.<br>
</div></blockquote>
You might be right, but I gave it a quick look and can&#39;t make out what =
the encoding actually is. =A0There is an appendix that lists several possib=
le encodings, including UTF-8!<br></blockquote><div><br></div><div>Yes, bec=
ause they didn&#39;t reinvent wheel from scratch and are reusing existing e=
ncodings as a base. There isn&#39;t any problem with adding another code pa=
ge.<br>
</div><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
Also, one of the first pages talks about representations of floating point =
and integer numbers, which are outside the purview of the text encodings we=
&#39;re talking about.</blockquote><div><br></div><div>They are outside of =
scope of CDRA too. At least read picture description before making out of c=
ontext assumptions.<br>
</div><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
.8ex;border-left:1px #ccc solid;padding-left:1ex">I cannot possibly be exp=
<div>

" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I=
f you can show that it is materially similar to my single-byte encoding ide=
a, it might be worth looking into.<br>

</blockquote></div><br></div><div class=3D"gmail_extra">Spending ~15 min to=
read Introduction isn&#39;t worth your time, so why should i waste my time=
showing you anything ?<br></div></div>

--001a11c3c9a0deb60204dda56efc--

May 26 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, May 26, 2013 at 02:14:17PM -0700, Walter Bright wrote:
On 5/26/2013 1:44 PM, Hans W. Uhlig wrote:
Using those characters would be wonderful and while we do have
unicode software support we don't really have unicode hardware
support. I am still on my 102 key keyboard and I haven't really seen
a good expanded character keyboard come along.

I have a post-it stuck to my monitor with the numbers for various
unicode characters, but I just can't see that for writing code.

that the keys are either a fixed layout with LCD labels on each key, or
perhaps the whole thing is a long touchscreen, that allows arbitrary
relabelling of keys (or, in the latter case, complete dynamic
reconfiguration of layout). There would be some convenient way to switch
between layouts, say a scrolling sidebar or roller dial of some sort, so
you could, in theory, type Unicode directly.

I haven't been able to refine this into an actual, implementable idea,
though.

T

--
Shin: (n.) A device for finding furniture in the dark.

May 26 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, May 28, 2013 at 02:23:32AM +0200, Diggory wrote:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and
D should not
support it.

Why do you think its a bad idea? It makes it such that code can
be in various
languages? Just lack of keyboard support?

Every time I've been to a programming shop in a foreign country,
the developers speak english at work and code in english. Of
course, that doesn't mean that everyone does, but as far as I can
tell the overwhelming bulk is done in english.

Naturally, full Unicode needs to be in strings and comments, but
symbol names? I don't see the point nor the utilty of it.
Supporting such is just pointless complexity to the language.

The most convincing case for usefulness I've seen was in java where
a class implemented a particular algorithm and so was named after
it. This name had a particular accented character and so required
unicode. Lots of algorithms are named after their inventors and lots
of these names contain unicode characters so it's not that uncommon.

I don't find this a compelling reason to allow full Unicode on
identifiers, though. For one thing, somebody maintaining your code may
not know how to type said identifier correctly. It can be very
frustrating to have to keep copy-n-pasting identifiers just because they
contain foreign letters you can't type. Not to mention sheer
unreadability if the inventor's name is in Chinese, so the algorithm
name is also in Chinese, and the person maintaining the code can't read
Chinese. This will kill D code maintainability.

T

--
Don't drink and derive. Alcohol and algebra don't mix.

May 27 2013
"Kiith-Sa" <kiithsacmp gmail.com> writes:
You mean like
http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard ?

May 26 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, May 26, 2013 at 11:25:09PM +0200, Kiith-Sa wrote:
You mean like http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
?

Whoa! That is exactly what I had in mind!!

Pity they don't appear to support Linux, though. :-(

T

--
MACINTOSH: Most Applications Crash, If Not, The Operating System Hangs

May 26 2013
"Torje Digernes" <torjehoa pvv.org> writes:
On Sunday, 26 May 2013 at 21:46:38 UTC, H. S. Teoh wrote:
On Sun, May 26, 2013 at 11:25:09PM +0200, Kiith-Sa wrote:
You mean like
http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
?

Whoa! That is exactly what I had in mind!!

Pity they don't appear to support Linux, though. :-(

T

If you want to configure your keyboard so you can type unicode in
Linux you should make yourself familiar with xkb, it is not that
difficult to work with, but not exactly user friendly either,
super user friendly though.

May 26 2013
"Wyatt" <wyatt.epp gmail.com> writes:
On Sunday, 26 May 2013 at 21:23:44 UTC, H. S. Teoh wrote:
keyboard", in
that the keys are either a fixed layout with LCD labels on each
key, or
perhaps the whole thing is a long touchscreen, that allows
arbitrary
relabelling of keys (or, in the latter case, complete dynamic
reconfiguration of layout). There would be some convenient way
to switch
between layouts, say a scrolling sidebar or roller dial of some
sort, so
you could, in theory, type Unicode directly.

I haven't been able to refine this into an actual,
implementable idea,
though.

perspective you want to throw hardware at a software problem.
Have you ever used a Japanese input method?  They're sort of a
good exemplar here, wherein you type a sequence and then hit
space to cycle through possible ways of writing it.  So "ame" can
become, あめ, 雨, 飴, etc.  Right now, in addition to my learning, I
also use it for things like α (アルファ) and Δ (デルタ).  It's
limited,
but...usable, I guess.  Sort of.

The other end of this is TeX, which was designed around the idea
of composing scientific texts with a high degree of control and
flexibility.  Specialty characters are inserted with
backslash-escapes, like \alpha, \beta, etc.

Now combine the two:  An input method that outputs as usual,
until you enter a character code which is substituted in real
time to what you actually want.
Example:
"values of \beta will give rise to dom!" composes as
"values of β will give rise to dom!"

No hardware required; just a smarter IME.  Like maybe this one:
http://www.andonyar.com/rec/2008-03/mathinput/ (I'm honestly not
yet sure how mature or usable that one is as I'm a UIM user, but
it does serve as a proof of concept).

May 26 2013
"Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 21:08:40 UTC, Marcin Mstowski wrote:
On Sun, May 26, 2013 at 9:42 PM, Joakim <joakim airpost.net>
wrote:
Also, one of the first pages talks about representations of
floating point
and integer numbers, which are outside the purview of the text
encodings

They are outside of scope of CDRA too. At least read picture
description
before making out of context assumptions.

fairly generic.  I do see now that one paragraph does say that
CDRA only deals with graphical characters and that they were only
talking about numbers earlier to introduce the topic of data
representation.

If you can show that it is materially similar to my
single-byte encoding
idea, it might be worth looking into.

why should
i waste my time showing you anything ?

the onus is on you to show which of the multiple encodings CDRA
uses that I'm reinventing.  I'm not interested in delving into
the docs for some dead IBM format to prove _your_ point.  More
likely, you are just dead wrong and CDRA simply uses code pages,
which are not the same as the single-byte encoding with a header
idea that I've sketched in this thread.

May 26 2013
"John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 27 May 2013 at 06:11:20 UTC, Joakim wrote:
You claimed that my encoding was reinventing the wheel,
therefore the onus is on you to show which of the multiple
encodings CDRA uses that I'm reinventing.  I'm not interested
in delving into the docs for some dead IBM format to prove
_your_ point.

It's your idea and project. Showing that it is original / doing
your research on previous efforts is probably something that
*you* should do, whether or not it's someone else's "point".

More likely, you are just dead wrong and CDRA simply uses code
pages


May 27 2013
"Joakim" <joakim airpost.net> writes:
On Monday, 27 May 2013 at 12:25:06 UTC, John Colvin wrote:
On Monday, 27 May 2013 at 06:11:20 UTC, Joakim wrote:
You claimed that my encoding was reinventing the wheel,
therefore the onus is on you to show which of the multiple
encodings CDRA uses that I'm reinventing.  I'm not interested
in delving into the docs for some dead IBM format to prove
_your_ point.

It's your idea and project. Showing that it is original / doing
your research on previous efforts is probably something that
*you* should do, whether or not it's someone else's "point".

with past projects that never really got started or bureaucratic
efforts, like CDRA appears to be, that never went anywhere.  I
can hardly be expected to go rummaging through all these efforts
in the hopes that what, someone else has already written the
code?  If you have a brain, you can look at the currently popular
approaches, which CDRA isn't, and come up with something that
makes more sense.  I don't much care if my idea is original, I
care that it is better.

More likely, you are just dead wrong and CDRA simply uses code
pages

antiquated code page encodings in its list of proposed encodings.
If Marcin believes one of those is similar to my scheme, he
should say which one, otherwise his entire line of argument is
irrelevant.  It's not up to me to prove _his_ point.

Without having looked any of the encodings in detail, I'm fairly
certain he's wrong.  If he feels otherwise, he can pipe up with
which one he had in mind.  The fact that he hasn't speaks volumes.

May 27 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, May 27, 2013 at 04:17:06AM +0200, Wyatt wrote:
On Sunday, 26 May 2013 at 21:23:44 UTC, H. S. Teoh wrote:
in that the keys are either a fixed layout with LCD labels on each
key, or perhaps the whole thing is a long touchscreen, that allows
arbitrary relabelling of keys (or, in the latter case, complete
dynamic reconfiguration of layout). There would be some convenient
way to switch between layouts, say a scrolling sidebar or roller dial
of some sort, so you could, in theory, type Unicode directly.

I haven't been able to refine this into an actual, implementable
idea, though.

perspective you want to throw hardware at a software problem.  Have
you ever used a Japanese input method?  They're sort of a good
exemplar here, wherein you type a sequence and then hit space to
cycle through possible ways of writing it.  So "ame" can become,
あめ, 雨, 飴, etc.  Right now, in addition to my learning, I also
use it for things like α (アルファ) and Δ (デルタ).  It's limited,
but...usable, I guess.  Sort of.

The other end of this is TeX, which was designed around the idea of
composing scientific texts with a high degree of control and
flexibility.  Specialty characters are inserted with
backslash-escapes, like \alpha, \beta, etc.

Now combine the two:  An input method that outputs as usual, until
you enter a character code which is substituted in real time to what
you actually want.
Example:
"values of \beta will give rise to dom!" composes as
"values of β will give rise to dom!"

No hardware required; just a smarter IME.  Like maybe this one:
http://www.andonyar.com/rec/2008-03/mathinput/ (I'm honestly not yet
sure how mature or usable that one is as I'm a UIM user, but it does
serve as a proof of concept).

I like this idea. It's certainly more feasible than reinventing the
Optimus Maximus keyboard. :) I can write code for free, but engineering
custom hardware is a bit beyond my abilities (and means!).

If we go the software route, then one possible strategy might be:

- Have a default mode that is whatever your default keyboard layout is
(the usual 100+-key layout, or DVORAK, whatever.).

- Assign one or two escape keys (not to be confused with the Esc key,
which is something else) that allows you to switch mode.

- Under the 1-key scheme, you'd use it to begin sequences like \beta,
except that instead of the backslash \, you're using a dedicated
key. These sequences can include individual characters (e.g.
<ESC>beta == β) or allow you to change the current input mode (e.g.
<ESC>grk to switch to a Greek layout that takes effect from that
point onwards until you enter, say, <ESC>eng). For convenience, the
sequence <ESC><ESC> can be shorthand for switching back to whatever
the default layout is, so that if you mistype an escape sequence
and end up in some strange unexpected layout mode, hitting <ESC>
twice will reset it back to the default.

- Under the 2-key scheme, you'd have one key dedicated for the
occasional foreign character (<ESC1>beta == β), and the second key
dedicated for switching layouts (thus allowing shorter sequences
for switching between languages without fear of conflicting with
single-character sequences, e.g., <ESC2>g for Greek).

Perhaps the 1-key scheme is the simplest to implement. The capslock key
is a good candidate, being conveniently located where your left little
finger is, and having no real useful function in this day and age.

The only drawback is no custom key labels. But perhaps that can be
alleviated by hooking an escape sequence to toggle an on-screen visual
representation of the current layout. Maybe <ESC>? can be assigned to
invoke a helper utility that renders the current layout on the screen.

T

--
Don't get stuck in a closet---wear yourself out.

May 27 2013
On Monday, 27 May 2013 at 02:17:08 UTC, Wyatt wrote:
No hardware required; just a smarter IME.

Perhaps something like the compose key?

http://en.wikipedia.org/wiki/Compose_key

May 27 2013
Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 26 May 2013 21:25:36 +0200
schrieb "Joakim" <joakim airpost.net>:

On Sunday, 26 May 2013 at 19:11:42 UTC, Mr. Anonymous wrote:
On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu
wrote:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is obviously
so.

Congratulations, you are literally the only person on the
Internet who said so: http://goo.gl/TFhUO

least 8 results.  How many people even know how UTF-8 works?
Given how few people use it, I'm not surprised most don't know
enough about how it works to criticize it.

On the other hand:

did.  There are only 19 results for that search string.  If UTF-8
were such a rousing success and most developers found it easy to
understand, you wouldn't expect only 19 results for it and 8
against it.  The paucity of results suggests most don't know how
it works or perhaps simply annoyed by it, liking the
internationalization but disliking the complexity.

--
Marco

May 29 2013
"Joakim" <joakim airpost.net> writes:
On Wednesday, 29 May 2013 at 23:40:51 UTC, Marco Leise wrote:
Am Sun, 26 May 2013 21:25:36 +0200
schrieb "Joakim" <joakim airpost.net>:

On Sunday, 26 May 2013 at 19:11:42 UTC, Mr. Anonymous wrote:
On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu
wrote:
On 5/26/13 1:45 PM, Joakim wrote:
What is extraordinary about "UTF-8 is shit?" It is
obviously so.

Congratulations, you are literally the only person on the
Internet who said so: http://goo.gl/TFhUO

least 8 results.  How many people even know how UTF-8
works?  Given how few people use it, I'm not surprised most
don't know enough about how it works to criticize it.

On the other hand:

did.  There are only 19 results for that search string.  If
UTF-8 were such a rousing success and most developers found it
easy to understand, you wouldn't expect only 19 results for it
and 8 against it.  The paucity of results suggests most don't
know how it works or perhaps simply annoyed by it, liking the
internationalization but disliking the complexity.

"utf-8 is the best guess."  If you look at the results, almost
all make the pragmatic recommendation that UTF-8 is the best _for
now_, because it is better supported than other multi-language
formats.  That's like saying Windows is the best OS because it's
easier to find one in your local computer store.

Yet again, the fact that even this somewhat ambiguous search
string has only 121 results is damning of anyone liking UTF-8,
nothing else, given the many thousands of programmmers that are
forced to use Unicode if they want to internationalize.

May 30 2013
Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 30 May 2013 09:19:32 +0200
schrieb "Joakim" <joakim airpost.net>:

Your point is?  121 results, including false positives like
"utf-8 is the best guess."  If you look at the results, almost
all make the pragmatic recommendation that UTF-8 is the best _for
now_, because it is better supported than other multi-language
formats.  That's like saying Windows is the best OS because it's
easier to find one in your local computer store.

Yet again, the fact that even this somewhat ambiguous search
string has only 121 results is damning of anyone liking UTF-8,
nothing else, given the many thousands of programmmers that are
forced to use Unicode if they want to internationalize.

Alright, for me it said ~6.570.000 results, which I found
funny. I'm not trying to make a point, but to troll. If there
is a point to be made, then that the count of search results
is a _very_ rough estimate.

--
Marco

May 30 2013