www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why UTF-8/16 character encodings?

reply "Joakim" <joakim airpost.net> writes:
On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
 toUpper/lower cannot be made in place if it should handle all 
 Unicode. Some characters will change their length when convert 
 to/from uppercase. Examples of these are the German double S 
 and some Turkish I.
This triggered a long-standing bugbear of mine: why are we using these variable-length encodings at all? Does anybody really care about UTF-8 being "self-synchronizing," ie does anybody actually use that in this day and age? Sure, it's backwards-compatible with ASCII and the vast majority of usage is probably just ASCII, but that means the other languages don't matter anyway. Not to mention taking the valuable 8-bit real estate for English and dumping the longer encodings on everyone else. I'd just use a single-byte header to signify the language and then put the vast majority of languages in a single byte encoding, with the few exceptional languages with more than 256 characters encoded in two bytes. OK, that doesn't cover multi-language strings, but that is what, .000001% of usage? Make your header a little longer and you could handle those also. Yes, it wouldn't be strictly backwards-compatible with ASCII, but it would be so much easier to internationalize. Of course, there's also the monoculture we're creating; love this UTF-8 rant by tuomov, author of one the first tiling window managers for linux: http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06 The emperor has no clothes, what am I missing?
May 24 2013
next sibling parent reply "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Friday, 24 May 2013 at 17:05:57 UTC, Joakim wrote:
 This triggered a long-standing bugbear of mine: why are we 
 using these variable-length encodings at all?
Simple: backwards compatibility with all ASCII APIs (e.g. most C libraries), and because I don't want my strings to consume multiple bytes per character when I don't need it. Your language header idea is no good for at least three reasons: 1. What happens if I want to take a substring slice of your string? I'll need to allocate a new string to add the header in. 2. What if I have a long string with the ASCII header and want to append a non-ASCII character on the end? I'll need to reallocate the whole string and widen it with the new header. 3. Even if I have a string that is 99% ASCII then I have to pay extra bytes for every character just because 1% wasn't ASCII. With UTF-8, I only pay the extra bytes when needed.
May 24 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Friday, 24 May 2013 at 17:43:03 UTC, Peter Alexander wrote:
 Simple: backwards compatibility with all ASCII APIs (e.g. most 
 C libraries), and because I don't want my strings to consume 
 multiple bytes per character when I don't need it.
And yet here we are today, where an early decision made solely to accommodate the authors of then-dominant all-ASCII APIs has now foisted an unnecessarily complex encoding on all of us, with reduced performance as the result. You do realize that my encoding would encode almost all languages' characters in single bytes, unlike UTF-8, right? Your latter argument is one against UTF-8.
 Your language header idea is no good for at least three reasons:

 1. What happens if I want to take a substring slice of your 
 string? I'll need to allocate a new string to add the header in.
Good point. The solution that comes to mind right now is that you'd parse my format and store it in memory as a String class, storing the chars in an internal array with the header stripped out and the language stored in a property. That way, even a slice could be made to refer to the same language, by referring to the language of the containing array. Strictly speaking, this solution could also be implemented with UTF-8, simply by changing the format for the data structure you use in memory to the one I've outlined, as opposed to using the the UTF-8 encoding for both transmission and processing. But if you're going to use my format for processing, you might as well use it for transmission also, since it is much smaller for non-ASCII text. Before you ridicule my solution as somehow unworkable, let me remind you of the current monstrosity. Currently, the language is stored in every single UTF-8 character, by having the length vary from one to four bytes depending on the language. This leads to Phobos converting every UTF-8 string to UTF-32, so that it can easily run its algorithms on a constant-width 32-bit character set, and the resulting performance penalties. Perhaps the biggest loss is that programmers everywhere are pushed to wrap their heads around this mess, predictably leading to either ignorance or broken code. Which seems more unworkable to you?
 2. What if I have a long string with the ASCII header and want 
 to append a non-ASCII character on the end? I'll need to 
 reallocate the whole string and widen it with the new header.
How often does this happen in practice? I suspect that this almost never happens. But if it does, it would be solved by the String class I outlined above, as the header isn't stored in the array anymore.
 3. Even if I have a string that is 99% ASCII then I have to pay 
 extra bytes for every character just because 1% wasn't ASCII. 
 With UTF-8, I only pay the extra bytes when needed.
I don't understand what you mean here. If your string has a thousand non-ASCII characters, the UTF-8 version will have one or two thousand more characters, ie 1 or 2 KB more. My format would add a couple bytes in the header for each non-ASCII language character used, that's it. It's a clear win for my format. In any case, I just came up with the simplest format I could off the top of my head, maybe there are gaping holes in it. But my point is that we should be able to come up with such a much simpler format, which keeps most characters to a single byte, not that my format is best. All I want to argue is that UTF-8 is the worst. ;)
May 24 2013
next sibling parent "Joakim" <joakim airpost.net> writes:
On Friday, 24 May 2013 at 20:37:58 UTC, Joakim wrote:
 3. Even if I have a string that is 99% ASCII then I have to 
 pay extra bytes for every character just because 1% wasn't 
 ASCII. With UTF-8, I only pay the extra bytes when needed.
I don't understand what you mean here. If your string has a thousand non-ASCII characters, the UTF-8 version will have one or two thousand more characters, ie 1 or 2 KB more. My format would add a couple bytes in the header for each non-ASCII language character used, that's it. It's a clear win for my format.
Sorry, I was a bit imprecise. Here's what I meant to write: I don't understand what you mean here. If your string has a thousand non-ASCII characters, the UTF-8 version will have one or two thousand more bytes, ie 1 or 2 KB more. My format would add a couple bytes in the header for each non-ASCII language used, that's it. It's a clear win for my format.
May 24 2013
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2013 1:37 PM, Joakim wrote:
 This leads to Phobos converting every UTF-8 string to UTF-32, so that
 it can easily run its algorithms on a constant-width 32-bit character set, and
 the resulting performance penalties.
This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed.
 Yes, it wouldn't be strictly backwards-compatible with ASCII, but it would be 
so much easier to internationalize. That was the go-to solution in the 1980's, they were called "code pages". A disaster.
 with the few exceptional languages with more than 256 characters encoded in 
two bytes. Like those rare languages Japanese, Korean, Chinese, etc. This too was done in the 80's with "Shift-JIS" for Japanese, and some other wacky scheme for Korean, and a third nutburger one for Chinese. I've had the misfortune of supporting all that in the old Zortech C++ compiler. It's AWFUL. If you think it's simpler, all I can say is you've never tried to write internationalized code with it. UTF-8 is heavenly in comparison. Your code is automatically internationalized. It's awesome.
May 24 2013
prev sibling next sibling parent "anonymous" <anonymous example.com> writes:
On Friday, 24 May 2013 at 17:05:57 UTC, Joakim wrote:
 On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
 toUpper/lower cannot be made in place if it should handle all 
 Unicode. Some characters will change their length when convert 
 to/from uppercase. Examples of these are the German double S 
 and some Turkish I.
This triggered a long-standing bugbear of mine: why are we using these variable-length encodings at all? Does anybody really care about UTF-8 being "self-synchronizing," ie does anybody actually use that in this day and age? Sure, it's backwards-compatible with ASCII and the vast majority of usage is probably just ASCII, but that means the other languages don't matter anyway. Not to mention taking the valuable 8-bit real estate for English and dumping the longer encodings on everyone else.
The German ß becomes SS when capitalised. It's no encoding issue.
May 24 2013
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-May-2013 21:05, Joakim пишет:
 On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
 toUpper/lower cannot be made in place if it should handle all Unicode.
 Some characters will change their length when convert to/from
 uppercase. Examples of these are the German double S and some Turkish I.
This triggered a long-standing bugbear of mine: why are we using these variable-length encodings at all? Does anybody really care about UTF-8 being "self-synchronizing," ie does anybody actually use that in this day and age? Sure, it's backwards-compatible with ASCII and the vast majority of usage is probably just ASCII, but that means the other languages don't matter anyway. Not to mention taking the valuable 8-bit real estate for English and dumping the longer encodings on everyone else. I'd just use a single-byte header to signify the language and then put the vast majority of languages in a single byte encoding, with the few exceptional languages with more than 256 characters encoded in two bytes.
You seem to think that not only UTF-8 is bad encoding but also one unified encoding (code-space) is bad(?). Separate code spaces were the case before Unicode (and utf-8). The problem is not only that without header text is meaningless (no easy slicing) but the fact that encoding of data after header strongly depends a variety of factors - a list of encodings actually. Now everybody has to keep a (code) page per language to at least know if it's 2 bytes per char or 1 byte per char or whatever. And you still work on a basis that there is no combining marks and regional specific stuff :) In fact it was even "better" nobody ever talked about header they just assumed a codepage with some global setting. Imagine yourself creating a font rendering system these days - a hell of an exercise in frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ then ...).
 OK, that doesn't cover multi-language strings, but that is what,
 .000001% of usage?
This just shows you don't care for multilingual stuff at all. Imagine any language tutor/translator/dictionary on the Web. For instance most languages need to intersperse ASCII (also keep in mind e.g. HTML markup). Books often feature citations in native language (or e.g. latin) along with translations. Now also take into account math symbols, currency symbols and beyond. Also these days cultures are mixing in wild combinations so you might need to see the text even if you can't read it. Unicode is not only "encode characters from all languages". It needs to address universal representation of symbolics used in writing systems at large.
 Make your header a little longer and you could handle
 those also.  Yes, it wouldn't be strictly backwards-compatible with
 ASCII, but it would be so much easier to internationalize.  Of course,
 there's also the monoculture we're creating; love this UTF-8 rant by
 tuomov, author of one the first tiling window managers for linux:
We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity). Want small - use compression schemes which are perfectly fine and get to the precious 1byte per codepoint with exceptional speed. http://www.unicode.org/reports/tr6/
 http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

 The emperor has no clothes, what am I missing?
And borrowing the arguments from from that rant: locale is borked shit when it comes to encodings. Locales should be used for tweaking visual like numbers, date display an so on. -- Dmitry Olshansky
May 24 2013
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote:
 24-May-2013 21:05, Joakim пишет:
[...]
This triggered a long-standing bugbear of mine: why are we using
these variable-length encodings at all?  Does anybody really care
about UTF-8 being "self-synchronizing," ie does anybody actually use
that in this day and age?  Sure, it's backwards-compatible with ASCII
and the vast majority of usage is probably just ASCII, but that means
the other languages don't matter anyway.  Not to mention taking the
valuable 8-bit real estate for English and dumping the longer
encodings on everyone else.

I'd just use a single-byte header to signify the language and then
put the vast majority of languages in a single byte encoding, with
the few exceptional languages with more than 256 characters encoded
in two bytes.
You seem to think that not only UTF-8 is bad encoding but also one unified encoding (code-space) is bad(?). Separate code spaces were the case before Unicode (and utf-8). The problem is not only that without header text is meaningless (no easy slicing) but the fact that encoding of data after header strongly depends a variety of factors - a list of encodings actually. Now everybody has to keep a (code) page per language to at least know if it's 2 bytes per char or 1 byte per char or whatever. And you still work on a basis that there is no combining marks and regional specific stuff :)
I remember those bad ole days of gratuitously-incompatible encodings. I wish those days will never ever return again. You'd get a text file in some unknown encoding, and the only way to make any sense of it was to guess what encoding it might be and hope you get lucky. Not only so, the same language often has multiple encodings, so adding support for a single new language required supporting several new encodings and being able to tell them apart (often with no info on which they are, if you're lucky, or if you're unlucky, with *wrong* encoding type specs -- for example, I *still* get email from outdated systems that claim to be iso-8859 when it's actually KOI8R). Prepending the encoding to the data doesn't help, because it's pretty much guaranteed somebody will cut-n-paste some segment of that data and save it without the encoding type header (or worse, some program will try to "fix" broken low-level code by prepending a default encoding type to everything, regardless of whether it's actually in that encoding or not), thus ensuring nobody will be able to reliably recognize what encoding it is down the road.
 In fact it was even "better" nobody ever talked about header they just
 assumed a codepage with some global setting. Imagine yourself creating
 a font rendering system these days - a hell of an exercise in
 frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ
 then ...).
Not to mention, if the sysadmin changes the default locale settings, you may suddenly discover that a bunch of your text files have become gibberish, because some programs blindly assume that every text file is in the current locale-specified language. I tried writing language-agnostic text-processing programs in C/C++ before the widespread adoption of Unicode. It was a living nightmare. The Posix spec *seems* to promise language-independence with its locale functions, but actually, the whole thing is one big inconsistent and under-specified mess that has many unspecified, implementation-specific behaviours that you can't rely on. The APIs basically assume that you set your locale's language once, and never change it, and every single file you'll ever want to read must be encoded in that particular encoding. If you try to read another encoding, too bad, you're screwed. There isn't even a standard for locale names that you could use to manually switch to inside your program (yes there are de facto conventions, but there *are* systems out there that don't follow it). And many standard library functions are affected by locale settings (once you call setlocale, *anything* could change, like string comparison, output encoding, etc.), making it a hairy mess to get input/output of multiple encodings to work correctly. Basically, you have to write everything manually, because the standard library can't handle more than a single encoding correctly (well, not without extreme amounts of pain, that is). So you're back to manipulating bytes directly. Which means you have to keep large tables of every single encoding you ever wish to support. And encoding-specific code to deal with exceptions with those evil variant encodings that are supposedly the same as the official standard of that encoding, but actually have one or two subtle differences that cause your program to output embarrassing garbage characters every now and then. For all of its warts, Unicode fixed a WHOLE bunch of these problems, and made cross-linguistic data sane to handle without pulling out your hair, many times over. And now we're trying to go back to that nightmarish old world again? No way, José! [...]
Make your header a little longer and you could handle those also.
Yes, it wouldn't be strictly backwards-compatible with ASCII, but it
would be so much easier to internationalize.  Of course, there's also
the monoculture we're creating; love this UTF-8 rant by tuomov,
author of one the first tiling window managers for linux:
We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity).
Yeah, those codepages were an utter nightmare to deal with. Everybody and his neighbour's dog invented their own codepage, sometimes multiple codepages for a single language, all of which are gratuitously incompatible with each other. Every codepage has its own peculiarities and exceptions, and programs have to know how to deal with all of them. Only to get broken again as soon as somebody invents yet another codepage two years later, or creates yet another codepage variant just for the heck of it. If you're really concerned about encoding size, just use a compression library -- they're readily available these days. Internally, the program can just use UTF-16 for the most part -- UTF-32 is really only necessary if you're routinely delving outside BMP, which is very rare. As far as Phobos is concerned, Dmitry's new std.uni module has powerful code-generation templates that let you write code that operate directly on UTF-8 without needing to convert to UTF-32 first. Well, OK, maybe we're not quite there yet, but the foundations are in place, and I'm looking forward to the day when string functions will no longer have implicit conversion to UTF-32, but will directly manipulate UTF-8 using optimized state tables generated by std.uni.
 Want small - use compression schemes which are perfectly fine and
 get to the precious 1byte per codepoint with exceptional speed.
 http://www.unicode.org/reports/tr6/
+1. Using your own encoding is perfectly fine. Just don't do that for data interchange. Unicode was created because we *want* a single standard to communicate with each other without stupid broken encoding issues that used to be rampant on the web before Unicode came along. In the bad ole days, HTML could be served in any random number of encodings, often out-of-sync with what the server claims the encoding is, and browsers would assume arbitrary default encodings that for the most part *appeared* to work but are actually fundamentally b0rken. Sometimes webpages would show up mostly-intact, but with a few characters mangled, because of deviations / variations on codepage interpretation, or non-standard characters being used in a particular encoding. It was a total, utter mess, that wasted who knows how many man-hours of programming time to work around. For data interchange on the internet, we NEED a universal standard that everyone can agree on.
http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?
And borrowing the arguments from from that rant: locale is borked shit when it comes to encodings. Locales should be used for tweaking visual like numbers, date display an so on.
[...] I found that rant rather incoherent. I didn't find any convincing arguments as to why we should return to the bad old scheme of codepages and gratuitous complexity, just a lot of grievances about why monoculture is "bad" without much supporting evidence. UTF-8, for all its flaws, is remarkably resilient to mangling -- you can cut-n-paste any byte sequence and the receiving end can still make some sense of it. Not like the bad old days of codepages where you just get one gigantic block of gibberish. A properly-synchronizing UTF-8 function can still recover legible data, maybe with only a few characters at the ends truncated in the worst case. I don't see how any codepage-based encoding is an improvement over this. T -- There are 10 kinds of people in the world: those who can count in binary, and those who can't.
May 24 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2013 3:42 PM, H. S. Teoh wrote:
 I tried writing language-agnostic text-processing programs in C/C++
 before the widespread adoption of Unicode.
One of the first, and best, decisions I made for D was it would be Unicode front to back. At the time, Unicode was poorly supported by operating systems and lots of software, and I encountered some initial resistance to it. But I believed Unicode was the inevitable future. Code pages, Shift-JIS, EBCDIC, etc., should all be terminated with prejudice.
May 24 2013
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 25 May 2013 11:58, Walter Bright <newshound2 digitalmars.com> wrote:

 On 5/24/2013 3:42 PM, H. S. Teoh wrote:

 I tried writing language-agnostic text-processing programs in C/C++
 before the widespread adoption of Unicode.
One of the first, and best, decisions I made for D was it would be Unicod=
e
 front to back.
Indeed, excellent decision! So when we define operators for u =C3=97 v and a =C2=B7 b, or maybe n=C2=B2= ? ;) At the time, Unicode was poorly supported by operating systems and lots of
 software, and I encountered some initial resistance to it. But I believed
 Unicode was the inevitable future.

 Code pages, Shift-JIS, EBCDIC, etc., should all be terminated with
 prejudice.
May 24 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/24/2013 7:16 PM, Manu wrote:
 So when we define operators for u × v and a · b, or maybe n²? ;)
Oh, how I want to do that. But I still think the world hasn't completely caught up with Unicode yet.
May 24 2013
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, May 24, 2013 at 08:45:56PM -0700, Walter Bright wrote:
 On 5/24/2013 7:16 PM, Manu wrote:
So when we define operators for u  v and a  b, or maybe n? ;)
Oh, how I want to do that. But I still think the world hasn't completely caught up with Unicode yet.
That would be most awesome! Though it does raise the issue of how parsing would work, 'cos you either have to assign a fixed precedence to each of these operators (and there are a LOT of them in Unicode!), or allow user-defined operators with custom precedence and associativity, which means nightmare for the parser (it has to adapt itself to new operators as the code is parsed/analysed, which then leads to issues with what happens if two different modules define the same operator with conflicting precedence / associativity). T -- Spaghetti code may be tangly, but lasagna code is just cheesy.
May 24 2013
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 05/25/2013 05:56 AM, H. S. Teoh wrote:
 On Fri, May 24, 2013 at 08:45:56PM -0700, Walter Bright wrote:
 On 5/24/2013 7:16 PM, Manu wrote:
 So when we define operators for u  v and a  b, or maybe n? ;)
Oh, how I want to do that. But I still think the world hasn't completely caught up with Unicode yet.
That would be most awesome! Though it does raise the issue of how parsing would work, 'cos you either have to assign a fixed precedence to each of these operators (and there are a LOT of them in Unicode!),
I think this is what eg. fortress is doing.
 or allow user-defined operators
 with custom precedence and associativity,
This is what eg. Haskell, Coq are doing. (Though Coq has the advantage of not allowing forward references, and hence inline parser customization is straighforward in Coq.)
 which means nightmare for the
 parser (it has to adapt itself to new operators as the code is
 parsed/analysed,
It would be easier on the parsing side, since the parser would not fully parse expressions. Semantic analysis would resolve precedences. This is quite simple, and the current way the parser resolves operator precedences is less efficient anyways.
 which then leads to issues with what happens if two
 different modules define the same operator with conflicting precedence /
 associativity).
This would probably be an error without explicit disambiguation, or follow the usual disambiguation rules. (trying all possibilities appears to be exponential in the number of conflicting operators in an expression in the worst case though.)
May 25 2013
prev sibling parent reply "Hans W. Uhlig" <huhlig gmail.com> writes:
On Saturday, 25 May 2013 at 03:46:23 UTC, Walter Bright wrote:
 On 5/24/2013 7:16 PM, Manu wrote:
 So when we define operators for u × v and a · b, or maybe n²? 
 ;)
Oh, how I want to do that. But I still think the world hasn't completely caught up with Unicode yet.
Using those characters would be wonderful and while we do have unicode software support we don't really have unicode hardware support. I am still on my 102 key keyboard and I haven't really seen a good expanded character keyboard come along.
May 26 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/26/2013 1:44 PM, Hans W. Uhlig wrote:
 Using those characters would be wonderful and while we do have unicode software
 support we don't really have unicode hardware support. I am still on my 102 key
 keyboard and I haven't really seen a good expanded character keyboard come
along.
I have a post-it stuck to my monitor with the numbers for various unicode characters, but I just can't see that for writing code.
May 26 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, May 26, 2013 at 02:14:17PM -0700, Walter Bright wrote:
 On 5/26/2013 1:44 PM, Hans W. Uhlig wrote:
Using those characters would be wonderful and while we do have
unicode software support we don't really have unicode hardware
support. I am still on my 102 key keyboard and I haven't really seen
a good expanded character keyboard come along.
I have a post-it stuck to my monitor with the numbers for various unicode characters, but I just can't see that for writing code.
I have been thinking about this idea of a "reprogrammable keyboard", in that the keys are either a fixed layout with LCD labels on each key, or perhaps the whole thing is a long touchscreen, that allows arbitrary relabelling of keys (or, in the latter case, complete dynamic reconfiguration of layout). There would be some convenient way to switch between layouts, say a scrolling sidebar or roller dial of some sort, so you could, in theory, type Unicode directly. I haven't been able to refine this into an actual, implementable idea, though. T -- Shin: (n.) A device for finding furniture in the dark.
May 26 2013
next sibling parent reply "Kiith-Sa" <kiithsacmp gmail.com> writes:
You mean like 
http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard ?
May 26 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, May 26, 2013 at 11:25:09PM +0200, Kiith-Sa wrote:
 You mean like http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
 ?
Whoa! That is exactly what I had in mind!! Pity they don't appear to support Linux, though. :-( T -- MACINTOSH: Most Applications Crash, If Not, The Operating System Hangs
May 26 2013
parent reply "Torje Digernes" <torjehoa pvv.org> writes:
On Sunday, 26 May 2013 at 21:46:38 UTC, H. S. Teoh wrote:
 On Sun, May 26, 2013 at 11:25:09PM +0200, Kiith-Sa wrote:
 You mean like 
 http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
 ?
Whoa! That is exactly what I had in mind!! Pity they don't appear to support Linux, though. :-( T
If you want to configure your keyboard so you can type unicode in Linux you should make yourself familiar with xkb, it is not that difficult to work with, but not exactly user friendly either, super user friendly though.
May 26 2013
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, May 27, 2013 at 12:30:02AM +0200, Torje Digernes wrote:
 On Sunday, 26 May 2013 at 21:46:38 UTC, H. S. Teoh wrote:
On Sun, May 26, 2013 at 11:25:09PM +0200, Kiith-Sa wrote:
You mean like
http://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
?
Whoa! That is exactly what I had in mind!! Pity they don't appear to support Linux, though. :-( T
If you want to configure your keyboard so you can type unicode in Linux you should make yourself familiar with xkb, it is not that difficult to work with, but not exactly user friendly either, super user friendly though.
Oh, I know *that*. I configured my xkb setup to switch between English and Russian with the unused windows key (I used to have Greek too, but I use it rarely enough that I took it out). It's just that without the dynamic key labels, I have to touch-type, which requires learning each layout as opposed to just looking for the symbol I need on the key labels. And I have yet to figure out a sane way to support *all* of Unicode without making the result unusable -- when I had Greek in the mix, it was already getting cumbersome having to continually hit the windows key repeatedly when alternating between two of the 3 languages. That's simply not scalable to, say, 100 modes. :-P But maybe I'm just missing a really obvious solution. That happens a lot. :-P T -- War doesn't prove who's right, just who's left. -- BSD Games' Fortune
May 26 2013
prev sibling parent reply "Wyatt" <wyatt.epp gmail.com> writes:
On Sunday, 26 May 2013 at 21:23:44 UTC, H. S. Teoh wrote:
 I have been thinking about this idea of a "reprogrammable 
 keyboard", in
 that the keys are either a fixed layout with LCD labels on each 
 key, or
 perhaps the whole thing is a long touchscreen, that allows 
 arbitrary
 relabelling of keys (or, in the latter case, complete dynamic
 reconfiguration of layout). There would be some convenient way 
 to switch
 between layouts, say a scrolling sidebar or roller dial of some 
 sort, so
 you could, in theory, type Unicode directly.

 I haven't been able to refine this into an actual, 
 implementable idea,
 though.
I've given this domain a fair bit of thought, and from my perspective you want to throw hardware at a software problem. Have you ever used a Japanese input method? They're sort of a good exemplar here, wherein you type a sequence and then hit space to cycle through possible ways of writing it. So "ame" can become, あめ, 雨, 飴, etc. Right now, in addition to my learning, I also use it for things like α (アルファ) and Δ (デルタ). It's limited, but...usable, I guess. Sort of. The other end of this is TeX, which was designed around the idea of composing scientific texts with a high degree of control and flexibility. Specialty characters are inserted with backslash-escapes, like \alpha, \beta, etc. Now combine the two: An input method that outputs as usual, until you enter a character code which is substituted in real time to what you actually want. Example: "values of \beta will give rise to dom!" composes as "values of β will give rise to dom!" No hardware required; just a smarter IME. Like maybe this one: http://www.andonyar.com/rec/2008-03/mathinput/ (I'm honestly not yet sure how mature or usable that one is as I'm a UIM user, but it does serve as a proof of concept).
May 26 2013
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, May 27, 2013 at 04:17:06AM +0200, Wyatt wrote:
 On Sunday, 26 May 2013 at 21:23:44 UTC, H. S. Teoh wrote:
I have been thinking about this idea of a "reprogrammable keyboard",
in that the keys are either a fixed layout with LCD labels on each
key, or perhaps the whole thing is a long touchscreen, that allows
arbitrary relabelling of keys (or, in the latter case, complete
dynamic reconfiguration of layout). There would be some convenient
way to switch between layouts, say a scrolling sidebar or roller dial
of some sort, so you could, in theory, type Unicode directly.

I haven't been able to refine this into an actual, implementable
idea, though.
I've given this domain a fair bit of thought, and from my perspective you want to throw hardware at a software problem. Have you ever used a Japanese input method? They're sort of a good exemplar here, wherein you type a sequence and then hit space to cycle through possible ways of writing it. So "ame" can become, あめ, 雨, 飴, etc. Right now, in addition to my learning, I also use it for things like α (アルファ) and Δ (デルタ). It's limited, but...usable, I guess. Sort of. The other end of this is TeX, which was designed around the idea of composing scientific texts with a high degree of control and flexibility. Specialty characters are inserted with backslash-escapes, like \alpha, \beta, etc. Now combine the two: An input method that outputs as usual, until you enter a character code which is substituted in real time to what you actually want. Example: "values of \beta will give rise to dom!" composes as "values of β will give rise to dom!" No hardware required; just a smarter IME. Like maybe this one: http://www.andonyar.com/rec/2008-03/mathinput/ (I'm honestly not yet sure how mature or usable that one is as I'm a UIM user, but it does serve as a proof of concept).
I like this idea. It's certainly more feasible than reinventing the Optimus Maximus keyboard. :) I can write code for free, but engineering custom hardware is a bit beyond my abilities (and means!). If we go the software route, then one possible strategy might be: - Have a default mode that is whatever your default keyboard layout is (the usual 100+-key layout, or DVORAK, whatever.). - Assign one or two escape keys (not to be confused with the Esc key, which is something else) that allows you to switch mode. - Under the 1-key scheme, you'd use it to begin sequences like \beta, except that instead of the backslash \, you're using a dedicated key. These sequences can include individual characters (e.g. <ESC>beta == β) or allow you to change the current input mode (e.g. <ESC>grk to switch to a Greek layout that takes effect from that point onwards until you enter, say, <ESC>eng). For convenience, the sequence <ESC><ESC> can be shorthand for switching back to whatever the default layout is, so that if you mistype an escape sequence and end up in some strange unexpected layout mode, hitting <ESC> twice will reset it back to the default. - Under the 2-key scheme, you'd have one key dedicated for the occasional foreign character (<ESC1>beta == β), and the second key dedicated for switching layouts (thus allowing shorter sequences for switching between languages without fear of conflicting with single-character sequences, e.g., <ESC2>g for Greek). Perhaps the 1-key scheme is the simplest to implement. The capslock key is a good candidate, being conveniently located where your left little finger is, and having no real useful function in this day and age. The only drawback is no custom key labels. But perhaps that can be alleviated by hooking an escape sequence to toggle an on-screen visual representation of the current layout. Maybe <ESC>? can be assigned to invoke a helper utility that renders the current layout on the screen. T -- Don't get stuck in a closet---wear yourself out.
May 27 2013
prev sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Monday, 27 May 2013 at 02:17:08 UTC, Wyatt wrote:
 No hardware required; just a smarter IME.
Perhaps something like the compose key? http://en.wikipedia.org/wiki/Compose_key
May 27 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, May 27, 2013 at 09:59:52PM +0200, Vladimir Panteleev wrote:
 On Monday, 27 May 2013 at 02:17:08 UTC, Wyatt wrote:
No hardware required; just a smarter IME.
Perhaps something like the compose key? http://en.wikipedia.org/wiki/Compose_key
I'm already using the compose key. But it only goes so far (I don't think compose key sequences cover all of unicode). Besides, it's impractical to use compose key sequences to write large amounts of text in some given language; a method of temporarily switching to a different layout is necessary. T -- Тише едешь, дальше будешь.
May 27 2013
parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
 Besides, it's impractical to use compose key sequences to write 
 large amounts of text in some given language; a method of 
 temporarily switching to a different layout is necessary.
I thought the topic was typing the occasional Unicode character to use as an operator in D programs?
May 27 2013
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, May 28, 2013 at 12:04:52AM +0200, Vladimir Panteleev wrote:
 On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.
I thought the topic was typing the occasional Unicode character to use as an operator in D programs?
Well, D *does* support non-English identifiers, y'know... for example: void main(string[] args) { int число = 1; foreach (и; 0..100) число += и; writeln(число); } Of course, whether that's a good practice is a different story. :) But for operators, you still need enough compose key sequences to cover all of the Unicode operators -- and there are a LOT of them -- which I don't think is currently done anywhere. You'd have to make your own compose key maps to do it. T -- Freedom: (n.) Man's self-given right to be enslaved by his own depravity.
May 27 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
 Well, D *does* support non-English identifiers, y'know... for example:

 	void main(string[] args) {
 		int число = 1;
 		foreach (и; 0..100)
 			число += и;
 		writeln(число);
 	}

 Of course, whether that's a good practice is a different story. :)
I've recently come to the opinion that that's a bad idea, and D should not support it.
May 27 2013
next sibling parent reply "Hans W. Uhlig" <huhlig gmail.com> writes:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 On 5/27/2013 3:18 PM, H. S. Teoh wrote:
 Well, D *does* support non-English identifiers, y'know... for 
 example:

 	void main(string[] args) {
 		int число = 1;
 		foreach (и; 0..100)
 			число += и;
 		writeln(число);
 	}

 Of course, whether that's a good practice is a different 
 story. :)
I've recently come to the opinion that that's a bad idea, and D should not support it.
Why do you think its a bad idea? It makes it such that code can be in various languages? Just lack of keyboard support?
May 27 2013
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, May 28, 2013 at 01:28:22AM +0200, Hans W. Uhlig wrote:
 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for
example:

	void main(string[] args) {
		int число = 1;
		foreach (и; 0..100)
			число += и;
		writeln(число);
	}

Of course, whether that's a good practice is a different story.
:)
I've recently come to the opinion that that's a bad idea, and D should not support it.
Currently, the above code snippet compiles (upon inserting "import std.stdio;", that is). Should that be made illegal?
 Why do you think its a bad idea? It makes it such that code can be
 in various languages? Just lack of keyboard support?
I can't speak for Walter, but one issue that comes to mind is when someone reads the code and doesn't understand the language the identifiers are in, or worse, can't reliably recognize the distinctions between the glyphs, and so can't match identifier names correctly -- if you don't know Japanese, for example, seeing a bunch of Japanese identifiers of equal length will look more-or-less the same (all gibberish to you), so it only obscures the code. Or if your computer doesn't have the requisite fonts to display the alphabet in question, then you'll just see a bunch of ?'s or black blotches for all program identifiers, making the code completely unreadable. Since language keywords are already in English, we might as well standardize on English identifiers too. (After all, Phobos identifiers are English as well.) While it's cool to have multilingual identifiers, I'm not sure if it actually adds any practical value. :) If anything, it arguably detracts from usability. Multilingual program output, of course, is a different kettle o' fish. T -- Doubt is a self-fulfilling prophecy.
May 27 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 27 May 2013 at 23:46:17 UTC, H. S. Teoh wrote:
 On Tue, May 28, 2013 at 01:28:22AM +0200, Hans W. Uhlig wrote:
 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
On 5/27/2013 3:18 PM, H. S. Teoh wrote:
Well, D *does* support non-English identifiers, y'know... for
example:

	void main(string[] args) {
		int число = 1;
		foreach (и; 0..100)
			число += и;
		writeln(число);
	}

Of course, whether that's a good practice is a different 
story.
:)
I've recently come to the opinion that that's a bad idea, and D should not support it.
Currently, the above code snippet compiles (upon inserting "import std.stdio;", that is). Should that be made illegal?
 Why do you think its a bad idea? It makes it such that code 
 can be
 in various languages? Just lack of keyboard support?
I can't speak for Walter, but one issue that comes to mind is when someone reads the code and doesn't understand the language the identifiers are in, or worse, can't reliably recognize the distinctions between the glyphs, and so can't match identifier names correctly -- if you don't know Japanese, for example, seeing a bunch of Japanese identifiers of equal length will look more-or-less the same (all gibberish to you), so it only obscures the code. Or if your computer doesn't have the requisite fonts to display the alphabet in question, then you'll just see a bunch of ?'s or black blotches for all program identifiers, making the code completely unreadable. Since language keywords are already in English, we might as well standardize on English identifiers too. (After all, Phobos identifiers are English as well.) While it's cool to have multilingual identifiers, I'm not sure if it actually adds any practical value. :) If anything, it arguably detracts from usability. Multilingual program output, of course, is a different kettle o' fish. T
I can tell you for a fact there are a tons of *private* companies that create closed source programs, whose source code is *not* English. And from *their* business perspective, it makes sense. They don't care if you can't understand their source code, since *you* will never see their source code. I'm quite confident there are tons of programs that you use that *aren't* written in English. My wifes writes the embedded soft for hardware her company sells. I can tell you the source code sure as hell isn't in English. Why would it? The entire company speaks the local language natively. I've worked in Japan, and I can tell you the norm over there is *not* to code in English. And why should it? Why would you code in a language that is not your own, if you don't plan to ever share your code to outside your team? Why would you care about users that don't have unicode support, if the workstations of all your employees is unicode compatible? Allowing unicode identifiers makes their work a better experience. Why should we take that away from them? There are advantages and disadvantages to non-ASCII identifiers, but whether or not you should be able to use them should belong in a coding standard, not in a compiler limitation.
May 28 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 I've recently come to the opinion that that's a bad idea, and D should not
 support it.
Why do you think its a bad idea? It makes it such that code can be in various languages? Just lack of keyboard support?
Every time I've been to a programming shop in a foreign country, the developers speak english at work and code in english. Of course, that doesn't mean that everyone does, but as far as I can tell the overwhelming bulk is done in english. Naturally, full Unicode needs to be in strings and comments, but symbol names? I don't see the point nor the utilty of it. Supporting such is just pointless complexity to the language.
May 27 2013
next sibling parent reply "Diggory" <diggsey googlemail.com> writes:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
 On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 I've recently come to the opinion that that's a bad idea, and 
 D should not
 support it.
Why do you think its a bad idea? It makes it such that code can be in various languages? Just lack of keyboard support?
Every time I've been to a programming shop in a foreign country, the developers speak english at work and code in english. Of course, that doesn't mean that everyone does, but as far as I can tell the overwhelming bulk is done in english. Naturally, full Unicode needs to be in strings and comments, but symbol names? I don't see the point nor the utilty of it. Supporting such is just pointless complexity to the language.
The most convincing case for usefulness I've seen was in java where a class implemented a particular algorithm and so was named after it. This name had a particular accented character and so required unicode. Lots of algorithms are named after their inventors and lots of these names contain unicode characters so it's not that uncommon.
May 27 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, May 28, 2013 at 02:23:32AM +0200, Diggory wrote:
 On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
I've recently come to the opinion that that's a bad idea, and
D should not
support it.
Why do you think its a bad idea? It makes it such that code can be in various languages? Just lack of keyboard support?
Every time I've been to a programming shop in a foreign country, the developers speak english at work and code in english. Of course, that doesn't mean that everyone does, but as far as I can tell the overwhelming bulk is done in english. Naturally, full Unicode needs to be in strings and comments, but symbol names? I don't see the point nor the utilty of it. Supporting such is just pointless complexity to the language.
The most convincing case for usefulness I've seen was in java where a class implemented a particular algorithm and so was named after it. This name had a particular accented character and so required unicode. Lots of algorithms are named after their inventors and lots of these names contain unicode characters so it's not that uncommon.
I don't find this a compelling reason to allow full Unicode on identifiers, though. For one thing, somebody maintaining your code may not know how to type said identifier correctly. It can be very frustrating to have to keep copy-n-pasting identifiers just because they contain foreign letters you can't type. Not to mention sheer unreadability if the inventor's name is in Chinese, so the algorithm name is also in Chinese, and the person maintaining the code can't read Chinese. This will kill D code maintainability. T -- Don't drink and derive. Alcohol and algebra don't mix.
May 27 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 6:06 PM, H. S. Teoh wrote:
 I don't find this a compelling reason to allow full Unicode on
 identifiers, though. For one thing, somebody maintaining your code may
 not know how to type said identifier correctly. It can be very
 frustrating to have to keep copy-n-pasting identifiers just because they
 contain foreign letters you can't type. Not to mention sheer
 unreadability if the inventor's name is in Chinese, so the algorithm
 name is also in Chinese, and the person maintaining the code can't read
 Chinese. This will kill D code maintainability.
+1
May 27 2013
parent Michel Fortin <michel.fortin michelf.ca> writes:
On 2013-05-28 01:34:17 +0000, Walter Bright <newshound2 digitalmars.com> said:

 On 5/27/2013 6:06 PM, H. S. Teoh wrote:
 I don't find this a compelling reason to allow full Unicode on
 identifiers, though. For one thing, somebody maintaining your code may
 not know how to type said identifier correctly. It can be very
 frustrating to have to keep copy-n-pasting identifiers just because they
 contain foreign letters you can't type. Not to mention sheer
 unreadability if the inventor's name is in Chinese, so the algorithm
 name is also in Chinese, and the person maintaining the code can't read
 Chinese. This will kill D code maintainability.
+1
-1 What's even worse for code maintainability is code that does not do what it says. Disallowing non-ASCII charsets does not prevent people from writing foreign-language code. I've seen plenty of code in French in my life in languages with no Unicode support. I've also seen plenty of bad English in code. I'd rather see a correct French word as a variable or function name than an incorrect English one. Correctly naming things is difficult, and correctly naming them in a foreign language is even more. This surely apply to languages using non-ASCII alphabets too. Of course, if you're not using English words you'll be limiting audience to programmers who understand that language. But you might widen it in other directions. I worked once with a grad student who was building a model to simulate breakages of water pipe systems. She was good enough to write code that worked, although she needed my help for a couple of things, notably increasing performance. The code was all in French, and thankfully so as attempting to translate all those terms (some dealing with concepts unknown to me) to English when writing the code and back to French when explaining the concepts would have been quite annoying, inefficient, and error-prone in our work. While French likely will always be a possibility (as it fits well in ASCII), I can see how writing code in Japanese or Russian might benefit native speakers of those languages too, especially those for who programming is only an incidental part of their job. Programming is a form of expression, and it's always easier to express ourself in our own native language. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca/
May 28 2013
prev sibling next sibling parent "Olivier Pisano" <olivier.pisano laposte.net> writes:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
 Every time I've been to a programming shop in a foreign 
 country, the developers speak english at work and code in 
 english. Of course, that doesn't mean that everyone does, but 
 as far as I can tell the overwhelming bulk is done in english.
Would you have been to such an event if you could not have understood what people were doing or saying? Of course, when we are working on something with international scope, we tend to do it in english, but it doesn't mean every programming task is performed in english… Being a non-native english speaker, I tend to see Unicode identifiers as an improvement over other programming languages. It's like operator overloading, it is good when used moderately, depending on the context of the programming task and its intended audience. BTW, I use a Unicode-aware alternative keyboard layout, so I can type greek letters or math symbols directly. ASCII-only identifiers sounds like an arbitrary limitation for me.
May 28 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
 Every time I've been to a programming shop in a foreign 
 country, the developers speak english at work and code in 
 english. Of course, that doesn't mean that everyone does, but 
 as far as I can tell the overwhelming bulk is done in english.
That's because you have an academic view of code, and a library approach to development. When you are a private company selling closed source code, I really don't see why you'd code in English. IMO, whether it is a bad idea is not for us to judge (and less so to stop), but for each company/organization to choose their own coding standard.
May 28 2013
prev sibling next sibling parent reply "qznc" <qznc web.de> writes:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
 On 5/27/2013 4:28 PM, Hans W. Uhlig wrote:
 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 I've recently come to the opinion that that's a bad idea, and 
 D should not
 support it.
Why do you think its a bad idea? It makes it such that code can be in various languages? Just lack of keyboard support?
Every time I've been to a programming shop in a foreign country, the developers speak english at work and code in english. Of course, that doesn't mean that everyone does, but as far as I can tell the overwhelming bulk is done in english. Naturally, full Unicode needs to be in strings and comments, but symbol names? I don't see the point nor the utilty of it. Supporting such is just pointless complexity to the language.
Once I heared an argument from developers working for banks. They coded business-specific stuff in Java. Business-specific meant financial concepts with german names (e.g. Vermögen,Bürgschaft), which sometimes include äöüß. Some of those concept had no good translation into english, because they are not used outside of Germany and the clients prefer the actual names anyways.
May 29 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/29/2013 3:26 AM, qznc wrote:
 Once I heared an argument from developers working for banks. They coded
 business-specific stuff in Java. Business-specific meant financial concepts
with
 german names (e.g. Vermögen,Bürgschaft), which sometimes include äöüß.
Some of
 those concept had no good translation into english, because they are not used
 outside of Germany and the clients prefer the actual names anyways.
German is pretty easy to do in ASCII: Vermoegen and Buergschaft
May 29 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 29 May 2013 at 22:42:08 UTC, Walter Bright wrote:
 On 5/29/2013 3:26 AM, qznc wrote:
 Once I heared an argument from developers working for banks. 
 They coded
 business-specific stuff in Java. Business-specific meant 
 financial concepts with
 german names (e.g. Vermögen,Bürgschaft), which sometimes 
 include äöüß. Some of
 those concept had no good translation into english, because 
 they are not used
 outside of Germany and the clients prefer the actual names 
 anyways.
German is pretty easy to do in ASCII: Vermoegen and Buergschaft
What about Chinese? Russian? Japanese? It is doable, but I can tell you for a fact that they very much don't like reading it that way. You know, having done programming in Japan, I know that a lot of devs simply don't care for english, and they'd really enjoy just being able to code in Japanese. I can't speak for the other countries, but I'm sure that large but not spread out countries like China would also just *love* to be able to code in 100% Madarin (I'd say they wouldn't care much for English either). I think this possibility is actually a brilliant feature that could help popularize the language oversees, especially in teaching courses, or the private sector. Why not turn down a feature that makes us popular? As for research/university, I think they are already global enough to stick to English anyways. No matter how I see it, I can only see benefits to keeping it, and downsides to turning it down.
May 30 2013
next sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Thu, 30 May 2013 11:36:42 +0200, monarch_dodra <monarchdodra gmail.co=
m>  =

wrote:

 On Wednesday, 29 May 2013 at 22:42:08 UTC, Walter Bright wrote:
 On 5/29/2013 3:26 AM, qznc wrote:
 Once I heared an argument from developers working for banks. They co=
ded
 business-specific stuff in Java. Business-specific meant financial  =
 concepts with
 german names (e.g. Verm=C3=B6gen,B=C3=BCrgschaft), which sometimes i=
nclude =C3=A4=C3=B6=C3=BC=C3=9F. =
 Some of
 those concept had no good translation into english, because they are=
=
 not used
 outside of Germany and the clients prefer the actual names anyways.
German is pretty easy to do in ASCII: Vermoegen and Buergschaft
What about Chinese? Russian? Japanese? It is doable, but I can tell yo=
u =
 for a fact that they very much don't like reading it that way.

 You know, having done programming in Japan, I know that a lot of devs =
=
 simply don't care for english, and they'd really enjoy just being able=
=
 to code in Japanese. I can't speak for the other countries, but I'm su=
re =
 that large but not spread out countries like China would also just  =
 *love* to be able to code in 100% Madarin (I'd say they wouldn't care =
=
 much for English either).

 I think this possibility is actually a brilliant feature that could he=
lp =
 popularize the language oversees, especially in teaching courses, or t=
he =
 private sector. Why not turn down a feature that makes us popular?

 As for research/university, I think they are already global enough to =
=
 stick to English anyways.

 No matter how I see it, I can only see benefits to keeping it, and  =
 downsides to turning it down.
Now if only we had the C preprocessor: #define =E5=A6=82=E6=9E=9C if #define =E7=9B=B4=E5=88=B0 while (Note: this is what Google Translate told me was good. I do not speak, read or otherwise understand chinese) -- = Simen
May 30 2013
prev sibling parent reply "Dicebot" <m.strashun gmail.com> writes:
On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra wrote:
 What about Chinese? Russian? Japanese? It is doable, but I can 
 tell you for a fact that they very much don't like reading it 
 that way.

 You know, having done programming in Japan, I know that a lot 
 of devs simply don't care for english, and they'd really enjoy 
 just being able to code in Japanese. I can't speak for the 
 other countries, but I'm sure that large but not spread out 
 countries like China would also just *love* to be able to code 
 in 100% Madarin (I'd say they wouldn't care much for English 
 either).
What about poor guys from other country that will support that project after? English is a de-facto standard language for programming for a good reason.
May 30 2013
next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 30 May 2013 at 10:13:46 UTC, Dicebot wrote:
 On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra wrote:
 What about Chinese? Russian? Japanese? It is doable, but I can 
 tell you for a fact that they very much don't like reading it 
 that way.

 You know, having done programming in Japan, I know that a lot 
 of devs simply don't care for english, and they'd really enjoy 
 just being able to code in Japanese. I can't speak for the 
 other countries, but I'm sure that large but not spread out 
 countries like China would also just *love* to be able to code 
 in 100% Madarin (I'd say they wouldn't care much for English 
 either).
What about poor guys from other country that will support that project after? English is a de-facto standard language for programming for a good reason.
Well... defacto: "in practice but not necessarily ordained by law". Besides, even in english, there are use cases for unicode. Such as math (Greek symbols). And even if you are coding in english, that don't mean you can't be working on a region specific project, that requires the identifiers to have region-specific names (AKA, German banking reference). Finally, english does have a few (albeit rare) words that can't be expressed with ASCII. For example: Möbius. Sure, you can write it "Mobius", but why settle for wrong, when you can have right? -------- I'm saying that even if I agree that code should be in English (which I don't completly agree with), it's still not a strong argument against unicode in identifiers. In this day and age, it seems as arbitrary to me as requiring lines to not exceed 80 chars. That kind of shit belongs in a coding standard.
May 30 2013
prev sibling parent reply Manu <turkeyman gmail.com> writes:
On 30 May 2013 20:13, Dicebot <m.strashun gmail.com> wrote:

 On Thursday, 30 May 2013 at 09:36:43 UTC, monarch_dodra wrote:

 What about Chinese? Russian? Japanese? It is doable, but I can tell you
 for a fact that they very much don't like reading it that way.

 You know, having done programming in Japan, I know that a lot of devs
 simply don't care for english, and they'd really enjoy just being able to
 code in Japanese. I can't speak for the other countries, but I'm sure that
 large but not spread out countries like China would also just *love* to be
 able to code in 100% Madarin (I'd say they wouldn't care much for English
 either).
What about poor guys from other country that will support that project after? English is a de-facto standard language for programming for a good reason.
Have you ever worked on code written by people who barely speak English? Even if they write English words, that doesn't make it 'English', or any easier to understand. And people often tend to just transliterate into latin, which is kinda pointless too, how does that help?
May 30 2013
next sibling parent "Dicebot" <m.strashun gmail.com> writes:
On Thursday, 30 May 2013 at 11:29:47 UTC, Manu wrote:
 Have you ever worked on code written by people who barely speak 
 English?
 Even if they write English words, that doesn't make it 
 'English', or any
 easier to understand. And people often tend to just 
 transliterate into
 latin, which is kinda pointless too, how does that help?
I have had comments with Finnish poetry in code I was responsible to support :( No need to provide means to think such approach is the way to go.
May 30 2013
prev sibling parent "Kagamin" <spam here.lot> writes:
On Thursday, 30 May 2013 at 11:29:47 UTC, Manu wrote:
 Have you ever worked on code written by people who barely speak 
 English?
I did. It's better than having a mixture of languages like here: http://code.google.com/p/trileri/source/browse/trunk/tr/yazi.d assert(length == dizgi.length); - in one expression! property Yazı küçüğü() const - property? const? küçüğü? BTW I don't speak English myself, and D code doesn't comprise English either. How well do you have to know English to use one word to name a variable "player"? And I believe everyone who learned math know latin alphabet. Unicode identifiers allow for typos, which can't be detected visually. For example greek and cyrillic alphabets have letters indistinguishable from ASCII so they can sneak into ASCII text and you won't see it. You can also have more fun with heuristic language switchers. Try to find a problem in this code: ------ class c { void Сlose(){} } int main() { c obj=new c; obj.Close(); return 0; } ------ believe noone checked phobos for such errors. I was taught BASIC at school and had no idea I should complain about latin alphabet even though I didn't learn English back then.
Jun 27 2013
prev sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Tuesday, 28 May 2013 at 00:11:18 UTC, Walter Bright wrote:
 Every time I've been to a programming shop in a foreign 
 country, the developers speak english at work and code in 
 english. Of course, that doesn't mean that everyone does, but 
 as far as I can tell the overwhelming bulk is done in english.
OOo codebase is historically mostly in german. They try to reduce the amunt of german in the codebase with each new version. Some massive codebases are non english.
 Naturally, full Unicode needs to be in strings and comments, 
 but symbol names? I don't see the point nor the utilty of it. 
 Supporting such is just pointless complexity to the language.
I know this is a crazy idea, but someone told be once that most people on this planet aren't living in english speaking countries. Insane isn't it ?
Jun 27 2013
prev sibling parent reply Peter Williams <pwil3058 bigpond.net.au> writes:
On 28/05/13 09:44, H. S. Teoh wrote:

 Since language keywords are already in English, we might as well
 standardize on English identifiers too.
So you're going to spell check them all to make sure that they're English? Or did you mean ASCII? Peter
May 27 2013
next sibling parent reply "David Eagen" <davideagen mailinator.com> writes:
On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:

 So you're going to spell check them all to make sure that 
 they're English?  Or did you mean ASCII?

 Peter
That's it. I'm filing a bug against std.traits. There's a unittest there that with a struct named "Colour". Completely unacceptable.
May 27 2013
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 28 May 2013 13:22, David Eagen <davideagen mailinator.com> wrote:

 On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:


 So you're going to spell check them all to make sure that they're
 English?  Or did you mean ASCII?

 Peter
That's it. I'm filing a bug against std.traits. There's a unittest there that with a struct named "Colour". Completely unacceptable.
How dare you! What's unacceptable is that a bunch of ex-english speakers had the audacity to rewrite the dictionary and continue to call it English! I will never write colour without a u, ever! I may suffer the global American cultural invasion of my country like the rest of us, but I will never let them infiltrate my mind! ;)
May 27 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 9:27 PM, Manu wrote:
 I will never write colour without a u, ever! I may suffer the global American
 cultural invasion of my country like the rest of us, but I will never let them
 infiltrate my mind! ;)
Resistance is useless.
May 27 2013
parent "Diggory" <diggsey googlemail.com> writes:
On Tuesday, 28 May 2013 at 04:52:55 UTC, Walter Bright wrote:
 On 5/27/2013 9:27 PM, Manu wrote:
 I will never write colour without a u, ever! I may suffer the 
 global American
 cultural invasion of my country like the rest of us, but I 
 will never let them
 infiltrate my mind! ;)
Resistance is useless.
*futile :P
May 27 2013
prev sibling next sibling parent Peter Williams <pwil3058 bigpond.net.au> writes:
On 28/05/13 13:22, David Eagen wrote:
 On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:

 So you're going to spell check them all to make sure that they're
 English?  Or did you mean ASCII?

 Peter
That's it. I'm filing a bug against std.traits. There's a unittest there that with a struct named "Colour". Completely unacceptable.
Except here in Australia and other places where they use the Queen's English :-) Peter
May 27 2013
prev sibling parent reply Manu <turkeyman gmail.com> writes:
On 28 May 2013 14:38, Peter Williams <pwil3058 bigpond.net.au> wrote:

 On 28/05/13 13:22, David Eagen wrote:

 On Tuesday, 28 May 2013 at 01:38:22 UTC, Peter Williams wrote:


 So you're going to spell check them all to make sure that they're
 English?  Or did you mean ASCII?

 Peter
That's it. I'm filing a bug against std.traits. There's a unittest there that with a struct named "Colour". Completely unacceptable.
Except here in Australia and other places where they use the Queen's English :-)
Is there anywhere other than America that doesn't?
May 27 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-05-28 08:00, Manu wrote:

 Is there anywhere other than America that doesn't?
Canada, Jamaica, other countries in that region? -- /Jacob Carlborg
May 28 2013
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 28 May 2013 19:12, Jacob Carlborg <doob me.com> wrote:

 On 2013-05-28 08:00, Manu wrote:

  Is there anywhere other than America that doesn't?

 Canada, Jamaica, other countries in that region?
Yes, the region called America ;) Although there's a few British colonies in the Caribbean...
May 28 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-05-28 14:09, Manu wrote:

 Yes, the region called America ;)
 Although there's a few British colonies in the Caribbean...
Oh, you meant the whole region and not the country. -- /Jacob Carlborg
May 28 2013
parent reply "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Tue, 28 May 2013 14:11:29 +0200, Jacob Carlborg <doob me.com> wrote:

 On 2013-05-28 14:09, Manu wrote:

 Yes, the region called America ;)
 Although there's a few British colonies in the Caribbean...
Oh, you meant the whole region and not the country.
America is not a country. The country is called USA. -- Simen
May 28 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-05-28 14:58, Simen Kjaeraas wrote:

 America is not a country. The country is called USA.
I know that, but I get the impression that people usually say "America" and refer to USA. -- /Jacob Carlborg
May 28 2013
prev sibling next sibling parent reply Peter Williams <pwil3058 bigpond.net.au> writes:
On 28/05/13 19:12, Jacob Carlborg wrote:
 On 2013-05-28 08:00, Manu wrote:

 Is there anywhere other than America that doesn't?
Canada, Jamaica, other countries in that region?
Last time I looked Canada was in America (which is a continent not a country). :-) Peter
May 28 2013
parent reply "Diggory" <diggsey googlemail.com> writes:
On Tuesday, 28 May 2013 at 23:33:47 UTC, Peter Williams wrote:
 On 28/05/13 19:12, Jacob Carlborg wrote:
 On 2013-05-28 08:00, Manu wrote:

 Is there anywhere other than America that doesn't?
Canada, Jamaica, other countries in that region?
Last time I looked Canada was in America (which is a continent not a country). :-) Peter
America isn't a continent, North America is a continent, and Canada is in North America :P
May 28 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 29 May 2013 at 01:29:07 UTC, Diggory wrote:
 On Tuesday, 28 May 2013 at 23:33:47 UTC, Peter Williams wrote:
 On 28/05/13 19:12, Jacob Carlborg wrote:
 On 2013-05-28 08:00, Manu wrote:

 Is there anywhere other than America that doesn't?
Canada, Jamaica, other countries in that region?
Last time I looked Canada was in America (which is a continent not a country). :-) Peter
America isn't a continent, North America is a continent, and Canada is in North America :P
Well, that point of view really depends from which continent you're from: http://en.wikipedia.org/wiki/Continents#Number_of_continents There is no internationally agreed on scheme. I for one, have always been taught that there is only "America", and that the terms "North America" and "South America" where only meant to denote regions within said continent.
May 28 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, May 29, 2013 at 09:33:32AM +1000, Peter Williams wrote:
 On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?
Canada, Jamaica, other countries in that region?
Last time I looked Canada was in America (which is a continent not a country). :-)
[...] If you say that to a Canadian to his face, you might get a hostile (or faux-hostile) reaction. :) Up here in the Great White North, we like to think of ourselves as different from our rowdy neighbours to the south (even though we're not that different, but we won't ever admit that :-P). And yes, "America" means USA up here (and "American" especially means USian, as distinct from Canadian), even though we all know that technically it refers to the continent, not the country. T -- Computers aren't intelligent; they only think they are.
May 28 2013
prev sibling next sibling parent Peter Williams <pwil3058 bigpond.net.au> writes:
On 29/05/13 09:57, H. S. Teoh wrote:
 On Wed, May 29, 2013 at 09:33:32AM +1000, Peter Williams wrote:
 On 28/05/13 19:12, Jacob Carlborg wrote:
 On 2013-05-28 08:00, Manu wrote:

 Is there anywhere other than America that doesn't?
Canada, Jamaica, other countries in that region?
Last time I looked Canada was in America (which is a continent not a country). :-)
[...] If you say that to a Canadian to his face, you might get a hostile (or faux-hostile) reaction. :) Up here in the Great White North, we like to think of ourselves as different from our rowdy neighbours to the south (even though we're not that different, but we won't ever admit that :-P). And yes, "America" means USA up here (and "American" especially means USian, as distinct from Canadian), even though we all know that technically it refers to the continent, not the country.
Last time I was there (about 40 years ago) Canadians didn't seem that touchy. :-) Peter
May 28 2013
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, May 29, 2013 at 10:36:08AM +1000, Peter Williams wrote:
 On 29/05/13 09:57, H. S. Teoh wrote:
On Wed, May 29, 2013 at 09:33:32AM +1000, Peter Williams wrote:
On 28/05/13 19:12, Jacob Carlborg wrote:
On 2013-05-28 08:00, Manu wrote:

Is there anywhere other than America that doesn't?
Canada, Jamaica, other countries in that region?
Last time I looked Canada was in America (which is a continent not a country). :-)
[...] If you say that to a Canadian to his face, you might get a hostile (or faux-hostile) reaction. :)
[...]
 Last time I was there (about 40 years ago) Canadians didn't seem
 that touchy. :-)
[...] Well, they are not, hence "faux-hostile". :) T -- Political correctness: socially-sanctioned hypocrisy.
May 28 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-05-28 03:38, Peter Williams wrote:

 So you're going to spell check them all to make sure that they're
 English?  Or did you mean ASCII?
Don't you have a spell checker in your editor? If not, find a new one :) -- /Jacob Carlborg
May 28 2013
prev sibling next sibling parent reply =?UTF-8?B?Ikx1w61z?= Marques" <luismarques gmail.com> writes:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 I've recently come to the opinion that that's a bad idea, and D 
 should not support it.
I think it is a bad idea to program in a language other than english, but I believe D should still support it.
May 27 2013
parent Manu <turkeyman gmail.com> writes:
On 28 May 2013 09:39, <luismarques gmail.com>" puremagic.com <
"\"Lu=C3=ADs".Marques"> wrote:

 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:

 I've recently come to the opinion that that's a bad idea, and D should
 not support it.
I think it is a bad idea to program in a language other than english, but I believe D should still support it.
I can imagine a young student learning to code, that may not speak English (yet). Or a not-so-unlikely future where we're all speaking chinese ;)
May 27 2013
prev sibling next sibling parent reply Manu <turkeyman gmail.com> writes:
On 28 May 2013 09:05, Walter Bright <newshound2 digitalmars.com> wrote:

 On 5/27/2013 3:18 PM, H. S. Teoh wrote:

 Well, D *does* support non-English identifiers, y'know... for example:

         void main(string[] args) {
                 int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
                 foreach (=D0=B8; 0..100)
                         =D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
                 writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
         }

 Of course, whether that's a good practice is a different story. :)
I've recently come to the opinion that that's a bad idea, and D should no=
t
 support it.
Why? You said previously that you'd love to support extended operators ;)
May 27 2013
next sibling parent "Torje Digernes" <torjehoa pvv.org> writes:
On Tuesday, 28 May 2013 at 00:34:20 UTC, Manu wrote:
 On 28 May 2013 09:05, Walter Bright 
 <newshound2 digitalmars.com> wrote:

 On 5/27/2013 3:18 PM, H. S. Teoh wrote:

 Well, D *does* support non-English identifiers, y'know... for 
 example:

         void main(string[] args) {
                 int число = 1;
                 foreach (и; 0..100)
                         число += и;
                 writeln(число);
         }

 Of course, whether that's a good practice is a different 
 story. :)
I've recently come to the opinion that that's a bad idea, and D should not support it.
Why? You said previously that you'd love to support extended operators ;)
I find features such as support for uncommon symbols in variables a strength as it makes some physics formulas a bit easier to read in code form, which in my opinion is a good thing.
May 27 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2013 5:34 PM, Manu wrote:
 On 28 May 2013 09:05, Walter Bright <newshound2 digitalmars.com
 <mailto:newshound2 digitalmars.com>> wrote:

     On 5/27/2013 3:18 PM, H. S. Teoh wrote:

         Well, D *does* support non-English identifiers, y'know... for example:

                  void main(string[] args) {
                          int число = 1;
                          foreach (и; 0..100)
                                  число += и;
                          writeln(число);
                  }

         Of course, whether that's a good practice is a different story. :)


     I've recently come to the opinion that that's a bad idea, and D should not
     support it.


 Why? You said previously that you'd love to support extended operators ;)
Extended operators, yes. Non-ascii identifiers, no.
May 27 2013
parent "Oleg Kuporosov" <Oleg.Kuporosov gmail.com> writes:
On Tuesday, 28 May 2013 at 01:34:47 UTC, Walter Bright wrote:

 Why? You said previously that you'd love to support extended 
 operators ;)
Extended operators, yes. Non-ascii identifiers, no.
BTW, this is one of big D advantage, take into account some day D could be used for teaching in schools where pupils still doesn't know English somewhere outside US/GB. It is much easy to start with localized Ids. Please keep unicode in language.
May 28 2013
prev sibling next sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Tue, 28 May 2013 01:05:46 +0200, Walter Bright  =

<newshound2 digitalmars.com> wrote:

 On 5/27/2013 3:18 PM, H. S. Teoh wrote:
 Well, D *does* support non-English identifiers, y'know... for example=
:
 	void main(string[] args) {
 		int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
 		foreach (=D0=B8; 0..100)
 			=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
 		writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
 	}

 Of course, whether that's a good practice is a different story. :)
I've recently come to the opinion that that's a bad idea, and D should=
=
 not support it.
I've recently come to the opinion that you're wrong - using them is ofte= n wrong, but D should support them. Various good reasons have been posted in this thread. -- = Simen
May 28 2013
prev sibling next sibling parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 I've recently come to the opinion that that's a bad idea, and D 
 should not support it.
Honestly, removing support for non-ASCII characters from identifiers is the worst idea you've had in a while. There is an _unfathomable amount_ of code out there written in non-English languages but hamfisted into an English-alphabet representation because the programming language doesn't care to support it. The resulting friction is considerable. You seem to attribute particular value to personal anecdotes, so here's one of mine: I personally know several prestigious universities in Europe and Asia which teach programming using Java and/or C with identifiers being in an English-alphabet representation of the native non-English language. Using the English language for identifiers is usually a sanctioned alternative, but not the primary modus operandi. I also know several professional programmers using their native non-English language for identifiers in production code. Please reconsider.
May 29 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/29/2013 2:42 AM, Jakob Ovrum wrote:
 On Monday, 27 May 2013 at 23:05:46 UTC, Walter Bright wrote:
 I've recently come to the opinion that that's a bad idea, and D should not
 support it.
Honestly, removing support for non-ASCII characters from identifiers is the worst idea you've had in a while. There is an _unfathomable amount_ of code out there written in non-English languages but hamfisted into an English-alphabet representation because the programming language doesn't care to support it. The resulting friction is considerable. You seem to attribute particular value to personal anecdotes, so here's one of mine: I personally know several prestigious universities in Europe and Asia which teach programming using Java and/or C with identifiers being in an English-alphabet representation of the native non-English language. Using the English language for identifiers is usually a sanctioned alternative, but not the primary modus operandi. I also know several professional programmers using their native non-English language for identifiers in production code. Please reconsider.
I still think it's a bad idea, but it's obvious people want it in D, so it'll stay. (Also note that I meant using ASCII, not necessarily english.)
May 29 2013
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 29 May 2013 15:44:17 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 I still think it's a bad idea, but it's obvious people want it in D, so it'll
stay.
 
 (Also note that I meant using ASCII, not necessarily english.)
Surprisingly ASCII also covers Cornish and Malay. -- Marco
May 29 2013
prev sibling next sibling parent "Oleg Kuporosov" <Oleg.Kuporosov gmail.com> writes:
On Wednesday, 29 May 2013 at 22:44:17 UTC, Walter Bright wrote:
 I still think it's a bad idea, but it's obvious people want it 
 in D, so it'll stay.

 (Also note that I meant using ASCII, not necessarily english.)
Good, thanks, restrictions definetelly can and should be applied per project, like for druntime/Phobos.
May 29 2013
prev sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 29 May 2013 at 22:44:17 UTC, Walter Bright wrote:
 (Also note that I meant using ASCII, not necessarily english.)
I don't understand the logic behind this. Surely this is the worst combination; severely crippled ability to use non-English languages (yes, even for European languages), yet non-speakers of those languages still don't have a clue what it means.
May 30 2013
prev sibling next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 27 May 2013 16:05:46 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 5/27/2013 3:18 PM, H. S. Teoh wrote:
 Well, D *does* support non-English identifiers, y'know... for example:

 	void main(string[] args) {
 		int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
 		foreach (=D0=B8; 0..100)
 			=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
 		writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
 	}

 Of course, whether that's a good practice is a different story. :)
=20 I've recently come to the opinion that that's a bad idea, and D should no=
t=20
 support it.
I hope that was just a random thought. I knew a teacher who would give all his methods German names so they are easier to distinguish from the English Java library methods. Personally I like to type =CE=B1 instead of alpha for angles, since that is the identifier you'd expect in math. And everyone likes "alias =E2=84=95 =3D size_t;", right? :) D=C3=A9j=C3=A0 vu? --=20 Marco
May 29 2013
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 05/29/2013 12:03 PM, Marco Leise wrote:
 ...  And everyone
 likes "alias ℕ = size_t;", right? :)
 ...
No, that's deeply troubling.
May 30 2013
prev sibling parent reply "Entry" <no no.com> writes:
My personal opinion is that code should only be in English.
May 29 2013
parent reply Peter Williams <pwil3058 bigpond.net.au> writes:
On 30/05/13 08:40, Entry wrote:
 My personal opinion is that code should only be in English.
But why would you want to impose this restriction on others? Peter
May 29 2013
parent reply "Entry" <no no.com> writes:
On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:
 On 30/05/13 08:40, Entry wrote:
 My personal opinion is that code should only be in English.
But why would you want to impose this restriction on others? Peter
I wouldn't say impose. I'd say that programming in a unified language (D) should not be sabotaged by comments and variable names in various human languages (Swedish, Russian), but be accompanied by a similarly 'unified' language that we all know - English. It is only my opinion though and I wouldn't force it upon anyone.
May 30 2013
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
 On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:
 On 30/05/13 08:40, Entry wrote:
 My personal opinion is that code should only be in English.
But why would you want to impose this restriction on others? Peter
I wouldn't say impose. I'd say that programming in a unified language (D) should not be sabotaged by comments and variable names in various human languages (Swedish, Russian), but be accompanied by a similarly 'unified' language that we all know - English. It is only my opinion though and I wouldn't force it upon anyone.
But programming IS a human tool, and thus, subject to human language. Also, I don't see how a programming language is any more unified than, say, a library. While you wouldn't force it on anyone, would it also be your opinion that putting a French book in a french library be a sabotage of the world's librarial institutions?
May 30 2013
parent reply "Entry" <no no.com> writes:
On Thursday, 30 May 2013 at 09:29:43 UTC, monarch_dodra wrote:
 On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
 On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams 
 wrote:
 On 30/05/13 08:40, Entry wrote:
 My personal opinion is that code should only be in English.
But why would you want to impose this restriction on others? Peter
I wouldn't say impose. I'd say that programming in a unified language (D) should not be sabotaged by comments and variable names in various human languages (Swedish, Russian), but be accompanied by a similarly 'unified' language that we all know - English. It is only my opinion though and I wouldn't force it upon anyone.
But programming IS a human tool, and thus, subject to human language. Also, I don't see how a programming language is any more unified than, say, a library. While you wouldn't force it on anyone, would it also be your opinion that putting a French book in a french library be a sabotage of the world's librarial institutions?
What a way to attack a straw-man and completely miss the point at the same time.
May 30 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 30 May 2013 at 13:12:17 UTC, Entry wrote:
 On Thursday, 30 May 2013 at 09:29:43 UTC, monarch_dodra wrote:
 On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
 On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams 
 wrote:
 On 30/05/13 08:40, Entry wrote:
 My personal opinion is that code should only be in English.
But why would you want to impose this restriction on others? Peter
I wouldn't say impose. I'd say that programming in a unified language (D) should not be sabotaged by comments and variable names in various human languages (Swedish, Russian), but be accompanied by a similarly 'unified' language that we all know - English. It is only my opinion though and I wouldn't force it upon anyone.
But programming IS a human tool, and thus, subject to human language. Also, I don't see how a programming language is any more unified than, say, a library. While you wouldn't force it on anyone, would it also be your opinion that putting a French book in a french library be a sabotage of the world's librarial institutions?
What a way to attack a straw-man and completely miss the point at the same time.
Fine. In that case, I'll retort by saying that you use of the 'unified' is intentionally loaded to favor your stance. My retort was not correctly expressed, but I don't see how D is "unified". I thought it was just a tool to create programs.
May 30 2013
parent reply "Entry" <no no.com> writes:
On Thursday, 30 May 2013 at 13:52:09 UTC, monarch_dodra wrote:
 On Thursday, 30 May 2013 at 13:12:17 UTC, Entry wrote:
 On Thursday, 30 May 2013 at 09:29:43 UTC, monarch_dodra wrote:
 On Thursday, 30 May 2013 at 08:32:01 UTC, Entry wrote:
 On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams 
 wrote:
 On 30/05/13 08:40, Entry wrote:
 My personal opinion is that code should only be in English.
But why would you want to impose this restriction on others? Peter
I wouldn't say impose. I'd say that programming in a unified language (D) should not be sabotaged by comments and variable names in various human languages (Swedish, Russian), but be accompanied by a similarly 'unified' language that we all know - English. It is only my opinion though and I wouldn't force it upon anyone.
But programming IS a human tool, and thus, subject to human language. Also, I don't see how a programming language is any more unified than, say, a library. While you wouldn't force it on anyone, would it also be your opinion that putting a French book in a french library be a sabotage of the world's librarial institutions?
What a way to attack a straw-man and completely miss the point at the same time.
Fine. In that case, I'll retort by saying that you use of the 'unified' is intentionally loaded to favor your stance. My retort was not correctly expressed, but I don't see how D is "unified". I thought it was just a tool to create programs.
Take a minute to think about why we're all communicating in English here. Let's see if you can figure it out. I just think that it's better to focus on two very specific languages with two very specific purposes (D for programming and English for communication). 'Twas just an idea, I don't care if you write your code in hieroglyphs.
May 30 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:
 Take a minute to think about why we're all communicating in 
 English here. Let's see if you can figure it out.
Well that's condescending :/ and fallacious. To answer your question, it may have something to do with the fact that these are the English forums? Just a wild hunch. Oh. And because we *can* speak English? That could also have something to do with it. There are tons of non-English speaking programming forums out there. Maybe those that don't speak English are over there? Heck, there are a few non-English threads in learn. Oh. And did you know TDPL was published in Japanese? Why bother right?
 I just think that it's better to focus on two very specific 
 languages with two very specific purposes (D for programming 
 and English for communication). 'Twas just an idea, I don't 
 care if you write your code in hieroglyphs.
I really really agree with you. Yet, I think they are orthogonal concepts, and that the D programming language has no business choosing which communication vector its users should use. It's not just a matter (imo) of "I wouldn't force it upon anyone", but "I think everyone should choose what's best for them". Yeah. I know. Same conclusion, but there is a nuance.
May 30 2013
parent reply "Entry" <no no.com> writes:
On Thursday, 30 May 2013 at 14:49:12 UTC, monarch_dodra wrote:
 On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:
 Take a minute to think about why we're all communicating in 
 English here. Let's see if you can figure it out.
Well that's condescending :/ and fallacious. To answer your question, it may have something to do with the fact that these are the English forums? Just a wild hunch. Oh. And because we *can* speak English? That could also have something to do with it. There are tons of non-English speaking programming forums out there. Maybe those that don't speak English are over there? Heck, there are a few non-English threads in learn. Oh. And did you know TDPL was published in Japanese? Why bother right?
 I just think that it's better to focus on two very specific 
 languages with two very specific purposes (D for programming 
 and English for communication). 'Twas just an idea, I don't 
 care if you write your code in hieroglyphs.
I really really agree with you. Yet, I think they are orthogonal concepts, and that the D programming language has no business choosing which communication vector its users should use. It's not just a matter (imo) of "I wouldn't force it upon anyone", but "I think everyone should choose what's best for them". Yeah. I know. Same conclusion, but there is a nuance.
I'm glad you agree, though I believe that I never said anything about D 'choosing' which human languages are compatible with it. I just expressed my belief that should people choose to construct something, be it a ship or a computer program, the usage of a single language will greatly enhance their progress (ever heard the story of the Tower of Babel? wink wink). Sorry if my previous comment seemed hostile, that was not my intention.
May 30 2013
next sibling parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:
 I'm glad you agree, though I believe that I never said anything 
 about D 'choosing' which human languages are compatible with 
 it. I just expressed my belief that should people choose to 
 construct something, be it a ship or a computer program, the 
 usage of a single language will greatly enhance their progress 
 (ever heard the story of the Tower of Babel? wink wink). Sorry 
 if my previous comment seemed hostile, that was not my 
 intention.
If the programmers who are going to be working on that code don't understand the "Single Language", then what use is it?
May 30 2013
parent reply "Entry" <no no.com> writes:
On Thursday, 30 May 2013 at 16:05:13 UTC, Jakob Ovrum wrote:
 On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:
 I'm glad you agree, though I believe that I never said 
 anything about D 'choosing' which human languages are 
 compatible with it. I just expressed my belief that should 
 people choose to construct something, be it a ship or a 
 computer program, the usage of a single language will greatly 
 enhance their progress (ever heard the story of the Tower of 
 Babel? wink wink). Sorry if my previous comment seemed 
 hostile, that was not my intention.
If the programmers who are going to be working on that code don't understand the "Single Language", then what use is it?
Then there's no helping it. Though I wonder what kind of a programmer doesn't understand English enough to at least read the code and comments.
May 30 2013
parent Manu <turkeyman gmail.com> writes:
On 31 May 2013 03:08, Entry <no no.com> wrote:

 On Thursday, 30 May 2013 at 16:05:13 UTC, Jakob Ovrum wrote:

 On Thursday, 30 May 2013 at 15:48:12 UTC, Entry wrote:

 I'm glad you agree, though I believe that I never said anything about D
 'choosing' which human languages are compatible with it. I just expressed
 my belief that should people choose to construct something, be it a ship or
 a computer program, the usage of a single language will greatly enhance
 their progress (ever heard the story of the Tower of Babel? wink wink).
 Sorry if my previous comment seemed hostile, that was not my intention.
If the programmers who are going to be working on that code don't understand the "Single Language", then what use is it?
Then there's no helping it. Though I wonder what kind of a programmer doesn't understand English enough to at least read the code and comments.
A child, or a student.
May 30 2013
prev sibling parent Manu <turkeyman gmail.com> writes:
On 31 May 2013 01:48, Entry <no no.com> wrote:

 On Thursday, 30 May 2013 at 14:49:12 UTC, monarch_dodra wrote:

 On Thursday, 30 May 2013 at 14:13:47 UTC, Entry wrote:

 Take a minute to think about why we're all communicating in English
 here. Let's see if you can figure it out.
Well that's condescending :/ and fallacious. To answer your question, it may have something to do with the fact that these are the English forums? Just a wild hunch. Oh. And because we *can* speak English? That could also have something to do with it. There are tons of non-English speaking programming forums out there. Maybe those that don't speak English are over there? Heck, there are a few non-English threads in learn. Oh. And did you know TDPL was published in Japanese? Why bother right? I just think that it's better to focus on two very specific languages
 with two very specific purposes (D for programming and English for
 communication). 'Twas just an idea, I don't care if you write your code in
 hieroglyphs.
I really really agree with you. Yet, I think they are orthogonal concepts, and that the D programming language has no business choosing which communication vector its users should use. It's not just a matter (imo) of "I wouldn't force it upon anyone", but "I think everyone should choose what's best for them". Yeah. I know. Same conclusion, but there is a nuance.
I'm glad you agree, though I believe that I never said anything about D 'choosing' which human languages are compatible with it. I just expressed my belief that should people choose to construct something, be it a ship or a computer program, the usage of a single language will greatly enhance their progress (ever heard the story of the Tower of Babel? wink wink). Sorry if my previous comment seemed hostile, that was not my intention.
This is the definition of a *convention*, not a rule.
May 30 2013
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 30 May 2013 18:32, Entry <no no.com> wrote:

 On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:

 On 30/05/13 08:40, Entry wrote:

 My personal opinion is that code should only be in English.
But why would you want to impose this restriction on others? Peter
I wouldn't say impose. I'd say that programming in a unified language (D) should not be sabotaged by comments and variable names in various human languages (Swedish, Russian), but be accompanied by a similarly 'unified' language that we all know - English. It is only my opinion though and I wouldn't force it upon anyone.
May 30 2013
prev sibling parent reply Manu <turkeyman gmail.com> writes:
On 30 May 2013 18:32, Entry <no no.com> wrote:

 On Wednesday, 29 May 2013 at 23:57:01 UTC, Peter Williams wrote:

 On 30/05/13 08:40, Entry wrote:

 My personal opinion is that code should only be in English.
But why would you want to impose this restriction on others? Peter
I wouldn't say impose. I'd say that programming in a unified language (D) should not be sabotaged by comments and variable names in various human languages (Swedish, Russian), but be accompanied by a similarly 'unified' language that we all know - English. It is only my opinion though and I wouldn't force it upon anyone.
We don't all know English. Plenty of people don't. I've worked a lot with Sony and Nintendo code/libraries, for instance, it almost always looks like this: { // E: I like cake. // J: =E3=82=B1=E3=83=BC=E3=82=AD=E3=81=8C=E5=A5=BD=E3=81=8D=E3=81=A7=E3= =81=99=E3=80=82 player.eatCake(); } Clearly someone doesn't speak English in these massive codebases that power an industry worth 10s of billions.
May 30 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2013 4:24 AM, Manu wrote:
 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance, it
almost
 always looks like this:

 {
    // E: I like cake.
    // J: ケーキが好きです。
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that power an
 industry worth 10s of billions.
Sure, but the code itself is written using ASCII!
May 30 2013
next sibling parent reply Peter Williams <pwil3058 bigpond.net.au> writes:
On 31/05/13 05:07, Walter Bright wrote:
 On 5/30/2013 4:24 AM, Manu wrote:
 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance,
 it almost
 always looks like this:

 {
    // E: I like cake.
    // J: ケーキが好きです。
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that
 power an
 industry worth 10s of billions.
Sure, but the code itself is written using ASCII!
Because they had no choice. Peter
May 30 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2013 5:00 PM, Peter Williams wrote:
 On 31/05/13 05:07, Walter Bright wrote:
 On 5/30/2013 4:24 AM, Manu wrote:
 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance,
 it almost
 always looks like this:

 {
    // E: I like cake.
    // J: ケーキが好きです。
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that
 power an
 industry worth 10s of billions.
Sure, but the code itself is written using ASCII!
Because they had no choice.
Not true, D supports Unicode identifiers.
May 30 2013
next sibling parent reply "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Fri, 31 May 2013 07:57:37 +0200, Walter Bright  =

<newshound2 digitalmars.com> wrote:

 On 5/30/2013 5:00 PM, Peter Williams wrote:
 On 31/05/13 05:07, Walter Bright wrote:
 On 5/30/2013 4:24 AM, Manu wrote:
 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instan=
ce,
 it almost
 always looks like this:

 {
    // E: I like cake.
    // J: =E3=82=B1=E3=83=BC=E3=82=AD=E3=81=8C=E5=A5=BD=E3=81=8D=E3=81=
=A7=E3=81=99=E3=80=82
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases th=
at
 power an
 industry worth 10s of billions.
Sure, but the code itself is written using ASCII!
Because they had no choice.
Not true, D supports Unicode identifiers.
I doubt Sony and Nintendo use D extensively. -- = Simen
May 31 2013
parent 1100110 <0b1100110 gmail.com> writes:
On 05/31/2013 05:11 AM, Simen Kjaeraas wrote:
 On Fri, 31 May 2013 07:57:37 +0200, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 5/30/2013 5:00 PM, Peter Williams wrote:
 On 31/05/13 05:07, Walter Bright wrote:
 On 5/30/2013 4:24 AM, Manu wrote:
 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance,
 it almost
 always looks like this:

 {
    // E: I like cake.
    // J: ケーキが好きです。
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that
 power an
 industry worth 10s of billions.
Sure, but the code itself is written using ASCII!
Because they had no choice.
Not true, D supports Unicode identifiers.
I doubt Sony and Nintendo use D extensively.
Jun 17 2013
prev sibling next sibling parent Timothee Cour <thelastmammoth gmail.com> writes:
On Thu, May 30, 2013 at 10:57 PM, Walter Bright
<newshound2 digitalmars.com>wrote:

 On 5/30/2013 5:00 PM, Peter Williams wrote:

 On 31/05/13 05:07, Walter Bright wrote:

 On 5/30/2013 4:24 AM, Manu wrote:

 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance,
 it almost
 always looks like this:

 {
    // E: I like cake.
    // J: =E3=82=B1=E3=83=BC=E3=82=AD=E3=81=8C=E5=A5=BD=E3=81=8D=E3=81=
=A7=E3=81=99=E3=80=82
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that
 power an
 industry worth 10s of billions.
Sure, but the code itself is written using ASCII!
Because they had no choice.
Not true, D supports Unicode identifiers.
currently std.demangle.demangle doesn't work with unicode (see example below) If we decide to keep allowing unicode symbols (as opposed to just unicode strings/comments), we must address this issue. Will supporting this negatively impact performance (of both compile time and runtime) ? Likewise, will linkers + other tools (gdb etc) be happy with unicode in mangled names? ---- struct A{ int z; void foo(int x){} void =E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F(int x){} void =C2=AA=C3=A5(int x){} } mangledName!(A.=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F).dema= ngle.writeln;=3D> _D4util13demangle_funs1A18=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5= =AE=9FMFiZv ----
Jun 05 2013
prev sibling next sibling parent Brad Roberts <braddr puremagic.com> writes:
On 6/5/13 6:11 PM, Timothee Cour wrote:
 currently std.demangle.demangle doesn't work with unicode (see example below)

 If we decide to keep allowing unicode symbols (as opposed to just unicode
strings/comments), we must
 address this issue. Will supporting this negatively impact performance (of
both compile time and
 runtime) ?

 Likewise, will linkers + other tools (gdb etc) be happy with unicode in
mangled names?

 ----
 structA{
 intz;
 voidfoo(intx){}
 voidさいごの果実(intx){}
 voidªå(intx){}
 }
 mangledName!(A.さいごの果実).demangle.writeln;=>_D4util13demangle_funs1A18さいごの果実MFiZv
 ----
Filed in bugzilla?
Jun 05 2013
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Jun 5, 2013, at 6:21 PM, Brad Roberts <braddr puremagic.com> wrote:

 On 6/5/13 6:11 PM, Timothee Cour wrote:
 currently std.demangle.demangle doesn't work with unicode (see =
example below)
=20
 If we decide to keep allowing unicode symbols (as opposed to just =
unicode strings/comments), we must
 address this issue. Will supporting this negatively impact =
performance (of both compile time and
 runtime) ?
=20
 Likewise, will linkers + other tools (gdb etc) be happy with unicode =
in mangled names?
=20
 ----
 structA{
 intz;
 voidfoo(intx){}
 void=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F(intx){}
 void=C2=AA=C3=A5(intx){}
 }
 =
mangledName!(A.=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=AE=9F).dem= angle.writeln;=3D>_D4util13demangle_funs1A18=E3=81=95=E3=81=84=E3=81=94=E3= =81=AE=E6=9E=9C=E5=AE=9FMFiZv
 ----
=20 Filed in bugzilla?
http://d.puremagic.com/issues/show_bug.cgi?id=3D10393 https://github.com/D-Programming-Language/druntime/pull/524
Jun 17 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jun 17, 2013 at 11:37:18AM -0700, Sean Kelly wrote:
 On Jun 5, 2013, at 6:21 PM, Brad Roberts <braddr puremagic.com> wrote:
 
 On 6/5/13 6:11 PM, Timothee Cour wrote:
 currently std.demangle.demangle doesn't work with unicode (see example below)
 
 If we decide to keep allowing unicode symbols (as opposed to just unicode
strings/comments), we must
 address this issue. Will supporting this negatively impact performance (of
both compile time and
 runtime) ?
 
 Likewise, will linkers + other tools (gdb etc) be happy with unicode in
mangled names?
 
 ----
 structA{
 intz;
 voidfoo(intx){}
 voidさいごの果実(intx){}
 voidªå(intx){}
 }
 mangledName!(A.さいごの果実).demangle.writeln;=>_D4util13demangle_funs1A18さいごの果実MFiZv
 ----
Filed in bugzilla?
http://d.puremagic.com/issues/show_bug.cgi?id=10393 https://github.com/D-Programming-Language/druntime/pull/524
Do linkers actually support 8-bit symbol names? Or do these have to be translated into ASCII somehow? T -- We've all heard that a million monkeys banging on a million typewriters will eventually reproduce the entire works of Shakespeare. Now, thanks to the Internet, we know this is not true. -- Robert Wilensk
Jun 17 2013
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Jun 17, 2013, at 11:47 AM, "H. S. Teoh" <hsteoh quickfur.ath.cx> =
wrote:
=20
 Do linkers actually support 8-bit symbol names? Or do these have to be
 translated into ASCII somehow?
Good question. It looks like the linker on OSX does: public _D3abc1A18=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5= =AE=9FMFiZv public _D3abc1A4=C2=AA=C3=A5MFiZv The object file linked just fine. I haven't tried OPTLINK on Win32 = though.=
Jun 17 2013
prev sibling next sibling parent reply Brad Roberts <braddr puremagic.com> writes:
On 6/17/13 11:58 AM, Sean Kelly wrote:
 On Jun 17, 2013, at 11:47 AM, "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote:
 Do linkers actually support 8-bit symbol names? Or do these have to be
 translated into ASCII somehow?
Good question. It looks like the linker on OSX does: public _D3abc1A18さいごの果実MFiZv public _D3abc1A4ªåMFiZv The object file linked just fine. I haven't tried OPTLINK on Win32 though.
Don't symbol names from dmd/win32 get compressed if they're too long, resulting in essentially arbitrary random binary data being used as symbol names? Assuming my memory on that is correct then it's already demonstrated that optlink doesn't care what the data is.
Jun 17 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/17/2013 6:28 PM, Brad Roberts wrote:
 Don't symbol names from dmd/win32 get compressed if they're too long, resulting
 in essentially arbitrary random binary data being used as symbol names?
 Assuming my memory on that is correct then it's already demonstrated that
 optlink doesn't care what the data is.
Optlink doesn't care what the symbol byte contents are.
Jun 17 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jun 17, 2013 at 06:49:19PM -0700, Walter Bright wrote:
 On 6/17/2013 6:28 PM, Brad Roberts wrote:
Don't symbol names from dmd/win32 get compressed if they're too long, resulting
in essentially arbitrary random binary data being used as symbol names?
Assuming my memory on that is correct then it's already demonstrated that
optlink doesn't care what the data is.
Optlink doesn't care what the symbol byte contents are.
It seems ld on Linux doesn't, either. I just tested separate compilation on some code containing functions and modules with Cyrillic names, and it worked fine. But my system locale is UTF-8; I'm not sure if there may be a problem on other system locales (not that modern systems would actually use anything else, though!). Might this cause a problem with the VS linker? T -- It only takes one twig to burn down a forest.
Jun 18 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/18/2013 9:44 AM, H. S. Teoh wrote:
 Might this cause a problem with the VS linker?
I doubt it, but try it and see!
Jun 18 2013
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jun 18, 2013 at 04:33:54PM -0700, Walter Bright wrote:
 On 6/18/2013 9:44 AM, H. S. Teoh wrote:
Might this cause a problem with the VS linker?
I doubt it, but try it and see!
Sadly I don't have access to a Windows dev machine. Anybody else cares to try? T -- Study gravitation, it's a field with a lot of potential.
Jun 18 2013
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Jun 17, 2013, at 6:28 PM, Brad Roberts <braddr puremagic.com> wrote:

 On 6/17/13 11:58 AM, Sean Kelly wrote:
 On Jun 17, 2013, at 11:47 AM, "H. S. Teoh" <hsteoh quickfur.ath.cx> =
wrote:
=20
 Do linkers actually support 8-bit symbol names? Or do these have to =
be
 translated into ASCII somehow?
=20 Good question. It looks like the linker on OSX does: =20 public _D3abc1A18=E3=81=95=E3=81=84=E3=81=94=E3=81=AE=E6=9E=9C=E5=
=AE=9FMFiZv
 	public	_D3abc1A4=C2=AA=C3=A5MFiZv
=20
 The object file linked just fine.  I haven't tried OPTLINK on Win32 =
though.
=20
=20 Don't symbol names from dmd/win32 get compressed if they're too long, =
resulting in essentially arbitrary random binary data being used as = symbol names? Assuming my memory on that is correct then it's already = demonstrated that optlink doesn't care what the data is. Yes. So it isn't always possible to fully demangle really long symbol = names. This is not terribly difficult to hit using templates, = especially if they take string arguments.=
Jun 19 2013
prev sibling next sibling parent reply Manu <turkeyman gmail.com> writes:
On 31 May 2013 05:07, Walter Bright <newshound2 digitalmars.com> wrote:

 On 5/30/2013 4:24 AM, Manu wrote:

 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance, i=
t
 almost
 always looks like this:

 {
    // E: I like cake.
    // J: =E3=82=B1=E3=83=BC=E3=82=AD=E3=81=8C=E5=A5=BD=E3=81=8D=E3=81=A7=
=E3=81=99=E3=80=82
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that
 power an
 industry worth 10s of billions.
Sure, but the code itself is written using ASCII!
But that doesn't make it English, or any more readable... The only benefit to forcing users to use ASCII is that everyone can physically type it. But that comes with disadvantages: 1. It's not natural to type a word that you don't know what it is or how to spell, you'll end up copy-pasting anyway rather than trying to remember/copy it letter by letter and risk misspelling. 2. It's less natural for the people who CAN read it, because they have to mentally transliterate too. (And if they're kids/amateurs who don't know even know the latin alphabet?) Ie, it serves neither party to force someone who doesn't speak English to write ASCII. Add that to the points I (and others) made earlier about education, or children learning to code. There's no compelling reason to force identifiers in ASCII. Currently, D offers a unique advantage; leave it that way.
May 30 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2013 5:04 PM, Manu wrote:
 Currently, D offers a unique advantage; leave it that way.
I am going to leave it that way based on the comments here, I only wanted to point out that the example didn't support Unicode identifiers.
May 30 2013
prev sibling parent Manu <turkeyman gmail.com> writes:
On 31 May 2013 10:00, Peter Williams <pwil3058 bigpond.net.au> wrote:

 On 31/05/13 05:07, Walter Bright wrote:

 On 5/30/2013 4:24 AM, Manu wrote:

 We don't all know English. Plenty of people don't.
 I've worked a lot with Sony and Nintendo code/libraries, for instance,
 it almost
 always looks like this:

 {
    // E: I like cake.
    // J: =E3=82=B1=E3=83=BC=E3=82=AD=E3=81=8C=E5=A5=BD=E3=81=8D=E3=81=
=A7=E3=81=99=E3=80=82
    player.eatCake();
 }

 Clearly someone doesn't speak English in these massive codebases that
 power an
 industry worth 10s of billions.
Sure, but the code itself is written using ASCII!
Because they had no choice.
Indeed, and believe me, the variable names can often make NO sense, or worse, they're misunderstood and quite misleading. Ie, you think a variable is something, but you realise it's the inverse, or just something completely different.
May 30 2013
prev sibling parent "Mr. Anonymous" <mailnew4ster gmail.com> writes:
On Monday, 27 May 2013 at 22:20:16 UTC, H. S. Teoh wrote:
 On Tue, May 28, 2013 at 12:04:52AM +0200, Vladimir Panteleev 
 wrote:
 On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to 
write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.
I thought the topic was typing the occasional Unicode character to use as an operator in D programs?
Well, D *does* support non-English identifiers, y'know... for example: void main(string[] args) { int число = 1; foreach (и; 0..100) число += и; writeln(число); } Of course, whether that's a good practice is a different story. :) But for operators, you still need enough compose key sequences to cover all of the Unicode operators -- and there are a LOT of them -- which I don't think is currently done anywhere. You'd have to make your own compose key maps to do it. T
http://code.google.com/p/trileri/source/browse/trunk/tr/yazi.d
May 28 2013
prev sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Tue, 28 May 2013 00:18:31 +0200, H. S. Teoh <hsteoh quickfur.ath.cx> =
 =

wrote:

 On Tue, May 28, 2013 at 12:04:52AM +0200, Vladimir Panteleev wrote:
 On Monday, 27 May 2013 at 21:24:15 UTC, H. S. Teoh wrote:
Besides, it's impractical to use compose key sequences to write
large amounts of text in some given language; a method of
temporarily switching to a different layout is necessary.
I thought the topic was typing the occasional Unicode character to use as an operator in D programs?
Well, D *does* support non-English identifiers, y'know... for example:=
 	void main(string[] args) {
 		int =D1=87=D0=B8=D1=81=D0=BB=D0=BE =3D 1;
 		foreach (=D0=B8; 0..100)
 			=D1=87=D0=B8=D1=81=D0=BB=D0=BE +=3D =D0=B8;
 		writeln(=D1=87=D0=B8=D1=81=D0=BB=D0=BE);
 	}

 Of course, whether that's a good practice is a different story. :)

 But for operators, you still need enough compose key sequences to cove=
r
 all of the Unicode operators -- and there are a LOT of them -- which I=
 don't think is currently done anywhere. You'd have to make your own
 compose key maps to do it.
The Fortress programming language has some 900 or so operators: https://java.net/projects/projectfortress/sources/sources/content/Specif= ication/fortress.1.0.pdf?rev=3D5558 Appendix C, and https://java.net/projects/projectfortress/sources/sources/content/Docume= ntation/Specification/fortress.pdf?rev=3D5558 chapter 14 -- = Simen
May 27 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, May 28, 2013 11:38:08 Peter Williams wrote:
 On 28/05/13 09:44, H. S. Teoh wrote:
 Since language keywords are already in English, we might as well
 standardize on English identifiers too.
So you're going to spell check them all to make sure that they're English? Or did you mean ASCII?
I think that it was more an issue of that the only reason that Unicode would be necessary in identifiers would be if you weren't using English, so if you assume that everyone is going to be using some form of English for their identifier names, you can skip having Unicode in identifiers. So, a natural effect of standardizing on English is that you can stick with ASCII. - Jonathan M Davis
May 27 2013
prev sibling parent Manu <turkeyman gmail.com> writes:
On 28 May 2013 11:42, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 On Tuesday, May 28, 2013 11:38:08 Peter Williams wrote:
 On 28/05/13 09:44, H. S. Teoh wrote:
 Since language keywords are already in English, we might as well
 standardize on English identifiers too.
So you're going to spell check them all to make sure that they're English? Or did you mean ASCII?
I think that it was more an issue of that the only reason that Unicode would be necessary in identifiers would be if you weren't using English, so if you assume that everyone is going to be using some form of English for their identifier names, you can skip having Unicode in identifiers. So, a natural effect of standardizing on English is that you can stick with ASCII.
I'm fairly sure that any programmer who takes themself seriously will use English, I don't see any reason why this rule should nee to be be implemented by the compiler. The loss I can imagine is that kids, or people from developing countries, etc, may have an additional barrier to learning to code if they don't speak English. Nobody in this set is likely to produce a useful library that will be used widely. Likewise, no sane programmer is going to choose to use a library that's not written in English. You may argue that the keywords and libs are in English. I can attest from personal experience, that a child, or a non-english-speaking beginner probably has absolutely NO IDEA what the keywords mean anyway, even if they do speak English. I certainly had no idea when I was a kid, I just typed them because I figured out what they did. I didn't even know how to say many of them, and realised 5 years later than I was saying all the words wrong... So my point is, why make this restriction as a static compiler rule, when it's not practically going to be broken anyway. You never know, it may actually assist some people somewhere. I think it's a great thing that D can accept identifiers in non-english.
May 27 2013
prev sibling parent "Daniel Murphy" <yebblies nospamgmail.com> writes:
"Manu" <turkeyman gmail.com> wrote in message 
news:mailman.137.1369448229.13711.digitalmars-d puremagic.com...
 One of the first, and best, decisions I made for D was it would be 
 Unicode
 front to back.
Indeed, excellent decision! So when we define operators for u v and a b, or maybe n? ;)
When these have keys on standard keyboards.
May 25 2013
prev sibling parent reply "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
 One of the first, and best, decisions I made for D was it would 
 be Unicode front to back.
That is why I asked this question here. I think D is still one of the few programming languages with such unicode support.
 This is more a problem with the algorithms taking the easy way 
 than a problem with UTF-8. You can do all the string 
 algorithms, including regex, by working with the UTF-8 directly 
 rather than converting to UTF-32. Then the algorithms work at 
 full speed.
I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also.
 That was the go-to solution in the 1980's, they were called 
 "code pages". A disaster.
My understanding is that code pages were a "disaster" because they weren't standardized and often badly implemented. If you used UCS with a single-byte encoding, you wouldn't have that problem.
 with the few exceptional languages with more than 256
characters encoded in two bytes. Like those rare languages Japanese, Korean, Chinese, etc. This too was done in the 80's with "Shift-JIS" for Japanese, and some other wacky scheme for Korean, and a third nutburger one for Chinese.
Of course, you have to have more than one byte for those languages, because they have more than 256 characters. So there will be no compression gain over UTF-8/16 there, but a big gain in parsing complexity with a simpler encoding, particularly when dealing with multi-language strings.
 I've had the misfortune of supporting all that in the old 
 Zortech C++ compiler. It's AWFUL. If you think it's simpler, 
 all I can say is you've never tried to write internationalized 
 code with it.
Heh, I'm not saying "let's go back to badly defined code pages" because I'm saying "let's go back to single-byte encodings." The two are separate arguments.
 UTF-8 is heavenly in comparison. Your code is automatically 
 internationalized. It's awesome.
At what cost? Most programmers completely punt on unicode, because they just don't want to deal with the complexity. Perhaps you can deal with it and don't mind the performance loss, but I suspect you're in the minority.
May 25 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 12:33 AM, Joakim wrote:
 At what cost?  Most programmers completely punt on unicode, because they just
 don't want to deal with the complexity. Perhaps you can deal with it and don't
 mind the performance loss, but I suspect you're in the minority.
I think you stand alone in your desire to return to code pages. I have years of experience with code pages and the unfixable misery they produce. This has disappeared with Unicode. I find your arguments unpersuasive when stacked against my experience. And yes, I have made a living writing high performance code that deals with characters, and you are quite off base with claims that UTF-8 has inevitable bad performance - though there is inefficient code in Phobos for it, to be sure. My grandfather wrote a book that consists of mixed German, French, and Latin words, using special characters unique to those languages. Another failing of code pages is it fails miserably at any such mixed language text. Unicode handles it with aplomb. I can't even write an email to Rainer Schütze in English under your scheme. Code pages simply are no longer practical nor acceptable for a global community. D is never going to convert to a code page system, and even if it did, there's no way D will ever convince the world to abandon Unicode, and so D would be as useless as EBCDIC. I'm afraid your quest is quixotic.
May 25 2013
next sibling parent reply "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote:
 I think you stand alone in your desire to return to code pages.
Nobody is talking about going back to code pages. I'm talking about going to single-byte encodings, which do not imply the problems that you had with code pages way back when.
 I have years of experience with code pages and the unfixable 
 misery they produce. This has disappeared with Unicode. I find 
 your arguments unpersuasive when stacked against my experience. 
 And yes, I have made a living writing high performance code 
 that deals with characters, and you are quite off base with 
 claims that UTF-8 has inevitable bad performance - though there 
 is inefficient code in Phobos for it, to be sure.
How can a variable-width encoding possibly compete with a constant-width encoding? You have not articulated a reason for this. Do you believe there is a performance loss with variable-width, but that it is not significant and therefore worth it? Or do you believe it can be implemented with no loss? That is what I asked above, but you did not answer.
 My grandfather wrote a book that consists of mixed German, 
 French, and Latin words, using special characters unique to 
 those languages. Another failing of code pages is it fails 
 miserably at any such mixed language text. Unicode handles it 
 with aplomb.
I see no reason why single-byte encodings wouldn't do a better job at such mixed-language text. You'd just have to have a larger, more complex header or keep all your strings in a single language, with a different format to compose them together for your book. This would be so much easier than UTF-8 that I cannot see how anyone could argue for a variable-length encoding instead.
 I can't even write an email to Rainer Schütze in English under 
 your scheme.
Why not? You seem to think that my scheme doesn't implement multi-language text at all, whereas I pointed out, from the beginning, that it could be trivially done also.
 Code pages simply are no longer practical nor acceptable for a 
 global community. D is never going to convert to a code page 
 system, and even if it did, there's no way D will ever convince 
 the world to abandon Unicode, and so D would be as useless as 
 EBCDIC.
I'm afraid you and others here seem to mentally translate "single-byte encodings" to "code pages" in your head, then recoil in horror as you remember all your problems with broken implementations of code pages, even though those problems are not intrinsic to single-byte encodings. I'm not asking you to consider this for D. I just wanted to discuss why UTF-8 is used at all. I had hoped for some technical evaluations of its merits, but I seem to simply be dredging up a bunch of repressed memories about code pages instead. ;) The world may not "abandon Unicode," but it will abandon UTF-8, because it's a dumb idea. Unfortunately, such dumb ideas- XML anyone?- often proliferate until someone comes up with something better to show how dumb they are. Perhaps it won't be the D programming language that does that, but it would be easy to implement my idea in D, so maybe it will be a D-based library someday. :)
 I'm afraid your quest is quixotic.
I'd argue the opposite, considering most programmers still can't wrap their head around UTF-8. If someone can just get a single-byte encoding implemented and in front of them, I suspect it will be UTF-8 that will be considered quixotic. :D
May 25 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 13:05, Joakim пишет:
 On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote:
 I think you stand alone in your desire to return to code pages.
Nobody is talking about going back to code pages. I'm talking about going to single-byte encodings, which do not imply the problems that you had with code pages way back when.
Problem is what you outline is isomorphic with code-pages. Hence the grief of accumulated experience against them.
 Code pages simply are no longer practical nor acceptable for a global
 community. D is never going to convert to a code page system, and even
 if it did, there's no way D will ever convince the world to abandon
 Unicode, and so D would be as useless as EBCDIC.
I'm afraid you and others here seem to mentally translate "single-byte encodings" to "code pages" in your head, then recoil in horror as you remember all your problems with broken implementations of code pages, even though those problems are not intrinsic to single-byte encodings. I'm not asking you to consider this for D. I just wanted to discuss why UTF-8 is used at all. I had hoped for some technical evaluations of its merits, but I seem to simply be dredging up a bunch of repressed memories about code pages instead. ;)
Well if somebody get a quest to redefine UTF-8 they *might* come up with something that is a bit faster to decode but shares the same properties. Hardly a life saver anyway.
 The world may not "abandon Unicode," but it will abandon UTF-8, because
 it's a dumb idea.  Unfortunately, such dumb ideas- XML anyone?- often
 proliferate until someone comes up with something better to show how
 dumb they are.
Even children know XML is awful redundant shit as interchange format. The hierarchical document is a nice idea anyway.
 Perhaps it won't be the D programming language that does
 that, but it would be easy to implement my idea in D, so maybe it will
 be a D-based library someday. :)
Implement Unicode compression scheme - at least that is standardized. -- Dmitry Olshansky
May 25 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, May 25, 2013 01:42:20 Walter Bright wrote:
 On 5/25/2013 12:33 AM, Joakim wrote:
 At what cost?  Most programmers completely punt on unicode, because=
they
 just don't want to deal with the complexity. Perhaps you can deal w=
ith it
 and don't mind the performance loss, but I suspect you're in the
 minority.
=20 I think you stand alone in your desire to return to code pages. I hav=
e years
 of experience with code pages and the unfixable misery they produce. =
This
 has disappeared with Unicode. I find your arguments unpersuasive when=
 stacked against my experience. And yes, I have made a living writing =
high
 performance code that deals with characters, and you are quite off ba=
se
 with claims that UTF-8 has inevitable bad performance - though there =
is
 inefficient code in Phobos for it, to be sure.
=20
 My grandfather wrote a book that consists of mixed German, French, an=
d Latin
 words, using special characters unique to those languages. Another fa=
iling
 of code pages is it fails miserably at any such mixed language text.
 Unicode handles it with aplomb.
=20
 I can't even write an email to Rainer Sch=C3=BCtze in English under y=
our scheme.
=20
 Code pages simply are no longer practical nor acceptable for a global=
 community. D is never going to convert to a code page system, and eve=
n if
 it did, there's no way D will ever convince the world to abandon Unic=
ode,
 and so D would be as useless as EBCDIC.
=20
 I'm afraid your quest is quixotic.
All I've got to say on this subject is "Thank you Walter Bright for bui= lding=20 Unicode into D!" - Jonathan M Davis
May 25 2013
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, May 25, 2013 at 04:14:34PM -0700, Jonathan M Davis wrote:
 On Saturday, May 25, 2013 01:42:20 Walter Bright wrote:
 On 5/25/2013 12:33 AM, Joakim wrote:
 At what cost?  Most programmers completely punt on unicode,
 because they just don't want to deal with the complexity. Perhaps
 you can deal with it and don't mind the performance loss, but I
 suspect you're in the minority.
I think you stand alone in your desire to return to code pages. I have years of experience with code pages and the unfixable misery they produce. This has disappeared with Unicode. I find your arguments unpersuasive when stacked against my experience. And yes, I have made a living writing high performance code that deals with characters, and you are quite off base with claims that UTF-8 has inevitable bad performance - though there is inefficient code in Phobos for it, to be sure. My grandfather wrote a book that consists of mixed German, French, and Latin words, using special characters unique to those languages. Another failing of code pages is it fails miserably at any such mixed language text. Unicode handles it with aplomb.
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 9:48 PM, H. S. Teoh wrote:
 Then came along D with native Unicode support built right into the
 language. And not just UTF-16 shoved down your throat like Java does (or
 was it UTF-32?); UTF-8, UTF-16, and UTF-32 are all equally supported.
 You cannot imagine what a happy camper I was since then!! Yes, Phobos
 still has a ways to go in terms of performance w.r.t. UTF-8 strings, but
 what we have right now is already far, far, superior to the situation in
 C/C++, and things can only get better.
Many moons ago, when the earth was young and I had a few strands of hair left, a C++ programmer challenged me to a "bakeoff", D vs C++. I wrote the program in D (a string processing program). He said "ahaaaa!" and wrote the C++ one. They were fairly comparable. I then suggested we do the internationalized version. I resubmitted exactly the same program. He threw in the towel.
May 25 2013
prev sibling next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
 This is more a problem with the algorithms taking the easy way 
 than a problem with UTF-8. You can do all the string 
 algorithms, including regex, by working with the UTF-8 
 directly rather than converting to UTF-32. Then the algorithms 
 work at full speed.
I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also.
For the record, I noticed that programmers (myself included) that had an incomplete understanding of Unicode / UTF exaggerate this point, and sometimes needlessly assume that their code needs to operate on individual characters (code points), when it is in fact not so - and that code will work just fine as if it was written to handle ASCII. The example Walter quoted (regex - assuming you don't want Unicode ranges or case-insensitivity) is one such case. Another thing I noticed: sometimes when you think you really need to operate on individual characters (and that your code will not be correct unless you do that), the assumption will be incorrect due to the existence of combining characters in Unicode. Two of the often-quoted use cases of working on individual code points is calculating the string width (assuming a fixed-width font), and slicing the string - both of these will break with combining characters if those are not accounted for. I believe the proper way to approach such tasks is to implement the respective Unicode algorithms for it, which I believe are non-trivial and for which the relative impact for the overhead of working with a variable-width encoding is acceptable. Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8? Also, I don't think this has been posted in this thread. Not sure if it answers your points, though: http://www.utf8everywhere.org/ And here's a simple and correct UTF-8 decoder: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
May 25 2013
next sibling parent reply "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 08:58:57 UTC, Vladimir Panteleev 
wrote:
 Another thing I noticed: sometimes when you think you really 
 need to operate on individual characters (and that your code 
 will not be correct unless you do that), the assumption will be 
 incorrect due to the existence of combining characters in 
 Unicode. Two of the often-quoted use cases of working on 
 individual code points is calculating the string width 
 (assuming a fixed-width font), and slicing the string - both of 
 these will break with combining characters if those are not 
 accounted for. I believe the proper way to approach such tasks 
 is to implement the respective Unicode algorithms for it, which 
 I believe are non-trivial and for which the relative impact for 
 the overhead of working with a variable-width encoding is 
 acceptable.
Combining characters are examples of complexity baked into the various languages, so there's no way around that. I'm arguing against layering more complexity on top, through UTF-8.
 Can you post some specific cases where the benefits of a 
 constant-width encoding are obvious and, in your opinion, make 
 constant-width encodings more useful than all the benefits of 
 UTF-8?
Let's take one you listed above, slicing a string. You have to either translate your entire string into UTF-32 so it's constant-width, which is apparently what Phobos does, or decode every single UTF-8 character along the way, every single time. A constant-width, single-byte encoding would be much easier to slice, while still using at most half the space.
 Also, I don't think this has been posted in this thread. Not 
 sure if it answers your points, though:

 http://www.utf8everywhere.org/
That seems to be a call to using UTF-8 on Windows, with a lot of info on how best to do so, with little justification for why you'd want to do so in the first place. For example, "Q: But what about performance of text processing algorithms, byte alignment, etc? A: Is it really better with UTF-16? Maybe so." Not exactly a considered analysis of the two. ;)
 And here's a simple and correct UTF-8 decoder:

 http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
You cannot honestly look at those multiple state diagrams and tell me it's "simple." That said, the difficulty of _using_ UTF-8 is a much bigger than problem than implementing a decoder in a library.
May 25 2013
next sibling parent "w0rp" <devw0rp gmail.com> writes:
This is dumb. You are dumb. Go away.
May 25 2013
prev sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 25 May 2013 at 09:40:36 UTC, Joakim wrote:
 Can you post some specific cases where the benefits of a 
 constant-width encoding are obvious and, in your opinion, make 
 constant-width encodings more useful than all the benefits of 
 UTF-8?
Let's take one you listed above, slicing a string. You have to either translate your entire string into UTF-32 so it's constant-width, which is apparently what Phobos does, or decode every single UTF-8 character along the way, every single time. A constant-width, single-byte encoding would be much easier to slice, while still using at most half the space.
You don't need to do that to slice a string. I think you mean to say that you need to decode each character if you want to slice the string at the N-th code point? But this is exactly what I'm trying to point out: how would you find this N? How would you know if it makes sense, taking into account combining characters, and all the other complexities of Unicode? If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless.
 You cannot honestly look at those multiple state diagrams and 
 tell me it's "simple."
I meant that it's simple to implement (and adapt/port to other languages). I would say that UTF-8 is quite cleverly designed, so I wouldn't say it's simple by itself.
May 25 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 10:33:12 UTC, Vladimir Panteleev 
wrote:
 You don't need to do that to slice a string. I think you mean 
 to say that you need to decode each character if you want to 
 slice the string at the N-th code point? But this is exactly 
 what I'm trying to point out: how would you find this N? How 
 would you know if it makes sense, taking into account combining 
 characters, and all the other complexities of Unicode?
Slicing a string implies finding the N-th code point, what other way would you slice and have it make any sense? Finding the N-th point is much simpler with a constant-width encoding. I'm leaving aside combining characters and those intrinsic language complexities baked into unicode in my previous analysis, but if you want to bring those in, that's actually an argument in favor of my encoding. With my encoding, you know up front if you're using languages that have such complexity- just check the header- whereas with a chunk of random UTF-8 text, you cannot ever know that unless you decode the entire string once and extract knowledge of all the languages that are embedded. For another similar example, let's say you want to run toUpper on a multi-language string, which contains English in the first half and some Asian script that doesn't define uppercase in the second half. With my format, toUpper can check the header, then process the English half and skip the Asian half (I'm assuming that the substring indices for each language would be stored in this more complex header). With UTF-8, you have to process the entire string, because you never know what random languages might be packed in there. UTF-8 is riddled with such performance bottlenecks, all to make if self-synchronizing. But is anybody really using its less compact encoding to do some "self-synchronized" integrity checking? I suspect almost nobody is.
 If you want to split a string by ASCII whitespace (newlines, 
 tabs and spaces), it makes no difference whether the string is 
 in ASCII or UTF-8 - the code will behave correctly in either 
 case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.
 You cannot honestly look at those multiple state diagrams and 
 tell me it's "simple."
I meant that it's simple to implement (and adapt/port to other languages). I would say that UTF-8 is quite cleverly designed, so I wouldn't say it's simple by itself.
Perhaps, maybe decoding is not so bad for the type of people who write the fundamental UTF-8 libraries. But implementation does not merely refer to the UTF-8 libraries, but also all the code that tries to build on it for internationalized apps. And with all the unnecessary additional complexity added by UTF-8, wrapping the average programmer's head around this mess likely leads to as many problems as broken code pages implementations did back in the day. ;)
May 25 2013
parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
 If you want to split a string by ASCII whitespace (newlines, 
 tabs and spaces), it makes no difference whether the string is 
 in ASCII or UTF-8 - the code will behave correctly in either 
 case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.
No. Are you sure you understand UTF-8 properly?
May 25 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev 
wrote:
 On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
 If you want to split a string by ASCII whitespace (newlines, 
 tabs and spaces), it makes no difference whether the string 
 is in ASCII or UTF-8 - the code will behave correctly in 
 either case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.
No. Are you sure you understand UTF-8 properly?
Are you sure _you_ understand it properly? Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this? On Saturday, 25 May 2013 at 12:43:21 UTC, Andrei Alexandrescu wrote:
 On 5/25/13 3:33 AM, Joakim wrote:
 On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
 This is more a problem with the algorithms taking the easy 
 way than a
 problem with UTF-8. You can do all the string algorithms, 
 including
 regex, by working with the UTF-8 directly rather than 
 converting to
 UTF-32. Then the algorithms work at full speed.
I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also.
You mentioned this a couple of times, and I wonder what makes you so sure. On contemporary architectures small is fast and large is slow; betting on replacing larger data with more computation is quite often a win.
When has small ever been slow and large fast? ;) I'm talking about replacing larger data _and_ more computation, ie UTF-8, with smaller data and less computation, ie single-byte encodings, so it is an unmitigated win in that regard. :)
May 25 2013
next sibling parent reply "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
 On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev 
 wrote:
 On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
 If you want to split a string by ASCII whitespace (newlines, 
 tabs and spaces), it makes no difference whether the string 
 is in ASCII or UTF-8 - the code will behave correctly in 
 either case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.
No. Are you sure you understand UTF-8 properly?
Are you sure _you_ understand it properly? Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this?
I suggest you read up on UTF-8. You really don't understand it. There is no need to decode, you just treat the UTF-8 string as if it is an ASCII string. This code will count all spaces in a string whether it is encoded as ASCII or UTF-8: int countSpaces(const(char)* c) { int n = 0; while (*c) if (*c == ' ') ++n; return n; } I repeat: there is no need to decode. Please read up on UTF-8. You do not understand it. The reason you don't need to decode is because UTF-8 is self-synchronising. The code above tests for spaces only, but it works the same when searching for any substring or single character. It is no slower than fixed-width encoding for these operations. Again, I urge you, please read up on UTF-8. It is very well designed.
May 25 2013
parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
 int countSpaces(const(char)* c)
 {
     int n = 0;
     while (*c)
         if (*c == ' ')
             ++n;
     return n;
 }
Oops. Missing a ++c in there, but I'm sure the point was made :-)
May 25 2013
prev sibling next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
 On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev 
 wrote:
 On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
 If you want to split a string by ASCII whitespace (newlines, 
 tabs and spaces), it makes no difference whether the string 
 is in ASCII or UTF-8 - the code will behave correctly in 
 either case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.
No. Are you sure you understand UTF-8 properly?
Are you sure _you_ understand it properly? Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this?
It looks like you've missed an important property of UTF-8: lower ASCII remains encoded the same, and UTF-8 code units encoding non-ASCII characters cannot be confused with ASCII characters. Code that does not need Unicode code points can treat UTF-8 strings as ASCII strings, and does not need to decode each character individually - because a 0x20 byte will mean "space" regardless of context. That's why a function that splits a string by ASCII whitespace does NOT need do perform UTF-8 decoding. I hope this clears up the misunderstanding :)
May 25 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 14:18:32 UTC, Vladimir Panteleev 
wrote:
 On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
 Are you sure _you_ understand it properly?  Both encodings 
 have to check every single character to test for whitespace, 
 but the single-byte encoding simply has to load each byte in 
 the string and compare it against the whitespace-signifying 
 bytes, while the variable-length code has to first load and 
 parse potentially 4 bytes before it can compare, because it 
 has to go through the state machine that you linked to above.  
 Obviously the constant-width encoding will be faster.  Did I 
 really need to explain this?
It looks like you've missed an important property of UTF-8: lower ASCII remains encoded the same, and UTF-8 code units encoding non-ASCII characters cannot be confused with ASCII characters. Code that does not need Unicode code points can treat UTF-8 strings as ASCII strings, and does not need to decode each character individually - because a 0x20 byte will mean "space" regardless of context. That's why a function that splits a string by ASCII whitespace does NOT need do perform UTF-8 decoding. I hope this clears up the misunderstanding :)
OK, you got me with this particular special case: it is not necessary to decode every UTF-8 character if you are simply comparing against ASCII space characters. My mixup is because I was unaware if every language used its own space character in UTF-8 or if they reuse the ASCII space character, apparently it's the latter. However, my overall point stands. You still have to check 2-4 times as many bytes if you do it the way Peter suggests, as opposed to a single-byte encoding. There is a shortcut: you could also check the first byte to see if it's ASCII or not and then skip the right number of ensuing bytes in a character's encoding if it isn't ASCII, but at that point you have begun partially decoding the UTF-8 encoding, which you claimed wasn't necessary and which will degrade performance anyway. On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
 I suggest you read up on UTF-8. You really don't understand it. 
 There is no need to decode, you just treat the UTF-8 string as 
 if it is an ASCII string.
Not being aware of this shortcut doesn't mean not understanding UTF-8.
 This code will count all spaces in a string whether it is 
 encoded as ASCII or UTF-8:

 int countSpaces(const(char)* c)
 {
     int n = 0;
     while (*c)
         if (*c == ' ')
             ++n;
     return n;
 }

 I repeat: there is no need to decode. Please read up on UTF-8. 
 You do not understand it. The reason you don't need to decode 
 is because UTF-8 is self-synchronising.
Not quite. The reason you don't need to decode is because of the particular encoding scheme chosen for UTF-8, a side effect of ASCII backwards compatibility and reusing the ASCII space character; it has nothing to do with whether it's self-synchronizing or not.
 The code above tests for spaces only, but it works the same 
 when searching for any substring or single character. It is no 
 slower than fixed-width encoding for these operations.
It doesn't work the same "for any substring or single character," it works the same for any single ASCII character. Of course it's slower than a fixed-width single-byte encoding. You have to check every single byte of a non-ASCII character in UTF-8, whereas a single-byte encoding only has to check a single byte per language character. There is a shortcut if you partially decode the first byte in UTF-8, mentioned above, but you seem dead-set against decoding. ;)
 Again, I urge you, please read up on UTF-8. It is very well 
 designed.
I disagree. It is very badly designed, but the ASCII compatibility does hack in some shortcuts like this, which still don't save its performance.
May 25 2013
parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 25 May 2013 at 14:58:02 UTC, Joakim wrote:
 On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
 I suggest you read up on UTF-8. You really don't understand 
 it. There is no need to decode, you just treat the UTF-8 
 string as if it is an ASCII string.
Not being aware of this shortcut doesn't mean not understanding UTF-8.
It's not just a shortcut, it is absolutely fundamental to the design of UTF-8. It's like saying you understand Lisp without being aware that everything is a list. Also, you continuously keep stating disadvantages to UTF-8 that are completely false, like "slicing does require decoding". Again, completely missing the point of UTF-8. I cannot conceive how you can claim to understand how UTF-8 works yet repeatedly demonstrating that you do not. You are either ignorant or a successful troll. In either case, I'm done here.
May 25 2013
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, May 25, 2013 at 03:47:41PM +0200, Joakim wrote:
 On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string is
in ASCII or UTF-8 - the code will behave correctly in either
case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.
No. Are you sure you understand UTF-8 properly?
Are you sure _you_ understand it properly? Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above. Obviously the constant-width encoding will be faster. Did I really need to explain this?
[...] Have you actually tried to write a whitespace splitter for UTF-8? Do you realize that you can use an ASCII whitespace splitter for UTF-8 and it will work correctly? There is no need to decode UTF-8 for whitespace splitting at all. There is no need to parse anything. You just iterate over the bytes and split on 0x20. There is no performance difference over ASCII. As Dmitry said, UTF-8 is self-synchronizing. While current Phobos code tries to play it safe by decoding every character, this is not necessary in many cases. T -- The best compiler is between your ears. -- Michael Abrash
May 25 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 12:58, Vladimir Panteleev пишет:
 On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
 This is more a problem with the algorithms taking the easy way than a
 problem with UTF-8. You can do all the string algorithms, including
 regex, by working with the UTF-8 directly rather than converting to
 UTF-32. Then the algorithms work at full speed.
I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also.
For the record, I noticed that programmers (myself included) that had an incomplete understanding of Unicode / UTF exaggerate this point, and sometimes needlessly assume that their code needs to operate on individual characters (code points), when it is in fact not so - and that code will work just fine as if it was written to handle ASCII. The example Walter quoted (regex - assuming you don't want Unicode ranges or case-insensitivity) is one such case.
+1 BTW regex even with Unicode ranges and case-insensitivity is doable just not easy (yet).
 Another thing I noticed: sometimes when you think you really need to
 operate on individual characters (and that your code will not be correct
 unless you do that), the assumption will be incorrect due to the
 existence of combining characters in Unicode. Two of the often-quoted
 use cases of working on individual code points is calculating the string
 width (assuming a fixed-width font), and slicing the string - both of
 these will break with combining characters if those are not accounted
 for.  I believe the proper way to approach such tasks is to implement the
 respective Unicode algorithms for it, which I believe are non-trivial
 and for which the relative impact for the overhead of working with a
 variable-width encoding is acceptable.
Another plus one. Algorithms defined on code point basis are quite complex so that benefit of not decoding won't be that large. The benefit of transparently special-casing ASCII in UTF-8 is far larger.
 Can you post some specific cases where the benefits of a constant-width
 encoding are obvious and, in your opinion, make constant-width encodings
 more useful than all the benefits of UTF-8?

 Also, I don't think this has been posted in this thread. Not sure if it
 answers your points, though:

 http://www.utf8everywhere.org/

 And here's a simple and correct UTF-8 decoder:

 http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
-- Dmitry Olshansky
May 25 2013
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/25/13 3:33 AM, Joakim wrote:
 On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
 This is more a problem with the algorithms taking the easy way than a
 problem with UTF-8. You can do all the string algorithms, including
 regex, by working with the UTF-8 directly rather than converting to
 UTF-32. Then the algorithms work at full speed.
I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also.
You mentioned this a couple of times, and I wonder what makes you so sure. On contemporary architectures small is fast and large is slow; betting on replacing larger data with more computation is quite often a win. Andrei
May 25 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 5:43 AM, Andrei Alexandrescu wrote:
 On 5/25/13 3:33 AM, Joakim wrote:
 On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
 This is more a problem with the algorithms taking the easy way than a
 problem with UTF-8. You can do all the string algorithms, including
 regex, by working with the UTF-8 directly rather than converting to
 UTF-32. Then the algorithms work at full speed.
I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also.
You mentioned this a couple of times, and I wonder what makes you so sure. On contemporary architectures small is fast and large is slow; betting on replacing larger data with more computation is quite often a win.
On the other hand, Joakim even admits his single byte encoding is variable length, as otherwise he simply dismisses the rarely used (!) Chinese, Japanese, and Korean languages, as well as any text that contains words from more than one language. I suspect he's trolling us, and quite successfully.
May 25 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:
 On the other hand, Joakim even admits his single byte encoding 
 is variable length, as otherwise he simply dismisses the rarely 
 used (!) Chinese, Japanese, and Korean languages, as well as 
 any text that contains words from more than one language.
I have noted from the beginning that these large alphabets have to be encoded to two bytes, so it is not a true constant-width encoding if you are mixing one of those languages into a single-byte encoded string. But this "variable length" encoding is so much simpler than UTF-8, there's no comparison.
 I suspect he's trolling us, and quite successfully.
Ha, I wondered who would pull out this insult, quite surprised to see it's Walter. It seems to be the trend on the internet to accuse anybody you disagree with of trolling, I am honestly surprised to see Walter stoop so low. Considering I'm the only one making any cogent arguments here, perhaps I should wonder if you're all trolling me. ;) On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:
 I suspect the Chinese, Koreans, and Japanese would take 
 exception to being called irrelevant.
Irrelevant only because they are a small subset of the UCS. I have noted that they would also be handled by a two-byte encoding.
 Good luck with your scheme that can't handle languages written 
 by billions of people!
So let's see: first you say that my scheme has to be variable length because I am using two bytes to handle these languages, then you claim I don't handle these languages. This kind of blatant contradiction within two posts can only be called... trolling!
May 25 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 1:03 PM, Joakim wrote:
 On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:
 On the other hand, Joakim even admits his single byte encoding is variable
 length, as otherwise he simply dismisses the rarely used (!) Chinese,
 Japanese, and Korean languages, as well as any text that contains words from
 more than one language.
I have noted from the beginning that these large alphabets have to be encoded to two bytes, so it is not a true constant-width encoding if you are mixing one of those languages into a single-byte encoded string. But this "variable length" encoding is so much simpler than UTF-8, there's no comparison.
If it's one byte sometimes, or two bytes sometimes, it's variable length. You overlook that I've had to deal with this. It isn't "simpler", there's actually more work to write code that adapts to one or two byte encodings.
 I suspect he's trolling us, and quite successfully.
Ha, I wondered who would pull out this insult, quite surprised to see it's Walter. It seems to be the trend on the internet to accuse anybody you disagree with of trolling, I am honestly surprised to see Walter stoop so low. Considering I'm the only one making any cogent arguments here, perhaps I should wonder if you're all trolling me. ;) On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:
 I suspect the Chinese, Koreans, and Japanese would take exception to being
 called irrelevant.
Irrelevant only because they are a small subset of the UCS. I have noted that they would also be handled by a two-byte encoding.
 Good luck with your scheme that can't handle languages written by billions of
 people!
So let's see: first you say that my scheme has to be variable length because I am using two bytes to handle these languages,
Well, it *is* variable length or you have to disregard Chinese. You cannot have it both ways. Code to deal with two bytes is significantly different than code to deal with one. That means you've got a conditional in your generic code - that isn't going to be faster than the conditional for UTF-8.
 then you claim I don't handle
 these languages.  This kind of blatant contradiction within two posts can only
 be called... trolling!
You gave some vague handwaving about it, and then dismissed it as irrelevant, along with more handwaving about what to do with text that has embedded words in multiple languages. Worse, there are going to be more than 256 of these encodings - you can't even have a byte to specify them. Remember, Unicode has approximately 256,000 characters in it. How many code pages is that? I was being kind saying you were trolling, as otherwise I'd be saying your scheme was, to be blunt, absurd. --------------------------------------- I'll be the first to admit that a lot of great ideas have been initially dismissed by the experts as absurd. If you really believe in this, I recommend that you write it up as a real article, taking care to fill in all the handwaving with something specific, and include some benchmarks to prove your performance claims. Post your article on reddit, stackoverflow, hackernews, etc., and look for fertile ground for it. I'm sorry you're not finding fertile ground here (so far, nobody has agreed with any of your points), and this is the wrong place for such proposals anyway, as D is simply not going to switch over to it. Remember, extraordinary claims require extraordinary evidence, not handwaving and assumptions disguised as bold assertions.
May 25 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
 I have noted from the beginning that these large alphabets 
 have to be encoded to
 two bytes, so it is not a true constant-width encoding if you 
 are mixing one of
 those languages into a single-byte encoded string.  But this 
 "variable length"
 encoding is so much simpler than UTF-8, there's no comparison.
If it's one byte sometimes, or two bytes sometimes, it's variable length. You overlook that I've had to deal with this. It isn't "simpler", there's actually more work to write code that adapts to one or two byte encodings.
It is variable length, with the advantage that only strings containing a few Asian languages are variable-length, as opposed to UTF-8 having every non-English language string be variable-length. It may be more work to write library code to handle my encoding, perhaps, but efficiency and ease of use are paramount.
 So let's see: first you say that my scheme has to be variable 
 length because I
 am using two bytes to handle these languages,
Well, it *is* variable length or you have to disregard Chinese. You cannot have it both ways. Code to deal with two bytes is significantly different than code to deal with one. That means you've got a conditional in your generic code - that isn't going to be faster than the conditional for UTF-8.
Hah, I have explicitly said several times that I'd use a two-byte encoding for Chinese and I already acknowledged that such a predominantly single-byte encoding is still variable-length. The problem is that _you_ try to have it both ways: first you claimed it is variable-length because I support Chinese that way, then you claimed I don't support Chinese. Yes, there will be conditionals, just as there are several conditionals in phobos depending on whether a language supports uppercase or not. The question is whether the conditionals for single-byte encoding will execute faster than decoding every UTF-8 character. This is a matter of engineering judgement, I see no reason why you think decoding every UTF-8 character is faster.
 then you claim I don't handle
 these languages.  This kind of blatant contradiction within 
 two posts can only
 be called... trolling!
You gave some vague handwaving about it, and then dismissed it as irrelevant, along with more handwaving about what to do with text that has embedded words in multiple languages.
If it was mere "vague handwaving," how did you know I planned to use two bytes to encode Chinese? I'm not sure why you're continuing along this contradictory path. I didn't "handwave" about multi-language strings, I gave specific ideas about how they might be implemented. I'm not claiming to have a bullet-proof and detailed single-byte encoding spec, just spitballing some ideas on how to do it better than the abominable UTF-8.
 Worse, there are going to be more than 256 of these encodings - 
 you can't even have a byte to specify them. Remember, Unicode 
 has approximately 256,000 characters in it. How many code pages 
 is that?
There are 72 modern scripts in Unicode 6.1, 28 ancient scripts, maybe another 50 symbolic sets. That leaves space for another 100 or so new scripts. Maybe you are so worried about future-proofing that you'd use two bytes to signify the alphabet, but I wouldn't. I think it's more likely that we'll ditch scripts than add them. ;) Most of those symbol sets should not be in UCS.
 I was being kind saying you were trolling, as otherwise I'd be 
 saying your scheme was, to be blunt, absurd.
I think it's absurd to use a self-synchronizing text encoding from 20 years ago, that is really only useful when streaming text, which nobody does today. There may have been a time when ASCII compatibility was paramount, when nobody cared about internationalization and almost all libraries only took ASCII input: that is not the case today.
 I'll be the first to admit that a lot of great ideas have been 
 initially dismissed by the experts as absurd. If you really 
 believe in this, I recommend that you write it up as a real 
 article, taking care to fill in all the handwaving with 
 something specific, and include some benchmarks to prove your 
 performance claims. Post your article on reddit, stackoverflow, 
 hackernews, etc., and look for fertile ground for it. I'm sorry 
 you're not finding fertile ground here (so far, nobody has 
 agreed with any of your points), and this is the wrong place 
 for such proposals anyway, as D is simply not going to switch 
 over to it.
Let me admit in return that I might be completely wrong about my single-byte encoding representing a step forward from UTF-8. While this argument has produced no argument that I'm wrong, it's possible we've all missed something salient, some deal-breaker. As I said before, I'm not proposing that D "switch over." I was simply asking people who know or at the very least use UTF-8 more than most, as a result of employing one of the few languages with Unicode support baked in, why they think UTF-8 is a good idea. I was hoping for a technical discussion on the merits, before I went ahead and implemented this single-byte encoding. Since nobody has been able to point out a reason for why my encoding wouldn't be much better than UTF-8, I see no reason not to go forward with my implementation. I may write something up after implementation: most people don't care about ideas, only results, to the point where almost nobody can reason at all about ideas.
 Remember, extraordinary claims require extraordinary evidence, 
 not handwaving and assumptions disguised as bold assertions.
I don't think my claims are extraordinary or backed by "handwaving and assumptions." Some people can reason about such possible encodings, even in the incomplete form I've sketched out, without having implemented them, if they know what they're doing. On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
 On 5/25/2013 2:51 PM, Walter Bright wrote:
 On 5/25/2013 12:51 PM, Joakim wrote:
 For a multi-language string encoding, the header would
 contain a single byte for every language used in the string, 
 along with multiple
 index bytes to signify the start and finish of every run of 
 single-language
 characters in the string. So, a list of languages and a list 
 of pure
 single-language substrings.
Please implement the simple C function strstr() with this simple scheme, and post it here. http://www.digitalmars.com/rtl/string.html#strstr
I'll go first. Here's a simple UTF-8 version in C. It's not the fastest way to do it, but at least it is correct: ---------------------------------- char *strstr(const char *s1,const char *s2) { size_t len1 = strlen(s1); size_t len2 = strlen(s2); if (!len2) return (char *) s1; char c2 = *s2; while (len2 <= len1) { if (c2 == *s1) if (memcmp(s2,s1,len2) == 0) return (char *) s1; s1++; len1--; } return NULL; }
There is no question that a UTF-8 implementation of strstr can be simpler to write in C and D for multi-language strings that include Korean/Chinese/Japanese. But while the strstr implementation for my encoding would contain more conditionals and lines of code, it would be far more efficient. For instance, because you know where all the language substrings are from the header, you can potentially rule out searching vast swathes of the string, because they don't contain the same languages or lengths as the string you're searching for. Even if you're searching a single-language string, which won't have those speedups, your naive implementation checks every byte, even continuation bytes, in UTF-8 to see if they might match the first letter of the search string, even though no continuation byte will match. You can avoid this by partially decoding the leading bytes of UTF-8 characters and skipping over continuation bytes, as I've mentioned earlier in this thread, but you've then added more lines of code to your pretty yet simple function and added decoding overhead to every iteration of the while loop. My single-byte encoding has none of these problems, in fact, it's much faster and uses less memory for the same function, while providing additional speedups, from the header, that are not available to UTF-8. Finally, being able to write simple yet inefficient functions like this is not the test of a good encoding, as strstr is a library function, and making library developers' lives easier is a low priority for any good format. The primary goals are ease of use for library consumers, ie app developers, and speed and efficiency of the code. You are trading on the latter two for the former with this implementation. That is not a good tradeoff. Perhaps it was a good trade 20 years ago when everyone rolled their own code and nobody bothered waiting for those floppy disks to arrive with expensive library code. It is not a good trade today.
May 26 2013
next sibling parent "Declan" <oyscal 163.com> writes:
On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
 On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
 I have noted from the beginning that these large alphabets 
 have to be encoded to
 two bytes, so it is not a true constant-width encoding if you 
 are mixing one of
 those languages into a single-byte encoded string.  But this 
 "variable length"
 encoding is so much simpler than UTF-8, there's no comparison.
If it's one byte sometimes, or two bytes sometimes, it's variable length. You overlook that I've had to deal with this. It isn't "simpler", there's actually more work to write code that adapts to one or two byte encodings.
It is variable length, with the advantage that only strings containing a few Asian languages are variable-length, as opposed to UTF-8 having every non-English language string be variable-length. It may be more work to write library code to handle my encoding, perhaps, but efficiency and ease of use are paramount.
 So let's see: first you say that my scheme has to be variable 
 length because I
 am using two bytes to handle these languages,
Well, it *is* variable length or you have to disregard Chinese. You cannot have it both ways. Code to deal with two bytes is significantly different than code to deal with one. That means you've got a conditional in your generic code - that isn't going to be faster than the conditional for UTF-8.
Hah, I have explicitly said several times that I'd use a two-byte encoding for Chinese and I already acknowledged that such a predominantly single-byte encoding is still variable-length. The problem is that _you_ try to have it both ways: first you claimed it is variable-length because I support Chinese that way, then you claimed I don't support Chinese. Yes, there will be conditionals, just as there are several conditionals in phobos depending on whether a language supports uppercase or not. The question is whether the conditionals for single-byte encoding will execute faster than decoding every UTF-8 character. This is a matter of engineering judgement, I see no reason why you think decoding every UTF-8 character is faster.
 then you claim I don't handle
 these languages.  This kind of blatant contradiction within 
 two posts can only
 be called... trolling!
You gave some vague handwaving about it, and then dismissed it as irrelevant, along with more handwaving about what to do with text that has embedded words in multiple languages.
If it was mere "vague handwaving," how did you know I planned to use two bytes to encode Chinese? I'm not sure why you're continuing along this contradictory path. I didn't "handwave" about multi-language strings, I gave specific ideas about how they might be implemented. I'm not claiming to have a bullet-proof and detailed single-byte encoding spec, just spitballing some ideas on how to do it better than the abominable UTF-8.
 Worse, there are going to be more than 256 of these encodings 
 - you can't even have a byte to specify them. Remember, 
 Unicode has approximately 256,000 characters in it. How many 
 code pages is that?
There are 72 modern scripts in Unicode 6.1, 28 ancient scripts, maybe another 50 symbolic sets. That leaves space for another 100 or so new scripts. Maybe you are so worried about future-proofing that you'd use two bytes to signify the alphabet, but I wouldn't. I think it's more likely that we'll ditch scripts than add them. ;) Most of those symbol sets should not be in UCS.
 I was being kind saying you were trolling, as otherwise I'd be 
 saying your scheme was, to be blunt, absurd.
I think it's absurd to use a self-synchronizing text encoding from 20 years ago, that is really only useful when streaming text, which nobody does today. There may have been a time when ASCII compatibility was paramount, when nobody cared about internationalization and almost all libraries only took ASCII input: that is not the case today.
 I'll be the first to admit that a lot of great ideas have been 
 initially dismissed by the experts as absurd. If you really 
 believe in this, I recommend that you write it up as a real 
 article, taking care to fill in all the handwaving with 
 something specific, and include some benchmarks to prove your 
 performance claims. Post your article on reddit, 
 stackoverflow, hackernews, etc., and look for fertile ground 
 for it. I'm sorry you're not finding fertile ground here (so 
 far, nobody has agreed with any of your points), and this is 
 the wrong place for such proposals anyway, as D is simply not 
 going to switch over to it.
Let me admit in return that I might be completely wrong about my single-byte encoding representing a step forward from UTF-8. While this argument has produced no argument that I'm wrong, it's possible we've all missed something salient, some deal-breaker. As I said before, I'm not proposing that D "switch over." I was simply asking people who know or at the very least use UTF-8 more than most, as a result of employing one of the few languages with Unicode support baked in, why they think UTF-8 is a good idea. I was hoping for a technical discussion on the merits, before I went ahead and implemented this single-byte encoding. Since nobody has been able to point out a reason for why my encoding wouldn't be much better than UTF-8, I see no reason not to go forward with my implementation. I may write something up after implementation: most people don't care about ideas, only results, to the point where almost nobody can reason at all about ideas.
 Remember, extraordinary claims require extraordinary evidence, 
 not handwaving and assumptions disguised as bold assertions.
I don't think my claims are extraordinary or backed by "handwaving and assumptions." Some people can reason about such possible encodings, even in the incomplete form I've sketched out, without having implemented them, if they know what they're doing. On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
 On 5/25/2013 2:51 PM, Walter Bright wrote:
 On 5/25/2013 12:51 PM, Joakim wrote:
 For a multi-language string encoding, the header would
 contain a single byte for every language used in the string, 
 along with multiple
 index bytes to signify the start and finish of every run of 
 single-language
 characters in the string. So, a list of languages and a list 
 of pure
 single-language substrings.
Please implement the simple C function strstr() with this simple scheme, and post it here. http://www.digitalmars.com/rtl/string.html#strstr
I'll go first. Here's a simple UTF-8 version in C. It's not the fastest way to do it, but at least it is correct: ---------------------------------- char *strstr(const char *s1,const char *s2) { size_t len1 = strlen(s1); size_t len2 = strlen(s2); if (!len2) return (char *) s1; char c2 = *s2; while (len2 <= len1) { if (c2 == *s1) if (memcmp(s2,s1,len2) == 0) return (char *) s1; s1++; len1--; } return NULL; }
There is no question that a UTF-8 implementation of strstr can be simpler to write in C and D for multi-language strings that include Korean/Chinese/Japanese. But while the strstr implementation for my encoding would contain more conditionals and lines of code, it would be far more efficient. For instance, because you know where all the language substrings are from the header, you can potentially rule out searching vast swathes of the string, because they don't contain the same languages or lengths as the string you're searching for. Even if you're searching a single-language string, which won't have those speedups, your naive implementation checks every byte, even continuation bytes, in UTF-8 to see if they might match the first letter of the search string, even though no continuation byte will match. You can avoid this by partially decoding the leading bytes of UTF-8 characters and skipping over continuation bytes, as I've mentioned earlier in this thread, but you've then added more lines of code to your pretty yet simple function and added decoding overhead to every iteration of the while loop. My single-byte encoding has none of these problems, in fact, it's much faster and uses less memory for the same function, while providing additional speedups, from the header, that are not available to UTF-8. Finally, being able to write simple yet inefficient functions like this is not the test of a good encoding, as strstr is a library function, and making library developers' lives easier is a low priority for any good format. The primary goals are ease of use for library consumers, ie app developers, and speed and efficiency of the code. You are trading on the latter two for the former with this implementation. That is not a good tradeoff. Perhaps it was a good trade 20 years ago when everyone rolled their own code and nobody bothered waiting for those floppy disks to arrive with expensive library code. It is not a good trade today.
I服了u,I'm thinking of your name means joking?
May 26 2013
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
 On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
 I have noted from the beginning that these large alphabets 
 have to be encoded to
 two bytes, so it is not a true constant-width encoding if you 
 are mixing one of
 those languages into a single-byte encoded string.  But this 
 "variable length"
 encoding is so much simpler than UTF-8, there's no comparison.
If it's one byte sometimes, or two bytes sometimes, it's variable length. You overlook that I've had to deal with this. It isn't "simpler", there's actually more work to write code that adapts to one or two byte encodings.
It is variable length, with the advantage that only strings containing a few Asian languages are variable-length, as opposed to UTF-8 having every non-English language string be variable-length. It may be more work to write library code to handle my encoding, perhaps, but efficiency and ease of use are paramount.
 So let's see: first you say that my scheme has to be variable 
 length because I
 am using two bytes to handle these languages,
Well, it *is* variable length or you have to disregard Chinese. You cannot have it both ways. Code to deal with two bytes is significantly different than code to deal with one. That means you've got a conditional in your generic code - that isn't going to be faster than the conditional for UTF-8.
Hah, I have explicitly said several times that I'd use a two-byte encoding for Chinese and I already acknowledged that such a predominantly single-byte encoding is still variable-length. The problem is that _you_ try to have it both ways: first you claimed it is variable-length because I support Chinese that way, then you claimed I don't support Chinese. Yes, there will be conditionals, just as there are several conditionals in phobos depending on whether a language supports uppercase or not. The question is whether the conditionals for single-byte encoding will execute faster than decoding every UTF-8 character. This is a matter of engineering judgement, I see no reason why you think decoding every UTF-8 character is faster.
 then you claim I don't handle
 these languages.  This kind of blatant contradiction within 
 two posts can only
 be called... trolling!
You gave some vague handwaving about it, and then dismissed it as irrelevant, along with more handwaving about what to do with text that has embedded words in multiple languages.
If it was mere "vague handwaving," how did you know I planned to use two bytes to encode Chinese? I'm not sure why you're continuing along this contradictory path. I didn't "handwave" about multi-language strings, I gave specific ideas about how they might be implemented. I'm not claiming to have a bullet-proof and detailed single-byte encoding spec, just spitballing some ideas on how to do it better than the abominable UTF-8.
 Worse, there are going to be more than 256 of these encodings 
 - you can't even have a byte to specify them. Remember, 
 Unicode has approximately 256,000 characters in it. How many 
 code pages is that?
There are 72 modern scripts in Unicode 6.1, 28 ancient scripts, maybe another 50 symbolic sets. That leaves space for another 100 or so new scripts. Maybe you are so worried about future-proofing that you'd use two bytes to signify the alphabet, but I wouldn't. I think it's more likely that we'll ditch scripts than add them. ;) Most of those symbol sets should not be in UCS.
 I was being kind saying you were trolling, as otherwise I'd be 
 saying your scheme was, to be blunt, absurd.
I think it's absurd to use a self-synchronizing text encoding from 20 years ago, that is really only useful when streaming text, which nobody does today. There may have been a time when ASCII compatibility was paramount, when nobody cared about internationalization and almost all libraries only took ASCII input: that is not the case today.
 I'll be the first to admit that a lot of great ideas have been 
 initially dismissed by the experts as absurd. If you really 
 believe in this, I recommend that you write it up as a real 
 article, taking care to fill in all the handwaving with 
 something specific, and include some benchmarks to prove your 
 performance claims. Post your article on reddit, 
 stackoverflow, hackernews, etc., and look for fertile ground 
 for it. I'm sorry you're not finding fertile ground here (so 
 far, nobody has agreed with any of your points), and this is 
 the wrong place for such proposals anyway, as D is simply not 
 going to switch over to it.
Let me admit in return that I might be completely wrong about my single-byte encoding representing a step forward from UTF-8. While this argument has produced no argument that I'm wrong, it's possible we've all missed something salient, some deal-breaker. As I said before, I'm not proposing that D "switch over." I was simply asking people who know or at the very least use UTF-8 more than most, as a result of employing one of the few languages with Unicode support baked in, why they think UTF-8 is a good idea. I was hoping for a technical discussion on the merits, before I went ahead and implemented this single-byte encoding. Since nobody has been able to point out a reason for why my encoding wouldn't be much better than UTF-8, I see no reason not to go forward with my implementation. I may write something up after implementation: most people don't care about ideas, only results, to the point where almost nobody can reason at all about ideas.
 Remember, extraordinary claims require extraordinary evidence, 
 not handwaving and assumptions disguised as bold assertions.
I don't think my claims are extraordinary or backed by "handwaving and assumptions." Some people can reason about such possible encodings, even in the incomplete form I've sketched out, without having implemented them, if they know what they're doing. On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
 On 5/25/2013 2:51 PM, Walter Bright wrote:
 On 5/25/2013 12:51 PM, Joakim wrote:
 For a multi-language string encoding, the header would
 contain a single byte for every language used in the string, 
 along with multiple
 index bytes to signify the start and finish of every run of 
 single-language
 characters in the string. So, a list of languages and a list 
 of pure
 single-language substrings.
Please implement the simple C function strstr() with this simple scheme, and post it here. http://www.digitalmars.com/rtl/string.html#strstr
I'll go first. Here's a simple UTF-8 version in C. It's not the fastest way to do it, but at least it is correct: ---------------------------------- char *strstr(const char *s1,const char *s2) { size_t len1 = strlen(s1); size_t len2 = strlen(s2); if (!len2) return (char *) s1; char c2 = *s2; while (len2 <= len1) { if (c2 == *s1) if (memcmp(s2,s1,len2) == 0) return (char *) s1; s1++; len1--; } return NULL; }
There is no question that a UTF-8 implementation of strstr can be simpler to write in C and D for multi-language strings that include Korean/Chinese/Japanese. But while the strstr implementation for my encoding would contain more conditionals and lines of code, it would be far more efficient. For instance, because you know where all the language substrings are from the header, you can potentially rule out searching vast swathes of the string, because they don't contain the same languages or lengths as the string you're searching for. Even if you're searching a single-language string, which won't have those speedups, your naive implementation checks every byte, even continuation bytes, in UTF-8 to see if they might match the first letter of the search string, even though no continuation byte will match. You can avoid this by partially decoding the leading bytes of UTF-8 characters and skipping over continuation bytes, as I've mentioned earlier in this thread, but you've then added more lines of code to your pretty yet simple function and added decoding overhead to every iteration of the while loop. My single-byte encoding has none of these problems, in fact, it's much faster and uses less memory for the same function, while providing additional speedups, from the header, that are not available to UTF-8. Finally, being able to write simple yet inefficient functions like this is not the test of a good encoding, as strstr is a library function, and making library developers' lives easier is a low priority for any good format. The primary goals are ease of use for library consumers, ie app developers, and speed and efficiency of the code. You are trading on the latter two for the former with this implementation. That is not a good tradeoff. Perhaps it was a good trade 20 years ago when everyone rolled their own code and nobody bothered waiting for those floppy disks to arrive with expensive library code. It is not a good trade today.
I suggest you make an attempt at writing strstr and post it. Code speaks louder than words.
May 26 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/26/2013 4:31 AM, Joakim wrote:
 My single-byte encoding has none of these problems, in fact, it's much faster
 and uses less memory for the same function, while providing additional
speedups,
 from the header, that are not available to UTF-8.
C'mon, Joakim, show us this amazing strstr() implementation for your scheme! http://www.youtube.com/watch?v=dhRUe-gz690
May 26 2013
parent "Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 12:55:11 UTC, Walter Bright wrote:
 On 5/26/2013 4:31 AM, Joakim wrote:
 My single-byte encoding has none of these problems, in fact, 
 it's much faster
 and uses less memory for the same function, while providing 
 additional speedups,
 from the header, that are not available to UTF-8.
C'mon, Joakim, show us this amazing strstr() implementation for your scheme!
You will see it when it's built into a fully working single-byte encoding implementation. I don't write toy code, particularly inefficient functions like yours, for the reasons given, which seem to have gone over your head.
 http://www.youtube.com/watch?v=dhRUe-gz690
Heh, never seen that sketch before. Never understood why anyone likes this silly Monty Python stuff, from what little I've seen.
May 26 2013
prev sibling parent "Diggory" <diggsey googlemail.com> writes:
On Saturday, 25 May 2013 at 20:03:59 UTC, Joakim wrote:
 I have noted from the beginning that these large alphabets have 
 to be encoded to two bytes, so it is not a true constant-width 
 encoding if you are mixing one of those languages into a 
 single-byte encoded string.  But this "variable length" 
 encoding is so much simpler than UTF-8, there's no comparison.
All I can say is if you think that is simpler than UTF-8 then you have completely the wrong idea about UTF-8. Let me explain: 1) Take the byte at a particular offset in the string 2) If it is ASCII then we're done 3) Otherwise count the number of '1's at the start of the byte - this is how many bytes make up the character (there's even an ASM instruction to do this) 4) This first byte will look like '1110xxxx' for a 3 byte character, '11110xxx' for a 4 byte character, etc. 5) All following bytes are of the form '10xxxxxx' 6) Now just concatenate all the 'x's together and add an offset to get the code point Note that this is CONSTANT TIME, O(1) with minimal branching so well suited to pipelining (after the initial byte the other bytes can all be processed in parallel by the CPU) and only sequential memory access so no cache misses, and zero additional memory requirements Now compare your encoding: 1) Look up the offset in the header using binary search: O(log N) lots of branching 2) Look up the code page ID in a massive array of code pages to work out how many bytes per character 3) Hope this array hasn't been paged out and is still in the cache 4) Extract that many bytes from the string and combine them into a number 5) Look up this new number in yet another large array specific to the code page 6) Hope this array hasn't been paged out and is still in the cache too This is O(log N) has lots of branching so no pipelining (every stage depends on the result of the stage before), lots of random memory access so lots of cache misses, lots of additional memory requirements to store all those tables, and an algorithm that isn't even any easier to understand. Plus every other algorithm to operate on it except for decoding is insanely complicated.
May 25 2013
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 02:42, H. S. Teoh пишет:
 On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote:
 24-May-2013 21:05, Joakim пишет:
[...]
 As far as Phobos is concerned, Dmitry's new std.uni module has powerful
 code-generation templates that let you write code that operate directly
 on UTF-8 without needing to convert to UTF-32 first.
As is there are no UTF-8 specific tables (yet), but there are tools to create the required abstraction by hand. I plan to grow one for std.regex that will thus be field-tested and then get into public interface. In fact the needs of std.regex prompted me to provide more Unicode stuff in the std.
 Well, OK, maybe
 we're not quite there yet, but the foundations are in place, and I'm
 looking forward to the day when string functions will no longer have
 implicit conversion to UTF-32, but will directly manipulate UTF-8 using
 optimized state tables generated by std.uni.
Yup, but let's get the correctness part first, then performance ;)
 Want small - use compression schemes which are perfectly fine and
 get to the precious 1byte per codepoint with exceptional speed.
 http://www.unicode.org/reports/tr6/
+1. Using your own encoding is perfectly fine. Just don't do that for data interchange. Unicode was created because we *want* a single standard to communicate with each other without stupid broken encoding issues that used to be rampant on the web before Unicode came along.
BTW the document linked discusses _standard_ compression so that anybody can decode that stuff. How you compress would largely affect the compression ratio but not much beyond it..
 In the bad ole days, HTML could be served in any random number of
 encodings, often out-of-sync with what the server claims the encoding
 is, and browsers would assume arbitrary default encodings that for the
 most part *appeared* to work but are actually fundamentally b0rken.
 Sometimes webpages would show up mostly-intact, but with a few
 characters mangled, because of deviations / variations on codepage
 interpretation, or non-standard characters being used in a particular
 encoding. It was a total, utter mess, that wasted who knows how many
 man-hours of programming time to work around. For data interchange on
 the internet, we NEED a universal standard that everyone can agree on.
+1 on these and others :) -- Dmitry Olshansky
May 24 2013
prev sibling parent reply "Joakim" <joakim airpost.net> writes:
On Friday, 24 May 2013 at 22:44:24 UTC, H. S. Teoh wrote:
 I remember those bad ole days of gratuitously-incompatible 
 encodings. I
 wish those days will never ever return again. You'd get a text 
 file in
 some unknown encoding, and the only way to make any sense of it 
 was to
 guess what encoding it might be and hope you get lucky. Not 
 only so, the
 same language often has multiple encodings, so adding support 
 for a
 single new language required supporting several new encodings 
 and being
 able to tell them apart (often with no info on which they are, 
 if you're
 lucky, or if you're unlucky, with *wrong* encoding type specs 
 -- for
 example, I *still* get email from outdated systems that claim 
 to be
 iso-8859 when it's actually KOI8R).
This is an argument for UCS, not UTF-8.
 Prepending the encoding to the data doesn't help, because it's 
 pretty
 much guaranteed somebody will cut-n-paste some segment of that 
 data and
 save it without the encoding type header (or worse, some 
 program will
 try to "fix" broken low-level code by prepending a default 
 encoding type
 to everything, regardless of whether it's actually in that 
 encoding or
 not), thus ensuring nobody will be able to reliably recognize 
 what
 encoding it is down the road.
This problem already exists for UTF-8, breaking ASCII compatibility in the process: http://en.wikipedia.org/wiki/Byte_order_mark Well, at the very least adding garbage ASCII data in the front, just as my header would do. ;)
 For all of its warts, Unicode fixed a WHOLE bunch of these 
 problems, and
 made cross-linguistic data sane to handle without pulling out 
 your hair,
 many times over.  And now we're trying to go back to that 
 nightmarish
 old world again? No way, José!
No, I'm suggesting going back to one element of that "old world," single-byte encodings, but using UCS or some other standardized character set to avoid all those incompatible code pages you had to deal with.
 If you're really concerned about encoding size, just use a 
 compression
 library -- they're readily available these days. Internally, 
 the program
 can just use UTF-16 for the most part -- UTF-32 is really only 
 necessary
 if you're routinely delving outside BMP, which is very rare.
True, but you're still doubling your string size with UTF-16 and non-ASCII text. My concerns are the following, in order of importance: 1. Lost programmer productivity due to these dumb variable-length encodings. That is the biggest loss from UTF-8's complexity. 2. Lost speed and memory due to using either an unnecessarily complex variable-length encoding or because you translated everything to 32-bit UTF-32 to get back to constant-width. 3. Lost bandwidth from using a fatter encoding.
 As far as Phobos is concerned, Dmitry's new std.uni module has 
 powerful
 code-generation templates that let you write code that operate 
 directly
 on UTF-8 without needing to convert to UTF-32 first. Well, OK, 
 maybe
 we're not quite there yet, but the foundations are in place, 
 and I'm
 looking forward to the day when string functions will no longer 
 have
 implicit conversion to UTF-32, but will directly manipulate 
 UTF-8 using
 optimized state tables generated by std.uni.
There is no way this can ever be as performant as a constant-width single-byte encoding.
 +1.  Using your own encoding is perfectly fine. Just don't do 
 that for
 data interchange. Unicode was created because we *want* a single
 standard to communicate with each other without stupid broken 
 encoding
 issues that used to be rampant on the web before Unicode came 
 along.

 In the bad ole days, HTML could be served in any random number 
 of
 encodings, often out-of-sync with what the server claims the 
 encoding
 is, and browsers would assume arbitrary default encodings that 
 for the
 most part *appeared* to work but are actually fundamentally 
 b0rken.
 Sometimes webpages would show up mostly-intact, but with a few
 characters mangled, because of deviations / variations on 
 codepage
 interpretation, or non-standard characters being used in a 
 particular
 encoding. It was a total, utter mess, that wasted who knows how 
 many
 man-hours of programming time to work around. For data 
 interchange on
 the internet, we NEED a universal standard that everyone can 
 agree on.
I disagree. This is not an indictment of multiple encodings, it is one of multiple unspecified or _broken_ encodings. Given how difficult UTF-8 is to get right, all you've likely done is replace multiple broken encodings with a single encoding with multiple broken implementations.
 UTF-8, for all its flaws, is remarkably resilient to mangling 
 -- you can
 cut-n-paste any byte sequence and the receiving end can still 
 make some
 sense of it.  Not like the bad old days of codepages where you 
 just get
 one gigantic block of gibberish. A properly-synchronizing UTF-8 
 function
 can still recover legible data, maybe with only a few 
 characters at the
 ends truncated in the worst case. I don't see how any 
 codepage-based
 encoding is an improvement over this.
Have you ever used this self-synchronizing future of UTF-8? Have you ever heard of anyone using it? There is no reason why this kind of limited checking of data integrity should be rolled into the encoding. Maybe this made sense two decades ago when everyone had plans to stream text or something, but nobody does that nowadays. Just put a checksum in your header and you're good to go. Unicode is still a "codepage-based encoding," nothing has changed in that regard. All UCS did is standardize a bunch of pre-existing code pages, so that some of the redundancy was taken out. Unfortunately, the UTF-8 encoding then bloated the transmission format and tempted devs to use this unnecessarily complex format for processing too.
May 25 2013
parent reply "Diggory" <diggsey googlemail.com> writes:
I think you are a little confused about what unicode actually 
is... Unicode has nothing to do with code pages and nobody uses 
code pages any more except for compatibility with legacy 
applications (with good reason!).

Unicode is:
1) A standardised numbering of a large number of characters
2) A set of standardised algorithms for operating on these 
characters
3) A set of standardised encodings for efficiently encoding 
sequences of these characters

You said that phobos converts UTF-8 strings to UTF-32 before 
operating on them but that's not true. As it iterates over UTF-8 
strings it iterates over dchars rather than chars, but that's not 
in any way inefficient so I don't really see the problem.

Also your complaint that UTF-8 reserves the short characters for 
the english alphabet is not really relevant - the characters with 
longer encodings tend to be rarer (such as special symbols) or 
carry more information (such as chinese characters where the same 
sentence takes only about 1/3 the number of characters).
May 25 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
 I think you are a little confused about what unicode actually 
 is... Unicode has nothing to do with code pages and nobody uses 
 code pages any more except for compatibility with legacy 
 applications (with good reason!).
Incorrect. "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode." http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
 Unicode is:
 1) A standardised numbering of a large number of characters
 2) A set of standardised algorithms for operating on these 
 characters
 3) A set of standardised encodings for efficiently encoding 
 sequences of these characters
What makes you think I'm unaware of this? I have repeatedly differentiated between UCS (1) and UTF-8 (3).
 You said that phobos converts UTF-8 strings to UTF-32 before 
 operating on them but that's not true. As it iterates over 
 UTF-8 strings it iterates over dchars rather than chars, but 
 that's not in any way inefficient so I don't really see the 
 problem.
And what's a dchar? Let's check: dchar : unsigned 32 bit UTF-32 http://dlang.org/type.html Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it. Walter as much as said so up above.
 Also your complaint that UTF-8 reserves the short characters 
 for the english alphabet is not really relevant - the 
 characters with longer encodings tend to be rarer (such as 
 special symbols) or carry more information (such as chinese 
 characters where the same sentence takes only about 1/3 the 
 number of characters).
The vast majority of non-english alphabets in UCS can be encoded in a single byte. It is your exceptions that are not relevant.
May 25 2013
next sibling parent reply "Diggory" <diggsey googlemail.com> writes:
On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
 On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
 I think you are a little confused about what unicode actually 
 is... Unicode has nothing to do with code pages and nobody 
 uses code pages any more except for compatibility with legacy 
 applications (with good reason!).
Incorrect. "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode." http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
That confirms exactly what I just said...
 You said that phobos converts UTF-8 strings to UTF-32 before 
 operating on them but that's not true. As it iterates over 
 UTF-8 strings it iterates over dchars rather than chars, but 
 that's not in any way inefficient so I don't really see the 
 problem.
And what's a dchar? Let's check: dchar : unsigned 32 bit UTF-32 http://dlang.org/type.html Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it. Walter as much as said so up above.
Given that all the machine registers are at least 32-bits already it doesn't make the slightest difference. The only additional operations on top of ascii are when it's a multi-byte character, and even then it's some simple bit manipulation which is as fast as any variable width encoding is going to get. The only alternatives to a variable width encoding I can see are: - Single code page per string This is completely useless because now you can't concatenate strings of different code pages. - Multiple code pages per string This just makes everything overly complicated and is far slower to decode what the actual character is than UTF-8. - String with escape sequences to change code page Can no longer access characters in the middle or end of the string, you have to parse the entire string every time which completely negates the benefit of a fixed width encoding. - An encoding wide enough to store every character This is just UTF-32.
 Also your complaint that UTF-8 reserves the short characters 
 for the english alphabet is not really relevant - the 
 characters with longer encodings tend to be rarer (such as 
 special symbols) or carry more information (such as chinese 
 characters where the same sentence takes only about 1/3 the 
 number of characters).
The vast majority of non-english alphabets in UCS can be encoded in a single byte. It is your exceptions that are not relevant.
Well obviously... That's like saying "if you know what the exact contents of a file are going to be anyway you can compress it to a single byte!" ie. It's possible to devise an encoding which will encode any given string to an arbitrarily small size. It's still completely useless because you'd have to know the string in advance... - A useful encoding has to be able to handle every unicode character - As I've shown the only space-efficient way to do this is using a variable length encoding like UTF-8 - Given the frequency distribution of unicode characters, UTF-8 does a pretty good job at encoding higher frequency characters in fewer bytes. - Yes you COULD encode non-english alphabets in a single byte but doing so would be inefficient because it would mean the more frequently used characters take more bytes to encode.
May 25 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote:
 On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
 On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
 I think you are a little confused about what unicode actually 
 is... Unicode has nothing to do with code pages and nobody 
 uses code pages any more except for compatibility with legacy 
 applications (with good reason!).
Incorrect. "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode." http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
That confirms exactly what I just said...
No, that directly _contradicts_ what you said about Unicode having "nothing to do with code pages." All UCS did is take a bunch of existing code pages and standardize them into one massive character set. For example, ISCII was a pre-existing single-byte encoding and Unicode "largely preserves the ISCII layout within each block." http://en.wikipedia.org/wiki/ISCII All a code page is is a table of mappings, UCS is just a much larger, standardized table of such mappings.
 You said that phobos converts UTF-8 strings to UTF-32 before 
 operating on them but that's not true. As it iterates over 
 UTF-8 strings it iterates over dchars rather than chars, but 
 that's not in any way inefficient so I don't really see the 
 problem.
And what's a dchar? Let's check: dchar : unsigned 32 bit UTF-32 http://dlang.org/type.html Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it. Walter as much as said so up above.
Given that all the machine registers are at least 32-bits already it doesn't make the slightest difference. The only additional operations on top of ascii are when it's a multi-byte character, and even then it's some simple bit manipulation which is as fast as any variable width encoding is going to get.
I see you've abandoned without note your claim that phobos doesn't convert UTF-8 to UTF-32 internally. Perhaps converting to UTF-32 is "as fast as any variable width encoding is going to get" but my claim is that single-byte encodings will be faster.
 The only alternatives to a variable width encoding I can see 
 are:
 - Single code page per string
 This is completely useless because now you can't concatenate 
 strings of different code pages.
I wouldn't be so fast to ditch this. There is a real argument to be made that strings of different languages are sufficiently different that there should be no multi-language strings. Is this the best route? I'm not sure, but I certainly wouldn't dismiss it out of hand.
 - Multiple code pages per string
 This just makes everything overly complicated and is far slower 
 to decode what the actual character is than UTF-8.
I disagree, this would still be far faster than UTF-8, particularly if you designed your header right.
 - String with escape sequences to change code page
 Can no longer access characters in the middle or end of the 
 string, you have to parse the entire string every time which 
 completely negates the benefit of a fixed width encoding.
I didn't think of this possibility, but you may be right that it's sub-optimal.
 Also your complaint that UTF-8 reserves the short characters 
 for the english alphabet is not really relevant - the 
 characters with longer encodings tend to be rarer (such as 
 special symbols) or carry more information (such as chinese 
 characters where the same sentence takes only about 1/3 the 
 number of characters).
The vast majority of non-english alphabets in UCS can be encoded in a single byte. It is your exceptions that are not relevant.
Well obviously... That's like saying "if you know what the exact contents of a file are going to be anyway you can compress it to a single byte!" ie. It's possible to devise an encoding which will encode any given string to an arbitrarily small size. It's still completely useless because you'd have to know the string in advance...
No, it's not the same at all. The contents of an arbitrary-length file cannot be compressed to a single byte, you would have collisions galore. But since most non-english alphabets are less than 256 characters, they can all be uniquely encoded in a single byte per character, with the header determining what language's code page to use. I don't understand your analogy whatsoever.
 - A useful encoding has to be able to handle every unicode 
 character
 - As I've shown the only space-efficient way to do this is 
 using a variable length encoding like UTF-8
You haven't shown this.
 - Given the frequency distribution of unicode characters, UTF-8 
 does a pretty good job at encoding higher frequency characters 
 in fewer bytes.
No, it does a very bad job of this. Every non-ASCII character takes at least two bytes to encode, whereas my single-byte encoding scheme would encode every alphabet with less than 256 characters in a single byte.
 - Yes you COULD encode non-english alphabets in a single byte 
 but doing so would be inefficient because it would mean the 
 more frequently used characters take more bytes to encode.
Not sure what you mean by this.
May 25 2013
parent "Diggory" <diggsey googlemail.com> writes:
On Saturday, 25 May 2013 at 19:02:43 UTC, Joakim wrote:
 On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote:
 On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
 On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
 I think you are a little confused about what unicode 
 actually is... Unicode has nothing to do with code pages and 
 nobody uses code pages any more except for compatibility 
 with legacy applications (with good reason!).
Incorrect. "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode." http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
That confirms exactly what I just said...
No, that directly _contradicts_ what you said about Unicode having "nothing to do with code pages." All UCS did is take a bunch of existing code pages and standardize them into one massive character set. For example, ISCII was a pre-existing single-byte encoding and Unicode "largely preserves the ISCII layout within each block." http://en.wikipedia.org/wiki/ISCII All a code page is is a table of mappings, UCS is just a much larger, standardized table of such mappings.
UCS does have nothing to do with code pages, it was designed as a replacement for them. A codepage is a strict subset of the possible characters, UCS is the entire set of possible characters.
 You said that phobos converts UTF-8 strings to UTF-32 before 
 operating on them but that's not true. As it iterates over 
 UTF-8 strings it iterates over dchars rather than chars, but 
 that's not in any way inefficient so I don't really see the 
 problem.
And what's a dchar? Let's check: dchar : unsigned 32 bit UTF-32 http://dlang.org/type.html Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it. Walter as much as said so up above.
Given that all the machine registers are at least 32-bits already it doesn't make the slightest difference. The only additional operations on top of ascii are when it's a multi-byte character, and even then it's some simple bit manipulation which is as fast as any variable width encoding is going to get.
I see you've abandoned without note your claim that phobos doesn't convert UTF-8 to UTF-32 internally. Perhaps converting to UTF-32 is "as fast as any variable width encoding is going to get" but my claim is that single-byte encodings will be faster.
I haven't "abandoned my claim". It's a simple fact that phobos does not convert UTF-8 string to UTF-32 strings before it uses them. ie. the difference between this: string mystr = ...; dstring temp = mystr.to!dstring; for (int i = 0; i < temp.length; ++i) process(temp[i]); and this: string mystr = ...; size_t i = 0; while (i < mystr.length) { dchar current = decode(mystr, i); process(current); } And if you can't see why the latter example is far more efficient I give up...
 The only alternatives to a variable width encoding I can see 
 are:
 - Single code page per string
 This is completely useless because now you can't concatenate 
 strings of different code pages.
I wouldn't be so fast to ditch this. There is a real argument to be made that strings of different languages are sufficiently different that there should be no multi-language strings. Is this the best route? I'm not sure, but I certainly wouldn't dismiss it out of hand.
 - Multiple code pages per string
 This just makes everything overly complicated and is far 
 slower to decode what the actual character is than UTF-8.
I disagree, this would still be far faster than UTF-8, particularly if you designed your header right.
The cache misses alone caused by simply accessing the separate headers would be a larger overhead than decoding UTF-8 which takes a few assembly instructions and has perfect locality and can be efficiently pipelined by the CPU. Then there's all the extra processing involved combining the headers when you concatenate strings. Plus you lose the one benefit a fixed width encoding has because random access is no longer possible without first finding out which header controls the location you want to access.
 - String with escape sequences to change code page
 Can no longer access characters in the middle or end of the 
 string, you have to parse the entire string every time which 
 completely negates the benefit of a fixed width encoding.
I didn't think of this possibility, but you may be right that it's sub-optimal.
 Also your complaint that UTF-8 reserves the short characters 
 for the english alphabet is not really relevant - the 
 characters with longer encodings tend to be rarer (such as 
 special symbols) or carry more information (such as chinese 
 characters where the same sentence takes only about 1/3 the 
 number of characters).
The vast majority of non-english alphabets in UCS can be encoded in a single byte. It is your exceptions that are not relevant.
Well obviously... That's like saying "if you know what the exact contents of a file are going to be anyway you can compress it to a single byte!" ie. It's possible to devise an encoding which will encode any given string to an arbitrarily small size. It's still completely useless because you'd have to know the string in advance...
No, it's not the same at all. The contents of an arbitrary-length file cannot be compressed to a single byte, you would have collisions galore. But since most non-english alphabets are less than 256 characters, they can all be uniquely encoded in a single byte per character, with the header determining what language's code page to use. I don't understand your analogy whatsoever.
It's very simple - the more information about the type of data you are compressing you have at the time of writing the algorithm the better compression ration you can get, to the point that if you know exactly what the file is going to contain you can compress it to nothing. This is why you have specialised compression algorithms for images, video, audio, etc. It doesn't matter how few characters non-english alphabets have - unless you know WHICH alphabet it is before-hand you can't store it in a single byte. Since any given character could be in any alphabet the best you can do is look at the probabilities of different characters appearing and use shorter representations for more common ones. (This is the basis for all lossless compression) The english alphabet plus 0-9 and basic punctuation are by far the most common characters used on computers so it makes sense to use one byte for those and multiple bytes for rarer characters.
 - A useful encoding has to be able to handle every unicode 
 character
 - As I've shown the only space-efficient way to do this is 
 using a variable length encoding like UTF-8
You haven't shown this.
If you had thought through your suggestion of multiple code pages per string you would see that I had.
 - Given the frequency distribution of unicode characters, 
 UTF-8 does a pretty good job at encoding higher frequency 
 characters in fewer bytes.
No, it does a very bad job of this. Every non-ASCII character takes at least two bytes to encode, whereas my single-byte encoding scheme would encode every alphabet with less than 256 characters in a single byte.
And strings with mixed characters would use lots of memory and be extremely slow. Common when using proper names, quotes, inline translations, graphical characters, etc. etc. Not to mention the added complexity to actually implement the algorithms.
May 25 2013
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 1:07 AM, Joakim wrote:
 The vast majority of non-english alphabets in UCS can be encoded in a single
 byte.  It is your exceptions that are not relevant.
I suspect the Chinese, Koreans, and Japanese would take exception to being called irrelevant. Good luck with your scheme that can't handle languages written by billions of people!
May 25 2013
prev sibling parent reply "Joakim" <joakim airpost.net> writes:
On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote:
 You seem to think that not only UTF-8 is bad encoding but also 
 one unified encoding (code-space) is bad(?).
Yes, on the encoding, if it's a variable-length encoding like UTF-8, no, on the code space. I was originally going to title my post, "Why Unicode?" but I have no real problem with UCS, which merely standardized a bunch of pre-existing code pages. Perhaps there are a lot of problems with UCS also, I just haven't delved into it enough to know. My problem is with these dumb variable-length encodings, so I was precise in the title.
 Separate code spaces were the case before Unicode (and utf-8). 
 The problem is not only that without header text is meaningless 
 (no easy slicing) but the fact that encoding of data after 
 header strongly depends a variety of factors -  a list of 
 encodings actually. Now everybody has to keep a (code) page per 
 language to at least know if it's 2 bytes per char or 1 byte 
 per char or whatever. And you still work on a basis that there 
 is no combining marks and regional specific stuff :)
Everybody is still keeping code pages, UTF-8 hasn't changed that. Does UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per char or whatever?" It has to do that also. Everyone keeps talking about "easy slicing" as though UTF-8 provides it, but it doesn't. Phobos turns UTF-8 into UTF-32 internally for all that ease of use, at least doubling your string size in the process. Correct me if I'm wrong, that was what I read on the newsgroup sometime back.
 In fact it was even "better" nobody ever talked about header 
 they just assumed a codepage with some global setting. Imagine 
 yourself creating a font rendering system these days - a hell 
 of an exercise in frustration (okay how do I render 0x88 ? mm 
 if that is in codepage XYZ then ...).
I understand that people were frustrated with all the code pages out there before UCS standardized them, but that is a completely different argument than my problem with UTF-8 and variable-length encodings. My proposed simple, header-based, constant-width encoding could be implemented with UCS and there go all your arguments about random code pages.
 This just shows you don't care for multilingual stuff at all. 
 Imagine any language tutor/translator/dictionary on the Web. 
 For instance most languages need to intersperse ASCII (also 
 keep in mind e.g. HTML markup). Books often feature citations 
 in native language (or e.g. latin) along with translations.
This is a small segment of use and it would be handled fine by an alternate encoding.
 Now also take into account math symbols, currency symbols and 
 beyond. Also these days cultures are mixing in wild 
 combinations so you might need to see the text even if you 
 can't read it. Unicode is not only "encode characters from all 
 languages". It needs to address universal representation of 
 symbolics used in writing systems at large.
I take your point that it isn't just languages, but symbols also. I see no reason why UTF-8 is a better encoding for that purpose than the kind of simple encoding I've suggested.
 We want monoculture! That is to understand each without all 
 these "par-le-vu-france?" and codepages of various 
 complexity(insanity).
I hate monoculture, but then I haven't had to decipher some screwed-up codepage in the middle of the night. ;) That said, you could standardize on UCS for your code space without using a bad encoding like UTF-8, as I said above.
 Want small - use compression schemes which are perfectly fine 
 and get to the precious 1byte per codepoint with exceptional 
 speed.
 http://www.unicode.org/reports/tr6/
Correct me if I'm wrong, but it seems like that compression scheme simply adds a header and then uses a single-byte encoding, exactly what I suggested! :) But I get the impression that it's only for sending over the wire, ie transmision, so all the processing issues that UTF-8 introduces would still be there.
 And borrowing the arguments from from that rant: locale is 
 borked shit when it comes to encodings. Locales should be used 
 for tweaking visual like numbers, date display an so on.
Is that worse than every API simply assuming UTF-8, as he says? Broken locale support in the past, as you and others complain about, doesn't invalidate the concept. If they're screwing up something so simple, imagine how much worse everyone is screwing up something complex like UTF-8?
May 24 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 10:44, Joakim пишет:
 On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote:
 You seem to think that not only UTF-8 is bad encoding but also one
 unified encoding (code-space) is bad(?).
Yes, on the encoding, if it's a variable-length encoding like UTF-8, no, on the code space. I was originally going to title my post, "Why Unicode?" but I have no real problem with UCS, which merely standardized a bunch of pre-existing code pages. Perhaps there are a lot of problems with UCS also, I just haven't delved into it enough to know. My problem is with these dumb variable-length encodings, so I was precise in the title.
UCS is dead and gone. Next in line to "640K is enough for everyone". Simply put Unicode decided to take into account all diversity of luggages instead of ~80% of these. Hard to add anything else. No offense meant but it feels like you actually live in universe that is 5-7 years behind current state. UTF-16 (a successor to UCS) is no random-access either. And it's shitty beyond measure, UTF-8 is a shining gem in comparison.
 Separate code spaces were the case before Unicode (and utf-8). The
 problem is not only that without header text is meaningless (no easy
 slicing) but the fact that encoding of data after header strongly
 depends a variety of factors -  a list of encodings actually. Now
 everybody has to keep a (code) page per language to at least know if
 it's 2 bytes per char or 1 byte per char or whatever. And you still
 work on a basis that there is no combining marks and regional specific
 stuff :)
Everybody is still keeping code pages, UTF-8 hasn't changed that.
Legacy. Hard to switch overnight. There are graphs that indicate that few years from now you might never encounter a legacy encoding anymore, only UTF-8/UTF-16.
  Does
 UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per
 char or whatever?"
It's coherent in its scheme to determine that. You don't need extra information synced to text unlike header stuff.
 It has to do that also. Everyone keeps talking about
 "easy slicing" as though UTF-8 provides it, but it doesn't.  Phobos
 turns UTF-8 into UTF-32 internally for all that ease of use, at least
 doubling your string size in the process.  Correct me if I'm wrong, that
 was what I read on the newsgroup sometime back.
Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't do any decoding and it does return you a slice of a balance of original.
 In fact it was even "better" nobody ever talked about header they just
 assumed a codepage with some global setting. Imagine yourself creating
 a font rendering system these days - a hell of an exercise in
 frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ
 then ...).
I understand that people were frustrated with all the code pages out there before UCS standardized them, but that is a completely different argument than my problem with UTF-8 and variable-length encodings. My proposed simple, header-based, constant-width encoding could be implemented with UCS and there go all your arguments about random code pages.
No they don't - have you ever seen native Korean or Chinese codepages? Problems with your header based approach are self-evident in a sense that there is no single sane way to deal with it on cross-locale basis (that you simply ignore as noted below).
 This just shows you don't care for multilingual stuff at all. Imagine
 any language tutor/translator/dictionary on the Web. For instance most
 languages need to intersperse ASCII (also keep in mind e.g. HTML
 markup). Books often feature citations in native language (or e.g.
 latin) along with translations.
This is a small segment of use and it would be handled fine by an alternate encoding.
??? Simply makes no sense. There is no intersection between some legacy encodings as of now. Or do you want to add N*(N-1) cross-encodings for any combination of 2? What about 3 in one string?
 Now also take into account math symbols, currency symbols and beyond.
 Also these days cultures are mixing in wild combinations so you might
 need to see the text even if you can't read it. Unicode is not only
 "encode characters from all languages". It needs to address universal
 representation of symbolics used in writing systems at large.
I take your point that it isn't just languages, but symbols also. I see no reason why UTF-8 is a better encoding for that purpose than the kind of simple encoding I've suggested.
 We want monoculture! That is to understand each without all these
 "par-le-vu-france?" and codepages of various complexity(insanity).
I hate monoculture, but then I haven't had to decipher some screwed-up codepage in the middle of the night. ;)
So you never had trouble of internationalization? What languages do you use (read/speak/etc.)?
That said, you could standardize
 on UCS for your code space without using a bad encoding like UTF-8, as I
 said above.
UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into that trap (Java, Windows NT). You shouldn't.
 Want small - use compression schemes which are perfectly fine and get
 to the precious 1byte per codepoint with exceptional speed.
 http://www.unicode.org/reports/tr6/
Correct me if I'm wrong, but it seems like that compression scheme simply adds a header and then uses a single-byte encoding, exactly what I suggested! :)
This is it but it's far more flexible in a sense that it allows multi-linguagal strings just fine and lone full-with unicode codepoints as well.
 But I get the impression that it's only for sending over
 the wire, ie transmision, so all the processing issues that UTF-8
 introduces would still be there.
Use mime-type etc. Standards are always a bit stringy and suboptimal, their acceptance rate is one of chief advantages they have. Unicode has horrifically large momentum now and not a single organization aside from them tries to do this dirty work (=i18n).
 And borrowing the arguments from from that rant: locale is borked shit
 when it comes to encodings. Locales should be used for tweaking visual
 like numbers, date display an so on.
Is that worse than every API simply assuming UTF-8, as he says? Broken locale support in the past, as you and others complain about, doesn't invalidate the concept.
It's combinatorial blowup and has some stone-walls to hit into. Consider adding another encoding for "Tuva" for isntance. Now you have to add 2*n conversion routines to match it to other codepages/locales. Beyond that - there are many things to consider in internationalization and you would have to special case them all by codepage.
 If they're screwing up something so simple,
 imagine how much worse everyone is screwing up something complex like
 UTF-8?
UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a sequence of octets. It does it pretty well and compatible with ASCII, even the little rant you posted acknowledged that. Now you are either against Unicode as whole or what? -- Dmitry Olshansky
May 25 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:
 25-May-2013 10:44, Joakim пишет:
 Yes, on the encoding, if it's a variable-length encoding like 
 UTF-8, no,
 on the code space.  I was originally going to title my post, 
 "Why
 Unicode?" but I have no real problem with UCS, which merely 
 standardized
 a bunch of pre-existing code pages.  Perhaps there are a lot 
 of problems
 with UCS also, I just haven't delved into it enough to know.
UCS is dead and gone. Next in line to "640K is enough for everyone".
I think you are confused. UCS refers to the Universal Character Set, which is the backbone of Unicode: http://en.wikipedia.org/wiki/Universal_Character_Set You might be thinking of the unpopular UCS-2 and UCS-4 encodings, which I have never referred to.
 Separate code spaces were the case before Unicode (and 
 utf-8). The
 problem is not only that without header text is meaningless 
 (no easy
 slicing) but the fact that encoding of data after header 
 strongly
 depends a variety of factors -  a list of encodings actually. 
 Now
 everybody has to keep a (code) page per language to at least 
 know if
 it's 2 bytes per char or 1 byte per char or whatever. And you 
 still
 work on a basis that there is no combining marks and regional 
 specific
 stuff :)
Everybody is still keeping code pages, UTF-8 hasn't changed that.
Legacy. Hard to switch overnight. There are graphs that indicate that few years from now you might never encounter a legacy encoding anymore, only UTF-8/UTF-16.
I didn't mean that people are literally keeping code pages. I meant that there's not much of a difference between code pages with 2 bytes per char and the language character sets in UCS.
 Does
 UTF-8 not need "to at least know if it's 2 bytes per char or 1 
 byte per
 char or whatever?"
It's coherent in its scheme to determine that. You don't need extra information synced to text unlike header stuff.
?! It's okay because you deem it "coherent in its scheme?" I deem headers much more coherent. :)
 It has to do that also. Everyone keeps talking about
 "easy slicing" as though UTF-8 provides it, but it doesn't.  
 Phobos
 turns UTF-8 into UTF-32 internally for all that ease of use, 
 at least
 doubling your string size in the process.  Correct me if I'm 
 wrong, that
 was what I read on the newsgroup sometime back.
Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't do any decoding and it does return you a slice of a balance of original.
Perhaps substring search doesn't strictly require decoding but you have changed the subject: slicing does require decoding and that's the use case you brought up to begin with. I haven't looked into it, but I suspect substring search not requiring decoding is the exception for UTF-8 algorithms, not the rule.
 ??? Simply makes no sense. There is no intersection between 
 some legacy encodings as of now. Or do you want to add N*(N-1) 
 cross-encodings for any combination of 2? What about 3 in one 
 string?
I sketched two possible encodings above, none of which would require "cross-encodings."
 We want monoculture! That is to understand each without all 
 these
 "par-le-vu-france?" and codepages of various 
 complexity(insanity).
I hate monoculture, but then I haven't had to decipher some screwed-up codepage in the middle of the night. ;)
So you never had trouble of internationalization? What languages do you use (read/speak/etc.)?
This was meant as a point in your favor, conceding that I haven't had to code with the terrible code pages system from the past. I can read and speak multiple languages, but I don't use anything other than English text.
That said, you could standardize
 on UCS for your code space without using a bad encoding like 
 UTF-8, as I
 said above.
UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into that trap (Java, Windows NT). You shouldn't.
UCS, the character set, as noted above. If that's a myth, Unicode is a myth. :)
 This is it but it's far more flexible in a sense that it allows 
 multi-linguagal strings just fine and lone full-with unicode 
 codepoints as well.
That's only because it uses a more complex header than a single byte for the language, which I noted could be done with my scheme, by adding a more complex header, long before you mentioned this unicode compression scheme.
 But I get the impression that it's only for sending over
 the wire, ie transmision, so all the processing issues that 
 UTF-8
 introduces would still be there.
Use mime-type etc. Standards are always a bit stringy and suboptimal, their acceptance rate is one of chief advantages they have. Unicode has horrifically large momentum now and not a single organization aside from them tries to do this dirty work (=i18n).
You misunderstand. I was saying that this unicode compression scheme doesn't help you with string processing, it is only for transmission and is probably fine for that, precisely because it seems to implement some version of my single-byte encoding scheme! You do raise a good point: the only reason why we're likely using such a bad encoding in UTF-8 is that nobody else wants to tackle this hairy problem.
 Consider adding another encoding for "Tuva" for isntance. Now 
 you have to add 2*n conversion routines to match it to other 
 codepages/locales.
Not sure what you're referring to here.
 Beyond that - there are many things to consider in 
 internationalization and you would have to special case them 
 all by codepage.
Not necessarily. But that is actually one of the advantages of single-byte encodings, as I have noted above. toUpper is a NOP for a single-byte encoding string with an Asian script, you can't do that with a UTF-8 string.
 If they're screwing up something so simple,
 imagine how much worse everyone is screwing up something 
 complex like
 UTF-8?
UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a sequence of octets. It does it pretty well and compatible with ASCII, even the little rant you posted acknowledged that. Now you are either against Unicode as whole or what?
The BOM link I gave notes that UTF-8 isn't always ASCII-compatible. There are two parts to Unicode. I don't know enough about UCS, the character set, ;) to be for it or against it, but I acknowledge that a standardized character set may make sense. I am dead set against the UTF-8 variable-width encoding, for all the reasons listed above. On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote:
 25-May-2013 13:05, Joakim пишет:
 Nobody is talking about going back to code pages.  I'm talking 
 about
 going to single-byte encodings, which do not imply the 
 problems that you
 had with code pages way back when.
Problem is what you outline is isomorphic with code-pages. Hence the grief of accumulated experience against them.
They may seem superficially similar but they're not. For example, from the beginning, I have suggested a more complex header that can enable multi-language strings, as one possible solution. I don't think code pages provided that.
 Well if somebody get a quest to redefine UTF-8 they *might* 
 come up with something that is a bit faster to decode but 
 shares the same properties. Hardly a life saver anyway.
Perhaps not, but I suspect programmers will flock to a constant-width encoding that is much simpler and more efficient than UTF-8. Programmer productivity is the biggest loss from the complexity of UTF-8, as I've noted before.
 The world may not "abandon Unicode," but it will abandon 
 UTF-8, because
 it's a dumb idea.  Unfortunately, such dumb ideas- XML 
 anyone?- often
 proliferate until someone comes up with something better to 
 show how
 dumb they are.
Even children know XML is awful redundant shit as interchange format. The hierarchical document is a nice idea anyway.
_We_ both know that, but many others don't, or XML wouldn't be as popular as it is. ;) I'm making a similar point about the more limited success of UTF-8, ie it's still shit.
May 25 2013
next sibling parent "Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes:
░░░░░░░░░ⓌⓉⒻ░
╔╗░╔╗░╔╗╔════╗╔════╗░░
║║░║║░║║╚═╗╔═╝║╔═══╝░░
║║░║║░║║░░║║░░║╚═╗░░░░
║╚═╝╚═╝║╔╗║║╔╗║╔═╝╔╗░░
╚══════╝╚╝╚╝╚╝╚╝░░╚╝░░

░░░░░░░░░░░░░░░░░░░░░░░░
█░█░█░░░░░░▐░░░░░░░░░░▐░
█░█░█▐▀█▐▀█▐░█▐▀█▐▀█▐▀█░
█░█░█▐▄█▐▄█▐▄▀▐▄█▐░█▐░█░
█▄█▄█▐▄▄▐▄▄▐░█▐▄▄▐░█▐▄█░
░░░░░░░░░░░░░░░░░░░░░░░░


--jm
May 25 2013
prev sibling next sibling parent reply "Diggory" <diggsey googlemail.com> writes:
"limited success of UTF-8"

Becoming the de-facto standard encoding EVERYWERE except for 
windows which uses UTF-16 is hardly a failure...

I really don't understand your hatred for UTF-8 - it's simple to 
decode and encode, fast and space-efficient. Fixed width 
encodings are not inherently fast, the only thing they are faster 
at is if you want to randomly access the Nth character instead of 
the Nth byte. In the rare cases that you need to do a lot of this 
kind of random access there exists UTF-32...

Any fixed width encoding which can encode every unicode character 
must use at least 3 bytes, and using 4 bytes is probably going to 
be faster because of alignment, so I don't see what the great 
improvement over UTF-32 is going to be.

 slicing does require decoding
Nope.
 I didn't mean that people are literally keeping code pages.  I 
 meant that there's not much of a difference between code pages 
 with 2 bytes per char and the language character sets in UCS.
Unicode doesn't have "language character sets". The different planes only exist for organisational purposes they don't affect how characters are encoded.
 ?!  It's okay because you deem it "coherent in its scheme?"  I 
 deem headers much more coherent. :)
Sure if you change the word "coherent" to mean something completely different... Coherent means that you store related things together, ie. everything that you need to decode a character in the same place, not spread out between part of a character and a header.
 but I suspect substring search not requiring decoding is the 
 exception for UTF-8 algorithms, not the rule.
The only time you need to decode is when you need to do some transformation that depends on the code point such as converting case or identifying which character class a particular character belongs to. Appending, slicing, copying, searching, replacing, etc. basically all the most common text operations can all be done without any encoding or decoding.
May 25 2013
parent "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 18:56:42 UTC, Diggory wrote:
 "limited success of UTF-8"

 Becoming the de-facto standard encoding EVERYWERE except for 
 windows which uses UTF-16 is hardly a failure...
So you admit that UTF-8 isn't used on the vast majority of computers since the inception of Unicode. That's what I call limited success, thank you for agreeing with me. :)
 I really don't understand your hatred for UTF-8 - it's simple 
 to decode and encode, fast and space-efficient. Fixed width 
 encodings are not inherently fast, the only thing they are 
 faster at is if you want to randomly access the Nth character 
 instead of the Nth byte. In the rare cases that you need to do 
 a lot of this kind of random access there exists UTF-32...
Space-efficient? Do you even understand what a single-byte encoding is? Suffice to say, a single-byte encoding beats UTF-8 on all these measures, not just one.
 Any fixed width encoding which can encode every unicode 
 character must use at least 3 bytes, and using 4 bytes is 
 probably going to be faster because of alignment, so I don't 
 see what the great improvement over UTF-32 is going to be.
Slaps head. You don't need "at least 3 bytes" because you're packing language info in the header. I don't think you even know what I'm talking about.
 slicing does require decoding
Nope.
Of course it does, at least partially. There is no other way to know where the code points are.
 I didn't mean that people are literally keeping code pages.  I 
 meant that there's not much of a difference between code pages 
 with 2 bytes per char and the language character sets in UCS.
Unicode doesn't have "language character sets". The different planes only exist for organisational purposes they don't affect how characters are encoded.
Nobody's talking about different planes. I'm talking about all the different language character sets in this list: http://en.wikipedia.org/wiki/List_of_Unicode_characters
 ?!  It's okay because you deem it "coherent in its scheme?"  I 
 deem headers much more coherent. :)
Sure if you change the word "coherent" to mean something completely different... Coherent means that you store related things together, ie. everything that you need to decode a character in the same place, not spread out between part of a character and a header.
Coherent means that the organizational pieces fit together and make sense conceptually, not that everything is stored together. My point is that putting the language info in a header seems much more coherent to me than ramming that info into every character.
 but I suspect substring search not requiring decoding is the 
 exception for UTF-8 algorithms, not the rule.
The only time you need to decode is when you need to do some transformation that depends on the code point such as converting case or identifying which character class a particular character belongs to. Appending, slicing, copying, searching, replacing, etc. basically all the most common text operations can all be done without any encoding or decoding.
Slicing by byte, which is the only way to slice without decoding, is useless, I have to laugh that you even include it. :) All these basic operations can be done very fast, often faster than UTF-8, in a single-byte encoding. Once you start talking code points, it's no contest: UTF-8 flat out loses. On Saturday, 25 May 2013 at 19:42:41 UTC, Diggory wrote:
 All a code page is is a table of mappings, UCS is just a much 
 larger, standardized table of such mappings.
UCS does have nothing to do with code pages, it was designed as a replacement for them. A codepage is a strict subset of the possible characters, UCS is the entire set of possible characters.
"[I]t was designed as a replacement for them" by combining several of them into a master code page and removing redundancies. Functionally, they are the same and historically they maintain the same layout in at least some cases. To then say, UCS has "nothing to do with code pages" is just dense.
 I see you've abandoned without note your claim that phobos 
 doesn't convert UTF-8 to UTF-32 internally.  Perhaps 
 converting to UTF-32 is "as fast as any variable width 
 encoding is going to get" but my claim is that single-byte 
 encodings will be faster.
I haven't "abandoned my claim". It's a simple fact that phobos does not convert UTF-8 string to UTF-32 strings before it uses them. ie. the difference between this: string mystr = ...; dstring temp = mystr.to!dstring; for (int i = 0; i < temp.length; ++i) process(temp[i]); and this: string mystr = ...; size_t i = 0; while (i < mystr.length) { dchar current = decode(mystr, i); process(current); } And if you can't see why the latter example is far more efficient I give up...
I take your point that phobos is often decoding by char as it iterates through, but there are still functions in std.string that convert the entire string, as in your first example. The point is that you are forced to decode everything to UTF-32, whether by char or the entire string. Your latter example may be marginally more efficient but it is only useful for functions that start from the beginning and walk the string in only one direction, which not all operations do.
 - Multiple code pages per string
 This just makes everything overly complicated and is far 
 slower to decode what the actual character is than UTF-8.
I disagree, this would still be far faster than UTF-8, particularly if you designed your header right.
The cache misses alone caused by simply accessing the separate headers would be a larger overhead than decoding UTF-8 which takes a few assembly instructions and has perfect locality and can be efficiently pipelined by the CPU.
Lol, you think a few potential cache misses is going to be slower than repeatedly decoding, whether in assembly and pipelined or not, every single UTF-8 character? :D
 Then there's all the extra processing involved combining the 
 headers when you concatenate strings. Plus you lose the one 
 benefit a fixed width encoding has because random access is no 
 longer possible without first finding out which header controls 
 the location you want to access.
There would be a few arithmetic operations on substring indices when concatenating strings, hardly anything. Random access is still not only possible, it is incredibly fast in most cases: you just have to check first if the header lists any two-byte encodings. This can be done once and cached as a property of the string (set a boolean no_two_byte_encoding once and simply have the slice operator check it before going ahead), just as you could add a property to UTF-8 strings to allow quick random access if they happen to be pure ASCII. The difference is that only strings that include the two-byte encoded Korean/Chinese/Japanese characters would require a bit more calculation for slicing in my scheme, whereas _every_ non-ASCII UTF-8 string requires full decoding to allow random access. This is a clear win for my single-byte encoding, though maybe not the complete demolition of UTF-8 you were hoping for. ;)
 No, it's not the same at all.  The contents of an 
 arbitrary-length file cannot be compressed to a single byte, 
 you would have collisions galore.  But since most non-english 
 alphabets are less than 256 characters, they can all be 
 uniquely encoded in a single byte per character, with the 
 header determining what language's code page to use.  I don't 
 understand your analogy whatsoever.
It's very simple - the more information about the type of data you are compressing you have at the time of writing the algorithm the better compression ration you can get, to the point that if you know exactly what the file is going to contain you can compress it to nothing. This is why you have specialised compression algorithms for images, video, audio, etc.
This may be mostly true in general, but your specific example of compressing down to a byte is nonsense. For any arbitrarily long data, there are always limits to compression. What any of this has to do with my single-byte encoding, I have no idea.
 It doesn't matter how few characters non-english alphabets have 
 - unless you know WHICH alphabet it is before-hand you can't 
 store it in a single byte. Since any given character could be 
 in any alphabet the best you can do is look at the 
 probabilities of different characters appearing and use shorter 
 representations for more common ones. (This is the basis for 
 all lossless compression) The english alphabet plus 0-9 and 
 basic punctuation are by far the most common characters used on 
 computers so it makes sense to use one byte for those and 
 multiple bytes for rarer characters.
How many times have I said that "you know WHICH alphabet it is before-hand" because that info is stored in the header? That is why I specifically said, from my first post, that multi-language strings would have more complex headers, which I later pointed out could list all the different language substrings within a multi-language string. Your silly exposition of how compression works makes me wonder if you understand anything about how a single-byte encoding would work. Perhaps it made sense to use one byte for ASCII characters and relegate _every other language_ to multiple bytes two decades ago. It doesn't make sense today.
 - As I've shown the only space-efficient way to do this is 
 using a variable length encoding like UTF-8
You haven't shown this.
If you had thought through your suggestion of multiple code pages per string you would see that I had.
You are not packaging and transmitting the code pages with the string, just as you do not ship the entire UCS with every UTF-8 string. A single-byte encoding is going to be more space-efficient for the vast majority of strings, everybody knows this.
 No, it does a very bad job of this.  Every non-ASCII character 
 takes at least two bytes to encode, whereas my single-byte 
 encoding scheme would encode every alphabet with less than 256 
 characters in a single byte.
And strings with mixed characters would use lots of memory and be extremely slow. Common when using proper names, quotes, inline translations, graphical characters, etc. etc. Not to mention the added complexity to actually implement the algorithms.
Ah, you have finally stumbled across the path to a good argument, though I'm not sure how, given your seeming ignorance of how single-byte encodings work. :) There _is_ a degenerate case with my particular single-byte encoding (not the ones you list, which would still be faster and use less memory than UTF-8): strings that use many, if not all, character sets. So the worst case scenario might be something like a string that had 100 characters, every one from a different language. In that case, I think it would still be smaller than the equivalent UTF-8 string, but not by much. There might be some complexity in implementing the algorithms, but on net, likely less than UTF-8, while being much more usable for most programmers. On Saturday, 25 May 2013 at 22:41:59 UTC, Diggory wrote:
 1) Take the byte at a particular offset in the string
 2) If it is ASCII then we're done
 3) Otherwise count the number of '1's at the start of the byte 
 - this is how many bytes make up the character (there's even an 
 ASM instruction to do this)
 4) This first byte will look like '1110xxxx' for a 3 byte 
 character, '11110xxx' for a 4 byte character, etc.
 5) All following bytes are of the form '10xxxxxx'
 6) Now just concatenate all the 'x's together and add an offset 
 to get the code point
Not sure why you chose to write this basic UTF-8 stuff out, other than to bluster on without much use.
 Note that this is CONSTANT TIME, O(1) with minimal branching so 
 well suited to pipelining (after the initial byte the other 
 bytes can all be processed in parallel by the CPU) and only 
 sequential memory access so no cache misses, and zero 
 additional memory requirements
It is constant time _per character_. You have to do it for _every_ non-ASCII character in your string, so the decoding adds up.
 Now compare your encoding:
 1) Look up the offset in the header using binary search: O(log 
 N) lots of branching
It is difficult to reason about the header, because it all depends on the number of languages used and how many substrings there are. There are worst-case scenarios that could approach something like log(n) but extremely unlikely in real-world use. Most of the time, this would be O(1).
 2) Look up the code page ID in a massive array of code pages to 
 work out how many bytes per character
Hardly, this could be done by a simple lookup function that simply checked if the language was one of the few alphabets that require two bytes.
 3) Hope this array hasn't been paged out and is still in the 
 cache
 4) Extract that many bytes from the string and combine them 
 into a number
Lol, I love how you think this is worth listing as a separate step for the few two-byte encodings, yet have no problem with doing this for every non-ASCII character in UTF-8.
 5) Look up this new number in yet another large array specific 
 to the code page
Why? The language byte and number uniquely specify the character, just like your Unicode code point above. If you were simply encoding the UCS in a single-byte encoding, you would arrange your scheme in such a way to trivially be able to generate the UCS code point using these two bytes.
 This is O(log N) has lots of branching so no pipelining (every 
 stage depends on the result of the stage before), lots of 
 random memory access so lots of cache misses, lots of 
 additional memory requirements to store all those tables, and 
 an algorithm that isn't even any easier to understand.
Wrong on practically every count, as detailed above.
 Plus every other algorithm to operate on it except for decoding 
 is insanely complicated.
They are still much _less_ complicated than UTF-8, that's the comparison that matters.
May 26 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 22:26, Joakim пишет:
 On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:
 25-May-2013 10:44, Joakim пишет:
 Yes, on the encoding, if it's a variable-length encoding like UTF-8, no,
 on the code space.  I was originally going to title my post, "Why
 Unicode?" but I have no real problem with UCS, which merely standardized
 a bunch of pre-existing code pages.  Perhaps there are a lot of problems
 with UCS also, I just haven't delved into it enough to know.
UCS is dead and gone. Next in line to "640K is enough for everyone".
I think you are confused. UCS refers to the Universal Character Set, which is the backbone of Unicode: http://en.wikipedia.org/wiki/Universal_Character_Set You might be thinking of the unpopular UCS-2 and UCS-4 encodings, which I have never referred to.
Yeah got confused. So sorry about that.
 Separate code spaces were the case before Unicode (and utf-8). The
 problem is not only that without header text is meaningless (no easy
 slicing) but the fact that encoding of data after header strongly
 depends a variety of factors -  a list of encodings actually. Now
 everybody has to keep a (code) page per language to at least know if
 it's 2 bytes per char or 1 byte per char or whatever. And you still
 work on a basis that there is no combining marks and regional specific
 stuff :)
Everybody is still keeping code pages, UTF-8 hasn't changed that.
Legacy. Hard to switch overnight. There are graphs that indicate that few years from now you might never encounter a legacy encoding anymore, only UTF-8/UTF-16.
I didn't mean that people are literally keeping code pages. I meant that there's not much of a difference between code pages with 2 bytes per char and the language character sets in UCS.
You can map a codepage to a subset of UCS :) That's what they do internally anyway. If I take you right you propose to define string as a header that denotes a set of windows in code space? I still fail to see how that would scale see below.
 It has to do that also. Everyone keeps talking about
 "easy slicing" as though UTF-8 provides it, but it doesn't. Phobos
 turns UTF-8 into UTF-32 internally for all that ease of use, at least
 doubling your string size in the process.  Correct me if I'm wrong, that
 was what I read on the newsgroup sometime back.
Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't do any decoding and it does return you a slice of a balance of original.
Perhaps substring search doesn't strictly require decoding but you have changed the subject: slicing does require decoding and that's the use case you brought up to begin with. I haven't looked into it, but I suspect substring search not requiring decoding is the exception for UTF-8 algorithms, not the rule.
Mm... strictly speaking (let's turn that argument backwards) - what are algorithms that require slicing say [5..$] of string without ever looking at it left to right, searching etc.?
 ??? Simply makes no sense. There is no intersection between some
 legacy encodings as of now. Or do you want to add N*(N-1)
 cross-encodings for any combination of 2? What about 3 in one string?
I sketched two possible encodings above, none of which would require "cross-encodings."
 We want monoculture! That is to understand each without all these
 "par-le-vu-france?" and codepages of various complexity(insanity).
I hate monoculture, but then I haven't had to decipher some screwed-up codepage in the middle of the night. ;)
So you never had trouble of internationalization? What languages do you use (read/speak/etc.)?
This was meant as a point in your favor, conceding that I haven't had to code with the terrible code pages system from the past. I can read and speak multiple languages, but I don't use anything other than English text.
Okay then.
 That said, you could standardize
 on UCS for your code space without using a bad encoding like UTF-8, as I
 said above.
UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into that trap (Java, Windows NT). You shouldn't.
UCS, the character set, as noted above. If that's a myth, Unicode is a myth. :)
Yeah, that was a mishap on my behalf. I think I've seen your 2 byte argument way to often and it got concatenated to UCS forming UCS-2 :)
 This is it but it's far more flexible in a sense that it allows
 multi-linguagal strings just fine and lone full-with unicode
 codepoints as well.
That's only because it uses a more complex header than a single byte for the language, which I noted could be done with my scheme, by adding a more complex header,
How would it look like? Or how the processing will go?
 long before you mentioned this unicode compression
 scheme.
It does inline headers or rather tags. That hop between fixed char windows. It's not random-access nor claims to be.
 But I get the impression that it's only for sending over
 the wire, ie transmision, so all the processing issues that UTF-8
 introduces would still be there.
Use mime-type etc. Standards are always a bit stringy and suboptimal, their acceptance rate is one of chief advantages they have. Unicode has horrifically large momentum now and not a single organization aside from them tries to do this dirty work (=i18n).
You misunderstand. I was saying that this unicode compression scheme doesn't help you with string processing, it is only for transmission and is probably fine for that, precisely because it seems to implement some version of my single-byte encoding scheme! You do raise a good point: the only reason why we're likely using such a bad encoding in UTF-8 is that nobody else wants to tackle this hairy problem.
Yup, where have you been say almost 10 years ago? :)
 Consider adding another encoding for "Tuva" for isntance. Now you have
 to add 2*n conversion routines to match it to other codepages/locales.
Not sure what you're referring to here.
If you adopt the "map to UCS policy" then nothing.
 Beyond that - there are many things to consider in
 internationalization and you would have to special case them all by
 codepage.
Not necessarily. But that is actually one of the advantages of single-byte encodings, as I have noted above. toUpper is a NOP for a single-byte encoding string with an Asian script, you can't do that with a UTF-8 string.
But you have to check what encoding it's in and given that not all codepages are that simple to upper case some generic algorithm is required.
 If they're screwing up something so simple,
 imagine how much worse everyone is screwing up something complex like
 UTF-8?
UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a sequence of octets. It does it pretty well and compatible with ASCII, even the little rant you posted acknowledged that. Now you are either against Unicode as whole or what?
The BOM link I gave notes that UTF-8 isn't always ASCII-compatible. There are two parts to Unicode. I don't know enough about UCS, the character set, ;) to be for it or against it, but I acknowledge that a standardized character set may make sense. I am dead set against the UTF-8 variable-width encoding, for all the reasons listed above.
Okay we are getting somewhere, now that I understand your position and got myself confused in the midway there.
 On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote:
 25-May-2013 13:05, Joakim пишет:
 Nobody is talking about going back to code pages.  I'm talking about
 going to single-byte encodings, which do not imply the problems that you
 had with code pages way back when.
Problem is what you outline is isomorphic with code-pages. Hence the grief of accumulated experience against them.
They may seem superficially similar but they're not. For example, from the beginning, I have suggested a more complex header that can enable multi-language strings, as one possible solution. I don't think code pages provided that.
The problem is how would you define an uppercase algorithm for multilingual string with 3 distinct 256 codespaces (windows)? I bet it's won't be pretty.
 Well if somebody get a quest to redefine UTF-8 they *might* come up
 with something that is a bit faster to decode but shares the same
 properties. Hardly a life saver anyway.
Perhaps not, but I suspect programmers will flock to a constant-width encoding that is much simpler and more efficient than UTF-8. Programmer productivity is the biggest loss from the complexity of UTF-8, as I've noted before.
I still don't see how your solution scales to beyond 256 different codepoints per string (= multiple pages/parts of UCS ;) ). -- Dmitry Olshansky
May 25 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
 You can map a codepage to a subset of UCS :)
 That's what they do internally anyway.
 If I take you right you propose to define string as a header 
 that denotes a set of windows in code space? I still fail to 
 see how that would scale see below.
Something like that. For a multi-language string encoding, the header would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string. So, a list of languages and a list of pure single-language substrings. This is just off the top of my head, I'm not suggesting it is definitive.
 Mm... strictly speaking (let's turn that argument backwards) - 
 what are algorithms that require slicing say [5..$] of string 
 without ever looking at it left to right, searching etc.?
Don't know, I was just pointing out that all the claims of easy slicing with UTF-8 are wrong. But a single-byte encoding would be scanned much faster also, as I've noted above, no decoding necessary and single bytes will always be faster than multiple bytes, even without decoding.
 How would it look like? Or how the processing will go?
Detailed a bit above. As I mentioned earlier in this thread, functions like toUpper would execute much faster because you wouldn't have to scan substrings containing languages that don't have uppercase, which you have to scan in UTF-8.
 long before you mentioned this unicode compression
 scheme.
It does inline headers or rather tags. That hop between fixed char windows. It's not random-access nor claims to be.
I wasn't criticizing it, just saying that it seems to be superficially similar to my scheme. :)
 version of my single-byte encoding scheme!  You do raise a 
 good point:
 the only reason why we're likely using such a bad encoding in 
 UTF-8 is
 that nobody else wants to tackle this hairy problem.
Yup, where have you been say almost 10 years ago? :)
I was in grad school, avoiding writing my thesis. :) I'd never have thought I'd be discussing Unicode today, didn't even know what it was back then.
 Not necessarily.  But that is actually one of the advantages of
 single-byte encodings, as I have noted above.  toUpper is a 
 NOP for a
 single-byte encoding string with an Asian script, you can't do 
 that with
 a UTF-8 string.
But you have to check what encoding it's in and given that not all codepages are that simple to upper case some generic algorithm is required.
You have to check the language, but my point is that you can look at the header and know that toUpper has to do nothing for a single-byte-encoded string of an Asian script which doesn't have uppercase characters. With UTF-8, you have to decode the entire string to find that out.
 They may seem superficially similar but they're not.  For 
 example, from
 the beginning, I have suggested a more complex header that can 
 enable
 multi-language strings, as one possible solution.  I don't 
 think code
 pages provided that.
The problem is how would you define an uppercase algorithm for multilingual string with 3 distinct 256 codespaces (windows)? I bet it's won't be pretty.
How is it done now? It isn't pretty with UTF-8 now either, as some languages have uppercase characters and others don't. The version of toUpper for my encoding will be similar, but it will do less work, because it doesn't have to be invoked for every character in the string.
 I still don't see how your solution scales to beyond 256 
 different codepoints per string (= multiple pages/parts of UCS 
 ;) ).
I assume you're talking about Chinese, Korean, etc. alphabets? I mentioned those to Walter earlier, they would have a two-byte encoding. No way around that, but they would still be easier to deal with than UTF-8, because of the header.
May 25 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
25-May-2013 23:51, Joakim пишет:
 On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
 You can map a codepage to a subset of UCS :)
 That's what they do internally anyway.
 If I take you right you propose to define string as a header that
 denotes a set of windows in code space? I still fail to see how that
 would scale see below.
Something like that. For a multi-language string encoding, the header would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string. So, a list of languages and a list of pure single-language substrings. This is just off the top of my head, I'm not suggesting it is definitive.
Runs away in horror :) It's mess even before you've got to details. Another point about using sometimes a 2-byte encoding - welcome to the nice world of BigEndian/LittleEndian i.e. the very trap UTF-16 has stepped into. -- Dmitry Olshansky
May 25 2013
parent "Joakim" <joakim airpost.net> writes:
On Saturday, 25 May 2013 at 19:58:25 UTC, Dmitry Olshansky wrote:
 Runs away in horror :) It's mess even before you've got to 
 details.
Perhaps it's fatally flawed, but I don't see an argument for why, so I'll assume you can't find such a flaw. It is still _much less_ messy than UTF-8, that is the critical distinction.
 Another point about using sometimes a 2-byte encoding - welcome 
 to the nice world of BigEndian/LittleEndian i.e. the very trap 
 UTF-16 has stepped into.
I don't think this is a sizable obstacle. It takes some coordination, but it is a minor issue. On Saturday, 25 May 2013 at 20:20:11 UTC, Juan Manuel Cabo wrote:
 You obviously are not thinking it through. Such encoding would 
 have a O(n^2) complexity for appending a character/symbol in a 
 different language to the string, since you would have to 
 update the beginning of the string, and move the contents 
 forward to make room. Not to mention that it wouldn't be 
 backwards compatible with ascii routines, and the complexity of 
 such a header would be have to be carried all the way to font 
 rendering routines in the OS.
You obviously have not read the rest of the thread, both your non-font-related assertions have been addressed earlier. I see no reason why a single-byte encoding of UCS would have to be carried to "font rendering routines" but UTF-8 wouldn't be.
 Multiple languages/symbols in one string is a blessing of 
 modern humane computing. It is the norm more than the exception 
 in most of the world.
I disagree, but in any case, most of this thread refers to multi-language strings. The argument is about how best to encode them. On Saturday, 25 May 2013 at 20:47:25 UTC, Peter Alexander wrote:
 On Saturday, 25 May 2013 at 14:58:02 UTC, Joakim wrote:
 On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander 
 wrote:
 I suggest you read up on UTF-8. You really don't understand 
 it. There is no need to decode, you just treat the UTF-8 
 string as if it is an ASCII string.
Not being aware of this shortcut doesn't mean not understanding UTF-8.
It's not just a shortcut, it is absolutely fundamental to the design of UTF-8. It's like saying you understand Lisp without being aware that everything is a list.
It is an accidental shortcut because of the encoding scheme chosen for UTF-8 and, as I've noted, still less efficient than similarly searching a single-byte encoding. The fact that you keep trumpeting this silly detail as somehow "fundamental" suggests you have no idea what you're talking about.
 Also, you continuously keep stating disadvantages to UTF-8 that 
 are completely false, like "slicing does require decoding". 
 Again, completely missing the point of UTF-8. I cannot conceive 
 how you can claim to understand how UTF-8 works yet repeatedly 
 demonstrating that you do not.
Slicing on code points requires decoding, I'm not sure how you don't know that. If you mean slicing by byte, that is not only useless, but _every_ encoding can do that. I cannot conceive how you claim to defend UTF-8, yet keep making such stupid points, that you don't even bother backing up.
 You are either ignorant or a successful troll. In either case, 
 I'm done here.
Must be nice to just insult someone who has demolished your arguments and leave. Good riddance, you weren't adding anything.
May 26 2013
prev sibling next sibling parent "Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes:
On Saturday, 25 May 2013 at 19:51:43 UTC, Joakim wrote:
 On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky 
 wrote:
 You can map a codepage to a subset of UCS :)
 That's what they do internally anyway.
 If I take you right you propose to define string as a header 
 that denotes a set of windows in code space? I still fail to 
 see how that would scale see below.
Something like that. For a multi-language string encoding, the header would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string. So, a list of languages and a list of pure single-language substrings. This is just off the top of my head, I'm not suggesting it is definitive.
You obviously are not thinking it through. Such encoding would have a O(n^2) complexity for appending a character/symbol in a different language to the string, since you would have to update the beginning of the string, and move the contents forward to make room. Not to mention that it wouldn't be backwards compatible with ascii routines, and the complexity of such a header would be have to be carried all the way to font rendering routines in the OS. Multiple languages/symbols in one string is a blessing of modern humane computing. It is the norm more than the exception in most of the world. --jm
May 25 2013
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, May 25, 2013 at 09:51:42PM +0200, Joakim wrote:
 On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
If I take you right you propose to define string as a header that
denotes a set of windows in code space? I still fail to see how
that would scale see below.
Something like that. For a multi-language string encoding, the header would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string. So, a list of languages and a list of pure single-language substrings. This is just off the top of my head, I'm not suggesting it is definitive.
[...] And just how exactly does that help with slicing? If anything, it makes slicing way hairier and error-prone than UTF-8. In fact, this one point alone already defeated any performance gains you may have had with a single-byte encoding. Now you can't do *any* slicing at all without convoluted algorithms to determine what encoding is where at the endpoints of your slice, and the resulting slice must have new headers to indicate the start/end of every different-language substring. By the time you're done with all that, you're going way slower than processing UTF-8. Again I say, I'm not 100% sold on UTF-8, but what you're proposing here is far worse. T -- The best compiler is between your ears. -- Michael Abrash
May 25 2013
parent reply "Joakim" <joakim airpost.net> writes:
For some reason this posting by H. S. Teoh shows up on the 
mailing list but not on the forum.

On Sat May 25 13:42:10 PDT 2013, H. S. Teoh wrote:
On Sat, May 25, 2013 at 10:07:41AM +0200, Joakim wrote:
 The vast majority of non-english alphabets in UCS can be 
 encoded in
 a single byte.  It is your exceptions that are not relevant.
I'll have you know that Chinese, Korean, and Japanese account for a significant percentage of the world's population, and therefore arguments about "vast majority" are kinda missing the forest for the trees. If you count the number of *alphabets* that can be encoded in a single byte, you can get a majority, but that in no way reflects actual usage.
Not just "a majority," the vast majority of alphabets, representing 85% of the world's population.
The only alternatives to a variable width encoding I can see 
are:
- Single code page per string
This is completely useless because now you can't concatenate
strings of different code pages.
I wouldn't be so fast to ditch this. There is a real argument to be made that strings of different languages are sufficiently different that there should be no multi-language strings. Is this the best route? I'm not sure, but I certainly wouldn't dismiss it out of hand.
This is so patently absurd I don't even know how to begin to answer... have you actually dealt with any significant amount of text at all? A large amount of text in today's digital world are at least bilingual, if not more. Even in pure English text, you occasionally need a foreign letter in order to transcribe a borrowed/quoted word, e.g., "cliché", "naïve", etc.. Under your scheme, it would be impossible to encode any text that contains even a single instance of such words. All it takes is *one* word in a 500-page text and your scheme breaks down, and we're back to the bad ole days of codepages. And yes you can say "well just include é and ï in the English code page". But then all it takes is a single math formula that requires a Greek letter, and your text is non-encodable anymore. By the time you pull in all the French, German, Greek letters and math symbols, you might as well just go back to UTF-8.
I think you misunderstand what this implies. I mentioned it earlier as another possibility to Walter, "keep all your strings in a single language, with a different format to compose them together." Nobody is talking about disallowing alphabets other than English or going back to code pages. The fundamental question is whether it makes sense to combine all these different alphabets and their idiosyncratic rules into a single string and encoding. There is a good argument to be made that the differences outweigh the similarities and you'd be better off keeping each language/alphabet in its own string. It's a question of modeling, just like a class hierarchy. As I said, I'm not sure this is the best route, but it has some real strengths.
 The alternative is to have embedded escape sequences for the 
 rare
 foreign letter/word that you might need, but then you're back 
 to being
 unable to slice the string at will, since slicing it at the 
 wrong place
 will produce gibberish.
No one has presented this as a viable option.
 I'm not saying UTF-8 (or UTF-16, etc.) is panacea -- there are 
 things
 about it that are annoying, but it's certainly better than the 
 scheme
 you're proposing.
I disagree. On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:
 And just how exactly does that help with slicing? If anything, 
 it makes
 slicing way hairier and error-prone than UTF-8. In fact, this 
 one point
 alone already defeated any performance gains you may have had 
 with a
 single-byte encoding. Now you can't do *any* slicing at all 
 without
 convoluted algorithms to determine what encoding is where at the
 endpoints of your slice, and the resulting slice must have new 
 headers
 to indicate the start/end of every different-language 
 substring. By the
 time you're done with all that, you're going way slower than 
 processing
 UTF-8.
There are no convoluted algorithms, it's a simple check if the string contains any two-bye encodings, a check which can be done once and cached. If it's single-byte all the way through, no problems whatsoever with slicing. If there are two-byte languages included, the slice function will have to do a little arithmetic calculation before slicing. You will also need a few arithmetic ops to create the new header for the slice. The point is that these operations will be much faster than decoding every code point to slice UTF-8.
 Again I say, I'm not 100% sold on UTF-8, but what you're 
 proposing here
 is far worse.
Well, I'm glad you realize some problems with UTF-8, :) even if you dismiss my alternative out of hand.
May 26 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, May 26, 2013 at 11:59:19AM +0200, Joakim wrote:
 On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:
And just how exactly does that help with slicing? If anything, it
makes slicing way hairier and error-prone than UTF-8. In fact, this
one point alone already defeated any performance gains you may have
had with a single-byte encoding. Now you can't do *any* slicing at
all without convoluted algorithms to determine what encoding is where
at the endpoints of your slice, and the resulting slice must have new
headers to indicate the start/end of every different-language
substring.  By the time you're done with all that, you're going way
slower than processing UTF-8.
There are no convoluted algorithms, it's a simple check if the string contains any two-bye encodings, a check which can be done once and cached.
IHBT. You said that to handle multilanguage strings, your header would have a list of starting/ending points indicating which encoding should be used for which substring(s). That has nothing to do with two-byte encodings. So, please show us the code: given a string containing, say, English and French substrings, what will the header look like? And what's the algorithm to take a slice of such a string?
 If it's single-byte all the way through, no problems whatsoever with
 slicing.
Huh?! How are there no problems with slicing? Let's say you have a string that contains both English and French. According to your scheme, you'll have some kind of header format that lets you say bytes 0-123 are English, bytes 124-129 are French, and bytes 130-200 are English. Now let's say I want a substring from 120 to 125. How would this be done? And what about if I want a substring from 120 to 140? Or 126 to 130? What if the string contains several runs of French? Please show us the code.
 If there are two-byte languages included, the slice function will have
 to do a little arithmetic calculation before slicing.  You will also
 need a few arithmetic ops to create the new header for the slice.  The
 point is that these operations will be much faster than decoding every
 code point to slice UTF-8.
You haven't proven that this "little arithmetic calculation" will be faster than manipulating UTF-8. What if I have an English text that contains quotations of Chinese, French, and Greek snippets? Math symbols? Please show us (1) how such a string should be encoded under your scheme, and (2) the code will slice such a string in an efficient way, according to your proposed encoding scheme. (And before you dismiss such a string as unlikely or write it off as rare, consider a technical math paper that cites the work of Chinese and French authors -- a rather common thing these days. You'd need the extra characters just to be able to cite their names, even if none of the actual Chinese or French is quoted verbatim. Greek in general is used all over math anyway, since for whatever reason mathematicians just love Greek symbols, so it pretty much needs to be included by default.)
Again I say, I'm not 100% sold on UTF-8, but what you're proposing
here is far worse.
Well, I'm glad you realize some problems with UTF-8, :) even if you dismiss my alternative out of hand.
Clearly, we're not seeing what you're seeing here. So instead of making general statements about the superiority of your scheme, you might want to show us the actual code. So far, I haven't seen anything that convinces me that your scheme is any better. In fact, from what I can see, it's a lot worse, and you're just evading pointed questions about how to address those problems. Maybe that's a wrong perception, but not having any actual code to look at, I'm having a hard time believing your claims. Right now I'm leaning towards agreeing with Walter that you're just trolling us (and rather successfully at that). So, please show us the code. Otherwise, I think I should just stop responding, as we're obviously not on the same page and this discussion isn't getting anywhere. T -- Some ideas are so stupid that only intellectuals could believe them. -- George Orwell
May 26 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 14:37:27 UTC, H. S. Teoh wrote:
 IHBT. You said that to handle multilanguage strings, your
Pretty funny how you claim you've been trolled and then go on to make a bunch of trolling arguments, which seem to imply you have no idea how a single-byte encoding works. I'm not going to bother explaining it to you, anyone who knows encodings can easily figure it out from what I've said so far.
 Clearly, we're not seeing what you're seeing here. So instead 
 of making
 general statements about the superiority of your scheme, you 
 might want
 to show us the actual code.  So far, I haven't seen anything 
 that
 convinces me that your scheme is any better.  In fact, from 
 what I can
 see, it's a lot worse, and you're just evading pointed 
 questions about
 how to address those problems.  Maybe that's a wrong 
 perception, but not
 having any actual code to look at, I'm having a hard time 
 believing your
 claims. Right now I'm leaning towards agreeing with Walter that 
 you're
 just trolling us (and rather successfully at that).
When someone makes arguments that fly over your head, that's not trolling, that's you not understanding what they're saying. I have demolished every claim that has been made about single-byte encoding being worse. If you can't understand my arguments, you need to go out and learn some more about these issues.
 So, please show us the code. Otherwise, I think I should just 
 stop
 responding, as we're obviously not on the same page and this 
 discussion
 isn't getting anywhere.
I've made my position clear: I don't write toy code. It will take too long for the kind of encoding I have in mind, so it isn't worth my time, and if you can't understand the higher-level technical language I'm using in these posts, you won't understand the code anyway. I have adequately sketched what I'd do, so that anyone proficient in the art can reason about what the consequences of such a scheme would be. Perhaps that doesn't include Walter and you. I don't know why you'd want to keep responding to someone you think is trolling you anyway.
May 26 2013
parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 26 May 2013 at 15:23:33 UTC, Joakim wrote:
 On Sunday, 26 May 2013 at 14:37:27 UTC, H. S. Teoh wrote:
 IHBT.
 I've made my position clear: I don't write toy code.
1. Make extraordinary claims 2. Refuse to back up said claims with small examples because "I don't write toy code" 3. Refuse to back up said claims with elaborate examples because "It will take too long" 4. Use arrogant tone throughout thread, imply that you're smarter than the creators of UTF, and creators and long-time contributors of D (never contribute code to D yourself) Result: 70-post thread Conclusion: Successful troll is successful :)
May 26 2013
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
26-May-2013 20:54, Vladimir Panteleev пишет:
 On Sunday, 26 May 2013 at 15:23:33 UTC, Joakim wrote:
 On Sunday, 26 May 2013 at 14:37:27 UTC, H. S. Teoh wrote:
 IHBT.
 I've made my position clear: I don't write toy code.
1. Make extraordinary claims 2. Refuse to back up said claims with small examples because "I don't write toy code" 3. Refuse to back up said claims with elaborate examples because "It will take too long" 4. Use arrogant tone throughout thread, imply that you're smarter than the creators of UTF, and creators and long-time contributors of D (never contribute code to D yourself) Result: 70-post thread Conclusion: Successful troll is successful :)
+1 Result: 71-post thread ;) -- Dmitry Olshansky
May 26 2013
prev sibling parent reply "Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 16:54:53 UTC, Vladimir Panteleev wrote:
 1. Make extraordinary claims
What is extraordinary about "UTF-8 is shit?" It is obviously so.
 2. Refuse to back up said claims with small examples because "I 
 don't write toy code"
I never refused small examples. I have provided several analyses of how a single-byte encoding would compare to UTF-8, along with listing optimizations that make it much faster. I finally refused to analyze Teoh's examples because he accused me of trolling and demanded code as the only possible explanation.
 3. Refuse to back up said claims with elaborate examples 
 because "It will
 take too long"
You are confused. What I said is "I don't write toy code, non-toy code would take too long, and you wouldn't understand it anyway." The whole demand for code is idiotic anyway. If I outlined TCP/IP as a packet-switched network and briefly sketched what the header might look like and the queuing algorithms that I might use, I can just imagine you saying, "But there's no code... how can I possibly understand what you're saying without any code?" If you can't understand networking without seeing working code, you're not equipped to understand it anyway, same here.
 4. Use arrogant tone throughout thread, imply that you're 
 smarter than the creators of UTF, and creators and long-time 
 contributors of D (never contribute code to D yourself)
Hey, if the shoe fits. :) I actually had a lot of respect for Walter till I read this thread. I can only assume that his past experience with code pages was so maddening that he cannot be rational on the subject of going to any single-byte encoding that would be similar, same with others griping about code pages above. I also don't think he and others are paying much attention to the various points I'm raising, hence his recent claim that I wouldn't handle Chinese, when I addressed that from the beginning. Or it could just be that I'm much smarter than everybody else in this thread, ;) I can't rule it out given the often silly responses I've been getting.
 Result: 70-post thread

 Conclusion: Successful troll is successful :)
Conclusion: Vladimir trolls me because he doesn't understand what I'm talking about, which is why he doesn't raise a single technical point in this post.
May 26 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/26/13 1:45 PM, Joakim wrote:
 What is extraordinary about "UTF-8 is shit?" It is obviously so.
Congratulations, you are literally the only person on the Internet who said so: http://goo.gl/TFhUO On 5/26/13 1:45 PM, Joakim wrote:
 Or it could just be that I'm much smarter than everybody else in this
 thread, ;) I can't rule it out given the often silly responses I've been
 getting.
One odd thing about this thread is it's extremely rare that most everybody in this forum raises like one to the same opinion. Usually it's like whatever the topic, a debate will ensue between two ad-hoc groups. It has become clear that people involved in this have gotten too frustrated to have a constructive exchange. I suggest we collectively drop it. What you may want to do is to use D's modeling abilities to define a great string type pursuant to your ideas. If it is as good as you believe it could, then it will enjoy use and adoption and everybody will be better off. Andrei
May 26 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu wrote:
 On 5/26/13 1:45 PM, Joakim wrote:
 What is extraordinary about "UTF-8 is shit?" It is obviously 
 so.
Congratulations, you are literally the only person on the Internet who said so: http://goo.gl/TFhUO
Haha, that is funny, :D though "unicode is shit" returns at least 8 results. How many people even know how UTF-8 works? Given how few people use it, I'm not surprised most don't know enough about how it works to criticize it.
 On 5/26/13 1:45 PM, Joakim wrote:
 Or it could just be that I'm much smarter than everybody else 
 in this
 thread, ;) I can't rule it out given the often silly responses 
 I've been
 getting.
One odd thing about this thread is it's extremely rare that most everybody in this forum raises like one to the same opinion. Usually it's like whatever the topic, a debate will ensue between two ad-hoc groups.
I suspect it's because I'm presenting an original idea about a not well-understood technology, Unicode, not the usual "emacs vs vim" or "D should not have null references" argument. For example, how many here know what UCS is? Most people never dig into Unicode, it's just a black box that is annoying to deal with.
 It has become clear that people involved in this have gotten 
 too frustrated to have a constructive exchange. I suggest we 
 collectively drop it. What you may want to do is to use D's 
 modeling abilities to define a great string type pursuant to 
 your ideas. If it is as good as you believe it could, then it 
 will enjoy use and adoption and everybody will be better off.
I agree. I am enjoying your book, btw.
May 26 2013
next sibling parent reply "Mr. Anonymous" <mailnew4ster gmail.com> writes:
On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
 On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu 
 wrote:
 On 5/26/13 1:45 PM, Joakim wrote:
 What is extraordinary about "UTF-8 is shit?" It is obviously 
 so.
Congratulations, you are literally the only person on the Internet who said so: http://goo.gl/TFhUO
Haha, that is funny, :D though "unicode is shit" returns at least 8 results. How many people even know how UTF-8 works? Given how few people use it, I'm not surprised most don't know enough about how it works to criticize it.
On the other hand: https://www.google.com/search?q=%22utf-8+is+awesome%22 :D
May 26 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 19:11:42 UTC, Mr. Anonymous wrote:
 On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
 On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu 
 wrote:
 On 5/26/13 1:45 PM, Joakim wrote:
 What is extraordinary about "UTF-8 is shit?" It is obviously 
 so.
Congratulations, you are literally the only person on the Internet who said so: http://goo.gl/TFhUO
Haha, that is funny, :D though "unicode is shit" returns at least 8 results. How many people even know how UTF-8 works? Given how few people use it, I'm not surprised most don't know enough about how it works to criticize it.
On the other hand: https://www.google.com/search?q=%22utf-8+is+awesome%22
I'm not sure if you were trying to make my point, but you just did. There are only 19 results for that search string. If UTF-8 were such a rousing success and most developers found it easy to understand, you wouldn't expect only 19 results for it and 8 against it. The paucity of results suggests most don't know how it works or perhaps simply annoyed by it, liking the internationalization but disliking the complexity.
May 26 2013
next sibling parent reply "Mr. Anonymous" <mailnew4ster gmail.com> writes:
On Sunday, 26 May 2013 at 19:25:37 UTC, Joakim wrote:
 On Sunday, 26 May 2013 at 19:11:42 UTC, Mr. Anonymous wrote:
 On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
 On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu 
 wrote:
 On 5/26/13 1:45 PM, Joakim wrote:
 What is extraordinary about "UTF-8 is shit?" It is 
 obviously so.
Congratulations, you are literally the only person on the Internet who said so: http://goo.gl/TFhUO
Haha, that is funny, :D though "unicode is shit" returns at least 8 results. How many people even know how UTF-8 works? Given how few people use it, I'm not surprised most don't know enough about how it works to criticize it.
On the other hand: https://www.google.com/search?q=%22utf-8+is+awesome%22
I'm not sure if you were trying to make my point, but you just did. There are only 19 results for that search string. If UTF-8 were such a rousing success and most developers found it easy to understand, you wouldn't expect only 19 results for it and 8 against it. The paucity of results suggests most don't know how it works or perhaps simply annoyed by it, liking the internationalization but disliking the complexity.
Man, you're a bullshit machine!
May 26 2013
parent "Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 19:38:21 UTC, Mr. Anonymous wrote:
 On Sunday, 26 May 2013 at 19:25:37 UTC, Joakim wrote:
 I'm not sure if you were trying to make my point, but you just 
 did.  There are only 19 results for that search string.  If 
 UTF-8 were such a rousing success and most developers found it 
 easy to understand, you wouldn't expect only 19 results for it 
 and 8 against it.  The paucity of results suggests most don't 
 know how it works or perhaps simply annoyed by it, liking the 
 internationalization but disliking the complexity.
Man, you're a bullshit machine!
What can I say? I'm very good at interpreting bad data. ;)
May 26 2013
prev sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 26 May 2013 21:25:36 +0200
schrieb "Joakim" <joakim airpost.net>:

 On Sunday, 26 May 2013 at 19:11:42 UTC, Mr. Anonymous wrote:
 On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
 On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu 
 wrote:
 On 5/26/13 1:45 PM, Joakim wrote:
 What is extraordinary about "UTF-8 is shit?" It is obviously 
 so.
Congratulations, you are literally the only person on the Internet who said so: http://goo.gl/TFhUO
Haha, that is funny, :D though "unicode is shit" returns at least 8 results. How many people even know how UTF-8 works? Given how few people use it, I'm not surprised most don't know enough about how it works to criticize it.
On the other hand: https://www.google.com/search?q=%22utf-8+is+awesome%22
I'm not sure if you were trying to make my point, but you just did. There are only 19 results for that search string. If UTF-8 were such a rousing success and most developers found it easy to understand, you wouldn't expect only 19 results for it and 8 against it. The paucity of results suggests most don't know how it works or perhaps simply annoyed by it, liking the internationalization but disliking the complexity.
Lol, https://www.google.com/search?q=%22utf-8+is+the+best%22 -- Marco
May 29 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Wednesday, 29 May 2013 at 23:40:51 UTC, Marco Leise wrote:
 Am Sun, 26 May 2013 21:25:36 +0200
 schrieb "Joakim" <joakim airpost.net>:

 On Sunday, 26 May 2013 at 19:11:42 UTC, Mr. Anonymous wrote:
 On Sunday, 26 May 2013 at 19:05:32 UTC, Joakim wrote:
 On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu 
 wrote:
 On 5/26/13 1:45 PM, Joakim wrote:
 What is extraordinary about "UTF-8 is shit?" It is 
 obviously so.
Congratulations, you are literally the only person on the Internet who said so: http://goo.gl/TFhUO
Haha, that is funny, :D though "unicode is shit" returns at least 8 results. How many people even know how UTF-8 works? Given how few people use it, I'm not surprised most don't know enough about how it works to criticize it.
On the other hand: https://www.google.com/search?q=%22utf-8+is+awesome%22
I'm not sure if you were trying to make my point, but you just did. There are only 19 results for that search string. If UTF-8 were such a rousing success and most developers found it easy to understand, you wouldn't expect only 19 results for it and 8 against it. The paucity of results suggests most don't know how it works or perhaps simply annoyed by it, liking the internationalization but disliking the complexity.
Lol, https://www.google.com/search?q=%22utf-8+is+the+best%22
Your point is? 121 results, including false positives like "utf-8 is the best guess." If you look at the results, almost all make the pragmatic recommendation that UTF-8 is the best _for now_, because it is better supported than other multi-language formats. That's like saying Windows is the best OS because it's easier to find one in your local computer store. Yet again, the fact that even this somewhat ambiguous search string has only 121 results is damning of anyone liking UTF-8, nothing else, given the many thousands of programmmers that are forced to use Unicode if they want to internationalize.
May 30 2013
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 30 May 2013 09:19:32 +0200
schrieb "Joakim" <joakim airpost.net>:

 Your point is?  121 results, including false positives like
 "utf-8 is the best guess."  If you look at the results, almost
 all make the pragmatic recommendation that UTF-8 is the best _for
 now_, because it is better supported than other multi-language
 formats.  That's like saying Windows is the best OS because it's
 easier to find one in your local computer store.
 
 Yet again, the fact that even this somewhat ambiguous search
 string has only 121 results is damning of anyone liking UTF-8,
 nothing else, given the many thousands of programmmers that are
 forced to use Unicode if they want to internationalize.
Alright, for me it said ~6.570.000 results, which I found funny. I'm not trying to make a point, but to troll. If there is a point to be made, then that the count of search results is a _very_ rough estimate. -- Marco
May 30 2013
prev sibling parent reply Marcin Mstowski <marmyst gmail.com> writes:
Character Data Representation
Architecture<http://www-01.ibm.com/software/globalization/cdra/>by
IBM. It is what you want to do with additions and it is available
since
1995.
When you come up with an inventive idea, i suggest you to first check what
was already done in that area and then rethink this again to check if you
can do this better or improve existing solution. Other approaches are
usually waste of time and efforts, unless you are doing this for fun or you
can't use existing solutions due to problems with license, copyrights,
price, etc.


On Sun, May 26, 2013 at 9:05 PM, Joakim <joakim airpost.net> wrote:

 On Sunday, 26 May 2013 at 18:29:38 UTC, Andrei Alexandrescu wrote:

 On 5/26/13 1:45 PM, Joakim wrote:

 What is extraordinary about "UTF-8 is shit?" It is obviously so.
Congratulations, you are literally the only person on the Internet who said so: http://goo.gl/TFhUO
Haha, that is funny, :D though "unicode is shit" returns at least 8 results. How many people even know how UTF-8 works? Given how few people use it, I'm not surprised most don't know enough about how it works to criticize it. On 5/26/13 1:45 PM, Joakim wrote:
 Or it could just be that I'm much smarter than everybody else in this
 thread, ;) I can't rule it out given the often silly responses I've been
 getting.
One odd thing about this thread is it's extremely rare that most everybody in this forum raises like one to the same opinion. Usually it's like whatever the topic, a debate will ensue between two ad-hoc groups.
I suspect it's because I'm presenting an original idea about a not well-understood technology, Unicode, not the usual "emacs vs vim" or "D should not have null references" argument. For example, how many here know what UCS is? Most people never dig into Unicode, it's just a black box that is annoying to deal with. It has become clear that people involved in this have gotten too
 frustrated to have a constructive exchange. I suggest we collectively drop
 it. What you may want to do is to use D's modeling abilities to define a
 great string type pursuant to your ideas. If it is as good as you believe
 it could, then it will enjoy use and adoption and everybody will be better
 off.
I agree. I am enjoying your book, btw.
May 26 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 19:20:15 UTC, Marcin Mstowski wrote:
 Character Data Representation
 Architecture<http://www-01.ibm.com/software/globalization/cdra/>by
 IBM. It is what you want to do with additions and it is 
 available
 since
 1995.
 When you come up with an inventive idea, i suggest you to first 
 check what
 was already done in that area and then rethink this again to 
 check if you
 can do this better or improve existing solution. Other 
 approaches are
 usually waste of time and efforts, unless you are doing this 
 for fun or you
 can't use existing solutions due to problems with license, 
 copyrights,
 price, etc.
You might be right, but I gave it a quick look and can't make out what the encoding actually is. There is an appendix that lists several possible encodings, including UTF-8! Also, one of the first pages talks about representations of floating point and integer numbers, which are outside the purview of the text encodings we're talking about. I cannot possibly be expected to know about every dead format out there. If you can show that it is materially similar to my single-byte encoding idea, it might be worth looking into.
May 26 2013
parent reply Marcin Mstowski <marmyst gmail.com> writes:
On Sun, May 26, 2013 at 9:42 PM, Joakim <joakim airpost.net> wrote:

 On Sunday, 26 May 2013 at 19:20:15 UTC, Marcin Mstowski wrote:

 Character Data Representation
 Architecture<http://www-01.**ibm.com/software/**globalization/cdra/<http://www-01.ibm.com/software/globalization/cdra/>
by
IBM. It is what you want to do with additions and it is available since 1995. When you come up with an inventive idea, i suggest you to first check what was already done in that area and then rethink this again to check if you can do this better or improve existing solution. Other approaches are usually waste of time and efforts, unless you are doing this for fun or you can't use existing solutions due to problems with license, copyrights, price, etc.
You might be right, but I gave it a quick look and can't make out what the encoding actually is. There is an appendix that lists several possible encodings, including UTF-8!
Yes, because they didn't reinvent wheel from scratch and are reusing existing encodings as a base. There isn't any problem with adding another code page.
 Also, one of the first pages talks about representations of floating point
 and integer numbers, which are outside the purview of the text encodings
 we're talking about.
They are outside of scope of CDRA too. At least read picture description before making out of context assumptions.
 I cannot possibly be expected to know about every dead format out there.
Nobody expect that.
 If you can show that it is materially similar to my single-byte encoding
 idea, it might be worth looking into.
Spending ~15 min to read Introduction isn't worth your time, so why should i waste my time showing you anything ?
May 26 2013
parent reply "Joakim" <joakim airpost.net> writes:
On Sunday, 26 May 2013 at 21:08:40 UTC, Marcin Mstowski wrote:
 On Sun, May 26, 2013 at 9:42 PM, Joakim <joakim airpost.net> 
 wrote:
 Also, one of the first pages talks about representations of 
 floating point
 and integer numbers, which are outside the purview of the text 
 encodings
 we're talking about.
They are outside of scope of CDRA too. At least read picture description before making out of context assumptions.
Which picture description did you have in mind? They all seem fairly generic. I do see now that one paragraph does say that CDRA only deals with graphical characters and that they were only talking about numbers earlier to introduce the topic of data representation.
 If you can show that it is materially similar to my 
 single-byte encoding
 idea, it might be worth looking into.
Spending ~15 min to read Introduction isn't worth your time, so why should i waste my time showing you anything ?
You claimed that my encoding was reinventing the wheel, therefore the onus is on you to show which of the multiple encodings CDRA uses that I'm reinventing. I'm not interested in delving into the docs for some dead IBM format to prove _your_ point. More likely, you are just dead wrong and CDRA simply uses code pages, which are not the same as the single-byte encoding with a header idea that I've sketched in this thread.
May 26 2013
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 27 May 2013 at 06:11:20 UTC, Joakim wrote:
 You claimed that my encoding was reinventing the wheel, 
 therefore the onus is on you to show which of the multiple 
 encodings CDRA uses that I'm reinventing.  I'm not interested 
 in delving into the docs for some dead IBM format to prove 
 _your_ point.
It's your idea and project. Showing that it is original / doing your research on previous efforts is probably something that *you* should do, whether or not it's someone else's "point".
 More likely, you are just dead wrong and CDRA simply uses code 
 pages
Based on what?
May 27 2013
parent "Joakim" <joakim airpost.net> writes:
On Monday, 27 May 2013 at 12:25:06 UTC, John Colvin wrote:
 On Monday, 27 May 2013 at 06:11:20 UTC, Joakim wrote:
 You claimed that my encoding was reinventing the wheel, 
 therefore the onus is on you to show which of the multiple 
 encodings CDRA uses that I'm reinventing.  I'm not interested 
 in delving into the docs for some dead IBM format to prove 
 _your_ point.
It's your idea and project. Showing that it is original / doing your research on previous efforts is probably something that *you* should do, whether or not it's someone else's "point".
Sure, some research is necessary. However, software is littered with past projects that never really got started or bureaucratic efforts, like CDRA appears to be, that never went anywhere. I can hardly be expected to go rummaging through all these efforts in the hopes that what, someone else has already written the code? If you have a brain, you can look at the currently popular approaches, which CDRA isn't, and come up with something that makes more sense. I don't much care if my idea is original, I care that it is better.
 More likely, you are just dead wrong and CDRA simply uses code 
 pages
Based on what?
Based on the fact that his link lists EBCDIC and several other antiquated code page encodings in its list of proposed encodings. If Marcin believes one of those is similar to my scheme, he should say which one, otherwise his entire line of argument is irrelevant. It's not up to me to prove _his_ point. Without having looked any of the encodings in detail, I'm fairly certain he's wrong. If he feels otherwise, he can pipe up with which one he had in mind. The fact that he hasn't speaks volumes.
May 27 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 12:51 PM, Joakim wrote:
 For a multi-language string encoding, the header would
 contain a single byte for every language used in the string, along with
multiple
 index bytes to signify the start and finish of every run of single-language
 characters in the string. So, a list of languages and a list of pure
 single-language substrings.
Please implement the simple C function strstr() with this simple scheme, and post it here. http://www.digitalmars.com/rtl/string.html#strstr
May 25 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/25/2013 2:51 PM, Walter Bright wrote:
 On 5/25/2013 12:51 PM, Joakim wrote:
 For a multi-language string encoding, the header would
 contain a single byte for every language used in the string, along with
multiple
 index bytes to signify the start and finish of every run of single-language
 characters in the string. So, a list of languages and a list of pure
 single-language substrings.
Please implement the simple C function strstr() with this simple scheme, and post it here. http://www.digitalmars.com/rtl/string.html#strstr
I'll go first. Here's a simple UTF-8 version in C. It's not the fastest way to do it, but at least it is correct: ---------------------------------- char *strstr(const char *s1,const char *s2) { size_t len1 = strlen(s1); size_t len2 = strlen(s2); if (!len2) return (char *) s1; char c2 = *s2; while (len2 <= len1) { if (c2 == *s1) if (memcmp(s2,s1,len2) == 0) return (char *) s1; s1++; len1--; } return NULL; }
May 25 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, May 25, 2013 at 10:07:41AM +0200, Joakim wrote:
[...]
 The vast majority of non-english alphabets in UCS can be encoded in
 a single byte.  It is your exceptions that are not relevant.
I'll have you know that Chinese, Korean, and Japanese account for a significant percentage of the world's population, and therefore arguments about "vast majority" are kinda missing the forest for the trees. If you count the number of *alphabets* that can be encoded in a single byte, you can get a majority, but that in no way reflects actual usage. [...]
The only alternatives to a variable width encoding I can see are:
- Single code page per string
This is completely useless because now you can't concatenate
strings of different code pages.
I wouldn't be so fast to ditch this. There is a real argument to be made that strings of different languages are sufficiently different that there should be no multi-language strings. Is this the best route? I'm not sure, but I certainly wouldn't dismiss it out of hand.
This is so patently absurd I don't even know how to begin to answer... have you actually dealt with any significant amount of text at all? A large amount of text in today's digital world are at least bilingual, if not more. Even in pure English text, you occasionally need a foreign letter in order to transcribe a borrowed/quoted word, e.g., "cliché", "naïve", etc.. Under your scheme, it would be impossible to encode any text that contains even a single instance of such words. All it takes is *one* word in a 500-page text and your scheme breaks down, and we're back to the bad ole days of codepages. And yes you can say "well just include é and ï in the English code page". But then all it takes is a single math formula that requires a Greek letter, and your text is non-encodable anymore. By the time you pull in all the French, German, Greek letters and math symbols, you might as well just go back to UTF-8. The alternative is to have embedded escape sequences for the rare foreign letter/word that you might need, but then you're back to being unable to slice the string at will, since slicing it at the wrong place will produce gibberish. I'm not saying UTF-8 (or UTF-16, etc.) is panacea -- there are things about it that are annoying, but it's certainly better than the scheme you're proposing. T -- Живёшь только однажды.
May 25 2013
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, May 28, 2013 at 02:54:30AM +0200, Torje Digernes wrote:
 On Tuesday, 28 May 2013 at 00:34:20 UTC, Manu wrote:
On 28 May 2013 09:05, Walter Bright <newshound2 digitalmars.com>
wrote:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:

Well, D *does* support non-English identifiers, y'know... for
example:

        void main(string[] args) {
                int число = 1;
                foreach (и; 0..100)
                        число += и;
                writeln(число);
        }

Of course, whether that's a good practice is a different
story. :)
I've recently come to the opinion that that's a bad idea, and D should not support it.
Why? You said previously that you'd love to support extended operators ;)
I find features such as support for uncommon symbols in variables a strength as it makes some physics formulas a bit easier to read in code form, which in my opinion is a good thing.
I think there's a difference between allowing math symbols (which includes things like (a subset of) Greek letters that mathematicians love) in identifiers, and allowing full Unicode. What if you're assigned to maintain code containing identifiers that has letters that don't appear in any of your installed fonts? I think it's OK to allow math symbols, but allowing the entire set of Unicode characters is going a bit too far, IMO. For one thing, if some code has identifiers written in Arabic, I wouldn't be able to understand the code, simply because I'd have a hard time telling different identifiers apart. Besides, if the rest of the language (keywords, Phobos, etc.) are in English, then I don't see any compelling reason to use a different language in identifiers, other than to submit IODCC entries. :-P C doesn't support Unicode identifiers, for one thing, but I've seen working C code written by people who barely understand any English -- it didn't stop them at all. (The comments were of course in their native language -- the compiler ignores everything inside anyway so 8-bit native encodings or even UTF-8 can be sneaked in without provoking compiler errors.) T -- WINDOWS = Will Install Needless Data On Whole System -- CompuMan
May 27 2013
parent "Torje Digernes" <torjehoa pvv.org> writes:
On Tuesday, 28 May 2013 at 01:17:37 UTC, H. S. Teoh wrote:
 On Tue, May 28, 2013 at 02:54:30AM +0200, Torje Digernes wrote:
 On Tuesday, 28 May 2013 at 00:34:20 UTC, Manu wrote:
On 28 May 2013 09:05, Walter Bright 
<newshound2 digitalmars.com>
wrote:

On 5/27/2013 3:18 PM, H. S. Teoh wrote:

Well, D *does* support non-English identifiers, y'know... 
for
example:

        void main(string[] args) {
                int число = 1;
                foreach (и; 0..100)
                        число += и;
                writeln(число);
        }

Of course, whether that's a good practice is a different
story. :)
I've recently come to the opinion that that's a bad idea, and D should not support it.
Why? You said previously that you'd love to support extended operators ;)
I find features such as support for uncommon symbols in variables a strength as it makes some physics formulas a bit easier to read in code form, which in my opinion is a good thing.
I think there's a difference between allowing math symbols (which includes things like (a subset of) Greek letters that mathematicians love) in identifiers, and allowing full Unicode. What if you're assigned to maintain code containing identifiers that has letters that don't appear in any of your installed fonts? I think it's OK to allow math symbols, but allowing the entire set of Unicode characters is going a bit too far, IMO. For one thing, if some code has identifiers written in Arabic, I wouldn't be able to understand the code, simply because I'd have a hard time telling different identifiers apart. Besides, if the rest of the language (keywords, Phobos, etc.) are in English, then I don't see any compelling reason to use a different language in identifiers, other than to submit IODCC entries. :-P C doesn't support Unicode identifiers, for one thing, but I've seen working C code written by people who barely understand any English -- it didn't stop them at all. (The comments were of course in their native language -- the compiler ignores everything inside anyway so 8-bit native encodings or even UTF-8 can be sneaked in without provoking compiler errors.) T
I think there is very little difference, both cases are artificially limiting the allowable symbols. Other symbols relevant in other fields which does not happen to use Greek symbols primarily, are they to be treated differently? What you propose is a built in code standard for D, based on your feelings on a topic. If what you fear is that unicode will suddenly make cooperation impossible I doubt you are right, after all there is all kind of ways to make terrible variable names (q,w,e,r ... qq,qw). If any such identifiers show up in a project I assume they are cleaned up, why wouldn't the same happen to unicode if they are causing problems? Think about it, it should happen even faster because the symbol might not be accessible for everyone, where a single/double letter gibberish one is perfectly reproducible and might grow into the project confusing every new reader. Are you going to argue for disallowing variables that are not a compound word or a dictionary word in English?
May 29 2013