digitalmars.D - Wide characters support in D

Ruslan Nikolaev (11/11) Jun 07 2010 Note: I posted this already on runtime D list, but I think that list was...

Simen kjaeraas (18/31) Jun 07 2010 String literals in D(2) are of type immutable(char)[] (char[] in D1) by
Robert Clipsham (27/61) Jun 07 2010 If you pass "Hello World", this is always a string (char[] in D1,
=?UTF-8?B?QWxpIMOHZWhyZWxp?= (11/13) Jun 07 2010 how D will determine which one to use if I pass a string "hello world"?
justin (9/9) Jun 07 2010 This doesn't answer all your questions and suggestions, but here goes.
Ruslan Nikolaev (7/7) Jun 07 2010 Ok, ok... that was just a suggestion... Thanks, for reply about "Hello w...

torhu (24/25) Jun 07 2010 There is automatic conversion, try this example:
Nick Sabalausky (92/121) Jun 07 2010 The postfix 'c', 'w' and 'd' have been in there a long time. But D does ...

Walter Bright (9/10) Jun 07 2010 Although D is designed to be fairly agnostic about character types, in p...

Kagamin (2/5) Jun 07 2010 I'm affraid, it will pretty hard to show the bug. I don't know whether j...
bearophile (8/10) Jun 08 2010 Warning: lazy musings ahead.

Walter Bright (11/39) Jun 08 2010 I strongly suspect that the encode/decode time for UTF-8 is more than
Rainer Deyke (7/13) Jun 08 2010 Off-topic, but I don't need a profiler to tell me that my 1024x1024x1024

Nick Sabalausky (6/17) Jun 08 2010 I think he was just musing that it would be nice to be able to ignore

Nick Sabalausky (13/33) Jun 08 2010 Keep in mind too, that for an English-language app (and there are plenty...

Ruslan Nikolaev (42/43) Jun 07 2010 Just one more addition: it is possible to have built-in function that co...

Walter Bright (4/15) Jun 07 2010 The nice thing about char[] is that you'll find out real fast if your mu...

Steven Schveighoffer (12/43) Jun 07 2010 One thing that may not be clear from your interpretation of D's docs, al...

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (13/16) Jun 07 2010 In Phobos, there are text, wtext, and dtext in std.conv:

Ruslan Nikolaev (38/38) Jun 07 2010 =0A> It only generates code for the types that are actually=0A> needed. ...

Jesse Phillips (5/18) Jun 07 2010 I think you really need to look more into what templates are and do.
Nick Sabalausky (74/125) Jun 07 2010 "Ruslan Nikolaev" wrote in message

Ruslan Nikolaev (5/15) Jun 07 2010 Excuse me? Unless templates are something different in D (I can't be 100...

BCS (16/32) Jun 07 2010 You only need to do that where you are shipping closed source and for th...

Ruslan Nikolaev (12/12) Jun 07 2010 Yes, to clarify what I suggest, I can put it as follows (2 possibilities...
Ruslan Nikolaev (6/24) Jun 07 2010 True. But even simple string handling is faster for UTF-16. The time req...

Nick Sabalausky (4/13) Jun 07 2010 Why do you say that UTF-16 is faster than UTF-8?

Ruslan Nikolaev (7/24) Jun 07 2010 I would say, the decision was quite optimal for many reasons, including ...

dennis luehring (2/25) Jun 08 2010
Nick Sabalausky (18/38) Jun 08 2010 I didn't say that was the only reason. Also, you've misunderstood my poi...

Nick Sabalausky (3/9) Jun 08 2010 s/UTF-16/16-bit/ It's getting late and I'm starting to mix terminology....

Andrei Alexandrescu (8/19) Jun 08 2010 s/16-bit/UCS-2/

Nick Sabalausky (4/25) Jun 08 2010 Ok, that's what I had thought, but then I started second-guessing, so I

Ruslan Nikolaev (2/10) Jun 08 2010 No. From the very beginning I said "it would also be nice to have some b...

Michel Fortin (16/21) Jun 08 2010 Is this what you want?
Walter Bright (3/8) Jun 08 2010 http://www.digitalmars.com/d/2.0/phobos/std_utf.html

Ruslan Nikolaev (17/17) Jun 08 2010 =0A> =0A> Is this what you want?=0A> =0A> =A0=A0=A0 version (utf16)=0A> ...

Michel Fortin (44/46) Jun 08 2010 My opinion is thinking this will work is a fallacy. Here's why...

Ruslan Nikolaev (7/55) Jun 08 2010 Programmer should not know generally what encoding he works with. For bo...

dennis luehring (3/57) Jun 08 2010 please stop top-posting - just click on the post you want to reply and

Nick Sabalausky (3/9) Jun 08 2010 Speaking of top-posting... ;)

Yao G. (9/88) Jun 08 2010 Every time you reply to somebody, a new message is created. Is kinda

Ruslan Nikolaev (9/17) Jun 08 2010 Sorry for that, I did not know there was some problem there. It looks th...
Ruslan Nikolaev (4/24) Jun 08 2010 Yes, I know function overloading takes care of it. But my whole point wa...
Ruslan Nikolaev (14/14) Jun 08 2010 Yeah... Exactly. I just verified our posts via web interface. Why did he...

dennis luehring (4/29) Jun 08 2010 sorry but - there are serveral others using the web-interface and you
Nick Sabalausky (11/14) Jun 08 2010 Sorry, I think I created some confusion:

Ruslan Nikolaev (26/27) Jun 08 2010 No. New messages are definitely not created by me. You can verify it her...

dennis luehring (5/8) Jun 08 2010 but my newsreader (thunderbird and mail live) telling an different story

Nick Sabalausky (7/19) Jun 08 2010 That link didn't show the image for me, but this one does:

Nick Sabalausky (9/34) Jun 08 2010 Well, more-or-less. This is what I'm getting:

Pelle (4/8) Jun 08 2010 Speaking as someone with a tiny bit of knowledge about nntp, you are

Simen kjaeraas (5/19) Jun 09 2010 Weird. I'm getting all his messages in their right places. Using

Steven Schveighoffer (8/27) Jun 09 2010 OK, you know what's really *really* weird? I was getting all his replie...

Jer (2/6) Jun 10 2010 It's wrong, actually.

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

Note: I posted this already on runtime D list, but I think that list was a
wrong one for this question. Sorry for duplication :-)

Hi. I am new to D. It looks like D supports 3 types of characters: char, wchar,
dchar. This is cool, however, I have some questions about it:

1. When we have 2 methods (one with wchar[] and another with char[]), how D
will determine which one to use if I pass a string "hello world"?
2. Many libraries (e.g. tango or phobos) don't provide functions/methods (or
have incomplete support) for wchar/dchar
e.g. writefln probably assumes char[] for strings like "Number %d..."
3. Even if they do support, it is kind of annoying to provide methods for all 3
types of chars. Especially, if we want to use native mode (e.g. for Windows
wchar is better, for Linux char is better). E.g. Windows has _wopen, _wdirent,
_wreaddir, _wopenddir, _wmain(int argc, wchar_t[] argv) and so on, and they
should be native (in a sense that no conversion is necessary when we do, for
instance, _wopen). Linux doesn't have them as UTF-8 is used widely there.

Since D language is targeted on system programming, why not to try to use
whatever works better on a particular system (e.g. char will be 2 bytes on
Windows and 1 byte on Linux; it can be a compiler switch, and all libraries can
be compiled properly on a particular system). It's still necessary to have all
3 types of char for cooperation with C. But in those cases byte, short and int
will do their work. For this kind of situation, it would be nice to have some
built-in functions for transparent conversion from char to byte/short/int and
vice versa (especially, if conversion only happens if needed on a particular
platform).

In my opinion, to separate notion of character from byte would be nice, and it
makes sense as a particular platform uses either UTF-8 or UTF-16 natively.
Programmers may write universal code (like TCHAR on Windows). Unfortunately, C
uses 'char' and 'byte' interchangeably but why D has to make this mistake again?

Sorry if my suggestion sounds odd. Anyway, it would be great to hear something
from D gurus :-)

Ruslan.

Jun 07 2010

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

Ruslan Nikolaev <nruslan_devel yahoo.com> wrote:

 1. When we have 2 methods (one with wchar[] and another with char[]),  
 how D will determine which one to use if I pass a string "hello world"?

String literals in D(2) are of type immutable(char)[] (char[] in D1) by
default, and thus will be handled by the char[]-version of the function.
Should you want a string literal of a different type, append a c, w, or
d to specify char[], wchar[] or dchar[]. Or use a cast.

 Since D language is targeted on system programming, why not to try to  
 use whatever works better on a particular system (e.g. char will be 2  
 bytes on Windows and 1 byte on Linux; it can be a compiler switch, and  
 all libraries can be compiled properly on a particular system).

Because this leads to unportable code, that fails in unexpected ways
when moved from one system to another, thus increasing rather than
decreasing the cognitive load on the hapless programmer.

 It's still necessary to have all 3 types of char for cooperation with C.  
 But in those cases byte, short and int will do their work.

Absolutely not. One of the things D tries, is doing strings right. For
that purpose, all 3 types are needed.

 In my opinion, to separate notion of character from byte would be nice,  
 and it makes sense as a particular platform uses either UTF-8 or UTF-16  
 natively. Programmers may write universal code (like TCHAR on Windows).  
 Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to  
 make this mistake again?

D has not. A char is a character, a possibly incomplete UTF-8 codepoint,
while a byte is a byte, a humble number in the order of -128 to +127.

Yes, it is possible to abuse char in D, and byte likewise. D aims to allow
programmers to program close to the metal if the programmer so wishes, and
thus does not pretend char is an opaque type about which nothing can be
known.

-- 
Simen

Jun 07 2010

Robert Clipsham <robert octarineparrot.com> writes:

On 07/06/10 22:48, Ruslan Nikolaev wrote:
 Note: I posted this already on runtime D list, but I think that list
 was a wrong one for this question. Sorry for duplication :-)

 Hi. I am new to D. It looks like D supports 3 types of characters:
 char, wchar, dchar. This is cool, however, I have some questions
 about it:

 1. When we have 2 methods (one with wchar[] and another with char[]),
 how D will determine which one to use if I pass a string "hello
 world"?

If you pass "Hello World", this is always a string (char[] in D1, 
immutable(char)[] in D2). If you want to specify a type with a string 
literal, you can use "Hello World"w or "Hello World"d for wstring 
anddstringrespectively.

 2. Many libraries (e.g. tango or phobos) don't provide
 functions/methods (or have incomplete support) for wchar/dchar e.g.
 writefln probably assumes char[] for strings like "Number %d..."

In tango most, if not all string functions are templated, so work with 
all string types, char[], wchar[] and dchar[]. I don't know how well 
phobos supports other string types, I know phobos 1 is extremely limited 
for types other than char[], I don't know about Phobos 2

 3.
 Even if they do support, it is kind of annoying to provide methods
 for all 3 types of chars. Especially, if we want to use native mode
 (e.g. for Windows wchar is better, for Linux char is better). E.g.
 Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc,
 wchar_t[] argv) and so on, and they should be native (in a sense that
 no conversion is necessary when we do, for instance, _wopen). Linux
 doesn't have them as UTF-8 is used widely there.

Enter templates! You can write the function once and have it work with 
all three string types with little effort involved. All the lower level 
functions that interact with the operating system are abstracted away 
nicely for you in both Tango and Phobos, so you'll never have to deal 
with this for basic functions. For your own it's a simple matter of 
templating them in most cases.

 Since D language is targeted on system programming, why not to try to
 use whatever works better on a particular system (e.g. char will be 2
 bytes on Windows and 1 byte on Linux; it can be a compiler switch,
 and all libraries can be compiled properly on a particular system).
 It's still necessary to have all 3 types of char for cooperation with
 C. But in those cases byte, short and int will do their work. For
 this kind of situation, it would be nice to have some built-in
 functions for transparent conversion from char to byte/short/int and
 vice versa (especially, if conversion only happens if needed on a
 particular platform).

This is something C did wrong. If compilers are free to choose their own 
width for the string type you end up with the mess C has where every 
library introduces their own custom types to make sure they're the 
expected length, eg uint32_t etc. Having things the other way around 
makes life far easier - int is always 32bits signed for example, the 
same applies to strings. You can use version blocks if you want to 
specify a type which changes based on platform, I wouldn't recommend it 
though, it just makes life harder in the long run.

 In my opinion, to separate notion of character from byte would be
 nice, and it makes sense as a particular platform uses either UTF-8
 or UTF-16 natively. Programmers may write universal code (like TCHAR
 on Windows). Unfortunately, C uses 'char' and 'byte' interchangeably
 but why D has to make this mistake again?

They are different types in D, so I'm not sure what you mean. byte/ubyte 
have no encoding associated with them, char is always UTF-8, wchar 
UTF-16 etc.

Robert

Jun 07 2010

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

Ruslan Nikolaev wrote:

 1. When we have 2 methods (one with wchar[] and another with char[]), 

how D will determine which one to use if I pass a string "hello world"?

I asked the same question on the D.learn group recently. Literals like 
that don't have a particular encoding. The programmer must specify 
explicitly to resolve ambiguities: "hello world"c or "hello world"w.

 3. Even if they do support, it is kind of annoying to provide methods 

for all 3 types of chars. Especially, if we want to use native mode

I think the solution is to take advantage of templates and use template 
constraints if the template parameter is too flexible.

Another approach might be to use dchar within the application and use 
other encodings on the intefraces.

Ali

Jun 07 2010

justin <justin economicmodeling.com> writes:

This doesn't answer all your questions and suggestions, but here goes.

want
to use UTF-16 or 32, use "Hello world"w and "Hello world"d respectively.

function to support string, wstring, and dstring by using templating and the
fact
that D can do automatic conversions for you. For instance:

string blah = "hello world";
foreach (dchar c; blah)   // guaranteed to get a full character
  // do something

Jun 07 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

Ok, ok... that was just a suggestion... Thanks, for reply about "Hello world"
representation. Was postfix "w" and "d" added initially or just recently? I did
not know about it. I thought D does automatic conversion for string literals.

Yes, templates may help. However, that unnecessary make code bigger (since we
have to compile it for every char type). The other problem is that it allows
programmer to choose which one to use. He or she may just prefer char[] as
UTF-8 (or wchar[] as UTF-16). That will be fine on platform that supports this
encoding natively (e.g. for file system operations, screen output, etc.),
whereas it will cause conversion overhead on the other. Not to say that it's a
big overhead, but unnecessary one. Having said this, I do agree that there must
be some flexibility (e.g. in Java char[] is always 2 bytes), however, I don't
believe that this flexibility should be available for application programmer.

I don't think there is any problem with having different size of char. In fact,
that would make programs better (since application programmers will have to
think in terms of characters as opposed to bytes). System programmers (i.e. OS
programmers) may choose to think as they expect it to be (since char width
option can be added to compiler). TCHAR in Windows is a good example of it.
Whenever you need to determine size of element (e.g. for allocation), you can
use 'sizeof'. Again, it does not mean that you're deprived of char/wchar/dchar
capability. It still can be supported (e.g. via ubyte/ushort/uint) for the sake
of interoperability or some special cases. Special string constants (e.g. ""b,
""w, ""d) can be supported, too. My only point is that it would be good to have
universal char type that depends on platform. That, in turns, allows to have
unified char for all libraries on this platform.

In addition, commonly used constants '\n', '\r', '\t' will be the same
regardless of char width.

Anyway, that was just a suggestion. You may disagree with this if you wish.

Ruslan.

Jun 07 2010

torhu <no spam.invalid> writes:

On 08.06.2010 01:16, Ruslan Nikolaev wrote:
 Ok, ok... that was just a suggestion... Thanks, for reply about "Hello world"
representation. Was postfix "w" and "d" added initially or just recently? I did
not know about it. I thought D does automatic conversion for string literals.

There is automatic conversion, try this example:

---
//void f(char[] s) { writefln("char"); }
void f(wchar[] s) { writefln("wchar"); }


void main()
{
   f("hello");
}
---

As long as there's just one possible match, a string literal with no 
postfix will be interpreted as char[], wchar[], or dchar[] depending on 
context.  But if you uncomment the first f(), the compiler will complain 
about there being two matching overloads.  Then you'll have to add the 
'c' or 'w' postfixes to the string literal to disambiguate.

For templates and type inference, string literals default to char[].

This example prints 'char':
---
void f(T)(T[] s) { writefln(T.stringof); }

void main()
{
   f("hello");
}
---

Jun 07 2010

"Nick Sabalausky" <a a.a> writes:

"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.122.1275952601.24349.digitalmars-d puremagic.com...
 Ok, ok... that was just a suggestion... Thanks, for reply about "Hello 
 world" representation. Was postfix "w" and "d" added initially or just 
 recently? I did not know about it. I thought D does automatic conversion 
 for string literals.

The postfix 'c', 'w' and 'd' have been in there a long time. But D does have 
a little bit of automatic conversion. Let me try to clarify:

    "hello"c  // string, UTF-8
    "hello"w  // wstring, UTF-16
    "hello"d  // dstring, UTF-32
    "hello"   // Depends how you use it

Suppose I have a function that takes a UTF-8 string, and I call it:

    void cfoo(string a) {}

    cfoo("hello"c); // Works
    cfoo("hello"w); // Error, wrong type
    cfoo("hello"d); // Error, wrong type
    cfoo("hello");  // Works, assumed to be UTF-8 string

If I make a different function that takes a UTF-16 wstring instead:

    void wfoo(wstring a) {}

    wfoo("hello"c); // Error, wrong type
    wfoo("hello"w); // Works
    wfoo("hello"d); // Error, wrong type
    wfoo("hello");  // Works, assumed to be UTF-16 wstring

And then, a UTF-32 dstring version would be similar:

    void dfoo(dstring a) {}

    dfoo("hello"c); // Error, wrong type
    dfoo("hello"w); // Error, wrong type
    dfoo("hello"d); // Works
    dfoo("hello");  // Works, assumed to be UTF-32 dstring

As you can see, the literals with postfixes are always the exact type you 
specify. If you have no postfix, then you get whatever the compiler expects 
it to be.

But, then the question is, what happens if any of those types can be used? 
Which does the compiler choose?

    void Tfoo(T)(T a)
    {
        // When compiling, display the type used.
        pragma(msg, T.stringof);
    }

    Tfoo("hello");

(Normally you'd want to add in a constraint that T must be one of the string 
types, so that no one tries to pass in an int or float or something. I 
skipped that in there.)

In that, Tfoo isn't expecting any particular type of string, it can take any 
type. And "hello" doesn't have a postfix, so the compiler uses the default: 
UTF-8 string.

 Yes, templates may help. However, that unnecessary make code bigger (since 
 we have to compile it for every char type).<

It only generates code for the types that are actually needed. If, for 
instance, your progam never uses anything except UTF-8, then only one 
version of the function will be made - the UTF-8 version.  If you don't use 
every char type, then it doesn't generate it for every char type - just the 
ones you choose to use.

The other problem is that it allows programmer to choose which one to use. 
He or she may just prefer char[] as UTF-8 (or wchar[] as UTF-16). That will 
be fine on platform that supports this encoding natively (e.g. for file 
system operations, screen output, etc.), whereas it will cause conversion 
overhead on the other. I don't think there is any problem with having 
different size of char. In fact, that would make programs better (since 
application programmers will have to think in terms of characters as 
opposed to bytes). Not to say that it's a big overhead, but unnecessary 
one. Having said this, I do agree that there must be some flexibility (e.g. 
in Java char[] is always 2 bytes), however, I don't believe that this 
flexibility should be available for application programmer.

<

That's not good. First of all, UTF-16 is a lousy encoding, it combines the 
worst of both UTF-8 and UTF-32: It's multibyte and non-word-aligned like 
UTF-8, but it still wastes a lot of space like UTF-32. So even if your OS 
uses it natively, it's still best to do most internal processing in either 
UTF-8 or UTF-32. (And with templated string functions, if the programmer 
actually does want to use the native type in the *rare* cases where he's 
making enough OS calls that it would actually matter, he can still do so.)

Secondly, the programmer *should* be able to use whatever type he decides is 
appropriate. If he wants to stick with native, he can do so, but he 
shouldn't be forced into choosing between "use the native encoding" and 
"abuse the type system by pretending that an int is a character". For 
instance, complex low-level text processing *relies* on knowing exactly what 
encoding is being used and coding specifically to that encoding. As an 
example, I'm currently working on a generalized parser library ( 
http://www.dsource.org/projects/goldie ). Something like that is complex 
enough already that implementing the internal lexer natively for each 
possible native text encoding is just not worthwhile, expecially since the 
text hardly every gets passed to or from any OS calls that expect any 
particular encoding. Or maybe you're on a fancy OS that can handle any 
encoding natively. Or maybe the programmer is in a low-memory (or 
very-large-data) situation and needs the space savings of UTF-8 regardless 
of OS and doesn't care about speed. Or maybe they're actually *writing* an 
OS (Most moderns languages are completely useless for writing an OS. D 
isn't). A language or a library should *never* assume it knows the 
programmer's needs better than the programmer does.

Also, C already tried the approach of multi-sized types (ex, C's "int"), and 
it ended up being a big PITA disaster that everyone ended up having to make 
up hacks to work around.


 System programmers (i.e. OS programmers) may choose to think as they 
 expect it to be (since char width option can be added to compiler).<

See that's the thing, D is intended as a systems language, so a D programmer 
must be able to easily handle it that way whenever they need to.

TCHAR in Windows is a good example of it. Whenever you need to determine 
size of element (e.g. for allocation), you can use 'sizeof'. Again, it does 
not mean that you're deprived of char/wchar/dchar capability. It still can 
be supported (e.g. via ubyte/ushort/uint) for the sake of interoperability 
or some special cases. Special string constants (e.g. ""b, ""w, ""d) can be 
supported, too. My only point is that it would be good to have universal 
char type that depends on platform.

You can have that easily:

version(Windows)
    alias wstring tstring;
else
    alias string tstring;

Besides, just because you *can* get a job done a certain way doesn't mean 
languages should never try to allow a better way for those who want a better 
way.

 That, in turns, allows to have unified char for all libraries on this 
 platform.

With templated text functions, there is very little benefit to be gained 
from having a unified char. Just wouldn't serve any real purpose. All it 
would do is cause problems for anyone who needs to work at the low-level.

-------------------------------
Not sent from an iPhone.

Jun 07 2010

Walter Bright <newshound1 digitalmars.com> writes:

Ruslan Nikolaev wrote:
 Note: I posted this already on runtime D list,

Although D is designed to be fairly agnostic about character types, in practice 
I recommend the following:

1. Use the string type for strings, it's char[] on D1 and immutable(char)[] on
D2.

2. Use dchar's to hold individual characters.

The problem with wchar's is that everyone forgets about surrogate pairs. Most 
UTF-16 programs in the wild, including nearly all Java programs, are broken
with 
regard to surrogate pairs. The problem with dchar's is strings of them consume 
memory at a prodigious rate.

Jun 07 2010

Kagamin <spam here.lot> writes:

Walter Bright Wrote:

 The problem with wchar's is that everyone forgets about surrogate pairs. Most 
 UTF-16 programs in the wild, including nearly all Java programs, are broken
with 
 regard to surrogate pairs.

I'm affraid, it will pretty hard to show the bug. I don't know whether java is
particularly nasty here, but for C code it will be hard.

Jun 07 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 The problem with dchar's is strings of them consume 
 memory at a prodigious rate.

Warning: lazy musings ahead.

I hope we'll soon have computers with 200+ GB of RAM where using strings that
use less than 32-bit chars is in most cases a premature optimization (like
today is often a silly optimization to use arrays of 16-bit ints instead of
32-bit or 64-bit ints. Only special situations found with the profiler can
justify the use of arrays of shorts in a low level language).

Even in PCs with 200 GB of RAM the first levels of CPU caches can be very small
(like 32 KB), and cache misses are costly, so even if huge amounts of RAMs are
present, to increase performance it can be useful to reduce the size of strings.

A possible solution to this problem can be some kind of real-time hardware
compression/decompression between the CPU and the RAM. UTF-8 can be a good
enough way to compress 32-bit strings. So we are back to writing low-level
programs that have to deal with UTF-8.

To avoid this, CPUs and RAM can compress/decompress the text transparently to
the programmer. Unfortunately UTF-8 is a variable-length encoding, so maybe it
can't be done transparently enough. So a smarter and better compression
algorithm can be used to keep all this transparent enough (not fully
transparent, some low-level situations can require code that deals with the
compression).

Bye,
bearophile

Jun 08 2010

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 Walter Bright:
 The problem with dchar's is strings of them consume memory at a prodigious
 rate.

 
 Warning: lazy musings ahead.
 
 I hope we'll soon have computers with 200+ GB of RAM where using strings that
 use less than 32-bit chars is in most cases a premature optimization (like
 today is often a silly optimization to use arrays of 16-bit ints instead of
 32-bit or 64-bit ints. Only special situations found with the profiler can
 justify the use of arrays of shorts in a low level language).
 
 Even in PCs with 200 GB of RAM the first levels of CPU caches can be very
 small (like 32 KB), and cache misses are costly, so even if huge amounts of
 RAMs are present, to increase performance it can be useful to reduce the size
 of strings.
 
 A possible solution to this problem can be some kind of real-time hardware
 compression/decompression between the CPU and the RAM. UTF-8 can be a good
 enough way to compress 32-bit strings. So we are back to writing low-level
 programs that have to deal with UTF-8.
 
 To avoid this, CPUs and RAM can compress/decompress the text transparently to
 the programmer. Unfortunately UTF-8 is a variable-length encoding, so maybe
 it can't be done transparently enough. So a smarter and better compression
 algorithm can be used to keep all this transparent enough (not fully
 transparent, some low-level situations can require code that deals with the
 compression).

I strongly suspect that the encode/decode time for UTF-8 is more than 
compensated for by the 4x reduction in memory usage. I did a large app 10 years 
ago using dchars throughout, and the effects of the memory consumption were 
murderous.

(As the recent article on memory consumption shows, large data structures can 
have huge negative speed consequences due to virtual and cache memory, and 
multiple cores trying to access the same memory.)

https://lwn.net/Articles/250967/

Keep in mind that the overwhelming bulk of UTF-8 text is ascii, and requires 
only one cycle to "decode".

Jun 08 2010

Rainer Deyke <rainerd eldwood.com> writes:

On 6/8/2010 13:57, bearophile wrote:
 I hope we'll soon have computers with 200+ GB of RAM where using
 strings that use less than 32-bit chars is in most cases a premature
 optimization (like today is often a silly optimization to use arrays
 of 16-bit ints instead of 32-bit or 64-bit ints. Only special
 situations found with the profiler can justify the use of arrays of
 shorts in a low level language).

Off-topic, but I don't need a profiler to tell me that my 1024x1024x1024
arrays should use shorts instead of ints.  And even when 200GB becomes
common, I'd still rather not waste that memory by using twice as much
space as I have to just because I can.


-- 
Rainer Deyke - rainerd eldwood.com

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Rainer Deyke" <rainerd eldwood.com> wrote in message 
news:humes8$s8$1 digitalmars.com...
 On 6/8/2010 13:57, bearophile wrote:
 I hope we'll soon have computers with 200+ GB of RAM where using
 strings that use less than 32-bit chars is in most cases a premature
 optimization (like today is often a silly optimization to use arrays
 of 16-bit ints instead of 32-bit or 64-bit ints. Only special
 situations found with the profiler can justify the use of arrays of
 shorts in a low level language).

 Off-topic, but I don't need a profiler to tell me that my 1024x1024x1024
 arrays should use shorts instead of ints.  And even when 200GB becomes
 common, I'd still rather not waste that memory by using twice as much
 space as I have to just because I can.

I think he was just musing that it would be nice to be able to ignore 
multiple encodings and multiple-code-units, and get back to something much 
closer to the blissful simplicity of ASCII. On that particular point, I 
concur ;)

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:humfrk$2gk$1 digitalmars.com...
 "Rainer Deyke" <rainerd eldwood.com> wrote in message 
 news:humes8$s8$1 digitalmars.com...
 On 6/8/2010 13:57, bearophile wrote:
 I hope we'll soon have computers with 200+ GB of RAM where using
 strings that use less than 32-bit chars is in most cases a premature
 optimization (like today is often a silly optimization to use arrays
 of 16-bit ints instead of 32-bit or 64-bit ints. Only special
 situations found with the profiler can justify the use of arrays of
 shorts in a low level language).

 Off-topic, but I don't need a profiler to tell me that my 1024x1024x1024
 arrays should use shorts instead of ints.  And even when 200GB becomes
 common, I'd still rather not waste that memory by using twice as much
 space as I have to just because I can.

 I think he was just musing that it would be nice to be able to ignore 
 multiple encodings and multiple-code-units, and get back to something much 
 closer to the blissful simplicity of ASCII. On that particular point, I 
 concur ;)

Keep in mind too, that for an English-language app (and there are plenty), 
even using ASCII still wastes space, since you usually only need the 26 
letters, 10 digits, a few whitespace characters, and a handful of 
punctuation. You could probably fit that in 6 bits per character, less if 
you're ballsy enough to use huffman encoding internally. Yea, there's twice 
as many letters if you count uppercase/lowercase, but random-casing is rare 
so there's tricks you can use to just stick with 26 plus maybe a few special 
control characters. But, of course, nobody actually does any of that because 
with the amount of memory we have, and the amount of memory already used by 
other parts of a program, the savings wouldn't be worth the bother.

But I agree with your point too. Just saying.

Jun 08 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

Just one more addition: it is possible to have built-in function that conve=
rts multibyte (or multiword) char sequence (even though in my proposal it c=
an be of different size) to dchar (UTF-32) character. Again, my only point =
is that it would be nice to have something similar to TCHAR so that all lib=
raries can use it if they choose not to provide functions for all 3 types.=
=0A=0A2Walter:=0AYes, programmers do often ignore surrogate pairs in case o=
f UTF-16. But in case of undetermined char size (1 or 2 bytes) they will ha=
ve to use special builtin conversion functions to dchar unless they want th=
eir code to be completely broken.=0A=0AThanks,=0ARuslan. =0A=0A--- On Tue, =
6/8/10, Ruslan Nikolaev <nruslan_devel yahoo.com> wrote:=0A=0A> From: Rusla=
n Nikolaev <nruslan_devel yahoo.com>=0A> Subject: Re: Wide characters suppo=
rt in D=0A> To: "digitalmars.D" <digitalmars-d puremagic.com>=0A> Date: Tue=
sday, June 8, 2010, 3:16 AM=0A> Ok, ok... that was just a=0A> suggestion...=
 Thanks, for reply about "Hello world"=0A> representation. Was postfix "w" =
and "d" added initially or=0A> just recently? I did not know about it. I th=
ought D does=0A> automatic conversion for string literals.=0A> =0A> Yes, te=
mplates may help. However, that unnecessary make=0A> code bigger (since we =
have to compile it for every char=0A> type). The other problem is that it a=
llows programmer to=0A> choose which one to use. He or she may just prefer =
char[] as=0A> UTF-8 (or wchar[] as UTF-16). That will be fine on platform=
=0A> that supports this encoding natively (e.g. for file system=0A> operati=
ons, screen output, etc.), whereas it will cause=0A> conversion overhead on=
 the other. Not to say that it's a big=0A> overhead, but unnecessary one. H=
aving said this, I do agree=0A> that there must be some flexibility (e.g. i=
n Java char[] is=0A> always 2 bytes), however, I don't believe that this=0A=
 flexibility should be available for application programmer.=0A> =0A> I do=

n't think there is any problem with having different=0A> size of char. In f=
act, that would make programs better=0A> (since application programmers wil=
l have to think in terms=0A> of characters as opposed to bytes). System pro=
grammers (i.e.=0A> OS programmers) may choose to think as they expect it to=
 be=0A> (since char width option can be added to compiler). TCHAR in=0A> Wi=
ndows is a good example of it. Whenever you need to=0A> determine size of e=
lement (e.g. for allocation), you can use=0A> 'sizeof'. Again, it does not =
mean that you're deprived of=0A> char/wchar/dchar capability. It still can =
be supported (e.g.=0A> via ubyte/ushort/uint) for the sake of interoperabil=
ity or=0A> some special cases. Special string constants (e.g. ""b, ""w,=0A>=
 ""d) can be supported, too. My only point is that it would=0A> be good to =
have universal char type that depends on=0A> platform. That, in turns, allo=
ws to have unified char for=0A> all libraries on this platform.=0A> =0A> In=
 addition, commonly used constants '\n', '\r', '\t' will=0A> be the same re=
gardless of char width.=0A> =0A> Anyway, that was just a suggestion. You ma=
y disagree with=0A> this if you wish.=0A> =0A> Ruslan.=0A> =0A> =0A> =A0 =
=A0 =A0 =0A> =0A=0A=0A

Jun 07 2010

Walter Bright <newshound1 digitalmars.com> writes:

Ruslan Nikolaev wrote:
 Just one more addition: it is possible to have built-in function that
 converts multibyte (or multiword) char sequence (even though in my proposal
 it can be of different size) to dchar (UTF-32) character. Again, my only
 point is that it would be nice to have something similar to TCHAR so that all
 libraries can use it if they choose not to provide functions for all 3 types.
 
 
 2Walter: Yes, programmers do often ignore surrogate pairs in case of UTF-16.
 But in case of undetermined char size (1 or 2 bytes) they will have to use
 special builtin conversion functions to dchar unless they want their code to
 be completely broken.

The nice thing about char[] is that you'll find out real fast if your multibyte 
code is broken. With surrogate pairs in wchar[], the bug may lurk undetected
for 
a decade.

Jun 07 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 07 Jun 2010 17:48:09 -0400, Ruslan Nikolaev  
<nruslan_devel yahoo.com> wrote:

 Note: I posted this already on runtime D list, but I think that list was  
 a wrong one for this question. Sorry for duplication :-)

 Hi. I am new to D. It looks like D supports 3 types of characters: char,  
 wchar, dchar. This is cool, however, I have some questions about it:

 1. When we have 2 methods (one with wchar[] and another with char[]),  
 how D will determine which one to use if I pass a string "hello world"?
 2. Many libraries (e.g. tango or phobos) don't provide functions/methods  
 (or have incomplete support) for wchar/dchar
 e.g. writefln probably assumes char[] for strings like "Number %d..."
 3. Even if they do support, it is kind of annoying to provide methods  
 for all 3 types of chars. Especially, if we want to use native mode  
 (e.g. for Windows wchar is better, for Linux char is better). E.g.  
 Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc,  
 wchar_t[] argv) and so on, and they should be native (in a sense that no  
 conversion is necessary when we do, for instance, _wopen). Linux doesn't  
 have them as UTF-8 is used widely there.

 Since D language is targeted on system programming, why not to try to  
 use whatever works better on a particular system (e.g. char will be 2  
 bytes on Windows and 1 byte on Linux; it can be a compiler switch, and  
 all libraries can be compiled properly on a particular system). It's  
 still necessary to have all 3 types of char for cooperation with C. But  
 in those cases byte, short and int will do their work. For this kind of  
 situation, it would be nice to have some built-in functions for  
 transparent conversion from char to byte/short/int and vice versa  
 (especially, if conversion only happens if needed on a particular  
 platform).

 In my opinion, to separate notion of character from byte would be nice,  
 and it makes sense as a particular platform uses either UTF-8 or UTF-16  
 natively. Programmers may write universal code (like TCHAR on Windows).  
 Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to  
 make this mistake again?

One thing that may not be clear from your interpretation of D's docs, all  
strings representable by one character type are also representable by all  
the other character types.  This means that a function that takes a char[]  
can also take a dchar[] if it is sent through a converter (i.e. toUtf8 on  
Tango I think).

So D's char is decidedly not like byte or ubyte, or C's char.

In general, I use char (utf8) because I am used to C and ASCII (which is  
exactly represented in utf-8).  But because char is utf-8, it could  
potentially accept any unicode string.

-Steve

Jun 07 2010

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

Steven Schveighoffer wrote:
 a function that takes 
 a char[] can also take a dchar[] if it is sent through a converter (i.e. 
 toUtf8 on Tango I think).

In Phobos, there are text, wtext, and dtext in std.conv:

/**
    Convenience functions for converting any number and types of
    arguments into _text (the three character widths).

    Example:
    ----
    assert(text(42, ' ', 1.5, ": xyz") == "42 1.5: xyz");
    assert(wtext(42, ' ', 1.5, ": xyz") == "42 1.5: xyz"w);
    assert(dtext(42, ' ', 1.5, ": xyz") == "42 1.5: xyz"d);
    ----
*/

Ali

Jun 07 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

=0A> It only generates code for the types that are actually=0A> needed. If,=
 for =0A> instance, your progam never uses anything except UTF-8,=0A> then =
only one =0A> version of the function will be made - the UTF-8=0A> version.=
=A0 If you don't use =0A> every char type, then it doesn't generate it for =
every char=0A> type - just the =0A> ones you choose to use.=0A=0ANot quite =
right. If we create system dynamic libraries or dynamic libraries commonly =
used, we will have to compile every instance unless we want to burden user =
with this. Otherwise, the same code will be duplicated in users program ove=
r and over again.=0A=0A> That's not good. First of all, UTF-16 is a lousy e=
ncoding,=0A> it combines the =0A> worst of both UTF-8 and UTF-32: It's mult=
ibyte and=0A> non-word-aligned like =0A> UTF-8, but it still wastes a lot o=
f space like UTF-32. So=0A> even if your OS =0A> uses it natively, it's sti=
ll best to do most internal=0A> processing in either =0A> UTF-8 or UTF-32. =
(And with templated string functions, if=0A> the programmer =0A> actually d=
oes want to use the native type in the *rare*=0A> cases where he's =0A> mak=
ing enough OS calls that it would actually matter, he=0A> can still do so.)=
=0A>=0A=0AFirst of all, UTF-16 is not a lousy encoding. It requires for mos=
t characters 2 bytes (not so big wastage especially if you consider other l=
anguages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 wi=
ll require from 1 to 3 bytes for the same common characters. And also 4 cha=
rs for REALLY rare ones. In UTF-16 surrogate is an exception whereas in UTF=
-8 it is a rule (when something is an exception, it won't affect performanc=
e in most cases; when something is a rule - it will affect).=0A=0AFinally, =

y others. Developers of these systems chose to use UTF-16 even though some =

condly, the programmer *should* be able to use whatever=0A> type he decides=
 is =0A> appropriate. If he wants to stick with native, he can do=0A=0AWhy?=
 He/She can just use conversion to UTF-32 (dchar) whenever better understan=
ding of character is needed. At least, that's what should be done anyway.=
=0A=0A> =0A> You can have that easily:=0A> =0A> version(Windows)=0A> =A0 =
=A0 alias wstring tstring;=0A> else=0A> =A0 =A0 alias string tstring;=0A> =
=0A=0ASee that's my point. Nobody is going to do this unless the above is s=
tandardized by the language. Everybody will stick to something particular (=
either char or wchar). =0A=0A=0A> =0A> With templated text functions, there=
 is very little benefit=0A> to be gained =0A> from having a unified char. J=
ust wouldn't serve any real=0A=0Asee my comment above about templates and d=
ynamic libraries =0A=0ARuslan=0A=0A=0A

Jun 07 2010

Jesse Phillips <jessekphillips+D gmail.com> writes:

On Mon, 07 Jun 2010 19:26:02 -0700, Ruslan Nikolaev wrote:

 It only generates code for the types that are actually needed. If, for
 instance, your progam never uses anything except UTF-8, then only one
 version of the function will be made - the UTF-8 version.  If you don't
 use
 every char type, then it doesn't generate it for every char type - just
 the
 ones you choose to use.

 
 Not quite right. If we create system dynamic libraries or dynamic
 libraries commonly used, we will have to compile every instance unless
 we want to burden user with this. Otherwise, the same code will be
 duplicated in users program over and over again.
 

I think you really need to look more into what templates are and do.

There is also going to be very little performance gain by using the 
"system type" for strings. Considering that most of the work is not 
likely going be to the system commands you mentioned, but within D itself.

Jun 07 2010

"Nick Sabalausky" <a a.a> writes:

"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.124.1275963971.24349.digitalmars-d puremagic.com...

Nick wrote:
 It only generates code for the types that are actually
 needed. If, for
 instance, your progam never uses anything except UTF-8,
 then only one
 version of the function will be made - the UTF-8
 version. If you don't use
 every char type, then it doesn't generate it for every char
 type - just the
 ones you choose to use.

Not quite right. If we create system dynamic libraries or dynamic libraries 
commonly used, we will have to compile every instance unless we want to 
burden user with this. Otherwise, the same code will be duplicated in users 
program over and over again.<

That's a rather minor issue. I think you're overestimating the amount of 
bloat that occurs from having one string type versus three string types. 
Absolute worst case scenario would be a library that contains nothing but 
text-processing functions. That would triple in size, but what's the biggest 
such lib you've ever seen anyway? And for most libs, only a fraction is 
going to be taken up by text processing, so the difference won't be 
particularly large. In fact, the difference would likely be dwarfed anyway 
by the bloat incurred from all the other templated code (ie which would be 
largely unaffected by number of string types), and yes, *that* can get to be 
a problem, but it's an entirely separate one.

 That's not good. First of all, UTF-16 is a lousy encoding,
 it combines the
 worst of both UTF-8 and UTF-32: It's multibyte and
 non-word-aligned like
 UTF-8, but it still wastes a lot of space like UTF-32. So
 even if your OS
 uses it natively, it's still best to do most internal
 processing in either
 UTF-8 or UTF-32. (And with templated string functions, if
 the programmer
 actually does want to use the native type in the *rare*
 cases where he's
 making enough OS calls that it would actually matter, he
 can still do so.)

First of all, UTF-16 is not a lousy encoding. It requires for most 
characters 2 bytes (not so big wastage especially if you consider other 
languages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 
will require from 1 to 3 bytes for the same common characters. And also 4 
chars for REALLY rare ones. In UTF-16 surrogate is an exception whereas in 
UTF-8 it is a rule (when something is an exception, it won't affect 
performance in most cases; when something is a rule - it will affect).<

Maybe "lousy" is too strong a word, but aside from compatibility with other 
libs/software that use it (which I'll address separately), UTF-16 is not 
particularly useful compared to UTF-8 and UTF-32:

Non-latin-alphabet language: UTF-8 vs UTF-16:

The real-word difference in sizes is minimal. But UTF-8 has some advantages: 
The nature of the the encoding makes backwards-scanning cheaper and easier. 
Also, as Walter said, bugs in the handling of multi-code-unit characters 
become fairly obvious. Advantages of UTF-16: None.

Latin-alphabet language: UTF-8 vs UTF-16:

All the same UTF-8 advantages for non-latin-alphabet languages still apply, 
plus there's a space savings: Under UTF-8, *most* characters are going to be 
1 byte. Yes, there will be the occasional 2+ byte character, but they're so 
much less common that the overhead compared to ASCII (I'm only using ASCII 
as a baseline here, for the sake of comparisons) would only be around 0% to 
15% depending on the language. UTF-16, however, has a consistent 100% 
overhead (slightly more when you count surrogate pairs, but I'll just leave 
it at 100%). So, depending on language, UTF-16 would be around 70%-100% 
larger than UTF-8. That's not insignificant.

Any language: UTF-32 vs UTF-16:

Using UTF-32 takes up extra space, but when that matters, UTF-8 already has 
the advantage over UTF-16 anyway regardless of whether or not UTF-8 is 
providing a space savings (see above), so the question of UTF-32 vs UTF-16 
becomes useless. The rest of the time, UTF-32 has these advantages: 
Guaranteed one code-unit per character. And, the code-unit size is faster on 
typical CPUs (typical CPUs generally handle 32-bits faster than they handle 
8- or 16-bits). Advantages of UTF-16: None.

So compatibility with certain tools/libs is really the only reason ever to 
choose UTF-16.


Qt and many others. Developers of these systems chose to use UTF-16 even 


First of all, it's not exactly unheard of for big projects to make a 
sub-optimal decision.

Secondly, Java and Windows adapted 16-bit encodings back when many people 
were still under the mistaken impression that would allow them to hold any 
character in one code-unit. If that had been true, then it would indeed have 
had at least certain advantages over UTF-8. But by the time the programming 
world at large knew better, it was too late for Java or Windows to 

.NET use UTF-16 because Windows does. I don't know about Qt, but judging by 
how long Wikipedia says it's been around, I'd say it's probably the same 
story.

As for choosing to use UTF-16 because of interfacing with other tools and 
libs that use it: That's certainly a good reason to use UTF-16. But it's 
about the only reason. And it's a big mistake to just assume that the 
overhead of converting to/from UTF-16 when crossing those API borders is 
always going to outweigh all other concerns:

For instance, if you're writing an app that does a large amount of 
text-processing on relatively small amounts of text and only deals a little 
bit with a UTF-16 API, then the overhead of operating on 16-bits at a time 
can easily outweigh the overhead from the UTF-16 <-> UTF-32 conversions.

Or, maybe the app you're writing is more memory-limited than speed-limited.

There are perfectly legitimate reasons to want to use an encoding other than 
the OS-native. Why force those people to circumvent the type system to do 
it? Especially in a language that's intended to be usable as a systems 
language. Just to potentially save a couple megs on some .dll or .so?

 Secondly, the programmer *should* be able to use whatever
 type he decides is
 appropriate. If he wants to stick with native, he can do
Why? He/She can just use conversion to UTF-32 (dchar) whenever better 
understanding of character is needed. At least, that's what should be done 
anyway.<

Weren't you saying that the main point of just having one string type (the 
OS-native string) was to avoid unnecessary conversions? But now you're 
arguing that's it's fine to do unnecessary conversions and to have the 
multiple string types?

 You can have that easily:

 version(Windows)
 alias wstring tstring;
 else
 alias string tstring;

See that's my point. Nobody is going to do this unless the above is 
standardized by the language. Everybody will stick to something particular 
(either char or wchar).<

True enough. I don't have anything against having something like that in the 
std library as long as the others are still available too. Could be useful 
in a few cases. I do think having it *instead* of the three types is far too 
presumptuous, though.

Jun 07 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

--- On Tue, 6/8/10, Jesse Phillips <jessekphillips+D gmail.com> wrote:

 I think you really need to look more into what templates
 are and do.
 

Excuse me? Unless templates are something different in D (I can't be 100% sure
since I am new D), it should be the case. At least in C++, that would be the
case. As I said, for libraries you need to compile every commonly used
instance, so that user will not be burdened with this overhead.

http://www.digitalmars.com/d/2.0/template.html

 There is also going to be very little performance gain by
 using the 
 "system type" for strings. Considering that most of the
 work is not 
 likely going be to the system commands you mentioned, but
 within D itself.
 

It depends. For instance, if you work with files, write on the console output,
use system functions, use Win32 api, DFL, there can be overhead.

Jun 07 2010

BCS <none anon.com> writes:

Hello Ruslan,

 --- On Tue, 6/8/10, Jesse Phillips <jessekphillips+D gmail.com> wrote:
 
 I think you really need to look more into what templates are and do.
 

 As I said, for libraries you need to compile
 every commonly used instance, so that user will not be burdened with
 this overhead.

You only need to do that where you are shipping closed source and for that, 
it should be trivial to get the compiler to generate all three versions. 

 There is also going to be very little performance gain by
 using the
 "system type" for strings. Considering that most of the
 work is not
 likely going be to the system commands you mentioned, but
 within D itself.

 It depends. For instance, if you work with files, write on the console
 output, use system functions, use Win32 api, DFL, there can be
 overhead.

Your, right: it depends. In the few cases I can think of where more of the 
D code will be interacting with non D code than just processing the text, 
you could almost use void[] as your type. Where would you care about the 
encoding but not do much worth it?

Also unless you have large amounts of text, you are going to have to work 
hard to get perf problems. If you do have large amounts of text, you are 
going to be I/O bound (cache misses etc.) and at that point, the cost of 
any operation, is it's I/O. From that, Reading in some date, doing a single 
pass of processing on it and writing it back out would only take 2/3 long 
with translations on both side.
 
-- 
... <IXOYE><

Jun 07 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

Yes, to clarify what I suggest, I can put it as follows (2 possibilities):

1. Have a special standardized type "tchar" and "tstring". Then, system
libraries as well as users can use this type unless they want to do something
special. There can be a compiler switch to change tchar width (essentially, to
assign tchar to char, wchar or dchar), so that for each platform it can be used
accordingly. In addition, tmain(tstring[] args) can be used as entry point;
_topen, _treaddir, _tfopen, etc. can be added to binding. 
Adv: doesn't break existent code.
Disadv: tchar and tstring may look weird for users. 

2. Rename current char to bchar or schar, or something similar. Then 'char' can
be used as type described above.
Adv: users are likely to use this type
Disadv: may break existent code; especially in part of bindings

I think to have something (at least (1)) would be nice feature and addition to
D. Although, I do admit that there can be different opinions about it. However,
TCHAR in Windows DOES work fine. In the case described above it's even better
since we always work with Unicode (UTF8/16/32) unlike Windows (which use ANSI
for 1 byte char), thus everything should be more or less transparent. It would
be cool to hear something from D, phobos and tango developers.

P.S. For commonly used characters (e.g. '\n') the size of char will never make
any difference. The problems should not occur in good code, or should occur
really rare (which can be adjusted by programmer).

Thanks,
Ruslan Nikolaev

Jun 07 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

 You only need to do that where you are shipping closed
 source and for that, it should be trivial to get the
 compiler to generate all three versions. 

You will also need to do it in open source projects if you want to include
generated template code into dynamic library as opposed to user's program (read
as unnecessary space "burden" where code is repeated over and over again across
user programs).

But, yes, closed source programs is a good particular example. True, you can
compile all 3 versions. But the whole argument was about additional generated
code which someone claimed will not happen. 


 
 Your, right: it depends. In the few cases I can think of
 where more of the D code will be interacting with non D code
 than just processing the text, you could almost use void[]
 as your type. Where would you care about the encoding but
 not do much worth it?
 
 Also unless you have large amounts of text, you are going
 to have to work hard to get perf problems. If you do have
 large amounts of text, you are going to be I/O bound (cache
 misses etc.) and at that point, the cost of any operation,
 is it's I/O. From that, Reading in some date, doing a single
 pass of processing on it and writing it back out would only
 take 2/3 long with translations on both side.
 

True. But even simple string handling is faster for UTF-16. The time required
to read 2 bytes from UTF-16 string is the same 1 byte from UTF-8. Generally, we
have to read one code point after another (not more than this) since data
guaranteed to be aligned by 2 byte boundary for wchar and 1 byte for char. Not
to mention that converting 2 code points takes less time in UTF-16. And why not
use this opportunity if system already natively support this?  


In addition, I want to mention that reading/writing file in text mode is very
transparent. For instance, in Windows, the conversion will happen automatically
from multibyte to unicode for open, fopen, etc. when text mode is specified. In
general, it is a good practice since 1 byte char text is not necessary UTF-8
anyway and can be ANSI as well.

Also, some other OS use 2 bytes UTF-16 natively, so it's not just for Windows.
If I am not wrong, Symbian should be one such example.

Jun 07 2010

"Nick Sabalausky" <a a.a> writes:

"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.127.1275974825.24349.digitalmars-d puremagic.com...
 True. But even simple string handling is faster for UTF-16. The time 
 required to read 2 bytes from UTF-16 string is the same 1 byte from UTF-8. 
 Generally, we have to read one code point after another (not more than 
 this) since data guaranteed to be aligned by 2 byte boundary for wchar and 
 1 byte for char. Not to mention that converting 2 code points takes less 
 time in UTF-16. And why not use this opportunity if system already 
 natively support this?

Why do you say that UTF-16 is faster than UTF-8?

In general, it is a good practice since 1 byte char text is not necessary 
UTF-8 anyway and can be ANSI as well.

That's what the BOM is for.

Jun 07 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

 
 Maybe "lousy" is too strong a word, but aside from
 compatibility with other 
 libs/software that use it (which I'll address separately),
 UTF-16 is not 
 particularly useful compared to UTF-8 and UTF-32:

...
 

I tried to avoid commenting this because I am afraid we'll stray away from the
main point (which is not discussion about which Unicode is better). But in
short I would say: "Not quite right". UTF-16 as already mentioned is generally
faster for non-Latin letters (as reading 2 bytes of aligned data takes the same
time as reading 1 byte). Although, I am not familiar with Asian languages, I
believe that UTF-16 requires just 2 bytes instead of 3 for most of symbols.
That is one of the reason they don't like UTF-8. UTF-32 doesn't have any
advantage except for being fixed length. It has a lot of unnecessary memory,
cache, etc. overhead (the worst case scenario for both UTF8/16) which is not
justified for any language.

 
 First of all, it's not exactly unheard of for big projects
 to make a 
 sub-optimal decision.

I would say, the decision was quite optimal for many reasons, including that
"lousy programming" will not cause too many problems as in case of UTF-8.

 
 Secondly, Java and Windows adapted 16-bit encodings back
 when many people 
 were still under the mistaken impression that would allow
 them to hold any 
 character in one code-unit. If that had been true, then it

I doubt that it was the only reason. UTF-8 was already available before Windows
NT was released. It would be much easier to use UTF-8 instead of ANSI as
opposed to creating parallel API. Nonetheless, UTF-16 has been chosen. In

doubt that conversion overhead (which is small compared to VM) was the main
reason to preserve UTF-16.


Concerning why I say that it's good to have conversion to UTF-32 (you asked
somewhere):

I think you did not understand correctly what I meant. This a very common
practice, and in fact - required, to convert from both UTF-8 and UTF-16 to
UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In fact, it
is the only place where UTF-32 is commonly used and useful.

Jun 07 2010

dennis luehring <dl.soluz gmx.net> writes:

please use the "Reply" Button

On 08.06.2010 08:50, Ruslan Nikolaev wrote:
 Maybe "lousy" is too strong a word, but aside from
 compatibility with other
 libs/software that use it (which I'll address separately),
 UTF-16 is not
 particularly useful compared to UTF-8 and UTF-32:

 ...

 I tried to avoid commenting this because I am afraid we'll stray away from the
main point (which is not discussion about which Unicode is better). But in
short I would say: "Not quite right". UTF-16 as already mentioned is generally
faster for non-Latin letters (as reading 2 bytes of aligned data takes the same
time as reading 1 byte). Although, I am not familiar with Asian languages, I
believe that UTF-16 requires just 2 bytes instead of 3 for most of symbols.
That is one of the reason they don't like UTF-8. UTF-32 doesn't have any
advantage except for being fixed length. It has a lot of unnecessary memory,
cache, etc. overhead (the worst case scenario for both UTF8/16) which is not
justified for any language.

 First of all, it's not exactly unheard of for big projects
 to make a
 sub-optimal decision.

 I would say, the decision was quite optimal for many reasons, including that
"lousy programming" will not cause too many problems as in case of UTF-8.

 Secondly, Java and Windows adapted 16-bit encodings back
 when many people
 were still under the mistaken impression that would allow
 them to hold any
 character in one code-unit. If that had been true, then it

 I doubt that it was the only reason. UTF-8 was already available before
Windows NT was released. It would be much easier to use UTF-8 instead of ANSI
as opposed to creating parallel API. Nonetheless, UTF-16 has been chosen. In

doubt that conversion overhead (which is small compared to VM) was the main
reason to preserve UTF-16.


 Concerning why I say that it's good to have conversion to UTF-32 (you asked
somewhere):

 I think you did not understand correctly what I meant. This a very common
practice, and in fact - required, to convert from both UTF-8 and UTF-16 to
UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In fact, it
is the only place where UTF-32 is commonly used and useful.

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.128.1275979841.24349.digitalmars-d puremagic.com...
 Secondly, Java and Windows adapted 16-bit encodings back
 when many people
 were still under the mistaken impression that would allow
 them to hold any
 character in one code-unit. If that had been true, then it

 I doubt that it was the only reason. UTF-8 was already available before 
 Windows NT was released. It would be much easier to use UTF-8 instead of 
 ANSI as opposed to creating parallel API. Nonetheless, UTF-16 has been 
 chosen.

I didn't say that was the only reason. Also, you've misunderstood my point:

Their reasoning at the time:
    8-bit: Multiple code-units for some characters
    16-bit: One code-unit per character
    Therefore, use 16-bit.

Reality:
    8-bit: Multiple code-units for some characters
    16-bit: Multiple code-units for some characters
    Therefore, old reasoning not necessarily still applicable.


 length.


standardized on.

I doubt that conversion overhead (which is small compared to VM) was the 
main reason to preserve UTF-16.

I never said anything about conversion overhead being a reason to preserve 
UTF-16.

 Concerning why I say that it's good to have conversion to UTF-32 (you 
 asked somewhere):

 I think you did not understand correctly what I meant. This a very common 
 practice, and in fact - required, to convert from both UTF-8 and UTF-16 to 
 UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In 
 fact, it is the only place where UTF-32 is commonly used and useful.

I'm well aware why UTF-32 is useful. Earlier, you had started out saying 
that there should only be one string type, the OS-native type. Now you're 
changing your tune and saying that we do need multiple types.

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:huktq1$8tr$1 digitalmars.com...
 "Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
 news:mailman.128.1275979841.24349.digitalmars-d puremagic.com...

 length.


 standardized on.

s/UTF-16/16-bit/  It's getting late and I'm starting to mix terminology...

Jun 08 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 06/08/2010 03:12 AM, Nick Sabalausky wrote:
 "Nick Sabalausky"<a a.a>  wrote in message
 news:huktq1$8tr$1 digitalmars.com...
 "Ruslan Nikolaev"<nruslan_devel yahoo.com>  wrote in message
 news:mailman.128.1275979841.24349.digitalmars-d puremagic.com...

 length.


 standardized on.

 s/UTF-16/16-bit/  It's getting late and I'm starting to mix terminology...

s/16-bit/UCS-2/

The story is that Windows standardized on UCS-2, which is the uniform 
16-bit-per-character encoding that predates UTF-16. When UCS-2 turned 
out to be insufficient, it was extended to the variable-length UTF-16. 
As has been discussed, that has been quite unpleasant because a lot of 
code out there handles strings as if they were UCS-2.

Andrei

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:hul65q$o98$1 digitalmars.com...
 On 06/08/2010 03:12 AM, Nick Sabalausky wrote:
 "Nick Sabalausky"<a a.a>  wrote in message
 news:huktq1$8tr$1 digitalmars.com...
 "Ruslan Nikolaev"<nruslan_devel yahoo.com>  wrote in message
 news:mailman.128.1275979841.24349.digitalmars-d puremagic.com...

 length.


 already
 standardized on.

 s/UTF-16/16-bit/  It's getting late and I'm starting to mix 
 terminology...

 s/16-bit/UCS-2/

 The story is that Windows standardized on UCS-2, which is the uniform 
 16-bit-per-character encoding that predates UTF-16. When UCS-2 turned out 
 to be insufficient, it was extended to the variable-length UTF-16. As has 
 been discussed, that has been quite unpleasant because a lot of code out 
 there handles strings as if they were UCS-2.

Ok, that's what I had thought, but then I started second-guessing, so I 
figured "s/UTF-16/16-bit/" was a safer claim than "s/UTF-16/UCS-2/".

Jun 08 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

 
 I'm well aware why UTF-32 is useful. Earlier, you had
 started out saying 
 that there should only be one string type, the OS-native
 type. Now you're 
 changing your tune and saying that we do need multiple
 types.
 

No. From the very beginning I said "it would also be nice to have some builtin
function for conversion to dchar". That means it would be nice to have function
that converts from tchar (regardless of its width) to UTF-32. The reason was
always clear - you normally don't need UTF-32 chars/strings but for some
character analysis you might need them.

Jun 08 2010

Michel Fortin <michel.fortin michelf.com> writes:

On 2010-06-08 04:15:50 -0400, Ruslan Nikolaev <nruslan_devel yahoo.com> said:

 No. From the very beginning I said "it would also be nice to have some 
 builtin function for conversion to dchar". That means it would be nice 
 to have function that converts from tchar (regardless of its width) to 
 UTF-32. The reason was always clear - you normally don't need UTF-32 
 chars/strings but for some character analysis you might need them.

Is this what you want?

	version (utf16)
		alias wchar tchar;
	else
		alias char tchar;

	alias immutable(tchar)[] tstring;

	import std.utf;

	unittest {
		tstring tstr = "hello";
		dstring dstr = toUTF32(tstr);
	}


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jun 08 2010

Walter Bright <newshound1 digitalmars.com> writes:

Ruslan Nikolaev wrote:
 No. From the very beginning I said "it would also be nice to have some
 builtin function for conversion to dchar". That means it would be nice to
 have function that converts from tchar (regardless of its width) to UTF-32.
 The reason was always clear - you normally don't need UTF-32 chars/strings
 but for some character analysis you might need them.

http://www.digitalmars.com/d/2.0/phobos/std_utf.html

Function overloading takes care of selecting the right version.

Jun 08 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

=0A> =0A> Is this what you want?=0A> =0A> =A0=A0=A0 version (utf16)=0A> =A0=
=A0=A0 =A0=A0=A0 alias wchar tchar;=0A> =A0=A0=A0 else=0A> =A0=A0=A0 =A0=A0=
=A0 alias char tchar;=0A> =0A> =A0=A0=A0 alias immutable(tchar)[] tstring;=
=0A> =0A> =A0=A0=A0 import std.utf;=0A> =0A> =A0=A0=A0 unittest {=0A> =A0=
=A0=A0 =A0=A0=A0 tstring tstr =3D=0A> "hello";=0A> =A0=A0=A0 =A0=A0=A0 dstr=
ing dstr =3D=0A> toUTF32(tstr);=0A> =A0=A0=A0 }=0A> =0A=0AYes, I think some=
thing like this but standardized by the language. Also would be nice to hav=
e for interoperability (like I also mentioned in the beginning) toUTF16, to=
UTF8, fromUTF16, fromUTF8, fromUTF32, as tchar can be anything. If it's UTF=
-16, and you do toUTF16 - it won't do actual conversion, rather use input s=
tring instead. Something like this.=0A=0AThe other point of argument - whet=
her to use this kind of type as the main character type. My point was that =
having this kind of type used in dynamic libraries would be nice since you =
don't need to provide instances for every other character type, and at the =
same time - use native character encoding available on system. Of course it=
 does not mean, that you should be deprived of other types. If you need spe=
cific type to do something specific, you can always use it.=0A=0A=0A

Jun 08 2010

Michel Fortin <michel.fortin michelf.com> writes:

On 2010-06-08 09:22:02 -0400, Ruslan Nikolaev <nruslan_devel yahoo.com> said:

 you don't need to provide instances for every other character type, and 
 at the same time - use native character encoding available on system.

My opinion is thinking this will work is a fallacy. Here's why...

Generally Linux systems use UTF-8 so I guess the "system encoding" 
there will be UTF-8. But then if you start to use QT you have to use 
UTF-16, but you might have to intermix UTF-8 to work with other 
libraries in the backend (libraries which are not necessarily D 
libraries, nor system libraries). So you may have a UTF-8 backend (such 
as the MySQL library), UTF-8 "system encoding" glue code, and UTF-16 
GUI code (QT). That might be a good or a bad choice, depending on 
various factors, such as whether the glue code send more strings to the 
backend or the GUI.

Now try to port the thing to Windows where you define the "system 
encoding" as UTF-16. Now you still have the same UTF-8 backend, and the 
same UTF-16 GUI code, but for some reason you're changing the glue code 
in the middle to UTF-16? Sure, it can be made to work, but all the 
string conversions will start to happen elsewhere, which may change the 
performance characteristics and add some potential for bugs, and all 
this for no real reason.

The problem is that what you call "system encoding" is only the 
encoding used by the system frameworks. It is relevant when working 
with the system frameworks, but when you're working with any other API, 
you'll probably want to use the same character type as that API does, 
not necessarily the "system encoding". Not all programs are based on 
extensive use of the system frameworks. In some situations you'll want 
to use UTF-16 on Linux, or UTF-8 on Windows, because you're dealing 
with libraries that expect that (QT, MySQL).

A compiler switch is a poor choice there, because you can't mix 
libraries compiled with a different compiler switches when that switch 
changes the default character type.

In most cases, it's much better in my opinion if the programmer just 
uses the same character type as one of the libraries it uses, stick to 
that, and is aware of what he's doing. If someone really want to deal 
with the complexity of supporting both character types depending on the 
environment it runs on, it's easy to create a "tchar" and "tstring" 
alias that depends on whether it's Windows or Linux, or on a custom 
version flag from a compiler switch, but that'll be his choice and his 
responsibility to make everything work. But I think in this case a 
better option might be to abstract all those 'strings' under a single 
type that work with all UTF encodings (something like [mtext]).

[mtext]: http://www.dprogramming.com/mtext.php

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jun 08 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

 
 Generally Linux systems use UTF-8 so I guess the "system
 encoding" there will be UTF-8. But then if you start to use
 QT you have to use UTF-16, but you might have to intermix
 UTF-8 to work with other libraries in the backend (libraries
 which are not necessarily D libraries, nor system
 libraries). So you may have a UTF-8 backend (such as the
 MySQL library), UTF-8 "system encoding" glue code, and
 UTF-16 GUI code (QT). That might be a good or a bad choice,
 depending on various factors, such as whether the glue code
 send more strings to the backend or the GUI.
 
 Now try to port the thing to Windows where you define the
 "system encoding" as UTF-16. Now you still have the same
 UTF-8 backend, and the same UTF-16 GUI code, but for some
 reason you're changing the glue code in the middle to
 UTF-16? Sure, it can be made to work, but all the string
 conversions will start to happen elsewhere, which may change
 the performance characteristics and add some potential for
 bugs, and all this for no real reason.
 
 The problem is that what you call "system encoding" is only
 the encoding used by the system frameworks. It is relevant
 when working with the system frameworks, but when you're
 working with any other API, you'll probably want to use the
 same character type as that API does, not necessarily the
 "system encoding". Not all programs are based on extensive
 use of the system frameworks. In some situations you'll want
 to use UTF-16 on Linux, or UTF-8 on Windows, because you're
 dealing with libraries that expect that (QT, MySQL).
 

Agreed. True, system encoding is not always that clear. Yet, usually UTF-8 is
common for Linux (consider also Gtk, wxWidgets, system calls, etc.) At the same
time, UTF-16 is more common for Windows (consider win32api, DFL, system calls,
etc.). Some programs written in C even tend to have their own 'tchar' so that
they can be compiled differently depending on platform.

 A compiler switch is a poor choice there, because you can't
 mix libraries compiled with a different compiler switches
 when that switch changes the default character type.

Compiler switch is only necessary for system programmer. For instance, gcc also
has 'fshort-wchar' that changes width of wchar_t to 16 bit. It also DOES break
the code casue libraries normally compiled for wchar_t to 32 bit. Again, it's
generally not for application programmer.

 
 In most cases, it's much better in my opinion if the
 programmer just uses the same character type as one of the
 libraries it uses, stick to that, and is aware of what he's
 doing. If someone really want to deal with the complexity of

Programmer should not know generally what encoding he works with. For both
UTF-8 and UTF-16, it's easy to determine number of bytes (words) in multibyte
(word) sequence by just looking at first code point. This can also be builtin
function (e.g. numberOfChars(tchar firstChar)). Size of each element can easily
be determined by sizeof. Conversion to UTF-32 and back can be done very
transparently.

The only problem it might cause - bindings with other libraries (but in this
case you can just use fromUTFxx and toUTFxx; you do this conversion anyway).
Also, transferring data over the network - again you can just stick to a
particular encoding (for network and files, UTF-8 is better since it's byte
order free).

 supporting both character types depending on the environment
 it runs on, it's easy to create a "tchar" and "tstring"
 alias that depends on whether it's Windows or Linux, or on a
 custom version flag from a compiler switch, but that'll be
 his choice and his responsibility to make everything work.

If it's a choice of programmer, then almost all advantages of tchar are lost.
It's like garbage collector - if used by everybody, you can expect advantages
of using it. However, if it's optional - everybody will write libraries
assuming no GC is available, thus - almost all performance advantages are lost.

And after all, one of the goals of D (if I am not wrong) to be flexible, so
that performance gains will be available for particular configurations if they
can be achieved (it's fully compiled language). It does not stick to something
particular and say 'you must use UTF-8' or 'you must use UTF-16'.

 michel.fortin michelf.com
 http://michelf.com/

Jun 08 2010

dennis luehring <dl.soluz gmx.net> writes:

please stop top-posting - just click on the post you want to reply and 
click then reply - your flooding the newsgroup root with replies ...

Am 08.06.2010 17:11, schrieb Ruslan Nikolaev:
  Generally Linux systems use UTF-8 so I guess the "system
  encoding" there will be UTF-8. But then if you start to use
  QT you have to use UTF-16, but you might have to intermix
  UTF-8 to work with other libraries in the backend (libraries
  which are not necessarily D libraries, nor system
  libraries). So you may have a UTF-8 backend (such as the
  MySQL library), UTF-8 "system encoding" glue code, and
  UTF-16 GUI code (QT). That might be a good or a bad choice,
  depending on various factors, such as whether the glue code
  send more strings to the backend or the GUI.

  Now try to port the thing to Windows where you define the
  "system encoding" as UTF-16. Now you still have the same
  UTF-8 backend, and the same UTF-16 GUI code, but for some
  reason you're changing the glue code in the middle to
  UTF-16? Sure, it can be made to work, but all the string
  conversions will start to happen elsewhere, which may change
  the performance characteristics and add some potential for
  bugs, and all this for no real reason.

  The problem is that what you call "system encoding" is only
  the encoding used by the system frameworks. It is relevant
  when working with the system frameworks, but when you're
  working with any other API, you'll probably want to use the
  same character type as that API does, not necessarily the
  "system encoding". Not all programs are based on extensive
  use of the system frameworks. In some situations you'll want
  to use UTF-16 on Linux, or UTF-8 on Windows, because you're
  dealing with libraries that expect that (QT, MySQL).

 Agreed. True, system encoding is not always that clear. Yet, usually UTF-8 is
common for Linux (consider also Gtk, wxWidgets, system calls, etc.) At the same
time, UTF-16 is more common for Windows (consider win32api, DFL, system calls,
etc.). Some programs written in C even tend to have their own 'tchar' so that
they can be compiled differently depending on platform.

  A compiler switch is a poor choice there, because you can't
  mix libraries compiled with a different compiler switches
  when that switch changes the default character type.

 Compiler switch is only necessary for system programmer. For instance, gcc
also has 'fshort-wchar' that changes width of wchar_t to 16 bit. It also DOES
break the code casue libraries normally compiled for wchar_t to 32 bit. Again,
it's generally not for application programmer.

  In most cases, it's much better in my opinion if the
  programmer just uses the same character type as one of the
  libraries it uses, stick to that, and is aware of what he's
  doing. If someone really want to deal with the complexity of

 Programmer should not know generally what encoding he works with. For both
UTF-8 and UTF-16, it's easy to determine number of bytes (words) in multibyte
(word) sequence by just looking at first code point. This can also be builtin
function (e.g. numberOfChars(tchar firstChar)). Size of each element can easily
be determined by sizeof. Conversion to UTF-32 and back can be done very
transparently.

 The only problem it might cause - bindings with other libraries (but in this
case you can just use fromUTFxx and toUTFxx; you do this conversion anyway).
Also, transferring data over the network - again you can just stick to a
particular encoding (for network and files, UTF-8 is better since it's byte
order free).

  supporting both character types depending on the environment
  it runs on, it's easy to create a "tchar" and "tstring"
  alias that depends on whether it's Windows or Linux, or on a
  custom version flag from a compiler switch, but that'll be
  his choice and his responsibility to make everything work.

 If it's a choice of programmer, then almost all advantages of tchar are lost.
It's like garbage collector - if used by everybody, you can expect advantages
of using it. However, if it's optional - everybody will write libraries
assuming no GC is available, thus - almost all performance advantages are lost.

 And after all, one of the goals of D (if I am not wrong) to be flexible, so
that performance gains will be available for particular configurations if they
can be achieved (it's fully compiled language). It does not stick to something
particular and say 'you must use UTF-8' or 'you must use UTF-16'.

  michel.fortin michelf.com
  http://michelf.com/

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"dennis luehring" <dl.soluz gmx.net> wrote in message 
news:hulqni$1ssj$1 digitalmars.com...
 please stop top-posting - just click on the post you want to reply and 
 click then reply - your flooding the newsgroup root with replies ...

 Am 08.06.2010 17:11, schrieb Ruslan Nikolaev:
  Generally Linux systems use UTF-8 so I guess the "system
  encoding" there will be UTF-8. But then if you start to use



Speaking of top-posting... ;)

Jun 08 2010

"Yao G." <nospamyao gmail.com> writes:

Every time you reply to somebody, a new message is created. Is kinda  
difficult to follow this discussion when you need to look more than 15  
separated messages about the same issue. Please check your news client or  
something.

Yao G.

On Tue, 08 Jun 2010 10:11:34 -0500, Ruslan Nikolaev  
<nruslan_devel yahoo.com> wrote:

 Generally Linux systems use UTF-8 so I guess the "system
 encoding" there will be UTF-8. But then if you start to use
 QT you have to use UTF-16, but you might have to intermix
 UTF-8 to work with other libraries in the backend (libraries
 which are not necessarily D libraries, nor system
 libraries). So you may have a UTF-8 backend (such as the
 MySQL library), UTF-8 "system encoding" glue code, and
 UTF-16 GUI code (QT). That might be a good or a bad choice,
 depending on various factors, such as whether the glue code
 send more strings to the backend or the GUI.

 Now try to port the thing to Windows where you define the
 "system encoding" as UTF-16. Now you still have the same
 UTF-8 backend, and the same UTF-16 GUI code, but for some
 reason you're changing the glue code in the middle to
 UTF-16? Sure, it can be made to work, but all the string
 conversions will start to happen elsewhere, which may change
 the performance characteristics and add some potential for
 bugs, and all this for no real reason.

 The problem is that what you call "system encoding" is only
 the encoding used by the system frameworks. It is relevant
 when working with the system frameworks, but when you're
 working with any other API, you'll probably want to use the
 same character type as that API does, not necessarily the
 "system encoding". Not all programs are based on extensive
 use of the system frameworks. In some situations you'll want
 to use UTF-16 on Linux, or UTF-8 on Windows, because you're
 dealing with libraries that expect that (QT, MySQL).

 Agreed. True, system encoding is not always that clear. Yet, usually  
 UTF-8 is common for Linux (consider also Gtk, wxWidgets, system calls,  
 etc.) At the same time, UTF-16 is more common for Windows (consider  
 win32api, DFL, system calls, etc.). Some programs written in C even tend  
 to have their own 'tchar' so that they can be compiled differently  
 depending on platform.

 A compiler switch is a poor choice there, because you can't
 mix libraries compiled with a different compiler switches
 when that switch changes the default character type.

 Compiler switch is only necessary for system programmer. For instance,  
 gcc also has 'fshort-wchar' that changes width of wchar_t to 16 bit. It  
 also DOES break the code casue libraries normally compiled for wchar_t  
 to 32 bit. Again, it's generally not for application programmer.

 In most cases, it's much better in my opinion if the
 programmer just uses the same character type as one of the
 libraries it uses, stick to that, and is aware of what he's
 doing. If someone really want to deal with the complexity of

 Programmer should not know generally what encoding he works with. For  
 both UTF-8 and UTF-16, it's easy to determine number of bytes (words) in  
 multibyte (word) sequence by just looking at first code point. This can  
 also be builtin function (e.g. numberOfChars(tchar firstChar)). Size of  
 each element can easily be determined by sizeof. Conversion to UTF-32  
 and back can be done very transparently.

 The only problem it might cause - bindings with other libraries (but in  
 this case you can just use fromUTFxx and toUTFxx; you do this conversion  
 anyway). Also, transferring data over the network - again you can just  
 stick to a particular encoding (for network and files, UTF-8 is better  
 since it's byte order free).

 supporting both character types depending on the environment
 it runs on, it's easy to create a "tchar" and "tstring"
 alias that depends on whether it's Windows or Linux, or on a
 custom version flag from a compiler switch, but that'll be
 his choice and his responsibility to make everything work.

 If it's a choice of programmer, then almost all advantages of tchar are  
 lost. It's like garbage collector - if used by everybody, you can expect  
 advantages of using it. However, if it's optional - everybody will write  
 libraries assuming no GC is available, thus - almost all performance  
 advantages are lost.

 And after all, one of the goals of D (if I am not wrong) to be flexible,  
 so that performance gains will be available for particular  
 configurations if they can be achieved (it's fully compiled language).  
 It does not stick to something particular and say 'you must use UTF-8'  
 or 'you must use UTF-16'.

 michel.fortin michelf.com
 http://michelf.com/



-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Jun 08 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

 Every time you reply to somebody, a
 new message is created. Is kinda difficult to follow this
 discussion when you need to look more than 15 separated
 messages about the same issue. Please check your news client
 or something.
 
 Yao G.
 

Sorry for that, I did not know there was some problem there. It looks there is
some problem with web-based mail I am using and I do click "Reply". I need to
check & fix it.

Just a last note regarding the topic:
Anyway, I already explained all my points. Others have good points, too. There
can be good and bad reasons for tchar. It primarily depends on the:
1. You view D  as a language with single framework that behaves absolutely the
same way on all platforms.
2. You allow some diversion from common view for the sake of better
interoperability with system libraries.

In addition, tchar can be added to 3 already existent types. I doubt that it
will hurt. If library developers prefer to work with native encoding, they can
use it. Otherwise, they can provide templates that can be used for any of those
4 types. Finally, if someone wants to use something particular, s/he can use it.

It would be nice to hear something from Walter. If he says "no, in no way we
need this", I am fine with this. The final decision, as you know, is made by
the developer of the language.

Thanks.

Jun 08 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

Yes, I know function overloading takes care of it. But my whole point was
totally different. 'tchar' has nothing to do with overloading, and the
rationale is totally different - provide a type depending on the target
platform preferences.

Ruslan.

--- On Tue, 6/8/10, Walter Bright <newshound1 digitalmars.com> wrote:

 From: Walter Bright <newshound1 digitalmars.com>
 Subject: Re: Wide characters support in D
 To: digitalmars-d puremagic.com
 Date: Tuesday, June 8, 2010, 8:36 PM
 Ruslan Nikolaev wrote:
 No. From the very beginning I said "it would also be

 nice to have some
 builtin function for conversion to dchar". That means

 it would be nice to
 have function that converts from tchar (regardless of

 its width) to UTF-32.
 The reason was always clear - you normally don't need

 UTF-32 chars/strings
 but for some character analysis you might need them.

 
 http://www.digitalmars.com/d/2.0/phobos/std_utf.html
 
 Function overloading takes care of selecting the right
 version.

Jun 08 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

Yeah... Exactly. I just verified our posts via web interface. Why did he bl=
ame me for top posting (at least it can be inferred from that my message ha=
s been addressed to)? I am simply replying to already existing messages.=0A=
=0ARuslan.=0A=0A--- On Tue, 6/8/10, Nick Sabalausky <a a.a> wrote:=0A=0A> F=
rom: Nick Sabalausky <a a.a>=0A> Subject: Re: Wide characters support in D=
=0A> To: digitalmars-d puremagic.com=0A> Date: Tuesday, June 8, 2010, 9:50 =
PM=0A> "dennis luehring" <dl.soluz gmx.net>=0A> wrote in message =0A> news:=
hulqni$1ssj$1 digitalmars.com...=0A> > please stop top-posting - just click=
 on the post you=0A> want to reply and =0A> > click then reply - your flood=
ing the newsgroup root=0A> with replies ...=0A> >=0A> > Am 08.06.2010 17:11=
, schrieb Ruslan Nikolaev:=0A> >>>=0A> >>>=A0 Generally Linux systems use U=
TF-8 so I=0A> guess the "system=0A> >>>=A0 encoding" there will be UTF-8. B=
ut then=0A> if you start to use=0A> =0A> Speaking of top-posting... ;)=0A> =
=0A> =0A> =0A=0A=0A

Jun 08 2010

dennis luehring <dl.soluz gmx.net> writes:

Am 08.06.2010 19:55, schrieb Ruslan Nikolaev:
 Yeah... Exactly. I just verified our posts via web interface. Why did he blame
me for top posting (at least it can be inferred from that my message has been
addressed to)? I am simply replying to already existing messages.

sorry but - there are serveral others using the web-interface and you 
the only power-top-poster around - maybe you should switch over to 
thunderbird or something

 --- On Tue, 6/8/10, Nick Sabalausky<a a.a>  wrote:

  From: Nick Sabalausky<a a.a>
  Subject: Re: Wide characters support in D
  To: digitalmars-d puremagic.com
  Date: Tuesday, June 8, 2010, 9:50 PM
  "dennis luehring"<dl.soluz gmx.net>
  wrote in message
  news:hulqni$1ssj$1 digitalmars.com...
  >  please stop top-posting - just click on the post you
  want to reply and
  >  click then reply - your flooding the newsgroup root
  with replies ...
  >
  >  Am 08.06.2010 17:11, schrieb Ruslan Nikolaev:
  >>>
  >>>   Generally Linux systems use UTF-8 so I
  guess the "system
  >>>   encoding" there will be UTF-8. But then
  if you start to use

  Speaking of top-posting... ;)

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.134.1276019725.24349.digitalmars-d puremagic.com...
Yeah... Exactly. I just verified our posts via web interface. Why did he 
blame me for top posting (at least it can be inferred from that my message 
has been addressed to)? I am simply replying to already existing messages.<

Sorry, I think I created some confusion:

What I think dennis was talking about (or am I mistaken?) was how all of 
your replies are being shown in tree-view as replying directly to the 
original post instead of being shown as a reply to the message that it 
*really* replies to. That makes the discussion hard to follow.

Then I came in and made a smart-ass comment about how he wrote his message 
above the quoted text instead of below the quoted text (usually we follow 
the convention here of writing below the quoted text).

So, two totally different things.

Jun 08 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

No. New messages are definitely not created by me. You can verify it here:=
=0Ahttp://blog.gmane.org/gmane.comp.lang.d.general=0A=0AYou can easily see =
that in none of the top posts (except for the first one) my name appears fi=
rst. In fact, you have just created another top post. I am only replying to=
 other's comments.=0A=0ARuslan.=0A=0A--- On Tue, 6/8/10, dennis luehring <d=
l.soluz gmx.net> wrote:=0A=0A> From: dennis luehring <dl.soluz gmx.net>=0A>=
 Subject: Re: Wide characters support in D=0A> To: digitalmars-d puremagic.=
com=0A> Date: Tuesday, June 8, 2010, 10:11 PM=0A> Am 08.06.2010 19:55, schr=
ieb Ruslan=0A> Nikolaev:=0A> > Yeah... Exactly. I just verified our posts v=
ia web=0A> interface. Why did he blame me for top posting (at least it=0A> =
can be inferred from that my message has been addressed to)?=0A> I am simpl=
y replying to already existing messages.=0A> =0A> sorry but - there are ser=
veral others using the=0A> web-interface and you =0A> the only power-top-po=
ster around - maybe you should switch=0A> over to =0A> thunderbird or somet=
hing=0A> =0A> > --- On Tue, 6/8/10, Nick Sabalausky<a a.a>=A0 wrote:=0A> >=
=0A> >>=A0 From: Nick Sabalausky<a a.a>=0A> >>=A0 Subject: Re: Wide charact=
ers support in D=0A> >>=A0 To: digitalmars-d puremagic.com=0A> >>=A0 Date: =
Tuesday, June 8, 2010, 9:50 PM=0A> >>=A0 "dennis luehring"<dl.soluz gmx.net=
=0A> >>=A0 wrote in message=0A> >>=A0 news:hulqni$1ssj$1 digitalmars.com..=

.=0A> >>=A0 >=A0 please stop top-posting - just=0A> click on the post you=
=0A> >>=A0 want to reply and=0A> >>=A0 >=A0 click then reply - your floodin=
g=0A> the newsgroup root=0A> >>=A0 with replies ...=0A> >>=A0 >=0A> >>=A0 >=
=A0 Am 08.06.2010 17:11, schrieb=0A> Ruslan Nikolaev:=0A> >>=A0 >>>=0A> >>=
=A0 >>>=A0=A0=A0Generally=0A> Linux systems use UTF-8 so I=0A> >>=A0 guess =
the "system=0A> >>=A0 >>>=A0=A0=A0encoding"=0A> there will be UTF-8. But th=
en=0A> >>=A0 if you start to use=0A> >>=0A> >>=A0 Speaking of top-posting..=
. ;)=0A> >>=0A> >>=0A> >>=0A> >=0A> >=0A> >=0A> =0A> =0A=0A=0A

Jun 08 2010

dennis luehring <dl.soluz gmx.net> writes:

Am 08.06.2010 20:20, schrieb Ruslan Nikolaev:
 No. New messages are definitely not created by me. You can verify it here:
 http://blog.gmane.org/gmane.comp.lang.d.general

 You can easily see that in none of the top posts (except for the first one) my
name appears first. In fact, you have just created another top post. I am only
replying to other's comments.

but my newsreader (thunderbird and mail live) telling an different story

http://de.tinypic.com/r/mbjndc/6 (click on the image)

as you can see - no top-poster can beat you

i should give thunderbird a try - very good and nice with newsgroups

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"dennis luehring" <dl.soluz gmx.net> wrote in message 
news:hum3fc$2dp6$1 digitalmars.com...
 Am 08.06.2010 20:20, schrieb Ruslan Nikolaev:
 No. New messages are definitely not created by me. You can verify it 
 here:
 http://blog.gmane.org/gmane.comp.lang.d.general

 You can easily see that in none of the top posts (except for the first 
 one) my name appears first. In fact, you have just created another top 
 post. I am only replying to other's comments.

 but my newsreader (thunderbird and mail live) telling an different story

 http://de.tinypic.com/r/mbjndc/6 (click on the image)

 as you can see - no top-poster can beat you

 i should give thunderbird a try - very good and nice with newsgroups

That link didn't show the image for me, but this one does: 
http://i50.tinypic.com/mbjndc.jpg

I get the same results as dennis in Outlook Express.

Also, that link from Ruslan seems to display in a blog-style, which is a 
really bizarre way to view a newsgroup.

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:hum6c8$2j0s$1 digitalmars.com...
 "dennis luehring" <dl.soluz gmx.net> wrote in message 
 news:hum3fc$2dp6$1 digitalmars.com...
 Am 08.06.2010 20:20, schrieb Ruslan Nikolaev:
 No. New messages are definitely not created by me. You can verify it 
 here:
 http://blog.gmane.org/gmane.comp.lang.d.general

 You can easily see that in none of the top posts (except for the first 
 one) my name appears first. In fact, you have just created another top 
 post. I am only replying to other's comments.

 but my newsreader (thunderbird and mail live) telling an different story

 http://de.tinypic.com/r/mbjndc/6 (click on the image)

 as you can see - no top-poster can beat you

 i should give thunderbird a try - very good and nice with newsgroups

 That link didn't show the image for me, but this one does: 
 http://i50.tinypic.com/mbjndc.jpg

 I get the same results as dennis in Outlook Express.

Well, more-or-less. This is what I'm getting: 
http://www.semitwist.com/download/wideCharNG.png

A standard newsgroup tree-view, except all of Ruslan's posts are immediate 
children of the original post. Everyone else's posts show up as proper 
replies. Yea, I am using Outlook Express, but I've never seen anyone else on 
this NG for which every one of their posts is always either first or second 
level.

 Also, that link from Ruslan seems to display in a blog-style, which is a 
 really bizarre way to view a newsgroup.

Jun 08 2010

Pelle <pelle.mansson gmail.com> writes:

On 06/08/2010 08:20 PM, Ruslan Nikolaev wrote:
 No. New messages are definitely not created by me. You can verify it here:
 http://blog.gmane.org/gmane.comp.lang.d.general

 You can easily see that in none of the top posts (except for the first one) my
name appears first. In fact, you have just created another top post. I am only
replying to other's comments.

 Ruslan.

Speaking as someone with a tiny bit of knowledge about nntp, you are 
sending your messages without references; that would be top posting.

Please, fix. Your thread is all over the place.

Jun 08 2010

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

Pelle <pelle.mansson gmail.com> wrote:

 On 06/08/2010 08:20 PM, Ruslan Nikolaev wrote:
 No. New messages are definitely not created by me. You can verify it  
 here:
 http://blog.gmane.org/gmane.comp.lang.d.general

 You can easily see that in none of the top posts (except for the first  
 one) my name appears first. In fact, you have just created another top  
 post. I am only replying to other's comments.

 Ruslan.

 Speaking as someone with a tiny bit of knowledge about nntp, you are  
 sending your messages without references; that would be top posting.

 Please, fix. Your thread is all over the place.

Weird. I'm getting all his messages in their right places. Using
Opera's built-in newsreader.

-- 
Simen

Jun 09 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 09 Jun 2010 07:22:17 -0400, Simen kjaeraas  
<simen.kjaras gmail.com> wrote:

 Pelle <pelle.mansson gmail.com> wrote:

 On 06/08/2010 08:20 PM, Ruslan Nikolaev wrote:
 No. New messages are definitely not created by me. You can verify it  
 here:
 http://blog.gmane.org/gmane.comp.lang.d.general

 You can easily see that in none of the top posts (except for the first  
 one) my name appears first. In fact, you have just created another top  
 post. I am only replying to other's comments.

 Ruslan.

 Speaking as someone with a tiny bit of knowledge about nntp, you are  
 sending your messages without references; that would be top posting.

 Please, fix. Your thread is all over the place.

 Weird. I'm getting all his messages in their right places. Using
 Opera's built-in newsreader.

OK, you know what's really *really* weird?  I was getting all his replies  
as root-level messages, with Opera's news reader.  Today, I started up  
opera, and all his posts are now magically threaded properly.  WTF!???

Maybe it was a server thing?  An Opera thing?  No clue, but all seems well  
now...

-Steve

Jun 09 2010

"Jer" <jersey chicago.com> writes:

Ruslan Nikolaev wrote:
 Note: I posted this already on runtime D list, but I think that list
 was a wrong one for this question. Sorry for duplication :-)

 Hi. I am new to D. It looks like D supports 3 types of characters:
 char, wchar, dchar. This is cool,

It's wrong, actually.

Jun 10 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Wide characters support in D