digitalmars.D - YAUST v1.0

xs0 (135/135) Nov 24 2005 YAUST - Yet Another Unified String Theory :)

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (56/187) Nov 24 2005 The idea (platform-independence) here is correct. :) The only thing is

xs0 (89/255) Nov 24 2005 Before anything else: while I agree that a (really well-thought out)

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (87/332) Nov 24 2005 Although D is able to support some hard coded properties too.

xs0 (62/209) Nov 25 2005 Sure you may need transcoding, you may use 15 different libraries, each

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (76/230) Nov 25 2005 In case you haven't noticed, most things in Java are made of wrappers.

Derek Parnell (10/12) Nov 25 2005 Wrong, I'm afraid. Some characters use 32 bits in UTF16.

xs0 (8/24) Nov 26 2005 Furthermore, a single visible character can be encoded using more than

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (30/45) Nov 27 2005 Thanks, I wasn't aware of this before.

Georg Wrede (7/15) Nov 25 2005 I'd have hoped you'd prefer

xs0 <xs0 xs0.com> writes:

YAUST - Yet Another Unified String Theory :)

Well, here's my proposal for cleaning up strings. I tried to

- be as practical as possible
- leave full control over encoding when one wants to have it
- remove any possible confusion as to what each type is
- allow efficiency where possible, without excessive effort

First, the proposed changes are listed, followed by rationale.

==============================

I. drop char and wchar

--

II.

create cchar (1-byte unsigned character of platform-specific encoding, 
C-equivalent)
create utf8  (1 byte of UTF8)
create utf16 (2 bytes of UTF16)
leave dchar as is

--

III.

version(Windows) {
	alias utf16[] string;
} else
version(Unix/Linux) {
	alias utf8[] string;
}

add suffix ""s for explicitly specifying platform-specific encoding 
(i.e. the string type), and make auto type inference default to that 
same type (this applies to the auto keyword, not undecorated strings). 
Add docs explaining that string is just a platform-dependant alias.

--

IV.

add the following implicit casts for interoperability

from: cchar[], utf8[], utf16[], dchar[]
to  : cchar*, utf8*, utf16*, dchar*

all of them ensure 0-termination. If cchar is converted to any other 
form, it becomes the appropriate Unicode char. In the reverse direction, 
all unrepresentable characters become '?'. when runtime transcoding 
and/or reallocation is required, make them produce a warning.

--

V.

add the following implicit (transcoding) casts

from: cchar[], utf8[], utf16[], dchar[]
to  : cchar[], utf8[], utf16[], dchar[]

when runtime transcoding is required, make them produce a warning (i.e. 
always, except when casting from T to T).

--

VI.

modify explicit casts between all the array and pointer types to
- transcode rather than paint
- use '?' for unrepresentable characters (applies to encoding into 
cchar*/cchar[] only)
- not produce the warnings from above

--

VII.

create compatibility kit:

module std.srccompatibility.oldchartypes;
// yes, it should be big and ugly

alias utf8 char;
alias utf16 wchar;

--

VIII.

add the following methods to all 4 array types

  utf8[] .asUTF8
utf16[] .asUTF16
dchar[] .asUTF32
cchar[] .asCchars

ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
ubyte[] .asUTF16LE(bool includeBOM)
ubyte[] .asUTF16BE(bool includeBOM)
ubyte[] .asUTF32LE(bool includeBOM)
ubyte[] .asUTF32BE(bool includeBOM)

--

IX.

modify the ~ operator between the 4 types to work as follows:

a) infer the result type from context, as with undecorated strings
b) if calling a function and there are multiple overloads
b.1) if both operand types are known, use that type
b.2) if one us known and another is undecorated literal, use the known type
b.3) if neither is known or both are known, but different, bork

--

X.

Disallow utf8 and utf16 as a stand-alone var type, only arrays and 
pointers allowed

========================

Point I. removes the confusion of "char" and "wchar" not actually 
representing characters.

Point II. explicitly states that the strings are either UTF-encoded, 
complete characters* or C-compatible characters.

Point III. makes the code

string abc="abc";
someOSFunc(abc);
someOtherOSFunc("qwe"s); // s only neccessary if there is more than one 
option

least likely to produce any transcoding.

Point IV. makes it nearly impossible to do the wrong thing and doesn't 
require explicit casts when interfacing to C code, assuming the C 
functions are declared properly (i.e. the correct of the two 1-byte 
types is declared). When used with literals, the 0 can be appended 
compile-time, like it is now.

Point V. makes it easier to use different types without explicit 
casting, but will still produce warnings when transcoding happens. In 
most cases it will be obvious anyway.

Point VI. breaks behavior of other array casts (which only paint), but 
strings are getting special behavior anyway, and you can still paint via 
void[], and even more importantly, if you need to paint between 
UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong 
in the first place.

Point VII. will make it somewhat easier to make the transition.

Point VIII. provides an alternative to casting and allows specifying 
endianness when writing to network and/or files. The methods should be 
compile-time resolvable when possible, so this would be both valid and 
evaluated in compile time:

ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);

Point IX. allows concatenation of strings in different encodings without 
significantly increasing the complexity of overloading rules, while also 
not requiring an inefficient toUTFxx followed by concatenation (which 
copies the result again).

Point X. prevents some invalid code:
- treating a UTF-8 code unit as a character
- treating a UTF-16 code unit as a character
- iterating over code units instead of characters

Note that it is still possible to iterate over the string using a cchar 
and dchar, which actually do represent characters. Also note that for 
I/O purposes, which are the only thing one should be doing with code 
units, you can still paint the string as void[] or byte[] (or even 
better, call one of the methods above), but then you give up the view 
that it is a string and lose language support/special treatment.



So, what do you guys/gals think? :)


xs0

* note that even dchar[] still doesn't neccessarily contain complete 
characters, at least as seen by the user. For example, the letter 
LATIN_C_WITH_CARON can also be written as LATIN_C + COMBINING_CARON, and 
they are in fact equivalent as far as Unicode is concerned (afaik). 
Splitting the string inbetween will thus produce a "wrong" result, but I 
don't think D should include any kind of full Unicode processing, as 
it's actually needed quite rarely, so that problem is ignored...

Nov 24 2005

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak invalid_utu.fi> writes:

xs0 wrote:
<snip>
 III.
 
 version(Windows) {
     alias utf16[] string;
 } else
 version(Unix/Linux) {
     alias utf8[] string;
 }
 
 add suffix ""s for explicitly specifying platform-specific encoding 
 (i.e. the string type), and make auto type inference default to that 
 same type (this applies to the auto keyword, not undecorated strings). 
 Add docs explaining that string is just a platform-dependant alias.
 

The idea (platform-independence) here is correct. :) The only thing is 
that you _don't_ need to know, which utf-implementation the current 
compiler is using. If you are using Unicode to communicate with the user 
and/or native D libraries, you don't need to do any string conversions - 
they all use the same string representation, for god's sake.

 IV.
 
 add the following implicit casts for interoperability
 
 from: cchar[], utf8[], utf16[], dchar[]
 to  : cchar*, utf8*, utf16*, dchar*
 
 all of them ensure 0-termination. If cchar is converted to any other 
 form, it becomes the appropriate Unicode char. In the reverse direction, 
 all unrepresentable characters become '?'. when runtime transcoding 
 and/or reallocation is required, make them produce a warning.

You mean C/C++ -interoperability?

Replacing all non-ASCII characters with '?'s means that we don't 
actually want to support all the legacy systems out there. So it would 
be impossible to write Unicode-compliant portable programs that 
supported '�' on the Windows 9x/NT/XP command line without version() {} 
-logic?

 V.
 
 add the following implicit (transcoding) casts
 
 from: cchar[], utf8[], utf16[], dchar[]
 to  : cchar[], utf8[], utf16[], dchar[]
 
 when runtime transcoding is required, make them produce a warning (i.e. 
 always, except when casting from T to T).

Again, the main reason for Unicode is that you don't need to transcode 
between several representations all the time.

 VI.
 
 modify explicit casts between all the array and pointer types to
 - transcode rather than paint
 - use '?' for unrepresentable characters (applies to encoding into 
 cchar*/cchar[] only)
 - not produce the warnings from above
 
 -- 
 
 VII.
 
 create compatibility kit:
 
 module std.srccompatibility.oldchartypes;
 // yes, it should be big and ugly
 
 alias utf8 char;
 alias utf16 wchar;
 

You know, sweeping the problem under the carpet doesn't help us much. 
char/wchar won't get any better by calling them with a different name. 
Still char won't be able to store more than the first 127 Unicode symbols.

 VIII.
 
 add the following methods to all 4 array types
 
  utf8[] .asUTF8
 utf16[] .asUTF16
 dchar[] .asUTF32
 cchar[] .asCchars

Why, section V. already allows you to transcode these implicitely.

 ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
 ubyte[] .asUTF16LE(bool includeBOM)
 ubyte[] .asUTF16BE(bool includeBOM)
 ubyte[] .asUTF32LE(bool includeBOM)
 ubyte[] .asUTF32BE(bool includeBOM)
 

This looks pretty familiar. My own proposal does this on a library level 
for a reason. You see, conversions from Unicode to ISO-8859-x/KOI8-R/... 
should be allowed. It's easier to maintain the conversion table in a 
separate library. This also saves Walter from a lot of unnecessary work.

UTF-8 _does_ have a BOM.

 IX.
 
 modify the ~ operator between the 4 types to work as follows:
 
 a) infer the result type from context, as with undecorated strings
 b) if calling a function and there are multiple overloads
 b.1) if both operand types are known, use that type
 b.2) if one us known and another is undecorated literal, use the known type
 b.3) if neither is known or both are known, but different, bork
 

If we didn't have several types of strings, this all would be much easier.

 X.
 
 Disallow utf8 and utf16 as a stand-alone var type, only arrays and 
 pointers allowed
 

Yes, this is a 'working' solution. Although I would like to be able to 
slice strings and do things like:

char[] s = "�lytt�m�mm�ksi voinee menn�?"
s[15..21] = "ei voi"
writefln(s) // outputs: �lytt�m�mm�ksi ei voi menn�?

Of course you can do this all using library functions, but tell me one 
thing: why should I do simple string slicing using library calls and 
much more complex Unicode conversion using language structures.

 Point I. removes the confusion of "char" and "wchar" not actually 
 representing characters.
 

True.

 Point II. explicitly states that the strings are either UTF-encoded, 
 complete characters* or C-compatible characters.

True.

 Point III. makes the code
 
 string abc="abc";
 someOSFunc(abc);
 someOtherOSFunc("qwe"s); // s only neccessary if there is more than one 
 option
 
 least likely to produce any transcoding.

Of course you need to do transcoding, if the OS-function expects 
ISO-8859-x and you're string has utf8/16.

 Point IV. makes it nearly impossible to do the wrong thing and doesn't 
 require explicit casts when interfacing to C code, assuming the C 
 functions are declared properly (i.e. the correct of the two 1-byte 
 types is declared). When used with literals, the 0 can be appended 
 compile-time, like it is now.

Why do you have to output Unicode strings using legacy non-Unicode 
C-APIs? AFAIK DUI / stardard I/O and other libraries use standard 
Unicode, right? At least QT / GTK+ / Win32API / Linux console do support 
Unicode.

 Point V. makes it easier to use different types without explicit 
 casting, but will still produce warnings when transcoding happens. In 
 most cases it will be obvious anyway.

It would easier with only a single Unicode-compliant string-type. Ask 
the Java guys.

 Point VI. breaks behavior of other array casts (which only paint), but 
 strings are getting special behavior anyway, and you can still paint via 
 void[], and even more importantly, if you need to paint between 
 UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong 
 in the first place.

?

 Point VII. will make it somewhat easier to make the transition.

?

 Point VIII. provides an alternative to casting and allows specifying 
 endianness when writing to network and/or files.

Partly true. Still, I think it would be much better if we had these as a 
std.stream.UnicodeStream class. Again, Java does this well.

 The methods should be 
 compile-time resolvable when possible, so this would be both valid and 
 evaluated in compile time:
 
 ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);

Why? Converting a 14 character string doesn't take much time. Besides, 
if all our strings and i/o were utf-8, there wouldn't be any 
conversions, right?

 Point IX. allows concatenation of strings in different encodings without 
 significantly increasing the complexity of overloading rules, while also 
 not requiring an inefficient toUTFxx followed by concatenation (which 
 copies the result again).

True, but as I previously said, I don't believe we need to do great 
amount of conversions in the runtime-level. All conversions should be 
near network/file-interfaces, thus using Stream-classes, right?

 Point X. prevents some invalid code:

 Note that it is still possible to iterate over the string using a cchar 
 and dchar, which actually do represent characters. Also note that for 
 I/O purposes, which are the only thing one should be doing with code 
 units, you can still paint the string as void[] or byte[] (or even 
 better, call one of the methods above), but then you give up the view 
 that it is a string and lose language support/special treatment.

True.

 Splitting the string inbetween will thus produce a "wrong" result, but I 
 don't think D should include any kind of full Unicode processing, as 
 it's actually needed quite rarely, so that problem is ignored...

Sigh. Maybe you're not doing full Unicode processing every day. What 
about the Chinese? And what is full Unicode processing?

Nov 24 2005

xs0 <xs0 xs0.com> writes:

Before anything else: while I agree that a (really well-thought out) 
string class would probably be a good solution, the D spec would seem to 
suggest an array-based approach is preferred, and Walter isn't one to 
change his mind easily :)
Besides, any kind of string class has it's share of problems (one size 
never fits all), and with the array based approach it's easy to add 
pseudo-methods doing all kinds of funky things, while a language-defined 
class makes it impossible.


Jari-Matti M�kel� wrote:
 version(Windows) {
     alias utf16[] string;
 } else
 version(Unix/Linux) {
     alias utf8[] string;
 }

 add suffix ""s for explicitly specifying platform-specific encoding 
 (i.e. the string type), and make auto type inference default to that 
 same type (this applies to the auto keyword, not undecorated strings). 
 Add docs explaining that string is just a platform-dependant alias.

 The idea (platform-independence) here is correct. :) The only thing is 
 that you _don't_ need to know, which utf-implementation the current 
 compiler is using. 

Well, sometimes you do and most times you don't (and it is often the 
case that at least some part of any app does need to know). I don't 
think it's wise to force anything down anyone's throat, so I tries to 
give options - you can use a specific UTF encoding, the native encoding 
for legacy OSes, or leave it to the compiler to choose the "best" one 
for you, where I believe best is what the underlying OS is using.


 If you are using Unicode to communicate with the user 
 and/or native D libraries, you don't need to do any string conversions - 
 they all use the same string representation, for god's sake.

Well, flexibility will definitely require some bloat in libraries, but 
for communicating with the user, you definitely need conversions, if 
you're not using the OS-native type (which, again, you do have the 
option of using with being explicit about it).


 add the following implicit casts for interoperability

 from: cchar[], utf8[], utf16[], dchar[]
 to  : cchar*, utf8*, utf16*, dchar*

 all of them ensure 0-termination. If cchar is converted to any other 
 form, it becomes the appropriate Unicode char. In the reverse 
 direction, all unrepresentable characters become '?'. when runtime 
 transcoding and/or reallocation is required, make them produce a warning.

 
 You mean C/C++ -interoperability?

Yup.

 Replacing all non-ASCII characters with '?'s means that we don't 
 actually want to support all the legacy systems out there. So it would 
 be impossible to write Unicode-compliant portable programs that 
 supported '�' on the Windows 9x/NT/XP command line without version() {} 
 -logic?

No, who mentioned ASCII? On windows, cchar would be exactly the legacy 
encoding each non-unicode app uses, and conversions between app's 
internal UTF-x and cchar[] would transcode into that charset. So, for 
example, a word processor on a non-unicode windows version could still 
use unicode internally, while automatically talking to the OS using all 
the characters its charset provides.


 add the following implicit (transcoding) casts

 from: cchar[], utf8[], utf16[], dchar[]
 to  : cchar[], utf8[], utf16[], dchar[]

 when runtime transcoding is required, make them produce a warning 
 (i.e. always, except when casting from T to T).

 
 Again, the main reason for Unicode is that you don't need to transcode 
 between several representations all the time.

Again, sometimes you do and most times you don't. But anyhow, painting 
casts between UTF types make no sense, and I don't think explicit casts 
are neccessary, as there can't be any loss (ok, except to cchar[]).


 create compatibility kit:

 module std.srccompatibility.oldchartypes;
 // yes, it should be big and ugly

 alias utf8 char;
 alias utf16 wchar;

 
 You know, sweeping the problem under the carpet doesn't help us much. 
 char/wchar won't get any better by calling them with a different name. 
 Still char won't be able to store more than the first 127 Unicode symbols.

I'm not sure if you're referring to those aliases or not, but in YAUST, 
there is no single char(utf8) anymore, and I think there's quite a 
difference between "char[]" and "utf8[]", especially in a C-influenced 
world the Earth is :)


 add the following methods to all 4 array types

  utf8[] .asUTF8
 utf16[] .asUTF16
 dchar[] .asUTF32
 cchar[] .asCchars

 
 Why, section V. already allows you to transcode these implicitely.

Yup, but with warnings; using one of these shows that you've thought 
about what you're doing, so the compiler is free to shut up :)


 ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
 ubyte[] .asUTF16LE(bool includeBOM)
 ubyte[] .asUTF16BE(bool includeBOM)
 ubyte[] .asUTF32LE(bool includeBOM)
 ubyte[] .asUTF32BE(bool includeBOM)

 
 This looks pretty familiar. My own proposal does this on a library level 
 for a reason. You see, conversions from Unicode to ISO-8859-x/KOI8-R/... 
 should be allowed. 

Sure they should be allowed, but D is supposed to be Unicode, so a D app 
should generally only deal with that, and other charsets should 
generally only exist in byte[] buffers before input or after output.


 It's easier to maintain the conversion table in a 
 separate library. This also saves Walter from a lot of unnecessary work.

Well, conversions between UTFs are done already, so the only thing 
remaining would be from/to cchar[], which shouldn't be too hard. Others 
definitely belong in some library, as they mostly won't be needed, I guess..


 UTF-8 _does_ have a BOM.

It does? What is it? I thought that single bytes have no Byte Order, so 
why would you need a Mark?


 modify the ~ operator between the 4 types to work as follows:

 a) infer the result type from context, as with undecorated strings
 b) if calling a function and there are multiple overloads
 b.1) if both operand types are known, use that type
 b.2) if one us known and another is undecorated literal, use the known 
 type
 b.3) if neither is known or both are known, but different, bork

 
 If we didn't have several types of strings, this all would be much easier.

Agreed, but we do have several types of strings :)


 Disallow utf8 and utf16 as a stand-alone var type, only arrays and 
 pointers allowed

 
 Yes, this is a 'working' solution. Although I would like to be able to 
 slice strings and do things like:
 
 char[] s = "�lytt�m�mm�ksi voinee menn�?"
 s[15..21] = "ei voi"
 writefln(s) // outputs: �lytt�m�mm�ksi ei voi menn�?
 
 Of course you can do this all using library functions, but tell me one 
 thing: why should I do simple string slicing using library calls and 
 much more complex Unicode conversion using language structures.

Because it's actually the opposite - Unicode conversions are simple, 
while slicing is hard (at least slicing on character boundaries). Even 
in the simple example you give, I have no idea whether the first � is 
one character or two, as both cases look the same.


 Point III. makes the code

 string abc="abc";
 someOSFunc(abc);
 someOtherOSFunc("qwe"s); // s only neccessary if there is more than 
 one option

 least likely to produce any transcoding.

 
 Of course you need to do transcoding, if the OS-function expects 
 ISO-8859-x and you're string has utf8/16.

True, I just said "least likely". But at least you can use the same 
(non-transcoding) code for both UTF-8 OSes and UTF-16 OSes.


 Point IV. makes it nearly impossible to do the wrong thing and doesn't 
 require explicit casts when interfacing to C code, assuming the C 
 functions are declared properly (i.e. the correct of the two 1-byte 
 types is declared). When used with literals, the 0 can be appended 
 compile-time, like it is now.

 
 Why do you have to output Unicode strings using legacy non-Unicode 
 C-APIs? AFAIK DUI / stardard I/O and other libraries use standard 
 Unicode, right? At least QT / GTK+ / Win32API / Linux console do support 
 Unicode.

Well, your point is moot, because if there's no such function to call, 
then there is no problem. But when there is such a function, you would 
hope that the language/library does something sensible by default, 
wouldn't you?


 Point V. makes it easier to use different types without explicit 
 casting, but will still produce warnings when transcoding happens. In 
 most cases it will be obvious anyway.

 
 It would easier with only a single Unicode-compliant string-type. Ask 
 the Java guys.

Well, I am one of the Java guys, and java.lang.String leaves a lot to be 
desired. Because it's language defined in the way it is, it's
1) immutable, which sucks if it's forced down your throat 100% of time
2) UTF-16 for ever and ever, which sucks if you want it to either take 
less memory or don't want to worry about surrogates; just look at all 
the crappy functions they had to add in Java 5 to support the entire 
Unicode charset :)


 Point VI. breaks behavior of other array casts (which only paint), but 
 strings are getting special behavior anyway, and you can still paint 
 via void[], and even more importantly, if you need to paint between 
 UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong 
 in the first place.

 
 ?

Well, a sequence of bytes can be either cchar[], UTF-8, UTF-16 or 
UTF-32, but not more than one at the same time (OK, unless it's ASCII 
only, which fits both the first two). So, for example, if you cast 
utf8[] to utf16[], either the data is UTF-8 and you don't get a UTF-16 
string (but some mumbo jumbo), or it's UTF-16 and was never valid UTF-8 
in the first place.


 Point VII. will make it somewhat easier to make the transition.

 
 ?

?

 Point VIII. provides an alternative to casting and allows specifying 
 endianness when writing to network and/or files.

 
 Partly true. Still, I think it would be much better if we had these as a 
 std.stream.UnicodeStream class. Again, Java does this well.

Why should you be forced to use a stream for something so simple? What 
if you want to use two encodings on the same stream (it's not even so 
far fetched - the first line in a HTTP request can only contain UTF-8, 
but you may want to send POST contents in UTF-16, for example). Etc. etc.


 The methods should be compile-time resolvable when possible, so this 
 would be both valid and evaluated in compile time:

 ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);

 
 Why? Converting a 14 character string doesn't take much time. 

Why would it not evaluate at compile time? Do you see any benefit in 
that? And while it doesn't take much time once, it does take some, and 
more importantly, allocates new memory each time. If you're trying to do 
more than one request (as in thousands), I'm sure it adds up..

 Besides, 
 if all our strings and i/o were utf-8, there wouldn't be any 
 conversions, right?

Except every time you'd call a Win32 function, which is what's on most 
computers?


 Point IX. allows concatenation of strings in different encodings 
 without significantly increasing the complexity of overloading rules, 
 while also not requiring an inefficient toUTFxx followed by 
 concatenation (which copies the result again).

 
 True, but as I previously said, I don't believe we need to do great 
 amount of conversions in the runtime-level. All conversions should be 
 near network/file-interfaces, thus using Stream-classes, right?

I agree decent stream classes can solve many problems, but not all of them.


 Splitting the string inbetween will thus produce a "wrong" result, but 
 I don't think D should include any kind of full Unicode processing, as 
 it's actually needed quite rarely, so that problem is ignored...

 
 Sigh. Maybe you're not doing full Unicode processing every day. What 
 about the Chinese? And what is full Unicode processing?

Unicode is much more than a really large character set. There's UTFs, 
collation, bidirectionality, combining characters, locales, etc. etc., see
http://www.unicode.org/reports/index.html

So, if you want to create a decent text editor according to Unicode 
specs, you'll have to implement "full Unicode processing", but a large 
majority of other apps just needs to be able to interface to OS and 
libraries to get and display the text, usually without even caring 
what's inside, so I see no point to include all that in D, not even as a 
standard library (or perhaps after many other things are implemented first)


xs0

Nov 24 2005

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak invalid_utu.fi> writes:

xs0 wrote:
 Before anything else: while I agree that a (really well-thought out) 
 string class would probably be a good solution, the D spec would seem to 
 suggest an array-based approach is preferred, and Walter isn't one to 
 change his mind easily :)

I believe we can achieve quite much with just simple array-like syntax.

 Besides, any kind of string class has it's share of problems (one size 
 never fits all), and with the array based approach it's easy to add 
 pseudo-methods doing all kinds of funky things, while a language-defined 
 class makes it impossible.

Although D is able to support some hard coded properties too.

 The idea (platform-independence) here is correct. :) The only thing is 
 that you _don't_ need to know, which utf-implementation the current 
 compiler is using. 

 
 Well, sometimes you do and most times you don't (and it is often the 
 case that at least some part of any app does need to know). I don't 
 think it's wise to force anything down anyone's throat, so I tries to 
 give options - you can use a specific UTF encoding, the native encoding 
 for legacy OSes, or leave it to the compiler to choose the "best" one 
 for you, where I believe best is what the underlying OS is using.

I'd give my vote for the "let compiler choose" option.

 If you are using Unicode to communicate with the user and/or native D 
 libraries, you don't need to do any string conversions - they all use 
 the same string representation, for god's sake.

 
 Well, flexibility will definitely require some bloat in libraries, but 
 for communicating with the user, you definitely need conversions, if 
 you're not using the OS-native type (which, again, you do have the 
 option of using with being explicit about it).

But if you let the compiler vendor to decide the encoding, there's a 
high probability that you don't need any explicit transcoding.

 add the following implicit casts for interoperability

 from: cchar[], utf8[], utf16[], dchar[]
 to  : cchar*, utf8*, utf16*, dchar*

 all of them ensure 0-termination. If cchar is converted to any other 
 form, it becomes the appropriate Unicode char. In the reverse 
 direction, all unrepresentable characters become '?'. when runtime 
 transcoding and/or reallocation is required, make them produce a 
 warning.

 You mean C/C++ -interoperability?

 Yup.

I was just thinking that once D has complete wrappers for all necessary 
stuff, you don't need these anymore. Library (wrapper) writers should be 
patient enough to use explicit conversion rules.

 Replacing all non-ASCII characters with '?'s means that we don't 
 actually want to support all the legacy systems out there. So it would 
 be impossible to write Unicode-compliant portable programs that 
 supported '�' on the Windows 9x/NT/XP command line without version() 
 {} -logic?

 
 
 No, who mentioned ASCII? On windows, cchar would be exactly the legacy 
 encoding each non-unicode app uses, and conversions between app's 
 internal UTF-x and cchar[] would transcode into that charset. So, for 
 example, a word processor on a non-unicode windows version could still 
 use unicode internally, while automatically talking to the OS using all 
 the characters its charset provides.
 

You said
"In the reverse direction, all unrepresentable characters become '?'."

The thing is that D compiler doesn't know anything about your system 
character encoding. You can even change it on the fly, if your system is 
capable of doing that. Therefore this transcoding must use the greatest 
common divisor which is probably 7-bit ASCII.

 add the following implicit (transcoding) casts

 from: cchar[], utf8[], utf16[], dchar[]
 to  : cchar[], utf8[], utf16[], dchar[]

 when runtime transcoding is required, make them produce a warning 
 (i.e. always, except when casting from T to T).

 Again, the main reason for Unicode is that you don't need to transcode 
 between several representations all the time.

 
 Again, sometimes you do and most times you don't. But anyhow, painting 
 casts between UTF types make no sense, and I don't think explicit casts 
 are neccessary, as there can't be any loss (ok, except to cchar[]).

You don't need to convert inside your own code unless you're really 
creating a program that is supposed to convert stuff. I mean you need 
the transcoding only when interfacing with foreign code / i/o.

 add the following methods to all 4 array types

  utf8[] .asUTF8
 utf16[] .asUTF16
 dchar[] .asUTF32
 cchar[] .asCchars

 Why, section V. already allows you to transcode these implicitely.

 
 Yup, but with warnings; using one of these shows that you've thought 
 about what you're doing, so the compiler is free to shut up :)

Yes, now you're right. The programmer should _always_ explicitely 
declare all conversions.

 ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
 ubyte[] .asUTF16LE(bool includeBOM)
 ubyte[] .asUTF16BE(bool includeBOM)
 ubyte[] .asUTF32LE(bool includeBOM)
 ubyte[] .asUTF32BE(bool includeBOM)

 This looks pretty familiar. My own proposal does this on a library 
 level for a reason. You see, conversions from Unicode to 
 ISO-8859-x/KOI8-R/... should be allowed. 

 
 Sure they should be allowed, but D is supposed to be Unicode, so a D app 
 should generally only deal with that, and other charsets should 
 generally only exist in byte[] buffers before input or after output.

Then tell me, how do I fill these buffers with your new functions? I 
would definitely want to explicitely define the character encoding. IMHO 
this is much better done using static classes (std.utf.e[n/de]code) than 
variable properties.

 It's easier to maintain the conversion table in a separate library. 
 This also saves Walter from a lot of unnecessary work.

 
 Well, conversions between UTFs are done already, so the only thing 
 remaining would be from/to cchar[], which shouldn't be too hard.

Yes, between UTFs, but between legacy charsets and UTFs is not! They 
aren't that hard, but as you might know, there are maybe hundreds of 
possible encoding types.

 Others
 definitely belong in some library, as they mostly won't be needed, I
 guess..

This isn't a very consistent approach. Some functions belong in some 
library, others should be implemented in the language...wtf?

 UTF-8 _does_ have a BOM.

 
 It does? What is it? I thought that single bytes have no Byte Order, so 
 why would you need a Mark?

0xEF 0xBB 0xBF



See also



 If we didn't have several types of strings, this all would be much 
 easier.

 
 Agreed, but we do have several types of strings :)

I'm trying to say we don't need several types of strings :)

 Disallow utf8 and utf16 as a stand-alone var type, only arrays and 
 pointers allowed

 Yes, this is a 'working' solution. Although I would like to be able to 
 slice strings and do things like:

 char[] s = "�lytt�m�mm�ksi voinee menn�?"
 s[15..21] = "ei voi"
 writefln(s) // outputs: �lytt�m�mm�ksi ei voi menn�?

 Of course you can do this all using library functions, but tell me one 
 thing: why should I do simple string slicing using library calls and 
 much more complex Unicode conversion using language structures.

 
 
 Because it's actually the opposite - Unicode conversions are simple, 
 while slicing is hard (at least slicing on character boundaries). Even 
 in the simple example you give, I have no idea whether the first � is 
 one character or two, as both cases look the same.

It's not really that hard. One downside is that you have to parse 
through the string (unless compiler uses UTF-16/32 as an internal string 
type).

Slicing the string on the code unit level doesn't make any sense, now 
does it? Because char should be treated as a special type by the 
compiler, I see no other use for slicing than this. Like you said, the 
alternative slicing can be achieved by casting the string to void[] (for 
i/o data buffering, etc).

 Point III. makes the code

 string abc="abc";
 someOSFunc(abc);
 someOtherOSFunc("qwe"s); // s only neccessary if there is more than 
 one option

 least likely to produce any transcoding.


 Of course you need to do transcoding, if the OS-function expects 
 ISO-8859-x and you're string has utf8/16.

 
 
 True, I just said "least likely". But at least you can use the same 
 (non-transcoding) code for both UTF-8 OSes and UTF-16 OSes.

Again, the compiler nor the compiled binary don't know anything about 
the OS standard encoding. Even some linux-systems still use iso-8859-x. 
If you're running windows-programs through vmware or wine on linux, you 
can't tell if it's always faster to use UTF-16 instead of UTF-8.

 Point IV. makes it nearly impossible to do the wrong thing and 
 doesn't require explicit casts when interfacing to C code, assuming 
 the C functions are declared properly (i.e. the correct of the two 
 1-byte types is declared). When used with literals, the 0 can be 
 appended compile-time, like it is now.


 Why do you have to output Unicode strings using legacy non-Unicode 
 C-APIs? AFAIK DUI / stardard I/O and other libraries use standard 
 Unicode, right? At least QT / GTK+ / Win32API / Linux console do 
 support Unicode.

 
 
 Well, your point is moot, because if there's no such function to call, 
 then there is no problem. But when there is such a function, you would 
 hope that the language/library does something sensible by default, 
 wouldn't you?

No, this brilliant invention of yours causes problems even if we didn't 
have any 'legacy'-systems/APIs. You see, Library-writer 1 might use 
UTF-16 for his library because he uses Windows and thinks it's the 
fastest charset. Now Library-writer 2 has done his work using UTF-8 as 
an internal format. If you make a client program that links with these 
both, you (may) have to create unnecessary conversions just because one 
guy decided to create his own standards.

 Point V. makes it easier to use different types without explicit 
 casting, but will still produce warnings when transcoding happens. In 
 most cases it will be obvious anyway.


 It would easier with only a single Unicode-compliant string-type. Ask 
 the Java guys.

 
 
 Well, I am one of the Java guys, and java.lang.String leaves a lot to be 
 desired. Because it's language defined in the way it is, it's
 1) immutable, which sucks if it's forced down your throat 100% of time

I agree.

 2) UTF-16 for ever and ever, which sucks if you want it to either take 
 less memory or don't want to worry about surrogates; just look at all 
 the crappy functions they had to add in Java 5 to support the entire 
 Unicode charset :)

Partly true. What I meant was that most Java programmers use only one 
kind of string class (because they don't have/need other types).

 Point VI. breaks behavior of other array casts (which only paint), 
 but strings are getting special behavior anyway, and you can still 
 paint via void[], and even more importantly, if you need to paint 
 between UTF8/UTF16/UTF32/cchar, either the source or destination type 
 is wrong in the first place.

 ?

 
 Well, a sequence of bytes can be either cchar[], UTF-8, UTF-16 or 
 UTF-32, but not more than one at the same time (OK, unless it's ASCII 
 only, which fits both the first two). So, for example, if you cast 
 utf8[] to utf16[], either the data is UTF-8 and you don't get a UTF-16 
 string (but some mumbo jumbo), or it's UTF-16 and was never valid UTF-8 
 in the first place.

Ok. But I thought you said utf8[] is implicitely converted to utf16[]. 
Then it's always valid whatever-type-it-is.

 Point VII. will make it somewhat easier to make the transition.



How? I don't believe.

 Point VIII. provides an alternative to casting and allows specifying 
 endianness when writing to network and/or files.


 Partly true. Still, I think it would be much better if we had these as 
 a std.stream.UnicodeStream class. Again, Java does this well.

 
 
 Why should you be forced to use a stream for something so simple?

So simple? Ahem, std.stream.File _is_ a stream. Here's my version:

   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF8);
   f.writeLine("valid unicode text ����");
   f.close;

   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF16LE);
   f.writeLine("valid unicode text ����");
   f.close;

Advantages:
-supports BOM values
-easy to use, right?

 What
 if you want to use two encodings on the same stream (it's not even so 
 far fetched - the first line in a HTTP request can only contain UTF-8, 
 but you may want to send POST contents in UTF-16, for example). Etc. etc.

Simple, just implement a method for changing the stream type:

Stream s = UnicodeSocketStream(socket, mode, encoding);

s.changeEncoding(encoding2);

If you want high-performance streams, you can convert the strings in a 
separate thread before you use them, right?

 The methods should be compile-time resolvable when possible, so this 
 would be both valid and evaluated in compile time:

 ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);


 Why? Converting a 14 character string doesn't take much time. 

 
 
 Why would it not evaluate at compile time? Do you see any benefit in 
 that? And while it doesn't take much time once, it does take some, and 
 more importantly, allocates new memory each time. If you're trying to do 
 more than one request (as in thousands), I'm sure it adds up..

You only need to convert once.

 Besides, if all our strings and i/o were utf-8, there wouldn't be any 
 conversions, right?

 
 Except every time you'd call a Win32 function, which is what's on most 
 computers?

My mistake, let's forget the utf-8 for a while. Actually I meant that if 
all strings were in the native OS format (let the compiler decide), 
there would be no need to convert.

 Point IX. allows concatenation of strings in different encodings 



Why do you want to do that?

 without significantly increasing the complexity of overloading rules, 
 while also not requiring an inefficient toUTFxx followed by 
 concatenation (which copies the result again).


 True, but as I previously said, I don't believe we need to do great 
 amount of conversions in the runtime-level. All conversions should be 
 near network/file-interfaces, thus using Stream-classes, right?

 
 
 I agree decent stream classes can solve many problems, but not all of them.

"Many, but not all of them." That's why we should have 
std.utf.encode/decode-functions.

 Splitting the string inbetween will thus produce a "wrong" result, 
 but I don't think D should include any kind of full Unicode 
 processing, as it's actually needed quite rarely, so that problem is 
 ignored...


 
 So, if you want to create a decent text editor according to Unicode 
 specs, you'll have to implement "full Unicode processing", but a large 
 majority of other apps just needs to be able to interface to OS and 
 libraries to get and display the text, usually without even caring 
 what's inside, so I see no point to include all that in D, not even as a 
 standard library (or perhaps after many other things are implemented first)

Ok, now I see your point. I thought you didn't want full Unicode 
processing even as a addon library. I agree, you don't need these 
'advanced' algorithms in the core language, rather as a separate 
library. Time will tell, maybe someday when we haven't got anything else 
to do, Phobos will finally include some cool Unicode tricks.


Jari-Matti

Nov 24 2005

xs0 <xs0 xs0.com> writes:

 Well, flexibility will definitely require some bloat in libraries, but 
 for communicating with the user, you definitely need conversions, if 
 you're not using the OS-native type (which, again, you do have the 
 option of using with being explicit about it).

 
 But if you let the compiler vendor to decide the encoding, there's a 
 high probability that you don't need any explicit transcoding.

Sure you may need transcoding, you may use 15 different libraries, each 
expecting its own thing.. The one thing that can be done is to not 
require transcoding at least when talking to OS, which all apps have to 
do at some point. But even then, you should have the option to choose 
otherwise - if you have a UTF-8 library that you use in 99% of 
string-related calls, it's still faster to use UTF-8 and transcode when 
talking to OS.


 You mean C/C++ -interoperability?

 Yup.

 
 I was just thinking that once D has complete wrappers for all necessary 
 stuff, you don't need these anymore. Library (wrapper) writers should be 
 patient enough to use explicit conversion rules.

But why should one have to create wrappers in the first place? With my 
proposal, you can directly link to many libraries and the compiler will 
do the conversions for you.


 No, who mentioned ASCII? On windows, cchar would be exactly the legacy 
 encoding each non-unicode app uses, and conversions between app's 
 internal UTF-x and cchar[] would transcode into that charset. So, for 
 example, a word processor on a non-unicode windows version could still 
 use unicode internally, while automatically talking to the OS using 
 all the characters its charset provides.

 
 You said
 "In the reverse direction, all unrepresentable characters become '?'."
 
 The thing is that D compiler doesn't know anything about your system 
 character encoding. You can even change it on the fly, if your system is 
 capable of doing that. Therefore this transcoding must use the greatest 
 common divisor which is probably 7-bit ASCII.

While the compiler may not, I'm sure it's possible to figure it out in 
runtime. For example, many old apps use a different language based on 
your settings, browsers send different Accept-Language, etc. So, it is 
possible, I think.


 Again, sometimes you do and most times you don't. But anyhow, painting 
 casts between UTF types make no sense, and I don't think explicit 
 casts are neccessary, as there can't be any loss (ok, except to cchar[]).

 
 You don't need to convert inside your own code unless you're really 
 creating a program that is supposed to convert stuff. I mean you need 
 the transcoding only when interfacing with foreign code / i/o.

If you don't need to convert, fine. If you do need to convert, I see no 
point in it being as easy/convenient as possible.

 Yup, but with warnings; using one of these shows that you've thought 
 about what you're doing, so the compiler is free to shut up :)

 
 Yes, now you're right. The programmer should _always_ explicitely 
 declare all conversions.

Why?


 ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
 ubyte[] .asUTF16LE(bool includeBOM)
 ubyte[] .asUTF16BE(bool includeBOM)
 ubyte[] .asUTF32LE(bool includeBOM)
 ubyte[] .asUTF32BE(bool includeBOM)

 This looks pretty familiar. My own proposal does this on a library 
 level for a reason. You see, conversions from Unicode to 
 ISO-8859-x/KOI8-R/... should be allowed. 


 Sure they should be allowed, but D is supposed to be Unicode, so a D 
 app should generally only deal with that, and other charsets should 
 generally only exist in byte[] buffers before input or after output.

 
 Then tell me, how do I fill these buffers with your new functions? 

You don't. Only UTFs and one OS-native encoding are supported in the 
language, the latter for obvious convenience. Others have to be done 
with a library. Note that the compiler is free to use the same library, 
it's not like anything would have to be done twice.


 UTF-8 _does_ have a BOM.

 It does? What is it? I thought that single bytes have no Byte Order, 
 so why would you need a Mark?

 
 0xEF 0xBB 0xBF

OK, then it's not a dummy parameter :)


 If we didn't have several types of strings, this all would be much 
 easier.

 Agreed, but we do have several types of strings :)

 
 I'm trying to say we don't need several types of strings :)

Why? I think if it's done properly, there are benefits from having a 
choice, while not complicating matters when one doesn't care.


 Because it's actually the opposite - Unicode conversions are simple, 
 while slicing is hard (at least slicing on character boundaries). Even 
 in the simple example you give, I have no idea whether the first � is 
 one character or two, as both cases look the same.

 
 It's not really that hard. One downside is that you have to parse 
 through the string (unless compiler uses UTF-16/32 as an internal string 
 type).

It is "hard" - if you want to get the first character, as in the first 
character that the user sees, it can actually be from 1 to x characters, 
where x can be at least 5 (that case is actually in the unicode 
standard) and possibly more (and I don't mean code units, but characters).


 Slicing the string on the code unit level doesn't make any sense, now 
 does it? Because char should be treated as a special type by the 
 compiler, I see no other use for slicing than this. Like you said, the 
 alternative slicing can be achieved by casting the string to void[] (for 
 i/o data buffering, etc).

Well, I sure don't have anything against making slicing strings slice on 
character boundaries... Although that complicates matters - which length 
should .length then return? It will surely bork all kinds of templates, 
so perhaps it should be done with a different operator, like {a..b} 
instead of [a..b], and length-in-characters should be .strlen.


 Why do you have to output Unicode strings using legacy non-Unicode 
 C-APIs? AFAIK DUI / stardard I/O and other libraries use standard 
 Unicode, right? At least QT / GTK+ / Win32API / Linux console do 
 support Unicode.

 Well, your point is moot, because if there's no such function to call, 
 then there is no problem. But when there is such a function, you would 
 hope that the language/library does something sensible by default, 
 wouldn't you?

 
 No, this brilliant invention of yours causes problems even if we didn't 
 have any 'legacy'-systems/APIs. You see, Library-writer 1 might use 
 UTF-16 for his library because he uses Windows and thinks it's the 
 fastest charset. Now Library-writer 2 has done his work using UTF-8 as 
 an internal format. If you make a client program that links with these 
 both, you (may) have to create unnecessary conversions just because one 
 guy decided to create his own standards.

Please don't get personal, as I and many others don't consider it polite.

Anyhow, even if all D libraries use the same encoding, D is still 
directly linkable to C libraries and it's obvious one doesn't have 
control over what encoding they're using, so I fail to see what is wrong 
with supporting different ones, and I also fail to see how it will help 
to decree one of them The One and ignore all others.


 2) UTF-16 for ever and ever, which sucks if you want it to either take 
 less memory or don't want to worry about surrogates; just look at all 
 the crappy functions they had to add in Java 5 to support the entire 
 Unicode charset :)

 
 Partly true. What I meant was that most Java programmers use only one 
 kind of string class (because they don't have/need other types).

Well, writing something high-performance string-related in Java 
definitely takes a lot of code, because the built-in String class is 
often useless. I see no need to repeat that in D.


 Well, a sequence of bytes can be either cchar[], UTF-8, UTF-16 or 
 UTF-32, but not more than one at the same time (OK, unless it's ASCII 
 only, which fits both the first two). So, for example, if you cast 
 utf8[] to utf16[], either the data is UTF-8 and you don't get a UTF-16 
 string (but some mumbo jumbo), or it's UTF-16 and was never valid 
 UTF-8 in the first place.

 
 Ok. But I thought you said utf8[] is implicitely converted to utf16[]. 
 Then it's always valid whatever-type-it-is.

Yes I did and that has nothing to do with the above paragraph, as it's 
referring to the current sitation, where casts between char types 
actually don't transcode.


 Why should you be forced to use a stream for something so simple?

 
 So simple? Ahem, std.stream.File _is_ a stream. Here's my version:
 
   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF8);
   f.writeLine("valid unicode text ����");
   f.close;
 
   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF16LE);
   f.writeLine("valid unicode text ����");
   f.close;
 
 Advantages:
 -supports BOM values
 -easy to use, right?

Well, I sure don't think so :P Why do I need a special class just to be 
able to output strings?  Where is the BOM placed? Does every string 
include a BOM or just the file at the beginning? How can I change that? 
If the writeLine is 2000 lines away from the stream declaration, how can 
I tell what it will do?

I'd certainly prefer

File f=new File("foo", FileMode.Out);
f.write("valid whatever".asUTF16LE);
f.close;

Less typing, too :)


 If you want high-performance streams, you can convert the strings in a 
 separate thread before you use them, right?

I don't know why you need a thread, but in any case, is that the easiest 
solution (to code) you can think of?


 The methods should be compile-time resolvable when possible, so this 
 would be both valid and evaluated in compile time:

 ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);

 Why? Converting a 14 character string doesn't take much time. 

 Why would it not evaluate at compile time? Do you see any benefit in 
 that? And while it doesn't take much time once, it does take some, and 
 more importantly, allocates new memory each time. If you're trying to 
 do more than one request (as in thousands), I'm sure it adds up..

 
 You only need to convert once.

Again, why would it not evaluate at compile time? Do you see any benefit 
in that?


 Point IX. allows concatenation of strings in different encodings 



 
 Why do you want to do that?

I don't, I want the whole world to use dchar[]s. But it doesn't, so 
using multiple encodings should be as easy as possible.


xs0

Nov 25 2005

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak invalid_utu.fi> writes:

xs0 wrote:

 I was just thinking that once D has complete wrappers for all 
 necessary stuff, you don't need these anymore. Library (wrapper) 
 writers should be patient enough to use explicit conversion rules.

 
 
 But why should one have to create wrappers in the first place? With my 
 proposal, you can directly link to many libraries and the compiler will 
 do the conversions for you.

In case you haven't noticed, most things in Java are made of wrappers. 
Even D uses wrappers because they're easier to work with. If you think 
that wrapper might be a slow, the D specs allow the compiler to inline 
wrapper functions.

 No, who mentioned ASCII? On windows, cchar would be exactly the 
 legacy encoding each non-unicode app uses, and conversions between 
 app's internal UTF-x and cchar[] would transcode into that charset. 
 So, for example, a word processor on a non-unicode windows version 
 could still use unicode internally, while automatically talking to 
 the OS using all the characters its charset provides.


 You said
 "In the reverse direction, all unrepresentable characters become '?'."

 The thing is that D compiler doesn't know anything about your system 
 character encoding. You can even change it on the fly, if your system 
 is capable of doing that. Therefore this transcoding must use the 
 greatest common divisor which is probably 7-bit ASCII.

 
 
 While the compiler may not, I'm sure it's possible to figure it out in 
 runtime. For example, many old apps use a different language based on 
 your settings, browsers send different Accept-Language, etc. So, it is 
 possible, I think.

You can't be serious. Of course do browsers use several encodings, but 
they also let the users choose them. You cannot achieve such a 
functionality with a statically chosen cchar-type. If you're going to 
change the cchar-type on the fly, characters 128-255 become corrupted 
sooner than you think. That's why I would use conversion libraries.

 You don't need to convert inside your own code unless you're really 
 creating a program that is supposed to convert stuff. I mean you need 
 the transcoding only when interfacing with foreign code / i/o.

 
 
 If you don't need to convert, fine. If you do need to convert, I see no 
 point in it being as easy/convenient as possible.

But you don't need to convert inside your own code:

utf8 foo(utf16 param) { return param.asUTF8; }
utf32 bar(utf8 param) { return param.asUTF32LE; }
utf16 zoo(utf32 param) { return param.asUTF16LE; }

void main() {
utf16 string = "something";
writefln( utf16( utf32( utf8(string) ) ) );
}

Doesn't look pretty useful to me, at least :)
It's the same thing with implicit conversions. You don't need them in 
your 'own' code.

 Yes, now you're right. The programmer should _always_ explicitely 
 declare all conversions.

 Why?

Because it will remove all 'hidden' (string) conversions.

 If we didn't have several types of strings, this all would be much 
 easier.

 Agreed, but we do have several types of strings :)

 I'm trying to say we don't need several types of strings :)

 Why? I think if it's done properly, there are benefits from having a 
 choice, while not complicating matters when one doesn't care.

Of course there's always a benefit, but it makes things more complex. 
Are you really saying that having 4 string types is easier than having 
just one? With only one type you don't need casting rules nor so many 
encumbering keywords etc. You always have to make a tradeoff somewhere. 
I'm not suggesting my own proposal just because I'm stubborn or 
something, I just know that you _can_ write Unicode-aware programs with 
just one string type and it doesn't cost much (in runtime 
performance/memory footprint). If you don't believe, please try to 
simulate these proposals using custom string classes.

 Because it's actually the opposite - Unicode conversions are simple, 
 while slicing is hard (at least slicing on character boundaries). 
 Even in the simple example you give, I have no idea whether the first 
 � is one character or two, as both cases look the same.


 It's not really that hard. One downside is that you have to parse 
 through the string (unless compiler uses UTF-16/32 as an internal 
 string type).

 
 
 It is "hard" - if you want to get the first character, as in the first 
 character that the user sees, it can actually be from 1 to x characters, 
 where x can be at least 5

Oh, I thought that UTF-16 character is always encoded using 16 bits, 
UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?

 (that case is actually in the unicode 
 standard) and possibly more (and I don't mean code units, but characters).

Slicing&indexing with UTF-16/32 is straightforward. Just multiply the 
index by 2/4. UTF-8 is only a bit harder - you need to iterate through 
the string, but it's not that hard. It's usually much faster than O(n).

 Slicing the string on the code unit level doesn't make any sense, now 
 does it? Because char should be treated as a special type by the 
 compiler, I see no other use for slicing than this. Like you said, the 
 alternative slicing can be achieved by casting the string to void[] 
 (for i/o data buffering, etc).

 
 
 Well, I sure don't have anything against making slicing strings slice on 
 character boundaries... Although that complicates matters - which length 
 should .length then return? It will surely bork all kinds of templates, 
 so perhaps it should be done with a different operator, like {a..b} 
 instead of [a..b], and length-in-characters should be .strlen.

Yes, it's true. My solution is a bit inconsistent, but doesn't hurt 
anyone: it uses character boundaries inside the []-syntax (also .length 
might be character-version inside the braces), but code unit -version 
elsewhere. I think D should use an internal counter for data type length 
and provide an intelligent (data type specific) .length for the 
programmer. {a..b} doesn't look good to me.

 Well, your point is moot, because if there's no such function to 
 call, then there is no problem. But when there is such a function, 
 you would hope that the language/library does something sensible by 
 default, wouldn't you?

 No, this brilliant invention of yours causes problems even if we 
 didn't have any 'legacy'-systems/APIs. You see, Library-writer 1 might 
 use UTF-16 for his library because he uses Windows and thinks it's the 
 fastest charset. Now Library-writer 2 has done his work using UTF-8 as 
 an internal format. If you make a client program that links with these 
 both, you (may) have to create unnecessary conversions just because 
 one guy decided to create his own standards.

 
 
 Please don't get personal, as I and many others don't consider it polite.

Sorry, trying to calm down a bit ;) You know, this thing is important to 
me as I write most of my programs using Unicode I/O.

 
 Anyhow, even if all D libraries use the same encoding, D is still 
 directly linkable to C libraries and it's obvious one doesn't have 
 control over what encoding they're using,

That's true.

 so I fail to see what is wrong 
 with supporting different ones, and I also fail to see how it will help 
 to decree one of them The One and ignore all others.

Surely you agree that all transcoding is bad for the performance. 
Minimizing the need to transcode inside D code (by eliminating the 
unnecessary string types) maximizes the performance, right?

 Well, writing something high-performance string-related in Java 
 definitely takes a lot of code, because the built-in String class is 
 often useless. I see no need to repeat that in D.

IMHO implying regular programmers to use high-performance strings 
everywhere as an only option is bad. All strings don't need to be that 
fast. It would look pretty funny, if you really needed to choose a 
proper encoding just to create a valid 'Hello world!' example.

   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF8);
   f.writeLine("valid unicode text ����");
   f.close;

   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF16LE);
   f.writeLine("valid unicode text ����");
   f.close;

 Advantages:
 -supports BOM values
 -easy to use, right?

 
 Well, I sure don't think so :P Why do I need a special class just to be 
 able to output strings?  Where is the BOM placed? Does every string 
 include a BOM or just the file at the beginning? How can I change that? 
 If the writeLine is 2000 lines away from the stream declaration, how can 
 I tell what it will do?
 
 I'd certainly prefer
 
 File f=new File("foo", FileMode.Out);
 f.write("valid whatever".asUTF16LE);
 f.close;
 
 Less typing, too :)

Less typing? No you're wrong. Your approach requires the programmer to 
remember the correct encoding everytime (s)he writes to that file. In 
case you didn't know, valid UTF-x files use BOM only in the beginning of 
the file. My UnicodeFile-class knows this. Your solution writes the BOM 
every time you write a string (test it, if you don't believe). In 
addition, changing the BOM in the middle of a valid UTF-x stream is 
illegal. If you want to create a datafile that serializes the 'objects', 
you can use regular files just like you did here.

 If you want high-performance streams, you can convert the strings in a 
 separate thread before you use them, right?

 
 
 I don't know why you need a thread, but in any case, is that the easiest 
 solution (to code) you can think of?

No, not the easiest. AFAIK in real life a high-performance web server 
uses separate threads for data processing. In case you're writing a 
single-threaded application, you can precalculate the string in the 
_same_ thread.

 The methods should be compile-time resolvable when possible, so 
 this would be both valid and evaluated in compile time:

 ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);


 Why? Converting a 14 character string doesn't take much time. 


 Why would it not evaluate at compile time? Do you see any benefit in 
 that? And while it doesn't take much time once, it does take some, 
 and more importantly, allocates new memory each time. If you're 
 trying to do more than one request (as in thousands), I'm sure it 
 adds up..

 You only need to convert once.

 
 Again, why would it not evaluate at compile time? Do you see any benefit 
 in that?

I think I already said that you really don't know, what would be the 
best encoding to use at compile time. You're saying (by having several 
types) that the programmer should decide this. Now building portable 
multiplatform programs isn't that simple. Your approach implyes you to 
define several version {} -blocks for different architectures => it 
isn't that simple anymore. You need to use version-blocks because if you 
decided to use utf-8, it would be fast on *nixes and slow on Windows. 
And if you used utf-16, the opposite would happer.

 Point IX. allows concatenation of strings in different encodings 



 Why do you want to do that?

 
 I don't, I want the whole world to use dchar[]s. But it doesn't, so 
 using multiple encodings should be as easy as possible.

But I'm saying here that we don't need several string types.

Jari-Matti

P.S. I won't be reading the NG for the next couple of days. I'll try to 
answer your (potential) future posts as soon as I get back.

Nov 25 2005

Derek Parnell <derek psych.ward> writes:

On Fri, 25 Nov 2005 15:50:13 +0200, Jari-Matti M�kel� wrote:


[snip]


 Oh, I thought that UTF-16 character is always encoded using 16 bits, 
 UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?

Wrong, I'm afraid. Some characters use 32 bits in UTF16.

UTF8:  1, 2, 3, and 4 byte characters.
UTF16: 2 and 4 byte characters.
UTF32: 4 byte characters (only)

-- 
Derek Parnell
Melbourne, Australia
26/11/2005 8:37:13 AM

Nov 25 2005

xs0 <xs0 xs0.com> writes:

Derek Parnell wrote:
 On Fri, 25 Nov 2005 15:50:13 +0200, Jari-Matti M�kel� wrote:
 
 
 [snip]
 
 
 
Oh, I thought that UTF-16 character is always encoded using 16 bits, 
UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?

 
 
 Wrong, I'm afraid. Some characters use 32 bits in UTF16.
 
 UTF8:  1, 2, 3, and 4 byte characters.
 UTF16: 2 and 4 byte characters.
 UTF32: 4 byte characters (only)

Furthermore, a single visible character can be encoded using more than 
one Unicode character (for example, a C with a caron can be both a 
single character and two characters, C + combining caron). Since there's 
no limit to how many combining characters a single "normal" char can 
have, slicing on char boundaries is not solved merely by finding UTF 
boundaries, which was my initial point.


xs0

Nov 26 2005

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak invalid_utu.fi> writes:

xs0 wrote:
 Oh, I thought that UTF-16 character is always encoded using 16 bits, 
 UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?

 Wrong, I'm afraid. Some characters use 32 bits in UTF16.

 UTF8:  1, 2, 3, and 4 byte characters.
 UTF16: 2 and 4 byte characters.
 UTF32: 4 byte characters (only)

 
 Furthermore, a single visible character can be encoded using more than 
 one Unicode character (for example, a C with a caron can be both a 
 single character and two characters, C + combining caron). Since there's 
 no limit to how many combining characters a single "normal" char can 
 have, slicing on char boundaries is not solved merely by finding UTF 
 boundaries, which was my initial point.

Thanks, I wasn't aware of this before.

It seems that I have underestimated the performance issues (web servers, 
etc.) of having only one Unicode text type. I have to admit the current 
types in D are a suitable compromise. They're not always the "easiest" 
way to do things, but have no greater weaknesses either.

I guess the only thing I tried to say was that it really _is_ possible 
to write all programs with only a single encoding-independent Unicode 
type. But this approach has few big downsides in some performance 
critical applications and therefore shouldn't be the default behavior 
for a systems programming language like D. On a scripting language it 
would be a killer feature, though.

---

* IMO support for indexing & slicing on Unicode character boundaries is 
not that obligatory on the language syntax level, but it would be nice 
to have this functionality somewhere. :) At least there's little use for 
[d,w]char slicing now.

* I wish Walter could fix this [1] bug: (I know why it produces 
compile-time errors, but don't know why DMD allows you to do that)

[1] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/30566

I wish it worked like this:

char foo = '\u0000'            // ok (C-strings compatibility)
char foo = '\u0001' - '\u007f' // ok
char foo = '\u0080' - '\uffff' // compile error

* A fully Unicode-aware stream system [2] would also be a nice feature: 
(currently there's no convenient way to create valid UTF-encoded text 
files with BOM)

[2] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5636

That would (perhaps) require Walter/us to reconsider the Phobos stream 
class hierarchy.

Nov 27 2005

Georg Wrede <georg.wrede nospam.org> writes:

xs0 wrote:
 
 I'd certainly prefer
 
 File f=new File("foo", FileMode.Out);
 f.write("valid whatever".asUTF16LE);
 f.close;
 
 Less typing, too :)

I'd have hoped you'd prefer

File f = new File("foo", FileMode.Out.UTF16LE);
f.print("Just doit! Nike");
f.close;

Save even more ink, in case you print more than once to the file, too.

And it's smarter overall, right?

Nov 25 2005

D Programming

C/C++ Programming

Other

digitalmars.D - YAUST v1.0