digitalmars.D - Selectable encodings

John C (22/22) Apr 06 2006 I know of three ways to support a user-selected char encoding in a libra...

Mike Capp (10/11) Apr 06 2006 Apologies for going off at a tangent to your question, but I've never qu...

Oskar Linde (4/14) Apr 06 2006 It is the latter. But I don't think much of the string handling code is

James Dunne (20/46) Apr 06 2006 The char type is really a misnomer for dealing with UTF-8 encoded

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (8/14) Apr 06 2006 Yeah, but it does hold an *ASCII* character ?

Mike Capp (24/30) Apr 06 2006 (Changing subject line since we seem to have rudely hijacked the OP's to...

Georg Wrede (32/66) Apr 06 2006 Yes. And it's a _gross_ misnomer.

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (13/26) Apr 06 2006 It's O(n) vs O(n). :) You have to go through all the bytes in both

Thomas Kuehne (11/17) Apr 06 2006 -----BEGIN PGP SIGNED MESSAGE-----

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (9/19) Apr 07 2006 Yes, I know. This was just an optimistic tongue-in-the-cheek analysis :)

Thomas Kuehne (60/70) Apr 06 2006 -----BEGIN PGP SIGNED MESSAGE-----

Walter Bright (4/7) Apr 06 2006 I don't know about that, but the code below isn't optimal . Replace

Sean Kelly (5/13) Apr 06 2006 I've been wondering about this. Will 'stride' be accurate for any

Walter Bright (3/16) Apr 06 2006 UTF8stride[] will give 0xFF for values that are not at the beginning of

Sean Kelly (10/28) Apr 06 2006 Thanks. I saw the 0xFF entries in UTF8stride and mostly wanted to make

Walter Bright (3/31) Apr 06 2006 Take a look at std.utf.toUTFindex(), which takes care of the problem (by...
Georg Wrede (8/38) Apr 06 2006 No fear. Any UTF-8 byte that belongs to a stride is clearly marked as

kris (6/16) Apr 06 2006 It's not as simple as that any more. Lookup tables can sometimes cause

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (20/29) Apr 07 2006 I don't think so. UTF-8 is good for us in "non-British" Europe, and

Sean Kelly (16/45) Apr 06 2006 Since UTF-8 is compatible with ASCII, might it not be reasonable to
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (12/18) Apr 06 2006 I'm not sure that C guys would miss a string class (after all, char[]

"John C" <johnch_atms hotmail.com> writes:

I know of three ways to support a user-selected char encoding in a library, 
but each has its drawbacks.

1) Method overloading
Introduces conflicts with string literals (forcing a c/w/d suffix to be 
used) and you can't overload by return type.

2) Parameterising all types that use strings
Making every class a template just to get this functionality seems over the 
top.
class SomeClassT(TChar) {
    TChar[] getSomeString() {}
}
alias SomeClassT!(char) SomeClass; // in library module
alias SomeClassT!(wchar) SomeClass; // in user module

3) A compiler version condition with aliases.
The version condition approach is the most attractive to me, but some people 
aren't fond of it.
version (utf8) alias mlchar char;
else version (utf16) alias mlchar wchar;
else version (utf32) alias mlchar dchar;

There's a fourth way - encoding conversion, but there's a runtime cost.

So does anyone use an alternative way to enable users to select which char 
encoding they want to use at compile time?

Apr 06 2006

Mike Capp <mike.capp gmail.com> writes:

In article <e12j34$2gi2$1 digitaldaemon.com>, John C says...
version (utf8) alias mlchar char;

Apologies for going off at a tangent to your question, but I've never quite
understood what D thinks it's doing here. If char[] is an array of characters,
then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding. So
is char[] an array of characters from some other charset (e.g. the subset of
UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8
string (in which case I suspect quite a lot of string-handling code is badly
broken)?

cheers
Mike

Apr 06 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Mike Capp skrev:
 In article <e12j34$2gi2$1 digitaldaemon.com>, John C says...
 version (utf8) alias mlchar char;

 
 Apologies for going off at a tangent to your question, but I've never quite
 understood what D thinks it's doing here. If char[] is an array of characters,
 then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding.
So
 is char[] an array of characters from some other charset (e.g. the subset of
 UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8
 string (in which case I suspect quite a lot of string-handling code is badly
 broken)?

It is the latter. But I don't think much of the string handling code is 
broken because of that.

/Oskar

Apr 06 2006

James Dunne <james.jdunne gmail.com> writes:

Oskar Linde wrote:
 Mike Capp skrev:
 
 In article <e12j34$2gi2$1 digitaldaemon.com>, John C says...

 version (utf8) alias mlchar char;


 Apologies for going off at a tangent to your question, but I've never 
 quite
 understood what D thinks it's doing here. If char[] is an array of 
 characters,
 then it can't be a UTF-8 string, because UTF-8 is a variable-length 
 encoding. So
 is char[] an array of characters from some other charset (e.g. the 
 subset of
 UTF-8 representable in one byte), or is it an array of bytes encoding 
 a UTF-8
 string (in which case I suspect quite a lot of string-handling code is 
 badly
 broken)?

 
 
 It is the latter. But I don't think much of the string handling code is 
 broken because of that.
 
 /Oskar

The char type is really a misnomer for dealing with UTF-8 encoded 
strings.  It should be named closer to "code-unit for UTF-8 encoding". 
For my own research language I've chosen what I believe to be a nice 
type naming system:

     char            - 32-bit Unicode code point

     u8cu            - UTF-8 code unit
     u16cu           - UTF-16 code unit
     u32cu           - UTF-32 code unit

I could be wrong (and I bet I am) on the terminology used to describe 
char, but I really mean it to just store a full Unicode character such 
that strings of chars can safely assume character index == array index.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O 
M--  V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e 
h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne

Apr 06 2006

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

James Dunne wrote:

 The char type is really a misnomer for dealing with UTF-8 encoded 
 strings.  It should be named closer to "code-unit for UTF-8 encoding". 

Yeah, but it does hold an *ASCII* character ?

Usually the D code handles char[] with dchar,
but with a "short path" for ASCII characters...

 I could be wrong (and I bet I am) on the terminology used to describe
 char, but I really mean it to just store a full Unicode character
 such that strings of chars can safely assume character index == array
 index.

For the general case, UTF-32 is a pretty wasteful
Unicode encoding just to have that priviledge ?



--anders

Apr 06 2006

Mike Capp <mike.capp gmail.com> writes:

(Changing subject line since we seem to have rudely hijacked the OP's topic)

In article <e13b56$is0$1 digitaldaemon.com>,
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
James Dunne wrote:

 The char type is really a misnomer for dealing with UTF-8 encoded 
 strings.  It should be named closer to "code-unit for UTF-8 encoding". 


(I fully agree with this statement, by the way.)

Yeah, but it does hold an *ASCII* character ?

I don't find that very helpful - seeing a char[] in code doesn't tell me
anything about whether it's byte-per-character ASCII or possibly-multibyte
UTF-8.

For the general case, UTF-32 is a pretty wasteful
Unicode encoding just to have that priviledge ?

I'm not sure there is a "general case", so it's hard to say. Some programmers
have to deal with MBCS every day; others can go for years without ever having to
worry about anything but vanilla ASCII.

"Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but
UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth.
Finding the millionth character in a UTF-8 string means looping through at least
a million bytes, and executing some conditional logic for each one. Finding the
millionth character in a UTF-32 string is a simple pointer offset and one-word
fetch.

At the risk of repeating James, I do think that spelling "string" as
"char[]"/"wchar[]" is grossly misleading, particularly to people coming from any
other C-family language. If I was doing any serious string-handling work in D
I'd almost certainly write a opaque String class that overloaded opIndex
(returning dchar) to do the right thing, and optimised the underlying storage to
suit the app's requirements.

cheers
Mike

Apr 06 2006

Georg Wrede <georg.wrede nospam.org> writes:

Mike Capp wrote:
 (Changing subject line since we seem to have rudely hijacked the OP's
 topic)
 
 In article <e13b56$is0$1 digitaldaemon.com>, 
 =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
 
 James Dunne wrote:
 
 The char type is really a misnomer for dealing with UTF-8 encoded
 strings.  It should be named closer to "code-unit for UTF-8
 encoding".


 
 (I fully agree with this statement, by the way.)

Yes. And it's a _gross_ misnomer.

And we who are used to D can't even _begin_ to appreciate the 
[unnecessary!] extra work and effort needed to gradually come to 
understand it "our way", for those new to D.

 Yeah, but it does hold an *ASCII* character ?

 
 I don't find that very helpful - seeing a char[] in code doesn't tell
 me anything about whether it's byte-per-character ASCII or
 possibly-multibyte UTF-8.

(( A dumb idea: the input stream has a flag that gets set as soon as the 
first non-ASCII character is found. ))

 For the general case, UTF-32 is a pretty wasteful Unicode encoding
 just to have that priviledge ?

 
 I'm not sure there is a "general case", so it's hard to say. Some
 programmers have to deal with MBCS every day; others can go for years
 without ever having to worry about anything but vanilla ASCII.

True!! Folks in Boise, Idaho, vs. folks in the non-British Europe or the 
Far East.

 "Wasteful" is also relative. UTF-32 is certainly wasteful of memory
 space, but UTF-8 is potentially far more wasteful of CPU cycles and
 memory bandwidth.

It sure looks like it. Then again, studying the UTF-8 spec, and "why we 
did it this way" (sorry, no URL here. Anybody?), shows that it actually 
is _amazingly_ light on CPU cycles! Really.

(( I sure wish there was somebody in this NG who could write a 
Scientifically Valid test to compare the time needed to find the 
millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))

 Finding the millionth character in a UTF-8 string
 means looping through at least a million bytes, and executing some
 conditional logic for each one. Finding the millionth character in a
 UTF-32 string is a simple pointer offset and one-word fetch.

True. And even if we'd exclude any "character width logic" in the 
search, we still end up with sequential lookup O(n) vs. O(1).

Then again, when's the last time anyone here had to find the millionth 
character of anything?  :-)

So, of course for library writers, this appears as most relevant, but 
for real world programming tasks, I think after profiling, the time 
wasted may be minor, in practice.

(Ah, and of course, turning a UTF-8 input into UTF-32 and then straight 
shooting the millionth character, is way more expensive (both in time 
and size) than just a loop through the UTF-8 as such. Not to mention the 
losses if one were, instead, to have a million-character file on hard 
disk in UTF-32 (i.e. a 4MB file) to avoid the look-through. Probably the 
time reading in the file gets so much longer that this in itself defeats 
the "gain".)

 At the risk of repeating James, I do think that spelling "string" as 
 "char[]"/"wchar[]" is grossly misleading, particularly to people
 coming from any other C-family language.

No argument here. :-)

In the midst of The Great Character Width Brouhaha (about November last 
year), I tried to convince Walter on this particular issue.

Apr 06 2006

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak utu.fi.invalid> writes:

Georg Wrede wrote:
 (( I sure wish there was somebody in this NG who could write a
 Scientifically Valid test to compare the time needed to find the
 millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))

It's O(n) vs O(n). :) You have to go through all the bytes in both
cases. I guess the conversion has a higher coefficient.

 So, of course for library writers, this appears as most relevant, but
 for real world programming tasks, I think after profiling, the time
 wasted may be minor, in practice.

Why not use the same encoding throughout the whole program and it's
libraries? No need to convert anywhere.

 (Ah, and of course, turning a UTF-8 input into UTF-32 and then straight
 shooting the millionth character, is way more expensive (both in time
 and size) than just a loop through the UTF-8 as such. Not to mention the
 losses if one were, instead, to have a million-character file on hard
 disk in UTF-32 (i.e. a 4MB file) to avoid the look-through. Probably the
 time reading in the file gets so much longer that this in itself defeats
 the "gain".)

That's very true. A "normal" hard drive reads 60 MB/s. So, reading a 4
MB file takes at least 66 ms and a 1 MB UTF-8-file (only
ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :).
A modern processor executes 3 000 000 000 operations in a second. Going
through the UTF-8 stream takes 1 000 000 * 10 (perhaps?) operations and
thus costs 3 ms. So it's actually faster to read UTF-8.

-- 
Jari-Matti

Apr 06 2006

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jari-Matti wrote:
 That's very true. A "normal" hard drive reads 60 MB/s. So,
 reading a 4 MB file takes at least 66 ms and a 1 MB UTF-8-file (only
 ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :).
 A modern processor executes 3 000 000 000 operations in a
 second. Going through the UTF-8 stream takes 1 000 000 * 10 (perhaps?)
 operations and thus costs 3 ms. So it's actually faster to read UTF-8.

1) your sample: English (consider Chinese)
2) magic word: seek

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFENbhY3w+/yD4P9tIRArYCAJ4vxbiR2fim5rFh+AQ4O3e/Gc3xjQCbBnCV
BLrTa9vqU3l8ny+/8Sqw8Mc=
=59uu
-----END PGP SIGNATURE-----

Apr 06 2006

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak utu.fi.invalid> writes:

Thomas Kuehne wrote:
 Jari-Matti wrote:
 That's very true. A "normal" hard drive reads 60 MB/s. So,
 reading a 4 MB file takes at least 66 ms and a 1 MB UTF-8-file (only
 ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :).
 A modern processor executes 3 000 000 000 operations in a
 second. Going through the UTF-8 stream takes 1 000 000 * 10 (perhaps?)
 operations and thus costs 3 ms. So it's actually faster to read UTF-8.


 
 1) your sample: English (consider Chinese)
 2) magic word: seek

Yes, I know. This was just an optimistic tongue-in-the-cheek analysis :)
A real world example would naturally have a lot of non-ASCII characters
too, but the point is that reading huge loads of uncompressed UTF-32
data will be usually slower than reading UTF-8 if we are also checking
against text corruptions. I wonder if it's any faster to read
UTF-32-files from a transparently compressed reiser4 drive?

-- 
Jari-Matti

Apr 07 2006

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Georg Wrede schrieb am 2006-04-06:
 Mike Capp wrote:
 "Wasteful" is also relative. UTF-32 is certainly wasteful of memory
 space, but UTF-8 is potentially far more wasteful of CPU cycles and
 memory bandwidth.

 It sure looks like it. Then again, studying the UTF-8 spec, and "why we 
 did it this way" (sorry, no URL here. Anybody?), shows that it actually 
 is _amazingly_ light on CPU cycles! Really.

Have a look at the endcoding of Hangul(Korean) and polytonic Greek <g>

 (( I sure wish there was somebody in this NG who could write a 
 Scientifically Valid test to compare the time needed to find the 
 millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))

Challenge:
Provide a D implementation that firsts converts to UTF-32 and has
shorter runtime than the code below:

















































Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFENbZw3w+/yD4P9tIRAjTkAJsEcE6xM0fSLrT3x+iArgdVacZIXgCgsnNa
19AB53HGi6fbH9AuHTMvjq4=
=gZWL
-----END PGP SIGNATURE-----

Apr 06 2006

Walter Bright <newshound digitalmars.com> writes:

Thomas Kuehne wrote:
 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

I don't know about that, but the code below isn't optimal <g>. Replace 
the sar's with a lookup of the 'stride' of the UTF-8 character (see 
std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().

Apr 06 2006

Sean Kelly <sean f4.ca> writes:

Walter Bright wrote:
 Thomas Kuehne wrote:
 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

 
 I don't know about that, but the code below isn't optimal <g>. Replace 
 the sar's with a lookup of the 'stride' of the UTF-8 character (see 
 std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().

I've been wondering about this.  Will 'stride' be accurate for any 
arbitrary string position or input data?  I would assume so, but don't 
know enough about how UTF-8 is structured to be sure.


Sean

Apr 06 2006

Walter Bright <newshound digitalmars.com> writes:

Sean Kelly wrote:
 Walter Bright wrote:
 Thomas Kuehne wrote:
 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

 I don't know about that, but the code below isn't optimal <g>. Replace 
 the sar's with a lookup of the 'stride' of the UTF-8 character (see 
 std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().

 
 I've been wondering about this.  Will 'stride' be accurate for any 
 arbitrary string position or input data?  I would assume so, but don't 
 know enough about how UTF-8 is structured to be sure.

UTF8stride[] will give 0xFF for values that are not at the beginning of 
a valid UTF-8 sequence.

Apr 06 2006

Sean Kelly <sean f4.ca> writes:

Walter Bright wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Thomas Kuehne wrote:
 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

 I don't know about that, but the code below isn't optimal <g>. 
 Replace the sar's with a lookup of the 'stride' of the UTF-8 
 character (see std.utf.UTF8stride[]). An implementation is 
 std.utf.toUTFindex().

 I've been wondering about this.  Will 'stride' be accurate for any 
 arbitrary string position or input data?  I would assume so, but don't 
 know enough about how UTF-8 is structured to be sure.

 
 UTF8stride[] will give 0xFF for values that are not at the beginning of 
 a valid UTF-8 sequence.

Thanks.  I saw the 0xFF entries in UTF8stride and mostly wanted to make 
sure an odd combination of bytes couldn't be mistaken as a valid 
character, as stride seems the best fit for an "is valid UTF-8 char" 
type function.  I've been giving the 0xFF choice some thought however, 
and while it would avoid stalling loops, the alternative is an access 
violation when evaluating short strings and just weird behavior for 
large strings.  If I had to track down a program bug I'd almost prefer 
it be a tight endless loop.


Sean

Apr 06 2006

Walter Bright <newshound digitalmars.com> writes:

Sean Kelly wrote:
 Walter Bright wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Thomas Kuehne wrote:
 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

 I don't know about that, but the code below isn't optimal <g>. 
 Replace the sar's with a lookup of the 'stride' of the UTF-8 
 character (see std.utf.UTF8stride[]). An implementation is 
 std.utf.toUTFindex().

 I've been wondering about this.  Will 'stride' be accurate for any 
 arbitrary string position or input data?  I would assume so, but 
 don't know enough about how UTF-8 is structured to be sure.

 UTF8stride[] will give 0xFF for values that are not at the beginning 
 of a valid UTF-8 sequence.

 
 Thanks.  I saw the 0xFF entries in UTF8stride and mostly wanted to make 
 sure an odd combination of bytes couldn't be mistaken as a valid 
 character, as stride seems the best fit for an "is valid UTF-8 char" 
 type function.  I've been giving the 0xFF choice some thought however, 
 and while it would avoid stalling loops, the alternative is an access 
 violation when evaluating short strings and just weird behavior for 
 large strings.  If I had to track down a program bug I'd almost prefer 
 it be a tight endless loop.

Take a look at std.utf.toUTFindex(), which takes care of the problem (by 
throwing an exception).

Apr 06 2006

Georg Wrede <georg.wrede nospam.org> writes:

Sean Kelly wrote:
 Walter Bright wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Thomas Kuehne wrote:

 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

 I don't know about that, but the code below isn't optimal <g>. 
 Replace the sar's with a lookup of the 'stride' of the UTF-8 
 character (see std.utf.UTF8stride[]). An implementation is 
 std.utf.toUTFindex().

 I've been wondering about this.  Will 'stride' be accurate for any 
 arbitrary string position or input data?  I would assume so, but 
 don't know enough about how UTF-8 is structured to be sure.

 UTF8stride[] will give 0xFF for values that are not at the beginning 
 of a valid UTF-8 sequence.

 
 Thanks.  I saw the 0xFF entries in UTF8stride and mostly wanted to make 
 sure an odd combination of bytes couldn't be mistaken as a valid 
 character, 

No fear. Any UTF-8 byte that belongs to a stride is clearly marked as 
such in the most significant bits. Thus, you can enter a byte[] at any 
place, and immediately know if it's (1) a single-byte character, (2) the 
first in a stride, or (3) within a stride. Without looking at any of the 
other bytes.

 as stride seems the best fit for an "is valid UTF-8 char" 
 type function.  I've been giving the 0xFF choice some thought however, 
 and while it would avoid stalling loops, the alternative is an access 
 violation when evaluating short strings and just weird behavior for 
 large strings.  If I had to track down a program bug I'd almost prefer 
 it be a tight endless loop.

UTF-8 is precisely designed to be used in very tight ASM loops, that 
don't need a lookup table.

Apr 06 2006

kris <foo bar.com> writes:

Walter Bright wrote:
 Thomas Kuehne wrote:
 
 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

 
 
 I don't know about that, but the code below isn't optimal <g>. Replace 
 the sar's with a lookup of the 'stride' of the UTF-8 character (see 
 std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().


It's not as simple as that any more. Lookup tables can sometimes cause 
more stalls that straight-line code, especially with designs such as the 
P4. Not to mention the possibility of a bit of cache-thrashing with 
other programs.

Thus, the lookup may be sub-optimal. Quite possibly less optimal.

Apr 06 2006

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Georg Wrede wrote:

 For the general case, UTF-32 is a pretty wasteful Unicode encoding
 just to have that priviledge ?

 I'm not sure there is a "general case", so it's hard to say. Some
 programmers have to deal with MBCS every day; others can go for years
 without ever having to worry about anything but vanilla ASCII.

 
 True!! Folks in Boise, Idaho, vs. folks in the non-British Europe or the 
 Far East.

I don't think so. UTF-8 is good for us in "non-British" Europe, and
UTF-16 is good in the East. UTF-32 is good for... finding codepoints ?

As long as the "exceptions" (high code units) are taken care of, there 
is really no difference between the three (or five) - it's all Unicode.


I prefer UTF-8 - because it is ASCII-compatible and endian-independent,
but UTF-16 is not a bad choice if you handle a lot of non-ASCII chars.

Just as long as other layers play along with the embedded NULs, and you
have the proper BOM marks when storing it. It seemed to work for Java ?


The argument was just against *UTF-32* as a storage type, nothing more.
(As was rationalized in http://www.unicode.org/faq/utf_bom.html#UTF32)

--anders


PS.
Thought that having std UTF type aliases would have helped, but I dunno:

module std.stdutf;

/* UTF code units */

alias char   utf8_t; // UTF-8
alias wchar utf16_t; // UTF-16
alias dchar utf32_t; // UTF-32

It's a little confusing anyway, many "char*" routines don't accept UTF ?

Apr 07 2006

Sean Kelly <sean f4.ca> writes:

Mike Capp wrote:
 (Changing subject line since we seem to have rudely hijacked the OP's topic)
 
 In article <e13b56$is0$1 digitaldaemon.com>,
 =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
 James Dunne wrote:

 The char type is really a misnomer for dealing with UTF-8 encoded 
 strings.  It should be named closer to "code-unit for UTF-8 encoding". 


 
 (I fully agree with this statement, by the way.)
 
 Yeah, but it does hold an *ASCII* character ?

 
 I don't find that very helpful - seeing a char[] in code doesn't tell me
 anything about whether it's byte-per-character ASCII or possibly-multibyte
 UTF-8.

Since UTF-8 is compatible with ASCII, might it not be reasonable to 
assume char strings are always UTF-8?  I'll admit this suggests many of 
the D string functions are broken, but they can certainly be fixed. 
I've been considering rewriting find and rfind to support multibyte 
strings.  Fixing find is pretty straightforward, though rfind might be a 
tad messy.  As a related question, can anyone verify whether 
std.utf.stride will return a correct result for evaluating an arbitrary 
offset in all potential input strings?

 For the general case, UTF-32 is a pretty wasteful
 Unicode encoding just to have that priviledge ?

 
 I'm not sure there is a "general case", so it's hard to say. Some programmers
 have to deal with MBCS every day; others can go for years without ever having
to
 worry about anything but vanilla ASCII.
 
 "Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but
 UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth.
 Finding the millionth character in a UTF-8 string means looping through at
least
 a million bytes, and executing some conditional logic for each one. Finding the
 millionth character in a UTF-32 string is a simple pointer offset and one-word
 fetch.

For what it's worth, I believe the correct behavior for string/array 
operations is to provide overloads for char[] and wchar[] that require 
input to be valid UTF-8 and UTF-16, respectively.  If the user knows 
their data is pure ASCII or they otherwise want to process it as a 
fixed-width string they can cast to ubyte[] or ushort[].  This is what 
I'm planning for std.array in Ares.


Sean

Apr 06 2006

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Mike Capp wrote:

 At the risk of repeating James, I do think that spelling "string" as
 "char[]"/"wchar[]" is grossly misleading, particularly to people coming from
any
 other C-family language. If I was doing any serious string-handling work in D
 I'd almost certainly write a opaque String class that overloaded opIndex
 (returning dchar) to do the right thing, and optimised the underlying storage
to
 suit the app's requirements.

I'm not sure that C guys would miss a string class (after all, char[]
is a lot better than the raw "undefined" char* they used to be using...)
but I do see how having an easy String class around is useful sometimes.

I even wrote a simple one myself, based on something Java-like:
http://www.algonet.se/~afb/d/dcaf/html/class_string.html
http://www.algonet.se/~afb/d/dcaf/html/class_string_buffer.html


But for wxD we use a simple char[] alias for strings, works just fine...
If the backend uses UTF-16, it will convert them at runtime when needed.
(wxWidgets can be built in a "ASCII"/UTF-8, or in "Unicode"/UTF-16 mode)

Then again it only does the occasional window title or dialog string etc

--anders

Apr 06 2006

D Programming

C/C++ Programming

Other

digitalmars.D - Selectable encodings