www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Selectable encodings

reply "John C" <johnch_atms hotmail.com> writes:
I know of three ways to support a user-selected char encoding in a library, 
but each has its drawbacks.

1) Method overloading
Introduces conflicts with string literals (forcing a c/w/d suffix to be 
used) and you can't overload by return type.

2) Parameterising all types that use strings
Making every class a template just to get this functionality seems over the 
top.
class SomeClassT(TChar) {
    TChar[] getSomeString() {}
}
alias SomeClassT!(char) SomeClass; // in library module
alias SomeClassT!(wchar) SomeClass; // in user module

3) A compiler version condition with aliases.
The version condition approach is the most attractive to me, but some people 
aren't fond of it.
version (utf8) alias mlchar char;
else version (utf16) alias mlchar wchar;
else version (utf32) alias mlchar dchar;

There's a fourth way - encoding conversion, but there's a runtime cost.

So does anyone use an alternative way to enable users to select which char 
encoding they want to use at compile time? 
Apr 06 2006
next sibling parent reply Mike Capp <mike.capp gmail.com> writes:
In article <e12j34$2gi2$1 digitaldaemon.com>, John C says...
version (utf8) alias mlchar char;

Apologies for going off at a tangent to your question, but I've never quite understood what D thinks it's doing here. If char[] is an array of characters, then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding. So is char[] an array of characters from some other charset (e.g. the subset of UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8 string (in which case I suspect quite a lot of string-handling code is badly broken)? cheers Mike
Apr 06 2006
parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Mike Capp skrev:
 In article <e12j34$2gi2$1 digitaldaemon.com>, John C says...
 version (utf8) alias mlchar char;

Apologies for going off at a tangent to your question, but I've never quite understood what D thinks it's doing here. If char[] is an array of characters, then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding. So is char[] an array of characters from some other charset (e.g. the subset of UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8 string (in which case I suspect quite a lot of string-handling code is badly broken)?

It is the latter. But I don't think much of the string handling code is broken because of that. /Oskar
Apr 06 2006
parent reply James Dunne <james.jdunne gmail.com> writes:
Oskar Linde wrote:
 Mike Capp skrev:
 
 In article <e12j34$2gi2$1 digitaldaemon.com>, John C says...

 version (utf8) alias mlchar char;

Apologies for going off at a tangent to your question, but I've never quite understood what D thinks it's doing here. If char[] is an array of characters, then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding. So is char[] an array of characters from some other charset (e.g. the subset of UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8 string (in which case I suspect quite a lot of string-handling code is badly broken)?

It is the latter. But I don't think much of the string handling code is broken because of that. /Oskar

The char type is really a misnomer for dealing with UTF-8 encoded strings. It should be named closer to "code-unit for UTF-8 encoding". For my own research language I've chosen what I believe to be a nice type naming system: char - 32-bit Unicode code point u8cu - UTF-8 code unit u16cu - UTF-16 code unit u32cu - UTF-32 code unit I could be wrong (and I bet I am) on the terminology used to describe char, but I really mean it to just store a full Unicode character such that strings of chars can safely assume character index == array index. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M-- V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++ ------END GEEK CODE BLOCK------ James Dunne
Apr 06 2006
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
James Dunne wrote:

 The char type is really a misnomer for dealing with UTF-8 encoded 
 strings.  It should be named closer to "code-unit for UTF-8 encoding". 

Yeah, but it does hold an *ASCII* character ? Usually the D code handles char[] with dchar, but with a "short path" for ASCII characters...
 I could be wrong (and I bet I am) on the terminology used to describe
 char, but I really mean it to just store a full Unicode character
 such that strings of chars can safely assume character index == array
 index.

For the general case, UTF-32 is a pretty wasteful Unicode encoding just to have that priviledge ? See http://www.unicode.org/faq/utf_bom.html#12 --anders
Apr 06 2006
parent reply Mike Capp <mike.capp gmail.com> writes:
(Changing subject line since we seem to have rudely hijacked the OP's topic)

In article <e13b56$is0$1 digitaldaemon.com>,
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
James Dunne wrote:

 The char type is really a misnomer for dealing with UTF-8 encoded 
 strings.  It should be named closer to "code-unit for UTF-8 encoding". 


(I fully agree with this statement, by the way.)
Yeah, but it does hold an *ASCII* character ?

I don't find that very helpful - seeing a char[] in code doesn't tell me anything about whether it's byte-per-character ASCII or possibly-multibyte UTF-8.
For the general case, UTF-32 is a pretty wasteful
Unicode encoding just to have that priviledge ?

I'm not sure there is a "general case", so it's hard to say. Some programmers have to deal with MBCS every day; others can go for years without ever having to worry about anything but vanilla ASCII. "Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth. Finding the millionth character in a UTF-8 string means looping through at least a million bytes, and executing some conditional logic for each one. Finding the millionth character in a UTF-32 string is a simple pointer offset and one-word fetch. At the risk of repeating James, I do think that spelling "string" as "char[]"/"wchar[]" is grossly misleading, particularly to people coming from any other C-family language. If I was doing any serious string-handling work in D I'd almost certainly write a opaque String class that overloaded opIndex (returning dchar) to do the right thing, and optimised the underlying storage to suit the app's requirements. cheers Mike
Apr 06 2006
next sibling parent reply Georg Wrede <georg.wrede nospam.org> writes:
Mike Capp wrote:
 (Changing subject line since we seem to have rudely hijacked the OP's
 topic)
 
 In article <e13b56$is0$1 digitaldaemon.com>, 
 =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
 
 James Dunne wrote:
 
 The char type is really a misnomer for dealing with UTF-8 encoded
 strings.  It should be named closer to "code-unit for UTF-8
 encoding".


(I fully agree with this statement, by the way.)

Yes. And it's a _gross_ misnomer. And we who are used to D can't even _begin_ to appreciate the [unnecessary!] extra work and effort needed to gradually come to understand it "our way", for those new to D.
 Yeah, but it does hold an *ASCII* character ?

I don't find that very helpful - seeing a char[] in code doesn't tell me anything about whether it's byte-per-character ASCII or possibly-multibyte UTF-8.

(( A dumb idea: the input stream has a flag that gets set as soon as the first non-ASCII character is found. ))
 For the general case, UTF-32 is a pretty wasteful Unicode encoding
 just to have that priviledge ?

I'm not sure there is a "general case", so it's hard to say. Some programmers have to deal with MBCS every day; others can go for years without ever having to worry about anything but vanilla ASCII.

True!! Folks in Boise, Idaho, vs. folks in the non-British Europe or the Far East.
 "Wasteful" is also relative. UTF-32 is certainly wasteful of memory
 space, but UTF-8 is potentially far more wasteful of CPU cycles and
 memory bandwidth.

It sure looks like it. Then again, studying the UTF-8 spec, and "why we did it this way" (sorry, no URL here. Anybody?), shows that it actually is _amazingly_ light on CPU cycles! Really. (( I sure wish there was somebody in this NG who could write a Scientifically Valid test to compare the time needed to find the millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))
 Finding the millionth character in a UTF-8 string
 means looping through at least a million bytes, and executing some
 conditional logic for each one. Finding the millionth character in a
 UTF-32 string is a simple pointer offset and one-word fetch.

True. And even if we'd exclude any "character width logic" in the search, we still end up with sequential lookup O(n) vs. O(1). Then again, when's the last time anyone here had to find the millionth character of anything? :-) So, of course for library writers, this appears as most relevant, but for real world programming tasks, I think after profiling, the time wasted may be minor, in practice. (Ah, and of course, turning a UTF-8 input into UTF-32 and then straight shooting the millionth character, is way more expensive (both in time and size) than just a loop through the UTF-8 as such. Not to mention the losses if one were, instead, to have a million-character file on hard disk in UTF-32 (i.e. a 4MB file) to avoid the look-through. Probably the time reading in the file gets so much longer that this in itself defeats the "gain".)
 At the risk of repeating James, I do think that spelling "string" as 
 "char[]"/"wchar[]" is grossly misleading, particularly to people
 coming from any other C-family language.

No argument here. :-) In the midst of The Great Character Width Brouhaha (about November last year), I tried to convince Walter on this particular issue.
Apr 06 2006
next sibling parent reply =?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak utu.fi.invalid> writes:
Georg Wrede wrote:
 (( I sure wish there was somebody in this NG who could write a
 Scientifically Valid test to compare the time needed to find the
 millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))

It's O(n) vs O(n). :) You have to go through all the bytes in both cases. I guess the conversion has a higher coefficient.
 So, of course for library writers, this appears as most relevant, but
 for real world programming tasks, I think after profiling, the time
 wasted may be minor, in practice.

Why not use the same encoding throughout the whole program and it's libraries? No need to convert anywhere.
 (Ah, and of course, turning a UTF-8 input into UTF-32 and then straight
 shooting the millionth character, is way more expensive (both in time
 and size) than just a loop through the UTF-8 as such. Not to mention the
 losses if one were, instead, to have a million-character file on hard
 disk in UTF-32 (i.e. a 4MB file) to avoid the look-through. Probably the
 time reading in the file gets so much longer that this in itself defeats
 the "gain".)

That's very true. A "normal" hard drive reads 60 MB/s. So, reading a 4 MB file takes at least 66 ms and a 1 MB UTF-8-file (only ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :). A modern processor executes 3 000 000 000 operations in a second. Going through the UTF-8 stream takes 1 000 000 * 10 (perhaps?) operations and thus costs 3 ms. So it's actually faster to read UTF-8. -- Jari-Matti
Apr 06 2006
parent =?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak utu.fi.invalid> writes:
Thomas Kuehne wrote:
 Jari-Matti wrote:
 That's very true. A "normal" hard drive reads 60 MB/s. So,
 reading a 4 MB file takes at least 66 ms and a 1 MB UTF-8-file (only
 ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :).
 A modern processor executes 3 000 000 000 operations in a
 second. Going through the UTF-8 stream takes 1 000 000 * 10 (perhaps?)
 operations and thus costs 3 ms. So it's actually faster to read UTF-8.


1) your sample: English (consider Chinese) 2) magic word: seek

Yes, I know. This was just an optimistic tongue-in-the-cheek analysis :) A real world example would naturally have a lot of non-ASCII characters too, but the point is that reading huge loads of uncompressed UTF-32 data will be usually slower than reading UTF-8 if we are also checking against text corruptions. I wonder if it's any faster to read UTF-32-files from a transparently compressed reiser4 drive? -- Jari-Matti
Apr 07 2006
prev sibling next sibling parent reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Georg Wrede schrieb am 2006-04-06:
 Mike Capp wrote:
 "Wasteful" is also relative. UTF-32 is certainly wasteful of memory
 space, but UTF-8 is potentially far more wasteful of CPU cycles and
 memory bandwidth.

It sure looks like it. Then again, studying the UTF-8 spec, and "why we did it this way" (sorry, no URL here. Anybody?), shows that it actually is _amazingly_ light on CPU cycles! Really.

Have a look at the endcoding of Hangul(Korean) and polytonic Greek <g>
 (( I sure wish there was somebody in this NG who could write a 
 Scientifically Valid test to compare the time needed to find the 
 millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))

Challenge: Provide a D implementation that firsts converts to UTF-32 and has shorter runtime than the code below: # size_t codepoint_to_index(size_t codepoint_number, char[] data){ # char* start = data.ptr; # char* end = start + data.length; # size_t index; # # if(!data.length){ # insufficent_input: # throw new Exception("not enough input"); # } # # if(!codepoint_number){ # return 0; # } # # asm{ # mov EDX, codepoint_number; # mov ECX, start; # mov EBX, end; # # next_codepoint: # mov AL, [ECX]; # inc ECX; # sal AL, 1; # jnc end_of_codepoint; # sal AL, 1; # inner_loop: # inc ECX; # sal AL, 1; # jc inner_loop; # # end_of_codepoint: # // array bounds # cmp ECX, EBX; # jnb insufficent_input; # # // the interresting codepoint? # dec EDX; # jnz next_codepoint; # # // calculate index # mov EBX, start; # sub ECX, EBX; # mov index, ECX; # } # # return index; # } Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFENbZw3w+/yD4P9tIRAjTkAJsEcE6xM0fSLrT3x+iArgdVacZIXgCgsnNa 19AB53HGi6fbH9AuHTMvjq4= =gZWL -----END PGP SIGNATURE-----
Apr 06 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Thomas Kuehne wrote:
 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().
Apr 06 2006
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Walter Bright wrote:
 Thomas Kuehne wrote:
 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().

I've been wondering about this. Will 'stride' be accurate for any arbitrary string position or input data? I would assume so, but don't know enough about how UTF-8 is structured to be sure. Sean
Apr 06 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Sean Kelly wrote:
 Walter Bright wrote:
 Thomas Kuehne wrote:
 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().

I've been wondering about this. Will 'stride' be accurate for any arbitrary string position or input data? I would assume so, but don't know enough about how UTF-8 is structured to be sure.

UTF8stride[] will give 0xFF for values that are not at the beginning of a valid UTF-8 sequence.
Apr 06 2006
parent reply Sean Kelly <sean f4.ca> writes:
Walter Bright wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Thomas Kuehne wrote:
 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().

I've been wondering about this. Will 'stride' be accurate for any arbitrary string position or input data? I would assume so, but don't know enough about how UTF-8 is structured to be sure.

UTF8stride[] will give 0xFF for values that are not at the beginning of a valid UTF-8 sequence.

Thanks. I saw the 0xFF entries in UTF8stride and mostly wanted to make sure an odd combination of bytes couldn't be mistaken as a valid character, as stride seems the best fit for an "is valid UTF-8 char" type function. I've been giving the 0xFF choice some thought however, and while it would avoid stalling loops, the alternative is an access violation when evaluating short strings and just weird behavior for large strings. If I had to track down a program bug I'd almost prefer it be a tight endless loop. Sean
Apr 06 2006
next sibling parent Walter Bright <newshound digitalmars.com> writes:
Sean Kelly wrote:
 Walter Bright wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Thomas Kuehne wrote:
 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().

I've been wondering about this. Will 'stride' be accurate for any arbitrary string position or input data? I would assume so, but don't know enough about how UTF-8 is structured to be sure.

UTF8stride[] will give 0xFF for values that are not at the beginning of a valid UTF-8 sequence.

Thanks. I saw the 0xFF entries in UTF8stride and mostly wanted to make sure an odd combination of bytes couldn't be mistaken as a valid character, as stride seems the best fit for an "is valid UTF-8 char" type function. I've been giving the 0xFF choice some thought however, and while it would avoid stalling loops, the alternative is an access violation when evaluating short strings and just weird behavior for large strings. If I had to track down a program bug I'd almost prefer it be a tight endless loop.

Take a look at std.utf.toUTFindex(), which takes care of the problem (by throwing an exception).
Apr 06 2006
prev sibling parent Georg Wrede <georg.wrede nospam.org> writes:
Sean Kelly wrote:
 Walter Bright wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Thomas Kuehne wrote:

 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().

I've been wondering about this. Will 'stride' be accurate for any arbitrary string position or input data? I would assume so, but don't know enough about how UTF-8 is structured to be sure.

UTF8stride[] will give 0xFF for values that are not at the beginning of a valid UTF-8 sequence.

Thanks. I saw the 0xFF entries in UTF8stride and mostly wanted to make sure an odd combination of bytes couldn't be mistaken as a valid character,

No fear. Any UTF-8 byte that belongs to a stride is clearly marked as such in the most significant bits. Thus, you can enter a byte[] at any place, and immediately know if it's (1) a single-byte character, (2) the first in a stride, or (3) within a stride. Without looking at any of the other bytes.
 as stride seems the best fit for an "is valid UTF-8 char" 
 type function.  I've been giving the 0xFF choice some thought however, 
 and while it would avoid stalling loops, the alternative is an access 
 violation when evaluating short strings and just weird behavior for 
 large strings.  If I had to track down a program bug I'd almost prefer 
 it be a tight endless loop.

UTF-8 is precisely designed to be used in very tight ASM loops, that don't need a lookup table.
Apr 06 2006
prev sibling parent kris <foo bar.com> writes:
Walter Bright wrote:
 Thomas Kuehne wrote:
 
 Challenge:
 Provide a D implementation that firsts converts to UTF-32 and has
 shorter runtime than the code below:

I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().

It's not as simple as that any more. Lookup tables can sometimes cause more stalls that straight-line code, especially with designs such as the P4. Not to mention the possibility of a bit of cache-thrashing with other programs. Thus, the lookup may be sub-optimal. Quite possibly less optimal.
Apr 06 2006
prev sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Georg Wrede wrote:

 For the general case, UTF-32 is a pretty wasteful Unicode encoding
 just to have that priviledge ?

I'm not sure there is a "general case", so it's hard to say. Some programmers have to deal with MBCS every day; others can go for years without ever having to worry about anything but vanilla ASCII.

True!! Folks in Boise, Idaho, vs. folks in the non-British Europe or the Far East.

I don't think so. UTF-8 is good for us in "non-British" Europe, and UTF-16 is good in the East. UTF-32 is good for... finding codepoints ? As long as the "exceptions" (high code units) are taken care of, there is really no difference between the three (or five) - it's all Unicode. I prefer UTF-8 - because it is ASCII-compatible and endian-independent, but UTF-16 is not a bad choice if you handle a lot of non-ASCII chars. Just as long as other layers play along with the embedded NULs, and you have the proper BOM marks when storing it. It seemed to work for Java ? The argument was just against *UTF-32* as a storage type, nothing more. (As was rationalized in http://www.unicode.org/faq/utf_bom.html#UTF32) --anders PS. Thought that having std UTF type aliases would have helped, but I dunno: module std.stdutf; /* UTF code units */ alias char utf8_t; // UTF-8 alias wchar utf16_t; // UTF-16 alias dchar utf32_t; // UTF-32 It's a little confusing anyway, many "char*" routines don't accept UTF ?
Apr 07 2006
prev sibling next sibling parent Sean Kelly <sean f4.ca> writes:
Mike Capp wrote:
 (Changing subject line since we seem to have rudely hijacked the OP's topic)
 
 In article <e13b56$is0$1 digitaldaemon.com>,
 =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
 James Dunne wrote:

 The char type is really a misnomer for dealing with UTF-8 encoded 
 strings.  It should be named closer to "code-unit for UTF-8 encoding". 


(I fully agree with this statement, by the way.)
 Yeah, but it does hold an *ASCII* character ?

I don't find that very helpful - seeing a char[] in code doesn't tell me anything about whether it's byte-per-character ASCII or possibly-multibyte UTF-8.

Since UTF-8 is compatible with ASCII, might it not be reasonable to assume char strings are always UTF-8? I'll admit this suggests many of the D string functions are broken, but they can certainly be fixed. I've been considering rewriting find and rfind to support multibyte strings. Fixing find is pretty straightforward, though rfind might be a tad messy. As a related question, can anyone verify whether std.utf.stride will return a correct result for evaluating an arbitrary offset in all potential input strings?
 For the general case, UTF-32 is a pretty wasteful
 Unicode encoding just to have that priviledge ?

I'm not sure there is a "general case", so it's hard to say. Some programmers have to deal with MBCS every day; others can go for years without ever having to worry about anything but vanilla ASCII. "Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth. Finding the millionth character in a UTF-8 string means looping through at least a million bytes, and executing some conditional logic for each one. Finding the millionth character in a UTF-32 string is a simple pointer offset and one-word fetch.

For what it's worth, I believe the correct behavior for string/array operations is to provide overloads for char[] and wchar[] that require input to be valid UTF-8 and UTF-16, respectively. If the user knows their data is pure ASCII or they otherwise want to process it as a fixed-width string they can cast to ubyte[] or ushort[]. This is what I'm planning for std.array in Ares. Sean
Apr 06 2006
prev sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Mike Capp wrote:

 At the risk of repeating James, I do think that spelling "string" as
 "char[]"/"wchar[]" is grossly misleading, particularly to people coming from
any
 other C-family language. If I was doing any serious string-handling work in D
 I'd almost certainly write a opaque String class that overloaded opIndex
 (returning dchar) to do the right thing, and optimised the underlying storage
to
 suit the app's requirements.

I'm not sure that C guys would miss a string class (after all, char[] is a lot better than the raw "undefined" char* they used to be using...) but I do see how having an easy String class around is useful sometimes. I even wrote a simple one myself, based on something Java-like: http://www.algonet.se/~afb/d/dcaf/html/class_string.html http://www.algonet.se/~afb/d/dcaf/html/class_string_buffer.html But for wxD we use a simple char[] alias for strings, works just fine... If the backend uses UTF-16, it will convert them at runtime when needed. (wxWidgets can be built in a "ASCII"/UTF-8, or in "Unicode"/UTF-16 mode) Then again it only does the occasional window title or dialog string etc --anders
Apr 06 2006
prev sibling parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jari-Matti wrote:
 That's very true. A "normal" hard drive reads 60 MB/s. So,
 reading a 4 MB file takes at least 66 ms and a 1 MB UTF-8-file (only
 ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :).
 A modern processor executes 3 000 000 000 operations in a
 second. Going through the UTF-8 stream takes 1 000 000 * 10 (perhaps?)
 operations and thus costs 3 ms. So it's actually faster to read UTF-8.

1) your sample: English (consider Chinese) 2) magic word: seek Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFENbhY3w+/yD4P9tIRArYCAJ4vxbiR2fim5rFh+AQ4O3e/Gc3xjQCbBnCV BLrTa9vqU3l8ny+/8Sqw8Mc= =59uu -----END PGP SIGNATURE-----
Apr 06 2006