www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - performance of char vs wchar

reply Ben Hinkle <bhinkle4 juno.com> writes:
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8Bit

All the threads about char vs wchar got me thinking about char performance,
so I thought I'd test out the difference. I've attached a simple test that
allocates a bunch of strings (of type char[] or wchar[]) of random length
each with one random 'a' in them. Then it loops over the strings randomly
and counts the number of 'a's. So this test does not measure transcoding
speed - it just measures allocation and simple access speed across a large
number of long strings. My times are
 char: .36 seconds
 wchar: .78
 dchar: 1.3

I tried to make sure the dchar version didn't run into swap space. If I
scale up or down the size of the problem the factor of 2 generally remains
until the times gets so small that it doesn't matter. It looks like a lot
of the time is taken in initializing the strings since the when I modify
the test to only time access performance the scale factor goes down from 2
to something under 1.5 or so.

I built it using
 dmd charspeed.d -O -inline -release

It would be nice to try other tests that involve searching for sub-strings
(like multi-byte unicode strings) but I haven't done that. 

-Ben
Aug 27 2004
next sibling parent Ben Hinkle <bhinkle4 juno.com> writes:
Note: looks like the version that got attached was not the version I used to
run the tests - it was the one I was using to tests access. I thought
attaching the file would make a copy right after attaching then but it
looks like my newsreader makes a copy only when the message is posted. The
version I ran to get my original numbers had longer strings (like 10000
instead of 1000) and timed the allocation as well as the access.

-Ben
Aug 27 2004
prev sibling next sibling parent reply Berin Loritsch <bloritsch d-haven.org> writes:
Ben Hinkle wrote:
 All the threads about char vs wchar got me thinking about char performance,
 so I thought I'd test out the difference. I've attached a simple test that
 allocates a bunch of strings (of type char[] or wchar[]) of random length
 each with one random 'a' in them. Then it loops over the strings randomly
 and counts the number of 'a's. So this test does not measure transcoding
 speed - it just measures allocation and simple access speed across a large
 number of long strings. My times are
  char: .36 seconds
  wchar: .78
  dchar: 1.3
 
 I tried to make sure the dchar version didn't run into swap space. If I
 scale up or down the size of the problem the factor of 2 generally remains
 until the times gets so small that it doesn't matter. It looks like a lot
 of the time is taken in initializing the strings since the when I modify
 the test to only time access performance the scale factor goes down from 2
 to something under 1.5 or so.
 
 I built it using
  dmd charspeed.d -O -inline -release
 
 It would be nice to try other tests that involve searching for sub-strings
 (like multi-byte unicode strings) but I haven't done that. 

FWIW, the Java 'char' is a 16 bit value due to the unicode standards. The idea of course, is that internally to the program all strings are encoded the same and translated on IO. I would venture to say there are two observations about Java strings: 1) In most applications they are fast enough to handle a lot of work. 2) It can become a bottleneck when excessive string concatenation is happening, or logging is overdone. As far as D is concerned, it should be expected that a same sized array would take longer if the difference between arrays is the element size. In this case 1 byte, 2 bytes, 4 bytes. I don't know much about allocation routines, but I imagine getting something to run at constant cost is either beyond the realm of possibility or it is just way too difficult to be practical to pursue. IMO, the benefits of using a standard unicode encoding within the D program outway the costs of allocating the arrays. Anytime we make it easier to provide internationalization (i18n), it is a step toward greater acceptance. To assume the world only works with european (or american) character sets is to stick your head in the sand. Honestly, I would prefer something that made internationalized strings easier to manage that more difficult. If there is no multi-char graphemes (i.e. takes up more than one code space) then that would be the easiest to work with and write libraries for.
Aug 27 2004
parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Berin Loritsch" <bloritsch d-haven.org> wrote in message
news:cgnc25$1l1f$1 digitaldaemon.com...
 Ben Hinkle wrote:
 All the threads about char vs wchar got me thinking about char


 so I thought I'd test out the difference. I've attached a simple test


 allocates a bunch of strings (of type char[] or wchar[]) of random


 each with one random 'a' in them. Then it loops over the strings


 and counts the number of 'a's. So this test does not measure transcoding
 speed - it just measures allocation and simple access speed across a


 number of long strings. My times are
  char: .36 seconds
  wchar: .78
  dchar: 1.3

 I tried to make sure the dchar version didn't run into swap space. If I
 scale up or down the size of the problem the factor of 2 generally


 until the times gets so small that it doesn't matter. It looks like a


 of the time is taken in initializing the strings since the when I modify
 the test to only time access performance the scale factor goes down from


 to something under 1.5 or so.

 I built it using
  dmd charspeed.d -O -inline -release

 It would be nice to try other tests that involve searching for


 (like multi-byte unicode strings) but I haven't done that.

FWIW, the Java 'char' is a 16 bit value due to the unicode standards. The idea of course, is that internally to the program all strings are encoded the same and translated on IO. I would venture to say there are two observations about Java strings: 1) In most applications they are fast enough to handle a lot of work. 2) It can become a bottleneck when excessive string concatenation is happening, or logging is overdone.

I wonder what Java strings in utf8 would be like... I wonder if anyone has tried that out.
 As far as D is concerned, it should be expected that a same sized array
 would take longer if the difference between arrays is the element size.
 In this case 1 byte, 2 bytes, 4 bytes.  I don't know much about
 allocation routines, but I imagine getting something to run at constant
 cost is either beyond the realm of possibility or it is just way too
 difficult to be practical to pursue.

agreed. There probably isn't any way to speed up the initialization much.
 IMO, the benefits of using a standard unicode encoding within the D
 program outway the costs of allocating the arrays.  Anytime we make
 it easier to provide internationalization (i18n), it is a step toward
 greater acceptance.  To assume the world only works with european (or
 american) character sets is to stick your head in the sand.

agreed. utf8 and utf16 are both unicode standards that require multi-byte handling to cover all of unicode so in terms of ease of use they shouldn't be any different.
 Honestly, I would prefer something that made internationalized strings
 easier to manage that more difficult.  If there is no multi-char
 graphemes (i.e. takes up more than one code space) then that would be
 the easiest to work with and write libraries for.

dchar would be the choice for ease of use but as you can see performance goes downhill significantly (at least for the naive test I ran). To me the performance of dchar is too poor to make it the standard and the ease-of-use of utf8 and utf16 are essentially equivalent so since utf8 has the best performance it should be the default. Hence my attempt to measure which is faster for typical usage: char or wchar? -Ben
Aug 27 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgner6$1mh4$1 digitaldaemon.com>, Ben Hinkle says...

dchar would be the choice for ease of use

Actually, wchar and dchar are pretty much the same in terms of ease of use. In almost all cases, you can just /pretend/ that UTF-16 is not a multi-word encoding, and just write your code as if one wchar == one character. Except in very specialist cases, you'll be correct. Even if you're using characters beyond U+FFFF, regarding them as two characters instead of one is likely to be harmless. Of course, there are applications for which you can't do this. Font rendering algorithms, for example, /must/ be able to distinguish true character boundaries. But even then, the cost of rendering greatly outweight the (almost insignificant) cost of determining true character boundaries - a trivial task in UTF-16. char, by contrast, is much more difficult to use, because you /cannot/ "pretend" it's a single-byte encoding, because there are too many characters beyond U+00FF which applications need to "understand".
but as you can see performance
goes downhill significantly (at least for the naive test I ran). To me the
performance of dchar is too poor to make it the standard

Agreed.
and the ease-of-use
of utf8 and utf16 are essentially equivalent

Not really. The ICU API is geared around UTF-16, so if you use wchar[]s, you can call ICU functions directly. With char[]s, there's a bit more faffing, so UTF-8 becomes less easy to use in this regard. And don't forget - UTF-8 is a complex algorithm, while UTF-16 is a trivially simple algorithm. As an application programmer, you may be shielded from the complexity of UTF-8 by library implementation and implicit conversion, but it's all going on under the hood. UTF-8's "ease of use" vanishes as soon as you actually have to implement it.
so since utf8 has the best
performance it should be the default.

But it doesn't. Your tests were unfair. A UTF-16 array will not, in general, require twice as many bytes as a UTF-8 array, which is what you seemed to assume. That will only be true if the string is pure ASCII, but not otherwise, and for the majority of characters the performance will be worse. Check out this table: Codepoint Number of bytes range UTF-8 UTF-16 UTF-32 ---------------------------------------- 0000 to 007F 1 2 4 0080 to 07FF 2 2 4 0800 to FFFF 3 2 4 <----- who wins on this row? 10000+ 4 4 4
Hence my attempt to measure which is
faster for typical usage: char or wchar?

You could try measuring it again, but this time without the assumption that all characters are ASCII. Arcane Jill
Aug 27 2004
next sibling parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgngtc$1nep$1 digitaldaemon.com...
 In article <cgner6$1mh4$1 digitaldaemon.com>, Ben Hinkle says...

dchar would be the choice for ease of use

Actually, wchar and dchar are pretty much the same in terms of ease of

 almost all cases, you can just /pretend/ that UTF-16 is not a multi-word
 encoding, and just write your code as if one wchar == one character.

 very specialist cases, you'll be correct. Even if you're using characters

 U+FFFF, regarding them as two characters instead of one is likely to be
 harmless.

 Of course, there are applications for which you can't do this. Font

 algorithms, for example, /must/ be able to distinguish true character
 boundaries. But even then, the cost of rendering greatly outweight the

 insignificant) cost of determining true character boundaries - a trivial

 UTF-16.

 char, by contrast, is much more difficult to use, because you /cannot/

 it's a single-byte encoding, because there are too many characters beyond

 which applications need to "understand".

true - shortcuts can be taken if the application doesn't have to support all of unicode.
but as you can see performance
goes downhill significantly (at least for the naive test I ran). To me


performance of dchar is too poor to make it the standard

Agreed.
and the ease-of-use
of utf8 and utf16 are essentially equivalent

Not really. The ICU API is geared around UTF-16, so if you use wchar[]s,

 call ICU functions directly. With char[]s, there's a bit more faffing, so

 becomes less easy to use in this regard.

I'm not sure what faffing is but yeah I agree - I just meant the ease-of-use without worrying about library APIs.
 And don't forget - UTF-8 is a complex algorithm, while UTF-16 is a

 simple algorithm. As an application programmer, you may be shielded from

 complexity of UTF-8 by library implementation and implicit conversion, but

 all going on under the hood. UTF-8's "ease of use" vanishes as soon as you
 actually have to implement it.

I meant ease-of-use for end-users. Calling toUTF8 is just as easy as calling toUTF16. Personally as long as it is correct the underlying library complexity doesn't affect me. Performance of the application becomes the key factor. .
so since utf8 has the best
performance it should be the default.

But it doesn't. Your tests were unfair. A UTF-16 array will not, in

 require twice as many bytes as a UTF-8 array, which is what you seemed to
 assume. That will only be true if the string is pure ASCII, but not

 and for the majority of characters the performance will be worse. Check

 table:

 Codepoint            Number of bytes
 range           UTF-8    UTF-16   UTF-32
 ----------------------------------------
 0000 to 007F    1        2        4
 0080 to 07FF    2        2        4
 0800 to FFFF    3        2        4    <----- who wins on this row?
 10000+          4        4        4

I was going to start digging into non-ascii next. I remember reading somewhere that encoding asian languages in utf8 typically results in longer strings than utf16. That will definitely hamper utf8.
Hence my attempt to measure which is
faster for typical usage: char or wchar?

You could try measuring it again, but this time without the assumption

 characters are ASCII.

 Arcane Jill

Aug 27 2004
parent Berin Loritsch <bloritsch d-haven.org> writes:
Ben Hinkle wrote:

 
 I was going to start digging into non-ascii next. I remember reading
 somewhere that encoding asian languages in utf8 typically results in longer
 strings than utf16. That will definitely hamper utf8.

Either that or hamper Japanese coders :) If it comes out to a performance draw when dealing with non-ascii text, then might I suggest using programming ease (for library writers as well) to be the tie breaker?
Aug 27 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgngtc$1nep$1 digitaldaemon.com...
 But it doesn't. Your tests were unfair. A UTF-16 array will not, in

 require twice as many bytes as a UTF-8 array, which is what you seemed to
 assume. That will only be true if the string is pure ASCII, but not

 and for the majority of characters the performance will be worse.

The majority of characters are multibyte in UTF-8, that is true. But the distribution of characters is what matters for speed in real apps, and for those, the vast majority will be ASCII. Furthermore, many operations on strings can treat UTF-8 as if it were single byte, such as copying, sorting, and searching. Of course, there are still many situations where UTF-8 is not ideal, which is why D tries to be agnostic about whether the application programmer wants to use char[], wchar[], or dchar[] or any combination of the three.
Aug 27 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgobse$237t$1 digitaldaemon.com>, Walter says...

The majority of characters [within strings] are multibyte in UTF-8,
that is true. But the [frequency]
distribution of characters is what matters for speed in real apps, and for
those, the vast majority will be ASCII.

Are you sure? That's one hellava claim to make. I've noticed that D seems to generate a lot of interest from Japan (judging from the existense of Japanese web sites). Of course, Japanese strings average 1.5 times as long in UTF-8 as those same strings would have been in UTF-16. The whole "Most characters are ASCII" dogma is really only true if you happen to live in a certain part of the world, and to bias a computer language because of that assumption hurts performance for everyone else. /Please/ reconsider.
Furthermore, many operations on
strings can treat UTF-8 as if it were single byte, such as copying, sorting,
and searching.

Copying, yes. But of course you miss the point that this would be just as true in UTF-16 as it is in UTF-8. Sorting? - Lexicographical sorting, maybe, but the only reason you can get away with that is because, in ASCII-only parts of the world, codepoint order happens to correspond to the order we find letters in the alphabet, and even then only if we're prepared to compromise on case ("Foo" sorts before "bar"). Stick an acute accent over one of the vowels and lexicographical sort order goes out the window. Lexicographical sorting may be good for things purely mechanical things like eliminating duplicates in an AA, but if someone wants to look up all entries between "Alice" and "Bob" in a database, I think they would be very surprised to find that "Äaron" was not in the list. (And again, you miss the point that if lexicographical sorting /is/ what you want, it works just as well in UTF-16 as it does in UTF-8). /Real/ sorting, however, requires full understanding of Unicode, and for that, ASCII is just not good enough. Searching? If you treat "UTF-8 as if it were single byte", there are an /awful lot/ of characters you can't search for, including the British Pound (currency) sign, the Euro currency sign and anything with an accent over it. And searching for the single byte 0x81 (for example) is not exactly useful.
Of course, there are still many situations where UTF-8 is not ideal, which
is why D tries to be agnostic about whether the application programmer wants
to use char[], wchar[], or dchar[] or any combination of the three.

Yes, agnostic is good. No problem there. I'm only talking about the /default/. I thought you were all /for/ internationalization and Unicode and all that? I'm surprised to find myself arguing with you on this one. (Okay, I didn't really expect you to ditch the char, but to prefer wchar[] over char[] is /reasonable/). I have given many, many, many reasons why I think that wchar[] strings should be the default in D, and if I can't convince you, I think that would be a big shame. Arcane Jill
Aug 27 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgp7c3$2e36$1 digitaldaemon.com>, Arcane Jill says...

if someone wants to look up all
entries between "Alice" and "Bob" in a database, I think they would be very
surprised to find that "Äaron" was not in the list.

Whoops - dumb brain fart! Please pretend I didn't say that. The reasoning is still sound - it's just my conception of alphabetical order that's up the spout. Jill
Aug 27 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgp7c3$2e36$1 digitaldaemon.com...
 But it doesn't. Your tests were unfair. A UTF-16 array will not, in



 require twice as many bytes as a UTF-8 array, which is what you seemed



 assume. That will only be true if the string is pure ASCII, but not



 and for the majority of characters the performance will be worse.


The majority of characters [within strings] are multibyte in UTF-8,
that is true. But the [frequency]
distribution of characters is what matters for speed in real apps, and


those, the vast majority will be ASCII.


Yes. Nearly all the spam I get is in ascii <g>. When optimizing for speed, the first rule is optimize for what the bulk of the data will likely consist of for your application. For example, if you're writing a user interface for Chinese people, you'd be sensible to consider using dchar[] throughout. It probably makes sense to use wchar[] for the unicode library you're developing because programmers who have a need for such a library will most likely NOT be writing applications for ascii.
 I've noticed that D seems to generate a lot of interest from Japan

 the existense of Japanese web sites). Of course, Japanese strings average

 times as long in UTF-8 as those same strings would have been in UTF-16.

If I was building a Japanese word processor, I certainly wouldn't use UTF-8 internally in it for that reason.
 The whole "Most characters are ASCII" dogma is really only true if you

 live in a certain part of the world, and to bias a computer language

 that assumption hurts performance for everyone else. /Please/ reconsider.

If everything is optimized for Japanese, it will hurt performance for ASCII users. The point is, there is no UTF encoding that is optimal for everyone. That's why D supports all three.
Furthermore, many operations on
strings can treat UTF-8 as if it were single byte, such as copying,


and searching.


 in UTF-16 as it is in UTF-8.

Of course. My point was that quite a few common string operations do not require decoding. For example, the D compiler processes source as UTF-8. It almost never has to do any decoding. The performance penalty for supporting multibyte encodings in D source is essentially zero.
 Sorting? - Lexicographical sorting, maybe, but the only reason you can get

 with that is because, in ASCII-only parts of the world, codepoint order

 to correspond to the order we find letters in the alphabet, and even then

 if we're prepared to compromise on case ("Foo" sorts before "bar"). Stick

 acute accent over one of the vowels and lexicographical sort order goes

 window. Lexicographical sorting may be good for things purely mechanical

 like eliminating duplicates in an AA, but if someone wants to look up all
 entries between "Alice" and "Bob" in a database, I think they would be

 surprised to find that "Äaron" was not in the list. (And again, you miss

 point that if lexicographical sorting /is/ what you want, it works just as

 in UTF-16 as it does in UTF-8).

Sure - and my point is it wasn't necessary to decode UTF-8 to do that sort. It's not necessary for hashing the string, either.
 /Real/ sorting, however, requires full
 understanding of Unicode, and for that, ASCII is just not good enough.

There are many different ways to sort, and since the unicode characters are not always ordered the obvious way, you have to deal with that specially in each of UTF-8, -16, and -32.
 Searching? If you treat "UTF-8 as if it were single byte", there are an

 lot/ of characters you can't search for, including the British Pound

 sign, the Euro currency sign and anything with an accent over it. And

 for the single byte 0x81 (for example) is not exactly useful.

That's why std.string.find() takes a dchar as its search argument. What you do is treat it as a substring search. There are a lot of very fast algorithms for doing such searches, such as Boyer-Moore, which get pretty close to the performance of a single character search. Furthermore, I'd optimize it so the first thing the search did was check if the search character was ascii. If so, it'd do the single character scan. Otherwise, it'd do the substring search.
Of course, there are still many situations where UTF-8 is not ideal,


is why D tries to be agnostic about whether the application programmer


to use char[], wchar[], or dchar[] or any combination of the three.

Yes, agnostic is good. No problem there. I'm only talking about the

 thought you were all /for/ internationalization and Unicode and all that?

But D does not have a default. Programmers can use the encoding which is optimal for the data they expect to see. Even if UTF-8 were the default, UTF-8 still supports full internationalization and Unicode. I am certainly not talking about supporting only ASCII or having ASCII as the default.
 I'm surprised to find myself arguing with you on this one. (Okay, I didn't

 expect you to ditch the char, but to prefer wchar[] over char[] is
 /reasonable/). I have given many, many, many reasons why I think that

 strings should be the default in D, and if I can't convince you, I think

 would be a big shame.

My experience with using UTF-16 throughout a program is it speeded up quite a bit when converted to UTF-8. There is no blanket advantage to UTF-16, it depends on your expected data. When your expected data will be mostly ASCII, then UTF-8 is the reasonable choice.
Aug 28 2004
parent reply Berin Loritsch <bloritsch d-haven.org> writes:
Walter wrote:

 
 But D does not have a default. Programmers can use the encoding which is
 optimal for the data they expect to see. Even if UTF-8 were the default,
 UTF-8 still supports full internationalization and Unicode. I am certainly
 not talking about supporting only ASCII or having ASCII as the default.
 

Umm, what about the toString() function? Doesn't that assume char[]? Hense, it is the default by example. I'll be honest, I don't get why optimization is so important when there hasn't been determined a need yet. I am sure there can be quicker ways of dealing with allocation and de-allocation--this would make the system faster for all objects, not just strings. If that can be done, why not concentrate on that? More advanced memory utilization can mean better overall performance, and reduce the cost of one type of string over another. Heck, if a page of memory is being allocated for string storage (multiple strings mind you), what about a really fast bit blit for the whole page? That would make the strings default to initialization state and speed things up. Ideally, the difference between a char[] and a dchar[] would be how much of that page is allocated.
Aug 30 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Berin Loritsch" <bloritsch d-haven.org> wrote in message
news:cgv92r$2fvv$1 digitaldaemon.com...
 Umm, what about the toString() function?  Doesn't that assume char[]?
 Hense, it is the default by example.

Yes, but it isn't char(!)acteristic of D.
 I'll be honest, I don't get why optimization is so important when there
 hasn't been determined a need yet.

Efficiency, or at least potential efficiency, has always been a strong attraction that programmers have to C/C++. Since D is targetted at that market, efficiency will be a major consideration. If D acquires an early reputation for being "slow", like Java did, that reputation can be very, very hard to shake.
 I am sure there can be quicker ways
 of dealing with allocation and de-allocation--this would make the system
 faster for all objects, not just strings.  If that can be done, why not
 concentrate on that?

There's no way to just wipe away the costs of using double the storage.
 More advanced memory utilization can mean better overall performance,
 and reduce the cost of one type of string over another.  Heck, if a
 page of memory is being allocated for string storage (multiple strings
 mind you), what about a really fast bit blit for the whole page?  That
 would make the strings default to initialization state and speed things
 up.  Ideally, the difference between a char[] and a dchar[] would be
 how much of that page is allocated.

Aug 30 2004
parent "Roald Ribe" <rr.no spam.teikom.no> writes:
"Walter" <newshound digitalmars.com> wrote in message
news:ch048q$2uql$1 digitaldaemon.com...
 "Berin Loritsch" <bloritsch d-haven.org> wrote in message
 news:cgv92r$2fvv$1 digitaldaemon.com...
 Umm, what about the toString() function?  Doesn't that assume char[]?
 Hense, it is the default by example.

Yes, but it isn't char(!)acteristic of D.
 I'll be honest, I don't get why optimization is so important when there
 hasn't been determined a need yet.

Efficiency, or at least potential efficiency, has always been a strong attraction that programmers have to C/C++. Since D is targetted at that market, efficiency will be a major consideration. If D acquires an early reputation for being "slow", like Java did, that reputation can be very, very hard to shake.
 I am sure there can be quicker ways
 of dealing with allocation and de-allocation--this would make the system
 faster for all objects, not just strings.  If that can be done, why not
 concentrate on that?

There's no way to just wipe away the costs of using double the storage.

But this claim holds true only for those who have English as their only working language, and (maybe) for a few others in Europe. In all other markets (5 billion+) the utf8 storage will in fact (mostly) be _larger_ than the utf16 storage. And as I proposed earlier, you could leave an otption for English/Europeans in the form of char defined as in C/C++, which would in addition to being just as fast, actually make the transition to the D language easier. I think it all comes down to this: will D became a general purpose language for the international community or will it mostly become a "better" C++ for English speakers only? Roald
Sep 02 2004
prev sibling parent Berin Loritsch <bloritsch d-haven.org> writes:
Ben Hinkle wrote:

 "Berin Loritsch" <bloritsch d-haven.org> wrote in message
 news:cgnc25$1l1f$1 digitaldaemon.com...
 
FWIW, the Java 'char' is a 16 bit value due to the unicode standards.
The idea of course, is that internally to the program all strings are
encoded the same and translated on IO.

I would venture to say there are two observations about Java strings:

1) In most applications they are fast enough to handle a lot of work.

2) It can become a bottleneck when excessive string concatenation is
    happening, or logging is overdone.

I wonder what Java strings in utf8 would be like... I wonder if anyone has tried that out.

Internally there is no such thing. It's just easier to deal with that way. The translation happens with encoders and decoders on IO.
Honestly, I would prefer something that made internationalized strings
easier to manage that more difficult.  If there is no multi-char
graphemes (i.e. takes up more than one code space) then that would be
the easiest to work with and write libraries for.

dchar would be the choice for ease of use but as you can see performance goes downhill significantly (at least for the naive test I ran). To me the performance of dchar is too poor to make it the standard and the ease-of-use of utf8 and utf16 are essentially equivalent so since utf8 has the best performance it should be the default. Hence my attempt to measure which is faster for typical usage: char or wchar?

Yea, but I've learned not to get hung up on strict performance. There is a difference between ultimately fast and fast enough. Sometimes to squeeze those extra cycles out we can cause more programming issues than needs be. If the allocation routines could be done faster (big assumption here), would it be preferable to use a dchar if it is fast enough? For the record, I believe Java uses UTF16 internally, which means for most things there is less of a need to worry about MB characters. The interesting test would be to have strings of the same length, and then test the algorithm to get a substring from that string. For example, a fixed string that uses some multi-codespace characters here and there, and then getting the 15th through 20th characters of that string. This will show just how things might come out in the wash when we are dealing with that type of issue. For example, it is not uncommon to have the system read in a block of text from a file into memory (say 4k worth at a time), and then just iterate one line at a time. Which gives us the substring scenario. Alternatively there is the regex algorithm that would need to account for the multi-codespace characters that will get a performance hit as well. There is a lot more to an overall more performant system than just string allocation--even though we are aware that it can be a significant cost.
Aug 27 2004
prev sibling next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgn92o$1jj4$1 digitaldaemon.com>, Ben Hinkle says...

 char: .36 seconds
 wchar: .78
 dchar: 1.3

Yeah, I forgot about allocation time. Of course, D initializes all arrays, no matter whence they are allocated. char[]s will be filled with all FFs, and wchar[]s will be filled with all FFFFs. Twice as many bytes = twice as many bytes to initialize. Damn! A super-fast character array allocator would make a lot of difference here. There are probably many different ways of doing this very fast. I guess this has to happen within DMD.
Aug 27 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgncho$1la0$1 digitaldaemon.com...
 In article <cgn92o$1jj4$1 digitaldaemon.com>, Ben Hinkle says...

 char: .36 seconds
 wchar: .78
 dchar: 1.3

Yeah, I forgot about allocation time. Of course, D initializes all arrays,

 matter whence they are allocated. char[]s will be filled with all FFs, and
 wchar[]s will be filled with all FFFFs. Twice as many bytes = twice as

 bytes to initialize. Damn!

There are also twice as many bytes to scan for the gc, and half the data until your machine starts thrashing the swap disk. The latter is a very real issue for server apps, since it means that you reach the point of having to double the hardware in half the time.
Aug 27 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgobse$237t$2 digitaldaemon.com>, Walter says...

There are also twice as many bytes to scan for the gc, and half the data
until your machine starts thrashing the swap disk. The latter is a very real
issue for server apps, since it means that you reach the point of having to
double the hardware in half the time.

There you go again, assuming that wchar[] strings are double the length of char[] strings. THIS IS NOT TRUE IN GENERAL. In Chinese, wchar[] strings are shorter than char[] strings. In Japanese, wchar[] strings are shorter than char[] strings. In Mongolian, wchar[] strings are shorter than char[] strings. In Tibetan, wchar[] strings are shorter than char[] strings. I assume I don't need to go on...? <sarcasm>But I guess server apps never have to deliver text in those languages.</sarcasm> Walter, servers are one the places where internationalization matters most. XML and HTML documents, for example, could be (a) stored and (b) requested in any encodings whatsoever. A server would have to push them through a transcoding function. For this, wchar[]s are more sensible. I don't understand the basis of your determination. It seems ill-founded. Jill
Aug 27 2004
next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgp845$2ea2$1 digitaldaemon.com...
 In article <cgobse$237t$2 digitaldaemon.com>, Walter says...
There are also twice as many bytes to scan for the gc, and half the data
until your machine starts thrashing the swap disk. The latter is a very


issue for server apps, since it means that you reach the point of having


double the hardware in half the time.

char[] strings. THIS IS NOT TRUE IN GENERAL.

Are you sure? Even european languages are mostly ascii.
 In Chinese, wchar[] strings are
 shorter than char[] strings. In Japanese, wchar[] strings are shorter than
 char[] strings. In Mongolian, wchar[] strings are shorter than char[]

 In Tibetan, wchar[] strings are shorter than char[] strings. I assume I

 need to go on...?

 <sarcasm>But I guess server apps never have to deliver text in those
 languages.</sarcasm>

Never is not the right word here. The right idea is what is the frequency distribution of the various types of data one's app will see. Once you know that, you optimize for the most common cases. Tibetan is still fully supported regardless.
 Walter, servers are one the places where internationalization matters

 and HTML documents, for example, could be (a) stored and (b) requested in

 encodings whatsoever.

Of course. But what are the frequencies of the requests for various encodings? Each of the 3 UTF encodings fully support unicode and are fully internationalized. Which one you pick depends on the frequency distribution of your data.
 A server would have to push them through a transcoding
 function. For this, wchar[]s are more sensible.

It is not optimal unless the person optimizing the server app has instrumented his data so he knows the frequency distribution of the various characters. Only then can he select the encoding that will deliver the best performance.
 I don't understand the basis of your determination. It seems ill-founded.

Experience optimizing apps. One of the most potent tools for optimization is analyzing the data patterns, and making the most common cases take the shortest path through the code. UTF-16 is not optimal for a great many applications - and I have experience with it.
Aug 28 2004
parent reply Berin Loritsch <bloritsch d-haven.org> writes:
Walter wrote:

 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgp845$2ea2$1 digitaldaemon.com...
 
 Are you sure? Even european languages are mostly ascii.

But not completely. There is the euro symbol (I dare say would be quite common). In Spanish the enye (n with ~ on top, can't really do that well in Windows) is fairly common, and important. There is a big difference between an anus and a year, but the only difference in Spanish is n vs. enye. Not to mention all those words that use an accent to mark an abnormally stressed sylable. Then we get to French, which uses the circumflex, accents, and accent grave. Oh, then there's German which uses those two little dots alot. And I haven't even touched on Greek or Russian, both European countries. You can only make that assumption about English speaking countries. Yes almost everyone is exposed to English in some way, and it is the current "lingua de franca" (language of business, like French used to be--hense the term). The bottom line is that there are sufficient exceptions to your "rule" that it would be a shame to assume the world was America and Great Britain.
Aug 30 2004
next sibling parent reply "Roald Ribe" <rr.no spam.teikom.no> writes:
"Berin Loritsch" <bloritsch d-haven.org> wrote in message
news:cgv9m0$2g71$1 digitaldaemon.com...
 Walter wrote:

 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgp845$2ea2$1 digitaldaemon.com...

 Are you sure? Even european languages are mostly ascii.

But not completely. There is the euro symbol (I dare say would be quite common). In Spanish the enye (n with ~ on top, can't really do that well in Windows) is fairly common, and important. There is a big difference between an anus and a year, but the only difference in Spanish is n vs. enye. Not to mention all those words that use an accent to mark an abnormally stressed sylable. Then we get to French, which uses the circumflex, accents, and accent grave. Oh, then there's German which uses those two little dots alot. And I haven't even touched on Greek or Russian, both European countries. You can only make that assumption about English speaking countries. Yes almost everyone is exposed to English in some way, and it is the current "lingua de franca" (language of business, like French used to be--hense the term). The bottom line is that there are sufficient exceptions to your "rule" that it would be a shame to assume the world was America and Great Britain.

Even Britain has a non-ASCII used quite extensively: Pound. £ Norway/Denmark/Sweden has three non ASCII characters (used all the time). The Sami peoples has their own characters (they live in Norway, Sweden, Russia). Finland, Estonia, Lituania, Poland, ++ all have their own characters in addition to ASCII. Russia has its own alphabet! All latin family languages (French/Spanish/Italian/ Portuguese) have all sorts of special characters (accents forwards/ backwards ++)... And now I have not even gone through HALF of Europe. In Asia there are wildly different systems, and several systems in use, _in_ each_ _country_. As I have stated before: I agree with Walter's concern for performance. But where I think there is some disagreement in these discussions is where to put the effort to "adapt" the environment, on those who only needs ASCII (most of the time), or on all those who would prefer the language to default to the more general need of application and server programmers all over the world. My view is that speed freaks are used to tune the tools for best speed, and the general case should reflect newbies and the 5 billion+ potential non English using markets. Everything else is selling D short, in a shortsighted quest for best speed as default as one of the language features. I have a rather radical suggestion, that may make sense, or it may happen that someone will shoot it down right away because of something I have not thought of: 1. Remove wchar and dchar from the language. 2. Make char mean 8-bit unsigned byte, containing US-ASCII/Latin1/ISO-8859/cp1252, with one character in each byte. Null termination is expected. AFAIK all the sets mentioned are compatible with each other. Char *may* contain characters from any 8-bit based encoding, given that either existing conv. table or application can convert to/from one of the types below. This type makes for a clean, minimum effort port, from C and C++, and interaction with current crop of OS and libraries. It also takes care of US/Western Europe speed freaks. 3. New types, utf8, utf16 and utf32 as suggested by others. 4. String based on utf16 as default storage. With overidden storage type like: new String(200, utf8) // 200 bytes new String(200, utf16) // 400 bytes new String(200) // 400 bytes new String(200, utf32) // 800 bytes Anyone can use string with the optimal performance for them. 5. String literals in source, default assumed to be utf16 encoded. Can be changed by app programmer like: c"text" // char[] 4 bytes u"text" // String() 4 bytes w"text" // String() 8 bytes "text" // String() 8 bytes d"text" // String() 16 bytes I am open to the fact that I am not at all experienced in language design, but I hope this may bring the discussion along. I think making char the same as in C/C++ (but slightly better defined default char set) and go with entirely different type for the rest is a sound idea. Roald
Aug 30 2004
parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
I couldn't agree more about Walter's ASCII argument. It's way out there 
and alienates all of us with non-english first languages (maybe I should 
start writing my messages using runes, just like my forefathers...).
If the toString is only really useful for debugging anyway, it could as 
well return dchars. I'd rather remove altoghether, though.

Lars Ivar Igesund

Roald Ribe wrote:
 "Berin Loritsch" <bloritsch d-haven.org> wrote in message
 news:cgv9m0$2g71$1 digitaldaemon.com...
 
Walter wrote:


"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgp845$2ea2$1 digitaldaemon.com...

Are you sure? Even european languages are mostly ascii.

But not completely. There is the euro symbol (I dare say would be quite common). In Spanish the enye (n with ~ on top, can't really do that well in Windows) is fairly common, and important. There is a big difference between an anus and a year, but the only difference in Spanish is n vs. enye. Not to mention all those words that use an accent to mark an abnormally stressed sylable. Then we get to French, which uses the circumflex, accents, and accent grave. Oh, then there's German which uses those two little dots alot. And I haven't even touched on Greek or Russian, both European countries. You can only make that assumption about English speaking countries. Yes almost everyone is exposed to English in some way, and it is the current "lingua de franca" (language of business, like French used to be--hense the term). The bottom line is that there are sufficient exceptions to your "rule" that it would be a shame to assume the world was America and Great Britain.

Even Britain has a non-ASCII used quite extensively: Pound. £ Norway/Denmark/Sweden has three non ASCII characters (used all the time). The Sami peoples has their own characters (they live in Norway, Sweden, Russia). Finland, Estonia, Lituania, Poland, ++ all have their own characters in addition to ASCII. Russia has its own alphabet! All latin family languages (French/Spanish/Italian/ Portuguese) have all sorts of special characters (accents forwards/ backwards ++)... And now I have not even gone through HALF of Europe. In Asia there are wildly different systems, and several systems in use, _in_ each_ _country_. As I have stated before: I agree with Walter's concern for performance. But where I think there is some disagreement in these discussions is where to put the effort to "adapt" the environment, on those who only needs ASCII (most of the time), or on all those who would prefer the language to default to the more general need of application and server programmers all over the world. My view is that speed freaks are used to tune the tools for best speed, and the general case should reflect newbies and the 5 billion+ potential non English using markets. Everything else is selling D short, in a shortsighted quest for best speed as default as one of the language features. I have a rather radical suggestion, that may make sense, or it may happen that someone will shoot it down right away because of something I have not thought of: 1. Remove wchar and dchar from the language. 2. Make char mean 8-bit unsigned byte, containing US-ASCII/Latin1/ISO-8859/cp1252, with one character in each byte. Null termination is expected. AFAIK all the sets mentioned are compatible with each other. Char *may* contain characters from any 8-bit based encoding, given that either existing conv. table or application can convert to/from one of the types below. This type makes for a clean, minimum effort port, from C and C++, and interaction with current crop of OS and libraries. It also takes care of US/Western Europe speed freaks. 3. New types, utf8, utf16 and utf32 as suggested by others. 4. String based on utf16 as default storage. With overidden storage type like: new String(200, utf8) // 200 bytes new String(200, utf16) // 400 bytes new String(200) // 400 bytes new String(200, utf32) // 800 bytes Anyone can use string with the optimal performance for them. 5. String literals in source, default assumed to be utf16 encoded. Can be changed by app programmer like: c"text" // char[] 4 bytes u"text" // String() 4 bytes w"text" // String() 8 bytes "text" // String() 8 bytes d"text" // String() 16 bytes I am open to the fact that I am not at all experienced in language design, but I hope this may bring the discussion along. I think making char the same as in C/C++ (but slightly better defined default char set) and go with entirely different type for the rest is a sound idea. Roald

Aug 30 2004
parent "Ben Hinkle" <bhinkle mathworks.com> writes:
Walter did use the word "most". Does anyone know of any studies on the
fequency of non-ASCII chars for different document content and languages?
There must be solid numbers about these things given all the zillions of
electronic documents out there. A quick google for French just dug up a
posting where someone scanned 86 millions characters from swiss-french
newsagency reports and got 22M non-accented vowels (aeiou) and 1.8M accented
chars. That's a factor of roughly 10. That seems significant. But I don't
want to read too much into one posting found in a minute of googling - I'm
just curious what the data says.

"Lars Ivar Igesund" <larsivar igesund.net> wrote in message
news:cgvoid$2nt9$1 digitaldaemon.com...
 I couldn't agree more about Walter's ASCII argument. It's way out there
 and alienates all of us with non-english first languages (maybe I should
 start writing my messages using runes, just like my forefathers...).
 If the toString is only really useful for debugging anyway, it could as
 well return dchars. I'd rather remove altoghether, though.

 Lars Ivar Igesund

 Roald Ribe wrote:
 "Berin Loritsch" <bloritsch d-haven.org> wrote in message
 news:cgv9m0$2g71$1 digitaldaemon.com...

Walter wrote:


"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgp845$2ea2$1 digitaldaemon.com...

Are you sure? Even european languages are mostly ascii.

But not completely. There is the euro symbol (I dare say would be quite common). In Spanish the enye (n with ~ on top, can't really do that well in Windows) is fairly common, and important. There is a big difference between an anus and a year, but the only difference in Spanish is n vs. enye. Not to mention all those words that use an accent to mark an abnormally stressed sylable. Then we get to French, which uses the circumflex, accents, and accent grave. Oh, then there's German which uses those two little dots alot. And I haven't even touched on Greek or Russian, both European countries. You can only make that assumption about English speaking countries. Yes almost everyone is exposed to English in some way, and it is the current "lingua de franca" (language of business, like French used to be--hense the term). The bottom line is that there are sufficient exceptions to your "rule" that it would be a shame to assume the world was America and Great Britain.

Even Britain has a non-ASCII used quite extensively: Pound. £ Norway/Denmark/Sweden has three non ASCII characters (used all the time). The Sami peoples has their own characters (they live in Norway, Sweden, Russia). Finland, Estonia, Lituania, Poland, ++ all have their own characters in addition to ASCII. Russia has its own alphabet! All latin family languages (French/Spanish/Italian/ Portuguese) have all sorts of special characters (accents forwards/ backwards ++)... And now I have not even gone through HALF of Europe. In Asia there are wildly different systems, and several systems in use, _in_ each_ _country_. As I have stated before: I agree with Walter's concern for performance. But where I think there is some disagreement in these discussions is where to put the effort to "adapt" the environment, on those who only needs ASCII (most of the time), or on all those who would prefer the language to default to the more general need of application and server programmers all over the world. My view is that speed freaks are used to tune the tools for best speed, and the general case should reflect newbies and the 5 billion+ potential non English using markets. Everything else is selling D short, in a shortsighted quest for best speed as default as one of the language features. I have a rather radical suggestion, that may make sense, or it may happen that someone will shoot it down right away because of something I have not thought of: 1. Remove wchar and dchar from the language. 2. Make char mean 8-bit unsigned byte, containing US-ASCII/Latin1/ISO-8859/cp1252, with one character in each byte. Null termination is expected. AFAIK all the sets mentioned are


    with each other. Char *may* contain characters from any
    8-bit based encoding, given that either existing conv. table or
 application
    can convert to/from one of the types below. This type makes for a


    minimum effort port, from C and C++, and interaction with current


    OS and libraries. It also takes care of US/Western Europe speed


 3. New types, utf8, utf16 and utf32 as suggested by others.
 4. String based on utf16 as default storage. With overidden storage type
 like:
    new String(200, utf8)   // 200 bytes
    new String(200, utf16)  // 400 bytes
    new String(200)         // 400 bytes
    new String(200, utf32)  // 800 bytes
    Anyone can use string with the optimal performance for them.
 5. String literals in source, default assumed to be utf16 encoded.
    Can be changed by app programmer like:
    c"text"    // char[] 4 bytes
    u"text"    // String() 4 bytes
    w"text"    // String() 8 bytes
    "text"     // String() 8 bytes
    d"text"    // String() 16 bytes

 I am open to the fact that I am not at all experienced in language
 design, but I hope this may bring the discussion along. I think making
 char the same as in C/C++ (but slightly better defined default char set)
 and go with entirely different type for the rest is a sound idea.

 Roald


Aug 30 2004
prev sibling parent reply Ilya Minkov <minkov cs.tum.edu> writes:
Berin Loritsch schrieb:

 Walter wrote:
 
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgp845$2ea2$1 digitaldaemon.com...

 Are you sure? Even european languages are mostly ascii.

But not completely. There is the euro symbol (I dare say would be quite common). In Spanish the enye (n with ~ on top, can't really do that well in Windows) is fairly common, and important. There is a big difference between an anus and a year, but the only difference in Spanish is n vs. enye. Not to mention all those words that use an accent to mark an abnormally stressed sylable. Then we get to French, which uses the circumflex, accents, and accent grave. Oh, then there's German which uses those two little dots alot. And I haven't even touched on Greek or Russian, both European countries.

When serving HTML, extended european characters are usually not served as Latin or Unicode. Instead, the &sym; escape encoding is preferred. There are ASCII escapes for all Latin-1 characters, as far as i know. But what bothers me with all Unicode, is that cyrillic languages cannot be handled with 8 bits as well. What would be nice, if we found an encoding which would work on 2 buffers - the primary one containing the ASCII and data in some codepage. The secondary one would contain packed codepage changes, so that russian, english, hebrew and other test can be mixed and would still need about one byte per character on average. For asian languages, the encoding should use in average per character one symbol on primary string, and one symbol on the secondary. The length of the primary stream must be exactly the length of the string, all of the overhang must be placed in the secondary one. I have a feeling that this could be great for most uses and most efficient in total. We should also not forget that the world is mostly chinese, and soon the computer users will also be. The european race will loose its importance. -eye
Aug 30 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ch07mc$30ai$1 digitaldaemon.com>, Ilya Minkov says...

But what bothers me with all Unicode, is that cyrillic languages cannot 
be handled with 8 bits as well.

One option would be the encoding WINDOWS-1251. Quote... "The Cyrillic text used in the data sets are encoded using the CP1251 Cyrillic system. Users will require CP1251 fonts to read or print such text correctly. CP1251 is the Cyrillic encoding used in Windows products as developed by Microsoft. The system replaces the underused upper 128 characters of the typical Latin character set with Cyrillic characters, leaving the full set of Latin type in the lower 128 characters. Thus the user may mix Cyrillic and Latin text without changing fonts." (-- source: http://polyglot.lss.wisc.edu/creeca/kaiser/cp1251.html) But that's just a transcoding issue, surely? Internally, we'd use Unicode, no?
We should also not forget that the world is mostly chinese, and soon the 
computer users will also be.

Well, Chinese /certainly/ can't be handled with 8 bits. Traditionally, Chinese users have made use of the encoding SHIFT-JIS, which is (shock! horror!) a /multi-byte-encoding/ (there being vastly more than 256 "letters" in the Chinese alphabet). SHIFT-JIS is seriously horrible to work with, compared with the elegant simplicity of UTF-16. Arcane Jill
Aug 31 2004
parent Ilya Minkov <minkov cs.tum.edu> writes:
Arcane Jill schrieb:

 One option would be the encoding WINDOWS-1251. Quote...

Oh come on. Do you rally think i don't know 1251 and all the other Windows codepages???? Oh, how would a person who natively speaks russian ever know that? They are all into typewriters and handwriting, aren't they?
 But that's just a transcoding issue, surely? Internally, we'd use Unicode, no?

You have apparently ignored what i tried to say. What is used externally is determined by external conditions, and is not the subject of this part of the post. I have suggested to investigate and possibly develop another *internal* representation which would provide optimal performance. It should consist of 2 storages, the 8-bit primary storage and the variable length "overhang" storage, and should be able to represent *all* unicode characters. We are back at the question of an efficient String class or struct. The idea is, that characters are not self-contained, but instead context-dependant. For example, the most commonly used escape in the overhang string would be "select a new unicode subrange to work on". Unicode documents are not just random data! They are words or sentences written in a combination of a few languages, with a change of the language happening perhaps every few words. But you don't have every symbol be in the new language. So why does every symbol need to carry the complete information, if most of it is more effciently stored as a relatively rare state change?
We should also not forget that the world is mostly chinese, and soon the 
computer users will also be.

Well, Chinese /certainly/ can't be handled with 8 bits. Traditionally, Chinese users have made use of the encoding SHIFT-JIS, which is (shock! horror!) a /multi-byte-encoding/ (there being vastly more than 256 "letters" in the Chinese alphabet). SHIFT-JIS is seriously horrible to work with, compared with the elegant simplicity of UTF-16.

Again, you have chosen to ignore my post. As you are much more familiar with Unicode than myself, could you possibly debelop an encoding which takes amortized 1 byte per character for usual codepages (not including the fixed-length subrange select command in the beginning) 2 bytes per character for all multibyte encodings which fit into UTF-16 (not including the fixed-length subrange select command in the beginning) the rest of the Unicode characters should be representable as well. Besides, i would like that only the first byte from the character encoding is stored in a primary string, and the rest on the "overhang". I have my reasons to suggest that, and *if* you care to pay attention i would also like to explain in detail. -eye
Sep 01 2004
prev sibling parent "van eeshan" <vanee hotmail.net> writes:
What you fail to understand, Jill, is that such arguments are but pinpricks
upon the World's foremost authority on everything from language-design to
server-software to ease-of-use.

Better to just build a wchar-based String class (and all the supporting
goodies), and those who care about such things will naturally migrate to it;
they'll curse D for the short-sighted approach to Object.toString, leaving
the door further open for a D successor

V

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgp845$2ea2$1 digitaldaemon.com...
 In article <cgobse$237t$2 digitaldaemon.com>, Walter says...

There are also twice as many bytes to scan for the gc, and half the data
until your machine starts thrashing the swap disk. The latter is a very


issue for server apps, since it means that you reach the point of having


double the hardware in half the time.

There you go again, assuming that wchar[] strings are double the length of char[] strings. THIS IS NOT TRUE IN GENERAL. In Chinese, wchar[] strings

 shorter than char[] strings. In Japanese, wchar[] strings are shorter than
 char[] strings. In Mongolian, wchar[] strings are shorter than char[]

 In Tibetan, wchar[] strings are shorter than char[] strings. I assume I

 need to go on...?

 <sarcasm>But I guess server apps never have to deliver text in those
 languages.</sarcasm>

 Walter, servers are one the places where internationalization matters

 and HTML documents, for example, could be (a) stored and (b) requested in

 encodings whatsoever. A server would have to push them through a

 function. For this, wchar[]s are more sensible.

 I don't understand the basis of your determination. It seems ill-founded.
 Jill

Aug 28 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgobse$237t$2 digitaldaemon.com>, Walter says...

There are also twice as many (sic) bytes to scan for the gc,

Why are strings added to the GC root list anyway? It occurs to me that arrays of bit, byte, ubyte, short, ushort, char, wchar and dchar which are allocated on the heap can never contain pointers, and so should not be added to the GC's list of things to scan when created with new (or modification of .length). I imagine that this one simple step would increase D's performance rather dramatically. Arcane Jill
Aug 27 2004
parent "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgp8di$2ec2$1 digitaldaemon.com...
 In article <cgobse$237t$2 digitaldaemon.com>, Walter says...

There are also twice as many (sic) bytes to scan for the gc,

Why are strings added to the GC root list anyway? It occurs to me that

 bit, byte, ubyte, short, ushort, char, wchar and dchar which are allocated

 the heap can never contain pointers, and so should not be added to the

 of things to scan when created with new (or modification of .length).

 I imagine that this one simple step would increase D's performance rather
 dramatically.

There's certainly potential in D to add type awareness to the gc. But that adds penalties of its own, and it's an open question whether on the balance it will be faster or not.
Aug 28 2004
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <cgn92o$1jj4$1 digitaldaemon.com>, Ben Hinkle says...
 char: .36 seconds
 wchar: .78
 dchar: 1.3

So: wchar = char * 2 dchar = char * 4 It looks like the time complexity is a direct factor of element size, which stands to reason since D default initializes all arrays. I would be interested in seeing performance comparisons for transcoding between different formats. For the sake of argument, perhaps using both std.utf and whatever Mango uses. Sean
Aug 27 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgnh8j$1nl6$1 digitaldaemon.com>, Sean Kelly says...

So:

wchar = char * 2
dchar = char * 4

Only if there are the same number of chars in a char array as there are wchars in a wchar array, and dchars in a dchar array. This will /only/ be true if the string is pure ASCII. Jill
Aug 27 2004