digitalmars.D - performance of char vs wchar

Ben Hinkle (21/21) Aug 27 2004 All the threads about char vs wchar got me thinking about char performan...

Ben Hinkle (7/7) Aug 27 2004 Note: looks like the version that got attached was not the version I use...
Berin Loritsch (23/46) Aug 27 2004 FWIW, the Java 'char' is a 16 bit value due to the unicode standards.

Ben Hinkle (24/71) Aug 27 2004 performance,

Arcane Jill (39/49) Aug 27 2004 Actually, wchar and dchar are pretty much the same in terms of ease of u...

Ben Hinkle (31/80) Aug 27 2004 use. In

Berin Loritsch (5/9) Aug 27 2004 Either that or hamper Japanese coders :)

Walter (12/16) Aug 27 2004 general,

Arcane Jill (34/44) Aug 27 2004 Are you sure? That's one hellava claim to make.

Arcane Jill (5/8) Aug 27 2004 Whoops - dumb brain fart! Please pretend I didn't say that.
Walter (67/115) Aug 28 2004 general,

Berin Loritsch (15/21) Aug 30 2004 Umm, what about the toString() function? Doesn't that assume char[]?

Walter (9/24) Aug 30 2004 Yes, but it isn't char(!)acteristic of D.

Roald Ribe (14/31) Sep 02 2004 But this claim holds true only for those who have English as their only

Berin Loritsch (27/55) Aug 27 2004 Internally there is no such thing. It's just easier to deal with that

Arcane Jill (8/11) Aug 27 2004 Yeah, I forgot about allocation time. Of course, D initializes all array...

Walter (8/16) Aug 27 2004 no

Arcane Jill (15/19) Aug 27 2004 There you go again, assuming that wchar[] strings are double the length ...

Walter (25/45) Aug 28 2004 real

Berin Loritsch (18/22) Aug 30 2004 But not completely. There is the euro symbol (I dare say would

Roald Ribe (55/77) Aug 30 2004 Even Britain has a non-ASCII used quite extensively: Pound. �

Lars Ivar Igesund (7/99) Aug 30 2004 I couldn't agree more about Walter's ASCII argument. It's way out there

Ben Hinkle (15/114) Aug 30 2004 Walter did use the word "most". Does anyone know of any studies on the

Ilya Minkov (18/37) Aug 30 2004 When serving HTML, extended european characters are usually not served

Arcane Jill (17/21) Aug 31 2004 One option would be the encoding WINDOWS-1251. Quote...

Ilya Minkov (34/44) Sep 01 2004 Oh come on. Do you rally think i don't know 1251 and all the other

van eeshan (18/37) Aug 28 2004 What you fail to understand, Jill, is that such arguments are but pinpri...

Arcane Jill (8/9) Aug 27 2004 Why are strings added to the GC root list anyway? It occurs to me that a...

Walter (8/16) Aug 28 2004 arrays of

Sean Kelly (9/12) Aug 27 2004 So:

Arcane Jill (5/8) Aug 27 2004 Only if there are the same number of chars in a char array as there are ...

Ben Hinkle <bhinkle4 juno.com> writes:

All the threads about char vs wchar got me thinking about char performance,
so I thought I'd test out the difference. I've attached a simple test that
allocates a bunch of strings (of type char[] or wchar[]) of random length
each with one random 'a' in them. Then it loops over the strings randomly
and counts the number of 'a's. So this test does not measure transcoding
speed - it just measures allocation and simple access speed across a large
number of long strings. My times are
 char: .36 seconds
 wchar: .78
 dchar: 1.3

I tried to make sure the dchar version didn't run into swap space. If I
scale up or down the size of the problem the factor of 2 generally remains
until the times gets so small that it doesn't matter. It looks like a lot
of the time is taken in initializing the strings since the when I modify
the test to only time access performance the scale factor goes down from 2
to something under 1.5 or so.

I built it using
 dmd charspeed.d -O -inline -release

It would be nice to try other tests that involve searching for sub-strings
(like multi-byte unicode strings) but I haven't done that. 

-Ben

Aug 27 2004

Ben Hinkle <bhinkle4 juno.com> writes:

Note: looks like the version that got attached was not the version I used to
run the tests - it was the one I was using to tests access. I thought
attaching the file would make a copy right after attaching then but it
looks like my newsreader makes a copy only when the message is posted. The
version I ran to get my original numbers had longer strings (like 10000
instead of 1000) and timed the allocation as well as the access.

-Ben

Aug 27 2004

Berin Loritsch <bloritsch d-haven.org> writes:

Ben Hinkle wrote:
 All the threads about char vs wchar got me thinking about char performance,
 so I thought I'd test out the difference. I've attached a simple test that
 allocates a bunch of strings (of type char[] or wchar[]) of random length
 each with one random 'a' in them. Then it loops over the strings randomly
 and counts the number of 'a's. So this test does not measure transcoding
 speed - it just measures allocation and simple access speed across a large
 number of long strings. My times are
  char: .36 seconds
  wchar: .78
  dchar: 1.3
 
 I tried to make sure the dchar version didn't run into swap space. If I
 scale up or down the size of the problem the factor of 2 generally remains
 until the times gets so small that it doesn't matter. It looks like a lot
 of the time is taken in initializing the strings since the when I modify
 the test to only time access performance the scale factor goes down from 2
 to something under 1.5 or so.
 
 I built it using
  dmd charspeed.d -O -inline -release
 
 It would be nice to try other tests that involve searching for sub-strings
 (like multi-byte unicode strings) but I haven't done that. 

FWIW, the Java 'char' is a 16 bit value due to the unicode standards.
The idea of course, is that internally to the program all strings are
encoded the same and translated on IO.

I would venture to say there are two observations about Java strings:

1) In most applications they are fast enough to handle a lot of work.

2) It can become a bottleneck when excessive string concatenation is
    happening, or logging is overdone.

As far as D is concerned, it should be expected that a same sized array
would take longer if the difference between arrays is the element size.
In this case 1 byte, 2 bytes, 4 bytes.  I don't know much about
allocation routines, but I imagine getting something to run at constant
cost is either beyond the realm of possibility or it is just way too
difficult to be practical to pursue.

IMO, the benefits of using a standard unicode encoding within the D
program outway the costs of allocating the arrays.  Anytime we make
it easier to provide internationalization (i18n), it is a step toward
greater acceptance.  To assume the world only works with european (or
american) character sets is to stick your head in the sand.

Honestly, I would prefer something that made internationalized strings
easier to manage that more difficult.  If there is no multi-char
graphemes (i.e. takes up more than one code space) then that would be
the easiest to work with and write libraries for.

Aug 27 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Berin Loritsch" <bloritsch d-haven.org> wrote in message
news:cgnc25$1l1f$1 digitaldaemon.com...
 Ben Hinkle wrote:
 All the threads about char vs wchar got me thinking about char


performance,
 so I thought I'd test out the difference. I've attached a simple test


that
 allocates a bunch of strings (of type char[] or wchar[]) of random


length
 each with one random 'a' in them. Then it loops over the strings


randomly
 and counts the number of 'a's. So this test does not measure transcoding
 speed - it just measures allocation and simple access speed across a


large
 number of long strings. My times are
  char: .36 seconds
  wchar: .78
  dchar: 1.3

 I tried to make sure the dchar version didn't run into swap space. If I
 scale up or down the size of the problem the factor of 2 generally


remains
 until the times gets so small that it doesn't matter. It looks like a


lot
 of the time is taken in initializing the strings since the when I modify
 the test to only time access performance the scale factor goes down from


2
 to something under 1.5 or so.

 I built it using
  dmd charspeed.d -O -inline -release

 It would be nice to try other tests that involve searching for


sub-strings
 (like multi-byte unicode strings) but I haven't done that.

 FWIW, the Java 'char' is a 16 bit value due to the unicode standards.
 The idea of course, is that internally to the program all strings are
 encoded the same and translated on IO.

 I would venture to say there are two observations about Java strings:

 1) In most applications they are fast enough to handle a lot of work.

 2) It can become a bottleneck when excessive string concatenation is
     happening, or logging is overdone.

I wonder what Java strings in utf8 would be like... I wonder if anyone has
tried that out.

 As far as D is concerned, it should be expected that a same sized array
 would take longer if the difference between arrays is the element size.
 In this case 1 byte, 2 bytes, 4 bytes.  I don't know much about
 allocation routines, but I imagine getting something to run at constant
 cost is either beyond the realm of possibility or it is just way too
 difficult to be practical to pursue.

agreed. There probably isn't any way to speed up the initialization much.

 IMO, the benefits of using a standard unicode encoding within the D
 program outway the costs of allocating the arrays.  Anytime we make
 it easier to provide internationalization (i18n), it is a step toward
 greater acceptance.  To assume the world only works with european (or
 american) character sets is to stick your head in the sand.

agreed. utf8 and utf16 are both unicode standards that require multi-byte
handling to cover all of unicode so in terms of ease of use they shouldn't
be any different.

 Honestly, I would prefer something that made internationalized strings
 easier to manage that more difficult.  If there is no multi-char
 graphemes (i.e. takes up more than one code space) then that would be
 the easiest to work with and write libraries for.

dchar would be the choice for ease of use but as you can see performance
goes downhill significantly (at least for the naive test I ran). To me the
performance of dchar is too poor to make it the standard and the ease-of-use
of utf8 and utf16 are essentially equivalent so since utf8 has the best
performance it should be the default. Hence my attempt to measure which is
faster for typical usage: char or wchar?

-Ben

Aug 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgner6$1mh4$1 digitaldaemon.com>, Ben Hinkle says...

dchar would be the choice for ease of use

Actually, wchar and dchar are pretty much the same in terms of ease of use. In
almost all cases, you can just /pretend/ that UTF-16 is not a multi-word
encoding, and just write your code as if one wchar == one character. Except in
very specialist cases, you'll be correct. Even if you're using characters beyond
U+FFFF, regarding them as two characters instead of one is likely to be
harmless.

Of course, there are applications for which you can't do this. Font rendering
algorithms, for example, /must/ be able to distinguish true character
boundaries. But even then, the cost of rendering greatly outweight the (almost
insignificant) cost of determining true character boundaries - a trivial task in
UTF-16.

char, by contrast, is much more difficult to use, because you /cannot/ "pretend"
it's a single-byte encoding, because there are too many characters beyond U+00FF
which applications need to "understand".



but as you can see performance
goes downhill significantly (at least for the naive test I ran). To me the
performance of dchar is too poor to make it the standard

Agreed.


and the ease-of-use
of utf8 and utf16 are essentially equivalent

Not really. The ICU API is geared around UTF-16, so if you use wchar[]s, you can
call ICU functions directly. With char[]s, there's a bit more faffing, so UTF-8
becomes less easy to use in this regard.

And don't forget - UTF-8 is a complex algorithm, while UTF-16 is a trivially
simple algorithm. As an application programmer, you may be shielded from the
complexity of UTF-8 by library implementation and implicit conversion, but it's
all going on under the hood. UTF-8's "ease of use" vanishes as soon as you
actually have to implement it.



so since utf8 has the best
performance it should be the default.

But it doesn't. Your tests were unfair. A UTF-16 array will not, in general,
require twice as many bytes as a UTF-8 array, which is what you seemed to
assume. That will only be true if the string is pure ASCII, but not otherwise,
and for the majority of characters the performance will be worse. Check out this
table:

Codepoint            Number of bytes
range           UTF-8    UTF-16   UTF-32
----------------------------------------
0000 to 007F    1        2        4
0080 to 07FF    2        2        4
0800 to FFFF    3        2        4    <----- who wins on this row?
10000+          4        4        4



Hence my attempt to measure which is
faster for typical usage: char or wchar?

You could try measuring it again, but this time without the assumption that all
characters are ASCII.

Arcane Jill

Aug 27 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgngtc$1nep$1 digitaldaemon.com...
 In article <cgner6$1mh4$1 digitaldaemon.com>, Ben Hinkle says...

dchar would be the choice for ease of use

 Actually, wchar and dchar are pretty much the same in terms of ease of

use. In
 almost all cases, you can just /pretend/ that UTF-16 is not a multi-word
 encoding, and just write your code as if one wchar == one character.

Except in
 very specialist cases, you'll be correct. Even if you're using characters

beyond
 U+FFFF, regarding them as two characters instead of one is likely to be
 harmless.

 Of course, there are applications for which you can't do this. Font

rendering
 algorithms, for example, /must/ be able to distinguish true character
 boundaries. But even then, the cost of rendering greatly outweight the

(almost
 insignificant) cost of determining true character boundaries - a trivial

task in
 UTF-16.

 char, by contrast, is much more difficult to use, because you /cannot/

"pretend"
 it's a single-byte encoding, because there are too many characters beyond

U+00FF
 which applications need to "understand".

true - shortcuts can be taken if the application doesn't have to support all
of unicode.

but as you can see performance
goes downhill significantly (at least for the naive test I ran). To me


the
performance of dchar is too poor to make it the standard

 Agreed.


and the ease-of-use
of utf8 and utf16 are essentially equivalent

 Not really. The ICU API is geared around UTF-16, so if you use wchar[]s,

you can
 call ICU functions directly. With char[]s, there's a bit more faffing, so

UTF-8
 becomes less easy to use in this regard.

I'm not sure what faffing is but yeah I agree - I just meant the ease-of-use
without worrying about library APIs.

 And don't forget - UTF-8 is a complex algorithm, while UTF-16 is a

trivially
 simple algorithm. As an application programmer, you may be shielded from

the
 complexity of UTF-8 by library implementation and implicit conversion, but

it's
 all going on under the hood. UTF-8's "ease of use" vanishes as soon as you
 actually have to implement it.

I meant ease-of-use for end-users. Calling toUTF8 is just as easy as calling
toUTF16. Personally as long as it is correct the underlying library
complexity doesn't affect me. Performance of the application becomes the key
factor. .

so since utf8 has the best
performance it should be the default.

 But it doesn't. Your tests were unfair. A UTF-16 array will not, in

general,
 require twice as many bytes as a UTF-8 array, which is what you seemed to
 assume. That will only be true if the string is pure ASCII, but not

otherwise,
 and for the majority of characters the performance will be worse. Check

out this
 table:

 Codepoint            Number of bytes
 range           UTF-8    UTF-16   UTF-32
 ----------------------------------------
 0000 to 007F    1        2        4
 0080 to 07FF    2        2        4
 0800 to FFFF    3        2        4    <----- who wins on this row?
 10000+          4        4        4

I was going to start digging into non-ascii next. I remember reading
somewhere that encoding asian languages in utf8 typically results in longer
strings than utf16. That will definitely hamper utf8.

Hence my attempt to measure which is
faster for typical usage: char or wchar?

 You could try measuring it again, but this time without the assumption

that all
 characters are ASCII.

 Arcane Jill

Aug 27 2004

Berin Loritsch <bloritsch d-haven.org> writes:

Ben Hinkle wrote:

 
 I was going to start digging into non-ascii next. I remember reading
 somewhere that encoding asian languages in utf8 typically results in longer
 strings than utf16. That will definitely hamper utf8.

Either that or hamper Japanese coders :)

If it comes out to a performance draw when dealing with non-ascii text,
then might I suggest using programming ease (for library writers as
well) to be the tie breaker?

Aug 27 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgngtc$1nep$1 digitaldaemon.com...
 But it doesn't. Your tests were unfair. A UTF-16 array will not, in

general,
 require twice as many bytes as a UTF-8 array, which is what you seemed to
 assume. That will only be true if the string is pure ASCII, but not

otherwise,
 and for the majority of characters the performance will be worse.

The majority of characters are multibyte in UTF-8, that is true. But the
distribution of characters is what matters for speed in real apps, and for
those, the vast majority will be ASCII. Furthermore, many operations on
strings can treat UTF-8 as if it were single byte, such as copying, sorting,
and searching.

Of course, there are still many situations where UTF-8 is not ideal, which
is why D tries to be agnostic about whether the application programmer wants
to use char[], wchar[], or dchar[] or any combination of the three.

Aug 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgobse$237t$1 digitaldaemon.com>, Walter says...

The majority of characters [within strings] are multibyte in UTF-8,
that is true. But the [frequency]
distribution of characters is what matters for speed in real apps, and for
those, the vast majority will be ASCII.

Are you sure? That's one hellava claim to make.

I've noticed that D seems to generate a lot of interest from Japan (judging from
the existense of Japanese web sites). Of course, Japanese strings average 1.5
times as long in UTF-8 as those same strings would have been in UTF-16.

The whole "Most characters are ASCII" dogma is really only true if you happen to
live in a certain part of the world, and to bias a computer language because of
that assumption hurts performance for everyone else. /Please/ reconsider.



Furthermore, many operations on
strings can treat UTF-8 as if it were single byte, such as copying, sorting,
and searching.

Copying, yes. But of course you miss the point that this would be just as true
in UTF-16 as it is in UTF-8.

Sorting? - Lexicographical sorting, maybe, but the only reason you can get away
with that is because, in ASCII-only parts of the world, codepoint order happens
to correspond to the order we find letters in the alphabet, and even then only
if we're prepared to compromise on case ("Foo" sorts before "bar"). Stick an
acute accent over one of the vowels and lexicographical sort order goes out the
window. Lexicographical sorting may be good for things purely mechanical things
like eliminating duplicates in an AA, but if someone wants to look up all
entries between "Alice" and "Bob" in a database, I think they would be very
surprised to find that "�aron" was not in the list. (And again, you miss the
point that if lexicographical sorting /is/ what you want, it works just as well
in UTF-16 as it does in UTF-8). /Real/ sorting, however, requires full
understanding of Unicode, and for that, ASCII is just not good enough.

Searching? If you treat "UTF-8 as if it were single byte", there are an /awful
lot/ of characters you can't search for, including the British Pound (currency)
sign, the Euro currency sign and anything with an accent over it. And searching
for the single byte 0x81 (for example) is not exactly useful.



Of course, there are still many situations where UTF-8 is not ideal, which
is why D tries to be agnostic about whether the application programmer wants
to use char[], wchar[], or dchar[] or any combination of the three.

Yes, agnostic is good. No problem there. I'm only talking about the /default/. I
thought you were all /for/ internationalization and Unicode and all that? I'm
surprised to find myself arguing with you on this one. (Okay, I didn't really
expect you to ditch the char, but to prefer wchar[] over char[] is
/reasonable/). I have given many, many, many reasons why I think that wchar[]
strings should be the default in D, and if I can't convince you, I think that
would be a big shame. 

Arcane Jill

Aug 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgp7c3$2e36$1 digitaldaemon.com>, Arcane Jill says...

if someone wants to look up all
entries between "Alice" and "Bob" in a database, I think they would be very
surprised to find that "�aron" was not in the list.

Whoops - dumb brain fart! Please pretend I didn't say that.

The reasoning is still sound - it's just my conception of alphabetical order
that's up the spout.

Jill

Aug 27 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgp7c3$2e36$1 digitaldaemon.com...
 But it doesn't. Your tests were unfair. A UTF-16 array will not, in



general,
 require twice as many bytes as a UTF-8 array, which is what you seemed



to
 assume. That will only be true if the string is pure ASCII, but not



otherwise,
 and for the majority of characters the performance will be worse.


 In article <cgobse$237t$1 digitaldaemon.com>, Walter says...
The majority of characters [within strings] are multibyte in UTF-8,
that is true. But the [frequency]
distribution of characters is what matters for speed in real apps, and


for
those, the vast majority will be ASCII.

 Are you sure? That's one hellava claim to make.

Yes. Nearly all the spam I get is in ascii <g>.

When optimizing for speed, the first rule is optimize for what the bulk of
the data will likely consist of for your application. For example, if you're
writing a user interface for Chinese people, you'd be sensible to consider
using dchar[] throughout.

It probably makes sense to use wchar[] for the unicode library you're
developing because programmers who have a need for such a library will most
likely NOT be writing applications for ascii.

 I've noticed that D seems to generate a lot of interest from Japan

(judging from
 the existense of Japanese web sites). Of course, Japanese strings average

1.5
 times as long in UTF-8 as those same strings would have been in UTF-16.

If I was building a Japanese word processor, I certainly wouldn't use UTF-8
internally in it for that reason.


 The whole "Most characters are ASCII" dogma is really only true if you

happen to
 live in a certain part of the world, and to bias a computer language

because of
 that assumption hurts performance for everyone else. /Please/ reconsider.

If everything is optimized for Japanese, it will hurt performance for ASCII
users. The point is, there is no UTF encoding that is optimal for everyone.
That's why D supports all three.

Furthermore, many operations on
strings can treat UTF-8 as if it were single byte, such as copying,


sorting,
and searching.

 Copying, yes. But of course you miss the point that this would be just as

true
 in UTF-16 as it is in UTF-8.

Of course. My point was that quite a few common string operations do not
require decoding. For example, the D compiler processes source as UTF-8. It
almost never has to do any decoding. The performance penalty for supporting
multibyte encodings in D source is essentially zero.

 Sorting? - Lexicographical sorting, maybe, but the only reason you can get

away
 with that is because, in ASCII-only parts of the world, codepoint order

happens
 to correspond to the order we find letters in the alphabet, and even then

only
 if we're prepared to compromise on case ("Foo" sorts before "bar"). Stick

an
 acute accent over one of the vowels and lexicographical sort order goes

out the
 window. Lexicographical sorting may be good for things purely mechanical

things
 like eliminating duplicates in an AA, but if someone wants to look up all
 entries between "Alice" and "Bob" in a database, I think they would be

very
 surprised to find that "�aron" was not in the list. (And again, you miss

the
 point that if lexicographical sorting /is/ what you want, it works just as

well
 in UTF-16 as it does in UTF-8).

Sure - and my point is it wasn't necessary to decode UTF-8 to do that sort.
It's not necessary for hashing the string, either.

 /Real/ sorting, however, requires full
 understanding of Unicode, and for that, ASCII is just not good enough.

There are many different ways to sort, and since the unicode characters are
not always ordered the obvious way, you have to deal with that specially in
each of UTF-8, -16, and -32.

 Searching? If you treat "UTF-8 as if it were single byte", there are an

/awful
 lot/ of characters you can't search for, including the British Pound

(currency)
 sign, the Euro currency sign and anything with an accent over it. And

searching
 for the single byte 0x81 (for example) is not exactly useful.

That's why std.string.find() takes a dchar as its search argument. What you
do is treat it as a substring search. There are a lot of very fast
algorithms for doing such searches, such as Boyer-Moore, which get pretty
close to the performance of a single character search. Furthermore, I'd
optimize it so the first thing the search did was check if the search
character was ascii. If so, it'd do the single character scan. Otherwise,
it'd do the substring search.

Of course, there are still many situations where UTF-8 is not ideal,


which
is why D tries to be agnostic about whether the application programmer


wants
to use char[], wchar[], or dchar[] or any combination of the three.

 Yes, agnostic is good. No problem there. I'm only talking about the

/default/. I
 thought you were all /for/ internationalization and Unicode and all that?

But D does not have a default. Programmers can use the encoding which is
optimal for the data they expect to see. Even if UTF-8 were the default,
UTF-8 still supports full internationalization and Unicode. I am certainly
not talking about supporting only ASCII or having ASCII as the default.

 I'm surprised to find myself arguing with you on this one. (Okay, I didn't

really
 expect you to ditch the char, but to prefer wchar[] over char[] is
 /reasonable/). I have given many, many, many reasons why I think that

wchar[]
 strings should be the default in D, and if I can't convince you, I think

that
 would be a big shame.

My experience with using UTF-16 throughout a program is it speeded up quite
a bit when converted to UTF-8. There is no blanket advantage to UTF-16, it
depends on your expected data. When your expected data will be mostly ASCII,
then UTF-8 is the reasonable choice.

Aug 28 2004

Berin Loritsch <bloritsch d-haven.org> writes:

Walter wrote:

 
 But D does not have a default. Programmers can use the encoding which is
 optimal for the data they expect to see. Even if UTF-8 were the default,
 UTF-8 still supports full internationalization and Unicode. I am certainly
 not talking about supporting only ASCII or having ASCII as the default.
 

Umm, what about the toString() function?  Doesn't that assume char[]?
Hense, it is the default by example.

I'll be honest, I don't get why optimization is so important when there
hasn't been determined a need yet.  I am sure there can be quicker ways
of dealing with allocation and de-allocation--this would make the system
faster for all objects, not just strings.  If that can be done, why not
concentrate on that?

More advanced memory utilization can mean better overall performance,
and reduce the cost of one type of string over another.  Heck, if a
page of memory is being allocated for string storage (multiple strings
mind you), what about a really fast bit blit for the whole page?  That
would make the strings default to initialization state and speed things
up.  Ideally, the difference between a char[] and a dchar[] would be
how much of that page is allocated.

Aug 30 2004

"Walter" <newshound digitalmars.com> writes:

"Berin Loritsch" <bloritsch d-haven.org> wrote in message
news:cgv92r$2fvv$1 digitaldaemon.com...
 Umm, what about the toString() function?  Doesn't that assume char[]?
 Hense, it is the default by example.

Yes, but it isn't char(!)acteristic of D.

 I'll be honest, I don't get why optimization is so important when there
 hasn't been determined a need yet.

Efficiency, or at least potential efficiency, has always been a strong
attraction that programmers have to C/C++. Since D is targetted at that
market, efficiency will be a major consideration. If D acquires an early
reputation for being "slow", like Java did, that reputation can be very,
very hard to shake.

 I am sure there can be quicker ways
 of dealing with allocation and de-allocation--this would make the system
 faster for all objects, not just strings.  If that can be done, why not
 concentrate on that?

There's no way to just wipe away the costs of using double the storage.

 More advanced memory utilization can mean better overall performance,
 and reduce the cost of one type of string over another.  Heck, if a
 page of memory is being allocated for string storage (multiple strings
 mind you), what about a really fast bit blit for the whole page?  That
 would make the strings default to initialization state and speed things
 up.  Ideally, the difference between a char[] and a dchar[] would be
 how much of that page is allocated.

Aug 30 2004

"Roald Ribe" <rr.no spam.teikom.no> writes:

"Walter" <newshound digitalmars.com> wrote in message
news:ch048q$2uql$1 digitaldaemon.com...
 "Berin Loritsch" <bloritsch d-haven.org> wrote in message
 news:cgv92r$2fvv$1 digitaldaemon.com...
 Umm, what about the toString() function?  Doesn't that assume char[]?
 Hense, it is the default by example.

 Yes, but it isn't char(!)acteristic of D.

 I'll be honest, I don't get why optimization is so important when there
 hasn't been determined a need yet.

 Efficiency, or at least potential efficiency, has always been a strong
 attraction that programmers have to C/C++. Since D is targetted at that
 market, efficiency will be a major consideration. If D acquires an early
 reputation for being "slow", like Java did, that reputation can be very,
 very hard to shake.

 I am sure there can be quicker ways
 of dealing with allocation and de-allocation--this would make the system
 faster for all objects, not just strings.  If that can be done, why not
 concentrate on that?

 There's no way to just wipe away the costs of using double the storage.

But this claim holds true only for those who have English as their only
working language, and (maybe) for a few others in Europe. In all other
markets (5 billion+) the utf8 storage will in fact (mostly) be _larger_
than the utf16 storage.
And as I proposed earlier, you could leave an otption for
English/Europeans in the form of char defined as in C/C++, which would
in addition to being just as fast, actually make the transition to the
D language easier.
I think it all comes down to this: will D became a general purpose language
for the international community or will it mostly become a "better" C++ for
English speakers only?

Roald

Sep 02 2004

Berin Loritsch <bloritsch d-haven.org> writes:

Ben Hinkle wrote:

 "Berin Loritsch" <bloritsch d-haven.org> wrote in message
 news:cgnc25$1l1f$1 digitaldaemon.com...
 
FWIW, the Java 'char' is a 16 bit value due to the unicode standards.
The idea of course, is that internally to the program all strings are
encoded the same and translated on IO.

I would venture to say there are two observations about Java strings:

1) In most applications they are fast enough to handle a lot of work.

2) It can become a bottleneck when excessive string concatenation is
    happening, or logging is overdone.

 
 
 I wonder what Java strings in utf8 would be like... I wonder if anyone has
 tried that out.

Internally there is no such thing.  It's just easier to deal with that
way.  The translation happens with encoders and decoders on IO.

Honestly, I would prefer something that made internationalized strings
easier to manage that more difficult.  If there is no multi-char
graphemes (i.e. takes up more than one code space) then that would be
the easiest to work with and write libraries for.

 
 dchar would be the choice for ease of use but as you can see performance
 goes downhill significantly (at least for the naive test I ran). To me the
 performance of dchar is too poor to make it the standard and the ease-of-use
 of utf8 and utf16 are essentially equivalent so since utf8 has the best
 performance it should be the default. Hence my attempt to measure which is
 faster for typical usage: char or wchar?

Yea, but I've learned not to get hung up on strict performance.  There
is a difference between ultimately fast and fast enough.  Sometimes to
squeeze those extra cycles out we can cause more programming issues
than needs be.

If the allocation routines could be done faster (big assumption here),
would it be preferable to use a dchar if it is fast enough?

For the record, I believe Java uses UTF16 internally, which means for
most things there is less of a need to worry about MB characters.

The interesting test would be to have strings of the same length, and
then test the algorithm to get a substring from that string.  For
example, a fixed string that uses some multi-codespace characters here
and there, and then getting the 15th through 20th characters of that
string.

This will show just how things might come out in the wash when we are
dealing with that type of issue.  For example, it is not uncommon to
have the system read in a block of text from a file into memory (say
4k worth at a time), and then just iterate one line at a time.  Which
gives us the substring scenario.

Alternatively there is the regex algorithm that would need to account
for the multi-codespace characters that will get a performance hit as
well.

There is a lot more to an overall more performant system than just
string allocation--even though we are aware that it can be a significant
cost.

Aug 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgn92o$1jj4$1 digitaldaemon.com>, Ben Hinkle says...

 char: .36 seconds
 wchar: .78
 dchar: 1.3

Yeah, I forgot about allocation time. Of course, D initializes all arrays, no
matter whence they are allocated. char[]s will be filled with all FFs, and
wchar[]s will be filled with all FFFFs. Twice as many bytes = twice as many
bytes to initialize. Damn!

A super-fast character array allocator would make a lot of difference here.
There are probably many different ways of doing this very fast. I guess this has
to happen within DMD.

Aug 27 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgncho$1la0$1 digitaldaemon.com...
 In article <cgn92o$1jj4$1 digitaldaemon.com>, Ben Hinkle says...

 char: .36 seconds
 wchar: .78
 dchar: 1.3

 Yeah, I forgot about allocation time. Of course, D initializes all arrays,

no
 matter whence they are allocated. char[]s will be filled with all FFs, and
 wchar[]s will be filled with all FFFFs. Twice as many bytes = twice as

many
 bytes to initialize. Damn!

There are also twice as many bytes to scan for the gc, and half the data
until your machine starts thrashing the swap disk. The latter is a very real
issue for server apps, since it means that you reach the point of having to
double the hardware in half the time.

Aug 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgobse$237t$2 digitaldaemon.com>, Walter says...

There are also twice as many bytes to scan for the gc, and half the data
until your machine starts thrashing the swap disk. The latter is a very real
issue for server apps, since it means that you reach the point of having to
double the hardware in half the time.

There you go again, assuming that wchar[] strings are double the length of
char[] strings. THIS IS NOT TRUE IN GENERAL. In Chinese, wchar[] strings are
shorter than char[] strings. In Japanese, wchar[] strings are shorter than
char[] strings. In Mongolian, wchar[] strings are shorter than char[] strings.
In Tibetan, wchar[] strings are shorter than char[] strings. I assume I don't
need to go on...?

<sarcasm>But I guess server apps never have to deliver text in those
languages.</sarcasm>

Walter, servers are one the places where internationalization matters most. XML
and HTML documents, for example, could be (a) stored and (b) requested in any
encodings whatsoever. A server would have to push them through a transcoding
function. For this, wchar[]s are more sensible.

I don't understand the basis of your determination. It seems ill-founded.
Jill

Aug 27 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgp845$2ea2$1 digitaldaemon.com...
 In article <cgobse$237t$2 digitaldaemon.com>, Walter says...
There are also twice as many bytes to scan for the gc, and half the data
until your machine starts thrashing the swap disk. The latter is a very


real
issue for server apps, since it means that you reach the point of having


to
double the hardware in half the time.

 There you go again, assuming that wchar[] strings are double the length of
 char[] strings. THIS IS NOT TRUE IN GENERAL.

Are you sure? Even european languages are mostly ascii.

 In Chinese, wchar[] strings are
 shorter than char[] strings. In Japanese, wchar[] strings are shorter than
 char[] strings. In Mongolian, wchar[] strings are shorter than char[]

strings.
 In Tibetan, wchar[] strings are shorter than char[] strings. I assume I

don't
 need to go on...?

 <sarcasm>But I guess server apps never have to deliver text in those
 languages.</sarcasm>

Never is not the right word here. The right idea is what is the frequency
distribution of the various types of data one's app will see. Once you know
that, you optimize for the most common cases. Tibetan is still fully
supported regardless.

 Walter, servers are one the places where internationalization matters

most. XML
 and HTML documents, for example, could be (a) stored and (b) requested in

any
 encodings whatsoever.

Of course. But what are the frequencies of the requests for various
encodings? Each of the 3 UTF encodings fully support unicode and are fully
internationalized. Which one you pick depends on the frequency distribution
of your data.

 A server would have to push them through a transcoding
 function. For this, wchar[]s are more sensible.

It is not optimal unless the person optimizing the server app has
instrumented his data so he knows the frequency distribution of the various
characters. Only then can he select the encoding that will deliver the best
performance.

 I don't understand the basis of your determination. It seems ill-founded.

Experience optimizing apps. One of the most potent tools for optimization is
analyzing the data patterns, and making the most common cases take the
shortest path through the code. UTF-16 is not optimal for a great many
applications - and I have experience with it.

Aug 28 2004

Berin Loritsch <bloritsch d-haven.org> writes:

Walter wrote:

 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgp845$2ea2$1 digitaldaemon.com...
 
 Are you sure? Even european languages are mostly ascii.

But not completely.  There is the euro symbol (I dare say would
be quite common).  In Spanish the enye (n with ~ on top, can't
really do that well in Windows) is fairly common, and important.
There is a big difference between an anus and a year, but the
only difference in Spanish is n vs. enye.  Not to mention all
those words that use an accent to mark an abnormally stressed
sylable.

Then we get to French, which uses the circumflex, accents, and
accent grave.  Oh, then there's German which uses those two little
dots alot.  And I haven't even touched on Greek or Russian, both
European countries.

You can only make that assumption about English speaking countries.
Yes almost everyone is exposed to English in some way, and it is
the current "lingua de franca" (language of business, like French
used to be--hense the term).  The bottom line is that there are
sufficient exceptions to your "rule" that it would be a shame to
assume the world was America and Great Britain.

Aug 30 2004

"Roald Ribe" <rr.no spam.teikom.no> writes:

"Berin Loritsch" <bloritsch d-haven.org> wrote in message
news:cgv9m0$2g71$1 digitaldaemon.com...
 Walter wrote:

 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgp845$2ea2$1 digitaldaemon.com...

 Are you sure? Even european languages are mostly ascii.

 But not completely.  There is the euro symbol (I dare say would
 be quite common).  In Spanish the enye (n with ~ on top, can't
 really do that well in Windows) is fairly common, and important.
 There is a big difference between an anus and a year, but the
 only difference in Spanish is n vs. enye.  Not to mention all
 those words that use an accent to mark an abnormally stressed
 sylable.

 Then we get to French, which uses the circumflex, accents, and
 accent grave.  Oh, then there's German which uses those two little
 dots alot.  And I haven't even touched on Greek or Russian, both
 European countries.

 You can only make that assumption about English speaking countries.
 Yes almost everyone is exposed to English in some way, and it is
 the current "lingua de franca" (language of business, like French
 used to be--hense the term).  The bottom line is that there are
 sufficient exceptions to your "rule" that it would be a shame to
 assume the world was America and Great Britain.

Even Britain has a non-ASCII used quite extensively: Pound. �
Norway/Denmark/Sweden has three non ASCII characters (used all the
time). The Sami peoples has their own characters (they live in
Norway, Sweden, Russia). Finland, Estonia, Lituania, Poland,
++ all have their own characters in addition to ASCII. Russia has
its own alphabet! All latin family languages (French/Spanish/Italian/
Portuguese) have all sorts of special characters (accents forwards/
backwards ++)... And now I have not even gone through HALF of Europe.
In Asia there are wildly different systems, and several systems in use,
_in_ each_ _country_.

As I have stated before: I agree with Walter's concern for performance.
But where I think there is some disagreement in these discussions is
where to put the effort to "adapt" the environment, on those who
only needs ASCII (most of the time), or on all those who would prefer
the language to default to the more general need of application and
server programmers all over the world. My view is that speed freaks
are used to tune the tools for best speed, and the general case should
reflect newbies and the 5 billion+ potential non English using markets.
Everything else is selling D short, in a shortsighted quest for best
speed as default as one of the language features.

I have a rather radical suggestion, that may make sense, or it may
happen that someone will shoot it down right away because of something
I have not thought of:

1. Remove wchar and dchar from the language.
2. Make char mean 8-bit unsigned byte, containing
   US-ASCII/Latin1/ISO-8859/cp1252, with one character in each byte.
   Null termination is expected. AFAIK all the sets mentioned are compatible
   with each other. Char *may* contain characters from any
   8-bit based encoding, given that either existing conv. table or
application
   can convert to/from one of the types below. This type makes for a clean,
   minimum effort port, from C and C++, and interaction with current crop of
   OS and libraries. It also takes care of US/Western Europe speed freaks.
3. New types, utf8, utf16 and utf32 as suggested by others.
4. String based on utf16 as default storage. With overidden storage type
like:
   new String(200, utf8)   // 200 bytes
   new String(200, utf16)  // 400 bytes
   new String(200)         // 400 bytes
   new String(200, utf32)  // 800 bytes
   Anyone can use string with the optimal performance for them.
5. String literals in source, default assumed to be utf16 encoded.
   Can be changed by app programmer like:
   c"text"    // char[] 4 bytes
   u"text"    // String() 4 bytes
   w"text"    // String() 8 bytes
   "text"     // String() 8 bytes
   d"text"    // String() 16 bytes

I am open to the fact that I am not at all experienced in language
design, but I hope this may bring the discussion along. I think making
char the same as in C/C++ (but slightly better defined default char set)
and go with entirely different type for the rest is a sound idea.

Roald

Aug 30 2004

Lars Ivar Igesund <larsivar igesund.net> writes:

I couldn't agree more about Walter's ASCII argument. It's way out there 
and alienates all of us with non-english first languages (maybe I should 
start writing my messages using runes, just like my forefathers...).
If the toString is only really useful for debugging anyway, it could as 
well return dchars. I'd rather remove altoghether, though.

Lars Ivar Igesund

Roald Ribe wrote:
 "Berin Loritsch" <bloritsch d-haven.org> wrote in message
 news:cgv9m0$2g71$1 digitaldaemon.com...
 
Walter wrote:


"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgp845$2ea2$1 digitaldaemon.com...

Are you sure? Even european languages are mostly ascii.

But not completely.  There is the euro symbol (I dare say would
be quite common).  In Spanish the enye (n with ~ on top, can't
really do that well in Windows) is fairly common, and important.
There is a big difference between an anus and a year, but the
only difference in Spanish is n vs. enye.  Not to mention all
those words that use an accent to mark an abnormally stressed
sylable.

Then we get to French, which uses the circumflex, accents, and
accent grave.  Oh, then there's German which uses those two little
dots alot.  And I haven't even touched on Greek or Russian, both
European countries.

You can only make that assumption about English speaking countries.
Yes almost everyone is exposed to English in some way, and it is
the current "lingua de franca" (language of business, like French
used to be--hense the term).  The bottom line is that there are
sufficient exceptions to your "rule" that it would be a shame to
assume the world was America and Great Britain.

 
 
 Even Britain has a non-ASCII used quite extensively: Pound. �
 Norway/Denmark/Sweden has three non ASCII characters (used all the
 time). The Sami peoples has their own characters (they live in
 Norway, Sweden, Russia). Finland, Estonia, Lituania, Poland,
 ++ all have their own characters in addition to ASCII. Russia has
 its own alphabet! All latin family languages (French/Spanish/Italian/
 Portuguese) have all sorts of special characters (accents forwards/
 backwards ++)... And now I have not even gone through HALF of Europe.
 In Asia there are wildly different systems, and several systems in use,
 _in_ each_ _country_.
 
 As I have stated before: I agree with Walter's concern for performance.
 But where I think there is some disagreement in these discussions is
 where to put the effort to "adapt" the environment, on those who
 only needs ASCII (most of the time), or on all those who would prefer
 the language to default to the more general need of application and
 server programmers all over the world. My view is that speed freaks
 are used to tune the tools for best speed, and the general case should
 reflect newbies and the 5 billion+ potential non English using markets.
 Everything else is selling D short, in a shortsighted quest for best
 speed as default as one of the language features.
 
 I have a rather radical suggestion, that may make sense, or it may
 happen that someone will shoot it down right away because of something
 I have not thought of:
 
 1. Remove wchar and dchar from the language.
 2. Make char mean 8-bit unsigned byte, containing
    US-ASCII/Latin1/ISO-8859/cp1252, with one character in each byte.
    Null termination is expected. AFAIK all the sets mentioned are compatible
    with each other. Char *may* contain characters from any
    8-bit based encoding, given that either existing conv. table or
 application
    can convert to/from one of the types below. This type makes for a clean,
    minimum effort port, from C and C++, and interaction with current crop of
    OS and libraries. It also takes care of US/Western Europe speed freaks.
 3. New types, utf8, utf16 and utf32 as suggested by others.
 4. String based on utf16 as default storage. With overidden storage type
 like:
    new String(200, utf8)   // 200 bytes
    new String(200, utf16)  // 400 bytes
    new String(200)         // 400 bytes
    new String(200, utf32)  // 800 bytes
    Anyone can use string with the optimal performance for them.
 5. String literals in source, default assumed to be utf16 encoded.
    Can be changed by app programmer like:
    c"text"    // char[] 4 bytes
    u"text"    // String() 4 bytes
    w"text"    // String() 8 bytes
    "text"     // String() 8 bytes
    d"text"    // String() 16 bytes
 
 I am open to the fact that I am not at all experienced in language
 design, but I hope this may bring the discussion along. I think making
 char the same as in C/C++ (but slightly better defined default char set)
 and go with entirely different type for the rest is a sound idea.
 
 Roald

Aug 30 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

Walter did use the word "most". Does anyone know of any studies on the
fequency of non-ASCII chars for different document content and languages?
There must be solid numbers about these things given all the zillions of
electronic documents out there. A quick google for French just dug up a
posting where someone scanned 86 millions characters from swiss-french
newsagency reports and got 22M non-accented vowels (aeiou) and 1.8M accented
chars. That's a factor of roughly 10. That seems significant. But I don't
want to read too much into one posting found in a minute of googling - I'm
just curious what the data says.

"Lars Ivar Igesund" <larsivar igesund.net> wrote in message
news:cgvoid$2nt9$1 digitaldaemon.com...
 I couldn't agree more about Walter's ASCII argument. It's way out there
 and alienates all of us with non-english first languages (maybe I should
 start writing my messages using runes, just like my forefathers...).
 If the toString is only really useful for debugging anyway, it could as
 well return dchars. I'd rather remove altoghether, though.

 Lars Ivar Igesund

 Roald Ribe wrote:
 "Berin Loritsch" <bloritsch d-haven.org> wrote in message
 news:cgv9m0$2g71$1 digitaldaemon.com...

Walter wrote:


"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgp845$2ea2$1 digitaldaemon.com...

Are you sure? Even european languages are mostly ascii.

But not completely.  There is the euro symbol (I dare say would
be quite common).  In Spanish the enye (n with ~ on top, can't
really do that well in Windows) is fairly common, and important.
There is a big difference between an anus and a year, but the
only difference in Spanish is n vs. enye.  Not to mention all
those words that use an accent to mark an abnormally stressed
sylable.

Then we get to French, which uses the circumflex, accents, and
accent grave.  Oh, then there's German which uses those two little
dots alot.  And I haven't even touched on Greek or Russian, both
European countries.

You can only make that assumption about English speaking countries.
Yes almost everyone is exposed to English in some way, and it is
the current "lingua de franca" (language of business, like French
used to be--hense the term).  The bottom line is that there are
sufficient exceptions to your "rule" that it would be a shame to
assume the world was America and Great Britain.


 Even Britain has a non-ASCII used quite extensively: Pound. �
 Norway/Denmark/Sweden has three non ASCII characters (used all the
 time). The Sami peoples has their own characters (they live in
 Norway, Sweden, Russia). Finland, Estonia, Lituania, Poland,
 ++ all have their own characters in addition to ASCII. Russia has
 its own alphabet! All latin family languages (French/Spanish/Italian/
 Portuguese) have all sorts of special characters (accents forwards/
 backwards ++)... And now I have not even gone through HALF of Europe.
 In Asia there are wildly different systems, and several systems in use,
 _in_ each_ _country_.

 As I have stated before: I agree with Walter's concern for performance.
 But where I think there is some disagreement in these discussions is
 where to put the effort to "adapt" the environment, on those who
 only needs ASCII (most of the time), or on all those who would prefer
 the language to default to the more general need of application and
 server programmers all over the world. My view is that speed freaks
 are used to tune the tools for best speed, and the general case should
 reflect newbies and the 5 billion+ potential non English using markets.
 Everything else is selling D short, in a shortsighted quest for best
 speed as default as one of the language features.

 I have a rather radical suggestion, that may make sense, or it may
 happen that someone will shoot it down right away because of something
 I have not thought of:

 1. Remove wchar and dchar from the language.
 2. Make char mean 8-bit unsigned byte, containing
    US-ASCII/Latin1/ISO-8859/cp1252, with one character in each byte.
    Null termination is expected. AFAIK all the sets mentioned are


compatible
    with each other. Char *may* contain characters from any
    8-bit based encoding, given that either existing conv. table or
 application
    can convert to/from one of the types below. This type makes for a


clean,
    minimum effort port, from C and C++, and interaction with current


crop of
    OS and libraries. It also takes care of US/Western Europe speed


freaks.
 3. New types, utf8, utf16 and utf32 as suggested by others.
 4. String based on utf16 as default storage. With overidden storage type
 like:
    new String(200, utf8)   // 200 bytes
    new String(200, utf16)  // 400 bytes
    new String(200)         // 400 bytes
    new String(200, utf32)  // 800 bytes
    Anyone can use string with the optimal performance for them.
 5. String literals in source, default assumed to be utf16 encoded.
    Can be changed by app programmer like:
    c"text"    // char[] 4 bytes
    u"text"    // String() 4 bytes
    w"text"    // String() 8 bytes
    "text"     // String() 8 bytes
    d"text"    // String() 16 bytes

 I am open to the fact that I am not at all experienced in language
 design, but I hope this may bring the discussion along. I think making
 char the same as in C/C++ (but slightly better defined default char set)
 and go with entirely different type for the rest is a sound idea.

 Roald

Aug 30 2004

Ilya Minkov <minkov cs.tum.edu> writes:

Berin Loritsch schrieb:

 Walter wrote:
 
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgp845$2ea2$1 digitaldaemon.com...

 Are you sure? Even european languages are mostly ascii.

 
 
 But not completely.  There is the euro symbol (I dare say would
 be quite common).  In Spanish the enye (n with ~ on top, can't
 really do that well in Windows) is fairly common, and important.
 There is a big difference between an anus and a year, but the
 only difference in Spanish is n vs. enye.  Not to mention all
 those words that use an accent to mark an abnormally stressed
 sylable.

 Then we get to French, which uses the circumflex, accents, and
 accent grave.  Oh, then there's German which uses those two little
 dots alot.  And I haven't even touched on Greek or Russian, both
 European countries.

When serving HTML, extended european characters are usually not served 
as Latin or Unicode. Instead, the &sym; escape encoding is preferred. 
There are ASCII escapes for all Latin-1 characters, as far as i know.

But what bothers me with all Unicode, is that cyrillic languages cannot 
be handled with 8 bits as well. What would be nice, if we found an 
encoding which would work on 2 buffers - the primary one containing the 
ASCII and data in some codepage. The secondary one would contain packed 
codepage changes, so that russian, english, hebrew and other test can be 
mixed and would still need about one byte per character on average. For 
asian languages, the encoding should use in average per character one 
symbol on primary string, and one symbol on the secondary. The length of 
the primary stream must be exactly the length of the string, all of the 
overhang must be placed in the secondary one. I have a feeling that this 
could be great for most uses and most efficient in total.

We should also not forget that the world is mostly chinese, and soon the 
computer users will also be. The european race will loose its importance.

-eye

Aug 30 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ch07mc$30ai$1 digitaldaemon.com>, Ilya Minkov says...

But what bothers me with all Unicode, is that cyrillic languages cannot 
be handled with 8 bits as well.

One option would be the encoding WINDOWS-1251. Quote...

"The Cyrillic text used in the data sets are encoded using the CP1251 Cyrillic
system.  Users will require CP1251 fonts to read or print such text correctly.
CP1251 is the Cyrillic encoding used in Windows products as developed by
Microsoft. The system replaces the underused upper 128 characters of the typical
Latin character set with Cyrillic characters, leaving the full set of Latin type
in the lower 128 characters. Thus the user may mix Cyrillic and Latin text
without changing fonts."

(-- source: http://polyglot.lss.wisc.edu/creeca/kaiser/cp1251.html)

But that's just a transcoding issue, surely? Internally, we'd use Unicode, no?



We should also not forget that the world is mostly chinese, and soon the 
computer users will also be.

Well, Chinese /certainly/ can't be handled with 8 bits. Traditionally, Chinese
users have made use of the encoding SHIFT-JIS, which is (shock! horror!) a
/multi-byte-encoding/ (there being vastly more than 256 "letters" in the Chinese
alphabet). SHIFT-JIS is seriously horrible to work with, compared with the
elegant simplicity of UTF-16.

Arcane Jill

Aug 31 2004

Ilya Minkov <minkov cs.tum.edu> writes:

Arcane Jill schrieb:

 One option would be the encoding WINDOWS-1251. Quote...

Oh come on. Do you rally think i don't know 1251 and all the other 
Windows codepages???? Oh, how would a person who natively speaks russian 
ever know that? They are all into typewriters and handwriting, aren't they?

 But that's just a transcoding issue, surely? Internally, we'd use Unicode, no?

You have apparently ignored what i tried to say. What is used externally 
is determined by external conditions, and is not the subject of this 
part of the post. I have suggested to investigate and possibly develop 
another *internal* representation which would provide optimal 
performance. It should consist of 2 storages, the 8-bit primary storage 
and the variable length "overhang" storage, and should be able to 
represent *all* unicode characters. We are back at the question of an 
efficient String class or struct.

The idea is, that characters are not self-contained, but instead 
context-dependant. For example, the most commonly used escape in the 
overhang string would be "select a new unicode subrange to work on". 
Unicode documents are not just random data! They are words or sentences 
written in a combination of a few languages, with a change of the 
language happening perhaps every few words. But you don't have every 
symbol be in the new language. So why does every symbol need to carry 
the complete information, if most of it is more effciently stored as a 
relatively rare state change?

We should also not forget that the world is mostly chinese, and soon the 
computer users will also be.

 
 Well, Chinese /certainly/ can't be handled with 8 bits. Traditionally, Chinese
 users have made use of the encoding SHIFT-JIS, which is (shock! horror!) a
 /multi-byte-encoding/ (there being vastly more than 256 "letters" in the
Chinese
 alphabet). SHIFT-JIS is seriously horrible to work with, compared with the
 elegant simplicity of UTF-16.

Again, you have chosen to ignore my post. As you are much more familiar 
with Unicode than myself, could you possibly debelop an encoding which 
takes amortized

1 byte per character for usual codepages (not including the fixed-length 
subrange select command in the beginning)

2 bytes per character for all multibyte encodings which fit into UTF-16 
(not including the fixed-length subrange select command in the beginning)

the rest of the Unicode characters should be representable as well. 
Besides, i would like that only the first byte from the character 
encoding is stored in a primary string, and the rest on the "overhang". 
I have my reasons to suggest that, and *if* you care to pay attention i 
would also like to explain in detail.

-eye

Sep 01 2004

"van eeshan" <vanee hotmail.net> writes:

What you fail to understand, Jill, is that such arguments are but pinpricks
upon the World's foremost authority on everything from language-design to
server-software to ease-of-use.

Better to just build a wchar-based String class (and all the supporting
goodies), and those who care about such things will naturally migrate to it;
they'll curse D for the short-sighted approach to Object.toString, leaving
the door further open for a D successor

V

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgp845$2ea2$1 digitaldaemon.com...
 In article <cgobse$237t$2 digitaldaemon.com>, Walter says...

There are also twice as many bytes to scan for the gc, and half the data
until your machine starts thrashing the swap disk. The latter is a very


real
issue for server apps, since it means that you reach the point of having


to
double the hardware in half the time.

 There you go again, assuming that wchar[] strings are double the length of
 char[] strings. THIS IS NOT TRUE IN GENERAL. In Chinese, wchar[] strings

are
 shorter than char[] strings. In Japanese, wchar[] strings are shorter than
 char[] strings. In Mongolian, wchar[] strings are shorter than char[]

strings.
 In Tibetan, wchar[] strings are shorter than char[] strings. I assume I

don't
 need to go on...?

 <sarcasm>But I guess server apps never have to deliver text in those
 languages.</sarcasm>

 Walter, servers are one the places where internationalization matters

most. XML
 and HTML documents, for example, could be (a) stored and (b) requested in

any
 encodings whatsoever. A server would have to push them through a

transcoding
 function. For this, wchar[]s are more sensible.

 I don't understand the basis of your determination. It seems ill-founded.
 Jill

Aug 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgobse$237t$2 digitaldaemon.com>, Walter says...

There are also twice as many (sic) bytes to scan for the gc,

Why are strings added to the GC root list anyway? It occurs to me that arrays of
bit, byte, ubyte, short, ushort, char, wchar and dchar which are allocated on
the heap can never contain pointers, and so should not be added to the GC's list
of things to scan when created with new (or modification of .length).

I imagine that this one simple step would increase D's performance rather
dramatically.

Arcane Jill

Aug 27 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgp8di$2ec2$1 digitaldaemon.com...
 In article <cgobse$237t$2 digitaldaemon.com>, Walter says...

There are also twice as many (sic) bytes to scan for the gc,

 Why are strings added to the GC root list anyway? It occurs to me that

arrays of
 bit, byte, ubyte, short, ushort, char, wchar and dchar which are allocated

on
 the heap can never contain pointers, and so should not be added to the

GC's list
 of things to scan when created with new (or modification of .length).

 I imagine that this one simple step would increase D's performance rather
 dramatically.

There's certainly potential in D to add type awareness to the gc. But that
adds penalties of its own, and it's an open question whether on the balance
it will be faster or not.

Aug 28 2004

Sean Kelly <sean f4.ca> writes:

In article <cgn92o$1jj4$1 digitaldaemon.com>, Ben Hinkle says...
 char: .36 seconds
 wchar: .78
 dchar: 1.3

So:

wchar = char * 2
dchar = char * 4

It looks like the time complexity is a direct factor of element size, which
stands to reason since D default initializes all arrays.  I would be interested
in seeing performance comparisons for transcoding between different formats.
For the sake of argument, perhaps using both std.utf and whatever Mango uses.


Sean

Aug 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgnh8j$1nl6$1 digitaldaemon.com>, Sean Kelly says...

So:

wchar = char * 2
dchar = char * 4

Only if there are the same number of chars in a char array as there are wchars
in a wchar array, and dchars in a dchar array. This will /only/ be true if the
string is pure ASCII.

Jill

Aug 27 2004

D Programming

C/C++ Programming

Other

digitalmars.D - performance of char vs wchar