D - string with encoding( suggestion )

Keisuke UEDA (12/12) Dec 02 2003 Hello. I've read the D language specification and "D Strings vs C++ Stri...

Elias Martenson (32/46) Dec 02 2003 I agree almost completely with this.

I (10/25) Dec 02 2003 It must be a bug in documentation. Under Windows wchar = UTF16 and under...

Elias Martenson (58/88) Dec 02 2003 Well, if that's the case then it's even worse. In C and C++, Windows

Hauke Duden (32/34) Dec 02 2003 Well, I think the problem is that many (western) programmers have an

Elias Martenson (40/55) Dec 03 2003 Agreed. This is a very good suggestion, and it overlaps to a large

Hauke Duden (7/23) Dec 03 2003 Never write such a thing when you're tired! ;) Where I wrote code point,...

J C Calvarese (15/68) Dec 02 2003 I think everyone would agree that the task would be a large one. Since

Elias Martenson (62/76) Dec 03 2003 Yes they are. However, like I said. No single person can do this right.

Roald Ribe (46/50) Dec 03 2003 "Elias Martenson" wrote in message

Elias Martenson (16/29) Dec 03 2003 Well, the efforts he refers to is external data representation in the
Walter (15/19) Jan 03 2004 I agree with you that internationalization and support for unicode is of

Walter (4/11) Jan 03 2004 Linux

Keisuke UEDA (14/14) Dec 02 2003 Thank you for replying.

Elias Martenson (21/36) Dec 03 2003 Yes, agreed. Looking at much of the C-code today we see that even though...

Chris Paulson-Ellis (10/15) Dec 03 2003 There is an implicit assumption in these arguments that making char 32

Elias Martenson (35/52) Dec 03 2003 You are right of course. There are a number of issues, not only the

Walter (4/13) Jan 03 2004 It will work as written if you define myString as dchar[] rather than

Ilya Minkov (5/7) Dec 02 2003 D Strings are Unicode UTF-8. It is enough for internationalised exchange...

Elias Martenson (7/15) Dec 02 2003 That is all fine, but that struct needs to be the natural way of dealing...

Keisuke UEDA <Keisuke_member pathlink.com> writes:

Hello. I've read the D language specification and "D Strings vs C++ Strings". I 
thought that D strings are not international strings. I think that strings 
should be independent of encoding, but D strings are array of char and resemble 
C strings. I think that array classes and string classes are different 
concepts, so we should specify encoding, in case we make string from array of 
char. And we should specify encoding, when we take out array of char from 
string. We should not assume tacit encoding to a character encoding.

If string class is internationalized, even if a certain programmer does not 
know the encoding of a foreign language, it cannot be necessary to make a bug.

Probably, as an actual problem, string class cannot but use the existing 
encodings, such as Unicode. For example, in the  string class of Java and 
Objective-C(NSString), Unicode is used internally.

Dec 02 2003

Elias Martenson <elias-m algonet.se> writes:

Keisuke UEDA wrote:

 Hello. I've read the D language specification and "D Strings vs C++ Strings".
I 
 thought that D strings are not international strings. I think that strings 
 should be independent of encoding, but D strings are array of char and
resemble 
 C strings. I think that array classes and string classes are different 
 concepts, so we should specify encoding, in case we make string from array of 
 char. And we should specify encoding, when we take out array of char from 
 string. We should not assume tacit encoding to a character encoding.
 
 If string class is internationalized, even if a certain programmer does not 
 know the encoding of a foreign language, it cannot be necessary to make a bug.
 
 Probably, as an actual problem, string class cannot but use the existing 
 encodings, such as Unicode. For example, in the  string class of Java and 
 Objective-C(NSString), Unicode is used internally.

I agree almost completely with this.

Also, the three different char types of D scared me a bit. I don't think 
it should come as a surprise when we see developers use char-arrays 
exclusiveley, and even though the documentation states that these arrays 
are supposed to use UTF-8 encoding, we will see a lot of people doing 
stuff like:

     char foo = bar[0]; // bar is an array of char

The above being completely useless, at best, or quite possibly illegal. 
Depending on the contents of the char-array. What's worse is that the 
bug probably will only manifest itself when people try to use 
international characters in the array. Just think about what would 
happen if bar contained the string: "�100"?

In fact, I feel that the char and wchar types are useless in that they 
serve no pratcical purpose. The documentation says "wchar - unsigned 8 
bit UTF-8". The only UTF-8 encoded characters that fit inside a char is 
a one with a code point less than 128, i.e. an ASCII character.

This makes me wonder what possible use there can be for the char and 
wchar types? When manipulating individual characters you absolutely need 
a data type that is able to hold any character.

Either the char and wchar types should be dropped, or the documentation 
should be clear that a char is ASCII only.

I have worked for a very long time with internationalisation issues, and 
anyone who ever tried to fix a C program to do full Unicode everywhere 
knows how painful that can be. Actually, even writing new Unicode-aware 
code in C can be real difficult. The Unicode support in D seems not to 
be very well thought through, and I feel that it needs to be fixed 
before it's too late.

I would very much like to do what I can to help out. At the very least 
share my experiences and knowledge on the subject.

Regards

Elias

Dec 02 2003

I <I_member pathlink.com> writes:

In article <bqhnd6$15f8$1 digitaldaemon.com>, Elias Martenson says...

In fact, I feel that the char and wchar types are useless in that they 
serve no pratcical purpose. The documentation says "wchar - unsigned 8 
bit UTF-8". The only UTF-8 encoded characters that fit inside a char is 
a one with a code point less than 128, i.e. an ASCII character.

It must be a bug in documentation. Under Windows wchar = UTF16 and under Linux
wchar = UTF32.

This makes me wonder what possible use there can be for the char and 
wchar types? When manipulating individual characters you absolutely need 
a data type that is able to hold any character.

wchar can hold any character that is allowed in the operating system. See above.

I have worked for a very long time with internationalisation issues, and 
anyone who ever tried to fix a C program to do full Unicode everywhere 
knows how painful that can be. Actually, even writing new Unicode-aware 
code in C can be real difficult. The Unicode support in D seems not to 
be very well thought through, and I feel that it needs to be fixed 
before it's too late.

We have standard conversions, don't we? Yet much to be done though.

I would very much like to do what I can to help out. At the very least 
share my experiences and knowledge on the subject.

Write a library, and let the standard library workgroup (currently defunct) take
it in.

I propose that streams and strings are somehow unified, which would allow both
to format strings and to iterate through them.

-eye

Dec 02 2003

Elias Martenson <elias-m algonet.se> writes:

I wrote:

 In article <bqhnd6$15f8$1 digitaldaemon.com>, Elias Martenson says...
 
In fact, I feel that the char and wchar types are useless in that they 
serve no pratcical purpose. The documentation says "wchar - unsigned 8 
bit UTF-8". The only UTF-8 encoded characters that fit inside a char is 
a one with a code point less than 128, i.e. an ASCII character.

 
 It must be a bug in documentation. Under Windows wchar = UTF16 and under Linux
 wchar = UTF32.

Well, if that's the case then it's even worse. In C and C++, Windows 
uses a 16-bit entity for wchar_t which can cause a lot of grief since it 
required you to deal with surrogate pairs more or less manually. Java 
has the same problems. They tried to deal with it in JDK1.5, but 
individual manipulation of characters is still very painul.

I fail to see any good arguments for having char be anything else than a 
32-bit type. The two arguments that do exist are:

     1) Storage. A 32-bit char is 4 times as large as an 8-bit char.

        Counter argument: Individual chars should be 32 bits. You could
                          have two string types, one UIF-32, and one
                          UTF-8 version. Both of which could have
                          identical interfaces. One would be fast,
                          the other would be small.

     2) Interoperability with legacy API's (i.e. linking with C and C++)

        Counter argument: I can sort of agree with this one.
                          Interoperability is necessary, but it
                          should not dictate the implementation.
                          Perhaps a type called legacy_char or
                          something like that. At least it would
                          prevent programmers from writing new code
                          that uses this type unless they really
                          need it?

This makes me wonder what possible use there can be for the char and 
wchar types? When manipulating individual characters you absolutely need 
a data type that is able to hold any character.

 
 wchar can hold any character that is allowed in the operating system. See
above.

Windows allows full Unicode 3.1 without problems. You do have to jump 
through some hoops to use Unicode codes >64K though, but that's caused 
by the legacy API's which were designed in an era when all Unicde values 
fit in 16 bits. This is (since 3.1) no longer case, and sticking with 
that convention because of argument 2 above is not a good thing.

I have worked for a very long time with internationalisation issues, and 
anyone who ever tried to fix a C program to do full Unicode everywhere 
knows how painful that can be. Actually, even writing new Unicode-aware 
code in C can be real difficult. The Unicode support in D seems not to 
be very well thought through, and I feel that it needs to be fixed 
before it's too late.

 
 We have standard conversions, don't we?

We sure do. But in D, many of those conventions are (fortunately) not 
set in stone yet, and can be fixed.

 Yet much to be done though.

I can agree with this one.

I would very much like to do what I can to help out. At the very least 
share my experiences and knowledge on the subject.

 
 Write a library, and let the standard library workgroup (currently defunct)
take
 it in.

Thanks for trusting me. I'd love to help out with exactly that. However, 
a single person can not create the perfect Unicode-aware string library. 
This can be proved by looking at the mountain of mistakes made when 
designing Java, a language that still can pride itself by being one of 
the best in terms of Unicode-awareness. They had to fix a lot of things 
along the way though, and the standard library is still riddled with 
legacy bugs that can't be fixed because of backward compatibility issues.

 I propose that streams and strings are somehow unified, which would allow both
 to format strings and to iterate through them.

I sort of agree with you. Although there should be a distiction in a way 
that Java did it (which I believe is what the original poster requested):

     - A string should be a sequence of unicode characters. That's it.
       Nothing more, noting less.

     - A stream should provide an externalisation interface for
       strings. It should be the responsibility of the stream to
       provide encoding and decoding of the external encoding to
       and from Unicode.

This is what Java has done, but they still managed to mess it up 
originally. I this one of the reasons for this is that the people who 
designed it didn't understand Unicode and encodings entirely.

Originally I wasn't interested in this at all, and made a lot of 
mistakes. I'm swedish, and while I do have the need for non-ASCII 
characters, I didn't understand the requirements of Unicode until I 
started studying mandarin chinese, and then russian. Since my GF is 
russian I now see a lot of the problems caused by badly written code, 
and I see D as an opportunity to lobby for a use of Unicode in a way 
that minimises the opportunties to write code that only works with english.

Regards

Elias

Dec 02 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Elias Martenson wrote:
 I fail to see any good arguments for having char be anything else than a 
 32-bit type. The two arguments that do exist are:

Well, I think the problem is that many (western) programmers have an 
ASCII bias. Most realize that Unicode is the best way to be able to 
write international code, but they don't want to change their existing 
code base. So UTF-8 seems like a nice solution - you have Unicode 
support on paper, but you can still treat everything as ASCII.

The problem, however, is that UTF-8 gets really complicated when you 
have non ASCII strings. You cannot index the string directly, you need 
to decode multiple code points in a loop to get a single character, you 
have to deal with invalid code point sequences... the list goes on.

UTF-16 is a little better if you know you won't ever have any surrogate 
pairs in there, but in my opinion that's a short-sighted view. These 
kinds of assumptions have a tendency to be proven wrong just when you 
are in a position where changing the encoding is not possible anymore.

In my opinion, memory strings should simply be UTF-32. Easy indexing and 
manipulation for everyone, no "discrimination" of multibyte languages.

But, I realise that some people do not share this view. And of course 
there's the problem of interacting with legacy code (e.g. printf and all 
the other C stuff, or the UTF-16 Windows API).

Which really only leaves one solution: we need an abstract string class 
with implementations for UTF-8, UTF-16 and UTF32. That way you can 
choose the best encoding, depending on your needs. And such classes 
could also take care of the hassles of UTF-8 and UTF-16 
decoding/encoding/manipulation.

An abstract class would even allow users to add their own encoding, 
which is necessary if legacy code is not ASCII, but one of the other few 
dozen codepages that are popular around the world.

And last, but not least, I think the D character type should always be 
32 bit. Then it would be a real, decoded Unicode character, not a code 
point. Since the decoding is done internally by the string classes, 
there is really no need to have different character sizes.

Hauke

Dec 02 2003

Elias Martenson <elias-m algonet.se> writes:

Hauke Duden wrote:
 Elias Martenson wrote:

    [ a lot of very good reasoning snipped for space ]

 Which really only leaves one solution: we need an abstract string class 
 with implementations for UTF-8, UTF-16 and UTF32. That way you can 
 choose the best encoding, depending on your needs. And such classes 
 could also take care of the hassles of UTF-8 and UTF-16 
 decoding/encoding/manipulation.
 
 An abstract class would even allow users to add their own encoding, 
 which is necessary if legacy code is not ASCII, but one of the other few 
 dozen codepages that are popular around the world.

Agreed. This is a very good suggestion, and it overlaps to a large 
degree with my ideas.

Taking your reasoning a little further, this means we have a need for:

     - An interface that represents a string (called "String"?)

     - Three concrete implementations of said class:
       UTF8String, UTF16String and UTF32String
       (or perhaps String8 etc...)

     - Yet another implementation called NativeString that implictly
       uses the encoding of the environment that the program is
       running in. In Unix this would look at the environment
       variable LC_CTYPE.

     - A comprehensive set of string manipulation classes and methods
       that work with the String interface.

     - Making sure the external interfaces of all std-classes use
       String instead of char arrays.

     - Removing char and wchar, and renaming dchar to char.
       The old "char" was all wrong anyway since UTF-8 is defined
       as being a byte sequence, so we already have the types
       byte and short.

All this is needed. wchar and dchar arrays are useless today anyway, 
since, from what I can tell, external interfaces seems to be using 
char[] for strings. If you decide you want to work with proper chars 
(i.e. dchar) you have to do UTF-32 <-> UTF-8 conversions on every method 
call that involves strings. Not a good thing, and an effective way of 
preventing any proper use of Unicode.

Besides, UTF-8 is highly inefficient for many operations. It's only 
advantage is small size of mostly-ASCII data and compatibility with 
ASCII. Internally, 32-bit strings should be used.

 And last, but not least, I think the D character type should always be 
 32 bit. Then it would be a real, decoded Unicode character, not a code 
 point. Since the decoding is done internally by the string classes, 
 there is really no need to have different character sizes.

I think I agree with you, but I'm not sure what you mean by "real, 
decoded Unicode character, not a code point"? If you are referring to 
the bytes that make up a UTF-8 character, then I agree with you (but 
that's not called a code point).

A code point is an individual character "position" as defined by 
Unicode. Are you saying that the "char" type should be able to hold a 
completeted composite character, including combining diacritical marks? 
In that case I don't agree with you, and no other languages even attempt 
this.

Regards

Elias

Dec 03 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Elias Martenson wrote:
 Hauke Duden wrote:
 And last, but not least, I think the D character type should always be 
 32 bit. Then it would be a real, decoded Unicode character, not a code 
 point. Since the decoding is done internally by the string classes, 
 there is really no need to have different character sizes.

 
 
 I think I agree with you, but I'm not sure what you mean by "real, 
 decoded Unicode character, not a code point"? If you are referring to 
 the bytes that make up a UTF-8 character, then I agree with you (but 
 that's not called a code point).

Never write such a thing when you're tired! ;) Where I wrote code point, 
I meant "encoding elements", i.e. bytes for UTF8 and 16 bit ints for UTF-16.

 A code point is an individual character "position" as defined by 
 Unicode. Are you saying that the "char" type should be able to hold a 
 completeted composite character, including combining diacritical marks? 
 In that case I don't agree with you, and no other languages even attempt 
 this.

God, no. Combining marks to what the end user thinks of as a "character" 
needs to be done on another layer, probably just before or while they 
are printed.

Hauke

Dec 03 2003

J C Calvarese <jcc7 cox.net> writes:

Elias Martenson wrote:

 I wrote:
 
 In article <bqhnd6$15f8$1 digitaldaemon.com>, Elias Martenson says...


...
 I would very much like to do what I can to help out. At the very 
 least share my experiences and knowledge on the subject.


 Write a library, and let the standard library workgroup (currently 
 defunct) take
 it in.

 
 
 Thanks for trusting me. I'd love to help out with exactly that. However, 
 a single person can not create the perfect Unicode-aware string library.

I think everyone would agree that the task would be a large one. Since 
Walter is only one person, you might judge the string-handling functions 
he developed to be simplistic.  (For my purposes they're fine, but I've 
never worked with Unicode.) If someone (or a group of people) offered to 
supply some more comprehensive functions/classes, I think he'll accept 
donations.

Personally, I know next to nothing about Unicode, so your discussion is 
way over my head.  I've noted similar criticisms before and I suspect 
D's library is somewhat lacking in this area.

I don't think the fundamental (C-inspired) types need to get any more 
complicated, but I think a fancy (Java-like) String class could help 
handle most of the messy things in the background.

Justin

 This can be proved by looking at the mountain of mistakes made when 
 designing Java, a language that still can pride itself by being one of 
 the best in terms of Unicode-awareness. They had to fix a lot of things 
 along the way though, and the standard library is still riddled with 
 legacy bugs that can't be fixed because of backward compatibility issues.
 
 I propose that streams and strings are somehow unified, which would 
 allow both
 to format strings and to iterate through them.

 
 
 I sort of agree with you. Although there should be a distiction in a way 
 that Java did it (which I believe is what the original poster requested):
 
     - A string should be a sequence of unicode characters. That's it.
       Nothing more, noting less.
 
     - A stream should provide an externalisation interface for
       strings. It should be the responsibility of the stream to
       provide encoding and decoding of the external encoding to
       and from Unicode.
 
 This is what Java has done, but they still managed to mess it up 
 originally. I this one of the reasons for this is that the people who 
 designed it didn't understand Unicode and encodings entirely.
 
 Originally I wasn't interested in this at all, and made a lot of 
 mistakes. I'm swedish, and while I do have the need for non-ASCII 
 characters, I didn't understand the requirements of Unicode until I 
 started studying mandarin chinese, and then russian. Since my GF is 
 russian I now see a lot of the problems caused by badly written code, 
 and I see D as an opportunity to lobby for a use of Unicode in a way 
 that minimises the opportunties to write code that only works with english.
 
 Regards
 
 Elias

Dec 02 2003

Elias Martenson <elias-m algonet.se> writes:

J C Calvarese wrote:

 I think everyone would agree that the task would be a large one. Since 
 Walter is only one person, you might judge the string-handling functions 
 he developed to be simplistic.

Yes they are. However, like I said. No single person can do this right. 
A lot of very bright people (don't ask me to compare them to Walter, I 
don't know him, nor them :-) ) worked on Java and they made dozens of 
very bad mistakes in 1.0. Not until 1.5 (which isn't released yet) are 
they catching up.

 (For my purposes they're fine, but I've 
 never worked with Unicode.)

Exactly. Not to offend anyone by asking about nationality, but I assume 
you are an english-only speaker? This is a common situation. One uses 
the most obvious tools at hand (which in both C and D are char-arrays) 
and everything works fine. Later, when it's time to localise the 
application you realise you should have used char_t (dchar) instead. 
Congratulations on the task of changing all that code.

In fact, you don't veen have to localise your app to end up with 
problems. My last name is actually M�rtenson, and in Unicode the second 
character encodes into two bytes. Guess what happens when a na�ve 
implementation runs the following code to verify my name? Try to count 
the number of errors:

     for (int c = 0 ; c < strlen(name) ; c++) {
         char ch = name[c];

         // make uppercase chars into lowercase
         if (isupper(ch)) {
             name[c] += 'a' - 'A';
         }
     }

     // print the name with a series of stars below it
     printf ("%s\n", name);
     for (int c = 0 ; c < strlen(name) ; c++) {
         putchar ('*');
     }
     putchar ('\n');

Code like the one above is not particularily uncommon to see. Hell, I 
even wrote code like it myself. All of the above problems will not be 
fixed if char is a 32-bit entity (at least one will remain) but most of 
the work will be done already.

 If someone (or a group of people) offered to 
 supply some more comprehensive functions/classes, I think he'll accept 
 donations.

I would like to help out in such a group, but I certainly cannot do it 
myself. For one, I'm not skilled enough in D to even use correct 
"D-style" everywhere.

 Personally, I know next to nothing about Unicode, so your discussion is 
 way over my head.  I've noted similar criticisms before and I suspect 
 D's library is somewhat lacking in this area.

Indeed they are. Unicode has to be in from the start. Again and again we 
have seen languages struggle when trying to tack on Unicode after the 
fact. C, C++, Perl, PHP are just a few examples.

 I don't think the fundamental (C-inspired) types need to get any more 
 complicated, but I think a fancy (Java-like) String class could help 
 handle most of the messy things in the background.

They shouldn't be more complicated syntax-wise or implementation-wise?

Syntax-wise they already are extremely convoluted, at least if your 
intention is to write a program that works properly with unicode. You 
need to manage the UTF-8 encoding yourself which actually next to 
impossible to get right.

Implementation-wise, I can conceieve an implementation where you still 
declare a string as a char[] (char being 32-bit of course) and then the 
compiler having special knowledge about this type such that it actually 
can implement it differently behind the scenes. An extremely poor example:

     char[] utf32string;
     char[] #UTF16 utf16string;
     char[] #UTF8 utf8string;
     char[] #NATIVE nativeString;

The above mentioned four strings would behave identically and it should 
be possible to use them interchangeably. In all cases s[0] would return 
a 32-bit char.

This cannot be stressed enough: Being able to dereference individual 
components of a UTF-8 or UTF-16 string is a recipie for failure. There 
are hardly any situations where anyone needs this. The only time would 
be a UTF-8 en/decoder but that functionality should be built in anyway.

Regards

Elias

Dec 03 2003

"Roald Ribe" <rr.no spam.teikom.no> writes:

"Elias Martenson" <elias-m algonet.se> wrote in message
news:bqkc8e$233i$1 digitaldaemon.com...

[snip]

 This cannot be stressed enough: Being able to dereference individual
 components of a UTF-8 or UTF-16 string is a recipie for failure. There
 are hardly any situations where anyone needs this. The only time would
 be a UTF-8 en/decoder but that functionality should be built in anyway.

I will just add moral support to all that E.M. said above.
Even in laguages written with chars fitting into 8 bits,
the current situation is a mess, in ALL computer languages,
but Java is getting there (slowly).

We have several groups of "customers" here:
- Native english speaking developers
  (with a large industrialized "home" market)
- "8-bits is enough" developers, "most of rest of industrialized world"
  (still has problems with different codepages though)
- "16-bits is enough" developers,
  Accounts for most of the rest of developers (as of today) but not all.
- "32-bit rules"
  Groups of all of the above with experience an insight + a few not
  fitting in the 16-bits limitations.

If D is to be seen as an 1st choice language around the world,
it has to *enforce* UNICODE for strings. If we all agree on that
(not today maybe, but what about 5 years from now?) a design
should be selected that caters for ALL, before the language hits
1.0. If not, it will just be "one more language", no matter what
other nice features it has. This can really set it apart.

32 bit UNICODE chars, resolves all these problems. The only
"problem" it generates, is more memory used. With the RAM
prices (still declining) we have today, I refuse to seriously
consider it a problem.

How about making this THE focus point of D?

"D 1st international language"
"D 1st language with real international focus"

Certainly an easy to communicate advantage, as all the
marketdroids I ever worked with seems to want.

For those who knows next to nothing about these issues, I
would recommend: http://www.cl.cam.ac.uk/~mgk25/unicode.html
as an introduction to some of the issues (but I disagree
with their reason for basing efforts around UTF-8).

Those who decides to take this on should:
1. Grab the String interface from Java 1.5
2. Learn from the other Java based String implementation
   performance improvers.
3. Use as much as possible from the IBM lib for UNICODE.

So much to do, never enough time...

This feature alone has the potential to make D a
language with very widespread use, and should be a very
nice "hook" for getting people to write about it.

Roald

Dec 03 2003

Elias Martenson <no spam.spam> writes:

Den Wed, 03 Dec 2003 17:11:54 +0100 skrev Roald Ribe:

 For those who knows next to nothing about these issues, I
 would recommend: http://www.cl.cam.ac.uk/~mgk25/unicode.html
 as an introduction to some of the issues (but I disagree
 with their reason for basing efforts around UTF-8).

Well, the efforts he refers to is external data representation in the
operating system. UTF-8 is the only reasonable encoding to use, or
everything would break. This is because Unix is so focused around ASCII
files. UTF-8 is a very neat way to go Unicode without breaking anything.

 Those who decides to take this on should:
 1. Grab the String interface from Java 1.5
 2. Learn from the other Java based String implementation
    performance improvers.
 3. Use as much as possible from the IBM lib for UNICODE.

Agreed. This is very important. However, there are some fundamental
changes that needs to be done to the language, sone of which is going to
break a lot of existing code (although some people would argue that that
code is already broken since they don't handle Unicode).

Does Walter have anything to say on this subject?

 So much to do, never enough time...

Again, agreed. However, I believe it needs to be done now, because fixing
it later will be next to impossble.

 This feature alone has the potential to make D a
 language with very widespread use, and should be a very
 nice "hook" for getting people to write about it.

Indeed. But as some other person already mentioned, there is a lot more to
be done than just getting 32-bit chars. It's the first step though.

Regards

Elias

Dec 03 2003

"Walter" <walter digitalmars.com> writes:

"Roald Ribe" <rr.no spam.teikom.no> wrote in message
news:bql1dn$30vm$1 digitaldaemon.com...
 32 bit UNICODE chars, resolves all these problems. The only
 "problem" it generates, is more memory used. With the RAM
 prices (still declining) we have today, I refuse to seriously
 consider it a problem.

I agree with you that internationalization and support for unicode is of
very large importance for the success of D. D's current support for it is
weak, but I think that is more a library issue than the language
definitional one.

I've written a large server app that was internally 100% UTF-32. It used all
available memory and then went into swap. The 4x memory consumption from
UTF-32 cost plenty in performance, if I'd have done it with UTF-8 I'd have
had a happier customer.

The reality is that D must offer the programmer a choice in using UTF-8,
UTF-16, and UTF-32, and make it easy to use either, as the optimal
representation for one app will be suboptimal for the next. Since D is a
systems programming language, coming off poorly on benchmarks is not
affordable <g>.

Jan 03 2004

"Walter" <walter digitalmars.com> writes:

"I" <I_member pathlink.com> wrote in message
news:bqhohh$1781$1 digitaldaemon.com...
 In article <bqhnd6$15f8$1 digitaldaemon.com>, Elias Martenson says...

In fact, I feel that the char and wchar types are useless in that they
serve no pratcical purpose. The documentation says "wchar - unsigned 8
bit UTF-8". The only UTF-8 encoded characters that fit inside a char is
a one with a code point less than 128, i.e. an ASCII character.

 It must be a bug in documentation. Under Windows wchar = UTF16 and under

Linux
 wchar = UTF32.

This is not correct. A D wchar is always UTF-16. A D dchar is always UTF-32.

Jan 03 2004

Keisuke UEDA <Keisuke_member pathlink.com> writes:

Thank you for replying.

I think that UTF-8 should not be treated directly. 

UTF-8 is so complicated that user may take mistakes. ASCII character sets ( 
from 0x00 to 0x7f ) are encoded to 1 byte, so programmers who use only ASCII 
characters do not need to distinguish ASCII and UTF-8. But 2 bytes character 
sets ( such as Japanese Shift JIS and Chinese Big5 ) are encoded to multi 
bytes. A certain character is encoded to 15 bytes. Unicode strings will be 
destroyed if it do not treat correctly. Many programmers do not know the 
circumstances of foreign language well. I think that encoded text data should 
be wrapped and programmers should use them indirectly.

I agree with Mr. Elias Martenson's opinion. I think that he has understood many 
languages. But I have no idea which is the best encoding which string class 
use. 


With best kind regards.

Dec 02 2003

Elias Martenson <elias-m algonet.se> writes:

Keisuke UEDA wrote:

 Thank you for replying.
 
 I think that UTF-8 should not be treated directly. 
 
 UTF-8 is so complicated that user may take mistakes. ASCII character sets ( 
 from 0x00 to 0x7f ) are encoded to 1 byte, so programmers who use only ASCII 
 characters do not need to distinguish ASCII and UTF-8. But 2 bytes character 
 sets ( such as Japanese Shift JIS and Chinese Big5 ) are encoded to multi 
 bytes. A certain character is encoded to 15 bytes. Unicode strings will be 
 destroyed if it do not treat correctly. Many programmers do not know the 
 circumstances of foreign language well. I think that encoded text data should 
 be wrapped and programmers should use them indirectly.

Yes, agreed. Looking at much of the C-code today we see that even though 
these issues are well knows very few people bother doing it right unless 
the language makes it natural to do so.

I am currently working in a combined C++ and Java project. The product 
will be exported to various countires using different languages. Still, 
the developers who were doing the bulk of the work before me completely 
disregarded the problem. Trying to fix the code now has proven to be 
more or less impossible and is one of reasons we have to rewrite a lot 
of it.

The attitude of the developers were "we'll fix the language issues when 
we do the translation". Suffice it to say that it wasn't that easy.

If we let the 8-bit char[] type be the "official" string type in D, then 
  the exact same mistakes that was made in C and C++ will be made again, 
and I really don't want to see that happen. I really have high hopes for 
D, it does so many things just "right", but this is a potential dealbreaker.

 I agree with Mr. Elias Martenson's opinion. I think that he has understood
many 
 languages. But I have no idea which is the best encoding which string class 
 use. 

Yes, I have basic skills in a lot of languages, and I have real needs to 
use multiple languages. However, when I read what you had to say I 
realise you also have good skills in this subject.

Regards

Elias

Dec 03 2003

Chris Paulson-Ellis <chris edesix.com> writes:

Elias Martenson wrote:

 If we let the 8-bit char[] type be the "official" string type in D, then 
  the exact same mistakes that was made in C and C++ will be made again, 
 and I really don't want to see that happen. I really have high hopes for 
 D, it does so many things just "right", but this is a potential 
 dealbreaker.

There is an implicit assumption in these arguments that making char 32 
bits wide will magically make programs easier to internationalise. I 
don't buy this. Even 32 bit characters are not 'characters' in the old 
fasioned ASCII sense. What about the combining diacritic marks, etc.

What is really needed is a debate about which of the C Unicode libraries 
out there is the most appropriate for inclusion in the D library (via a 
wrapper to support the D string functionality, I assume).

Does anyone out there have experience with these libraries. I don't :-(.

Chris.

Dec 03 2003

Elias Martenson <elias-m algonet.se> writes:

Chris Paulson-Ellis wrote:

 Elias Martenson wrote:
 
 If we let the 8-bit char[] type be the "official" string type in D, 
 then  the exact same mistakes that was made in C and C++ will be made 
 again, and I really don't want to see that happen. I really have high 
 hopes for D, it does so many things just "right", but this is a 
 potential dealbreaker.

 
 
 There is an implicit assumption in these arguments that making char 32 
 bits wide will magically make programs easier to internationalise. I 
 don't buy this. Even 32 bit characters are not 'characters' in the old 
 fasioned ASCII sense. What about the combining diacritic marks, etc.

You are right of course. There are a number of issues, not only the 
diacritics, but also upper/lowercasing, equivalence (diacritic or 
combined char) etc...

However, just because a 32-bit char doesn't solve everything doesn't 
mean that we should just give up and use 8-bit chars for the basic 
string type.

Unicode works with code points. The concept of what a "character" is is 
well defined by Unicode as "The basic unit of encoding for the Unicode 
character encoding". It's only natural to actually deal with these for 
the char type. Don't you think?

An example:

     char[] myString = ...;

     for (int c = 0 ; c < myString.size ; c++) {
         if (myString[c] == '�') {
             // found the euro sign
         }
     }

This is perfectly reasonable code that doesn't work in D today, but 
would work if chars were 32-bit. Code like this should work in my opinion.

 What is really needed is a debate about which of the C Unicode libraries 
 out there is the most appropriate for inclusion in the D library (via a 
 wrapper to support the D string functionality, I assume).

The functinality is needed. Basing it off of existing libraries is a 
good thing. Just grabbing them and include them in D is not. The 
functionality provided byt he string management functions in the D 
standard library should be natural and helpful. In particular, it should 
make the need to individually access characters as small as possible.

 Does anyone out there have experience with these libraries. I don't :-(.

As in "have I written real code using them"? No. The IBM libraries are 
good, that much I know. Might be worth taking a deeper look at.

     http://oss.software.ibm.com/icu/index.html

We have to keep in mind though, that even if all these features are 
included in the standard library, if they aren't easy and natural to use 
we will still see a lot of applications doing the wrong thing.

In my mind, Java is very good at making Unicode natural to use. But it 
can be better. That's why I'm soapboxing on this forum.

Regards

Elias

Dec 03 2003

"Walter" <walter digitalmars.com> writes:

"Elias Martenson" <elias-m algonet.se> wrote in message
news:bqkd9l$24fv$1 digitaldaemon.com...
 An example:

      char[] myString = ...;

      for (int c = 0 ; c < myString.size ; c++) {
          if (myString[c] == '�') {
              // found the euro sign
          }
      }

 This is perfectly reasonable code that doesn't work in D today, but
 would work if chars were 32-bit. Code like this should work in my opinion.

It will work as written if you define myString as dchar[] rather than
char[].

Jan 03 2004

Ilya Minkov <Ilya_member pathlink.com> writes:

In article <bqhkrg$11s8$1 digitaldaemon.com>, Keisuke UEDA says...
Hello. I've read the D language specification and "D Strings vs C++ Strings". I 
thought that D strings are not international strings.

D Strings are Unicode UTF-8. It is enough for internationalised exchange within
a program, but for processing we need a cursor struct which would allow to
extract Unicode UTF-32 characters, and its counterpart to create strings.

-eye

Dec 02 2003

Elias Martenson <elias-m algonet.se> writes:

Ilya Minkov wrote:

 In article <bqhkrg$11s8$1 digitaldaemon.com>, Keisuke UEDA says...
 
Hello. I've read the D language specification and "D Strings vs C++ Strings". I 
thought that D strings are not international strings.

 
 D Strings are Unicode UTF-8. It is enough for internationalised exchange within
 a program, but for processing we need a cursor struct which would allow to
 extract Unicode UTF-32 characters, and its counterpart to create strings.

That is all fine, but that struct needs to be the natural way of dealing 
with characters. Arrays of 8-bit entities is too appealing for 
western-language-only speaking people to use for manual string 
manipulation. I know, I see it all the time.

Regards

Elias

Dec 02 2003

D Programming

C/C++ Programming

Other

D - string with encoding( suggestion )