www.digitalmars.com         C & C++   DMDScript  

D - string with encoding( suggestion )

reply Keisuke UEDA <Keisuke_member pathlink.com> writes:
Hello. I've read the D language specification and "D Strings vs C++ Strings". I 
thought that D strings are not international strings. I think that strings 
should be independent of encoding, but D strings are array of char and resemble 
C strings. I think that array classes and string classes are different 
concepts, so we should specify encoding, in case we make string from array of 
char. And we should specify encoding, when we take out array of char from 
string. We should not assume tacit encoding to a character encoding.

If string class is internationalized, even if a certain programmer does not 
know the encoding of a foreign language, it cannot be necessary to make a bug.

Probably, as an actual problem, string class cannot but use the existing 
encodings, such as Unicode. For example, in the  string class of Java and 
Objective-C(NSString), Unicode is used internally.
Dec 02 2003
next sibling parent reply Elias Martenson <elias-m algonet.se> writes:
Keisuke UEDA wrote:

 Hello. I've read the D language specification and "D Strings vs C++ Strings".
I 
 thought that D strings are not international strings. I think that strings 
 should be independent of encoding, but D strings are array of char and
resemble 
 C strings. I think that array classes and string classes are different 
 concepts, so we should specify encoding, in case we make string from array of 
 char. And we should specify encoding, when we take out array of char from 
 string. We should not assume tacit encoding to a character encoding.
 
 If string class is internationalized, even if a certain programmer does not 
 know the encoding of a foreign language, it cannot be necessary to make a bug.
 
 Probably, as an actual problem, string class cannot but use the existing 
 encodings, such as Unicode. For example, in the  string class of Java and 
 Objective-C(NSString), Unicode is used internally.

I agree almost completely with this. Also, the three different char types of D scared me a bit. I don't think it should come as a surprise when we see developers use char-arrays exclusiveley, and even though the documentation states that these arrays are supposed to use UTF-8 encoding, we will see a lot of people doing stuff like: char foo = bar[0]; // bar is an array of char The above being completely useless, at best, or quite possibly illegal. Depending on the contents of the char-array. What's worse is that the bug probably will only manifest itself when people try to use international characters in the array. Just think about what would happen if bar contained the string: "€100"? In fact, I feel that the char and wchar types are useless in that they serve no pratcical purpose. The documentation says "wchar - unsigned 8 bit UTF-8". The only UTF-8 encoded characters that fit inside a char is a one with a code point less than 128, i.e. an ASCII character. This makes me wonder what possible use there can be for the char and wchar types? When manipulating individual characters you absolutely need a data type that is able to hold any character. Either the char and wchar types should be dropped, or the documentation should be clear that a char is ASCII only. I have worked for a very long time with internationalisation issues, and anyone who ever tried to fix a C program to do full Unicode everywhere knows how painful that can be. Actually, even writing new Unicode-aware code in C can be real difficult. The Unicode support in D seems not to be very well thought through, and I feel that it needs to be fixed before it's too late. I would very much like to do what I can to help out. At the very least share my experiences and knowledge on the subject. Regards Elias
Dec 02 2003
next sibling parent reply I <I_member pathlink.com> writes:
In article <bqhnd6$15f8$1 digitaldaemon.com>, Elias Martenson says...

In fact, I feel that the char and wchar types are useless in that they 
serve no pratcical purpose. The documentation says "wchar - unsigned 8 
bit UTF-8". The only UTF-8 encoded characters that fit inside a char is 
a one with a code point less than 128, i.e. an ASCII character.

It must be a bug in documentation. Under Windows wchar = UTF16 and under Linux wchar = UTF32.
This makes me wonder what possible use there can be for the char and 
wchar types? When manipulating individual characters you absolutely need 
a data type that is able to hold any character.

wchar can hold any character that is allowed in the operating system. See above.
I have worked for a very long time with internationalisation issues, and 
anyone who ever tried to fix a C program to do full Unicode everywhere 
knows how painful that can be. Actually, even writing new Unicode-aware 
code in C can be real difficult. The Unicode support in D seems not to 
be very well thought through, and I feel that it needs to be fixed 
before it's too late.

We have standard conversions, don't we? Yet much to be done though.
I would very much like to do what I can to help out. At the very least 
share my experiences and knowledge on the subject.

Write a library, and let the standard library workgroup (currently defunct) take it in. I propose that streams and strings are somehow unified, which would allow both to format strings and to iterate through them. -eye
Dec 02 2003
next sibling parent reply Elias Martenson <elias-m algonet.se> writes:
I wrote:

 In article <bqhnd6$15f8$1 digitaldaemon.com>, Elias Martenson says...
 
In fact, I feel that the char and wchar types are useless in that they 
serve no pratcical purpose. The documentation says "wchar - unsigned 8 
bit UTF-8". The only UTF-8 encoded characters that fit inside a char is 
a one with a code point less than 128, i.e. an ASCII character.

It must be a bug in documentation. Under Windows wchar = UTF16 and under Linux wchar = UTF32.

Well, if that's the case then it's even worse. In C and C++, Windows uses a 16-bit entity for wchar_t which can cause a lot of grief since it required you to deal with surrogate pairs more or less manually. Java has the same problems. They tried to deal with it in JDK1.5, but individual manipulation of characters is still very painul. I fail to see any good arguments for having char be anything else than a 32-bit type. The two arguments that do exist are: 1) Storage. A 32-bit char is 4 times as large as an 8-bit char. Counter argument: Individual chars should be 32 bits. You could have two string types, one UIF-32, and one UTF-8 version. Both of which could have identical interfaces. One would be fast, the other would be small. 2) Interoperability with legacy API's (i.e. linking with C and C++) Counter argument: I can sort of agree with this one. Interoperability is necessary, but it should not dictate the implementation. Perhaps a type called legacy_char or something like that. At least it would prevent programmers from writing new code that uses this type unless they really need it?
This makes me wonder what possible use there can be for the char and 
wchar types? When manipulating individual characters you absolutely need 
a data type that is able to hold any character.

wchar can hold any character that is allowed in the operating system. See above.

Windows allows full Unicode 3.1 without problems. You do have to jump through some hoops to use Unicode codes >64K though, but that's caused by the legacy API's which were designed in an era when all Unicde values fit in 16 bits. This is (since 3.1) no longer case, and sticking with that convention because of argument 2 above is not a good thing.
I have worked for a very long time with internationalisation issues, and 
anyone who ever tried to fix a C program to do full Unicode everywhere 
knows how painful that can be. Actually, even writing new Unicode-aware 
code in C can be real difficult. The Unicode support in D seems not to 
be very well thought through, and I feel that it needs to be fixed 
before it's too late.

We have standard conversions, don't we?

We sure do. But in D, many of those conventions are (fortunately) not set in stone yet, and can be fixed.
 Yet much to be done though.

I can agree with this one.
I would very much like to do what I can to help out. At the very least 
share my experiences and knowledge on the subject.

Write a library, and let the standard library workgroup (currently defunct) take it in.

Thanks for trusting me. I'd love to help out with exactly that. However, a single person can not create the perfect Unicode-aware string library. This can be proved by looking at the mountain of mistakes made when designing Java, a language that still can pride itself by being one of the best in terms of Unicode-awareness. They had to fix a lot of things along the way though, and the standard library is still riddled with legacy bugs that can't be fixed because of backward compatibility issues.
 I propose that streams and strings are somehow unified, which would allow both
 to format strings and to iterate through them.

I sort of agree with you. Although there should be a distiction in a way that Java did it (which I believe is what the original poster requested): - A string should be a sequence of unicode characters. That's it. Nothing more, noting less. - A stream should provide an externalisation interface for strings. It should be the responsibility of the stream to provide encoding and decoding of the external encoding to and from Unicode. This is what Java has done, but they still managed to mess it up originally. I this one of the reasons for this is that the people who designed it didn't understand Unicode and encodings entirely. Originally I wasn't interested in this at all, and made a lot of mistakes. I'm swedish, and while I do have the need for non-ASCII characters, I didn't understand the requirements of Unicode until I started studying mandarin chinese, and then russian. Since my GF is russian I now see a lot of the problems caused by badly written code, and I see D as an opportunity to lobby for a use of Unicode in a way that minimises the opportunties to write code that only works with english. Regards Elias
Dec 02 2003
next sibling parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Elias Martenson wrote:
 I fail to see any good arguments for having char be anything else than a 
 32-bit type. The two arguments that do exist are:

Well, I think the problem is that many (western) programmers have an ASCII bias. Most realize that Unicode is the best way to be able to write international code, but they don't want to change their existing code base. So UTF-8 seems like a nice solution - you have Unicode support on paper, but you can still treat everything as ASCII. The problem, however, is that UTF-8 gets really complicated when you have non ASCII strings. You cannot index the string directly, you need to decode multiple code points in a loop to get a single character, you have to deal with invalid code point sequences... the list goes on. UTF-16 is a little better if you know you won't ever have any surrogate pairs in there, but in my opinion that's a short-sighted view. These kinds of assumptions have a tendency to be proven wrong just when you are in a position where changing the encoding is not possible anymore. In my opinion, memory strings should simply be UTF-32. Easy indexing and manipulation for everyone, no "discrimination" of multibyte languages. But, I realise that some people do not share this view. And of course there's the problem of interacting with legacy code (e.g. printf and all the other C stuff, or the UTF-16 Windows API). Which really only leaves one solution: we need an abstract string class with implementations for UTF-8, UTF-16 and UTF32. That way you can choose the best encoding, depending on your needs. And such classes could also take care of the hassles of UTF-8 and UTF-16 decoding/encoding/manipulation. An abstract class would even allow users to add their own encoding, which is necessary if legacy code is not ASCII, but one of the other few dozen codepages that are popular around the world. And last, but not least, I think the D character type should always be 32 bit. Then it would be a real, decoded Unicode character, not a code point. Since the decoding is done internally by the string classes, there is really no need to have different character sizes. Hauke
Dec 02 2003
parent reply Elias Martenson <elias-m algonet.se> writes:
Hauke Duden wrote:
 Elias Martenson wrote:

    [ a lot of very good reasoning snipped for space ]

 Which really only leaves one solution: we need an abstract string class 
 with implementations for UTF-8, UTF-16 and UTF32. That way you can 
 choose the best encoding, depending on your needs. And such classes 
 could also take care of the hassles of UTF-8 and UTF-16 
 decoding/encoding/manipulation.
 
 An abstract class would even allow users to add their own encoding, 
 which is necessary if legacy code is not ASCII, but one of the other few 
 dozen codepages that are popular around the world.

Agreed. This is a very good suggestion, and it overlaps to a large degree with my ideas. Taking your reasoning a little further, this means we have a need for: - An interface that represents a string (called "String"?) - Three concrete implementations of said class: UTF8String, UTF16String and UTF32String (or perhaps String8 etc...) - Yet another implementation called NativeString that implictly uses the encoding of the environment that the program is running in. In Unix this would look at the environment variable LC_CTYPE. - A comprehensive set of string manipulation classes and methods that work with the String interface. - Making sure the external interfaces of all std-classes use String instead of char arrays. - Removing char and wchar, and renaming dchar to char. The old "char" was all wrong anyway since UTF-8 is defined as being a byte sequence, so we already have the types byte and short. All this is needed. wchar and dchar arrays are useless today anyway, since, from what I can tell, external interfaces seems to be using char[] for strings. If you decide you want to work with proper chars (i.e. dchar) you have to do UTF-32 <-> UTF-8 conversions on every method call that involves strings. Not a good thing, and an effective way of preventing any proper use of Unicode. Besides, UTF-8 is highly inefficient for many operations. It's only advantage is small size of mostly-ASCII data and compatibility with ASCII. Internally, 32-bit strings should be used.
 And last, but not least, I think the D character type should always be 
 32 bit. Then it would be a real, decoded Unicode character, not a code 
 point. Since the decoding is done internally by the string classes, 
 there is really no need to have different character sizes.

I think I agree with you, but I'm not sure what you mean by "real, decoded Unicode character, not a code point"? If you are referring to the bytes that make up a UTF-8 character, then I agree with you (but that's not called a code point). A code point is an individual character "position" as defined by Unicode. Are you saying that the "char" type should be able to hold a completeted composite character, including combining diacritical marks? In that case I don't agree with you, and no other languages even attempt this. Regards Elias
Dec 03 2003
parent Hauke Duden <H.NS.Duden gmx.net> writes:
Elias Martenson wrote:
 Hauke Duden wrote:
 And last, but not least, I think the D character type should always be 
 32 bit. Then it would be a real, decoded Unicode character, not a code 
 point. Since the decoding is done internally by the string classes, 
 there is really no need to have different character sizes.

I think I agree with you, but I'm not sure what you mean by "real, decoded Unicode character, not a code point"? If you are referring to the bytes that make up a UTF-8 character, then I agree with you (but that's not called a code point).

Never write such a thing when you're tired! ;) Where I wrote code point, I meant "encoding elements", i.e. bytes for UTF8 and 16 bit ints for UTF-16.
 A code point is an individual character "position" as defined by 
 Unicode. Are you saying that the "char" type should be able to hold a 
 completeted composite character, including combining diacritical marks? 
 In that case I don't agree with you, and no other languages even attempt 
 this.

God, no. Combining marks to what the end user thinks of as a "character" needs to be done on another layer, probably just before or while they are printed. Hauke
Dec 03 2003
prev sibling parent reply J C Calvarese <jcc7 cox.net> writes:
Elias Martenson wrote:

 I wrote:
 
 In article <bqhnd6$15f8$1 digitaldaemon.com>, Elias Martenson says...


 I would very much like to do what I can to help out. At the very 
 least share my experiences and knowledge on the subject.

Write a library, and let the standard library workgroup (currently defunct) take it in.

Thanks for trusting me. I'd love to help out with exactly that. However, a single person can not create the perfect Unicode-aware string library.

I think everyone would agree that the task would be a large one. Since Walter is only one person, you might judge the string-handling functions he developed to be simplistic. (For my purposes they're fine, but I've never worked with Unicode.) If someone (or a group of people) offered to supply some more comprehensive functions/classes, I think he'll accept donations. Personally, I know next to nothing about Unicode, so your discussion is way over my head. I've noted similar criticisms before and I suspect D's library is somewhat lacking in this area. I don't think the fundamental (C-inspired) types need to get any more complicated, but I think a fancy (Java-like) String class could help handle most of the messy things in the background. Justin
 This can be proved by looking at the mountain of mistakes made when 
 designing Java, a language that still can pride itself by being one of 
 the best in terms of Unicode-awareness. They had to fix a lot of things 
 along the way though, and the standard library is still riddled with 
 legacy bugs that can't be fixed because of backward compatibility issues.
 
 I propose that streams and strings are somehow unified, which would 
 allow both
 to format strings and to iterate through them.

I sort of agree with you. Although there should be a distiction in a way that Java did it (which I believe is what the original poster requested): - A string should be a sequence of unicode characters. That's it. Nothing more, noting less. - A stream should provide an externalisation interface for strings. It should be the responsibility of the stream to provide encoding and decoding of the external encoding to and from Unicode. This is what Java has done, but they still managed to mess it up originally. I this one of the reasons for this is that the people who designed it didn't understand Unicode and encodings entirely. Originally I wasn't interested in this at all, and made a lot of mistakes. I'm swedish, and while I do have the need for non-ASCII characters, I didn't understand the requirements of Unicode until I started studying mandarin chinese, and then russian. Since my GF is russian I now see a lot of the problems caused by badly written code, and I see D as an opportunity to lobby for a use of Unicode in a way that minimises the opportunties to write code that only works with english. Regards Elias

Dec 02 2003
parent reply Elias Martenson <elias-m algonet.se> writes:
J C Calvarese wrote:

 I think everyone would agree that the task would be a large one. Since 
 Walter is only one person, you might judge the string-handling functions 
 he developed to be simplistic.

Yes they are. However, like I said. No single person can do this right. A lot of very bright people (don't ask me to compare them to Walter, I don't know him, nor them :-) ) worked on Java and they made dozens of very bad mistakes in 1.0. Not until 1.5 (which isn't released yet) are they catching up.
 (For my purposes they're fine, but I've 
 never worked with Unicode.)

Exactly. Not to offend anyone by asking about nationality, but I assume you are an english-only speaker? This is a common situation. One uses the most obvious tools at hand (which in both C and D are char-arrays) and everything works fine. Later, when it's time to localise the application you realise you should have used char_t (dchar) instead. Congratulations on the task of changing all that code. In fact, you don't veen have to localise your app to end up with problems. My last name is actually Mårtenson, and in Unicode the second character encodes into two bytes. Guess what happens when a naïve implementation runs the following code to verify my name? Try to count the number of errors: for (int c = 0 ; c < strlen(name) ; c++) { char ch = name[c]; // make uppercase chars into lowercase if (isupper(ch)) { name[c] += 'a' - 'A'; } } // print the name with a series of stars below it printf ("%s\n", name); for (int c = 0 ; c < strlen(name) ; c++) { putchar ('*'); } putchar ('\n'); Code like the one above is not particularily uncommon to see. Hell, I even wrote code like it myself. All of the above problems will not be fixed if char is a 32-bit entity (at least one will remain) but most of the work will be done already.
 If someone (or a group of people) offered to 
 supply some more comprehensive functions/classes, I think he'll accept 
 donations.

I would like to help out in such a group, but I certainly cannot do it myself. For one, I'm not skilled enough in D to even use correct "D-style" everywhere.
 Personally, I know next to nothing about Unicode, so your discussion is 
 way over my head.  I've noted similar criticisms before and I suspect 
 D's library is somewhat lacking in this area.

Indeed they are. Unicode has to be in from the start. Again and again we have seen languages struggle when trying to tack on Unicode after the fact. C, C++, Perl, PHP are just a few examples.
 I don't think the fundamental (C-inspired) types need to get any more 
 complicated, but I think a fancy (Java-like) String class could help 
 handle most of the messy things in the background.

They shouldn't be more complicated syntax-wise or implementation-wise? Syntax-wise they already are extremely convoluted, at least if your intention is to write a program that works properly with unicode. You need to manage the UTF-8 encoding yourself which actually next to impossible to get right. Implementation-wise, I can conceieve an implementation where you still declare a string as a char[] (char being 32-bit of course) and then the compiler having special knowledge about this type such that it actually can implement it differently behind the scenes. An extremely poor example: char[] utf32string; char[] #UTF16 utf16string; char[] #UTF8 utf8string; char[] #NATIVE nativeString; The above mentioned four strings would behave identically and it should be possible to use them interchangeably. In all cases s[0] would return a 32-bit char. This cannot be stressed enough: Being able to dereference individual components of a UTF-8 or UTF-16 string is a recipie for failure. There are hardly any situations where anyone needs this. The only time would be a UTF-8 en/decoder but that functionality should be built in anyway. Regards Elias
Dec 03 2003
parent reply "Roald Ribe" <rr.no spam.teikom.no> writes:
"Elias Martenson" <elias-m algonet.se> wrote in message
news:bqkc8e$233i$1 digitaldaemon.com...

[snip]

 This cannot be stressed enough: Being able to dereference individual
 components of a UTF-8 or UTF-16 string is a recipie for failure. There
 are hardly any situations where anyone needs this. The only time would
 be a UTF-8 en/decoder but that functionality should be built in anyway.

I will just add moral support to all that E.M. said above. Even in laguages written with chars fitting into 8 bits, the current situation is a mess, in ALL computer languages, but Java is getting there (slowly). We have several groups of "customers" here: - Native english speaking developers (with a large industrialized "home" market) - "8-bits is enough" developers, "most of rest of industrialized world" (still has problems with different codepages though) - "16-bits is enough" developers, Accounts for most of the rest of developers (as of today) but not all. - "32-bit rules" Groups of all of the above with experience an insight + a few not fitting in the 16-bits limitations. If D is to be seen as an 1st choice language around the world, it has to *enforce* UNICODE for strings. If we all agree on that (not today maybe, but what about 5 years from now?) a design should be selected that caters for ALL, before the language hits 1.0. If not, it will just be "one more language", no matter what other nice features it has. This can really set it apart. 32 bit UNICODE chars, resolves all these problems. The only "problem" it generates, is more memory used. With the RAM prices (still declining) we have today, I refuse to seriously consider it a problem. How about making this THE focus point of D? "D 1st international language" "D 1st language with real international focus" Certainly an easy to communicate advantage, as all the marketdroids I ever worked with seems to want. For those who knows next to nothing about these issues, I would recommend: http://www.cl.cam.ac.uk/~mgk25/unicode.html as an introduction to some of the issues (but I disagree with their reason for basing efforts around UTF-8). Those who decides to take this on should: 1. Grab the String interface from Java 1.5 2. Learn from the other Java based String implementation performance improvers. 3. Use as much as possible from the IBM lib for UNICODE. So much to do, never enough time... This feature alone has the potential to make D a language with very widespread use, and should be a very nice "hook" for getting people to write about it. Roald
Dec 03 2003
next sibling parent Elias Martenson <no spam.spam> writes:
Den Wed, 03 Dec 2003 17:11:54 +0100 skrev Roald Ribe:

 For those who knows next to nothing about these issues, I
 would recommend: http://www.cl.cam.ac.uk/~mgk25/unicode.html
 as an introduction to some of the issues (but I disagree
 with their reason for basing efforts around UTF-8).

Well, the efforts he refers to is external data representation in the operating system. UTF-8 is the only reasonable encoding to use, or everything would break. This is because Unix is so focused around ASCII files. UTF-8 is a very neat way to go Unicode without breaking anything.
 Those who decides to take this on should:
 1. Grab the String interface from Java 1.5
 2. Learn from the other Java based String implementation
    performance improvers.
 3. Use as much as possible from the IBM lib for UNICODE.

Agreed. This is very important. However, there are some fundamental changes that needs to be done to the language, sone of which is going to break a lot of existing code (although some people would argue that that code is already broken since they don't handle Unicode). Does Walter have anything to say on this subject?
 So much to do, never enough time...

Again, agreed. However, I believe it needs to be done now, because fixing it later will be next to impossble.
 This feature alone has the potential to make D a
 language with very widespread use, and should be a very
 nice "hook" for getting people to write about it.

Indeed. But as some other person already mentioned, there is a lot more to be done than just getting 32-bit chars. It's the first step though. Regards Elias
Dec 03 2003
prev sibling parent "Walter" <walter digitalmars.com> writes:
"Roald Ribe" <rr.no spam.teikom.no> wrote in message
news:bql1dn$30vm$1 digitaldaemon.com...
 32 bit UNICODE chars, resolves all these problems. The only
 "problem" it generates, is more memory used. With the RAM
 prices (still declining) we have today, I refuse to seriously
 consider it a problem.

I agree with you that internationalization and support for unicode is of very large importance for the success of D. D's current support for it is weak, but I think that is more a library issue than the language definitional one. I've written a large server app that was internally 100% UTF-32. It used all available memory and then went into swap. The 4x memory consumption from UTF-32 cost plenty in performance, if I'd have done it with UTF-8 I'd have had a happier customer. The reality is that D must offer the programmer a choice in using UTF-8, UTF-16, and UTF-32, and make it easy to use either, as the optimal representation for one app will be suboptimal for the next. Since D is a systems programming language, coming off poorly on benchmarks is not affordable <g>.
Jan 03 2004
prev sibling parent "Walter" <walter digitalmars.com> writes:
"I" <I_member pathlink.com> wrote in message
news:bqhohh$1781$1 digitaldaemon.com...
 In article <bqhnd6$15f8$1 digitaldaemon.com>, Elias Martenson says...

In fact, I feel that the char and wchar types are useless in that they
serve no pratcical purpose. The documentation says "wchar - unsigned 8
bit UTF-8". The only UTF-8 encoded characters that fit inside a char is
a one with a code point less than 128, i.e. an ASCII character.

It must be a bug in documentation. Under Windows wchar = UTF16 and under

 wchar = UTF32.

This is not correct. A D wchar is always UTF-16. A D dchar is always UTF-32.
Jan 03 2004
prev sibling parent reply Keisuke UEDA <Keisuke_member pathlink.com> writes:
Thank you for replying.

I think that UTF-8 should not be treated directly. 

UTF-8 is so complicated that user may take mistakes. ASCII character sets ( 
from 0x00 to 0x7f ) are encoded to 1 byte, so programmers who use only ASCII 
characters do not need to distinguish ASCII and UTF-8. But 2 bytes character 
sets ( such as Japanese Shift JIS and Chinese Big5 ) are encoded to multi 
bytes. A certain character is encoded to 15 bytes. Unicode strings will be 
destroyed if it do not treat correctly. Many programmers do not know the 
circumstances of foreign language well. I think that encoded text data should 
be wrapped and programmers should use them indirectly.

I agree with Mr. Elias Martenson's opinion. I think that he has understood many 
languages. But I have no idea which is the best encoding which string class 
use. 


With best kind regards.
Dec 02 2003
parent reply Elias Martenson <elias-m algonet.se> writes:
Keisuke UEDA wrote:

 Thank you for replying.
 
 I think that UTF-8 should not be treated directly. 
 
 UTF-8 is so complicated that user may take mistakes. ASCII character sets ( 
 from 0x00 to 0x7f ) are encoded to 1 byte, so programmers who use only ASCII 
 characters do not need to distinguish ASCII and UTF-8. But 2 bytes character 
 sets ( such as Japanese Shift JIS and Chinese Big5 ) are encoded to multi 
 bytes. A certain character is encoded to 15 bytes. Unicode strings will be 
 destroyed if it do not treat correctly. Many programmers do not know the 
 circumstances of foreign language well. I think that encoded text data should 
 be wrapped and programmers should use them indirectly.

Yes, agreed. Looking at much of the C-code today we see that even though these issues are well knows very few people bother doing it right unless the language makes it natural to do so. I am currently working in a combined C++ and Java project. The product will be exported to various countires using different languages. Still, the developers who were doing the bulk of the work before me completely disregarded the problem. Trying to fix the code now has proven to be more or less impossible and is one of reasons we have to rewrite a lot of it. The attitude of the developers were "we'll fix the language issues when we do the translation". Suffice it to say that it wasn't that easy. If we let the 8-bit char[] type be the "official" string type in D, then the exact same mistakes that was made in C and C++ will be made again, and I really don't want to see that happen. I really have high hopes for D, it does so many things just "right", but this is a potential dealbreaker.
 I agree with Mr. Elias Martenson's opinion. I think that he has understood
many 
 languages. But I have no idea which is the best encoding which string class 
 use. 

Yes, I have basic skills in a lot of languages, and I have real needs to use multiple languages. However, when I read what you had to say I realise you also have good skills in this subject. Regards Elias
Dec 03 2003
parent reply Chris Paulson-Ellis <chris edesix.com> writes:
Elias Martenson wrote:

 If we let the 8-bit char[] type be the "official" string type in D, then 
  the exact same mistakes that was made in C and C++ will be made again, 
 and I really don't want to see that happen. I really have high hopes for 
 D, it does so many things just "right", but this is a potential 
 dealbreaker.

There is an implicit assumption in these arguments that making char 32 bits wide will magically make programs easier to internationalise. I don't buy this. Even 32 bit characters are not 'characters' in the old fasioned ASCII sense. What about the combining diacritic marks, etc. What is really needed is a debate about which of the C Unicode libraries out there is the most appropriate for inclusion in the D library (via a wrapper to support the D string functionality, I assume). Does anyone out there have experience with these libraries. I don't :-(. Chris.
Dec 03 2003
parent reply Elias Martenson <elias-m algonet.se> writes:
Chris Paulson-Ellis wrote:

 Elias Martenson wrote:
 
 If we let the 8-bit char[] type be the "official" string type in D, 
 then  the exact same mistakes that was made in C and C++ will be made 
 again, and I really don't want to see that happen. I really have high 
 hopes for D, it does so many things just "right", but this is a 
 potential dealbreaker.

There is an implicit assumption in these arguments that making char 32 bits wide will magically make programs easier to internationalise. I don't buy this. Even 32 bit characters are not 'characters' in the old fasioned ASCII sense. What about the combining diacritic marks, etc.

You are right of course. There are a number of issues, not only the diacritics, but also upper/lowercasing, equivalence (diacritic or combined char) etc... However, just because a 32-bit char doesn't solve everything doesn't mean that we should just give up and use 8-bit chars for the basic string type. Unicode works with code points. The concept of what a "character" is is well defined by Unicode as "The basic unit of encoding for the Unicode character encoding". It's only natural to actually deal with these for the char type. Don't you think? An example: char[] myString = ...; for (int c = 0 ; c < myString.size ; c++) { if (myString[c] == '€') { // found the euro sign } } This is perfectly reasonable code that doesn't work in D today, but would work if chars were 32-bit. Code like this should work in my opinion.
 What is really needed is a debate about which of the C Unicode libraries 
 out there is the most appropriate for inclusion in the D library (via a 
 wrapper to support the D string functionality, I assume).

The functinality is needed. Basing it off of existing libraries is a good thing. Just grabbing them and include them in D is not. The functionality provided byt he string management functions in the D standard library should be natural and helpful. In particular, it should make the need to individually access characters as small as possible.
 Does anyone out there have experience with these libraries. I don't :-(.

As in "have I written real code using them"? No. The IBM libraries are good, that much I know. Might be worth taking a deeper look at. http://oss.software.ibm.com/icu/index.html We have to keep in mind though, that even if all these features are included in the standard library, if they aren't easy and natural to use we will still see a lot of applications doing the wrong thing. In my mind, Java is very good at making Unicode natural to use. But it can be better. That's why I'm soapboxing on this forum. Regards Elias
Dec 03 2003
parent "Walter" <walter digitalmars.com> writes:
"Elias Martenson" <elias-m algonet.se> wrote in message
news:bqkd9l$24fv$1 digitaldaemon.com...
 An example:

      char[] myString = ...;

      for (int c = 0 ; c < myString.size ; c++) {
          if (myString[c] == '€') {
              // found the euro sign
          }
      }

 This is perfectly reasonable code that doesn't work in D today, but
 would work if chars were 32-bit. Code like this should work in my opinion.

It will work as written if you define myString as dchar[] rather than char[].
Jan 03 2004
prev sibling parent reply Ilya Minkov <Ilya_member pathlink.com> writes:
In article <bqhkrg$11s8$1 digitaldaemon.com>, Keisuke UEDA says...
Hello. I've read the D language specification and "D Strings vs C++ Strings". I 
thought that D strings are not international strings.

D Strings are Unicode UTF-8. It is enough for internationalised exchange within a program, but for processing we need a cursor struct which would allow to extract Unicode UTF-32 characters, and its counterpart to create strings. -eye
Dec 02 2003
parent Elias Martenson <elias-m algonet.se> writes:
Ilya Minkov wrote:

 In article <bqhkrg$11s8$1 digitaldaemon.com>, Keisuke UEDA says...
 
Hello. I've read the D language specification and "D Strings vs C++ Strings". I 
thought that D strings are not international strings.

D Strings are Unicode UTF-8. It is enough for internationalised exchange within a program, but for processing we need a cursor struct which would allow to extract Unicode UTF-32 characters, and its counterpart to create strings.

That is all fine, but that struct needs to be the natural way of dealing with characters. Arrays of 8-bit entities is too appealing for western-language-only speaking people to use for manual string manipulation. I know, I see it all the time. Regards Elias
Dec 02 2003