www.digitalmars.com         C & C++   DMDScript  

D - String type.

reply Jakob Kemi <jakob.kemi telia.com> writes:
As I understand from the docs, D is supposed to use wchars 
(2 to 4 bytes) for representing non-ASCII strings. I think it would be
better to let all string functions only handle UTF-8 (which is fully
backwards compatible with ASCII). UTF-8 is slowly becoming standard in
UNIX. (just look at X and Gtk+ 2.0)

Just a thought.

	Jakob Kemi
Mar 11 2002
next sibling parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Jakob Kemi" <jakob.kemi telia.com> wrote in message
news:a6j4v5$1koq$1 digitaldaemon.com...

 As I understand from the docs, D is supposed to use wchars
 (2 to 4 bytes) for representing non-ASCII strings. I think it would be
 better to let all string functions only handle UTF-8 (which is fully
 backwards compatible with ASCII). UTF-8 is slowly becoming standard in
 UNIX. (just look at X and Gtk+ 2.0)

...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference). BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?
Mar 11 2002
next sibling parent Jakob Kemi <jakob.kemi telia.com> writes:
On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:

 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6j4v5$1koq$1 digitaldaemon.com...
 
 As I understand from the docs, D is supposed to use wchars (2 to 4
 bytes) for representing non-ASCII strings. I think it would be better
 to let all string functions only handle UTF-8 (which is fully backwards
 compatible with ASCII). UTF-8 is slowly becoming standard in UNIX.
 (just look at X and Gtk+ 2.0)

...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference). BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?

Linux supports UTF-8 very good. You can use all your standard programs with UTF-8 encoding (cat, less, etc...) The best part is that if all string functions is written to handle UTF-8, they'll also work for ordinary (legacy) ASCII strings. There's no need to change the char type or anything to deal with UTF-8. wchar would still be useful to interact with older C libs.
Mar 11 2002
prev sibling next sibling parent reply Jakob Kemi <jakob.kemi telia.com> writes:
On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:

 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6j4v5$1koq$1 digitaldaemon.com...
 
 As I understand from the docs, D is supposed to use wchars (2 to 4
 bytes) for representing non-ASCII strings. I think it would be better
 to let all string functions only handle UTF-8 (which is fully backwards
 compatible with ASCII). UTF-8 is slowly becoming standard in UNIX.
 (just look at X and Gtk+ 2.0)

...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference). BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?

I forgot to add. UNICODE is a very loose notion as it includes UCS-2 (2 byte), UCS-4 (4 byte) and UTF-8 (variable width) among others. My first reaction was that variable width characters is kinda gross and inelegant. However, one would waste memory (yeah, I know, it's cheap) with 4 byte characters in order to be UNICODE compliant. Also the fact that UTF-8 works so well with ASCII inheritance and it's fast acceptance in the UNIX world make me feel all warm and fuzzy about it, despite its variable character size. By sticking with UCS-2 and UCS-4 we'll still be in present day's situation, all internationalization will be clumpsy addons to the standard ASCII strings and every program will have to decide which routines to use etc. (a hell when you're developing big applications and/or exchanging data from different countries and different systems.) The best thing would offcourse be if the whole world including all legacy databases, every file ever written and every old and unsupported application just magically and instantly were rewritten or converted to UCS-4. But there is _no_ way that it's ever going to happen. UTF-8 gives the best compromise IMO. Jakob Kemi
Mar 11 2002
parent "Walter" <walter digitalmars.com> writes:
Supporting utf8 would just be by using char[] arrays!

"Jakob Kemi" <jakob.kemi telia.com> wrote in message
news:a6j8hj$1m6n$1 digitaldaemon.com...
 On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:

 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6j4v5$1koq$1 digitaldaemon.com...

 As I understand from the docs, D is supposed to use wchars (2 to 4
 bytes) for representing non-ASCII strings. I think it would be better
 to let all string functions only handle UTF-8 (which is fully backwards
 compatible with ASCII). UTF-8 is slowly becoming standard in UNIX.
 (just look at X and Gtk+ 2.0)

...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference). BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?

I forgot to add. UNICODE is a very loose notion as it includes UCS-2 (2 byte), UCS-4 (4 byte) and UTF-8 (variable width) among others. My first reaction was that variable width characters is kinda gross and inelegant. However, one would waste memory (yeah, I know, it's cheap) with 4 byte characters in order to be UNICODE compliant. Also the fact that UTF-8 works so well with ASCII inheritance and it's fast acceptance in the UNIX world make me feel all warm and fuzzy about it, despite its variable character size. By sticking with UCS-2 and UCS-4 we'll still be in present day's situation, all internationalization will be clumpsy addons to the standard ASCII strings and every program will have to decide which routines to use etc. (a hell when you're developing big applications and/or exchanging data from different countries and different systems.) The best thing would offcourse be if the whole world including all legacy databases, every file ever written and every old and unsupported application just magically and instantly were rewritten or converted to UCS-4. But there is _no_ way that it's ever going to happen. UTF-8 gives the best compromise IMO. Jakob Kemi

Mar 11 2002
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:a6j584$1kuq$1 digitaldaemon.com...
 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6j4v5$1koq$1 digitaldaemon.com...
 BTW, Walter, a question. String literals seem to be char[] by
 default, I guess they are wchar[] if the program is written
 in UNICODE, though? Also, are UNICODE literals allowed in ASCII
 programs?

Actually, string literals are uncommitted by default. They then get converted to char[], wchar[], char, or wchar depending on the context. You can insert unicode literals into strings with the \uUUUU syntax.
Mar 11 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a6jfpg$2vd$1 digitaldaemon.com...

 Actually, string literals are uncommitted by default. They then get
 converted to char[], wchar[], char, or wchar depending on the context.

So how is the context determined? void foo(char[] s) { ... } void foo(wchar[] s) { ... } foo("Hello, world!"); My tests show that in the above snippet "Hello, world!" is passed to the function that takes char[] argument. If the whole program text would be in UNICODE, would the string be UNICODE as well? And what if I insert some UNICODE chars into the literal? Will the compiler complain about "invalid characters"?
Mar 12 2002
parent reply "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:a6l1sf$l7f$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a6jfpg$2vd$1 digitaldaemon.com...

 Actually, string literals are uncommitted by default. They then get
 converted to char[], wchar[], char, or wchar depending on the context.

So how is the context determined? void foo(char[] s) { ... } void foo(wchar[] s) { ... } foo("Hello, world!"); My tests show that in the above snippet "Hello, world!" is passed to the function that takes char[] argument.

That's a bug, it should give an ambiguity error.
 If the whole program text would
 be in UNICODE, would the string be UNICODE as well?

Yes, but if the string doesn't contain any characters with the high bits set, it can be implicitly converted to ascii.
 And what if I insert some UNICODE chars into the literal? Will the
 compiler complain about "invalid characters"?

It won't implicitly convert it to char[], then.
Mar 12 2002
parent reply "Juan Carlos Arevalo Baeza" <jcab roningames.com> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a6lccr$psj$2 digitaldaemon.com...

 So how is the context determined?

     void foo(char[] s)  { ... }
     void foo(wchar[] s) { ... }
     foo("Hello, world!");

 My tests show that in the above snippet "Hello, world!" is passed to the
 function that takes char[] argument.

That's a bug, it should give an ambiguity error.

Hmmm... I'm thinking that flagging an ambiguity here would still be bad. How about using attributes to add the ability to resolve ambiguities in a user-defined manner? For example: priority(9) void foo(char[] s) { ... } priority(5) void foo(wchar[] s) { ... } foo("Hello, world!"); // Calls using char[], as it's higher priority. This way, ambiguities will only be flagged if multiple possibilities exist that have the same priority. The default priority could be 5 for all functions, and the range could be 0 to 9, so you can always define higher or lower ones as needed. I admit that this might open a whole new can of worms, but I'd definitely be willing to explore this if the language supported it. Salutaciones, JCAB
 If the whole program text would
 be in UNICODE, would the string be UNICODE as well?

Yes, but if the string doesn't contain any characters with the high bits set, it can be implicitly converted to ascii.
 And what if I insert some UNICODE chars into the literal? Will the
 compiler complain about "invalid characters"?

It won't implicitly convert it to char[], then.

Mar 15 2002
parent "Walter" <walter digitalmars.com> writes:
"Juan Carlos Arevalo Baeza" <jcab roningames.com> wrote in message
news:a6u823$bip$1 digitaldaemon.com...
 priority(9) void foo(char[] s)  { ... }
 priority(5) void foo(wchar[] s) { ... }
 foo("Hello, world!"); // Calls using char[], as it's higher priority.

    This way, ambiguities will only be flagged if multiple possibilities
 exist that have the same priority. The default priority could be 5 for all
 functions, and the range could be 0 to 9, so you can always define higher

 lower ones as needed.

    I admit that this might open a whole new can of worms, but I'd

 be willing to explore this if the language supported it.

D was trying to migrate to a simpler overloading scheme <g>. It can be less convenient at times, but I think it's more than made up for by having simple and obvious rules.
Mar 26 2002
prev sibling next sibling parent reply "J. Daniel Smith" <j_daniel_smith HoTMaiL.com> writes:
UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source
code, Western European languages).  But if the string is entirely UNICODE
(something in Chinese for example), the UTF-8 encoding can consume MORE
memory since the UTF-8 tranformation can be as many as six bytes long.

UTF-8 solves a lot of problems, but I'm not sure you want to wire it into
the language as the only option.

   Dan

"Jakob Kemi" <jakob.kemi telia.com> wrote in message
news:a6j4v5$1koq$1 digitaldaemon.com...
 As I understand from the docs, D is supposed to use wchars
 (2 to 4 bytes) for representing non-ASCII strings. I think it would be
 better to let all string functions only handle UTF-8 (which is fully
 backwards compatible with ASCII). UTF-8 is slowly becoming standard in
 UNIX. (just look at X and Gtk+ 2.0)

 Just a thought.

 Jakob Kemi

Mar 11 2002
next sibling parent Jakob Kemi <jakob.kemi telia.com> writes:
On Mon, 11 Mar 2002 22:52:46 +0100, J. Daniel Smith wrote:

 UTF-8 is fine for strings that are mostly ASCII with some UNICODE
 (source code, Western European languages).  But if the string is
 entirely UNICODE (something in Chinese for example), the UTF-8 encoding
 can consume MORE memory since the UTF-8 tranformation can be as many as
 six bytes long.

Globally however UTF-8 will save memory compared to UCS-4 (no UCS-2 isn't enough) since 6 byte wide characters are rare. But I think that the memory issue doesn't really matter, it will also matter even less as prices fall. Also if someone is storing _huge_ amounts of text they will just compress it and remove most of the redundancy in the codeset.
 UTF-8 solves a lot of problems, but I'm not sure you want to wire it
 into the language as the only option.

don't have to opt out anything else. Just design all string functions to handle UTF-8 and you'll have the best of both worlds (ordinary ASCII char strings and UTF-8 that is). If there's need you can still have special UCS-4 functions and ucs4_char (or whatever).
    Dan

Mar 11 2002
prev sibling parent "Serge K" <skarebo programmer.net> writes:
"J. Daniel Smith" <j_daniel_smith HoTMaiL.com> wrote in message
news:a6j90i$1me0$1 digitaldaemon.com...
 UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source
 code, Western European languages).  But if the string is entirely UNICODE
 (something in Chinese for example), the UTF-8 encoding can consume MORE
 memory since the UTF-8 tranformation can be as many as six bytes long.

Actually, UTF-8 can represent all Unicode 3.2 characters with 1..4 bytes. Which means - it simply cannot consume more memory than UTF-32. (ISO/IEC 10646 may require up to 6 bytes in UTF-8, but it is the superset for Unicode.)
Mar 11 2002
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Jakob Kemi" <jakob.kemi telia.com> wrote in message
news:a6j4v5$1koq$1 digitaldaemon.com...
 As I understand from the docs, D is supposed to use wchars
 (2 to 4 bytes) for representing non-ASCII strings. I think it would be
 better to let all string functions only handle UTF-8 (which is fully
 backwards compatible with ASCII). UTF-8 is slowly becoming standard in
 UNIX. (just look at X and Gtk+ 2.0)

At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented. It turned out to be a lot of trouble :-( and I finally converted it to wchar's.
Mar 11 2002
parent reply Jakob Kemi <jakob.kemi telia.com> writes:
On Tue, 12 Mar 2002 01:00:49 +0100, Walter wrote:


 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6j4v5$1koq$1 digitaldaemon.com...
 As I understand from the docs, D is supposed to use wchars (2 to 4
 bytes) for representing non-ASCII strings. I think it would be better
 to let all string functions only handle UTF-8 (which is fully backwards
 compatible with ASCII). UTF-8 is slowly becoming standard in UNIX.
 (just look at X and Gtk+ 2.0)

At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented. It turned out to be a lot of trouble :-( and I finally converted it to wchar's.

You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string class implementation and if you need to set/get positions in streams you use tell and seek (you're not supposed to assume that 1 character == 1 byte anyway according to standards.) There should be no real _need_ to index characters in strings with pointers. Jakob Kemi
Mar 11 2002
next sibling parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Jakob Kemi" <jakob.kemi telia.com> wrote in message
news:a6jitg$118$1 digitaldaemon.com...

 You already have this problem in windows with linebreaks being two
 bytes. Just use custom iterators for your string class

There are no iterators in D, nor there is a string class.
Mar 12 2002
parent Jakob Kemi <jakob.kemi telia.com> writes:
On Tue, 12 Mar 2002 20:06:34 +0100, Pavel Minayev wrote:


 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6jitg$118$1 digitaldaemon.com...
 
 You already have this problem in windows with linebreaks being two
 bytes. Just use custom iterators for your string class

There are no iterators in D, nor there is a string class.

I'm not talking about some STL iterators here, what I mean is that you just desing your loops like this: for (char* s = string; get_char(s) != '\0'; s = next_char(s) ) { ... } Loops operating on strings is rare anyway, most string functions should be optimized library functions. get_char() & next_char() should be inlined and can use whatever syntax sugar applicatable. Jakob
Mar 12 2002
prev sibling parent "Walter" <walter digitalmars.com> writes:
"Jakob Kemi" <jakob.kemi telia.com> wrote in message
news:a6jitg$118$1 digitaldaemon.com...
 On Tue, 12 Mar 2002 01:00:49 +0100, Walter wrote:


 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6j4v5$1koq$1 digitaldaemon.com...
 As I understand from the docs, D is supposed to use wchars (2 to 4
 bytes) for representing non-ASCII strings. I think it would be better
 to let all string functions only handle UTF-8 (which is fully backwards
 compatible with ASCII). UTF-8 is slowly becoming standard in UNIX.
 (just look at X and Gtk+ 2.0)

At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented. It turned out to be a lot of trouble :-( and I finally converted it to wchar's.

You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string class implementation and if you need to set/get positions in streams you use tell and seek (you're not supposed to assume that 1 character == 1 byte anyway according to standards.) There should be no real _need_ to index characters in strings with pointers.

That's true, but I was never comfortable using such things, and hiding the performance hit of it behind syntactic sugar doesn't make the hit go away. When you're trying to compile 100,000 lines of code, every cycle in the lexer matters.
Mar 12 2002