digitalmars.D - A grave Prayer

Georg Wrede (25/25) Nov 23 2005 We've wasted truckloads of ink lately on the UTF issue.

Oskar Linde (8/20) Nov 23 2005 What atleast should be changed is the suggestion that the C char is

Georg Wrede (47/65) Nov 23 2005 A UTF-8 code unit can not be stored in char. Therefore UTF can't be

Regan Heath (16/32) Nov 23 2005 I suspect we're having terminology issues again.

Oskar Linde (12/50) Nov 23 2005 I think you are correct in your analysis Regan.
Bruno Medeiros (16/55) Nov 23 2005 Wrong actually, it can only contain codepoints below 128. Above that it

Regan Heath (4/10) Nov 23 2005 Thanks for the correction. I wasn't really sure.
=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= (10/16) Nov 23 2005 That's odd. I tried this on Linux (dmd .139):

Oskar Linde (4/26) Nov 23 2005 I think Bruno may have been using a non-unicode format like Latin-1 for

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= (19/48) Nov 23 2005 Yes, that might be true if he's using Windows. On Linux the compiler

Carlos Santander (5/29) Nov 23 2005 The compiler input must also be valid Unicode on Windows. Just try to fe...
Oskar Linde (23/40) Nov 23 2005 There is no such thing as a "UTF-8 character" or "UTF-8 symbol"...

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= (11/53) Nov 24 2005 Sorry, I didn't test this enough. At least the foreach-statement is

Oskar Linde (8/44) Nov 24 2005 Its rather a matter of efficiency. Character indexing on a utf-8 array

Georg Wrede (2/10) Nov 24 2005 Hmmmmmmmmm. "Correct" is currently under intense debate here.

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= (10/23) Nov 24 2005 Yes, the biggest problem here is that some people don't like the O(n)

Georg Wrede (14/16) Nov 24 2005 Of course O(n) is worse than O(k). Then again, that may not be such a

Georg Wrede (3/12) Nov 23 2005 Thanks! Of course!

Oskar Linde (9/26) Nov 23 2005 No... Apparently code point is a point in the unicode space. I.e. a

Georg Wrede (2/33) Nov 23 2005 Aaaarrrghhhhh! :-P

Oskar Linde (46/119) Nov 23 2005 Can it not? I thought I had been doing this all the time... Why?

Kris (8/15) Nov 23 2005 Yes, very close. Rather than be uncommited they should default to char[]...

Oskar Linde (10/13) Nov 23 2005 How would the content imply the type? All Unicode strings are

Kris (9/19) Nov 23 2005 I meant in terms of implying the "default" type. We're suggesting that

Regan Heath (11/34) Nov 23 2005 Did you see my reply to you in this other thread .. wait a minute ..

Kris (3/10) Nov 23 2005 Consistent with auto string-literals too.
Oskar Linde (27/69) Nov 23 2005 I think I see what you mean. By unicode char, you actually mean unicode

Georg Wrede (18/69) Nov 23 2005 I'd be willing to say that (see my other post in this thread, a couple

Georg Wrede (68/197) Nov 23 2005 No.

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (23/57) Nov 23 2005 The complexity of utf->utf conversions should be linear. I don't believe...

Georg Wrede <georg.wrede nospam.org> writes:

We've wasted truckloads of ink lately on the UTF issue.

Seems we're tangled in this like the new machine room boy who was later 
found dead. He got so wound up with cat-5 cables he got asphyxiated.


The prayer:

Please remove the *char*, *wchar* and *dchar* basic data types from the 
documentation!

Please remove ""c, ""w and ""d from the documentation!


The illusion they make is more than 70% responsible for the recent 
wasteage of ink here. Further, they will cloud the minds of any new D 
user -- without anybody even noticing.

For example, there is no such thing as an 8 bit utf char. It simply does 
not exist. The mere hint of such in the documentation has to be the 
result of a slipup of the mind.

---

There are other things to fix regarding strings, utf, and such, but it 
will take some time before we develop a collective understanding about 
them.

But the documentation should be changed RIGHT NOW, so as to not cause 
further misunderstanding among ourselves, and to not introduce this to 
all new readers.

Anybody who reads these archives a year from now, can't help feeling we 
are a bunch of pathological cases harboring an incredible psychosis -- 
about something that is basically trivial.

---

ps, this is not a rhetorical joke, I'm really asking for the removals.

Nov 23 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Georg Wrede wrote:
 We've wasted truckloads of ink lately on the UTF issue.
 
 Seems we're tangled in this like the new machine room boy who was later 
 found dead. He got so wound up with cat-5 cables he got asphyxiated.
 
 
 The prayer:
 
 Please remove the *char*, *wchar* and *dchar* basic data types from the 
 documentation!

What atleast should be changed is the suggestion that the C char is 
replacable by the D char, and a slight change in definition:

char: UTF-8 code unit
wchar: UTF-16 code unit
dchar: UTF-32 code unit and/or unicode character

 
 Please remove ""c, ""w and ""d from the documentation!

So you suggest that users should use an explicit cast instead?

/Oskar

Nov 23 2005

Georg Wrede <georg.wrede nospam.org> writes:

Oskar Linde wrote:
 Georg Wrede wrote:
 
 We've wasted truckloads of ink lately on the UTF issue.

 The prayer:

 Please remove the *char*, *wchar* and *dchar* basic data types from 
 the documentation!

 
 What atleast should be changed is the suggestion that the C char is 
 replacable by the D char, and a slight change in definition:
 
 char: UTF-8 code unit
 wchar: UTF-16 code unit
 dchar: UTF-32 code unit and/or unicode character

A UTF-8 code unit can not be stored in char. Therefore UTF can't be 
mentioned at all in this context.

By the same token, and because of symmetry, the wchar and dchar things 
should vanish. Disappear. Get nuked.

The language manual (as opposed to the Phobos docs), should not once use 
the word UTF anywhere. Except possibly stating that a conformant 
compiler has to accept source code files that are stored in UTF.

---

If a programmer decides to do UTF mangling, then he should store the 
intermediate results in ubytes, ushorts and uints. This way he can avoid 
getting tangled and tripped up before or later.

And this way he keeps remembering that it's on his responsibility to get 
things right.

 Please remove ""c, ""w and ""d from the documentation!

 
 So you suggest that users should use an explicit cast instead?

No.

How, or in what form the string literals get internally stored, is the 
compiler vendor's own private matter. And definitely not part of a 
language specification.

The string literal decorations only create a huge amount of distraction, 
sending every programmer on a wild goose chase, possibly several times, 
before they either finds someone who explains things, or they switch 
away from D.

Behind the scenes, a smart compiler manufacturer probably stores all 
string literals as UTF-8. Nice. Or as some other representation that's 
convenient for him, be it UTF-whatever, or even a native encoding. In 
any case, the compiler knows what this representation is.

When the string gets used (gets assigned to a variable, or put in a 
structure, or printed to screen), then the compiler should implicitly 
cast (as in toUTFxxx and not the current "cast") the string to what's 
expected.

We can have "string types", like [c/w/d]char[], but not standalone UTF 
chars. When the string literal gets assigned to any of these, then it 
should get converted.

Actually, a smart compiler would already store the string literal in the 
width the string will get used, it's got the info right there in the 
same source file. And in case the programmer is real dumb and assigns 
the same literal to more than one UTF width, it could be stored in all 
of those separately. -- But the brain cells of the compiler writer 
should be put to more productive issues than this.

Probably the easiest would be to just decide that UTF-8 is it. And use 
that while something else is not demanded. Or UTF-whatever else, 
depending on the platform. But the D programmer should never have to 
cast a string literal.

The D programmer should not necessarily even know that representation. 
The only situation he would need to know it is if his program goes 
directly to the executable memory image snooping for the literal. I'm 
not sure that should be legal.

Nov 23 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Wed, 23 Nov 2005 21:16:03 +0200, Georg Wrede <georg.wrede nospam.org>  
wrote:
 Oskar Linde wrote:
 Georg Wrede wrote:

 We've wasted truckloads of ink lately on the UTF issue.

 The prayer:

 Please remove the *char*, *wchar* and *dchar* basic data types from  
 the documentation!

  What atleast should be changed is the suggestion that the C char is  
 replacable by the D char, and a slight change in definition:
  char: UTF-8 code unit
 wchar: UTF-16 code unit
 dchar: UTF-32 code unit and/or unicode character

 A UTF-8 code unit can not be stored in char. Therefore UTF can't be  
 mentioned at all in this context.

I suspect we're having terminology issues again.

http://en.wikipedia.org/wiki/Unicode
"each code point may be represented by a variable number of code values."

"code point" == "character"
"code value" == part of a (or in some cases a complete) character

I think Oscar's is correct (because I believe "code unit" == "code value")  
and char does contain a UTF-8 code value/unit.

However, Georg is also correct (because I suspect he meant) that a char  
can not contain all Unicode(*) code points. It can contain some, those  
with values less than 255 but not others.

(*) Note I did not say UTF-8 here, I believe it's incorrect to do so. code  
points are universal across all the UTF encodings, they are simply  
represented by different code values/units depending on the encoding used.

Regan

Nov 23 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Regan Heath wrote:
 On Wed, 23 Nov 2005 21:16:03 +0200, Georg Wrede 
 <georg.wrede nospam.org>  wrote:
 
 Oskar Linde wrote:

 Georg Wrede wrote:

 We've wasted truckloads of ink lately on the UTF issue.

 The prayer:

 Please remove the *char*, *wchar* and *dchar* basic data types from  
 the documentation!

  What atleast should be changed is the suggestion that the C char is  
 replacable by the D char, and a slight change in definition:
  char: UTF-8 code unit
 wchar: UTF-16 code unit
 dchar: UTF-32 code unit and/or unicode character


 A UTF-8 code unit can not be stored in char. Therefore UTF can't be  
 mentioned at all in this context.

 
 
 I suspect we're having terminology issues again.
 
 http://en.wikipedia.org/wiki/Unicode
 "each code point may be represented by a variable number of code values."
 
 "code point" == "character"
 "code value" == part of a (or in some cases a complete) character
 
 I think Oscar's is correct (because I believe "code unit" == "code 
 value")  and char does contain a UTF-8 code value/unit.

 However, Georg is also correct (because I suspect he meant) that a char  
 can not contain all Unicode(*) code points. It can contain some, those  
 with values less than 255 but not others.

I think you are correct in your analysis Regan.

I was actually referring to your definitions in:
news://news.digitalmars.com:119/ops0bhkaaa23k2f5 nrage.netwin.co.nz
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/30029
and trying to be very careful to use the agreed terminology:

character: (Unicode character/symbol)
code unit: (part of the actual encoding)

I feel that I may have been a bit dense in my second reply to Georg. 
Apologies.

Regards,

Oskar

Nov 23 2005

Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:

Regan Heath wrote:
 On Wed, 23 Nov 2005 21:16:03 +0200, Georg Wrede 
 <georg.wrede nospam.org>  wrote:
 
 Oskar Linde wrote:

 Georg Wrede wrote:

 We've wasted truckloads of ink lately on the UTF issue.

 The prayer:

 Please remove the *char*, *wchar* and *dchar* basic data types from  
 the documentation!

  What atleast should be changed is the suggestion that the C char is  
 replacable by the D char, and a slight change in definition:
  char: UTF-8 code unit
 wchar: UTF-16 code unit
 dchar: UTF-32 code unit and/or unicode character


 A UTF-8 code unit can not be stored in char. Therefore UTF can't be  
 mentioned at all in this context.

 
 
 I suspect we're having terminology issues again.
 
 http://en.wikipedia.org/wiki/Unicode
 "each code point may be represented by a variable number of code values."
 
 "code point" == "character"
 "code value" == part of a (or in some cases a complete) character
 
 I think Oscar's is correct (because I believe "code unit" == "code 
 value")  and char does contain a UTF-8 code value/unit.
 

 However, Georg is also correct (because I suspect he meant) that a char  
 can not contain all Unicode(*) code points. It can contain some, those  
 with values less than 255 but not others.

Wrong actually, it can only contain codepoints below 128. Above that it 
takes two bytes of storage.

"A character whose code point is below U+0080 is encoded with a single 
byte that contains its code point: these correspond exactly to the 128 
characters of 7-bit ASCII."
http://en.wikipedia.org/wiki/UTF-8

"a) Use UTF-8. This preserves ASCII, but not Latin-1, because the 
characters >127 are different from Latin-1. UTF-8 uses the bytes in the 
ASCII only for ASCII characters. "
http://www.unicode.org/faq/utf_bom.html

I've actually only found this today when trying to writefln('�');


-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to 
be... unnatural."

Nov 23 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Wed, 23 Nov 2005 20:50:16 +0000, Bruno Medeiros  
<daiphoenixNO SPAMlycos.com> wrote:
 Regan Heath wrote:
 However, Georg is also correct (because I suspect he meant) that a  
 char  can not contain all Unicode(*) code points. It can contain some,  
 those  with values less than 255 but not others.

 Wrong actually, it can only contain codepoints below 128. Above that it  
 takes two bytes of storage.

Thanks for the correction. I wasn't really sure.

Regan

Nov 23 2005

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= <jmjmak invalid_utu.fi> writes:

Bruno Medeiros wrote:
 "a) Use UTF-8. This preserves ASCII, but not Latin-1, because the 
 characters >127 are different from Latin-1. UTF-8 uses the bytes in the 
 ASCII only for ASCII characters. "
 http://www.unicode.org/faq/utf_bom.html
 
 I've actually only found this today when trying to writefln('ç');

That's odd. I tried this on Linux (dmd .139):

writefln('ༀ');	// -> no error!
char a = 'ༀ';	// -> error!

It seems that DMD allows all Unicode-character values as character 
literals (IMO this is correct behavior). One problem is that while you 
cannot assign the om-symbol 'ༀ' to a char (this is also correct - 
3840>127), you can do this:

char a = 'ä';	// -> no error!
writefln(a);	// outputs: Error: 4invalid UTF-8 sequence (oops!)

Nov 23 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Jari-Matti Mäkelä wrote:
 Bruno Medeiros wrote:
 
 "a) Use UTF-8. This preserves ASCII, but not Latin-1, because the 
 characters >127 are different from Latin-1. UTF-8 uses the bytes in 
 the ASCII only for ASCII characters. "
 http://www.unicode.org/faq/utf_bom.html

 I've actually only found this today when trying to writefln('ç');

 
 
 That's odd. I tried this on Linux (dmd .139):
 
 writefln('ༀ');    // -> no error!
 char a = 'ༀ';    // -> error!
 
 It seems that DMD allows all Unicode-character values as character 
 literals (IMO this is correct behavior). One problem is that while you 
 cannot assign the om-symbol 'ༀ' to a char (this is also correct - 
 3840>127), you can do this:
 
 char a = 'ä';    // -> no error!
 writefln(a);    // outputs: Error: 4invalid UTF-8 sequence (oops!)

I think Bruno may have been using a non-unicode format like Latin-1 for 
his sources. 'ç' would then appear as garbage to the compiler.

/Oskar

Nov 23 2005

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= <jmjmak invalid_utu.fi> writes:

Oskar Linde wrote:
 Jari-Matti Mäkelä wrote:
 
 Bruno Medeiros wrote:

 "a) Use UTF-8. This preserves ASCII, but not Latin-1, because the 
 characters >127 are different from Latin-1. UTF-8 uses the bytes in 
 the ASCII only for ASCII characters. "
 http://www.unicode.org/faq/utf_bom.html

 I've actually only found this today when trying to writefln('ç');



 That's odd. I tried this on Linux (dmd .139):

 writefln('ༀ');    // -> no error!
 char a = 'ༀ';    // -> error!

 It seems that DMD allows all Unicode-character values as character 
 literals (IMO this is correct behavior). One problem is that while you 
 cannot assign the om-symbol 'ༀ' to a char (this is also correct - 
 3840>127), you can do this:

 char a = 'ä';    // -> no error!
 writefln(a);    // outputs: Error: 4invalid UTF-8 sequence (oops!)

 
 
 I think Bruno may have been using a non-unicode format like Latin-1 for 
 his sources. 'ç' would then appear as garbage to the compiler.

Yes, that might be true if he's using Windows. On Linux the compiler 
input must be valid Unicode. Maybe it's hard to create Unicode-compliant 
programs on Windows since even Sun Java has some problems with Unicode 
class names on Windows XP.

--

My point here was that since char[] is a fully valid UTF-8 string and 
the index/slice-operators are intelligent enough to work on the 
Unicode-symbol level [code 1], we should be able to store Unicode 
symbols to a char-variable as well. You see, an UTF-8 symbol requires 
8-32 bits of storage space, so it would be perfectly possible to 
implement an UTF-8 symbol (char) using standard 32-bit integers or 1-4 x 
8-bit bytes.

The current implementation seems to be always using 1 x 8-bit byte for 
the char-type and n x 8-bit bytes for char[] - strings. The strings work 
well but it's impossible to store a single UTF-8 character now.

[code 1]:

   char[] a = "∇∆∈∋";
   writefln(a[2]);  //outputs: ∈ (a stupid implementation would output 0x87)

Nov 23 2005

Carlos Santander <csantander619 gmail.com> writes:

Jari-Matti Mäkelä escribió:
 
 Yes, that might be true if he's using Windows. On Linux the compiler 
 input must be valid Unicode. Maybe it's hard to create Unicode-compliant 

The compiler input must also be valid Unicode on Windows. Just try to feed it 
with some invalid characters and you'll see how it'll complain.

 programs on Windows since even Sun Java has some problems with Unicode 
 class names on Windows XP.
 
 -- 
 
 My point here was that since char[] is a fully valid UTF-8 string and 
 the index/slice-operators are intelligent enough to work on the 
 Unicode-symbol level [code 1], we should be able to store Unicode 
 symbols to a char-variable as well. You see, an UTF-8 symbol requires 
 8-32 bits of storage space, so it would be perfectly possible to 
 implement an UTF-8 symbol (char) using standard 32-bit integers or 1-4 x 
 8-bit bytes.
 
 The current implementation seems to be always using 1 x 8-bit byte for 
 the char-type and n x 8-bit bytes for char[] - strings. The strings work 
 well but it's impossible to store a single UTF-8 character now.
 
 [code 1]:
 
   char[] a = "∇∆∈∋";
   writefln(a[2]);  //outputs: ∈ (a stupid implementation would output 0x87)


-- 
Carlos Santander Bernal

Nov 23 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Jari-Matti Mäkelä wrote:
 
 My point here was that since char[] is a fully valid UTF-8 string and 
 the index/slice-operators are intelligent enough to work on the 
 Unicode-symbol level [code 1], we should be able to store Unicode 
 symbols to a char-variable as well. 

No this is wrong. The index/slice operations have no idea about Unicode/UTF.

 You see, an UTF-8 symbol requires 
 8-32 bits of storage space, so it would be perfectly possible to 
 implement an UTF-8 symbol (char) using standard 32-bit integers or 1-4 x 
 8-bit bytes.
 
 The current implementation seems to be always using 1 x 8-bit byte for 
 the char-type and n x 8-bit bytes for char[] - strings. The strings work 
 well but it's impossible to store a single UTF-8 character now.

There is no such thing as a "UTF-8 character" or "UTF-8 symbol"... 
"Unicode character" or "UTF-8 code unit"

UTF is merely the encoding. Unicode is the character set.

 [code 1]:
 
   char[] a = "∇∆∈∋";
   writefln(a[2]);  //outputs: ∈ (a stupid implementation would output 0x87)

You really had me confused there for a moment (I had to test this). 
Fortunately you are wrong. On what platform with what source code 
encoding have you tested this? The following is on DMD 0.139 linux:

import std.stdio;

void main() {
   char[] a = "åäö";
   writef(a[1]);
}

Will print Error: 4invalid UTF-8 sequence

Which is correct. (The error message might be slightly confusing though)

import std.stdio;

void main() {
   char[] a = "åäö";
   writef(a[2..4]);
}

will print "ä"
Which is correct.

/Oskar

Nov 23 2005

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= <jmjmak invalid_utu.fi> writes:

Oskar Linde wrote:
 Jari-Matti Mäkelä wrote:
 
 My point here was that since char[] is a fully valid UTF-8 string and 
 the index/slice-operators are intelligent enough to work on the 
 Unicode-symbol level [code 1], we should be able to store Unicode 
 symbols to a char-variable as well. 

 
 
 No this is wrong. The index/slice operations have no idea about 
 Unicode/UTF.
 

Sorry, I didn't test this enough. At least the foreach-statement is 
Unicode-aware. Anyway, these operations and a the char-type should know 
about Unicode.

 You see, an UTF-8 symbol requires 8-32 bits of storage space, so it 
 would be perfectly possible to implement an UTF-8 symbol (char) using 
 standard 32-bit integers or 1-4 x 8-bit bytes.

 The current implementation seems to be always using 1 x 8-bit byte for 
 the char-type and n x 8-bit bytes for char[] - strings. The strings 
 work well but it's impossible to store a single UTF-8 character now.

 
 
 There is no such thing as a "UTF-8 character" or "UTF-8 symbol"... 
 "Unicode character" or "UTF-8 code unit"

Ok, "UTF-8 symbol" -> "the UTF-8 encoded bytestream of an Unicode symbol"

 [code 1]:

   char[] a = "∇∆∈∋";
   writefln(a[2]);  //outputs: ∈ (a stupid implementation would output 
 0x87)

 
 
 You really had me confused there for a moment (I had to test this). 
 Fortunately you are wrong. On what platform with what source code 
 encoding have you tested this? The following is on DMD 0.139 linux:

Sorry again, didn't test this. This ought to work, but it doesn't.

 import std.stdio;
 
 void main() {
   char[] a = "åäö";
   writef(a[2..4]);
 }
 
 will print "ä"
 Which is correct.

Actually it isn't correct behavior. There's no real need to change the 
string on a byte-level. Think now, you don't have to change individual 
bits on a ASCII-string either. In 7-bit ASCII the base unit is a 7-bit 
byte, in ISO-8859-x it is a 8-bit byte and in UTF-8 it is 8-32 bits. The 
smallest unit you need is the base unit.

Nov 24 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Jari-Matti Mäkelä wrote:
 Oskar Linde wrote:
 
 There is no such thing as a "UTF-8 character" or "UTF-8 symbol"... 
 "Unicode character" or "UTF-8 code unit"

 
 Ok, "UTF-8 symbol" -> "the UTF-8 encoded bytestream of an Unicode symbol"

According to terminology, it is "UTF-8 code unit" or "UTF-8 code value".

 [code 1]:

   char[] a = "∇∆∈∋";
   writefln(a[2]);  //outputs: ∈ (a stupid implementation would output 
 0x87)



 You really had me confused there for a moment (I had to test this). 
 Fortunately you are wrong. On what platform with what source code 
 encoding have you tested this? The following is on DMD 0.139 linux:

 
 Sorry again, didn't test this. This ought to work, but it doesn't.
 
 import std.stdio;

 void main() {
   char[] a = "åäö";
   writef(a[2..4]);
 }

 will print "ä"
 Which is correct.

 
 
 Actually it isn't correct behavior. There's no real need to change the 
 string on a byte-level. Think now, you don't have to change individual 
 bits on a ASCII-string either. In 7-bit ASCII the base unit is a 7-bit 
 byte, in ISO-8859-x it is a 8-bit byte and in UTF-8 it is 8-32 bits. The 
 smallest unit you need is the base unit.

Its rather a matter of efficiency. Character indexing on a utf-8 array 
is O(n) (You can of course make it better, but with more complexity and 
higher memory requirements), while code unit indexing is O(1).

You actually very seldom need character indexing. Almost everything can 
be done on a code unit level.

/Oskar

Nov 24 2005

Georg Wrede <georg.wrede nospam.org> writes:

Oskar Linde wrote:
 
 void main() {
   char[] a = "åäö";
   writef(a[2..4]);
 }
 
 will print "ä"
 Which is correct.

Hmmmmmmmmm. "Correct" is currently under intense debate here.

Nov 24 2005

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= <jmjmak invalid_utu.fi> writes:

Georg Wrede wrote:
 Oskar Linde wrote:
 
 void main() {
   char[] a = "åäö";
   writef(a[2..4]);
 }

 will print "ä"
 Which is correct.

 
 
 Hmmmmmmmmm. "Correct" is currently under intense debate here.

Yes, the biggest problem here is that some people don't like the O(n) 
complexity of the 'correct' UTF-8 indexing.

I'm not a string handling expert but I don't like the current 
implementation:

char[] a = "åäö";
writefln(a.length);	// outputs: 6


char[] a = "tässä  tekstiä";
std.string.insert(a, 6, "vähän");
writefln(a);		// outputs: tässä  tekstiä

Nov 24 2005

Georg Wrede <georg.wrede nospam.org> writes:

Jari-Matti Mäkelä wrote:
 Yes, the biggest problem here is that some people don't like the O(n) 
 complexity of the 'correct' UTF-8 indexing.

Of course O(n) is worse than O(k). Then again, that may not be such a 
problem with actual applications, since the total time used to this 
usually is (almost) negligible compared to the total runtime activity of 
the application.

And, if beta testing shows that the program is not "fast enough" with 
production data, then it is easy to profile a D program. If it turns out 
that precisely this causes the slowness, then one might use, say, UTF16 
in those situations. No biggie. Or UTF32.

It's actually a shame that we haven't tried all of this stuff already. 
The wc.d example might be a good candidate to try the three UTF 
encodings and time the results. And other examples would be quite easy 
to write.

(Ha, a chance for newbies to become "D gurus": publish your results here!)

Nov 24 2005

Georg Wrede <georg.wrede nospam.org> writes:

Regan Heath wrote:
 On Wed, 23 Nov 2005 21:16:03 +0200, Georg Wrede 
 <georg.wrede nospam.org>  wrote:
 
 A UTF-8 code unit can not be stored in char. Therefore UTF can't be  
 mentioned at all in this context.

 
 I suspect we're having terminology issues again.
 
 Regan

Thanks! Of course!

So "A UTF-8 code point".

Nov 23 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Georg Wrede wrote:
 Regan Heath wrote:
 
 On Wed, 23 Nov 2005 21:16:03 +0200, Georg Wrede 
 <georg.wrede nospam.org>  wrote:

 A UTF-8 code unit can not be stored in char. Therefore UTF can't be  
 mentioned at all in this context.


 I suspect we're having terminology issues again.

 Regan

 
 
 Thanks! Of course!
 
 So "A UTF-8 code point".

No... Apparently code point is a point in the unicode space. I.e. a 
character. The guys that came up with that should be shot though. ;)

http://www.unicode.org/glossary/

There are no UTF-8 code points by terminology. Only

UTF-8 code unit == UTF-8 code value
Unicode code point == Unicode character == Unicode symbol

:)

/Oskar

Nov 23 2005

Georg Wrede <georg.wrede nospam.org> writes:

Oskar Linde wrote:
 Georg Wrede wrote:
 
 Regan Heath wrote:

 On Wed, 23 Nov 2005 21:16:03 +0200, Georg Wrede 
 <georg.wrede nospam.org>  wrote:

 A UTF-8 code unit can not be stored in char. Therefore UTF can't be  
 mentioned at all in this context.

 I suspect we're having terminology issues again.

 Regan

 Thanks! Of course!

 So "A UTF-8 code point".

 
 No... Apparently code point is a point in the unicode space. I.e. a 
 character. The guys that came up with that should be shot though. ;)
 
 http://www.unicode.org/glossary/
 
 There are no UTF-8 code points by terminology. Only
 
 UTF-8 code unit == UTF-8 code value
 Unicode code point == Unicode character == Unicode symbol
 
 :)
 
 /Oskar

Aaaarrrghhhhh!  :-P

Nov 23 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Georg Wrede wrote:
 Oskar Linde wrote:
 
 Georg Wrede wrote:

 We've wasted truckloads of ink lately on the UTF issue.

 The prayer:

 Please remove the *char*, *wchar* and *dchar* basic data types from 
 the documentation!


 What atleast should be changed is the suggestion that the C char is 
 replacable by the D char, and a slight change in definition:

 char: UTF-8 code unit
 wchar: UTF-16 code unit
 dchar: UTF-32 code unit and/or unicode character

 
 
 A UTF-8 code unit can not be stored in char. [...]

Can it not? I thought I had been doing this all the time... Why?
(I know most Unicode code points aka characters can not be stored in a char)

 [...] Therefore UTF can't be
 mentioned at all in this context.

 By the same token, and because of symmetry, the wchar and dchar things 
 should vanish. Disappear. Get nuked.

A dchar can represent any Unicode character. Is that not enough reason 
to keep it?

 The language manual (as opposed to the Phobos docs), should not once use 
 the word UTF anywhere. Except possibly stating that a conformant 
 compiler has to accept source code files that are stored in UTF.

Why? It is often important to know the encoding of a string. (When 
dealing with system calls, c-library calls etc.) Doesn't D do the right 
thing to define Unicode as the character set and UTF-* as the supported 
encodings?

 
 ---
 
 If a programmer decides to do UTF mangling, then he should store the 
 intermediate results in ubytes, ushorts and uints. This way he can avoid 
 getting tangled and tripped up before or later.

Those are more or less equivalent to char, wchar, dchar. The difference 
is more a matter of convention. (and string/character literal handling)
Doesn't the confusion rather lie in the name? (char may imply character 
rather than code unit. I agree that this is unfortunate.)

 And this way he keeps remembering that it's on his responsibility to get 
 things right.
 
 Please remove ""c, ""w and ""d from the documentation!


 So you suggest that users should use an explicit cast instead?

 No.

I think this may have been the reason for the implementation of the 
suffixes:

print(char[] x) { printf("1\n"); }
print(wchar[] x) { printf("2\n"); }
void main() { print("hello"); }

A remedy is very close to the current implementation. The following 
change in behaviour: string literals are always char[], but may be 
implicitly cast to wchar[] and dchar[]. print(char[] x) would therefore 
be a closer match.
(All casts of literals are of course done at compile time.)

Currently in D: string literals are __uncommitted char[] and may be 
implicitly cast to char[], wchar[] and dchar[]. No cast takes preference.

D may be too agnostic about preferred encoding...

 How, or in what form the string literals get internally stored, is the 
 compiler vendor's own private matter. And definitely not part of a 
 language specification.

The form string literals are stored in affects the performance in many 
cases. Why remove this control from the programmer?
Rereading your post, tryig to understand what you ask for leads me to 
assume that by your suggestion, a poor, but conforming, compiler may 
include a hidden call to a conversion function and a hidden memory 
allocation in this function call:

foo("hello");

 The string literal decorations only create a huge amount of distraction, 
 sending every programmer on a wild goose chase, possibly several times, 
 before they either finds someone who explains things, or they switch 
 away from D.

I can not understand this confusion. It is currently very well specified 
how a character literal gets stored (and you get to pick between 3 
different encodings). The problem lies in the "" form, where you have 
not specified the encoding. The compiler will try to infer the encoding 
depending on context. This is a weakness i think should be fixed by:

- "" is char[], but may be implicitly cast to dchar[] and wchar[]

 Behind the scenes, a smart compiler manufacturer probably stores all 
 string literals as UTF-8. Nice. Or as some other representation that's 
 convenient for him, be it UTF-whatever, or even a native encoding. In 
 any case, the compiler knows what this representation is.
 
 When the string gets used (gets assigned to a variable, or put in a 
 structure, or printed to screen), then the compiler should implicitly 
 cast (as in toUTFxxx and not the current "cast") the string to what's 
 expected.

I start to understand you now... (first time reading)

 We can have "string types", like [c/w/d]char[], but not standalone UTF 
 chars. When the string literal gets assigned to any of these, then it 
 should get converted.
 
 Actually, a smart compiler would already store the string literal in the 
 width the string will get used, it's got the info right there in the 
 same source file. [...]

But this is exactly what D does! Only in a well defined way. Not by 
hiding it as an implementation detail.

 [...] And in case the programmer is real dumb and assigns 
 the same literal to more than one UTF width, it could be stored in all 
 of those separately. -- But the brain cells of the compiler writer 
 should be put to more productive issues than this.

How do you assign _one_ literal to more than one variable?

 The D programmer should not necessarily even know that representation. 
 The only situation he would need to know it is if his program goes 
 directly to the executable memory image snooping for the literal. I'm 
 not sure that should be legal.

What is the wrong with a well defined representation?

Regards,

/Oskar

Nov 23 2005

"Kris" <fu bar.com> writes:

"Oskar Linde" <oskar.lindeREM OVEgmail.com> wrote
[snip]
 A remedy is very close to the current implementation. The following change 
 in behaviour: string literals are always char[], but may be implicitly 
 cast to wchar[] and dchar[]. print(char[] x) would therefore be a closer 
 match.
 (All casts of literals are of course done at compile time.)

 Currently in D: string literals are __uncommitted char[] and may be 
 implicitly cast to char[], wchar[] and dchar[]. No cast takes preference.

Yes, very close. Rather than be uncommited they should default to char[]. A 
suffix can be used to direct as appropriate. This makes the behaviour more 
consistent in two (known) ways.

There's the additional question as to whether the literal content should be 
used to imply the type (like literal chars and numerics do), but that's a 
different topic and probably not as important.

Nov 23 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Kris wrote:

 There's the additional question as to whether the literal content should be 
 used to imply the type (like literal chars and numerics do), but that's a 
 different topic and probably not as important. 

How would the content imply the type? All Unicode strings are 
representable equally by char[], wchar[] and dchar[].
Do you mean that the most optimal (memory wise) encoding is used?
Or do you mean that file encoding could imply type?
(This means that transcoding the source code could change program 
behaviour.)

Char literals and numerals imply type because of limited 
representability. This is not the case with strings.

/Oskar

Nov 23 2005

"Kris" <fu bar.com> writes:

"Oskar Linde" <oskar.lindeREM OVEgmail.com> wrote
 Kris wrote:

 There's the additional question as to whether the literal content should 
 be used to imply the type (like literal chars and numerics do), but 
 that's a different topic and probably not as important.

 How would the content imply the type? All Unicode strings are 
 representable equally by char[], wchar[] and dchar[].
 Do you mean that the most optimal (memory wise) encoding is used?
 Or do you mean that file encoding could imply type?
 (This means that transcoding the source code could change program 
 behaviour.)

I meant in terms of implying the "default" type. We're suggesting that 
default be char[], but if the literal contained unicode chars then the 
default might be something else. This is the method used to assign type to a 
char literal, but Walter has noted it might be confusing to do something 
similar with string literals.

As indicated, I don't think this aspect is of much importance compared to 
the issues associated with an uncommited literal type ~ that uncommited 
aspect needs to be fixed, IMO.

Nov 23 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Wed, 23 Nov 2005 13:18:49 -0800, Kris <fu bar.com> wrote:
 "Oskar Linde" <oskar.lindeREM OVEgmail.com> wrote
 Kris wrote:

 There's the additional question as to whether the literal content  
 should
 be used to imply the type (like literal chars and numerics do), but
 that's a different topic and probably not as important.

 How would the content imply the type? All Unicode strings are
 representable equally by char[], wchar[] and dchar[].
 Do you mean that the most optimal (memory wise) encoding is used?
 Or do you mean that file encoding could imply type?
 (This means that transcoding the source code could change program
 behaviour.)

 I meant in terms of implying the "default" type. We're suggesting that
 default be char[], but if the literal contained unicode chars then the
 default might be something else. This is the method used to assign type  
 to a
 char literal, but Walter has noted it might be confusing to do something
 similar with string literals.

Did you see my reply to you in this other thread .. wait a minute ..  
where's it gone .. no wonder you didn't reply, it seems my post was never  
made. Let me try again:
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.learn/2216

What do you two think of what I have there?

 As indicated, I don't think this aspect is of much importance compared to
 the issues associated with an uncommited literal type ~ that uncommited
 aspect needs to be fixed, IMO.

I agree. While there exists a slight risk in doing so, it's no different  
to the risk involved with the current handling of integer literals. This  
change would essentailly make string literal handling consistent with  
integer literal handling.

Regan

Nov 23 2005

"Kris" <fu bar.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote
[snip]
 As indicated, I don't think this aspect is of much importance compared to
 the issues associated with an uncommited literal type ~ that uncommited
 aspect needs to be fixed, IMO.

 I agree. While there exists a slight risk in doing so, it's no different 
 to the risk involved with the current handling of integer literals. This 
 change would essentailly make string literal handling consistent with 
 integer literal handling.

Consistent with auto string-literals too.

Nov 23 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Regan Heath wrote:
 On Wed, 23 Nov 2005 13:18:49 -0800, Kris <fu bar.com> wrote:
 
 "Oskar Linde" <oskar.lindeREM OVEgmail.com> wrote

 Kris wrote:

 There's the additional question as to whether the literal content  
 should
 be used to imply the type (like literal chars and numerics do), but
 that's a different topic and probably not as important.


 How would the content imply the type? All Unicode strings are
 representable equally by char[], wchar[] and dchar[].
 Do you mean that the most optimal (memory wise) encoding is used?
 Or do you mean that file encoding could imply type?
 (This means that transcoding the source code could change program
 behaviour.)


 I meant in terms of implying the "default" type. We're suggesting that
 default be char[], but if the literal contained unicode chars then the
 default might be something else. This is the method used to assign 
 type  to a
 char literal, but Walter has noted it might be confusing to do something
 similar with string literals.


I think I see what you mean. By unicode char, you actually mean unicode 
character with code point > 127 (i.e. not representable in ASCII).

 Did you see my reply to you in this other thread .. wait a minute ..  
 where's it gone .. no wonder you didn't reply, it seems my post was 
 never  made. Let me try again:
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.learn/2216
 
 What do you two think of what I have there?
 
 As indicated, I don't think this aspect is of much importance compared to
 the issues associated with an uncommited literal type ~ that uncommited
 aspect needs to be fixed, IMO.

 
 
 I agree. While there exists a slight risk in doing so, it's no 
 different  to the risk involved with the current handling of integer 
 literals. This  change would essentailly make string literal handling 
 consistent with  integer literal handling.

Reasoning: The only risk always defaulting to char[] is in efficiency? 
Like the following scenario:

A user writes string literals, in say chineese, without a suffix, making 
those strings char[] by default. Those literals are fed to functions 
having multiple implementations, optimised for char[], wchar[] and 
dchar[]. Since the optimal encoding for chineese is UTF-16 (guessing 
here), the users code will be suboptimal.

You are hypothesizing that a heuristic could be used to pick a encoding 
depending on content. The heuristic you are considering seems to be:
If the string contains characters not representable by a single UTF-16 
code unit, make the string dchar[].
Else, if the string contains characters not representable by a single 
UTF-8 code unit, make the string wchar[].
Else, the string becomes char[].

Hmm... By having such a rule, you would also guarantee that:
a) all indexing of string literals give valid unicode characters
b) all slicing of string literals give valid unicode strings
c) counting characters in a Unicode aware editor gives you the correct index

Without such a rule, the following might give surprising results for
"na�ve users"[0..5]

a), b) and c) above seem to me to be quite strong arguments. I have 
almost convinced myself that there might be a good reason to have string 
literal types depend on content... :)

/Oskar

Nov 23 2005

Georg Wrede <georg.wrede nospam.org> writes:

Oskar Linde wrote:
 Regan Heath wrote:
 On Wed, 23 Nov 2005 13:18:49 -0800, Kris <fu bar.com> wrote:
 "Oskar Linde" <oskar.lindeREM OVEgmail.com> wrote
 Kris wrote:
 
 There's the additional question as to whether the literal
 content should be used to imply the type (like literal chars
 and numerics do), but that's a different topic and probably
 not as important.

 
 How would the content imply the type? All Unicode strings are 
 representable equally by char[], wchar[] and dchar[]. Do you
 mean that the most optimal (memory wise) encoding is used? Or
 do you mean that file encoding could imply type? (This means
 that transcoding the source code could change program 
 behaviour.)

 
 I meant in terms of implying the "default" type. We're suggesting
 that default be char[], but if the literal contained unicode
 chars then the default might be something else. This is the
 method used to assign type  to a char literal, but Walter has
 noted it might be confusing to do something similar with string
 literals.


 
 I think I see what you mean. By unicode char, you actually mean
 unicode character with code point > 127 (i.e. not representable in
 ASCII).

I'd be willing to say that (see my other post in this thread, a couple 
of hours ago), the difference in execution time between having the 
string literal decoded from "suboptimal" to the needed type is negligible.

And deciding based on content just makes things complicated. Both the 
deciding, the storing, and the retrieving.

 From that follows, that the time for Walter to do the code for that, is 
wasted. Much better to just store the literal in (whatever we, or 
actually Walter decide) just some type, and be done with it. At usage 
time it'd then get cast as needed, if needed. Simple.

 Did you see my reply to you in this other thread .. wait a minute
 .. where's it gone .. no wonder you didn't reply, it seems my post
 was never  made. Let me try again: 
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.learn/2216
 
 What do you two think of what I have there?
 
 As indicated, I don't think this aspect is of much importance 
 compared to the issues associated with an uncommited literal type
 ~ that uncommited aspect needs to be fixed, IMO.

 
 I agree. While there exists a slight risk in doing so, it's no 
 different  to the risk involved with the current handling of
 integer literals. This  change would essentailly make string
 literal handling consistent with  integer literal handling.

 
 Reasoning: The only risk always defaulting to char[] is in
 efficiency? Like the following scenario:
 
 A user writes string literals, in say chineese, without a suffix,
 making those strings char[] by default. Those literals are fed to
 functions having multiple implementations, optimised for char[],
 wchar[] and dchar[]. Since the optimal encoding for chineese is
 UTF-16 (guessing here), the users code will be suboptimal.

Ok, let's have a vote:

Let's say we have 100 000 French names, and we want to sort them. We 
have the same names in a char[][], wchar[][] and a dchar[][].

Which is fastest to sort?

Say we have 100 000 Chinese names? Which is fastest?

What if we have 100 000 English names?


(Of course somebody will actually code and run this. But I'd like to 
hear it from folks _before_ that.)

Nov 23 2005

Georg Wrede <georg.wrede nospam.org> writes:

Oskar Linde wrote:
 Georg Wrede wrote:
 Oskar Linde wrote:
 Georg Wrede wrote:

 We've wasted truckloads of ink lately on the UTF issue.

 The prayer:

 Please remove the *char*, *wchar* and *dchar* basic data types from 
 the documentation!

 What atleast should be changed is the suggestion that the C char is 
 replacable by the D char, and a slight change in definition:

 char: UTF-8 code unit
 wchar: UTF-16 code unit
 dchar: UTF-32 code unit and/or unicode character

 A UTF-8 code unit can not be stored in char. [...]

 
 Can it not? I thought I had been doing this all the time... Why?
 (I know most Unicode code points aka characters can not be stored in a 
 char)

Like Regan pointed out, I meant code point.

 [...] Therefore UTF can't be
 mentioned at all in this context.

 
 By the same token, and because of symmetry, the wchar and dchar things 
 should vanish. Disappear. Get nuked.

 
 A dchar can represent any Unicode character. Is that not enough reason 
 to keep it?

No.

What we need is to eradicate this entire issue from stealing bandwidth 
between the ears of every soul who reads the D documents.

 The language manual (as opposed to the Phobos docs), should not once 
 use the word UTF anywhere. Except possibly stating that a conformant 
 compiler has to accept source code files that are stored in UTF.

 
 Why? It is often important to know the encoding of a string. (When 
 dealing with system calls, c-library calls etc.) Doesn't D do the right 
 thing to define Unicode as the character set and UTF-* as the supported 
 encodings?

D docs, no.

Phobos docs, yes. But only in std.utf, nowhere else.

 If a programmer decides to do UTF mangling, then he should store the 
 intermediate results in ubytes, ushorts and uints. This way he can 
 avoid getting tangled and tripped up before or later.

 
 Those are more or less equivalent to char, wchar, dchar.

Those are exactly equivalent.

The issue is not technical, compiler related, language related, 
anything. But, having char, wchar, and dchar gives the impression that 
they have something inherently to do with UTF, and therefore there's 
something special with them in D, as the docs now stand.

That in turn, implies (to the reader, not necessarily to the writer), 
that there is some kind of difference in how D treats them, compared to 
ubyte, ushort and uint.

Well, there is a difference. But that difference exists _only_ in char[] 
and wchar[]. From the docs however, the reader gets the impression that 
there is something in char, wchar and dchar themselves that 
differentiates them from whatever the reader previously may have assumed.

 The difference is more a matter of convention. (and string/character
 literal handling) Doesn't the confusion rather lie in the name? (char
 may imply character rather than code unit. I agree that this is
 unfortunate.)

Yes, yes!

And that precisely is the issue here.

 And this way he keeps remembering that it's on his responsibility to 
 get things right.

 Please remove ""c, ""w and ""d from the documentation!

 So you suggest that users should use an explicit cast instead?

 No.

 
 I think this may have been the reason for the implementation of the 
 suffixes:
 
 print(char[] x) { printf("1\n"); }
 print(wchar[] x) { printf("2\n"); }
 void main() { print("hello"); }
 
 A remedy is very close to the current implementation. The following 
 change in behaviour: string literals are always char[], but may be 
 implicitly cast to wchar[] and dchar[]. print(char[] x) would
 therefore be a closer match. (All casts of literals are of course
 done at compile time.)

I'd have no problem with that.

As long as we all agree on, that this is Because We Just Happened to 
Choose So -- and not because there'd be any Real Reason For Precisely 
This Choice.

(That does sound stupid, but it's actually a very important distinction 
here.)

 Currently in D: string literals are __uncommitted char[] and may be 
 implicitly cast to char[], wchar[] and dchar[]. No cast takes preference.
 
 D may be too agnostic about preferred encoding...

Yes. The specters were lurking behind the back, so the programmer got a 
little paranoid in the night.

 How, or in what form the string literals get internally stored, is the 
 compiler vendor's own private matter. And definitely not part of a 
 language specification.

 
 The form string literals are stored in affects the performance in many 
 cases. Why remove this control from the programmer?

Because he really should not have to bother.

(You might take an average program, count the literal strings, and then 
time the execution. Figure out the stupidest choice between UTF[8..32] 
and the OS's native encoding, and then calculate the time for decoding. 
I'd say that this amounts to such peanuts that it really is not worth 
thinking about.)

If a programmer writes a 10kloc program, half of which is string 
literals, then maybe yes, it could make a measurable difference. OTOH, 
such a programmer's code probably has bigger performance issues (and 
others) anyhow.

 Rereading your post, tryig to understand what you ask for leads me to 
 assume that by your suggestion, a poor, but conforming, compiler may 
 include a hidden call to a conversion function and a hidden memory 
 allocation in this function call:
 
 foo("hello");

Yes. Don't buy anythin from that company.

 The string literal decorations only create a huge amount of 
 distraction, sending every programmer on a wild goose chase, possibly 
 several times, before they either finds someone who explains things, 
 or they switch away from D.

 
 I can not understand this confusion. It is currently very well specified 
 how a character literal gets stored (and you get to pick between 3 
 different encodings). 

This confusion is unnecessary, and based on a misconception. And I dare 
say, most of this confusion is directly a result from _having_ the 
string literal decorations in the first place.

 The problem lies in the "" form, where you have 
 not specified the encoding. The compiler will try to infer the encoding 
 depending on context. This is a weakness i think should be fixed by:
 
 - "" is char[], but may be implicitly cast to dchar[] and wchar[]

That would fix it technically. I'd vote yes.

The bigger thing to fix is the rest. (char not being code point, gotchas 
when having arrays of char which you then slice, etc. -- still have to 
be sorted out.)

 Behind the scenes, a smart compiler manufacturer probably stores all 
 string literals as UTF-8. Nice. Or as some other representation that's 
 convenient for him, be it UTF-whatever, or even a native encoding. In 
 any case, the compiler knows what this representation is.

 When the string gets used (gets assigned to a variable, or put in a 
 structure, or printed to screen), then the compiler should implicitly 
 cast (as in toUTFxxx and not the current "cast") the string to what's 
 expected.

 
 I start to understand you now... (first time reading)
 
 We can have "string types", like [c/w/d]char[], but not standalone UTF 
 chars. When the string literal gets assigned to any of these, then it 
 should get converted.

 Actually, a smart compiler would already store the string literal in 
 the width the string will get used, it's got the info right there in 
 the same source file. [...]


 [...] And in case the programmer is real dumb and assigns the same 
 literal to more than one UTF width, it could be stored in all of those 
 separately. -- But the brain cells of the compiler writer should be 
 put to more productive issues than this.

 
 How do you assign _one_ literal to more than one variable?

Gee, thanks! My bad.

Oh-oh, it's not my bad, after all. If another literal with the same 
content gets assigned in other places, it's customary for the compiler 
to only store one copy of it in the binary. So, of course then, _that_ 
compiler might then store it in different encodings if it is assigned to 
different width variables in the source.

 The D programmer should not necessarily even know that representation. 
 The only situation he would need to know it is if his program goes 
 directly to the executable memory image snooping for the literal. I'm 
 not sure that should be legal.

 
 What is the wrong with a well defined representation?

First of all, if the machine is big endian or little endian. What if it 
wants to store them in an unexpected place? Maybe it's a CIA certified 
compiler that encrypts all literals in binaries and inserts on-the-fly 
runtime decoding routines in all string fetches? (I REALLY wouldn't be 
surprised to hear they have one.)

Thousands of reasons why you'd be very interested in _not_ knowing the 
representation. "What you don't know doesn't hurt you, and at least it 
doesn't borrow brain cells from you."

    for(byte i=0; i<10; i++) {printfln("foo ", i);}

Do you honestly know whether i is a byte or an int in the compiled 
program? Should you even care? How many have even thought about it?

---

Having said that, if somebody asks me, I'd vote for UTF-8 on Linux, and 
whatever is appropriate on Windows. Then I could look for the strings in 
the binary file with the unix "strings" command.

Nov 23 2005

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak invalid_utu.fi> writes:

Georg Wrede wrote:

<snip>

 (You might take an average program, count the literal strings, and then 
 time the execution. Figure out the stupidest choice between UTF[8..32] 
 and the OS's native encoding, and then calculate the time for decoding. 
 I'd say that this amounts to such peanuts that it really is not worth 
 thinking about.)
 
 If a programmer writes a 10kloc program, half of which is string 
 literals, then maybe yes, it could make a measurable difference. OTOH, 
 such a programmer's code probably has bigger performance issues (and 
 others) anyhow.

The complexity of utf->utf conversions should be linear. I don't believe 
anyone wants to convert 200 kB strings while drawing a 3d scene etc.

IMHO this UTF-thing is a problem since some people here don't want to 
write their own libraries and would like the compiler to do everything 
for them.

I totally agree that we should have only one type of string structure. I 
write Unicode-compliant programs every week and have never used these 
wchar/dchar structures. What I would really like is a Unicode-stream 
class that could read/write valid Unicode text files.

 The problem lies in the "" form, where you have not specified the 
 encoding. The compiler will try to infer the encoding depending on 
 context. This is a weakness i think should be fixed by:

 - "" is char[], but may be implicitly cast to dchar[] and wchar[]

 
 
 That would fix it technically. I'd vote yes.
 
 The bigger thing to fix is the rest. (char not being code point, gotchas 
 when having arrays of char which you then slice, etc. -- still have to 
 be sorted out.)

I don't believe this is a big problem. This works correctly already:

foreach(wchar c; char[] unicodestring) { ... }

Walter should fix this so that to following would work too:

foreach(char c; char[] unicodestring) { ... }

BTW, should the sorting of char-arrays be a language or library feature?

 What is the wrong with a well defined representation?

 
 
 First of all, if the machine is big endian or little endian. What if it 
 wants to store them in an unexpected place? Maybe it's a CIA certified 
 compiler that encrypts all literals in binaries and inserts on-the-fly 
 runtime decoding routines in all string fetches? (I REALLY wouldn't be 
 surprised to hear they have one.)
 

Now wasn't that already the second conspiracy theory you have posted 
today... go get some sleep, man :)

 Having said that, if somebody asks me, I'd vote for UTF-8 on Linux, and 
 whatever is appropriate on Windows. Then I could look for the strings in 
 the binary file with the unix "strings" command.

UTF-8 on Linux. Maybe UTF-16 on Windows (performance)? OTOH, UTF-8 has 
serious advantages over UTF-7/16/32 [1], [2] (some of us appreciate 
other things than just raw performance)

[1] http://www.everything2.com/index.pl?node=UTF-8
[2] http://en.wikipedia.org/wiki/Utf-8

Nov 23 2005

D Programming

C/C++ Programming

Other

digitalmars.D - A grave Prayer