www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - char[] initialization

reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
Could somebody shed light on the subject:

According to http://digitalmars.com/d/type.html

characters in D are getting initialized by following values

char -> 0xFF
wchar -> 0xFFFF
dchar -> 0x0000FFFF

what is the idea to have string initialized by valid character code instead 
of 0?

And that 0xFFFF.... Why is this special character (See Basic
Multilingual Plane) was selected?

To avoid use of strcat & co. on d strings?

(Sorry if it was discussed before)

Andrew Fedoniouk.
http://terrainformatica.com
Jul 29 2006
parent reply kris <foo bar.com> writes:
Andrew Fedoniouk wrote:
 Could somebody shed light on the subject:
 
 According to http://digitalmars.com/d/type.html
 
 characters in D are getting initialized by following values
 
 char -> 0xFF
 wchar -> 0xFFFF
 dchar -> 0x0000FFFF
 
 what is the idea to have string initialized by valid character code instead 
 of 0?

Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html
Jul 29 2006
next sibling parent reply Hasan Aljudy <hasan.aljudy gmail.com> writes:
kris wrote:
 Andrew Fedoniouk wrote:
 
 Could somebody shed light on the subject:

 According to http://digitalmars.com/d/type.html

 characters in D are getting initialized by following values

 char -> 0xFF
 wchar -> 0xFFFF
 dchar -> 0x0000FFFF

 what is the idea to have string initialized by valid character code 
 instead of 0?

Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html

I don't understand why the compiler should initialize variables to illegal values!! OK, is it because you have to initialize variables explicitly? Just WHY? As far as I know, the notion that non-initialized variables are bad is a side-effect of the C (and C++) language, because non-inited variables are garbage. However, in D (and Java .. and others), vars are always initialized. So, if the compiler can init variables to good defaults, why should it still be considered a bad habit not to init variables explicitly? That just makes no sense to me.
Jul 29 2006
parent reply Derek <derek psyc.ward> writes:
On Sat, 29 Jul 2006 06:29:21 -0600, Hasan Aljudy wrote:

 kris wrote:
 Andrew Fedoniouk wrote:
 
 Could somebody shed light on the subject:

 According to http://digitalmars.com/d/type.html

 characters in D are getting initialized by following values

 char -> 0xFF
 wchar -> 0xFFFF
 dchar -> 0x0000FFFF

 what is the idea to have string initialized by valid character code 
 instead of 0?

Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html

I don't understand why the compiler should initialize variables to illegal values!! OK, is it because you have to initialize variables explicitly? Just WHY? As far as I know, the notion that non-initialized variables are bad is a side-effect of the C (and C++) language, because non-inited variables are garbage. However, in D (and Java .. and others), vars are always initialized. So, if the compiler can init variables to good defaults, why should it still be considered a bad habit not to init variables explicitly? That just makes no sense to me.

I believe that D's philopsophy is that all datatypes are initialized to 'invalid' values if they possibly can be. The ones that can't are integers, bytes, and bools. References, floating point values, and characters are initialized to 'wrong' values. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Jul 29 2006
parent reply Hasan Aljudy <hasan.aljudy gmail.com> writes:
Derek wrote:
 On Sat, 29 Jul 2006 06:29:21 -0600, Hasan Aljudy wrote:
 
 
kris wrote:

Andrew Fedoniouk wrote:


Could somebody shed light on the subject:

According to http://digitalmars.com/d/type.html

characters in D are getting initialized by following values

char -> 0xFF
wchar -> 0xFFFF
dchar -> 0x0000FFFF

what is the idea to have string initialized by valid character code 
instead of 0?

Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html

I don't understand why the compiler should initialize variables to illegal values!! OK, is it because you have to initialize variables explicitly? Just WHY? As far as I know, the notion that non-initialized variables are bad is a side-effect of the C (and C++) language, because non-inited variables are garbage. However, in D (and Java .. and others), vars are always initialized. So, if the compiler can init variables to good defaults, why should it still be considered a bad habit not to init variables explicitly? That just makes no sense to me.

I believe that D's philopsophy is that all datatypes are initialized to 'invalid' values if they possibly can be. The ones that can't are integers, bytes, and bools. References, floating point values, and characters are initialized to 'wrong' values.

I know .. I was asking "but why?" :(
Jul 29 2006
parent reply Robert Atkinson <Robert.Atkinson NO.gmail.com.SPAM> writes:
Hasan Aljudy wrote:
 
 
 Derek wrote:
 On Sat, 29 Jul 2006 06:29:21 -0600, Hasan Aljudy wrote:


 kris wrote:

 Andrew Fedoniouk wrote:


 Could somebody shed light on the subject:

 According to http://digitalmars.com/d/type.html

 characters in D are getting initialized by following values

 char -> 0xFF
 wchar -> 0xFFFF
 dchar -> 0x0000FFFF

 what is the idea to have string initialized by valid character code 
 instead of 0?

Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html

I don't understand why the compiler should initialize variables to illegal values!! OK, is it because you have to initialize variables explicitly? Just WHY? As far as I know, the notion that non-initialized variables are bad is a side-effect of the C (and C++) language, because non-inited variables are garbage. However, in D (and Java .. and others), vars are always initialized. So, if the compiler can init variables to good defaults, why should it still be considered a bad habit not to init variables explicitly? That just makes no sense to me.

I believe that D's philopsophy is that all datatypes are initialized to 'invalid' values if they possibly can be. The ones that can't are integers, bytes, and bools. References, floating point values, and characters are initialized to 'wrong' values.

I know .. I was asking "but why?" :(

The intent I believe is to signal the programmer as soon as possible showing they have missed something. In C/C++ an un-initialised variable can easily survive thousands of debug runs until it 'initialises' to a completely wrong value. Most often on a release build and a end-users system. Take floats. By starting at NaN, from the very start you'll know you missed initialising it. You'll catch the error earlier in your debug process.
Jul 29 2006
parent reply Hasan Aljudy <hasan.aljudy gmail.com> writes:
Robert Atkinson wrote:
 Hasan Aljudy wrote:
 
 Derek wrote:

 On Sat, 29 Jul 2006 06:29:21 -0600, Hasan Aljudy wrote:


 kris wrote:

 Andrew Fedoniouk wrote:


 Could somebody shed light on the subject:

 According to http://digitalmars.com/d/type.html

 characters in D are getting initialized by following values

 char -> 0xFF
 wchar -> 0xFFFF
 dchar -> 0x0000FFFF

 what is the idea to have string initialized by valid character 
 code instead of 0?

Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html

I don't understand why the compiler should initialize variables to illegal values!! OK, is it because you have to initialize variables explicitly? Just WHY? As far as I know, the notion that non-initialized variables are bad is a side-effect of the C (and C++) language, because non-inited variables are garbage. However, in D (and Java .. and others), vars are always initialized. So, if the compiler can init variables to good defaults, why should it still be considered a bad habit not to init variables explicitly? That just makes no sense to me.

I believe that D's philopsophy is that all datatypes are initialized to 'invalid' values if they possibly can be. The ones that can't are integers, bytes, and bools. References, floating point values, and characters are initialized to 'wrong' values.

I know .. I was asking "but why?" :(

The intent I believe is to signal the programmer as soon as possible showing they have missed something. In C/C++ an un-initialised variable can easily survive thousands of debug runs until it 'initialises' to a completely wrong value. Most often on a release build and a end-users system. Take floats. By starting at NaN, from the very start you'll know you missed initialising it. You'll catch the error earlier in your debug process.

Still missing my point. in C/C++ that's a problem because un-initialized variables carry garbage. in D, it's not; if you init them to a reasonable valid default, this problem won't exist anymore. If un-initializing is bad just for its own sake .. then the compiler should detect it and issue an error/warning, otherwise it should default to a reasonable valid value; in this case, zero for chars and floats.
Jul 29 2006
parent reply Carlos Santander <csantander619 gmail.com> writes:
Hasan Aljudy escribi:
 
 
 Still missing my point.
 in C/C++ that's a problem because un-initialized variables carry garbage.
 in D, it's not; if you init them to a reasonable valid default, this 
 problem won't exist anymore.
 
 If un-initializing is bad just for its own sake .. then the compiler 
 should detect it and issue an error/warning, otherwise it should default 
 to a reasonable valid value; in this case, zero for chars and floats.

The issue here is, a "reasonable valid default" will change from one app to the other, one function to the next, one variable to another, so the intention here is force the developer to be explicit about his/her intentions. Walter has said in the past that if there was a NAN for int/long/etc, he'd use that instead of 0. -- Carlos Santander Bernal
Jul 29 2006
parent Walter Bright <newshound digitalmars.com> writes:
Carlos Santander wrote:
 Hasan Aljudy escribi:
 Still missing my point.
 in C/C++ that's a problem because un-initialized variables carry garbage.
 in D, it's not; if you init them to a reasonable valid default, this 
 problem won't exist anymore.

 If un-initializing is bad just for its own sake .. then the compiler 
 should detect it and issue an error/warning, otherwise it should 
 default to a reasonable valid value; in this case, zero for chars and 
 floats.

The issue here is, a "reasonable valid default" will change from one app to the other, one function to the next, one variable to another, so the intention here is force the developer to be explicit about his/her intentions. Walter has said in the past that if there was a NAN for int/long/etc, he'd use that instead of 0.

That's right. Also, given: int x; foo(x); it is impossible for the maintenance programmer to distinguish between: 1) x is meant to be 0 2) the original programmer forgot to initialize x to 3, and there's a bug in the program Ok, fine, so why doesn't the compiler just squawk about referencing uninitialized variables? Consider: int x; ... if (...) { x = 3; ... } ... if (...) { ... foo(x); } There is no way for the compiler to determine that x in foo(x) is always initialized. So it must assume otherwise, and squawk about it. So how does our harried programmer fix it? int x = some-random-value; ... if (...) { x = 3; ... } ... if (...) { ... foo(x); } The compiler is now happy, but pity the poor maintenance programmer. He notices the some-random-value, and wonders what that value means. He analyzes the code, and discovers that that value is never used. Was it intended to be used? Did some previous maintenance programmer break the code? What's going on here? My take on programming languages is that the semantics should have the obvious meaning - i.e. if the programmer initializes a variable to a value, that value should have meaning. He should not have to initialize a variable because of some subtle *side effect* such initialization has. Programmers should not be required to add dead assignments, unreachable code, etc., just to keep the compiler happy.
Jul 29 2006
prev sibling parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"kris" <foo bar.com> wrote in message news:eaf9ei$2m7$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Could somebody shed light on the subject:

 According to http://digitalmars.com/d/type.html

 characters in D are getting initialized by following values

 char -> 0xFF
 wchar -> 0xFFFF
 dchar -> 0x0000FFFF

 what is the idea to have string initialized by valid character code 
 instead of 0?

Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html

Thanks, Kris. To Walter: Following assumption ( http://www.digitalmars.com/d/archives/digitalmars/D/3239.html): "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, it is guaranteed by the Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character. This codepoint will remain forever unassigned, precisely so that it may be used for purposes such as this." is just wrong. 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from R-zone: {U+FFF0..U+FFFF} - region assigned already. 2) For char[] selection of 0xFF is wrong and even worse. For example character with code 0xFF in Latin-I encoding is "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point. For example in KOI-8 encoding 0xFF is officially assigned value. What is the point of current initializaton? If you are doing intialization already and this intialization is a part of specification so why not to use official "Nul" values in this case? You are doing the same for floats - you are using NaNs there (Null value for floats). Why not to use the same for chars? I think I understand your intention, 0xFF is sort of debug values in Visual C++: 0xCDCDCDCD - Allocated in heap, but not initialized 0xDDDDDDDD - Released heap memory. 0xFDFDFDFD - "NoMansLand" fences automatically placed at boundary of heap memory. Should never be overwritten. If you do overwrite one, you're probably walking off the end of an array. 0xCCCCCCCC - Allocated on stack, but not initialized but this is far from concept of null codepoint in character encodings. Andrew Fedoniouk. http://terrainformatica.com
Jul 29 2006
next sibling parent reply Carlos Santander <csantander619 gmail.com> writes:
Andrew Fedoniouk escribi:
 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is a 
 valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.
 

But D's chars are UTF-8, not Latin-1 nor any other, so I don't think this applies. -- Carlos Santander Bernal
Jul 29 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Carlos Santander" <csantander619 gmail.com> wrote in message 
news:eagiip$1lad$3 digitaldaemon.com...
 Andrew Fedoniouk escribi:
 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is 
 a valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

But D's chars are UTF-8, not Latin-1 nor any other, so I don't think this applies.

UTF-8 is a multibyte transport encoding of full 21-bit UNICODE codepoint. Strictly speaking single byte in UTF-8 sequence cannot be named as char[acter] char as typename implies that value of its type contains some complete codepoint (assumed that information about codepage is stored somewhere or is known at the point of use) I mean that "UTF-8 characrter" (if it makes any sense at all) as type is always char[] and not a single char. 0xFF as a char initialization value implies that D char is not supposed to handle single byte character encodings at all. Is this an original intention? Andrew Fedoniouk. http://terrainformatica.com
Jul 29 2006
parent Carlos Santander <csantander619 gmail.com> writes:
Andrew Fedoniouk escribi:
 "Carlos Santander" <csantander619 gmail.com> wrote in message 
 news:eagiip$1lad$3 digitaldaemon.com...
 Andrew Fedoniouk escribi:
 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is 
 a valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

applies.

UTF-8 is a multibyte transport encoding of full 21-bit UNICODE codepoint. Strictly speaking single byte in UTF-8 sequence cannot be named as char[acter] char as typename implies that value of its type contains some complete codepoint (assumed that information about codepage is stored somewhere or is known at the point of use) I mean that "UTF-8 characrter" (if it makes any sense at all) as type is always char[] and not a single char. 0xFF as a char initialization value implies that D char is not supposed to handle single byte character encodings at all. Is this an original intention? Andrew Fedoniouk. http://terrainformatica.com

My bad, then. I should've said char[] instead of char. Frits and Walter wrote better responses, anyway, so I'll leave this as is. -- Carlos Santander Bernal
Jul 29 2006
prev sibling next sibling parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Andrew Fedoniouk wrote:
 To Walter:
 
 Following assumption ( 
 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
 
 "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, 
 it is guaranteed by the
 Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character.
 This codepoint will remain forever unassigned, precisely so that it may be 
 used
 for purposes such as this."
 
 is just wrong.
 
 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
 R-zone: {U+FFF0..U+FFFF} - region assigned already.

Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it forms the subrange of the "Noncharacters" (see http://www.unicode.org/charts/PDF/UFFF0.pdf, at the end). These are "intended for process internal uses, but are not permitted for interchange". 0xFFFF specifically is marked "<not a character> - the value FFFF if guaranteed not to be a Unicode character at all". So yes, it's assigned - for exactly such a purpose as D is using it for :).
 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is a 
 valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

First of all, non-Unicode encodings are irrelevant. 'char' is a UTF-8 codepoint (I think that's the correct term). It's not a Unicode character (though some Unicode characters are encoded as a single UTF-8 codepoint, specifically anything up to 0x80 IIRC). 0xFF is indeed a valid Unicode character, but that doesn't mean that character is encoded as a byte with value 0xFF in UTF-8 (which char[]s represent). 0xFF is in fact one of the byte values that *cannot* occur in a valid UTF-8 text.
Jul 29 2006
parent "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Frits van Bommel" <fvbommel REMwOVExCAPSs.nl> wrote in message 
news:eagjcd$1m1t$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 To Walter:

 Following assumption ( 
 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):

 "codepoint U+FFFF is not a legitimate Unicode character, and, 
 furthermore, it is guaranteed by the
 Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode 
 character.
 This codepoint will remain forever unassigned, precisely so that it may 
 be used
 for purposes such as this."

 is just wrong.

 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
 R-zone: {U+FFF0..U+FFFF} - region assigned already.

Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it forms the subrange of the "Noncharacters" (see http://www.unicode.org/charts/PDF/UFFF0.pdf, at the end). These are "intended for process internal uses, but are not permitted for interchange". 0xFFFF specifically is marked "<not a character> - the value FFFF if guaranteed not to be a Unicode character at all". So yes, it's assigned - for exactly such a purpose as D is using it for :).
 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is 
 a valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

First of all, non-Unicode encodings are irrelevant. 'char' is a UTF-8 codepoint (I think that's the correct term).

Sorry but this is wrong. "UTF-8 codepoint" is a non-sense. In common practice Code Point is a: (1) A numerical index (or position) in an encoding table used for encoding characters. (2) Synonym for Unicode scalar value. As rule one code point represented by single glyph while represented to human.
 It's not a Unicode character (though some Unicode characters are encoded 
 as a single UTF-8 codepoint, specifically anything up to 0x80 IIRC).
 0xFF is indeed a valid Unicode character, but that doesn't mean that 
 character is encoded as a byte with value 0xFF in UTF-8 (which char[]s 
 represent). 0xFF is in fact one of the byte values that *cannot* occur in 
 a valid UTF-8 text.

Sorry, but element of UTF-8 encoded sequence is a byte (octet) and not a char. char as a type historically means type for storing character code points. 0xFF is assigned and legal value in many encodings. Either use different name for this "D char" - let's say utf8byte or use char in the meaning "code point value" - thus initialize it by NUL value common for all known encodings. Andrew Fedoniouk. http://terrainformatica.com
Jul 29 2006
prev sibling next sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 Following assumption ( 
 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
 
 "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, 
 it is guaranteed by the
 Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character.
 This codepoint will remain forever unassigned, precisely so that it may be 
 used
 for purposes such as this."
 
 is just wrong.
 
 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
 R-zone: {U+FFF0..U+FFFF} - region assigned already.

"the value FFFF is guaranteed not to be a Unicode character at all" http://www.unicode.org/charts/PDF/UFFF0.pdf
 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is a 
 valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

char[] is not Unicode, it is UTF-8. For UTF-8, 0xFF is not a valid value. The Unicode U00FF is not encoded into UTF-8 as FF. "The octet values C0, C1, F5 to FF never appear." http://www.ietf.org/rfc/rfc3629.txt
 What is the point of current initializaton?

The point is to initialize it with an invalid value, in order to flush out uninitialized data errors.
 If you are doing intialization already
 and this intialization is a part of specification so why not to use
 official "Nul" values in this case?

Because 0 is a valid UTF-8 character.
 You are doing the same for floats - you are using NaNs there
  (Null value for floats). Why not to use the same for chars?

The FF initialization does correspond (as close as we can get) with NaN for floats. 0 can masquerade as legitimate data, FF cannot.
Jul 29 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eagk1o$1mph$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Following assumption ( 
 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):

 "codepoint U+FFFF is not a legitimate Unicode character, and, 
 furthermore, it is guaranteed by the
 Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode 
 character.
 This codepoint will remain forever unassigned, precisely so that it may 
 be used
 for purposes such as this."

 is just wrong.

 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
 R-zone: {U+FFF0..U+FFFF} - region assigned already.

"the value FFFF is guaranteed not to be a Unicode character at all" http://www.unicode.org/charts/PDF/UFFF0.pdf
 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is 
 a valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

char[] is not Unicode, it is UTF-8. For UTF-8, 0xFF is not a valid value. The Unicode U00FF is not encoded into UTF-8 as FF. "The octet values C0, C1, F5 to FF never appear." http://www.ietf.org/rfc/rfc3629.txt
 What is the point of current initializaton?

The point is to initialize it with an invalid value, in order to flush out uninitialized data errors.
 If you are doing intialization already
 and this intialization is a part of specification so why not to use
 official "Nul" values in this case?

Because 0 is a valid UTF-8 character.

1) What "UTF-8 character" means exactly? 2) In ASCII char(0) is officially NUL. Why not to initialize strings by null?
 You are doing the same for floats - you are using NaNs there
  (Null value for floats). Why not to use the same for chars?

The FF initialization does correspond (as close as we can get) with NaN for floats. 0 can masquerade as legitimate data, FF cannot.

I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?' Are you saying that I cannot use char[] to represen russian text in D? Andrew Fedoniouk. http://terrainformatica.com
Jul 29 2006
next sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 What is the point of current initializaton?

uninitialized data errors.
 If you are doing intialization already
 and this intialization is a part of specification so why not to use
 official "Nul" values in this case?


1) What "UTF-8 character" means exactly?

For an exact answer, the spec is: http://www.ietf.org/rfc/rfc3629.txt There isn't much to it.
 2) In ASCII char(0) is officially NUL. Why not to initialize strings
 by null?

Because 0 characters are valid UTF-8 values. By using an invalid UTF-8 value, we can flush out bugs from uninitialized data.
 I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?'
 Are you saying that I cannot use char[] to represen russian text in D?

char[] is for UTF-8 encoded text only. For other encoding systems, use ubyte[]. But rest assured that Russian (and every other language) has a defined encoding in UTF-8, which is why it was selected for D.
Jul 29 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eagmrk$1pn9$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 What is the point of current initializaton?

out uninitialized data errors.
 If you are doing intialization already
 and this intialization is a part of specification so why not to use
 official "Nul" values in this case?


1) What "UTF-8 character" means exactly?

For an exact answer, the spec is: http://www.ietf.org/rfc/rfc3629.txt There isn't much to it.

Sorry but I understand what UCS character means but what exactly is "UTF-8 character" you are using? Is this 1) a single octet in UTF-8 sequence or 2) is a sequence of octets representing one unicode character (21 bit value)
 2) In ASCII char(0) is officially NUL. Why not to initialize strings
 by null?

Because 0 characters are valid UTF-8 values. By using an invalid UTF-8 value, we can flush out bugs from uninitialized data.

Oh.... 0 as a value of UTF-8 octet can represent only single value character with codepoint 0x00000000. In plain English: UTF-8 encoded strings cannot contain zeros in the middle.
 I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?'
 Are you saying that I cannot use char[] to represen russian text in D?

char[] is for UTF-8 encoded text only. For other encoding systems, use ubyte[]. But rest assured that Russian (and every other language) has a defined encoding in UTF-8, which is why it was selected for D.

Sorry but char[acter] in plain english means character - index of some human readable glyph in some table like ASCII, KOI-8, MAC-ASCII, whatever. Element of UTF-8 sequence is an octet. I think you should rename 'char' type to 'octet' if D/Phobos intended to support only UTF-8. Andrew.
Jul 29 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 Element of UTF-8 sequence is an octet.  I think you should rename
 'char' type to 'octet' if D/Phobos intended to support only UTF-8.

This was all hashed out years ago. It's too late to start renaming basic types.
Jul 29 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eagufo$2knt$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Element of UTF-8 sequence is an octet.  I think you should rename
 'char' type to 'octet' if D/Phobos intended to support only UTF-8.

This was all hashed out years ago. It's too late to start renaming basic types.

I am not asking to rename anything. Could you please just remove this weird 0xFF initialization for char arrays? ( as it was prior to .162 buld ) This is the whole point. If you will do this then current char type can be used for representation of single byte encodings as it stands - character. Andrew Fedoniouk. http://terrainformatica.com
Jul 29 2006
next sibling parent "Unknown W. Brackets" <unknown simplemachines.org> writes:
But even prior, this:

char c;
writefln(cast(size_t) c);

Would have given you 255, not 0.  This has been true for quite some 
time.  The fact that it did not happen for arrays in the same way was, 
as far as I know, a bug.  Actually, I didn't even realize that got fixed.

-[Unknown]


 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eagufo$2knt$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Element of UTF-8 sequence is an octet.  I think you should rename
 'char' type to 'octet' if D/Phobos intended to support only UTF-8.

types.

I am not asking to rename anything. Could you please just remove this weird 0xFF initialization for char arrays? ( as it was prior to .162 buld ) This is the whole point. If you will do this then current char type can be used for representation of single byte encodings as it stands - character. Andrew Fedoniouk. http://terrainformatica.com

Jul 29 2006
prev sibling parent Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eagufo$2knt$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Element of UTF-8 sequence is an octet.  I think you should rename
 'char' type to 'octet' if D/Phobos intended to support only UTF-8.

types.


Ok, but you did say "I think you should rename..." <g>
 Could you please just remove this weird 0xFF initialization
 for char arrays? ( as it was prior to .162 buld )

char's have been initialized to 0xFF for years now, it was a bug that some array initializations didn't do it.
 This is the whole point. If you will do this
 then current char type can be used for
 representation of single byte encodings as it stands -
 character.

? I don't understand what's standing in the way of that now. And values from 0..7F are single byte UTF-8 encodings and can be stored in a char. BTW, you can do this: typedef char mychar = 0; mychar[] a = new mychar[100]; // mychar[] will be initialized to 0
Jul 29 2006
prev sibling parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
Andrew,

I think it will make a lot more sense if you keep these things in 
mind... (I'm sure you already know all of them, I'm just listing them 
out since they're crucial and must be thought of together):

1. char, wchar, and dchar are separate types.

2. char contains UTF-8 bytes.  It may not contain UTF-16, UCS-2, KOI-8R, 
or any other encoding.  It must contain UTF-8.

3. wchar contains UTF-16.  It is similar to char in every other way (may 
not contain any other encoding than UTF-16, not even UCS-2.)

4. dchar contains UTF-32 code points.  It may not contain any other sort 
of encoding, again.

5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use 
ubyte/byte or some other method.  It is not valid to use char.

6. The FF byte (8-bit octet sequence) may never appear in any valid 
UTF-8 string.  Since char can only contain UTF-8 strings, it represents 
invalid data if it contains such an 8-bit octet.

7. Code points are the characters in Unicode; they are "compressed", so 
to speak, in encodings such as UTF-8 and UTF-16.  USC-2 and USC-4 
(UTF-32) contain full code points.

8. If you were to examine the bytes in a wchar string, it may be 
possible that the 8-bit octet sequence "FF" might show up.  Nonetheless, 
since char cannot be used for UTF-16, this doesn't matter.

9. For the above reason, wchar (UTF-16) uses FFFF.  This character is 
similar to FF for UTF-8.

Given the above, I think I might answer your questions:

1. UTF-8 character here could mean an 8-bit octet of code point.  In 
this case, they are both the same and represent a perfectly valid 
character in a string.

2. ASCII does not matter; char is not ASCII.  It happens that ASCII 
bytes 0 to 127 correspond to the same code points in Unicode, and the 
same characters in UTF-8.

3. It does not matter; KOI-8R encoded strings should not be placed in 
char arrays.  You should use UTF-8 or another encoding for your Russian 
text.

4. If you wish to use KOI-8R (or any other encoding not based on 
Unicode) you should not be using char arrays, which are meant for 
Unicode-related encodings only.

Obviously this is by far different from C, but that's the good thing 
about D in many ways ;).

Thanks,
-[Unknown]



 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eagk1o$1mph$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Following assumption ( 
 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):

 "codepoint U+FFFF is not a legitimate Unicode character, and, 
 furthermore, it is guaranteed by the
 Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode 
 character.
 This codepoint will remain forever unassigned, precisely so that it may 
 be used
 for purposes such as this."

 is just wrong.

 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
 R-zone: {U+FFF0..U+FFFF} - region assigned already.

http://www.unicode.org/charts/PDF/UFFF0.pdf
 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is 
 a valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

The Unicode U00FF is not encoded into UTF-8 as FF. "The octet values C0, C1, F5 to FF never appear." http://www.ietf.org/rfc/rfc3629.txt
 What is the point of current initializaton?

uninitialized data errors.
 If you are doing intialization already
 and this intialization is a part of specification so why not to use
 official "Nul" values in this case?


1) What "UTF-8 character" means exactly? 2) In ASCII char(0) is officially NUL. Why not to initialize strings by null?
 You are doing the same for floats - you are using NaNs there
  (Null value for floats). Why not to use the same for chars?

for floats. 0 can masquerade as legitimate data, FF cannot.

I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?' Are you saying that I cannot use char[] to represen russian text in D? Andrew Fedoniouk. http://terrainformatica.com

Jul 29 2006
next sibling parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
news:eagn4d$1q1t$1 digitaldaemon.com...
 Andrew,

 I think it will make a lot more sense if you keep these things in mind... 
 (I'm sure you already know all of them, I'm just listing them out since 
 they're crucial and must be thought of together):

 1. char, wchar, and dchar are separate types.

No objections with this.
 2. char contains UTF-8 bytes.  It may not contain UTF-16, UCS-2, KOI-8R, 
 or any other encoding.  It must contain UTF-8.

Sorry but plural form "char contains UTF-8 bytes" is wrong. What you think char means: 1) char is an octet (byte) - member of utf-8 sequence -or- 2) char is code point of some character in some character table. ? Probably I am treating English too literally but char(acter) is not an UTF-8 byte. And never was. char is an index of some glyph in some encoding table. This is common definition used everywhere.
 3. wchar contains UTF-16.  It is similar to char in every other way (may 
 not contain any other encoding than UTF-16, not even UCS-2.)

The same problem as in #2. What is wchar (uint16) for you: 1) wchar as is an index of a Unicode scalar value in Basic Multilingual Plane (BMP) -or- 2) is a uint16 value - member of UTF-16 sequence. ?
 4. dchar contains UTF-32 code points.  It may not contain any other sort 
 of encoding, again.

Oh..... UTF-32 (as any other utfs) is a transformation format - group name of two different encodings UTF-32BE and UTF-32LE. UTF-32 code point is a non-sense. UTF-32 defines of how to encode Unicode code point in again sequence of four bytes - octets. I would define this thing as dchar ( better name is uchar ) is type for representing full set of Unicode Code Points (21bit value). Pleas note: "transformation format" (UTF) is not by any means a "manipulation format". Representation of text in memory suitable for manipulation (e.g. text processing) is different as rule. You cannot use utf-8 encoded russian text for analysis. No way.
 5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use 
 ubyte/byte or some other method.  It is not valid to use char.

Vice versa. For utf-8 encoded strings you should use byte[] and for strings using single byte encodings you should use char.
 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 
 string.  Since char can only contain UTF-8 strings, it represents invalid 
 data if it contains such an 8-bit octet.

No objections with that, for UTF-8 octet sequences 0xFF is invalid value of octet in the sequence. But please note: in the sequence of octets.
 7. Code points are the characters in Unicode; they are "compressed", so to 
 speak, in encodings such as UTF-8 and UTF-16.  USC-2 and USC-4 (UTF-32) 
 contain full code points.

Sorry, but USC-4 *is not* UTF-32 http://www.unicode.org/reports/tr19/tr19-9.html I will ask again: What: char c = 'a'; means for you? And following in C/C++: #pragma(encoding,"KOI-8R") char c = '?'; ?
 8. If you were to examine the bytes in a wchar string, it may be possible 
 that the 8-bit octet sequence "FF" might show up.  Nonetheless, since char 
 cannot be used for UTF-16, this doesn't matter.

Not clear what you mean here. Could you clarify? Especially last statement.
 9. For the above reason, wchar (UTF-16) uses FFFF.  This character is 
 similar to FF for UTF-8.

 Given the above, I think I might answer your questions:

 1. UTF-8 character here could mean an 8-bit octet of code point.  In this 
 case, they are both the same and represent a perfectly valid character in 
 a string.

Sorry I am not buying following: "UTF-8 character" and "8-bit octet of code point"
 2. ASCII does not matter; char is not ASCII.  It happens that ASCII bytes 
 0 to 127 correspond to the same code points in Unicode, and the same 
 characters in UTF-8.

"ASCII does not matter"... for whom?
 3. It does not matter; KOI-8R encoded strings should not be placed in char 
 arrays.  You should use UTF-8 or another encoding for your Russian text.

"You should use UTF-8 or another encoding for your Russian text." Thanks. Advice from my side: Let me know when you will visit Russia. I will ask representatives of russian developer community and web authors to meet you. Advice per se: You should wear a helmet.
 4. If you wish to use KOI-8R (or any other encoding not based on Unicode) 
 you should not be using char arrays, which are meant for Unicode-related 
 encodings only.

The same advice as above.
 Obviously this is by far different from C, but that's the good thing about 
 D in many ways ;).

In Israel they have an old saying: "Not a human for Saturday but Saturday for human". I do have practical experience in writnig text processing software in encodings other than "US-ASCII" and have heard your advices about UTF-8 usage with interest. Please don't take all of this personal - no intention to harm anybody. Honestly and with smile :) Andrew.
Jul 29 2006
next sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 I will ask again:
 
 What:
 char c = 'a';
 means for you?
 And following in C/C++:
 
 #pragma(encoding,"KOI-8R")
 
 char c = '?';
 
 ?

Pragmas are implementation defined behavior in C and C++, meaning they are unportable and rather useless. Not only that, char's themselves are implementation defined, and so it is very difficult to write portable code that deals with anything other than a-zA-Z0-9 and a few other characters. In D, char[] is a UTF-8 sequence. It's well defined, and therefore portable. It supports every human language.
Jul 29 2006
next sibling parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eagut9$2l96$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 I will ask again:

 What:
 char c = 'a';
 means for you?
 And following in C/C++:

 #pragma(encoding,"KOI-8R")

 char c = '?';

 ?

Pragmas are implementation defined behavior in C and C++, meaning they are unportable and rather useless. Not only that, char's themselves are implementation defined, and so it is very difficult to write portable code that deals with anything other than a-zA-Z0-9 and a few other characters. In D, char[] is a UTF-8 sequence. It's well defined, and therefore portable. It supports every human language.

What does it mean "UTF-8 ... supports ...every human language" ? It allows to encode - yes. But in runtime support means quite different thing and I am pretty sure you know what I mean here. In Java as we know UTF-8 is used for representing string literals inside .class files but being loaded they became vectors of Java chars - unicode BMP codepoints (ushort). And this serves almost all character cases. Exceptions like: it is not trivial to do effectively processing of single byte encoded things there - you need to rewrite the whole set of functions to handle this. Please don't think that UTF-8 is a panacea. For example in China they use GB2312 encoding to represent almost 7000 Chinese characters in active use now. This is strictly 2 bytes enconding and don't even try to ask them to switch to UTF-8 (3 bytes as a rule). This will increase their internet traffic by 1/3. Same apply to Europe. E.g. in Russia there are 32 characters in alphabet and it is just enough to have one byte encoding for English/Russian text. It makes no sense to send over the wire two bytes (russian in utf-8) instead of one for the sites like lib.ru. Sorry but guys are paying there for each byte downloaded from Internet. This apply to almost all countries except of US and Canada. Andrew Fedoniouk. http://terrainformatica.com
Jul 29 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 In D, char[] is a UTF-8 sequence. It's well defined, and therefore 
 portable. It supports every human language.

What does it mean "UTF-8 ... supports ...every human language" ? It allows to encode - yes.

We both know what UTF-8 is and does.
 But in runtime support means quite different thing
 and I am pretty sure you know what I mean here.

I'm sure there are bugs in the library UTF-8 support. But they are bugs, are fixable, and not fundamental problems. As you find any, please post them to bugzilla.
 In Java as we know UTF-8 is used for representing
 string literals inside .class files but being loaded they
 became vectors of Java chars - unicode BMP codepoints
 (ushort). And this serves almost all character cases.
 Exceptions like: it is not trivial to do effectively
 processing of single byte encoded things there - you need
 to rewrite the whole set of functions to handle this.
 
 Please don't think that UTF-8 is a panacea.

I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.
 For example in China they use GB2312 encoding
 to represent almost 7000 Chinese characters in active use now.
 This is strictly 2 bytes enconding and
 don't even try to ask them to switch to UTF-8
 (3 bytes as a rule). This will increase their internet
 traffic by 1/3.
 
 Same apply to Europe. E.g. in Russia
 there are 32 characters in alphabet and it is
 just enough to have one byte encoding for
 English/Russian text. It makes no sense
 to send over the wire two bytes (russian in utf-8)
 instead of one for the sites like lib.ru.
 
 Sorry but guys are paying there for each byte
 downloaded from Internet. This apply
 to almost all countries except of US and Canada.

If one needs to use a custom encoding, use ubyte[] or ushort[]. If one needs to be universal, use char[], wchar[], or dchar[]. And for what it's worth, D isn't a web transmission protocol. I don't see any problem with a D program converting its input from Format X to UTF for internal processing, and then converting its output back to X or Y or Z.
Jul 29 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
 Please don't think that UTF-8 is a panacea.

I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.

Sorry but this is a bit optimistic. D/samples/wc.exe from the box will fail on russian texts. It will fail on almost all Eastern texts. Even they will be in UTF-8 encoding. Meaning of 'word' is different there. Having statement "string literals in D are only UTF-8 encoded" is not conceptually better than "string literals in C are encoded by using codepage defined by pragma(codepage,...)". Same by the way applied to most of Java compilers they accepts texts in various singlebyte encodings. (Why *I* am telling this to *you*? :-) Andrew.
Jul 29 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 Please don't think that UTF-8 is a panacea.

your code will work with different languages out of the box.

Sorry but this is a bit optimistic. D/samples/wc.exe from the box will fail on russian texts. It will fail on almost all Eastern texts. Even they will be in UTF-8 encoding. Meaning of 'word' is different there.

No matter, it is far easier to write a UTF-8 isword function than one that will work on all possible character encoding methods.
 Having statement "string literals in D are only
 UTF-8 encoded" is not conceptually better than
 "string literals in C are encoded by using codepage defined
 by pragma(codepage,...)".

It is conceptually better because UTF-8 is completely defined and covers all human languages. Codepages are not completely defined, do not cover asian languages, rely on non-standard compiler extensions, and in fact you cannot even rely on *ASCII* being supported by any particular C or C++ compiler. (It could be EBCDIC or any encoding invented by the compiler vendor.) Code pages have another disastrous problem - it's impossible to mix languages. I have an academic text in front of me written in a mixture of german, french, and latin. How's that going to work with code pages? Code pages are obsolete yesterday's technology, and I'm not sorry to see them go.
 Same by the way applied to most of Java compilers
 they accepts texts in various singlebyte encodings.
 (Why *I* am telling this to *you*? :-)

The compiler may accept it as an extension, but the Java *language* is defined to work with UTF-16 source text only. (Java calls them 'char's, even though there may be multi-char encodings.)
Jul 29 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eah9st$2v1o$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Please don't think that UTF-8 is a panacea.

your code will work with different languages out of the box.

Sorry but this is a bit optimistic. D/samples/wc.exe from the box will fail on russian texts. It will fail on almost all Eastern texts. Even they will be in UTF-8 encoding. Meaning of 'word' is different there.

No matter, it is far easier to write a UTF-8 isword function than one that will work on all possible character encoding methods.

Sorry, did you try to write such a function (isword)? (You need the whole set of character classification tables to accomplish this - utf-8 will not help you)
 Having statement "string literals in D are only
 UTF-8 encoded" is not conceptually better than
 "string literals in C are encoded by using codepage defined
 by pragma(codepage,...)".

It is conceptually better because UTF-8 is completely defined and covers all human languages. Codepages are not completely defined, do not cover asian languages, rely on non-standard compiler extensions, and in fact you cannot even rely on *ASCII* being supported by any particular C or C++ compiler. (It could be EBCDIC or any encoding invented by the compiler vendor.) Code pages have another disastrous problem - it's impossible to mix languages. I have an academic text in front of me written in a mixture of german, french, and latin. How's that going to work with code pages?

I am not saying that you shall avoid use of UTF-8 encoding. If you have mix of say english, russian and chinese on some page the only way to deliver this to the user is to use some (universal) unicode transport encoding. But to render this thing on the screen is completely different story. Consider this: attribute names in html (sgml) represented by ascii codes only - you don't need utf-8 processing to deal with them at all. You also cannot use utf-8 for storing attribute values generally speaking. Attribute values participate in CSS selector analysis and some selectors require char by char (char as a code point and not a D char) access. There are only few academic cases where you can use utf-8 literally (as a sequence of utf-8 bytes) *in runtime*. D source code compilation is one of such things - you can store content of string literals in utf-8 form - you don't need to analyze their content.
 Code pages are obsolete yesterday's technology, and I'm not sorry to see 
 them go.

Sorry but US is the first country which will ask "what a ...?" on demand to send always four bytes instead of one. UTF-8 encoding is "traffic friendly" only for 1/10 of population on the Earth (English speaking people). Others just don't want to pay that price. Sorry you or not sorry it is irrelevant for code pages existence. They will be forever untill all of us will not speak on Esperanto. ( Currently I am doing right-to-left support in the engine - Arabic and Hebrew - trust me - probably I have more things to say "sorry" about )
 Same by the way applied to most of Java compilers
 they accepts texts in various singlebyte encodings.
 (Why *I* am telling this to *you*? :-)

The compiler may accept it as an extension, but the Java *language* is defined to work with UTF-16 source text only. (Java calls them 'char's, even though there may be multi-char encodings.)

Walter, where did you get that magic UTF-16 ? Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html mentions that input of Java compiler is sequence of Unicode (Code Points). And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not matter at all and spec is silent about this - human is in its rights to choose encoding his/her terminal/keyboard supports. Andrew Fedoniouk. http://terrainformatica.com
Jul 29 2006
next sibling parent kris <foo bar.com> writes:
Is there a doctor in the house?



Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eah9st$2v1o$1 digitaldaemon.com...
 
Andrew Fedoniouk wrote:

Please don't think that UTF-8 is a panacea.

I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.

Sorry but this is a bit optimistic. D/samples/wc.exe from the box will fail on russian texts. It will fail on almost all Eastern texts. Even they will be in UTF-8 encoding. Meaning of 'word' is different there.

No matter, it is far easier to write a UTF-8 isword function than one that will work on all possible character encoding methods.

Sorry, did you try to write such a function (isword)? (You need the whole set of character classification tables to accomplish this - utf-8 will not help you)
Having statement "string literals in D are only
UTF-8 encoded" is not conceptually better than
"string literals in C are encoded by using codepage defined
by pragma(codepage,...)".

It is conceptually better because UTF-8 is completely defined and covers all human languages. Codepages are not completely defined, do not cover asian languages, rely on non-standard compiler extensions, and in fact you cannot even rely on *ASCII* being supported by any particular C or C++ compiler. (It could be EBCDIC or any encoding invented by the compiler vendor.) Code pages have another disastrous problem - it's impossible to mix languages. I have an academic text in front of me written in a mixture of german, french, and latin. How's that going to work with code pages?

I am not saying that you shall avoid use of UTF-8 encoding. If you have mix of say english, russian and chinese on some page the only way to deliver this to the user is to use some (universal) unicode transport encoding. But to render this thing on the screen is completely different story. Consider this: attribute names in html (sgml) represented by ascii codes only - you don't need utf-8 processing to deal with them at all. You also cannot use utf-8 for storing attribute values generally speaking. Attribute values participate in CSS selector analysis and some selectors require char by char (char as a code point and not a D char) access. There are only few academic cases where you can use utf-8 literally (as a sequence of utf-8 bytes) *in runtime*. D source code compilation is one of such things - you can store content of string literals in utf-8 form - you don't need to analyze their content.
Code pages are obsolete yesterday's technology, and I'm not sorry to see 
them go.

Sorry but US is the first country which will ask "what a ...?" on demand to send always four bytes instead of one. UTF-8 encoding is "traffic friendly" only for 1/10 of population on the Earth (English speaking people). Others just don't want to pay that price. Sorry you or not sorry it is irrelevant for code pages existence. They will be forever untill all of us will not speak on Esperanto. ( Currently I am doing right-to-left support in the engine - Arabic and Hebrew - trust me - probably I have more things to say "sorry" about )
Same by the way applied to most of Java compilers
they accepts texts in various singlebyte encodings.
(Why *I* am telling this to *you*? :-)

The compiler may accept it as an extension, but the Java *language* is defined to work with UTF-16 source text only. (Java calls them 'char's, even though there may be multi-char encodings.)

Walter, where did you get that magic UTF-16 ? Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html mentions that input of Java compiler is sequence of Unicode (Code Points). And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not matter at all and spec is silent about this - human is in its rights to choose encoding his/her terminal/keyboard supports. Andrew Fedoniouk. http://terrainformatica.com

Jul 29 2006
prev sibling next sibling parent Hasan Aljudy <hasan.aljudy gmail.com> writes:
Andrew Fedoniouk wrote:
 ( Currently I am doing right-to-left support in the engine - Arabic and 
 Hebrew -
 trust me - probably I have more things to say "sorry" about )
 

That's great, I'd be glad to help with anything if you need help with regard to Arabic (I'm a native Arabic speaker).
 
 Andrew Fedoniouk.
 http://terrainformatica.com
 
 

Jul 30 2006
prev sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eah9st$2v1o$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Please don't think that UTF-8 is a panacea.

your code will work with different languages out of the box.

D/samples/wc.exe from the box will fail on russian texts. It will fail on almost all Eastern texts. Even they will be in UTF-8 encoding. Meaning of 'word' is different there.

will work on all possible character encoding methods.


I have written isUniAlpha, which is the same thing.
 (You need the whole set of character classification tables
 to accomplish this - utf-8 will not help you)

With code pages, it isn't so straightforward (especially if you've got things like shift-JIS too). With code pages, a program can't even accept a text file unless you tell it what page the text is in.
 I am not saying that you shall avoid use of UTF-8 encoding.
 If you have mix of say english, russian and chinese on some page
 the only way to deliver this to the user is to use some (universal)
 unicode transport encoding.
 But to render this thing on the screen is completely different
 story.

Fortunately, rendering is the job of the operating system - and I don't see how rendering with code pages would be any easier.
 Consider this: attribute names in html (sgml) represented by
 ascii codes only - you don't need utf-8 processing to deal with them at all.
 You also cannot use utf-8 for storing attribute values generally speaking.
 Attribute values participate in CSS selector analysis and some selectors
 require char by char (char as a code point and not a D char) access.

I'd be surprised at that, since UTF-8 is a documented, supported HTML page encoding method. But if UTF-8 doesn't work for you, you can use wchar (UTF-16) or dchar (UTF-32), or ubyte (for anything else).
 There are only few academic cases where you can use utf-8 literally
 (as a sequence of utf-8 bytes) *in runtime*. D source code compilation
 is one of such things - you can store content of string literals in utf-8 
 form -
 you don't need to analyze their content.

D identifiers can be unicode alphas, which means the UTF-8 must be decoded. The DMC++ compiler supports various code page source file possibilities, including some of the asian language multibyte encodings. I find that UTF-8 is a lot easier to work with, as the UTF-8 designers learned from the mistakes of the earlier multibyte encodings.
 Code pages are obsolete yesterday's technology, and I'm not sorry to see 
 them go.

to send always four bytes instead of one. UTF-8 encoding is "traffic friendly" only for 1/10 of population on the Earth (English speaking people). Others just don't want to pay that price.

I'll make a prediction that the huge benefits of UTF will outweigh the downside, and that code pages will increasingly fall into disuse. Note that javascript, java, C#, Ruby, etc., are all unicode languages (Ruby also supports EUC or SJIS, but not other code pages). Windows is (internally) completely unicode (the code page face it shows is done by a translation layer on I/O). In an increasingly multicultural and global economy, applications that cannot simultaneously handle multiple languages are going to be at a severe disadvantage. Another problem with code pages is when you're presented with a text file, what code page is it in? There's no way for a program to tell, unless there's some other transmission of associated metadata. With UTF, that's no problem.
 Sorry you or not sorry it is irrelevant for code pages existence.
 They will be forever untill all of us will not speak on Esperanto.
 
 ( Currently I am doing right-to-left support in the engine - Arabic and 
 Hebrew -
 trust me - probably I have more things to say "sorry" about )

No problem, I believe you <g>.
 Same by the way applied to most of Java compilers
 they accepts texts in various singlebyte encodings.
 (Why *I* am telling this to *you*? :-)

defined to work with UTF-16 source text only. (Java calls them 'char's, even though there may be multi-char encodings.)

Walter, where did you get that magic UTF-16 ? Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html mentions that input of Java compiler is sequence of Unicode (Code Points). And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not matter at all and spec is silent about this - human is in its rights to choose encoding his/her terminal/keyboard supports.

Java Language Specification Third Edition Chapter 3.2: "The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding." It is, of course, entirely reasonable for a Java compiler to have extensions to recognize other encodings and automatically convert them internally to UTF-16 before lexical analysis. "One Encoding to rule them all, One Encoding to replace them, One Encoding to handle them all and in the darkness bind them" -- UTF Tolkien
Jul 30 2006
next sibling parent reply Paolo Invernizzi <arathorn NOSPAM_fastwebnet.it> writes:
LOL!!!

---
Paolo

Walter Bright wrote:

 "One Encoding to rule them all, One Encoding to replace them,
 One Encoding to handle them all and in the darkness bind them"
 -- UTF Tolkien

Jul 30 2006
parent reply "John Reimer" <terminal.node gmail.com> writes:
On Sun, 30 Jul 2006 03:25:22 -0700, Paolo Invernizzi  
<arathorn NOSPAM_fastwebnet.it> wrote:

 LOL!!!

 ---
 Paolo

 Walter Bright wrote:

 "One Encoding to rule them all, One Encoding to replace them,
 One Encoding to handle them all and in the darkness bind them"
 -- UTF Tolkien


Okay, that clears things up. Now we know that UTF is a conspiracy for world domination. ;) -JJR
Jul 30 2006
parent kris <foo bar.com> writes:
John Reimer wrote:
 On Sun, 30 Jul 2006 03:25:22 -0700, Paolo Invernizzi  
 <arathorn NOSPAM_fastwebnet.it> wrote:
 
 LOL!!!

 ---
 Paolo

 Walter Bright wrote:

 "One Encoding to rule them all, One Encoding to replace them,
 One Encoding to handle them all and in the darkness bind them"
 -- UTF Tolkien


Okay, that clears things up. Now we know that UTF is a conspiracy for world domination. ;) -JJR

And created on the back of a napkin in a New Jersey diner ... way to go, Ken
Jul 30 2006
prev sibling parent "Unknown W. Brackets" <unknown simplemachines.org> writes:
It's true that in HTML, attribute names were limited to a subset of 
characters available for use in the document.  Namely, as mentioned, 
alpha-type characters (/[A-Za-z][A-Za-z0-9\.\-]*/.)  You couldn't even 
use accented chars.

However (in the case of HTML), you were required to use specific 
(English) attribute names anyway for HTML to validate; it's really not a 
significant limitation.  Few people used SGML for anything else.

XML allows for Unicode attribute and element names... PIs, CDATA, 
PCDATA, etc.  And, of course, allows you to reference any Unicode code 
point (e.g. &#1234;.)

We could also talk about the limitations of horse driven carriages, and 
how they can only go a certain speed... nonetheless, we have cars now, 
so I'm not terribly worried about HTML's technical limitations anymore.

-[Unknown]


 Consider this: attribute names in html (sgml) represented by
 ascii codes only - you don't need utf-8 processing to deal with them 
 at all.
 You also cannot use utf-8 for storing attribute values generally 
 speaking.
 Attribute values participate in CSS selector analysis and some selectors
 require char by char (char as a code point and not a D char) access.

I'd be surprised at that, since UTF-8 is a documented, supported HTML page encoding method. But if UTF-8 doesn't work for you, you can use wchar (UTF-16) or dchar (UTF-32), or ubyte (for anything else).

Jul 30 2006
prev sibling parent "Chris Miller" <chris dprogramming.com> writes:
On Sat, 29 Jul 2006 20:37:56 -0400, Walter Bright  
<newshound digitalmars.com> wrote:

 In D, char[] is a UTF-8 sequence. It's well defined, and therefore  
 portable. It supports every human language.

Even body language? :)
Jul 30 2006
prev sibling parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
2. Sorry, an array of char (a single char is one single 8 bit octet) 
contains UTF-8 bytes which are 8-bit octets.

A single character, in UTF-8 encoding, may be 1 byte, 2 bytes, etc. 
Thus, one char MAY NOT hold every single Unicode code point.  You may 
need an array of multiple chars (bytes) to hold a single code point.

This is not what it means to me; this is what it means.  A char is a 
single 8-bit octet in a UTF-8 sequence.  They ARE NOT by any means code 
points.

I'm sorry that I did not specify "array", but I fear you are being 
pedantic here; I'm sure you knew what I meant.

A char is a single byte in a UTF-8 sequence.  I'm afraid I think calling 
it an index to a glyph is dangerous, because it could be mistaken. 
Again, a single char CANNOT represent code points above and including 
128 because it is only ONE byte.

A single char therefore may not represent a glyph all of the time, but 
rather will represent a byte in the sequence of UTF-8 which may be used 
to decode (along with other necessary bytes) the entirity of the code point.

I hope I'm not being overly pedantic here, but I think your definition 
is either lax or wrong.  But, that is only by its reading in English.

3. It is #2, as above.  wchars are not UCS-2.  They cannot always 
represent full code points alone.  Arrays of wchars must be used for 
some code points.  As I read your question, #1 is UCS-2 (fixed length 
16-bit encoding) and #2 is UTF-16 (dynamic length, 16-bit baseline 
encoding.)

4. I was ignoring endianess issues for simplicity.  My point here is 
that a UTF-32 character directly represents  a code point.  Sorry again 
for the non-pedantic laxness in my wording.

5. Wrong.  There is no vice versa.  You may use byte or ubyte arrays for 
your UTF-8 encoded strings and so forth.

In case you didn't realize I was trying to say this:

*char is not for single byte encodings.  char is ONLY for UTF-8.  char 
may not be used for any other encoding unless you wish to have problems. 
  char is not the same as in other languages, e.g. C.*

If you wish for a 8-bit octet value (such as a character in any 
encoding; single byte or otherwise) you should not be using a char. 
That is not a correct usage for them, that is what byte and ubyte are for.

It is expected that chars in an array will follow a specific sequence; 
that is, that they will be encoded in UTF-8.  It is not possible to 
guarantee this if you use other encodings, which is why writefln() will 
fail in such cases.

6.  Correct.  And a single char (8-bit octet in a sequence of UTF-8 
octets encoded such) may never be FF because no single 8-bit octet 
anywhere in a valid UTF-8 sequence may be FF.  Remember, char is not a 
code point.  It is a single 8-bit octet in a sequence.

7. My mistake.  I always consider them roughly the same (and for some 
reason I thought that they had been made the same; but I assume your 
link is current.)

Your first code sample defines a single UTF-8 character, 'a'.  It is 
lucky you did not try:

char c = '蝿';

(hopefully this character gets sent through to you properly; I will be 
sending this message UTF-8 if my client allows it.)

Because that would have failed.  A char cannot hold such a character, 
which has a code point outside the range 0 - 127.  You would either need 
to use an array of chars, or etc.

Your second example means nothing to me.  I don't really care for such 
pragmas or putting untranslated text directly in source code, and have 
never dealt with it.

8. You may not use a single char or an array of chars to represent 
UTF-16.  It may only represent UTF-8.  If you wish to use UTF-16, you 
must use wchars.

1 (the second #1): but for the code point 0, as encoded in UTF-8, they 
are the same - do you not agree?  A 0 is a zero is a zero.  It doesn't 
matter what he means.

2 (the second): rules about ASCII do not apply to char.  Just as rules 
in Portugal do not dissuade me here in Los Angeles.

3 (the second): I have lead the development of a multi-lingual software 
which was used by quite a large sum of people.  I also helped 
coordinate, and later interface with the assigned coordinator of 
translation.  This software was translated into Thai, Chinese (simple 
and traditional), Russian, Italian, Spanish, Japanese, Catalan, and 
several other languages.  More than twenty anyway.

At first I was suggesting that everyone use their own encoding and 
handling that (sometimes painfully) in the code.  I would sometimes get 
comments about using Unicode instead (from the translators who would 
have preferred this.)  This software now uses UTF-8 and remains 
translated in these languages.

So, while I have not been to Russia (although I have worked with 
numerous Russian developers, consumers, and translators) I would tend to 
disagree with your assertion.  Also I do not like helmets.

Obviously, I mean nothing to be taken personally as well; we are only 
talking about UTF-8, Unicode, its usage in D, and being pedantic ;). 
And helmets, we touched that subject too.  But not about each other, really.

Thanks,
-[Unknown]


 "Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
 news:eagn4d$1q1t$1 digitaldaemon.com...
 Andrew,

 I think it will make a lot more sense if you keep these things in mind... 
 (I'm sure you already know all of them, I'm just listing them out since 
 they're crucial and must be thought of together):

 1. char, wchar, and dchar are separate types.

No objections with this.
 2. char contains UTF-8 bytes.  It may not contain UTF-16, UCS-2, KOI-8R, 
 or any other encoding.  It must contain UTF-8.

Sorry but plural form "char contains UTF-8 bytes" is wrong. What you think char means: 1) char is an octet (byte) - member of utf-8 sequence -or- 2) char is code point of some character in some character table. ? Probably I am treating English too literally but char(acter) is not an UTF-8 byte. And never was. char is an index of some glyph in some encoding table. This is common definition used everywhere.
 3. wchar contains UTF-16.  It is similar to char in every other way (may 
 not contain any other encoding than UTF-16, not even UCS-2.)

The same problem as in #2. What is wchar (uint16) for you: 1) wchar as is an index of a Unicode scalar value in Basic Multilingual Plane (BMP) -or- 2) is a uint16 value - member of UTF-16 sequence. ?
 4. dchar contains UTF-32 code points.  It may not contain any other sort 
 of encoding, again.

Oh..... UTF-32 (as any other utfs) is a transformation format - group name of two different encodings UTF-32BE and UTF-32LE. UTF-32 code point is a non-sense. UTF-32 defines of how to encode Unicode code point in again sequence of four bytes - octets. I would define this thing as dchar ( better name is uchar ) is type for representing full set of Unicode Code Points (21bit value). Pleas note: "transformation format" (UTF) is not by any means a "manipulation format". Representation of text in memory suitable for manipulation (e.g. text processing) is different as rule. You cannot use utf-8 encoded russian text for analysis. No way.
 5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use 
 ubyte/byte or some other method.  It is not valid to use char.

Vice versa. For utf-8 encoded strings you should use byte[] and for strings using single byte encodings you should use char.
 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 
 string.  Since char can only contain UTF-8 strings, it represents invalid 
 data if it contains such an 8-bit octet.

No objections with that, for UTF-8 octet sequences 0xFF is invalid value of octet in the sequence. But please note: in the sequence of octets.
 7. Code points are the characters in Unicode; they are "compressed", so to 
 speak, in encodings such as UTF-8 and UTF-16.  USC-2 and USC-4 (UTF-32) 
 contain full code points.

Sorry, but USC-4 *is not* UTF-32 http://www.unicode.org/reports/tr19/tr19-9.html I will ask again: What: char c = 'a'; means for you? And following in C/C++: #pragma(encoding,"KOI-8R") char c = '?'; ?
 8. If you were to examine the bytes in a wchar string, it may be possible 
 that the 8-bit octet sequence "FF" might show up.  Nonetheless, since char 
 cannot be used for UTF-16, this doesn't matter.

Not clear what you mean here. Could you clarify? Especially last statement.
 9. For the above reason, wchar (UTF-16) uses FFFF.  This character is 
 similar to FF for UTF-8.

 Given the above, I think I might answer your questions:

 1. UTF-8 character here could mean an 8-bit octet of code point.  In this 
 case, they are both the same and represent a perfectly valid character in 
 a string.

Sorry I am not buying following: "UTF-8 character" and "8-bit octet of code point"
 2. ASCII does not matter; char is not ASCII.  It happens that ASCII bytes 
 0 to 127 correspond to the same code points in Unicode, and the same 
 characters in UTF-8.

"ASCII does not matter"... for whom?
 3. It does not matter; KOI-8R encoded strings should not be placed in char 
 arrays.  You should use UTF-8 or another encoding for your Russian text.

"You should use UTF-8 or another encoding for your Russian text." Thanks. Advice from my side: Let me know when you will visit Russia. I will ask representatives of russian developer community and web authors to meet you. Advice per se: You should wear a helmet.
 4. If you wish to use KOI-8R (or any other encoding not based on Unicode) 
 you should not be using char arrays, which are meant for Unicode-related 
 encodings only.

The same advice as above.
 Obviously this is by far different from C, but that's the good thing about 
 D in many ways ;).

In Israel they have an old saying: "Not a human for Saturday but Saturday for human". I do have practical experience in writnig text processing software in encodings other than "US-ASCII" and have heard your advices about UTF-8 usage with interest. Please don't take all of this personal - no intention to harm anybody. Honestly and with smile :) Andrew.

Jul 29 2006
next sibling parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
news:eah49h$2pi8$1 digitaldaemon.com...
 2. Sorry, an array of char (a single char is one single 8 bit octet) 
 contains UTF-8 bytes which are 8-bit octets.

 A single character, in UTF-8 encoding, may be 1 byte, 2 bytes, etc. Thus, 
 one char MAY NOT hold every single Unicode code point.  You may need an 
 array of multiple chars (bytes) to hold a single code point.

 This is not what it means to me; this is what it means.  A char is a 
 single 8-bit octet in a UTF-8 sequence.  They ARE NOT by any means code 
 points.

 I'm sorry that I did not specify "array", but I fear you are being 
 pedantic here; I'm sure you knew what I meant.

 A char is a single byte in a UTF-8 sequence.  I'm afraid I think calling 
 it an index to a glyph is dangerous, because it could be mistaken. Again, 
 a single char CANNOT represent code points above and including 128 because 
 it is only ONE byte.

 A single char therefore may not represent a glyph all of the time, but 
 rather will represent a byte in the sequence of UTF-8 which may be used to 
 decode (along with other necessary bytes) the entirity of the code point.

 I hope I'm not being overly pedantic here, but I think your definition is 
 either lax or wrong.  But, that is only by its reading in English.

"your definition is either lax or wrong" Which one?
 3. It is #2, as above.  wchars are not UCS-2.  They cannot always 
 represent full code points alone.  Arrays of wchars must be used for some 
 code points.  As I read your question, #1 is UCS-2 (fixed length 16-bit 
 encoding) and #2 is UTF-16 (dynamic length, 16-bit baseline encoding.)

 4. I was ignoring endianess issues for simplicity.  My point here is that 
 a UTF-32 character directly represents  a code point.  Sorry again for the 
 non-pedantic laxness in my wording.

 5. Wrong.  There is no vice versa.  You may use byte or ubyte arrays for 
 your UTF-8 encoded strings and so forth.

 In case you didn't realize I was trying to say this:

 *char is not for single byte encodings.  char is ONLY for UTF-8.  char may 
 not be used for any other encoding unless you wish to have problems. char 
 is not the same as in other languages, e.g. C.*

 If you wish for a 8-bit octet value (such as a character in any encoding; 
 single byte or otherwise) you should not be using a char. That is not a 
 correct usage for them, that is what byte and ubyte are for.

 It is expected that chars in an array will follow a specific sequence; 
 that is, that they will be encoded in UTF-8.  It is not possible to 
 guarantee this if you use other encodings, which is why writefln() will 
 fail in such cases.

 6.  Correct.  And a single char (8-bit octet in a sequence of UTF-8 octets 
 encoded such) may never be FF because no single 8-bit octet anywhere in a 
 valid UTF-8 sequence may be FF.  Remember, char is not a code point.  It 
 is a single 8-bit octet in a sequence.

 7. My mistake.  I always consider them roughly the same (and for some 
 reason I thought that they had been made the same; but I assume your link 
 is current.)

 Your first code sample defines a single UTF-8 character, 'a'.  It is lucky 
 you did not try:

 char c = '?';

 (hopefully this character gets sent through to you properly; I will be 
 sending this message UTF-8 if my client allows it.)

 Because that would have failed.  A char cannot hold such a character, 
 which has a code point outside the range 0 - 127.  You would either need 
 to use an array of chars, or etc.

 Your second example means nothing to me.  I don't really care for such 
 pragmas or putting untranslated text directly in source code, and have 
 never dealt with it.

 8. You may not use a single char or an array of chars to represent UTF-16. 
 It may only represent UTF-8.  If you wish to use UTF-16, you must use 
 wchars.

 1 (the second #1): but for the code point 0, as encoded in UTF-8, they are 
 the same - do you not agree?  A 0 is a zero is a zero.  It doesn't matter 
 what he means.

 2 (the second): rules about ASCII do not apply to char.  Just as rules in 
 Portugal do not dissuade me here in Los Angeles.

 3 (the second): I have lead the development of a multi-lingual software 
 which was used by quite a large sum of people.  I also helped coordinate, 
 and later interface with the assigned coordinator of translation.  This 
 software was translated into Thai, Chinese (simple and traditional), 
 Russian, Italian, Spanish, Japanese, Catalan, and several other languages. 
 More than twenty anyway.

 At first I was suggesting that everyone use their own encoding and 
 handling that (sometimes painfully) in the code.  I would sometimes get 
 comments about using Unicode instead (from the translators who would have 
 preferred this.)  This software now uses UTF-8 and remains translated in 
 these languages.

 So, while I have not been to Russia (although I have worked with numerous 
 Russian developers, consumers, and translators) I would tend to disagree 
 with your assertion.  Also I do not like helmets.

 Obviously, I mean nothing to be taken personally as well; we are only 
 talking about UTF-8, Unicode, its usage in D, and being pedantic ;). And 
 helmets, we touched that subject too.  But not about each other, really.

 Thanks,
 -[Unknown]

Ok. Let's make second round Some defintions: Unicode Code Point is an integer value (21bit used) - index in global Unicode table. Such global encoding table maintained by international Unicode Consortium. With some exceptions each code point there has correspondent glyph in "global super font". There are two types of encodings used for Unicode Code Points: 1) transport encodings - example UTF. Main purpose - transport/transfer. 2) manipulation encodings - mapping of ranges of Unicode Code Points to diapasons 0..0xFF, 0..0xFFFF and 0..0xFFFFFFFF. Transport encodings are used for transfer and long term storage of character data - texts. Manipulation encoding are used in programming for effective implementation of text processing functions. As a rule manipulation encoding maps some fragment (or two) of Unicode Code Point set to the range 0..0xFF and 0..0xFFFF. Main charcteristic of such mapping: each value of character vector (string) there is in 1:1 relationship with the correspondent codepoint in Unicode set. Main idea of such encoding - character at some index in string (vector) represents one code point in full. I think that motivation of having manipulation encodings is simple and everyone understands it. Think about how you will implement caret positioning in editbox for example. So statement: "char[] in D supposed to hold only UTF-8 encoded text" immediately leads us to "D is not designed for effective text processing". Is this logic clear? Again - let char be a char in D as it is now. Just don't initialize it by 0xFF please. And let us be a bit carefull with our utf-8 expectations - yes, it is almost ideal transport encoding, but it is completely useless for text manipulation purposes - too expensive. (last message on the subject) Andrew Fedoniouk. http://terrainformatica.com
Jul 29 2006
parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
It really sounds to me like you're looking for UCS-2, then (e.g. as used 
in JavaScript, etc.)  For that, length calculation (which is what I 
presume you mean) is inexpensive.

As to your below assertion, I disagree.  What I think you meant was:

"char[] is not designed for effective multi-byte text processing."

I will agree that wchar[] would be much better in that case, and even 
that limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would 
probably make things significantly easier to work with.

Nonetheless, I was only commenting on how D is currently designed and 
implemented.  Perhaps there was some misunderstanding here.

Even so, I don't see how initializing it to FF makes any problem.  I 
think everyone understands that char[] is meant to hold UTF-8, and if 
you don't like that or don't want to use it, there are other methods 
available to you (heh, you can even use UTF-32!)

I don't see that the initialization of these variables will cause anyone 
any problems.  The only time I want such a variable initialized to 0 is 
when I use a numeric type, not a character type (and then, I try to use 
= 0 anyway.)

It seems like what you may want to do is simply this:

typedef ushort ucs2_t = 0;

And use that type.  Mission accomplished.  Or, use various different 
encodings - in which case I humbly suggest:

typedef ubyte latin1_t = 0;
typedef ushort ucs2_t = 0;
typedef ubyte koi8r_t = 0;
typedef ubyte big5_t = 0;

And so on, so on, so on...

-[Unknown]


 So statement: "char[] in D supposed to hold only UTF-8 encoded text"
 immediately leads us to "D is not designed for effective text processing".
 
 Is this logic clear?

Jul 29 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
news:eahcqu$4d$1 digitaldaemon.com...
 It really sounds to me like you're looking for UCS-2, then (e.g. as used 
 in JavaScript, etc.)  For that, length calculation (which is what I 
 presume you mean) is inexpensive.

Well, lets speak in terms of javascript if it is easier: String.substr(start, end)... What these start, end means for you? I don't think that you will be interested in indexes of bytes in utf-8 sequence.
 As to your below assertion, I disagree.  What I think you meant was:

 "char[] is not designed for effective multi-byte text processing."

What is "multi-byte text processing"? processing of text - sequence of codepoints of the alphabet? What is 'multi-byte' there doing? Multi-byte I beleive you mean is a method of encoding of codepoints for transmission. Is this correct? You need real codepoints to do something meaningfull with them... How these codepoints are stored in memory: as byte, word or dword depends on your task, amount of memory you have and alphabet you are using. E.g. if you are counting frequency of russian words used in internet you'd better do not do this in Java - twice as expensive as in C without any need. So phrase "multi-byte text processing" is fuzzy on this end. (Seems like I am not clear enough with my subset of English.)
 I will agree that wchar[] would be much better in that case, and even that 
 limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would probably 
 make things significantly easier to work with.

 Nonetheless, I was only commenting on how D is currently designed and 
 implemented.  Perhaps there was some misunderstanding here.

 Even so, I don't see how initializing it to FF makes any problem.  I think 
 everyone understands that char[] is meant to hold UTF-8, and if you don't 
 like that or don't want to use it, there are other methods available to 
 you (heh, you can even use UTF-32!)

 I don't see that the initialization of these variables will cause anyone 
 any problems.  The only time I want such a variable initialized to 0 is 
 when I use a numeric type, not a character type (and then, I try to use = 
 0 anyway.)

 It seems like what you may want to do is simply this:

 typedef ushort ucs2_t = 0;

 And use that type.  Mission accomplished.  Or, use various different 
 encodings - in which case I humbly suggest:

 typedef ubyte latin1_t = 0;
 typedef ushort ucs2_t = 0;
 typedef ubyte koi8r_t = 0;
 typedef ubyte big5_t = 0;

 And so on, so on, so on...

 -[Unknown]

I like the last statement "..., so on, so on..." Sounds promissing enough. Just for information: strlen(const char* str) works with *all* single byte encodings in C. For multi-bytes (e.g. utf-8 ) it returns length of the sequence in octets. But these are not chars in terms of C strictly speaking but bytes - unsigned chars.
 So statement: "char[] in D supposed to hold only UTF-8 encoded text"
 immediately leads us to "D is not designed for effective text 
 processing".

 Is this logic clear? 


Jul 29 2006
parent "Unknown W. Brackets" <unknown simplemachines.org> writes:
Yes, you're right, most of the time I wouldn't (although a significant 
portion of the time, I would.)  Even so, this is why I would use UCS-2, 
and not UTF-8.  Why are you held up on char[]?

My point is that char[] is only trouble when you're dealing with text 
that is not ISO-8859-1.  I'm a great fan of localization and 
internationalization, but in all honesty the largest part of my text 
processing/analysis is with such strings.

Generally, user input I don't analyze.  Caret placement I leave to be 
handled by the libraries I use.  That is, when I use char[].

So again, I will agree that, in D, char[] is not a good choice for 
strings you are expecting to contain possibly-internationalized data.

I'm perfectly aware of what strlen (and str.length in D) do... it's 
similar to what they do in practically all other languages (unless you 
know the encoding is UCS-2, etc.)  For example, I work with PHP a lot 
and it doesn't even have (with the versions I support) built-in support 
for Unicode.  This makes text processing fun!

-[Unknown]


 "Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
 news:eahcqu$4d$1 digitaldaemon.com...
 It really sounds to me like you're looking for UCS-2, then (e.g. as used 
 in JavaScript, etc.)  For that, length calculation (which is what I 
 presume you mean) is inexpensive.

Well, lets speak in terms of javascript if it is easier: String.substr(start, end)... What these start, end means for you? I don't think that you will be interested in indexes of bytes in utf-8 sequence.
 As to your below assertion, I disagree.  What I think you meant was:

 "char[] is not designed for effective multi-byte text processing."

What is "multi-byte text processing"? processing of text - sequence of codepoints of the alphabet? What is 'multi-byte' there doing? Multi-byte I beleive you mean is a method of encoding of codepoints for transmission. Is this correct? You need real codepoints to do something meaningfull with them... How these codepoints are stored in memory: as byte, word or dword depends on your task, amount of memory you have and alphabet you are using. E.g. if you are counting frequency of russian words used in internet you'd better do not do this in Java - twice as expensive as in C without any need. So phrase "multi-byte text processing" is fuzzy on this end. (Seems like I am not clear enough with my subset of English.)
 I will agree that wchar[] would be much better in that case, and even that 
 limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would probably 
 make things significantly easier to work with.

 Nonetheless, I was only commenting on how D is currently designed and 
 implemented.  Perhaps there was some misunderstanding here.

 Even so, I don't see how initializing it to FF makes any problem.  I think 
 everyone understands that char[] is meant to hold UTF-8, and if you don't 
 like that or don't want to use it, there are other methods available to 
 you (heh, you can even use UTF-32!)

 I don't see that the initialization of these variables will cause anyone 
 any problems.  The only time I want such a variable initialized to 0 is 
 when I use a numeric type, not a character type (and then, I try to use = 
 0 anyway.)

 It seems like what you may want to do is simply this:

 typedef ushort ucs2_t = 0;

 And use that type.  Mission accomplished.  Or, use various different 
 encodings - in which case I humbly suggest:

 typedef ubyte latin1_t = 0;
 typedef ushort ucs2_t = 0;
 typedef ubyte koi8r_t = 0;
 typedef ubyte big5_t = 0;

 And so on, so on, so on...

 -[Unknown]

I like the last statement "..., so on, so on..." Sounds promissing enough. Just for information: strlen(const char* str) works with *all* single byte encodings in C. For multi-bytes (e.g. utf-8 ) it returns length of the sequence in octets. But these are not chars in terms of C strictly speaking but bytes - unsigned chars.
 So statement: "char[] in D supposed to hold only UTF-8 encoded text"
 immediately leads us to "D is not designed for effective text 
 processing".

 Is this logic clear? 



Jul 30 2006
prev sibling parent reply Bruno Medeiros <brunodomedeirosATgmail SPAM.com> writes:
Unknown W. Brackets wrote:
 
 char c = '蝿';
 
 
 Because that would have failed.  A char cannot hold such a character, 
 which has a code point outside the range 0 - 127.  You would either need 
 to use an array of chars, or etc.

Which, speaking of which, shouldn't that be a compile time error? The compiler allows all kinds of *char mingling: dchar dc = '蝿'; char sc = dc; // :-( -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Jul 30 2006
parent "Unknown W. Brackets" <unknown simplemachines.org> writes:
Eek!  Yes, I would say (in my humble opinion) that this should be a 
compile-time error.

Obviously down-casting is more complicated.  I think the case of chars 
is much more obvious/clear than the case of ints, but then it's also a 
special-case.

-[Unknown]


 Unknown W. Brackets wrote:
 char c = '蝿';


 Because that would have failed.  A char cannot hold such a character, 
 which has a code point outside the range 0 - 127.  You would either 
 need to use an array of chars, or etc.

Which, speaking of which, shouldn't that be a compile time error? The compiler allows all kinds of *char mingling: dchar dc = '蝿'; char sc = dc; // :-(

Jul 30 2006
prev sibling parent reply Bruno Medeiros <brunodomedeirosATgmail SPAM.com> writes:
Unknown W. Brackets wrote:
 6. The FF byte (8-bit octet sequence) may never appear in any valid 
 UTF-8 string.  Since char can only contain UTF-8 strings, it represents 
 invalid data if it contains such an 8-bit octet.
 

redundant: An "octet" is an 8-bit value. There are no "16-bit octets" and no "8-bit hextets" or stuff like that :P . I hope you knew that and were just distracted, but you kept saying that :) .
 1. UTF-8 character here could mean an 8-bit octet of code point.  In 
 this case, they are both the same and represent a perfectly valid 
 character in a string.
 

An "UTF-8 octet" is also called a UTF-8 'code unit'. Similarly a "UTF-16 hextet" is called a UTF-16 'code unit'. An UTF-8 code unit holds a Unicode code point if the code point is <128. Otherwise multiple UTF-8 code units are needed to encode that code point. The confusion between 'code unit' and 'code point' is a long standing one. An "UTF-8 character" is a slighty ambiguous term. Does it a mean a UTF-8 code unit, or does it mean an Unicode character/codepoint encoded in a UTF-8 sequence? -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Jul 30 2006
parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
I use that terminology because I've read too many RFCs (consider the FTP 
RFC) - they all say "8-bit octet".  Anyway, I'm trying to be completely 
clear.

Code unit.  Yeah, I knew it was code something but it slipped my mind. 
I was sure that he'd either correct me or 8-bit octet/etc. would remain 
clear.  I hate it when I forget such obvious terms.

Anyway, my point in what you're quoting is very context-dependent. 
Walter mentioned that "0 is a valid UTF-8 character."  Andrew asked what 
this meant, so I explained that in this case (as you also clarified) it 
doesn't make any difference.  Regardless, it's a valid [whatever it is] 
and that meaning is not unclear.

-[Unknown]


 Unknown W. Brackets wrote:
 6. The FF byte (8-bit octet sequence) may never appear in any valid 
 UTF-8 string.  Since char can only contain UTF-8 strings, it 
 represents invalid data if it contains such an 8-bit octet.

redundant: An "octet" is an 8-bit value. There are no "16-bit octets" and no "8-bit hextets" or stuff like that :P . I hope you knew that and were just distracted, but you kept saying that :) .
 1. UTF-8 character here could mean an 8-bit octet of code point.  In 
 this case, they are both the same and represent a perfectly valid 
 character in a string.

An "UTF-8 octet" is also called a UTF-8 'code unit'. Similarly a "UTF-16 hextet" is called a UTF-16 'code unit'. An UTF-8 code unit holds a Unicode code point if the code point is <128. Otherwise multiple UTF-8 code units are needed to encode that code point. The confusion between 'code unit' and 'code point' is a long standing one. An "UTF-8 character" is a slighty ambiguous term. Does it a mean a UTF-8 code unit, or does it mean an Unicode character/codepoint encoded in a UTF-8 sequence?

Jul 30 2006
parent Walter Bright <newshound digitalmars.com> writes:
Unknown W. Brackets wrote:
 Walter mentioned that "0 is a valid UTF-8 character."  Andrew asked what 
 this meant, so I explained that in this case (as you also clarified) it 
 doesn't make any difference.  Regardless, it's a valid [whatever it is] 
 and that meaning is not unclear.

I confess I often misuse the terminology.
Jul 30 2006
prev sibling parent reply Derek <derek psyc.ward> writes:
On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:


 ... but this is far from concept of null codepoint in character encodings.

Andrew and others, I've read through these posts a few times now, trying to understand the various points of view being presented. I keep getting the feeling that some people are deliberately trying *not* to understand what other people are saying. This is a sad situation. Andrew seems to be stating ... (a) char[] arrays should be allowed to hold encodings other than UTF-8, and thus initializing them with hex-FF byte values is not useful. (b) UTF-8 encoding is not an efficient encoding for text analysis. (c) UTF encodings are not optimized for data transmission (they contain redundant data in many contexts). (d) The D type called 'char' may not have been the best name to use if it is meant to be used to contain only UTF-8 octets. I, and many others including Walter, would probably agree to (b), (c) and (d). However, considering (b) and (c), UTF has benefits that outweigh these issues and there are ways to compensate for these too. Point (d) is a casualty of history and to change the language now to rename 'char' to anything else would be counter productive now. But feel free to implement your own flavour of D.<g> Back to point (a)... The fact is, char[] is designed to hold UTF-8 encodings so don't try to force anything else into such arrays. If you wish to use some other encodings, then use a more appropriate data structure for it. For example, to hold 'KOI-8' encodings of Russian text, I would recommend using ubyte[] instead. To transform char[] to any other encoding you will have to provide the functions to do that, as I don't think it is Walter's or D's responsibilty to do it. The point of initializing UTF-8 strings with illegal values is to help detect coding or logical mistakes. And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode codepoint *is* illegal. If you must store an octet of hex-FF then use ubyte[] arrays to do it. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Jul 30 2006
next sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Derek wrote:
 Andrew seems to be stating ...
 (a) char[] arrays should be allowed to hold encodings other than UTF-8, and
 thus initializing them with hex-FF byte values is not useful.
 (b) UTF-8 encoding is not an efficient encoding for text analysis.
 (c) UTF encodings are not optimized for data transmission (they contain
 redundant data in many contexts).
 (d) The D type called 'char' may not have been the best name to use if it
 is meant to be used to contain only UTF-8 octets.
 
 I, and many others including Walter, would probably agree to (b), (c) and
 (d). However, considering (b) and (c), UTF has benefits that outweigh these
 issues and there are ways to compensate for these too. Point (d) is a
 casualty of history and to change the language now to rename 'char' to
 anything else would be counter productive now. But feel free to implement
 your own flavour of D.<g>
 
 Back to point (a)... The fact is, char[] is designed to hold UTF-8
 encodings so don't try to force anything else into such arrays. If you wish
 to use some other encodings, then use a more appropriate data structure for
 it. For example, to hold 'KOI-8' encodings of Russian text, I would
 recommend using ubyte[] instead. To transform char[] to any other encoding
 you will have to provide the functions to do that, as I don't think it is
 Walter's or D's responsibilty to do it. The point of initializing UTF-8
 strings with illegal values is to help detect coding or logical mistakes.
 And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
 codepoint *is* illegal. If you must store an octet of hex-FF then use
 ubyte[] arrays to do it.

Thank you for the insightful summary of the situation. I suspect, though, that (c) might be moot since it is my understanding that most actual data transmission equipment automatically compresses the data stream, and so the redundancy of the UTF-8 is minimized. Text itself tends to be highly compressible on top of that. Furthermore, because of the rate of expansion and declining costs of bandwidth, the cost of extra bytes is declining at the same time that the cost of the inflexibility of code pages is increasing.
Jul 30 2006
parent "Unknown W. Brackets" <unknown simplemachines.org> writes:
Indeed; this is the same situation as with XML transmission over the 
web.  It contains a huge amount of redundancy, and compresses so well 
that I've seen it do better than binary-based formats.

Although, I'm afraid that most of the time this compression isn't 
necessarily automatic, and too often is not done.

-[Unknown]


 I suspect, though, that (c) might be moot since it is my understanding 
 that most actual data transmission equipment automatically compresses 
 the data stream, and so the redundancy of the UTF-8 is minimized. Text 
 itself tends to be highly compressible on top of that.
 
 Furthermore, because of the rate of expansion and declining costs of 
 bandwidth, the cost of extra bytes is declining at the same time that 
 the cost of the inflexibility of code pages is increasing.

Jul 30 2006
prev sibling next sibling parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Derek wrote:
 On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:
 
 
 ... but this is far from concept of null codepoint in character encodings.

Andrew and others, I've read through these posts a few times now, trying to understand the various points of view being presented. I keep getting the feeling that some people are deliberately trying *not* to understand what other people are saying. This is a sad situation. Andrew seems to be stating ... (a) char[] arrays should be allowed to hold encodings other than UTF-8, and thus initializing them with hex-FF byte values is not useful. (b) UTF-8 encoding is not an efficient encoding for text analysis. (c) UTF encodings are not optimized for data transmission (they contain redundant data in many contexts). (d) The D type called 'char' may not have been the best name to use if it is meant to be used to contain only UTF-8 octets. I, and many others including Walter, would probably agree to (b), (c) and (d). However, considering (b) and (c), UTF has benefits that outweigh these issues and there are ways to compensate for these too. Point (d) is a casualty of history and to change the language now to rename 'char' to anything else would be counter productive now. But feel free to implement your own flavour of D.<g> Back to point (a)... The fact is, char[] is designed to hold UTF-8 encodings so don't try to force anything else into such arrays. If you wish to use some other encodings, then use a more appropriate data structure for it. For example, to hold 'KOI-8' encodings of Russian text, I would recommend using ubyte[] instead. To transform char[] to any other encoding you will have to provide the functions to do that, as I don't think it is Walter's or D's responsibilty to do it. The point of initializing UTF-8 strings with illegal values is to help detect coding or logical mistakes. And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode codepoint *is* illegal. If you must store an octet of hex-FF then use ubyte[] arrays to do it.

Thank you for the clear summary. Apart from the obvious (d), I think there are two reasons this char confusion comes up now and then. 1. The documentation may not be clear enough on the point that char is really only meant to represent an UTF-8 code unit (or ASCII character) and that char[] is an UTF-8 encoded string. It seems it needs to be more stressed. People coming from C will automatically assume the D char is a C char equivalent. It should be mentioned that dchar is the only type that can represent any Unicode character, while char is a character only in ASCII. The C to D type conversion table doesn't help either: http://www.digitalmars.com/d/ctod.html It should say something like: char => char (UTF-8 and ASCII strings) ubyte (other byte based encodings) 2. All string functions in Phobos work only on char[] (and in some cases wchar[] and dchar[]), making the tools for working with other string encodings extremely limited. This is easily remedied by a templated string library, such as what I have proposed earlier. /Oskar
Jul 30 2006
prev sibling next sibling parent Bruno Medeiros <brunodomedeirosATgmail SPAM.com> writes:
Derek wrote:
 On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:
 
 
 ... but this is far from concept of null codepoint in character encodings.

Andrew and others, I've read through these posts a few times now, trying to understand the various points of view being presented. I keep getting the feeling that some people are deliberately trying *not* to understand what other people are saying. This is a sad situation. Andrew seems to be stating ... (a) char[] arrays should be allowed to hold encodings other than UTF-8, and thus initializing them with hex-FF byte values is not useful. (b) UTF-8 encoding is not an efficient encoding for text analysis. (c) UTF encodings are not optimized for data transmission (they contain redundant data in many contexts). (d) The D type called 'char' may not have been the best name to use if it is meant to be used to contain only UTF-8 octets. I, and many others including Walter, would probably agree to (b), (c) and (d). However, considering (b) and (c), UTF has benefits that outweigh these issues and there are ways to compensate for these too. Point (d) is a casualty of history and to change the language now to rename 'char' to anything else would be counter productive now. But feel free to implement your own flavour of D.<g> Back to point (a)... The fact is, char[] is designed to hold UTF-8 encodings so don't try to force anything else into such arrays. If you wish to use some other encodings, then use a more appropriate data structure for it. For example, to hold 'KOI-8' encodings of Russian text, I would recommend using ubyte[] instead. To transform char[] to any other encoding you will have to provide the functions to do that, as I don't think it is Walter's or D's responsibilty to do it. The point of initializing UTF-8 strings with illegal values is to help detect coding or logical mistakes. And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode codepoint *is* illegal. If you must store an octet of hex-FF then use ubyte[] arrays to do it.

Good summary. Additionally I'd like to say that, to hold 'KOI-8' encodings, you could create a typedef instead of just using a ubyte; typedef ubyte koi8char; Thus you are able to express in the code, what the encoding of such ubyte is, as it is part of the type information. And then the program is able to work with it: koi8char toUpper(koi8char ch) { ... int wordCount(koi8char[] str) { ... dchar[] toUTF32(koi8char[] str) { ... -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Jul 30 2006
prev sibling next sibling parent reply Serg Kovrov <user domain.invalid> writes:
Maybe I missed the point here, correct me if I misunderstood.

This is how I see the problem with char[] as utf-8 *string*. The length 
of array of chars is not always count of characters, but rather size of 
array in bytes. Which makes no sense for me. For that purpose I would 
like to see separate properties.

For example,
char[] str = "тест";
word "test" in russian - 4 cyrillic characters, would give you 
str.length 8, which make no use of this length property if you not sure 
that string is latin characters only.
Jul 31 2006
parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Serg Kovrov wrote:
 Maybe I missed the point here, correct me if I misunderstood.

You have understood correctly.
 This is how I see the problem with char[] as utf-8 *string*. The length 
 of array of chars is not always count of characters, but rather size of 
 array in bytes. Which makes no sense for me. For that purpose I would 
 like to see separate properties.

Having char[].length return something other than the actual number of char-units would break it's array semantics.
 For example,
 char[] str = "тест";
 word "test" in russian - 4 cyrillic characters, would give you 
 str.length 8, which make no use of this length property if you not sure 
 that string is latin characters only.

It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units. Counting the number of characters is also a rather expensive operation. All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters. It is easy to implement your own character count though: size_t count(char[] arr) { size_t c = 0; foreach(dchar c;arr) c++; return c; } assert("тест".count() == 4); Also note that: assert("тест"d.length == 4); /Oskar
Jul 31 2006
next sibling parent reply Serg Kovrov <user domain.invalid> writes:
* Oskar Linde:
 Having char[].length return something other than the actual number
 of char-units would break it's array semantics.

Yes, I see. Thats why I do not like much char[] as substitute for string type.
 It is actually not very often that you need to count the number
 of characters as opposed to the number of (UTF-8) code units.

Why not use separate properties for that?
 Counting the number of characters is also a rather expensive
 operation. 

Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.
 All the ordinary operations (searching, slicing, concatenation, 
 sub-string  search, etc) operate on code units rather than
 characters.

Yes that's tough one. If you want to slice an array - use array unit's count for that. But if you want to slice a *string* (substring, search, etc) - use character's count for that. Maybe there should be interchangeable types - string and char[]. For different length, slice, find, etc. behaviors? I mean it could be same actual type, but different contexts for properties. And besides, string as opposite to char[] is more pleasant for my eyes =)
Jul 31 2006
next sibling parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Serg Kovrov wrote:
 * Oskar Linde:
 Counting the number of characters is also a rather expensive
 operation. 

Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.

Store where? You can't put it in the array data itself without breaking slicing, and you putting it in the reference introduces problems with it getting out of date if the array is modified through another reference (without enforcing COW, that is).
Jul 31 2006
parent reply Serg Kovrov <user domain.invalid> writes:
* Frits van Bommel:
 Serg Kovrov wrote:
 * Oskar Linde:
 Counting the number of characters is also a rather expensive
 operation. 

Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.

Store where? You can't put it in the array data itself without breaking slicing, and you putting it in the reference introduces problems with it getting out of date if the array is modified through another reference (without enforcing COW, that is).

Need to say that I no not have an idea where to store it, neither where current length property stored. I'm really glad that compiler do it for me. As language user I just want to be confident that compiler do it wisely, and focus on my domain problems.
Jul 31 2006
parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Serg Kovrov wrote:
 * Frits van Bommel:
 Serg Kovrov wrote:
 * Oskar Linde:
 Counting the number of characters is also a rather expensive
 operation. 

Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.

Store where? You can't put it in the array data itself without breaking slicing, and you putting it in the reference introduces problems with it getting out of date if the array is modified through another reference (without enforcing COW, that is).

Need to say that I no not have an idea where to store it, neither where current length property stored. I'm really glad that compiler do it for me. As language user I just want to be confident that compiler do it wisely, and focus on my domain problems.

The length is stored in the reference, but the character count would not only depend on the memory location and size (which the reference holds) but also the data it holds (at least for char and wchar) which may be accessed through different references as well. That's the problem I was pointing out.
Jul 31 2006
prev sibling next sibling parent Hasan Aljudy <hasan.aljudy gmail.com> writes:
Serg Kovrov wrote:
 * Oskar Linde:
 
 Having char[].length return something other than the actual number
 of char-units would break it's array semantics.

Yes, I see. Thats why I do not like much char[] as substitute for string type.
 It is actually not very often that you need to count the number
 of characters as opposed to the number of (UTF-8) code units.

Why not use separate properties for that?
 Counting the number of characters is also a rather expensive
 operation. 

Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.
 All the ordinary operations (searching, slicing, concatenation, 
 sub-string  search, etc) operate on code units rather than
 characters.

Yes that's tough one. If you want to slice an array - use array unit's count for that. But if you want to slice a *string* (substring, search, etc) - use character's count for that. Maybe there should be interchangeable types - string and char[]. For different length, slice, find, etc. behaviors? I mean it could be same actual type, but different contexts for properties. And besides, string as opposite to char[] is more pleasant for my eyes =)

I say this calls for a proper *standard* String class ... <g>
Jul 31 2006
prev sibling parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Serg Kovrov wrote:
 * Oskar Linde:
 Having char[].length return something other than the actual number
 of char-units would break it's array semantics.

Yes, I see. Thats why I do not like much char[] as substitute for string type.
 It is actually not very often that you need to count the number
 of characters as opposed to the number of (UTF-8) code units.

Why not use separate properties for that?
 Counting the number of characters is also a rather expensive
 operation. 

Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.

The question is, how often do you need it? Especially if you are not indexing by character.
 All the ordinary operations (searching, slicing, concatenation, 
 sub-string  search, etc) operate on code units rather than
 characters.

Yes that's tough one. If you want to slice an array - use array unit's count for that. But if you want to slice a *string* (substring, search, etc) - use character's count for that.

Why? Code unit indices will work equally well for substrings, searching etc.
 Maybe there should be interchangeable types - string and char[]. For 
 different length, slice, find, etc. behaviors? I mean it could be same 
 actual type, but different contexts for properties.

Indexing an UTF-8 encoded string by character rather than code unit is expensive in either time or memory. If you for some reason need character indexing, use a dchar[].
 And besides, string as opposite to char[] is more pleasant for my eyes =)

There is always alias.
Jul 31 2006
parent Serg Kovrov <user domain.invalid> writes:
* Oskar Linde:
 Serg Kovrov wrote:
 * Oskar Linde:
 Having char[].length return something other than the actual number
 of char-units would break it's array semantics.

Yes, I see. Thats why I do not like much char[] as substitute for string type.
 It is actually not very often that you need to count the number
 of characters as opposed to the number of (UTF-8) code units.

Why not use separate properties for that?
 Counting the number of characters is also a rather expensive
 operation. 

Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.

The question is, how often do you need it? Especially if you are not indexing by character.
 All the ordinary operations (searching, slicing, concatenation, 
 sub-string  search, etc) operate on code units rather than
 characters.

Yes that's tough one. If you want to slice an array - use array unit's count for that. But if you want to slice a *string* (substring, search, etc) - use character's count for that.

Why? Code unit indices will work equally well for substrings, searching etc.
 Maybe there should be interchangeable types - string and char[]. For 
 different length, slice, find, etc. behaviors? I mean it could be same 
 actual type, but different contexts for properties.

Indexing an UTF-8 encoded string by character rather than code unit is expensive in either time or memory. If you for some reason need character indexing, use a dchar[].
 And besides, string as opposite to char[] is more pleasant for my eyes =)

There is always alias.

You've got some valid points, I just showed mine.
Jul 31 2006
prev sibling next sibling parent Walter Bright <newshound digitalmars.com> writes:
Oskar Linde wrote:
 It is easy to implement your own character count though:
 
 size_t count(char[] arr) {
     size_t c = 0;
     foreach(dchar c;arr)
         c++;
     return c;
 }
 
 assert("тест".count() == 4);

std.utf.toUCSindex(s, s.length) will also give the character count.
Jul 31 2006
prev sibling parent reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Oskar Linde schrieb am 2006-07-31:
 Serg Kovrov wrote:

 For example,
 char[] str = "????";
 word "test" in russian - 4 cyrillic characters, would give you 
 str.length 8, which make no use of this length property if you not sure 
 that string is latin characters only.

It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units. Counting the number of characters is also a rather expensive operation. All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters. It is easy to implement your own character count though: size_t count(char[] arr) { size_t c = 0; foreach(dchar c;arr) c++; return c; } assert("????".count() == 4); Also note that: assert("????"d.length == 4);

I hate to be pedantic but dchar[] can only be used to count the code points - not the characters. A "character" can be composed by more than one code point/dchar. This feature is frequent used for accents, marks and some Asian scripts. - -> http://www.unicode.org Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFEzmhrLK5blCcjpWoRAnJhAJ0VKD2sD++PkR0hnFfGIAgFxn8OGgCeLg0Y mp2vyHbFrwExwr3h6/etjWc= =9RLJ -----END PGP SIGNATURE-----
Jul 31 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Thomas Kuehne" <thomas-dloop kuehne.cn> wrote in message 
news:ls52q3-3o8.ln1 birke.kuehne.cn...
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1

 Oskar Linde schrieb am 2006-07-31:
 Serg Kovrov wrote:

 For example,
 char[] str = "????";
 word "test" in russian - 4 cyrillic characters, would give you
 str.length 8, which make no use of this length property if you not sure
 that string is latin characters only.

It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units. Counting the number of characters is also a rather expensive operation. All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters. It is easy to implement your own character count though: size_t count(char[] arr) { size_t c = 0; foreach(dchar c;arr) c++; return c; } assert("????".count() == 4); Also note that: assert("????"d.length == 4);

I hate to be pedantic but dchar[] can only be used to count the code points - not the characters. A "character" can be composed by more than one code point/dchar. This feature is frequent used for accents, marks and some Asian scripts. - -> http://www.unicode.org

Right, Thomas, umlaut as a separate code point can exist so A with umlaut can be represented by two code points. But as far as I remember the intention was and is to have in Unicode also all full forms like "A-with-umlaut" So you can always "compress" multi code point forms into single point counterparts. This way "????"d.length == 4 will be true - it is just depeneds on your text parser. Andrew.
 Thomas


 -----BEGIN PGP SIGNATURE-----

 iD8DBQFEzmhrLK5blCcjpWoRAnJhAJ0VKD2sD++PkR0hnFfGIAgFxn8OGgCeLg0Y
 mp2vyHbFrwExwr3h6/etjWc=
 =9RLJ
 -----END PGP SIGNATURE----- 

Jul 31 2006
parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Fedoniouk schrieb am 2006-07-31:
 "Thomas Kuehne" <thomas-dloop kuehne.cn> wrote in message 
 news:ls52q3-3o8.ln1 birke.kuehne.cn...
 Oskar Linde schrieb am 2006-07-31:
 Serg Kovrov wrote:

 For example,
 char[] str = "????";
 word "test" in russian - 4 cyrillic characters, would give you
 str.length 8, which make no use of this length property if you not sure
 that string is latin characters only.

It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units. Counting the number of characters is also a rather expensive operation. All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters. It is easy to implement your own character count though: size_t count(char[] arr) { size_t c = 0; foreach(dchar c;arr) c++; return c; } assert("????".count() == 4); Also note that: assert("????"d.length == 4);

I hate to be pedantic but dchar[] can only be used to count the code points - not the characters. A "character" can be composed by more than one code point/dchar. This feature is frequent used for accents, marks and some Asian scripts. - -> http://www.unicode.org

Right, Thomas, umlaut as a separate code point can exist so A with umlaut can be represented by two code points. But as far as I remember the intention was and is to have in Unicode also all full forms like "A-with-umlaut"

http://www.unicode.org/faq/char_combmark.html#13 I won't argue about the intention here. Post this statement on <unicode unicode.org> (http://www.unicode.org/consortium/distlist.html) an let's see the various responces ;)
 So you can always "compress" multi code point forms into
 single point counterparts.

Not allways. For a common use case see http://www.unicode.org/faq/han_cjk.html#7 http://www.unicode.org/faq/han_cjk.html#9 Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFE0QYbLK5blCcjpWoRArZiAJ4mVulttOK6bafuCZLt2Ini2lx4JACgjdC7 1DH/6rvW8qaSzRX5W0i+7jk= =2pt0 -----END PGP SIGNATURE-----
Aug 02 2006
prev sibling parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
Derek thanks for summarizing all this but I will put it as following.

There are two type of text encodings for two distinct use cases:
  1) transport/storage encodings - one unicode code point
      represented as multiple code units of encoded sequence ( e.g. UTF )
      string.length returns length in code units of encoding - not 
characters.

  2) manipulation encodings - one unicode code point represented
      as one and only one element of the sequence (e.g. one byte, word or 
dword)
      string.length here returns length in code points (mapped character 
glyphs).

The problem as I can see is this:
D propose to use transport encoding for manipulation purposes
which is main problem imo here - transport encodings are not
designed for the manipulation - it is extremely difficult to use
them for manipualtion in practice as we may see.

One more problem:

Encoding like UTF-8 and UTF-16 are almost useless
with let's say Windows API, say TextOutA and TextOutW functions.
Neither one of them will accept D's char[] and wchar[] directly.

- ***A  functions in Windows take byte string (LPSTR) and current
  codepage id  to render text. ( byte + codepage = Unicode Code Point )

- ***W functions in Windows use LPWSTR things which are
  sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
  (  cast(dword) word  = Unicode Code Point )
  Only few functions in Windows API treat LPWSTR as UTF-16.

-----------------
"D strings are utf encoded sequences only" is a design mistake, IMO.
On disk (serialized form) - yes. But not in memory for manipulation please.

Andrew Fedoniouk.
http://terrainformatica.com



"Derek" <derek psyc.ward> wrote in message 
news:177u058vq8cdj.koexsq99n112.dlg 40tude.net...
 On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:


 ... but this is far from concept of null codepoint in character 
 encodings.

Andrew and others, I've read through these posts a few times now, trying to understand the various points of view being presented. I keep getting the feeling that some people are deliberately trying *not* to understand what other people are saying. This is a sad situation. Andrew seems to be stating ... (a) char[] arrays should be allowed to hold encodings other than UTF-8, and thus initializing them with hex-FF byte values is not useful. (b) UTF-8 encoding is not an efficient encoding for text analysis. (c) UTF encodings are not optimized for data transmission (they contain redundant data in many contexts). (d) The D type called 'char' may not have been the best name to use if it is meant to be used to contain only UTF-8 octets. I, and many others including Walter, would probably agree to (b), (c) and (d). However, considering (b) and (c), UTF has benefits that outweigh these issues and there are ways to compensate for these too. Point (d) is a casualty of history and to change the language now to rename 'char' to anything else would be counter productive now. But feel free to implement your own flavour of D.<g> Back to point (a)... The fact is, char[] is designed to hold UTF-8 encodings so don't try to force anything else into such arrays. If you wish to use some other encodings, then use a more appropriate data structure for it. For example, to hold 'KOI-8' encodings of Russian text, I would recommend using ubyte[] instead. To transform char[] to any other encoding you will have to provide the functions to do that, as I don't think it is Walter's or D's responsibilty to do it. The point of initializing UTF-8 strings with illegal values is to help detect coding or logical mistakes. And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode codepoint *is* illegal. If you must store an octet of hex-FF then use ubyte[] arrays to do it. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"

Jul 31 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 The problem as I can see is this:
 D propose to use transport encoding for manipulation purposes
 which is main problem imo here - transport encodings are not
 designed for the manipulation - it is extremely difficult to use
 them for manipualtion in practice as we may see.

I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem. It's also certainly easier than codepage based multibyte designs like shift-JIS (I used to write code for shift-JIS).
 Encoding like UTF-8 and UTF-16 are almost useless
 with let's say Windows API, say TextOutA and TextOutW functions.
 Neither one of them will accept D's char[] and wchar[] directly.
 
 - ***A  functions in Windows take byte string (LPSTR) and current
   codepage id  to render text. ( byte + codepage = Unicode Code Point )

Win9x only supports the A functions, and Phobos does a translation of the output into the Win9x code page when running on Win9x. Of course, this fails when one has characters not supported by Win9x, but code pages aren't going to help that either. Win9x is obsolete anyway, and there's no reason to cripple a new language by accommodating the failures of an obsolete system. When running on NT or later Windows, the W functions are used instead which work directly with UTF-16. Later Windows also support UTF-8 with the A functions.
 - ***W functions in Windows use LPWSTR things which are
   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
   (  cast(dword) word  = Unicode Code Point )
   Only few functions in Windows API treat LPWSTR as UTF-16.

BMP is a proper subset of UTF-16. The only difference is that BMP doesn't do the 2-word surrogate pair encodings. But those are reserved in BMP anyway, so there is no conflict. Windows has been upgraded to handle them. Early versions of NT that couldn't handle surrogate pairs didn't work with those code points anyway, so nothing is gained by going to code pages. So, the W functions can and do take UTF-16 directly, and in fact the Phobos implementation does use the W functions, transmitting wchar[] to them, and it works fine. The neat thing about Phobos is it adapts to whether you are using Win9x, full 32 bit Windows, or Linux, and adjusts the char output accordingly so it "just works."
 -----------------
 "D strings are utf encoded sequences only" is a design mistake, IMO.
 On disk (serialized form) - yes. But not in memory for manipulation please.

There isn't any better method of handling international character sets in a portable way. Code pages have serious, crippling, unfixable problems - including all the downsides of multibyte systems (because the asian code pages are multibyte).
Jul 31 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eam1ec$10e1$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 The problem as I can see is this:
 D propose to use transport encoding for manipulation purposes
 which is main problem imo here - transport encodings are not
 designed for the manipulation - it is extremely difficult to use
 them for manipualtion in practice as we may see.

I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.

Sorry but strings in DMDScript are quite different in terms of 0) there are no such thing as char in JavaScript. 1) strings are Strings - not vectors of octets - js::string[] and d::char[] are different things. 2) are not supposed to be used by any OS API. 3) there are 12 or so methods of String class in JS - limited perimeter - what model you've choosen to store them is irrelevant - in some implementations they represented even by list of fixed runs.
 It's also certainly easier than codepage based multibyte designs like 
 shift-JIS (I used to write code for shift-JIS).

 Encoding like UTF-8 and UTF-16 are almost useless
 with let's say Windows API, say TextOutA and TextOutW functions.
 Neither one of them will accept D's char[] and wchar[] directly.

 - ***A  functions in Windows take byte string (LPSTR) and current
   codepage id  to render text. ( byte + codepage = Unicode Code Point )

Win9x only supports the A functions,

You are not right here. TextOutA and TextOutW are both supported by Win98. And intention in Harmonia was to use only those ***W functions which come out of the box on Win98 (without need of MSLU)
 and Phobos does a translation of the output into the Win9x code page when 
 running on Win9x. Of course, this fails when one has characters not 
 supported by Win9x, but code pages aren't going to help that either.

 Win9x is obsolete anyway, and there's no reason to cripple a new language 
 by accommodating the failures of an obsolete system.

There is a huge market of embedded devices. If you think that computer evolution expands only in more-ram-speed direction than you are in trouble. http://www.litepc.com/graphics/eossystem.jpg
 When running on NT or later Windows, the W functions are used instead 
 which work directly with UTF-16. Later Windows also support UTF-8 with the 
 A functions.

http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx
 - ***W functions in Windows use LPWSTR things which are
   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
   (  cast(dword) word  = Unicode Code Point )
   Only few functions in Windows API treat LPWSTR as UTF-16.

BMP is a proper subset of UTF-16. The only difference is that BMP doesn't do the 2-word surrogate pair encodings. But those are reserved in BMP anyway, so there is no conflict. Windows has been upgraded to handle them. Early versions of NT that couldn't handle surrogate pairs didn't work with those code points anyway, so nothing is gained by going to code pages.

Sorry this scares me "BMP is a proper subset of UTF-16" UTF-16 is a group name of *byte stream encodings* (UTF-16LE and UTF-16BE) of Unicode Code Set. BTW: which one of this UTFs D uses? Platform dependent I beleive.
 So, the W functions can and do take UTF-16 directly, and in fact the 
 Phobos implementation does use the W functions, transmitting wchar[] to 
 them, and it works fine.

 The neat thing about Phobos is it adapts to whether you are using Win9x, 
 full 32 bit Windows, or Linux, and adjusts the char output accordingly so 
 it "just works."

It should work well. Efficent I mean. The language shall be agnostic to the meaning of char as much as possible. It shall not prevent you to write effective algorithms.
 -----------------
 "D strings are utf encoded sequences only" is a design mistake, IMO.
 On disk (serialized form) - yes. But not in memory for manipulation 
 please.

There isn't any better method of handling international character sets in a portable way. Code pages have serious, crippling, unfixable problems - including all the downsides of multibyte systems (because the asian code pages are multibyte).

We are speaking in different languages: A: "strings are utf encoded sequences only" is a design mistake. W: "use any encoding other that utf" is a design mistake. Different meaning, eh? Forget about codepages. Let those who aware about them to deal with them efficiently. "Codepage" (c) Walter (e.g. ASCII) is an efficient way of representing text. That is it. Others who can afford full set will work with full 21bit values. Practically it is enough to have 16 (BMP) but... Andrew Fedoniouk. http://terrainformatica.com
Jul 31 2006
next sibling parent reply Derek Parnell <derek nomail.afraid.org> writes:
On Mon, 31 Jul 2006 18:23:19 -0700, Andrew Fedoniouk wrote:

 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eam1ec$10e1$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 The problem as I can see is this:
 D propose to use transport encoding for manipulation purposes
 which is main problem imo here - transport encodings are not
 designed for the manipulation - it is extremely difficult to use
 them for manipualtion in practice as we may see.

I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.

Sorry but strings in DMDScript are quite different in terms of 0) there are no such thing as char in JavaScript. 1) strings are Strings - not vectors of octets - js::string[] and d::char[] are different things. 2) are not supposed to be used by any OS API. 3) there are 12 or so methods of String class in JS - limited perimeter - what model you've choosen to store them is irrelevant - in some implementations they represented even by list of fixed runs.

For what its worth, to do *character* manipulation I convert strings to UTF-32, do my stuff and convert back to the initial format. char[] somefunc(char[] x) { return std.utf.toUTF8( somefunc( std.utf.toUTF32(x) ) ); } wchar[] somefunc(wchar[] x) { return std.utf.toUTF16( somefunc( std.utf.toUTF32(x) ) ); } dchar[] somefunc(dchar[] x) { dchar[] result; ... return result; } This seems to work fast enough for my purposes. DBuild (nee Build) uses this a lot. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 1/08/2006 11:45:36 AM
Jul 31 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Derek Parnell" <derek nomail.afraid.org> wrote in message 
news:8n0koj5wjiio.qwc8ok4mrvr3$.dlg 40tude.net...
 On Mon, 31 Jul 2006 18:23:19 -0700, Andrew Fedoniouk wrote:

 "Walter Bright" <newshound digitalmars.com> wrote in message
 news:eam1ec$10e1$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 The problem as I can see is this:
 D propose to use transport encoding for manipulation purposes
 which is main problem imo here - transport encodings are not
 designed for the manipulation - it is extremely difficult to use
 them for manipualtion in practice as we may see.

I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.

Sorry but strings in DMDScript are quite different in terms of 0) there are no such thing as char in JavaScript. 1) strings are Strings - not vectors of octets - js::string[] and d::char[] are different things. 2) are not supposed to be used by any OS API. 3) there are 12 or so methods of String class in JS - limited perimeter - what model you've choosen to store them is irrelevant - in some implementations they represented even by list of fixed runs.

For what its worth, to do *character* manipulation I convert strings to UTF-32, do my stuff and convert back to the initial format. char[] somefunc(char[] x) { return std.utf.toUTF8( somefunc( std.utf.toUTF32(x) ) ); } wchar[] somefunc(wchar[] x) { return std.utf.toUTF16( somefunc( std.utf.toUTF32(x) ) ); } dchar[] somefunc(dchar[] x) { dchar[] result; ... return result; } This seems to work fast enough for my purposes. DBuild (nee Build) uses this a lot. --

Derek, using dchar (ultimate char) is perfectly fine in DBuild(*) circumstances - you are parsing - not dealing with OS in each line. Using dchar has drawback - you need to recreate all string primitive ops from scratch including RegExp, etc. Again dchar is ok - the only not ok is a strange selection for dchar null/nothing/nihil/nil/whatever value. (* dbuild does not sound good in russian - very close to idiot in medical meaning consider builDer/buildDer/creaDor for example - with red D in the middle - stylish at least) Andrew.
Jul 31 2006
parent reply "John Reimer" <terminal.node gmail.com> writes:
On Mon, 31 Jul 2006 20:46:53 -0700, Andrew Fedoniouk  
<news terrainformatica.com> wrote:

 (* dbuild does not sound good in russian - very close to idiot in medical
 meaning
 consider builDer/buildDer/creaDor for example - with red D in the middle  
 -
 stylish at least)

 Andrew.

Really, Andrew, you are getting carried away in your demands. You almost sound self-centered :). dbuild is not made for Russians only. Almost any English word conceived for a name might just have some sort of bad connotation in any one of the thousands of languages in this world. Why should anyone feel obligated to accomodate your culture here? I know Russians tend to be quite proud of their heritage, which is fine... but really, you are being quite silly to make these demands here. That aside... my personal, self-centered feeling is that the name "bud" is quite adequate. :D -JJR
Jul 31 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"John Reimer" <terminal.node gmail.com> wrote in message 
news:op.tdlccd0b6gr7xp epsilon-alpha...
 On Mon, 31 Jul 2006 20:46:53 -0700, Andrew Fedoniouk 
 <news terrainformatica.com> wrote:

 (* dbuild does not sound good in russian - very close to idiot in medical
 meaning
 consider builDer/buildDer/creaDor for example - with red D in the 
 iddle  -
 stylish at least)

 Andrew.

Really, Andrew, you are getting carried away in your demands. You almost sound self-centered :). dbuild is not made for Russians only. Almost any English word conceived for a name might just have some sort of bad connotation in any one of the thousands of languages in this world. Why should anyone feel obligated to accomodate your culture here? I know Russians tend to be quite proud of their heritage, which is fine... but really, you are being quite silly to make these demands here. That aside... my personal, self-centered feeling is that the name "bud" is quite adequate. :D

:D BTW: debilita [lat.] as a word with many variations is used in almost all laguages directly derived from latin. You can say d'buil' on streets of say Munich and they will undersatnd you. Trust me , free beer will be yours. So it is far from russian-centric :-P Andrew.
Aug 01 2006
parent =?ISO-8859-1?Q?=22R=E9my_J=2E_A=2E_Mou=EBza=22?= writes:
Andrew Fedoniouk a crit :
 "John Reimer" <terminal.node gmail.com> wrote in message 
 news:op.tdlccd0b6gr7xp epsilon-alpha...
 
On Mon, 31 Jul 2006 20:46:53 -0700, Andrew Fedoniouk 
<news terrainformatica.com> wrote:


(* dbuild does not sound good in russian - very close to idiot in medical
meaning
consider builDer/buildDer/creaDor for example - with red D in the 
iddle  -
stylish at least)

Andrew.

Really, Andrew, you are getting carried away in your demands. You almost sound self-centered :). dbuild is not made for Russians only. Almost any English word conceived for a name might just have some sort of bad connotation in any one of the thousands of languages in this world. Why should anyone feel obligated to accomodate your culture here? I know Russians tend to be quite proud of their heritage, which is fine... but really, you are being quite silly to make these demands here. That aside... my personal, self-centered feeling is that the name "bud" is quite adequate. :D

:D BTW: debilita [lat.] as a word with many variations is used in almost all laguages directly derived from latin. You can say d'buil' on streets of say Munich and they will undersatnd you. Trust me , free beer will be yours. So it is far from russian-centric :-P Andrew.

As "dbile" in french, pronounced something like "day bill". One has to correctly pronounce the ending D of dbuild to disambiguate it, but since we generally know about what we're speaking about in an IT related discussion, it should be OK, or even funny if we ambigously pronounce it in presence of humourous enough people.
Aug 01 2006
prev sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eam1ec$10e1$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 The problem as I can see is this:
 D propose to use transport encoding for manipulation purposes
 which is main problem imo here - transport encodings are not
 designed for the manipulation - it is extremely difficult to use
 them for manipualtion in practice as we may see.

for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.

Sorry but strings in DMDScript are quite different in terms of 0) there are no such thing as char in JavaScript.

ECMAScript 262-3 (Javascript) defines the source character set to be UTF-16, and the source character set is what JS programs manipulate for strings and characters.
 1) strings are Strings - not vectors of octets - js::string[] and d::char[] 
 are different things.
 2) are not supposed to be used by any OS API.
 3) there are 12 or so methods of String class in JS - limited perimeter -
 what model you've choosen to store them is irrelevant -
 in some implementations they represented even by list of fixed runs.

I agree how it's stored in the JS implementation is irrelevant. My point was that in DMDScript they are stored as utf-8 strings, and they work with only minor extra effort - DMDScript implements all the string handling functions JS defines.
 - ***A  functions in Windows take byte string (LPSTR) and current
   codepage id  to render text. ( byte + codepage = Unicode Code Point )


You are not right here. TextOutA and TextOutW are both supported by Win98. And intention in Harmonia was to use only those ***W functions which come out of the box on Win98 (without need of MSLU)

You're right in that Win98 exports a small handful of W functions without MSLU - but what those W functions actually do under the hood is translate the data based on the current code page and then call the corresponding A function. In other words, the Win9x W functions are rather pointless and don't support characters that are not in the current code page anyway. MSLU extends the same poor behavior to a bunch more pseudo W functions. This is why Phobos does not call W functions under Win9x. Conversely, the A functions under NT and later translate the characters to - you guessed it - UTF-16 and then call the corresponding W function. This is why Phobos under NT does not call the A functions.
 Win9x is obsolete anyway, and there's no reason to cripple a new language 
 by accommodating the failures of an obsolete system.

There is a huge market of embedded devices. If you think that computer evolution expands only in more-ram-speed direction than you are in trouble. http://www.litepc.com/graphics/eossystem.jpg

I agree there's a huge ecosystem of 32 bit embedded processors. And D works fine with Win9x - it just isn't crippled by Win9x's shortcomings.
 When running on NT or later Windows, the W functions are used instead 
 which work directly with UTF-16. Later Windows also support UTF-8 with the 
 A functions.


That is consistent with what I wrote about it.
 - ***W functions in Windows use LPWSTR things which are
   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
   (  cast(dword) word  = Unicode Code Point )
   Only few functions in Windows API treat LPWSTR as UTF-16.

do the 2-word surrogate pair encodings. But those are reserved in BMP anyway, so there is no conflict. Windows has been upgraded to handle them. Early versions of NT that couldn't handle surrogate pairs didn't work with those code points anyway, so nothing is gained by going to code pages.

Sorry this scares me "BMP is a proper subset of UTF-16" UTF-16 is a group name of *byte stream encodings* (UTF-16LE and UTF-16BE) of Unicode Code Set. BTW: which one of this UTFs D uses? Platform dependent I beleive.

D has been used for many years with foreign languages under Windows. If UTF-16 didn't work with Windows, I think it would have come up by now <g>. As for whether it is LE or BE, it is whatever the local platform is, just like ints, shorts, longs, etc. are.
 So, the W functions can and do take UTF-16 directly, and in fact the 
 Phobos implementation does use the W functions, transmitting wchar[] to 
 them, and it works fine.

 The neat thing about Phobos is it adapts to whether you are using Win9x, 
 full 32 bit Windows, or Linux, and adjusts the char output accordingly so 
 it "just works."

It should work well. Efficent I mean.

Yes.
 The language shall be agnostic to the meaning of char as much as possible.

That's C/C++'s approach, and it does not work very well. Check out tchar.h, there's a lovely disaster <g>. For another, just try using std::string with shift-JIS.
 It shall not prevent you to write effective algorithms.

Does UTF-8 prevent writing effective algorithms? I don't see how. DMDScript works, and is faster than any other JS implementation out there, including my own C++ version <g>. And frankly, my struggles with trying to internationalize C++ code for DMDScript is what led to D's support for UTF. The D implementation is shorter, simpler, and faster than the C++ one (which uses wchar's).
 Practically it is enough to have 16 (BMP) but...

I agree you can write code using BMP and ignore surrogate pairs today, and you'll probably never notice the bugs. But sooner or later, the surrogate pair problem is going to show up. Windows, Java, and Javascript have all had to go back and redo to deal with surrogate pairs.
Jul 31 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eamql8$1jgc$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eam1ec$10e1$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 The problem as I can see is this:
 D propose to use transport encoding for manipulation purposes
 which is main problem imo here - transport encodings are not
 designed for the manipulation - it is extremely difficult to use
 them for manipualtion in practice as we may see.

for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.

Sorry but strings in DMDScript are quite different in terms of 0) there are no such thing as char in JavaScript.

ECMAScript 262-3 (Javascript) defines the source character set to be UTF-16, and the source character set is what JS programs manipulate for strings and characters.

Walter, please, forget about such thing as "character set is UTF-16" it is a non-sense. Regarding ECMA-262: "A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 2.1 or later, and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form..." It is quite different from your interpretation. Compiler accepts input stream as either BMP codes or full unicode set encoded using UTF-16. There is no mentioning that String[n] will return you utf-16 code unit. That will be weird.
 1) strings are Strings - not vectors of octets - js::string[] and 
 d::char[] are different things.
 2) are not supposed to be used by any OS API.
 3) there are 12 or so methods of String class in JS - limited perimeter -
 what model you've choosen to store them is irrelevant -
 in some implementations they represented even by list of fixed runs.

I agree how it's stored in the JS implementation is irrelevant. My point was that in DMDScript they are stored as utf-8 strings, and they work with only minor extra effort - DMDScript implements all the string handling functions JS defines.

Again it is up to you how they are stored internally and what you did there. In D situation is completely different - there is a char and char[] opened to all winds.
 - ***A  functions in Windows take byte string (LPSTR) and current
   codepage id  to render text. ( byte + codepage = Unicode Code Point )


You are not right here. TextOutA and TextOutW are both supported by Win98. And intention in Harmonia was to use only those ***W functions which come out of the box on Win98 (without need of MSLU)

You're right in that Win98 exports a small handful of W functions without MSLU - but what those W functions actually do under the hood is translate the data based on the current code page and then call the corresponding A function. In other words, the Win9x W functions are rather pointless and don't support characters that are not in the current code page anyway. MSLU extends the same poor behavior to a bunch more pseudo W functions. This is why Phobos does not call W functions under Win9x.

I wouldn't be so pessimistic about Win98 :)
 Conversely, the A functions under NT and later translate the characters 
 to - you guessed it - UTF-16 and then call the corresponding W function. 
 This is why Phobos under NT does not call the A functions.

Ok. And how do you call A functions? Do you use proposed koi8chars, latin1chars, etc.? You are using char for that. But wait, char cannot contain anything other than utf-8 :-P
 Win9x is obsolete anyway, and there's no reason to cripple a new 
 language by accommodating the failures of an obsolete system.

There is a huge market of embedded devices. If you think that computer evolution expands only in more-ram-speed direction than you are in trouble. http://www.litepc.com/graphics/eossystem.jpg

I agree there's a huge ecosystem of 32 bit embedded processors. And D works fine with Win9x - it just isn't crippled by Win9x's shortcomings.
 When running on NT or later Windows, the W functions are used instead 
 which work directly with UTF-16. Later Windows also support UTF-8 with 
 the A functions.


That is consistent with what I wrote about it.

No doubts about it.
 - ***W functions in Windows use LPWSTR things which are
   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
   (  cast(dword) word  = Unicode Code Point )
   Only few functions in Windows API treat LPWSTR as UTF-16.

doesn't do the 2-word surrogate pair encodings. But those are reserved in BMP anyway, so there is no conflict. Windows has been upgraded to handle them. Early versions of NT that couldn't handle surrogate pairs didn't work with those code points anyway, so nothing is gained by going to code pages.

Sorry this scares me "BMP is a proper subset of UTF-16" UTF-16 is a group name of *byte stream encodings* (UTF-16LE and UTF-16BE) of Unicode Code Set. BTW: which one of this UTFs D uses? Platform dependent I beleive.

D has been used for many years with foreign languages under Windows. If UTF-16 didn't work with Windows, I think it would have come up by now <g>. As for whether it is LE or BE, it is whatever the local platform is, just like ints, shorts, longs, etc. are.

 So, the W functions can and do take UTF-16 directly, and in fact the 
 Phobos implementation does use the W functions, transmitting wchar[] to 
 them, and it works fine.

 The neat thing about Phobos is it adapts to whether you are using Win9x, 
 full 32 bit Windows, or Linux, and adjusts the char output accordingly 
 so it "just works."

It should work well. Efficent I mean.

Yes.
 The language shall be agnostic to the meaning of char as much as 
 possible.

That's C/C++'s approach, and it does not work very well. Check out tchar.h, there's a lovely disaster <g>. For another, just try using std::string with shift-JIS.
 It shall not prevent you to write effective algorithms.

Does UTF-8 prevent writing effective algorithms? I don't see how. DMDScript works, and is faster than any other JS implementation out there, including my own C++ version <g>. And frankly, my struggles with trying to internationalize C++ code for DMDScript is what led to D's support for UTF. The D implementation is shorter, simpler, and faster than the C++ one (which uses wchar's).
 Practically it is enough to have 16 (BMP) but...

I agree you can write code using BMP and ignore surrogate pairs today, and you'll probably never notice the bugs. But sooner or later, the surrogate pair problem is going to show up. Windows, Java, and Javascript have all had to go back and redo to deal with surrogate pairs.

Why? JavaScript for example has no such things as char. String.charAt() returns guess what? Correct - String object. No char - no problem :D Why do they need to redefine anything then? Again - let people decide of what char is and how to interpret it And that will be it. Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no offence implied). Ordinary people will do their own strings anyway. Just give them opAssign and dtor in structs and you will see explosion of perfect strings. That char#[] (read-only arrays) will also benefit here. oh..... Changing char init value to 0 will not harm anybody but will allow to use char for other than utf-8 purposes - it is only one from 40 in active use encodings anyway. For persistence purposes (in compiled EXE) utf is the best choice probably. But in runtime - please not on language level. Educated IMO, of course. Andrew.
Aug 01 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set 

BMP is a subset of UTF-16.
 There is no mentioning that String[n] will return you utf-16 code
 unit. That will be weird.

String.charCodeAt() will give you the utf-16 code unit.
 Conversely, the A functions under NT and later translate the characters 
 to - you guessed it - UTF-16 and then call the corresponding W function. 
 This is why Phobos under NT does not call the A functions.


Take a look at std.file for an example.
 Windows, Java, and Javascript have all 
 had to go back and redo to deal with surrogate pairs.

String.charAt() returns guess what? Correct - String object. No char - no problem :D

See String.fromCharCode() and String.charCodeAt()
 Again - let people decide of what char is and how to interpret it And that 
 will be it.

I've already explained the problems C/C++ have with that. They're real problems, bad and unfixable enough that there are official proposals to add new UTF basic types to to C++.
 Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no 
 offence implied).

C++'s experience with this demonstrates that char* does not work very well with UTF-8. It's not just my experience, it's why new types were proposed for C++ (and not by me).
 Ordinary people will do their own strings anyway. Just 
 give them opAssign and dtor in structs and you will see explosion of perfect 
 strings. That char#[] (read-only arrays) will also benefit here. oh.....
 
 Changing char init value to 0 will not harm anybody but will allow to use 
 char for other than
 
 utf-8 purposes - it is only one from 40 in active use encodings anyway.
 
 For persistence purposes (in compiled EXE) utf is the best choice probably. 
 But in runtime - please not on language level.

ubyte[] will enable you to use any encoding you wish - and that's what it's there for.
Aug 01 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
(Hope this long dialog will help all of us to better understand what UNICODE 
is)

"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

BMP is a subset of UTF-16.

Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking. If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you are in trouble. See: Sequence of two words D834 DD1E as UTF-16 will give you one unicode character with code 0x1D11E ( musical G clef ). And the same sequence interpretted as UCS-2 sequence will give you two (invlaid, non-printable but still) character codes. You will get different length of the string at least.
 There is no mentioning that String[n] will return you utf-16 code
 unit. That will be weird.

String.charCodeAt() will give you the utf-16 code unit.
 Conversely, the A functions under NT and later translate the characters 
 to - you guessed it - UTF-16 and then call the corresponding W function. 
 This is why Phobos under NT does not call the A functions.


Take a look at std.file for an example.

You mean here?: char* namez = toMBSz(name); h = CreateFileA(namez,GENERIC_WRITE,0,null,CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,cast(HANDLE)null); char* here is far from UTF-8 sequence.
 Windows, Java, and Javascript have all had to go back and redo to deal 
 with surrogate pairs.

String.charAt() returns guess what? Correct - String object. No char - no problem :D

See String.fromCharCode() and String.charCodeAt()

ECMA-262 String.prototype.charCodeAt (pos) Returns a number (a nonnegative integer less than 2^16) representing the code point value of the character at position pos in the string.... As you may see it is returning (unicode) *code point* from BMP set but it is far from UTF-16 code unit you've declared above. Relaxing "a nonnegative integer less than 2^16" to "a nonnegative integer less than 2^21" will not harm anybody. Or at least such probability is vanishingly small.
 Again - let people decide of what char is and how to interpret it And 
 that will be it.

I've already explained the problems C/C++ have with that. They're real problems, bad and unfixable enough that there are official proposals to add new UTF basic types to to C++.

Basic types of what?
 Phobos can work with utf-8/16 and satisfy you and other UTF-masochists 
 (no offence implied).

C++'s experience with this demonstrates that char* does not work very well with UTF-8. It's not just my experience, it's why new types were proposed for C++ (and not by me).

Because char in C is not supposed to hold multy-byte encodings. At least std::string is strictly single byte thing by definition. And this is perfectly fine. There is wchar_t for holding OS supported range in full. On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.
 Ordinary people will do their own strings anyway. Just give them opAssign 
 and dtor in structs and you will see explosion of perfect strings. That 
 char#[] (read-only arrays) will also benefit here. oh.....

 Changing char init value to 0 will not harm anybody but will allow to use 
 char for other than

 utf-8 purposes - it is only one from 40 in active use encodings anyway.

 For persistence purposes (in compiled EXE) utf is the best choice 
 probably. But in runtime - please not on language level.

ubyte[] will enable you to use any encoding you wish - and that's what it's there for.

Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C Is this the idea? Andrew.
Aug 01 2006
next sibling parent reply Derek Parnell <derek nomail.afraid.org> writes:
On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:

 (Hope this long dialog will help all of us to better understand what UNICODE 
 is)
 
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

BMP is a subset of UTF-16.

Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.

Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now. ...
 ubyte[] will enable you to use any encoding you wish - and that's what 
 it's there for.

Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C Is this the idea?

Yes. I believe this is how it now should be done. The Phobos library is not correctly using char, char[], and ubyte[] when interfacing with Windows and C functions. My guess is that Walter originally used 'char' to make things easier for C coders to move over to D, but in doing so, now with UTF support built-in, has caused more problems that the idea was supposed to solve. The move to UTF support is good, but the choice of 'char' for the name of a UTF-8 code-unit was, and still is, a big mistake. I would have liked something more like ... char ==> An unsigned 8-bit byte. An alias for ubyte. schar ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. char[] ==> A 'C' string schar[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string And then have built-in conversions between the UTF encodings. So if people want to continue to use code from C/C++ that uses code-pages or similar they can stick with char[]. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 2/08/2006 1:08:51 PM
Aug 01 2006
next sibling parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Derek Parnell" <derek nomail.afraid.org> wrote in message 
news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg 40tude.net...
 On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:

 (Hope this long dialog will help all of us to better understand what 
 UNICODE
 is)

 "Walter Bright" <newshound digitalmars.com> wrote in message
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

BMP is a subset of UTF-16.

Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.

Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now. ...
 ubyte[] will enable you to use any encoding you wish - and that's what
 it's there for.

Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C Is this the idea?

Yes. I believe this is how it now should be done. The Phobos library is not correctly using char, char[], and ubyte[] when interfacing with Windows and C functions. My guess is that Walter originally used 'char' to make things easier for C coders to move over to D, but in doing so, now with UTF support built-in, has caused more problems that the idea was supposed to solve. The move to UTF support is good, but the choice of 'char' for the name of a UTF-8 code-unit was, and still is, a big mistake. I would have liked something more like ... char ==> An unsigned 8-bit byte. An alias for ubyte. schar ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. char[] ==> A 'C' string schar[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string And then have built-in conversions between the UTF encodings. So if people want to continue to use code from C/C++ that uses code-pages or similar they can stick with char[].

Yes, Derek, this will be probably near the ideal. Andrew.
Aug 01 2006
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk  
<news terrainformatica.com> wrote:
 "Derek Parnell" <derek nomail.afraid.org> wrote in message
 news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg 40tude.net...
 On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:

 (Hope this long dialog will help all of us to better understand what
 UNICODE
 is)

 "Walter Bright" <newshound digitalmars.com> wrote in message
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

BMP is a subset of UTF-16.

Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.

Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now. ...
 ubyte[] will enable you to use any encoding you wish - and that's what
 it's there for.

Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C Is this the idea?

Yes. I believe this is how it now should be done. The Phobos library is not correctly using char, char[], and ubyte[] when interfacing with Windows and C functions. My guess is that Walter originally used 'char' to make things easier for C coders to move over to D, but in doing so, now with UTF support built-in, has caused more problems that the idea was supposed to solve. The move to UTF support is good, but the choice of 'char' for the name of a UTF-8 code-unit was, and still is, a big mistake. I would have liked something more like ... char ==> An unsigned 8-bit byte. An alias for ubyte. schar ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. char[] ==> A 'C' string schar[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string And then have built-in conversions between the UTF encodings. So if people want to continue to use code from C/C++ that uses code-pages or similar they can stick with char[].

Yes, Derek, this will be probably near the ideal.

Yet, I don't find it at all difficult to think of them like so: ubyte ==> An unsigned 8-bit byte. char ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. ubyte[] ==> A 'C' string char[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string If you want to program in D you _will_ have to readjust your thinking in some areas, this is one of them. All you have to realise is that 'char' in D is not the same as 'char' in C. In quick and dirty ASCII only applications I can adjust my thinking further: char ==> An ASCII character char[] ==> An ASCII string I do however agree that C functions used in D should be declared like: int strlen(ubyte* s); and not like (as they currently are): int strlen(char* s); The problem with this is that the code: char[] s = "test"; strlen(s) would produce a compile error, and require a cast or a conversion function (toMBSz perhaps, which in many cases will not need to do anything). Of course the purists would say "That's perfectly correct, strlen cannot tell you the length of a UTF-8 string, only it's byte count", but at the same time it would be nice (for quick and dirty ASCII only programs) if it worked. Is it possible to declare them like this? int strlen(void* s); and for char[] to be implicitly 'paintable' as void* as char[] is already implicitly 'paintable' as void[]? It seems like it would nicely solve the problem of people seeing: int strlen(char* s); and thinking D's char is the same as C's char without introducing a painful need for cast or conversion in simple ASCII only situations. Regan
Aug 01 2006
next sibling parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message 
news:optdm2gghi23k2f5 nrage...
 On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk 
 <news terrainformatica.com> wrote:
 "Derek Parnell" <derek nomail.afraid.org> wrote in message
 news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg 40tude.net...
 On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:

 (Hope this long dialog will help all of us to better understand what
 UNICODE
 is)

 "Walter Bright" <newshound digitalmars.com> wrote in message
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

BMP is a subset of UTF-16.

Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.

Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now. ...
 ubyte[] will enable you to use any encoding you wish - and that's what
 it's there for.

Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C Is this the idea?

Yes. I believe this is how it now should be done. The Phobos library is not correctly using char, char[], and ubyte[] when interfacing with Windows and C functions. My guess is that Walter originally used 'char' to make things easier for C coders to move over to D, but in doing so, now with UTF support built-in, has caused more problems that the idea was supposed to solve. The move to UTF support is good, but the choice of 'char' for the name of a UTF-8 code-unit was, and still is, a big mistake. I would have liked something more like ... char ==> An unsigned 8-bit byte. An alias for ubyte. schar ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. char[] ==> A 'C' string schar[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string And then have built-in conversions between the UTF encodings. So if people want to continue to use code from C/C++ that uses code-pages or similar they can stick with char[].

Yes, Derek, this will be probably near the ideal.

Yet, I don't find it at all difficult to think of them like so: ubyte ==> An unsigned 8-bit byte. char ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. ubyte[] ==> A 'C' string char[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string If you want to program in D you _will_ have to readjust your thinking in some areas, this is one of them. All you have to realise is that 'char' in D is not the same as 'char' in C. In quick and dirty ASCII only applications I can adjust my thinking further: char ==> An ASCII character char[] ==> An ASCII string I do however agree that C functions used in D should be declared like: int strlen(ubyte* s); and not like (as they currently are): int strlen(char* s); The problem with this is that the code: char[] s = "test"; strlen(s) would produce a compile error, and require a cast or a conversion function (toMBSz perhaps, which in many cases will not need to do anything). Of course the purists would say "That's perfectly correct, strlen cannot tell you the length of a UTF-8 string, only it's byte count", but at the same time it would be nice (for quick and dirty ASCII only programs) if it worked. Is it possible to declare them like this? int strlen(void* s); and for char[] to be implicitly 'paintable' as void* as char[] is already implicitly 'paintable' as void[]? It seems like it would nicely solve the problem of people seeing: int strlen(char* s); and thinking D's char is the same as C's char without introducing a painful need for cast or conversion in simple ASCII only situations. Regan

Another option will be to change char.init to 0 and forget about the problem left it as it is now. Some good string implementation will contain encoding field in string instance if needed. Andrew.
Aug 01 2006
parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
I'm trying to understand why this 0 thing is such an issue.  If your 
second statement is valid, it makes the first moot - 0 or no 0.  Why 
does it matter, then?

-[Unknown]


 Another option will be to change char.init to 0 and forget about the problem
 left it as it is now.  Some good string implementation will
 contain encoding field in string instance if needed.
 
 Andrew.
 
 
 

Aug 01 2006
next sibling parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
"Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
news:eapdsg$qeo$1 digitaldaemon.com...
 I'm trying to understand why this 0 thing is such an issue.  If your 
 second statement is valid, it makes the first moot - 0 or no 0.  Why does 
 it matter, then?

Declaration of char.init == 0 pretty much means that D has no strict requirement that char[] shall contain only UTF-8 encoded sequences but any other encodings suitable for the application. char.init == 0 will resolve situation we see in Phobos now. char[] de facto is used for other than utf-8 encodings. char.init == 0 tells everybody that char can also be used for representing unicode *code points* with asuumption that offset value (mapping on full Unicode set, aka codepage) is stored somewhere in application or well known to it. char.init == 0 also highlights the fact that it is safe to use char[] as C string processing functions and passing them to non D modules and libraries. Is it UTF-8 encoded or not - does not matter - type is universal enough. Andrew.
 -[Unknown]


 Another option will be to change char.init to 0 and forget about the 
 problem
 left it as it is now.  Some good string implementation will
 contain encoding field in string instance if needed.

 Andrew.

 


Aug 01 2006
next sibling parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Andrew Fedoniouk wrote:
 "Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
 news:eapdsg$qeo$1 digitaldaemon.com...
 I'm trying to understand why this 0 thing is such an issue.  If your 
 second statement is valid, it makes the first moot - 0 or no 0.  Why does 
 it matter, then?

Declaration of char.init == 0 pretty much means that D has no strict requirement that char[] shall contain only UTF-8 encoded sequences but any other encodings suitable for the application.

Why is this good?
 char.init == 0 will resolve situation we see in Phobos now.
 char[] de facto is used for other than utf-8 encodings.

You mean data with other encodings that still want to use the std.string functions? I have written template versions that replaces (almost) all std.string functions that do not rely on encoding.
 char.init == 0 tells everybody that char can also be used
 for representing unicode *code points* with asuumption
 that offset value (mapping on full Unicode set, aka codepage) is stored
 somewhere in application or well known to it.

Maybe it would tell people that. A good thing it isn't so then. Again, why do you want to store non utf-8 data in a char[]?. What is wrong with ubyte[] or a suitable typedef?
 char.init == 0 also highlights the fact that it is safe to
 use char[] as C string processing functions and passing them to non D 
 modules and libraries.
 Is it UTF-8 encoded or not - does not matter - type is universal enough.

I can't see how that would make it considerably safer. /Oskar
Aug 02 2006
prev sibling parent "Unknown W. Brackets" <unknown simplemachines.org> writes:
I fail to understand why I want another ambiguous type in my 
programming.  I am glad that when I type "int", I know I have a number 
and not a pointer.

I am glad that when I type char, I again know what I have.  No 
guesswork.  Your proposals sound like shooting myself in the foot.

No fun.  I'll take that helmet you offered first.

-[Unknown]


 "Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
 news:eapdsg$qeo$1 digitaldaemon.com...
 I'm trying to understand why this 0 thing is such an issue.  If your 
 second statement is valid, it makes the first moot - 0 or no 0.  Why does 
 it matter, then?

Declaration of char.init == 0 pretty much means that D has no strict requirement that char[] shall contain only UTF-8 encoded sequences but any other encodings suitable for the application. char.init == 0 will resolve situation we see in Phobos now. char[] de facto is used for other than utf-8 encodings. char.init == 0 tells everybody that char can also be used for representing unicode *code points* with asuumption that offset value (mapping on full Unicode set, aka codepage) is stored somewhere in application or well known to it. char.init == 0 also highlights the fact that it is safe to use char[] as C string processing functions and passing them to non D modules and libraries. Is it UTF-8 encoded or not - does not matter - type is universal enough. Andrew.
 -[Unknown]


 Another option will be to change char.init to 0 and forget about the 
 problem
 left it as it is now.  Some good string implementation will
 contain encoding field in string instance if needed.

 Andrew.



Aug 02 2006
prev sibling parent reply Derek Parnell <derek nomail.afraid.org> writes:
On Tue, 01 Aug 2006 22:40:56 -0700, Unknown W. Brackets wrote:

 I'm trying to understand why this 0 thing is such an issue.  If your 
 second statement is valid, it makes the first moot - 0 or no 0.  Why 
 does it matter, then?

I think the issue is more that Andrew wants to have hex-FF as a legitimate byte value anywhere in a char[] variable. He misses the point that the purpose of not allowing it in so we can detected uninitialized UTF-8 strings at run-time. Andrew, just use ubyte[] variables and you won't have a problem, apart from conversions between code-pages and Unicode <G>. In D, ubyte[] is the data structure designed to hold variable length arrays of unsigned bytes, which is exactly what you need to implement the type strings you have in KOI-8 encoding. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 2/08/2006 4:24:27 PM
Aug 01 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
 I think the issue is more that Andrew wants to have hex-FF as a legitimate
 byte value anywhere in a char[] variable. He misses the point that the
 purpose of not allowing it in so we can detected uninitialized UTF-8
 strings at run-time.

What does it mean uninitialized? They *are* initialized. This is the main point. For any types you can declare initial value. I bet you are choosing not non existent values for say enums but some really meaningfull default values. having strings filled by ff's means that you will get problems of different kinds - partially initialized strings. Could you tell me do you ever had situation when ffffff strings helped you to find problem? And if yes how it is in principle different from catching strings with 00000? Can anyone here say that this fffffffs helped to find problem? Andrew.
Aug 02 2006
next sibling parent Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 Can anyone here say that this fffffffs helped to find
 problem?

Yes, I found two bugs in my own code with it that would have been hidden with the 0 initialization.
Aug 02 2006
prev sibling parent Derek Parnell <derek nomail.afraid.org> writes:
On Wed, 2 Aug 2006 00:08:42 -0700, Andrew Fedoniouk wrote:

 I think the issue is more that Andrew wants to have hex-FF as a legitimate
 byte value anywhere in a char[] variable. He misses the point that the
 purpose of not allowing it in so we can detected uninitialized UTF-8
 strings at run-time.

What does it mean uninitialized? They *are* initialized.

Andrew, I will assume you are not trying to be difficult but that maybe your English is a bit too literal. Of course in the clinical sense they are initialized because data is moved into them before your code has a chance to do anything. However, when I say "detected uninitialized UTF-8 strings" I mean "detect UTF-8 strings that have not been initialized by your own code". Is that better?
 This is the main point. For any types you can declare
 initial value. I bet you are choosing not non existent values
 for say enums but some really meaningfull default values.

Huh??? Now you are being difficult. The purpose of enums is to have them initialized to values that make sense in their context. But the default values for enum generally work for me as the exact value doesn't really matter in most cases. enum AccountType { Savings, Investment, FixedLoan, Club, LineOfCredit } I really don't care what values the compiler assigns to these enums. Sure I could choose specific values but it doesn't really matter.
 having strings filled by ff's means that you will get problems
 of different kinds - partially initialized strings.

Huh???? Why would I always get partially initialized strings, as you imply? And even if I did, then having 0xFF in them is going to help me track down some stupid code that I wrote.
 Could you tell me do you ever had situation when
 ffffff strings helped you to find problem?

No. I haven't made that kind of mistake yet with my code.
 And if yes how it is in principle different from
 catching strings with 00000?

Because if I found a 0x00 in a string, I wouldn't know if its legitimate or not.
 Can anyone here say that this fffffffs helped to find
 problem?

But if I found 0xFF I would know straight away that I've made a mistake somewhere. Actually, come to think about it, I did make a mistake once when my code was incorrectly interpreting a BOM in a text file. I loaded the file as if was UTF-8 but it should have been UTF-16. DMD correctly told me I had a bad UTF strings when I tried to write it out. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 2/08/2006 5:49:46 PM
Aug 02 2006
prev sibling parent reply Derek Parnell <derek nomail.afraid.org> writes:
On Wed, 02 Aug 2006 16:22:54 +1200, Regan Heath wrote:

  char  ==> An unsigned 8-bit byte. An alias for ubyte.
  schar ==> A UTF-8 code unit.
  wchar ==> A UTF-16 code unit.
  dchar ==> A UTF-32 code unit.

  char[] ==> A 'C' string
  schar[] ==> A UTF-8 string
  wchar[] ==> A UTF-16 string
  dchar[] ==> A UTF-32 string

 And then have built-in conversions between the UTF encodings. So if  
 people
 want to continue to use code from C/C++ that uses code-pages or similar
 they can stick with char[].

Yes, Derek, this will be probably near the ideal.

Yet, I don't find it at all difficult to think of them like so: ubyte ==> An unsigned 8-bit byte. char ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. ubyte[] ==> A 'C' string char[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string

Me too, but that's probably because I've not been immersed in C/C++ for the last 20 odd years ;-) I "think in D" now and char[] is a UTF-8 string in my mind.
 If you want to program in D you _will_ have to readjust your thinking in  
 some areas, this is one of them.
 All you have to realise is that 'char' in D is not the same as 'char' in C.

True, but Walter seems hell bent of easing the transition to D for C/C++ refugees.
 In quick and dirty ASCII only applications I can adjust my thinking  
 further:
 
    char   ==> An ASCII character
    char[] ==> An ASCII string
 
 I do however agree that C functions used in D should be declared like:
    int strlen(ubyte* s);
 
 and not like (as they currently are):
    int strlen(char* s);
 
 The problem with this is that the code:
    char[] s = "test";
    strlen(s)
 
 would produce a compile error, and require a cast or a conversion function  
 (toMBSz perhaps, which in many cases will not need to do anything).
 
 Of course the purists would say "That's perfectly correct, strlen cannot  
 tell you the length of a UTF-8 string, only it's byte count", but at the  
 same time it would be nice (for quick and dirty ASCII only programs) if it  
 worked.

And I'm a wannabe purist <G>
 Is it possible to declare them like this?
    int strlen(void* s);
 
 and for char[] to be implicitly 'paintable' as void* as char[] is already  
 implicitly 'paintable' as void[]?
 
 It seems like it would nicely solve the problem of people seeing:
    int strlen(char* s);
 
 and thinking D's char is the same as C's char without introducing a  
 painful need for cast or conversion in simple ASCII only situations.

Is the zero-terminator for C strings that will get in the way. We need a nice way of getting the compiler to ensure C-strings are always terminated correctly. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 2/08/2006 2:48:43 PM
Aug 01 2006
parent "Regan Heath" <regan netwin.co.nz> writes:
On Wed, 2 Aug 2006 14:55:11 +1000, Derek Parnell <derek nomail.afraid.org>  
wrote:
 Is the zero-terminator for C strings that will get in the way. We need a
 nice way of getting the compiler to ensure C-strings are always  
 terminated
 correctly.

Good point. I neglected to mention that. Regan
Aug 01 2006
prev sibling next sibling parent kris <foo bar.com> writes:
Derek Parnell wrote:
[snip]
   char  ==> An unsigned 8-bit byte. An alias for ubyte.
   schar ==> A UTF-8 code unit.
   wchar ==> A UTF-16 code unit.
   dchar ==> A UTF-32 code unit.
 
   char[] ==> A 'C' string 
   schar[] ==> A UTF-8 string
   wchar[] ==> A UTF-16 string
   dchar[] ==> A UTF-32 string

Sure, although char, utf8, utf16, utf32 are much better choices, IMHO :) I'd be game to have them changed at this stage. It's not much more than some (extensive) global replacements. Don't think there's much need to check each instance. There's a nice shareware tool called "Active Search & Replace" which I've recently found to be very helpful in this regard.
Aug 01 2006
prev sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
 On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
 
 (Hope this long dialog will help all of us to better understand what UNICODE 
 is)

 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

BMP is a subset of UTF-16.

UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.

Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid UTF-16?
Aug 02 2006
parent reply Derek Parnell <derek nomail.afraid.org> writes:
On Wed, 02 Aug 2006 00:11:26 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
 
 (Hope this long dialog will help all of us to better understand what UNICODE 
 is)

 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

BMP is a subset of UTF-16.

UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.

Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid UTF-16?

Huh??? I said "UCS-2 is a subset of Unicode characters" Did you miss that? UTF-16 is not a subset as it can be used to encode every Unicode code point. UCS-2 is a subset as it can *not* encode code points that are outside of the "basic multilingual plane" (aka BMP). -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 2/08/2006 5:43:18 PM
Aug 02 2006
parent Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
 On Wed, 02 Aug 2006 00:11:26 -0700, Walter Bright wrote:
 Derek Parnell wrote:
 On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

BMP is a subset of UTF-16.

UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.

bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

UTF-16?

Huh??? I said "UCS-2 is a subset of Unicode characters" Did you miss that?

I saw it, but that statement is not the same as "UCS-2 is a subset of UTF-16". The issue I was talking about is "BMP [UCS-2] is a subset of UTF-16", which Andrew keeps replying "it is not". You said "Andrew is correct", so I inferred you were agreeing that UCS-2 is not a subset of UTF-16.
 UTF-16 is not a subset as it can be used to encode every Unicode code
 point. UCS-2 is a subset as it can *not* encode code points that are
 outside of the "basic multilingual plane" (aka BMP). 

I think you and I are in agreement.
Aug 02 2006
prev sibling next sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 BMP is a subset of UTF-16.

Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking. If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you are in trouble. See: Sequence of two words D834 DD1E as UTF-16 will give you one unicode character with code 0x1D11E ( musical G clef ). And the same sequence interpretted as UCS-2 sequence will give you two (invlaid, non-printable but still) character codes. You will get different length of the string at least.

The only thing that UTF-16 adds are semantics for characters that are invalid for BMP. That makes UTF-16 a superset. It doesn't matter if you're strictly speaking, or if the jargon is different. UTF-16 is a superset of BMP, once you cut past the jargon and look at the underlying reality.
 Ok. And how do you call A functions?


You mean here?: char* namez = toMBSz(name); h = CreateFileA(namez,GENERIC_WRITE,0,null,CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,cast(HANDLE)null); char* here is far from UTF-8 sequence.

You could argue that for clarity namez should have been written as a ubyte*, but in the above code it would make no difference.
 Windows, Java, and Javascript have all had to go back and redo to deal 
 with surrogate pairs.

String.charAt() returns guess what? Correct - String object. No char - no problem :D


ECMA-262 String.prototype.charCodeAt (pos) Returns a number (a nonnegative integer less than 2^16) representing the code point value of the character at position pos in the string.... As you may see it is returning (unicode) *code point* from BMP set but it is far from UTF-16 code unit you've declared above.

There is no difference.
 Relaxing "a nonnegative integer less than 2^16" to
 "a nonnegative integer less than 2^21" will not harm anybody.
 Or at least such probability is vanishingly small.

It'll break any code trying to deal with surrogate pairs.
 Again - let people decide of what char is and how to interpret it And 
 that will be it.

problems, bad and unfixable enough that there are official proposals to add new UTF basic types to to C++.

Basic types of what?

Basic types for utf-8 and utf-16. Ironically, they wind up being very much like D's char and wchar types.
 Phobos can work with utf-8/16 and satisfy you and other UTF-masochists 
 (no offence implied).

with UTF-8. It's not just my experience, it's why new types were proposed for C++ (and not by me).


Standard functions in the C standard library to deal with multibyte encodings have been there since 1989. Compiler extensions to deal with shift-JIS and other multibyte encodings have been there since the mid 80's. They don't work very well, but nevertheless, are there and supported.
 At least std::string is strictly single byte thing by definition. And this
 is perfectly fine.

As long as you're dealing with ASCII only <g>. That world has been left behind, though.
 There is wchar_t for holding OS supported range in full.
 On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.

That's just the trouble with wchar_t. It's implementation defined, which means its use is non-portable. The Win32 version cannot handle surrogate pairs as a single character. Linux has the opposite problem - you can't have UTF-16 strings in any non-kludgy way. Trying to write internationalized code with wchar_t that works correctly on both Win32 and Linux is an exercise in frustration. What you wind up doing is abstracting away the char type - giving up on help from the standard libraries and writing your own text processing code from scratch. I've been through this with real projects. It doesn't work just fine, and is a lot of extra work. Translating the code to D is nice, you essentially give that whole mess a heave-ho. BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit wchar_t eats memory like nothing else.
 Thus the whole set of Windows API headers (and std.c.string for example)
 seen in D has to be rewrited to accept ubyte[]. As char in D is not char in 
 C

You're right that a C char isn't a D char. All that means is one must be careful when calling C functions that take char*'s to pass it data in the form that particular C function expects. This is true for all C's data types - even int.
 Is this the idea?

The vast majority (perhaps even all) of C standard string handling functions that accept char* will work with UTF-8 without modification. No rewrite required. You've implied all this doesn't work, by saying things must be rewritten, that it's extremely difficult to deal with UTF-8, that BMP is not a subset of UTF-16, etc. This is not my experience at all. If you've got some persuasive code examples, I'd like to see them.
Aug 01 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
 As you may see it is returning (unicode) *code point* from BMP set
 but it is far from UTF-16 code unit you've declared above.

There is no difference.
 Relaxing "a nonnegative integer less than 2^16" to
 "a nonnegative integer less than 2^21" will not harm anybody.
 Or at least such probability is vanishingly small.

It'll break any code trying to deal with surrogate pairs.

There is no such thing as surrogate pair in UCS-2. JS string is not holding UTF-16 code units - only full code points. See spec.
 Phobos can work with utf-8/16 and satisfy you and other UTF-masochists


well with UTF-8. It's not just my experience, it's why new types were proposed for C++ (and not by me).


Standard functions in the C standard library to deal with multibyte encodings have been there since 1989. Compiler extensions to deal with shift-JIS and other multibyte encodings have been there since the mid 80's. They don't work very well, but nevertheless, are there and supported.
 At least std::string is strictly single byte thing by definition. And 
 this
 is perfectly fine.

As long as you're dealing with ASCII only <g>. That world has been left behind, though.

C string functions can be used with mutibyte encodings for the sole reason: all byte encodings has char with code 0 defined as NUL character. All encodings in practical use has no code byte with code 0 appear in the middle of sequence. They all built with C string processing in mind.
 There is wchar_t for holding OS supported range in full.
 On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.

That's just the trouble with wchar_t. It's implementation defined, which means its use is non-portable. The Win32 version cannot handle surrogate pairs as a single character. Linux has the opposite problem - you can't have UTF-16 strings in any non-kludgy way. Trying to write internationalized code with wchar_t that works correctly on both Win32 and Linux is an exercise in frustration. What you wind up doing is abstracting away the char type - giving up on help from the standard libraries and writing your own text processing code from scratch. I've been through this with real projects. It doesn't work just fine, and is a lot of extra work. Translating the code to D is nice, you essentially give that whole mess a heave-ho. BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit wchar_t eats memory like nothing else.

Agree. As I said - if you need efficiency use byte/word encodings + mapping. dchar is no better than wchar_t/linux. Please don't say that I shall use urf-8 for that - simply does not work in my cases - too expencive.
 Thus the whole set of Windows API headers (and std.c.string for example)
 seen in D has to be rewrited to accept ubyte[]. As char in D is not char 
 in C

You're right that a C char isn't a D char. All that means is one must be careful when calling C functions that take char*'s to pass it data in the form that particular C function expects. This is true for all C's data types - even int.
 Is this the idea?

The vast majority (perhaps even all) of C standard string handling functions that accept char* will work with UTF-8 without modification. No rewrite required.

Correct. As I said because of 0 is NUL in UTF-8 too. Not 0xFF or anything else exotic.
 You've implied all this doesn't work, by saying things must be rewritten, 
 that it's extremely difficult to deal with UTF-8, that BMP is not a subset 
 of UTF-16, etc. This is not my experience at all. If you've got some 
 persuasive code examples, I'd like to see them.

I am not saying that "must be rewritten". Sorry but this is you who propose to rewrite all string processing functions of standard library mankind has for today. Or I don't quite understand your idea with UTFs. Java did change string world by introducing just char (single UCS-2 code point) And no variations. Good it or bad? From uniformity point of view - good. For efficiency - bad. I've seen a lot of reinvented char as byte wheels in professional packages. Andrew.
Aug 01 2006
parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
Andrew, I think there's a misunderstanding here.  Perhaps it's a 
language thing.

Let me define two things for you, in English, by my understanding of 
them.  I was born in Utah and raised in Los Angeles as a native speaker, 
so hopefully these definitions aren't far from the standard understanding.

Default: a setting, value, or situation which persists unless action is 
taken otherwise; such a thing that happens unless overridden or canceled.

Null: something which has no current setting, value, or situation (but 
could have one); the absence of a setting, value, or situation.

Therefore, I should conclude that "default" and "null" are very 
different concepts.

The fact that C strings are null terminated, and that encodings provide 
for a "null" character (or code point or muffin or whatever they care to 
call them) does not logically necessitate that this provides for a 
default, or logically default, value.

It is true that, as the above definitions, it would not be wrong for the 
default to be null.  That would fit the definitions above perfectly. 
However, so would a value of ' ' (which might be the default in some 
language out there.)

It would seem logical that 0 could be used as the default, but then as 
Walter pointed out... this can (and tends to) hide bugs which will bite 
you eventually.

Let us suppose you were to have a string displayed in a place.  It is 
possible, were it blank, that you might not notice it.  Next let us 
suppose this space were filled with "?", "`", "ﮘ", or "ß" characters.

Do you think you would be more, or less likely to notice it?

Next, let us suppose that this character could be (in cases) detectable 
as invalid.  Again note that 0 is not invalid, and may appear in 
strings.  This sounds even better.

So a default value of 0 does not, from an implementation or practical 
point of view, seem to make much sense to me.  In fact, I think a 
default value of "42" for int makes sense (surely it reminds you of what 
six by nine is.)

But maybe that's because I never leave things at their defaults.  It's 
like writing a story where you expect the reader to think everyone has 
brown eyes unless you say otherwise.

-[Unknown]


 Correct. As I said because of 0 is NUL in UTF-8 too. Not
 0xFF or anything else exotic.

Aug 01 2006
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
 But maybe that's because I never leave things at their defaults.  It's
 like writing a story where you expect the reader to think everyone has 
 brown eyes unless you say otherwise.

Consider this: char[6] buf; strncpy(buf, "1234567", 5); What will be a content of you buffer? Answer is: 12345\xff . Surprise? It is. In modern D reliable implementation of this shall be as: char[6] buf; // memset(buf,0xFF,6); under the hood. uint n = strncpy(buf, "1234567", 5); buf[n] = 0; if you are going to use this with non D modules. Needless to say that this is a bit redundant. If D in any case initializes that memory why you need this uint n and buf[n] = 0; ? Don't tell me please that this is because your spent your childhood in boyscout camps and got some high principles. Lets' put aside that matters - it is purely technical discussion. Andrew.
Aug 01 2006
next sibling parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Andrew Fedoniouk wrote:
 But maybe that's because I never leave things at their defaults.  It's
 like writing a story where you expect the reader to think everyone has 
 brown eyes unless you say otherwise.

Consider this: char[6] buf; strncpy(buf, "1234567", 5); What will be a content of you buffer? Answer is: 12345\xff . Surprise? It is.

Not really surprising. Had you compiled this in a C program (you are using C functions after all), you would have gotten: 12345\x?? <- some garbage. Not a zero terminated string. My manual for strncpy explicitly states: " if there is no null byte among the first n bytes of src, the result will not be null-terminated." /Oskar
Aug 02 2006
prev sibling next sibling parent Derek Parnell <derek nomail.afraid.org> writes:
On Tue, 1 Aug 2006 23:45:26 -0700, Andrew Fedoniouk wrote:

 But maybe that's because I never leave things at their defaults.  It's
 like writing a story where you expect the reader to think everyone has 
 brown eyes unless you say otherwise.

Consider this: char[6] buf; strncpy(buf, "1234567", 5); What will be a content of you buffer? Answer is: 12345\xff . Surprise? It is.

No, not surprised, just wondering why you didn't code it correctly though. If you insist on using C functions then it should be coded ... extern(C) uint strncpy(ubyte *, ubyte *, uint ); ubyte[6] buf; strncpy(buf.ptr, cast(ubyte*)"1234567", 5);
 In modern D reliable implementation of this shall be as:
 
 char[6] buf; // memset(buf,0xFF,6); under the hood.
 uint n = strncpy(buf, "1234567", 5);
 buf[n] = 0;

Well that is debatable. I'd do it more like ... char[6] buf; // An array of UTF-8 code units. uint n = strncpy(buf, "1234567", 5); // Replace the first 5 code-units. buf[n..$] = 0; // Set remaining code-units to zero.
 if you are going to use this with non D modules.
 
 Needless to say that this is a bit redundant.
 
 If D in any case initializes that memory why you need
 this uint n and buf[n] = 0; ?
 
 Don't tell me please that this is because your spent
 your childhood in boyscout camps and got some high principles.
 Lets' put aside that matters - it is purely technical discussion.

Exactly. And technically you should be using ubyte[] and not char[]. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 2/08/2006 4:57:15 PM
Aug 02 2006
prev sibling parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
Why would I ever use strncat() in a D program?

Consider this: if you do not wear a helmet while riding a motorcycle 
(read: I don't like helmets) you could break your head and die.  Guess 
what?  I don't ride motorcycles.  Problem solved.

I don't like null terminated strings.  I think they are the root of much 
evil.  Describing why having 0 as a default benefits null terminated 
strings is like describing how having less police help burglars to me. 
Obviously I'm being over-dramatic, but I remain unconvinced...

Also I did spend (some of) my childhood in Boy Scout camps and I did 
learn many principles (none of which related to programming in the 
slightest.)  I mean that literally.  But you're right, that's beside the 
point.

-[Unknown]


 But maybe that's because I never leave things at their defaults.  It's
 like writing a story where you expect the reader to think everyone has 
 brown eyes unless you say otherwise.

Consider this: char[6] buf; strncpy(buf, "1234567", 5); What will be a content of you buffer? Answer is: 12345\xff . Surprise? It is. In modern D reliable implementation of this shall be as: char[6] buf; // memset(buf,0xFF,6); under the hood. uint n = strncpy(buf, "1234567", 5); buf[n] = 0; if you are going to use this with non D modules. Needless to say that this is a bit redundant. If D in any case initializes that memory why you need this uint n and buf[n] = 0; ? Don't tell me please that this is because your spent your childhood in boyscout camps and got some high principles. Lets' put aside that matters - it is purely technical discussion. Andrew.

Aug 02 2006
parent "Unknown W. Brackets" <unknown simplemachines.org> writes:
Correction: strncpy().  They're all evil.

-[Unknown]


 Why would I ever use strncat() in a D program?
 
 Consider this: if you do not wear a helmet while riding a motorcycle 
 (read: I don't like helmets) you could break your head and die.  Guess 
 what?  I don't ride motorcycles.  Problem solved.
 
 I don't like null terminated strings.  I think they are the root of much 
 evil.  Describing why having 0 as a default benefits null terminated 
 strings is like describing how having less police help burglars to me. 
 Obviously I'm being over-dramatic, but I remain unconvinced...
 
 Also I did spend (some of) my childhood in Boy Scout camps and I did 
 learn many principles (none of which related to programming in the 
 slightest.)  I mean that literally.  But you're right, that's beside the 
 point.
 
 -[Unknown]
 
 
 But maybe that's because I never leave things at their defaults.  It's
 like writing a story where you expect the reader to think everyone 
 has brown eyes unless you say otherwise.

Consider this: char[6] buf; strncpy(buf, "1234567", 5); What will be a content of you buffer? Answer is: 12345\xff . Surprise? It is. In modern D reliable implementation of this shall be as: char[6] buf; // memset(buf,0xFF,6); under the hood. uint n = strncpy(buf, "1234567", 5); buf[n] = 0; if you are going to use this with non D modules. Needless to say that this is a bit redundant. If D in any case initializes that memory why you need this uint n and buf[n] = 0; ? Don't tell me please that this is because your spent your childhood in boyscout camps and got some high principles. Lets' put aside that matters - it is purely technical discussion. Andrew.


Aug 02 2006
prev sibling next sibling parent kris <foo bar.com> writes:
Andrew Fedoniouk wrote:
 (Hope this long dialog will help all of us to better understand what UNICODE 
 is)

Actually, it doesn't help at all, Andrew ~ some of it is thoroughly misguided, and some is "cleverly" slanted purely for the benefit of the author. In truth, this thread would be the last place one would look to learn from an entirely unbiased opinion; one with only the readers education in mind. There are infinitely more useful places to go for that sort of thing. For those who have an interest, this tiny selection may help: http://icu.sourceforge.net/docs/papers/forms_of_unicode/ http://www.hackcraft.net/xmlUnicode/ http://www.cl.cam.ac.uk/~mgk25/unicode.html http://www.unicode.org/unicode/faq/utf_bom.html http://en.wikipedia.org/wiki/UTF-8 http://www.joelonsoftware.com/articles/Unicode.html
Aug 01 2006
prev sibling parent Bruno Medeiros <brunodomedeirosATgmail SPAM.com> writes:
Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

BMP is a subset of UTF-16.

Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking. If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you are in trouble. See:

Uh, the statement "BMP is a subset of UTF-16" means that you can read a BMP sequence as an UTF-16 sequence, not the opposite as you said: "If you will treat utf-16 sequence as a sequence of UCS-2 (BMP)".
 Ordinary people will do their own strings anyway. Just give them opAssign 
 and dtor in structs and you will see explosion of perfect strings. That 
 char#[] (read-only arrays) will also benefit here. oh.....

 Changing char init value to 0 will not harm anybody but will allow to use 
 char for other than

 utf-8 purposes - it is only one from 40 in active use encodings anyway.

 For persistence purposes (in compiled EXE) utf is the best choice 
 probably. But in runtime - please not on language level.

it's there for.

Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C Is this the idea? Andrew.

Just a note, not to ubyte[] but to ubyte* . -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Aug 03 2006