www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Casting between char[]/wchar[]/dchar[]

reply Hasan Aljudy <hasan.aljudy gmail.com> writes:
What are the rules for implicit/explicit casting between char[] and 
wchar[] and dchar[] ?

When one casts (explicitly or implicitly) does the compiler 
automatically invoke std.utf.toUTF*()?

Here's an idea that should simplify much of string handling in D:
allow char[] and wchar[] and dchar[] to be castable implicitly to each 
other, provided that the compiler invokes the appropriate std.utf.toUTF* 
method.
I think this is perfectly safe; no data is lost, and string handling can 
become much more flexable.

Instead of writing three version of the same funciton for each of char[] 
wchar[] and dchar[], one can just write a wchar[] version (for example) 
and the compiler will handle the conversion from/to char[] and dchar[].

This is also relevies developers from writing templetized 
functions/class when they deal with strings.

Thoughts?
Aug 04 2006
parent reply kris <foo bar.com> writes:
Hasan Aljudy wrote:
 What are the rules for implicit/explicit casting between char[] and 
 wchar[] and dchar[] ?
 
 When one casts (explicitly or implicitly) does the compiler 
 automatically invoke std.utf.toUTF*()?
 
 Here's an idea that should simplify much of string handling in D:
 allow char[] and wchar[] and dchar[] to be castable implicitly to each 
 other, provided that the compiler invokes the appropriate std.utf.toUTF* 
 method.
 I think this is perfectly safe; no data is lost, and string handling can 
 become much more flexable.
 
 Instead of writing three version of the same funciton for each of char[] 
 wchar[] and dchar[], one can just write a wchar[] version (for example) 
 and the compiler will handle the conversion from/to char[] and dchar[].
 
 This is also relevies developers from writing templetized 
 functions/class when they deal with strings.
 
 Thoughts?

This one was beaten soundly around the head & shoulders in the past :) In a systems language like D, one could argue that hidden conversions and/or translations (a) can mask what would otherwise be unintended compile-time errors (b) can be terribly detrimental to performance where multiple conversions are implicitly applied. Such an environment could potentially put C0W to shame in terms of heap abuse -- recall some of the recent CoW examples, and sprinkle in a few unintended conversions for good measure :) IIRC, the last time this came up there was a pretty strong feeling that such things should be explicit (partly because it can be an expensive operation ~ likely sucking on the heap also). Although foreach() will convert on the fly, that's perhaps not something one should do with extensive chunks of text? One approach would be to make the Unicode converters more attractive for daily use. There are libraries other than Phobos which attempt to do just that. On the other hand, if you're writing some kind of platform where convenience is more important than, say, performance, being able to /add/ the implicit conversion might be of real value. One might, for example, implement such a platform using a String class to abstract the encoding differences. Functions could accept said String rather than one of the three stooges^H^H^H^H^H^H^H Unicode types. If I recall correctly, I think Regan was quite keen on implicit Unicode conversions (during function calls also), so a google on the subject along with his name might get you to the prior threads? Either way, having the compiler tell you at compile time when you're mixing metaphors is a-good-thing (tm). Being able to 'extend' the language (via classes or whatever) to implement higher level abstractions such as String is also a-good-thing. Having both provides for differing uses of D without stepping on toes, or hitting said appendages with a hammer - Kris
Aug 04 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
kris wrote:
 Hasan Aljudy wrote:
 What are the rules for implicit/explicit casting between char[] and 
 wchar[] and dchar[] ?

 When one casts (explicitly or implicitly) does the compiler 
 automatically invoke std.utf.toUTF*()?

 Here's an idea that should simplify much of string handling in D:
 allow char[] and wchar[] and dchar[] to be castable implicitly to each 
 other, provided that the compiler invokes the appropriate 
 std.utf.toUTF* method.
 I think this is perfectly safe; no data is lost, and string handling 
 can become much more flexable.

 Instead of writing three version of the same funciton for each of 
 char[] wchar[] and dchar[], one can just write a wchar[] version (for 
 example) and the compiler will handle the conversion from/to char[] 
 and dchar[].

 This is also relevies developers from writing templetized 
 functions/class when they deal with strings.

 Thoughts?

This one was beaten soundly around the head & shoulders in the past :) In a systems language like D, one could argue that hidden conversions and/or translations (a) can mask what would otherwise be unintended compile-time errors (b) can be terribly detrimental to performance where multiple conversions are implicitly applied. Such an environment could potentially put C0W to shame in terms of heap abuse -- recall some of the recent CoW examples, and sprinkle in a few unintended conversions for good measure :) IIRC, the last time this came up there was a pretty strong feeling that such things should be explicit (partly because it can be an expensive operation ~ likely sucking on the heap also).

Yes. It's hard to judge where the line is, but too many implicit conversions leads to very hard to understand/debug programs.
 Although foreach() will 
 convert on the fly, that's perhaps not something one should do with 
 extensive chunks of text?

foreach also doesn't consume memory for the conversion.
Aug 05 2006
parent reply Hasan Aljudy <hasan.aljudy gmail.com> writes:
Walter Bright wrote:
 kris wrote:
 
 Hasan Aljudy wrote:

 What are the rules for implicit/explicit casting between char[] and 
 wchar[] and dchar[] ?

 When one casts (explicitly or implicitly) does the compiler 
 automatically invoke std.utf.toUTF*()?

 Here's an idea that should simplify much of string handling in D:
 allow char[] and wchar[] and dchar[] to be castable implicitly to 
 each other, provided that the compiler invokes the appropriate 
 std.utf.toUTF* method.
 I think this is perfectly safe; no data is lost, and string handling 
 can become much more flexable.

 Instead of writing three version of the same funciton for each of 
 char[] wchar[] and dchar[], one can just write a wchar[] version (for 
 example) and the compiler will handle the conversion from/to char[] 
 and dchar[].

 This is also relevies developers from writing templetized 
 functions/class when they deal with strings.

 Thoughts?

This one was beaten soundly around the head & shoulders in the past :) In a systems language like D, one could argue that hidden conversions and/or translations (a) can mask what would otherwise be unintended compile-time errors (b) can be terribly detrimental to performance where multiple conversions are implicitly applied. Such an environment could potentially put C0W to shame in terms of heap abuse -- recall some of the recent CoW examples, and sprinkle in a few unintended conversions for good measure :) IIRC, the last time this came up there was a pretty strong feeling that such things should be explicit (partly because it can be an expensive operation ~ likely sucking on the heap also).

Yes. It's hard to judge where the line is, but too many implicit conversions leads to very hard to understand/debug programs.

Can I ask you atleast to simplify the conversion by adding properties utf* to char/wchar/dchar arrays? so, if I have: ---- char[] process( char[] str ) { ... } ... dchar[] my32str = .....; //I can write my32str = process( my32str.utf8 ).utf32; //instead of //my32str = toUTF32( process( toUTF8( my32str ) ) ); ----
Aug 05 2006
next sibling parent reply kris <foo bar.com> writes:
Hasan Aljudy wrote:
 
 
 Walter Bright wrote:
 
 kris wrote:

 Hasan Aljudy wrote:

 What are the rules for implicit/explicit casting between char[] and 
 wchar[] and dchar[] ?

 When one casts (explicitly or implicitly) does the compiler 
 automatically invoke std.utf.toUTF*()?

 Here's an idea that should simplify much of string handling in D:
 allow char[] and wchar[] and dchar[] to be castable implicitly to 
 each other, provided that the compiler invokes the appropriate 
 std.utf.toUTF* method.
 I think this is perfectly safe; no data is lost, and string handling 
 can become much more flexable.

 Instead of writing three version of the same funciton for each of 
 char[] wchar[] and dchar[], one can just write a wchar[] version 
 (for example) and the compiler will handle the conversion from/to 
 char[] and dchar[].

 This is also relevies developers from writing templetized 
 functions/class when they deal with strings.

 Thoughts?

This one was beaten soundly around the head & shoulders in the past :) In a systems language like D, one could argue that hidden conversions and/or translations (a) can mask what would otherwise be unintended compile-time errors (b) can be terribly detrimental to performance where multiple conversions are implicitly applied. Such an environment could potentially put C0W to shame in terms of heap abuse -- recall some of the recent CoW examples, and sprinkle in a few unintended conversions for good measure :) IIRC, the last time this came up there was a pretty strong feeling that such things should be explicit (partly because it can be an expensive operation ~ likely sucking on the heap also).

Yes. It's hard to judge where the line is, but too many implicit conversions leads to very hard to understand/debug programs.

Can I ask you atleast to simplify the conversion by adding properties utf* to char/wchar/dchar arrays? so, if I have: ---- char[] process( char[] str ) { ... } ... dchar[] my32str = .....; //I can write my32str = process( my32str.utf8 ).utf32; //instead of //my32str = toUTF32( process( toUTF8( my32str ) ) ); ----

er, you can do that yourself, Hasan? char[] utf8 (dchar[] s) { ... } dchar[] utf32 (char[] s) { ... } etc, followed by:
 char[] process( char[] str ) { ... }

 ...

 dchar[] my32str = .....;

 //I can write
 my32str = process( my32str.utf8 ).utf32;

 //instead of
 //my32str = toUTF32( process( toUTF8( my32str ) ) );

However, this is sucking on the heap, since you're not providing anywhere for the conversion to occur. Hence it it expensive (heap allocation is several times slower than a 'typical' utf conversion, and there's potential lock-contention to deal with also). This is partly why there was some pushback against such properties in the past; especially when you can add them yourself using the funky array-prop syntax (demonstrated above). There's nothing wrong with convenience props and so on, but if the ones built-in to the compiler are expensive to use, D will inevitably get a reputation for being slow and/or heap-bound; just like Java did ~ deserved or otherwise. D currently offers a number of alternatives anyway. Again, why not use a String aggregate instead? To hide/abstract the distinction between Unicode types? I suspect that would be both more efficient and more convenient? Having written just such a class, I can attest to these attributes.
Aug 05 2006
next sibling parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"kris" <foo bar.com> wrote in message news:eb322c$sml$1 digitaldaemon.com...

 er, you can do that yourself, Hasan?

 char[] utf8 (dchar[] s)
 {
   ...
 }

 dchar[] utf32 (char[] s)
 {
   ...
 }

lol :)
Aug 05 2006
prev sibling parent reply Hasan Aljudy <hasan.aljudy gmail.com> writes:
kris wrote:
 Hasan Aljudy wrote:
 
 Walter Bright wrote:

 kris wrote:

 Hasan Aljudy wrote:

 What are the rules for implicit/explicit casting between char[] and 
 wchar[] and dchar[] ?

 When one casts (explicitly or implicitly) does the compiler 
 automatically invoke std.utf.toUTF*()?

 Here's an idea that should simplify much of string handling in D:
 allow char[] and wchar[] and dchar[] to be castable implicitly to 
 each other, provided that the compiler invokes the appropriate 
 std.utf.toUTF* method.
 I think this is perfectly safe; no data is lost, and string 
 handling can become much more flexable.

 Instead of writing three version of the same funciton for each of 
 char[] wchar[] and dchar[], one can just write a wchar[] version 
 (for example) and the compiler will handle the conversion from/to 
 char[] and dchar[].

 This is also relevies developers from writing templetized 
 functions/class when they deal with strings.

 Thoughts?

This one was beaten soundly around the head & shoulders in the past :) In a systems language like D, one could argue that hidden conversions and/or translations (a) can mask what would otherwise be unintended compile-time errors (b) can be terribly detrimental to performance where multiple conversions are implicitly applied. Such an environment could potentially put C0W to shame in terms of heap abuse -- recall some of the recent CoW examples, and sprinkle in a few unintended conversions for good measure :) IIRC, the last time this came up there was a pretty strong feeling that such things should be explicit (partly because it can be an expensive operation ~ likely sucking on the heap also).

Yes. It's hard to judge where the line is, but too many implicit conversions leads to very hard to understand/debug programs.

Can I ask you atleast to simplify the conversion by adding properties utf* to char/wchar/dchar arrays? so, if I have: ---- char[] process( char[] str ) { ... } ... dchar[] my32str = .....; //I can write my32str = process( my32str.utf8 ).utf32; //instead of //my32str = toUTF32( process( toUTF8( my32str ) ) ); ----

er, you can do that yourself, Hasan? char[] utf8 (dchar[] s) { ... } dchar[] utf32 (char[] s) { ... } etc, followed by: > char[] process( char[] str ) { ... } > > ... > > dchar[] my32str = .....; > > //I can write > my32str = process( my32str.utf8 ).utf32; > > //instead of > //my32str = toUTF32( process( toUTF8( my32str ) ) );

I know, but 1: The syntax is still not documented.. 2: I'm talking about making these properties a part of the standard. actually, I think: alias toUTF8 utf8; alias toUTF16 utf16; alias toUTF32 utf32; would do the trick.
 
 However, this is sucking on the heap, since you're not providing 
 anywhere for the conversion to occur. Hence it it expensive (heap 
 allocation is several times slower than a 'typical' utf conversion, and 
 there's potential lock-contention to deal with also). This is partly why 
 there was some pushback against such properties in the past; especially 
 when you can add them yourself using the funky array-prop syntax 
 (demonstrated above).
 
 There's nothing wrong with convenience props and so on, but if the ones 
 built-in to the compiler are expensive to use, D will inevitably get a 
 reputation for being slow and/or heap-bound; just like Java did ~ 
 deserved or otherwise. D currently offers a number of alternatives anyway.

Doesn't COW suck on the heap? object allocation? array concatenation? increasing the length property? I suppose one could write custom allocators for these "temporary" conversions. For example, pre-allocate a chunk of heap for temporary utf conversions (10 K would suffice, I think) and use it like a stack to make the allocation faster? Honestly, I don't know how that would work, but I bet someone else does, and I bet that person can write such an allocator. Then, integrating that allocator into std.utf would make it faster to use the standard utf conversion properties. No?
 
 Again, why not use a String aggregate instead? To hide/abstract the 
 distinction between Unicode types? I suspect that would be both more 
 efficient and more convenient? Having written just such a class, I can 
 attest to these attributes.

Because the standard library functions always expect a char[]. What you did with mango was write a whole library, not just a String class. BTW, are there tutorials for using mango Strings?
Aug 05 2006
parent Hasan Aljudy <hasan.aljudy gmail.com> writes:
Hasan Aljudy wrote:
 
 I know, but
 1: The syntax is still not documented..
 2: I'm talking about making these properties a part of the standard.
 
 actually, I think:
 
 alias toUTF8 utf8;
 alias toUTF16 utf16;
 alias toUTF32 utf32;
 
 would do the trick.
 
 

even that doesn't always work now ..
Aug 05 2006
prev sibling parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Hasan Aljudy" <hasan.aljudy gmail.com> wrote in message 
news:eb2u9n$psv$1 digitaldaemon.com...

 Can I ask you atleast to simplify the conversion by adding properties utf* 
 to char/wchar/dchar arrays?

 so, if I have:
 ----
 char[] process( char[] str ) { ... }

 ...

 dchar[] my32str = .....;

 //I can write
 my32str = process( my32str.utf8 ).utf32;

 //instead of
 //my32str = toUTF32( process( toUTF8( my32str ) ) );

import utf = std.utf; wchar[] utf16(char[] s) { return utf.toUTF16(s); } ... char[] s = "hello"; wchar[] t = s.utf16; ;) Aren't first-array-param-as-a-property functions cool?
Aug 05 2006
next sibling parent Serg Kovrov <kovrov no.spam> writes:
Jarrett Billingsley wrote:
 Aren't first-array-param-as-a-property functions cool? 

Cool indeed =) Is it documented? -- serg.
Aug 05 2006
prev sibling next sibling parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Jarrett Billingsley wrote:
 "Hasan Aljudy" <hasan.aljudy gmail.com> wrote in message 
 news:eb2u9n$psv$1 digitaldaemon.com...
 
 Can I ask you atleast to simplify the conversion by adding properties utf* 
 to char/wchar/dchar arrays?

 so, if I have:
 ----
 char[] process( char[] str ) { ... }

 ...

 dchar[] my32str = .....;

 //I can write
 my32str = process( my32str.utf8 ).utf32;

 //instead of
 //my32str = toUTF32( process( toUTF8( my32str ) ) );

import utf = std.utf; wchar[] utf16(char[] s) { return utf.toUTF16(s); } ... char[] s = "hello"; wchar[] t = s.utf16; ;) Aren't first-array-param-as-a-property functions cool?

In fact, "raw" toUTF* functions work without the wrapper functions (though they're obviously named differently): import std.utf; void main() { char[] s = "hello"; wchar[] t = s.toUTF16(); // Or, if you prefer: alias toUTF16 utf16; wchar[] u = s.utf16(); }
Aug 05 2006
prev sibling next sibling parent Derek Parnell <derek psyc.ward> writes:
On Sat, 5 Aug 2006 17:23:08 -0400, Jarrett Billingsley wrote:

 import utf = std.utf;
 
 wchar[] utf16(char[] s)
 {
     return utf.toUTF16(s);
 }
 
 ...
 
 char[] s = "hello";
 wchar[] t = s.utf16;
 
  ;)
 
 Aren't first-array-param-as-a-property functions cool?

I don't want to rain on anyone's parade, but the new import formats kill off this undocumented feature. This works ... import std.utf; void main() { wchar[] w; dchar[] d; d = w.toUTF32(); } This doesn't .... static import std.utf; void main() { wchar[] w; dchar[] d; d = w.std.utf.toUTF32(); } And neither does this ... import utf = std.utf; void main() { wchar[] w; dchar[] d; d = w.utf.toUTF32(); } -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Aug 05 2006
prev sibling parent reply Derek Parnell <derek psyc.ward> writes:
On Sat, 5 Aug 2006 17:23:08 -0400, Jarrett Billingsley wrote:

 import utf = std.utf;
 
 wchar[] utf16(char[] s)
 {
     return utf.toUTF16(s);
 }
 
 ...
 
 char[] s = "hello";
 wchar[] t = s.utf16;
 
  ;)
 
 Aren't first-array-param-as-a-property functions cool?

Actually, that doesn't compile any more either. Instead of wchar[] t = s.utf16; you have to code ... wchar[] t = s.utf16(); I'm sure it used to work the way you wrote it. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Aug 05 2006
parent Hasan Aljudy <hasan.aljudy gmail.com> writes:
Derek Parnell wrote:
 On Sat, 5 Aug 2006 17:23:08 -0400, Jarrett Billingsley wrote:
 
 
import utf = std.utf;

wchar[] utf16(char[] s)
{
    return utf.toUTF16(s);
}

...

char[] s = "hello";
wchar[] t = s.utf16;

 ;)

Aren't first-array-param-as-a-property functions cool?

Actually, that doesn't compile any more either. Instead of wchar[] t = s.utf16; you have to code ... wchar[] t = s.utf16(); I'm sure it used to work the way you wrote it.

even worse, if abc has property foo which is dchar[], then abc.foo.utf8(); will also fail; you'd have to use: abc.foo().utf8();
Aug 05 2006