digitalmars.D - Casting between char[]/wchar[]/dchar[]

Hasan Aljudy (16/16) Aug 04 2006 What are the rules for implicit/explicit casting between char[] and

kris (33/54) Aug 04 2006 This one was beaten soundly around the head & shoulders in the past :)

Walter Bright (4/44) Aug 05 2006 Yes. It's hard to judge where the line is, but too many implicit

Hasan Aljudy (13/58) Aug 05 2006 Can I ask you atleast to simplify the conversion by adding properties

kris (26/105) Aug 05 2006 er, you can do that yourself, Hasan?

Jarrett Billingsley (2/11) Aug 05 2006 lol :)
Hasan Aljudy (22/145) Aug 05 2006 I know, but

Hasan Aljudy (2/16) Aug 05 2006 even that doesn't always work now ..

Jarrett Billingsley (12/23) Aug 05 2006 import utf = std.utf;

Serg Kovrov (5/6) Aug 05 2006 Cool indeed =)
Frits van Bommel (12/47) Aug 05 2006 In fact, "raw" toUTF* functions work without the wrapper functions
Derek Parnell (31/46) Aug 05 2006 I don't want to rain on anyone's parade, but the new import formats kill
Derek Parnell (11/26) Aug 05 2006 Actually, that doesn't compile any more either.

Hasan Aljudy (5/37) Aug 05 2006 even worse, if abc has property foo which is dchar[], then

Hasan Aljudy <hasan.aljudy gmail.com> writes:

What are the rules for implicit/explicit casting between char[] and 
wchar[] and dchar[] ?

When one casts (explicitly or implicitly) does the compiler 
automatically invoke std.utf.toUTF*()?

Here's an idea that should simplify much of string handling in D:
allow char[] and wchar[] and dchar[] to be castable implicitly to each 
other, provided that the compiler invokes the appropriate std.utf.toUTF* 
method.
I think this is perfectly safe; no data is lost, and string handling can 
become much more flexable.

Instead of writing three version of the same funciton for each of char[] 
wchar[] and dchar[], one can just write a wchar[] version (for example) 
and the compiler will handle the conversion from/to char[] and dchar[].

This is also relevies developers from writing templetized 
functions/class when they deal with strings.

Thoughts?

Aug 04 2006

kris <foo bar.com> writes:

Hasan Aljudy wrote:
 What are the rules for implicit/explicit casting between char[] and 
 wchar[] and dchar[] ?
 
 When one casts (explicitly or implicitly) does the compiler 
 automatically invoke std.utf.toUTF*()?
 
 Here's an idea that should simplify much of string handling in D:
 allow char[] and wchar[] and dchar[] to be castable implicitly to each 
 other, provided that the compiler invokes the appropriate std.utf.toUTF* 
 method.
 I think this is perfectly safe; no data is lost, and string handling can 
 become much more flexable.
 
 Instead of writing three version of the same funciton for each of char[] 
 wchar[] and dchar[], one can just write a wchar[] version (for example) 
 and the compiler will handle the conversion from/to char[] and dchar[].
 
 This is also relevies developers from writing templetized 
 functions/class when they deal with strings.
 
 Thoughts?

This one was beaten soundly around the head & shoulders in the past :)

In a systems language like D, one could argue that hidden conversions 
and/or translations (a) can mask what would otherwise be unintended 
compile-time errors (b) can be terribly detrimental to performance where 
multiple conversions are implicitly applied. Such an environment could 
potentially put C0W to shame in terms of heap abuse -- recall some of 
the recent CoW examples, and sprinkle in a few unintended conversions 
for good measure :)

IIRC, the last time this came up there was a pretty strong feeling that 
such things should be explicit (partly because it can be an expensive 
operation ~ likely sucking on the heap also). Although foreach() will 
convert on the fly, that's perhaps not something one should do with 
extensive chunks of text?

One approach would be to make the Unicode converters more attractive for 
daily use. There are libraries other than Phobos which attempt to do 
just that.

On the other hand, if you're writing some kind of platform where 
convenience is more important than, say, performance, being able to 
/add/ the implicit conversion might be of real value. One might, for 
example, implement such a platform using a String class to abstract the 
encoding differences. Functions could accept said String rather than one 
of the three stooges^H^H^H^H^H^H^H Unicode types.

If I recall correctly, I think Regan was quite keen on implicit Unicode 
conversions (during function calls also), so a google on the subject 
along with his name might get you to the prior threads?

Either way, having the compiler tell you at compile time when you're 
mixing metaphors is a-good-thing (tm). Being able to 'extend' the 
language (via classes or whatever) to implement higher level 
abstractions such as String is also a-good-thing. Having both provides 
for differing uses of D without stepping on toes, or hitting said 
appendages with a hammer

- Kris

Aug 04 2006

Walter Bright <newshound digitalmars.com> writes:

kris wrote:
 Hasan Aljudy wrote:
 What are the rules for implicit/explicit casting between char[] and 
 wchar[] and dchar[] ?

 When one casts (explicitly or implicitly) does the compiler 
 automatically invoke std.utf.toUTF*()?

 Here's an idea that should simplify much of string handling in D:
 allow char[] and wchar[] and dchar[] to be castable implicitly to each 
 other, provided that the compiler invokes the appropriate 
 std.utf.toUTF* method.
 I think this is perfectly safe; no data is lost, and string handling 
 can become much more flexable.

 Instead of writing three version of the same funciton for each of 
 char[] wchar[] and dchar[], one can just write a wchar[] version (for 
 example) and the compiler will handle the conversion from/to char[] 
 and dchar[].

 This is also relevies developers from writing templetized 
 functions/class when they deal with strings.

 Thoughts?

 
 This one was beaten soundly around the head & shoulders in the past :)
 
 In a systems language like D, one could argue that hidden conversions 
 and/or translations (a) can mask what would otherwise be unintended 
 compile-time errors (b) can be terribly detrimental to performance where 
 multiple conversions are implicitly applied. Such an environment could 
 potentially put C0W to shame in terms of heap abuse -- recall some of 
 the recent CoW examples, and sprinkle in a few unintended conversions 
 for good measure :)
 
 IIRC, the last time this came up there was a pretty strong feeling that 
 such things should be explicit (partly because it can be an expensive 
 operation ~ likely sucking on the heap also).

Yes. It's hard to judge where the line is, but too many implicit 
conversions leads to very hard to understand/debug programs.

 Although foreach() will 
 convert on the fly, that's perhaps not something one should do with 
 extensive chunks of text?

foreach also doesn't consume memory for the conversion.

Aug 05 2006

Hasan Aljudy <hasan.aljudy gmail.com> writes:

Walter Bright wrote:
 kris wrote:
 
 Hasan Aljudy wrote:

 What are the rules for implicit/explicit casting between char[] and 
 wchar[] and dchar[] ?

 When one casts (explicitly or implicitly) does the compiler 
 automatically invoke std.utf.toUTF*()?

 Here's an idea that should simplify much of string handling in D:
 allow char[] and wchar[] and dchar[] to be castable implicitly to 
 each other, provided that the compiler invokes the appropriate 
 std.utf.toUTF* method.
 I think this is perfectly safe; no data is lost, and string handling 
 can become much more flexable.

 Instead of writing three version of the same funciton for each of 
 char[] wchar[] and dchar[], one can just write a wchar[] version (for 
 example) and the compiler will handle the conversion from/to char[] 
 and dchar[].

 This is also relevies developers from writing templetized 
 functions/class when they deal with strings.

 Thoughts?


 This one was beaten soundly around the head & shoulders in the past :)

 In a systems language like D, one could argue that hidden conversions 
 and/or translations (a) can mask what would otherwise be unintended 
 compile-time errors (b) can be terribly detrimental to performance 
 where multiple conversions are implicitly applied. Such an environment 
 could potentially put C0W to shame in terms of heap abuse -- recall 
 some of the recent CoW examples, and sprinkle in a few unintended 
 conversions for good measure :)

 IIRC, the last time this came up there was a pretty strong feeling 
 that such things should be explicit (partly because it can be an 
 expensive operation ~ likely sucking on the heap also).

 
 
 Yes. It's hard to judge where the line is, but too many implicit 
 conversions leads to very hard to understand/debug programs.

Can I ask you atleast to simplify the conversion by adding properties 
utf* to char/wchar/dchar arrays?

so, if I have:
----
char[] process( char[] str ) { ... }

...

dchar[] my32str = .....;

//I can write
my32str = process( my32str.utf8 ).utf32;

//instead of
//my32str = toUTF32( process( toUTF8( my32str ) ) );
----

Aug 05 2006

kris <foo bar.com> writes:

Hasan Aljudy wrote:
 
 
 Walter Bright wrote:
 
 kris wrote:

 Hasan Aljudy wrote:

 What are the rules for implicit/explicit casting between char[] and 
 wchar[] and dchar[] ?

 When one casts (explicitly or implicitly) does the compiler 
 automatically invoke std.utf.toUTF*()?

 Here's an idea that should simplify much of string handling in D:
 allow char[] and wchar[] and dchar[] to be castable implicitly to 
 each other, provided that the compiler invokes the appropriate 
 std.utf.toUTF* method.
 I think this is perfectly safe; no data is lost, and string handling 
 can become much more flexable.

 Instead of writing three version of the same funciton for each of 
 char[] wchar[] and dchar[], one can just write a wchar[] version 
 (for example) and the compiler will handle the conversion from/to 
 char[] and dchar[].

 This is also relevies developers from writing templetized 
 functions/class when they deal with strings.

 Thoughts?



 This one was beaten soundly around the head & shoulders in the past :)

 In a systems language like D, one could argue that hidden conversions 
 and/or translations (a) can mask what would otherwise be unintended 
 compile-time errors (b) can be terribly detrimental to performance 
 where multiple conversions are implicitly applied. Such an 
 environment could potentially put C0W to shame in terms of heap abuse 
 -- recall some of the recent CoW examples, and sprinkle in a few 
 unintended conversions for good measure :)

 IIRC, the last time this came up there was a pretty strong feeling 
 that such things should be explicit (partly because it can be an 
 expensive operation ~ likely sucking on the heap also).



 Yes. It's hard to judge where the line is, but too many implicit 
 conversions leads to very hard to understand/debug programs.

 
 
 Can I ask you atleast to simplify the conversion by adding properties 
 utf* to char/wchar/dchar arrays?
 
 so, if I have:
 ----
 char[] process( char[] str ) { ... }
 
 ...
 
 dchar[] my32str = .....;
 
 //I can write
 my32str = process( my32str.utf8 ).utf32;
 
 //instead of
 //my32str = toUTF32( process( toUTF8( my32str ) ) );
 ----
 
 


er, you can do that yourself, Hasan?

char[] utf8 (dchar[] s)
{
   ...
}

dchar[] utf32 (char[] s)
{
   ...
}

etc, followed by:

 char[] process( char[] str ) { ... }

 ...

 dchar[] my32str = .....;

 //I can write
 my32str = process( my32str.utf8 ).utf32;

 //instead of
 //my32str = toUTF32( process( toUTF8( my32str ) ) );


However, this is sucking on the heap, since you're not providing 
anywhere for the conversion to occur. Hence it it expensive (heap 
allocation is several times slower than a 'typical' utf conversion, and 
there's potential lock-contention to deal with also). This is partly why 
there was some pushback against such properties in the past; especially 
when you can add them yourself using the funky array-prop syntax 
(demonstrated above).

There's nothing wrong with convenience props and so on, but if the ones 
built-in to the compiler are expensive to use, D will inevitably get a 
reputation for being slow and/or heap-bound; just like Java did ~ 
deserved or otherwise. D currently offers a number of alternatives anyway.

Again, why not use a String aggregate instead? To hide/abstract the 
distinction between Unicode types? I suspect that would be both more 
efficient and more convenient? Having written just such a class, I can 
attest to these attributes.

Aug 05 2006

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"kris" <foo bar.com> wrote in message news:eb322c$sml$1 digitaldaemon.com...

 er, you can do that yourself, Hasan?

 char[] utf8 (dchar[] s)
 {
   ...
 }

 dchar[] utf32 (char[] s)
 {
   ...
 }

lol :)

Aug 05 2006

Hasan Aljudy <hasan.aljudy gmail.com> writes:

kris wrote:
 Hasan Aljudy wrote:

 Walter Bright wrote:

 kris wrote:

 Hasan Aljudy wrote:

 What are the rules for implicit/explicit casting between char[] and 
 wchar[] and dchar[] ?

 When one casts (explicitly or implicitly) does the compiler 
 automatically invoke std.utf.toUTF*()?

 Here's an idea that should simplify much of string handling in D:
 allow char[] and wchar[] and dchar[] to be castable implicitly to 
 each other, provided that the compiler invokes the appropriate 
 std.utf.toUTF* method.
 I think this is perfectly safe; no data is lost, and string 
 handling can become much more flexable.

 Instead of writing three version of the same funciton for each of 
 char[] wchar[] and dchar[], one can just write a wchar[] version 
 (for example) and the compiler will handle the conversion from/to 
 char[] and dchar[].

 This is also relevies developers from writing templetized 
 functions/class when they deal with strings.

 Thoughts?

 This one was beaten soundly around the head & shoulders in the past :)

 In a systems language like D, one could argue that hidden 
 conversions and/or translations (a) can mask what would otherwise be 
 unintended compile-time errors (b) can be terribly detrimental to 
 performance where multiple conversions are implicitly applied. Such 
 an environment could potentially put C0W to shame in terms of heap 
 abuse -- recall some of the recent CoW examples, and sprinkle in a 
 few unintended conversions for good measure :)

 IIRC, the last time this came up there was a pretty strong feeling 
 that such things should be explicit (partly because it can be an 
 expensive operation ~ likely sucking on the heap also).

 Yes. It's hard to judge where the line is, but too many implicit 
 conversions leads to very hard to understand/debug programs.

 Can I ask you atleast to simplify the conversion by adding properties 
 utf* to char/wchar/dchar arrays?

 so, if I have:
 ----
 char[] process( char[] str ) { ... }

 ...

 dchar[] my32str = .....;

 //I can write
 my32str = process( my32str.utf8 ).utf32;

 //instead of
 //my32str = toUTF32( process( toUTF8( my32str ) ) );
 ----

 er, you can do that yourself, Hasan?

 char[] utf8 (dchar[] s)
 {
   ...
 }

 dchar[] utf32 (char[] s)
 {
   ...
 }

 etc, followed by:

  > char[] process( char[] str ) { ... }
  >
  > ...
  >
  > dchar[] my32str = .....;
  >
  > //I can write
  > my32str = process( my32str.utf8 ).utf32;
  >
  > //instead of
  > //my32str = toUTF32( process( toUTF8( my32str ) ) );

I know, but
1: The syntax is still not documented..
2: I'm talking about making these properties a part of the standard.

actually, I think:

alias toUTF8 utf8;
alias toUTF16 utf16;
alias toUTF32 utf32;

would do the trick.

 However, this is sucking on the heap, since you're not providing 
 anywhere for the conversion to occur. Hence it it expensive (heap 
 allocation is several times slower than a 'typical' utf conversion, and 
 there's potential lock-contention to deal with also). This is partly why 
 there was some pushback against such properties in the past; especially 
 when you can add them yourself using the funky array-prop syntax 
 (demonstrated above).

 There's nothing wrong with convenience props and so on, but if the ones 
 built-in to the compiler are expensive to use, D will inevitably get a 
 reputation for being slow and/or heap-bound; just like Java did ~ 
 deserved or otherwise. D currently offers a number of alternatives anyway.

Doesn't COW suck on the heap? object allocation? array concatenation? 
increasing the length property?

I suppose one could write custom allocators for these "temporary" 
conversions. For example, pre-allocate a chunk of heap for temporary utf 
conversions (10 K would suffice, I think) and use it like a stack to 
make the allocation faster?

Honestly, I don't know how that would work, but I bet someone else does, 
and I bet that person can write such an allocator.
Then, integrating that allocator into std.utf would make it faster to 
use the standard utf conversion properties. No?

 Again, why not use a String aggregate instead? To hide/abstract the 
 distinction between Unicode types? I suspect that would be both more 
 efficient and more convenient? Having written just such a class, I can 
 attest to these attributes.

Because the standard library functions always expect a char[].
What you did with mango was write a whole library, not just a String class.

BTW, are there tutorials for using mango Strings?

Aug 05 2006

Hasan Aljudy <hasan.aljudy gmail.com> writes:

Hasan Aljudy wrote:
 
 I know, but
 1: The syntax is still not documented..
 2: I'm talking about making these properties a part of the standard.
 
 actually, I think:
 
 alias toUTF8 utf8;
 alias toUTF16 utf16;
 alias toUTF32 utf32;
 
 would do the trick.
 
 

even that doesn't always work now ..

Aug 05 2006

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Hasan Aljudy" <hasan.aljudy gmail.com> wrote in message 
news:eb2u9n$psv$1 digitaldaemon.com...

 Can I ask you atleast to simplify the conversion by adding properties utf* 
 to char/wchar/dchar arrays?

 so, if I have:
 ----
 char[] process( char[] str ) { ... }

 ...

 dchar[] my32str = .....;

 //I can write
 my32str = process( my32str.utf8 ).utf32;

 //instead of
 //my32str = toUTF32( process( toUTF8( my32str ) ) );

import utf = std.utf;

wchar[] utf16(char[] s)
{
    return utf.toUTF16(s);
}

...

char[] s = "hello";
wchar[] t = s.utf16;

 ;)

Aren't first-array-param-as-a-property functions cool?

Aug 05 2006

Serg Kovrov <kovrov no.spam> writes:

Jarrett Billingsley wrote:
 Aren't first-array-param-as-a-property functions cool? 

Cool indeed =)
Is it documented?

--
serg.

Aug 05 2006

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Jarrett Billingsley wrote:
 "Hasan Aljudy" <hasan.aljudy gmail.com> wrote in message 
 news:eb2u9n$psv$1 digitaldaemon.com...
 
 Can I ask you atleast to simplify the conversion by adding properties utf* 
 to char/wchar/dchar arrays?

 so, if I have:
 ----
 char[] process( char[] str ) { ... }

 ...

 dchar[] my32str = .....;

 //I can write
 my32str = process( my32str.utf8 ).utf32;

 //instead of
 //my32str = toUTF32( process( toUTF8( my32str ) ) );

 
 import utf = std.utf;
 
 wchar[] utf16(char[] s)
 {
     return utf.toUTF16(s);
 }
 
 ...
 
 char[] s = "hello";
 wchar[] t = s.utf16;
 
  ;)
 
 Aren't first-array-param-as-a-property functions cool? 

In fact, "raw" toUTF* functions work without the wrapper functions 
(though they're obviously named differently):

import std.utf;

void main()
{
     char[] s = "hello";
     wchar[] t = s.toUTF16();

     // Or, if you prefer:
     alias toUTF16 utf16;
     wchar[] u = s.utf16();
}

Aug 05 2006

Derek Parnell <derek psyc.ward> writes:

On Sat, 5 Aug 2006 17:23:08 -0400, Jarrett Billingsley wrote:

 import utf = std.utf;
 
 wchar[] utf16(char[] s)
 {
     return utf.toUTF16(s);
 }
 
 ...
 
 char[] s = "hello";
 wchar[] t = s.utf16;
 
  ;)
 
 Aren't first-array-param-as-a-property functions cool?

I don't want to rain on anyone's parade, but the new import formats kill
off this undocumented feature.

This works ...

 import std.utf;
 void main()
 {
    wchar[] w;
    dchar[] d;
    d = w.toUTF32();
 }
This doesn't ....

 static import std.utf;
 void main()
 {
    wchar[] w;
    dchar[] d;
    d = w.std.utf.toUTF32();
 }
And neither does this ...

 import utf = std.utf;
 void main()
 {
    wchar[] w;
    dchar[] d;
    d = w.utf.toUTF32();
 }

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"

Aug 05 2006

Derek Parnell <derek psyc.ward> writes:

On Sat, 5 Aug 2006 17:23:08 -0400, Jarrett Billingsley wrote:

 import utf = std.utf;
 
 wchar[] utf16(char[] s)
 {
     return utf.toUTF16(s);
 }
 
 ...
 
 char[] s = "hello";
 wchar[] t = s.utf16;
 
  ;)
 
 Aren't first-array-param-as-a-property functions cool?

Actually, that doesn't compile any more either.

Instead of 

   wchar[] t = s.utf16;

you have to code ...

   wchar[] t = s.utf16();

I'm sure it used to work the way you wrote it.

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"

Aug 05 2006

Hasan Aljudy <hasan.aljudy gmail.com> writes:

Derek Parnell wrote:
 On Sat, 5 Aug 2006 17:23:08 -0400, Jarrett Billingsley wrote:
 
 
import utf = std.utf;

wchar[] utf16(char[] s)
{
    return utf.toUTF16(s);
}

...

char[] s = "hello";
wchar[] t = s.utf16;

 ;)

Aren't first-array-param-as-a-property functions cool?

 
 
 Actually, that doesn't compile any more either.
 
 Instead of 
 
    wchar[] t = s.utf16;
 
 you have to code ...
 
    wchar[] t = s.utf16();
 
 I'm sure it used to work the way you wrote it.
 


even worse, if abc has property foo which is dchar[], then

     abc.foo.utf8();

will also fail; you'd have to use:
abc.foo().utf8();

Aug 05 2006

D Programming

C/C++ Programming

Other

digitalmars.D - Casting between char[]/wchar[]/dchar[]