digitalmars.D - Re: char[] vs. ubyte[]

Arcane Jill (29/40) Sep 24 2004 Completely in agreement with you there. However, Stewart did actually as...

Sean Kelly (8/21) Sep 24 2004 Why not just expose the C library functions as-is and user beware? I ha...

Arcane Jill (9/31) Sep 24 2004 Fair enough. So, maybe then legacy C functions involving strings should ...

Roald Ribe (12/45) Sep 24 2004 parameters and
Sean Kelly (5/19) Sep 24 2004 Yes.

Stewart Gordon (20/26) Sep 24 2004 UTF-16 AIUI. At least, I'm not sure if Win9x actually supports the

Arcane Jill <Arcane_member pathlink.com> writes:

In article <opses3k8mv5a2sq9 digitalmars.com>, Regan Heath says...

It appears to me that Walter has decided on having only 3 types with a 
specified encoding, and all other encodings will be handled by using 
ubyte[]/byte[] and conversion functions.

I think this is the right choice. I see unicode as the future and other 
encodings as legacy encodings, whose use I hope gradually disappears.

Of course is there is a valid reason for a certain encoding to remain, for 
speed/space/other reasons, and D wanted the same sort of built-in support 
as we do for utf8/16/32 then a new type might emerge.

Completely in agreement with you there. However, Stewart did actually ask a
question which I couldn't answer, and which we shouldn't ignore. Maybe you have
some ideas. The question is:

 Then how would I write the C call
 strcat(qwert, "yuiop");
 in D?

I hope you see the problem. If strcat is declared to accept ubyte[]s, then
Stewart's code won't compile. He would instead need:



which I hope we can all agree looks ugly. So maybe we should accept that
strcat() should accept char* parameters, not ubyte* parameters? After all,
strcat() /allows/ the encoding to be UTF-8 - it merely does not /require/ it. I
think maybe this sort of thing is where all the confusion comes in. There
doesn't really seem to be any need for a cast in the above statement.

Perhaps the solution is to map legacy C functions which take char* parameters to
char* in D after all, provided that such functions will always do the right
thing with UTF-8 (as strcat() does). I don't think this will break anything.

However, we must still map legacy C functions which take char* parameters and
which /don't/ do the right thing with UTF-8 to D's ubyte*. So for example, I
reckon we should end up with the following APIs in D:




because strcat() works fine in UTF-8, but gets() does not.

This, to me, makes it very clear (self-documenting) what's going on, and
requires minimal casting. What do you guys think of this problem or this
possible solution?

Arcane Jill


PS. Just to add further confusion - obviously D strings are allowed to contain
the character U+0000, whereas C strings are not ('\0' being the
null-terminator), so even strcat() can't really be used for /all/ UTF-8 D
strings. But I guess that's just the price you pay for using C functions.

Sep 24 2004

Sean Kelly <sean f4.ca> writes:

In article <cj0h93$2o8o$1 digitaldaemon.com>, Arcane Jill says...
However, we must still map legacy C functions which take char* parameters and
which /don't/ do the right thing with UTF-8 to D's ubyte*. So for example, I
reckon we should end up with the following APIs in D:




because strcat() works fine in UTF-8, but gets() does not.

This, to me, makes it very clear (self-documenting) what's going on, and
requires minimal casting. What do you guys think of this problem or this
possible solution?

Why not just expose the C library functions as-is and user beware?  I have a
feeling it will be hard to decide between parameter types for a bunch of
functions if we start changing things, and any changed prototype would just
confuse people.

PS. Just to add further confusion - obviously D strings are allowed to contain
the character U+0000, whereas C strings are not ('\0' being the
null-terminator), so even strcat() can't really be used for /all/ UTF-8 D
strings. But I guess that's just the price you pay for using C functions.

C++ ran into the same problem, but in both cases there's really little reason to
use C strings so I've never seen it as much of an issue.


Sean

Sep 24 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cj0k8i$2qm7$1 digitaldaemon.com>, Sean Kelly says...
In article <cj0h93$2o8o$1 digitaldaemon.com>, Arcane Jill says...
However, we must still map legacy C functions which take char* parameters and
which /don't/ do the right thing with UTF-8 to D's ubyte*. So for example, I
reckon we should end up with the following APIs in D:




because strcat() works fine in UTF-8, but fgets() does not.

This, to me, makes it very clear (self-documenting) what's going on, and
requires minimal casting. What do you guys think of this problem or this
possible solution?

Why not just expose the C library functions as-is and user beware?

By "as is" I assume you mean, assuming that C char* == D char*?


I have a
feeling it will be hard to decide between parameter types for a bunch of
functions if we start changing things, and any changed prototype would just
confuse people.

Fair enough. So, maybe then legacy C functions involving strings should use
char* and may the caller beware, but any D functions should enforce the UTF-8
rule?


but in both cases there's really little reason to
use C strings so I've never seen it as much of an issue.

Sean

I think Stewart was primarily concerned with Windows API functions (which of
course are legacy C functions). These do use C strings, so I'm not sure what can
be done about that.

Jill

Sep 24 2004

"Roald Ribe" <rr.no spam.teikom.no> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cj0sf7$2vfg$1 digitaldaemon.com...
 In article <cj0k8i$2qm7$1 digitaldaemon.com>, Sean Kelly says...
In article <cj0h93$2o8o$1 digitaldaemon.com>, Arcane Jill says...
However, we must still map legacy C functions which take char*



parameters and
which /don't/ do the right thing with UTF-8 to D's ubyte*. So for



example, I
reckon we should end up with the following APIs in D:




because strcat() works fine in UTF-8, but fgets() does not.

This, to me, makes it very clear (self-documenting) what's going on, and
requires minimal casting. What do you guys think of this problem or this
possible solution?

Why not just expose the C library functions as-is and user beware?

 By "as is" I assume you mean, assuming that C char* == D char*?


I have a
feeling it will be hard to decide between parameter types for a bunch of
functions if we start changing things, and any changed prototype would


just
confuse people.

 Fair enough. So, maybe then legacy C functions involving strings should

use
 char* and may the caller beware, but any D functions should enforce the

UTF-8
 rule?


but in both cases there's really little reason to
use C strings so I've never seen it as much of an issue.

Sean

 I think Stewart was primarily concerned with Windows API functions (which

of
 course are legacy C functions). These do use C strings, so I'm not sure

what can
 be done about that.

On newer versions of MS-Windows there is a *W API for each of the old API
calls, and they take wchar's.

Roald

Sep 24 2004

Sean Kelly <sean f4.ca> writes:

In article <cj0sf7$2vfg$1 digitaldaemon.com>, Arcane Jill says...
In article <cj0k8i$2qm7$1 digitaldaemon.com>, Sean Kelly says...
Why not just expose the C library functions as-is and user beware?

By "as is" I assume you mean, assuming that C char* == D char*?

Yes.

I have a
feeling it will be hard to decide between parameter types for a bunch of
functions if we start changing things, and any changed prototype would just
confuse people.

Fair enough. So, maybe then legacy C functions involving strings should use
char* and may the caller beware, but any D functions should enforce the UTF-8
rule?

Yes.

I think Stewart was primarily concerned with Windows API functions (which of
course are legacy C functions). These do use C strings, so I'm not sure what can
be done about that.

Not much I suppose.  What encoding do the Windows wide char functions expect?


Sean

Sep 24 2004

Stewart Gordon <Stewart_member pathlink.com> writes:

In article <cj1cs4$6h6$1 digitaldaemon.com>, Sean Kelly says...
<snip>
 I think Stewart was primarily concerned with Windows API functions 
 (which of course are legacy C functions).  These do use C strings, 
 so I'm not sure what can be done about that.

 
 Not much I suppose.  What encoding do the Windows wide char 
 functions expect?

UTF-16 AIUI.  At least, I'm not sure if Win9x actually supports the 
codepoints above 0xFFFF.

I'm continually pondering over what to do with SDWF in this regard.  
At the moment, it uses char[] for all strings, but for Windows 
compatibility they're in the Windows charset instead of UTF-8.

I'm guessing the solution is to switch to using wchar and the W 
versions of API functions/structs.  Of course, setters can be 
overloaded to take either char[] or wchar[].  (Hang on ...  what's 
meant to happen with literal strings here?)

Getters would need to be versioned, at least if partial backward 
compatibility is desired.  The versions would then use UTF-8 and 
UTF-16 respectively.  That, of course, brings us back to the 
complication of giving the wchar[] setters return values in the UTF-8 
version:

http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/10199

How (if at all) have the other GUI libs dealt wtih UTF-8/UTF-16/ANSI 
FTM?

Stewart.

Sep 24 2004

D Programming

C/C++ Programming

Other

digitalmars.D - Re: char[] vs. ubyte[]