www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: char[] vs. ubyte[]

reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <opses3k8mv5a2sq9 digitalmars.com>, Regan Heath says...

It appears to me that Walter has decided on having only 3 types with a 
specified encoding, and all other encodings will be handled by using 
ubyte[]/byte[] and conversion functions.

I think this is the right choice. I see unicode as the future and other 
encodings as legacy encodings, whose use I hope gradually disappears.

Of course is there is a valid reason for a certain encoding to remain, for 
speed/space/other reasons, and D wanted the same sort of built-in support 
as we do for utf8/16/32 then a new type might emerge.

Completely in agreement with you there. However, Stewart did actually ask a question which I couldn't answer, and which we shouldn't ignore. Maybe you have some ideas. The question is:
 Then how would I write the C call
 strcat(qwert, "yuiop");
 in D?

I hope you see the problem. If strcat is declared to accept ubyte[]s, then Stewart's code won't compile. He would instead need: # strcat(qwert, cast(ubyte*)"yuiop"); which I hope we can all agree looks ugly. So maybe we should accept that strcat() should accept char* parameters, not ubyte* parameters? After all, strcat() /allows/ the encoding to be UTF-8 - it merely does not /require/ it. I think maybe this sort of thing is where all the confusion comes in. There doesn't really seem to be any need for a cast in the above statement. Perhaps the solution is to map legacy C functions which take char* parameters to char* in D after all, provided that such functions will always do the right thing with UTF-8 (as strcat() does). I don't think this will break anything. However, we must still map legacy C functions which take char* parameters and which /don't/ do the right thing with UTF-8 to D's ubyte*. So for example, I reckon we should end up with the following APIs in D: # char* strcat(char* dest, char* source); # ubyte* gets(ubyte* dest, int len, FILE* stream); because strcat() works fine in UTF-8, but gets() does not. This, to me, makes it very clear (self-documenting) what's going on, and requires minimal casting. What do you guys think of this problem or this possible solution? Arcane Jill PS. Just to add further confusion - obviously D strings are allowed to contain the character U+0000, whereas C strings are not ('\0' being the null-terminator), so even strcat() can't really be used for /all/ UTF-8 D strings. But I guess that's just the price you pay for using C functions.
Sep 24 2004
parent reply Sean Kelly <sean f4.ca> writes:
In article <cj0h93$2o8o$1 digitaldaemon.com>, Arcane Jill says...
However, we must still map legacy C functions which take char* parameters and
which /don't/ do the right thing with UTF-8 to D's ubyte*. So for example, I
reckon we should end up with the following APIs in D:

#    char* strcat(char* dest, char* source);
#    ubyte* gets(ubyte* dest, int len, FILE* stream);

because strcat() works fine in UTF-8, but gets() does not.

This, to me, makes it very clear (self-documenting) what's going on, and
requires minimal casting. What do you guys think of this problem or this
possible solution?

Why not just expose the C library functions as-is and user beware? I have a feeling it will be hard to decide between parameter types for a bunch of functions if we start changing things, and any changed prototype would just confuse people.
PS. Just to add further confusion - obviously D strings are allowed to contain
the character U+0000, whereas C strings are not ('\0' being the
null-terminator), so even strcat() can't really be used for /all/ UTF-8 D
strings. But I guess that's just the price you pay for using C functions.

C++ ran into the same problem, but in both cases there's really little reason to use C strings so I've never seen it as much of an issue. Sean
Sep 24 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cj0k8i$2qm7$1 digitaldaemon.com>, Sean Kelly says...
In article <cj0h93$2o8o$1 digitaldaemon.com>, Arcane Jill says...
However, we must still map legacy C functions which take char* parameters and
which /don't/ do the right thing with UTF-8 to D's ubyte*. So for example, I
reckon we should end up with the following APIs in D:

#    char* strcat(char* dest, char* source);
#    ubyte* fgets(ubyte* dest, int len, FILE* stream);

because strcat() works fine in UTF-8, but fgets() does not.

This, to me, makes it very clear (self-documenting) what's going on, and
requires minimal casting. What do you guys think of this problem or this
possible solution?

Why not just expose the C library functions as-is and user beware?

By "as is" I assume you mean, assuming that C char* == D char*?
I have a
feeling it will be hard to decide between parameter types for a bunch of
functions if we start changing things, and any changed prototype would just
confuse people.

Fair enough. So, maybe then legacy C functions involving strings should use char* and may the caller beware, but any D functions should enforce the UTF-8 rule?
but in both cases there's really little reason to
use C strings so I've never seen it as much of an issue.

Sean

I think Stewart was primarily concerned with Windows API functions (which of course are legacy C functions). These do use C strings, so I'm not sure what can be done about that. Jill
Sep 24 2004
next sibling parent "Roald Ribe" <rr.no spam.teikom.no> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cj0sf7$2vfg$1 digitaldaemon.com...
 In article <cj0k8i$2qm7$1 digitaldaemon.com>, Sean Kelly says...
In article <cj0h93$2o8o$1 digitaldaemon.com>, Arcane Jill says...
However, we must still map legacy C functions which take char*



which /don't/ do the right thing with UTF-8 to D's ubyte*. So for



reckon we should end up with the following APIs in D:

#    char* strcat(char* dest, char* source);
#    ubyte* fgets(ubyte* dest, int len, FILE* stream);

because strcat() works fine in UTF-8, but fgets() does not.

This, to me, makes it very clear (self-documenting) what's going on, and
requires minimal casting. What do you guys think of this problem or this
possible solution?

Why not just expose the C library functions as-is and user beware?

By "as is" I assume you mean, assuming that C char* == D char*?
I have a
feeling it will be hard to decide between parameter types for a bunch of
functions if we start changing things, and any changed prototype would


confuse people.

Fair enough. So, maybe then legacy C functions involving strings should

 char* and may the caller beware, but any D functions should enforce the

 rule?


but in both cases there's really little reason to
use C strings so I've never seen it as much of an issue.

Sean

I think Stewart was primarily concerned with Windows API functions (which

 course are legacy C functions). These do use C strings, so I'm not sure

 be done about that.

On newer versions of MS-Windows there is a *W API for each of the old API calls, and they take wchar's. Roald
Sep 24 2004
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <cj0sf7$2vfg$1 digitaldaemon.com>, Arcane Jill says...
In article <cj0k8i$2qm7$1 digitaldaemon.com>, Sean Kelly says...
Why not just expose the C library functions as-is and user beware?

By "as is" I assume you mean, assuming that C char* == D char*?

Yes.
I have a
feeling it will be hard to decide between parameter types for a bunch of
functions if we start changing things, and any changed prototype would just
confuse people.

Fair enough. So, maybe then legacy C functions involving strings should use char* and may the caller beware, but any D functions should enforce the UTF-8 rule?

Yes.
I think Stewart was primarily concerned with Windows API functions (which of
course are legacy C functions). These do use C strings, so I'm not sure what can
be done about that.

Not much I suppose. What encoding do the Windows wide char functions expect? Sean
Sep 24 2004
parent Stewart Gordon <Stewart_member pathlink.com> writes:
In article <cj1cs4$6h6$1 digitaldaemon.com>, Sean Kelly says...
<snip>
 I think Stewart was primarily concerned with Windows API functions 
 (which of course are legacy C functions).  These do use C strings, 
 so I'm not sure what can be done about that.

Not much I suppose. What encoding do the Windows wide char functions expect?

UTF-16 AIUI. At least, I'm not sure if Win9x actually supports the codepoints above 0xFFFF. I'm continually pondering over what to do with SDWF in this regard. At the moment, it uses char[] for all strings, but for Windows compatibility they're in the Windows charset instead of UTF-8. I'm guessing the solution is to switch to using wchar and the W versions of API functions/structs. Of course, setters can be overloaded to take either char[] or wchar[]. (Hang on ... what's meant to happen with literal strings here?) Getters would need to be versioned, at least if partial backward compatibility is desired. The versions would then use UTF-8 and UTF-16 respectively. That, of course, brings us back to the complication of giving the wchar[] setters return values in the UTF-8 version: digitalmars.D/10199 How (if at all) have the other GUI libs dealt wtih UTF-8/UTF-16/ANSI FTM? Stewart.
Sep 24 2004