www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - toString vs. toUtf8

reply Sean Kelly <sean f4.ca> writes:
I was looking at converting Tango's use of toUtf8 to toString today and 
ran into a bit of a quandry.  Currently, Tango's use of toUtf8 as the 
member function for returning char strings is consitent with all use of 
string operations in Tango.  Routines that return wchar strings are 
named toUtf16 whether they are members of the String class or whether 
they are intended to perform UTF conversions, and so on.  Thus, the 
convention is consitent and pervasive.

What I discovered during a test conversion of Tango was that converting 
all uses of toUtf8 to toString /except/ those intended to perfom UTF 
conversions reduced code clarity, and left me unsure as to which name I 
would actually use in a given situation.  For example, there is quite a 
bit of code in the text and io packages which convert an arbitrary type 
to a char[] for output, etc.  So by making this change I was left with 
some conversions using toString and others using toUtf8, toUtf16, and 
toUtf32, not to mention the fromXxx versions of these same functions. 
As this is template code, the choice between toString and toUtf8 in a 
given situation was unclear.  Given this, I decided to look to Phobos 
for model to follow.

What I found in Phobos was that it suffers from the same situation as I 
found Tango in during my test conversion.  Routines that convert any 
type by a string to a char[] are named toString, while the string 
equivalent is named toUTF8.  Given this, I surmised that the naming 
convention in D is that all strings are assumed to be Unicode, except 
when they're not.  String literals are required to be Unicode, foreach 
assumes strings to be UTF encoded when performing its automatic 
conversions, and all of the toString functions in std.string assume 
UTF-8 as the output format.  So who bother with the name toUTF8 in std.utf?

As near as I can tell, the reason for text conversion routines to be 
named differently is to simplify the use of routines which covert to 
another format.  std.windows.charset, for example, has a routine called 
toMBSz, to distinguish from the toUTF8 routine.  What I find significant 
about this is that it suggests that while the transport mechanism for 
strings is the same in each case (both routines return a char[], ie. a 
string), the underlying encoding is different.  Thus there seems a clear 
disconnect between the name of the transport mechanism (string), and 
routines that generate them.  With this in mind, I begin to question the 
point of having toString as the common name for routines that generate 
char strings.  The encoding clearly matters in some instances and cannot 
be ignored, so ignoring it in others just seems to confuse things.

With this in mind, I will admit that I am questioning the merit of 
changing Tango's toUtf8 routines to be named toString.  Doing so seems 
to sacrifice both operational consistency and clarity in an attempt to 
maintain consistency with the name of the transport mechanism: string. 
And as I have said above, while strings in D are generally expected to 
be Unicode, they are clearly not always Unicode, as the existence of 
std.windows.charset can attest.  So I am left wondering whether someone 
can explain why toString is the preferred name for string-producing 
routines in D?  I feel it is very important to establish a consistent 
naming mechanism for D, and as Phobos seems to be the model in this case 
I may well have no choice in the matter of toUtf8 vs. toString.  But I 
would feel much better about the change if someone could provide a sound 
reason for doing so, since my first attempt at a conversion has left me 
somewhat worried about its long-term effect on code clarity.

As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 
be named toString, toWString, and toDString, respectively, and Unicode 
should be assumed as the standard encoding format in D.


Sean
Nov 19 2007
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Sean Kelly" wrote
 What I discovered during a test conversion of Tango was that converting 
 all uses of toUtf8 to toString /except/ those intended to perfom UTF 
 conversions reduced code clarity, and left me unsure as to which name I 
 would actually use in a given situation.  For example, there is quite a 
 bit of code in the text and io packages which convert an arbitrary type to 
 a char[] for output, etc.  So by making this change I was left with some 
 conversions using toString and others using toUtf8, toUtf16, and toUtf32, 
 not to mention the fromXxx versions of these same functions. As this is 
 template code, the choice between toString and toUtf8 in a given situation 
 was unclear.

Can you give an example file for this problem? It would be easier to understand your problem if I knew exactly what you were talking about. An actual example is fine, it doesn't need to be minimized (i.e. "take a look at tango/io/X.d") -Steve
Nov 19 2007
parent reply Sean Kelly <sean f4.ca> writes:
Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 What I discovered during a test conversion of Tango was that converting 
 all uses of toUtf8 to toString /except/ those intended to perfom UTF 
 conversions reduced code clarity, and left me unsure as to which name I 
 would actually use in a given situation.  For example, there is quite a 
 bit of code in the text and io packages which convert an arbitrary type to 
 a char[] for output, etc.  So by making this change I was left with some 
 conversions using toString and others using toUtf8, toUtf16, and toUtf32, 
 not to mention the fromXxx versions of these same functions. As this is 
 template code, the choice between toString and toUtf8 in a given situation 
 was unclear.

Can you give an example file for this problem? It would be easier to understand your problem if I knew exactly what you were talking about. An actual example is fine, it doesn't need to be minimized (i.e. "take a look at tango/io/X.d")

tango.text.convert.Layout Sean
Nov 19 2007
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Sean Kelly" wrote
 Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 What I discovered during a test conversion of Tango was that converting 
 all uses of toUtf8 to toString /except/ those intended to perfom UTF 
 conversions reduced code clarity, and left me unsure as to which name I 
 would actually use in a given situation.  For example, there is quite a 
 bit of code in the text and io packages which convert an arbitrary type 
 to a char[] for output, etc.  So by making this change I was left with 
 some conversions using toString and others using toUtf8, toUtf16, and 
 toUtf32, not to mention the fromXxx versions of these same functions. As 
 this is template code, the choice between toString and toUtf8 in a given 
 situation was unclear.

Can you give an example file for this problem? It would be easier to understand your problem if I knew exactly what you were talking about. An actual example is fine, it doesn't need to be minimized (i.e. "take a look at tango/io/X.d")

tango.text.convert.Layout

I can't say I see a problem. I'd say use toUtf8 when doing a conversion from one type of encoded string to another (i.e. utf-16 to utf-8), and use toString when overriding Object's toString, OR when converting a native type (i.e. int, float, etc). For example tango.text.convert.Integer.toUtf8 should be toString. In the case of tango.text.convert.Layout, I don't see any overriding of Object.toUtf8? The Unicode.toUtf8 should be left alone since it is a conversion between utf encodings. In any case, Unicode.toUtf8 is a global function, and is not overriding Object.toUtf8, so there is no conflict there. -Steve
Nov 19 2007
parent Sean Kelly <sean f4.ca> writes:
Steven Schveighoffer wrote:
 
 I'd say use toUtf8 when doing a conversion from one type of encoded string 
 to another (i.e. utf-16 to utf-8), and use toString when overriding Object's 
 toString, OR when converting a native type (i.e. int, float, etc).  For 
 example tango.text.convert.Integer.toUtf8 should be toString.
 
 In the case of tango.text.convert.Layout, I don't see any overriding of 
 Object.toUtf8?  The Unicode.toUtf8 should be left alone since it is a 
 conversion between utf encodings.  In any case, Unicode.toUtf8 is a global 
 function, and is not overriding Object.toUtf8, so there is no conflict 
 there.

There's no conflict, it's just more difficult to understand. Also, in template code, having a consistent rule for overloaded functions can be a valuable asset. I found myself wanting to simply change everything to toString, toWString, and toDString rather than change only the Object member function as originally planned. And the conflict with other encodings worried me so I posted here. Sean
Nov 19 2007
prev sibling parent reply Bill Baxter <dnewsgroup billbaxter.com> writes:
Sean Kelly wrote:
 Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 What I discovered during a test conversion of Tango was that 
 converting all uses of toUtf8 to toString /except/ those intended to 
 perfom UTF conversions reduced code clarity, and left me unsure as to 
 which name I would actually use in a given situation.  For example, 
 there is quite a bit of code in the text and io packages which 
 convert an arbitrary type to a char[] for output, etc.  So by making 
 this change I was left with some conversions using toString and 
 others using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx 
 versions of these same functions. As this is template code, the 
 choice between toString and toUtf8 in a given situation was unclear.

Can you give an example file for this problem? It would be easier to understand your problem if I knew exactly what you were talking about. An actual example is fine, it doesn't need to be minimized (i.e. "take a look at tango/io/X.d")

tango.text.convert.Layout

I you are right that the meanings of toString and toUtf8 are subtly different. My take is that toString promises to produce some textual form of the input (and it happens to use the utf8 encoding). This transformation might be wildly lossy and non-reversible as is the case with the default implementation of toString for classes, which just prints the class name. toUtf8 on the other hand, promises to do a conversion. It's probably lossless, or nearly so, and since the encoding is mentioned specifically probably that's specifically a conversion between different string encodings. The thing is some times A is B. The best textual representation of a Utf32 string as Utf8 is going to be the Utf8 converted version of it. So in that case toString and toUtf8 happen to do the same thing. So to me, the logical thing to do is to "alias toUtf8 toString;" in the cases where there's a converter that also suffices as a textual representation generator. That way everything that can be represented as text has a toString method, and things that deal with encoding conversions have toUtf blah methods. So in that case I don't see any reason for toWString, toDString. toString generates your canonical "textual representation" for whatever it is. If you need that in a different encoding for whatever reason then you need to run an encoding converter on it. --bb
Nov 19 2007
parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
That's roughly what I've suggested before, except that I also suggested
the following interface:

interface UtfConversion
{
    char[] toUtf8();
    wchar[] toUtf16();
    dchar[] toUtf32();
}

This would allow all objects to have a distinct set of methods for
lossless conversion to different encodings, whilst still preserving the
"just give me something to throw at the user" toString method.

Incidentally, given that to(T)'s entire purpose is to do generalised
*value-preserving* conversions, is this really a problem?  Using a
formatter will always give you something, whilst to!(charT[])(v) will
always preserve the value of the conversion.

	-- Daniel
Nov 19 2007
parent "Kris" <foo bar.com> writes:
"Daniel Keep" <daniel.keep.lists gmail.com> wrote in >
 That's roughly what I've suggested before, except that I also suggested
 the following interface:

 interface UtfConversion
 {
    char[] toUtf8();
    wchar[] toUtf16();
    dchar[] toUtf32();
 }

 This would allow all objects to have a distinct set of methods for
 lossless conversion to different encodings, whilst still preserving the
 "just give me something to throw at the user" toString method.

 Incidentally, given that to(T)'s entire purpose is to do generalised
 *value-preserving* conversions, is this really a problem?  Using a
 formatter will always give you something, whilst to!(charT[])(v) will
 always preserve the value of the conversion.

Tango already has this ....
 -- Daniel 

Nov 19 2007
prev sibling next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Sean,

 I was looking at converting Tango's use of toUtf8 to toString today
 and ran into a bit of a quandry.  

 
 Sean
 

why? What is to be gained by having tango us toString rather than toUTF*? IIRC there was a big thing, back when tango started using toUTF8 in place of toString, about how using toUTF8 would solve a number of these issues. If it is as part of the Phobos/Tango collaboration project, I think that is going the other way would be better (add toUTF8 to Phoboe's Object and an alias to make old things compile.
Nov 19 2007
parent Lars Ivar Igesund <larsivar igesund.net> writes:
BCS wrote:

 Reply to Sean,
 
 I was looking at converting Tango's use of toUtf8 to toString today
 and ran into a bit of a quandry.

 
 Sean
 

why? What is to be gained by having tango us toString rather than toUTF*? IIRC there was a big thing, back when tango started using toUTF8 in place of toString, about how using toUTF8 would solve a number of these issues. If it is as part of the Phobos/Tango collaboration project, I think that is going the other way would be better (add toUTF8 to Phoboe's Object and an alias to make old things compile.

Ooh, that would be nice. Apparently toString is the only thing not up for discussion at all on Walter's end. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Nov 19 2007
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Phobos (and D) has undergone some evolution in the thinking about 
unicode strings, and it certainly has a few anachronisms in its names. 
But I think we've evolved to the point where going forward, we know what 
to do:

char[] => string
wchar[] => wstring
dchar[] => dstring

These are all unicode strings. Putting non-unicode encodings in them, 
even temporarily, should be discouraged. Non-unicode encodings should 
use ubyte[], ushort[], etc.
Nov 19 2007
next sibling parent Lars Ivar Igesund <larsivar igesund.net> writes:
Walter Bright wrote:

 Phobos (and D) has undergone some evolution in the thinking about
 unicode strings, and it certainly has a few anachronisms in its names.
 But I think we've evolved to the point where going forward, we know what
 to do:

We certainly don't include all of the D community (of which I would say Tango is a large part, although some of Tango's users probably like your suggestions). Was these names ever really up for discussion?
 
 char[] => string
 wchar[] => wstring
 dchar[] => dstring
 
 These are all unicode strings. Putting non-unicode encodings in them,
 even temporarily, should be discouraged. Non-unicode encodings should
 use ubyte[], ushort[], etc.

This don't at all address the points Sean pulled forth on the naming of the functions returning various encodings, regardless of the types returned. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Nov 19 2007
prev sibling next sibling parent reply Gregor Richards <Richards codu.org> writes:
Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its names. 
 But I think we've evolved to the point where going forward, we know what 
 to do:
 
 char[] => string
 wchar[] => wstring
 dchar[] => dstring
 
 These are all unicode strings. Putting non-unicode encodings in them, 
 even temporarily, should be discouraged. Non-unicode encodings should 
 use ubyte[], ushort[], etc.

I believe that this naming convention would be best in Tango (toString, toWString, toDString). Naming them toUtf8, toUtf16, toUtf32 not only means that the coder has to understand what character encodings are (which would be nice but shouldn't be necessary), but that the familiar terminology "string" we take from literally every other language is lost. If we have to define "strings" as being a bit more confined than "arrays of bytes which presumably have some sort of form", even better. Bytes encoding random-arsed character sets in to WTF-17 don't need to be called "strings", they can be called WTF-17 arrays. - Gregor Richards
Nov 19 2007
parent Robert Fraser <fraserofthenight gmail.com> writes:
Gregor Richards wrote:
 Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its names. 
 But I think we've evolved to the point where going forward, we know 
 what to do:

 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in them, 
 even temporarily, should be discouraged. Non-unicode encodings should 
 use ubyte[], ushort[], etc.

I believe that this naming convention would be best in Tango (toString, toWString, toDString). Naming them toUtf8, toUtf16, toUtf32 not only means that the coder has to understand what character encodings are (which would be nice but shouldn't be necessary), but that the familiar terminology "string" we take from literally every other language is lost. If we have to define "strings" as being a bit more confined than "arrays of bytes which presumably have some sort of form", even better. Bytes encoding random-arsed character sets in to WTF-17 don't need to be called "strings", they can be called WTF-17 arrays. - Gregor Richards

Agreed. It's also worth noting that toString as the name of a method/function has some precedence, so people familiar with Java, etc. will be able to get used to it right away, possibly without ever looking it up.
Nov 19 2007
prev sibling next sibling parent reply Sean Kelly <sean f4.ca> writes:
Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its names. 
 But I think we've evolved to the point where going forward, we know what 
 to do:
 
 char[] => string
 wchar[] => wstring
 dchar[] => dstring
 
 These are all unicode strings. Putting non-unicode encodings in them, 
 even temporarily, should be discouraged. Non-unicode encodings should 
 use ubyte[], ushort[], etc.

This seems fair. It would reinforce the idea that strings really do use a common encoding format, and that foreign encodings are relegated to a different form of transport. Now if only toWString didn't look so horrible :-) Sean
Nov 19 2007
parent reply Gregor Richards <Richards codu.org> writes:
Sean Kelly wrote:
 Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its names. 
 But I think we've evolved to the point where going forward, we know 
 what to do:

 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in them, 
 even temporarily, should be discouraged. Non-unicode encodings should 
 use ubyte[], ushort[], etc.

This seems fair. It would reinforce the idea that strings really do use a common encoding format, and that foreign encodings are relegated to a different form of transport. Now if only toWString didn't look so horrible :-) Sean

Worse looking than toUtf16? Would you prefer if int => int32, long => int64, short => int16, byte => int8, real => float80 (portability be damned), double => float64, float => float32? They'd certainly be more obvious, but I can tell you I'd go crazy. - Gregor Richards
Nov 19 2007
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Gregor Richards wrote:
 Would you prefer if int => int32, long => 
 int64, short => int16, byte => int8, real => float80 (portability be 
 damned), double => float64, float => float32? They'd certainly be more 
 obvious, but I can tell you I'd go crazy.

Those get requested now and then, but I agree they are awful. They're a legacy from the C world where the sizes of basic types are unknown.
Nov 19 2007
parent reply Jason House <jason.james.house gmail.com> writes:
Walter Bright wrote:

 Gregor Richards wrote:
 Would you prefer if int => int32, long =>
 int64, short => int16, byte => int8, real => float80 (portability be
 damned), double => float64, float => float32? They'd certainly be more
 obvious, but I can tell you I'd go crazy.

Those get requested now and then, but I agree they are awful. They're a legacy from the C world where the sizes of basic types are unknown.

The first bullet on http://www.digitalmars.com/d/portability.html implies some wiggle room on this issue. I really liked how D got rid of size ambiguity at first... all the way until I started developing on machines that were not 32 bit. When I don't care about the true size, I feel guilty using "int" all over the place because it is a fixed size. I'd love to see both a fixed and variable size option available. Maybe: int - variable size int32 - fixed size int64 - fixed size If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.
Nov 19 2007
next sibling parent reply Robert DaSilva <sp.unit.262+digitalmars gmail.com> writes:
Jason House wrote:
 Walter Bright wrote:
 
 Gregor Richards wrote:
 Would you prefer if int => int32, long =>
 int64, short => int16, byte => int8, real => float80 (portability be
 damned), double => float64, float => float32? They'd certainly be more
 obvious, but I can tell you I'd go crazy.

legacy from the C world where the sizes of basic types are unknown.

The first bullet on http://www.digitalmars.com/d/portability.html implies some wiggle room on this issue. I really liked how D got rid of size ambiguity at first... all the way until I started developing on machines that were not 32 bit. When I don't care about the true size, I feel guilty using "int" all over the place because it is a fixed size. I'd love to see both a fixed and variable size option available. Maybe: int - variable size int32 - fixed size int64 - fixed size If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.

Even on 64-bit systems int is 32-bit.
Nov 19 2007
parent reply Jason House <jason.james.house gmail.com> writes:
Robert DaSilva wrote:

 Jason House wrote:
 Walter Bright wrote:
 
 Gregor Richards wrote:
 Would you prefer if int => int32, long =>
 int64, short => int16, byte => int8, real => float80 (portability be
 damned), double => float64, float => float32? They'd certainly be more
 obvious, but I can tell you I'd go crazy.

legacy from the C world where the sizes of basic types are unknown.

The first bullet on http://www.digitalmars.com/d/portability.html implies some wiggle room on this issue. I really liked how D got rid of size ambiguity at first... all the way until I started developing on machines that were not 32 bit. When I don't care about the true size, I feel guilty using "int" all over the place because it is a fixed size. I'd love to see both a fixed and variable size option available. Maybe: int - variable size int32 - fixed size int64 - fixed size If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.

Even on 64-bit systems int is 32-bit.

Are you talking about what D does or what is most efficient on a 64 bit system? If 32-bit integers are less efficient, then it's a crime to make size-tolerant code use an inefficient size.
Nov 19 2007
next sibling parent Sean Kelly <sean f4.ca> writes:
Jason House wrote:
 Robert DaSilva wrote:
 
 Jason House wrote:
 If that's done, the size of types become obvious when the programmer
 cares about them and may make size-sensitive code more obvious.


Are you talking about what D does or what is most efficient on a 64 bit system? If 32-bit integers are less efficient, then it's a crime to make size-tolerant code use an inefficient size.

You could always use tango.stdc.stdint.int_fast32_t ;-) Sean
Nov 19 2007
prev sibling parent Robert DaSilva <sp.unit.262+digitalmars gmail.com> writes:
Jason House wrote:
 Robert DaSilva wrote:
 
 Jason House wrote:
 Walter Bright wrote:

 Gregor Richards wrote:
 Would you prefer if int => int32, long =>
 int64, short => int16, byte => int8, real => float80 (portability be
 damned), double => float64, float => float32? They'd certainly be more
 obvious, but I can tell you I'd go crazy.

legacy from the C world where the sizes of basic types are unknown.

some wiggle room on this issue. I really liked how D got rid of size ambiguity at first... all the way until I started developing on machines that were not 32 bit. When I don't care about the true size, I feel guilty using "int" all over the place because it is a fixed size. I'd love to see both a fixed and variable size option available. Maybe: int - variable size int32 - fixed size int64 - fixed size If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.


Are you talking about what D does or what is most efficient on a 64 bit system? If 32-bit integers are less efficient, then it's a crime to make size-tolerant code use an inefficient size.

C don't specify the sizes, but it does specify the sizes relative to each other. sizeof(short) <= sizeof(int) && sizeof(int) <= sizeof(long)
Nov 19 2007
prev sibling parent reply renoX <renosky free.fr> writes:
Jason House a écrit :
 Walter Bright wrote:
 
 Gregor Richards wrote:
 Would you prefer if int => int32, long =>
 int64, short => int16, byte => int8, real => float80 (portability be
 damned), double => float64, float => float32? They'd certainly be more
 obvious, but I can tell you I'd go crazy.

legacy from the C world where the sizes of basic types are unknown.

The first bullet on http://www.digitalmars.com/d/portability.html implies some wiggle room on this issue. I really liked how D got rid of size ambiguity at first... all the way until I started developing on machines that were not 32 bit. When I don't care about the true size, I feel guilty using "int" all over the place because it is a fixed size. I'd love to see both a fixed and variable size option available. Maybe: int - variable size int32 - fixed size int64 - fixed size If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.

No! Naturally programmers would use 'int' everywhere and this would create again portability issue. var_int, int_be32 (bigger or equal 32 bit), int_word: would be ok though. renoX
Nov 19 2007
parent reply Jason House <jason.james.house gmail.com> writes:
renoX wrote:

 Jason House a Ă©crit :
 I'd love to see both a fixed and variable size option available.  Maybe:
 int - variable size
 int32 - fixed size
 int64 - fixed size
 
 If that's done, the size of types become obvious when the programmer
 cares about them and may make size-sensitive code more obvious.

No! Naturally programmers would use 'int' everywhere and this would create again portability issue.

What would the portability issue be? If they use int and don't care about the true size, it'll port fine.
 var_int, int_be32 (bigger or equal 32 bit), int_word: would be ok though.

I'm assuming programmers won't use long windes type definitions if they can avoid it.
Nov 22 2007
parent renoX <renosky free.fr> writes:
Jason House a Ă©crit :
 renoX wrote:
 
 Jason House a Ă©crit :
 I'd love to see both a fixed and variable size option available.  Maybe:
 int - variable size
 int32 - fixed size
 int64 - fixed size

 If that's done, the size of types become obvious when the programmer
 cares about them and may make size-sensitive code more obvious.

Naturally programmers would use 'int' everywhere and this would create again portability issue.

What would the portability issue be? If they use int and don't care about the true size, it'll port fine.

Sure, it's the same as C, except that if you look to the real world, you'd see that there are many portability issue in C due to this.. There are many not-very-good|overworked programmers who care only about their current target, so if you have integers with a varying size as a default portability will be poor. IMHO, that's a case of 'premature optimisation', providing machine sized integers for optimisations is nice, but using it as a default sucks, especially since it's not that obvious that they are always faster: 64bit integers on 64bit CPU can be slower than 32bit integers due to the increased memory&cache usage.. renoX
 
 
 var_int, int_be32 (bigger or equal 32 bit), int_word: would be ok though.

I'm assuming programmers won't use long windes type definitions if they can avoid it.

Nov 25 2007
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
Gregor Richards wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its 
 names. But I think we've evolved to the point where going forward, we 
 know what to do:

 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in them, 
 even temporarily, should be discouraged. Non-unicode encodings should 
 use ubyte[], ushort[], etc.

This seems fair. It would reinforce the idea that strings really do use a common encoding format, and that foreign encodings are relegated to a different form of transport. Now if only toWString didn't look so horrible :-)

Worse looking than toUtf16?

Yes. I find the 'W' or 'D' in the middle of the name difficult to read. It literally hurts my eyes to look at that particular word. Something about the single capital letter in the middle of the word as the distinguishing characteristic, and the fact that the 'W' and 'D' do not correlate to anything meaningful in English. Didn't someone post recently that the mind is trained to recognize words by their first and last letter? I tihnk its smoehtnig lkie taht. With toUtf8, etc, I basically just see the trailing '8' and I know what it is. Trying to pick out a 'W' or 'D' in the middle of a word is much more difficult, particularly since it is next to another capital letter.
 Would you prefer if int => int32, long => 
 int64, short => int16, byte => int8, real => float80 (portability be 
 damned), double => float64, float => float32? They'd certainly be more 
 obvious, but I can tell you I'd go crazy.

No, but I feel that this is an invalid comparison. We are talking about function names concerning type transformations, not type names. Sean
Nov 19 2007
next sibling parent "Kris" <foo bar.com> writes:
"Sean Kelly" <sean f4.ca> wrote ...
 Gregor Richards wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its names. 
 But I think we've evolved to the point where going forward, we know 
 what to do:

 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in them, 
 even temporarily, should be discouraged. Non-unicode encodings should 
 use ubyte[], ushort[], etc.

This seems fair. It would reinforce the idea that strings really do use a common encoding format, and that foreign encodings are relegated to a different form of transport. Now if only toWString didn't look so horrible :-)

Worse looking than toUtf16?

Yes. I find the 'W' or 'D' in the middle of the name difficult to read. It literally hurts my eyes to look at that particular word.

Hear hear! :o
 Something about the single capital letter in the middle of the word as the 
 distinguishing characteristic, and the fact that the 'W' and 'D' do not 
 correlate to anything meaningful in English.  Didn't someone post recently 
 that the mind is trained to recognize words by their first and last 
 letter?  I tihnk its smoehtnig lkie taht.  With toUtf8, etc, I basically 
 just see the trailing '8' and I know what it is.  Trying to pick out a 'W' 
 or 'D' in the middle of a word is much more difficult, particularly since 
 it is next to another capital letter.

Yes, it looks more akin to GoBbleDeGOOk that other options. I find such things to be as distasteful as Walter finds toUtf8 <g>
 Would you prefer if int => int32, long => int64, short => int16, byte => 
 int8, real => float80 (portability be damned), double => float64, float 
 => float32? They'd certainly be more obvious, but I can tell you I'd go 
 crazy.

No, but I feel that this is an invalid comparison. We are talking about function names concerning type transformations, not type names.

Good point
Nov 19 2007
prev sibling parent Regan Heath <regan netmail.co.nz> writes:
Sean Kelly wrote:
 Gregor Richards wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its 
 names. But I think we've evolved to the point where going forward, 
 we know what to do:

 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in 
 them, even temporarily, should be discouraged. Non-unicode encodings 
 should use ubyte[], ushort[], etc.

This seems fair. It would reinforce the idea that strings really do use a common encoding format, and that foreign encodings are relegated to a different form of transport. Now if only toWString didn't look so horrible :-)

Worse looking than toUtf16?

Yes. I find the 'W' or 'D' in the middle of the name difficult to read. It literally hurts my eyes to look at that particular word. Something about the single capital letter in the middle of the word as the distinguishing characteristic, and the fact that the 'W' and 'D' do not correlate to anything meaningful in English. Didn't someone post recently that the mind is trained to recognize words by their first and last letter? I tihnk its smoehtnig lkie taht. With toUtf8, etc, I basically just see the trailing '8' and I know what it is. Trying to pick out a 'W' or 'D' in the middle of a word is much more difficult, particularly since it is next to another capital letter.

I agree, I think I'd prefer: toString toStringW toStringD or toString toString16 toString32 maybe with an alias for toString to toStringA, and/or toString8. There is some precedent as Unicode versions of windows functions have a trailing W, i.e. CreateFileA, CreateFileW Regan
Nov 20 2007
prev sibling next sibling parent reply Roberto Mariottini <rmariottini mail.com> writes:
Walter Bright wrote:
[...]
 Non-unicode encodings should use ubyte[], ushort[], etc.

Are you saying that the toMBSz() function should return ubyte* not char*? Ciao
Nov 20 2007
parent Walter Bright <newshound1 digitalmars.com> writes:
Roberto Mariottini wrote:
 Walter Bright wrote:
 [...]
 Non-unicode encodings should use ubyte[], ushort[], etc.

Are you saying that the toMBSz() function should return ubyte* not char*?

Probably.
Nov 20 2007
prev sibling parent reply Matti Niemenmaa <see_signature for.real.address> writes:
Walter Bright wrote:
 char[] => string
 wchar[] => wstring
 dchar[] => dstring
 
 These are all unicode strings. Putting non-unicode encodings in them,
 even temporarily, should be discouraged. Non-unicode encodings should
 use ubyte[], ushort[], etc.

At last! This is the way I've been thinking it should be for a long time. However, this requires a change to the language - make char/wchar/dchar types implicitly convertible to ubyte/ushort/uint - and a bunch of library changes - functions that don't require UTF should use ubyte/ushort/uint - in order to be practically usable. Details follow. Assume you have an ubyte[] named iso_8859_1_string which contains a string encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying cast. The same thing applies the other way, of course - assume the C standard library accepts ubyte* instead of char* for all the C string functions. This is more correct than the current situation, as the C standard library is encoding-independent. Now, if you have a UTF-8 string which you wish to pass to a C string handling function, you need to do, for instance: "printf(cast(ubyte*)utf_8_string.ptr)" - another cast. If encoding-independent functions accept only char, then it's the former case for _every_ call to a string function when you're dealing with non-UTF strings, which quickly becomes onerous. I actually tried this, but the code ended up so unreadable that I was forced to change it back, thus having arbitrarily-encoded bytes stored in char[], just for the convenience of being able to use string functions on them. Here're the details of the solution to this problem that I've thought of: Make char, char*, char[], etc. all implicitly castable to the corresponding ubyte types, and equivalently for wchar/ushort and dchar/uint. Then, functions which require UTF-x can continue to use [dw]char while functions which work regardless of encoding (most functions in std.string) should use ubyte. This way, the functions transparently work for [dw]string whilst still working for non-UTF. To be precise, in the above, "work regardless of encoding" should be read as "works on more than one encoding": even a simple function like std.string.strip would have to be changed to work on EBCDIC, for instance. I would assume ASCII, especially given that D doesn't target machines older than relatively modern 32-bit computers, to be the common subset. This way ubyte[] would mean "ASCII or something else" and it's up to the programmer to not call it on functions which require ASCII. I don't think this is a problem. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Nov 20 2007
next sibling parent reply Regan Heath <regan netmail.co.nz> writes:
Matti Niemenmaa wrote:
 Walter Bright wrote:
 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in them,
 even temporarily, should be discouraged. Non-unicode encodings should
 use ubyte[], ushort[], etc.

At last! This is the way I've been thinking it should be for a long time. However, this requires a change to the language - make char/wchar/dchar types implicitly convertible to ubyte/ushort/uint - and a bunch of library changes - functions that don't require UTF should use ubyte/ushort/uint - in order to be practically usable. Details follow. Assume you have an ubyte[] named iso_8859_1_string which contains a string encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying cast.

I think we should be encouraging people to convert this data to UTF-8 before calling any D string handling functions on it (those that accept w/d/char[]). Which implies all D string handling functions should only operate on UTF-8/16/32. If they want to call a C function like those in std.c.<whatever> on it, it should just work as expected. Which implies std.c.<whatever> functions should accept ubyte* or void* or something, not char*
 The same thing applies the other way, of course - assume the C standard library
 accepts ubyte* instead of char* for all the C string functions. This is more
 correct than the current situation, as the C standard library is
 encoding-independent. Now, if you have a UTF-8 string which you wish to pass to
 a C string handling function, you need to do, for instance:
 "printf(cast(ubyte*)utf_8_string.ptr)" - another cast.

w/d/char[] arrays are implicitly convertable to void[] (and void*?) so perhaps C functions should accept void* instead? I mean, void* means "pointer to something/anything"... Regan
Nov 20 2007
parent reply Matti Niemenmaa <see_signature for.real.address> writes:
Regan Heath wrote:
 I think we should be encouraging people to convert this data to UTF-8
 before calling any D string handling functions on it (those that accept
 w/d/char[]).  Which implies all D string handling functions should only
 operate on UTF-8/16/32.

This is an impossible task. Given a plaintext file, you cannot know what encoding it is in. If you assume an encoding and convert it to UTF-8 for internal use and then recode it back to that encoding for output, you may lose information.
 w/d/char[] arrays are implicitly convertable to void[] (and void*?) so
 perhaps C functions should accept void* instead?  I mean, void* means
 "pointer to something/anything"...

void* means "pointer to anything", as you say. ubyte* means "pointer to unsigned byte(s)", which is a different thing entirely. To me, ubyte[] means either integers in the range 0-255 or "arbitrary data". void[] is more like "arbitrary memory": used for hacking around language restrictions or for extremely low-level stuff such as memory management. Would you consider malloc as returning the same type of data which mbstrlen accepts? -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Nov 20 2007
parent reply Regan Heath <regan netmail.co.nz> writes:
Matti Niemenmaa wrote:
 Regan Heath wrote:
 I think we should be encouraging people to convert this data to UTF-8
 before calling any D string handling functions on it (those that accept
 w/d/char[]).  Which implies all D string handling functions should only
 operate on UTF-8/16/32.

This is an impossible task. Given a plaintext file, you cannot know what encoding it is in. If you assume an encoding and convert it to UTF-8 for internal use and then recode it back to that encoding for output, you may lose information.

Yep, but the same thing may occur calling a D string function as it expects UTF-8 and may even convert to dchar[] internally (which would probably throw an invalid UTF exception). Worse, it might work in one version of the library and fail in another due to internal changes of that sort. Meaning, the function cannot guarantee to operate on your 'could be any encoding' data. You'd be better of passing this data to the C function that does what you want. Convert input early and output late I reckon.
 w/d/char[] arrays are implicitly convertable to void[] (and void*?) so
 perhaps C functions should accept void* instead?  I mean, void* means
 "pointer to something/anything"...

void* means "pointer to anything", as you say. ubyte* means "pointer to unsigned byte(s)", which is a different thing entirely. To me, ubyte[] means either integers in the range 0-255 or "arbitrary data". void[] is more like "arbitrary memory": used for hacking around language restrictions or for extremely low-level stuff such as memory management. Would you consider malloc as returning the same type of data which mbstrlen accepts?

Not the same type of data, but they could give/accept the same pointer. void *p = malloc(100); strcpy((char*)p, "test"); printf("%d", mbstrlen(p)); Memory is memory, the only difference between char* and void* is that char* knows (thinks) it's pointing at a char. What about other text encodings which do not have 8 bit sized 'character' pieces, like UCS-2 (but not because UCS-2 is a subset of UTF-16 and we can handle it as such). I'm not sure any exist, so this point may be invalid, but if one did exist then ubyte[] would not be the correct way to store it, perhaps ushort[] would. Or.. we could use void[]/void* for all types of unknown data and be done with it. Using void* basically says "we don't know the type/format of the data but we assume the function receiving the data does". Regan
Nov 20 2007
parent reply Matti Niemenmaa <see_signature for.real.address> writes:
Regan Heath wrote:
 Matti Niemenmaa wrote:
 Regan Heath wrote:
 I think we should be encouraging people to convert this data to UTF-8 
 before calling any D string handling functions on it (those that accept 
 w/d/char[]).  Which implies all D string handling functions should only 
 operate on UTF-8/16/32.

This is an impossible task. Given a plaintext file, you cannot know what encoding it is in. If you assume an encoding and convert it to UTF-8 for internal use and then recode it back to that encoding for output, you may lose information.

Yep, but the same thing may occur calling a D string function as it expects UTF-8 and may even convert to dchar[] internally (which would probably throw an invalid UTF exception).

Which is why I think that unless you know it's UTF-8, you should use ubyte[]. Functions which expect UTF-8 would require char[], thus causing a type error.
 You'd be better of passing this data to the C function that does what you 
 want.

There's not always a C function that does what you want available. Both Phobos's and Tango's string processing capabilities are greater than the C standard library's even for plain ASCII. The point is to make it easy to use non-UTF strings when necessary, without having to resort to huge amounts of casts or writing your own functions with the correct type signatures.
 What about other text encodings which do not have 8 bit sized 'character' 
 pieces, like UCS-2 (but not because UCS-2 is a subset of UTF-16 and we can 
 handle it as such).  I'm not sure any exist, so this point may be invalid, 
 but if one did exist then ubyte[] would not be the correct way to store it, 
 perhaps ushort[] would.

Walter mentioned ushort[] in his post, as did I in mine.
 Or.. we could use void[]/void* for all types of unknown data and be done with
 it.  Using void* basically says "we don't know the type/format of the data 
 but we assume the function receiving the data does".

I just think "void" means "typeless" or "I don't know the type". "ubyte" means something like "byte-oriented data" or "I don't care about the type". It all depends on your point of view, but I think it's nice to have a semantic difference between void and ubyte. The meaning of plain byte, on the other hand, eludes me, beyond just "integer from -128 to 127". The problem with using void to store data is also that the garbage collectors assume it may contain pointers, and thus scan it for uncollected memory. It may also be that if they find a valid pointer (small, but nonzero, probability) they do not free memory which should be released, thus retaining it as long as the data lives, which could be as long as the program runs. Hell, we /could/ use void[] to replace char[], byte[], and ubyte[], and why not the rest of the types, too. But this isn't asm. This is D! -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Nov 20 2007
parent Regan Heath <regan netmail.co.nz> writes:
Matti Niemenmaa wrote:
 The meaning of plain byte, on the other hand, eludes me, beyond just "integer
 from -128 to 127".

To my mind byte = "signed interpretation of 8 bits". Regan
Nov 21 2007
prev sibling next sibling parent reply =?UTF-8?B?SnVsaW8gQ8Opc2FyIENhcnJhc2NhbCBVcnF1aWpv?= writes:
Matti Niemenmaa wrote:
 Assume you have an ubyte[] named iso_8859_1_string which contains a string
 encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to
 work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)"
-
 note the annoying cast.

You can't assume that a function designed to work on an UTF-8 strings works with ISO-8859-1 strings. Beyond the ASCII range UTF-8 isn't compatible with any other charset.
 The same thing applies the other way, of course - assume the C standard library
 accepts ubyte* instead of char* for all the C string functions. This is more
 correct than the current situation, as the C standard library is
 encoding-independent. Now, if you have a UTF-8 string which you wish to pass to
 a C string handling function, you need to do, for instance:
 "printf(cast(ubyte*)utf_8_string.ptr)" - another cast.

This is probably the actual problem: C string functions should accept ubyte* instead of char* because a ubyte doesn't have an implied encoding while char does.
 If encoding-independent functions accept only char, then it's the former case
 for _every_ call to a string function when you're dealing with non-UTF strings,
 which quickly becomes onerous.

Unless you are referring to a conversion library like ICU, I don't understand your point on "encoding-independent functions". Phobos' string functions aren't "encoding-independent".
 I actually tried this, but the code ended up so unreadable that I was forced to
 change it back, thus having arbitrarily-encoded bytes stored in char[], just
for
 the convenience of being able to use string functions on them.

If you've done that I fear you'll see lots of exceptions appearing in your string handling code once you deliver your program to any non-english speaking user.
 Here're the details of the solution to this problem that I've thought of:
 
 Make char, char*, char[], etc. all implicitly castable to the corresponding
 ubyte types, and equivalently for wchar/ushort and dchar/uint. Then, functions
 which require UTF-x can continue to use [dw]char while functions which work
 regardless of encoding (most functions in std.string) should use ubyte. This
 way, the functions transparently work for [dw]string whilst still working for
 non-UTF.

Most function in std.string *require* UTF-8 or they'll blow up with a "Error: 4invalid UTF-8 sequence" message. Actually, I think the implicit casting would be useful for string literals: byte[] foo = "Julio CĂ©sar"; // In ISO-8859-1. But then I need some way to tell the compiler that the string is in ISO-8859-1. What I don't see is where does your proposal helps with the example you were giving. For example, if I try to uppercase foo I would get an exception: toupper(foo); // BOOM!
 To be precise, in the above, "work regardless of encoding" should be read as
 "works on more than one encoding": even a simple function like std.string.strip
 would have to be changed to work on EBCDIC, for instance. I would assume ASCII,
 especially given that D doesn't target machines older than relatively modern
 32-bit computers, to be the common subset. This way ubyte[] would mean "ASCII
or
 something else" and it's up to the programmer to not call it on functions which
 require ASCII. I don't think this is a problem.

I think this is unrealistic unless you want to change std.string to be something more like ICU. There are just too many (popular) encodings and variations in use today... and you'll have to support most of them once you start promising to "works on more than one encoding". Even Unicode has UCS which is the not-quite-UTF encoding used in Windows NT4 (yes, there are still lots of machines using NT4). -- Julio CĂ©sar Carrascal Urquijo http://jcesar.artelogico.com/
Nov 20 2007
next sibling parent Regan Heath <regan netmail.co.nz> writes:
Julio CĂ©sar Carrascal Urquijo wrote:
 Even Unicode has UCS which is the not-quite-UTF encoding used in Windows 
 NT4 (yes, there are still lots of machines using NT4).

FYI: You probably already know this but I wanted to be sure, plus others might find it of interest.. http://en.wikipedia.org/wiki/UTF-16 UCS2 is not quite UTF-16, but UCS2 is a subset of UTF-16 ("upwards compatibility from UCS-2 to UTF-16"), it's essentially UTF-16 without the surrogate pairs. So, in D you can generally* say: wchar[] data = cast(wchar[]) std.file.read("filename"); and it should work without throwing any invalid UTF errors. * this may depend on whether it's UCS-2, UCS-2BE, or UCS-2LE. I'm not sure which format D's UTF-16 is in. Regan
Nov 21 2007
prev sibling parent reply Matti Niemenmaa <see_signature for.real.address> writes:
Julio CĂ©sar Carrascal Urquijo wrote:
 Matti Niemenmaa wrote:
 Assume you have an ubyte[] named iso_8859_1_string which contains a string
  encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it
  to work, you need to call 
 "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying
  cast.

You can't assume that a function designed to work on an UTF-8 strings works with ISO-8859-1 strings. Beyond the ASCII range UTF-8 isn't compatible with any other charset.

I am well aware of this. I chose strip as an example because it does work on any encoding: it simply calls std.ctype.isspace on each char.
 This is probably the actual problem: C string functions should accept ubyte* 
 instead of char* because a ubyte doesn't have an implied encoding while char 
 does.

Yes. But there are also many D string functions which would work on any encoding.
 If encoding-independent functions accept only char, then it's the former 
 case for _every_ call to a string function when you're dealing with non-UTF
  strings, which quickly becomes onerous.

Unless you are referring to a conversion library like ICU, I don't understand your point on "encoding-independent functions". Phobos' string functions aren't "encoding-independent".

Most are, actually, except for the fact that D character constants are always ASCII. Almost all the std.string functions will work for any "extended ASCII" encoding. And that's what I mean. Given that D doesn't target the kind of machines that use EBCDIC, I use "encoding-independent" to mean either "works on any encoding" or "works on any encoding with ASCII as the lower 128 values".
 I actually tried this, but the code ended up so unreadable that I was 
 forced to change it back, thus having arbitrarily-encoded bytes stored in 
 char[], just for the convenience of being able to use string functions on 
 them.

If you've done that I fear you'll see lots of exceptions appearing in your string handling code once you deliver your program to any non-english speaking user.

Trust me, I know what I'm doing. For instance, the integer conversion functions in std.conv only look for values in the range '0' to '9', ignoring all others. If the encoding has the digits in the same place as ASCII, it will work, regardless of what all the other bytes in the encoding are. If the encoding has the digits in a different place than ASCII, then it won't work, true. But I think you'll find that using EBCDIC or another non-ASCII-based encoding will confuse most of the programs you've got installed on your computer.
 Most function in std.string *require* UTF-8 or they'll blow up with a "Error:
  4invalid UTF-8 sequence" message.

No, they do not. Some do, but not most. Of all the functions that take char[] or char* in std.string: Functions requiring UTF-8: 22 Functions not requiring UTF-8: 35
 Actually, I think the implicit casting would be useful for string literals:
 
 byte[] foo = "Julio CĂ©sar";    // In ISO-8859-1.
 
 But then I need some way to tell the compiler that the string is in 
 ISO-8859-1. What I don't see is where does your proposal helps with the 
 example you were giving. For example, if I try to uppercase foo I would get 
 an exception:
 
 toupper(foo);    // BOOM!

True, you would, because std.string.toupper assumes UTF-8. Hence, its type should be string(string), which you couldn't call with byte[], since byte[] doesn't implicitly convert to char[]. But consider what happens now with char[]. The following program compiles, but blows up at runtime: import std.string; void main() { char[] foo = "Julio C\xe9sar"; toupper(foo); } An amendment to my proposal to correct this would be that hex strings, and any string which contains a byte sequence which is not valid UTF, would become ubyte/ushort/uint. Thus the above would fail with a type error because the type of the literal is ubyte[], and it cannot be assigned to a char[]. If the type of foo were ubyte[], calling toupper would fail with a type error. Thereby the only way to get the program above to compile, aside from changing the string literal to UTF-8, would be with a cast, which shows that there's something unsafe going on.
 I think this is unrealistic unless you want to change std.string to be 
 something more like ICU. There are just too many (popular) encodings and 
 variations in use today... and you'll have to support most of them once you 
 start promising to "works on more than one encoding".

By "works on more than one encoding" I meant "works for anything with ASCII as the lower 128 bytes". You'll find that covers the majority of encodings in common use today. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Nov 21 2007
parent reply Regan Heath <regan netmail.co.nz> writes:
Matti Niemenmaa wrote:
 Julio CĂ©sar Carrascal Urquijo wrote:
 Matti Niemenmaa wrote:
 Assume you have an ubyte[] named iso_8859_1_string which contains a string
  encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it
  to work, you need to call 
 "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying
  cast.

with ISO-8859-1 strings. Beyond the ASCII range UTF-8 isn't compatible with any other charset.

I am well aware of this. I chose strip as an example because it does work on any encoding: it simply calls std.ctype.isspace on each char.

But, this behvaiour isn't guaranteed. In fact I would expect that in future a library like iconv will be leveraged to determine if a character 'is a space' and it will assume the input data is UTF-8. So, if your ASCII based encoding has characters outside the ASCII range and they just happen to match a valid 'is a space' character from the UTF-8 set, then .. whoops. Now, I don't have a canonical knowledge of character sets so it may be that there are no space characters outside the ASCII range defined in UTF-8... (perhaps when you include surogate pairs?) or, even if they exist the chance of an ASCII based character set using that value may be pretty small. Who knows, all I'm saying is that if a function says it accepts char[] then it is saying "I accept valid UTF-8" and not "I accept any ASCII based character data" so all bets are off if you pass it anything other than UTF-8.
 This is probably the actual problem: C string functions should accept ubyte* 
 instead of char* because a ubyte doesn't have an implied encoding while char 
 does.

Yes. But there are also many D string functions which would work on any encoding.

At present. But that's not guaranteed and it may change in the future, in fact, I expect it to. As far as I can see the only guaranteed thing is that the C functions will not change and will continue to accept ASCII based character sets without possible future gotchas. So, if you must perform string manipulation on non UTF data then you should either write your own functions, or use the C ones. Regan
Nov 21 2007
parent reply Matti Niemenmaa <see_signature for.real.address> writes:
Regan Heath wrote:
 But, this behvaiour isn't guaranteed.  In fact I would expect that in
 future a library like iconv will be leveraged to determine if a
 character 'is a space' and it will assume the input data is UTF-8.

You're right. See below.
 So, if your ASCII based encoding has characters outside the ASCII range
 and they just happen to match a valid 'is a space' character from the
 UTF-8 set, then .. whoops.
 
 Now, I don't have a canonical knowledge of character sets so it may be
 that there are no space characters outside the ASCII range defined in
 UTF-8... (perhaps when you include surogate pairs?) or, even if they
 exist the chance of an ASCII based character set using that value may be
 pretty small.

std.string.LS and std.string.PS are two examples of Unicode whitespace characters. Strip, for some reason, does not strip them.
 Who knows, all I'm saying is that if a function says it accepts char[]
 then it is saying "I accept valid UTF-8" and not "I accept any ASCII
 based character data" so all bets are off if you pass it anything other
 than UTF-8.

You are correct, which is exactly my point: char[] should mean UTF-8 whereas currently many functions use it to mean "text with single-byte characters". That std.string.strip uses char[] currently says nothing about whether it expects UTF-8 or not. Were the std.c package converted to use ubyte[] everywhere, there would be a clear distinction between UTF-8 and "anything". Then, as you say, one should interpret std.string.* as accepting only UTF-8.
 As far as I can see the only guaranteed thing is that the C functions
 will not change and will continue to accept ASCII based character sets
 without possible future gotchas.
 
 So, if you must perform string manipulation on non UTF data then you
 should either write your own functions, or use the C ones.

Correct. The point is that storing non-UTF data in ubyte/ushort/uint is a difficult task because even the C functions take char (or wchar_t, which I think is wchar on Windows and dchar elsewhere) and thus the code quickly becomes castville. cast here, cast there, everywhere a cast cast - and for no good reason. Thus I believe, as per my original proposal, that library functions be converted to use ubyte[] where they are not meant to accept char[]. This may or may not mean changes in std.string - it's up to the Phobos maintainers to make the choice as to whether a function will ever require UTF-8, and whether to type it as taking char[] or ubyte[]. In any case, at least the C functions should take ubyte[]. The implicit casting from char-whatever to ubyte-whatever is useful when you want to call C functions with D strings. Once again the code would rapidly become castville if it would have to be done explicitly. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Nov 21 2007
parent Regan Heath <regan netmail.co.nz> writes:
Matti Niemenmaa wrote:
 Regan Heath wrote:
 The point is that storing non-UTF data in ubyte/ushort/uint is a difficult task
 because even the C functions take char (or wchar_t, which I think is wchar on
 Windows and dchar elsewhere) and thus the code quickly becomes castville. cast
 here, cast there, everywhere a cast cast - and for no good reason.

Yeah, agreed 100%
 Thus I believe, as per my original proposal, that library functions be
converted
 to use ubyte[] where they are not meant to accept char[]. This may or may not
 mean changes in std.string - it's up to the Phobos maintainers to make the
 choice as to whether a function will ever require UTF-8, and whether to type it
 as taking char[] or ubyte[]. In any case, at least the C functions should take
 ubyte[].

Agreed. I would tend to leave the std.string functions taking char[] so that when they finally step up and have complete UTF compatibility their signatures do not change. If we need some functions, like strip, as a stop gap for other encodings then I reckon we add them, perhaps to a different module, and we use ubyte* (or whatever) instead of char[] for the input parameter.
 The implicit casting from char-whatever to ubyte-whatever is useful when you
 want to call C functions with D strings. Once again the code would rapidly
 become castville if it would have to be done explicitly.

The only problem I have with implicit cast to ubyte-whatever is that I worry it will have an unexpected side effect somewhere... Perhaps I am being alarmist. Regan
Nov 21 2007
prev sibling parent reply Robert DaSilva <sp.unit.262+digitalmars gmail.com> writes:
Matti Niemenmaa wrote:
 Walter Bright wrote:
 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in them,
 even temporarily, should be discouraged. Non-unicode encodings should
 use ubyte[], ushort[], etc.

At last! This is the way I've been thinking it should be for a long time. However, this requires a change to the language - make char/wchar/dchar types implicitly convertible to ubyte/ushort/uint - and a bunch of library changes - functions that don't require UTF should use ubyte/ushort/uint - in order to be practically usable. Details follow. Assume you have an ubyte[] named iso_8859_1_string which contains a string encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying cast. The same thing applies the other way, of course - assume the C standard library accepts ubyte* instead of char* for all the C string functions. This is more correct than the current situation, as the C standard library is encoding-independent. Now, if you have a UTF-8 string which you wish to pass to a C string handling function, you need to do, for instance: "printf(cast(ubyte*)utf_8_string.ptr)" - another cast. If encoding-independent functions accept only char, then it's the former case for _every_ call to a string function when you're dealing with non-UTF strings, which quickly becomes onerous. I actually tried this, but the code ended up so unreadable that I was forced to change it back, thus having arbitrarily-encoded bytes stored in char[], just for the convenience of being able to use string functions on them. Here're the details of the solution to this problem that I've thought of: Make char, char*, char[], etc. all implicitly castable to the corresponding ubyte types, and equivalently for wchar/ushort and dchar/uint. Then, functions which require UTF-x can continue to use [dw]char while functions which work regardless of encoding (most functions in std.string) should use ubyte. This way, the functions transparently work for [dw]string whilst still working for non-UTF. To be precise, in the above, "work regardless of encoding" should be read as "works on more than one encoding": even a simple function like std.string.strip would have to be changed to work on EBCDIC, for instance. I would assume ASCII, especially given that D doesn't target machines older than relatively modern 32-bit computers, to be the common subset. This way ubyte[] would mean "ASCII or something else" and it's up to the programmer to not call it on functions which require ASCII. I don't think this is a problem.

Perhapses {,w,d}char should become typedef of u{byte,short,int} and drop as keywords?
Nov 20 2007
parent Matti Niemenmaa <see_signature for.real.address> writes:
Robert DaSilva wrote:
 Perhapses {,w,d}char should become typedef of u{byte,short,int} and drop
 as keywords?

There would still need to be special handling for char and string literals, at least (since they're defined as UTF), but yes, this is a possibility. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Nov 21 2007
prev sibling next sibling parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message 
news:fhsts6$5nn$1 digitalmars.com...
 As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 be 
 named toString, toWString, and toDString, respectively, and Unicode should 
 be assumed as the standard encoding format in D.

Do you want to know my single overriding reason for wanting toString instead of toUtf8? Because it's nicer-looking and easier to type. My other reasons include consistency (Java uses .toString, .Net uses .ToString, phobos uses .toString) and that "toUtf8" screams "I'm a string class and this method converts my encoding!" while "toString" says "convert this object, whatever it is, to some kind of string." votes += 8 for toString, toWString, and toDString.
Nov 19 2007
next sibling parent reply Gregor Richards <Richards codu.org> writes:
Jarrett Billingsley wrote:
 Do you want to know my single overriding reason for wanting toString instead 
 of toUtf8?  Because it's nicer-looking and easier to type.

Hear hear.
Nov 19 2007
next sibling parent reply "Kris" <foo bar.com> writes:
With respect to all, you're perhaps not addressing Sean's deeper questions? 
Instead, this seems like another bunch of "toUtf8! NO! U ... toString() 
dammit!"

Which is kinda superficial at this point?


"Gregor Richards" <Richards codu.org> wrote in message 
news:fht0qs$9rk$1 digitalmars.com...
 Jarrett Billingsley wrote:
 Do you want to know my single overriding reason for wanting toString 
 instead of toUtf8?  Because it's nicer-looking and easier to type.

Hear hear.

Nov 19 2007
parent Gregor Richards <Richards codu.org> writes:
Kris wrote:
 With respect to all, you're perhaps not addressing Sean's deeper questions? 
 Instead, this seems like another bunch of "toUtf8! NO! U ... toString() 
 dammit!"
 
 Which is kinda superficial at this point?
 
 
 "Gregor Richards" <Richards codu.org> wrote in message 
 news:fht0qs$9rk$1 digitalmars.com...
 Jarrett Billingsley wrote:
 Do you want to know my single overriding reason for wanting toString 
 instead of toUtf8?  Because it's nicer-looking and easier to type.



Confusing it's, post top don't. Actually, we're not even addressing Sean's surficial questions. - Gregor Richards
Nov 19 2007
prev sibling parent =?ISO-8859-1?Q?Julio_C=E9sar_Carrascal_Urquijo?= writes:
Gregor Richards wrote:
 Jarrett Billingsley wrote:
 Do you want to know my single overriding reason for wanting toString 
 instead of toUtf8?  Because it's nicer-looking and easier to type.

Hear hear.

Actually, Jarrett's other arguments seemed more compelling to me. Jarrett Billingsley wrote:
 My other reasons include consistency (Java uses .toString, .Net uses
 ..ToString, phobos uses .toString) and that "toUtf8" screams "I'm a 

 class and this method converts my encoding!"  while "toString" says 

 this object, whatever it is, to some kind of string."

Stating intent seem more important to me than the stylistic issues between toString and toUtf8. I'm all for the toString / toWString / toDString for a readable representation of a class and toUtf8 / 16 / 32 for converting encodings. Also, toStringW seems more readable than toWString but for me its not big of a deal which one Tango developers choose. -- Julio César Carrascal Urquijo http://jcesar.artelogico.com/
Nov 20 2007
prev sibling parent BCS <ao pathlink.com> writes:
Reply to Jarrett,

 "Sean Kelly" <sean f4.ca> wrote in message
 news:fhsts6$5nn$1 digitalmars.com...
 
 As an alternative, I can only suggest that toUTF8, toUTF16, and
 toUTF32 be named toString, toWString, and toDString, respectively,
 and Unicode should be assumed as the standard encoding format in D.
 

instead of toUtf8? Because it's nicer-looking and easier to type. My other reasons include consistency (Java uses .toString, .Net uses .ToString, phobos uses .toString) and that "toUtf8" screams "I'm a string class and this method converts my encoding!" while "toString" says "convert this object, whatever it is, to some kind of string." votes += 8 for toString, toWString, and toDString.

shouldn't that be? votes += 8 for toString votes += 16 for toWString votes += 32 for toDString.
Nov 19 2007
prev sibling next sibling parent reply Christopher Wright <dhasenan gmail.com> writes:
Sean Kelly wrote:
 I was looking at converting Tango's use of toUtf8 to toString today and 
 ran into a bit of a quandry....

toUtf8 is ugly. toString/toWString/toDString are opaque and ugly, hard to distinguish from each other. toString, toStringW, toStringD? Still ugly. toUtf, toUtf16, toUtf32? Slightly less clear, but easier to type. toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits well with other conventions.
Nov 19 2007
parent reply Sean Kelly <sean f4.ca> writes:
Christopher Wright wrote:
 Sean Kelly wrote:
 I was looking at converting Tango's use of toUtf8 to toString today 
 and ran into a bit of a quandry....

toUtf8 is ugly. toString/toWString/toDString are opaque and ugly, hard to distinguish from each other. toString, toStringW, toStringD? Still ugly. toUtf, toUtf16, toUtf32? Slightly less clear, but easier to type. toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits well with other conventions.

I tend to place a tremendous amount of value on consistency, because the more consistent an API is, the more likely my guesses about it are to be correct. In my opinion, that precludes using the option you suggest. In my opinion, Walter's suggestion that alternate encodings not be stored in strings is sufficient reason to not bother with the encoding format in the function name (ie. toUtf8/toUtf16/toUtf32). I might counter that I don't see any reason to lose meaning where it is so easily provided, but on the other hand, I agree that new users are more likely to know what a function named toString does than were it named toUtf8. These two points are a wash in my opinion. The remaining concerns are less substantive. I find toWString and toDString difficult to read, but those feelings hold little more weight than "toUtf8 is ugly." I also feel that the term "string" is largely meaningless in programming. But I certainly couldn't win a debate with either point. I don't suppose there is anyone who does a lot of internationalization programming who can comment on the utility of one convention vs. the other? I would love to hear some more practical concerns regarding the naming convention for these functions. Sean
Nov 19 2007
next sibling parent Bill Baxter <dnewsgroup billbaxter.com> writes:
Sean Kelly wrote:
 Christopher Wright wrote:
 Sean Kelly wrote:
 I was looking at converting Tango's use of toUtf8 to toString today 
 and ran into a bit of a quandry....

toUtf8 is ugly. toString/toWString/toDString are opaque and ugly, hard to distinguish from each other. toString, toStringW, toStringD? Still ugly. toUtf, toUtf16, toUtf32? Slightly less clear, but easier to type. toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits well with other conventions.

I tend to place a tremendous amount of value on consistency, because the more consistent an API is, the more likely my guesses about it are to be correct. In my opinion, that precludes using the option you suggest. In my opinion, Walter's suggestion that alternate encodings not be stored in strings is sufficient reason to not bother with the encoding format in the function name (ie. toUtf8/toUtf16/toUtf32). I might counter that I don't see any reason to lose meaning where it is so easily provided, but on the other hand, I agree that new users are more likely to know what a function named toString does than were it named toUtf8. These two points are a wash in my opinion. The remaining concerns are less substantive. I find toWString and toDString difficult to read, but those feelings hold little more weight than "toUtf8 is ugly." I also feel that the term "string" is largely meaningless in programming. But I certainly couldn't win a debate with either point. I don't suppose there is anyone who does a lot of internationalization programming who can comment on the utility of one convention vs. the other? I would love to hear some more practical concerns regarding the naming convention for these functions.

My just formed opinion :-) is that any sort of toWstring/toDstring functions should be standalone things that only accept type "string" or "char" as input. Yes there will be some performance penalty in some cases, but I don't think that's significant enough to warrant creating lots of functions that do exactly the same thing, just with different encodings. --bb
Nov 19 2007
prev sibling next sibling parent "David B. Held" <dheld codelogicconsulting.com> writes:
Sean Kelly wrote:
 [...]
 I don't suppose there is anyone who does a lot of internationalization 
 programming who can comment on the utility of one convention vs. the 
 other?  I would love to hear some more practical concerns regarding the 
 naming convention for these functions.

I certainly don't qualify as someone who does a "lot" of i18n programming, but I do some. Regardless, I would have to say that when I see a function called toUtfXX(), I think "Oh, that must convert a string from Latin-1 or something", rather than "Oh, that must give me the UTF-XX representation of an object". Perl is a bad example because it didn't get righteous UTF-8 support until 5.8, but whenever you see "utf8" or similar in a Perl program, it almost invariably involves an encoding/decoding operation. Perhaps it is worth noting that whenever you see "UTF-8" in Java, is most likely has to do with encoding/decoding. And the same is true of C#, etc. So it appears that the precedent is that for most other languages, when "UTF-8" is spelled out explicitly, it is usually in a transcoding context. I don't think toWString() is an ideal name, but it seems to have the right connotations to the naive programmer. Dave
Nov 20 2007
prev sibling next sibling parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Sean Kelly wrote:
 Christopher Wright wrote:
 toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits 
 well with other conventions.

I tend to place a tremendous amount of value on consistency, because the more consistent an API is, the more likely my guesses about it are to be correct. In my opinion, that precludes using the option you suggest.

IMHO, the consistent alternative is pretty clear: char -> string -> toString wchar -> wstring -> toWString dchar -> dstring -> toDString The only problem seems to lie in the aesthetics of the camelCase convention, but doesn't consistency trump aesthetics?
 In my opinion, Walter's suggestion that alternate encodings not be 
 stored in strings is sufficient reason to not bother with the encoding 
 format in the function name (ie. toUtf8/toUtf16/toUtf32). 

I agree, but this is hardly a new suggestion. I think it has always been pretty clear that one should never store anything but UTF-encoded data in {,w,d}char[]s. Also, I have always felt Tangos toUtf{8,16,32} are a bit too explicitly named. Almost like using toSingleIEEE754 instead of toFloat.
 I don't suppose there is anyone who does a lot of internationalization 
 programming who can comment on the utility of one convention vs. the 
 other?  I would love to hear some more practical concerns regarding the 
 naming convention for these functions.

I have done quite a bit of text processing and handling of different encodings in D and while naming doesn't matter much as long as it is consistent, what I do is: * use {,w,d}char strictly for UTF data (I have sometimes cheated here, mainly to be able to use certain std.string functions, but with a good templated string/array library (such as in Tango), that is not necessary) * use unicode internally as much as possible, transcoding as early and late as possible. * when there is a reason not to use UTF internally, use typedefs like "typedef char lat1", and keep unknown encodings as ubyte[]s. Knowing that {,w,d}chars always contain UTF has never been a problem. Problems arising are instead of mistakingly using char rather than {,u}byte in C APIs and D's horrible behavior of by default crashing instead of recovering from UTF errors. A much better default behavior would be to simply substitute illegal UTF-units with a '?' and keep going. Having to remember to sanitize all untrusted unicode strings is a chore, and forgetting that at any point will lead to crashes in running code at inconvenient situations. -- Oskar
Nov 20 2007
parent Sean Kelly <sean f4.ca> writes:
Oskar Linde wrote:
 Sean Kelly wrote:
 Christopher Wright wrote:
 toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits 
 well with other conventions.

I tend to place a tremendous amount of value on consistency, because the more consistent an API is, the more likely my guesses about it are to be correct. In my opinion, that precludes using the option you suggest.

IMHO, the consistent alternative is pretty clear: char -> string -> toString wchar -> wstring -> toWString dchar -> dstring -> toDString The only problem seems to lie in the aesthetics of the camelCase convention, but doesn't consistency trump aesthetics?

It depends :-) I prefer the suggested toStringW and toStringD convention. While it doesn't exactly match the returned type name in letter order, the same information is communicated and is done in what I feel is a more readable format. Also, if the words were placed in a larger list and then sorted, they would end up adjacent to one another.
 In my opinion, Walter's suggestion that alternate encodings not be 
 stored in strings is sufficient reason to not bother with the encoding 
 format in the function name (ie. toUtf8/toUtf16/toUtf32). 

I agree, but this is hardly a new suggestion. I think it has always been pretty clear that one should never store anything but UTF-encoded data in {,w,d}char[]s.

Yup. But to me, this is different from a semi-official declaration to this effect. With the latter, the suggestion is more likely to be enforceable.
 Also, I have always felt Tangos toUtf{8,16,32} are a 
 bit too explicitly named. Almost like using toSingleIEEE754 instead of 
 toFloat.

Fair enough :-)
 I don't suppose there is anyone who does a lot of internationalization 
 programming who can comment on the utility of one convention vs. the 
 other?  I would love to hear some more practical concerns regarding 
 the naming convention for these functions.

I have done quite a bit of text processing and handling of different encodings in D and while naming doesn't matter much as long as it is consistent, what I do is: * use {,w,d}char strictly for UTF data (I have sometimes cheated here, mainly to be able to use certain std.string functions, but with a good templated string/array library (such as in Tango), that is not necessary) * use unicode internally as much as possible, transcoding as early and late as possible. * when there is a reason not to use UTF internally, use typedefs like "typedef char lat1", and keep unknown encodings as ubyte[]s. Knowing that {,w,d}chars always contain UTF has never been a problem. Problems arising are instead of mistakingly using char rather than {,u}byte in C APIs and D's horrible behavior of by default crashing instead of recovering from UTF errors.

Darnit, I forgot about the C APIs. I'll have to replace their use of char with char_t or c_char (the latter matches c_long but the former matches wchar_t).
 A much better default behavior would be to simply substitute illegal 
 UTF-units with a '?' and keep going. Having to remember to sanitize all 
 untrusted unicode strings is a chore, and forgetting that at any point 
 will lead to crashes in running code at inconvenient situations.
 

This is useful information. Thanks. Sean
Nov 20 2007
prev sibling parent James Dennett <jdennett acm.org> writes:
Sean Kelly wrote:
 I don't suppose there is anyone who does a lot of internationalization
 programming who can comment on the utility of one convention vs. the
 other?  I would love to hear some more practical concerns regarding the
 naming convention for these functions.

A D-wide (at least optionally *enforced*) specification that the various types of "character" arrays are really strings, not just arrays of the underlying storage types, would mean that no convention would be needed to convey the meaning, and the simpler name could be used safely (as the type system would imply the encoding). (I can't help but think that this is one more reason why string types should *not* be built-in arrays, even if they are known to the compiler, but I think my chances of persuading Walter that string==array is a mistake are three quarters of ten percent of none at all.) In the absence of a language-enforced/mandated encoding, it's up to the library to force programmers to consider these issues; in that case, names making the encoding clear (even names as ugly as toUtf8) are better than a more readable, more generic but less intention-conveying name like toString. Most of the code I see (in C, C++, Java and more) is far too sloppy about knowing which encoding is used for a given string. Unicode is now mature enough to make some sense for the default in programming languages. Ideally I'd make the encoding something akin to a template parameter, so that the compiler's type-checking could help out -- but I digress into language design (as is inevitable when high level facilities like strings are made part of the language rather than being "just" standard library features). -- James
Dec 16 2007
prev sibling next sibling parent reply Bill Baxter <dnewsgroup billbaxter.com> writes:
Sean Kelly wrote:
 I was looking at converting Tango's use of toUtf8 to toString today and 
 ran into a bit of a quandry.  Currently, Tango's use of toUtf8 as the 
 member function for returning char strings is consitent with all use of 
 string operations in Tango.  Routines that return wchar strings are 
 named toUtf16 whether they are members of the String class or whether 
 they are intended to perform UTF conversions, and so on.  Thus, the 
 convention is consitent and pervasive.
 
 What I discovered during a test conversion of Tango was that converting 
 all uses of toUtf8 to toString /except/ those intended to perfom UTF 
 conversions reduced code clarity, and left me unsure as to which name I 
 would actually use in a given situation.  For example, there is quite a 
 bit of code in the text and io packages which convert an arbitrary type 
 to a char[] for output, etc.  So by making this change I was left with 
 some conversions using toString and others using toUtf8, toUtf16, and 
 toUtf32, not to mention the fromXxx versions of these same functions. As 
 this is template code, the choice between toString and toUtf8 in a given 
 situation was unclear.  Given this, I decided to look to Phobos for 
 model to follow.
 
 What I found in Phobos was that it suffers from the same situation as I 
 found Tango in during my test conversion.  Routines that convert any 
 type by a string to a char[] are named toString, while the string 
 equivalent is named toUTF8.  Given this, I surmised that the naming 
 convention in D is that all strings are assumed to be Unicode, except 
 when they're not.  String literals are required to be Unicode, foreach 
 assumes strings to be UTF encoded when performing its automatic 
 conversions, and all of the toString functions in std.string assume 
 UTF-8 as the output format.  So who bother with the name toUTF8 in std.utf?
 
 As near as I can tell, the reason for text conversion routines to be 
 named differently is to simplify the use of routines which covert to 
 another format.  std.windows.charset, for example, has a routine called 
 toMBSz, to distinguish from the toUTF8 routine.  What I find significant 
 about this is that it suggests that while the transport mechanism for 
 strings is the same in each case (both routines return a char[], ie. a 
 string), 

Does that even work? I would think there are some valid MBSz's that are invalid UTF sequences, and so toMBSz would have to return byte[].
 the underlying encoding is different.  Thus there seems a clear 
 disconnect between the name of the transport mechanism (string), and 
 routines that generate them.  With this in mind, I begin to question the 
 point of having toString as the common name for routines that generate 
 char strings.  The encoding clearly matters in some instances and cannot 
 be ignored, so ignoring it in others just seems to confuse things.

As far as I'm concerned Utf8 is *the* encoding for text in D. Anything else is only for some special purpose like ease of manipulation (dstring for I18N text that needs fast searching / slicing) or interchange with external APIs (utf16 for working with windows).
 With this in mind, I will admit that I am questioning the merit of 
 changing Tango's toUtf8 routines to be named toString.  Doing so seems 
 to sacrifice both operational consistency and clarity in an attempt to 
 maintain consistency with the name of the transport mechanism: string. 
 And as I have said above, while strings in D are generally expected to 
 be Unicode, they are clearly not always Unicode, as the existence of 
 std.windows.charset can attest.  

I really think toMBSz should be returning byte[] and fromMBSz should be taking a byte*. The doc for types says char is unsigned 8 bit UTF-8. Period. And you get errors from the compiler if you try to initialize a string with something that's not valid UTF-8. So MBSz data has no business parading around dressed up as char[].
 So I am left wondering whether someone 
 can explain why toString is the preferred name for string-producing 
 routines in D?  I feel it is very important to establish a consistent 
 naming mechanism for D, and as Phobos seems to be the model in this case 
 I may well have no choice in the matter of toUtf8 vs. toString.  But I 
 would feel much better about the change if someone could provide a sound 
 reason for doing so, since my first attempt at a conversion has left me 
 somewhat worried about its long-term effect on code clarity.
 
 As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 
 be named toString, toWString, and toDString, respectively, and Unicode 
 should be assumed as the standard encoding format in D.

Since the tango convention is to treat acronyms as single words, (the actual tango utf methods are called toUtf8 toUtf16 and toUtf32) it seems there's an argument for treating wstring and dstring as single entities too. So then it would be: toString, toWstring, toDstring Don't know if that hurts your eyes less or not, but it seems more consistent with Tango's existing naming convention to me than toWString, etc. --bb
Nov 19 2007
parent reply Sean Kelly <sean f4.ca> writes:
Bill Baxter wrote:
 Sean Kelly wrote:
 I was looking at converting Tango's use of toUtf8 to toString today 
 and ran into a bit of a quandry.  Currently, Tango's use of toUtf8 as 
 the member function for returning char strings is consitent with all 
 use of string operations in Tango.  Routines that return wchar strings 
 are named toUtf16 whether they are members of the String class or 
 whether they are intended to perform UTF conversions, and so on.  
 Thus, the convention is consitent and pervasive.

 What I discovered during a test conversion of Tango was that 
 converting all uses of toUtf8 to toString /except/ those intended to 
 perfom UTF conversions reduced code clarity, and left me unsure as to 
 which name I would actually use in a given situation.  For example, 
 there is quite a bit of code in the text and io packages which convert 
 an arbitrary type to a char[] for output, etc.  So by making this 
 change I was left with some conversions using toString and others 
 using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx 
 versions of these same functions. As this is template code, the choice 
 between toString and toUtf8 in a given situation was unclear.  Given 
 this, I decided to look to Phobos for model to follow.

 What I found in Phobos was that it suffers from the same situation as 
 I found Tango in during my test conversion.  Routines that convert any 
 type by a string to a char[] are named toString, while the string 
 equivalent is named toUTF8.  Given this, I surmised that the naming 
 convention in D is that all strings are assumed to be Unicode, except 
 when they're not.  String literals are required to be Unicode, foreach 
 assumes strings to be UTF encoded when performing its automatic 
 conversions, and all of the toString functions in std.string assume 
 UTF-8 as the output format.  So who bother with the name toUTF8 in 
 std.utf?

 As near as I can tell, the reason for text conversion routines to be 
 named differently is to simplify the use of routines which covert to 
 another format.  std.windows.charset, for example, has a routine 
 called toMBSz, to distinguish from the toUTF8 routine.  What I find 
 significant about this is that it suggests that while the transport 
 mechanism for strings is the same in each case (both routines return a 
 char[], ie. a string), 

Does that even work? I would think there are some valid MBSz's that are invalid UTF sequences, and so toMBSz would have to return byte[].

It works because D performs no run-time verification that what's in a char[] is actually Unicode. You could dump binary data in a string if you really wanted to.
 I really think toMBSz should be returning byte[] and fromMBSz should be 
 taking a byte*.  The doc for types says char is unsigned 8 bit UTF-8. 
 Period.  And you get errors from the compiler if you try to initialize a 
 string with something that's not valid UTF-8.  So MBSz data has no 
 business parading around dressed up as char[].

I think you're right about toMBSz.
 Since the tango convention is to treat acronyms as single words, (the 
 actual tango utf methods are called toUtf8 toUtf16 and toUtf32) it seems 
 there's an argument for treating wstring and dstring as single entities 
 too.  So then it would be:
     toString, toWstring, toDstring
 
 Don't know if that hurts your eyes less or not, but it seems more 
 consistent with Tango's existing naming convention to me than toWString, 
 etc.

Yeah I was thinking the same thing. It's certainly easier for me to read than the other form. Sean
Nov 19 2007
parent reply "Kris" <foo bar.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message
[snip]
 Don't know if that hurts your eyes less or not, but it seems more 
 consistent with Tango's existing naming convention to me than toWString, 
 etc.

Yeah I was thinking the same thing. It's certainly easier for me to read than the other form.

Bill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstring
Nov 19 2007
next sibling parent reply Bill Baxter <dnewsgroup billbaxter.com> writes:
Kris wrote:
 "Sean Kelly" <sean f4.ca> wrote in message
 [snip]
 Don't know if that hurts your eyes less or not, but it seems more 
 consistent with Tango's existing naming convention to me than toWString, 
 etc.

than the other form.

Bill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstring

How so? toString returns a string. toInt returns an int. toFloat returns a float. to??? returns a wstring. Seems whatever goes in the ??? place should include the letters "w-s-t-r-i-n-g" in that order. --bb
Nov 19 2007
parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Bill Baxter wrote:

 Kris wrote:
 "Sean Kelly" <sean f4.ca> wrote in message
 [snip]
 Don't know if that hurts your eyes less or not, but it seems more
 consistent with Tango's existing naming convention to me than
 toWString, etc.

read than the other form.

Bill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstring

How so? toString returns a string. toInt returns an int. toFloat returns a float. to??? returns a wstring. Seems whatever goes in the ??? place should include the letters "w-s-t-r-i-n-g" in that order.

Only if you have recognized wstring and dstring as good names for those aliases <g> -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Nov 20 2007
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Lars Ivar Igesund wrote:
 Only if you have recognized wstring and dstring as good names for those
 aliases <g>

They'd be consistent with wchar and dchar.
Nov 20 2007
parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Walter Bright wrote:

 Lars Ivar Igesund wrote:
 Only if you have recognized wstring and dstring as good names for those
 aliases <g>

They'd be consistent with wchar and dchar.

Right ... now I don't like those either ;) -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Nov 20 2007
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Lars Ivar Igesund wrote:
 Walter Bright wrote:
 
 Lars Ivar Igesund wrote:
 Only if you have recognized wstring and dstring as good names for those
 aliases <g>


Right ... now I don't like those either ;)

What can I say? !!
Nov 20 2007
parent reply "Kris" <foo bar.com> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote
 Lars Ivar Igesund wrote:
 Walter Bright wrote:

 Lars Ivar Igesund wrote:
 Only if you have recognized wstring and dstring as good names for those
 aliases <g>


Right ... now I don't like those either ;)

What can I say? !!

hehe Well, perhaps it's worth noting that all of these names are probably a cousin of "hungarian notation", since the name is being decorated with some kind of indicator of what it represents? The question perhaps should be - why is that? If we speculate, for a moment, that the language supported overload on return type: char[] toString(); wchar[] toString(); dchar[] toString(); then, there would be no issue here. Right? However, we don't have overload-on-return-type, so it seems to me that the decorated names are a means to work around that. Does that seem logical? Perhaps what we're seeing here, Walter, is a measure of distaste for the notion of decorated-names?
Nov 20 2007
parent reply Christopher Wright <dhasenan gmail.com> writes:
Kris wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote
 Lars Ivar Igesund wrote:
 Walter Bright wrote:

 Lars Ivar Igesund wrote:
 Only if you have recognized wstring and dstring as good names for those
 aliases <g>




hehe Well, perhaps it's worth noting that all of these names are probably a cousin of "hungarian notation", since the name is being decorated with some kind of indicator of what it represents? The question perhaps should be - why is that? If we speculate, for a moment, that the language supported overload on return type: char[] toString(); wchar[] toString(); dchar[] toString(); then, there would be no issue here. Right? However, we don't have overload-on-return-type, so it seems to me that the decorated names are a means to work around that. Does that seem logical? Perhaps what we're seeing here, Walter, is a measure of distaste for the notion of decorated-names?

class String { char[] opImplicitCast () {} wchar[] opImplicitCast () {} dchar[] opImplicitCast () {} } String toString () {} How does that look?
Nov 20 2007
parent reply Sean Kelly <sean f4.ca> writes:
Christopher Wright wrote:
 
 class String {
    char[] opImplicitCast () {}
    wchar[] opImplicitCast () {}
    dchar[] opImplicitCast () {}
 }
 
 String toString () {}
 
 How does that look?

Tango already has a String class with toUtf8, toUtf16, and toUtf32 member functions. This was one of our original objections to the idea of toString as a member function that must return a char[]. We will have to rename the class to something else if this change goes through. Sean
Nov 20 2007
parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Sean Kelly wrote:

 Christopher Wright wrote:
 
 class String {
    char[] opImplicitCast () {}
    wchar[] opImplicitCast () {}
    dchar[] opImplicitCast () {}
 }
 
 String toString () {}
 
 How does that look?

Tango already has a String class with toUtf8, toUtf16, and toUtf32 member functions. This was one of our original objections to the idea of toString as a member function that must return a char[]. We will have to rename the class to something else if this change goes through. Sean

It is already renamed to Text. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Nov 20 2007
parent Sean Kelly <sean f4.ca> writes:
Lars Ivar Igesund wrote:
 Sean Kelly wrote:
 
 Christopher Wright wrote:
 class String {
    char[] opImplicitCast () {}
    wchar[] opImplicitCast () {}
    dchar[] opImplicitCast () {}
 }

 String toString () {}

 How does that look?

member functions. This was one of our original objections to the idea of toString as a member function that must return a char[]. We will have to rename the class to something else if this change goes through.

It is already renamed to Text.

Oops!
Nov 20 2007
prev sibling next sibling parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Kris wrote:

 "Sean Kelly" <sean f4.ca> wrote in message
 [snip]
 Don't know if that hurts your eyes less or not, but it seems more
 consistent with Tango's existing naming convention to me than toWString,
 etc.

Yeah I was thinking the same thing. It's certainly easier for me to read than the other form.

Bill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstring

FWIW, this would be preferable to me too. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Nov 20 2007
parent Regan Heath <regan netmail.co.nz> writes:
Lars Ivar Igesund wrote:
 Kris wrote:
 
 "Sean Kelly" <sean f4.ca> wrote in message
 [snip]
 Don't know if that hurts your eyes less or not, but it seems more
 consistent with Tango's existing naming convention to me than toWString,
 etc.

than the other form.

Bill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstring

FWIW, this would be preferable to me too.

+votes
Nov 20 2007
prev sibling next sibling parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Kris" <foo bar.com> wrote in message news:fhtru8$1no5$1 digitalmars.com...
 "Sean Kelly" <sean f4.ca> wrote in message
 [snip]
 Don't know if that hurts your eyes less or not, but it seems more 
 consistent with Tango's existing naming convention to me than toWString, 
 etc.

Yeah I was thinking the same thing. It's certainly easier for me to read than the other form.

Bill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstring

Now that I've seen toWString and toStringW, I'll have to say I do like the toStringW/toStringD version better. // retract previous votes toWString.votes -= 8; toDString.votes -= 8; toStringW.votes += 334; toStringD.votes += 334;
Nov 20 2007
prev sibling parent Chad J <gamerChad _spamIsBad_gmail.com> writes:
Kris wrote:
 "Sean Kelly" <sean f4.ca> wrote in message
 [snip]
 Don't know if that hurts your eyes less or not, but it seems more 
 consistent with Tango's existing naming convention to me than toWString, 
 etc.

than the other form.

Bill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstring

This conversation caught my eye and I cringed at toWString and toDString. toStringW and toStringD are acceptable though. Sean made a brief argument from psychology earlier. It made me remember this thing: Olny srmat poelpe can raed tihs. I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Amzanig huh? yaeh and I awlyas tghuhot slpeling was ipmorantt! Perhaps this is important for naming conventions in general? Any similarly named entities must differ at the beginning or end of the name. I'm not sure how deeply this affects existing APIs or if it causes problems ;) It is also noteworthy that char, wchar, dchar are consistent with that naming constraint, but not with toStringW and toStringD. IMO the former matters more than the latter, simply because it is ingrained into our minds. Still, I am not entirely convinced that such a constraint is wise in general, though I do like its application here.
Nov 20 2007
prev sibling parent reply Bill Baxter <dnewsgroup billbaxter.com> writes:
Sean Kelly wrote:
 As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 
 be named toString, toWString, and toDString, respectively, and Unicode 
 should be assumed as the standard encoding format in D.

1) On the question of toWString vs toWstring and consistency: I don't think there's any clear precedent for either in Tango right now, but my question is, if tango *had* a "to uint" function, what would it be named? toUInt or toUint? Whatever the answer to that is should be the same as the answer to how to name a "to wstring" function. 2) On the question of toWString vs toStringW It seems to be pretty well agreed in this thread that toWString is more consistent but toStringW is prettier. I could be wrong but I think usage pattern of these W and D variants of the functions will be bimodal: either very frequent or very infrequent. In the former case I'd probably want to make a simpler alias like 'wstr'. In the latter case I'd want it to be the most consistent thing possible to be easy to remember for the few times I use it. --bb
Nov 20 2007
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Bill Baxter wrote:
 Sean Kelly wrote:
 As an alternative, I can only suggest that toUTF8, toUTF16, and 
 toUTF32 be named toString, toWString, and toDString, respectively, and 
 Unicode should be assumed as the standard encoding format in D.

1) On the question of toWString vs toWstring and consistency: I don't think there's any clear precedent for either in Tango right now, but my question is, if tango *had* a "to uint" function, what would it be named? toUInt or toUint? Whatever the answer to that is should be the same as the answer to how to name a "to wstring" function.

Good question. Probably toUInt, though I don't like it much :-) For these conversion routines, I'll admit I find the idea that the type name should be repeated exactly, which suggests something like to_wstring, but I don't imagine anyone finds that appealing.
 2) On the question of toWString vs toStringW
 
 It seems to be pretty well agreed in this thread that toWString is more 
 consistent but toStringW is prettier.
 
 I could be wrong but I think usage pattern of these W and D variants of 
 the functions will be bimodal:  either very frequent or very infrequent. 
  In the former case I'd probably want to make a simpler alias like 
 'wstr'.  In the latter case I'd want it to be the most consistent thing 
 possible to be easy to remember for the few times I use it.

Agreed. Sean
Nov 20 2007
parent "Kris" <foo bar.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message 
news:fhvsaf$2k6t$1 digitalmars.com...
 Bill Baxter wrote:
 Sean Kelly wrote:
 As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 
 be named toString, toWString, and toDString, respectively, and Unicode 
 should be assumed as the standard encoding format in D.

1) On the question of toWString vs toWstring and consistency: I don't think there's any clear precedent for either in Tango right now, but my question is, if tango *had* a "to uint" function, what would it be named? toUInt or toUint? Whatever the answer to that is should be the same as the answer to how to name a "to wstring" function.

Good question. Probably toUInt, though I don't like it much :-) For these conversion routines, I'll admit I find the idea that the type name should be repeated exactly, which suggests something like to_wstring, but I don't imagine anyone finds that appealing.

It was resolved by having a Float module and an Integer module, containing relevant parse/format methods. The toUtf/toString() family is the only one where the type is decorated in the name (hungarian style)
Nov 20 2007
prev sibling parent reply Christopher Wright <dhasenan gmail.com> writes:
Bill Baxter wrote:
 Sean Kelly wrote:
 As an alternative, I can only suggest that toUTF8, toUTF16, and 
 toUTF32 be named toString, toWString, and toDString, respectively, and 
 Unicode should be assumed as the standard encoding format in D.

1) On the question of toWString vs toWstring and consistency: I don't think there's any clear precedent for either in Tango right now, but my question is, if tango *had* a "to uint" function, what would it be named? toUInt or toUint? Whatever the answer to that is should be the same as the answer to how to name a "to wstring" function.

If it had a to uint function and a to int function and a to 'sint' function, what then? If it's only uint, then you can tell the difference quite easily. Also, 'int' is shorter than 'string'. Not a very good comparison.
Nov 20 2007
parent reply Bill Baxter <dnewsgroup billbaxter.com> writes:
Christopher Wright wrote:
 Bill Baxter wrote:
 Sean Kelly wrote:
 As an alternative, I can only suggest that toUTF8, toUTF16, and 
 toUTF32 be named toString, toWString, and toDString, respectively, 
 and Unicode should be assumed as the standard encoding format in D.

1) On the question of toWString vs toWstring and consistency: I don't think there's any clear precedent for either in Tango right now, but my question is, if tango *had* a "to uint" function, what would it be named? toUInt or toUint? Whatever the answer to that is should be the same as the answer to how to name a "to wstring" function.

If it had a to uint function and a to int function and a to 'sint' function, what then? If it's only uint, then you can tell the difference quite easily.

I don't understand you. Tell the difference between what?
 Also, 'int' is shorter than 'string'. Not a very good comparison.

What does length have to do with whether or not the naming scheme is consistent? --bb
Nov 20 2007
parent Christopher Wright <dhasenan gmail.com> writes:
Bill Baxter wrote:
 Christopher Wright wrote:
 Bill Baxter wrote:
 Sean Kelly wrote:
 As an alternative, I can only suggest that toUTF8, toUTF16, and 
 toUTF32 be named toString, toWString, and toDString, respectively, 
 and Unicode should be assumed as the standard encoding format in D.

1) On the question of toWString vs toWstring and consistency: I don't think there's any clear precedent for either in Tango right now, but my question is, if tango *had* a "to uint" function, what would it be named? toUInt or toUint? Whatever the answer to that is should be the same as the answer to how to name a "to wstring" function.

If it had a to uint function and a to int function and a to 'sint' function, what then? If it's only uint, then you can tell the difference quite easily.

I don't understand you. Tell the difference between what?

Sorry, mistyped. If it were 'to uint' and 'to int', that would be rather clear. 'to uint' and 'to sint' would be less clear, since they're the same number of letters and would have the same capitalization pattern.
 Also, 'int' is shorter than 'string'. Not a very good comparison.

What does length have to do with whether or not the naming scheme is consistent?

Readability. I'd rather sacrifice a bit of consistency -- I can memorize a *few* inconsistencies -- for readability, whose lack will cause more trouble in the future. With shorter identifiers, smaller differences are more noticeable, but 'toWString' is a relatively long identifier.
 --bb

Nov 20 2007