digitalmars.D - toString vs. toUtf8

Sean Kelly (57/57) Nov 19 2007 I was looking at converting Tango's use of toUtf8 to toString today and

Steven Schveighoffer (6/16) Nov 19 2007 Can you give an example file for this problem? It would be easier to

Sean Kelly (3/19) Nov 19 2007 tango.text.convert.Layout

Steven Schveighoffer (12/30) Nov 19 2007 I can't say I see a problem.

Sean Kelly (8/19) Nov 19 2007 There's no conflict, it's just more difficult to understand. Also, in

Bill Baxter (24/43) Nov 19 2007 I you are right that the meanings of toString and toUtf8 are subtly

Daniel Keep (16/16) Nov 19 2007 That's roughly what I've suggested before, except that I also suggested

Kris (2/18) Nov 19 2007

BCS (8/13) Nov 19 2007 why? What is to be gained by having tango us toString rather than toUTF*...

Lars Ivar Igesund (8/23) Nov 19 2007 Ooh, that would be nice. Apparently toString is the only thing not up fo...

Walter Bright (10/10) Nov 19 2007 Phobos (and D) has undergone some evolution in the thinking about

Lars Ivar Igesund (11/23) Nov 19 2007 We certainly don't include all of the D community (of which I would say
Gregor Richards (11/23) Nov 19 2007 I believe that this naming convention would be best in Tango (toString,

Robert Fraser (5/30) Nov 19 2007 Agreed. It's also worth noting that toString as the name of a

Sean Kelly (6/18) Nov 19 2007 This seems fair. It would reinforce the idea that strings really do use...

Gregor Richards (6/27) Nov 19 2007 Worse looking than toUtf16? Would you prefer if int => int32, long =>

Walter Bright (3/7) Nov 19 2007 Those get requested now and then, but I agree they are awful. They're a

Jason House (13/21) Nov 19 2007 The first bullet on http://www.digitalmars.com/d/portability.html implie...

Robert DaSilva (2/27) Nov 19 2007 Even on 64-bit systems int is 32-bit.

Jason House (4/33) Nov 19 2007 Are you talking about what D does or what is most efficient on a 64 bit

Sean Kelly (3/14) Nov 19 2007 You could always use tango.stdc.stdint.int_fast32_t ;-)
Robert DaSilva (4/37) Nov 19 2007 C don't specify the sizes, but it does specify the sizes relative to

renoX (6/31) Nov 19 2007 No!

Jason House (5/18) Nov 22 2007 What would the portability issue be? If they use int and don't care abo...

renoX (12/34) Nov 25 2007 Sure, it's the same as C, except that if you look to the real world,

Sean Kelly (14/39) Nov 19 2007 Yes. I find the 'W' or 'D' in the middle of the name difficult to read....

Kris (5/43) Nov 19 2007 Yes, it looks more akin to GoBbleDeGOOk that other options. I find such
Regan Heath (13/46) Nov 20 2007 I agree, I think I'd prefer:

Roberto Mariottini (4/5) Nov 20 2007 Are you saying that the toMBSz() function should return ubyte* not char*...

Walter Bright (2/7) Nov 20 2007 Probably.

Matti Niemenmaa (38/45) Nov 20 2007 At last! This is the way I've been thinking it should be for a long time...

Regan Heath (12/37) Nov 20 2007 I think we should be encouraging people to convert this data to UTF-8

Matti Niemenmaa (13/20) Nov 20 2007 This is an impossible task. Given a plaintext file, you cannot know what

Regan Heath (25/45) Nov 20 2007 Yep, but the same thing may occur calling a D string function as it

Matti Niemenmaa (25/50) Nov 20 2007 Which is why I think that unless you know it's UTF-8, you should use uby...

Regan Heath (3/5) Nov 21 2007 To my mind byte = "signed interpretation of 8 bits".

=?UTF-8?B?SnVsaW8gQ8Opc2FyIENhcnJhc2NhbCBVcnF1aWpv?= (31/62) Nov 20 2007 You can't assume that a function designed to work on an UTF-8 strings

Regan Heath (13/15) Nov 21 2007 FYI: You probably already know this but I wanted to be sure, plus
Matti Niemenmaa (45/89) Nov 21 2007 I am well aware of this. I chose strip as an example because it does wor...

Regan Heath (24/42) Nov 21 2007 But, this behvaiour isn't guaranteed. In fact I would expect that in

Matti Niemenmaa (26/48) Nov 21 2007 std.string.LS and std.string.PS are two examples of Unicode whitespace

Regan Heath (12/26) Nov 21 2007 Agreed. I would tend to leave the std.string functions taking char[] so...

Robert DaSilva (3/55) Nov 20 2007 Perhapses {,w,d}char should become typedef of u{byte,short,int} and drop

Matti Niemenmaa (5/7) Nov 21 2007 There would still need to be special handling for char and string litera...

Jarrett Billingsley (9/12) Nov 19 2007 Do you want to know my single overriding reason for wanting toString ins...

Gregor Richards (2/4) Nov 19 2007 Hear hear.

Kris (6/10) Nov 19 2007 With respect to all, you're perhaps not addressing Sean's deeper questio...

Gregor Richards (4/19) Nov 19 2007 Confusing it's, post top don't.

=?ISO-8859-1?Q?Julio_C=E9sar_Carrascal_Urquijo?= (14/23) Nov 20 2007 Actually, Jarrett's other arguments seemed more compelling to me.

BCS (5/22) Nov 19 2007 shouldn't that be?

Christopher Wright (8/10) Nov 19 2007 toUtf8 is ugly.

Sean Kelly (21/35) Nov 19 2007 I tend to place a tremendous amount of value on consistency, because the...

Bill Baxter (8/46) Nov 19 2007 My just formed opinion :-) is that any sort of toWstring/toDstring
David B. Held (16/21) Nov 20 2007 I certainly don't qualify as someone who does a "lot" of i18n
Oskar Linde (32/46) Nov 20 2007 IMHO, the consistent alternative is pretty clear:

Sean Kelly (15/71) Nov 20 2007 It depends :-) I prefer the suggested toStringW and toStringD

James Dennett (28/32) Dec 16 2007 A D-wide (at least optionally *enforced*) specification that

Bill Baxter (21/83) Nov 19 2007 Does that even work? I would think there are some valid MBSz's that are...

Sean Kelly (8/64) Nov 19 2007 It works because D performs no run-time verification that what's in a

Kris (6/11) Nov 19 2007 Bill: actually, toString, toStringW and toStringD are more consistent wi...

Bill Baxter (9/22) Nov 19 2007 How so?

Lars Ivar Igesund (8/31) Nov 20 2007 Only if you have recognized wstring and dstring as good names for those

Walter Bright (2/4) Nov 20 2007 They'd be consistent with wchar and dchar.

Lars Ivar Igesund (7/12) Nov 20 2007 Right ... now I don't like those either ;)

Walter Bright (2/10) Nov 20 2007 What can I say? !!

Kris (15/25) Nov 20 2007 hehe

Christopher Wright (8/36) Nov 20 2007 class String {

Sean Kelly (6/16) Nov 20 2007 Tango already has a String class with toUtf8, toUtf16, and toUtf32

Lars Ivar Igesund (7/26) Nov 20 2007 It is already renamed to Text.

Sean Kelly (2/20) Nov 20 2007 Oops!

Lars Ivar Igesund (7/21) Nov 20 2007 FWIW, this would be preferable to me too.

Regan Heath (2/18) Nov 20 2007 +votes

Jarrett Billingsley (8/20) Nov 20 2007 Now that I've seen toWString and toStringW, I'll have to say I do like t...
Chad J (23/38) Nov 20 2007 This conversation caught my eye and I cringed at toWString and

Bill Baxter (15/18) Nov 20 2007 1) On the question of toWString vs toWstring and consistency:

Sean Kelly (7/28) Nov 20 2007 Good question. Probably toUInt, though I don't like it much :-) For

Kris (5/21) Nov 20 2007 It was resolved by having a Float module and an Integer module, containi...

Christopher Wright (5/16) Nov 20 2007 If it had a to uint function and a to int function and a to 'sint'

Bill Baxter (5/22) Nov 20 2007 What does length have to do with whether or not the naming scheme is

Christopher Wright (9/33) Nov 20 2007 Sorry, mistyped. If it were 'to uint' and 'to int', that would be rather...

Sean Kelly <sean f4.ca> writes:

I was looking at converting Tango's use of toUtf8 to toString today and 
ran into a bit of a quandry.  Currently, Tango's use of toUtf8 as the 
member function for returning char strings is consitent with all use of 
string operations in Tango.  Routines that return wchar strings are 
named toUtf16 whether they are members of the String class or whether 
they are intended to perform UTF conversions, and so on.  Thus, the 
convention is consitent and pervasive.

What I discovered during a test conversion of Tango was that converting 
all uses of toUtf8 to toString /except/ those intended to perfom UTF 
conversions reduced code clarity, and left me unsure as to which name I 
would actually use in a given situation.  For example, there is quite a 
bit of code in the text and io packages which convert an arbitrary type 
to a char[] for output, etc.  So by making this change I was left with 
some conversions using toString and others using toUtf8, toUtf16, and 
toUtf32, not to mention the fromXxx versions of these same functions. 
As this is template code, the choice between toString and toUtf8 in a 
given situation was unclear.  Given this, I decided to look to Phobos 
for model to follow.

What I found in Phobos was that it suffers from the same situation as I 
found Tango in during my test conversion.  Routines that convert any 
type by a string to a char[] are named toString, while the string 
equivalent is named toUTF8.  Given this, I surmised that the naming 
convention in D is that all strings are assumed to be Unicode, except 
when they're not.  String literals are required to be Unicode, foreach 
assumes strings to be UTF encoded when performing its automatic 
conversions, and all of the toString functions in std.string assume 
UTF-8 as the output format.  So who bother with the name toUTF8 in std.utf?

As near as I can tell, the reason for text conversion routines to be 
named differently is to simplify the use of routines which covert to 
another format.  std.windows.charset, for example, has a routine called 
toMBSz, to distinguish from the toUTF8 routine.  What I find significant 
about this is that it suggests that while the transport mechanism for 
strings is the same in each case (both routines return a char[], ie. a 
string), the underlying encoding is different.  Thus there seems a clear 
disconnect between the name of the transport mechanism (string), and 
routines that generate them.  With this in mind, I begin to question the 
point of having toString as the common name for routines that generate 
char strings.  The encoding clearly matters in some instances and cannot 
be ignored, so ignoring it in others just seems to confuse things.

With this in mind, I will admit that I am questioning the merit of 
changing Tango's toUtf8 routines to be named toString.  Doing so seems 
to sacrifice both operational consistency and clarity in an attempt to 
maintain consistency with the name of the transport mechanism: string. 
And as I have said above, while strings in D are generally expected to 
be Unicode, they are clearly not always Unicode, as the existence of 
std.windows.charset can attest.  So I am left wondering whether someone 
can explain why toString is the preferred name for string-producing 
routines in D?  I feel it is very important to establish a consistent 
naming mechanism for D, and as Phobos seems to be the model in this case 
I may well have no choice in the matter of toUtf8 vs. toString.  But I 
would feel much better about the change if someone could provide a sound 
reason for doing so, since my first attempt at a conversion has left me 
somewhat worried about its long-term effect on code clarity.

As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 
be named toString, toWString, and toDString, respectively, and Unicode 
should be assumed as the standard encoding format in D.


Sean

Nov 19 2007

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Sean Kelly" wrote
 What I discovered during a test conversion of Tango was that converting 
 all uses of toUtf8 to toString /except/ those intended to perfom UTF 
 conversions reduced code clarity, and left me unsure as to which name I 
 would actually use in a given situation.  For example, there is quite a 
 bit of code in the text and io packages which convert an arbitrary type to 
 a char[] for output, etc.  So by making this change I was left with some 
 conversions using toString and others using toUtf8, toUtf16, and toUtf32, 
 not to mention the fromXxx versions of these same functions. As this is 
 template code, the choice between toString and toUtf8 in a given situation 
 was unclear.

Can you give an example file for this problem?  It would be easier to 
understand your problem if I knew exactly what you were talking about.  An 
actual example is fine, it doesn't need to be minimized (i.e. "take a look 
at tango/io/X.d")

-Steve

Nov 19 2007

Sean Kelly <sean f4.ca> writes:

Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 What I discovered during a test conversion of Tango was that converting 
 all uses of toUtf8 to toString /except/ those intended to perfom UTF 
 conversions reduced code clarity, and left me unsure as to which name I 
 would actually use in a given situation.  For example, there is quite a 
 bit of code in the text and io packages which convert an arbitrary type to 
 a char[] for output, etc.  So by making this change I was left with some 
 conversions using toString and others using toUtf8, toUtf16, and toUtf32, 
 not to mention the fromXxx versions of these same functions. As this is 
 template code, the choice between toString and toUtf8 in a given situation 
 was unclear.

 
 Can you give an example file for this problem?  It would be easier to 
 understand your problem if I knew exactly what you were talking about.  An 
 actual example is fine, it doesn't need to be minimized (i.e. "take a look 
 at tango/io/X.d")

tango.text.convert.Layout


Sean

Nov 19 2007

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Sean Kelly" wrote
 Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 What I discovered during a test conversion of Tango was that converting 
 all uses of toUtf8 to toString /except/ those intended to perfom UTF 
 conversions reduced code clarity, and left me unsure as to which name I 
 would actually use in a given situation.  For example, there is quite a 
 bit of code in the text and io packages which convert an arbitrary type 
 to a char[] for output, etc.  So by making this change I was left with 
 some conversions using toString and others using toUtf8, toUtf16, and 
 toUtf32, not to mention the fromXxx versions of these same functions. As 
 this is template code, the choice between toString and toUtf8 in a given 
 situation was unclear.

 Can you give an example file for this problem?  It would be easier to 
 understand your problem if I knew exactly what you were talking about. 
 An actual example is fine, it doesn't need to be minimized (i.e. "take a 
 look at tango/io/X.d")

 tango.text.convert.Layout

I can't say I see a problem.

I'd say use toUtf8 when doing a conversion from one type of encoded string 
to another (i.e. utf-16 to utf-8), and use toString when overriding Object's 
toString, OR when converting a native type (i.e. int, float, etc).  For 
example tango.text.convert.Integer.toUtf8 should be toString.

In the case of tango.text.convert.Layout, I don't see any overriding of 
Object.toUtf8?  The Unicode.toUtf8 should be left alone since it is a 
conversion between utf encodings.  In any case, Unicode.toUtf8 is a global 
function, and is not overriding Object.toUtf8, so there is no conflict 
there.

-Steve

Nov 19 2007

Sean Kelly <sean f4.ca> writes:

Steven Schveighoffer wrote:
 
 I'd say use toUtf8 when doing a conversion from one type of encoded string 
 to another (i.e. utf-16 to utf-8), and use toString when overriding Object's 
 toString, OR when converting a native type (i.e. int, float, etc).  For 
 example tango.text.convert.Integer.toUtf8 should be toString.
 
 In the case of tango.text.convert.Layout, I don't see any overriding of 
 Object.toUtf8?  The Unicode.toUtf8 should be left alone since it is a 
 conversion between utf encodings.  In any case, Unicode.toUtf8 is a global 
 function, and is not overriding Object.toUtf8, so there is no conflict 
 there.

There's no conflict, it's just more difficult to understand.  Also, in 
template code, having a consistent rule for overloaded functions can be 
a valuable asset.  I found myself wanting to simply change everything to 
toString, toWString, and toDString rather than change only the Object 
member function as originally planned.  And the conflict with other 
encodings worried me so I posted here.


Sean

Nov 19 2007

Bill Baxter <dnewsgroup billbaxter.com> writes:

Sean Kelly wrote:
 Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 What I discovered during a test conversion of Tango was that 
 converting all uses of toUtf8 to toString /except/ those intended to 
 perfom UTF conversions reduced code clarity, and left me unsure as to 
 which name I would actually use in a given situation.  For example, 
 there is quite a bit of code in the text and io packages which 
 convert an arbitrary type to a char[] for output, etc.  So by making 
 this change I was left with some conversions using toString and 
 others using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx 
 versions of these same functions. As this is template code, the 
 choice between toString and toUtf8 in a given situation was unclear.

 Can you give an example file for this problem?  It would be easier to 
 understand your problem if I knew exactly what you were talking 
 about.  An actual example is fine, it doesn't need to be minimized 
 (i.e. "take a look at tango/io/X.d")

 
 tango.text.convert.Layout

I you are right that the meanings of toString and toUtf8 are subtly 
different.

My take is that toString promises to produce some textual form of the 
input (and it happens to use the utf8 encoding).  This transformation 
might be wildly lossy and non-reversible as is the case with the default 
implementation of toString for classes, which just prints the class 
name.  toUtf8 on the other hand, promises to do a conversion.  It's 
probably lossless, or nearly so, and since the encoding is mentioned 
specifically probably that's specifically a conversion between different 
string encodings.

The thing is some times A is B.  The best textual representation of a 
Utf32 string as Utf8 is going to be the Utf8 converted version of it. 
So in that case toString and toUtf8 happen to do the same thing.

So to me, the logical thing to do is to "alias toUtf8 toString;" in the 
cases where there's a converter that also suffices as a textual 
representation generator.  That way everything that can be represented 
as text has a toString method, and things that deal with encoding 
conversions have toUtf blah methods.

So in that case I don't see any reason for toWString, toDString. 
toString generates your canonical "textual representation" for whatever 
it is.  If you need that in a different encoding for whatever reason 
then you need to run an encoding converter on it.

--bb

Nov 19 2007

Daniel Keep <daniel.keep.lists gmail.com> writes:

That's roughly what I've suggested before, except that I also suggested
the following interface:

interface UtfConversion
{
    char[] toUtf8();
    wchar[] toUtf16();
    dchar[] toUtf32();
}

This would allow all objects to have a distinct set of methods for
lossless conversion to different encodings, whilst still preserving the
"just give me something to throw at the user" toString method.

Incidentally, given that to(T)'s entire purpose is to do generalised
*value-preserving* conversions, is this really a problem?  Using a
formatter will always give you something, whilst to!(charT[])(v) will
always preserve the value of the conversion.

	-- Daniel

Nov 19 2007

"Kris" <foo bar.com> writes:

"Daniel Keep" <daniel.keep.lists gmail.com> wrote in >
 That's roughly what I've suggested before, except that I also suggested
 the following interface:

 interface UtfConversion
 {
    char[] toUtf8();
    wchar[] toUtf16();
    dchar[] toUtf32();
 }

 This would allow all objects to have a distinct set of methods for
 lossless conversion to different encodings, whilst still preserving the
 "just give me something to throw at the user" toString method.

 Incidentally, given that to(T)'s entire purpose is to do generalised
 *value-preserving* conversions, is this really a problem?  Using a
 formatter will always give you something, whilst to!(charT[])(v) will
 always preserve the value of the conversion.


Tango already has this ....

 -- Daniel

Nov 19 2007

BCS <ao pathlink.com> writes:

Reply to Sean,

 I was looking at converting Tango's use of toUtf8 to toString today
 and ran into a bit of a quandry.  

[...]
 
 Sean
 

why? What is to be gained by having tango us toString rather than toUTF*? 
IIRC there was a big thing, back when tango started using toUTF8 in place 
of toString, about how using toUTF8 would solve a number of these issues. 
If it is as part of the Phobos/Tango collaboration project, I think that 
is going the other way would be better (add toUTF8 to Phoboe's Object and 
an alias to make old things compile.

Nov 19 2007

Lars Ivar Igesund <larsivar igesund.net> writes:

BCS wrote:

 Reply to Sean,
 
 I was looking at converting Tango's use of toUtf8 to toString today
 and ran into a bit of a quandry.

 [...]
 
 Sean
 

 
 why? What is to be gained by having tango us toString rather than toUTF*?
 IIRC there was a big thing, back when tango started using toUTF8 in place
 of toString, about how using toUTF8 would solve a number of these issues.
 If it is as part of the Phobos/Tango collaboration project, I think that
 is going the other way would be better (add toUTF8 to Phoboe's Object and
 an alias to make old things compile.

Ooh, that would be nice. Apparently toString is the only thing not up for
discussion at all on Walter's end.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

Nov 19 2007

Walter Bright <newshound1 digitalmars.com> writes:

Phobos (and D) has undergone some evolution in the thinking about 
unicode strings, and it certainly has a few anachronisms in its names. 
But I think we've evolved to the point where going forward, we know what 
to do:

char[] => string
wchar[] => wstring
dchar[] => dstring

These are all unicode strings. Putting non-unicode encodings in them, 
even temporarily, should be discouraged. Non-unicode encodings should 
use ubyte[], ushort[], etc.

Nov 19 2007

Lars Ivar Igesund <larsivar igesund.net> writes:

Walter Bright wrote:

 Phobos (and D) has undergone some evolution in the thinking about
 unicode strings, and it certainly has a few anachronisms in its names.
 But I think we've evolved to the point where going forward, we know what
 to do:

We certainly don't include all of the D community (of which I would say
Tango is a large part, although some of Tango's users probably like your
suggestions). Was these names ever really up for discussion?
 
 char[] => string
 wchar[] => wstring
 dchar[] => dstring
 
 These are all unicode strings. Putting non-unicode encodings in them,
 even temporarily, should be discouraged. Non-unicode encodings should
 use ubyte[], ushort[], etc.

This don't at all address the points Sean pulled forth on the naming of the
functions returning various encodings, regardless of the types returned.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

Nov 19 2007

Gregor Richards <Richards codu.org> writes:

Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its names. 
 But I think we've evolved to the point where going forward, we know what 
 to do:
 
 char[] => string
 wchar[] => wstring
 dchar[] => dstring
 
 These are all unicode strings. Putting non-unicode encodings in them, 
 even temporarily, should be discouraged. Non-unicode encodings should 
 use ubyte[], ushort[], etc.

I believe that this naming convention would be best in Tango (toString, 
toWString, toDString). Naming them toUtf8, toUtf16, toUtf32 not only 
means that the coder has to understand what character encodings are 
(which would be nice but shouldn't be necessary), but that the familiar 
terminology "string" we take from literally every other language is 
lost. If we have to define "strings" as being a bit more confined than 
"arrays of bytes which presumably have some sort of form", even better. 
Bytes encoding random-arsed character sets in to WTF-17 don't need to be 
called "strings", they can be called WTF-17 arrays.

  - Gregor Richards

Nov 19 2007

Robert Fraser <fraserofthenight gmail.com> writes:

Gregor Richards wrote:
 Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its names. 
 But I think we've evolved to the point where going forward, we know 
 what to do:

 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in them, 
 even temporarily, should be discouraged. Non-unicode encodings should 
 use ubyte[], ushort[], etc.

 
 I believe that this naming convention would be best in Tango (toString, 
 toWString, toDString). Naming them toUtf8, toUtf16, toUtf32 not only 
 means that the coder has to understand what character encodings are 
 (which would be nice but shouldn't be necessary), but that the familiar 
 terminology "string" we take from literally every other language is 
 lost. If we have to define "strings" as being a bit more confined than 
 "arrays of bytes which presumably have some sort of form", even better. 
 Bytes encoding random-arsed character sets in to WTF-17 don't need to be 
 called "strings", they can be called WTF-17 arrays.
 
  - Gregor Richards

Agreed. It's also worth noting that toString as the name of a 
method/function has some precedence, so people familiar with Java, etc. 
will be able to get used to it right away, possibly without ever looking 
it up.

Nov 19 2007

Sean Kelly <sean f4.ca> writes:

Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its names. 
 But I think we've evolved to the point where going forward, we know what 
 to do:
 
 char[] => string
 wchar[] => wstring
 dchar[] => dstring
 
 These are all unicode strings. Putting non-unicode encodings in them, 
 even temporarily, should be discouraged. Non-unicode encodings should 
 use ubyte[], ushort[], etc.

This seems fair.  It would reinforce the idea that strings really do use 
a common encoding format, and that foreign encodings are relegated to a 
different form of transport.  Now if only toWString didn't look so 
horrible :-)


Sean

Nov 19 2007

Gregor Richards <Richards codu.org> writes:

Sean Kelly wrote:
 Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its names. 
 But I think we've evolved to the point where going forward, we know 
 what to do:

 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in them, 
 even temporarily, should be discouraged. Non-unicode encodings should 
 use ubyte[], ushort[], etc.

 
 This seems fair.  It would reinforce the idea that strings really do use 
 a common encoding format, and that foreign encodings are relegated to a 
 different form of transport.  Now if only toWString didn't look so 
 horrible :-)
 
 
 Sean

Worse looking than toUtf16? Would you prefer if int => int32, long => 
int64, short => int16, byte => int8, real => float80 (portability be 
damned), double => float64, float => float32? They'd certainly be more 
obvious, but I can tell you I'd go crazy.

  - Gregor Richards

Nov 19 2007

Walter Bright <newshound1 digitalmars.com> writes:

Gregor Richards wrote:
 Would you prefer if int => int32, long => 
 int64, short => int16, byte => int8, real => float80 (portability be 
 damned), double => float64, float => float32? They'd certainly be more 
 obvious, but I can tell you I'd go crazy.

Those get requested now and then, but I agree they are awful. They're a 
legacy from the C world where the sizes of basic types are unknown.

Nov 19 2007

Jason House <jason.james.house gmail.com> writes:

Walter Bright wrote:

 Gregor Richards wrote:
 Would you prefer if int => int32, long =>
 int64, short => int16, byte => int8, real => float80 (portability be
 damned), double => float64, float => float32? They'd certainly be more
 obvious, but I can tell you I'd go crazy.

 
 Those get requested now and then, but I agree they are awful. They're a
 legacy from the C world where the sizes of basic types are unknown.

The first bullet on http://www.digitalmars.com/d/portability.html implies
some wiggle room on this issue.

I really liked how D got rid of size ambiguity at first... all the way until
I started developing on machines that were not 32 bit.  When I don't care
about the true size, I feel guilty using "int" all over the place because
it is a fixed size.

I'd love to see both a fixed and variable size option available.  Maybe:
int - variable size
int32 - fixed size
int64 - fixed size

If that's done, the size of types become obvious when the programmer cares
about them and may make size-sensitive code more obvious.

Nov 19 2007

Robert DaSilva <sp.unit.262+digitalmars gmail.com> writes:

Jason House wrote:
 Walter Bright wrote:
 
 Gregor Richards wrote:
 Would you prefer if int => int32, long =>
 int64, short => int16, byte => int8, real => float80 (portability be
 damned), double => float64, float => float32? They'd certainly be more
 obvious, but I can tell you I'd go crazy.

 Those get requested now and then, but I agree they are awful. They're a
 legacy from the C world where the sizes of basic types are unknown.

 
 The first bullet on http://www.digitalmars.com/d/portability.html implies
 some wiggle room on this issue.
 
 I really liked how D got rid of size ambiguity at first... all the way until
 I started developing on machines that were not 32 bit.  When I don't care
 about the true size, I feel guilty using "int" all over the place because
 it is a fixed size.
 
 I'd love to see both a fixed and variable size option available.  Maybe:
 int - variable size
 int32 - fixed size
 int64 - fixed size
 
 If that's done, the size of types become obvious when the programmer cares
 about them and may make size-sensitive code more obvious.

Even on 64-bit systems int is 32-bit.

Nov 19 2007

Jason House <jason.james.house gmail.com> writes:

Robert DaSilva wrote:

 Jason House wrote:
 Walter Bright wrote:
 
 Gregor Richards wrote:
 Would you prefer if int => int32, long =>
 int64, short => int16, byte => int8, real => float80 (portability be
 damned), double => float64, float => float32? They'd certainly be more
 obvious, but I can tell you I'd go crazy.

 Those get requested now and then, but I agree they are awful. They're a
 legacy from the C world where the sizes of basic types are unknown.

 
 The first bullet on http://www.digitalmars.com/d/portability.html implies
 some wiggle room on this issue.
 
 I really liked how D got rid of size ambiguity at first... all the way
 until
 I started developing on machines that were not 32 bit.  When I don't care
 about the true size, I feel guilty using "int" all over the place because
 it is a fixed size.
 
 I'd love to see both a fixed and variable size option available.  Maybe:
 int - variable size
 int32 - fixed size
 int64 - fixed size
 
 If that's done, the size of types become obvious when the programmer
 cares about them and may make size-sensitive code more obvious.

 
 Even on 64-bit systems int is 32-bit.

Are you talking about what D does or what is most efficient on a 64 bit
system?  If 32-bit integers are less efficient, then it's a crime to make
size-tolerant code use an inefficient size.

Nov 19 2007

Sean Kelly <sean f4.ca> writes:

Jason House wrote:
 Robert DaSilva wrote:
 
 Jason House wrote:
 If that's done, the size of types become obvious when the programmer
 cares about them and may make size-sensitive code more obvious.

 Even on 64-bit systems int is 32-bit.

 
 Are you talking about what D does or what is most efficient on a 64 bit
 system?  If 32-bit integers are less efficient, then it's a crime to make
 size-tolerant code use an inefficient size.

You could always use tango.stdc.stdint.int_fast32_t ;-)


Sean

Nov 19 2007

Robert DaSilva <sp.unit.262+digitalmars gmail.com> writes:

Jason House wrote:
 Robert DaSilva wrote:
 
 Jason House wrote:
 Walter Bright wrote:

 Gregor Richards wrote:
 Would you prefer if int => int32, long =>
 int64, short => int16, byte => int8, real => float80 (portability be
 damned), double => float64, float => float32? They'd certainly be more
 obvious, but I can tell you I'd go crazy.

 Those get requested now and then, but I agree they are awful. They're a
 legacy from the C world where the sizes of basic types are unknown.

 The first bullet on http://www.digitalmars.com/d/portability.html implies
 some wiggle room on this issue.

 I really liked how D got rid of size ambiguity at first... all the way
 until
 I started developing on machines that were not 32 bit.  When I don't care
 about the true size, I feel guilty using "int" all over the place because
 it is a fixed size.

 I'd love to see both a fixed and variable size option available.  Maybe:
 int - variable size
 int32 - fixed size
 int64 - fixed size

 If that's done, the size of types become obvious when the programmer
 cares about them and may make size-sensitive code more obvious.

 Even on 64-bit systems int is 32-bit.

 
 Are you talking about what D does or what is most efficient on a 64 bit
 system?  If 32-bit integers are less efficient, then it's a crime to make
 size-tolerant code use an inefficient size.

C don't specify the sizes, but it does specify the sizes relative to
each other.
sizeof(short) <= sizeof(int) && sizeof(int) <= sizeof(long)

Nov 19 2007

renoX <renosky free.fr> writes:

Jason House a �crit :
 Walter Bright wrote:
 
 Gregor Richards wrote:
 Would you prefer if int => int32, long =>
 int64, short => int16, byte => int8, real => float80 (portability be
 damned), double => float64, float => float32? They'd certainly be more
 obvious, but I can tell you I'd go crazy.

 Those get requested now and then, but I agree they are awful. They're a
 legacy from the C world where the sizes of basic types are unknown.

 
 The first bullet on http://www.digitalmars.com/d/portability.html implies
 some wiggle room on this issue.
 
 I really liked how D got rid of size ambiguity at first... all the way until
 I started developing on machines that were not 32 bit.  When I don't care
 about the true size, I feel guilty using "int" all over the place because
 it is a fixed size.
 
 I'd love to see both a fixed and variable size option available.  Maybe:
 int - variable size
 int32 - fixed size
 int64 - fixed size
 
 If that's done, the size of types become obvious when the programmer cares
 about them and may make size-sensitive code more obvious.

No!
Naturally programmers would use 'int' everywhere and this would create 
again portability issue.

var_int, int_be32 (bigger or equal 32 bit), int_word: would be ok though.

renoX

Nov 19 2007

Jason House <jason.james.house gmail.com> writes:

renoX wrote:

 Jason House a écrit :
 I'd love to see both a fixed and variable size option available.  Maybe:
 int - variable size
 int32 - fixed size
 int64 - fixed size
 
 If that's done, the size of types become obvious when the programmer
 cares about them and may make size-sensitive code more obvious.

 
 No!
 Naturally programmers would use 'int' everywhere and this would create
 again portability issue.

What would the portability issue be?  If they use int and don't care about
the true size, it'll port fine.


 var_int, int_be32 (bigger or equal 32 bit), int_word: would be ok though.

I'm assuming programmers won't use long windes type definitions if they can
avoid it.

Nov 22 2007

renoX <renosky free.fr> writes:

Jason House a écrit :
 renoX wrote:
 
 Jason House a écrit :
 I'd love to see both a fixed and variable size option available.  Maybe:
 int - variable size
 int32 - fixed size
 int64 - fixed size

 If that's done, the size of types become obvious when the programmer
 cares about them and may make size-sensitive code more obvious.

 No!
 Naturally programmers would use 'int' everywhere and this would create
 again portability issue.

 
 What would the portability issue be?  If they use int and don't care about
 the true size, it'll port fine.

Sure, it's the same as C, except that if you look to the real world, 
you'd see that there are many portability issue in C due to this..

There are many not-very-good|overworked programmers who care only about 
their current target, so if you have integers with a varying size as a 
default portability will be poor.

IMHO, that's a case of 'premature optimisation', providing machine sized 
integers for optimisations is nice, but using it as a default sucks, 
especially since it's not that obvious that they are always faster: 
64bit integers on 64bit CPU can be slower than 32bit integers due to the 
increased memory&cache usage..

renoX


 
 
 var_int, int_be32 (bigger or equal 32 bit), int_word: would be ok though.

 
 I'm assuming programmers won't use long windes type definitions if they can
 avoid it.

Nov 25 2007

Sean Kelly <sean f4.ca> writes:

Gregor Richards wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its 
 names. But I think we've evolved to the point where going forward, we 
 know what to do:

 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in them, 
 even temporarily, should be discouraged. Non-unicode encodings should 
 use ubyte[], ushort[], etc.

 This seems fair.  It would reinforce the idea that strings really do 
 use a common encoding format, and that foreign encodings are relegated 
 to a different form of transport.  Now if only toWString didn't look 
 so horrible :-)

 
 Worse looking than toUtf16? 

Yes.  I find the 'W' or 'D' in the middle of the name difficult to read. 
   It literally hurts my eyes to look at that particular word. 
Something about the single capital letter in the middle of the word as 
the distinguishing characteristic, and the fact that the 'W' and 'D' do 
not correlate to anything meaningful in English.  Didn't someone post 
recently that the mind is trained to recognize words by their first and 
last letter?  I tihnk its smoehtnig lkie taht.  With toUtf8, etc, I 
basically just see the trailing '8' and I know what it is.  Trying to 
pick out a 'W' or 'D' in the middle of a word is much more difficult, 
particularly since it is next to another capital letter.

 Would you prefer if int => int32, long => 
 int64, short => int16, byte => int8, real => float80 (portability be 
 damned), double => float64, float => float32? They'd certainly be more 
 obvious, but I can tell you I'd go crazy.

No, but I feel that this is an invalid comparison.  We are talking about 
function names concerning type transformations, not type names.


Sean

Nov 19 2007

"Kris" <foo bar.com> writes:

"Sean Kelly" <sean f4.ca> wrote ...
 Gregor Richards wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its names. 
 But I think we've evolved to the point where going forward, we know 
 what to do:

 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in them, 
 even temporarily, should be discouraged. Non-unicode encodings should 
 use ubyte[], ushort[], etc.

 This seems fair.  It would reinforce the idea that strings really do use 
 a common encoding format, and that foreign encodings are relegated to a 
 different form of transport.  Now if only toWString didn't look so 
 horrible :-)

 Worse looking than toUtf16?

 Yes.  I find the 'W' or 'D' in the middle of the name difficult to read. 
 It literally hurts my eyes to look at that particular word.

Hear hear! :o



 Something about the single capital letter in the middle of the word as the 
 distinguishing characteristic, and the fact that the 'W' and 'D' do not 
 correlate to anything meaningful in English.  Didn't someone post recently 
 that the mind is trained to recognize words by their first and last 
 letter?  I tihnk its smoehtnig lkie taht.  With toUtf8, etc, I basically 
 just see the trailing '8' and I know what it is.  Trying to pick out a 'W' 
 or 'D' in the middle of a word is much more difficult, particularly since 
 it is next to another capital letter.


Yes, it looks more akin to GoBbleDeGOOk that other options. I find such 
things to be as distasteful as Walter finds toUtf8 <g>

 Would you prefer if int => int32, long => int64, short => int16, byte => 
 int8, real => float80 (portability be damned), double => float64, float 
 => float32? They'd certainly be more obvious, but I can tell you I'd go 
 crazy.

 No, but I feel that this is an invalid comparison.  We are talking about 
 function names concerning type transformations, not type names.


Good point

Nov 19 2007

Regan Heath <regan netmail.co.nz> writes:

Sean Kelly wrote:
 Gregor Richards wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Phobos (and D) has undergone some evolution in the thinking about 
 unicode strings, and it certainly has a few anachronisms in its 
 names. But I think we've evolved to the point where going forward, 
 we know what to do:

 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in 
 them, even temporarily, should be discouraged. Non-unicode encodings 
 should use ubyte[], ushort[], etc.

 This seems fair.  It would reinforce the idea that strings really do 
 use a common encoding format, and that foreign encodings are 
 relegated to a different form of transport.  Now if only toWString 
 didn't look so horrible :-)

 Worse looking than toUtf16? 

 
 Yes.  I find the 'W' or 'D' in the middle of the name difficult to read. 
   It literally hurts my eyes to look at that particular word. Something 
 about the single capital letter in the middle of the word as the 
 distinguishing characteristic, and the fact that the 'W' and 'D' do not 
 correlate to anything meaningful in English.  Didn't someone post 
 recently that the mind is trained to recognize words by their first and 
 last letter?  I tihnk its smoehtnig lkie taht.  With toUtf8, etc, I 
 basically just see the trailing '8' and I know what it is.  Trying to 
 pick out a 'W' or 'D' in the middle of a word is much more difficult, 
 particularly since it is next to another capital letter.

I agree, I think I'd prefer:

toString
toStringW
toStringD

or

toString
toString16
toString32

maybe with an alias for toString to toStringA, and/or toString8.


There is some precedent as Unicode versions of windows functions have a 
trailing W, i.e. CreateFileA, CreateFileW

Regan

Nov 20 2007

Roberto Mariottini <rmariottini mail.com> writes:

Walter Bright wrote:
[...]
 Non-unicode encodings should use ubyte[], ushort[], etc.

Are you saying that the toMBSz() function should return ubyte* not char*?

Ciao

Nov 20 2007

Walter Bright <newshound1 digitalmars.com> writes:

Roberto Mariottini wrote:
 Walter Bright wrote:
 [...]
 Non-unicode encodings should use ubyte[], ushort[], etc.

 
 Are you saying that the toMBSz() function should return ubyte* not char*?

Probably.

Nov 20 2007

Matti Niemenmaa <see_signature for.real.address> writes:

Walter Bright wrote:
 char[] => string
 wchar[] => wstring
 dchar[] => dstring
 
 These are all unicode strings. Putting non-unicode encodings in them,
 even temporarily, should be discouraged. Non-unicode encodings should
 use ubyte[], ushort[], etc.

At last! This is the way I've been thinking it should be for a long time.
However, this requires a change to the language - make char/wchar/dchar types
implicitly convertible to ubyte/ushort/uint - and a bunch of library changes -
functions that don't require UTF should use ubyte/ushort/uint - in order to be
practically usable. Details follow.

Assume you have an ubyte[] named iso_8859_1_string which contains a string
encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to
work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" -
note the annoying cast.

The same thing applies the other way, of course - assume the C standard library
accepts ubyte* instead of char* for all the C string functions. This is more
correct than the current situation, as the C standard library is
encoding-independent. Now, if you have a UTF-8 string which you wish to pass to
a C string handling function, you need to do, for instance:
"printf(cast(ubyte*)utf_8_string.ptr)" - another cast.

If encoding-independent functions accept only char, then it's the former case
for _every_ call to a string function when you're dealing with non-UTF strings,
which quickly becomes onerous.

I actually tried this, but the code ended up so unreadable that I was forced to
change it back, thus having arbitrarily-encoded bytes stored in char[], just for
the convenience of being able to use string functions on them.

Here're the details of the solution to this problem that I've thought of:

Make char, char*, char[], etc. all implicitly castable to the corresponding
ubyte types, and equivalently for wchar/ushort and dchar/uint. Then, functions
which require UTF-x can continue to use [dw]char while functions which work
regardless of encoding (most functions in std.string) should use ubyte. This
way, the functions transparently work for [dw]string whilst still working for
non-UTF.

To be precise, in the above, "work regardless of encoding" should be read as
"works on more than one encoding": even a simple function like std.string.strip
would have to be changed to work on EBCDIC, for instance. I would assume ASCII,
especially given that D doesn't target machines older than relatively modern
32-bit computers, to be the common subset. This way ubyte[] would mean "ASCII or
something else" and it's up to the programmer to not call it on functions which
require ASCII. I don't think this is a problem.

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi

Nov 20 2007

Regan Heath <regan netmail.co.nz> writes:

Matti Niemenmaa wrote:
 Walter Bright wrote:
 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in them,
 even temporarily, should be discouraged. Non-unicode encodings should
 use ubyte[], ushort[], etc.

 
 At last! This is the way I've been thinking it should be for a long time.
 However, this requires a change to the language - make char/wchar/dchar types
 implicitly convertible to ubyte/ushort/uint - and a bunch of library changes -
 functions that don't require UTF should use ubyte/ushort/uint - in order to be
 practically usable. Details follow.
 
 Assume you have an ubyte[] named iso_8859_1_string which contains a string
 encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to
 work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)"
-
 note the annoying cast.

I think we should be encouraging people to convert this data to UTF-8 
before calling any D string handling functions on it (those that accept 
w/d/char[]).  Which implies all D string handling functions should only 
operate on UTF-8/16/32.

If they want to call a C function like those in std.c.<whatever> on it, 
it should just work as expected.  Which implies std.c.<whatever> 
functions should accept ubyte* or void* or something, not char*

 The same thing applies the other way, of course - assume the C standard library
 accepts ubyte* instead of char* for all the C string functions. This is more
 correct than the current situation, as the C standard library is
 encoding-independent. Now, if you have a UTF-8 string which you wish to pass to
 a C string handling function, you need to do, for instance:
 "printf(cast(ubyte*)utf_8_string.ptr)" - another cast.

w/d/char[] arrays are implicitly convertable to void[] (and void*?) so 
perhaps C functions should accept void* instead?  I mean, void* means 
"pointer to something/anything"...

Regan

Nov 20 2007

Matti Niemenmaa <see_signature for.real.address> writes:

Regan Heath wrote:
 I think we should be encouraging people to convert this data to UTF-8
 before calling any D string handling functions on it (those that accept
 w/d/char[]).  Which implies all D string handling functions should only
 operate on UTF-8/16/32.

This is an impossible task. Given a plaintext file, you cannot know what
encoding it is in. If you assume an encoding and convert it to UTF-8 for
internal use and then recode it back to that encoding for output, you may lose
information.

 w/d/char[] arrays are implicitly convertable to void[] (and void*?) so
 perhaps C functions should accept void* instead?  I mean, void* means
 "pointer to something/anything"...

void* means "pointer to anything", as you say. ubyte* means "pointer to unsigned
byte(s)", which is a different thing entirely.

To me, ubyte[] means either integers in the range 0-255 or "arbitrary data".
void[] is more like "arbitrary memory": used for hacking around language
restrictions or for extremely low-level stuff such as memory management.

Would you consider malloc as returning the same type of data which mbstrlen
accepts?

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi

Nov 20 2007

Regan Heath <regan netmail.co.nz> writes:

Matti Niemenmaa wrote:
 Regan Heath wrote:
 I think we should be encouraging people to convert this data to UTF-8
 before calling any D string handling functions on it (those that accept
 w/d/char[]).  Which implies all D string handling functions should only
 operate on UTF-8/16/32.

 
 This is an impossible task. Given a plaintext file, you cannot know what
 encoding it is in. If you assume an encoding and convert it to UTF-8 for
 internal use and then recode it back to that encoding for output, you may lose
 information.

Yep, but the same thing may occur calling a D string function as it 
expects UTF-8 and may even convert to dchar[] internally (which would 
probably throw an invalid UTF exception).

Worse, it might work in one version of the library and fail in another 
due to internal changes of that sort.  Meaning, the function cannot 
guarantee to operate on your 'could be any encoding' data.

You'd be better of passing this data to the C function that does what 
you want.

Convert input early and output late I reckon.

 w/d/char[] arrays are implicitly convertable to void[] (and void*?) so
 perhaps C functions should accept void* instead?  I mean, void* means
 "pointer to something/anything"...

 
 void* means "pointer to anything", as you say. ubyte* means "pointer to
unsigned
 byte(s)", which is a different thing entirely.

 To me, ubyte[] means either integers in the range 0-255 or "arbitrary data".
 void[] is more like "arbitrary memory": used for hacking around language
 restrictions or for extremely low-level stuff such as memory management.

 Would you consider malloc as returning the same type of data which mbstrlen
accepts?

Not the same type of data, but they could give/accept the same pointer.

void *p = malloc(100);
strcpy((char*)p, "test");
printf("%d", mbstrlen(p));

Memory is memory, the only difference between char* and void* is that 
char* knows (thinks) it's pointing at a char.

What about other text encodings which do not have 8 bit sized 
'character' pieces, like UCS-2 (but not because UCS-2 is a subset of 
UTF-16 and we can handle it as such).  I'm not sure any exist, so this 
point may be invalid, but if one did exist then ubyte[] would not be the 
correct way to store it, perhaps ushort[] would.

Or.. we could use void[]/void* for all types of unknown data and be done 
with it.  Using void* basically says "we don't know the type/format of 
the data but we assume the function receiving the data does".

Regan

Nov 20 2007

Matti Niemenmaa <see_signature for.real.address> writes:

Regan Heath wrote:
 Matti Niemenmaa wrote:
 Regan Heath wrote:
 I think we should be encouraging people to convert this data to UTF-8 
 before calling any D string handling functions on it (those that accept 
 w/d/char[]).  Which implies all D string handling functions should only 
 operate on UTF-8/16/32.

 
 This is an impossible task. Given a plaintext file, you cannot know what 
 encoding it is in. If you assume an encoding and convert it to UTF-8 for 
 internal use and then recode it back to that encoding for output, you may 
 lose information.

 
 Yep, but the same thing may occur calling a D string function as it expects 
 UTF-8 and may even convert to dchar[] internally (which would probably throw
  an invalid UTF exception).

Which is why I think that unless you know it's UTF-8, you should use ubyte[].
Functions which expect UTF-8 would require char[], thus causing a type error.

 You'd be better of passing this data to the C function that does what you 
 want.

There's not always a C function that does what you want available. Both Phobos's
and Tango's string processing capabilities are greater than the C standard
library's even for plain ASCII.

The point is to make it easy to use non-UTF strings when necessary, without
having to resort to huge amounts of casts or writing your own functions with the
correct type signatures.

 What about other text encodings which do not have 8 bit sized 'character' 
 pieces, like UCS-2 (but not because UCS-2 is a subset of UTF-16 and we can 
 handle it as such).  I'm not sure any exist, so this point may be invalid, 
 but if one did exist then ubyte[] would not be the correct way to store it, 
 perhaps ushort[] would.

Walter mentioned ushort[] in his post, as did I in mine.

 Or.. we could use void[]/void* for all types of unknown data and be done with
 it.  Using void* basically says "we don't know the type/format of the data 
 but we assume the function receiving the data does".

I just think "void" means "typeless" or "I don't know the type". "ubyte" means
something like "byte-oriented data" or "I don't care about the type". It all
depends on your point of view, but I think it's nice to have a semantic
difference between void and ubyte.

The meaning of plain byte, on the other hand, eludes me, beyond just "integer
from -128 to 127".

The problem with using void to store data is also that the garbage collectors
assume it may contain pointers, and thus scan it for uncollected memory. It may
also be that if they find a valid pointer (small, but nonzero, probability) they
do not free memory which should be released, thus retaining it as long as the
data lives, which could be as long as the program runs.

Hell, we /could/ use void[] to replace char[], byte[], and ubyte[], and why not
the rest of the types, too. But this isn't asm. This is D!

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi

Nov 20 2007

Regan Heath <regan netmail.co.nz> writes:

Matti Niemenmaa wrote:
 The meaning of plain byte, on the other hand, eludes me, beyond just "integer
 from -128 to 127".

To my mind byte = "signed interpretation of 8 bits".

Regan

Nov 21 2007

=?UTF-8?B?SnVsaW8gQ8Opc2FyIENhcnJhc2NhbCBVcnF1aWpv?= writes:

Matti Niemenmaa wrote:
 Assume you have an ubyte[] named iso_8859_1_string which contains a string
 encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to
 work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)"
-
 note the annoying cast.

You can't assume that a function designed to work on an UTF-8 strings 
works with ISO-8859-1 strings. Beyond the ASCII range UTF-8 isn't 
compatible with any other charset.


 The same thing applies the other way, of course - assume the C standard library
 accepts ubyte* instead of char* for all the C string functions. This is more
 correct than the current situation, as the C standard library is
 encoding-independent. Now, if you have a UTF-8 string which you wish to pass to
 a C string handling function, you need to do, for instance:
 "printf(cast(ubyte*)utf_8_string.ptr)" - another cast.

This is probably the actual problem: C string functions should accept 
ubyte* instead of char* because a ubyte doesn't have an implied encoding 
while char does.


 If encoding-independent functions accept only char, then it's the former case
 for _every_ call to a string function when you're dealing with non-UTF strings,
 which quickly becomes onerous.

Unless you are referring to a conversion library like ICU, I don't 
understand your point on "encoding-independent functions". Phobos' 
string functions aren't "encoding-independent".


 I actually tried this, but the code ended up so unreadable that I was forced to
 change it back, thus having arbitrarily-encoded bytes stored in char[], just
for
 the convenience of being able to use string functions on them.

If you've done that I fear you'll see lots of exceptions appearing in 
your string handling code once you deliver your program to any 
non-english speaking user.


 Here're the details of the solution to this problem that I've thought of:
 
 Make char, char*, char[], etc. all implicitly castable to the corresponding
 ubyte types, and equivalently for wchar/ushort and dchar/uint. Then, functions
 which require UTF-x can continue to use [dw]char while functions which work
 regardless of encoding (most functions in std.string) should use ubyte. This
 way, the functions transparently work for [dw]string whilst still working for
 non-UTF.

Most function in std.string *require* UTF-8 or they'll blow up with a 
"Error: 4invalid UTF-8 sequence" message.

Actually, I think the implicit casting would be useful for string literals:

byte[] foo = "Julio César";	// In ISO-8859-1.

But then I need some way to tell the compiler that the string is in 
ISO-8859-1. What I don't see is where does your proposal helps with the 
example you were giving. For example, if I try to uppercase foo I would 
get an exception:

toupper(foo);	// BOOM!


 To be precise, in the above, "work regardless of encoding" should be read as
 "works on more than one encoding": even a simple function like std.string.strip
 would have to be changed to work on EBCDIC, for instance. I would assume ASCII,
 especially given that D doesn't target machines older than relatively modern
 32-bit computers, to be the common subset. This way ubyte[] would mean "ASCII
or
 something else" and it's up to the programmer to not call it on functions which
 require ASCII. I don't think this is a problem.

I think this is unrealistic unless you want to change std.string to be 
something more like ICU. There are just too many (popular) encodings and 
variations in use today... and you'll have to support most of them once 
you start promising to "works on more than one encoding".

Even Unicode has UCS which is the not-quite-UTF encoding used in Windows 
NT4 (yes, there are still lots of machines using NT4).



-- 
Julio César Carrascal Urquijo
http://jcesar.artelogico.com/

Nov 20 2007

Regan Heath <regan netmail.co.nz> writes:

Julio César Carrascal Urquijo wrote:
 Even Unicode has UCS which is the not-quite-UTF encoding used in Windows 
 NT4 (yes, there are still lots of machines using NT4).

FYI:  You probably already know this but I wanted to be sure, plus 
others might find it of interest..

http://en.wikipedia.org/wiki/UTF-16

UCS2 is not quite UTF-16, but UCS2 is a subset of UTF-16 ("upwards 
compatibility from UCS-2 to UTF-16"), it's essentially UTF-16 without 
the surrogate pairs.

So, in D you can generally* say:
wchar[] data = cast(wchar[]) std.file.read("filename");

and it should work without throwing any invalid UTF errors.

* this may depend on whether it's UCS-2, UCS-2BE, or UCS-2LE.  I'm not 
sure which format D's UTF-16 is in.

Regan

Nov 21 2007

Matti Niemenmaa <see_signature for.real.address> writes:

Julio César Carrascal Urquijo wrote:
 Matti Niemenmaa wrote:
 Assume you have an ubyte[] named iso_8859_1_string which contains a string
  encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it
  to work, you need to call 
 "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying
  cast.

 
 You can't assume that a function designed to work on an UTF-8 strings works 
 with ISO-8859-1 strings. Beyond the ASCII range UTF-8 isn't compatible with 
 any other charset.

I am well aware of this. I chose strip as an example because it does work on any
encoding: it simply calls std.ctype.isspace on each char.

 This is probably the actual problem: C string functions should accept ubyte* 
 instead of char* because a ubyte doesn't have an implied encoding while char 
 does.

Yes. But there are also many D string functions which would work on any
encoding.

 If encoding-independent functions accept only char, then it's the former 
 case for _every_ call to a string function when you're dealing with non-UTF
  strings, which quickly becomes onerous.

 
 Unless you are referring to a conversion library like ICU, I don't understand
  your point on "encoding-independent functions". Phobos' string functions 
 aren't "encoding-independent".

Most are, actually, except for the fact that D character constants are always
ASCII. Almost all the std.string functions will work for any "extended ASCII"
encoding.

And that's what I mean. Given that D doesn't target the kind of machines that
use EBCDIC, I use "encoding-independent" to mean either "works on any encoding"
or "works on any encoding with ASCII as the lower 128 values".

 I actually tried this, but the code ended up so unreadable that I was 
 forced to change it back, thus having arbitrarily-encoded bytes stored in 
 char[], just for the convenience of being able to use string functions on 
 them.

 
 If you've done that I fear you'll see lots of exceptions appearing in your 
 string handling code once you deliver your program to any non-english 
 speaking user.

Trust me, I know what I'm doing.

For instance, the integer conversion functions in std.conv only look for values
in the range '0' to '9', ignoring all others. If the encoding has the digits in
the same place as ASCII, it will work, regardless of what all the other bytes in
the encoding are.

If the encoding has the digits in a different place than ASCII, then it won't
work, true. But I think you'll find that using EBCDIC or another non-ASCII-based
encoding will confuse most of the programs you've got installed on your
computer.

 Most function in std.string *require* UTF-8 or they'll blow up with a "Error:
  4invalid UTF-8 sequence" message.

No, they do not. Some do, but not most. Of all the functions that take char[] or
char* in std.string:

Functions     requiring UTF-8: 22
Functions not requiring UTF-8: 35

 Actually, I think the implicit casting would be useful for string literals:
 
 byte[] foo = "Julio César";    // In ISO-8859-1.
 
 But then I need some way to tell the compiler that the string is in 
 ISO-8859-1. What I don't see is where does your proposal helps with the 
 example you were giving. For example, if I try to uppercase foo I would get 
 an exception:
 
 toupper(foo);    // BOOM!

True, you would, because std.string.toupper assumes UTF-8. Hence, its type
should be string(string), which you couldn't call with byte[], since byte[]
doesn't implicitly convert to char[].

But consider what happens now with char[]. The following program compiles, but
blows up at runtime:

import std.string;
void main() {
	char[] foo = "Julio C\xe9sar";
	toupper(foo);
}

An amendment to my proposal to correct this would be that hex strings, and any
string which contains a byte sequence which is not valid UTF, would become
ubyte/ushort/uint. Thus the above would fail with a type error because the type
of the literal is ubyte[], and it cannot be assigned to a char[]. If the type of
foo were ubyte[], calling toupper would fail with a type error.

Thereby the only way to get the program above to compile, aside from changing
the string literal to UTF-8, would be with a cast, which shows that there's
something unsafe going on.

 I think this is unrealistic unless you want to change std.string to be 
 something more like ICU. There are just too many (popular) encodings and 
 variations in use today... and you'll have to support most of them once you 
 start promising to "works on more than one encoding".

By "works on more than one encoding" I meant "works for anything with ASCII as
the lower 128 bytes". You'll find that covers the majority of encodings in
common use today.

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi

Nov 21 2007

Regan Heath <regan netmail.co.nz> writes:

Matti Niemenmaa wrote:
 Julio César Carrascal Urquijo wrote:
 Matti Niemenmaa wrote:
 Assume you have an ubyte[] named iso_8859_1_string which contains a string
  encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it
  to work, you need to call 
 "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying
  cast.

 You can't assume that a function designed to work on an UTF-8 strings works 
 with ISO-8859-1 strings. Beyond the ASCII range UTF-8 isn't compatible with 
 any other charset.

 
 I am well aware of this. I chose strip as an example because it does work on
any
 encoding: it simply calls std.ctype.isspace on each char.

But, this behvaiour isn't guaranteed.  In fact I would expect that in 
future a library like iconv will be leveraged to determine if a 
character 'is a space' and it will assume the input data is UTF-8.

So, if your ASCII based encoding has characters outside the ASCII range 
and they just happen to match a valid 'is a space' character from the 
UTF-8 set, then .. whoops.

Now, I don't have a canonical knowledge of character sets so it may be 
that there are no space characters outside the ASCII range defined in 
UTF-8... (perhaps when you include surogate pairs?) or, even if they 
exist the chance of an ASCII based character set using that value may be 
pretty small.

Who knows, all I'm saying is that if a function says it accepts char[] 
then it is saying "I accept valid UTF-8" and not "I accept any ASCII 
based character data" so all bets are off if you pass it anything other 
than UTF-8.

 This is probably the actual problem: C string functions should accept ubyte* 
 instead of char* because a ubyte doesn't have an implied encoding while char 
 does.

 
 Yes. But there are also many D string functions which would work on any
encoding.

At present. But that's not guaranteed and it may change in the future, 
in fact, I expect it to.

As far as I can see the only guaranteed thing is that the C functions 
will not change and will continue to accept ASCII based character sets 
without possible future gotchas.

So, if you must perform string manipulation on non UTF data then you 
should either write your own functions, or use the C ones.

Regan

Nov 21 2007

Matti Niemenmaa <see_signature for.real.address> writes:

Regan Heath wrote:
 But, this behvaiour isn't guaranteed.  In fact I would expect that in
 future a library like iconv will be leveraged to determine if a
 character 'is a space' and it will assume the input data is UTF-8.

You're right. See below.

 So, if your ASCII based encoding has characters outside the ASCII range
 and they just happen to match a valid 'is a space' character from the
 UTF-8 set, then .. whoops.
 
 Now, I don't have a canonical knowledge of character sets so it may be
 that there are no space characters outside the ASCII range defined in
 UTF-8... (perhaps when you include surogate pairs?) or, even if they
 exist the chance of an ASCII based character set using that value may be
 pretty small.

std.string.LS and std.string.PS are two examples of Unicode whitespace
characters. Strip, for some reason, does not strip them.

 Who knows, all I'm saying is that if a function says it accepts char[]
 then it is saying "I accept valid UTF-8" and not "I accept any ASCII
 based character data" so all bets are off if you pass it anything other
 than UTF-8.

You are correct, which is exactly my point: char[] should mean UTF-8 whereas
currently many functions use it to mean "text with single-byte characters".

That std.string.strip uses char[] currently says nothing about whether it
expects UTF-8 or not. Were the std.c package converted to use ubyte[]
everywhere, there would be a clear distinction between UTF-8 and "anything".
Then, as you say, one should interpret std.string.* as accepting only UTF-8.

 As far as I can see the only guaranteed thing is that the C functions
 will not change and will continue to accept ASCII based character sets
 without possible future gotchas.
 
 So, if you must perform string manipulation on non UTF data then you
 should either write your own functions, or use the C ones.

Correct.

The point is that storing non-UTF data in ubyte/ushort/uint is a difficult task
because even the C functions take char (or wchar_t, which I think is wchar on
Windows and dchar elsewhere) and thus the code quickly becomes castville. cast
here, cast there, everywhere a cast cast - and for no good reason.

Thus I believe, as per my original proposal, that library functions be converted
to use ubyte[] where they are not meant to accept char[]. This may or may not
mean changes in std.string - it's up to the Phobos maintainers to make the
choice as to whether a function will ever require UTF-8, and whether to type it
as taking char[] or ubyte[]. In any case, at least the C functions should take
ubyte[].

The implicit casting from char-whatever to ubyte-whatever is useful when you
want to call C functions with D strings. Once again the code would rapidly
become castville if it would have to be done explicitly.

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi

Nov 21 2007

Regan Heath <regan netmail.co.nz> writes:

Matti Niemenmaa wrote:
 Regan Heath wrote:
 The point is that storing non-UTF data in ubyte/ushort/uint is a difficult task
 because even the C functions take char (or wchar_t, which I think is wchar on
 Windows and dchar elsewhere) and thus the code quickly becomes castville. cast
 here, cast there, everywhere a cast cast - and for no good reason.

Yeah, agreed 100%

 Thus I believe, as per my original proposal, that library functions be
converted
 to use ubyte[] where they are not meant to accept char[]. This may or may not
 mean changes in std.string - it's up to the Phobos maintainers to make the
 choice as to whether a function will ever require UTF-8, and whether to type it
 as taking char[] or ubyte[]. In any case, at least the C functions should take
 ubyte[].

Agreed.  I would tend to leave the std.string functions taking char[] so 
that when they finally step up and have complete UTF compatibility their 
signatures do not change.  If we need some functions, like strip, as a 
stop gap for other encodings then I reckon we add them, perhaps to a 
different module, and we use ubyte* (or whatever) instead of char[] for 
the input parameter.

 The implicit casting from char-whatever to ubyte-whatever is useful when you
 want to call C functions with D strings. Once again the code would rapidly
 become castville if it would have to be done explicitly.

The only problem I have with implicit cast to ubyte-whatever is that I 
worry it will have an unexpected side effect somewhere...  Perhaps I am 
being alarmist.

Regan

Nov 21 2007

Robert DaSilva <sp.unit.262+digitalmars gmail.com> writes:

Matti Niemenmaa wrote:
 Walter Bright wrote:
 char[] => string
 wchar[] => wstring
 dchar[] => dstring

 These are all unicode strings. Putting non-unicode encodings in them,
 even temporarily, should be discouraged. Non-unicode encodings should
 use ubyte[], ushort[], etc.

 
 At last! This is the way I've been thinking it should be for a long time.
 However, this requires a change to the language - make char/wchar/dchar types
 implicitly convertible to ubyte/ushort/uint - and a bunch of library changes -
 functions that don't require UTF should use ubyte/ushort/uint - in order to be
 practically usable. Details follow.
 
 Assume you have an ubyte[] named iso_8859_1_string which contains a string
 encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to
 work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)"
-
 note the annoying cast.
 
 The same thing applies the other way, of course - assume the C standard library
 accepts ubyte* instead of char* for all the C string functions. This is more
 correct than the current situation, as the C standard library is
 encoding-independent. Now, if you have a UTF-8 string which you wish to pass to
 a C string handling function, you need to do, for instance:
 "printf(cast(ubyte*)utf_8_string.ptr)" - another cast.
 
 If encoding-independent functions accept only char, then it's the former case
 for _every_ call to a string function when you're dealing with non-UTF strings,
 which quickly becomes onerous.
 
 I actually tried this, but the code ended up so unreadable that I was forced to
 change it back, thus having arbitrarily-encoded bytes stored in char[], just
for
 the convenience of being able to use string functions on them.
 
 Here're the details of the solution to this problem that I've thought of:
 
 Make char, char*, char[], etc. all implicitly castable to the corresponding
 ubyte types, and equivalently for wchar/ushort and dchar/uint. Then, functions
 which require UTF-x can continue to use [dw]char while functions which work
 regardless of encoding (most functions in std.string) should use ubyte. This
 way, the functions transparently work for [dw]string whilst still working for
 non-UTF.
 
 To be precise, in the above, "work regardless of encoding" should be read as
 "works on more than one encoding": even a simple function like std.string.strip
 would have to be changed to work on EBCDIC, for instance. I would assume ASCII,
 especially given that D doesn't target machines older than relatively modern
 32-bit computers, to be the common subset. This way ubyte[] would mean "ASCII
or
 something else" and it's up to the programmer to not call it on functions which
 require ASCII. I don't think this is a problem.
 

Perhapses {,w,d}char should become typedef of u{byte,short,int} and drop
as keywords?

Nov 20 2007

Matti Niemenmaa <see_signature for.real.address> writes:

Robert DaSilva wrote:
 Perhapses {,w,d}char should become typedef of u{byte,short,int} and drop
 as keywords?

There would still need to be special handling for char and string literals, at
least (since they're defined as UTF), but yes, this is a possibility.

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi

Nov 21 2007

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message 
news:fhsts6$5nn$1 digitalmars.com...
 As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 be 
 named toString, toWString, and toDString, respectively, and Unicode should 
 be assumed as the standard encoding format in D.

Do you want to know my single overriding reason for wanting toString instead 
of toUtf8?  Because it's nicer-looking and easier to type.

My other reasons include consistency (Java uses .toString, .Net uses 
.ToString, phobos uses .toString) and that "toUtf8" screams "I'm a string 
class and this method converts my encoding!"  while "toString" says "convert 
this object, whatever it is, to some kind of string."

votes += 8 for toString, toWString, and toDString.

Nov 19 2007

Gregor Richards <Richards codu.org> writes:

Jarrett Billingsley wrote:
 Do you want to know my single overriding reason for wanting toString instead 
 of toUtf8?  Because it's nicer-looking and easier to type.

Hear hear.

Nov 19 2007

"Kris" <foo bar.com> writes:

With respect to all, you're perhaps not addressing Sean's deeper questions? 
Instead, this seems like another bunch of "toUtf8! NO! U ... toString() 
dammit!"

Which is kinda superficial at this point?


"Gregor Richards" <Richards codu.org> wrote in message 
news:fht0qs$9rk$1 digitalmars.com...
 Jarrett Billingsley wrote:
 Do you want to know my single overriding reason for wanting toString 
 instead of toUtf8?  Because it's nicer-looking and easier to type.

 Hear hear.

Nov 19 2007

Gregor Richards <Richards codu.org> writes:

Kris wrote:
 With respect to all, you're perhaps not addressing Sean's deeper questions? 
 Instead, this seems like another bunch of "toUtf8! NO! U ... toString() 
 dammit!"
 
 Which is kinda superficial at this point?
 
 
 "Gregor Richards" <Richards codu.org> wrote in message 
 news:fht0qs$9rk$1 digitalmars.com...
 Jarrett Billingsley wrote:
 Do you want to know my single overriding reason for wanting toString 
 instead of toUtf8?  Because it's nicer-looking and easier to type.

 Hear hear. 

 
 

Confusing it's, post top don't.

Actually, we're not even addressing Sean's surficial questions.

  - Gregor Richards

Nov 19 2007

=?ISO-8859-1?Q?Julio_C=E9sar_Carrascal_Urquijo?= writes:

Gregor Richards wrote:
 Jarrett Billingsley wrote:
 Do you want to know my single overriding reason for wanting toString 
 instead of toUtf8?  Because it's nicer-looking and easier to type.

 
 Hear hear.

Actually, Jarrett's other arguments seemed more compelling to me.

Jarrett Billingsley wrote:
 My other reasons include consistency (Java uses .toString, .Net uses
 ..ToString, phobos uses .toString) and that "toUtf8" screams "I'm a 

string
 class and this method converts my encoding!"  while "toString" says 

"convert
 this object, whatever it is, to some kind of string."


Stating intent seem more important to me than the stylistic issues 
between toString and toUtf8. I'm all for the toString / toWString / 
toDString for a readable representation of a class and toUtf8 / 16 / 32 
for converting encodings.

Also, toStringW seems more readable than toWString but for me its not 
big of a deal which one Tango developers choose.


-- 
Julio C�sar Carrascal Urquijo
http://jcesar.artelogico.com/

Nov 20 2007

BCS <ao pathlink.com> writes:

Reply to Jarrett,

 "Sean Kelly" <sean f4.ca> wrote in message
 news:fhsts6$5nn$1 digitalmars.com...
 
 As an alternative, I can only suggest that toUTF8, toUTF16, and
 toUTF32 be named toString, toWString, and toDString, respectively,
 and Unicode should be assumed as the standard encoding format in D.
 

 Do you want to know my single overriding reason for wanting toString
 instead of toUtf8?  Because it's nicer-looking and easier to type.
 
 My other reasons include consistency (Java uses .toString, .Net uses
 .ToString, phobos uses .toString) and that "toUtf8" screams "I'm a
 string class and this method converts my encoding!"  while "toString"
 says "convert this object, whatever it is, to some kind of string."
 
 votes += 8 for toString, toWString, and toDString.
 

shouldn't that be?

votes += 8 for toString
votes += 16 for toWString
votes += 32 for toDString.

Nov 19 2007

Christopher Wright <dhasenan gmail.com> writes:

Sean Kelly wrote:
 I was looking at converting Tango's use of toUtf8 to toString today and 
 ran into a bit of a quandry....

toUtf8 is ugly.
toString/toWString/toDString are opaque and ugly, hard to distinguish 
from each other.

toString, toStringW, toStringD? Still ugly.

toUtf, toUtf16, toUtf32? Slightly less clear, but easier to type.

toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits well 
with other conventions.

Nov 19 2007

Sean Kelly <sean f4.ca> writes:

Christopher Wright wrote:
 Sean Kelly wrote:
 I was looking at converting Tango's use of toUtf8 to toString today 
 and ran into a bit of a quandry....

 
 toUtf8 is ugly.
 toString/toWString/toDString are opaque and ugly, hard to distinguish 
 from each other.
 
 toString, toStringW, toStringD? Still ugly.
 
 toUtf, toUtf16, toUtf32? Slightly less clear, but easier to type.
 
 toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits well 
 with other conventions.

I tend to place a tremendous amount of value on consistency, because the 
more consistent an API is, the more likely my guesses about it are to be 
correct.  In my opinion, that precludes using the option you suggest.

In my opinion, Walter's suggestion that alternate encodings not be 
stored in strings is sufficient reason to not bother with the encoding 
format in the function name (ie. toUtf8/toUtf16/toUtf32).  I might 
counter that I don't see any reason to lose meaning where it is so 
easily provided, but on the other hand, I agree that new users are more 
likely to know what a function named toString does than were it named 
toUtf8.  These two points are a wash in my opinion.

The remaining concerns are less substantive.  I find toWString and 
toDString difficult to read, but those feelings hold little more weight 
than "toUtf8 is ugly."  I also feel that the term "string" is largely 
meaningless in programming.  But I certainly couldn't win a debate with 
either point.

I don't suppose there is anyone who does a lot of internationalization 
programming who can comment on the utility of one convention vs. the 
other?  I would love to hear some more practical concerns regarding the 
naming convention for these functions.


Sean

Nov 19 2007

Bill Baxter <dnewsgroup billbaxter.com> writes:

Sean Kelly wrote:
 Christopher Wright wrote:
 Sean Kelly wrote:
 I was looking at converting Tango's use of toUtf8 to toString today 
 and ran into a bit of a quandry....

 toUtf8 is ugly.
 toString/toWString/toDString are opaque and ugly, hard to distinguish 
 from each other.

 toString, toStringW, toStringD? Still ugly.

 toUtf, toUtf16, toUtf32? Slightly less clear, but easier to type.

 toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits 
 well with other conventions.

 
 I tend to place a tremendous amount of value on consistency, because the 
 more consistent an API is, the more likely my guesses about it are to be 
 correct.  In my opinion, that precludes using the option you suggest.
 
 In my opinion, Walter's suggestion that alternate encodings not be 
 stored in strings is sufficient reason to not bother with the encoding 
 format in the function name (ie. toUtf8/toUtf16/toUtf32).  I might 
 counter that I don't see any reason to lose meaning where it is so 
 easily provided, but on the other hand, I agree that new users are more 
 likely to know what a function named toString does than were it named 
 toUtf8.  These two points are a wash in my opinion.
 
 The remaining concerns are less substantive.  I find toWString and 
 toDString difficult to read, but those feelings hold little more weight 
 than "toUtf8 is ugly."  I also feel that the term "string" is largely 
 meaningless in programming.  But I certainly couldn't win a debate with 
 either point.
 
 I don't suppose there is anyone who does a lot of internationalization 
 programming who can comment on the utility of one convention vs. the 
 other?  I would love to hear some more practical concerns regarding the 
 naming convention for these functions.

My just formed opinion :-) is that any sort of toWstring/toDstring 
functions should be standalone things that only accept type "string" or 
"char" as input.  Yes there will be some performance penalty in some 
cases, but I don't think that's significant enough to warrant creating 
lots of functions that do exactly the same thing, just with different 
encodings.

--bb

Nov 19 2007

"David B. Held" <dheld codelogicconsulting.com> writes:

Sean Kelly wrote:
 [...]
 I don't suppose there is anyone who does a lot of internationalization 
 programming who can comment on the utility of one convention vs. the 
 other?  I would love to hear some more practical concerns regarding the 
 naming convention for these functions.

I certainly don't qualify as someone who does a "lot" of i18n 
programming, but I do some.  Regardless, I would have to say that when I 
see a function called toUtfXX(), I think "Oh, that must convert a string 
from Latin-1 or something", rather than "Oh, that must give me the 
UTF-XX representation of an object".

Perl is a bad example because it didn't get righteous UTF-8 support 
until 5.8, but whenever you see "utf8" or similar in a Perl program, it 
almost invariably involves an encoding/decoding operation.  Perhaps it 
is worth noting that whenever you see "UTF-8" in Java, is most likely 


So it appears that the precedent is that for most other languages, when 
"UTF-8" is spelled out explicitly, it is usually in a transcoding 
context.  I don't think toWString() is an ideal name, but it seems to 
have the right connotations to the naive programmer.

Dave

Nov 20 2007

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Sean Kelly wrote:
 Christopher Wright wrote:
 toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits 
 well with other conventions.

 
 I tend to place a tremendous amount of value on consistency, because the 
 more consistent an API is, the more likely my guesses about it are to be 
 correct.  In my opinion, that precludes using the option you suggest.

IMHO, the consistent alternative is pretty clear:

char -> string -> toString
wchar -> wstring -> toWString
dchar -> dstring -> toDString

The only problem seems to lie in the aesthetics of the camelCase 
convention, but doesn't consistency trump aesthetics?

 In my opinion, Walter's suggestion that alternate encodings not be 
 stored in strings is sufficient reason to not bother with the encoding 
 format in the function name (ie. toUtf8/toUtf16/toUtf32). 

I agree, but this is hardly a new suggestion. I think it has always been 
pretty clear that one should never store anything but UTF-encoded data 
in {,w,d}char[]s. Also, I have always felt Tangos toUtf{8,16,32} are a 
bit too explicitly named. Almost like using toSingleIEEE754 instead of 
toFloat.

 I don't suppose there is anyone who does a lot of internationalization 
 programming who can comment on the utility of one convention vs. the 
 other?  I would love to hear some more practical concerns regarding the 
 naming convention for these functions.

I have done quite a bit of text processing and handling of different 
encodings in D and while naming doesn't matter much as long as it is 
consistent, what I do is:

* use {,w,d}char strictly for UTF data (I have sometimes cheated here, 
mainly to be able to use certain std.string functions, but with a good 
templated string/array library (such as in Tango), that is not necessary)

* use unicode internally as much as possible, transcoding as early and 
late as possible.

* when there is a reason not to use UTF internally, use typedefs like 
"typedef char lat1", and keep unknown encodings as ubyte[]s.

Knowing that {,w,d}chars always contain UTF has never been a problem. 
Problems arising are instead of mistakingly using char rather than 
{,u}byte in C APIs and D's horrible behavior of by default crashing 
instead of recovering from UTF errors.

A much better default behavior would be to simply substitute illegal 
UTF-units with a '?' and keep going. Having to remember to sanitize all 
untrusted unicode strings is a chore, and forgetting that at any point 
will lead to crashes in running code at inconvenient situations.

-- 
Oskar

Nov 20 2007

Sean Kelly <sean f4.ca> writes:

Oskar Linde wrote:
 Sean Kelly wrote:
 Christopher Wright wrote:
 toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits 
 well with other conventions.

 I tend to place a tremendous amount of value on consistency, because 
 the more consistent an API is, the more likely my guesses about it are 
 to be correct.  In my opinion, that precludes using the option you 
 suggest.

 
 IMHO, the consistent alternative is pretty clear:
 
 char -> string -> toString
 wchar -> wstring -> toWString
 dchar -> dstring -> toDString
 
 The only problem seems to lie in the aesthetics of the camelCase 
 convention, but doesn't consistency trump aesthetics?

It depends :-)  I prefer the suggested toStringW and toStringD 
convention.  While it doesn't exactly match the returned type name in 
letter order, the same information is communicated and is done in what I 
feel is a more readable format.  Also, if the words were placed in a 
larger list and then sorted, they would end up adjacent to one another.

 In my opinion, Walter's suggestion that alternate encodings not be 
 stored in strings is sufficient reason to not bother with the encoding 
 format in the function name (ie. toUtf8/toUtf16/toUtf32). 

 
 I agree, but this is hardly a new suggestion. I think it has always been 
 pretty clear that one should never store anything but UTF-encoded data 
 in {,w,d}char[]s.

Yup.  But to me, this is different from a semi-official declaration to 
this effect.  With the latter, the suggestion is more likely to be 
enforceable.

 Also, I have always felt Tangos toUtf{8,16,32} are a 
 bit too explicitly named. Almost like using toSingleIEEE754 instead of 
 toFloat.

Fair enough :-)

 I don't suppose there is anyone who does a lot of internationalization 
 programming who can comment on the utility of one convention vs. the 
 other?  I would love to hear some more practical concerns regarding 
 the naming convention for these functions.

 
 I have done quite a bit of text processing and handling of different 
 encodings in D and while naming doesn't matter much as long as it is 
 consistent, what I do is:
 
 * use {,w,d}char strictly for UTF data (I have sometimes cheated here, 
 mainly to be able to use certain std.string functions, but with a good 
 templated string/array library (such as in Tango), that is not necessary)
 
 * use unicode internally as much as possible, transcoding as early and 
 late as possible.
 
 * when there is a reason not to use UTF internally, use typedefs like 
 "typedef char lat1", and keep unknown encodings as ubyte[]s.
 
 Knowing that {,w,d}chars always contain UTF has never been a problem. 
 Problems arising are instead of mistakingly using char rather than 
 {,u}byte in C APIs and D's horrible behavior of by default crashing 
 instead of recovering from UTF errors.

Darnit, I forgot about the C APIs.  I'll have to replace their use of 
char with char_t or c_char (the latter matches c_long but the former 
matches wchar_t).

 A much better default behavior would be to simply substitute illegal 
 UTF-units with a '?' and keep going. Having to remember to sanitize all 
 untrusted unicode strings is a chore, and forgetting that at any point 
 will lead to crashes in running code at inconvenient situations.
 

This is useful information.  Thanks.


Sean

Nov 20 2007

James Dennett <jdennett acm.org> writes:

Sean Kelly wrote:
 I don't suppose there is anyone who does a lot of internationalization
 programming who can comment on the utility of one convention vs. the
 other?  I would love to hear some more practical concerns regarding the
 naming convention for these functions.

A D-wide (at least optionally *enforced*) specification that
the various types of "character" arrays are really strings,
not just arrays of the underlying storage types, would mean
that no convention would be needed to convey the meaning, and
the simpler name could be used safely (as the type system
would imply the encoding).

(I can't help but think that this is one more reason why
string types should *not* be built-in arrays, even if they
are known to the compiler, but I think my chances of
persuading Walter that string==array is a mistake are
three quarters of ten percent of none at all.)

In the absence of a language-enforced/mandated encoding,
it's up to the library to force programmers to consider
these issues; in that case, names making the encoding
clear (even names as ugly as toUtf8) are better than a
more readable, more generic but less intention-conveying
name like toString.

Most of the code I see (in C, C++, Java and more) is far
too sloppy about knowing which encoding is used for a
given string.  Unicode is now mature enough to make some
sense for the default in programming languages.  Ideally
I'd make the encoding something akin to a template parameter,
so that the compiler's type-checking could help out -- but
I digress into language design (as is inevitable when high
level facilities like strings are made part of the language
rather than being "just" standard library features).

-- James

Dec 16 2007

Bill Baxter <dnewsgroup billbaxter.com> writes:

Sean Kelly wrote:
 I was looking at converting Tango's use of toUtf8 to toString today and 
 ran into a bit of a quandry.  Currently, Tango's use of toUtf8 as the 
 member function for returning char strings is consitent with all use of 
 string operations in Tango.  Routines that return wchar strings are 
 named toUtf16 whether they are members of the String class or whether 
 they are intended to perform UTF conversions, and so on.  Thus, the 
 convention is consitent and pervasive.
 
 What I discovered during a test conversion of Tango was that converting 
 all uses of toUtf8 to toString /except/ those intended to perfom UTF 
 conversions reduced code clarity, and left me unsure as to which name I 
 would actually use in a given situation.  For example, there is quite a 
 bit of code in the text and io packages which convert an arbitrary type 
 to a char[] for output, etc.  So by making this change I was left with 
 some conversions using toString and others using toUtf8, toUtf16, and 
 toUtf32, not to mention the fromXxx versions of these same functions. As 
 this is template code, the choice between toString and toUtf8 in a given 
 situation was unclear.  Given this, I decided to look to Phobos for 
 model to follow.
 
 What I found in Phobos was that it suffers from the same situation as I 
 found Tango in during my test conversion.  Routines that convert any 
 type by a string to a char[] are named toString, while the string 
 equivalent is named toUTF8.  Given this, I surmised that the naming 
 convention in D is that all strings are assumed to be Unicode, except 
 when they're not.  String literals are required to be Unicode, foreach 
 assumes strings to be UTF encoded when performing its automatic 
 conversions, and all of the toString functions in std.string assume 
 UTF-8 as the output format.  So who bother with the name toUTF8 in std.utf?
 
 As near as I can tell, the reason for text conversion routines to be 
 named differently is to simplify the use of routines which covert to 
 another format.  std.windows.charset, for example, has a routine called 
 toMBSz, to distinguish from the toUTF8 routine.  What I find significant 
 about this is that it suggests that while the transport mechanism for 
 strings is the same in each case (both routines return a char[], ie. a 
 string), 

Does that even work?  I would think there are some valid MBSz's that are 
invalid UTF sequences, and so toMBSz would have to return byte[].

 the underlying encoding is different.  Thus there seems a clear 
 disconnect between the name of the transport mechanism (string), and 
 routines that generate them.  With this in mind, I begin to question the 
 point of having toString as the common name for routines that generate 
 char strings.  The encoding clearly matters in some instances and cannot 
 be ignored, so ignoring it in others just seems to confuse things.

As far as I'm concerned Utf8 is *the* encoding for text in D.  Anything 
else is only for some special purpose like ease of manipulation (dstring 
for I18N text that needs fast searching / slicing) or interchange with 
external APIs (utf16 for working with windows).

 With this in mind, I will admit that I am questioning the merit of 
 changing Tango's toUtf8 routines to be named toString.  Doing so seems 
 to sacrifice both operational consistency and clarity in an attempt to 
 maintain consistency with the name of the transport mechanism: string. 
 And as I have said above, while strings in D are generally expected to 
 be Unicode, they are clearly not always Unicode, as the existence of 
 std.windows.charset can attest.  

I really think toMBSz should be returning byte[] and fromMBSz should be 
taking a byte*.  The doc for types says char is unsigned 8 bit UTF-8. 
Period.  And you get errors from the compiler if you try to initialize a 
string with something that's not valid UTF-8.  So MBSz data has no 
business parading around dressed up as char[].


 So I am left wondering whether someone 
 can explain why toString is the preferred name for string-producing 
 routines in D?  I feel it is very important to establish a consistent 
 naming mechanism for D, and as Phobos seems to be the model in this case 
 I may well have no choice in the matter of toUtf8 vs. toString.  But I 
 would feel much better about the change if someone could provide a sound 
 reason for doing so, since my first attempt at a conversion has left me 
 somewhat worried about its long-term effect on code clarity.
 
 As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 
 be named toString, toWString, and toDString, respectively, and Unicode 
 should be assumed as the standard encoding format in D.

Since the tango convention is to treat acronyms as single words, (the 
actual tango utf methods are called toUtf8 toUtf16 and toUtf32) it seems 
there's an argument for treating wstring and dstring as single entities 
too.  So then it would be:
     toString, toWstring, toDstring

Don't know if that hurts your eyes less or not, but it seems more 
consistent with Tango's existing naming convention to me than toWString, 
etc.

--bb

Nov 19 2007

Sean Kelly <sean f4.ca> writes:

Bill Baxter wrote:
 Sean Kelly wrote:
 I was looking at converting Tango's use of toUtf8 to toString today 
 and ran into a bit of a quandry.  Currently, Tango's use of toUtf8 as 
 the member function for returning char strings is consitent with all 
 use of string operations in Tango.  Routines that return wchar strings 
 are named toUtf16 whether they are members of the String class or 
 whether they are intended to perform UTF conversions, and so on.  
 Thus, the convention is consitent and pervasive.

 What I discovered during a test conversion of Tango was that 
 converting all uses of toUtf8 to toString /except/ those intended to 
 perfom UTF conversions reduced code clarity, and left me unsure as to 
 which name I would actually use in a given situation.  For example, 
 there is quite a bit of code in the text and io packages which convert 
 an arbitrary type to a char[] for output, etc.  So by making this 
 change I was left with some conversions using toString and others 
 using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx 
 versions of these same functions. As this is template code, the choice 
 between toString and toUtf8 in a given situation was unclear.  Given 
 this, I decided to look to Phobos for model to follow.

 What I found in Phobos was that it suffers from the same situation as 
 I found Tango in during my test conversion.  Routines that convert any 
 type by a string to a char[] are named toString, while the string 
 equivalent is named toUTF8.  Given this, I surmised that the naming 
 convention in D is that all strings are assumed to be Unicode, except 
 when they're not.  String literals are required to be Unicode, foreach 
 assumes strings to be UTF encoded when performing its automatic 
 conversions, and all of the toString functions in std.string assume 
 UTF-8 as the output format.  So who bother with the name toUTF8 in 
 std.utf?

 As near as I can tell, the reason for text conversion routines to be 
 named differently is to simplify the use of routines which covert to 
 another format.  std.windows.charset, for example, has a routine 
 called toMBSz, to distinguish from the toUTF8 routine.  What I find 
 significant about this is that it suggests that while the transport 
 mechanism for strings is the same in each case (both routines return a 
 char[], ie. a string), 

 
 Does that even work?  I would think there are some valid MBSz's that are 
 invalid UTF sequences, and so toMBSz would have to return byte[].

It works because D performs no run-time verification that what's in a 
char[] is actually Unicode.  You could dump binary data in a string if 
you really wanted to.

 I really think toMBSz should be returning byte[] and fromMBSz should be 
 taking a byte*.  The doc for types says char is unsigned 8 bit UTF-8. 
 Period.  And you get errors from the compiler if you try to initialize a 
 string with something that's not valid UTF-8.  So MBSz data has no 
 business parading around dressed up as char[].

I think you're right about toMBSz.

 Since the tango convention is to treat acronyms as single words, (the 
 actual tango utf methods are called toUtf8 toUtf16 and toUtf32) it seems 
 there's an argument for treating wstring and dstring as single entities 
 too.  So then it would be:
     toString, toWstring, toDstring
 
 Don't know if that hurts your eyes less or not, but it seems more 
 consistent with Tango's existing naming convention to me than toWString, 
 etc.

Yeah I was thinking the same thing.  It's certainly easier for me to 
read than the other form.


Sean

Nov 19 2007

"Kris" <foo bar.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message
[snip]
 Don't know if that hurts your eyes less or not, but it seems more 
 consistent with Tango's existing naming convention to me than toWString, 
 etc.

 Yeah I was thinking the same thing.  It's certainly easier for me to read 
 than the other form.


Bill: actually, toString, toStringW and toStringD are more consistent with 
themselves, and with Tango convention. Even toString, toString16 and 
toString32 are significantly more style-consistent than toWString and 
toWstring

Nov 19 2007

Bill Baxter <dnewsgroup billbaxter.com> writes:

Kris wrote:
 "Sean Kelly" <sean f4.ca> wrote in message
 [snip]
 Don't know if that hurts your eyes less or not, but it seems more 
 consistent with Tango's existing naming convention to me than toWString, 
 etc.

 Yeah I was thinking the same thing.  It's certainly easier for me to read 
 than the other form.

 
 
 Bill: actually, toString, toStringW and toStringD are more consistent with 
 themselves, and with Tango convention. Even toString, toString16 and 
 toString32 are significantly more style-consistent than toWString and 
 toWstring 

How so?
toString returns a string.
toInt    returns an int.
toFloat  returns a float.
to???    returns a wstring.

Seems whatever goes in the ??? place should include the letters 
"w-s-t-r-i-n-g" in that order.



--bb

Nov 19 2007

Lars Ivar Igesund <larsivar igesund.net> writes:

Bill Baxter wrote:

 Kris wrote:
 "Sean Kelly" <sean f4.ca> wrote in message
 [snip]
 Don't know if that hurts your eyes less or not, but it seems more
 consistent with Tango's existing naming convention to me than
 toWString, etc.

 Yeah I was thinking the same thing.  It's certainly easier for me to
 read than the other form.

 
 
 Bill: actually, toString, toStringW and toStringD are more consistent
 with themselves, and with Tango convention. Even toString, toString16 and
 toString32 are significantly more style-consistent than toWString and
 toWstring

 
 How so?
 toString returns a string.
 toInt    returns an int.
 toFloat  returns a float.
 to???    returns a wstring.
 
 Seems whatever goes in the ??? place should include the letters
 "w-s-t-r-i-n-g" in that order.

Only if you have recognized wstring and dstring as good names for those
aliases <g>

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

Nov 20 2007

Walter Bright <newshound1 digitalmars.com> writes:

Lars Ivar Igesund wrote:
 Only if you have recognized wstring and dstring as good names for those
 aliases <g>

They'd be consistent with wchar and dchar.

Nov 20 2007

Lars Ivar Igesund <larsivar igesund.net> writes:

Walter Bright wrote:

 Lars Ivar Igesund wrote:
 Only if you have recognized wstring and dstring as good names for those
 aliases <g>

 
 They'd be consistent with wchar and dchar.

Right ... now I don't like those either ;)

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

Nov 20 2007

Walter Bright <newshound1 digitalmars.com> writes:

Lars Ivar Igesund wrote:
 Walter Bright wrote:
 
 Lars Ivar Igesund wrote:
 Only if you have recognized wstring and dstring as good names for those
 aliases <g>

 They'd be consistent with wchar and dchar.

 
 Right ... now I don't like those either ;)

What can I say? !!

Nov 20 2007

"Kris" <foo bar.com> writes:

"Walter Bright" <newshound1 digitalmars.com> wrote
 Lars Ivar Igesund wrote:
 Walter Bright wrote:

 Lars Ivar Igesund wrote:
 Only if you have recognized wstring and dstring as good names for those
 aliases <g>

 They'd be consistent with wchar and dchar.

 Right ... now I don't like those either ;)

 What can I say? !!

hehe

Well, perhaps it's worth noting that all of these names are probably a 
cousin of "hungarian notation", since the name is being decorated with some 
kind of indicator of what it represents? The question perhaps should be - 
why is that? If we speculate, for a moment, that the language supported 
overload on return type:

char[] toString();
wchar[] toString();
dchar[] toString();

then, there would be no issue here. Right? However, we don't have 
overload-on-return-type, so it seems to me that the decorated names are a 
means to work around that. Does that seem logical?  Perhaps what we're 
seeing here, Walter, is a measure of distaste for the notion of 
decorated-names?

Nov 20 2007

Christopher Wright <dhasenan gmail.com> writes:

Kris wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote
 Lars Ivar Igesund wrote:
 Walter Bright wrote:

 Lars Ivar Igesund wrote:
 Only if you have recognized wstring and dstring as good names for those
 aliases <g>

 They'd be consistent with wchar and dchar.

 Right ... now I don't like those either ;)

 What can I say? !!

 
 hehe
 
 Well, perhaps it's worth noting that all of these names are probably a 
 cousin of "hungarian notation", since the name is being decorated with some 
 kind of indicator of what it represents? The question perhaps should be - 
 why is that? If we speculate, for a moment, that the language supported 
 overload on return type:
 
 char[] toString();
 wchar[] toString();
 dchar[] toString();
 
 then, there would be no issue here. Right? However, we don't have 
 overload-on-return-type, so it seems to me that the decorated names are a 
 means to work around that. Does that seem logical?  Perhaps what we're 
 seeing here, Walter, is a measure of distaste for the notion of 
 decorated-names? 

class String {
    char[] opImplicitCast () {}
    wchar[] opImplicitCast () {}
    dchar[] opImplicitCast () {}
}

String toString () {}

How does that look?

Nov 20 2007

Sean Kelly <sean f4.ca> writes:

Christopher Wright wrote:
 
 class String {
    char[] opImplicitCast () {}
    wchar[] opImplicitCast () {}
    dchar[] opImplicitCast () {}
 }
 
 String toString () {}
 
 How does that look?

Tango already has a String class with toUtf8, toUtf16, and toUtf32 
member functions.  This was one of our original objections to the idea 
of toString as a member function that must return a char[].  We will 
have to rename the class to something else if this change goes through.


Sean

Nov 20 2007

Lars Ivar Igesund <larsivar igesund.net> writes:

Sean Kelly wrote:

 Christopher Wright wrote:
 
 class String {
    char[] opImplicitCast () {}
    wchar[] opImplicitCast () {}
    dchar[] opImplicitCast () {}
 }
 
 String toString () {}
 
 How does that look?

 
 Tango already has a String class with toUtf8, toUtf16, and toUtf32
 member functions.  This was one of our original objections to the idea
 of toString as a member function that must return a char[].  We will
 have to rename the class to something else if this change goes through.
 
 
 Sean

It is already renamed to Text.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

Nov 20 2007

Sean Kelly <sean f4.ca> writes:

Lars Ivar Igesund wrote:
 Sean Kelly wrote:
 
 Christopher Wright wrote:
 class String {
    char[] opImplicitCast () {}
    wchar[] opImplicitCast () {}
    dchar[] opImplicitCast () {}
 }

 String toString () {}

 How does that look?

 Tango already has a String class with toUtf8, toUtf16, and toUtf32
 member functions.  This was one of our original objections to the idea
 of toString as a member function that must return a char[].  We will
 have to rename the class to something else if this change goes through.

 
 It is already renamed to Text.

Oops!

Nov 20 2007

Lars Ivar Igesund <larsivar igesund.net> writes:

Kris wrote:

 "Sean Kelly" <sean f4.ca> wrote in message
 [snip]
 Don't know if that hurts your eyes less or not, but it seems more
 consistent with Tango's existing naming convention to me than toWString,
 etc.

 Yeah I was thinking the same thing.  It's certainly easier for me to read
 than the other form.

 
 
 Bill: actually, toString, toStringW and toStringD are more consistent with
 themselves, and with Tango convention. Even toString, toString16 and
 toString32 are significantly more style-consistent than toWString and
 toWstring

FWIW, this would be preferable to me too.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

Nov 20 2007

Regan Heath <regan netmail.co.nz> writes:

Lars Ivar Igesund wrote:
 Kris wrote:
 
 "Sean Kelly" <sean f4.ca> wrote in message
 [snip]
 Don't know if that hurts your eyes less or not, but it seems more
 consistent with Tango's existing naming convention to me than toWString,
 etc.

 Yeah I was thinking the same thing.  It's certainly easier for me to read
 than the other form.

 Bill: actually, toString, toStringW and toStringD are more consistent with
 themselves, and with Tango convention. Even toString, toString16 and
 toString32 are significantly more style-consistent than toWString and
 toWstring

 
 FWIW, this would be preferable to me too.

+votes

Nov 20 2007

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Kris" <foo bar.com> wrote in message news:fhtru8$1no5$1 digitalmars.com...
 "Sean Kelly" <sean f4.ca> wrote in message
 [snip]
 Don't know if that hurts your eyes less or not, but it seems more 
 consistent with Tango's existing naming convention to me than toWString, 
 etc.

 Yeah I was thinking the same thing.  It's certainly easier for me to read 
 than the other form.


 Bill: actually, toString, toStringW and toStringD are more consistent with 
 themselves, and with Tango convention. Even toString, toString16 and 
 toString32 are significantly more style-consistent than toWString and 
 toWstring

Now that I've seen toWString and toStringW, I'll have to say I do like the 
toStringW/toStringD version better.

// retract previous votes
toWString.votes -= 8;
toDString.votes -= 8;

toStringW.votes += 334;
toStringD.votes += 334;

Nov 20 2007

Chad J <gamerChad _spamIsBad_gmail.com> writes:

Kris wrote:
 "Sean Kelly" <sean f4.ca> wrote in message
 [snip]
 Don't know if that hurts your eyes less or not, but it seems more 
 consistent with Tango's existing naming convention to me than toWString, 
 etc.

 Yeah I was thinking the same thing.  It's certainly easier for me to read 
 than the other form.

 
 
 Bill: actually, toString, toStringW and toStringD are more consistent with 
 themselves, and with Tango convention. Even toString, toString16 and 
 toString32 are significantly more style-consistent than toWString and 
 toWstring 
 
 

This conversation caught my eye and I cringed at toWString and 
toDString.  toStringW and toStringD are acceptable though.

Sean made a brief argument from psychology earlier.  It made me remember 
this thing:

Olny srmat poelpe can raed tihs.

I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. 
The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at 
Cmabrigde Uinervtisy, it deosn�t mttaer in waht oredr the ltteers in a 
wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be 
in the rghit pclae. The rset can be a taotl mses and you can sitll raed 
it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey 
lteter by istlef, but the wrod as a wlohe. Amzanig huh? yaeh and I 
awlyas tghuhot slpeling was ipmorantt!



Perhaps this is important for naming conventions in general?  Any 
similarly named entities must differ at the beginning or end of the 
name.  I'm not sure how deeply this affects existing APIs or if it 
causes problems ;)

It is also noteworthy that char, wchar, dchar are consistent with that 
naming constraint, but not with toStringW and toStringD. IMO the former 
matters more than the latter, simply because it is ingrained into our 
minds.  Still, I am not entirely convinced that such a constraint is 
wise in general, though I do like its application here.

Nov 20 2007

Bill Baxter <dnewsgroup billbaxter.com> writes:

Sean Kelly wrote:
 As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 
 be named toString, toWString, and toDString, respectively, and Unicode 
 should be assumed as the standard encoding format in D.

1) On the question of toWString vs toWstring and consistency:

I don't think there's any clear precedent for either in Tango right now, 
but my question is, if tango *had* a "to uint" function, what would it 
be named?  toUInt or toUint?  Whatever the answer to that is should be 
the same as the answer to how to name a "to wstring" function.

2) On the question of toWString vs toStringW

It seems to be pretty well agreed in this thread that toWString is more 
consistent but toStringW is prettier.

I could be wrong but I think usage pattern of these W and D variants of 
the functions will be bimodal:  either very frequent or very infrequent. 
  In the former case I'd probably want to make a simpler alias like 
'wstr'.  In the latter case I'd want it to be the most consistent thing 
possible to be easy to remember for the few times I use it.

--bb

Nov 20 2007

Sean Kelly <sean f4.ca> writes:

Bill Baxter wrote:
 Sean Kelly wrote:
 As an alternative, I can only suggest that toUTF8, toUTF16, and 
 toUTF32 be named toString, toWString, and toDString, respectively, and 
 Unicode should be assumed as the standard encoding format in D.

 
 1) On the question of toWString vs toWstring and consistency:
 
 I don't think there's any clear precedent for either in Tango right now, 
 but my question is, if tango *had* a "to uint" function, what would it 
 be named?  toUInt or toUint?  Whatever the answer to that is should be 
 the same as the answer to how to name a "to wstring" function.

Good question.  Probably toUInt, though I don't like it much :-)  For 
these conversion routines, I'll admit I find the idea that the type name 
should be repeated exactly, which suggests something like to_wstring, 
but I don't imagine anyone finds that appealing.

 2) On the question of toWString vs toStringW
 
 It seems to be pretty well agreed in this thread that toWString is more 
 consistent but toStringW is prettier.
 
 I could be wrong but I think usage pattern of these W and D variants of 
 the functions will be bimodal:  either very frequent or very infrequent. 
  In the former case I'd probably want to make a simpler alias like 
 'wstr'.  In the latter case I'd want it to be the most consistent thing 
 possible to be easy to remember for the few times I use it.

Agreed.


Sean

Nov 20 2007

"Kris" <foo bar.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message 
news:fhvsaf$2k6t$1 digitalmars.com...
 Bill Baxter wrote:
 Sean Kelly wrote:
 As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 
 be named toString, toWString, and toDString, respectively, and Unicode 
 should be assumed as the standard encoding format in D.

 1) On the question of toWString vs toWstring and consistency:

 I don't think there's any clear precedent for either in Tango right now, 
 but my question is, if tango *had* a "to uint" function, what would it be 
 named?  toUInt or toUint?  Whatever the answer to that is should be the 
 same as the answer to how to name a "to wstring" function.

 Good question.  Probably toUInt, though I don't like it much :-)  For 
 these conversion routines, I'll admit I find the idea that the type name 
 should be repeated exactly, which suggests something like to_wstring, but 
 I don't imagine anyone finds that appealing.


It was resolved by having a Float module and an Integer module, containing 
relevant parse/format methods. The toUtf/toString() family is the only one 
where the type is decorated in the name (hungarian style)

Nov 20 2007

Christopher Wright <dhasenan gmail.com> writes:

Bill Baxter wrote:
 Sean Kelly wrote:
 As an alternative, I can only suggest that toUTF8, toUTF16, and 
 toUTF32 be named toString, toWString, and toDString, respectively, and 
 Unicode should be assumed as the standard encoding format in D.

 
 1) On the question of toWString vs toWstring and consistency:
 
 I don't think there's any clear precedent for either in Tango right now, 
 but my question is, if tango *had* a "to uint" function, what would it 
 be named?  toUInt or toUint?  Whatever the answer to that is should be 
 the same as the answer to how to name a "to wstring" function.

If it had a to uint function and a to int function and a to 'sint' 
function, what then? If it's only uint, then you can tell the difference 
quite easily.

Also, 'int' is shorter than 'string'. Not a very good comparison.

Nov 20 2007

Bill Baxter <dnewsgroup billbaxter.com> writes:

Christopher Wright wrote:
 Bill Baxter wrote:
 Sean Kelly wrote:
 As an alternative, I can only suggest that toUTF8, toUTF16, and 
 toUTF32 be named toString, toWString, and toDString, respectively, 
 and Unicode should be assumed as the standard encoding format in D.

 1) On the question of toWString vs toWstring and consistency:

 I don't think there's any clear precedent for either in Tango right 
 now, but my question is, if tango *had* a "to uint" function, what 
 would it be named?  toUInt or toUint?  Whatever the answer to that is 
 should be the same as the answer to how to name a "to wstring" function.

 
 If it had a to uint function and a to int function and a to 'sint' 
 function, what then? If it's only uint, then you can tell the difference 
 quite easily.

I don't understand you.  Tell the difference between what?

 Also, 'int' is shorter than 'string'. Not a very good comparison.

What does length have to do with whether or not the naming scheme is 
consistent?

--bb

Nov 20 2007

Christopher Wright <dhasenan gmail.com> writes:

Bill Baxter wrote:
 Christopher Wright wrote:
 Bill Baxter wrote:
 Sean Kelly wrote:
 As an alternative, I can only suggest that toUTF8, toUTF16, and 
 toUTF32 be named toString, toWString, and toDString, respectively, 
 and Unicode should be assumed as the standard encoding format in D.

 1) On the question of toWString vs toWstring and consistency:

 I don't think there's any clear precedent for either in Tango right 
 now, but my question is, if tango *had* a "to uint" function, what 
 would it be named?  toUInt or toUint?  Whatever the answer to that is 
 should be the same as the answer to how to name a "to wstring" function.

 If it had a to uint function and a to int function and a to 'sint' 
 function, what then? If it's only uint, then you can tell the 
 difference quite easily.

 
 I don't understand you.  Tell the difference between what?

Sorry, mistyped. If it were 'to uint' and 'to int', that would be rather 
clear. 'to uint' and 'to sint' would be less clear, since they're the 
same number of letters and would have the same capitalization pattern.

 Also, 'int' is shorter than 'string'. Not a very good comparison.

 
 What does length have to do with whether or not the naming scheme is 
 consistent?

Readability. I'd rather sacrifice a bit of consistency -- I can memorize 
a *few* inconsistencies -- for readability, whose lack will cause more 
trouble in the future.

With shorter identifiers, smaller differences are more noticeable, but 
'toWString' is a relatively long identifier.

 --bb

Nov 20 2007

D Programming

C/C++ Programming

Other

digitalmars.D - toString vs. toUtf8