www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - TDPL reaches Thermopylae level

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
303 pages and counting!

Andrei
Oct 25 2009
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 303 pages and counting!

Come and get them!
Oct 25 2009
prev sibling next sibling parent reply Jeremie Pelletier <jeremiep gmail.com> writes:
Andrei Alexandrescu wrote:
 303 pages and counting!
 
 Andrei

Soon the PI level, or at least 10 times PI!
Oct 26 2009
next sibling parent reply Bill Baxter <wbaxter gmail.com> writes:
On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei

Soon the PI level, or at least 10 times PI!

A hundred even. ;-) --bb
Oct 26 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei


A hundred even. ;-)

Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) Andrei
Oct 26 2009
next sibling parent reply Jeremie Pelletier <jeremiep gmail.com> writes:
Andrei Alexandrescu wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier 
 <jeremiep gmail.com> wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei


A hundred even. ;-)

Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) Andrei

I don't know if thats a good idea, its better when string encoding is explicit so you know where your reallocations are. ie if I know some routine will have to convert a utf16 parameter to utf8 to append it to a string, then ill try and either make it output utf16 or input utf8. If its implicit its much harder to find and optimize these cases. to!string() is easy enough to use anyways. But it could be good to add a range type that does this with multiple opAppend/opAppendAssign overloads.
Oct 26 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Jeremie Pelletier wrote:
 Andrei Alexandrescu wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier 
 <jeremiep gmail.com> wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei


A hundred even. ;-)

Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) Andrei

I don't know if thats a good idea, its better when string encoding is explicit so you know where your reallocations are.

The beauty of it is that reallocation with ~ occurs anyway, and with ~= is anyway imminent, regardless of the character width you're reallocating. Allowing concatenation of strings of different widths is a nice way of acknowledging at the language level that all character widths are encodings of abstract characters.
 ie if I know some routine will have to convert a utf16 parameter to utf8 
 to append it to a string, then ill try and either make it output utf16 
 or input utf8. If its implicit its much harder to find and optimize 
 these cases.
 
 to!string() is easy enough to use anyways.
 
 But it could be good to add a range type that does this with multiple 
 opAppend/opAppendAssign overloads.

One problem with s ~= to!string(someDstring); is that it does two allocations instead of one. Andrei
Oct 26 2009
parent Jeremie Pelletier <jeremiep gmail.com> writes:
Andrei Alexandrescu wrote:
 Jeremie Pelletier wrote:
 Andrei Alexandrescu wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier 
 <jeremiep gmail.com> wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei


A hundred even. ;-)

Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) Andrei

I don't know if thats a good idea, its better when string encoding is explicit so you know where your reallocations are.

The beauty of it is that reallocation with ~ occurs anyway, and with ~= is anyway imminent, regardless of the character width you're reallocating. Allowing concatenation of strings of different widths is a nice way of acknowledging at the language level that all character widths are encodings of abstract characters.
 ie if I know some routine will have to convert a utf16 parameter to 
 utf8 to append it to a string, then ill try and either make it output 
 utf16 or input utf8. If its implicit its much harder to find and 
 optimize these cases.

 to!string() is easy enough to use anyways.

 But it could be good to add a range type that does this with multiple 
 opAppend/opAppendAssign overloads.

One problem with s ~= to!string(someDstring); is that it does two allocations instead of one. Andrei

Good points, I didn't think of the separation between characters and encodings or the extra allocation from to. You have my vote for this feature then! Jeremie
Oct 26 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com>
 wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei



wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)

So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.

Well, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...
  Just using something like
 to!(char[])(myWcharString) seems less goofy to me.

Problem is, an append + one transcoding requires two allocations. We could always define routines in std.string or std.utf: append(s, ws); // s ~= ws but really it's quite unambiguous what ~= should do. A nod from the language is a nice touch. Andrei
Oct 26 2009
parent reply Chris Nicholson-Sauls <ibisbasenji gmail.com> writes:
Andrei Alexandrescu wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com>
 wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei



wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)

So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.

Well, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...

My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-Sauls
Oct 27 2009
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Chris Nicholson-Sauls wrote:
 Andrei Alexandrescu wrote:

 Well, I guess. In particular, to me it's not clear what type we should 
 assign to a concatenation between a string and a wstring. With ~=, 
 it's much easier...

My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-Sauls

Yah, I agree. The problem is, there's a big difference too: all encodings are able to represent the same information, unlike numeric widths where there's a clear inclusion relationship. It could even be argued that in pure theory UTF-16 is the least general of the three (I dislike UTF-16 from an engineering standpoint; unlike UTF-8 which I think is brilliant, I find UTF-16 is forced and uninspired - the typical outcome of a committee.) My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs. Andrei
Oct 27 2009
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 My current thought is to ascribe lhs ~ rhs the same type as lhs 
 (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = 
 lhs ~ rhs) in case lhs is a string type. If lhs is a character type, 
 the result type is obviously the same as rhs.

Seems the most intuitive option to me. Also, it makes "a ~= b" equivalent to "a = a ~ b" which is always nice. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 27 2009
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Bill Baxter wrote:
 On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin
 <michel.fortin michelf.com> wrote:
 On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby
 making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in
 case lhs is a string type. If lhs is a character type, the result type is
 obviously the same as rhs.

"a = a ~ b" which is always nice.

And that kind of suggests to me that even a = b should work. It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error.

I agree. This one, however, will be very difficult to slide by Walter's watchful eye. He doesn't like hidden allocations, and a width adjustment does involve one. Andrei P.S. I got green light from my editor's marketing folks. Will release The Thermopylae Excerpt of TDPL today for free off my website. Stay tuned. It's a rough draft but I hope you will enjoy it.
Oct 27 2009
prev sibling parent reply =?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> writes:
Bill Baxter wrote:
 On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin
 <michel.fortin michelf.com> wrote:
 On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby
 making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in
 case lhs is a string type. If lhs is a character type, the result type is
 obviously the same as rhs.

"a = a ~ b" which is always nice.

And that kind of suggests to me that even a = b should work. It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error. --bb

float b = 2.1; a = b; also unambiguous?
Oct 27 2009
parent reply =?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> writes:
Bill Baxter wrote:
 On Tue, Oct 27, 2009 at 12:48 PM, Pelle MŚnsson <pelle.mansson gmail.com>
wrote:
 Bill Baxter wrote:
 On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin
 <michel.fortin michelf.com> wrote:
 On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby
 making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~
 rhs) in
 case lhs is a string type. If lhs is a character type, the result type
 is
 obviously the same as rhs.

to "a = a ~ b" which is always nice.

It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error. --bb

float b = 2.1; a = b; also unambiguous?

I'm not sure what point you're trying to make, but wstring <-> string <-> dstring are all lossless conversions. That isn't the case with int and float. --bb

...Then what is the point of wstring, dstring?
Oct 27 2009
next sibling parent =?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> writes:
Bill Baxter wrote:
 On Tue, Oct 27, 2009 at 1:06 PM, Pelle MŚnsson <pelle.mansson gmail.com> wrote:
 Bill Baxter wrote:
 On Tue, Oct 27, 2009 at 12:48 PM, Pelle MŚnsson <pelle.mansson gmail.com>
 wrote:
 Bill Baxter wrote:
 On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin
 <michel.fortin michelf.com> wrote:
 On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 My current thought is to ascribe lhs ~ rhs the same type as lhs
 (thereby
 making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~
 rhs) in
 case lhs is a string type. If lhs is a character type, the result type
 is
 obviously the same as rhs.

equivalent to "a = a ~ b" which is always nice.

It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error. --bb

float b = 2.1; a = b; also unambiguous?

<-> dstring are all lossless conversions. That isn't the case with int and float. --bb

...Then what is the point of wstring, dstring?

They are all just different representations of Unicode. string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb --bb

Oct 27 2009
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Leandro Lucarella wrote:
 Bill Baxter, el 27 de octubre a las 13:12 me escribiste:
 They are?

 ...Then what is the point of wstring, dstring?

string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb

And here is a nice artible about Unicode and encodings: http://www.joelonsoftware.com/articles/Unicode.html

Damn guys, with these good explanations, nobody's going to use the one in TDPL! Andrei
Oct 27 2009
prev sibling next sibling parent reply Justin Johansson <no spam.com> writes:
Chris Nicholson-Sauls Wrote:

 Andrei Alexandrescu wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com>
 wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei



wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)

So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.

Well, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...

My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-Sauls

Though I'm sure Shannon would say that the number of bits of intrinsic information contained in the same sequence of Unicode codepoints is exactly the same whether it be encoded as a string or a wstring. Accordingly my intuition is that some rule based upon left-to-right associativity would be more apt. You could then concatenate a wstring (on the rhs) to an empty string (on the lhs) to convert the wstring to a string or vica versa. Cheers Justin Johansson
Oct 27 2009
parent reply Chris Nicholson-Sauls <ibisbasenji gmail.com> writes:
Justin Johansson wrote:
 Chris Nicholson-Sauls Wrote:
 
 Andrei Alexandrescu wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com>
 wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei



wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)

""~myWcharString? That seems kind of odd.

assign to a concatenation between a string and a wstring. With ~=, it's much easier...

Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-Sauls

Though I'm sure Shannon would say that the number of bits of intrinsic information contained in the same sequence of Unicode codepoints is exactly the same whether it be encoded as a string or a wstring. Accordingly my intuition is that some rule based upon left-to-right associativity would be more apt. You could then concatenate a wstring (on the rhs) to an empty string (on the lhs) to convert the wstring to a string or vica versa. Cheers Justin Johansson

Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF. I would argue that string ~ wstring returning string is fine, but would suggest it be a warning for those like myself who might have first guessed it would "upscale to fit". Just so long as the foreach(dchar;string) trick is still around, char/string can cover an awful lot of ground. All that said, though, I don't think I would ever use ""~wstring as a means of conversion. It just feels like "there wasn't any other way to do this, so here's a cheap hack" -- which just isn't the case. -- Chris Nicholson-Sauls
Oct 29 2009
next sibling parent Justin Johansson <no spam.com> writes:
Chris Nicholson-Sauls Wrote:

 
 Granted LTR is common enough to be expectable and acceptable.  To be perfectly
honest, I 
 don't believe I have *ever* even used wchar/wstring.  Char/string gosh yes;
dchar/dstring 
 quite a bit as well, where I need the simplicity; but I've yet to feel much
need for the 
 "weirdo" middle child of UTF.
 
 I would argue that string ~ wstring returning string is fine, but would
suggest it be a 
 warning for those like myself who might have first guessed it would "upscale
to fit". 
 Just so long as the foreach(dchar;string) trick is still around, char/string
can cover an 
 awful lot of ground.
 
 All that said, though, I don't think I would ever use ""~wstring as a means of
conversion. 
   It just feels like "there wasn't any other way to do this, so here's a cheap
hack" -- 
 which just isn't the case.

Your overall reply well put. On last point: agree; cheap hacks should be avoided. cheers, Justin
Oct 29 2009
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message 
news:hcctuf$140a$1 digitalmars.com...
 Granted LTR is common enough to be expectable and acceptable.  To be 
 perfectly honest, I don't believe I have *ever* even used wchar/wstring. 
 Char/string gosh yes; dchar/dstring quite a bit as well, where I need the 
 simplicity; but I've yet to feel much need for the "weirdo" middle child 
 of UTF.

Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.
Oct 29 2009
parent reply "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
Nick Sabalausky wrote:
 "Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message 
 news:hcctuf$140a$1 digitalmars.com...
 Granted LTR is common enough to be expectable and acceptable.  To be 
 perfectly honest, I don't believe I have *ever* even used wchar/wstring. 
 Char/string gosh yes; dchar/dstring quite a bit as well, where I need the 
 simplicity; but I've yet to feel much need for the "weirdo" middle child 
 of UTF.

Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.

I think this says it all: http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_systems_and_environments -Lars :)
Oct 30 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Lars T. Kyllingstad wrote:
 Nick Sabalausky wrote:
 "Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message 
 news:hcctuf$140a$1 digitalmars.com...
 Granted LTR is common enough to be expectable and acceptable.  To be 
 perfectly honest, I don't believe I have *ever* even used 
 wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as 
 well, where I need the simplicity; but I've yet to feel much need for 
 the "weirdo" middle child of UTF.

Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.

I think this says it all: http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_syst ms_and_environments -Lars :)

Yep, there was a frenzy when UCS-2 came about: everybody thought two bytes will be enough for everyone. So UCS-2 was widely adopted - who wouldn't love to have constant character width? Then, the UTF-16 surrogate business came about, and the only logical step they could take was to migrate to UTF-16, which was upward compatible to UCS-2. I personally think UTF-8 is a better overall design though. Andrei
Oct 30 2009
parent reply Justin Johansson <no spam.com> writes:
Andrei Alexandrescu Wrote:

 Lars T. Kyllingstad wrote:
 Nick Sabalausky wrote:
 "Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message 
 news:hcctuf$140a$1 digitalmars.com...
 Granted LTR is common enough to be expectable and acceptable.  To be 
 perfectly honest, I don't believe I have *ever* even used 
 wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as 
 well, where I need the simplicity; but I've yet to feel much need for 
 the "weirdo" middle child of UTF.

Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.

I think this says it all: http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_syst ms_and_environments -Lars :)

Yep, there was a frenzy when UCS-2 came about: everybody thought two bytes will be enough for everyone. So UCS-2 was widely adopted - who wouldn't love to have constant character width? Then, the UTF-16 surrogate business came about, and the only logical step they could take was to migrate to UTF-16, which was upward compatible to UCS-2. I personally think UTF-8 is a better overall design though. Andrei

"I personally think UTF-8 is a better overall design though." Unicode Technical Note #12 by The Unicode Consortium apparently disagree, recommending UTF-16 for Processing. http://unicode.org/notes/tn12/ The major claim in the TN is that Unicode is optimized for UTF-16. The rest of the argument looks like a VHS (everyone is using it i.e. UTF-16) versus Beta argument. So who's right? My personal view is that whilst they are the *Unicode Consortium*, I have great difficulty in accepting UTF-16 as the one-and-holy encoding. FWIW, there was a subthread during a discussion about the ordained features of programming languages on LtU a while back. http://lambda-the-ultimate.org/node/3166#comment-46233 What Are The Resolved Debates in General Purpose Language Design? Its a long discussion so easier to search for UTF or Unicode on the page if you're interested. cheers Justin Johansson
Oct 30 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Justin Johansson wrote:
 Andrei Alexandrescu Wrote:
 
 Lars T. Kyllingstad wrote:
 Nick Sabalausky wrote:
 "Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message 
 news:hcctuf$140a$1 digitalmars.com...
 Granted LTR is common enough to be expectable and acceptable.  To be 
 perfectly honest, I don't believe I have *ever* even used 
 wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as 
 well, where I need the simplicity; but I've yet to feel much need for 
 the "weirdo" middle child of UTF.

seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.

http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_syst ms_and_environments -Lars :)

bytes will be enough for everyone. So UCS-2 was widely adopted - who wouldn't love to have constant character width? Then, the UTF-16 surrogate business came about, and the only logical step they could take was to migrate to UTF-16, which was upward compatible to UCS-2. I personally think UTF-8 is a better overall design though. Andrei

"I personally think UTF-8 is a better overall design though." Unicode Technical Note #12 by The Unicode Consortium apparently disagree, recommending UTF-16 for Processing. http://unicode.org/notes/tn12/ The major claim in the TN is that Unicode is optimized for UTF-16. The rest of the argument looks like a VHS (everyone is using it i.e. UTF-16) versus Beta argument. So who's right? My personal view is that whilst they are the *Unicode Consortium*, I have great difficulty in accepting UTF-16 as the one-and-holy encoding. FWIW, there was a subthread during a discussion about the ordained features of programming languages on LtU a while back. http://lambda-the-ultimate.org/node/3166#comment-46233 What Are The Resolved Debates in General Purpose Language Design? Its a long discussion so easier to search for UTF or Unicode on the page if you're interested. cheers Justin Johansson

Thanks for the pointers. One of the reasons for which I like the design of UTF-8 is its generality: it's a variable-length code for any number of 31 bits. In contrast, UTF-16 is a relies on specific dead zones inside the assigned space. But the authors of the unicode.org article do make a few good points, such as there not being any invalid UTF-16 symbol. But then that actually can be seen as a strength of UTF-8 - the binary files that are actually UTF-8 files are statistically so scarce, UTF-8 has a very solid method of checking whether a file is UTF-8 or something else. Andrei
Oct 30 2009
prev sibling next sibling parent Leandro Lucarella <llucax gmail.com> writes:
Andrei Alexandrescu, el 27 de octubre a las 19:32 me escribiste:
 Leandro Lucarella wrote:
Bill Baxter, el 27 de octubre a las 13:12 me escribiste:
They are?

...Then what is the point of wstring, dstring?

string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb

And here is a nice artible about Unicode and encodings: http://www.joelonsoftware.com/articles/Unicode.html

Damn guys, with these good explanations, nobody's going to use the one in TDPL!

:) -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Vivimos en una época muy contemporánea, Don Inodoro... -- Mendieta
Oct 27 2009
prev sibling parent Leandro Lucarella <llucax gmail.com> writes:
Andrei Alexandrescu, el 27 de octubre a las 19:32 me escribiste:
 Leandro Lucarella wrote:
Bill Baxter, el 27 de octubre a las 13:12 me escribiste:
They are?

...Then what is the point of wstring, dstring?

string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb

And here is a nice artible about Unicode and encodings: http://www.joelonsoftware.com/articles/Unicode.html

Damn guys, with these good explanations, nobody's going to use the one in TDPL!

BTW, seeing the explanation about Unicode in your book, one wonders why UTF-8, UTF-16 and UTF-32 character types are not simply called utf8, utf16 and utf32... -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Ya ni el cielo me quiere, ya ni la muerte me visita Ya ni el sol me calienta, ya ni el viento me acaricia
Oct 29 2009
prev sibling next sibling parent reply Bill Baxter <wbaxter gmail.com> writes:
On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com>
 wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei

Soon the PI level, or at least 10 times PI!

A hundred even. ;-)

Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)

So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd. Just using something like to!(char[])(myWcharString) seems less goofy to me. But that subjective reaction is all I have against it. --bb
Oct 26 2009
parent Leandro Lucarella <llucax gmail.com> writes:
Bill Baxter, el 27 de octubre a las 13:12 me escribiste:
 They are?

 ...Then what is the point of wstring, dstring?

They are all just different representations of Unicode. string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb

And here is a nice artible about Unicode and encodings: http://www.joelonsoftware.com/articles/Unicode.html -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- He cometido pecados, he hecho el mal, he sido víctima de la envidia, el egoísmo, la ambición, la mentira y la frivolidad, pero siempre he sido un padre argentino que quiere que su hijo triunfe en la vida. -- Ricardo Vaporeso
Oct 27 2009
prev sibling next sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Mon, Oct 26, 2009 at 4:05 PM, Jeremie Pelletier <jeremiep gmail.com> wrote:
 Andrei Alexandrescu wrote:
 Jeremie Pelletier wrote:
 Andrei Alexandrescu wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com>
 wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei

Soon the PI level, or at least 10 times PI!

A hundred even. ;-)

Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) Andrei

I don't know if thats a good idea, its better when string encoding is explicit so you know where your reallocations are.

The beauty of it is that reallocation with ~ occurs anyway, and with ~= is anyway imminent, regardless of the character width you're reallocating. Allowing concatenation of strings of different widths is a nice way of acknowledging at the language level that all character widths are encodings of abstract characters.
 ie if I know some routine will have to convert a utf16 parameter to utf8
 to append it to a string, then ill try and either make it output utf16 or
 input utf8. If its implicit its much harder to find and optimize these
 cases.

 to!string() is easy enough to use anyways.

 But it could be good to add a range type that does this with multiple
 opAppend/opAppendAssign overloads.

One problem with s ~= to!string(someDstring); is that it does two allocations instead of one. Andrei

Good points, I didn't think of the separation between characters and encodings or the extra allocation from to. You have my vote for this feature then! Jeremie

Yeh, me too. Saving an allocation is good. And I agree that having ~= do a conversion is much more useful than just getting an error. Its one of those things you might try just hoping it will work, and it's always nice when something like that does just what you hope it will. I guess the only other thing I could worry about is that in generic array code it might cause someone headaches that for some T[], T[] ~= S[] is legal and the length of the result is not the same as the lengths of the inputs. But I can't think of any real situation where that would cause trouble. --bb
Oct 26 2009
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Tue, 27 Oct 2009 10:04:33 +0300, Chris Nicholson-Sauls  
<ibisbasenji gmail.com> wrote:

 Andrei Alexandrescu wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier  
 <jeremiep gmail.com>
 wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei



wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)

So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.

assign to a concatenation between a string and a wstring. With ~=, it's much easier...

My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-Sauls

ubyte i = 42; int j = 1; i += j; // still ubyte same here: string a = "hello"; wstring b = "world"w; a ~= b; // still string
Oct 27 2009
prev sibling next sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Tue, Oct 27, 2009 at 4:37 AM, Denis Koroskin <2korden gmail.com> wrote:
 On Tue, 27 Oct 2009 10:04:33 +0300, Chris Nicholson-Sauls
 <ibisbasenji gmail.com> wrote:

 Andrei Alexandrescu wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Bill Baxter wrote:
 On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier
 <jeremiep gmail.com>
 wrote:
 Andrei Alexandrescu wrote:
 303 pages and counting!

 Andrei

Soon the PI level, or at least 10 times PI!


Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=3D) of strings of different character widths. The support library could do a=





 of
 the transcoding.

 (I understand that concatenating an array of wchar or char with a dch=





 is
 already in bugzilla.)

So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.

=A0Well, I guess. In particular, to me it's not clear what type we shou=



 assign to a concatenation between a string and a wstring. With ~=3D, it=



 easier...

My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring;=


 encode to the wider of the two types.

 -- Chris Nicholson-Sauls

ubyte i =3D 42; int j =3D 1; i +=3D j; // still ubyte same here: string a =3D "hello"; wstring b =3D "world"w; a ~=3D b; // still string

As Andrei said (and maybe you missed) "With ~=3D, it's much easier...". The only question is about what "a ~ b" should do. --bb
Oct 27 2009
prev sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin
<michel.fortin michelf.com> wrote:
 On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby
 making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in
 case lhs is a string type. If lhs is a character type, the result type is
 obviously the same as rhs.

Seems the most intuitive option to me. Also, it makes "a ~= b" equivalent to "a = a ~ b" which is always nice.

And that kind of suggests to me that even a = b should work. It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error. --bb
Oct 27 2009
prev sibling next sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Tue, Oct 27, 2009 at 12:48 PM, Pelle M=E5nsson <pelle.mansson gmail.com>=
 wrote:
 Bill Baxter wrote:
 On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin
 <michel.fortin michelf.com> wrote:
 On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 My current thought is to ascribe lhs ~ rhs the same type as lhs (there=




 making ~ consistent with ~=3D by making lhs ~=3D rhs same as lhs =3D l=




 rhs) in
 case lhs is a string type. If lhs is a character type, the result type
 is
 obviously the same as rhs.

Seems the most intuitive option to me. Also, it makes "a ~=3D b" equiva=



 to
 "a =3D a ~ b" which is always nice.

And that kind of suggests to me that even =A0a =3D b =A0should work. It has many of the same characteristics as ~=3D. =A0It's pretty unambiguous what you'd expect to happen if not an error. --bb

int a; float b =3D 2.1; a =3D b; also unambiguous?

I'm not sure what point you're trying to make, but wstring <-> string <-> dstring are all lossless conversions. That isn't the case with int and float. --bb
Oct 27 2009
prev sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Tue, Oct 27, 2009 at 1:06 PM, Pelle M=E5nsson <pelle.mansson gmail.com> =
wrote:
 Bill Baxter wrote:
 On Tue, Oct 27, 2009 at 12:48 PM, Pelle M=E5nsson <pelle.mansson gmail.c=


 wrote:
 Bill Baxter wrote:
 On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin
 <michel.fortin michelf.com> wrote:
 On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 My current thought is to ascribe lhs ~ rhs the same type as lhs
 (thereby
 making ~ consistent with ~=3D by making lhs ~=3D rhs same as lhs =3D=






 rhs) in
 case lhs is a string type. If lhs is a character type, the result ty=






 is
 obviously the same as rhs.

Seems the most intuitive option to me. Also, it makes "a ~=3D b" equivalent to "a =3D a ~ b" which is always nice.

And that kind of suggests to me that even =A0a =3D b =A0should work. It has many of the same characteristics as ~=3D. =A0It's pretty unambiguous what you'd expect to happen if not an error. --bb

int a; float b =3D 2.1; a =3D b; also unambiguous?

I'm not sure what point you're trying to make, but wstring <-> string <-> dstring are all lossless conversions. =A0That isn't the case with int and float. --bb

They are? ...Then what is the point of wstring, dstring?

They are all just different representations of Unicode. string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element =3D one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=3Ddtqh79k_1rbxfmb --bb
Oct 27 2009