digitalmars.D - TDPL reaches Thermopylae level
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 25 2009
- Walter Bright <newshound1 digitalmars.com> Oct 25 2009
- Jeremie Pelletier <jeremiep gmail.com> Oct 26 2009
- Bill Baxter <wbaxter gmail.com> Oct 26 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 26 2009
- Jeremie Pelletier <jeremiep gmail.com> Oct 26 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 26 2009
- Jeremie Pelletier <jeremiep gmail.com> Oct 26 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 26 2009
- Chris Nicholson-Sauls <ibisbasenji gmail.com> Oct 27 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 27 2009
- Michel Fortin <michel.fortin michelf.com> Oct 27 2009
- Justin Johansson <no spam.com> Oct 27 2009
- Chris Nicholson-Sauls <ibisbasenji gmail.com> Oct 29 2009
- Justin Johansson <no spam.com> Oct 29 2009
- "Nick Sabalausky" <a a.a> Oct 29 2009
- "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> Oct 30 2009
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 30 2009
- Justin Johansson <no spam.com> Oct 30 2009
- Leandro Lucarella <llucax gmail.com> Oct 27 2009
- Leandro Lucarella <llucax gmail.com> Oct 29 2009
- Bill Baxter <wbaxter gmail.com> Oct 26 2009
- Leandro Lucarella <llucax gmail.com> Oct 27 2009
- Bill Baxter <wbaxter gmail.com> Oct 26 2009
- "Denis Koroskin" <2korden gmail.com> Oct 27 2009
- Bill Baxter <wbaxter gmail.com> Oct 27 2009
- Bill Baxter <wbaxter gmail.com> Oct 27 2009
- Bill Baxter <wbaxter gmail.com> Oct 27 2009
- Bill Baxter <wbaxter gmail.com> Oct 27 2009
303 pages and counting! Andrei
Oct 25 2009
Andrei Alexandrescu wrote:303 pages and counting!
Come and get them!
Oct 25 2009
Andrei Alexandrescu wrote:303 pages and counting! Andrei
Soon the PI level, or at least 10 times PI!
Oct 26 2009
On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
Soon the PI level, or at least 10 times PI!
A hundred even. ;-) --bb
Oct 26 2009
Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
A hundred even. ;-)
Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) Andrei
Oct 26 2009
Andrei Alexandrescu wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
A hundred even. ;-)
Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) Andrei
I don't know if thats a good idea, its better when string encoding is explicit so you know where your reallocations are. ie if I know some routine will have to convert a utf16 parameter to utf8 to append it to a string, then ill try and either make it output utf16 or input utf8. If its implicit its much harder to find and optimize these cases. to!string() is easy enough to use anyways. But it could be good to add a range type that does this with multiple opAppend/opAppendAssign overloads.
Oct 26 2009
Jeremie Pelletier wrote:Andrei Alexandrescu wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
A hundred even. ;-)
Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) Andrei
I don't know if thats a good idea, its better when string encoding is explicit so you know where your reallocations are.
The beauty of it is that reallocation with ~ occurs anyway, and with ~= is anyway imminent, regardless of the character width you're reallocating. Allowing concatenation of strings of different widths is a nice way of acknowledging at the language level that all character widths are encodings of abstract characters.ie if I know some routine will have to convert a utf16 parameter to utf8 to append it to a string, then ill try and either make it output utf16 or input utf8. If its implicit its much harder to find and optimize these cases. to!string() is easy enough to use anyways. But it could be good to add a range type that does this with multiple opAppend/opAppendAssign overloads.
One problem with s ~= to!string(someDstring); is that it does two allocations instead of one. Andrei
Oct 26 2009
Andrei Alexandrescu wrote:Jeremie Pelletier wrote:Andrei Alexandrescu wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
A hundred even. ;-)
Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) Andrei
I don't know if thats a good idea, its better when string encoding is explicit so you know where your reallocations are.
The beauty of it is that reallocation with ~ occurs anyway, and with ~= is anyway imminent, regardless of the character width you're reallocating. Allowing concatenation of strings of different widths is a nice way of acknowledging at the language level that all character widths are encodings of abstract characters.ie if I know some routine will have to convert a utf16 parameter to utf8 to append it to a string, then ill try and either make it output utf16 or input utf8. If its implicit its much harder to find and optimize these cases. to!string() is easy enough to use anyways. But it could be good to add a range type that does this with multiple opAppend/opAppendAssign overloads.
One problem with s ~= to!string(someDstring); is that it does two allocations instead of one. Andrei
Good points, I didn't think of the separation between characters and encodings or the extra allocation from to. You have my vote for this feature then! Jeremie
Oct 26 2009
Bill Baxter wrote:On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)
So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.
Well, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...Just using something like to!(char[])(myWcharString) seems less goofy to me.
Problem is, an append + one transcoding requires two allocations. We could always define routines in std.string or std.utf: append(s, ws); // s ~= ws but really it's quite unambiguous what ~= should do. A nod from the language is a nice touch. Andrei
Oct 26 2009
Andrei Alexandrescu wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)
So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.
Well, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...
My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-Sauls
Oct 27 2009
Chris Nicholson-Sauls wrote:Andrei Alexandrescu wrote:
Well, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...
My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-Sauls
Yah, I agree. The problem is, there's a big difference too: all encodings are able to represent the same information, unlike numeric widths where there's a clear inclusion relationship. It could even be argued that in pure theory UTF-16 is the least general of the three (I dislike UTF-16 from an engineering standpoint; unlike UTF-8 which I think is brilliant, I find UTF-16 is forced and uninspired - the typical outcome of a committee.) My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs. Andrei
Oct 27 2009
On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.
Seems the most intuitive option to me. Also, it makes "a ~= b" equivalent to "a = a ~ b" which is always nice. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 27 2009
Bill Baxter wrote:On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.
"a = a ~ b" which is always nice.
And that kind of suggests to me that even a = b should work. It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error.
I agree. This one, however, will be very difficult to slide by Walter's watchful eye. He doesn't like hidden allocations, and a width adjustment does involve one. Andrei P.S. I got green light from my editor's marketing folks. Will release The Thermopylae Excerpt of TDPL today for free off my website. Stay tuned. It's a rough draft but I hope you will enjoy it.
Oct 27 2009
Bill Baxter wrote:On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.
"a = a ~ b" which is always nice.
And that kind of suggests to me that even a = b should work. It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error. --bb
float b = 2.1; a = b; also unambiguous?
Oct 27 2009
Bill Baxter wrote:On Tue, Oct 27, 2009 at 12:48 PM, Pelle Månsson <pelle.mansson gmail.com> wrote:Bill Baxter wrote:On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.
to "a = a ~ b" which is always nice.
It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error. --bb
float b = 2.1; a = b; also unambiguous?
I'm not sure what point you're trying to make, but wstring <-> string <-> dstring are all lossless conversions. That isn't the case with int and float. --bb
...Then what is the point of wstring, dstring?
Oct 27 2009
Bill Baxter wrote:On Tue, Oct 27, 2009 at 1:06 PM, Pelle Månsson <pelle.mansson gmail.com> wrote:Bill Baxter wrote:On Tue, Oct 27, 2009 at 12:48 PM, Pelle Månsson <pelle.mansson gmail.com> wrote:Bill Baxter wrote:On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.
equivalent to "a = a ~ b" which is always nice.
It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error. --bb
float b = 2.1; a = b; also unambiguous?
<-> dstring are all lossless conversions. That isn't the case with int and float. --bb
...Then what is the point of wstring, dstring?
They are all just different representations of Unicode. string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb --bb
Oct 27 2009
Leandro Lucarella wrote:Bill Baxter, el 27 de octubre a las 13:12 me escribiste:They are? ...Then what is the point of wstring, dstring?
string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb
And here is a nice artible about Unicode and encodings: http://www.joelonsoftware.com/articles/Unicode.html
Damn guys, with these good explanations, nobody's going to use the one in TDPL! Andrei
Oct 27 2009
Chris Nicholson-Sauls Wrote:Andrei Alexandrescu wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)
So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.
Well, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...
My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-Sauls
Though I'm sure Shannon would say that the number of bits of intrinsic information contained in the same sequence of Unicode codepoints is exactly the same whether it be encoded as a string or a wstring. Accordingly my intuition is that some rule based upon left-to-right associativity would be more apt. You could then concatenate a wstring (on the rhs) to an empty string (on the lhs) to convert the wstring to a string or vica versa. Cheers Justin Johansson
Oct 27 2009
Justin Johansson wrote:Chris Nicholson-Sauls Wrote:Andrei Alexandrescu wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)
""~myWcharString? That seems kind of odd.
assign to a concatenation between a string and a wstring. With ~=, it's much easier...
Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-Sauls
Though I'm sure Shannon would say that the number of bits of intrinsic information contained in the same sequence of Unicode codepoints is exactly the same whether it be encoded as a string or a wstring. Accordingly my intuition is that some rule based upon left-to-right associativity would be more apt. You could then concatenate a wstring (on the rhs) to an empty string (on the lhs) to convert the wstring to a string or vica versa. Cheers Justin Johansson
Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF. I would argue that string ~ wstring returning string is fine, but would suggest it be a warning for those like myself who might have first guessed it would "upscale to fit". Just so long as the foreach(dchar;string) trick is still around, char/string can cover an awful lot of ground. All that said, though, I don't think I would ever use ""~wstring as a means of conversion. It just feels like "there wasn't any other way to do this, so here's a cheap hack" -- which just isn't the case. -- Chris Nicholson-Sauls
Oct 29 2009
Chris Nicholson-Sauls Wrote:Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF. I would argue that string ~ wstring returning string is fine, but would suggest it be a warning for those like myself who might have first guessed it would "upscale to fit". Just so long as the foreach(dchar;string) trick is still around, char/string can cover an awful lot of ground. All that said, though, I don't think I would ever use ""~wstring as a means of conversion. It just feels like "there wasn't any other way to do this, so here's a cheap hack" -- which just isn't the case.
Your overall reply well put. On last point: agree; cheap hacks should be avoided. cheers, Justin
Oct 29 2009
"Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message news:hcctuf$140a$1 digitalmars.com...Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF.
Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.
Oct 29 2009
Nick Sabalausky wrote:"Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message news:hcctuf$140a$1 digitalmars.com...Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF.
Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.
I think this says it all: http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_systems_and_environments -Lars :)
Oct 30 2009
Lars T. Kyllingstad wrote:Nick Sabalausky wrote:"Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message news:hcctuf$140a$1 digitalmars.com...Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF.
Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.
I think this says it all: http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_syst ms_and_environments -Lars :)
Yep, there was a frenzy when UCS-2 came about: everybody thought two bytes will be enough for everyone. So UCS-2 was widely adopted - who wouldn't love to have constant character width? Then, the UTF-16 surrogate business came about, and the only logical step they could take was to migrate to UTF-16, which was upward compatible to UCS-2. I personally think UTF-8 is a better overall design though. Andrei
Oct 30 2009
Andrei Alexandrescu Wrote:Lars T. Kyllingstad wrote:Nick Sabalausky wrote:"Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message news:hcctuf$140a$1 digitalmars.com...Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF.
Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.
I think this says it all: http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_syst ms_and_environments -Lars :)
Yep, there was a frenzy when UCS-2 came about: everybody thought two bytes will be enough for everyone. So UCS-2 was widely adopted - who wouldn't love to have constant character width? Then, the UTF-16 surrogate business came about, and the only logical step they could take was to migrate to UTF-16, which was upward compatible to UCS-2. I personally think UTF-8 is a better overall design though. Andrei
"I personally think UTF-8 is a better overall design though." Unicode Technical Note #12 by The Unicode Consortium apparently disagree, recommending UTF-16 for Processing. http://unicode.org/notes/tn12/ The major claim in the TN is that Unicode is optimized for UTF-16. The rest of the argument looks like a VHS (everyone is using it i.e. UTF-16) versus Beta argument. So who's right? My personal view is that whilst they are the *Unicode Consortium*, I have great difficulty in accepting UTF-16 as the one-and-holy encoding. FWIW, there was a subthread during a discussion about the ordained features of programming languages on LtU a while back. http://lambda-the-ultimate.org/node/3166#comment-46233 What Are The Resolved Debates in General Purpose Language Design? Its a long discussion so easier to search for UTF or Unicode on the page if you're interested. cheers Justin Johansson
Oct 30 2009
Justin Johansson wrote:Andrei Alexandrescu Wrote:Lars T. Kyllingstad wrote:Nick Sabalausky wrote:"Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message news:hcctuf$140a$1 digitalmars.com...Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF.
seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.
http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_syst ms_and_environments -Lars :)
bytes will be enough for everyone. So UCS-2 was widely adopted - who wouldn't love to have constant character width? Then, the UTF-16 surrogate business came about, and the only logical step they could take was to migrate to UTF-16, which was upward compatible to UCS-2. I personally think UTF-8 is a better overall design though. Andrei
"I personally think UTF-8 is a better overall design though." Unicode Technical Note #12 by The Unicode Consortium apparently disagree, recommending UTF-16 for Processing. http://unicode.org/notes/tn12/ The major claim in the TN is that Unicode is optimized for UTF-16. The rest of the argument looks like a VHS (everyone is using it i.e. UTF-16) versus Beta argument. So who's right? My personal view is that whilst they are the *Unicode Consortium*, I have great difficulty in accepting UTF-16 as the one-and-holy encoding. FWIW, there was a subthread during a discussion about the ordained features of programming languages on LtU a while back. http://lambda-the-ultimate.org/node/3166#comment-46233 What Are The Resolved Debates in General Purpose Language Design? Its a long discussion so easier to search for UTF or Unicode on the page if you're interested. cheers Justin Johansson
Thanks for the pointers. One of the reasons for which I like the design of UTF-8 is its generality: it's a variable-length code for any number of 31 bits. In contrast, UTF-16 is a relies on specific dead zones inside the assigned space. But the authors of the unicode.org article do make a few good points, such as there not being any invalid UTF-16 symbol. But then that actually can be seen as a strength of UTF-8 - the binary files that are actually UTF-8 files are statistically so scarce, UTF-8 has a very solid method of checking whether a file is UTF-8 or something else. Andrei
Oct 30 2009
Andrei Alexandrescu, el 27 de octubre a las 19:32 me escribiste:Leandro Lucarella wrote:Bill Baxter, el 27 de octubre a las 13:12 me escribiste:They are? ...Then what is the point of wstring, dstring?
string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb
And here is a nice artible about Unicode and encodings: http://www.joelonsoftware.com/articles/Unicode.html
Damn guys, with these good explanations, nobody's going to use the one in TDPL!
:) -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Vivimos en una época muy contemporánea, Don Inodoro... -- Mendieta
Oct 27 2009
Andrei Alexandrescu, el 27 de octubre a las 19:32 me escribiste:Leandro Lucarella wrote:Bill Baxter, el 27 de octubre a las 13:12 me escribiste:They are? ...Then what is the point of wstring, dstring?
string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb
And here is a nice artible about Unicode and encodings: http://www.joelonsoftware.com/articles/Unicode.html
Damn guys, with these good explanations, nobody's going to use the one in TDPL!
BTW, seeing the explanation about Unicode in your book, one wonders why UTF-8, UTF-16 and UTF-32 character types are not simply called utf8, utf16 and utf32... -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Ya ni el cielo me quiere, ya ni la muerte me visita Ya ni el sol me calienta, ya ni el viento me acaricia
Oct 29 2009
On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
Soon the PI level, or at least 10 times PI!
A hundred even. ;-)
Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)
So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd. Just using something like to!(char[])(myWcharString) seems less goofy to me. But that subjective reaction is all I have against it. --bb
Oct 26 2009
Bill Baxter, el 27 de octubre a las 13:12 me escribiste:They are? ...Then what is the point of wstring, dstring?
They are all just different representations of Unicode. string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb
And here is a nice artible about Unicode and encodings: http://www.joelonsoftware.com/articles/Unicode.html -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- He cometido pecados, he hecho el mal, he sido vÃctima de la envidia, el egoÃsmo, la ambición, la mentira y la frivolidad, pero siempre he sido un padre argentino que quiere que su hijo triunfe en la vida. -- Ricardo Vaporeso
Oct 27 2009
On Mon, Oct 26, 2009 at 4:05 PM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:Jeremie Pelletier wrote:Andrei Alexandrescu wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
Soon the PI level, or at least 10 times PI!
A hundred even. ;-)
Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) Andrei
I don't know if thats a good idea, its better when string encoding is explicit so you know where your reallocations are.
The beauty of it is that reallocation with ~ occurs anyway, and with ~= is anyway imminent, regardless of the character width you're reallocating. Allowing concatenation of strings of different widths is a nice way of acknowledging at the language level that all character widths are encodings of abstract characters.ie if I know some routine will have to convert a utf16 parameter to utf8 to append it to a string, then ill try and either make it output utf16 or input utf8. If its implicit its much harder to find and optimize these cases. to!string() is easy enough to use anyways. But it could be good to add a range type that does this with multiple opAppend/opAppendAssign overloads.
One problem with s ~= to!string(someDstring); is that it does two allocations instead of one. Andrei
Good points, I didn't think of the separation between characters and encodings or the extra allocation from to. You have my vote for this feature then! Jeremie
Yeh, me too. Saving an allocation is good. And I agree that having ~= do a conversion is much more useful than just getting an error. Its one of those things you might try just hoping it will work, and it's always nice when something like that does just what you hope it will. I guess the only other thing I could worry about is that in generic array code it might cause someone headaches that for some T[], T[] ~= S[] is legal and the length of the result is not the same as the lengths of the inputs. But I can't think of any real situation where that would cause trouble. --bb
Oct 26 2009
On Tue, 27 Oct 2009 10:04:33 +0300, Chris Nicholson-Sauls <ibisbasenji gmail.com> wrote:Andrei Alexandrescu wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)
So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.
assign to a concatenation between a string and a wstring. With ~=, it's much easier...
My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-Sauls
ubyte i = 42; int j = 1; i += j; // still ubyte same here: string a = "hello"; wstring b = "world"w; a ~= b; // still string
Oct 27 2009
On Tue, Oct 27, 2009 at 4:37 AM, Denis Koroskin <2korden gmail.com> wrote:On Tue, 27 Oct 2009 10:04:33 +0300, Chris Nicholson-Sauls <ibisbasenji gmail.com> wrote:Andrei Alexandrescu wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:303 pages and counting! Andrei
Soon the PI level, or at least 10 times PI!
Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=3D) of strings of different character widths. The support library could do a=
of the transcoding. (I understand that concatenating an array of wchar or char with a dch=
is already in bugzilla.)
So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.
=A0Well, I guess. In particular, to me it's not clear what type we shou=
assign to a concatenation between a string and a wstring. With ~=3D, it=
easier...
My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring;=
encode to the wider of the two types. -- Chris Nicholson-Sauls
ubyte i =3D 42; int j =3D 1; i +=3D j; // still ubyte same here: string a =3D "hello"; wstring b =3D "world"w; a ~=3D b; // still string
As Andrei said (and maybe you missed) "With ~=3D, it's much easier...". The only question is about what "a ~ b" should do. --bb
Oct 27 2009
On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.
Seems the most intuitive option to me. Also, it makes "a ~= b" equivalent to "a = a ~ b" which is always nice.
And that kind of suggests to me that even a = b should work. It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error. --bb
Oct 27 2009
On Tue, Oct 27, 2009 at 12:48 PM, Pelle M=E5nsson <pelle.mansson gmail.com>= wrote:Bill Baxter wrote:On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:My current thought is to ascribe lhs ~ rhs the same type as lhs (there=
making ~ consistent with ~=3D by making lhs ~=3D rhs same as lhs =3D l=
rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.
Seems the most intuitive option to me. Also, it makes "a ~=3D b" equiva=
to "a =3D a ~ b" which is always nice.
And that kind of suggests to me that even =A0a =3D b =A0should work. It has many of the same characteristics as ~=3D. =A0It's pretty unambiguous what you'd expect to happen if not an error. --bb
int a; float b =3D 2.1; a =3D b; also unambiguous?
I'm not sure what point you're trying to make, but wstring <-> string <-> dstring are all lossless conversions. That isn't the case with int and float. --bb
Oct 27 2009
On Tue, Oct 27, 2009 at 1:06 PM, Pelle M=E5nsson <pelle.mansson gmail.com> = wrote:Bill Baxter wrote:On Tue, Oct 27, 2009 at 12:48 PM, Pelle M=E5nsson <pelle.mansson gmail.c=
wrote:Bill Baxter wrote:On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~=3D by making lhs ~=3D rhs same as lhs =3D=
rhs) in case lhs is a string type. If lhs is a character type, the result ty=
is obviously the same as rhs.
Seems the most intuitive option to me. Also, it makes "a ~=3D b" equivalent to "a =3D a ~ b" which is always nice.
And that kind of suggests to me that even =A0a =3D b =A0should work. It has many of the same characteristics as ~=3D. =A0It's pretty unambiguous what you'd expect to happen if not an error. --bb
int a; float b =3D 2.1; a =3D b; also unambiguous?
I'm not sure what point you're trying to make, but wstring <-> string <-> dstring are all lossless conversions. =A0That isn't the case with int and float. --bb
They are? ...Then what is the point of wstring, dstring?
They are all just different representations of Unicode. string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element =3D one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=3Ddtqh79k_1rbxfmb --bb
Oct 27 2009









Walter Bright <newshound1 digitalmars.com> 