www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - dchar unicode phobos

reply Johan Granberg <lijat.meREM OVEgmail.com> writes:
That D supports UTF is great, and by using dchar[] all Unicode code 
points can bee used. But phobos does not support dchar[]s adequately. 
(or wchar[]s for that matter) Wouldn't it bee expected of the language 
standard library to support all of the languages string encodings?

Proposal: add wchar[] and dchar[] versions of the string functions in phobos

(should this bee filed as a bug?)
Jun 07 2006
next sibling parent "Derek Parnell" <derek psych.ward> writes:
On Wed, 07 Jun 2006 22:40:00 +1000, Johan Granberg  
<lijat.meREM OVEgmail.com> wrote:

 That D supports UTF is great, and by using dchar[] all Unicode code  
 points can bee used. But phobos does not support dchar[]s adequately.  
 (or wchar[]s for that matter) Wouldn't it bee expected of the language  
 standard library to support all of the languages string encodings?

 Proposal: add wchar[] and dchar[] versions of the string functions in  
 phobos

 (should this bee filed as a bug?)

YES! I've had to recode many of them for dchar/wchar support. -- Derek Parnell Melbourne, Australia
Jun 07 2006
prev sibling parent reply pragma <pragma_member pathlink.com> writes:
In article <e66hf1$otn$1 digitaldaemon.com>, Johan Granberg says...
That D supports UTF is great, and by using dchar[] all Unicode code 
points can bee used. But phobos does not support dchar[]s adequately. 
(or wchar[]s for that matter) Wouldn't it bee expected of the language 
standard library to support all of the languages string encodings?

Proposal: add wchar[] and dchar[] versions of the string functions in phobos

(should this bee filed as a bug?)

Ya know, I never really thought about this, but you're right: D has three character types yet only has full library support for one of them. If you ask me, there's only so many ways to go about this: 1. Refactor std.string to use implicit templates 2. Branch std.string into three modules, one for each char type 3. Support all three char types via overloads within std.string Personally, I like #1 since it would be seamless to implement, and would require almost exactly as much code as is in use now. The only drawback here is centers around problems with distributing template code in libraries. Also, do you personally need this kind of support in your project? Have you looked at Mango? - EricAnderton at yahoo
Jun 07 2006
next sibling parent reply Johan Granberg <lijat.meREM OVEgmail.com> writes:
pragma wrote:
 In article <e66hf1$otn$1 digitaldaemon.com>, Johan Granberg says...
 That D supports UTF is great, and by using dchar[] all Unicode code 
 points can bee used. But phobos does not support dchar[]s adequately. 
 (or wchar[]s for that matter) Wouldn't it bee expected of the language 
 standard library to support all of the languages string encodings?

 Proposal: add wchar[] and dchar[] versions of the string functions in phobos

 (should this bee filed as a bug?)

Ya know, I never really thought about this, but you're right: D has three character types yet only has full library support for one of them. If you ask me, there's only so many ways to go about this: 1. Refactor std.string to use implicit templates 2. Branch std.string into three modules, one for each char type 3. Support all three char types via overloads within std.string Personally, I like #1 since it would be seamless to implement, and would require almost exactly as much code as is in use now. The only drawback here is centers around problems with distributing template code in libraries. Also, do you personally need this kind of support in your project? Have you looked at Mango? - EricAnderton at yahoo

splitline and strip in std.string. Yes your ways of doing the support looks ok, I would choose 3 thou instead of 1. It may bee because I'm not 100% sure about how 1 would work. (care to give an example) No I have not looked closly at mango yet. (Will do)
Jun 07 2006
next sibling parent reply pragma <pragma_member pathlink.com> writes:
In article <e66qqr$1er5$1 digitaldaemon.com>, Johan Granberg says...
pragma wrote:
 In article <e66hf1$otn$1 digitaldaemon.com>, Johan Granberg says...
 That D supports UTF is great, and by using dchar[] all Unicode code 
 points can bee used. But phobos does not support dchar[]s adequately. 
 (or wchar[]s for that matter) Wouldn't it bee expected of the language 
 standard library to support all of the languages string encodings?

 Proposal: add wchar[] and dchar[] versions of the string functions in phobos

 (should this bee filed as a bug?)

Ya know, I never really thought about this, but you're right: D has three character types yet only has full library support for one of them. If you ask me, there's only so many ways to go about this: 1. Refactor std.string to use implicit templates 2. Branch std.string into three modules, one for each char type 3. Support all three char types via overloads within std.string Personally, I like #1 since it would be seamless to implement, and would require almost exactly as much code as is in use now. The only drawback here is centers around problems with distributing template code in libraries. Also, do you personally need this kind of support in your project? Have you looked at Mango? - EricAnderton at yahoo

splitline and strip in std.string. Yes your ways of doing the support looks ok, I would choose 3 thou instead of 1. It may bee because I'm not 100% sure about how 1 would work. (care to give an example)

Sure. D will now try to implicitly instantiate templates where it finds them. So you can do this: /**/ template trim(TChar){ /**/ TChar[] trim(TChar[] src){ /* ... */ } /**/ } ..and the call to trim will still be as simple as the non-templated version: /**/ dchar[] foo,bar; /**/ foo = trim(bar); So we get to have our cake and eat it too. The onus is now placed on the compiler, as it will generate a distinct version of each template as needed. The astute observer will notice that any array type can be used as a parameter in the above example. Proper use of static if() and the 'is' operator can easily ensure that only char, wchar and dchar are being used. Template overloads, while verbose, are another way to go. - EricAnderton at yahoo
Jun 07 2006
next sibling parent Johan Granberg <lijat.meREM OVEgmail.com> writes:
pragma wrote:
 So we get to have our cake and eat it too.  The onus is now placed on the
 compiler, as it will generate a distinct version of each template as needed.
 
 The astute observer will notice that any array type can be used as a parameter
 in the above example.  Proper use of static if() and the 'is' operator can
 easily ensure that only char, wchar and dchar are being used.  Template
 overloads, while verbose, are another way to go.
 
 - EricAnderton at yahoo

Ok that was neat. (I have to look a bit more into templates) Is their any special cases that need to bee handled. Could utf32 have more posible symbols for line endings or withspace than utf8 or anything like that. It could bee handled with static if on a case by case basis thou.
Jun 07 2006
prev sibling parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
pragma skrev:
 In article <e66qqr$1er5$1 digitaldaemon.com>, Johan Granberg says...
 pragma wrote:
 In article <e66hf1$otn$1 digitaldaemon.com>, Johan Granberg says...
 That D supports UTF is great, and by using dchar[] all Unicode code 
 points can bee used. But phobos does not support dchar[]s adequately. 
 (or wchar[]s for that matter) Wouldn't it bee expected of the language 
 standard library to support all of the languages string encodings?

 Proposal: add wchar[] and dchar[] versions of the string functions in phobos

 (should this bee filed as a bug?)

character types yet only has full library support for one of them. If you ask me, there's only so many ways to go about this: 1. Refactor std.string to use implicit templates 2. Branch std.string into three modules, one for each char type 3. Support all three char types via overloads within std.string Personally, I like #1 since it would be seamless to implement, and would require almost exactly as much code as is in use now. The only drawback here is centers around problems with distributing template code in libraries. Also, do you personally need this kind of support in your project? Have you looked at Mango? - EricAnderton at yahoo

splitline and strip in std.string. Yes your ways of doing the support looks ok, I would choose 3 thou instead of 1. It may bee because I'm not 100% sure about how 1 would work. (care to give an example)

Sure. D will now try to implicitly instantiate templates where it finds them. So you can do this: /**/ template trim(TChar){ /**/ TChar[] trim(TChar[] src){ /* ... */ } /**/ }

D is unfortunately not really that smart yet. You need exactly the same function argument types and in the same order as the template arguments. template trim(MyString) { MyString trim(MyString src) { /* */ } } works.
 
 ..and the call to trim will still be as simple as the non-templated version:
 
 /**/ dchar[] foo,bar;
 /**/ foo = trim(bar);

and even bar.trim() will work.
 The astute observer will notice that any array type can be used as a parameter
 in the above example.  Proper use of static if() and the 'is' operator can
 easily ensure that only char, wchar and dchar are being used.  Template
 overloads, while verbose, are another way to go.

I don't really see any reason to limit string functions to char, wchar and dchar. Strings in other encodings (for instance latin1, iso8859-1), are readily encoded as ubyte[] or with a typedef:ed type. It would be useful to be able to work with such string too. I have myself several times cast latin1 strings into char[], just to be able to use one of the std.string functions on it before casting the result back into a ubyte[]. /Oskar
Jun 07 2006
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
Johan Granberg wrote:
 Yes I have needed support for dchar[] with functions like split , 
 splitline and strip in std.string.
 Yes your ways of doing the support looks ok, I would choose 3 thou 
 instead of 1. It may bee because I'm not 100% sure about how 1 would 
 work. (care to give an example)
 No I have not looked closly at mango yet. (Will do)

Oskar has an array template library that can do much of this, and I have the beginnings of one in Ares as well. The source is here: http://svn.dsource.org/projects/ares/trunk/src/ares/std/array.d As you can see however, half the functions are commented out because template function overloading basically just doesn't work yet. Eventually however, I plan to add split, join, etc. These will probably all assume fixed-width elements, with improved support for char and wchar strings in a std.string module, as supporting variable width encoding will slow down the algorithms. Sean
Jun 07 2006
parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Sean Kelly skrev:
 Johan Granberg wrote:
 Yes I have needed support for dchar[] with functions like split , 
 splitline and strip in std.string.
 Yes your ways of doing the support looks ok, I would choose 3 thou 
 instead of 1. It may bee because I'm not 100% sure about how 1 would 
 work. (care to give an example)
 No I have not looked closly at mango yet. (Will do)

Oskar has an array template library that can do much of this, and I have the beginnings of one in Ares as well. The source is here: http://svn.dsource.org/projects/ares/trunk/src/ares/std/array.d As you can see however, half the functions are commented out because template function overloading basically just doesn't work yet.

I agree that it would be really nice if those types of templates worked today, but all of those functions can be rewritten in a way that works with current D. Considering the amount of time it took us to get the current (most basic) ifti support, I would rather use a solution that works today, than wait an indefinite amount of time for something that may never happen. :) I fully appreciate your stand point though and would love to hear something from Walter regarding future ifti support. Some things that I would like to see improved (in descending order of importance) are ifti support for: 1. template member functions 2. mixed explicit/implicit arguments: f!(int)('x') => f!(int,char)('x') 3. template specializations 4. better template function overloading 5. generic matching: template t(X) { void t(X[] a, X b) {}}
 Eventually however, I plan to add split, join, etc.  These will probably 
 all assume fixed-width elements, with improved support for char and 
 wchar strings in a std.string module, as supporting variable width 
 encoding will slow down the algorithms.

It sounds reasonable to avoid any variable length awareness in std.array, but I don't really see how supporting that will make split or join any slower. For instance (char[]).split(char) (char[]).split(char[]) (char[]).split(bool delegate(char)) Aren't affected by variable length encodings. Only: (char[]).split(dchar) (char[]).split(bool delegate(dchar)) are, (by using a dchar foreach over a char[]), but here, the user is explicit about wanting a multi byte implementation. Putting the implementation of the last two versions in std.string gives a neat std.string/std.array separation, but risk confusing the user: - Why would "abc".split('a') be in std.array while "abc".split('') requires std.string? Regards, Oskar
Jun 07 2006
parent Sean Kelly <sean f4.ca> writes:
Oskar Linde wrote:
 Sean Kelly skrev:
 
 Eventually however, I plan to add split, join, etc.  These will 
 probably all assume fixed-width elements, with improved support for 
 char and wchar strings in a std.string module, as supporting variable 
 width encoding will slow down the algorithms.

It sounds reasonable to avoid any variable length awareness in std.array, but I don't really see how supporting that will make split or join any slower. For instance (char[]).split(char) (char[]).split(char[]) (char[]).split(bool delegate(char)) Aren't affected by variable length encodings.

The most obvious performance issue with variable width encodings is with searching and matching routines. And most routines in std.array ultimately rely on searching and matching in some form. However, I wasn't going to go so far as to support type conversion for this stuff: size_t find( char[] str, dchar elem ); which does help a bit.
 Only:
 
 (char[]).split(dchar)
 (char[]).split(bool delegate(dchar))
 
 are, (by using a dchar foreach over a char[]), but here, the user is 
 explicit about wanting a multi byte implementation. Putting the 
 implementation of the last two versions in std.string gives a neat 
 std.string/std.array separation, but risk confusing the user:
 
 - Why would "abc".split('a') be in std.array while "abc".split('') 
 requires std.string?

I had initially thought that std.utf.stride would be required to avoid false matches for search routines but have since been told otherwise, so there may be no reason for the specialized std.string functions I'd mentioned. I forgot about this bit while writing my last post :-) Sean
Jun 07 2006
prev sibling next sibling parent Sean Kelly <sean f4.ca> writes:
pragma wrote:
 In article <e66hf1$otn$1 digitaldaemon.com>, Johan Granberg says...
 That D supports UTF is great, and by using dchar[] all Unicode code 
 points can bee used. But phobos does not support dchar[]s adequately. 
 (or wchar[]s for that matter) Wouldn't it bee expected of the language 
 standard library to support all of the languages string encodings?

 Proposal: add wchar[] and dchar[] versions of the string functions in phobos

 (should this bee filed as a bug?)

Ya know, I never really thought about this, but you're right: D has three character types yet only has full library support for one of them. If you ask me, there's only so many ways to go about this: 1. Refactor std.string to use implicit templates 2. Branch std.string into three modules, one for each char type 3. Support all three char types via overloads within std.string Personally, I like #1 since it would be seamless to implement, and would require almost exactly as much code as is in use now. The only drawback here is centers around problems with distributing template code in libraries.

And the fact that template overloading and implicit templates just aren't ready for this kind of use. But I believe this is ultimately the correct solution. Sean
Jun 07 2006
prev sibling parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
pragma skrev:
 In article <e66hf1$otn$1 digitaldaemon.com>, Johan Granberg says...
 That D supports UTF is great, and by using dchar[] all Unicode code 
 points can bee used. But phobos does not support dchar[]s adequately. 
 (or wchar[]s for that matter) Wouldn't it bee expected of the language 
 standard library to support all of the languages string encodings?

 Proposal: add wchar[] and dchar[] versions of the string functions in phobos

 (should this bee filed as a bug?)

Ya know, I never really thought about this, but you're right: D has three character types yet only has full library support for one of them. If you ask me, there's only so many ways to go about this: 1. Refactor std.string to use implicit templates

In http://www.digitalmars.com/d/archives/digitalmars/D/35455.html and other posts, I suggested a rough specification and a proof of concept implementation of implicit array templates that replace many of the functions in std.string with generic versions. If there is a definite interest in taking this path, I will gladly write a full generic replacement for std.string. Most of the functions in my earlier suggestion were aimed at a std.array module, and it it hard to draw a definite line between std.string and std.array. My current divider is something along the line of anything that only makes sense for text strings are in std.string, the rest in std.array. One suggestion was to make std.string aliases to the generic functions in std.array (for instance std.string.find -> std.array.find)
 2. Branch std.string into three modules, one for each char type
 3. Support all three char types via overloads within std.string

 Personally, I like #1 since it would be seamless to implement, and would
require
 almost exactly as much code as is in use now.  The only drawback here is
centers
 around problems with distributing template code in libraries.

The template/library issues really need to be resolved. /Oskar
Jun 07 2006