digitalmars.D - dchar unicode phobos

Johan Granberg (6/6) Jun 07 2006 That D supports UTF is great, and by using dchar[] all Unicode code

Derek Parnell (6/13) Jun 07 2006 YES! I've had to recode many of them for dchar/wchar support.
pragma (13/19) Jun 07 2006 Ya know, I never really thought about this, but you're right: D has thre...

Johan Granberg (7/34) Jun 07 2006 Yes I have needed support for dchar[] with functions like split ,

pragma (16/49) Jun 07 2006 Sure. D will now try to implicitly instantiate templates where it finds...

Johan Granberg (6/15) Jun 07 2006 Ok that was neat. (I have to look a bit more into templates)
Oskar Linde (15/64) Jun 07 2006 D is unfortunately not really that smart yet. You need exactly the same

Sean Kelly (11/17) Jun 07 2006 Oskar has an array template library that can do much of this, and I have...

Oskar Linde (32/52) Jun 07 2006 I agree that it would be really nice if those types of templates worked

Sean Kelly (12/40) Jun 07 2006 The most obvious performance issue with variable width encodings is with...

Sean Kelly (5/27) Jun 07 2006 And the fact that template overloading and implicit templates just
Oskar Linde (15/36) Jun 07 2006 In http://www.digitalmars.com/d/archives/digitalmars/D/35455.html and

Johan Granberg <lijat.meREM OVEgmail.com> writes:

That D supports UTF is great, and by using dchar[] all Unicode code 
points can bee used. But phobos does not support dchar[]s adequately. 
(or wchar[]s for that matter) Wouldn't it bee expected of the language 
standard library to support all of the languages string encodings?

Proposal: add wchar[] and dchar[] versions of the string functions in phobos

(should this bee filed as a bug?)

Jun 07 2006

"Derek Parnell" <derek psych.ward> writes:

On Wed, 07 Jun 2006 22:40:00 +1000, Johan Granberg  
<lijat.meREM OVEgmail.com> wrote:

 That D supports UTF is great, and by using dchar[] all Unicode code  
 points can bee used. But phobos does not support dchar[]s adequately.  
 (or wchar[]s for that matter) Wouldn't it bee expected of the language  
 standard library to support all of the languages string encodings?

 Proposal: add wchar[] and dchar[] versions of the string functions in  
 phobos

 (should this bee filed as a bug?)

YES! I've had to recode many of them for dchar/wchar support.

-- 
Derek Parnell
Melbourne, Australia

Jun 07 2006

pragma <pragma_member pathlink.com> writes:

In article <e66hf1$otn$1 digitaldaemon.com>, Johan Granberg says...
That D supports UTF is great, and by using dchar[] all Unicode code 
points can bee used. But phobos does not support dchar[]s adequately. 
(or wchar[]s for that matter) Wouldn't it bee expected of the language 
standard library to support all of the languages string encodings?

Proposal: add wchar[] and dchar[] versions of the string functions in phobos

(should this bee filed as a bug?)

Ya know, I never really thought about this, but you're right: D has three
character types yet only has full library support for one of them.

If you ask me, there's only so many ways to go about this:

1. Refactor std.string to use implicit templates
2. Branch std.string into three modules, one for each char type
3. Support all three char types via overloads within std.string


almost exactly as much code as is in use now.  The only drawback here is centers
around problems with distributing template code in libraries.

Also, do you personally need this kind of support in your project?  Have you
looked at Mango?

- EricAnderton at yahoo

Jun 07 2006

Johan Granberg <lijat.meREM OVEgmail.com> writes:

pragma wrote:
 In article <e66hf1$otn$1 digitaldaemon.com>, Johan Granberg says...
 That D supports UTF is great, and by using dchar[] all Unicode code 
 points can bee used. But phobos does not support dchar[]s adequately. 
 (or wchar[]s for that matter) Wouldn't it bee expected of the language 
 standard library to support all of the languages string encodings?

 Proposal: add wchar[] and dchar[] versions of the string functions in phobos

 (should this bee filed as a bug?)

 
 Ya know, I never really thought about this, but you're right: D has three
 character types yet only has full library support for one of them.
 
 If you ask me, there's only so many ways to go about this:
 
 1. Refactor std.string to use implicit templates
 2. Branch std.string into three modules, one for each char type
 3. Support all three char types via overloads within std.string
 

require
 almost exactly as much code as is in use now.  The only drawback here is
centers
 around problems with distributing template code in libraries.
 
 Also, do you personally need this kind of support in your project?  Have you
 looked at Mango?
 
 - EricAnderton at yahoo

Yes I have needed support for dchar[] with functions like split , 
splitline and strip in std.string.
Yes your ways of doing the support looks ok, I would choose 3 thou 
instead of 1. It may bee because I'm not 100% sure about how 1 would 
work. (care to give an example)
No I have not looked closly at mango yet. (Will do)

Jun 07 2006

pragma <pragma_member pathlink.com> writes:

In article <e66qqr$1er5$1 digitaldaemon.com>, Johan Granberg says...
pragma wrote:
 In article <e66hf1$otn$1 digitaldaemon.com>, Johan Granberg says...
 That D supports UTF is great, and by using dchar[] all Unicode code 
 points can bee used. But phobos does not support dchar[]s adequately. 
 (or wchar[]s for that matter) Wouldn't it bee expected of the language 
 standard library to support all of the languages string encodings?

 Proposal: add wchar[] and dchar[] versions of the string functions in phobos

 (should this bee filed as a bug?)

 
 Ya know, I never really thought about this, but you're right: D has three
 character types yet only has full library support for one of them.
 
 If you ask me, there's only so many ways to go about this:
 
 1. Refactor std.string to use implicit templates
 2. Branch std.string into three modules, one for each char type
 3. Support all three char types via overloads within std.string
 

require
 almost exactly as much code as is in use now.  The only drawback here is
centers
 around problems with distributing template code in libraries.
 
 Also, do you personally need this kind of support in your project?  Have you
 looked at Mango?
 
 - EricAnderton at yahoo

Yes I have needed support for dchar[] with functions like split , 
splitline and strip in std.string.
Yes your ways of doing the support looks ok, I would choose 3 thou 
instead of 1. It may bee because I'm not 100% sure about how 1 would 
work. (care to give an example)

Sure.  D will now try to implicitly instantiate templates where it finds them.
So you can do this:

/**/ template trim(TChar){
/**/   TChar[] trim(TChar[] src){ /* ... */ }
/**/ }

..and the call to trim will still be as simple as the non-templated version:

/**/ dchar[] foo,bar;
/**/ foo = trim(bar);

So we get to have our cake and eat it too.  The onus is now placed on the
compiler, as it will generate a distinct version of each template as needed.

The astute observer will notice that any array type can be used as a parameter
in the above example.  Proper use of static if() and the 'is' operator can
easily ensure that only char, wchar and dchar are being used.  Template
overloads, while verbose, are another way to go.

- EricAnderton at yahoo

Jun 07 2006

Johan Granberg <lijat.meREM OVEgmail.com> writes:

pragma wrote:
 So we get to have our cake and eat it too.  The onus is now placed on the
 compiler, as it will generate a distinct version of each template as needed.
 
 The astute observer will notice that any array type can be used as a parameter
 in the above example.  Proper use of static if() and the 'is' operator can
 easily ensure that only char, wchar and dchar are being used.  Template
 overloads, while verbose, are another way to go.
 
 - EricAnderton at yahoo

Ok that was neat. (I have to look a bit more into templates)
Is their any special cases that need to bee handled.
Could utf32 have more posible symbols for line endings or withspace than 
utf8 or anything like that. It could bee handled with static if on a 
case by case basis thou.

Jun 07 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

pragma skrev:
 In article <e66qqr$1er5$1 digitaldaemon.com>, Johan Granberg says...
 pragma wrote:
 In article <e66hf1$otn$1 digitaldaemon.com>, Johan Granberg says...
 That D supports UTF is great, and by using dchar[] all Unicode code 
 points can bee used. But phobos does not support dchar[]s adequately. 
 (or wchar[]s for that matter) Wouldn't it bee expected of the language 
 standard library to support all of the languages string encodings?

 Proposal: add wchar[] and dchar[] versions of the string functions in phobos

 (should this bee filed as a bug?)

 Ya know, I never really thought about this, but you're right: D has three
 character types yet only has full library support for one of them.

 If you ask me, there's only so many ways to go about this:

 1. Refactor std.string to use implicit templates
 2. Branch std.string into three modules, one for each char type
 3. Support all three char types via overloads within std.string


require
 almost exactly as much code as is in use now.  The only drawback here is
centers
 around problems with distributing template code in libraries.

 Also, do you personally need this kind of support in your project?  Have you
 looked at Mango?

 - EricAnderton at yahoo

 Yes I have needed support for dchar[] with functions like split , 
 splitline and strip in std.string.
 Yes your ways of doing the support looks ok, I would choose 3 thou 
 instead of 1. It may bee because I'm not 100% sure about how 1 would 
 work. (care to give an example)

 
 Sure.  D will now try to implicitly instantiate templates where it finds them.
 So you can do this:
 
 /**/ template trim(TChar){
 /**/   TChar[] trim(TChar[] src){ /* ... */ }
 /**/ }

D is unfortunately not really that smart yet. You need exactly the same 
function argument types and in the same order as the template arguments.

template trim(MyString) {
	MyString trim(MyString src) { /* */ }
}

works.

 
 ..and the call to trim will still be as simple as the non-templated version:
 
 /**/ dchar[] foo,bar;
 /**/ foo = trim(bar);

and even bar.trim() will work.

 The astute observer will notice that any array type can be used as a parameter
 in the above example.  Proper use of static if() and the 'is' operator can
 easily ensure that only char, wchar and dchar are being used.  Template
 overloads, while verbose, are another way to go.

I don't really see any reason to limit string functions to char, wchar 
and dchar. Strings in other encodings (for instance latin1, iso8859-1), 
are readily encoded as ubyte[] or with a typedef:ed type. It would be 
useful to be able to work with such string too. I have myself several 
times cast latin1 strings into char[], just to be able to use one of the 
std.string functions on it before casting the result back into a ubyte[].

/Oskar

Jun 07 2006

Sean Kelly <sean f4.ca> writes:

Johan Granberg wrote:
 Yes I have needed support for dchar[] with functions like split , 
 splitline and strip in std.string.
 Yes your ways of doing the support looks ok, I would choose 3 thou 
 instead of 1. It may bee because I'm not 100% sure about how 1 would 
 work. (care to give an example)
 No I have not looked closly at mango yet. (Will do)

Oskar has an array template library that can do much of this, and I have 
the beginnings of one in Ares as well.  The source is here:

http://svn.dsource.org/projects/ares/trunk/src/ares/std/array.d

As you can see however, half the functions are commented out because 
template function overloading basically just doesn't work yet. 
Eventually however, I plan to add split, join, etc.  These will probably 
all assume fixed-width elements, with improved support for char and 
wchar strings in a std.string module, as supporting variable width 
encoding will slow down the algorithms.


Sean

Jun 07 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Sean Kelly skrev:
 Johan Granberg wrote:
 Yes I have needed support for dchar[] with functions like split , 
 splitline and strip in std.string.
 Yes your ways of doing the support looks ok, I would choose 3 thou 
 instead of 1. It may bee because I'm not 100% sure about how 1 would 
 work. (care to give an example)
 No I have not looked closly at mango yet. (Will do)

 
 Oskar has an array template library that can do much of this, and I have 
 the beginnings of one in Ares as well.  The source is here:
 
 http://svn.dsource.org/projects/ares/trunk/src/ares/std/array.d
 
 As you can see however, half the functions are commented out because 
 template function overloading basically just doesn't work yet.

I agree that it would be really nice if those types of templates worked 
today, but all of those functions can be rewritten in a way that works 
with current D. Considering the amount of time it took us to get the 
current (most basic) ifti support, I would rather use a solution that 
works today, than wait an indefinite amount of time for something that 
may never happen. :) I fully appreciate your stand point though and 
would love to hear something from Walter regarding future ifti support.

Some things that I would like to see improved (in descending order of 
importance) are ifti support for:
1. template member functions
2. mixed explicit/implicit arguments: f!(int)('x') => f!(int,char)('x')
3. template specializations
4. better template function overloading
5. generic matching: template t(X) { void t(X[] a, X b) {}}

 Eventually however, I plan to add split, join, etc.  These will probably 
 all assume fixed-width elements, with improved support for char and 
 wchar strings in a std.string module, as supporting variable width 
 encoding will slow down the algorithms.

It sounds reasonable to avoid any variable length awareness in 
std.array, but I don't really see how supporting that will make split or 
join any slower. For instance

(char[]).split(char)
(char[]).split(char[])
(char[]).split(bool delegate(char))

Aren't affected by variable length encodings. Only:

(char[]).split(dchar)
(char[]).split(bool delegate(dchar))

are, (by using a dchar foreach over a char[]), but here, the user is 
explicit about wanting a multi byte implementation. Putting the 
implementation of the last two versions in std.string gives a neat 
std.string/std.array separation, but risk confusing the user:

- Why would "abc".split('a') be in std.array while "abc".split('�') 
requires std.string?

Regards,

Oskar

Jun 07 2006

Sean Kelly <sean f4.ca> writes:

Oskar Linde wrote:
 Sean Kelly skrev:
 
 Eventually however, I plan to add split, join, etc.  These will 
 probably all assume fixed-width elements, with improved support for 
 char and wchar strings in a std.string module, as supporting variable 
 width encoding will slow down the algorithms.

 
 It sounds reasonable to avoid any variable length awareness in 
 std.array, but I don't really see how supporting that will make split or 
 join any slower. For instance
 
 (char[]).split(char)
 (char[]).split(char[])
 (char[]).split(bool delegate(char))
 
 Aren't affected by variable length encodings.

The most obvious performance issue with variable width encodings is with 
searching and matching routines.  And most routines in std.array 
ultimately rely on searching and matching in some form.  However, I 
wasn't going to go so far as to support type conversion for this stuff:

     size_t find( char[] str, dchar elem );

which does help a bit.

 Only:
 
 (char[]).split(dchar)
 (char[]).split(bool delegate(dchar))
 
 are, (by using a dchar foreach over a char[]), but here, the user is 
 explicit about wanting a multi byte implementation. Putting the 
 implementation of the last two versions in std.string gives a neat 
 std.string/std.array separation, but risk confusing the user:
 
 - Why would "abc".split('a') be in std.array while "abc".split('�') 
 requires std.string?

I had initially thought that std.utf.stride would be required to avoid 
false matches for search routines but have since been told otherwise, so 
there may be no reason for the specialized std.string functions I'd 
mentioned.  I forgot about this bit while writing my last post :-)


Sean

Jun 07 2006

Sean Kelly <sean f4.ca> writes:

pragma wrote:
 In article <e66hf1$otn$1 digitaldaemon.com>, Johan Granberg says...
 That D supports UTF is great, and by using dchar[] all Unicode code 
 points can bee used. But phobos does not support dchar[]s adequately. 
 (or wchar[]s for that matter) Wouldn't it bee expected of the language 
 standard library to support all of the languages string encodings?

 Proposal: add wchar[] and dchar[] versions of the string functions in phobos

 (should this bee filed as a bug?)

 
 Ya know, I never really thought about this, but you're right: D has three
 character types yet only has full library support for one of them.
 
 If you ask me, there's only so many ways to go about this:
 
 1. Refactor std.string to use implicit templates
 2. Branch std.string into three modules, one for each char type
 3. Support all three char types via overloads within std.string
 

require
 almost exactly as much code as is in use now.  The only drawback here is
centers
 around problems with distributing template code in libraries.

And the fact that template overloading and implicit templates just 
aren't ready for this kind of use.  But I believe this is ultimately the 
correct solution.


Sean

Jun 07 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

pragma skrev:
 In article <e66hf1$otn$1 digitaldaemon.com>, Johan Granberg says...
 That D supports UTF is great, and by using dchar[] all Unicode code 
 points can bee used. But phobos does not support dchar[]s adequately. 
 (or wchar[]s for that matter) Wouldn't it bee expected of the language 
 standard library to support all of the languages string encodings?

 Proposal: add wchar[] and dchar[] versions of the string functions in phobos

 (should this bee filed as a bug?)

 
 Ya know, I never really thought about this, but you're right: D has three
 character types yet only has full library support for one of them.
 
 If you ask me, there's only so many ways to go about this:
 
 1. Refactor std.string to use implicit templates

In http://www.digitalmars.com/d/archives/digitalmars/D/35455.html and 
other posts, I suggested a rough specification and a proof of concept 
implementation of implicit array templates that replace many of the 
functions in std.string with generic versions. If there is a definite 
interest in taking this path, I will gladly write a full generic 
replacement for std.string.

Most of the functions in my earlier suggestion were aimed at a std.array 
module, and it it hard to draw a definite line between std.string and 
std.array. My current divider is something along the line of anything 
that only makes sense for text strings are in std.string, the rest in 
std.array. One suggestion was to make std.string aliases to the generic 
functions in std.array (for instance std.string.find -> std.array.find)

 2. Branch std.string into three modules, one for each char type
 3. Support all three char types via overloads within std.string


require
 almost exactly as much code as is in use now.  The only drawback here is
centers
 around problems with distributing template code in libraries.

The template/library issues really need to be resolved.

/Oskar

Jun 07 2006

D Programming

C/C++ Programming

Other

digitalmars.D - dchar unicode phobos