www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Rename std.ctype to std.ascii?

reply Jonathan M Davis <jmdavisProg gmx.com> writes:
std.ctype is modeled after C's ctype.h. It has functions for operating on 
characters - particularly functions which indicate the type of a character (I 
believe that ctype stands for character type, so that makes sense). For 
instance, isdigit will tell you whether a particular character is a digit. It 
only works on ASCII characters (non-ASCII characters return false for 
functions like isdigit and functions like toupper do nothing to non-ASCII 
characters).

std.uni, on the other hand, operates on characters just like std.ctype does, 
but it extends its charter to unicode characters (e.g. it has isUniUpper which 
_does_ work on unicode characters, unlike std.ctype's isupper).

The thing is that aside from those familiar with C/C++, most programmers are 
likely to find the module name ctype to be rather uniformative. If they're 
looking for something like isdigit, they're not terribly likely to go looking 
at std.ctype first. And I'm not sure that std.ascii will be all that much more 
obvious to them, but it fits in much better with std.uni. std.ascii gets the 
character functions which operate only on ASCII characters, and std.uni gets 
the character functions which operate on unicode characters in addition to 
ASCII characters.

I don't think that the change of module name is enough of an improvement to 
merit changing the name just because ctype is arguably bad. However, as it 
turns out, _no_ function in std.ctype is properly camelcased, and many of them 
return int instead of bool (which the C functions they're modeled after do but 
which is not particularly D-like and can cause problems when you actual _need_ 
them to return bool). And it has been made very clear in past discussions in 
this newsgroup that the consensus is that we prefer that Phobos functions 
follow Phobos' naming conventions (which means camelcasing) rather than 
matching the casing of functions in other languages. So, all of the functions 
in std.ctype need to be renamed.

I now have a pull request which creates properly camelcased versions of all of 
them ( https://github.com/D-Programming-Language/phobos/pull/101 ). The thing 
is though that because _every_ function in std.ctype is renamed, the cost of 
renaming the entire module (as far as people updating their code to use 
functions such as isDigit instead of isdigit goes) is essentially the same if 
as just renaming the functions in-place. In either case, the old functions 
will go through the full deprecation process before they're actually gone, so 
no one's code will suddenly break because of the changes, but any code that 
uses the old functions will eventually have to be change to use the properly 
named ones. And since the cost to making those changes is essentially the same 
whether we replace the whole std.ctype module or whether we replace all of its 
functions, I'm wondering whether it would be worthwhile to take this 
opportunity to rename std.ctype?

I don't think that the name change is enough of an improvement to do it if 
it's going to break everyone's code, but given that fixing all of its 
functions gives us a perfect opportunity to rename it at no additional cost, I 
feel that the question should be posed.

Should we rename std.ctype to std.ascii? Or should we just keep the old name, 
which is familiar to C programmers?

- Jonathan M Davis
Jun 13 2011
next sibling parent reply "Jouko Koski" <joukokoskispam101 netti.fi> writes:
"Jonathan M Davis" <jmdavisProg gmx.com> wrote:
 std.ctype is modeled after C's ctype.h. It has functions for operating on
 characters - particularly functions which indicate the type of a character 
 (I
 believe that ctype stands for character type, so that makes sense). For
 instance, isdigit will tell you whether a particular character is a digit. 
 It
 only works on ASCII characters (non-ASCII characters return false for
 functions like isdigit and functions like toupper do nothing to non-ASCII
 characters).

What is your definition for ASCII character? Most of the <ctype.h> functions (or macros) are locale dependent, see setlocale() and <locale.h>. And there is the <wctype.h>, too. While the C standardized ways of doing things might not be most appropriate approach in D domain, we must not base our design decisions on deficient analysis. "I just want this text uppercase" is one of the hardest things in the _world_. The problem is not just the header or package naming. -- Jouko
Jun 13 2011
next sibling parent KennyTM~ <kennytm gmail.com> writes:
On Jun 14, 11 14:23, Jonathan M Davis wrote:
 On 2011-06-13 22:48, Jouko Koski wrote:
 "Jonathan M Davis"<jmdavisProg gmx.com>  wrote:
 std.ctype is modeled after C's ctype.h. It has functions for operating on
 characters - particularly functions which indicate the type of a
 character (I
 believe that ctype stands for character type, so that makes sense). For
 instance, isdigit will tell you whether a particular character is a
 digit. It
 only works on ASCII characters (non-ASCII characters return false for
 functions like isdigit and functions like toupper do nothing to non-ASCII
 characters).

What is your definition for ASCII character? Most of the<ctype.h> functions (or macros) are locale dependent, see setlocale() and<locale.h>. And there is the<wctype.h>, too. While the C standardized ways of doing things might not be most appropriate approach in D domain, we must not base our design decisions on deficient analysis. "I just want this text uppercase" is one of the hardest things in the _world_. The problem is not just the header or package naming.

??? std.ctype does _nothing_ with localization. And even if it did, that doesn't change what ASCII is. ASCII is made up of the values 0 through 127. And honestly, I have no clue how _those_ characters could be affected by locale. Extended-ASCII might be, but I wouldn't think that ASCII would be. Regardless, std.ctype does nothing with locale. - Jonathan M Davis

std.ctype does not, but <ctype.h> does. (which could be another reason it shouldn't be called std.ctype.)
Jun 14 2011
prev sibling parent reply David Nadlinger <see klickverbot.at> writes:
On 6/14/11 8:23 AM, Jonathan M Davis wrote:
 What is your definition for ASCII character?

 Most of the<ctype.h>  functions (or macros) are locale dependent, see
 setlocale() and<locale.h>. And there is the<wctype.h>, too.

 While the C standardized ways of doing things might not be most appropriate
 approach in D domain, we must not base our design decisions on deficient
 analysis. "I just want this text uppercase" is one of the hardest things in
 the _world_. The problem is not just the header or package naming.

??? std.ctype does _nothing_ with localization. And even if it did, that doesn't change what ASCII is. ASCII is made up of the values 0 through 127. And honestly, I have no clue how _those_ characters could be affected by locale. Extended-ASCII might be, but I wouldn't think that ASCII would be. Regardless, std.ctype does nothing with locale.

But the functions in <ctype.h> do. And there can be some locale-dependent problems even if you use only ASCII, the most prominent being the different handling of »i« in the Turkish locale: http://www.i18nguy.com/unicode/turkish-i18n.html This is probably another reason why it shouldn't be called std.ctype… David
Jun 14 2011
parent reply David Nadlinger <see klickverbot.at> writes:
On 6/14/11 11:20 AM, Jonathan M Davis wrote:
 On 2011-06-14 01:51, David Nadlinger wrote:
 But the functions in<ctype.h>  do. And there can be some
 locale-dependent problems even if you use only ASCII, the most prominent
 being the different handling of »i« in the Turkish locale:
 http://www.i18nguy.com/unicode/turkish-i18n.html

 This is probably another reason why it shouldn't be called std.ctype…

From the looks of it, that affects extended ASCII but not ASCII (since the Turkish uppercase I isn't even in ASCII). It's definitely a great link though. Thanks!

Oh, I was probably a bit unclear – what I meant is that it affects you also if you use only ASCII input, since toupper('i') == 221 when your locale is tr_TR.ISO-8859-9. David
Jun 14 2011
parent reply Daniel Gibson <metalcaedes gmail.com> writes:
Am 14.06.2011 20:58, schrieb Andrej Mitrovic:
 Why does std.ctype exist anyway? Can't you use std.uni for both ASCII
 and UTF? Or is there some overhead in using the uni functions?

I haven't looked at either implementation, but on ASCII everything is really simple.. isalpha, isdigit, isupper and islower are just a simple checks if the value is between two values, tolower(dchar c) is just return isupper(c) ? c+32 : c; etc. For Unicode this is most probably *much* harder (=> more expensive). Cheers, - Daniel
Jun 14 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
Daniel Gibson wrote:
 Am 14.06.2011 20:58, schrieb Andrej Mitrovic:
 Why does std.ctype exist anyway? Can't you use std.uni for both ASCII
 and UTF? Or is there some overhead in using the uni functions?

I haven't looked at either implementation, but on ASCII everything is really simple.. isalpha, isdigit, isupper and islower are just a simple checks if the value is between two values, tolower(dchar c) is just return isupper(c) ? c+32 : c; etc. For Unicode this is most probably *much* harder (=> more expensive). Cheers, - Daniel

The implementation of toUniLower shortcuts on ASCII characters. I don't expect it to be any slower if not for inlineability. And if somebody really needs the speed, I feel manually writing if('A' <= c && c <= 'Z') c+=32; (or similar) is just good enough. Timon
Jun 14 2011
parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 14.06.2011 21:29, schrieb Timon Gehr:
 Daniel Gibson wrote:
 Am 14.06.2011 20:58, schrieb Andrej Mitrovic:
 Why does std.ctype exist anyway? Can't you use std.uni for both ASCII
 and UTF? Or is there some overhead in using the uni functions?

I haven't looked at either implementation, but on ASCII everything is really simple.. isalpha, isdigit, isupper and islower are just a simple checks if the value is between two values, tolower(dchar c) is just return isupper(c) ? c+32 : c; etc. For Unicode this is most probably *much* harder (=> more expensive). Cheers, - Daniel

The implementation of toUniLower shortcuts on ASCII characters. I don't expect it to be any slower if not for inlineability. And if somebody really needs the speed, I feel manually writing if('A' <= c && c <= 'Z') c+=32; (or similar) is just good enough. Timon

OK. I just looked at the implementation and it seems like there are ASCII-shortcuts in all those unicode functions. So I agree with Andrej, stc.ctype isn't really needed. Cheers, - Daniel
Jun 14 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-06-13 22:48, Jouko Koski wrote:
 "Jonathan M Davis" <jmdavisProg gmx.com> wrote:
 std.ctype is modeled after C's ctype.h. It has functions for operating on
 characters - particularly functions which indicate the type of a
 character (I
 believe that ctype stands for character type, so that makes sense). For
 instance, isdigit will tell you whether a particular character is a
 digit. It
 only works on ASCII characters (non-ASCII characters return false for
 functions like isdigit and functions like toupper do nothing to non-ASCII
 characters).

What is your definition for ASCII character? Most of the <ctype.h> functions (or macros) are locale dependent, see setlocale() and <locale.h>. And there is the <wctype.h>, too. While the C standardized ways of doing things might not be most appropriate approach in D domain, we must not base our design decisions on deficient analysis. "I just want this text uppercase" is one of the hardest things in the _world_. The problem is not just the header or package naming.

??? std.ctype does _nothing_ with localization. And even if it did, that doesn't change what ASCII is. ASCII is made up of the values 0 through 127. And honestly, I have no clue how _those_ characters could be affected by locale. Extended-ASCII might be, but I wouldn't think that ASCII would be. Regardless, std.ctype does nothing with locale. - Jonathan M Davis
Jun 13 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-06-14 01:51, David Nadlinger wrote:
 On 6/14/11 8:23 AM, Jonathan M Davis wrote:
 What is your definition for ASCII character?
=20
 Most of the<ctype.h>  functions (or macros) are locale dependent, see
 setlocale() and<locale.h>. And there is the<wctype.h>, too.
=20
 While the C standardized ways of doing things might not be most
 appropriate approach in D domain, we must not base our design decisions
 on deficient analysis. "I just want this text uppercase" is one of the
 hardest things in the _world_. The problem is not just the header or
 package naming.

??? std.ctype does _nothing_ with localization. And even if it did, that doesn't change what ASCII is. ASCII is made up of the values 0 through 127. And honestly, I have no clue how _those_ characters could be affected by locale. Extended-ASCII might be, but I wouldn't think that ASCII would be. Regardless, std.ctype does nothing with locale.

But the functions in <ctype.h> do. And there can be some locale-dependent problems even if you use only ASCII, the most prominent being the different handling of =C2=BBi=C2=AB in the Turkish locale: http://www.i18nguy.com/unicode/turkish-i18n.html =20 This is probably another reason why it shouldn't be called std.ctype=E2=

=46rom the looks of it, that affects extended ASCII but not ASCII (since th= e=20 Turkish uppercase I isn't even in ASCII). It's definitely a great link thou= gh.=20 Thanks! It may be that we'll want to improve std.uni to deal with locales in some=20 manner (either by providing new functions which handle them or altering the= =20 current ones to handle them), but std.ctype is pure ASCII. And while I don'= t=20 see how locales can affect pure ASCII, ctype.h appears to actually deal wit= h=20 extended ASCII rather than just ASCII (where locales _do_ matter). So, all = in=20 all, std.ctype definitely has different behavior than ctype.h, which makes = the=20 name std.ctype that much worse. So, given the arguably poor name of ctype and the fact that std.ctype does = not=20 actually match ctype.h's behavior, unless someone comes up with a really go= od=20 reason not to fairly soon, I'm going to schedule std.ctype for deprecation = and=20 put the properly camelcased functions in std.ascii. =2D Jonathan M Davis
Jun 14 2011
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-06-14 02:51, David Nadlinger wrote:
 On 6/14/11 11:20 AM, Jonathan M Davis wrote:
 On 2011-06-14 01:51, David Nadlinger wrote:
 But the functions in<ctype.h>  do. And there can be some
 locale-dependent problems even if you use only ASCII, the most promine=



 being the different handling of =C2=BBi=C2=AB in the Turkish locale:
 http://www.i18nguy.com/unicode/turkish-i18n.html
=20
 This is probably another reason why it shouldn't be called std.ctype=



=20

the =20 Turkish uppercase I isn't even in ASCII). It's definitely a great link though. Thanks!

Oh, I was probably a bit unclear =E2=80=93 what I meant is that it affect=

 also if you use only ASCII input, since toupper('i') =3D=3D 221 when your
 locale is tr_TR.ISO-8859-9.

Yes, but the result is extended ASCII, so it doesn't affect anything which= =20 only deals with pure ASCII. ctype.h deals with extended ASCII, so locales=20 actually affect what it's doing. std.ctype only deals in pure ASCII, so it= =20 wouldn't do anything which would result in a non-ASCII character, and so=20 locales shouldn't matter at all. However, if you _do_ want to bring locales= =20 into it, then a locale like tr_TR.ISO_8859-9 is not going to be able to=20 operate purely in ASCII, since the uppercase value of i is 221, which is=20 extended ASCII. So, yes I understood. It's just that as far as I can tell, locales don't=20 matter if you're completely restricting yourself to ASCII like std.ctype do= es.=20 And std.ctype is not going to try and deal with locales at this point (and= =20 likely not ever). I think that that is far better left to unicode. The Turk= ish=20 locale is a great example of why you _want_ to be dealing with unicode when= =20 dealing with locales. std.ctype is for when you're specifically restricting= =20 yourself to ASCII (which sometimes can be very useful - e.g. with formattin= g=20 strings or regex strings where all of the special characters are ASCII; usi= ng=20 unicode functions would just make them slower at no benefit and would risk= =20 changing behavior based on locale if you brought locales into it). If you'r= e=20 not restricting yourself to ASCII, then std.uni is the way to go. =2D Jonathan M Davis
Jun 14 2011
parent "Jouko Koski" <joukokoskispam101 netti.fi> writes:
"Jonathan M Davis" <jmdavisProg gmx.com> wrote:

 So, yes I understood. It's just that as far as I can tell, locales don't

does. I would not consider it being good idea to include this kind of ascii-only utilities in the standard-ish library. It might be best to rename the module to std.ascii_for_insular_yankees_others_keep_away so that nobody would use it by accident. This way the name would also remind us about the historical terms which were used quarter of a century ago when ascii-only <ctype.h> utilities were first suggested to the intenational C standardization committee. -- Jouko
Jun 14 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Why does std.ctype exist anyway? Can't you use std.uni for both ASCII
and UTF? Or is there some overhead in using the uni functions?
Jun 14 2011
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-06-14 11:53, Jouko Koski wrote:
 "Jonathan M Davis" <jmdavisProg gmx.com> wrote:
 So, yes I understood. It's just that as far as I can tell, locales don't

matter if you're completely restricting yourself to ASCII like std.ctype does. I would not consider it being good idea to include this kind of ascii-only utilities in the standard-ish library. It might be best to rename the module to std.ascii_for_insular_yankees_others_keep_away so that nobody would use it by accident. This way the name would also remind us about the historical terms which were used quarter of a century ago when ascii-only <ctype.h> utilities were first suggested to the intenational C standardization committee.

For some classes of operations, it makes perfect sense to be checking for ASCII characters only. For others, it's just people not worrying about internationalization like they should be. For instance, format strings don't care about unicode as far as their escape sequences go. %a, %d, etc. are all pure ASCII. So, worrying about unicode with them just wouldn't make sense. In most cases, isDigit working on the arabic numerals 0 through 9 is _exactly_ what people want and need. But if you were to try and make it more unicode- friendly, would Greek or Chinese numbers count as digits? Maybe, maybe not. It gets much more complicated. In some cases, all you care about with isUpper or toUpper is ASCII. In others, you want it to deal with unicode (and probably locales as well) properly. std.ctype/std.ascii deals with ASCII for those situations where you really do only care about ASCII. It deals with unicode characters, but it returns false for everything with them which returns a bool, and it never tries to change their case. std.uni actually deals with unicode and worries about things like whether a unicode character is uppercase or not. They're for two different use cases. Most of Phobos should be dealing with unicode (e.g. pretty much everything in std.string should be using the std.uni functions rather than the std.ascii functions if there's a function which is in both), but there are cases where unicode doesn't matter, and you might as well have the efficiency available of just dealing with ASCII. Ultimately, it's up to the programmer to do the right thing. - Jonathan M Davis
Jun 14 2011
parent "Jouko Koski" <joukokoskispam101 netti.fi> writes:
"Jonathan M Davis" <jmdavisProg gmx.com> wrote:
 On 2011-06-14 11:53, Jouko Koski wrote:

 I would not consider it being good idea to include this kind of 
 ascii-only
 utilities in the standard-ish library.


 For some classes of operations, it makes perfect sense to be checking for
 ASCII characters only. For others, it's just people not worrying about
 internationalization like they should be. For instance, format strings 
 don't
 care about unicode as far as their escape sequences go. %a, %d, etc. are 
 all
 pure ASCII.

Do we really need a common library utility for such a bounded domain? I would vote dropping ascii-only std.ctype altogether. Those who know and ensure that they are dealing with ascii-only, ebcdic-only or whatever-only representations can easily write their own utilities to their particular domains - maybe even better optimized than std.ctype because the domain may be even more restricted. A common use ascii-only utility will be used inevitably in places where it shouldn't.
 std.ctype/std.ascii deals with ASCII for those situations where you really 
 do
 only care about ASCII. It deals with unicode characters, but it returns 
 false
 for everything with them which returns a bool, and it never tries to 
 change
 their case. std.uni actually deals with unicode and worries about things 
 like
 whether a unicode character is uppercase or not.

That is what <ctype.h> (or <wctype.h>) utilities do when the default locale setting is in effect. Some other posters seem to suggest that a more generalized library module does this, too, without losing performance. -- Jouko
Jun 16 2011
prev sibling next sibling parent "Regan Heath" <regan netmail.co.nz> writes:
On Tue, 14 Jun 2011 10:20:48 +0100, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 So, given the arguably poor name of ctype and the fact that std.ctype  
 does not actually match ctype.h's behavior, unless someone comes up with  
 a really good reason not to fairly soon, I'm going to schedule std.ctype  
 for deprecation and put the properly camelcased functions in std.ascii.

I reckon this is the best option. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jun 16 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-06-16 12:51, Jouko Koski wrote:
 "Jonathan M Davis" <jmdavisProg gmx.com> wrote:
 On 2011-06-14 11:53, Jouko Koski wrote:
 I would not consider it being good idea to include this kind of
 ascii-only
 utilities in the standard-ish library.

For some classes of operations, it makes perfect sense to be checking for ASCII characters only. For others, it's just people not worrying about internationalization like they should be. For instance, format strings don't care about unicode as far as their escape sequences go. %a, %d, etc. are all pure ASCII.

Do we really need a common library utility for such a bounded domain? I would vote dropping ascii-only std.ctype altogether. Those who know and ensure that they are dealing with ascii-only, ebcdic-only or whatever-only representations can easily write their own utilities to their particular domains - maybe even better optimized than std.ctype because the domain may be even more restricted. A common use ascii-only utility will be used inevitably in places where it shouldn't.
 std.ctype/std.ascii deals with ASCII for those situations where you
 really do
 only care about ASCII. It deals with unicode characters, but it returns
 false
 for everything with them which returns a bool, and it never tries to
 change
 their case. std.uni actually deals with unicode and worries about things
 like
 whether a unicode character is uppercase or not.

That is what <ctype.h> (or <wctype.h>) utilities do when the default locale setting is in effect. Some other posters seem to suggest that a more generalized library module does this, too, without losing performance.

You actually do get a performance loss for a number of functions. They do tend to shortcut on ASCII in many cases, but they tend to become too large to be inlined, and if all you care about is ASCII, even if there are unicode characters in the string (which is common enough in domains that have nothing to do with English - e.g. regular expressions), you take a performance hit for all characters which aren't ASCII. There are also a number of functions which arguably don't make much sense to try and turn into unicode functions (e.g. isDigit) but are heavily used. Another fun one is isWhite vs isUniWhite. In most cases, you _don't_ care about unicode whitespace, and it is definitely more expensive to call isUniWhite than isWhite, because there are a _lot_ of extraneous whitespace characters in unicode. std.ctype/std.ascii is _not_ going away. Too many people find those functions to be useful. I grant you that too many programmers don't worry about unicode when they should, but there are so many issues surrounding the proper handling of unicode that programmers aren't going to get it right unless they're actully trying to get it right. D provides a lot of the tools to make unicode mostly work correctly out of the box, but it's still complicated enough that you can't expect it to "just work" without programmers having some clue of what they're doing. And forcing people to come up with their own functions for basic ASCII operations (which pretty much every other programming language has) isn't going to help any. - Jonathan M Davis
Jun 16 2011