digitalmars.D - Rename std.ctype to std.ascii?
- Jonathan M Davis (48/48) Jun 13 2011 std.ctype is modeled after C's ctype.h. It has functions for operating o...
- Jouko Koski (10/19) Jun 13 2011 What is your definition for ASCII character?
- Jonathan M Davis (7/27) Jun 13 2011 ??? std.ctype does _nothing_ with localization. And even if it did, that...
-
KennyTM~
(3/30)
Jun 14 2011
std.ctype does not, but
does. (which could be another reason -
David Nadlinger
(7/21)
Jun 14 2011
But the functions in
do. And there can be some - Jonathan M Davis (27/51) Jun 14 2011 =80=A6
- David Nadlinger (5/15) Jun 14 2011 Oh, I was probably a bit unclear – what I meant is that it affects you...
- Jonathan M Davis (36/54) Jun 14 2011 =E2=80=A6
- Jouko Koski (12/13) Jun 14 2011 matter if you're completely restricting yourself to ASCII like std.ctype...
- Andrej Mitrovic (2/2) Jun 14 2011 Why does std.ctype exist anyway? Can't you use std.uni for both ASCII
- Daniel Gibson (8/10) Jun 14 2011 I haven't looked at either implementation, but on ASCII everything is
- Timon Gehr (6/16) Jun 14 2011 The implementation of toUniLower shortcuts on ASCII characters. I don't ...
- Daniel Gibson (6/27) Jun 14 2011 OK. I just looked at the implementation and it seems like there are
- Jonathan M Davis (24/37) Jun 14 2011 For some classes of operations, it makes perfect sense to be checking fo...
- Jouko Koski (13/33) Jun 16 2011 Do we really need a common library utility for such a bounded domain? I
- Jonathan M Davis (23/58) Jun 16 2011 You actually do get a performance loss for a number of functions. They d...
- Regan Heath (5/9) Jun 16 2011 I reckon this is the best option.
std.ctype is modeled after C's ctype.h. It has functions for operating on characters - particularly functions which indicate the type of a character (I believe that ctype stands for character type, so that makes sense). For instance, isdigit will tell you whether a particular character is a digit. It only works on ASCII characters (non-ASCII characters return false for functions like isdigit and functions like toupper do nothing to non-ASCII characters). std.uni, on the other hand, operates on characters just like std.ctype does, but it extends its charter to unicode characters (e.g. it has isUniUpper which _does_ work on unicode characters, unlike std.ctype's isupper). The thing is that aside from those familiar with C/C++, most programmers are likely to find the module name ctype to be rather uniformative. If they're looking for something like isdigit, they're not terribly likely to go looking at std.ctype first. And I'm not sure that std.ascii will be all that much more obvious to them, but it fits in much better with std.uni. std.ascii gets the character functions which operate only on ASCII characters, and std.uni gets the character functions which operate on unicode characters in addition to ASCII characters. I don't think that the change of module name is enough of an improvement to merit changing the name just because ctype is arguably bad. However, as it turns out, _no_ function in std.ctype is properly camelcased, and many of them return int instead of bool (which the C functions they're modeled after do but which is not particularly D-like and can cause problems when you actual _need_ them to return bool). And it has been made very clear in past discussions in this newsgroup that the consensus is that we prefer that Phobos functions follow Phobos' naming conventions (which means camelcasing) rather than matching the casing of functions in other languages. So, all of the functions in std.ctype need to be renamed. I now have a pull request which creates properly camelcased versions of all of them ( https://github.com/D-Programming-Language/phobos/pull/101 ). The thing is though that because _every_ function in std.ctype is renamed, the cost of renaming the entire module (as far as people updating their code to use functions such as isDigit instead of isdigit goes) is essentially the same if as just renaming the functions in-place. In either case, the old functions will go through the full deprecation process before they're actually gone, so no one's code will suddenly break because of the changes, but any code that uses the old functions will eventually have to be change to use the properly named ones. And since the cost to making those changes is essentially the same whether we replace the whole std.ctype module or whether we replace all of its functions, I'm wondering whether it would be worthwhile to take this opportunity to rename std.ctype? I don't think that the name change is enough of an improvement to do it if it's going to break everyone's code, but given that fixing all of its functions gives us a perfect opportunity to rename it at no additional cost, I feel that the question should be posed. Should we rename std.ctype to std.ascii? Or should we just keep the old name, which is familiar to C programmers? - Jonathan M Davis
Jun 13 2011
"Jonathan M Davis" <jmdavisProg gmx.com> wrote:std.ctype is modeled after C's ctype.h. It has functions for operating on characters - particularly functions which indicate the type of a character (I believe that ctype stands for character type, so that makes sense). For instance, isdigit will tell you whether a particular character is a digit. It only works on ASCII characters (non-ASCII characters return false for functions like isdigit and functions like toupper do nothing to non-ASCII characters).What is your definition for ASCII character? Most of the <ctype.h> functions (or macros) are locale dependent, see setlocale() and <locale.h>. And there is the <wctype.h>, too. While the C standardized ways of doing things might not be most appropriate approach in D domain, we must not base our design decisions on deficient analysis. "I just want this text uppercase" is one of the hardest things in the _world_. The problem is not just the header or package naming. -- Jouko
Jun 13 2011
On 2011-06-13 22:48, Jouko Koski wrote:"Jonathan M Davis" <jmdavisProg gmx.com> wrote:??? std.ctype does _nothing_ with localization. And even if it did, that doesn't change what ASCII is. ASCII is made up of the values 0 through 127. And honestly, I have no clue how _those_ characters could be affected by locale. Extended-ASCII might be, but I wouldn't think that ASCII would be. Regardless, std.ctype does nothing with locale. - Jonathan M Davisstd.ctype is modeled after C's ctype.h. It has functions for operating on characters - particularly functions which indicate the type of a character (I believe that ctype stands for character type, so that makes sense). For instance, isdigit will tell you whether a particular character is a digit. It only works on ASCII characters (non-ASCII characters return false for functions like isdigit and functions like toupper do nothing to non-ASCII characters).What is your definition for ASCII character? Most of the <ctype.h> functions (or macros) are locale dependent, see setlocale() and <locale.h>. And there is the <wctype.h>, too. While the C standardized ways of doing things might not be most appropriate approach in D domain, we must not base our design decisions on deficient analysis. "I just want this text uppercase" is one of the hardest things in the _world_. The problem is not just the header or package naming.
Jun 13 2011
On Jun 14, 11 14:23, Jonathan M Davis wrote:On 2011-06-13 22:48, Jouko Koski wrote:std.ctype does not, but <ctype.h> does. (which could be another reason it shouldn't be called std.ctype.)"Jonathan M Davis"<jmdavisProg gmx.com> wrote:??? std.ctype does _nothing_ with localization. And even if it did, that doesn't change what ASCII is. ASCII is made up of the values 0 through 127. And honestly, I have no clue how _those_ characters could be affected by locale. Extended-ASCII might be, but I wouldn't think that ASCII would be. Regardless, std.ctype does nothing with locale. - Jonathan M Davisstd.ctype is modeled after C's ctype.h. It has functions for operating on characters - particularly functions which indicate the type of a character (I believe that ctype stands for character type, so that makes sense). For instance, isdigit will tell you whether a particular character is a digit. It only works on ASCII characters (non-ASCII characters return false for functions like isdigit and functions like toupper do nothing to non-ASCII characters).What is your definition for ASCII character? Most of the<ctype.h> functions (or macros) are locale dependent, see setlocale() and<locale.h>. And there is the<wctype.h>, too. While the C standardized ways of doing things might not be most appropriate approach in D domain, we must not base our design decisions on deficient analysis. "I just want this text uppercase" is one of the hardest things in the _world_. The problem is not just the header or package naming.
Jun 14 2011
On 6/14/11 8:23 AM, Jonathan M Davis wrote:But the functions in <ctype.h> do. And there can be some locale-dependent problems even if you use only ASCII, the most prominent being the different handling of »i« in the Turkish locale: http://www.i18nguy.com/unicode/turkish-i18n.html This is probably another reason why it shouldn't be called std.ctype… DavidWhat is your definition for ASCII character? Most of the<ctype.h> functions (or macros) are locale dependent, see setlocale() and<locale.h>. And there is the<wctype.h>, too. While the C standardized ways of doing things might not be most appropriate approach in D domain, we must not base our design decisions on deficient analysis. "I just want this text uppercase" is one of the hardest things in the _world_. The problem is not just the header or package naming.??? std.ctype does _nothing_ with localization. And even if it did, that doesn't change what ASCII is. ASCII is made up of the values 0 through 127. And honestly, I have no clue how _those_ characters could be affected by locale. Extended-ASCII might be, but I wouldn't think that ASCII would be. Regardless, std.ctype does nothing with locale.
Jun 14 2011
On 2011-06-14 01:51, David Nadlinger wrote:On 6/14/11 8:23 AM, Jonathan M Davis wrote:=80=A6 =46rom the looks of it, that affects extended ASCII but not ASCII (since th= e=20 Turkish uppercase I isn't even in ASCII). It's definitely a great link thou= gh.=20 Thanks! It may be that we'll want to improve std.uni to deal with locales in some=20 manner (either by providing new functions which handle them or altering the= =20 current ones to handle them), but std.ctype is pure ASCII. And while I don'= t=20 see how locales can affect pure ASCII, ctype.h appears to actually deal wit= h=20 extended ASCII rather than just ASCII (where locales _do_ matter). So, all = in=20 all, std.ctype definitely has different behavior than ctype.h, which makes = the=20 name std.ctype that much worse. So, given the arguably poor name of ctype and the fact that std.ctype does = not=20 actually match ctype.h's behavior, unless someone comes up with a really go= od=20 reason not to fairly soon, I'm going to schedule std.ctype for deprecation = and=20 put the properly camelcased functions in std.ascii. =2D Jonathan M Davis=20 But the functions in <ctype.h> do. And there can be some locale-dependent problems even if you use only ASCII, the most prominent being the different handling of =C2=BBi=C2=AB in the Turkish locale: http://www.i18nguy.com/unicode/turkish-i18n.html =20 This is probably another reason why it shouldn't be called std.ctype=E2=What is your definition for ASCII character? =20 Most of the<ctype.h> functions (or macros) are locale dependent, see setlocale() and<locale.h>. And there is the<wctype.h>, too. =20 While the C standardized ways of doing things might not be most appropriate approach in D domain, we must not base our design decisions on deficient analysis. "I just want this text uppercase" is one of the hardest things in the _world_. The problem is not just the header or package naming.=20 ??? std.ctype does _nothing_ with localization. And even if it did, that doesn't change what ASCII is. ASCII is made up of the values 0 through 127. And honestly, I have no clue how _those_ characters could be affected by locale. Extended-ASCII might be, but I wouldn't think that ASCII would be. Regardless, std.ctype does nothing with locale.
Jun 14 2011
On 6/14/11 11:20 AM, Jonathan M Davis wrote:On 2011-06-14 01:51, David Nadlinger wrote:Oh, I was probably a bit unclear – what I meant is that it affects you also if you use only ASCII input, since toupper('i') == 221 when your locale is tr_TR.ISO-8859-9. DavidBut the functions in<ctype.h> do. And there can be some locale-dependent problems even if you use only ASCII, the most prominent being the different handling of »i« in the Turkish locale: http://www.i18nguy.com/unicode/turkish-i18n.html This is probably another reason why it shouldn't be called std.ctype…From the looks of it, that affects extended ASCII but not ASCII (since the Turkish uppercase I isn't even in ASCII). It's definitely a great link though. Thanks!
Jun 14 2011
On 2011-06-14 02:51, David Nadlinger wrote:On 6/14/11 11:20 AM, Jonathan M Davis wrote:ntOn 2011-06-14 01:51, David Nadlinger wrote:But the functions in<ctype.h> do. And there can be some locale-dependent problems even if you use only ASCII, the most promine==E2=80=A6being the different handling of =C2=BBi=C2=AB in the Turkish locale: http://www.i18nguy.com/unicode/turkish-i18n.html =20 This is probably another reason why it shouldn't be called std.ctype=s you=20 Oh, I was probably a bit unclear =E2=80=93 what I meant is that it affect==20From the looks of it, that affects extended ASCII but not ASCII (since the =20 Turkish uppercase I isn't even in ASCII). It's definitely a great link though. Thanks!also if you use only ASCII input, since toupper('i') =3D=3D 221 when your locale is tr_TR.ISO-8859-9.Yes, but the result is extended ASCII, so it doesn't affect anything which= =20 only deals with pure ASCII. ctype.h deals with extended ASCII, so locales=20 actually affect what it's doing. std.ctype only deals in pure ASCII, so it= =20 wouldn't do anything which would result in a non-ASCII character, and so=20 locales shouldn't matter at all. However, if you _do_ want to bring locales= =20 into it, then a locale like tr_TR.ISO_8859-9 is not going to be able to=20 operate purely in ASCII, since the uppercase value of i is 221, which is=20 extended ASCII. So, yes I understood. It's just that as far as I can tell, locales don't=20 matter if you're completely restricting yourself to ASCII like std.ctype do= es.=20 And std.ctype is not going to try and deal with locales at this point (and= =20 likely not ever). I think that that is far better left to unicode. The Turk= ish=20 locale is a great example of why you _want_ to be dealing with unicode when= =20 dealing with locales. std.ctype is for when you're specifically restricting= =20 yourself to ASCII (which sometimes can be very useful - e.g. with formattin= g=20 strings or regex strings where all of the special characters are ASCII; usi= ng=20 unicode functions would just make them slower at no benefit and would risk= =20 changing behavior based on locale if you brought locales into it). If you'r= e=20 not restricting yourself to ASCII, then std.uni is the way to go. =2D Jonathan M Davis
Jun 14 2011
"Jonathan M Davis" <jmdavisProg gmx.com> wrote:So, yes I understood. It's just that as far as I can tell, locales don'tmatter if you're completely restricting yourself to ASCII like std.ctype does. I would not consider it being good idea to include this kind of ascii-only utilities in the standard-ish library. It might be best to rename the module to std.ascii_for_insular_yankees_others_keep_away so that nobody would use it by accident. This way the name would also remind us about the historical terms which were used quarter of a century ago when ascii-only <ctype.h> utilities were first suggested to the intenational C standardization committee. -- Jouko
Jun 14 2011
Why does std.ctype exist anyway? Can't you use std.uni for both ASCII and UTF? Or is there some overhead in using the uni functions?
Jun 14 2011
Am 14.06.2011 20:58, schrieb Andrej Mitrovic:Why does std.ctype exist anyway? Can't you use std.uni for both ASCII and UTF? Or is there some overhead in using the uni functions?I haven't looked at either implementation, but on ASCII everything is really simple.. isalpha, isdigit, isupper and islower are just a simple checks if the value is between two values, tolower(dchar c) is just return isupper(c) ? c+32 : c; etc. For Unicode this is most probably *much* harder (=> more expensive). Cheers, - Daniel
Jun 14 2011
Daniel Gibson wrote:Am 14.06.2011 20:58, schrieb Andrej Mitrovic:The implementation of toUniLower shortcuts on ASCII characters. I don't expect it to be any slower if not for inlineability. And if somebody really needs the speed, I feel manually writing if('A' <= c && c <= 'Z') c+=32; (or similar) is just good enough. TimonWhy does std.ctype exist anyway? Can't you use std.uni for both ASCII and UTF? Or is there some overhead in using the uni functions?I haven't looked at either implementation, but on ASCII everything is really simple.. isalpha, isdigit, isupper and islower are just a simple checks if the value is between two values, tolower(dchar c) is just return isupper(c) ? c+32 : c; etc. For Unicode this is most probably *much* harder (=> more expensive). Cheers, - Daniel
Jun 14 2011
Am 14.06.2011 21:29, schrieb Timon Gehr:Daniel Gibson wrote:OK. I just looked at the implementation and it seems like there are ASCII-shortcuts in all those unicode functions. So I agree with Andrej, stc.ctype isn't really needed. Cheers, - DanielAm 14.06.2011 20:58, schrieb Andrej Mitrovic:The implementation of toUniLower shortcuts on ASCII characters. I don't expect it to be any slower if not for inlineability. And if somebody really needs the speed, I feel manually writing if('A' <= c && c <= 'Z') c+=32; (or similar) is just good enough. TimonWhy does std.ctype exist anyway? Can't you use std.uni for both ASCII and UTF? Or is there some overhead in using the uni functions?I haven't looked at either implementation, but on ASCII everything is really simple.. isalpha, isdigit, isupper and islower are just a simple checks if the value is between two values, tolower(dchar c) is just return isupper(c) ? c+32 : c; etc. For Unicode this is most probably *much* harder (=> more expensive). Cheers, - Daniel
Jun 14 2011
On 2011-06-14 11:53, Jouko Koski wrote:"Jonathan M Davis" <jmdavisProg gmx.com> wrote:For some classes of operations, it makes perfect sense to be checking for ASCII characters only. For others, it's just people not worrying about internationalization like they should be. For instance, format strings don't care about unicode as far as their escape sequences go. %a, %d, etc. are all pure ASCII. So, worrying about unicode with them just wouldn't make sense. In most cases, isDigit working on the arabic numerals 0 through 9 is _exactly_ what people want and need. But if you were to try and make it more unicode- friendly, would Greek or Chinese numbers count as digits? Maybe, maybe not. It gets much more complicated. In some cases, all you care about with isUpper or toUpper is ASCII. In others, you want it to deal with unicode (and probably locales as well) properly. std.ctype/std.ascii deals with ASCII for those situations where you really do only care about ASCII. It deals with unicode characters, but it returns false for everything with them which returns a bool, and it never tries to change their case. std.uni actually deals with unicode and worries about things like whether a unicode character is uppercase or not. They're for two different use cases. Most of Phobos should be dealing with unicode (e.g. pretty much everything in std.string should be using the std.uni functions rather than the std.ascii functions if there's a function which is in both), but there are cases where unicode doesn't matter, and you might as well have the efficiency available of just dealing with ASCII. Ultimately, it's up to the programmer to do the right thing. - Jonathan M DavisSo, yes I understood. It's just that as far as I can tell, locales don'tmatter if you're completely restricting yourself to ASCII like std.ctype does. I would not consider it being good idea to include this kind of ascii-only utilities in the standard-ish library. It might be best to rename the module to std.ascii_for_insular_yankees_others_keep_away so that nobody would use it by accident. This way the name would also remind us about the historical terms which were used quarter of a century ago when ascii-only <ctype.h> utilities were first suggested to the intenational C standardization committee.
Jun 14 2011
"Jonathan M Davis" <jmdavisProg gmx.com> wrote:On 2011-06-14 11:53, Jouko Koski wrote:I would not consider it being good idea to include this kind of ascii-only utilities in the standard-ish library.For some classes of operations, it makes perfect sense to be checking for ASCII characters only. For others, it's just people not worrying about internationalization like they should be. For instance, format strings don't care about unicode as far as their escape sequences go. %a, %d, etc. are all pure ASCII.Do we really need a common library utility for such a bounded domain? I would vote dropping ascii-only std.ctype altogether. Those who know and ensure that they are dealing with ascii-only, ebcdic-only or whatever-only representations can easily write their own utilities to their particular domains - maybe even better optimized than std.ctype because the domain may be even more restricted. A common use ascii-only utility will be used inevitably in places where it shouldn't.std.ctype/std.ascii deals with ASCII for those situations where you really do only care about ASCII. It deals with unicode characters, but it returns false for everything with them which returns a bool, and it never tries to change their case. std.uni actually deals with unicode and worries about things like whether a unicode character is uppercase or not.That is what <ctype.h> (or <wctype.h>) utilities do when the default locale setting is in effect. Some other posters seem to suggest that a more generalized library module does this, too, without losing performance. -- Jouko
Jun 16 2011
On 2011-06-16 12:51, Jouko Koski wrote:"Jonathan M Davis" <jmdavisProg gmx.com> wrote:You actually do get a performance loss for a number of functions. They do tend to shortcut on ASCII in many cases, but they tend to become too large to be inlined, and if all you care about is ASCII, even if there are unicode characters in the string (which is common enough in domains that have nothing to do with English - e.g. regular expressions), you take a performance hit for all characters which aren't ASCII. There are also a number of functions which arguably don't make much sense to try and turn into unicode functions (e.g. isDigit) but are heavily used. Another fun one is isWhite vs isUniWhite. In most cases, you _don't_ care about unicode whitespace, and it is definitely more expensive to call isUniWhite than isWhite, because there are a _lot_ of extraneous whitespace characters in unicode. std.ctype/std.ascii is _not_ going away. Too many people find those functions to be useful. I grant you that too many programmers don't worry about unicode when they should, but there are so many issues surrounding the proper handling of unicode that programmers aren't going to get it right unless they're actully trying to get it right. D provides a lot of the tools to make unicode mostly work correctly out of the box, but it's still complicated enough that you can't expect it to "just work" without programmers having some clue of what they're doing. And forcing people to come up with their own functions for basic ASCII operations (which pretty much every other programming language has) isn't going to help any. - Jonathan M DavisOn 2011-06-14 11:53, Jouko Koski wrote:Do we really need a common library utility for such a bounded domain? I would vote dropping ascii-only std.ctype altogether. Those who know and ensure that they are dealing with ascii-only, ebcdic-only or whatever-only representations can easily write their own utilities to their particular domains - maybe even better optimized than std.ctype because the domain may be even more restricted. A common use ascii-only utility will be used inevitably in places where it shouldn't.I would not consider it being good idea to include this kind of ascii-only utilities in the standard-ish library.For some classes of operations, it makes perfect sense to be checking for ASCII characters only. For others, it's just people not worrying about internationalization like they should be. For instance, format strings don't care about unicode as far as their escape sequences go. %a, %d, etc. are all pure ASCII.std.ctype/std.ascii deals with ASCII for those situations where you really do only care about ASCII. It deals with unicode characters, but it returns false for everything with them which returns a bool, and it never tries to change their case. std.uni actually deals with unicode and worries about things like whether a unicode character is uppercase or not.That is what <ctype.h> (or <wctype.h>) utilities do when the default locale setting is in effect. Some other posters seem to suggest that a more generalized library module does this, too, without losing performance.
Jun 16 2011
On Tue, 14 Jun 2011 10:20:48 +0100, Jonathan M Davis <jmdavisProg gmx.com> wrote:So, given the arguably poor name of ctype and the fact that std.ctype does not actually match ctype.h's behavior, unless someone comes up with a really good reason not to fairly soon, I'm going to schedule std.ctype for deprecation and put the properly camelcased functions in std.ascii.I reckon this is the best option. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Jun 16 2011