www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - numericValue for (unicode) characters

reply "monarch_dodra" <monarchdodra gmail.com> writes:
There is an ER that would allow to convert characters to numebers:
http://d.puremagic.com/issues/show_bug.cgi?id=5543

For example: '1' => 1
Or, unicode considered: 'Ⅶ' => 7

Long story short, it was decided that it wasn't std.conv.to's job 
to do this conversion, but rather, there should be a function 
called "numericValue" inside std.uni and std.ascii that would do 
this job.

What remains are defining how these methods should work. Things 
to keep in mind:
- ASCII to int should be fast.
- unicode numeric values span from -0.5 to 1.0e12.
- unicode numeric values can be fractional.
- ALL unicode numeric values can be EXACTLY represented in a 
double.

Given these observations, I'd like to propose these:

//------------------------------
//std.ascii.numericValue
/** Given an ascii character, returns that character's
     numeric value if it is numeric ($(D isNumeric)),
     and -1 otherwise
  */
pure  safe nothrow
int numericValue(dchar c);
//------------------------------
//std.uni.numericValue
/** Given a unicode character, returns that character's
     numeric value if it is numeric ($(D isNumeric)),
     and throws an exception otherwise
  */
pure  safe
double numericValue(dchar c);
//------------------------------

The rationale for this:
std.ascii: I think returning -1 as a magic number should help 
keep the code faster and with less clutter than with exceptions. 
returning an int is the obvious choice for numbers that span -1 
to 10.

std.uni: double is the only type that can hold all ranges of 
unicode's numeric values.
This time, uni throws exceptions. This is for two reasons:
1. Choosing a magic number is difficult, and error prone. Correct 
code would have to look like: "if (std.uni.numericValue(c) > 
-0.7) {...}"
2. When dealing with unicode, overhead of the exception is 
probably cleaner and not as critical as with ascii.

***********************************************
Thoughts?

I wanted to get this ER moved forward. I don't think 
uni.numericValue will be finished soon, but I would have wanted 
std.ascii's done sooner rather than later.
Jan 02 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
monarch_dodra:

 The rationale for this:
 std.ascii: I think returning -1 as a magic number should help 
 keep the code faster and with less clutter than with exceptions.
For the ASCII version I have two use cases: - Where I want to go fast&unsafe I just use "c - '0'". - When I want more safety I'd like to use something as to!(), that raises exceptions in case of errors. A function that works on ASCII and returns -1 doesn't give me much more than "c - '0'". So maybe exceptions are good in the ASCII case too. There is also std.typecons.nullable, it's a possibility for std.uni.numericValue. Generally Phobos should eat more of its dog food :-) Bye, bearophile
Jan 02 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
1/2/2013 7:24 PM, bearophile пишет:
 monarch_dodra:

 The rationale for this:
 std.ascii: I think returning -1 as a magic number should help keep the
 code faster and with less clutter than with exceptions.
For the ASCII version I have two use cases: - Where I want to go fast&unsafe I just use "c - '0'". - When I want more safety I'd like to use something as to!(), that raises exceptions in case of errors. A function that works on ASCII and returns -1 doesn't give me much more than "c - '0'". So maybe exceptions are good in the ASCII case too.
Then we can maybe just drop this function? What's wrong with if(std.ascii.isNumeric(a)) a -= '0'; else enforce(false); I mean that the time to look it up in std library is much bigger then to roll your own with any of the 2 semantics. Unlike the unicode version, of course. Then IMO having the std.ascii one is mostly just for symmetry and thus I think that both should just use some sentinel value.
 There is also std.typecons.nullable, it's a possibility for
 std.uni.numericValue. Generally Phobos should eat more of its dog food :-)
double.nan sounds more like it.
 Bye,
 bearophile
-- Dmitry Olshansky
Jan 02 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/2/13 3:13 PM, Dmitry Olshansky wrote:
 1/2/2013 7:24 PM, bearophile пишет:
 monarch_dodra:

 The rationale for this:
 std.ascii: I think returning -1 as a magic number should help keep the
 code faster and with less clutter than with exceptions.
For the ASCII version I have two use cases: - Where I want to go fast&unsafe I just use "c - '0'". - When I want more safety I'd like to use something as to!(), that raises exceptions in case of errors. A function that works on ASCII and returns -1 doesn't give me much more than "c - '0'". So maybe exceptions are good in the ASCII case too.
Then we can maybe just drop this function? What's wrong with if(std.ascii.isNumeric(a)) a -= '0'; else enforce(false);
Unnecessary flow :o). enforce(std.ascii.isNumeric(a)); a -= '0'; Andrei
Jan 02 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
1/3/2013 12:21 AM, Andrei Alexandrescu пишет:
 On 1/2/13 3:13 PM, Dmitry Olshansky wrote:
 1/2/2013 7:24 PM, bearophile пишет:
 monarch_dodra:

 The rationale for this:
 std.ascii: I think returning -1 as a magic number should help keep the
 code faster and with less clutter than with exceptions.
For the ASCII version I have two use cases: - Where I want to go fast&unsafe I just use "c - '0'". - When I want more safety I'd like to use something as to!(), that raises exceptions in case of errors. A function that works on ASCII and returns -1 doesn't give me much more than "c - '0'". So maybe exceptions are good in the ASCII case too.
Then we can maybe just drop this function? What's wrong with if(std.ascii.isNumeric(a)) a -= '0'; else enforce(false);
Unnecessary flow :o). enforce(std.ascii.isNumeric(a)); a -= '0';
Yup, and it's 2 lines then. And if one really wants to chain it: map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...); Hardly makes it Phobos candidate then ;) -- Dmitry Olshansky
Jan 02 2013
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 2 January 2013 at 20:49:38 UTC, Dmitry Olshansky 
wrote:
 Yup, and it's 2 lines then. And if one really wants to chain it:
 map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

 Hardly makes it Phobos candidate then ;)
Well, just because its almost trivial to us doesn't mean it hurts to have it. The fact that you can even operate on chars in such a fashion (c - '0') is not obvious to everyone: I've seen time and time again code such as: //---- if (97 <= c && c <= 122) c -= 97; //---- numericValue helps keep things clean and self documented. What's more, it helps keep ascii complete. Code originally written for ascii is easily upgreable to support uni (and vice-versa). Further more, *writing* "std.ascii.numericValue" self documents ascii only support, which is less obvious than code using "c - '0'": In the original pull request to "improve" conv.to, the fact that it did not support unicode didn't even cross our minds. Seeing "std.ascii.numericValue" raises the eyebrow. It *forces* unicode consideration (regardless of which is right, it can't be ignored). Really, by the rationale of "it's 2 lines", we shouldn't even have "std.ascii.isNumeric" at all... On Wednesday, 2 January 2013 at 20:13:32 UTC, Dmitry Olshansky wrote:
 1/2/2013 7:24 PM, bearophile пишет:
 There is also std.typecons.nullable, it's a possibility for
 std.uni.numericValue. Generally Phobos should eat more of its 
 dog food :-)
double.nan sounds more like it.
Hum... nan. I like it.
Jan 02 2013
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Jan 02, 2013 at 11:15:31PM +0100, monarch_dodra wrote:
 On Wednesday, 2 January 2013 at 20:49:38 UTC, Dmitry Olshansky
 wrote:
Yup, and it's 2 lines then. And if one really wants to chain it:
map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

Hardly makes it Phobos candidate then ;)
Well, just because its almost trivial to us doesn't mean it hurts to have it. The fact that you can even operate on chars in such a fashion (c - '0') is not obvious to everyone: I've seen time and time again code such as: //---- if (97 <= c && c <= 122) c -= 97; //---- numericValue helps keep things clean and self documented.
+1. Code intent is important. [...]
 On Wednesday, 2 January 2013 at 20:13:32 UTC, Dmitry Olshansky
 wrote:
1/2/2013 7:24 PM, bearophile пишет:
There is also std.typecons.nullable, it's a possibility for
std.uni.numericValue. Generally Phobos should eat more of its
dog food :-)
double.nan sounds more like it.
Hum... nan. I like it.
+1 for nan. It's about time we used nan for something useful beyond just an annoying default value for floating-point variables. :) T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG
Jan 02 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
1/3/2013 2:15 AM, monarch_dodra пишет:
 On Wednesday, 2 January 2013 at 20:49:38 UTC, Dmitry Olshansky wrote:
 Yup, and it's 2 lines then. And if one really wants to chain it:
 map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

 Hardly makes it Phobos candidate then ;)
Well, just because its almost trivial to us doesn't mean it hurts to have it. The fact that you can even operate on chars in such a fashion (c - '0') is not obvious to everyone: I've seen time and time again code such as: //---- if (97 <= c && c <= 122) c -= 97; //---- numericValue helps keep things clean and self documented. What's more, it helps keep ascii complete. Code originally written for ascii is easily upgreable to support uni (and vice-versa). Further more, *writing* "std.ascii.numericValue" self documents ascii only support, which is less obvious than code using "c - '0'": In the original pull request to "improve" conv.to, the fact that it did not support unicode didn't even cross our minds. Seeing "std.ascii.numericValue" raises the eyebrow. It *forces* unicode consideration (regardless of which is right, it can't be ignored). Really, by the rationale of "it's 2 lines", we shouldn't even have "std.ascii.isNumeric" at all...
I don't mind adding because of completeness and/or symmetry stand point as I said. I do see another cool issue popping up though. It's a problem of how the anti-hijacking works. Say we add numericValue right now to std.ascii but not std.uni. A release later we have numericValue in std.uni (well hopefully they are both in the same 2.062 ;) ). Now take this code: map!numericValue(...) If the code also happens to import std.uni it's going to stop compiling. That's one of reasons I think our hopes on stability (as in compiles in 5 years from now) are ill placed as we can't have it until the library is essentially dead in stone. -- Dmitry Olshansky
Jan 03 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky 
wrote:
 Now take this code:
 map!numericValue(...)

 If the code also happens to import std.uni it's going to stop 
 compiling.
Hum... We could always "camp" the std.uni's numericValue function? //---- double numericValue()(dchar c) const nothrow safe { static assert(false, "Sorry, std.uni.numericValue is not yet implemented"); } //---- This would avoid the breakage you mentioned.
Jan 03 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Jan-2013 21:13, monarch_dodra пишет:
 On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky wrote:
 Now take this code:
 map!numericValue(...)

 If the code also happens to import std.uni it's going to stop compiling.
Hum... We could always "camp" the std.uni's numericValue function? //---- double numericValue()(dchar c) const nothrow safe { static assert(false, "Sorry, std.uni.numericValue is not yet implemented"); } //----
We'd pretty much have to. -- Dmitry Olshansky
Jan 03 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 3 January 2013 at 18:11:45 UTC, Dmitry Olshansky 
wrote:
 03-Jan-2013 21:13, monarch_dodra пишет:
 On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky 
 wrote:
 Now take this code:
 map!numericValue(...)

 If the code also happens to import std.uni it's going to stop 
 compiling.
Hum... We could always "camp" the std.uni's numericValue function? [SNIP]
We'd pretty much have to.
Or, you know... I could just implement both at the same time. It's not like there's an *urgency* for the ascii version or anything. I think I'll just do that. So... do we agree on ascii: int - not found => -1 uni: double - not found => nan ? I can still get started anyways, even if it isn't definite.
Jan 03 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Jan-2013 23:40, monarch_dodra пишет:
 On Thursday, 3 January 2013 at 18:11:45 UTC, Dmitry Olshansky wrote:
 03-Jan-2013 21:13, monarch_dodra пишет:
 On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky wrote:
 Now take this code:
 map!numericValue(...)

 If the code also happens to import std.uni it's going to stop
 compiling.
Hum... We could always "camp" the std.uni's numericValue function? [SNIP]
We'd pretty much have to.
Or, you know... I could just implement both at the same time. It's not like there's an *urgency* for the ascii version or anything. I think I'll just do that. So... do we agree on ascii: int - not found => -1 uni: double - not found => nan ?
Me fine.
 I can still get started anyways, even if it isn't definite.
It's just an idea that I have exceptionally fast version for Unicode just around the corner, but I wouldn't mind some competition ;) -- Dmitry Olshansky
Jan 03 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 3 January 2013 at 20:14:43 UTC, Dmitry Olshansky 
wrote:
 It's just an idea that I have exceptionally fast version for 
 Unicode just around the corner, but I wouldn't mind some 
 competition ;)
Well, I already mentioned to you how I was planning to do it: Just stupid binary search over ranges of numbers indexed on 0. The "big" chunk of work, actually (IMO), is just creating the raw data...
Jan 04 2013
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Jan 03, 2013 at 08:40:47PM +0100, monarch_dodra wrote:
[...]
 Or, you know... I could just implement both at the same time. It's
 not like there's an *urgency* for the ascii version or anything. I
 think I'll just do that.
 
 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan
[...] LGTM. :) I did think of what might happen if somebody wrote an int cast for std.uni.numericValue: void sloppyProgrammersFunction(dchar ch) { // First attempt: compiler error: can't implicitly // convert double -> int ... //int val = std.uni.numericValue(ch); // ... so sloppy programmer inserts a cast int val = cast(int)std.uni.numericValue(ch); // On Linux/64, if numericValue returns nan, this prints // -int.max. writeln(val); // So this should work: if (val < 0) { // (In fact, it will still work if // std.ascii.numericValue were used instead.) writeln("Sloppy code caught the problem correctly!"); } } So it seems that everything should be alright. This particular example occurred to me, 'cos I'm thinking of how often one wishes to extract an integral value from a string, and usually one doesn't think that floating point is necessary(!), so the cast from double is a rather big temptation (even though it's wrong!). T -- Tell me and I forget. Teach me and I remember. Involve me and I understand. -- Benjamin Franklin
Jan 03 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 3 January 2013 at 21:51:14 UTC, H. S. Teoh wrote:
 On Thu, Jan 03, 2013 at 08:40:47PM +0100, monarch_dodra wrote:
 [...]
 Or, you know... I could just implement both at the same time. 
 It's
 not like there's an *urgency* for the ascii version or 
 anything. I
 think I'll just do that.
 
 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan
[...] LGTM. :) I did think of what might happen if somebody wrote an int cast for std.uni.numericValue [SNIP] writeln("Sloppy code caught the problem correctly!");
... alsmost! 1e12 will have a negative value when cast to int. To be 100% correct in regards to converting, the end user would have to use long. But that'd be a *really exceptional* case behavior... Even with long, the only problem with the code is that the user would not know the difference between exact integral, and inexact integral. Well, that's what the user gets for being sloppy I guess. In any case, I think we'd have to provide an example section with a "recommended" way for casting to integral.
Jan 04 2013
prev sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan
I'm not a fan of the ASCII version returning -1, but I don't really have a better suggestion. I suppose that you could throw instead, but I don't know if that's a good idea or not. It _would_ be more consistent with our other conversion functions however. - Jonathan M Davis
Jan 04 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
04-Jan-2013 15:58, Jonathan M Davis пишет:
 On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan
I'm not a fan of the ASCII version returning -1, but I don't really have a better suggestion. I suppose that you could throw instead, but I don't know if that's a good idea or not. It _would_ be more consistent with our other conversion functions however. - Jonathan M Davis
I find low-level stuff that throws to be overly awkward to deal with (not to mention performance problems). Hm... I've found an brilliant primitive Expected!T that could be of great help in error code vs exceptions problem. See the recent Andrei's talk that went live not long ago: http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C Time to put the analogous stuff into Phobos? -- Dmitry Olshansky
Jan 04 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 4 January 2013 at 13:18:48 UTC, Dmitry Olshansky wrote:
 04-Jan-2013 15:58, Jonathan M Davis пишет:
 On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan
I'm not a fan of the ASCII version returning -1, but I don't really have a better suggestion. I suppose that you could throw instead, but I don't know if that's a good idea or not. It _would_ be more consistent with our other conversion functions however. - Jonathan M Davis
I find low-level stuff that throws to be overly awkward to deal with (not to mention performance problems). Hm... I've found an brilliant primitive Expected!T that could be of great help in error code vs exceptions problem. See the recent Andrei's talk that went live not long ago: http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C Time to put the analogous stuff into Phobos?
I finished an implementation: https://github.com/D-Programming-Language/phobos/pull/1052 It is not "pull ready", so we can still discuss it. I raised a couple of issues in the pull, which I'll copy here: //---- I did run into a couple of issues, namelly that I'm not getting 100% equivalence between chars that are numeric, and chars with numeric value... Is this normal...? * There's a fair bit of chars that have numeric value, but aren't isNumber. I think they might be new in 6.1.0. But I'm not sure. I decided it was best to have them return nan, instead of having inconsistent behavior. * There's a couple characters in tableLo that have numeric values. These aren't considered in isNumber either. I think this might be a bug though. * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC SIGN". These return wild values, and in particular two of them return -1. I *think* this should actually return nan for us, because (AFAIK), -1 is just wild for invalid :/ Maybe we should just return -1 on invalid unicode? Or maybe it's just my input file: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt It doesn't have a separate field for isNumber/numericValue, so it is forced to write a wild number. Maybe these four chars should return nan? //---- Oh yeah, I also added isNumber to std.ascii. Feels wrong to not have it if we have numericValue.
Jan 04 2013
next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 4 January 2013 at 17:48:28 UTC, monarch_dodra wrote:
 //----
 Maybe we should just return -1 on invalid unicode? Or maybe 
 it's just my input file:
 http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
 It doesn't have a separate field for isNumber/numericValue, so 
 it is forced to write a wild number. Maybe these four chars 
 should return nan?
Wait: I figured it out: They are just non-numbers that happen to be inside Nl (Number Letter): http://unicode.org/cldr/utility/character.jsp?a=12433 Documentation on this is not very clear, nor consistent, so sorry for any confusion. Well, I guess there is a bug in std.isNumber then...
Jan 04 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
04-Jan-2013 21:48, monarch_dodra пишет:
 On Friday, 4 January 2013 at 13:18:48 UTC, Dmitry Olshansky wrote:
 04-Jan-2013 15:58, Jonathan M Davis пишет:
 On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan
I'm not a fan of the ASCII version returning -1, but I don't really have a better suggestion. I suppose that you could throw instead, but I don't know if that's a good idea or not. It _would_ be more consistent with our other conversion functions however. - Jonathan M Davis
I find low-level stuff that throws to be overly awkward to deal with (not to mention performance problems). Hm... I've found an brilliant primitive Expected!T that could be of great help in error code vs exceptions problem. See the recent Andrei's talk that went live not long ago: http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C Time to put the analogous stuff into Phobos?
I finished an implementation: https://github.com/D-Programming-Language/phobos/pull/1052 It is not "pull ready", so we can still discuss it.
Well, for start it features tons of code duplication. But I'm replacing the whole std.uni anyway...
 I raised a couple of issues in the pull, which I'll copy here:

 //----
 I did run into a couple of issues, namelly that I'm not getting 100%
 equivalence between chars that are numeric, and chars with numeric
 value... Is this normal...?
Yes, it's called Unicode ;)
 * There's a fair bit of chars that have numeric value, but aren't
 isNumber. I think they might be new in 6.1.0. But I'm not sure. I
 decided it was best to have them return nan, instead of having
 inconsistent behavior.
You also might be using 6.2. It's released as of a fall of 2012.
 * There's a couple characters in tableLo that have numeric values. These
 aren't considered in isNumber either. I think this might be a bug though.
 * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC
 SIGN". These return wild values, and in particular two of them return
 -1. I *think* this should actually return nan for us, because (AFAIK),
 -1 is just wild for invalid :/
Some have numeric value of '-1' I think. The truth of the matter is as usual with Unicode things are rather complicated. So 'numeric character' is a category (general) and 'has numeric value' is some other property of codepoint that may or may not correlate directly with category. Thus I think (looking ahead into your other post) that isNumber is correct as it follows its documented behavior.
 Maybe we should just return -1 on invalid unicode? Or maybe it's just my
 input file:
 http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
 It doesn't have a separate field for isNumber/numericValue, so it is
 forced to write a wild number. Maybe these four chars should return nan?
Nope. Does letter 'A' return a wild number?
 //----

 Oh yeah, I also added isNumber to std.ascii. Feels wrong to not have it
 if we have numericValue.
-- Dmitry Olshansky
Jan 04 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:
 04-Jan-2013 21:48, monarch_dodra пишет:
 I finished an implementation:

 https://github.com/D-Programming-Language/phobos/pull/1052

 It is not "pull ready", so we can still discuss it.
Well, for start it features tons of code duplication. But I'm replacing the whole std.uni anyway...
Well, I wrote that with duplication, keeping in mind you would probably replace both. I thought it be cleaner to have some duplication, than a warped single implementation. I could also make the extra effort. I was really concerned with first having an implementation that is unicode correct. I also though that, at worst, you could use my parsed data ;) to submit your own (superior?) pull.
 * There's a couple characters in tableLo that have numeric 
 values. These
 aren't considered in isNumber either. I think this might be a 
 bug though.
 * There are 4 "non-number numeric" characters in "CUNEIFORM 
 NUMERIC
 SIGN". These return wild values, and in particular two of them 
 return
 -1. I *think* this should actually return nan for us, because 
 (AFAIK),
 -1 is just wild for invalid :/
Some have numeric value of '-1' I think. The truth of the matter is as usual with Unicode things are rather complicated. So 'numeric character' is a category (general) and 'has numeric value' is some other property of codepoint that may or may not correlate directly with category. Thus I think (looking ahead into your other post) that isNumber is correct as it follows its documented behavior.
 Maybe we should just return -1 on invalid unicode? Or maybe 
 it's just my
 input file:
 http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
 It doesn't have a separate field for isNumber/numericValue, so 
 it is
 forced to write a wild number. Maybe these four chars should 
 return nan?
Nope. Does letter 'A' return a wild number?
Well, the thing is that I'm getting contradictory info from the consortium itself: Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN" According to the "UnicodeData.txt", its numeric value is -1. According to The "Unocide utilities", it is not a numeric type, and it's value is null: http://unicode.org/cldr/utility/character.jsp?a=12456 Also according to the consortium: "-1" is an illegal numeric value. http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:] Really, all the info seems to indicate a bug in UnicodeData.txt: They really seem like 4 entries in Nl that aren't numbers. I've found a couple people on internet discussing this, but no hard conclusion :/ **** Anyways, those 4 CUNEIFORM asside, what do you make of the entries in Lo: http://unicode.org/cldr/utility/character.jsp?a=F96B These appear to be numeric, but aren't inside Nd/No/Nl. They should return true to isNumber, no? Maybe isNumber's "documented behavior" is wrong?
Jan 04 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
05-Jan-2013 00:51, monarch_dodra пишет:
 On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:
 04-Jan-2013 21:48, monarch_dodra пишет:
 I finished an implementation:

 https://github.com/D-Programming-Language/phobos/pull/1052

 It is not "pull ready", so we can still discuss it.
Well, for start it features tons of code duplication. But I'm replacing the whole std.uni anyway...
Well, I wrote that with duplication, keeping in mind you would probably replace both. I thought it be cleaner to have some duplication, than a warped single implementation. I could also make the extra effort. I was really concerned with first having an implementation that is unicode correct. I also though that, at worst, you could use my parsed data ;) to submit your module that is well due for peer review.
Fixed ;)
 * There's a couple characters in tableLo that have numeric values. These
 aren't considered in isNumber either. I think this might be a bug
 though.
 * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC
 SIGN". These return wild values, and in particular two of them return
 -1. I *think* this should actually return nan for us, because (AFAIK),
 -1 is just wild for invalid :/
Some have numeric value of '-1' I think. The truth of the matter is as usual with Unicode things are rather complicated. So 'numeric character' is a category (general) and 'has numeric value' is some other property of codepoint that may or may not correlate directly with category. Thus I think (looking ahead into your other post) that isNumber is correct as it follows its documented behavior.
 Maybe we should just return -1 on invalid unicode? Or maybe it's just my
 input file:
 http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
 It doesn't have a separate field for isNumber/numericValue, so it is
 forced to write a wild number. Maybe these four chars should return nan?
Nope. Does letter 'A' return a wild number?
Well, the thing is that I'm getting contradictory info from the consortium itself: Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN" According to the "UnicodeData.txt", its numeric value is -1. According to The "Unocide utilities", it is not a numeric type, and it's value is null: http://unicode.org/cldr/utility/character.jsp?a=12456 Also according to the consortium: "-1" is an illegal numeric value. http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:] Really, all the info seems to indicate a bug in UnicodeData.txt: They really seem like 4 entries in Nl that aren't numbers. I've found a couple people on internet discussing this, but no hard conclusion :/
Basically check the bottom of that page: .... See also: Unicode Display Problems. Version 3.6; ICU version: 50.0.1.0; Unicode version: 6.1.0.0 So it's not up to date. The file is. I can test with ICU 51 to see what it reports.
 ****

 Anyways, those 4 CUNEIFORM asside, what do you make of the
 entries in Lo:
 http://unicode.org/cldr/utility/character.jsp?a=F96B
 These appear to be numeric, but aren't inside Nd/No/Nl. They
 should return true to isNumber, no?
Hmmm. Take a look here: http://unicode.org/cldr/utility/properties.jsp There is a section called Numeric that has 3 properties, and then there is a General section. The General has Category which in turn has 'Number' category. Bottom line is that I believe that std.uni isXXX queries the category of a symbol and not some other property. Let any mishaps in between properties and general category be consortium's headache.
 Maybe isNumber's "documented behavior" is wrong?
Problem is I can't come up with a good description of some other behavior. Maybe this one [^[:Numeric_Type=None:]] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5E%5B%3ANumeric_Type%3DNone%3A%5D%5D&g= -- Dmitry Olshansky
Jan 04 2013
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 4 January 2013 at 22:00:02 UTC, Dmitry Olshansky wrote:
 05-Jan-2013 00:51, monarch_dodra пишет:
 Anyways, those 4 CUNEIFORM asside, what do you make of the
 entries in Lo:
 http://unicode.org/cldr/utility/character.jsp?a=F96B
 These appear to be numeric, but aren't inside Nd/No/Nl. They
 should return true to isNumber, no?
Hmmm. Take a look here: http://unicode.org/cldr/utility/properties.jsp There is a section called Numeric that has 3 properties, and then there is a General section. The General has Category which in turn has 'Number' category. Bottom line is that I believe that std.uni isXXX queries the category of a symbol and not some other property. Let any mishaps in between properties and general category be consortium's headache.
 Maybe isNumber's "documented behavior" is wrong?
Problem is I can't come up with a good description of some other behavior. Maybe this one [^[:Numeric_Type=None:]] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5E%5B%3ANumeric_Type%3DNone%3A%5D%5D&g=
Sounds like the root of the problem is that isNumber != Numeric_Type[Decimal, Digit, Numeric] Ergo, there is no correlation between isNumber and numericValue. Feels like there is a lot missing from std.uni, but at the same time, unicode is really huge. At the very least, I think we should have Category enum, along with a (get) "category" function. I was just saying to jmdavis in the pull that std.ascii had "isDigit", but that uni didn't. In truth, both also lack isDecimal and isNumeric. There would just be a bit of ambiguity now between the broad "isNumeric", and "all the chars that have a numeric value"... :/ Damn. Unicode is complicated. Anyways, taking my weekend break.
Jan 04 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Jan 04, 2013 at 11:48:39PM +0100, monarch_dodra wrote:
[...]
 Sounds like the root of the problem is that isNumber !=
 Numeric_Type[Decimal, Digit, Numeric]
 
 Ergo, there is no correlation between isNumber and numericValue.
Yikes. That's pretty ... nasty. :-(
 Feels like there is a lot missing from std.uni, but at the same
 time, unicode is really huge.
Yeah, Unicode is a lot more complex than most people realize. Recently I read through TR14 (proper line-breaking in Unicode), and I was gaping in awe at the insane complexity of such a seemingly-simple task.
 At the very least, I think we should have Category enum, along with a
 (get) "category" function.
Yes! We need that!!
 I was just saying to jmdavis in the pull that std.ascii had
 "isDigit", but that uni didn't. In truth, both also lack isDecimal
 and isNumeric.
 
 There would just be a bit of ambiguity now between the broad
 "isNumeric", and "all the chars that have a numeric value"... :/
 
 Damn. Unicode is complicated.
[...] I, for one, would love to know why isNumeric != hasNumericValue. T -- Valentine's Day: an occasion for florists to reach into the wallets of nominal lovers in dire need of being reminded to profess their hypothetical love for their long-forgotten.
Jan 04 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
 [...]

 I, for one, would love to know why isNumeric != hasNumericValue.


 T
I guess it's just bad wording from the standard. The standard defined 3 groups that make up Number: [Nd] Number, Decimal Digit [Nl] Number, Letter [No] Number, Other However, there are a couple of characters that *are* numbers, but aren't in those goups. The "Good" news is that the standard, *does* define number_types to classify the kind of number a char is: * Null: Not a number * Digit: Obvious * Decimal: Any decimal number that is NOT a digit * Numeric: Everything else. So they used "Numeric" as wild, and "Number" as their general category. This leaves us with ambiguity when choosing our word: Technically '5' does not clasify as "numeric", although you could consider it "has a numeric value". I hope that makes sense.
Jan 07 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
 On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
[...]
I, for one, would love to know why isNumeric != hasNumericValue.
[...]
 I guess it's just bad wording from the standard.
 
 The standard defined 3 groups that make up Number:
 [Nd] 	Number, Decimal Digit
 [Nl] 	Number, Letter
 [No] 	Number, Other
 
 However, there are a couple of characters that *are* numbers, but
 aren't in those goups.
 
 The "Good" news is that the standard, *does* define number_types to
 classify the kind of number a char is:
 * Null: Not a number
 * Digit: Obvious
 * Decimal: Any decimal number that is NOT a digit
 * Numeric: Everything else.
 
 So they used "Numeric" as wild, and "Number" as their general
 category.
 
 This leaves us with ambiguity when choosing our word:
 Technically '5' does not clasify as "numeric", although you could
 consider it "has a numeric value".
 
 I hope that makes sense.
Hmph. I guess we need to differentiate between the unicode category called "numeric", and the property of having a numerical value. So we'd need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's what the standard is, then that's what it is. Anyway, I'd love to see std.uni cover all unicode categories. Offhanded note: should we unify the various isX() functions into: bool inCategory(string category)(dchar ch) where category is the Unicode designation, say "Nl", "Nd", etc.? That way, it's more future-proof in case the Unicode guys add more categories. Also makes it easier to remember which function to call; else you'd always have to remember "N" -> isNumeric, "L" -> isAlpha, etc.. The current names of course can be left as aliases. T -- The fact that anyone still uses AOL shows that even the presence of options doesn't stop some people from picking the pessimal one. - Mike Ellis
Jan 09 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
10-Jan-2013 03:21, H. S. Teoh пишет:
 On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
 On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
 [...]
 I, for one, would love to know why isNumeric != hasNumericValue.
[...]
 I guess it's just bad wording from the standard.

 The standard defined 3 groups that make up Number:
 [Nd] 	Number, Decimal Digit
 [Nl] 	Number, Letter
 [No] 	Number, Other

 However, there are a couple of characters that *are* numbers, but
 aren't in those goups.

 The "Good" news is that the standard, *does* define number_types to
 classify the kind of number a char is:
 * Null: Not a number
 * Digit: Obvious
 * Decimal: Any decimal number that is NOT a digit
 * Numeric: Everything else.

 So they used "Numeric" as wild, and "Number" as their general
 category.

 This leaves us with ambiguity when choosing our word:
 Technically '5' does not clasify as "numeric", although you could
 consider it "has a numeric value".

 I hope that makes sense.
Hmph. I guess we need to differentiate between the unicode category called "numeric", and the property of having a numerical value. So we'd need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's what the standard is, then that's what it is.
isNumber - _Number_ General category (as defined by Unicode 1:1) isNumeric - as having NumericType != None (again going be definition of Unicode properties) And that's all, correct and to the latter.
 Anyway, I'd love to see std.uni cover all unicode categories.

 Offhanded note: should we unify the various isX() functions into:

 	bool inCategory(string category)(dchar ch)
No, no, no! It's a horrible idea. The main problem with it is: huge catalog of data has to be stored in Phobos (object code) of no (even niche) use. Also to be practical for use cases other then casual observation it has to be fast.. and it can't for any of the useful cases. Just count the number of bits to store per codepoint and fairly irregular structure of the whole set of properties (unlike individual combinations that do have nice distribution e.g. Scripts as in Cyrillic). I've been shoulder-deep in Unicode for about half a year now, and reading through TR-xx algorithms and *none* of them requires queries of the sort that tests all (more then 1-2?) of properties. In all cases the algorithm itself defines a set(s) of codepoints with different meanings/values for this use case. These (useful) sets could be compressed to a fast multi-stage table, the whole catalog of properties - no, as it packs enormous heaps of unused junk (Unicode_Age anyone??). This junk is not fit for std library but the goal is to provide tool for the user to work with sets/data beyond the commonly useful in std.
 where category is the Unicode designation, say "Nl", "Nd", etc.? That
 way, it's more future-proof in case the Unicode guys add more
 categories.
I'm posting my work on std.uni as ready for review today or tomorrow. It includes a type for a set of codepoints and ton of predefined sets for Nl, Nd and almost everything sensible (blocks, scripts, properties). The user can then conjure whatever combination required. And it still way smaller then having full 'query the database' thing. To check the full madness of all of the properties just use the web interface of unicode.org. P.S. Hopefully, nobody rises the point of codepoint _names_ they are after all too part of Unicode standard (and character database). -- Dmitry Olshansky
Jan 10 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 10 January 2013 at 18:09:31 UTC, Dmitry Olshansky 
wrote:
 10-Jan-2013 03:21, H. S. Teoh пишет:
 On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
 On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
 [...]
 I, for one, would love to know why isNumeric != 
 hasNumericValue.
[...]
 I guess it's just bad wording from the standard.

 The standard defined 3 groups that make up Number:
 [Nd] 	Number, Decimal Digit
 [Nl] 	Number, Letter
 [No] 	Number, Other

 However, there are a couple of characters that *are* numbers, 
 but
 aren't in those goups.

 The "Good" news is that the standard, *does* define 
 number_types to
 classify the kind of number a char is:
 * Null: Not a number
 * Digit: Obvious
 * Decimal: Any decimal number that is NOT a digit
 * Numeric: Everything else.

 So they used "Numeric" as wild, and "Number" as their general
 category.

 This leaves us with ambiguity when choosing our word:
 Technically '5' does not clasify as "numeric", although you 
 could
 consider it "has a numeric value".

 I hope that makes sense.
Hmph. I guess we need to differentiate between the unicode category called "numeric", and the property of having a numerical value. So we'd need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's what the standard is, then that's what it is.
isNumber - _Number_ General category (as defined by Unicode 1:1) isNumeric - as having NumericType != None (again going be definition of Unicode properties) And that's all, correct and to the latter.
Are you sure about that? The four values of Numeric_Type are: * Decimal * Digit * None * Numeric <= !!! http://unicode.org/cldr/utility/properties.jsp?a=Numeric_Type#Numeric_Type Hopefully, we'll have "isDecimal", "isDigit", and eventually "isNumeric", which according to definition, would simply be "Numeric_Type == Numeric_Type.Numeric" The problem is that by the definitions of Unicode properties, there is no name for "not in Numeric_Type.None" "hasNumericValue" is the best name I could come up with to differentiate between "Not Numeric_Type.None" and "Numeric_Type.Numeric"
Jan 10 2013
prev sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 4 January 2013 at 22:00:02 UTC, Dmitry Olshansky wrote:
 [SNIP]
Thank you for all your feed back. *everything* makes sense now. However, the conclusion I'm comming to is that there needs some ground work before doing numeric value, which I am currently doing.
Jan 07 2013
prev sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Dmitry Olshansky:

 Yup, and it's 2 lines then. And if one really wants to chain it:
 map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

 Hardly makes it Phobos candidate then ;)
I think you meant to write: map(a => enforce(std.ascii.isNumeric(a)), a - '0')(...); To avoid some bugs I try to not use the comma expression like that. Compare that code with: map!numericValue(...); Bye, bearophile
Jan 02 2013