## digitalmars.D - numericValue for (unicode) characters

"monarch_dodra" <monarchdodra gmail.com> writes:
```There is an ER that would allow to convert characters to numebers:
http://d.puremagic.com/issues/show_bug.cgi?id=5543

For example: '1' => 1
Or, unicode considered: 'Ⅶ' => 7

Long story short, it was decided that it wasn't std.conv.to's job
to do this conversion, but rather, there should be a function
called "numericValue" inside std.uni and std.ascii that would do
this job.

What remains are defining how these methods should work. Things
to keep in mind:
- ASCII to int should be fast.
- unicode numeric values span from -0.5 to 1.0e12.
- unicode numeric values can be fractional.
- ALL unicode numeric values can be EXACTLY represented in a
double.

Given these observations, I'd like to propose these:

//------------------------------
//std.ascii.numericValue
/** Given an ascii character, returns that character's
numeric value if it is numeric (\$(D isNumeric)),
and -1 otherwise
*/
pure  safe nothrow
int numericValue(dchar c);
//------------------------------
//std.uni.numericValue
/** Given a unicode character, returns that character's
numeric value if it is numeric (\$(D isNumeric)),
and throws an exception otherwise
*/
pure  safe
double numericValue(dchar c);
//------------------------------

The rationale for this:
std.ascii: I think returning -1 as a magic number should help
keep the code faster and with less clutter than with exceptions.
returning an int is the obvious choice for numbers that span -1
to 10.

std.uni: double is the only type that can hold all ranges of
unicode's numeric values.
This time, uni throws exceptions. This is for two reasons:
1. Choosing a magic number is difficult, and error prone. Correct
code would have to look like: "if (std.uni.numericValue(c) >
-0.7) {...}"
2. When dealing with unicode, overhead of the exception is
probably cleaner and not as critical as with ascii.

***********************************************
Thoughts?

I wanted to get this ER moved forward. I don't think
uni.numericValue will be finished soon, but I would have wanted
std.ascii's done sooner rather than later.
```
Jan 02 2013
"bearophile" <bearophileHUGS lycos.com> writes:
```monarch_dodra:

The rationale for this:
std.ascii: I think returning -1 as a magic number should help
keep the code faster and with less clutter than with exceptions.

For the ASCII version I have two use cases:
- Where I want to go fast&unsafe I just use "c - '0'".
- When I want more safety I'd like to use something as to!(),
that raises exceptions in case of errors.

A function that works on ASCII and returns -1 doesn't give me
much more than "c - '0'". So maybe exceptions are good in the
ASCII case too.

There is also std.typecons.nullable, it's a possibility for
std.uni.numericValue. Generally Phobos should eat more of its dog
food :-)

Bye,
bearophile
```
Jan 02 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
```1/2/2013 7:24 PM, bearophile пишет:
monarch_dodra:

The rationale for this:
std.ascii: I think returning -1 as a magic number should help keep the
code faster and with less clutter than with exceptions.

For the ASCII version I have two use cases:
- Where I want to go fast&unsafe I just use "c - '0'".
- When I want more safety I'd like to use something as to!(), that
raises exceptions in case of errors.

A function that works on ASCII and returns -1 doesn't give me much more
than "c - '0'". So maybe exceptions are good in the ASCII case too.

Then we can maybe just drop this function? What's wrong with
if(std.ascii.isNumeric(a))
a -= '0';
else
enforce(false);

I mean that the time to look it up in std library is much bigger then to
roll your own with any of the 2 semantics.

Unlike the unicode version, of course. Then IMO having the std.ascii one
is mostly just for symmetry and thus I think that both should just use
some sentinel value.

There is also std.typecons.nullable, it's a possibility for
std.uni.numericValue. Generally Phobos should eat more of its dog food :-)

double.nan sounds more like it.

Bye,
bearophile

--
Dmitry Olshansky
```
Jan 02 2013
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 1/2/13 3:13 PM, Dmitry Olshansky wrote:
1/2/2013 7:24 PM, bearophile пишет:
monarch_dodra:

The rationale for this:
std.ascii: I think returning -1 as a magic number should help keep the
code faster and with less clutter than with exceptions.

For the ASCII version I have two use cases:
- Where I want to go fast&unsafe I just use "c - '0'".
- When I want more safety I'd like to use something as to!(), that
raises exceptions in case of errors.

A function that works on ASCII and returns -1 doesn't give me much more
than "c - '0'". So maybe exceptions are good in the ASCII case too.

Then we can maybe just drop this function? What's wrong with
if(std.ascii.isNumeric(a))
a -= '0';
else
enforce(false);

Unnecessary flow :o).

enforce(std.ascii.isNumeric(a));
a -= '0';

Andrei
```
Jan 02 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
```1/3/2013 12:21 AM, Andrei Alexandrescu пишет:
On 1/2/13 3:13 PM, Dmitry Olshansky wrote:
1/2/2013 7:24 PM, bearophile пишет:
monarch_dodra:

The rationale for this:
std.ascii: I think returning -1 as a magic number should help keep the
code faster and with less clutter than with exceptions.

For the ASCII version I have two use cases:
- Where I want to go fast&unsafe I just use "c - '0'".
- When I want more safety I'd like to use something as to!(), that
raises exceptions in case of errors.

A function that works on ASCII and returns -1 doesn't give me much more
than "c - '0'". So maybe exceptions are good in the ASCII case too.

Then we can maybe just drop this function? What's wrong with
if(std.ascii.isNumeric(a))
a -= '0';
else
enforce(false);

Unnecessary flow :o).

enforce(std.ascii.isNumeric(a));
a -= '0';

Yup, and it's 2 lines then. And if one really wants to chain it:
map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

Hardly makes it Phobos candidate then ;)

--
Dmitry Olshansky
```
Jan 02 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
```1/3/2013 2:15 AM, monarch_dodra пишет:
On Wednesday, 2 January 2013 at 20:49:38 UTC, Dmitry Olshansky wrote:
Yup, and it's 2 lines then. And if one really wants to chain it:
map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

Hardly makes it Phobos candidate then ;)

Well, just because its almost trivial to us doesn't mean it hurts to
have it. The fact that you can even operate on chars in such a fashion
(c - '0') is not obvious to everyone: I've seen time and time again code
such as:
//----
if (97 <= c && c <= 122)
c -= 97;
//----

numericValue helps keep things clean and self documented.

What's more, it helps keep ascii complete. Code originally written for
ascii is easily upgreable to support uni (and vice-versa). Further more,
*writing* "std.ascii.numericValue" self documents ascii only support,
which is less obvious than code using "c - '0'":

In the original pull request to "improve" conv.to, the fact that it did
not support unicode didn't even cross our minds. Seeing
"std.ascii.numericValue" raises the eyebrow. It *forces* unicode
consideration (regardless of which is right, it can't be ignored).

Really, by the rationale of "it's 2 lines", we shouldn't even have
"std.ascii.isNumeric" at all...

I don't mind adding because of completeness and/or symmetry stand point
as I said.

I do see another cool issue popping up though. It's a problem of how the
anti-hijacking works. Say we add numericValue right now to std.ascii but
not std.uni. A release later we have numericValue in std.uni (well
hopefully they are both in the same 2.062 ;) ).

Now take this code:
map!numericValue(...)

If the code also happens to import std.uni it's going to stop compiling.

That's one of reasons I think our hopes on stability (as in compiles in
5 years from now) are ill placed as we can't have it until the library

--
Dmitry Olshansky
```
Jan 03 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
```03-Jan-2013 21:13, monarch_dodra пишет:
On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky wrote:
Now take this code:
map!numericValue(...)

If the code also happens to import std.uni it's going to stop compiling.

Hum... We could always "camp" the std.uni's numericValue function?

//----
double numericValue()(dchar c) const nothrow  safe
{
static assert(false, "Sorry, std.uni.numericValue is not yet
implemented");
}
//----

We'd pretty much have to.

--
Dmitry Olshansky
```
Jan 03 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
```03-Jan-2013 23:40, monarch_dodra пишет:
On Thursday, 3 January 2013 at 18:11:45 UTC, Dmitry Olshansky wrote:
03-Jan-2013 21:13, monarch_dodra пишет:
On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky wrote:
Now take this code:
map!numericValue(...)

If the code also happens to import std.uni it's going to stop
compiling.

Hum... We could always "camp" the std.uni's numericValue function?
[SNIP]

We'd pretty much have to.

Or, you know... I could just implement both at the same time. It's not
like there's an *urgency* for the ascii version or anything. I think
I'll just do that.

So... do we agree on
?

I can still get started anyways, even if it isn't definite.

It's just an idea that I have exceptionally fast version for Unicode
just around the corner, but I wouldn't mind some competition ;)

--
Dmitry Olshansky
```
Jan 03 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
```04-Jan-2013 15:58, Jonathan M Davis пишет:
On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
So... do we agree on

I'm not a fan of the ASCII version returning -1, but I don't really have a
better suggestion. I suppose that you could throw instead, but I don't know if
that's a good idea or not. It _would_ be more consistent with our other
conversion functions however.

- Jonathan M Davis

I find low-level stuff that throws to be overly awkward to deal with
(not to mention performance problems).

Hm... I've found an brilliant primitive Expected!T that could be of
great help in error code vs exceptions problem. See the recent Andrei's
talk that went live not long ago:

http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C

Time to put the analogous stuff into Phobos?

--
Dmitry Olshansky
```
Jan 04 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
```04-Jan-2013 21:48, monarch_dodra пишет:
On Friday, 4 January 2013 at 13:18:48 UTC, Dmitry Olshansky wrote:
04-Jan-2013 15:58, Jonathan M Davis пишет:
On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
So... do we agree on

I'm not a fan of the ASCII version returning -1, but I don't really
have a
better suggestion. I suppose that you could throw instead, but I
don't know if
that's a good idea or not. It _would_ be more consistent with our other
conversion functions however.

- Jonathan M Davis

I find low-level stuff that throws to be overly awkward to deal with
(not to mention performance problems).

Hm... I've found an brilliant primitive Expected!T that could be of
great help in error code vs exceptions problem. See the recent
Andrei's talk that went live not long ago:

http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C

Time to put the analogous stuff into Phobos?

I finished an implementation:

https://github.com/D-Programming-Language/phobos/pull/1052

It is not "pull ready", so we can still discuss it.

Well, for start it features tons of code duplication. But I'm replacing
the whole std.uni anyway...

I raised a couple of issues in the pull, which I'll copy here:

//----
I did run into a couple of issues, namelly that I'm not getting 100%
equivalence between chars that are numeric, and chars with numeric
value... Is this normal...?

Yes, it's called Unicode ;)

* There's a fair bit of chars that have numeric value, but aren't
isNumber. I think they might be new in 6.1.0. But I'm not sure. I
decided it was best to have them return nan, instead of having
inconsistent behavior.

You also might be using 6.2. It's released as of a fall of 2012.

* There's a couple characters in tableLo that have numeric values. These
aren't considered in isNumber either. I think this might be a bug though.
* There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC
SIGN". These return wild values, and in particular two of them return
-1. I *think* this should actually return nan for us, because (AFAIK),
-1 is just wild for invalid :/

Some have numeric value of '-1' I think. The truth of the matter is as
usual with Unicode things are rather complicated.
So 'numeric character' is a category (general) and 'has numeric value'
is some other property of codepoint that may or may not correlate
directly with category.

Thus I think (looking ahead into your other post) that isNumber is
correct as it follows its documented behavior.

Maybe we should just return -1 on invalid unicode? Or maybe it's just my
input file:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
It doesn't have a separate field for isNumber/numericValue, so it is
forced to write a wild number. Maybe these four chars should return nan?

Nope. Does letter 'A' return a wild number?

//----

Oh yeah, I also added isNumber to std.ascii. Feels wrong to not have it
if we have numericValue.

--
Dmitry Olshansky
```
Jan 04 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
```05-Jan-2013 00:51, monarch_dodra пишет:
On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:
04-Jan-2013 21:48, monarch_dodra пишет:
I finished an implementation:

https://github.com/D-Programming-Language/phobos/pull/1052

It is not "pull ready", so we can still discuss it.

Well, for start it features tons of code duplication. But I'm
replacing the whole std.uni anyway...

Well, I wrote that with duplication, keeping in mind you would
probably replace both. I thought it be cleaner to have some duplication,
than a warped single implementation. I could also make the extra effort.
I was really concerned with first having an implementation that is
unicode correct.

I also though that, at worst, you could use my parsed data ;) to submit
your module that is well due for peer review.

Fixed ;)

* There's a couple characters in tableLo that have numeric values. These
aren't considered in isNumber either. I think this might be a bug
though.
* There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC
SIGN". These return wild values, and in particular two of them return
-1. I *think* this should actually return nan for us, because (AFAIK),
-1 is just wild for invalid :/

Some have numeric value of '-1' I think. The truth of the matter is as
usual with Unicode things are rather complicated.
So 'numeric character' is a category (general) and 'has numeric value'
is some other property of codepoint that may or may not correlate
directly with category.

Thus I think (looking ahead into your other post) that isNumber is
correct as it follows its documented behavior.

Maybe we should just return -1 on invalid unicode? Or maybe it's just my
input file:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
It doesn't have a separate field for isNumber/numericValue, so it is
forced to write a wild number. Maybe these four chars should return nan?

Nope. Does letter 'A' return a wild number?

Well, the thing is that I'm getting contradictory info from the
consortium itself:
Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN"
According to the "UnicodeData.txt", its numeric value is -1.
According to The "Unocide utilities", it is not a numeric type,
and it's value is null:
http://unicode.org/cldr/utility/character.jsp?a=12456

Also according to the consortium: "-1" is an illegal numeric
value.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:]

Really, all the info seems to indicate a bug in UnicodeData.txt:
They really seem like 4 entries in Nl that aren't numbers.

I've found a couple people on internet discussing this, but no
hard conclusion :/

Basically check the bottom of that page:
....
Version 3.6; ICU version: 50.0.1.0; Unicode version: 6.1.0.0

So it's not up to date. The file is. I can test with ICU 51 to see what
it reports.

****

Anyways, those 4 CUNEIFORM asside, what do you make of the
entries in Lo:
http://unicode.org/cldr/utility/character.jsp?a=F96B
These appear to be numeric, but aren't inside Nd/No/Nl. They
should return true to isNumber, no?

Hmmm. Take a look here:
http://unicode.org/cldr/utility/properties.jsp

There is a section called Numeric that has 3 properties,
and then there is a General section.
The General has Category which in turn has 'Number' category.

Bottom line is that I believe that std.uni isXXX queries the category of
a symbol and not some other property. Let any mishaps in between
properties and general category be consortium's headache.

Maybe isNumber's "documented behavior" is wrong?

Problem is I can't come up with a good description of some other
behavior. Maybe this one [^[:Numeric_Type=None:]]
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5E%5B%3ANumeric_Type%3DNone%3A%5D%5D&g=

--
Dmitry Olshansky
```
Jan 04 2013
Dmitry Olshansky <dmitry.olsh gmail.com> writes:
```10-Jan-2013 03:21, H. S. Teoh пишет:
On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
[...]
I, for one, would love to know why isNumeric != hasNumericValue.

I guess it's just bad wording from the standard.

The standard defined 3 groups that make up Number:
[Nd] 	Number, Decimal Digit
[Nl] 	Number, Letter
[No] 	Number, Other

However, there are a couple of characters that *are* numbers, but
aren't in those goups.

The "Good" news is that the standard, *does* define number_types to
classify the kind of number a char is:
* Null: Not a number
* Digit: Obvious
* Decimal: Any decimal number that is NOT a digit
* Numeric: Everything else.

So they used "Numeric" as wild, and "Number" as their general
category.

This leaves us with ambiguity when choosing our word:
Technically '5' does not clasify as "numeric", although you could
consider it "has a numeric value".

I hope that makes sense.

Hmph. I guess we need to differentiate between the unicode category
called "numeric", and the property of having a numerical value. So we'd
need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's
what the standard is, then that's what it is.

isNumber - _Number_ General category (as defined by Unicode 1:1)

isNumeric - as having NumericType != None (again going be definition of
Unicode properties)

And that's all, correct and to the latter.

Anyway, I'd love to see std.uni cover all unicode categories.

Offhanded note: should we unify the various isX() functions into:

bool inCategory(string category)(dchar ch)

No, no, no! It's a horrible idea. The main problem with it is: huge
catalog of data has to be stored in Phobos (object code) of no (even
niche) use. Also to be practical for use cases other then casual
observation it has to be fast.. and it can't for any of the useful cases.

Just count the number of bits to store per codepoint and fairly
irregular structure of the whole set of properties (unlike individual
combinations that do have nice distribution e.g. Scripts as in Cyrillic).

I've been shoulder-deep in Unicode for about half a year now, and
reading through TR-xx algorithms and *none* of them requires queries of
the sort that tests all (more then 1-2?) of properties.

In all cases the algorithm itself defines a set(s) of codepoints with
different meanings/values for this use case. These (useful) sets could
be compressed to a fast multi-stage table, the whole catalog of
properties - no, as it packs enormous heaps of unused junk (Unicode_Age
anyone??). This junk is not fit for std library but the goal is to
provide tool for the user to work with sets/data beyond the commonly
useful in std.

where category is the Unicode designation, say "Nl", "Nd", etc.? That
way, it's more future-proof in case the Unicode guys add more
categories.

I'm posting my work on std.uni as ready for review today or tomorrow.
It includes a type for a set of codepoints and ton of predefined sets
for Nl, Nd and almost everything sensible (blocks, scripts, properties).
The user can then conjure whatever combination required.

And it still way smaller then having full 'query the database' thing. To
check the full madness of all of the properties just use the web
interface of unicode.org.

P.S. Hopefully, nobody rises the point of codepoint _names_ they are
after all too part of Unicode standard (and character database).

--
Dmitry Olshansky
```
Jan 10 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```On Wednesday, 2 January 2013 at 20:49:38 UTC, Dmitry Olshansky
wrote:
Yup, and it's 2 lines then. And if one really wants to chain it:
map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

Hardly makes it Phobos candidate then ;)

Well, just because its almost trivial to us doesn't mean it hurts
to have it. The fact that you can even operate on chars in such a
fashion (c - '0') is not obvious to everyone: I've seen time and
time again code such as:
//----
if (97 <= c && c <= 122)
c -= 97;
//----

numericValue helps keep things clean and self documented.

What's more, it helps keep ascii complete. Code originally
written for ascii is easily upgreable to support uni (and
vice-versa). Further more, *writing* "std.ascii.numericValue"
self documents ascii only support, which is less obvious than
code using "c - '0'":

In the original pull request to "improve" conv.to, the fact that
it did not support unicode didn't even cross our minds. Seeing
"std.ascii.numericValue" raises the eyebrow. It *forces* unicode
consideration (regardless of which is right, it can't be ignored).

Really, by the rationale of "it's 2 lines", we shouldn't even
have "std.ascii.isNumeric" at all...

On Wednesday, 2 January 2013 at 20:13:32 UTC, Dmitry Olshansky
wrote:
1/2/2013 7:24 PM, bearophile пишет:
There is also std.typecons.nullable, it's a possibility for
std.uni.numericValue. Generally Phobos should eat more of its
dog food :-)

double.nan sounds more like it.

Hum... nan. I like it.
```
Jan 02 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```On Wed, Jan 02, 2013 at 11:15:31PM +0100, monarch_dodra wrote:
On Wednesday, 2 January 2013 at 20:49:38 UTC, Dmitry Olshansky
wrote:
Yup, and it's 2 lines then. And if one really wants to chain it:
map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

Hardly makes it Phobos candidate then ;)

Well, just because its almost trivial to us doesn't mean it hurts to
have it. The fact that you can even operate on chars in such a
fashion (c - '0') is not obvious to everyone: I've seen time and
time again code such as:
//----
if (97 <= c && c <= 122)
c -= 97;
//----

numericValue helps keep things clean and self documented.

+1. Code intent is important.

[...]
On Wednesday, 2 January 2013 at 20:13:32 UTC, Dmitry Olshansky
wrote:
1/2/2013 7:24 PM, bearophile пишет:
There is also std.typecons.nullable, it's a possibility for
std.uni.numericValue. Generally Phobos should eat more of its
dog food :-)

double.nan sounds more like it.

Hum... nan. I like it.

+1 for nan. It's about time we used nan for something useful beyond just
an annoying default value for floating-point variables. :)

T

--
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG
```
Jan 02 2013
"bearophile" <bearophileHUGS lycos.com> writes:
```Dmitry Olshansky:

Yup, and it's 2 lines then. And if one really wants to chain it:
map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

Hardly makes it Phobos candidate then ;)

I think you meant to write:

map(a => enforce(std.ascii.isNumeric(a)), a - '0')(...);

To avoid some bugs I try to not use the comma expression like
that.

Compare that code with:

map!numericValue(...);

Bye,
bearophile
```
Jan 02 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky
wrote:
Now take this code:
map!numericValue(...)

If the code also happens to import std.uni it's going to stop
compiling.

Hum... We could always "camp" the std.uni's numericValue function?

//----
double numericValue()(dchar c) const nothrow  safe
{
static assert(false, "Sorry, std.uni.numericValue is not yet
implemented");
}
//----

This would avoid the breakage you mentioned.
```
Jan 03 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```On Thursday, 3 January 2013 at 18:11:45 UTC, Dmitry Olshansky
wrote:
03-Jan-2013 21:13, monarch_dodra пишет:
On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky
wrote:
Now take this code:
map!numericValue(...)

If the code also happens to import std.uni it's going to stop
compiling.

Hum... We could always "camp" the std.uni's numericValue
function?
[SNIP]

We'd pretty much have to.

Or, you know... I could just implement both at the same time.
It's not like there's an *urgency* for the ascii version or
anything. I think I'll just do that.

So... do we agree on
?

I can still get started anyways, even if it isn't definite.
```
Jan 03 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
[...]
I, for one, would love to know why isNumeric != hasNumericValue.

I guess it's just bad wording from the standard.

The standard defined 3 groups that make up Number:
[Nd] 	Number, Decimal Digit
[Nl] 	Number, Letter
[No] 	Number, Other

However, there are a couple of characters that *are* numbers, but
aren't in those goups.

The "Good" news is that the standard, *does* define number_types to
classify the kind of number a char is:
* Null: Not a number
* Digit: Obvious
* Decimal: Any decimal number that is NOT a digit
* Numeric: Everything else.

So they used "Numeric" as wild, and "Number" as their general
category.

This leaves us with ambiguity when choosing our word:
Technically '5' does not clasify as "numeric", although you could
consider it "has a numeric value".

I hope that makes sense.

Hmph. I guess we need to differentiate between the unicode category
called "numeric", and the property of having a numerical value. So we'd
need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's
what the standard is, then that's what it is.

Anyway, I'd love to see std.uni cover all unicode categories.

Offhanded note: should we unify the various isX() functions into:

bool inCategory(string category)(dchar ch)

where category is the Unicode designation, say "Nl", "Nd", etc.? That
way, it's more future-proof in case the Unicode guys add more
categories. Also makes it easier to remember which function to call;
else you'd always have to remember "N" -> isNumeric, "L" -> isAlpha,
etc..

The current names of course can be left as aliases.

T

--
The fact that anyone still uses AOL shows that even the presence of
options doesn't stop some people from picking the pessimal one. - Mike
Ellis
```
Jan 09 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```On Thursday, 10 January 2013 at 18:09:31 UTC, Dmitry Olshansky
wrote:
10-Jan-2013 03:21, H. S. Teoh пишет:
On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
[...]
I, for one, would love to know why isNumeric !=
hasNumericValue.

I guess it's just bad wording from the standard.

The standard defined 3 groups that make up Number:
[Nd] 	Number, Decimal Digit
[Nl] 	Number, Letter
[No] 	Number, Other

However, there are a couple of characters that *are* numbers,
but
aren't in those goups.

The "Good" news is that the standard, *does* define
number_types to
classify the kind of number a char is:
* Null: Not a number
* Digit: Obvious
* Decimal: Any decimal number that is NOT a digit
* Numeric: Everything else.

So they used "Numeric" as wild, and "Number" as their general
category.

This leaves us with ambiguity when choosing our word:
Technically '5' does not clasify as "numeric", although you
could
consider it "has a numeric value".

I hope that makes sense.

Hmph. I guess we need to differentiate between the unicode
category
called "numeric", and the property of having a numerical
value. So we'd
need both isNumeric and hasNumericValue. Ugh. It's ugly but if
that's
what the standard is, then that's what it is.

isNumber - _Number_ General category (as defined by Unicode 1:1)

isNumeric - as having NumericType != None (again going be
definition of Unicode properties)

And that's all, correct and to the latter.

Are you sure about that? The four values of Numeric_Type are:
* Decimal
* Digit
* None
* Numeric <= !!!
http://unicode.org/cldr/utility/properties.jsp?a=Numeric_Type#Numeric_Type

Hopefully, we'll have "isDecimal", "isDigit", and eventually
"isNumeric", which according to definition, would simply be
"Numeric_Type == Numeric_Type.Numeric"

The problem is that by the definitions of Unicode properties,
there is no name for "not in Numeric_Type.None"

"hasNumericValue" is the best name I could come up with to
differentiate between "Not Numeric_Type.None" and
"Numeric_Type.Numeric"
```
Jan 10 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```On Thu, Jan 03, 2013 at 08:40:47PM +0100, monarch_dodra wrote:
[...]
Or, you know... I could just implement both at the same time. It's
not like there's an *urgency* for the ascii version or anything. I
think I'll just do that.

So... do we agree on

LGTM. :)

I did think of what might happen if somebody wrote an int cast for
std.uni.numericValue:

void sloppyProgrammersFunction(dchar ch) {
// First attempt: compiler error: can't implicitly
// convert double -> int ...
//int val = std.uni.numericValue(ch);

// ... so sloppy programmer inserts a cast
int val = cast(int)std.uni.numericValue(ch);

// On Linux/64, if numericValue returns nan, this prints
// -int.max.
writeln(val);

// So this should work:
if (val < 0) {
// (In fact, it will still work if
writeln("Sloppy code caught the problem correctly!");
}
}

So it seems that everything should be alright.

This particular example occurred to me, 'cos I'm thinking of how often
one wishes to extract an integral value from a string, and usually one
doesn't think that floating point is necessary(!), so the cast from
double is a rather big temptation (even though it's wrong!).

T

--
Tell me and I forget. Teach me and I remember. Involve me and I understand. --
Benjamin Franklin
```
Jan 03 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```On Thursday, 3 January 2013 at 21:51:14 UTC, H. S. Teoh wrote:
On Thu, Jan 03, 2013 at 08:40:47PM +0100, monarch_dodra wrote:
[...]
Or, you know... I could just implement both at the same time.
It's
not like there's an *urgency* for the ascii version or
anything. I
think I'll just do that.

So... do we agree on

LGTM. :)

I did think of what might happen if somebody wrote an int cast
for
std.uni.numericValue
[SNIP]
writeln("Sloppy code caught the problem correctly!");

... alsmost! 1e12 will have a negative value when cast to int. To
be 100% correct in regards to converting, the end user would have
to use long.

But that'd be a *really exceptional* case behavior...

Even with long, the only problem with the code is that the user
would not know the difference between exact integral, and inexact
integral. Well, that's what the user gets for being sloppy I
guess.

In any case, I think we'd have to provide an example section with
a "recommended" way for casting to integral.
```
Jan 04 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```On Thursday, 3 January 2013 at 20:14:43 UTC, Dmitry Olshansky
wrote:
It's just an idea that I have exceptionally fast version for
Unicode just around the corner, but I wouldn't mind some
competition ;)

Well, I already mentioned to you how I was planning to do it:
Just stupid binary search over ranges of numbers indexed on 0.

The "big" chunk of work, actually (IMO), is just creating the raw
data...
```
Jan 04 2013
Jonathan M Davis <jmdavisProg gmx.com> writes:
```On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
So... do we agree on

I'm not a fan of the ASCII version returning -1, but I don't really have a
better suggestion. I suppose that you could throw instead, but I don't know if
that's a good idea or not. It _would_ be more consistent with our other
conversion functions however.

- Jonathan M Davis
```
Jan 04 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```On Friday, 4 January 2013 at 13:18:48 UTC, Dmitry Olshansky wrote:
04-Jan-2013 15:58, Jonathan M Davis пишет:
On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
So... do we agree on

I'm not a fan of the ASCII version returning -1, but I don't
really have a
better suggestion. I suppose that you could throw instead, but
I don't know if
that's a good idea or not. It _would_ be more consistent with
our other
conversion functions however.

- Jonathan M Davis

I find low-level stuff that throws to be overly awkward to deal
with (not to mention performance problems).

Hm... I've found an brilliant primitive Expected!T that could
be of great help in error code vs exceptions problem. See the
recent Andrei's talk that went live not long ago:

http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C

Time to put the analogous stuff into Phobos?

I finished an implementation:

https://github.com/D-Programming-Language/phobos/pull/1052

It is not "pull ready", so we can still discuss it.

I raised a couple of issues in the pull, which I'll copy here:

//----
I did run into a couple of issues, namelly that I'm not getting
100% equivalence between chars that are numeric, and chars with
numeric value... Is this normal...?

* There's a fair bit of chars that have numeric value, but aren't
isNumber. I think they might be new in 6.1.0. But I'm not sure. I
decided it was best to have them return nan, instead of having
inconsistent behavior.
* There's a couple characters in tableLo that have numeric
values. These aren't considered in isNumber either. I think this
might be a bug though.
* There are 4 "non-number numeric" characters in "CUNEIFORM
NUMERIC SIGN". These return wild values, and in particular two of
them return -1. I *think* this should actually return nan for us,
because (AFAIK), -1 is just wild for invalid :/

Maybe we should just return -1 on invalid unicode? Or maybe it's
just my input file:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
It doesn't have a separate field for isNumber/numericValue, so it
is forced to write a wild number. Maybe these four chars should
return nan?
//----

Oh yeah, I also added isNumber to std.ascii. Feels wrong to not
have it if we have numericValue.
```
Jan 04 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```On Friday, 4 January 2013 at 17:48:28 UTC, monarch_dodra wrote:
//----
Maybe we should just return -1 on invalid unicode? Or maybe
it's just my input file:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
It doesn't have a separate field for isNumber/numericValue, so
it is forced to write a wild number. Maybe these four chars
should return nan?

Wait: I figured it out: They are just non-numbers that happen to
be inside Nl (Number Letter):
http://unicode.org/cldr/utility/character.jsp?a=12433

Documentation on this is not very clear, nor consistent, so sorry
for any confusion.

Well, I guess there is a bug in std.isNumber then...
```
Jan 04 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:
04-Jan-2013 21:48, monarch_dodra пишет:
I finished an implementation:

https://github.com/D-Programming-Language/phobos/pull/1052

It is not "pull ready", so we can still discuss it.

Well, for start it features tons of code duplication. But I'm
replacing the whole std.uni anyway...

Well, I wrote that with duplication, keeping in mind you would
probably replace both. I thought it be cleaner to have some
duplication, than a warped single implementation. I could also
make the extra effort. I was really concerned with first having
an implementation that is unicode correct.

I also though that, at worst, you could use my parsed data ;) to

* There's a couple characters in tableLo that have numeric
values. These
aren't considered in isNumber either. I think this might be a
bug though.
* There are 4 "non-number numeric" characters in "CUNEIFORM
NUMERIC
SIGN". These return wild values, and in particular two of them
return
-1. I *think* this should actually return nan for us, because
(AFAIK),
-1 is just wild for invalid :/

Some have numeric value of '-1' I think. The truth of the
matter is as usual with Unicode things are rather complicated.
So 'numeric character' is a category (general) and 'has numeric
value' is some other property of codepoint that may or may not
correlate directly with category.

is correct as it follows its documented behavior.

Maybe we should just return -1 on invalid unicode? Or maybe
it's just my
input file:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
It doesn't have a separate field for isNumber/numericValue, so
it is
forced to write a wild number. Maybe these four chars should
return nan?

Nope. Does letter 'A' return a wild number?

Well, the thing is that I'm getting contradictory info from the
consortium itself:
Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN"
According to the "UnicodeData.txt", its numeric value is -1.
According to The "Unocide utilities", it is not a numeric type,
and it's value is null:
http://unicode.org/cldr/utility/character.jsp?a=12456

Also according to the consortium: "-1" is an illegal numeric
value.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:]

Really, all the info seems to indicate a bug in UnicodeData.txt:
They really seem like 4 entries in Nl that aren't numbers.

I've found a couple people on internet discussing this, but no
hard conclusion :/

****

Anyways, those 4 CUNEIFORM asside, what do you make of the
entries in Lo:
http://unicode.org/cldr/utility/character.jsp?a=F96B
These appear to be numeric, but aren't inside Nd/No/Nl. They
should return true to isNumber, no?

Maybe isNumber's "documented behavior" is wrong?
```
Jan 04 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```On Friday, 4 January 2013 at 22:00:02 UTC, Dmitry Olshansky wrote:
05-Jan-2013 00:51, monarch_dodra пишет:
Anyways, those 4 CUNEIFORM asside, what do you make of the
entries in Lo:
http://unicode.org/cldr/utility/character.jsp?a=F96B
These appear to be numeric, but aren't inside Nd/No/Nl. They
should return true to isNumber, no?

Hmmm. Take a look here:
http://unicode.org/cldr/utility/properties.jsp

There is a section called Numeric that has 3 properties,
and then there is a General section.
The General has Category which in turn has 'Number' category.

Bottom line is that I believe that std.uni isXXX queries the
category of a symbol and not some other property. Let any
mishaps in between properties and general category be
Maybe isNumber's "documented behavior" is wrong?

Problem is I can't come up with a good description of some
other behavior. Maybe this one [^[:Numeric_Type=None:]]
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5E%5B%3ANumeric_Type%3DNone%3A%5D%5D&g=

Sounds like the root of the problem is that isNumber !=
Numeric_Type[Decimal, Digit, Numeric]

Ergo, there is no correlation between isNumber and numericValue.

Feels like there is a lot missing from std.uni, but at the same
time, unicode is really huge.

At the very least, I think we should have Category enum, along
with a (get) "category" function.

I was just saying to jmdavis in the pull that std.ascii had
"isDigit", but that uni didn't. In truth, both also lack
isDecimal and isNumeric.

There would just be a bit of ambiguity now between the broad
"isNumeric", and "all the chars that have a numeric value"... :/

Damn. Unicode is complicated.

Anyways, taking my weekend break.
```
Jan 04 2013
"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
```On Fri, Jan 04, 2013 at 11:48:39PM +0100, monarch_dodra wrote:
[...]
Sounds like the root of the problem is that isNumber !=
Numeric_Type[Decimal, Digit, Numeric]

Ergo, there is no correlation between isNumber and numericValue.

Yikes. That's pretty ... nasty. :-(

Feels like there is a lot missing from std.uni, but at the same
time, unicode is really huge.

Yeah, Unicode is a lot more complex than most people realize. Recently I
read through TR14 (proper line-breaking in Unicode), and I was gaping in
awe at the insane complexity of such a seemingly-simple task.

At the very least, I think we should have Category enum, along with a
(get) "category" function.

Yes! We need that!!

I was just saying to jmdavis in the pull that std.ascii had
"isDigit", but that uni didn't. In truth, both also lack isDecimal
and isNumeric.

There would just be a bit of ambiguity now between the broad
"isNumeric", and "all the chars that have a numeric value"... :/

Damn. Unicode is complicated.

I, for one, would love to know why isNumeric != hasNumericValue.

T

--
Valentine's Day: an occasion for florists to reach into the wallets of
nominal lovers in dire need of being reminded to profess their
hypothetical love for their long-forgotten.
```
Jan 04 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```On Friday, 4 January 2013 at 22:00:02 UTC, Dmitry Olshansky wrote:
[SNIP]

Thank you for all your feed back.

*everything* makes sense now.

However, the conclusion I'm comming to is that there needs some
ground work before doing numeric value, which I am currently
doing.
```
Jan 07 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
[...]

I, for one, would love to know why isNumeric != hasNumericValue.

T

I guess it's just bad wording from the standard.

The standard defined 3 groups that make up Number:
[Nd] 	Number, Decimal Digit
[Nl] 	Number, Letter
[No] 	Number, Other

However, there are a couple of characters that *are* numbers, but
aren't in those goups.

The "Good" news is that the standard, *does* define number_types
to classify the kind of number a char is:
* Null: Not a number
* Digit: Obvious
* Decimal: Any decimal number that is NOT a digit
* Numeric: Everything else.

So they used "Numeric" as wild, and "Number" as their general
category.

This leaves us with ambiguity when choosing our word:
Technically '5' does not clasify as "numeric", although you could
consider it "has a numeric value".

I hope that makes sense.
```
Jan 07 2013