digitalmars.D - numericValue for (unicode) characters

monarch_dodra (51/51) Jan 02 2013 There is an ER that would allow to convert characters to numebers:

bearophile (13/16) Jan 02 2013 For the ASCII version I have two use cases:

Dmitry Olshansky (14/28) Jan 02 2013 Then we can maybe just drop this function? What's wrong with

Andrei Alexandrescu (5/25) Jan 02 2013 Unnecessary flow :o).

Dmitry Olshansky (6/31) Jan 02 2013 Yup, and it's 2 lines then. And if one really wants to chain it:

monarch_dodra (25/34) Jan 02 2013 Well, just because its almost trivial to us doesn't mean it hurts

H. S. Teoh (8/37) Jan 02 2013 +1. Code intent is important.
Dmitry Olshansky (15/40) Jan 03 2013 I don't mind adding because of completeness and/or symmetry stand point

monarch_dodra (11/15) Jan 03 2013 Hum... We could always "camp" the std.uni's numericValue function?

Dmitry Olshansky (4/17) Jan 03 2013 We'd pretty much have to.

monarch_dodra (10/23) Jan 03 2013 Or, you know... I could just implement both at the same time.

Dmitry Olshansky (6/27) Jan 03 2013 It's just an idea that I have exceptionally fast version for Unicode

monarch_dodra (6/9) Jan 04 2013 Well, I already mentioned to you how I was planning to do it:

H. S. Teoh (30/37) Jan 03 2013 [...]

monarch_dodra (11/29) Jan 04 2013 ... alsmost! 1e12 will have a negative value when cast to int. To

Jonathan M Davis (6/9) Jan 04 2013 I'm not a fan of the ASCII version returning -1, but I don't really have...

Dmitry Olshansky (10/19) Jan 04 2013 I find low-level stuff that throws to be overly awkward to deal with

monarch_dodra (29/51) Jan 04 2013 I finished an implementation:

monarch_dodra (7/14) Jan 04 2013 Wait: I figured it out: They are just non-numbers that happen to
Dmitry Olshansky (15/68) Jan 04 2013 Well, for start it features tons of code duplication. But I'm replacing

monarch_dodra (29/67) Jan 04 2013 Well, I wrote that with duplication, keeping in mind you would

Dmitry Olshansky (21/87) Jan 04 2013 Basically check the bottom of that page:

monarch_dodra (15/35) Jan 04 2013 Sounds like the root of the problem is that isNumber !=

H. S. Teoh (14/30) Jan 04 2013 Yikes. That's pretty ... nasty. :-(

monarch_dodra (20/23) Jan 07 2013 I guess it's just bad wording from the standard.

H. S. Teoh (20/48) Jan 09 2013 Hmph. I guess we need to differentiate between the unicode category

Dmitry Olshansky (33/73) Jan 10 2013 isNumber - _Number_ General category (as defined by Unicode 1:1)

monarch_dodra (16/63) Jan 10 2013 Are you sure about that? The four values of Numeric_Type are:

monarch_dodra (6/7) Jan 07 2013 Thank you for all your feed back.

bearophile (9/12) Jan 02 2013 I think you meant to write:

"monarch_dodra" <monarchdodra gmail.com> writes:

There is an ER that would allow to convert characters to numebers:
http://d.puremagic.com/issues/show_bug.cgi?id=5543

For example: '1' => 1
Or, unicode considered: 'Ⅶ' => 7

Long story short, it was decided that it wasn't std.conv.to's job 
to do this conversion, but rather, there should be a function 
called "numericValue" inside std.uni and std.ascii that would do 
this job.

What remains are defining how these methods should work. Things 
to keep in mind:
- ASCII to int should be fast.
- unicode numeric values span from -0.5 to 1.0e12.
- unicode numeric values can be fractional.
- ALL unicode numeric values can be EXACTLY represented in a 
double.

Given these observations, I'd like to propose these:

//------------------------------
//std.ascii.numericValue
/** Given an ascii character, returns that character's
     numeric value if it is numeric ($(D isNumeric)),
     and -1 otherwise
  */
pure  safe nothrow
int numericValue(dchar c);
//------------------------------
//std.uni.numericValue
/** Given a unicode character, returns that character's
     numeric value if it is numeric ($(D isNumeric)),
     and throws an exception otherwise
  */
pure  safe
double numericValue(dchar c);
//------------------------------

The rationale for this:
std.ascii: I think returning -1 as a magic number should help 
keep the code faster and with less clutter than with exceptions. 
returning an int is the obvious choice for numbers that span -1 
to 10.

std.uni: double is the only type that can hold all ranges of 
unicode's numeric values.
This time, uni throws exceptions. This is for two reasons:
1. Choosing a magic number is difficult, and error prone. Correct 
code would have to look like: "if (std.uni.numericValue(c) > 
-0.7) {...}"
2. When dealing with unicode, overhead of the exception is 
probably cleaner and not as critical as with ascii.

***********************************************
Thoughts?

I wanted to get this ER moved forward. I don't think 
uni.numericValue will be finished soon, but I would have wanted 
std.ascii's done sooner rather than later.

Jan 02 2013

"bearophile" <bearophileHUGS lycos.com> writes:

monarch_dodra:

 The rationale for this:
 std.ascii: I think returning -1 as a magic number should help 
 keep the code faster and with less clutter than with exceptions.

For the ASCII version I have two use cases:
- Where I want to go fast&unsafe I just use "c - '0'".
- When I want more safety I'd like to use something as to!(), 
that raises exceptions in case of errors.

A function that works on ASCII and returns -1 doesn't give me 
much more than "c - '0'". So maybe exceptions are good in the 
ASCII case too.

There is also std.typecons.nullable, it's a possibility for 
std.uni.numericValue. Generally Phobos should eat more of its dog 
food :-)

Bye,
bearophile

Jan 02 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

1/2/2013 7:24 PM, bearophile пишет:
 monarch_dodra:

 The rationale for this:
 std.ascii: I think returning -1 as a magic number should help keep the
 code faster and with less clutter than with exceptions.

 For the ASCII version I have two use cases:
 - Where I want to go fast&unsafe I just use "c - '0'".
 - When I want more safety I'd like to use something as to!(), that
 raises exceptions in case of errors.

 A function that works on ASCII and returns -1 doesn't give me much more
 than "c - '0'". So maybe exceptions are good in the ASCII case too.

Then we can maybe just drop this function? What's wrong with
if(std.ascii.isNumeric(a))
    a -= '0';
else
    enforce(false);

I mean that the time to look it up in std library is much bigger then to 
roll your own with any of the 2 semantics.

Unlike the unicode version, of course. Then IMO having the std.ascii one 
is mostly just for symmetry and thus I think that both should just use 
some sentinel value.

 There is also std.typecons.nullable, it's a possibility for
 std.uni.numericValue. Generally Phobos should eat more of its dog food :-)

double.nan sounds more like it.

 Bye,
 bearophile


-- 
Dmitry Olshansky

Jan 02 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/2/13 3:13 PM, Dmitry Olshansky wrote:
 1/2/2013 7:24 PM, bearophile пишет:
 monarch_dodra:

 The rationale for this:
 std.ascii: I think returning -1 as a magic number should help keep the
 code faster and with less clutter than with exceptions.

 For the ASCII version I have two use cases:
 - Where I want to go fast&unsafe I just use "c - '0'".
 - When I want more safety I'd like to use something as to!(), that
 raises exceptions in case of errors.

 A function that works on ASCII and returns -1 doesn't give me much more
 than "c - '0'". So maybe exceptions are good in the ASCII case too.

 Then we can maybe just drop this function? What's wrong with
 if(std.ascii.isNumeric(a))
 a -= '0';
 else
 enforce(false);

Unnecessary flow :o).

enforce(std.ascii.isNumeric(a));
a -= '0';


Andrei

Jan 02 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

1/3/2013 12:21 AM, Andrei Alexandrescu пишет:
 On 1/2/13 3:13 PM, Dmitry Olshansky wrote:
 1/2/2013 7:24 PM, bearophile пишет:
 monarch_dodra:

 The rationale for this:
 std.ascii: I think returning -1 as a magic number should help keep the
 code faster and with less clutter than with exceptions.

 For the ASCII version I have two use cases:
 - Where I want to go fast&unsafe I just use "c - '0'".
 - When I want more safety I'd like to use something as to!(), that
 raises exceptions in case of errors.

 A function that works on ASCII and returns -1 doesn't give me much more
 than "c - '0'". So maybe exceptions are good in the ASCII case too.

 Then we can maybe just drop this function? What's wrong with
 if(std.ascii.isNumeric(a))
 a -= '0';
 else
 enforce(false);

 Unnecessary flow :o).

 enforce(std.ascii.isNumeric(a));
 a -= '0';

Yup, and it's 2 lines then. And if one really wants to chain it:
map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

Hardly makes it Phobos candidate then ;)


-- 
Dmitry Olshansky

Jan 02 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 2 January 2013 at 20:49:38 UTC, Dmitry Olshansky 
wrote:
 Yup, and it's 2 lines then. And if one really wants to chain it:
 map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

 Hardly makes it Phobos candidate then ;)

Well, just because its almost trivial to us doesn't mean it hurts 
to have it. The fact that you can even operate on chars in such a 
fashion (c - '0') is not obvious to everyone: I've seen time and 
time again code such as:
//----
if (97 <= c && c <= 122)
     c -= 97;
//----

numericValue helps keep things clean and self documented.

What's more, it helps keep ascii complete. Code originally 
written for ascii is easily upgreable to support uni (and 
vice-versa). Further more, *writing* "std.ascii.numericValue" 
self documents ascii only support, which is less obvious than 
code using "c - '0'":

In the original pull request to "improve" conv.to, the fact that 
it did not support unicode didn't even cross our minds. Seeing 
"std.ascii.numericValue" raises the eyebrow. It *forces* unicode 
consideration (regardless of which is right, it can't be ignored).

Really, by the rationale of "it's 2 lines", we shouldn't even 
have "std.ascii.isNumeric" at all...

On Wednesday, 2 January 2013 at 20:13:32 UTC, Dmitry Olshansky 
wrote:
 1/2/2013 7:24 PM, bearophile пишет:
 There is also std.typecons.nullable, it's a possibility for
 std.uni.numericValue. Generally Phobos should eat more of its 
 dog food :-)

 double.nan sounds more like it.

Hum... nan. I like it.

Jan 02 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Jan 02, 2013 at 11:15:31PM +0100, monarch_dodra wrote:
 On Wednesday, 2 January 2013 at 20:49:38 UTC, Dmitry Olshansky
 wrote:
Yup, and it's 2 lines then. And if one really wants to chain it:
map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

Hardly makes it Phobos candidate then ;)

 
 Well, just because its almost trivial to us doesn't mean it hurts to
 have it. The fact that you can even operate on chars in such a
 fashion (c - '0') is not obvious to everyone: I've seen time and
 time again code such as:
 //----
 if (97 <= c && c <= 122)
     c -= 97;
 //----
 
 numericValue helps keep things clean and self documented.

+1. Code intent is important.


[...]
 On Wednesday, 2 January 2013 at 20:13:32 UTC, Dmitry Olshansky
 wrote:
1/2/2013 7:24 PM, bearophile пишет:
There is also std.typecons.nullable, it's a possibility for
std.uni.numericValue. Generally Phobos should eat more of its
dog food :-)

double.nan sounds more like it.

 
 Hum... nan. I like it.

+1 for nan. It's about time we used nan for something useful beyond just
an annoying default value for floating-point variables. :)


T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG

Jan 02 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

1/3/2013 2:15 AM, monarch_dodra пишет:
 On Wednesday, 2 January 2013 at 20:49:38 UTC, Dmitry Olshansky wrote:
 Yup, and it's 2 lines then. And if one really wants to chain it:
 map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

 Hardly makes it Phobos candidate then ;)

 Well, just because its almost trivial to us doesn't mean it hurts to
 have it. The fact that you can even operate on chars in such a fashion
 (c - '0') is not obvious to everyone: I've seen time and time again code
 such as:
 //----
 if (97 <= c && c <= 122)
      c -= 97;
 //----

 numericValue helps keep things clean and self documented.

 What's more, it helps keep ascii complete. Code originally written for
 ascii is easily upgreable to support uni (and vice-versa). Further more,
 *writing* "std.ascii.numericValue" self documents ascii only support,
 which is less obvious than code using "c - '0'":

 In the original pull request to "improve" conv.to, the fact that it did
 not support unicode didn't even cross our minds. Seeing
 "std.ascii.numericValue" raises the eyebrow. It *forces* unicode
 consideration (regardless of which is right, it can't be ignored).

 Really, by the rationale of "it's 2 lines", we shouldn't even have
 "std.ascii.isNumeric" at all...

I don't mind adding because of completeness and/or symmetry stand point 
as I said.

I do see another cool issue popping up though. It's a problem of how the 
anti-hijacking works. Say we add numericValue right now to std.ascii but 
not std.uni. A release later we have numericValue in std.uni (well 
hopefully they are both in the same 2.062 ;) ).

Now take this code:
map!numericValue(...)

If the code also happens to import std.uni it's going to stop compiling.

That's one of reasons I think our hopes on stability (as in compiles in 
5 years from now) are ill placed as we can't have it until the library 
is essentially dead in stone.


-- 
Dmitry Olshansky

Jan 03 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky 
wrote:
 Now take this code:
 map!numericValue(...)

 If the code also happens to import std.uni it's going to stop 
 compiling.

Hum... We could always "camp" the std.uni's numericValue function?

//----
double numericValue()(dchar c) const nothrow  safe
{
     static assert(false, "Sorry, std.uni.numericValue is not yet 
implemented");
}
//----

This would avoid the breakage you mentioned.

Jan 03 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

03-Jan-2013 21:13, monarch_dodra пишет:
 On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky wrote:
 Now take this code:
 map!numericValue(...)

 If the code also happens to import std.uni it's going to stop compiling.

 Hum... We could always "camp" the std.uni's numericValue function?

 //----
 double numericValue()(dchar c) const nothrow  safe
 {
      static assert(false, "Sorry, std.uni.numericValue is not yet
 implemented");
 }
 //----

We'd pretty much have to.


-- 
Dmitry Olshansky

Jan 03 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 3 January 2013 at 18:11:45 UTC, Dmitry Olshansky 
wrote:
 03-Jan-2013 21:13, monarch_dodra пишет:
 On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky 
 wrote:
 Now take this code:
 map!numericValue(...)

 If the code also happens to import std.uni it's going to stop 
 compiling.

 Hum... We could always "camp" the std.uni's numericValue 
 function?
 [SNIP]

 We'd pretty much have to.

Or, you know... I could just implement both at the same time. 
It's not like there's an *urgency* for the ascii version or 
anything. I think I'll just do that.

So... do we agree on
ascii: int - not found => -1
uni: double - not found => nan
?

I can still get started anyways, even if it isn't definite.

Jan 03 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

03-Jan-2013 23:40, monarch_dodra пишет:
 On Thursday, 3 January 2013 at 18:11:45 UTC, Dmitry Olshansky wrote:
 03-Jan-2013 21:13, monarch_dodra пишет:
 On Thursday, 3 January 2013 at 08:23:06 UTC, Dmitry Olshansky wrote:
 Now take this code:
 map!numericValue(...)

 If the code also happens to import std.uni it's going to stop
 compiling.

 Hum... We could always "camp" the std.uni's numericValue function?
 [SNIP]

 We'd pretty much have to.

 Or, you know... I could just implement both at the same time. It's not
 like there's an *urgency* for the ascii version or anything. I think
 I'll just do that.

 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan
 ?

Me fine.

 I can still get started anyways, even if it isn't definite.

It's just an idea that I have exceptionally fast version for Unicode 
just around the corner, but I wouldn't mind some competition ;)

-- 
Dmitry Olshansky

Jan 03 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 3 January 2013 at 20:14:43 UTC, Dmitry Olshansky 
wrote:
 It's just an idea that I have exceptionally fast version for 
 Unicode just around the corner, but I wouldn't mind some 
 competition ;)

Well, I already mentioned to you how I was planning to do it: 
Just stupid binary search over ranges of numbers indexed on 0.

The "big" chunk of work, actually (IMO), is just creating the raw 
data...

Jan 04 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Jan 03, 2013 at 08:40:47PM +0100, monarch_dodra wrote:
[...]
 Or, you know... I could just implement both at the same time. It's
 not like there's an *urgency* for the ascii version or anything. I
 think I'll just do that.
 
 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan

[...]

LGTM. :)

I did think of what might happen if somebody wrote an int cast for
std.uni.numericValue:

	void sloppyProgrammersFunction(dchar ch) {
		// First attempt: compiler error: can't implicitly
		// convert double -> int ...
		//int val = std.uni.numericValue(ch);

		// ... so sloppy programmer inserts a cast
		int val = cast(int)std.uni.numericValue(ch);

		// On Linux/64, if numericValue returns nan, this prints
		// -int.max.
		writeln(val);

		// So this should work:
		if (val < 0) {
			// (In fact, it will still work if
			// std.ascii.numericValue were used instead.)
			writeln("Sloppy code caught the problem correctly!");
		}
	}

So it seems that everything should be alright.

This particular example occurred to me, 'cos I'm thinking of how often
one wishes to extract an integral value from a string, and usually one
doesn't think that floating point is necessary(!), so the cast from
double is a rather big temptation (even though it's wrong!).


T

-- 
Tell me and I forget. Teach me and I remember. Involve me and I understand. --
Benjamin Franklin

Jan 03 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 3 January 2013 at 21:51:14 UTC, H. S. Teoh wrote:
 On Thu, Jan 03, 2013 at 08:40:47PM +0100, monarch_dodra wrote:
 [...]
 Or, you know... I could just implement both at the same time. 
 It's
 not like there's an *urgency* for the ascii version or 
 anything. I
 think I'll just do that.
 
 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan

 [...]

 LGTM. :)

 I did think of what might happen if somebody wrote an int cast 
 for
 std.uni.numericValue
 [SNIP]
 writeln("Sloppy code caught the problem correctly!");

... alsmost! 1e12 will have a negative value when cast to int. To 
be 100% correct in regards to converting, the end user would have 
to use long.

But that'd be a *really exceptional* case behavior...

Even with long, the only problem with the code is that the user 
would not know the difference between exact integral, and inexact 
integral. Well, that's what the user gets for being sloppy I 
guess.

In any case, I think we'd have to provide an example section with 
a "recommended" way for casting to integral.

Jan 04 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan

I'm not a fan of the ASCII version returning -1, but I don't really have a 
better suggestion. I suppose that you could throw instead, but I don't know if 
that's a good idea or not. It _would_ be more consistent with our other 
conversion functions however.

- Jonathan M Davis

Jan 04 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

04-Jan-2013 15:58, Jonathan M Davis пишет:
 On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan

 I'm not a fan of the ASCII version returning -1, but I don't really have a
 better suggestion. I suppose that you could throw instead, but I don't know if
 that's a good idea or not. It _would_ be more consistent with our other
 conversion functions however.

 - Jonathan M Davis

I find low-level stuff that throws to be overly awkward to deal with 
(not to mention performance problems).

Hm... I've found an brilliant primitive Expected!T that could be of 
great help in error code vs exceptions problem. See the recent Andrei's 
talk that went live not long ago:

http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C

Time to put the analogous stuff into Phobos?

-- 
Dmitry Olshansky

Jan 04 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 4 January 2013 at 13:18:48 UTC, Dmitry Olshansky wrote:
 04-Jan-2013 15:58, Jonathan M Davis пишет:
 On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan

 I'm not a fan of the ASCII version returning -1, but I don't 
 really have a
 better suggestion. I suppose that you could throw instead, but 
 I don't know if
 that's a good idea or not. It _would_ be more consistent with 
 our other
 conversion functions however.

 - Jonathan M Davis

 I find low-level stuff that throws to be overly awkward to deal 
 with (not to mention performance problems).

 Hm... I've found an brilliant primitive Expected!T that could 
 be of great help in error code vs exceptions problem. See the 
 recent Andrei's talk that went live not long ago:

 http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C

 Time to put the analogous stuff into Phobos?

I finished an implementation:

https://github.com/D-Programming-Language/phobos/pull/1052

It is not "pull ready", so we can still discuss it.

I raised a couple of issues in the pull, which I'll copy here:

//----
I did run into a couple of issues, namelly that I'm not getting 
100% equivalence between chars that are numeric, and chars with 
numeric value... Is this normal...?

* There's a fair bit of chars that have numeric value, but aren't 
isNumber. I think they might be new in 6.1.0. But I'm not sure. I 
decided it was best to have them return nan, instead of having 
inconsistent behavior.
* There's a couple characters in tableLo that have numeric 
values. These aren't considered in isNumber either. I think this 
might be a bug though.
* There are 4 "non-number numeric" characters in "CUNEIFORM 
NUMERIC SIGN". These return wild values, and in particular two of 
them return -1. I *think* this should actually return nan for us, 
because (AFAIK), -1 is just wild for invalid :/

Maybe we should just return -1 on invalid unicode? Or maybe it's 
just my input file:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
It doesn't have a separate field for isNumber/numericValue, so it 
is forced to write a wild number. Maybe these four chars should 
return nan?
//----

Oh yeah, I also added isNumber to std.ascii. Feels wrong to not 
have it if we have numericValue.

Jan 04 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 4 January 2013 at 17:48:28 UTC, monarch_dodra wrote:
 //----
 Maybe we should just return -1 on invalid unicode? Or maybe 
 it's just my input file:
 http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
 It doesn't have a separate field for isNumber/numericValue, so 
 it is forced to write a wild number. Maybe these four chars 
 should return nan?

Wait: I figured it out: They are just non-numbers that happen to 
be inside Nl (Number Letter): 
http://unicode.org/cldr/utility/character.jsp?a=12433

Documentation on this is not very clear, nor consistent, so sorry 
for any confusion.

Well, I guess there is a bug in std.isNumber then...

Jan 04 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

04-Jan-2013 21:48, monarch_dodra пишет:
 On Friday, 4 January 2013 at 13:18:48 UTC, Dmitry Olshansky wrote:
 04-Jan-2013 15:58, Jonathan M Davis пишет:
 On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
 So... do we agree on
 ascii: int - not found => -1
 uni: double - not found => nan

 I'm not a fan of the ASCII version returning -1, but I don't really
 have a
 better suggestion. I suppose that you could throw instead, but I
 don't know if
 that's a good idea or not. It _would_ be more consistent with our other
 conversion functions however.

 - Jonathan M Davis

 I find low-level stuff that throws to be overly awkward to deal with
 (not to mention performance problems).

 Hm... I've found an brilliant primitive Expected!T that could be of
 great help in error code vs exceptions problem. See the recent
 Andrei's talk that went live not long ago:

 http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C


 Time to put the analogous stuff into Phobos?

 I finished an implementation:

 https://github.com/D-Programming-Language/phobos/pull/1052

 It is not "pull ready", so we can still discuss it.

Well, for start it features tons of code duplication. But I'm replacing 
the whole std.uni anyway...

 I raised a couple of issues in the pull, which I'll copy here:

 //----
 I did run into a couple of issues, namelly that I'm not getting 100%
 equivalence between chars that are numeric, and chars with numeric
 value... Is this normal...?

Yes, it's called Unicode ;)

 * There's a fair bit of chars that have numeric value, but aren't
 isNumber. I think they might be new in 6.1.0. But I'm not sure. I
 decided it was best to have them return nan, instead of having
 inconsistent behavior.

You also might be using 6.2. It's released as of a fall of 2012.

 * There's a couple characters in tableLo that have numeric values. These
 aren't considered in isNumber either. I think this might be a bug though.
 * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC
 SIGN". These return wild values, and in particular two of them return
 -1. I *think* this should actually return nan for us, because (AFAIK),
 -1 is just wild for invalid :/

Some have numeric value of '-1' I think. The truth of the matter is as 
usual with Unicode things are rather complicated.
So 'numeric character' is a category (general) and 'has numeric value' 
is some other property of codepoint that may or may not correlate 
directly with category.

Thus I think (looking ahead into your other post) that isNumber is 
correct as it follows its documented behavior.

 Maybe we should just return -1 on invalid unicode? Or maybe it's just my
 input file:
 http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
 It doesn't have a separate field for isNumber/numericValue, so it is
 forced to write a wild number. Maybe these four chars should return nan?

Nope. Does letter 'A' return a wild number?

 //----

 Oh yeah, I also added isNumber to std.ascii. Feels wrong to not have it
 if we have numericValue.


-- 
Dmitry Olshansky

Jan 04 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:
 04-Jan-2013 21:48, monarch_dodra пишет:
 I finished an implementation:

 https://github.com/D-Programming-Language/phobos/pull/1052

 It is not "pull ready", so we can still discuss it.

 Well, for start it features tons of code duplication. But I'm 
 replacing the whole std.uni anyway...

Well, I wrote that with duplication, keeping in mind you would
probably replace both. I thought it be cleaner to have some 
duplication, than a warped single implementation. I could also 
make the extra effort. I was really concerned with first having 
an implementation that is unicode correct.

I also though that, at worst, you could use my parsed data ;) to 
submit your own (superior?) pull.

 * There's a couple characters in tableLo that have numeric 
 values. These
 aren't considered in isNumber either. I think this might be a 
 bug though.
 * There are 4 "non-number numeric" characters in "CUNEIFORM 
 NUMERIC
 SIGN". These return wild values, and in particular two of them 
 return
 -1. I *think* this should actually return nan for us, because 
 (AFAIK),
 -1 is just wild for invalid :/

 Some have numeric value of '-1' I think. The truth of the 
 matter is as usual with Unicode things are rather complicated.
 So 'numeric character' is a category (general) and 'has numeric 
 value' is some other property of codepoint that may or may not 
 correlate directly with category.

 Thus I think (looking ahead into your other post) that isNumber 
 is correct as it follows its documented behavior.

 Maybe we should just return -1 on invalid unicode? Or maybe 
 it's just my
 input file:
 http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
 It doesn't have a separate field for isNumber/numericValue, so 
 it is
 forced to write a wild number. Maybe these four chars should 
 return nan?

 Nope. Does letter 'A' return a wild number?

Well, the thing is that I'm getting contradictory info from the
consortium itself:
Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN"
According to the "UnicodeData.txt", its numeric value is -1.
According to The "Unocide utilities", it is not a numeric type,
and it's value is null:
http://unicode.org/cldr/utility/character.jsp?a=12456

Also according to the consortium: "-1" is an illegal numeric
value.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:]

Really, all the info seems to indicate a bug in UnicodeData.txt:
They really seem like 4 entries in Nl that aren't numbers.

I've found a couple people on internet discussing this, but no
hard conclusion :/

****

Anyways, those 4 CUNEIFORM asside, what do you make of the
entries in Lo:
http://unicode.org/cldr/utility/character.jsp?a=F96B
These appear to be numeric, but aren't inside Nd/No/Nl. They
should return true to isNumber, no?

Maybe isNumber's "documented behavior" is wrong?

Jan 04 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

05-Jan-2013 00:51, monarch_dodra пишет:
 On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:
 04-Jan-2013 21:48, monarch_dodra пишет:
 I finished an implementation:

 https://github.com/D-Programming-Language/phobos/pull/1052

 It is not "pull ready", so we can still discuss it.

 Well, for start it features tons of code duplication. But I'm
 replacing the whole std.uni anyway...

 Well, I wrote that with duplication, keeping in mind you would
 probably replace both. I thought it be cleaner to have some duplication,
 than a warped single implementation. I could also make the extra effort.
 I was really concerned with first having an implementation that is
 unicode correct.

 I also though that, at worst, you could use my parsed data ;) to submit
 your module that is well due for peer review.

Fixed ;)

 * There's a couple characters in tableLo that have numeric values. These
 aren't considered in isNumber either. I think this might be a bug
 though.
 * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC
 SIGN". These return wild values, and in particular two of them return
 -1. I *think* this should actually return nan for us, because (AFAIK),
 -1 is just wild for invalid :/

 Some have numeric value of '-1' I think. The truth of the matter is as
 usual with Unicode things are rather complicated.
 So 'numeric character' is a category (general) and 'has numeric value'
 is some other property of codepoint that may or may not correlate
 directly with category.

 Thus I think (looking ahead into your other post) that isNumber is
 correct as it follows its documented behavior.

 Maybe we should just return -1 on invalid unicode? Or maybe it's just my
 input file:
 http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
 It doesn't have a separate field for isNumber/numericValue, so it is
 forced to write a wild number. Maybe these four chars should return nan?

 Nope. Does letter 'A' return a wild number?

 Well, the thing is that I'm getting contradictory info from the
 consortium itself:
 Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN"
 According to the "UnicodeData.txt", its numeric value is -1.
 According to The "Unocide utilities", it is not a numeric type,
 and it's value is null:
 http://unicode.org/cldr/utility/character.jsp?a=12456

 Also according to the consortium: "-1" is an illegal numeric
 value.
 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:]

 Really, all the info seems to indicate a bug in UnicodeData.txt:
 They really seem like 4 entries in Nl that aren't numbers.

 I've found a couple people on internet discussing this, but no
 hard conclusion :/

Basically check the bottom of that page:
....
See also: Unicode Display Problems.
Version 3.6; ICU version: 50.0.1.0; Unicode version: 6.1.0.0

So it's not up to date. The file is. I can test with ICU 51 to see what 
it reports.

 ****

 Anyways, those 4 CUNEIFORM asside, what do you make of the
 entries in Lo:
 http://unicode.org/cldr/utility/character.jsp?a=F96B
 These appear to be numeric, but aren't inside Nd/No/Nl. They
 should return true to isNumber, no?

Hmmm. Take a look here:
http://unicode.org/cldr/utility/properties.jsp

There is a section called Numeric that has 3 properties,
and then there is a General section.
The General has Category which in turn has 'Number' category.

Bottom line is that I believe that std.uni isXXX queries the category of 
a symbol and not some other property. Let any mishaps in between 
properties and general category be consortium's headache.

 Maybe isNumber's "documented behavior" is wrong?

Problem is I can't come up with a good description of some other 
behavior. Maybe this one [^[:Numeric_Type=None:]]
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5E%5B%3ANumeric_Type%3DNone%3A%5D%5D&g=


-- 
Dmitry Olshansky

Jan 04 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 4 January 2013 at 22:00:02 UTC, Dmitry Olshansky wrote:
05-Jan-2013 00:51, monarch_dodra пишет:
Anyways, those 4 CUNEIFORM asside, what do you make of the
entries in Lo:
http://unicode.org/cldr/utility/character.jsp?a=F96B
These appear to be numeric, but aren't inside Nd/No/Nl. They
should return true to isNumber, no?

Hmmm. Take a look here:
http://unicode.org/cldr/utility/properties.jsp

There is a section called Numeric that has 3 properties,
and then there is a General section.
The General has Category which in turn has 'Number' category.

Bottom line is that I believe that std.uni isXXX queries the
category of a symbol and not some other property. Let any
mishaps in between properties and general category be
consortium's headache.
Maybe isNumber's "documented behavior" is wrong?

Problem is I can't come up with a good description of some
other behavior. Maybe this one [^[:Numeric_Type=None:]]
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5E%5B%3ANumeric_Type%3DNone%3A%5D%5D&g=

Sounds like the root of the problem is that isNumber !=
Numeric_Type[Decimal, Digit, Numeric]

Ergo, there is no correlation between isNumber and numericValue.

Feels like there is a lot missing from std.uni, but at the same
time, unicode is really huge.

At the very least, I think we should have Category enum, along
with a (get) "category" function.

I was just saying to jmdavis in the pull that std.ascii had
"isDigit", but that uni didn't. In truth, both also lack
isDecimal and isNumeric.

There would just be a bit of ambiguity now between the broad
"isNumeric", and "all the chars that have a numeric value"... :/

Damn. Unicode is complicated.

Anyways, taking my weekend break.

Jan 04 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Jan 04, 2013 at 11:48:39PM +0100, monarch_dodra wrote:
[...]
 Sounds like the root of the problem is that isNumber !=
 Numeric_Type[Decimal, Digit, Numeric]
 
 Ergo, there is no correlation between isNumber and numericValue.

Yikes. That's pretty ... nasty. :-(


 Feels like there is a lot missing from std.uni, but at the same
 time, unicode is really huge.

Yeah, Unicode is a lot more complex than most people realize. Recently I
read through TR14 (proper line-breaking in Unicode), and I was gaping in
awe at the insane complexity of such a seemingly-simple task.


 At the very least, I think we should have Category enum, along with a
 (get) "category" function.

Yes! We need that!!


 I was just saying to jmdavis in the pull that std.ascii had
 "isDigit", but that uni didn't. In truth, both also lack isDecimal
 and isNumeric.
 
 There would just be a bit of ambiguity now between the broad
 "isNumeric", and "all the chars that have a numeric value"... :/
 
 Damn. Unicode is complicated.

[...]

I, for one, would love to know why isNumeric != hasNumericValue.


T

-- 
Valentine's Day: an occasion for florists to reach into the wallets of
nominal lovers in dire need of being reminded to profess their
hypothetical love for their long-forgotten.

Jan 04 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
 [...]

 I, for one, would love to know why isNumeric != hasNumericValue.


 T

I guess it's just bad wording from the standard.

The standard defined 3 groups that make up Number:
[Nd] 	Number, Decimal Digit
[Nl] 	Number, Letter
[No] 	Number, Other

However, there are a couple of characters that *are* numbers, but 
aren't in those goups.

The "Good" news is that the standard, *does* define number_types 
to classify the kind of number a char is:
* Null: Not a number
* Digit: Obvious
* Decimal: Any decimal number that is NOT a digit
* Numeric: Everything else.

So they used "Numeric" as wild, and "Number" as their general 
category.

This leaves us with ambiguity when choosing our word:
Technically '5' does not clasify as "numeric", although you could 
consider it "has a numeric value".

I hope that makes sense.

Jan 07 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
 On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
[...]
I, for one, would love to know why isNumeric != hasNumericValue.


[...]
 I guess it's just bad wording from the standard.
 
 The standard defined 3 groups that make up Number:
 [Nd] 	Number, Decimal Digit
 [Nl] 	Number, Letter
 [No] 	Number, Other
 
 However, there are a couple of characters that *are* numbers, but
 aren't in those goups.
 
 The "Good" news is that the standard, *does* define number_types to
 classify the kind of number a char is:
 * Null: Not a number
 * Digit: Obvious
 * Decimal: Any decimal number that is NOT a digit
 * Numeric: Everything else.
 
 So they used "Numeric" as wild, and "Number" as their general
 category.
 
 This leaves us with ambiguity when choosing our word:
 Technically '5' does not clasify as "numeric", although you could
 consider it "has a numeric value".
 
 I hope that makes sense.

Hmph. I guess we need to differentiate between the unicode category
called "numeric", and the property of having a numerical value. So we'd
need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's
what the standard is, then that's what it is.

Anyway, I'd love to see std.uni cover all unicode categories.

Offhanded note: should we unify the various isX() functions into:

	bool inCategory(string category)(dchar ch)

where category is the Unicode designation, say "Nl", "Nd", etc.? That
way, it's more future-proof in case the Unicode guys add more
categories. Also makes it easier to remember which function to call;
else you'd always have to remember "N" -> isNumeric, "L" -> isAlpha,
etc..

The current names of course can be left as aliases.


T

-- 
The fact that anyone still uses AOL shows that even the presence of
options doesn't stop some people from picking the pessimal one. - Mike
Ellis

Jan 09 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

10-Jan-2013 03:21, H. S. Teoh пишет:
 On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
 On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
 [...]
 I, for one, would love to know why isNumeric != hasNumericValue.


 [...]
 I guess it's just bad wording from the standard.

 The standard defined 3 groups that make up Number:
 [Nd] 	Number, Decimal Digit
 [Nl] 	Number, Letter
 [No] 	Number, Other

 However, there are a couple of characters that *are* numbers, but
 aren't in those goups.

 The "Good" news is that the standard, *does* define number_types to
 classify the kind of number a char is:
 * Null: Not a number
 * Digit: Obvious
 * Decimal: Any decimal number that is NOT a digit
 * Numeric: Everything else.

 So they used "Numeric" as wild, and "Number" as their general
 category.

 This leaves us with ambiguity when choosing our word:
 Technically '5' does not clasify as "numeric", although you could
 consider it "has a numeric value".

 I hope that makes sense.

 Hmph. I guess we need to differentiate between the unicode category
 called "numeric", and the property of having a numerical value. So we'd
 need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's
 what the standard is, then that's what it is.

isNumber - _Number_ General category (as defined by Unicode 1:1)

isNumeric - as having NumericType != None (again going be definition of 
Unicode properties)

And that's all, correct and to the latter.

 Anyway, I'd love to see std.uni cover all unicode categories.

 Offhanded note: should we unify the various isX() functions into:

 	bool inCategory(string category)(dchar ch)

No, no, no! It's a horrible idea. The main problem with it is: huge 
catalog of data has to be stored in Phobos (object code) of no (even 
niche) use. Also to be practical for use cases other then casual 
observation it has to be fast.. and it can't for any of the useful cases.

Just count the number of bits to store per codepoint and fairly 
irregular structure of the whole set of properties (unlike individual 
combinations that do have nice distribution e.g. Scripts as in Cyrillic).

I've been shoulder-deep in Unicode for about half a year now, and 
reading through TR-xx algorithms and *none* of them requires queries of 
the sort that tests all (more then 1-2?) of properties.

In all cases the algorithm itself defines a set(s) of codepoints with 
different meanings/values for this use case. These (useful) sets could 
be compressed to a fast multi-stage table, the whole catalog of 
properties - no, as it packs enormous heaps of unused junk (Unicode_Age 
anyone??). This junk is not fit for std library but the goal is to 
provide tool for the user to work with sets/data beyond the commonly 
useful in std.

 where category is the Unicode designation, say "Nl", "Nd", etc.? That
 way, it's more future-proof in case the Unicode guys add more
 categories.

I'm posting my work on std.uni as ready for review today or tomorrow.
It includes a type for a set of codepoints and ton of predefined sets 
for Nl, Nd and almost everything sensible (blocks, scripts, properties).
The user can then conjure whatever combination required.

And it still way smaller then having full 'query the database' thing. To 
check the full madness of all of the properties just use the web 
interface of unicode.org.

P.S. Hopefully, nobody rises the point of codepoint _names_ they are 
after all too part of Unicode standard (and character database).

-- 
Dmitry Olshansky

Jan 10 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 10 January 2013 at 18:09:31 UTC, Dmitry Olshansky 
wrote:
 10-Jan-2013 03:21, H. S. Teoh пишет:
 On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
 On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
 [...]
 I, for one, would love to know why isNumeric != 
 hasNumericValue.


 [...]
 I guess it's just bad wording from the standard.

 The standard defined 3 groups that make up Number:
 [Nd] 	Number, Decimal Digit
 [Nl] 	Number, Letter
 [No] 	Number, Other

 However, there are a couple of characters that *are* numbers, 
 but
 aren't in those goups.

 The "Good" news is that the standard, *does* define 
 number_types to
 classify the kind of number a char is:
 * Null: Not a number
 * Digit: Obvious
 * Decimal: Any decimal number that is NOT a digit
 * Numeric: Everything else.

 So they used "Numeric" as wild, and "Number" as their general
 category.

 This leaves us with ambiguity when choosing our word:
 Technically '5' does not clasify as "numeric", although you 
 could
 consider it "has a numeric value".

 I hope that makes sense.

 Hmph. I guess we need to differentiate between the unicode 
 category
 called "numeric", and the property of having a numerical 
 value. So we'd
 need both isNumeric and hasNumericValue. Ugh. It's ugly but if 
 that's
 what the standard is, then that's what it is.

 isNumber - _Number_ General category (as defined by Unicode 1:1)

 isNumeric - as having NumericType != None (again going be 
 definition of Unicode properties)

 And that's all, correct and to the latter.

Are you sure about that? The four values of Numeric_Type are:
* Decimal
* Digit
* None
* Numeric <= !!!
http://unicode.org/cldr/utility/properties.jsp?a=Numeric_Type#Numeric_Type

Hopefully, we'll have "isDecimal", "isDigit", and eventually 
"isNumeric", which according to definition, would simply be 
"Numeric_Type == Numeric_Type.Numeric"

The problem is that by the definitions of Unicode properties, 
there is no name for "not in Numeric_Type.None"

"hasNumericValue" is the best name I could come up with to 
differentiate between "Not Numeric_Type.None" and 
"Numeric_Type.Numeric"

Jan 10 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 4 January 2013 at 22:00:02 UTC, Dmitry Olshansky wrote:
 [SNIP]

Thank you for all your feed back.

*everything* makes sense now.

However, the conclusion I'm comming to is that there needs some 
ground work before doing numeric value, which I am currently 
doing.

Jan 07 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Dmitry Olshansky:

 Yup, and it's 2 lines then. And if one really wants to chain it:
 map(a => enforce(std.ascii.isNumeric(a)), a -= '0')(...);

 Hardly makes it Phobos candidate then ;)

I think you meant to write:

map(a => enforce(std.ascii.isNumeric(a)), a - '0')(...);

To avoid some bugs I try to not use the comma expression like 
that.

Compare that code with:

map!numericValue(...);

Bye,
bearophile

Jan 02 2013

D Programming

C/C++ Programming

Other

digitalmars.D - numericValue for (unicode) characters