digitalmars.D.learn - bearophile

bearophile (9/9) Sep 20 2012 This is the signature of a function of std.ascii:

bearophile (1/1) Sep 20 2012 Sorry, the thread title was "About std.ascii.toLower"...
monarch_dodra (16/25) Sep 20 2012 It's not, it only *operates* on ASCII, but non ascii is still a

bearophile (6/8) Sep 20 2012 Then maybe std.ascii.toLower needs a pre-condition that

monarch_dodra (6/14) Sep 20 2012 I was thinking the exact same thing right after replying actually.

bearophile (19/23) Sep 20 2012 If you are thinking about the number of operations, then it's the

monarch_dodra (9/32) Sep 20 2012 That's what I thought. You have a valid point (IMO) but at the

Jonathan M Davis (11/18) Sep 20 2012 Goodness no.

bearophile (23/42) Sep 20 2012 A single char is often not so useful but I have to keep many

monarch_dodra (42/53) Sep 21 2012 What do you (you two) think of my proposition for a

Jonathan M Davis (4/6) Sep 21 2012 I don't think that it's at all worth it. It's just duplicate functionali...

monarch_dodra (5/12) Sep 21 2012 (and contract)

Jonathan M Davis (12/24) Sep 21 2012 If that's what you want, it's easy enough to create a helper function wh...

monarch_dodra (33/59) Sep 21 2012 That's a real good idea. Also, I find it is these kinds of

Jonathan M Davis (9/13) Sep 21 2012 I certainly would be against adding it. I think that it's a relatively

monarch_dodra (9/27) Sep 21 2012 I know my ideas usually get shot down, but I usually learn a LOT

Jonathan M Davis (5/6) Sep 21 2012 For conversions which can be done with both casting and std.conv.to,

Don Clugston (14/32) Sep 27 2012 Are there any use cases of toLower() on non-ASCII strings?

"bearophile" <bearophileHUGS lycos.com> writes:

This is the signature of a function of std.ascii:

http://dlang.org/phobos/std_ascii.html#toLower

pure nothrow  safe dchar toLower(dchar c);

If this function is supposed to be used on ASCII strings, what's 
the point of returning a dchar? When I use it I have usually to 
cast its result back to char, and I prefer to avoid casts in my 
code in D.

Bye,
bearophile

Sep 20 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Sorry, the thread title was "About std.ascii.toLower"...

Sep 20 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 20 September 2012 at 16:00:18 UTC, bearophile wrote:
 This is the signature of a function of std.ascii:

 http://dlang.org/phobos/std_ascii.html#toLower

 pure nothrow  safe dchar toLower(dchar c);

 If this function is supposed to be used on ASCII strings, 
 what's the point of returning a dchar? When I use it I have 
 usually to cast its result back to char, and I prefer to avoid 
 casts in my code in D.

 Bye,
 bearophile

It's not, it only *operates* on ASCII, but non ascii is still a 
legal arg:

----
import std.stdio;
import std.ascii;

void main(){
     string s = "héllö";
     write("\"");
     foreach(c; s)
         write(c.toUpper);
     write("\"");
}
----
HéLLö
----

Sep 20 2012

"bearophile" <bearophileHUGS lycos.com> writes:

monarch_dodra:

 It's not, it only *operates* on ASCII, but non ascii is still a 
 legal arg:

Then maybe std.ascii.toLower needs a pre-condition that 
constraints it to just ASCII inputs, so it's free to return a 
char.

Bye,
bearophile

Sep 20 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 20 September 2012 at 16:34:22 UTC, bearophile wrote:
 monarch_dodra:

 It's not, it only *operates* on ASCII, but non ascii is still 
 a legal arg:

 Then maybe std.ascii.toLower needs a pre-condition that 
 constraints it to just ASCII inputs, so it's free to return a 
 char.

 Bye,
 bearophile

I was thinking the exact same thing right after replying actually.

Would that actually change anything though? I mean what with 
alignment and everything, wouldn't returning a char be just as 
expansive? I'm not 100% sure. What is your use case that would 
require this?

Sep 20 2012

"bearophile" <bearophileHUGS lycos.com> writes:

monarch_dodra:

 Would that actually change anything though? I mean what with 
 alignment and everything, wouldn't returning a char be just as 
 expansive? I'm not 100% sure.

If you are thinking about the number of operations, then it's the 
same, as both a char and dchar value go in a register. The run 
time is the same, especially after inlining.


 What is your use case that would require this?

I have a char[] like:

['a','x','b','a','c','x','f']

Every char encodes something. Putting it to upper case means that 
that data was already used:

['a','X','b','a','C','x','f']

In this case to use toUpper I have to use:

cast(char)toUpper(foo[1])

What's I am trying to minimize is the number of cast(). On the 
other hand even in C toupper returns a type larger than char:

http://www.acm.uiuc.edu/webmonkeys/book/c_guide/2.2.html

It's just D has contract programming, and this module is written 
for ASCII, so it's able to be smarter than C functions, and 
return a char.

Bye,
bearophile

Sep 20 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 20 September 2012 at 17:05:18 UTC, bearophile wrote:
 monarch_dodra:

 Would that actually change anything though? I mean what with 
 alignment and everything, wouldn't returning a char be just as 
 expansive? I'm not 100% sure.

 If you are thinking about the number of operations, then it's 
 the same, as both a char and dchar value go in a register. The 
 run time is the same, especially after inlining.


 What is your use case that would require this?

 I have a char[] like:

 ['a','x','b','a','c','x','f']

 Every char encodes something. Putting it to upper case means 
 that that data was already used:

 ['a','X','b','a','C','x','f']

 In this case to use toUpper I have to use:

 cast(char)toUpper(foo[1])

 What's I am trying to minimize is the number of cast(). On the 
 other hand even in C toupper returns a type larger than char:

 http://www.acm.uiuc.edu/webmonkeys/book/c_guide/2.2.html

 It's just D has contract programming, and this module is 
 written for ASCII, so it's able to be smarter than C functions, 
 and return a char.

 Bye,
 bearophile

That's what I thought. You have a valid point (IMO) but at the 
same time, using the ASCII methods on non-ascii characters is 
also legit operation.

I guess we'd need the extra "std.strictascii" module (!) for 
operations that would accept ASCII char, and return a ASCII char.

I'd support such an ER, I think it would. Allow users (such as 
you) to have tighter constraints if needed, while still keeping 
std.ascii for "safer" ASCII operations.

Sep 20 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, September 20, 2012 18:35:21 bearophile wrote:
 monarch_dodra:
 It's not, it only *operates* on ASCII, but non ascii is still a

 
 legal arg:

 Then maybe std.ascii.toLower needs a pre-condition that
 constraints it to just ASCII inputs, so it's free to return a
 char.

Goodness no.

1. Operating on a char is almost always the wrong thing to do. If you really 
want to do that, then cast. It should _not_ be encouraged.

2. It would be disastrous if std.ascii's funtions didn't work on unicode. 
Right now, you can use them with ranges on strings which are unicode, which 
can be very useful. I grant you that that's more obvious with something like 
isDigit than toLower, but regardless, std.ascii is designed such that its 
functions will all operate on unicode strings. It just doesn't alter unicode 
characters and returns false for them with any of the query functions.

- Jonathan M Davis

Sep 20 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Jonathan M Davis:

 Goodness no.

:-)


 1. Operating on a char is almost always the wrong thing to do.

A single char is often not so useful but I have to keep many 
mutable chars, keeping them as char[] instead of dchar[] saves 
both memory and reduces cache misses. The same is true for types 
like short or float, single ones are not so useful, but they 
sometimes become useful when you have many of them in arrays.

If I have to modify such char[], using toUpper() requires me a 
cast. And in my opinion it's not a good idea to return a dchar if 
you know the both the input and output of the function are a char.


 If you really want to do that, then cast.

On the other hand casts in D have a certain risk, so reducing 
their number as much as possible is a good idea.


 It should _not_ be encouraged.

This is silly, see the above explanation.


 2. It would be disastrous if std.ascii's funtions didn't work 
 on unicode.
 Right now, you can use them with ranges on strings which are 
 unicode, which
 can be very useful. [...] but regardless, std.ascii is designed 
 such that its
 functions will all operate on unicode strings. It just doesn't 
 alter unicode
 characters and returns false for them with any of the query 
 functions.

I see, and I didn't know this, I have misunderstood. I have 
thought of std.ascii functions as functions meant to work on just 
ASCII characters/text. But they are better defined as 
Unicode-passing functions. And yeah, it's written at the top of 
the module:

Functions which operate on ASCII characters. All of the 
functions in std.ascii accept unicode characters but effectively 
ignore them. All isX functions return false for unicode 
characters, and all toX functions do nothing to unicode 
characters.<

So now I'd like a new set of functions designed for ASCII text, 
with contracts to refuse not-ASCII things ;-)

Thank you for the answers Jonathan.

Bye,
bearophile

Sep 20 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 20 September 2012 at 17:32:52 UTC, bearophile wrote:
 Jonathan M Davis:
Functions which operate on ASCII characters. All of the 
functions in std.ascii accept unicode characters but 
effectively ignore them. All isX functions return false for 
unicode characters, and all toX functions do nothing to unicode 
characters.<

 So now I'd like a new set of functions designed for ASCII text, 
 with contracts to refuse not-ASCII things ;-)

 Thank you for the answers Jonathan.

 Bye,
 bearophile

What do you (you two) think of my proposition for a 
"std.strictascii" module?

The signatures would be:
char toLower(dchar c);

And the implementations be like:

----
char toLower(dchar c)
in
{
     assert(c.std.ascii.isAscii());
}
body
{
     cast(char) c.std.ascii.toLower();
}
----

The rational for taking a dchar as input is so that it's own 
input can be correctly validated, and so that it can easilly 
operate with foreach etc, doing the cast internally. The returned 
value would be pre-cast to char.

Usage:

----
import std.stdio;
import std.strictascii;

void main(){
     string s1 = "axbacxf";
     string s2 = "àxbécxf";
     char[] cs = new char[](7);

     //bearophile use case: no casts
     foreach(i, c; s1)
         cs[i] = c.toUpper();

     //illegal use case: correct input validation
     foreach(i, c; s1)
         cs[i] = c.toUpper(); //in assert
}
----

It doesn't add *much* functionality, and arguably, it is a 
specialized functionality, but there are usecases where you want 
to operate ONLY on ascii, as pointed out by bearophile.

Just curious if I should even consider investing some effort in 
this.

Sep 21 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, September 21, 2012 11:00:31 monarch_dodra wrote:
 What do you (you two) think of my proposition for a
 "std.strictascii" module?

I don't think that it's at all worth it. It's just duplicate functionality in 
order to avoid a cast.

- Jonathan M Davis

Sep 21 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 21 September 2012 at 10:23:39 UTC, Jonathan M Davis 
wrote:
 On Friday, September 21, 2012 11:00:31 monarch_dodra wrote:
 What do you (you two) think of my proposition for a
 "std.strictascii" module?

 I don't think that it's at all worth it. It's just duplicate 
 functionality in
 order to avoid a cast.

(and contract)

 - Jonathan M Davis

Somehow, I expected that reply, but I had to ask anyways :D

Thanks.

Sep 21 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, September 21, 2012 12:38:07 monarch_dodra wrote:
 On Friday, 21 September 2012 at 10:23:39 UTC, Jonathan M Davis
 
 wrote:
 On Friday, September 21, 2012 11:00:31 monarch_dodra wrote:
 What do you (you two) think of my proposition for a
 "std.strictascii" module?

 
 I don't think that it's at all worth it. It's just duplicate
 functionality in
 order to avoid a cast.

 
 (and contract)

If that's what you want, it's easy enough to create a helper function which 
you use instead of a cast which does the contract check as well. e.g.

char toChar(dchar c)
{
    assert(isAscii(c));
    return cast(char)c;
}

foreach(ref char c; str)
    c = toChar(std.ascii.toLower(c));

It should be completely optimized out with -release and -inline.

- Jonathan M Davis

Sep 21 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 21 September 2012 at 10:45:42 UTC, Jonathan M Davis 
wrote:
 On Friday, September 21, 2012 12:38:07 monarch_dodra wrote:
 On Friday, 21 September 2012 at 10:23:39 UTC, Jonathan M Davis
 
 wrote:
 On Friday, September 21, 2012 11:00:31 monarch_dodra wrote:
 What do you (you two) think of my proposition for a
 "std.strictascii" module?

 
 I don't think that it's at all worth it. It's just duplicate
 functionality in
 order to avoid a cast.

 
 (and contract)

 If that's what you want, it's easy enough to create a helper 
 function which
 you use instead of a cast which does the contract check as 
 well. e.g.

 char toChar(dchar c)
 {
     assert(isAscii(c));
     return cast(char)c;
 }

 foreach(ref char c; str)
     c = toChar(std.ascii.toLower(c));

 It should be completely optimized out with -release and -inline.

 - Jonathan M Davis

That's a real good idea. Also, I find it is these kinds of 
situations where UFCS really shines (IMO):

     foreach(i, c; s1)
         cs[i] = c.toUpper().toChar();

I love this syntax.

Related, could "toChar" be considered for inclusion? I think it 
would be a convenient tool for validation.

/*
  * Casts dchar to a char.
  *
  * Preconditions:
  *   $(D c) must be representable in a single char.
  */
char toChar(dchar c)
{
     assert(c < 256, "toChar: Input too large for char");
     return cast(char)c;
}

That said, if we go that way, we might as well just have a more 
generic safeCast in std.conv or something:

T safeCast(T, U)(U i)
     if(isBasicType!T && isBasicType!U)
{
     assert(cast(T)i == i, "safeCast: Cast failed");
     return cast(T)i;
}

     foreach(i, c; s1)
         cs[i] = c.toUpper().safeCast!char();

Hum... yeah... I don't know...

I seem to be typing faster than I can really think of the 
consequences of such a function.

Sep 21 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, September 21, 2012 13:18:01 monarch_dodra wrote:
 Related, could "toChar" be considered for inclusion? I think it
 would be a convenient tool for validation.

I certainly would be against adding it. I think that it's a relatively 
uncommon use case and considering how easy it is to just write the function 
yourself, the functionality gain is minimal. I just don't think that it 
carries it's weight as far as the standard library goes. But I don't know how 
others like Andrei would feel.

 That said, if we go that way, we might as well just have a more
 generic safeCast in std.conv or something:

You're basically asking for a version of std.conv.to which uses assertions 
instead of exceptions.

- Jonathan M Davis

Sep 21 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 21 September 2012 at 11:25:54 UTC, Jonathan M Davis 
wrote:
 On Friday, September 21, 2012 13:18:01 monarch_dodra wrote:
 Related, could "toChar" be considered for inclusion? I think it
 would be a convenient tool for validation.

 I certainly would be against adding it. I think that it's a 
 relatively
 uncommon use case and considering how easy it is to just write 
 the function
 yourself, the functionality gain is minimal. I just don't think 
 that it
 carries it's weight as far as the standard library goes. But I 
 don't know how
 others like Andrei would feel.

 That said, if we go that way, we might as well just have a more
 generic safeCast in std.conv or something:

 You're basically asking for a version of std.conv.to which uses 
 assertions
 instead of exceptions.

 - Jonathan M Davis

I know my ideas usually get shot down, but I usually learn a LOT 
from your answers, so sorry for insisting.

I did not know conv's to did cast validation. In my defense, the 
doc is actually missing:

http://dlang.org/phobos/std_conv.html

I made a doc pull request so that it would appear.

https://github.com/D-Programming-Language/phobos/pull/811

Sep 21 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, September 21, 2012 14:10:25 monarch_dodra wrote:
 I did not know conv's to did cast validation.

For conversions which can be done with both casting and std.conv.to, 
std.conv.to does runtime checks wherever a narrowing conversion would take 
place and throws if the conversion would lose precision.

- Jonathan M Davis

Sep 21 2012

Don Clugston <dac nospam.com> writes:

On 20/09/12 18:57, Jonathan M Davis wrote:
 On Thursday, September 20, 2012 18:35:21 bearophile wrote:
 monarch_dodra:
 It's not, it only *operates* on ASCII, but non ascii is still a

 legal arg:

 Then maybe std.ascii.toLower needs a pre-condition that
 constraints it to just ASCII inputs, so it's free to return a
 char.

 Goodness no.

 1. Operating on a char is almost always the wrong thing to do. If you really
 want to do that, then cast. It should _not_ be encouraged.

 2. It would be disastrous if std.ascii's funtions didn't work on unicode.
 Right now, you can use them with ranges on strings which are unicode, which
 can be very useful.
 I grant you that that's more obvious with something like
 isDigit than toLower, but regardless, std.ascii is designed such that its
 functions will all operate on unicode strings. It just doesn't alter unicode
 characters and returns false for them with any of the query functions.

Are there any use cases of toLower() on non-ASCII strings?
Seriously? I think it's _always_ a bug.

At the very least that function should have a name like 
toLowerIgnoringNonAscii() to indicate that it is performing a really, 
really foul operation.

The fact that toLower("Ü") doesn't generate an error, but doesn't return 
"ü" is a wrong-code bug IMHO. It isn't any better than if it returned a 
random garbage character (eg, it's OK in my opinion for ASCII toLower to 
consider only the lower 7 bits).

OTOH I can see some value in a cased ASCII vs unicode comparison.
ie, given an ASCII string and a unicode string, do a case-insensitive 
comparison, eg look for
"<HTML>" inside "öähaøſ€đ <html>ſŋħŋ€ł¶"

Sep 27 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - bearophile