digitalmars.D - DIP76: Autodecode Should Not Throw

Walter Bright (1/1) Apr 06 2015 http://wiki.dlang.org/DIP76

Vladimir Panteleev (3/4) Apr 06 2015 I am against this. It can lead to silent irreversible data

Vladimir Panteleev (8/12) Apr 06 2015 Instead, I would like to suggest promoting the use of `handle`
w0rp (18/22) Apr 07 2015 I can see the value in both.

Vladimir Panteleev (6/9) Apr 07 2015 No no no, terrible idea. This means your program will pass your

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (11/20) Apr 07 2015 I'd say that invalid UTF8 in `string`s _is_ a logic error,

John Carter (21/25) Apr 07 2015 Sigh!

Kagamin (7/8) Apr 07 2015 Deprecation can be reported by checking version:
Dicebot (9/10) Apr 07 2015 I have doubts about it similar to Vladimir. Main problem is that

Walter Bright (11/18) Apr 07 2015 With UTF strings, if you care about invalid UTF (a surprisingly large am...

Vladimir Panteleev (8/28) Apr 07 2015 Yes, but std.conv doesn't return NaN if you try to convert

bearophile (5/7) Apr 07 2015 I have suggested to add a nothrow function like "maybeTo" that
Walter Bright (5/20) Apr 07 2015 I know, I read your post. The machinery to allocate, throw, catch, and r...

"Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> (7/21) Apr 07 2015 There was a time when operations on NaNs where painfully slow.
H. S. Teoh via Digitalmars-d (9/16) Apr 07 2015 How so? There *are* possible options we can consider to migrate away
w0rp (13/21) Apr 07 2015 I don't think we are stuck with it. I think we can change it. I

H. S. Teoh via Digitalmars-d (7/27) Apr 07 2015 If somebody were to write a DIP for killing autodecoding, I'd vote in

Kagamin (3/7) Apr 08 2015 http://forum.dlang.org/post/luonbfghopyrtcoejjsu@forum.dlang.org

Daniel Kozak via Digitalmars-d (3/28) Apr 07 2015 me too

H. S. Teoh via Digitalmars-d (7/10) Apr 07 2015 I used to be pro-autodecoding... nowadays, I'm starting to lean towards

Abdulhaq (7/8) Apr 07 2015 The DIP lists the benefits but does not mention any cons.

Walter Bright (3/10) Apr 07 2015 On the other hand, if there's any place where people demand the highest

Jonathan M Davis via Digitalmars-d (15/16) Apr 19 2015 I am fully in favor of this. Most code really doesn't care about invalid

Walter Bright <newshound2 digitalmars.com> writes:

http://wiki.dlang.org/DIP76

Apr 06 2015

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76

I am against this. It can lead to silent irreversible data 
corruption.

Apr 06 2015

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Tuesday, 7 April 2015 at 04:05:38 UTC, Vladimir Panteleev 
wrote:
 On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76

 I am against this. It can lead to silent irreversible data 
 corruption.

Instead, I would like to suggest promoting the use of `handle` 
and the like:

http://dlang.org/phobos/std_exception.html#handle

This way, code that needs to be nothrow can opt in to be nothrow 
via such composition, which is also aligned with that introducing 
the risk of silent data corruption needing to be opt-in.

Apr 06 2015

"w0rp" <devw0rp gmail.com> writes:

On Tuesday, 7 April 2015 at 04:05:38 UTC, Vladimir Panteleev 
wrote:
 On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76

 I am against this. It can lead to silent irreversible data 
 corruption.

I can see the value in both.

With something like Objective C on iOS, basically everything is 
nothrow. They don't do any cleanup for references when exceptions 
happen, so they don't generate slower reference counting code. 
Exceptions in Objective C on iOS are not supposed to be caught 
ever. So you don't use exceptions and garbage collection, your 
code runs pretty fast, and your applications are smooth.

On the other hand, not throwing the exceptions leads to silent 
failures, which can lead to creating garbage data. Objective C in 
particular is designed to tolerate failure, given that messages 
run on nil objects simply do nothing and return cast(T) 0 for the 
message's return type. You're in a world of checking return 
codes, validating data, etc.

Maybe autodecoding could throw an Error (No 'new' allowed) when 
debug mode is on, and use replacement characters in release mode. 
I haven't thought it through, but that's an idea.

Apr 07 2015

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Tuesday, 7 April 2015 at 07:42:02 UTC, w0rp wrote:
 Maybe autodecoding could throw an Error (No 'new' allowed) when 
 debug mode is on, and use replacement characters in release 
 mode. I haven't thought it through, but that's an idea.

No no no, terrible idea. This means your program will pass your 
test suite in debug mode (which, of course, is never going to 
test behavior with bad UTF in all the relevant places), but 
silently corrupt real-world data in release mode. Errors and 
asserts are for logic errors, not for validating user input!

Apr 07 2015

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Tuesday, 7 April 2015 at 07:50:40 UTC, Vladimir Panteleev 
wrote:
 On Tuesday, 7 April 2015 at 07:42:02 UTC, w0rp wrote:
 Maybe autodecoding could throw an Error (No 'new' allowed) 
 when debug mode is on, and use replacement characters in 
 release mode. I haven't thought it through, but that's an idea.

 No no no, terrible idea. This means your program will pass your 
 test suite in debug mode (which, of course, is never going to 
 test behavior with bad UTF in all the relevant places), but 
 silently corrupt real-world data in release mode. Errors and 
 asserts are for logic errors, not for validating user input!

I'd say that invalid UTF8 in `string`s _is_ a logic error, 
because these are defined to be valid UTF8. If they aren't, 
someone didn't correctly validate their inputs.

Unfortunately, not even the runtime cares about UTF correctness:

     void main(string[] args) {
         import std.utf;
         args[1].validate; // throws
     }

Apr 07 2015

"John Carter" <john.carter taitradio.com> writes:

On Tuesday, 7 April 2015 at 04:05:38 UTC, Vladimir Panteleev 
wrote:
 On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76

 I am against this. It can lead to silent irreversible data 
 corruption.

Sigh!

99.99% of the time when I'm processing text.... my program didn't 
create the text.

An eclectic mob of text editors driven by a herd of cats each 
having wildly different concepts of encoding wrote it.

99.999% of the time when I hit one of these cases... the 
"irreversible data corruption" is _already_ there.

Tough.

It's there, it's irreversible, I have to live with it and make 
forward progress.

Sure, on some tasks I want to know it is there.... but by far in 
most tasks all I can do is shrug, slap it to something sensible, 
and carry on.

One of the first things I had to do in D was write code to do 
this.... and it all seem way harder and slower than it needed to 
be.

(Oh for The Simple Fun Good Bad Old Days of everything is 7 bit 
ASCII... except for the funny stuff above 127 which you ignored 
anyway.)

Apr 07 2015

"Kagamin" <spam here.lot> writes:

On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76

Deprecation can be reported by checking version:

version(EnableNothrowAutodecoding)
   alias autodecode=autodecodeImpl;
else
    deprecated("compile with -version=EnableNothrowAutodecoding")
   alias autodecode=autodecodeImpl;

Apr 07 2015

"Dicebot" <public dicebot.lv> writes:

On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76

I have doubts about it similar to Vladimir. Main problem is that 
I have no idea what actually happens if replacement characters 
appear in some unicode text my program processes. So far I have 
that calming feeling that if something goes wrong in this regard, 
exception will slap me right into my face.

Also it is worrying to see so much effort put into `nothrow` in 
language which endorses exceptions as its main error reporting 
mechanism.

Apr 07 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/7/2015 1:19 AM, Dicebot wrote:
 I have doubts about it similar to Vladimir. Main problem is that I have no idea
 what actually happens if replacement characters appear in some unicode text my
 program processes.

It's much like floating point NaN values, which are 'sticky'.

 So far I have that calming feeling that if something goes
 wrong in this regard, exception will slap me right into my face.

With UTF strings, if you care about invalid UTF (a surprisingly large amount of 
operations done on strings simply don't care about invalid UTF) the validation 
can be done as a separate step. Then, the program logic is divided into 
operating on "validated" and "unvalidated" data.

 Also it is worrying to see so much effort put into `nothrow` in language which
 endorses exceptions as its main error reporting mechanism.

There is definitely a tug of war going on there. Exceptions are great, except 
they aren't free.

What I've tried to do is design things so that erroneous input is not possible
- 
that all possible input has straightforward output. In other words, try to 
define the problem out of existence. Then there are no errors.

Apr 07 2015

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Tuesday, 7 April 2015 at 09:04:09 UTC, Walter Bright wrote:
 On 4/7/2015 1:19 AM, Dicebot wrote:
 I have doubts about it similar to Vladimir. Main problem is 
 that I have no idea
 what actually happens if replacement characters appear in some 
 unicode text my
 program processes.

 It's much like floating point NaN values, which are 'sticky'.

Yes, but std.conv doesn't return NaN if you try to convert 
"banana" to a double.

 With UTF strings, if you care about invalid UTF (a surprisingly 
 large amount of operations done on strings simply don't care 
 about invalid UTF) the validation can be done as a separate 
 step.

So can converting invalid UTF to replacement characters.

 Also it is worrying to see so much effort put into `nothrow` 
 in language which
 endorses exceptions as its main error reporting mechanism.

 There is definitely a tug of war going on there. Exceptions are 
 great, except they aren't free.

 What I've tried to do is design things so that erroneous input 
 is not possible - that all possible input has straightforward 
 output. In other words, try to define the problem out of 
 existence. Then there are no errors.

I think the correct solution to that is to kill auto-decoding :) 
Then all decoding is explicit, and since it is explicit, it is 
trivial to allow specifying the desired behavior upon 
encountering invalid UTF-8.

Apr 07 2015

"bearophile" <bearophileHUGS lycos.com> writes:

Vladimir Panteleev:

 std.conv doesn't return NaN if you try to convert "banana" to a 
 double.

I have suggested to add a nothrow function like "maybeTo" that 
returns a Nullable result.

Bye,
bearophile

Apr 07 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
 On Tuesday, 7 April 2015 at 09:04:09 UTC, Walter Bright wrote:
 On 4/7/2015 1:19 AM, Dicebot wrote:
 I have doubts about it similar to Vladimir. Main problem is that I have no idea
 what actually happens if replacement characters appear in some unicode text my
 program processes.

 It's much like floating point NaN values, which are 'sticky'.

 Yes, but std.conv doesn't return NaN if you try to convert "banana" to a
double.

Maybe it should :-)


 With UTF strings, if you care about invalid UTF (a surprisingly large amount
 of operations done on strings simply don't care about invalid UTF) the
 validation can be done as a separate step.

 So can converting invalid UTF to replacement characters.

I know, I read your post. The machinery to allocate, throw, catch, and replace 
is still there.


 I think the correct solution to that is to kill auto-decoding :) Then all
 decoding is explicit, and since it is explicit, it is trivial to allow
 specifying the desired behavior upon encountering invalid UTF-8.

I agree autodecoding is a mistake, but we're stuck with it.

Apr 07 2015

"Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:

On Tuesday, 7 April 2015 at 09:21:52 UTC, Walter Bright wrote:
 On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
 On Tuesday, 7 April 2015 at 09:04:09 UTC, Walter Bright wrote:
 On 4/7/2015 1:19 AM, Dicebot wrote:
 I have doubts about it similar to Vladimir. Main problem is 
 that I have no idea
 what actually happens if replacement characters appear in 
 some unicode text my
 program processes.

 It's much like floating point NaN values, which are 'sticky'.

 Yes, but std.conv doesn't return NaN if you try to convert 
 "banana" to a double.

 Maybe it should :-)

There was a time when operations on NaNs where painfully slow. 
Also, since NaNs tend to spread, once a NaN appears, there usual 
is not much of a result left. Debugging used to be painfully hard 
if NaNs are enabled. We used to rely on floating point exceptions 
instead.

This might or might not be relevant.

Apr 07 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Tue, Apr 07, 2015 at 02:21:50AM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:

[...]
I think the correct solution to that is to kill auto-decoding :) Then
all decoding is explicit, and since it is explicit, it is trivial to
allow specifying the desired behavior upon encountering invalid
UTF-8.

 
 I agree autodecoding is a mistake, but we're stuck with it.

How so? There *are* possible options we can consider to migrate away
from autodecoding. AFAICT the real roadblock here is that some people
strongly disagree with this, so it's more a community barrier than a
technical one.


T

-- 
Unix is my IDE. -- Justin Whear

Apr 07 2015

"w0rp" <devw0rp gmail.com> writes:

On Tuesday, 7 April 2015 at 09:21:52 UTC, Walter Bright wrote:
 On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
 I think the correct solution to that is to kill auto-decoding 
 :) Then all
 decoding is explicit, and since it is explicit, it is trivial 
 to allow
 specifying the desired behavior upon encountering invalid 
 UTF-8.

 I agree autodecoding is a mistake, but we're stuck with it.

I don't think we are stuck with it. I think we can change it. I 
think a lot of the automatic decoding happens inside of Phobos, 
while people care mostly about the boundaries of the API. If we 
do get rid of it, then as Vladimir says, you can opt in to 
whether or not you want a non-throwing conversion, or a throwing 
one.

I was going to write about how the auto decoding doesn't solve 
the problem of comparing strings, given that you need to look at 
ranges of characters, subject to normalisation, unless you're 
dealing with just ASCII. I think all of that has been said to 
death, though. I think it's possible for us to get rid of 
automatic decoding.

Apr 07 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Tue, Apr 07, 2015 at 06:00:12PM +0000, w0rp via Digitalmars-d wrote:
 On Tuesday, 7 April 2015 at 09:21:52 UTC, Walter Bright wrote:
On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
I think the correct solution to that is to kill auto-decoding :)
Then all decoding is explicit, and since it is explicit, it is
trivial to allow specifying the desired behavior upon encountering
invalid UTF-8.

I agree autodecoding is a mistake, but we're stuck with it.

 
 I don't think we are stuck with it. I think we can change it. I think
 a lot of the automatic decoding happens inside of Phobos, while people
 care mostly about the boundaries of the API. If we do get rid of it,
 then as Vladimir says, you can opt in to whether or not you want a
 non-throwing conversion, or a throwing one.
 
 I was going to write about how the auto decoding doesn't solve the
 problem of comparing strings, given that you need to look at ranges of
 characters, subject to normalisation, unless you're dealing with just
 ASCII. I think all of that has been said to death, though. I think
 it's possible for us to get rid of automatic decoding.

If somebody were to write a DIP for killing autodecoding, I'd vote in
favor.

Getting it past Andrei, OTOH, is a different story. ;-)


T

-- 
Never trust an operating system you don't have source for! -- Martin Schulze

Apr 07 2015

"Kagamin" <spam here.lot> writes:

On Tuesday, 7 April 2015 at 18:18:55 UTC, H. S. Teoh wrote:
 If somebody were to write a DIP for killing autodecoding, I'd 
 vote in
 favor.

 Getting it past Andrei, OTOH, is a different story. ;-)

http://forum.dlang.org/post/luonbfghopyrtcoejjsu forum.dlang.org
But how DIP can address a non-technical issue?

Apr 08 2015

Daniel Kozak via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Tue, 7 Apr 2015 11:16:16 -0700
"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> wrote:

 On Tue, Apr 07, 2015 at 06:00:12PM +0000, w0rp via Digitalmars-d wrote:
 On Tuesday, 7 April 2015 at 09:21:52 UTC, Walter Bright wrote:
On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
I think the correct solution to that is to kill auto-decoding :)
Then all decoding is explicit, and since it is explicit, it is
trivial to allow specifying the desired behavior upon encountering
invalid UTF-8.

I agree autodecoding is a mistake, but we're stuck with it.

 
 I don't think we are stuck with it. I think we can change it. I think
 a lot of the automatic decoding happens inside of Phobos, while people
 care mostly about the boundaries of the API. If we do get rid of it,
 then as Vladimir says, you can opt in to whether or not you want a
 non-throwing conversion, or a throwing one.
 
 I was going to write about how the auto decoding doesn't solve the
 problem of comparing strings, given that you need to look at ranges of
 characters, subject to normalisation, unless you're dealing with just
 ASCII. I think all of that has been said to death, though. I think
 it's possible for us to get rid of automatic decoding.

 
 If somebody were to write a DIP for killing autodecoding, I'd vote in
 favor.
 

me too

Apr 07 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Tue, Apr 07, 2015 at 09:10:32AM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
[...]
 I think the correct solution to that is to kill auto-decoding :) Then
 all decoding is explicit, and since it is explicit, it is trivial to
 allow specifying the desired behavior upon encountering invalid UTF-8.

I used to be pro-autodecoding... nowadays, I'm starting to lean towards
killing it. This is another nail in the coffin.


T

-- 
He who sacrifices functionality for ease of use, loses both and deserves
neither. -- Slashdotter

Apr 07 2015

"Abdulhaq" <alynch4047 gmail.com> writes:

On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76

The DIP lists the benefits but does not mention any cons.

A con that I can see is that it is violating the 'fail fast' 
principle. By silently replacing data the developer will be 
presented with a probably-hard-to-debug problem later down the 
application lifecyle (probably in an unrelated area), wasting 
developer time.

Apr 07 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/7/2015 5:04 AM, Abdulhaq wrote:
 On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76

 The DIP lists the benefits but does not mention any cons.

 A con that I can see is that it is violating the 'fail fast' principle. By
 silently replacing data the developer will be presented with a
 probably-hard-to-debug problem later down the application lifecyle (probably in
 an unrelated area), wasting developer time.


On the other hand, if there's any place where people demand the highest 
performance, it's string processing.

Apr 07 2015

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Monday, April 06, 2015 20:16:19 Walter Bright via Digitalmars-d wrote:
 http://wiki.dlang.org/DIP76

I am fully in favor of this. Most code really doesn't care about invalid
unicode, and if it does, it can check explicitly. Using the replacement
character is much cleaner and follows the Unicode standard.

And in my experience, if I run into invalid Unicode, I generally have to
process it regardless, forcing me to do something like use the replacement
character anyway. The fact that std.utf.decode throws just becomes an
annoyance.

About the only real downside to this that I can think of is that if you're
writing a new string algorithm, and you botch it such that it mangles the
Unicode, right now, you'd quickly get exceptions, whereas with this change,
you wouldn't. But if you're testing your string-based code with Unicode
rather than just ASCII, then that should still get caught. Regardless, I
think that this is the way to go.

- Jonathan M Davis

Apr 19 2015

D Programming

C/C++ Programming

Other

digitalmars.D - DIP76: Autodecode Should Not Throw