www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - DIP76: Autodecode Should Not Throw

reply Walter Bright <newshound2 digitalmars.com> writes:
http://wiki.dlang.org/DIP76
Apr 06 2015
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76
I am against this. It can lead to silent irreversible data corruption.
Apr 06 2015
next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Tuesday, 7 April 2015 at 04:05:38 UTC, Vladimir Panteleev 
wrote:
 On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76
I am against this. It can lead to silent irreversible data corruption.
Instead, I would like to suggest promoting the use of `handle` and the like: http://dlang.org/phobos/std_exception.html#handle This way, code that needs to be nothrow can opt in to be nothrow via such composition, which is also aligned with that introducing the risk of silent data corruption needing to be opt-in.
Apr 06 2015
prev sibling next sibling parent reply "w0rp" <devw0rp gmail.com> writes:
On Tuesday, 7 April 2015 at 04:05:38 UTC, Vladimir Panteleev 
wrote:
 On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76
I am against this. It can lead to silent irreversible data corruption.
I can see the value in both. With something like Objective C on iOS, basically everything is nothrow. They don't do any cleanup for references when exceptions happen, so they don't generate slower reference counting code. Exceptions in Objective C on iOS are not supposed to be caught ever. So you don't use exceptions and garbage collection, your code runs pretty fast, and your applications are smooth. On the other hand, not throwing the exceptions leads to silent failures, which can lead to creating garbage data. Objective C in particular is designed to tolerate failure, given that messages run on nil objects simply do nothing and return cast(T) 0 for the message's return type. You're in a world of checking return codes, validating data, etc. Maybe autodecoding could throw an Error (No 'new' allowed) when debug mode is on, and use replacement characters in release mode. I haven't thought it through, but that's an idea.
Apr 07 2015
parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Tuesday, 7 April 2015 at 07:42:02 UTC, w0rp wrote:
 Maybe autodecoding could throw an Error (No 'new' allowed) when 
 debug mode is on, and use replacement characters in release 
 mode. I haven't thought it through, but that's an idea.
No no no, terrible idea. This means your program will pass your test suite in debug mode (which, of course, is never going to test behavior with bad UTF in all the relevant places), but silently corrupt real-world data in release mode. Errors and asserts are for logic errors, not for validating user input!
Apr 07 2015
parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Tuesday, 7 April 2015 at 07:50:40 UTC, Vladimir Panteleev 
wrote:
 On Tuesday, 7 April 2015 at 07:42:02 UTC, w0rp wrote:
 Maybe autodecoding could throw an Error (No 'new' allowed) 
 when debug mode is on, and use replacement characters in 
 release mode. I haven't thought it through, but that's an idea.
No no no, terrible idea. This means your program will pass your test suite in debug mode (which, of course, is never going to test behavior with bad UTF in all the relevant places), but silently corrupt real-world data in release mode. Errors and asserts are for logic errors, not for validating user input!
I'd say that invalid UTF8 in `string`s _is_ a logic error, because these are defined to be valid UTF8. If they aren't, someone didn't correctly validate their inputs. Unfortunately, not even the runtime cares about UTF correctness: void main(string[] args) { import std.utf; args[1].validate; // throws } # ./testutf8 `echo 'äöü' | recode utf8..latin1`
Apr 07 2015
prev sibling parent "John Carter" <john.carter taitradio.com> writes:
On Tuesday, 7 April 2015 at 04:05:38 UTC, Vladimir Panteleev 
wrote:
 On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76
I am against this. It can lead to silent irreversible data corruption.
Sigh! 99.99% of the time when I'm processing text.... my program didn't create the text. An eclectic mob of text editors driven by a herd of cats each having wildly different concepts of encoding wrote it. 99.999% of the time when I hit one of these cases... the "irreversible data corruption" is _already_ there. Tough. It's there, it's irreversible, I have to live with it and make forward progress. Sure, on some tasks I want to know it is there.... but by far in most tasks all I can do is shrug, slap it to something sensible, and carry on. One of the first things I had to do in D was write code to do this.... and it all seem way harder and slower than it needed to be. (Oh for The Simple Fun Good Bad Old Days of everything is 7 bit ASCII... except for the funny stuff above 127 which you ignored anyway.)
Apr 07 2015
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76
Deprecation can be reported by checking version: version(EnableNothrowAutodecoding) alias autodecode=autodecodeImpl; else deprecated("compile with -version=EnableNothrowAutodecoding") alias autodecode=autodecodeImpl;
Apr 07 2015
prev sibling next sibling parent reply "Dicebot" <public dicebot.lv> writes:
On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76
I have doubts about it similar to Vladimir. Main problem is that I have no idea what actually happens if replacement characters appear in some unicode text my program processes. So far I have that calming feeling that if something goes wrong in this regard, exception will slap me right into my face. Also it is worrying to see so much effort put into `nothrow` in language which endorses exceptions as its main error reporting mechanism.
Apr 07 2015
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/7/2015 1:19 AM, Dicebot wrote:
 I have doubts about it similar to Vladimir. Main problem is that I have no idea
 what actually happens if replacement characters appear in some unicode text my
 program processes.
It's much like floating point NaN values, which are 'sticky'.
 So far I have that calming feeling that if something goes
 wrong in this regard, exception will slap me right into my face.
With UTF strings, if you care about invalid UTF (a surprisingly large amount of operations done on strings simply don't care about invalid UTF) the validation can be done as a separate step. Then, the program logic is divided into operating on "validated" and "unvalidated" data.
 Also it is worrying to see so much effort put into `nothrow` in language which
 endorses exceptions as its main error reporting mechanism.
There is definitely a tug of war going on there. Exceptions are great, except they aren't free. What I've tried to do is design things so that erroneous input is not possible - that all possible input has straightforward output. In other words, try to define the problem out of existence. Then there are no errors.
Apr 07 2015
parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Tuesday, 7 April 2015 at 09:04:09 UTC, Walter Bright wrote:
 On 4/7/2015 1:19 AM, Dicebot wrote:
 I have doubts about it similar to Vladimir. Main problem is 
 that I have no idea
 what actually happens if replacement characters appear in some 
 unicode text my
 program processes.
It's much like floating point NaN values, which are 'sticky'.
Yes, but std.conv doesn't return NaN if you try to convert "banana" to a double.
 With UTF strings, if you care about invalid UTF (a surprisingly 
 large amount of operations done on strings simply don't care 
 about invalid UTF) the validation can be done as a separate 
 step.
So can converting invalid UTF to replacement characters.
 Also it is worrying to see so much effort put into `nothrow` 
 in language which
 endorses exceptions as its main error reporting mechanism.
There is definitely a tug of war going on there. Exceptions are great, except they aren't free. What I've tried to do is design things so that erroneous input is not possible - that all possible input has straightforward output. In other words, try to define the problem out of existence. Then there are no errors.
I think the correct solution to that is to kill auto-decoding :) Then all decoding is explicit, and since it is explicit, it is trivial to allow specifying the desired behavior upon encountering invalid UTF-8.
Apr 07 2015
next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Vladimir Panteleev:

 std.conv doesn't return NaN if you try to convert "banana" to a 
 double.
I have suggested to add a nothrow function like "maybeTo" that returns a Nullable result. Bye, bearophile
Apr 07 2015
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
 On Tuesday, 7 April 2015 at 09:04:09 UTC, Walter Bright wrote:
 On 4/7/2015 1:19 AM, Dicebot wrote:
 I have doubts about it similar to Vladimir. Main problem is that I have no idea
 what actually happens if replacement characters appear in some unicode text my
 program processes.
It's much like floating point NaN values, which are 'sticky'.
Yes, but std.conv doesn't return NaN if you try to convert "banana" to a double.
Maybe it should :-)
 With UTF strings, if you care about invalid UTF (a surprisingly large amount
 of operations done on strings simply don't care about invalid UTF) the
 validation can be done as a separate step.
So can converting invalid UTF to replacement characters.
I know, I read your post. The machinery to allocate, throw, catch, and replace is still there.
 I think the correct solution to that is to kill auto-decoding :) Then all
 decoding is explicit, and since it is explicit, it is trivial to allow
 specifying the desired behavior upon encountering invalid UTF-8.
I agree autodecoding is a mistake, but we're stuck with it.
Apr 07 2015
next sibling parent "Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:
On Tuesday, 7 April 2015 at 09:21:52 UTC, Walter Bright wrote:
 On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
 On Tuesday, 7 April 2015 at 09:04:09 UTC, Walter Bright wrote:
 On 4/7/2015 1:19 AM, Dicebot wrote:
 I have doubts about it similar to Vladimir. Main problem is 
 that I have no idea
 what actually happens if replacement characters appear in 
 some unicode text my
 program processes.
It's much like floating point NaN values, which are 'sticky'.
Yes, but std.conv doesn't return NaN if you try to convert "banana" to a double.
Maybe it should :-)
There was a time when operations on NaNs where painfully slow. Also, since NaNs tend to spread, once a NaN appears, there usual is not much of a result left. Debugging used to be painfully hard if NaNs are enabled. We used to rely on floating point exceptions instead. This might or might not be relevant.
Apr 07 2015
prev sibling next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, Apr 07, 2015 at 02:21:50AM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
[...]
I think the correct solution to that is to kill auto-decoding :) Then
all decoding is explicit, and since it is explicit, it is trivial to
allow specifying the desired behavior upon encountering invalid
UTF-8.
I agree autodecoding is a mistake, but we're stuck with it.
How so? There *are* possible options we can consider to migrate away from autodecoding. AFAICT the real roadblock here is that some people strongly disagree with this, so it's more a community barrier than a technical one. T -- Unix is my IDE. -- Justin Whear
Apr 07 2015
prev sibling parent reply "w0rp" <devw0rp gmail.com> writes:
On Tuesday, 7 April 2015 at 09:21:52 UTC, Walter Bright wrote:
 On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
 I think the correct solution to that is to kill auto-decoding 
 :) Then all
 decoding is explicit, and since it is explicit, it is trivial 
 to allow
 specifying the desired behavior upon encountering invalid 
 UTF-8.
I agree autodecoding is a mistake, but we're stuck with it.
I don't think we are stuck with it. I think we can change it. I think a lot of the automatic decoding happens inside of Phobos, while people care mostly about the boundaries of the API. If we do get rid of it, then as Vladimir says, you can opt in to whether or not you want a non-throwing conversion, or a throwing one. I was going to write about how the auto decoding doesn't solve the problem of comparing strings, given that you need to look at ranges of characters, subject to normalisation, unless you're dealing with just ASCII. I think all of that has been said to death, though. I think it's possible for us to get rid of automatic decoding.
Apr 07 2015
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, Apr 07, 2015 at 06:00:12PM +0000, w0rp via Digitalmars-d wrote:
 On Tuesday, 7 April 2015 at 09:21:52 UTC, Walter Bright wrote:
On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
I think the correct solution to that is to kill auto-decoding :)
Then all decoding is explicit, and since it is explicit, it is
trivial to allow specifying the desired behavior upon encountering
invalid UTF-8.
I agree autodecoding is a mistake, but we're stuck with it.
I don't think we are stuck with it. I think we can change it. I think a lot of the automatic decoding happens inside of Phobos, while people care mostly about the boundaries of the API. If we do get rid of it, then as Vladimir says, you can opt in to whether or not you want a non-throwing conversion, or a throwing one. I was going to write about how the auto decoding doesn't solve the problem of comparing strings, given that you need to look at ranges of characters, subject to normalisation, unless you're dealing with just ASCII. I think all of that has been said to death, though. I think it's possible for us to get rid of automatic decoding.
If somebody were to write a DIP for killing autodecoding, I'd vote in favor. Getting it past Andrei, OTOH, is a different story. ;-) T -- Never trust an operating system you don't have source for! -- Martin Schulze
Apr 07 2015
parent "Kagamin" <spam here.lot> writes:
On Tuesday, 7 April 2015 at 18:18:55 UTC, H. S. Teoh wrote:
 If somebody were to write a DIP for killing autodecoding, I'd 
 vote in
 favor.

 Getting it past Andrei, OTOH, is a different story. ;-)
http://forum.dlang.org/post/luonbfghopyrtcoejjsu forum.dlang.org But how DIP can address a non-technical issue?
Apr 08 2015
prev sibling parent Daniel Kozak via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tue, 7 Apr 2015 11:16:16 -0700
"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> wrote:

 On Tue, Apr 07, 2015 at 06:00:12PM +0000, w0rp via Digitalmars-d wrote:
 On Tuesday, 7 April 2015 at 09:21:52 UTC, Walter Bright wrote:
On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
I think the correct solution to that is to kill auto-decoding :)
Then all decoding is explicit, and since it is explicit, it is
trivial to allow specifying the desired behavior upon encountering
invalid UTF-8.
I agree autodecoding is a mistake, but we're stuck with it.
I don't think we are stuck with it. I think we can change it. I think a lot of the automatic decoding happens inside of Phobos, while people care mostly about the boundaries of the API. If we do get rid of it, then as Vladimir says, you can opt in to whether or not you want a non-throwing conversion, or a throwing one. I was going to write about how the auto decoding doesn't solve the problem of comparing strings, given that you need to look at ranges of characters, subject to normalisation, unless you're dealing with just ASCII. I think all of that has been said to death, though. I think it's possible for us to get rid of automatic decoding.
If somebody were to write a DIP for killing autodecoding, I'd vote in favor.
me too
Apr 07 2015
prev sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, Apr 07, 2015 at 09:10:32AM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
[...]
 I think the correct solution to that is to kill auto-decoding :) Then
 all decoding is explicit, and since it is explicit, it is trivial to
 allow specifying the desired behavior upon encountering invalid UTF-8.
I used to be pro-autodecoding... nowadays, I'm starting to lean towards killing it. This is another nail in the coffin. T -- He who sacrifices functionality for ease of use, loses both and deserves neither. -- Slashdotter
Apr 07 2015
prev sibling next sibling parent reply "Abdulhaq" <alynch4047 gmail.com> writes:
On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76
The DIP lists the benefits but does not mention any cons. A con that I can see is that it is violating the 'fail fast' principle. By silently replacing data the developer will be presented with a probably-hard-to-debug problem later down the application lifecyle (probably in an unrelated area), wasting developer time.
Apr 07 2015
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/7/2015 5:04 AM, Abdulhaq wrote:
 On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
 http://wiki.dlang.org/DIP76
The DIP lists the benefits but does not mention any cons. A con that I can see is that it is violating the 'fail fast' principle. By silently replacing data the developer will be presented with a probably-hard-to-debug problem later down the application lifecyle (probably in an unrelated area), wasting developer time.
On the other hand, if there's any place where people demand the highest performance, it's string processing.
Apr 07 2015
prev sibling parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Monday, April 06, 2015 20:16:19 Walter Bright via Digitalmars-d wrote:
 http://wiki.dlang.org/DIP76
I am fully in favor of this. Most code really doesn't care about invalid unicode, and if it does, it can check explicitly. Using the replacement character is much cleaner and follows the Unicode standard. And in my experience, if I run into invalid Unicode, I generally have to process it regardless, forcing me to do something like use the replacement character anyway. The fact that std.utf.decode throws just becomes an annoyance. About the only real downside to this that I can think of is that if you're writing a new string algorithm, and you botch it such that it mangles the Unicode, right now, you'd quickly get exceptions, whereas with this change, you wouldn't. But if you're testing your string-based code with Unicode rather than just ASCII, then that should still get caught. Regardless, I think that this is the way to go. - Jonathan M Davis
Apr 19 2015