www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - dmd foreach loops throw exceptions on invalid UTF sequences, use

reply Walter Bright <newshound2 digitalmars.com> writes:
https://issues.dlang.org/show_bug.cgi?id=22473

I've tried to fix this before, but too many people objected.

Are we fed up with this yet? I sure am.

Who wants to take up this cudgel and fix the durned thing once and for all?

(It's unclear if it would even break existing code.)
Nov 03 2021
next sibling parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 I've tried to fix this before, but too many people objected.
I proposed a few days ago that Phobos autodecoding, if not completely removed, do this exact same thing too. I agree it is a good idea. If you want an exception, it is easy enough to just check it in the loop and throw then. Let's do it.
Nov 03 2021
parent Dukc <ajieskola gmail.com> writes:
On Thursday, 4 November 2021 at 02:34:54 UTC, Adam D Ruppe wrote:
 I agree it is a good idea. If you want an exception, it is easy 
 enough to just check it in the loop and throw then.

 Let's do it.
Plus the present is inconsistent with rest of the language features. Implicit language-level conversions in D do not usually throw.
Nov 04 2021
prev sibling next sibling parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 I've tried to fix this before, but too many people objected.

 Are we fed up with this yet? I sure am.

 Who wants to take up this cudgel and fix the durned thing once 
 and for all?

 (It's unclear if it would even break existing code.)
I still think this is a mistake. One may disagree about autodecoding; I for one think it's a sensible idea. However, a program should either process data correctly or, if that is impossible, not at all. It should not, ever, silently modify it "for you" while reading! I predict this will lead to cryptic, hair-pulling bugs in user code involving replacement characters appearing far downstream of the error site.
Nov 03 2021
parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Thursday, 4 November 2021 at 05:34:29 UTC, FeepingCreature 
wrote:
 One may disagree about autodecoding; I for one think it's a 
 sensible idea. However, a program should either process data 
 correctly or, if that is impossible, not at all. It should not, 
 ever, silently modify it "for you" while reading! I predict 
 this will lead to cryptic, hair-pulling bugs in user code 
 involving replacement characters appearing far downstream of 
 the error site.
(This is floating point NaN all over again!)
Nov 03 2021
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/3/2021 10:41 PM, FeepingCreature wrote:
 On Thursday, 4 November 2021 at 05:34:29 UTC, FeepingCreature wrote:
 One may disagree about autodecoding; I for one think it's a sensible idea. 
 However, a program should either process data correctly or, if that is 
 impossible, not at all. It should not, ever, silently modify it "for you" 
 while reading! I predict this will lead to cryptic, hair-pulling bugs in user 
 code involving replacement characters appearing far downstream of the error
site.
Surprisingly, the reverse seems to be true. Suppose you're writing a text editor. Then read a file with some bad UTF in it. The editor dies with an exception. You can't even edit the file to fix it. If you need to display user provided text, like in a browser, or all sorts of tools, you don't want to die with an exception. What are you going to do in an exception handler? You're just going to replace the offending bytes with ReplacementChar and go render it anyway.
 (This is floating point NaN all over again!)
Poor NaNs are terribly misunderstood. Suppose you have an array of sensors. One goes bad. The "bad" value is 0.0. So now your data analyzer is happily averaging 0.0 into the results, silently skewing them. Now, if a NaN is returned instead, your "average" will be NaN. You know it's no good. It won't be hidden. Uninitialized variables are sensors giving bad data. Having a NaN in your result is a *good* thing.
Nov 04 2021
next sibling parent reply Mathias LANG <geod24 gmail.com> writes:
On Friday, 5 November 2021 at 00:38:59 UTC, Walter Bright wrote:
 Surprisingly, the reverse seems to be true. Suppose you're 
 writing a text editor. Then read a file with some bad UTF in 
 it. The editor dies with an exception. You can't even edit the 
 file to fix it.

 If you need to display user provided text, like in a browser, 
 or all sorts of tools, you don't want to die with an exception. 
 What are you going to do in an exception handler? You're just 
 going to replace the offending bytes with ReplacementChar and 
 go render it anyway.
If you handle user input, you take it as `ubyte[]` and validate it. Any decent editor will try to detect the encoding instead of blindly assuming UTF-8. If you want to fix it, just deprecate the special case and tell people to use `foreach (dchar d; someString.byUTF!(dchar, No.useReplacementDchar))` and voilà. And if they don't want it to throw, it's shorter: `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).
Nov 04 2021
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/4/2021 7:41 PM, Mathias LANG wrote:
 If you want to fix it, just deprecate the special case and tell people to use 
 `foreach (dchar d; someString.byUTF!(dchar, No.useReplacementDchar))` and
voilà.
 And if they don't want it to throw, it's shorter:
 `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).
People will always gravitate towards the smaller, simpler syntax. Like [] instead of std::vector<>.
Nov 04 2021
parent reply max haughton <maxhaton gmail.com> writes:
On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright wrote:
 On 11/4/2021 7:41 PM, Mathias LANG wrote:
 If you want to fix it, just deprecate the special case and 
 tell people to use `foreach (dchar d; someString.byUTF!(dchar, 
 No.useReplacementDchar))` and voilà.
 And if they don't want it to throw, it's shorter:
 `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).
People will always gravitate towards the smaller, simpler syntax. Like [] instead of std::vector<>.
I have never observed this mistake in any C++ cod, unless you mean as a point of language design. This decision should be guided by how current D programmers act rather than a hyperreal ideal of someone encountering the language.
Nov 04 2021
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/4/2021 9:11 PM, max haughton wrote:
 On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright wrote:
 On 11/4/2021 7:41 PM, Mathias LANG wrote:
 If you want to fix it, just deprecate the special case and tell people to use 
 `foreach (dchar d; someString.byUTF!(dchar, No.useReplacementDchar))` and
voilà.
 And if they don't want it to throw, it's shorter:
 `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).
People will always gravitate towards the smaller, simpler syntax. Like [] instead of std::vector<>.
I have never observed this mistake in any C++ cod,
You've never observed people write: int array[3]; in C++ code?
 unless you mean as a point of language design.
D (still) has a rather verbose way of doing lambdas. People constantly complained that D didn't have lambdas. Until the => syntax was added, and suddenly lambdas in D became noticed and useful.
 This decision should be guided by how current D programmers act rather than a 
 hyperreal ideal of someone encountering the language.
The only reason D's associative arrays continue to exist is because they are so darned syntactically convenient. I've seen over and over and over that syntactic convenience matters a lot.
Nov 04 2021
next sibling parent Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Friday, 5 November 2021 at 06:15:44 UTC, Walter Bright wrote:
 On 11/4/2021 9:11 PM, max haughton wrote:
 On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright 
 wrote:
 On 11/4/2021 7:41 PM, Mathias LANG wrote:
 If you want to fix it, just deprecate the special case and 
 tell people to use `foreach (dchar d; 
 someString.byUTF!(dchar, No.useReplacementDchar))` and voilà.
 And if they don't want it to throw, it's shorter:
 `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).
People will always gravitate towards the smaller, simpler syntax. Like [] instead of std::vector<>.
I have never observed this mistake in any C++ cod,
You've never observed people write: int array[3]; in C++ code?
 unless you mean as a point of language design.
D (still) has a rather verbose way of doing lambdas. People constantly complained that D didn't have lambdas. Until the => syntax was added, and suddenly lambdas in D became noticed and useful.
 This decision should be guided by how current D programmers 
 act rather than a hyperreal ideal of someone encountering the 
 language.
The only reason D's associative arrays continue to exist is because they are so darned syntactically convenient. I've seen over and over and over that syntactic convenience matters a lot.
*The value of convenience should not be underestimated* It's what enables productivity, which in my opinion should be *the* main metric of success. Everything else is just "fluff". In how many seconds can you transform idea A into program B. That is how you measure success imo. It doesn't matter if you have a cool or super interesting way of achieving something, if person X is still trying to figure out how to do some cool thing while person Y is already done and focusing on the next thing, person X has lost. Because, person Y always optimize and refractor later (before the deadline), but person X can't because the deadline is already over. *The value of convenience should not be underestimated*
Nov 05 2021
prev sibling parent reply max haughton <maxhaton gmail.com> writes:
On Friday, 5 November 2021 at 06:15:44 UTC, Walter Bright wrote:
 On 11/4/2021 9:11 PM, max haughton wrote:
 On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright 
 wrote:
 On 11/4/2021 7:41 PM, Mathias LANG wrote:
 If you want to fix it, just deprecate the special case and 
 tell people to use `foreach (dchar d; 
 someString.byUTF!(dchar, No.useReplacementDchar))` and voilà.
 And if they don't want it to throw, it's shorter:
 `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).
People will always gravitate towards the smaller, simpler syntax. Like [] instead of std::vector<>.
I have never observed this mistake in any C++ cod,
You've never observed people write: int array[3]; in C++ code?
 unless you mean as a point of language design.
D (still) has a rather verbose way of doing lambdas. People constantly complained that D didn't have lambdas. Until the => syntax was added, and suddenly lambdas in D became noticed and useful.
 This decision should be guided by how current D programmers 
 act rather than a hyperreal ideal of someone encountering the 
 language.
The only reason D's associative arrays continue to exist is because they are so darned syntactically convenient. I've seen over and over and over that syntactic convenience matters a lot.
I have never ever seen someone use a static array by mistake, is what I meant, vector doesn't do the same thing as []. It's more common in (so-called) modern C++ to see std::array these days than a raw static array in certain contexts since you still want a constant length buffer but want iterators etc..
Nov 05 2021
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/5/2021 5:38 AM, max haughton wrote:
 I have never ever seen someone use a static array by mistake, is what I meant,
I didn't mean by mistake. I mean using it as a matter of convenience.
 since you still  want a constant length buffer but want iterators etc..
This is why D has special support for turning arrays seamlessly into ranges. An early goal of D is to encourage use of [ ], rather than deprecate it.
Nov 05 2021
prev sibling next sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 11/5/21 5:38 AM, max haughton wrote:

 I have never ever seen someone use a static array by mistake
Related, although safe, vector::at is almost never used because the more convenient (but unsafe) vector.operator[] exists: v[42] // What Ali saw in the wild v.at(42) // What Ali did not see as much in the wild Ali
Nov 05 2021
parent reply max haughton <maxhaton gmail.com> writes:
On Friday, 5 November 2021 at 23:01:24 UTC, Ali Çehreli wrote:
 On 11/5/21 5:38 AM, max haughton wrote:

 I have never ever seen someone use a static array by mistake
Related, although safe, vector::at is almost never used because the more convenient (but unsafe) vector.operator[] exists: v[42] // What Ali saw in the wild v.at(42) // What Ali did not see as much in the wild Ali
Although I understand what Walter is trying to say, he picked a poor example, this one does actually make sense. Although in the world of sanitizers and such it is not a hard thing to catch, bounds checking by default is a win.
Nov 05 2021
parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/5/2021 9:25 PM, max haughton wrote:
 Although I understand what Walter is trying to say, he picked a poor example, 
 this one does actually make sense. Although in the world of sanitizers and
such 
 it is not a hard thing to catch, bounds checking by default is a win.
Not sure what your point is, as D has bounds checking by default with [ ].
Nov 06 2021
prev sibling parent reply Atila Neves <atila.neves gmail.com> writes:
On Friday, 5 November 2021 at 12:38:36 UTC, max haughton wrote:
 On Friday, 5 November 2021 at 06:15:44 UTC, Walter Bright wrote:
 On 11/4/2021 9:11 PM, max haughton wrote:
 On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright 
 wrote:
 On 11/4/2021 7:41 PM, Mathias LANG wrote:
 If you want to fix it, just deprecate the special case and 
 tell people to use `foreach (dchar d; 
 someString.byUTF!(dchar, No.useReplacementDchar))` and 
 voilà.
 And if they don't want it to throw, it's shorter:
 `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).
People will always gravitate towards the smaller, simpler syntax. Like [] instead of std::vector<>.
I have never observed this mistake in any C++ cod,
You've never observed people write: int array[3]; in C++ code?
 unless you mean as a point of language design.
D (still) has a rather verbose way of doing lambdas. People constantly complained that D didn't have lambdas. Until the => syntax was added, and suddenly lambdas in D became noticed and useful.
 This decision should be guided by how current D programmers 
 act rather than a hyperreal ideal of someone encountering the 
 language.
The only reason D's associative arrays continue to exist is because they are so darned syntactically convenient. I've seen over and over and over that syntactic convenience matters a lot.
is what I meant, vector doesn't do the same thing as [].
Aside from not depending on GC-allocated memory, what does vector do that [] doesn't?
 It's more common in (so-called) modern C++ to see std::array 
 these days than a raw static array in certain contexts since 
 you still  want a constant length buffer but want iterators 
 etc..
int src[10]{}; int dst[10]{}; transform(begin(src), end(src), begin(dst), [](int i) { return i + 1; }); for(const auto i: dst) cout << i << " "; cout << endl; But yes, std::array is an option that's better, but legacy code means C arrays have to be supported.
Nov 08 2021
parent reply max haughton <maxhaton gmail.com> writes:
On Monday, 8 November 2021 at 14:29:47 UTC, Atila Neves wrote:
 On Friday, 5 November 2021 at 12:38:36 UTC, max haughton wrote:
 On Friday, 5 November 2021 at 06:15:44 UTC, Walter Bright 
 wrote:
 On 11/4/2021 9:11 PM, max haughton wrote:
 On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright 
 wrote:
 [...]
I have never observed this mistake in any C++ cod,
You've never observed people write: int array[3]; in C++ code?
 unless you mean as a point of language design.
D (still) has a rather verbose way of doing lambdas. People constantly complained that D didn't have lambdas. Until the => syntax was added, and suddenly lambdas in D became noticed and useful.
 This decision should be guided by how current D programmers 
 act rather than a hyperreal ideal of someone encountering 
 the language.
The only reason D's associative arrays continue to exist is because they are so darned syntactically convenient. I've seen over and over and over that syntactic convenience matters a lot.
is what I meant, vector doesn't do the same thing as [].
Aside from not depending on GC-allocated memory, what does vector do that [] doesn't?
 It's more common in (so-called) modern C++ to see std::array 
 these days than a raw static array in certain contexts since 
 you still  want a constant length buffer but want iterators 
 etc..
int src[10]{}; int dst[10]{}; transform(begin(src), end(src), begin(dst), [](int i) { return i + 1; }); for(const auto i: dst) cout << i << " "; cout << endl; But yes, std::array is an option that's better, but legacy code means C arrays have to be supported.
In my post I was referring to a C style array (in C++) rather than a D slice, to be clear. It's entirely possible Walter originally meant a slice, but the point about following the syntactic path of least resistance seem to be referring to a [] in C++ rather than a slice i.e. I was intending to get across that I've never seen someone making this mistake in practice (either using a mere [] to pass data around, or using a vector in place of a static array / vice versa )
Nov 08 2021
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Monday, 8 November 2021 at 22:12:15 UTC, max haughton wrote:
 In my post I was referring to a C style array (in C++) rather 
 than a D slice, to be clear. It's entirely possible Walter 
 originally meant a slice, but the point about following the 
 syntactic path of least resistance seem to be referring to a [] 
 in C++ rather than a slice i.e. I was intending to get across 
 that I've never seen someone making this mistake in practice 
 (either using a mere [] to pass data around, or using a vector 
 in place of a static array / vice versa )
Could happen in C. Does not happen in C++, you use std::span for passing around data.
Nov 08 2021
prev sibling parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Friday, 5 November 2021 at 00:38:59 UTC, Walter Bright wrote:
 On 11/3/2021 10:41 PM, FeepingCreature wrote:
 On Thursday, 4 November 2021 at 05:34:29 UTC, FeepingCreature 
 wrote:
 One may disagree about autodecoding; I for one think it's a 
 sensible idea. However, a program should either process data 
 correctly or, if that is impossible, not at all. It should 
 not, ever, silently modify it "for you" while reading! I 
 predict this will lead to cryptic, hair-pulling bugs in user 
 code involving replacement characters appearing far 
 downstream of the error site.
Surprisingly, the reverse seems to be true. Suppose you're writing a text editor. Then read a file with some bad UTF in it. The editor dies with an exception. You can't even edit the file to fix it. If you need to display user provided text, like in a browser, or all sorts of tools, you don't want to die with an exception. What are you going to do in an exception handler? You're just going to replace the offending bytes with ReplacementChar and go render it anyway.
 (This is floating point NaN all over again!)
Poor NaNs are terribly misunderstood. Suppose you have an array of sensors. One goes bad. The "bad" value is 0.0. So now your data analyzer is happily averaging 0.0 into the results, silently skewing them. Now, if a NaN is returned instead, your "average" will be NaN. You know it's no good. It won't be hidden. Uninitialized variables are sensors giving bad data. Having a NaN in your result is a *good* thing.
I think the program should crash in all these cases. The text editor should crash. The browser should crash. The analyzer should see a NaN, and crash. These programs are *wrong.* They thought they could only get Unicode and they've gotten non-Unicode. So we know they're written on wrong assumptions; why do we want to continue running code we know is untrustworthy? Let them crash, let them be fixed to make fewer assumptions. Automagically handling errors by propagating them in an inert form robs the developers and users of a chance to avoid a mistake. It's no better than 0.0.
Nov 04 2021
next sibling parent Paolo Invernizzi <paolo.invernizzi gmail.com> writes:
On Friday, 5 November 2021 at 06:30:02 UTC, FeepingCreature wrote:
 On Friday, 5 November 2021 at 00:38:59 UTC, Walter Bright wrote:
 [...]
I think the program should crash in all these cases. The text editor should crash. The browser should crash. The analyzer should see a NaN, and crash. These programs are *wrong.* They thought they could only get Unicode and they've gotten non-Unicode. So we know they're written on wrong assumptions; why do we want to continue running code we know is untrustworthy? Let them crash, let them be fixed to make fewer assumptions. Automagically handling errors by propagating them in an inert form robs the developers and users of a chance to avoid a mistake. It's no better than 0.0.
+1000
Nov 05 2021
prev sibling next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 5 November 2021 at 06:30:02 UTC, FeepingCreature wrote:
 I think the program should crash in all these cases. The text 
 editor should crash. The browser should crash. The analyzer 
 should see a NaN, and crash.
No, NaN is completely different. You have two types of NaN, one is for signalling that data is missing in a dataset (received from the outside). The other is to convey that a computation failed (often caused by roundoff errors). To remove NaN from floating point is unworkable in the general case.
Nov 05 2021
parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Friday, 5 November 2021 at 10:08:30 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 5 November 2021 at 06:30:02 UTC, FeepingCreature 
 wrote:
 I think the program should crash in all these cases. The text 
 editor should crash. The browser should crash. The analyzer 
 should see a NaN, and crash.
No, NaN is completely different. You have two types of NaN, one is for signalling that data is missing in a dataset (received from the outside). The other is to convey that a computation failed (often caused by roundoff errors). To remove NaN from floating point is unworkable in the general case.
When I have to do numeric work and suspect NaNs in play, I like to `feenableexcept(FE_INVALID)`. Then every time a NaN arises in a computation, I get a nice SIGFPE.
Nov 05 2021
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 5 November 2021 at 11:44:42 UTC, FeepingCreature wrote:
 When I have to do numeric work and suspect NaNs in play, I like 
 to `feenableexcept(FE_INVALID)`. Then every time a NaN arises 
 in a computation, I get a nice SIGFPE.
Yes, and the IEEE spec suggests that ones should be able to choose whether you get exceptions or compute with NaNs based on the nature of the application/computation. Regardless, as long as hardware follow IEEE and supports using NaN in calculations, you are better off playing up to the IEEE standard (for a modern system level language that means you should have easy access to both approaches).
Nov 05 2021
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 5 November 2021 at 11:54:21 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 5 November 2021 at 11:44:42 UTC, FeepingCreature 
 wrote:
 When I have to do numeric work and suspect NaNs in play, I 
 like to `feenableexcept(FE_INVALID)`. Then every time a NaN 
 arises in a computation, I get a nice SIGFPE.
Yes, and the IEEE spec suggests that ones should be able to choose whether you get exceptions or compute with NaNs based on the nature of the application/computation. Regardless, as long as hardware follow IEEE and supports using NaN in calculations, you are better off playing up to the IEEE standard (for a modern system level language that means you should have easy access to both approaches).
To put some meat on this. The ideal is that you can have two implementations for the same computation, one fast and one robust. So ideally you should be able to do the computations with NaNs in expressions where the NaNs can disappear and use exceptions where they cannot disappear. If an exception occurs you fall back to the slower robust implementation. In reality you have to weigh in performance characteristic of the hardware so… very much system level programming and not only a choice that can be done on the language level. For instance in raytracing I would want NaNs. Then I can make a choice based on neighbouring pixels whether I want to compute it again using a slower method or simply fill it in with the average of the neighbours (if all the neighbours have roughly the same colour).
Nov 05 2021
parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Friday, 5 November 2021 at 12:03:24 UTC, Ola Fosheim Grøstad 
wrote:
 For instance in raytracing I would want NaNs. Then I can make a 
 choice based on neighbouring pixels whether I want to compute 
 it again using a slower method or simply fill it in with the 
 average of the neighbours (if all the neighbours have roughly 
 the same colour).
I can't imagine wanting nans in raytracing. Just the idea of a fpu slowpath-provoking nan making its way into my nice wide SSE vectors gives me hives. Any sensible raytracing routine should just never produce a nan to begin with. (For denormals there's FTZ/DAZ, at least.)
Nov 05 2021
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 5 November 2021 at 12:13:17 UTC, FeepingCreature wrote:
 (For denormals there's FTZ/DAZ, at least.)
Not IEEE?
Nov 05 2021
prev sibling next sibling parent reply norm <norm.rowtree gmail.com> writes:
On Friday, 5 November 2021 at 06:30:02 UTC, FeepingCreature wrote:
 On Friday, 5 November 2021 at 00:38:59 UTC, Walter Bright wrote:
 [...]
I think the program should crash in all these cases. The text editor should crash. The browser should crash. The analyzer should see a NaN, and crash. These programs are *wrong.* They thought they could only get Unicode and they've gotten non-Unicode. So we know they're written on wrong assumptions; why do we want to continue running code we know is untrustworthy? Let them crash, let them be fixed to make fewer assumptions. Automagically handling errors by propagating them in an inert form robs the developers and users of a chance to avoid a mistake. It's no better than 0.0.
It isn't always that simple, e.g. working on medical devices crashing isn't an option when it comes to how we're going to deal with bad data.
Nov 05 2021
next sibling parent reply Dennis <dkorpel gmail.com> writes:
On Friday, 5 November 2021 at 10:09:40 UTC, norm wrote:
 It isn't always that simple, e.g. working on medical devices 
 crashing isn't an option when it comes to how we're going to 
 deal with bad data.
Oh no, let's not go there again. See this 44-page discussion: [Program logic bugs vs input/environmental errors](https://forum.dlang.org/post/m07gf1$18jl$1 digitalmars.com)
Nov 05 2021
parent Paolo Invernizzi <paolo.invernizzi gmail.com> writes:
On Friday, 5 November 2021 at 10:27:05 UTC, Dennis wrote:
 On Friday, 5 November 2021 at 10:09:40 UTC, norm wrote:
 It isn't always that simple, e.g. working on medical devices 
 crashing isn't an option when it comes to how we're going to 
 deal with bad data.
Oh no, let's not go there again. See this 44-page discussion: [Program logic bugs vs input/environmental errors](https://forum.dlang.org/post/m07gf1$18jl$1 digitalmars.com)
Ehehe, the old good times :-P
Nov 05 2021
prev sibling next sibling parent Dukc <ajieskola gmail.com> writes:
On Friday, 5 November 2021 at 10:09:40 UTC, norm wrote:
 It isn't always that simple, e.g. working on medical devices 
 crashing isn't an option when it comes to how we're going to 
 deal with bad data.
You can always validate the UTF beforehand if you don't want to crash.
Nov 05 2021
prev sibling parent Abdulhaq <alynch4047 gmail.com> writes:
On Friday, 5 November 2021 at 10:09:40 UTC, norm wrote:
 It isn't always that simple, e.g. working on medical devices 
 crashing isn't an option when it comes to how we're going to 
 deal with bad data.
Mm, I have a totally different take on this. In my view all incoming data should be sanitised on entry into the application, this takes place at what I think of as leaf nodes in the application. This sanitisation includes conversion of all measurements into standard units, checking validity of strings etc. Once data has entered the main application then the application should **fail fast**. This is **especially important** for medical devices. This allows the developers of the application to see, early in development, problems with their code and the logic thereof. Signs of developers ignoring the fail fast principle include a disease I've identified where ```if (x is null)``` is seen to start proliferating through the code. This happens when you are calling a function that you did not write and one day you find it has returned null, you don't know why. So you add an ```if (null) return null``` to your code and carry on. This allows the program to stagger on in the face of being in a state that is not understood by the developer. If I am on a ventilator and the program enters a state that the programmer did not anticipate, then life can start to get very uncomfortable for me. I would far prefer that it stopped, coughed up an error code, and the medical staff can unplug it and (quickly, I hope) replace it with another one. If there is actually a scenario where staggering on is considered better, then at the very least it should be under instruction from the programmer. The idea of the language runtime silently modifying application data is somewhat frightening for me in this scenario.
Nov 07 2021
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/4/2021 11:30 PM, FeepingCreature wrote:
 These programs are *wrong.* They thought they could only get Unicode and
they've 
 gotten non-Unicode. So we know they're written on wrong assumptions; why do we 
 want to continue running code we know is untrustworthy? Let them crash, let
them 
 be fixed to make fewer assumptions. Automagically handling errors by
propagating 
 them in an inert form robs the developers and users of a chance to avoid a 
 mistake. It's no better than 0.0.
It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a very common valid value. NaN and ReplacementChar are not valid and are easily distinguished.
Nov 06 2021
next sibling parent reply kdevel <kdevel vogtner.de> writes:
On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:
 On 11/4/2021 11:30 PM, FeepingCreature wrote:
 [...] Let them crash, [...] Automagically handling errors by 
 propagating them in an inert form robs the developers and 
 users of a chance to avoid a mistake. It's no better than 0.0.
It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a very common valid value.
Technically it makes no difference if you do not check for 0.0 or not for NaN. What makes a difference is using "out of band signalling" (exceptions) if its default behavior is process termination.
 NaN and ReplacementChar are not valid
The replacemment character '�' is a valid Unicode codepoint (U+FFFD).
 and are easily distinguished.
Someone may forget to write explicit code to handle this case which most likely leads to data corruption. I choose stack trace over potentially corrupted data.
Nov 07 2021
next sibling parent reply Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Sunday, 7 November 2021 at 16:28:33 UTC, kdevel wrote:
 On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:
 On 11/4/2021 11:30 PM, FeepingCreature wrote:
 [...]
It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a very common valid value.
Technically it makes no difference if you do not check for 0.0 or not for NaN. What makes a difference is using "out of band signalling" (exceptions) if its default behavior is process termination.
 NaN and ReplacementChar are not valid
The replacemment character '�' is a valid Unicode codepoint (U+FFFD).
 and are easily distinguished.
Someone may forget to write explicit code to handle this case which most likely leads to data corruption. I choose stack trace over potentially corrupted data.
https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
Nov 07 2021
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/7/2021 8:46 AM, Imperatorn wrote:
 https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
The money quote: "By default, each object uses replacement fallback to handle strings that it cannot encode and bytes that it cannot decode, but you can specify that an exception should be thrown instead. For more information, see Replacement fallback and Exception fallback."
Nov 07 2021
parent Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Sunday, 7 November 2021 at 23:29:39 UTC, Walter Bright wrote:
 On 11/7/2021 8:46 AM, Imperatorn wrote:
 https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
The money quote: "By default, each object uses replacement fallback to handle strings that it cannot encode and bytes that it cannot decode, but you can specify that an exception should be thrown instead. For more information, see Replacement fallback and Exception fallback."
💲💲💲
Nov 07 2021
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/7/2021 8:28 AM, kdevel wrote:
 Technically it makes no difference if you do not check for 0.0 or not for NaN.
Yes, it does. 0.0 is not distinguishable from valid data. NaN is.
Nov 07 2021
prev sibling parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:
 It's much better than 0.0. 0.0 is indistinguishable from valid 
 data, and is a very common valid value.

 NaN and ReplacementChar are not valid and are easily 
 distinguished.
No, that's exactly the problem. ReplacementChar is not easily distinguished, because it's a valid Unicode character - that's the whole point of it. So just like nan, it can propagate arbitrarily far through your processing pipeline before some downstream process decides that it actually doesn't like it. And at that point you generally have no chance to recover the source of the issue - you know that something maybe has gone wrong, but you don't even know if it was in your process or in the input data. After all, if you were screening your input data for ReplacementChar, you could as easily have been screening it for invalid UTF-8 to begin with. So while yes it's marginally better than 0.0, because at least you know that *something* is wrong, it does as little as possible to help you locate the problem while technically informing you. And all the workarounds for that take the form of "throw everywhere where a ReplacementChar could be generated." So imo just do the equivalent of turning on FE_INVALID, and do that to begin with. There's no point to getting rid of throw sites when you just force the user to readd them manually because they fulfill a genuine need. IMO if you want to get rid of the exception overhead, I'd go the other way and make invalid unicode an abort(). Check your input data, people.
Nov 08 2021
parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Monday, 8 November 2021 at 08:11:12 UTC, FeepingCreature wrote:
 On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:
 It's much better than 0.0. 0.0 is indistinguishable from valid 
 data, and is a very common valid value.

 NaN and ReplacementChar are not valid and are easily 
 distinguished.
No, that's exactly the problem. ReplacementChar is not easily distinguished, because it's a valid Unicode character - that's the whole point of it. So just like nan, it can propagate arbitrarily far through your processing pipeline before some downstream process decides that it actually doesn't like it.
Sorry, let me expand on this because I think it's the very core of the disagreement. I feel you have two options with NaN/ReplacementChar. You can either just accept that this is what you get, and let it propagate throughout your entire pipeline. In that case it's no better than 0.0 - actually, NaN would be *worse*, because your process would be completely broken with no way to fix it, whereas at least with 0.0 you can maybe get some reasonably-usable data out. Or you can say that "we don't want to be generating NaN/ReplacementChar." Then where do you draw the line? At the process input/output boundary? But then the process needs to be fixed if it generates nans/fffds. So you want to move your signaling as close to the production site as possible. Preferably, you want to fail at the exact line that the problematic data was produced. So we're back at exceptions in foreach. (Actually, an exception in cast(string) would be the best.) And that's why I think ReplacementChar/NaN are no better than 0.0. You either embrace them fully as "valid" data, or you handle them at the site of origin; any compromise just makes you worse off than either extreme.
Nov 08 2021
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Monday, 8 November 2021 at 08:18:51 UTC, FeepingCreature wrote:
 (Actually, an exception in cast(string) would be the best.)
D should distinguish more clearly between strong and weak casting at the language level. UTF-8 is now so dominating that D really should reconsider the string type and make it so it is required to be valid UTF-8 (like Python3 did). C++ has even introduced a new character type to signify UTF-8, I use it all the time.
 And that's why I think ReplacementChar/NaN are no better than 
 0.0. You either embrace them fully as "valid" data, or you 
 handle them at the site of origin; any compromise just makes 
 you worse off than either extreme.
It is very difficult to follow your line of reasoning, because ReplacementChar is nothing like qNaN, it is more like sNaN. ReplacementChar is not the result of an approximation failure, it is corruption of the input (or maybe a foreign encoding). Getting a 0.0 instead of qNaN in a signal is absolutely disastrous. Walter is 100% right on that one. 0.0 will introduce a peak across the frequency range. qNan can be removed with no distortion. Should you express your types strongly? Yes, but then you also should include things like negative numbers, denormal numbers, ±infity, ranges [1.0-0.0] and so on.
Nov 08 2021
next sibling parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Monday, 8 November 2021 at 12:02:12 UTC, Ola Fosheim Grøstad 
wrote:
 It is very difficult to follow your line of reasoning, because 
 ReplacementChar is nothing like qNaN, it is more like sNaN. 
 ReplacementChar is not the result of an approximation failure, 
 it is corruption of the input (or maybe a foreign encoding).

 Getting a 0.0 instead of qNaN in a signal is absolutely 
 disastrous. Walter is 100% right on that one. 0.0 will 
 introduce a peak across the frequency range. qNan can be 
 removed with no distortion.

 Should you express your types strongly? Yes, but then you also 
 should include things like negative numbers, denormal numbers, 
 ±infity, ranges [1.0-0.0] and so on.
Yeah I noticed this after I clicked post, but I didn't want to add a third comment. I think the difference is fundamentally one of "time-series vs progressive data". I don't think that's the right word, but I don't know a better one. Like, if you have a measuring series of values interspersed with nans, you can know for instance that the values are assigned to times, or to positions, and then you can semantically decide what to do with the data. For instance you may mark the nans with an error, or drop them and interpolate. However, it is much harder to see where such a behavior would be useful for ReplacementCharacter. Generally, you're reading data that someone wrote for a reason, and ReplacementCharacter would almost universally indicate that there was something you were meant to pick up on but failed to handle. As such, it's much less clear to me whether there even are cases where "text with replacement characters" or "text with replacement characters removed" is even useful.
Nov 08 2021
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Monday, 8 November 2021 at 12:32:08 UTC, FeepingCreature wrote:
 Generally, you're reading data that someone wrote for a reason, 
 and ReplacementCharacter would almost universally indicate that 
 there was something you were meant to pick up on but failed to 
 handle. As such, it's much less clear to me whether there even 
 are cases where "text with replacement characters" or "text 
 with replacement characters removed" is even useful.
It could mean that someone did cut'n'paste of text from a more recent version of the Unicode standard. ReplacementCharacter makes it possible for you to use the input regardless (replacing it with a question mark in a square or something). I think this is an application level feature, and not a language level feature, so it doesn't make sense for the language to do this IMO. That we can agree on. (D is not a scripting langauge.)
Nov 08 2021
prev sibling parent kdevel <kdevel vogtner.de> writes:
On Monday, 8 November 2021 at 12:02:12 UTC, Ola Fosheim Grøstad 
wrote:
[...]
 ReplacementChar is not the result of an approximation failure, 
 it is corruption of the input (or maybe a foreign encoding).
As in this line I can write down the replacement character '�' since it is a valid Unicode codepoint (U+FFFD). It even round-trips correctly. I think the iconv-library [1] has a nice approach: it stops the conversion among others if it encounters an invalid input sequence. The ideal conversion without throwing or using the replacement character is IMHO generating a list of pairs of ranges, named "left" and "right". Left contains sucessfully parsed data, right invalid data. For valid utf-8 input this list has only one element. The left element of this pair contains the conversion and the right is empty. From this representation one can easily compute all required presentations. [1] https://man7.org/linux/man-pages/man3/iconv.3.html
Nov 08 2021
prev sibling parent User <user blah.com> writes:
 (This is floating point NaN all over again!)
Did you try Pony language? its so user friendly, it even allows divide by zero. https://www.reddit.com/r/programming/comments/7al9s2/pony_a_programming_language_that_allows_dividing/
Nov 05 2021
prev sibling next sibling parent reply Elronnd <elronnd elronnd.net> writes:
On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 I've tried to fix this before, but too many people objected.

 Are we fed up with this yet? I sure am.

 Who wants to take up this cudgel and fix the durned thing once 
 and for all?

 (It's unclear if it would even break existing code.)
Assuming the comment by Ali on the linked bug is right, I think the current behaviour is correct. Your complaints:
 It can't be turned off
Sure it can. You can choose to iterate in another fashion; say, by creating your own iterator which folds invalid utf8 into replacement characters.
 it throws
Is it better to produce an incorrect result? A high-quality, non-throwing mechanism for error handling exists. It consists of an _optional_ value which must be explicitly unwrapped. It is also an out-of-band signal; how will I distinguish invalid utf8 from a correctly-encoded replacement character?
 it may allocate with the gc
So? If that is the sort of thing you care about, then you will nogc and find an alternate solution. Lots of core language features allocate, like arrays and hash tables.
 it's slow
In the hot path it's the same speed. In the slow path, performance doesn't matter. In any case, it's useless to give an incorrect result faster. (Notably, this is not exactly _auto_ decoding; it is explicitly requested decoding. And your proposed modification doesn't change that fact.) What is (potentially) questionable imo is that given foreach (c; a), c will be inferred to be dchar; you have to explicitly ask for char. Perhaps that default should be reversed. (This will definitely break code, though, and may not be worth it.) If you want an iterator that generates replacement characters for invalid utf8, just create one. But the default translation should be faithful, and that means not generating any result if none can be generated.
Nov 04 2021
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 04/11/2021 8:51 PM, Elronnd wrote:
 What is (potentially) questionable imo is that given foreach (c; a), c 
 will be inferred to be dchar; you have to explicitly ask for char.  
 Perhaps that default should be reversed.  (This will definitely break 
 code, though, and may not be worth it.)
I think this is the right answer. Fix the default. Less surprises, less head aches, everyone is happy.
Nov 04 2021
parent rikki cattermole <rikki cattermole.co.nz> writes:
On 05/11/2021 12:59 AM, rikki cattermole wrote:
 
 On 04/11/2021 8:51 PM, Elronnd wrote:
 What is (potentially) questionable imo is that given foreach (c; a), c 
 will be inferred to be dchar; you have to explicitly ask for char. 
 Perhaps that default should be reversed.  (This will definitely break 
 code, though, and may not be worth it.)
I think this is the right answer. Fix the default. Less surprises, less head aches, everyone is happy.
Correction: the default is correct, I checked.
Nov 04 2021
prev sibling next sibling parent Adam D Ruppe <destructionator gmail.com> writes:
On Thursday, 4 November 2021 at 07:51:11 UTC, Elronnd wrote:
 What is (potentially) questionable imo is that given foreach 
 (c; a), c will be inferred to be dchar; you have to explicitly 
 ask for char.  Perhaps that default should be reversed.  (This 
 will definitely break code, though, and may not be worth it.)
That's not true. It will always be the type of the thing: void main() { foreach(a; "test") pragma(msg, typeof(a)); // immutable(char) NOT dchar }
Nov 04 2021
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/4/2021 12:51 AM, Elronnd wrote:
 In the hot path it's the same speed.
C++ sold everyone the myth that exceptions not thrown are zero cost. This has been thoroughly debunked, though the myth persists :-(
Nov 04 2021
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.com> writes:
On 2021-11-04 20:40, Walter Bright wrote:
 On 11/4/2021 12:51 AM, Elronnd wrote:
 In the hot path it's the same speed.
C++ sold everyone the myth that exceptions not thrown are zero cost. This has been thoroughly debunked, though the myth persists :-(
I've been doing a fair amount of benchmarking for https://amazon.com/Embracing-Modern-Safely-John-Lakos/dp/0137380356 and surprisingly enough the myth holds true in most cases.
Nov 05 2021
next sibling parent deadalnix <deadalnix gmail.com> writes:
On Friday, 5 November 2021 at 13:25:06 UTC, Andrei Alexandrescu 
wrote:
 On 2021-11-04 20:40, Walter Bright wrote:
 On 11/4/2021 12:51 AM, Elronnd wrote:
 In the hot path it's the same speed.
C++ sold everyone the myth that exceptions not thrown are zero cost. This has been thoroughly debunked, though the myth persists :-(
I've been doing a fair amount of benchmarking for https://amazon.com/Embracing-Modern-Safely-John-Lakos/dp/0137380356 and surprisingly enough the myth holds true in most cases.
It really depends on the exact specification of the myth. You get executable bigger by about 20%, and some constructs such as ref counted smart pointer become harder to optimize, but indeed, the cost when you don't throw in term of runtime isn't remotely as high as people seems to think.
Nov 05 2021
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/5/2021 6:25 AM, Andrei Alexandrescu wrote:
 I've been doing a fair amount of benchmarking for 
 https://amazon.com/Embracing-Modern-Safely-John-Lakos/dp/0137380356 and 
 surprisingly enough the myth holds true in most cases.
All the compilers I know of abandon many optimizations in the presence of unwind blocks. For example, register allocation of variables is not done across unwind blocks. This is because the unwinder does not restore register contents. A further problem is data flow analysis becomes largely ineffective because any operation that may throw (such as a function call to a throwing function) produces an edge from there to the catch block.
Nov 05 2021
next sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Friday, 5 November 2021 at 20:41:34 UTC, Walter Bright wrote:
 On 11/5/2021 6:25 AM, Andrei Alexandrescu wrote:
 I've been doing a fair amount of benchmarking for 
 https://amazon.com/Embracing-Modern-Safely-John-Lakos/dp/0137380356 and
surprisingly enough the myth holds true in most cases.
All the compilers I know of abandon many optimizations in the presence of unwind blocks. For example, register allocation of variables is not done across unwind blocks. This is because the unwinder does not restore register contents. A further problem is data flow analysis becomes largely ineffective because any operation that may throw (such as a function call to a throwing function) produces an edge from there to the catch block.
I have not checked for GCC, but modern version of LLVM are pretty good at optimizing in the presence of landing pads. Not so good at optimizing the landing pad themselves, but hey, if you get there often, something has gone horribly wrong and optimization is the least of your concerns. While I have not checked GCC, I'm fairly confident it does a good job. That being said, on windows, it's another can of worm, because their exception ABI is some special level of crazy.
Nov 05 2021
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/5/2021 2:43 PM, deadalnix wrote:
 I have not checked for GCC, but modern version of LLVM are pretty good at 
 optimizing in the presence of landing pads.
I saw a presentation by Chandler Carruth at CPPCON 3 years back or so where he said that LLVM abandoned much of the optimizations in the presence of rewind blocks. Optimizations will do better, of course, if your tight loops don't call functions that might throw. You'll also lose simply because the extra bulk of the EH code will push more of your hot code out of the cache.
Nov 05 2021
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.com> writes:
On 2021-11-05 20:03, Walter Bright wrote:
 On 11/5/2021 2:43 PM, deadalnix wrote:
 I have not checked for GCC, but modern version of LLVM are pretty good 
 at optimizing in the presence of landing pads.
I saw a presentation by Chandler Carruth at CPPCON 3 years back or so where he said that LLVM abandoned much of the optimizations in the presence of rewind blocks.
Three years is a long time in this industry.
 Optimizations will do better, of course, if your tight loops don't call 
 functions that might throw.
 
 You'll also lose simply because the extra bulk of the EH code will push 
 more of your hot code out of the cache.
Turns out the EH code is very well separated. Gcc goes so far as to generate two separate functions, one for hot and one for cold. Clang also does a good job separating the paths. It happens in our metier that good judgment becomes prejudice. It seems that's what happening with "exceptions are expensive" right now.
Nov 05 2021
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/5/2021 5:40 PM, Andrei Alexandrescu wrote:
 Turns out the EH code is very well separated. Gcc goes so far as to generate
two 
 separate functions, one for hot and one for cold. Clang also does a good job 
 separating the paths.
How does one decide in advance to call the non-throwing function?
 It happens in our metier that good judgment becomes prejudice. It seems that's 
 what happening with "exceptions are expensive" right now.
I remain skeptical. My playing with gcc shows it moves the unwind blocks past the end of the function, which keeps them somewhat out of the hot path. Doesn't fix the register allocation problem, though. BTW, dmd also moves the unwind blocks past the end.
Nov 05 2021
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Saturday, 6 November 2021 at 00:40:41 UTC, Andrei Alexandrescu 
wrote:
 Turns out the EH code is very well separated. Gcc goes so far 
 as to generate two separate functions, one for hot and one for 
 cold. Clang also does a good job separating the paths.
You bet, I wrote the code that separates the two :)
 It happens in our metier that good judgment becomes prejudice. 
 It seems that's what happening with "exceptions are expensive" 
 right now.
It is on windows due to the whole funclet business, and it is in some specific condition (for instance if icache pressure is the bottleneck) but in most cases, the impact is fairly minimal beyond binary size.
Nov 05 2021
parent deadalnix <deadalnix gmail.com> writes:
On Saturday, 6 November 2021 at 01:55:47 UTC, deadalnix wrote:
 On Saturday, 6 November 2021 at 00:40:41 UTC, Andrei 
 Alexandrescu wrote:
 Turns out the EH code is very well separated. Gcc goes so far 
 as to generate two separate functions, one for hot and one for 
 cold. Clang also does a good job separating the paths.
You bet, I wrote the code that separates the two :)
To expand on that, I also wrote code that send all the exception handling code in a cold section in the executable (and if PGO is enabled, also really cold codepath). This impact on benchmark was fairly minimal, so this ended up not being merged.
Nov 05 2021
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.com> writes:
On 2021-11-05 16:41, Walter Bright wrote:
 On 11/5/2021 6:25 AM, Andrei Alexandrescu wrote:
 I've been doing a fair amount of benchmarking for 
 https://amazon.com/Embracing-Modern-Safely-John-Lakos/dp/0137380356 
 and surprisingly enough the myth holds true in most cases.
All the compilers I know of abandon many optimizations in the presence of unwind blocks. For example, register allocation of variables is not done across unwind blocks. This is because the unwinder does not restore register contents. A further problem is data flow analysis becomes largely ineffective because any operation that may throw (such as a function call to a throwing function) produces an edge from there to the catch block.
I know the story. It is aging. I'm telling the facts. It turns out that modern compilers have made a lot of progress in the area.
Nov 05 2021
prev sibling next sibling parent reply Elronnd <elronnd elronnd.net> writes:
Part of the problem, as mentioned, is that this throws away 
information, because text may legitimately contain replacement 
characters.  (And this makes the 'check if replacement char and 
throw yourself' approach a non-starter).  But there are lossless 
encodings.  I think if we are really going to go this route, we 
should use something like raku's utf8-c8 
(https://docs.raku.org/language/unicode#UTF8-C8).
Nov 04 2021
parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/4/2021 12:55 AM, Elronnd wrote:
 Part of the problem, as mentioned, is that this throws away information,
because 
 text may legitimately contain replacement characters.  (And this makes the 
 'check if replacement char and throw yourself' approach a non-starter).  But 
 there are lossless encodings.  I think if we are really going to go this
route, 
 we should use something like raku's utf8-c8 
 (https://docs.raku.org/language/unicode#UTF8-C8).
There's only one replacement character, and this use is officially what it is for. If you're using it for other porpoises, you've got a whale of a problem.
Nov 04 2021
prev sibling next sibling parent reply zjh <fqbqrr 163.com> writes:
On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473
`string`, as a language part, should not be encoded at all. It is `(8-bit)` byte directly. The standard library implement the required 'coding string'. In this way, other people need various "coding strings", so they import the "coding strings" in the "standard library" Just because the code is not 'utf8', and then you can't write `d`'s program, it's terrible.
Nov 04 2021
parent reply zjh <fqbqrr 163.com> writes:
On Thursday, 4 November 2021 at 08:24:59 UTC, zjh wrote:

The `fundamental` problem is that we should provide users with 
`options` at compile time, not we `choose` for users.
If you `choose` for users, there will always be dissatisfaction.
You provide options ,and Users choose according to their needs.

`auto decoding` and `utf8 string encoding` are both like this. If 
you choose for users, some people are always not happy.
Nov 06 2021
parent reply jfondren <julian.fondren gmail.com> writes:
On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote:
 On Thursday, 4 November 2021 at 08:24:59 UTC, zjh wrote:

 The `fundamental` problem is that we should provide users with 
 `options` at compile time, not we `choose` for users.
 If you `choose` for users, there will always be dissatisfaction.
 You provide options ,and Users choose according to their needs.

 `auto decoding` and `utf8 string encoding` are both like this. 
 If you choose for users, some people are always not happy.
d index with range checking: `arr[ind]` d index without range checking: `arr.ptr[ind]` c++ index with range checking: `arr.at(ind)` c++ index without range checking: `arr[ind]` There are two ways to index, and both D and C++ offer both ways. Neither language removes a choice. If whether `arr[ind]` should rangecheck were up for debate, what's for debate is what the language should encourage by making that the default--the option's more naturally expressed, that requires less typing. The question here of "what should a foreach over the dchar of a char[] do?" is the same kind of question. default: `str` throwing: `str.byUTF!(dchar, UseReplacementChar.no)` asserting: `std.encoding.codePoints(str)` replacement: `std.utf.byDchar(str)` truncation: `str[0 .. std.encoding.validLength(str)]` promotion: `std.string.representation(str)` Put one of those inside `foreach (dchar; ...) { }` and you get that handling of bad UTF. Changing the default doesn't make the other options go away, and the default has to do *something* (even a compile-time error of "this is not supported behavior" is *something*), so you have to make a choice about the default and make some users unhappy.
Nov 06 2021
parent reply zjh <fqbqrr 163.com> writes:
On Sunday, 7 November 2021 at 01:59:47 UTC, jfondren wrote:
 On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote:
Rust has more than ten `kinds` of strings. Maybe we can add `2/3` one.
Nov 06 2021
parent jfondren <julian.fondren gmail.com> writes:
On Sunday, 7 November 2021 at 02:12:36 UTC, zjh wrote:
 On Sunday, 7 November 2021 at 01:59:47 UTC, jfondren wrote:
 On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote:
Rust has more than ten `kinds` of strings. Maybe we can add `2/3` one.
Meanwhile, in Rust: ```rust mod tests { fn type_of<T>(_: T) -> &'static str { core::any::type_name::<T>() } const INVALID: &'static str = unsafe { std::str::from_utf8_unchecked(&[ 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0xa7, 0x85, 0xaf, 0x74, 0x68, 0x65, 0x72, 0x65, ]) }; fn iter_invalid() { for c in INVALID.chars() { println!("{} {}, {}", type_of(c), c as u32, c); } } } ``` If you smuggle invalid UTF into a type that Rust expects to be valid UTF (the same case as `string` in D, allegedly), then Rust's equivalent of `foreach (dchar c; str) { }` just emits invalid chars -- two of 'em, somehow. 104, 101, 108, 108, 110 - "hello" 453, 1012 - ??? 104, 101, 114, 101 - "here" (the 't' is lost) This is similar to `foreach (dchar c; std.encoding.codePoints(str)) { }` which emits three dchars between "hello" and "there", but which also has an assert failure in non-release builds.
Nov 07 2021
prev sibling next sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 11/3/21 10:26 PM, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473
 
 I've tried to fix this before, but too many people objected.
 
 Are we fed up with this yet? I sure am.
 
 Who wants to take up this cudgel and fix the durned thing once and for all?
 
 (It's unclear if it would even break existing code.)
Honestly, I'd say `foreach(dchar c; somestr)` should not work. 1. It's slow and calls opaque functions 2. Adds more requirements to runtime that are simply solved by basic wrappers. 3. If writing wrappers, you can decide what you want. 4. It gets people used to language-magic character conversion, when this doesn't work on ranges of `char` that aren't arrays -- which then performs integer promotion. What I would *not* suggest though, is to just disable the feature. If it falls back to integer promotion (which is the worst thing ever for characters), then tons and tons of code will break, and much code will just work for English strings. Autodecoding might be a huge problem with Phobos, but character promotion is a huge problem with the language. -Steve
Nov 04 2021
prev sibling next sibling parent reply jfondren <julian.fondren gmail.com> writes:
On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473
This doesn't throw, actually: ```d unittest { import std.stdio : writeln; enum invalid = "hello\247\205\257there"; foreach (c; invalid) writeln(cast(ubyte) c); } ``` Which is per usual in D ```d ("std.utf.byUTF 2/3 (throwing)") safe unittest { import std.utf : byUTF, UTFException, UseReplacementDchar; import std.exception : assertThrown, assertNotThrown; import std.algorithm : count; string partial = "hello\247\205\257there"; // byChar misses the bad UTF8 ... assertNotThrown!UTFException(partial.byUTF!(char, UseReplacementDchar.no).count); // byDchar objects to it assertThrown!UTFException(partial.byUTF!(dchar, UseReplacementDchar.no).count); } ``` This does throw: ```d unittest { import std.stdio : writeln; enum invalid = "hello\247\205\257there"; foreach (dchar c; invalid) writeln(cast(int) c); } ``` but by asking for dchars from an immutable(char)[] you're asking for some unicode work to happen, so throwing is a reasonable default IMO. Emitting the replacement character is also a reasonable default, and objections in the thread can be answered the same way that objections to throwing can be: if you don't like it, iterate some other way: ```d // throw on invalid UTF unittest { import std.utf : byUTF, UseReplacementDchar, UTFException; enum invalid = "hello\247\205\257there"; int sum; try { foreach (dchar c; invalid.byUTF!(dchar, UseReplacementDchar.no)) sum += cast(int) c; assert(sum == 197667); } catch (UTFException e) { assert(sum == 532); } } // AssertError on invalid UTF // (release behavior: "\247\205\257" is three dchars!) unittest { import std.stdio : writeln; import std.encoding : codePoints; enum invalid = "hello\247\205\257there"; foreach (dchar c; invalid.codePoints) writeln(cast(int) c); } // stop iterating on invalid UTF unittest { import std.encoding : validLength; enum invalid = "hello\247\205\257there"; char[] s; foreach (dchar c; invalid[0 .. invalid.validLength]) s ~= c; assert(s == "hello"); } ```
Nov 04 2021
parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/4/2021 7:52 AM, jfondren wrote:
 Emitting the 
 replacement character is also a reasonable default, and objections in the
thread 
 can be answered the same way that objections to throwing can be: if you don't 
 like it, iterate some other way:
Technically, you are correct. But experience shows this does not work, because people will be human. Two things are abundantly clear: 1. throwing exceptions must not be default behavior 2. allocating with the GC must not be the default behavior and pushing against that is like trying to get people to eat their vegetables.
Nov 04 2021
prev sibling next sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 I've tried to fix this before, but too many people objected.

 Are we fed up with this yet? I sure am.

 Who wants to take up this cudgel and fix the durned thing once 
 and for all?

 (It's unclear if it would even break existing code.)
For the love of god, if you are going to make a breaking change there, just remove autodecoding altogether. Trying to fix what shouldn't exist is by far the biggest time sink engineers involves themselves in.
Nov 04 2021
parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Friday, 5 November 2021 at 02:06:01 UTC, deadalnix wrote:
 On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright 
 wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473
For the love of god, if you are going to make a breaking change there, just remove autodecoding altogether.
This post isn't about autodecoding. With foreach, you opt into the decoding by specifically asking for it.
Nov 04 2021
parent reply deadalnix <deadalnix gmail.com> writes:
On Friday, 5 November 2021 at 02:38:51 UTC, Adam D Ruppe wrote:
 On Friday, 5 November 2021 at 02:06:01 UTC, deadalnix wrote:
 On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright 
 wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473
For the love of god, if you are going to make a breaking change there, just remove autodecoding altogether.
This post isn't about autodecoding. With foreach, you opt into the decoding by specifically asking for it.
Very clearly it is, because if you don't decode, then you don't do replacement chars or exceptions.
Nov 04 2021
next sibling parent Dukc <ajieskola gmail.com> writes:
On Friday, 5 November 2021 at 03:02:07 UTC, deadalnix wrote:
 On Friday, 5 November 2021 at 02:38:51 UTC, Adam D Ruppe wrote:
 This post isn't about autodecoding. With foreach, you opt into 
 the decoding by specifically asking for it.
Very clearly it is, because if you don't decode, then you don't do replacement chars or exceptions.
It's about decoding, but not autodecoding. Or at least not the same autodecoding we usually refer to. Autodecoding is the way Phobos v1 treats character arrays when they are used as ranges. This is about an implicit conversion in the language itself.
Nov 05 2021
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.com> writes:
On 2021-11-04 23:02, deadalnix wrote:
 On Friday, 5 November 2021 at 02:38:51 UTC, Adam D Ruppe wrote:
 On Friday, 5 November 2021 at 02:06:01 UTC, deadalnix wrote:
 On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473
For the love of god, if you are going to make a breaking change there, just remove autodecoding altogether.
This post isn't about autodecoding. With foreach, you opt into the decoding by specifically asking for it.
Very clearly it is, because if you don't decode, then you don't do replacement chars or exceptions.
"On demand" is not "auto".
Nov 05 2021
parent reply deadalnix <deadalnix gmail.com> writes:
On Friday, 5 November 2021 at 13:26:00 UTC, Andrei Alexandrescu 
wrote:
 "On demand" is not "auto".
From the bug repport:
 A simple foreach loop:
 
     void test(char[] a)
     {
         foreach (char c; a) { }
     }
 
 will throw a UtfException if `a` is not a valid UTF string. 
 Instead, it should replace the invalid sequence with 
 replacementDchar.
This shouldn't do anything related to unicode at all.
Nov 05 2021
next sibling parent Adam D Ruppe <destructionator gmail.com> writes:
On Friday, 5 November 2021 at 14:10:35 UTC, deadalnix wrote:
 This shouldn't do anything related to unicode at all.
Well, it doesn't. That was apparently just a typo, as the comments in the bug report quickly pointed out, as the thing is when you specifically request dchar out of char, NOT when you are just working on chars (which is the default).
Nov 05 2021
prev sibling parent reply jfondren <julian.fondren gmail.com> writes:
On Friday, 5 November 2021 at 14:10:35 UTC, deadalnix wrote:
 On Friday, 5 November 2021 at 13:26:00 UTC, Andrei Alexandrescu 
 wrote:
 "On demand" is not "auto".
From the bug repport:
 A simple foreach loop:
 
     void test(char[] a)
     {
         foreach (char c; a) { }
     }
 
 will throw a UtfException if `a` is not a valid UTF string. 
 Instead, it should replace the invalid sequence with 
 replacementDchar.
This shouldn't do anything related to unicode at all.
It doesn't. This does: ```d unittest { enum invalid = "hello\247\205\257there"; foreach (dchar c; invalid) { } } ``` Looping over the dchar of a char[] requires one of 1. throwing an error on invalid UTF (current behavior) 2. doing something else in that case (proposed: replacementDchar; also possible: silently doing something invalid like iterating over three dchars between "hello" and "there") 3. a compile-time error (also proposed in the thread)
Nov 05 2021
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.com> writes:
On 2021-11-05 10:22, jfondren wrote:
 3. a compile-time error (also proposed in the thread)
Speaking of which, I was thinking std2x should simply reject mixed-sign min and max during compilation instead of cleverly figuring out the "right" comparison. Now we have signed() and unsigned() that make it trivial for the user to steer min and max toward doing the right thing, and it's clearer too.
Nov 05 2021
prev sibling parent ag0aep6g <anonymous example.com> writes:
On 05.11.21 15:22, jfondren wrote:
 Looping over the dchar of a char[] requires one of
 
 1. throwing an error on invalid UTF (current behavior)
 2. doing something else in that case (proposed: replacementDchar; also 
 possible: silently doing something invalid like iterating over three 
 dchars between "hello" and "there")
 3. a compile-time error (also proposed in the thread)
4. Don't decode. Just do an implicit conversion from char to dchar. Just like `char c; dchar d = c;`. It's horrible, but D usually allows it. So let foreach do it too. conversion as well while you're at it.
Nov 05 2021
prev sibling next sibling parent reply Guillaume Piolat <first.last gmail.com> writes:
On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 I've tried to fix this before, but too many people objected.

 Are we fed up with this yet? I sure am.

 Who wants to take up this cudgel and fix the durned thing once 
 and for all?

 (It's unclear if it would even break existing code.)
How about just assert(false)? It is nogc and foreach over invalid utf-8 is a logic error (as you didn't sanitize).
Nov 05 2021
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 5 November 2021 at 09:34:31 UTC, Guillaume Piolat 
wrote:
 How about just assert(false)? It is  nogc and foreach over 
 invalid utf-8 is a logic error (as you didn't sanitize).
It is even worse, it is a type error. If "utf-8" is to be a meaningful type you should be allowed to assume that it follows the spec.
Nov 05 2021
parent reply Guillaume Piolat <first.last gmail.com> writes:
On Friday, 5 November 2021 at 09:57:45 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 5 November 2021 at 09:34:31 UTC, Guillaume Piolat 
 wrote:
 How about just assert(false)? It is  nogc and foreach over 
 invalid utf-8 is a logic error (as you didn't sanitize).
It is even worse, it is a type error. If "utf-8" is to be a meaningful type you should be allowed to assume that it follows the spec.
Well you only know that it is meant to be utf8 in the context of the auto-decoding foreach (which must still exist). string in actual programs may contains binary files, strings in other codepages encodings.
Nov 05 2021
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 5 November 2021 at 10:13:13 UTC, Guillaume Piolat 
wrote:
 Well you only know that it is meant to be utf8 in the context 
 of the auto-decoding foreach (which must still exist). string 
 in actual programs may contains binary files, strings in other 
 codepages encodings.
D needs to rethink strings. Newbies going for "scripty" programming really need an encapsulated strongly typed string type, accessed only through functions that do-the-right-thing. I think safe/ system distinction would be more useful if safe was for those that wanted a more "scripty" programming style and system was for those that wanted a more "low level" programming style. On a related note, I also think it would be useful to have something stronger than safe, like a non-trojan marker for libraries, which basically says that it is impossible for that library to do evil and have that statically checked by the compiler. Then you could import libraries without caring about bad code. One issue I have with packages in smaller languages is that you don't have enough eyeballs on them, too easy for "evil" code to slip through (intentionally or not).
Nov 05 2021
parent reply Elronnd <elronnd elronnd.net> writes:
On Friday, 5 November 2021 at 10:30:27 UTC, Ola Fosheim Grøstad 
wrote:
 I also think it would be useful to have something stronger than 
  safe, like a   non-trojan marker for libraries, which 
 basically says that it is impossible for that library to do 
 evil and have that statically checked by the compiler.
pure
Nov 05 2021
parent Elronnd <elronnd elronnd.net> writes:
On Friday, 5 November 2021 at 22:31:59 UTC, Elronnd wrote:
 On Friday, 5 November 2021 at 10:30:27 UTC, Ola Fosheim Grøstad 
 wrote:
 I also think it would be useful to have something stronger 
 than  safe, like a   non-trojan marker for libraries, which 
 basically says that it is impossible for that library to do 
 evil and have that statically checked by the compiler.
pure
Hmm, technically pure code can infinite loop and cause a DOS. But any useful language will be able to get arbitrary recursion depth even if it is proved to terminate (e.g. cpp), sooo... And there is also the obvious pitfall of debug-in-pure.
Nov 05 2021
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 5 November 2021 at 10:13:13 UTC, Guillaume Piolat 
wrote:
 Well you only know that it is meant to be utf8 in the context 
 of the auto-decoding foreach (which must still exist). string 
 in actual programs may contains binary files, strings in other 
 codepages encodings.
I had a look at the [documentation]( https://dlang.org/spec/arrays.html#strings ) today, and it said: «char[] strings are in UTF-8 format.» I would assume that this is normative? Maybe change the documentation to use more forceful specification language so that it says: «char[] strings MUST be in UTF-8 format.» So, I think a messed up ```string``` should be considered a type error and it would be good if the compiler checked this statically where possible (e.g. literals) and simply assumed it to hold when parsing strings (like in a ```for``` loop). In C++ I use ```span<uint8_t>``` for raw string-slices and ```span<char8_t>``` for utf8 string-slices. I find that to be quite clear. In C++ these are distinct types. (newbies need a wrapper that is foolproof)
Nov 10 2021
next sibling parent reply Guillaume Piolat <first.last gmail.com> writes:
On Wednesday, 10 November 2021 at 10:23:31 UTC, Ola Fosheim 
Grøstad wrote:
 On Friday, 5 November 2021 at 10:13:13 UTC, Guillaume Piolat 
 wrote:
 Well you only know that it is meant to be utf8 in the context 
 of the auto-decoding foreach (which must still exist). string 
 in actual programs may contains binary files, strings in other 
 codepages encodings.
I had a look at the [documentation]( https://dlang.org/spec/arrays.html#strings ) today, and it said: «char[] strings are in UTF-8 format.» I would assume that this is normative? Maybe change the documentation to use more forceful specification language so that it says: «char[] strings MUST be in UTF-8 format.»
I'm not sure what is intended. import("file.stuff") yields string. So there is at least one gap, as it is often used with binary files that ain't UTF-8. Also look at that signature: https://dlang.org/phobos/std_utf.html#validate By spec it shall only return true then. It seems in practice it doesn't have to be utf-8 until you use something that assume it is. Which is ok for me.
Nov 10 2021
next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 10 November 2021 at 11:47:16 UTC, Guillaume Piolat 
wrote:
 It seems in practice it doesn't have to be utf-8 until you use 
 something that assume it is. Which is ok for me.
Hm… for me the key advantage of stricter typing is that you can make more functions free of exceptions and error-handling without using much human judgment. The ideal is to only do error handling in I/O call-trees.
Nov 10 2021
prev sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 10 November 2021 at 11:47:16 UTC, Guillaume Piolat 
wrote:
 import("file.stuff") yields string.
 So there is at least one gap, as it is often used with binary 
 files that ain't UTF-8.
Maybe a «binary_import!T("file.data")» that yields slice of type T?
Nov 10 2021
prev sibling parent reply Elronnd <elronnd elronnd.net> writes:
On Wednesday, 10 November 2021 at 10:23:31 UTC, Ola Fosheim 
Grøstad wrote:
 I had a look at the [documentation]( 
 https://dlang.org/spec/arrays.html#strings ) today, and it said:

 «char[] strings are in UTF-8 format.»

 I would assume that this is normative? Maybe change the 
 documentation to use more forceful specification language so 
 that it says: «char[] strings MUST be in UTF-8 format.»

 So, I think a messed up ```string``` should be considered a 
 type error and it would be good if the compiler checked this 
 statically where possible (e.g. literals) and simply assumed it 
 to hold when parsing strings (like in a ```for``` loop).
I agree this should be required. If you want something which is not valid UTF-8, _do not put it into a string_. Use ubyte[]. Go further: require a runtime check on cast from ubyte[] to char[] (expensive), and on slicing char[] (cheap). (If you abuse unions you are on your own; but obviously that is not allowed in safe code, so has the same limitations as e.g. boundschecking.)
Nov 10 2021
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
 I agree this should be required.  If you want something which 
 is not valid UTF-8, _do not put it into a string_.  Use ubyte[].
Exactly.
 Go further: require a runtime check on cast from ubyte[] to 
 char[] (expensive), and on slicing char[] (cheap).  (If you 
 abuse unions you are on your own; but obviously that is not 
 allowed in  safe code, so has the same limitations as e.g. 
 boundschecking.)
The compiler could do such checks in an extra-solid-debug-mode. That could certainly improve unit-testing and other testing. In such a mode you could also do overflow checks for signed integers (if they are changed so they don't wrap).
Nov 10 2021
parent reply kdevel <kdevel vogtner.de> writes:
On Thursday, 11 November 2021 at 07:58:54 UTC, Ola Fosheim 
Grøstad wrote:
 On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
 I agree this should be required.  If you want something which 
 is not valid UTF-8, _do not put it into a string_.  Use 
 ubyte[].
Exactly.
[...]
 The compiler could do such checks in an extra-solid-debug-mode.
This requires lots of changes or additions ``` import std.stdio; import std.file; void main () { ubyte [] filename = [ 'a', 0x80, 'b', '\0' ]; // valid filename in some OS auto s = readText (filename); } ``` This does not yet compile: [...] R = ubyte[]` must satisfy one of the following constraints: ` isSomeChar!(ElementType!R) is(StringTypeOf!R)`
Nov 12 2021
next sibling parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 On Thursday, 11 November 2021 at 07:58:54 UTC, Ola Fosheim 
 Grøstad wrote:
 On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
 I agree this should be required.  If you want something which 
 is not valid UTF-8, _do not put it into a string_.  Use 
 ubyte[].
Exactly.
[...]
 The compiler could do such checks in an extra-solid-debug-mode.
This requires lots of changes or additions ``` import std.stdio; import std.file; void main () { ubyte [] filename = [ 'a', 0x80, 'b', '\0' ]; // valid filename in some OS auto s = readText (filename); } ``` This does not yet compile: [...] R = ubyte[]` must satisfy one of the following constraints: ` isSomeChar!(ElementType!R) is(StringTypeOf!R)`
Yes, because `readText` is typed in a way that it excludes valid filenames. But it's *already* wrong - this feature would only expose the wrongness, as `filename` is already not a validly typed string. File a bug?
Nov 14 2021
parent kdevel <kdevel vogtner.de> writes:
On Monday, 15 November 2021 at 07:17:03 UTC, FeepingCreature 
wrote:
[...]
 Yes, because `readText` is typed in a way that it excludes 
 valid filenames. But it's *already* wrong - this feature would 
 only expose the wrongness, as `filename` is already not a 
 validly typed string. File a bug?
May I ask you for a bug title? "readText shall accept natively typed filename"?
Nov 15 2021
prev sibling next sibling parent reply user1234 <user1234 12.de> writes:
On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 This does not yet compile:

    [...]
           R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`
auto-decoding or not... you need to decode from whatever is the OS encoding (must be ancient ANSI I presume ?) to UTF-8.
Nov 15 2021
parent reply user1234 <user1234 12.de> writes:
On Monday, 15 November 2021 at 08:20:57 UTC, user1234 wrote:
 On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 This does not yet compile:

    [...]
           R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`
auto-decoding or not... you need to decode from whatever is the OS encoding (must be ancient ANSI I presume ?) to UTF-8.
I meant decode then re-enc to utf
Nov 15 2021
next sibling parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Monday, 15 November 2021 at 08:22:13 UTC, user1234 wrote:
 On Monday, 15 November 2021 at 08:20:57 UTC, user1234 wrote:
 On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 This does not yet compile:

    [...]
           `R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`
auto-decoding or not... you need to decode from whatever is the OS encoding (must be ancient ANSI I presume ?) to UTF-8.
I meant decode then re-enc to utf
I don't see how that could work. `readText` would need to encode it to the OS codepage, but `readText` has no idea what encoding you intend. And the encoding of a filename isn't even always determined by the locale; consider trying to access filenames saved in a different locale, ie. what iconv does. There's no way around `readText` taking `ubyte[]`.
Nov 15 2021
parent reply user1234 <user1234 12.de> writes:
On Monday, 15 November 2021 at 11:20:04 UTC, FeepingCreature 
wrote:
 On Monday, 15 November 2021 at 08:22:13 UTC, user1234 wrote:
 On Monday, 15 November 2021 at 08:20:57 UTC, user1234 wrote:
 On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 This does not yet compile:

    [...]
           `R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`
auto-decoding or not... you need to decode from whatever is the OS encoding (must be ancient ANSI I presume ?) to UTF-8.
I meant decode then re-enc to utf
I don't see how that could work. `readText` would need to encode it to the OS codepage, but `readText` has no idea what encoding you intend. And the encoding of a filename isn't even always determined by the locale; consider trying to access filenames saved in a different locale, ie. what iconv does. There's no way around `readText` taking `ubyte[]`.
I think I was off-topic, my reply was about the filename, e.g `fname.fromAnsi(cp).toUTF!char.readText()` you were more talking about the file content apparently ? sorry about that.
Nov 15 2021
parent kdevel <kdevel vogtner.de> writes:
On Monday, 15 November 2021 at 11:26:41 UTC, user1234 wrote:
[...]
 I think I was off-topic, my reply was about the filename, e.g

 `fname.fromAnsi(cp).toUTF!char.readText()`

 you were more talking about the file content apparently ? sorry 
 about that.
I /am/ talking about the filename. In POSIX systems the bytes do not mean anything (to the operating system) execpt for the three values '\0', '/' and '.' [1]. [1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_170
Nov 15 2021
prev sibling parent reply kdevel <kdevel vogtner.de> writes:
On Monday, 15 November 2021 at 08:22:13 UTC, user1234 wrote:
 On Monday, 15 November 2021 at 08:20:57 UTC, user1234 wrote:
 On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 This does not yet compile:

    [...]
           R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`
auto-decoding or not... you need to decode from whatever is the OS encoding (must be ancient ANSI I presume ?) to UTF-8.
I meant decode then re-enc to utf
You can only decode what has been (or is ment to be) encoded. Except for '.', '\0', and '/' the character values (0 .. 255) have no meaning within a filename.
Nov 15 2021
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Monday, 15 November 2021 at 19:59:40 UTC, kdevel wrote:
 You can only decode what has been (or is ment to be) encoded. 
 Except for '.', '\0', and '/' the character values (0 .. 255) 
 have no meaning within a filename.
It should probably be a system specific string-type that validates using the rules of the specific OS.
Nov 15 2021
prev sibling parent Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 On Thursday, 11 November 2021 at 07:58:54 UTC, Ola Fosheim 
 Grøstad wrote:
 On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
 I agree this should be required.  If you want something which 
 is not valid UTF-8, _do not put it into a string_.  Use 
 ubyte[].
Exactly.
[...]
 The compiler could do such checks in an extra-solid-debug-mode.
This requires lots of changes or additions ``` import std.stdio; import std.file; void main () { ubyte [] filename = [ 'a', 0x80, 'b', '\0' ]; // valid filename in some OS auto s = readText (filename); } ``` This does not yet compile: [...] R = ubyte[]` must satisfy one of the following constraints: ` isSomeChar!(ElementType!R) is(StringTypeOf!R)`
One idea that has come up would be compile time checking of strings. But thinking about the garbage in garbage out concept in general, maybe functions should really just accept data and it's the callers responsibility that it's valid. This becomes a philosophical discussion, but could maybe be interesting (increased compile times ofc, but could be worth it). This would be more of a D3 thing. The Erlang path is fail fast. Fix the error at it's root. Don't get me wrong, I understand why phobos is the way it is now, and it works. It's more in the "ideas to explore" category. One might say "but what about external data, I don't know if that's valid". The answer there would be to sanitize it before passing it to the function. It would also be better from a composability viewpoint. In summary: Keep the functions themselves short and friendly. Make the data in correct. Put the constraints outside the function. Pros and cons as with everything ofc
Nov 15 2021
prev sibling next sibling parent reply Alexey <invalid email.address> writes:
On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 I've tried to fix this before, but too many people objected.

 Are we fed up with this yet? I sure am.

 Who wants to take up this cudgel and fix the durned thing once 
 and for all?

 (It's unclear if it would even break existing code.)
I didn't read thread. And I'm not an expert in D or Unicode, of course. But If I would need to solve the problem of unicode handling, I would do the following: 1. define type for the 'grapheme' - so grapheme could store any unicode symbol; 2. define string of grapheme as array of grapheme, so programmer could at any time use usual array tools on those. like so things like .length and slicing [x..y] work as usual. call this, for instance, 'gstring' or 'graphstring'; 3. IMHO, one grapheme should be and alias to ubyte[] or to one BigInt; 4. conversion from string/wstring/dstring/ubyte[]/BigInt[]/etc to ['gstring' or 'graphstring'] should be automatic and this should be stated in documentation; 5. ['gstring' or 'graphstring'] should have functions to convert to string/wstring/dstring/ubyte[]/BigInt[]/etc
Nov 05 2021
parent reply Alexey <invalid email.address> writes:
On Saturday, 6 November 2021 at 04:07:35 UTC, Alexey wrote:

 3. IMHO, one grapheme should be and alias to ubyte[] or to one 
 BigInt;
or may be, even, define one grapheme as dchar[]. or maybe, even, define new separate type for 'codepoint' and define one grapheme as codepoint[].
Nov 05 2021
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
https://dlang.org/phobos/std_uni.html#Grapheme
Nov 05 2021
prev sibling next sibling parent Alexey <invalid email.address> writes:
On Saturday, 6 November 2021 at 04:18:51 UTC, Alexey wrote:
 On Saturday, 6 November 2021 at 04:07:35 UTC, Alexey wrote:
and as for Ranges: Ranges should not do any automatic string conversions
Nov 05 2021
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Nov 06, 2021 at 04:18:51AM +0000, Alexey via Digitalmars-d wrote:
 On Saturday, 6 November 2021 at 04:07:35 UTC, Alexey wrote:
 
 3. IMHO, one grapheme should be and alias to ubyte[] or to one BigInt;
or may be, even, define one grapheme as dchar[]. or maybe, even, define new separate type for 'codepoint' and define one grapheme as codepoint[].
Unfortunately, codepoint != grapheme. This was the fundamental error with autodecoding that made it so bad. It costs us a performance hit but doesn't even produce the right results in return. And even more unfortunately, grapheme segmentation is an extremely convoluted (i.e. slow) operation that normally you would *not* want to do it unless your code absolutely has to. T -- Let's eat some disquits while we format the biskettes.
Nov 05 2021
next sibling parent reply Alexey <invalid email.address> writes:
On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote:
 Unfortunately, codepoint != grapheme. This was the fundamental 
 error with autodecoding that made it so bad. It costs us a 
 performance hit but doesn't even produce the right results in 
 return.

 And even more unfortunately, grapheme segmentation is an 
 extremely convoluted (i.e. slow) operation that normally you 
 would *not* want to do it unless your code absolutely has to.


 T
```D struct graphstring { grapheme[] grapheme_elements; } struct grapheme { dchar[] codepoints; } ``` Would this really be _that_ slow? also, there is no need to do error checks on every action which user may do with graphstrings: no need to check on concatenations or slicings, for instance. but do checks on conversions from other string/ubyte[] types and to those types.
Nov 05 2021
next sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Saturday, 6 November 2021 at 06:17:55 UTC, Alexey wrote:
 On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote:
 Unfortunately, codepoint != grapheme. This was the fundamental 
 error with autodecoding that made it so bad. It costs us a 
 performance hit but doesn't even produce the right results in 
 return.

 And even more unfortunately, grapheme segmentation is an 
 extremely convoluted (i.e. slow) operation that normally you 
 would *not* want to do it unless your code absolutely has to.


 T
```D struct graphstring { grapheme[] grapheme_elements; } struct grapheme { dchar[] codepoints; } ``` Would this really be _that_ slow? also, there is no need to do error checks on every action which user may do with graphstrings: no need to check on concatenations or slicings, for instance. but do checks on conversions from other string/ubyte[] types and to those types.
This is 1 grapheme A̶͙̜͚̫̬̻ͅ (U+0041 U+0336 U+0359 U+0345 U+031c U+035a U+032b U+032c U+033b) but 9 codepoints (9 dchar, 9 wchar, 17 char (0x41 0xcc 0xb6 0xcd 0x99 0xcd 0x85 0xcc 0x9c 0xcd 0x9a 0xcc 0xab 0xcc 0xac 0xcc 0xbb)
Nov 06 2021
prev sibling parent reply jfondren <julian.fondren gmail.com> writes:
On Saturday, 6 November 2021 at 06:17:55 UTC, Alexey wrote:
 ```D
 struct graphstring
 {
     grapheme[] grapheme_elements;
 }

 struct grapheme
 {
     dchar[] codepoints;
 }

 ```
std.uni.Grapheme is more complex than a dchar[] (it tries to avoid allocating and it owns the dchars) but it has .length and opIndex that work like dchar[] (but read the warning on opSlice) A Grapheme[] you can get with just `s1.byGrapheme.array`. Round-trip example from std.uni: ```d safe unittest { import std.array : array; import std.conv : text; import std.range : retro; import std.uni : byGrapheme, byCodePoint; string s = "noe\u0308l"; // noël // reverse it and convert the result to a string string reverse = s.byGrapheme .array .retro .byCodePoint .text; assert(reverse == "le\u0308on"); // lëon } ```
Nov 06 2021
parent Alexey <invalid email.address> writes:
On Saturday, 6 November 2021 at 13:07:53 UTC, jfondren wrote:
 ...
I doubt what std.uni.Grapheme works faster than dchar[]. Also I doubt what all the checks and things std.uni.Grapheme does are really necessary in context of hypothetical 'graphstring'
Nov 06 2021
prev sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote:
 And even more unfortunately, grapheme segmentation is an 
 extremely convoluted (i.e. slow) operation that normally you 
 would *not* want to do it unless your code absolutely has to.
It is suitable for a library though.
Nov 06 2021
prev sibling parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473
Previous discussions: - https://wiki.dlang.org/DIP76 - https://forum.dlang.org/post/mfvi86$10ml$1 digitalmars.com - https://issues.dlang.org/show_bug.cgi?id=14519 - https://github.com/dlang/druntime/pull/1240 - https://github.com/dlang/druntime/pull/1279 - https://issues.dlang.org/show_bug.cgi?id=20134 - https://github.com/dlang/phobos/pull/7144
Nov 06 2021
parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/6/2021 9:09 AM, Vladimir Panteleev wrote:
 On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473
Previous discussions: - https://wiki.dlang.org/DIP76 - https://forum.dlang.org/post/mfvi86$10ml$1 digitalmars.com - https://issues.dlang.org/show_bug.cgi?id=14519 - https://github.com/dlang/druntime/pull/1240 - https://github.com/dlang/druntime/pull/1279 - https://issues.dlang.org/show_bug.cgi?id=20134 - https://github.com/dlang/phobos/pull/7144
Thanks, Vladimir.
Nov 06 2021