digitalmars.D - dmd foreach loops throw exceptions on invalid UTF sequences, use

Walter Bright (5/5) Nov 03 2021 https://issues.dlang.org/show_bug.cgi?id=22473

Adam D Ruppe (6/7) Nov 03 2021 I proposed a few days ago that Phobos autodecoding, if not

Dukc (4/7) Nov 04 2021 Plus the present is inconsistent with rest of the language

FeepingCreature (8/14) Nov 03 2021 I still think this is a mistake.

FeepingCreature (3/10) Nov 03 2021 (This is floating point NaN all over again!)

Walter Bright (16/23) Nov 04 2021 Surprisingly, the reverse seems to be true. Suppose you're writing a tex...

Mathias LANG (10/19) Nov 04 2021 If you handle user input, you take it as `ubyte[]` and validate

Walter Bright (3/7) Nov 04 2021 People will always gravitate towards the smaller, simpler syntax. Like [...

max haughton (6/14) Nov 04 2021 I have never observed this mistake in any C++ cod, unless you

Walter Bright (10/24) Nov 04 2021 You've never observed people write:

Imperatorn (14/43) Nov 05 2021 *The value of convenience should not be underestimated*
max haughton (6/35) Nov 05 2021 I have never ever seen someone use a static array by mistake, is

Walter Bright (4/6) Nov 05 2021 This is why D has special support for turning arrays seamlessly into ran...
=?UTF-8?Q?Ali_=c3=87ehreli?= (6/7) Nov 05 2021 Related, although safe, vector::at is almost never used because the more...

max haughton (5/12) Nov 05 2021 Although I understand what Walter is trying to say, he picked a

Walter Bright (2/5) Nov 06 2021 Not sure what your point is, as D has bounds checking by default with [ ...

Atila Neves (12/57) Nov 08 2021 Aside from not depending on GC-allocated memory, what does vector

max haughton (9/59) Nov 08 2021 In my post I was referring to a C style array (in C++) rather

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (3/11) Nov 08 2021 Could happen in C. Does not happen in C++, you use std::span for

FeepingCreature (11/39) Nov 04 2021 I think the program should crash in all these cases. The text

Paolo Invernizzi (2/15) Nov 05 2021 +1000
Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (7/10) Nov 05 2021 No, NaN is completely different. You have two types of NaN, one

FeepingCreature (5/16) Nov 05 2021 When I have to do numeric work and suspect NaNs in play, I like

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (8/11) Nov 05 2021 Yes, and the IEEE spec suggests that ones should be able to

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (16/28) Nov 05 2021 To put some meat on this. The ideal is that you can have two

FeepingCreature (7/12) Nov 05 2021 I can't imagine wanting nans in raytracing. Just the idea of a

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (2/3) Nov 05 2021 Not IEEE?

norm (4/17) Nov 05 2021 It isn't always that simple, e.g. working on medical devices

Dennis (4/7) Nov 05 2021 Oh no, let's not go there again. See this 44-page discussion:

Paolo Invernizzi (2/9) Nov 05 2021 Ehehe, the old good times :-P

Dukc (3/6) Nov 05 2021 You can always validate the UTF beforehand if you don't want to
Abdulhaq (29/32) Nov 07 2021 Mm, I have a totally different take on this. In my view all

Walter Bright (4/10) Nov 06 2021 It's much better than 0.0. 0.0 is indistinguishable from valid data, and...

kdevel (10/18) Nov 07 2021 Technically it makes no difference if you do not check for 0.0 or

Imperatorn (2/18) Nov 07 2021 https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-en...

Walter Bright (6/7) Nov 07 2021 The money quote:

Imperatorn (2/10) Nov 07 2021 💲💲💲

Walter Bright (2/3) Nov 07 2021 Yes, it does. 0.0 is not distinguishable from valid data. NaN is.

FeepingCreature (23/27) Nov 08 2021 No, that's exactly the problem. ReplacementChar is not easily

FeepingCreature (23/34) Nov 08 2021 Sorry, let me expand on this because I think it's the very core

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (17/22) Nov 08 2021 D should distinguish more clearly between strong and weak casting

FeepingCreature (18/29) Nov 08 2021 Yeah I noticed this after I clicked post, but I didn't want to

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (9/15) Nov 08 2021 It could mean that someone did cut'n'paste of text from a more

kdevel (16/18) Nov 08 2021 On Monday, 8 November 2021 at 12:02:12 UTC, Ola Fosheim Grøstad

User (3/4) Nov 05 2021 Did you try Pony language? its so user friendly, it even allows

Elronnd (30/40) Nov 04 2021 Assuming the comment by Ali on the linked bug is right, I think

rikki cattermole (3/7) Nov 04 2021 I think this is the right answer.

rikki cattermole (2/12) Nov 04 2021 Correction: the default is correct, I checked.

Adam D Ruppe (7/11) Nov 04 2021 That's not true. It will always be the type of the thing:
Walter Bright (3/4) Nov 04 2021 C++ sold everyone the myth that exceptions not thrown are zero cost. Thi...

Andrei Alexandrescu (4/9) Nov 05 2021 I've been doing a fair amount of benchmarking for

deadalnix (7/16) Nov 05 2021 It really depends on the exact specification of the myth.
Walter Bright (8/11) Nov 05 2021 All the compilers I know of abandon many optimizations in the presence o...

deadalnix (10/22) Nov 05 2021 I have not checked for GCC, but modern version of LLVM are pretty

Walter Bright (7/9) Nov 05 2021 I saw a presentation by Chandler Carruth at CPPCON 3 years back or so wh...

Andrei Alexandrescu (7/19) Nov 05 2021 Turns out the EH code is very well separated. Gcc goes so far as to

Walter Bright (7/12) Nov 05 2021 I remain skeptical.
deadalnix (7/13) Nov 05 2021 You bet, I wrote the code that separates the two :)

deadalnix (5/12) Nov 05 2021 To expand on that, I also wrote code that send all the exception

Andrei Alexandrescu (3/17) Nov 05 2021 I know the story. It is aging. I'm telling the facts. It turns out that

Elronnd (7/7) Nov 04 2021 Part of the problem, as mentioned, is that this throws away

Walter Bright (3/9) Nov 04 2021 There's only one replacement character, and this use is officially what ...

zjh (8/9) Nov 04 2021 `string`, as a language part, should not be encoded at all. It is

zjh (7/7) Nov 06 2021 On Thursday, 4 November 2021 at 08:24:59 UTC, zjh wrote:

jfondren (24/31) Nov 06 2021 d index with range checking: `arr[ind]`

zjh (3/4) Nov 06 2021 Rust has more than ten `kinds` of strings. Maybe we can add `2/3`

jfondren (33/37) Nov 07 2021 Meanwhile, in Rust:

Steven Schveighoffer (16/25) Nov 04 2021 Honestly, I'd say `foreach(dchar c; somestr)` should not work.
jfondren (75/76) Nov 04 2021 This doesn't throw, actually:

Walter Bright (7/11) Nov 04 2021 Technically, you are correct. But experience shows this does not work, b...

deadalnix (5/11) Nov 04 2021 For the love of god, if you are going to make a breaking change

Adam D Ruppe (3/8) Nov 04 2021 This post isn't about autodecoding. With foreach, you opt into

deadalnix (3/12) Nov 04 2021 Very clearly it is, because if you don't decode, then you don't

Dukc (5/10) Nov 05 2021 It's about decoding, but not autodecoding. Or at least not the
Andrei Alexandrescu (2/15) Nov 05 2021 "On demand" is not "auto".

deadalnix (4/15) Nov 05 2021 From the bug repport:

Adam D Ruppe (5/6) Nov 05 2021 Well, it doesn't. That was apparently just a typo, as the
jfondren (14/29) Nov 05 2021 It doesn't. This does:

Andrei Alexandrescu (6/7) Nov 05 2021 Speaking of which, I was thinking std2x should simply reject mixed-sign
ag0aep6g (6/13) Nov 05 2021 4. Don't decode. Just do an implicit conversion from char to dchar. Just...

Guillaume Piolat (3/9) Nov 05 2021 How about just assert(false)? It is @nogc and foreach over

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (5/7) Nov 05 2021 It is even worse, it is a type error. If "utf-8" is to be a

Guillaume Piolat (6/13) Nov 05 2021 Well you only know that it is meant to be utf8 in the context of

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (17/21) Nov 05 2021 D needs to rethink strings. Newbies going for "scripty"

Elronnd (3/7) Nov 05 2021 pure

Elronnd (5/12) Nov 05 2021 Hmm, technically pure code can infinite loop and cause a DOS.

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (16/20) Nov 10 2021 I had a look at the [documentation](

Guillaume Piolat (11/23) Nov 10 2021 I'm not sure what is intended.

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (6/8) Nov 10 2021 Hm… for me the key advantage of stricter typing is that you can
Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (4/7) Nov 10 2021 Maybe a «binary_import!T("file.data")» that yields slice of type

Elronnd (8/18) Nov 10 2021 I agree this should be required. If you want something which is

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (6/13) Nov 10 2021 The compiler could do such checks in an extra-solid-debug-mode.

kdevel (20/26) Nov 12 2021 [...]

FeepingCreature (5/32) Nov 14 2021 Yes, because `readText` is typed in a way that it excludes valid

kdevel (5/9) Nov 15 2021 On Monday, 15 November 2021 at 07:17:03 UTC, FeepingCreature

user1234 (3/9) Nov 15 2021 auto-decoding or not... you need to decode from whatever is the

user1234 (2/12) Nov 15 2021 I meant decode then re-enc to utf

FeepingCreature (7/20) Nov 15 2021 I don't see how that could work. `readText` would need to encode

user1234 (6/27) Nov 15 2021 I think I was off-topic, my reply was about the filename, e.g

kdevel (7/11) Nov 15 2021 I /am/ talking about the filename. In POSIX systems the bytes do

kdevel (4/17) Nov 15 2021 You can only decode what has been (or is ment to be) encoded.

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (3/6) Nov 15 2021 It should probably be a system specific string-type that

Imperatorn (20/47) Nov 15 2021 One idea that has come up would be compile time checking of

Alexey (18/24) Nov 05 2021 I didn't read thread. And I'm not an expert in D or Unicode, of

Alexey (4/6) Nov 05 2021 or may be, even, define one grapheme as dchar[]. or maybe, even,

rikki cattermole (1/1) Nov 05 2021 https://dlang.org/phobos/std_uni.html#Grapheme
Alexey (3/4) Nov 05 2021 and as for Ranges: Ranges should not do any automatic string
H. S. Teoh (10/16) Nov 05 2021 Unfortunately, codepoint != grapheme. This was the fundamental error

Alexey (16/24) Nov 05 2021 ```D

Patrick Schluter (5/33) Nov 06 2021 This is 1 grapheme A̶͙̜͚̫̬̻ͅ
jfondren (22/32) Nov 06 2021 std.uni.Grapheme is more complex than a dchar[] (it tries to

Alexey (4/5) Nov 06 2021 I doubt what std.uni.Grapheme works faster than dchar[]. Also I

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (2/5) Nov 06 2021 It is suitable for a library though.

Vladimir Panteleev (9/10) Nov 06 2021 Previous discussions:

Walter Bright (2/15) Nov 06 2021 Thanks, Vladimir.

Walter Bright <newshound2 digitalmars.com> writes:

https://issues.dlang.org/show_bug.cgi?id=22473

I've tried to fix this before, but too many people objected.

Are we fed up with this yet? I sure am.

Who wants to take up this cudgel and fix the durned thing once and for all?

(It's unclear if it would even break existing code.)

Nov 03 2021

Adam D Ruppe <destructionator gmail.com> writes:

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 I've tried to fix this before, but too many people objected.

I proposed a few days ago that Phobos autodecoding, if not 
completely removed, do this exact same thing too.

I agree it is a good idea. If you want an exception, it is easy 
enough to just check it in the loop and throw then.

Let's do it.

Nov 03 2021

Dukc <ajieskola gmail.com> writes:

On Thursday, 4 November 2021 at 02:34:54 UTC, Adam D Ruppe wrote:
 I agree it is a good idea. If you want an exception, it is easy 
 enough to just check it in the loop and throw then.

 Let's do it.

Plus the present is inconsistent with rest of the language 
features. Implicit language-level conversions in D do not usually 
throw.

Nov 04 2021

FeepingCreature <feepingcreature gmail.com> writes:

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 I've tried to fix this before, but too many people objected.

 Are we fed up with this yet? I sure am.

 Who wants to take up this cudgel and fix the durned thing once 
 and for all?

 (It's unclear if it would even break existing code.)

I still think this is a mistake.

One may disagree about autodecoding; I for one think it's a 
sensible idea. However, a program should either process data 
correctly or, if that is impossible, not at all. It should not, 
ever, silently modify it "for you" while reading! I predict this 
will lead to cryptic, hair-pulling bugs in user code involving 
replacement characters appearing far downstream of the error site.

Nov 03 2021

FeepingCreature <feepingcreature gmail.com> writes:

On Thursday, 4 November 2021 at 05:34:29 UTC, FeepingCreature 
wrote:
 One may disagree about autodecoding; I for one think it's a 
 sensible idea. However, a program should either process data 
 correctly or, if that is impossible, not at all. It should not, 
 ever, silently modify it "for you" while reading! I predict 
 this will lead to cryptic, hair-pulling bugs in user code 
 involving replacement characters appearing far downstream of 
 the error site.

(This is floating point NaN all over again!)

Nov 03 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/3/2021 10:41 PM, FeepingCreature wrote:
 On Thursday, 4 November 2021 at 05:34:29 UTC, FeepingCreature wrote:
 One may disagree about autodecoding; I for one think it's a sensible idea. 
 However, a program should either process data correctly or, if that is 
 impossible, not at all. It should not, ever, silently modify it "for you" 
 while reading! I predict this will lead to cryptic, hair-pulling bugs in user 
 code involving replacement characters appearing far downstream of the error
site.


Surprisingly, the reverse seems to be true. Suppose you're writing a text 
editor. Then read a file with some bad UTF in it. The editor dies with an 
exception. You can't even edit the file to fix it.

If you need to display user provided text, like in a browser, or all sorts of 
tools, you don't want to die with an exception. What are you going to do in an 
exception handler? You're just going to replace the offending bytes with 
ReplacementChar and go render it anyway.

 (This is floating point NaN all over again!)

Poor NaNs are terribly misunderstood.

Suppose you have an array of sensors. One goes bad. The "bad" value is 0.0. So 
now your data analyzer is happily averaging 0.0 into the results, silently 
skewing them.

Now, if a NaN is returned instead, your "average" will be NaN. You know it's no 
good. It won't be hidden.

Uninitialized variables are sensors giving bad data. Having a NaN in your
result 
is a *good* thing.

Nov 04 2021

Mathias LANG <geod24 gmail.com> writes:

On Friday, 5 November 2021 at 00:38:59 UTC, Walter Bright wrote:
 Surprisingly, the reverse seems to be true. Suppose you're 
 writing a text editor. Then read a file with some bad UTF in 
 it. The editor dies with an exception. You can't even edit the 
 file to fix it.

 If you need to display user provided text, like in a browser, 
 or all sorts of tools, you don't want to die with an exception. 
 What are you going to do in an exception handler? You're just 
 going to replace the offending bytes with ReplacementChar and 
 go render it anyway.

If you handle user input, you take it as `ubyte[]` and validate 
it.
Any decent editor will try to detect the encoding instead of 
blindly assuming UTF-8.

If you want to fix it, just deprecate the special case and tell 
people to use `foreach (dchar d; someString.byUTF!(dchar, 
No.useReplacementDchar))` and voilà. And if they don't want it to 
throw, it's shorter:
`foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).

Nov 04 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/4/2021 7:41 PM, Mathias LANG wrote:
 If you want to fix it, just deprecate the special case and tell people to use 
 `foreach (dchar d; someString.byUTF!(dchar, No.useReplacementDchar))` and
voilà.
 And if they don't want it to throw, it's shorter:
 `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).

People will always gravitate towards the smaller, simpler syntax. Like [] 
instead of std::vector<>.

Nov 04 2021

max haughton <maxhaton gmail.com> writes:

On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright wrote:
 On 11/4/2021 7:41 PM, Mathias LANG wrote:
 If you want to fix it, just deprecate the special case and 
 tell people to use `foreach (dchar d; someString.byUTF!(dchar, 
 No.useReplacementDchar))` and voilà.
 And if they don't want it to throw, it's shorter:
 `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).

 People will always gravitate towards the smaller, simpler 
 syntax. Like [] instead of std::vector<>.

I have never observed this mistake in any C++ cod, unless you 
mean as a point of language design.

This decision should be guided by how current D programmers act 
rather than a hyperreal ideal of someone encountering the 
language.

Nov 04 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/4/2021 9:11 PM, max haughton wrote:
 On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright wrote:
 On 11/4/2021 7:41 PM, Mathias LANG wrote:
 If you want to fix it, just deprecate the special case and tell people to use 
 `foreach (dchar d; someString.byUTF!(dchar, No.useReplacementDchar))` and
voilà.
 And if they don't want it to throw, it's shorter:
 `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).

 People will always gravitate towards the smaller, simpler syntax. Like [] 
 instead of std::vector<>.

 
 I have never observed this mistake in any C++ cod,

You've never observed people write:

    int array[3];

in C++ code?

 unless you mean as a point of language design.

D (still) has a rather verbose way of doing lambdas. People constantly 
complained that D didn't have lambdas. Until the => syntax was added, and 
suddenly lambdas in D became noticed and useful.


 This decision should be guided by how current D programmers act rather than a 
 hyperreal ideal of someone encountering the language.

The only reason D's associative arrays continue to exist is because they are so 
darned syntactically convenient.

I've seen over and over and over that syntactic convenience matters a lot.

Nov 04 2021

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Friday, 5 November 2021 at 06:15:44 UTC, Walter Bright wrote:
 On 11/4/2021 9:11 PM, max haughton wrote:
 On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright 
 wrote:
 On 11/4/2021 7:41 PM, Mathias LANG wrote:
 If you want to fix it, just deprecate the special case and 
 tell people to use `foreach (dchar d; 
 someString.byUTF!(dchar, No.useReplacementDchar))` and voilà.
 And if they don't want it to throw, it's shorter:
 `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).

 People will always gravitate towards the smaller, simpler 
 syntax. Like [] instead of std::vector<>.

 
 I have never observed this mistake in any C++ cod,

 You've never observed people write:

    int array[3];

 in C++ code?

 unless you mean as a point of language design.

 D (still) has a rather verbose way of doing lambdas. People 
 constantly complained that D didn't have lambdas. Until the => 
 syntax was added, and suddenly lambdas in D became noticed and 
 useful.


 This decision should be guided by how current D programmers 
 act rather than a hyperreal ideal of someone encountering the 
 language.

 The only reason D's associative arrays continue to exist is 
 because they are so darned syntactically convenient.

 I've seen over and over and over that syntactic convenience 
 matters a lot.

*The value of convenience should not be underestimated*

It's what enables productivity, which in my opinion should be 
*the* main metric of success. Everything else is just "fluff".

In how many seconds can you transform idea A into program B.

That is how you measure success imo.

It doesn't matter if you have a cool or super interesting way of 
achieving something, if person X is still trying to figure out 
how to do some cool thing while person Y is already done and 
focusing on the next thing, person X has lost.

Because, person Y always optimize and refractor later (before the 
deadline), but person X can't because the deadline is already 
over.

*The value of convenience should not be underestimated*

Nov 05 2021

max haughton <maxhaton gmail.com> writes:

On Friday, 5 November 2021 at 06:15:44 UTC, Walter Bright wrote:
 On 11/4/2021 9:11 PM, max haughton wrote:
 On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright 
 wrote:
 On 11/4/2021 7:41 PM, Mathias LANG wrote:
 If you want to fix it, just deprecate the special case and 
 tell people to use `foreach (dchar d; 
 someString.byUTF!(dchar, No.useReplacementDchar))` and voilà.
 And if they don't want it to throw, it's shorter:
 `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).

 People will always gravitate towards the smaller, simpler 
 syntax. Like [] instead of std::vector<>.

 
 I have never observed this mistake in any C++ cod,

 You've never observed people write:

    int array[3];

 in C++ code?

 unless you mean as a point of language design.

 D (still) has a rather verbose way of doing lambdas. People 
 constantly complained that D didn't have lambdas. Until the => 
 syntax was added, and suddenly lambdas in D became noticed and 
 useful.


 This decision should be guided by how current D programmers 
 act rather than a hyperreal ideal of someone encountering the 
 language.

 The only reason D's associative arrays continue to exist is 
 because they are so darned syntactically convenient.

 I've seen over and over and over that syntactic convenience 
 matters a lot.

I have never ever seen someone use a static array by mistake, is 
what I meant, vector doesn't do the same thing as []. It's more 
common in (so-called) modern C++ to see std::array these days 
than a raw static array in certain contexts since you still  want 
a constant length buffer but want iterators etc..

Nov 05 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/5/2021 5:38 AM, max haughton wrote:
 I have never ever seen someone use a static array by mistake, is what I meant,

I didn't mean by mistake. I mean using it as a matter of convenience.

 since you still  want a constant length buffer but want iterators etc..

This is why D has special support for turning arrays seamlessly into ranges. An 
early goal of D is to encourage use of [ ], rather than deprecate it.

Nov 05 2021

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 11/5/21 5:38 AM, max haughton wrote:

 I have never ever seen someone use a static array by mistake

Related, although safe, vector::at is almost never used because the more 
convenient (but unsafe) vector.operator[] exists:

   v[42]     // What Ali saw in the wild
   v.at(42)  // What Ali did not see as much in the wild

Ali

Nov 05 2021

max haughton <maxhaton gmail.com> writes:

On Friday, 5 November 2021 at 23:01:24 UTC, Ali Çehreli wrote:
 On 11/5/21 5:38 AM, max haughton wrote:

 I have never ever seen someone use a static array by mistake

 Related, although safe, vector::at is almost never used because 
 the more convenient (but unsafe) vector.operator[] exists:

   v[42]     // What Ali saw in the wild
   v.at(42)  // What Ali did not see as much in the wild

 Ali

Although I understand what Walter is trying to say, he picked a 
poor example, this one does actually make sense. Although in the 
world of sanitizers and such it is not a hard thing to catch, 
bounds checking by default is a win.

Nov 05 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/5/2021 9:25 PM, max haughton wrote:
 Although I understand what Walter is trying to say, he picked a poor example, 
 this one does actually make sense. Although in the world of sanitizers and
such 
 it is not a hard thing to catch, bounds checking by default is a win.

Not sure what your point is, as D has bounds checking by default with [ ].

Nov 06 2021

Atila Neves <atila.neves gmail.com> writes:

On Friday, 5 November 2021 at 12:38:36 UTC, max haughton wrote:
 On Friday, 5 November 2021 at 06:15:44 UTC, Walter Bright wrote:
 On 11/4/2021 9:11 PM, max haughton wrote:
 On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright 
 wrote:
 On 11/4/2021 7:41 PM, Mathias LANG wrote:
 If you want to fix it, just deprecate the special case and 
 tell people to use `foreach (dchar d; 
 someString.byUTF!(dchar, No.useReplacementDchar))` and 
 voilà.
 And if they don't want it to throw, it's shorter:
 `foreach (dchar d; someString.byUTF!dchar)` (or `byDChar`).

 People will always gravitate towards the smaller, simpler 
 syntax. Like [] instead of std::vector<>.

 
 I have never observed this mistake in any C++ cod,

 You've never observed people write:

    int array[3];

 in C++ code?

 unless you mean as a point of language design.

 D (still) has a rather verbose way of doing lambdas. People 
 constantly complained that D didn't have lambdas. Until the => 
 syntax was added, and suddenly lambdas in D became noticed and 
 useful.


 This decision should be guided by how current D programmers 
 act rather than a hyperreal ideal of someone encountering the 
 language.

 The only reason D's associative arrays continue to exist is 
 because they are so darned syntactically convenient.

 I've seen over and over and over that syntactic convenience 
 matters a lot.

 is what I meant, vector doesn't do the same thing as [].

Aside from not depending on GC-allocated memory, what does vector 
do that [] doesn't?

 It's more common in (so-called) modern C++ to see std::array 
 these days than a raw static array in certain contexts since 
 you still  want a constant length buffer but want iterators 
 etc..

     int src[10]{};
     int dst[10]{};
     transform(begin(src), end(src), begin(dst), [](int i) { 
return i + 1; });
     for(const auto i: dst)
         cout << i << " ";
     cout << endl;


But yes, std::array is an option that's better, but legacy code 
means C arrays have to be supported.

Nov 08 2021

max haughton <maxhaton gmail.com> writes:

On Monday, 8 November 2021 at 14:29:47 UTC, Atila Neves wrote:
 On Friday, 5 November 2021 at 12:38:36 UTC, max haughton wrote:
 On Friday, 5 November 2021 at 06:15:44 UTC, Walter Bright 
 wrote:
 On 11/4/2021 9:11 PM, max haughton wrote:
 On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright 
 wrote:
 [...]

 
 I have never observed this mistake in any C++ cod,

 You've never observed people write:

    int array[3];

 in C++ code?

 unless you mean as a point of language design.

 D (still) has a rather verbose way of doing lambdas. People 
 constantly complained that D didn't have lambdas. Until the 
 => syntax was added, and suddenly lambdas in D became noticed 
 and useful.


 This decision should be guided by how current D programmers 
 act rather than a hyperreal ideal of someone encountering 
 the language.

 The only reason D's associative arrays continue to exist is 
 because they are so darned syntactically convenient.

 I've seen over and over and over that syntactic convenience 
 matters a lot.

 is what I meant, vector doesn't do the same thing as [].

 Aside from not depending on GC-allocated memory, what does 
 vector do that [] doesn't?

 It's more common in (so-called) modern C++ to see std::array 
 these days than a raw static array in certain contexts since 
 you still  want a constant length buffer but want iterators 
 etc..

     int src[10]{};
     int dst[10]{};
     transform(begin(src), end(src), begin(dst), [](int i) { 
 return i + 1; });
     for(const auto i: dst)
         cout << i << " ";
     cout << endl;


 But yes, std::array is an option that's better, but legacy code 
 means C arrays have to be supported.

In my post I was referring to a C style array (in C++) rather 
than a D slice, to be clear. It's entirely possible Walter 
originally meant a slice, but the point about following the 
syntactic path of least resistance seem to be referring to a [] 
in C++ rather than a slice i.e. I was intending to get across 
that I've never seen someone making this mistake in practice 
(either using a mere [] to pass data around, or using a vector in 
place of a static array / vice versa )

Nov 08 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Monday, 8 November 2021 at 22:12:15 UTC, max haughton wrote:
 In my post I was referring to a C style array (in C++) rather 
 than a D slice, to be clear. It's entirely possible Walter 
 originally meant a slice, but the point about following the 
 syntactic path of least resistance seem to be referring to a [] 
 in C++ rather than a slice i.e. I was intending to get across 
 that I've never seen someone making this mistake in practice 
 (either using a mere [] to pass data around, or using a vector 
 in place of a static array / vice versa )

Could happen in C. Does not happen in C++, you use std::span for 
passing around data.

Nov 08 2021

FeepingCreature <feepingcreature gmail.com> writes:

On Friday, 5 November 2021 at 00:38:59 UTC, Walter Bright wrote:
 On 11/3/2021 10:41 PM, FeepingCreature wrote:
 On Thursday, 4 November 2021 at 05:34:29 UTC, FeepingCreature 
 wrote:
 One may disagree about autodecoding; I for one think it's a 
 sensible idea. However, a program should either process data 
 correctly or, if that is impossible, not at all. It should 
 not, ever, silently modify it "for you" while reading! I 
 predict this will lead to cryptic, hair-pulling bugs in user 
 code involving replacement characters appearing far 
 downstream of the error site.


 Surprisingly, the reverse seems to be true. Suppose you're 
 writing a text editor. Then read a file with some bad UTF in 
 it. The editor dies with an exception. You can't even edit the 
 file to fix it.

 If you need to display user provided text, like in a browser, 
 or all sorts of tools, you don't want to die with an exception. 
 What are you going to do in an exception handler? You're just 
 going to replace the offending bytes with ReplacementChar and 
 go render it anyway.

 (This is floating point NaN all over again!)

 Poor NaNs are terribly misunderstood.

 Suppose you have an array of sensors. One goes bad. The "bad" 
 value is 0.0. So now your data analyzer is happily averaging 
 0.0 into the results, silently skewing them.

 Now, if a NaN is returned instead, your "average" will be NaN. 
 You know it's no good. It won't be hidden.

 Uninitialized variables are sensors giving bad data. Having a 
 NaN in your result is a *good* thing.

I think the program should crash in all these cases. The text 
editor should crash. The browser should crash. The analyzer 
should see a NaN, and crash.

These programs are *wrong.* They thought they could only get 
Unicode and they've gotten non-Unicode. So we know they're 
written on wrong assumptions; why do we want to continue running 
code we know is untrustworthy? Let them crash, let them be fixed 
to make fewer assumptions. Automagically handling errors by 
propagating them in an inert form robs the developers and users 
of a chance to avoid a mistake. It's no better than 0.0.

Nov 04 2021

Paolo Invernizzi <paolo.invernizzi gmail.com> writes:

On Friday, 5 November 2021 at 06:30:02 UTC, FeepingCreature wrote:
 On Friday, 5 November 2021 at 00:38:59 UTC, Walter Bright wrote:
 [...]

 I think the program should crash in all these cases. The text 
 editor should crash. The browser should crash. The analyzer 
 should see a NaN, and crash.

 These programs are *wrong.* They thought they could only get 
 Unicode and they've gotten non-Unicode. So we know they're 
 written on wrong assumptions; why do we want to continue 
 running code we know is untrustworthy? Let them crash, let them 
 be fixed to make fewer assumptions. Automagically handling 
 errors by propagating them in an inert form robs the developers 
 and users of a chance to avoid a mistake. It's no better than 
 0.0.

+1000

Nov 05 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 5 November 2021 at 06:30:02 UTC, FeepingCreature wrote:
 I think the program should crash in all these cases. The text 
 editor should crash. The browser should crash. The analyzer 
 should see a NaN, and crash.

No, NaN is completely different. You have two types of NaN, one 
is for signalling that data is missing in a dataset (received 
from the outside). The other is to convey that a computation 
failed (often caused by roundoff errors).

To remove NaN from floating point is unworkable in the general 
case.

Nov 05 2021

FeepingCreature <feepingcreature gmail.com> writes:

On Friday, 5 November 2021 at 10:08:30 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 5 November 2021 at 06:30:02 UTC, FeepingCreature 
 wrote:
 I think the program should crash in all these cases. The text 
 editor should crash. The browser should crash. The analyzer 
 should see a NaN, and crash.

 No, NaN is completely different. You have two types of NaN, one 
 is for signalling that data is missing in a dataset (received 
 from the outside). The other is to convey that a computation 
 failed (often caused by roundoff errors).

 To remove NaN from floating point is unworkable in the general 
 case.

When I have to do numeric work and suspect NaNs in play, I like 
to `feenableexcept(FE_INVALID)`. Then every time a NaN arises in 
a computation, I get a nice SIGFPE.

Nov 05 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 5 November 2021 at 11:44:42 UTC, FeepingCreature wrote:
 When I have to do numeric work and suspect NaNs in play, I like 
 to `feenableexcept(FE_INVALID)`. Then every time a NaN arises 
 in a computation, I get a nice SIGFPE.

Yes, and the IEEE spec suggests that ones should be able to 
choose whether you get exceptions or compute with NaNs based on 
the nature of the application/computation. Regardless, as long as 
hardware follow IEEE and supports using NaN in calculations, you 
are better off playing up to the IEEE standard (for a modern 
system level language that means you should have easy access to 
both approaches).

Nov 05 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 5 November 2021 at 11:54:21 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 5 November 2021 at 11:44:42 UTC, FeepingCreature 
 wrote:
 When I have to do numeric work and suspect NaNs in play, I 
 like to `feenableexcept(FE_INVALID)`. Then every time a NaN 
 arises in a computation, I get a nice SIGFPE.

 Yes, and the IEEE spec suggests that ones should be able to 
 choose whether you get exceptions or compute with NaNs based on 
 the nature of the application/computation. Regardless, as long 
 as hardware follow IEEE and supports using NaN in calculations, 
 you are better off playing up to the IEEE standard (for a 
 modern system level language that means you should have easy 
 access to both approaches).

To put some meat on this. The ideal is that you can have two 
implementations for the same computation, one fast and one 
robust. So ideally you should be able to do the computations with 
NaNs in expressions where the NaNs can disappear and use 
exceptions where they cannot disappear. If an exception occurs 
you fall back to the slower robust implementation. In reality you 
have to weigh in performance characteristic of the hardware so… 
very much system level programming and not only a choice that can 
be done on the language level.

For instance in raytracing I would want NaNs. Then I can make a 
choice based on neighbouring pixels whether I want to compute it 
again using a slower method or simply fill it in with the average 
of the neighbours (if all the neighbours have roughly the same 
colour).

Nov 05 2021

FeepingCreature <feepingcreature gmail.com> writes:

On Friday, 5 November 2021 at 12:03:24 UTC, Ola Fosheim Grøstad 
wrote:
 For instance in raytracing I would want NaNs. Then I can make a 
 choice based on neighbouring pixels whether I want to compute 
 it again using a slower method or simply fill it in with the 
 average of the neighbours (if all the neighbours have roughly 
 the same colour).

I can't imagine wanting nans in raytracing. Just the idea of a 
fpu slowpath-provoking nan making its way into my nice wide SSE 
vectors gives me hives. Any sensible raytracing routine should 
just never produce a nan to begin with.

(For denormals there's FTZ/DAZ, at least.)

Nov 05 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 5 November 2021 at 12:13:17 UTC, FeepingCreature wrote:
 (For denormals there's FTZ/DAZ, at least.)

Not IEEE?

Nov 05 2021

norm <norm.rowtree gmail.com> writes:

On Friday, 5 November 2021 at 06:30:02 UTC, FeepingCreature wrote:
 On Friday, 5 November 2021 at 00:38:59 UTC, Walter Bright wrote:
 [...]

 I think the program should crash in all these cases. The text 
 editor should crash. The browser should crash. The analyzer 
 should see a NaN, and crash.

 These programs are *wrong.* They thought they could only get 
 Unicode and they've gotten non-Unicode. So we know they're 
 written on wrong assumptions; why do we want to continue 
 running code we know is untrustworthy? Let them crash, let them 
 be fixed to make fewer assumptions. Automagically handling 
 errors by propagating them in an inert form robs the developers 
 and users of a chance to avoid a mistake. It's no better than 
 0.0.

It isn't always that simple, e.g. working on medical devices 
crashing isn't an option when it comes to how we're going to deal 
with bad data.

Nov 05 2021

Dennis <dkorpel gmail.com> writes:

On Friday, 5 November 2021 at 10:09:40 UTC, norm wrote:
 It isn't always that simple, e.g. working on medical devices 
 crashing isn't an option when it comes to how we're going to 
 deal with bad data.

Oh no, let's not go there again. See this 44-page discussion:

[Program logic bugs vs input/environmental 
errors](https://forum.dlang.org/post/m07gf1$18jl$1 digitalmars.com)

Nov 05 2021

Paolo Invernizzi <paolo.invernizzi gmail.com> writes:

On Friday, 5 November 2021 at 10:27:05 UTC, Dennis wrote:
 On Friday, 5 November 2021 at 10:09:40 UTC, norm wrote:
 It isn't always that simple, e.g. working on medical devices 
 crashing isn't an option when it comes to how we're going to 
 deal with bad data.

 Oh no, let's not go there again. See this 44-page discussion:

 [Program logic bugs vs input/environmental 
 errors](https://forum.dlang.org/post/m07gf1$18jl$1 digitalmars.com)

Ehehe, the old good times :-P

Nov 05 2021

Dukc <ajieskola gmail.com> writes:

On Friday, 5 November 2021 at 10:09:40 UTC, norm wrote:
 It isn't always that simple, e.g. working on medical devices 
 crashing isn't an option when it comes to how we're going to 
 deal with bad data.

You can always validate the UTF beforehand if you don't want to 
crash.

Nov 05 2021

Abdulhaq <alynch4047 gmail.com> writes:

On Friday, 5 November 2021 at 10:09:40 UTC, norm wrote:
 It isn't always that simple, e.g. working on medical devices 
 crashing isn't an option when it comes to how we're going to 
 deal with bad data.

Mm, I have a totally different take on this. In my view all 
incoming data should be sanitised on entry into the application, 
this takes place at what I think of as leaf nodes in the 
application. This sanitisation includes conversion of all 
measurements into standard units, checking validity of strings 
etc.

Once data has entered the main application then the application 
should **fail fast**. This is **especially important** for 
medical devices. This allows the developers of the application to 
see, early in development, problems with their code and the logic 
thereof.

Signs of developers ignoring the fail fast principle include a 
disease I've identified where ```if (x is null)``` is seen to 
start proliferating through the code. This happens when you are 
calling a function that you did not write and one day you find it 
has returned null, you don't know why. So you add an ```if (null) 
return null``` to your code and carry on. This allows the program 
to stagger on in the face of being in a state that is not 
understood by the developer.

If I am on a ventilator and the program enters a state that the 
programmer did not anticipate, then life can start to get very 
uncomfortable for me. I would far prefer that it stopped, coughed 
up an error code, and the medical staff can unplug it and 
(quickly, I hope) replace it with another one. If there is 
actually a scenario where staggering on is considered better, 
then at the very least it should be under instruction from the 
programmer. The idea of the language runtime silently modifying 
application data is somewhat frightening for me in this scenario.

Nov 07 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/4/2021 11:30 PM, FeepingCreature wrote:
 These programs are *wrong.* They thought they could only get Unicode and
they've 
 gotten non-Unicode. So we know they're written on wrong assumptions; why do we 
 want to continue running code we know is untrustworthy? Let them crash, let
them 
 be fixed to make fewer assumptions. Automagically handling errors by
propagating 
 them in an inert form robs the developers and users of a chance to avoid a 
 mistake. It's no better than 0.0.

It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a 
very common valid value.

NaN and ReplacementChar are not valid and are easily distinguished.

Nov 06 2021

kdevel <kdevel vogtner.de> writes:

On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:
 On 11/4/2021 11:30 PM, FeepingCreature wrote:
 [...] Let them crash, [...] Automagically handling errors by 
 propagating them in an inert form robs the developers and 
 users of a chance to avoid a mistake. It's no better than 0.0.

 It's much better than 0.0. 0.0 is indistinguishable from valid 
 data, and is a very common valid value.

Technically it makes no difference if you do not check for 0.0 or 
not for NaN. What makes a difference is using "out of band 
signalling" (exceptions) if its default behavior is process 
termination.

 NaN and ReplacementChar are not valid

The replacemment character '�' is a valid Unicode codepoint 
(U+FFFD).

 and are easily distinguished.

Someone may forget to write explicit code to handle this case 
which most likely leads to data corruption. I choose stack trace 
over potentially corrupted data.

Nov 07 2021

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Sunday, 7 November 2021 at 16:28:33 UTC, kdevel wrote:
 On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:
 On 11/4/2021 11:30 PM, FeepingCreature wrote:
 [...]

 It's much better than 0.0. 0.0 is indistinguishable from valid 
 data, and is a very common valid value.

 Technically it makes no difference if you do not check for 0.0 
 or not for NaN. What makes a difference is using "out of band 
 signalling" (exceptions) if its default behavior is process 
 termination.

 NaN and ReplacementChar are not valid

 The replacemment character '�' is a valid Unicode codepoint 
 (U+FFFD).

 and are easily distinguished.

 Someone may forget to write explicit code to handle this case 
 which most likely leads to data corruption. I choose stack 
 trace over potentially corrupted data.

https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding

Nov 07 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/7/2021 8:46 AM, Imperatorn wrote:
 https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding

The money quote:

"By default, each object uses replacement fallback to handle strings that it 
cannot encode and bytes that it cannot decode, but you can specify that an 
exception should be thrown instead. For more information, see Replacement 
fallback and Exception fallback."

Nov 07 2021

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Sunday, 7 November 2021 at 23:29:39 UTC, Walter Bright wrote:
 On 11/7/2021 8:46 AM, Imperatorn wrote:
 https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding

 The money quote:

 "By default, each object uses replacement fallback to handle 
 strings that it cannot encode and bytes that it cannot decode, 
 but you can specify that an exception should be thrown instead. 
 For more information, see Replacement fallback and Exception 
 fallback."

💲💲💲

Nov 07 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/7/2021 8:28 AM, kdevel wrote:
 Technically it makes no difference if you do not check for 0.0 or not for NaN.

Yes, it does. 0.0 is not distinguishable from valid data. NaN is.

Nov 07 2021

FeepingCreature <feepingcreature gmail.com> writes:

On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:
 It's much better than 0.0. 0.0 is indistinguishable from valid 
 data, and is a very common valid value.

 NaN and ReplacementChar are not valid and are easily 
 distinguished.

No, that's exactly the problem. ReplacementChar is not easily 
distinguished, because it's a valid Unicode character - that's 
the whole point of it. So just like nan, it can propagate 
arbitrarily far through your processing pipeline before some 
downstream process decides that it actually doesn't like it. And 
at that point you generally have no chance to recover the source 
of the issue - you know that something maybe has gone wrong, but 
you don't even know if it was in your process or in the input 
data. After all, if you were screening your input data for 
ReplacementChar, you could as easily have been screening it for 
invalid UTF-8 to begin with. So while yes it's marginally better 
than 0.0, because at least you know that *something* is wrong, it 
does as little as possible to help you locate the problem while 
technically informing you. And all the workarounds for that take 
the form of "throw everywhere where a ReplacementChar could be 
generated." So imo just do the equivalent of turning on 
FE_INVALID, and do that to begin with. There's no point to 
getting rid of throw sites when you just force the user to readd 
them manually because they fulfill a genuine need.

IMO if you want to get rid of the exception overhead, I'd go the 
other way and make invalid unicode an abort(). Check your input 
data, people.

Nov 08 2021

FeepingCreature <feepingcreature gmail.com> writes:

On Monday, 8 November 2021 at 08:11:12 UTC, FeepingCreature wrote:
 On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:
 It's much better than 0.0. 0.0 is indistinguishable from valid 
 data, and is a very common valid value.

 NaN and ReplacementChar are not valid and are easily 
 distinguished.

 No, that's exactly the problem. ReplacementChar is not easily 
 distinguished, because it's a valid Unicode character - that's 
 the whole point of it. So just like nan, it can propagate 
 arbitrarily far through your processing pipeline before some 
 downstream process decides that it actually doesn't like it.

Sorry, let me expand on this because I think it's the very core 
of the disagreement.

I feel you have two options with NaN/ReplacementChar. You can 
either just accept that this is what you get, and let it 
propagate throughout your entire pipeline. In that case it's no 
better than 0.0 - actually, NaN would be *worse*, because your 
process would be completely broken with no way to fix it, whereas 
at least with 0.0 you can maybe get some reasonably-usable data 
out.

Or you can say that "we don't want to be generating 
NaN/ReplacementChar." Then where do you draw the line? At the 
process input/output boundary? But then the process needs to be 
fixed if it generates nans/fffds. So you want to move your 
signaling as close to the production site as possible. 
Preferably, you want to fail at the exact line that the 
problematic data was produced. So we're back at exceptions in 
foreach. (Actually, an exception in cast(string) would be the 
best.)

And that's why I think ReplacementChar/NaN are no better than 
0.0. You either embrace them fully as "valid" data, or you handle 
them at the site of origin; any compromise just makes you worse 
off than either extreme.

Nov 08 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Monday, 8 November 2021 at 08:18:51 UTC, FeepingCreature wrote:
 (Actually, an exception in cast(string) would be the best.)

D should distinguish more clearly between strong and weak casting 
at the language level. UTF-8 is now so dominating that D really 
should reconsider the string type and make it so it is required 
to be valid UTF-8 (like Python3 did). C++ has even introduced a 
new character type to signify UTF-8, I use it all the time.

 And that's why I think ReplacementChar/NaN are no better than 
 0.0. You either embrace them fully as "valid" data, or you 
 handle them at the site of origin; any compromise just makes 
 you worse off than either extreme.

It is very difficult to follow your line of reasoning, because 
ReplacementChar is nothing like qNaN, it is more like sNaN. 
ReplacementChar is not the result of an approximation failure, it 
is corruption of the input (or maybe a foreign encoding).

Getting a 0.0 instead of qNaN in a signal is absolutely 
disastrous. Walter is 100% right on that one. 0.0 will introduce 
a peak across the frequency range. qNan can be removed with no 
distortion.

Should you express your types strongly? Yes, but then you also 
should include things like negative numbers, denormal numbers, 
±infity, ranges [1.0-0.0] and so on.

Nov 08 2021

FeepingCreature <feepingcreature gmail.com> writes:

On Monday, 8 November 2021 at 12:02:12 UTC, Ola Fosheim Grøstad 
wrote:
 It is very difficult to follow your line of reasoning, because 
 ReplacementChar is nothing like qNaN, it is more like sNaN. 
 ReplacementChar is not the result of an approximation failure, 
 it is corruption of the input (or maybe a foreign encoding).

 Getting a 0.0 instead of qNaN in a signal is absolutely 
 disastrous. Walter is 100% right on that one. 0.0 will 
 introduce a peak across the frequency range. qNan can be 
 removed with no distortion.

 Should you express your types strongly? Yes, but then you also 
 should include things like negative numbers, denormal numbers, 
 ±infity, ranges [1.0-0.0] and so on.

Yeah I noticed this after I clicked post, but I didn't want to 
add a third comment. I think the difference is fundamentally one 
of "time-series vs progressive data". I don't think that's the 
right word, but I don't know a better one. Like, if you have a 
measuring series of values interspersed with nans, you can know 
for instance that the values are assigned to times, or to 
positions, and then you can semantically decide what to do with 
the data. For instance you may mark the nans with an error, or 
drop them and interpolate. However, it is much harder to see 
where such a behavior would be useful for ReplacementCharacter. 
Generally, you're reading data that someone wrote for a reason, 
and ReplacementCharacter would almost universally indicate that 
there was something you were meant to pick up on but failed to 
handle. As such, it's much less clear to me whether there even 
are cases where "text with replacement characters" or "text with 
replacement characters removed" is even useful.

Nov 08 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Monday, 8 November 2021 at 12:32:08 UTC, FeepingCreature wrote:
 Generally, you're reading data that someone wrote for a reason, 
 and ReplacementCharacter would almost universally indicate that 
 there was something you were meant to pick up on but failed to 
 handle. As such, it's much less clear to me whether there even 
 are cases where "text with replacement characters" or "text 
 with replacement characters removed" is even useful.

It could mean that someone did cut'n'paste of text from a more 
recent version of the Unicode standard. ReplacementCharacter 
makes it possible for you to use the input regardless (replacing 
it with a question mark in a square or something).

I think this is an application level feature, and not a language 
level feature, so it doesn't make sense for the language to do 
this IMO. That we can agree on.

(D is not a scripting langauge.)

Nov 08 2021

kdevel <kdevel vogtner.de> writes:

On Monday, 8 November 2021 at 12:02:12 UTC, Ola Fosheim Grøstad 
wrote:
[...]
 ReplacementChar is not the result of an approximation failure, 
 it is corruption of the input (or maybe a foreign encoding).

As in this line I can write down the replacement character '�' 
since it is a valid Unicode codepoint (U+FFFD). It even 
round-trips correctly. I think the iconv-library [1] has a nice 
approach: it stops the conversion among others if it encounters 
an invalid input sequence.

The ideal conversion without throwing or using the replacement 
character is IMHO generating a list of pairs of ranges, named 
"left" and "right". Left contains sucessfully parsed data, right 
invalid data. For valid utf-8 input this list has only one 
element. The left element of this pair contains the conversion 
and the right is empty. From this representation one can easily 
compute all required presentations.

[1] https://man7.org/linux/man-pages/man3/iconv.3.html

Nov 08 2021

User <user blah.com> writes:

 (This is floating point NaN all over again!)

Did you try Pony language? its so user friendly, it even allows 
divide by zero.

https://www.reddit.com/r/programming/comments/7al9s2/pony_a_programming_language_that_allows_dividing/

Nov 05 2021

Elronnd <elronnd elronnd.net> writes:

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 I've tried to fix this before, but too many people objected.

 Are we fed up with this yet? I sure am.

 Who wants to take up this cudgel and fix the durned thing once 
 and for all?

 (It's unclear if it would even break existing code.)

Assuming the comment by Ali on the linked bug is right, I think 
the current behaviour is correct.

Your complaints:

 It can't be turned off

Sure it can.  You can choose to iterate in another fashion; say, 
by creating your own iterator which folds invalid utf8 into 
replacement characters.

 it throws

Is it better to produce an incorrect result?

A high-quality, non-throwing mechanism for error handling exists. 
  It consists of an _optional_ value which must be explicitly 
unwrapped.  It is also an out-of-band signal; how will I 
distinguish invalid utf8 from a correctly-encoded replacement 
character?

 it may allocate with the gc

So?  If that is the sort of thing you care about, then you will 
 nogc and find an alternate solution.  Lots of core language 
features allocate, like arrays and hash tables.

 it's slow

In the hot path it's the same speed.  In the slow path, 
performance doesn't matter.  In any case, it's useless to give an 
incorrect result faster.


(Notably, this is not exactly _auto_ decoding; it is explicitly 
requested decoding.  And your proposed modification doesn't 
change that fact.)


What is (potentially) questionable imo is that given foreach (c; 
a), c will be inferred to be dchar; you have to explicitly ask 
for char.  Perhaps that default should be reversed.  (This will 
definitely break code, though, and may not be worth it.)

If you want an iterator that generates replacement characters for 
invalid utf8, just create one.  But the default translation 
should be faithful, and that means not generating any result if 
none can be generated.

Nov 04 2021

rikki cattermole <rikki cattermole.co.nz> writes:

On 04/11/2021 8:51 PM, Elronnd wrote:
 What is (potentially) questionable imo is that given foreach (c; a), c 
 will be inferred to be dchar; you have to explicitly ask for char.  
 Perhaps that default should be reversed.  (This will definitely break 
 code, though, and may not be worth it.)

I think this is the right answer.

Fix the default. Less surprises, less head aches, everyone is happy.

Nov 04 2021

rikki cattermole <rikki cattermole.co.nz> writes:

On 05/11/2021 12:59 AM, rikki cattermole wrote:
 
 On 04/11/2021 8:51 PM, Elronnd wrote:
 What is (potentially) questionable imo is that given foreach (c; a), c 
 will be inferred to be dchar; you have to explicitly ask for char. 
 Perhaps that default should be reversed.  (This will definitely break 
 code, though, and may not be worth it.)

 
 I think this is the right answer.
 
 Fix the default. Less surprises, less head aches, everyone is happy.

Correction: the default is correct, I checked.

Nov 04 2021

Adam D Ruppe <destructionator gmail.com> writes:

On Thursday, 4 November 2021 at 07:51:11 UTC, Elronnd wrote:
 What is (potentially) questionable imo is that given foreach 
 (c; a), c will be inferred to be dchar; you have to explicitly 
 ask for char.  Perhaps that default should be reversed.  (This 
 will definitely break code, though, and may not be worth it.)

That's not true. It will always be the type of the thing:

void main() {
         foreach(a; "test")
                 pragma(msg, typeof(a)); // immutable(char) NOT 
dchar
}

Nov 04 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/4/2021 12:51 AM, Elronnd wrote:
 In the hot path it's the same speed.

C++ sold everyone the myth that exceptions not thrown are zero cost. This has 
been thoroughly debunked, though the myth persists :-(

Nov 04 2021

Andrei Alexandrescu <SeeWebsiteForEmail erdani.com> writes:

On 2021-11-04 20:40, Walter Bright wrote:
 On 11/4/2021 12:51 AM, Elronnd wrote:
 In the hot path it's the same speed.

 
 C++ sold everyone the myth that exceptions not thrown are zero cost. 
 This has been thoroughly debunked, though the myth persists :-(

I've been doing a fair amount of benchmarking for 
https://amazon.com/Embracing-Modern-Safely-John-Lakos/dp/0137380356 and 
surprisingly enough the myth holds true in most cases.

Nov 05 2021

deadalnix <deadalnix gmail.com> writes:

On Friday, 5 November 2021 at 13:25:06 UTC, Andrei Alexandrescu 
wrote:
 On 2021-11-04 20:40, Walter Bright wrote:
 On 11/4/2021 12:51 AM, Elronnd wrote:
 In the hot path it's the same speed.

 
 C++ sold everyone the myth that exceptions not thrown are zero 
 cost. This has been thoroughly debunked, though the myth 
 persists :-(

 I've been doing a fair amount of benchmarking for 
 https://amazon.com/Embracing-Modern-Safely-John-Lakos/dp/0137380356 and
surprisingly enough the myth holds true in most cases.

It really depends on the exact specification of the myth.

You get executable bigger by about 20%, and some constructs such 
as ref counted smart pointer become harder to optimize, but 
indeed, the cost when you don't throw in term of runtime isn't 
remotely as high as people seems to think.

Nov 05 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/5/2021 6:25 AM, Andrei Alexandrescu wrote:
 I've been doing a fair amount of benchmarking for 
 https://amazon.com/Embracing-Modern-Safely-John-Lakos/dp/0137380356 and 
 surprisingly enough the myth holds true in most cases.

All the compilers I know of abandon many optimizations in the presence of
unwind 
blocks.

For example, register allocation of variables is not done across unwind blocks. 
This is because the unwinder does not restore register contents.

A further problem is data flow analysis becomes largely ineffective because any 
operation that may throw (such as a function call to a throwing function) 
produces an edge from there to the catch block.

Nov 05 2021

deadalnix <deadalnix gmail.com> writes:

On Friday, 5 November 2021 at 20:41:34 UTC, Walter Bright wrote:
 On 11/5/2021 6:25 AM, Andrei Alexandrescu wrote:
 I've been doing a fair amount of benchmarking for 
 https://amazon.com/Embracing-Modern-Safely-John-Lakos/dp/0137380356 and
surprisingly enough the myth holds true in most cases.

 All the compilers I know of abandon many optimizations in the 
 presence of unwind blocks.

 For example, register allocation of variables is not done 
 across unwind blocks. This is because the unwinder does not 
 restore register contents.

 A further problem is data flow analysis becomes largely 
 ineffective because any operation that may throw (such as a 
 function call to a throwing function) produces an edge from 
 there to the catch block.

I have not checked for GCC, but modern version of LLVM are pretty 
good at optimizing in the presence of landing pads. Not so good 
at optimizing the landing pad themselves, but hey, if you get 
there often, something has gone horribly wrong and optimization 
is the least of your concerns.

While I have not checked GCC, I'm fairly confident it does a good 
job.

That being said, on windows, it's another can of worm, because 
their exception ABI is some special level of crazy.

Nov 05 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/5/2021 2:43 PM, deadalnix wrote:
 I have not checked for GCC, but modern version of LLVM are pretty good at 
 optimizing in the presence of landing pads.

I saw a presentation by Chandler Carruth at CPPCON 3 years back or so where he 
said that LLVM abandoned much of the optimizations in the presence of rewind
blocks.

Optimizations will do better, of course, if your tight loops don't call 
functions that might throw.

You'll also lose simply because the extra bulk of the EH code will push more of 
your hot code out of the cache.

Nov 05 2021

Andrei Alexandrescu <SeeWebsiteForEmail erdani.com> writes:

On 2021-11-05 20:03, Walter Bright wrote:
 On 11/5/2021 2:43 PM, deadalnix wrote:
 I have not checked for GCC, but modern version of LLVM are pretty good 
 at optimizing in the presence of landing pads.

 
 I saw a presentation by Chandler Carruth at CPPCON 3 years back or so 
 where he said that LLVM abandoned much of the optimizations in the 
 presence of rewind blocks.

Three years is a long time in this industry.

 Optimizations will do better, of course, if your tight loops don't call 
 functions that might throw.
 
 You'll also lose simply because the extra bulk of the EH code will push 
 more of your hot code out of the cache.

Turns out the EH code is very well separated. Gcc goes so far as to 
generate two separate functions, one for hot and one for cold. Clang 
also does a good job separating the paths.

It happens in our metier that good judgment becomes prejudice. It seems 
that's what happening with "exceptions are expensive" right now.

Nov 05 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/5/2021 5:40 PM, Andrei Alexandrescu wrote:
 Turns out the EH code is very well separated. Gcc goes so far as to generate
two 
 separate functions, one for hot and one for cold. Clang also does a good job 
 separating the paths.

How does one decide in advance to call the non-throwing function?


 It happens in our metier that good judgment becomes prejudice. It seems that's 
 what happening with "exceptions are expensive" right now.

I remain skeptical.

My playing with gcc shows it moves the unwind blocks past the end of the 
function, which keeps them somewhat out of the hot path. Doesn't fix the 
register allocation problem, though.

BTW, dmd also moves the unwind blocks past the end.

Nov 05 2021

deadalnix <deadalnix gmail.com> writes:

On Saturday, 6 November 2021 at 00:40:41 UTC, Andrei Alexandrescu 
wrote:
 Turns out the EH code is very well separated. Gcc goes so far 
 as to generate two separate functions, one for hot and one for 
 cold. Clang also does a good job separating the paths.

You bet, I wrote the code that separates the two :)

 It happens in our metier that good judgment becomes prejudice. 
 It seems that's what happening with "exceptions are expensive" 
 right now.

It is on windows due to the whole funclet business, and it is in 
some specific condition (for instance if icache pressure is the 
bottleneck) but in most cases, the impact is fairly minimal 
beyond binary size.

Nov 05 2021

deadalnix <deadalnix gmail.com> writes:

On Saturday, 6 November 2021 at 01:55:47 UTC, deadalnix wrote:
 On Saturday, 6 November 2021 at 00:40:41 UTC, Andrei 
 Alexandrescu wrote:
 Turns out the EH code is very well separated. Gcc goes so far 
 as to generate two separate functions, one for hot and one for 
 cold. Clang also does a good job separating the paths.

 You bet, I wrote the code that separates the two :)

To expand on that, I also wrote code that send all the exception 
handling code in a cold section in the executable (and if PGO is 
enabled, also really cold codepath). This impact on benchmark was 
fairly minimal, so this ended up not being merged.

Nov 05 2021

Andrei Alexandrescu <SeeWebsiteForEmail erdani.com> writes:

On 2021-11-05 16:41, Walter Bright wrote:
 On 11/5/2021 6:25 AM, Andrei Alexandrescu wrote:
 I've been doing a fair amount of benchmarking for 
 https://amazon.com/Embracing-Modern-Safely-John-Lakos/dp/0137380356 
 and surprisingly enough the myth holds true in most cases.

 
 All the compilers I know of abandon many optimizations in the presence 
 of unwind blocks.
 
 For example, register allocation of variables is not done across unwind 
 blocks. This is because the unwinder does not restore register contents.
 
 A further problem is data flow analysis becomes largely ineffective 
 because any operation that may throw (such as a function call to a 
 throwing function) produces an edge from there to the catch block.

I know the story. It is aging. I'm telling the facts. It turns out that 
modern compilers have made a lot of progress in the area.

Nov 05 2021

Elronnd <elronnd elronnd.net> writes:

Part of the problem, as mentioned, is that this throws away 
information, because text may legitimately contain replacement 
characters.  (And this makes the 'check if replacement char and 
throw yourself' approach a non-starter).  But there are lossless 
encodings.  I think if we are really going to go this route, we 
should use something like raku's utf8-c8 
(https://docs.raku.org/language/unicode#UTF8-C8).

Nov 04 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/4/2021 12:55 AM, Elronnd wrote:
 Part of the problem, as mentioned, is that this throws away information,
because 
 text may legitimately contain replacement characters.  (And this makes the 
 'check if replacement char and throw yourself' approach a non-starter).  But 
 there are lossless encodings.  I think if we are really going to go this
route, 
 we should use something like raku's utf8-c8 
 (https://docs.raku.org/language/unicode#UTF8-C8).

There's only one replacement character, and this use is officially what it is 
for. If you're using it for other porpoises, you've got a whale of a problem.

Nov 04 2021

zjh <fqbqrr 163.com> writes:

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

`string`, as a language part, should not be encoded at all. It is 
`(8-bit)` byte directly.
The standard library implement the required 'coding string'.
In this way, other people need various "coding strings", so they 
import the "coding strings" in the "standard library"
Just because the code is not 'utf8', and then you can't write 
`d`'s program, it's terrible.

Nov 04 2021

zjh <fqbqrr 163.com> writes:

On Thursday, 4 November 2021 at 08:24:59 UTC, zjh wrote:

The `fundamental` problem is that we should provide users with 
`options` at compile time, not we `choose` for users.
If you `choose` for users, there will always be dissatisfaction.
You provide options ,and Users choose according to their needs.

`auto decoding` and `utf8 string encoding` are both like this. If 
you choose for users, some people are always not happy.

Nov 06 2021

jfondren <julian.fondren gmail.com> writes:

On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote:
 On Thursday, 4 November 2021 at 08:24:59 UTC, zjh wrote:

 The `fundamental` problem is that we should provide users with 
 `options` at compile time, not we `choose` for users.
 If you `choose` for users, there will always be dissatisfaction.
 You provide options ,and Users choose according to their needs.

 `auto decoding` and `utf8 string encoding` are both like this. 
 If you choose for users, some people are always not happy.

d index with range checking: `arr[ind]`
d index without range checking: `arr.ptr[ind]`

c++ index with range checking: `arr.at(ind)`
c++ index without range checking: `arr[ind]`

There are two ways to index, and both D and C++ offer both ways. 
Neither language removes a choice. If whether `arr[ind]` should 
rangecheck were up for debate, what's for debate is what the 
language should encourage by making that the default--the 
option's more naturally expressed, that requires less typing.

The question here of "what should a foreach over the dchar of a 
char[] do?" is the same kind of question.

default: `str`
throwing: `str.byUTF!(dchar, UseReplacementChar.no)`
asserting: `std.encoding.codePoints(str)`
replacement: `std.utf.byDchar(str)`
truncation: `str[0 .. std.encoding.validLength(str)]`
promotion: `std.string.representation(str)`

Put one of those inside `foreach (dchar; ...) { }` and you get 
that handling of bad UTF. Changing the default doesn't make the 
other options go away, and the default has to do *something* 
(even a compile-time error of "this is not supported behavior" is 
*something*), so you have to make a choice about the default and 
make some users unhappy.

Nov 06 2021

zjh <fqbqrr 163.com> writes:

On Sunday, 7 November 2021 at 01:59:47 UTC, jfondren wrote:
 On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote:

Rust has more than ten `kinds` of strings. Maybe we can add `2/3` 
one.

Nov 06 2021

jfondren <julian.fondren gmail.com> writes:

On Sunday, 7 November 2021 at 02:12:36 UTC, zjh wrote:
 On Sunday, 7 November 2021 at 01:59:47 UTC, jfondren wrote:
 On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote:

 Rust has more than ten `kinds` of strings. Maybe we can add 
 `2/3` one.

Meanwhile, in Rust:

```rust

mod tests {
     fn type_of<T>(_: T) -> &'static str {
         core::any::type_name::<T>()
     }
     const INVALID: &'static str = unsafe {
         std::str::from_utf8_unchecked(&[
             0x68, 0x65, 0x6c, 0x6c, 0x6f, 0xa7, 0x85, 0xaf, 0x74, 
0x68, 0x65, 0x72, 0x65,
         ])
     };

     fn iter_invalid() {
         for c in INVALID.chars() {
             println!("{} {}, {}", type_of(c), c as u32, c);
         }
     }
}
```

If you smuggle invalid UTF into a type that Rust expects to be 
valid UTF (the same case as `string` in D, allegedly), then 
Rust's equivalent of `foreach (dchar c; str) { }` just emits 
invalid chars -- two of 'em, somehow.

104, 101, 108, 108, 110 - "hello"
453, 1012 - ???
104, 101, 114, 101 - "here" (the 't' is lost)

This is similar to `foreach (dchar c; 
std.encoding.codePoints(str)) { }` which emits three dchars 
between "hello" and "there", but which also has an assert failure 
in non-release builds.

Nov 07 2021

Steven Schveighoffer <schveiguy gmail.com> writes:

On 11/3/21 10:26 PM, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473
 
 I've tried to fix this before, but too many people objected.
 
 Are we fed up with this yet? I sure am.
 
 Who wants to take up this cudgel and fix the durned thing once and for all?
 
 (It's unclear if it would even break existing code.)

Honestly, I'd say `foreach(dchar c; somestr)` should not work.

1. It's slow and calls opaque functions
2. Adds more requirements to runtime that are simply solved by basic 
wrappers.
3. If writing wrappers, you can decide what you want.
4. It gets people used to language-magic character conversion, when this 
doesn't work on ranges of `char` that aren't arrays -- which then 
performs integer promotion.

What I would *not* suggest though, is to just disable the feature. If it 
falls back to integer promotion (which is the worst thing ever for 
characters), then tons and tons of code will break, and much code will 
just work for English strings.

Autodecoding might be a huge problem with Phobos, but character 
promotion is a huge problem with the language.

-Steve

Nov 04 2021

jfondren <julian.fondren gmail.com> writes:

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

This doesn't throw, actually:

```d
unittest {
     import std.stdio : writeln;
     enum invalid = "hello\247\205\257there";

     foreach (c; invalid)
         writeln(cast(ubyte) c);
}
```

Which is per usual in D

```d
 ("std.utf.byUTF 2/3 (throwing)")
 safe unittest {
     import std.utf : byUTF, UTFException, UseReplacementDchar;
     import std.exception : assertThrown, assertNotThrown;
     import std.algorithm : count;

     string partial = "hello\247\205\257there";

     // byChar misses the bad UTF8 ...
     assertNotThrown!UTFException(partial.byUTF!(char, 
UseReplacementDchar.no).count);

     // byDchar objects to it
     assertThrown!UTFException(partial.byUTF!(dchar, 
UseReplacementDchar.no).count);
}
```

This does throw:

```d
unittest {
     import std.stdio : writeln;
     enum invalid = "hello\247\205\257there";

     foreach (dchar c; invalid)
         writeln(cast(int) c);
}
```

but by asking for dchars from an immutable(char)[] you're asking 
for some unicode work to happen, so throwing is a reasonable 
default IMO. Emitting the replacement character is also a 
reasonable default, and objections in the thread can be answered 
the same way that objections to throwing can be: if you don't 
like it, iterate some other way:

```d
// throw on invalid UTF
unittest {
     import std.utf : byUTF, UseReplacementDchar, UTFException;

     enum invalid = "hello\247\205\257there";

     int sum;
     try {
         foreach (dchar c; invalid.byUTF!(dchar, 
UseReplacementDchar.no))
             sum += cast(int) c;
         assert(sum == 197667);
     } catch (UTFException e) {
         assert(sum == 532);
     }
}

// AssertError on invalid UTF
// (release behavior: "\247\205\257" is three dchars!)
unittest {
     import std.stdio : writeln;
     import std.encoding : codePoints;

     enum invalid = "hello\247\205\257there";

     foreach (dchar c; invalid.codePoints)
         writeln(cast(int) c);
}

// stop iterating on invalid UTF
unittest {
     import std.encoding : validLength;

     enum invalid = "hello\247\205\257there";
     char[] s;

     foreach (dchar c; invalid[0 .. invalid.validLength])
         s ~= c;
     assert(s == "hello");
}
```

Nov 04 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/4/2021 7:52 AM, jfondren wrote:
 Emitting the 
 replacement character is also a reasonable default, and objections in the
thread 
 can be answered the same way that objections to throwing can be: if you don't 
 like it, iterate some other way:

Technically, you are correct. But experience shows this does not work, because 
people will be human.

Two things are abundantly clear:

1. throwing exceptions must not be default behavior

2. allocating with the GC must not be the default behavior

and pushing against that is like trying to get people to eat their vegetables.

Nov 04 2021

deadalnix <deadalnix gmail.com> writes:

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 I've tried to fix this before, but too many people objected.

 Are we fed up with this yet? I sure am.

 Who wants to take up this cudgel and fix the durned thing once 
 and for all?

 (It's unclear if it would even break existing code.)

For the love of god, if you are going to make a breaking change 
there, just remove autodecoding altogether.

Trying to fix what shouldn't exist is by far the biggest time 
sink engineers involves themselves in.

Nov 04 2021

Adam D Ruppe <destructionator gmail.com> writes:

On Friday, 5 November 2021 at 02:06:01 UTC, deadalnix wrote:
 On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright 
 wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 For the love of god, if you are going to make a breaking change 
 there, just remove autodecoding altogether.

This post isn't about autodecoding. With foreach, you opt into 
the decoding by specifically asking for it.

Nov 04 2021

deadalnix <deadalnix gmail.com> writes:

On Friday, 5 November 2021 at 02:38:51 UTC, Adam D Ruppe wrote:
 On Friday, 5 November 2021 at 02:06:01 UTC, deadalnix wrote:
 On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright 
 wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 For the love of god, if you are going to make a breaking 
 change there, just remove autodecoding altogether.

 This post isn't about autodecoding. With foreach, you opt into 
 the decoding by specifically asking for it.

Very clearly it is, because if you don't decode, then you don't 
do replacement chars or exceptions.

Nov 04 2021

Dukc <ajieskola gmail.com> writes:

On Friday, 5 November 2021 at 03:02:07 UTC, deadalnix wrote:
 On Friday, 5 November 2021 at 02:38:51 UTC, Adam D Ruppe wrote:
 This post isn't about autodecoding. With foreach, you opt into 
 the decoding by specifically asking for it.

 Very clearly it is, because if you don't decode, then you don't 
 do replacement chars or exceptions.

It's about decoding, but not autodecoding. Or at least not the 
same autodecoding we usually refer to. Autodecoding is the way 
Phobos v1 treats character arrays when they are used as ranges.

This is about an implicit conversion in the language itself.

Nov 05 2021

Andrei Alexandrescu <SeeWebsiteForEmail erdani.com> writes:

On 2021-11-04 23:02, deadalnix wrote:
 On Friday, 5 November 2021 at 02:38:51 UTC, Adam D Ruppe wrote:
 On Friday, 5 November 2021 at 02:06:01 UTC, deadalnix wrote:
 On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 For the love of god, if you are going to make a breaking change 
 there, just remove autodecoding altogether.

 This post isn't about autodecoding. With foreach, you opt into the 
 decoding by specifically asking for it.

 
 Very clearly it is, because if you don't decode, then you don't do 
 replacement chars or exceptions.

"On demand" is not "auto".

Nov 05 2021

deadalnix <deadalnix gmail.com> writes:

On Friday, 5 November 2021 at 13:26:00 UTC, Andrei Alexandrescu 
wrote:
 "On demand" is not "auto".

 From the bug repport:

 A simple foreach loop:
 
     void test(char[] a)
     {
         foreach (char c; a) { }
     }
 
 will throw a UtfException if `a` is not a valid UTF string. 
 Instead, it should replace the invalid sequence with 
 replacementDchar.

This shouldn't do anything related to unicode at all.

Nov 05 2021

Adam D Ruppe <destructionator gmail.com> writes:

On Friday, 5 November 2021 at 14:10:35 UTC, deadalnix wrote:
 This shouldn't do anything related to unicode at all.

Well, it doesn't. That was apparently just a typo, as the 
comments in the bug report quickly pointed out, as the thing is 
when you specifically request dchar out of char, NOT when you are 
just working on chars (which is the default).

Nov 05 2021

jfondren <julian.fondren gmail.com> writes:

On Friday, 5 November 2021 at 14:10:35 UTC, deadalnix wrote:
 On Friday, 5 November 2021 at 13:26:00 UTC, Andrei Alexandrescu 
 wrote:
 "On demand" is not "auto".

 From the bug repport:

 A simple foreach loop:
 
     void test(char[] a)
     {
         foreach (char c; a) { }
     }
 
 will throw a UtfException if `a` is not a valid UTF string. 
 Instead, it should replace the invalid sequence with 
 replacementDchar.

 This shouldn't do anything related to unicode at all.

It doesn't. This does:

```d
unittest {
     enum invalid = "hello\247\205\257there";
     foreach (dchar c; invalid) { }
}
```

Looping over the dchar of a char[] requires one of

1. throwing an error on invalid UTF (current behavior)
2. doing something else in that case (proposed: replacementDchar; 
also possible: silently doing something invalid like iterating 
over three dchars between "hello" and "there")
3. a compile-time error (also proposed in the thread)

Nov 05 2021

Andrei Alexandrescu <SeeWebsiteForEmail erdani.com> writes:

On 2021-11-05 10:22, jfondren wrote:
 3. a compile-time error (also proposed in the thread)

Speaking of which, I was thinking std2x should simply reject mixed-sign 
min and max during compilation instead of cleverly figuring out the 
"right" comparison. Now we have signed() and unsigned() that make it 
trivial for the user to steer min and max toward doing the right thing, 
and it's clearer too.

Nov 05 2021

ag0aep6g <anonymous example.com> writes:

On 05.11.21 15:22, jfondren wrote:
 Looping over the dchar of a char[] requires one of
 
 1. throwing an error on invalid UTF (current behavior)
 2. doing something else in that case (proposed: replacementDchar; also 
 possible: silently doing something invalid like iterating over three 
 dchars between "hello" and "there")
 3. a compile-time error (also proposed in the thread)

4. Don't decode. Just do an implicit conversion from char to dchar. Just 
like `char c; dchar d = c;`. It's horrible, but D usually allows it. So 
let foreach do it too.


conversion as well while you're at it.

Nov 05 2021

Guillaume Piolat <first.last gmail.com> writes:

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 I've tried to fix this before, but too many people objected.

 Are we fed up with this yet? I sure am.

 Who wants to take up this cudgel and fix the durned thing once 
 and for all?

 (It's unclear if it would even break existing code.)

How about just assert(false)? It is  nogc and foreach over 
invalid utf-8 is a logic error (as you didn't sanitize).

Nov 05 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 5 November 2021 at 09:34:31 UTC, Guillaume Piolat 
wrote:
 How about just assert(false)? It is  nogc and foreach over 
 invalid utf-8 is a logic error (as you didn't sanitize).

It is even worse, it is a type error. If "utf-8" is to be a 
meaningful type you should be allowed to assume that it follows 
the spec.

Nov 05 2021

Guillaume Piolat <first.last gmail.com> writes:

On Friday, 5 November 2021 at 09:57:45 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 5 November 2021 at 09:34:31 UTC, Guillaume Piolat 
 wrote:
 How about just assert(false)? It is  nogc and foreach over 
 invalid utf-8 is a logic error (as you didn't sanitize).

 It is even worse, it is a type error. If "utf-8" is to be a 
 meaningful type you should be allowed to assume that it follows 
 the spec.

Well you only know that it is meant to be utf8 in the context of 
the auto-decoding foreach (which must still exist). string in 
actual programs may contains binary files, strings in other 
codepages encodings.

Nov 05 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 5 November 2021 at 10:13:13 UTC, Guillaume Piolat 
wrote:
 Well you only know that it is meant to be utf8 in the context 
 of the auto-decoding foreach (which must still exist). string 
 in actual programs may contains binary files, strings in other 
 codepages encodings.

D needs to rethink strings. Newbies going for "scripty" 
programming really need an encapsulated strongly typed string 
type, accessed only through functions that do-the-right-thing.

I think  safe/ system distinction would be more useful if  safe 
was for those that wanted a more "scripty" programming style and 
 system was for those that wanted a more "low level" programming 
style.

On a related note, I also think it would be useful to have 
something stronger than  safe, like a   non-trojan marker for 
libraries, which basically says that it is impossible for that 
library to do evil and have that statically checked by the 
compiler. Then you could import libraries without caring about 
bad code. One issue I have with packages in smaller languages is 
that you don't have enough eyeballs on them, too easy for "evil" 
code to slip through (intentionally or not).

Nov 05 2021

Elronnd <elronnd elronnd.net> writes:

On Friday, 5 November 2021 at 10:30:27 UTC, Ola Fosheim Grøstad 
wrote:
 I also think it would be useful to have something stronger than 
  safe, like a   non-trojan marker for libraries, which 
 basically says that it is impossible for that library to do 
 evil and have that statically checked by the compiler.

pure

Nov 05 2021

Elronnd <elronnd elronnd.net> writes:

On Friday, 5 November 2021 at 22:31:59 UTC, Elronnd wrote:
 On Friday, 5 November 2021 at 10:30:27 UTC, Ola Fosheim Grøstad 
 wrote:
 I also think it would be useful to have something stronger 
 than  safe, like a   non-trojan marker for libraries, which 
 basically says that it is impossible for that library to do 
 evil and have that statically checked by the compiler.

 pure

Hmm, technically pure code can infinite loop and cause a DOS.  
But any useful language will be able to get arbitrary recursion 
depth even if it is proved to terminate (e.g. cpp), sooo...

And there is also the obvious pitfall of debug-in-pure.

Nov 05 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 5 November 2021 at 10:13:13 UTC, Guillaume Piolat 
wrote:
 Well you only know that it is meant to be utf8 in the context 
 of the auto-decoding foreach (which must still exist). string 
 in actual programs may contains binary files, strings in other 
 codepages encodings.

I had a look at the [documentation]( 
https://dlang.org/spec/arrays.html#strings ) today, and it said:

«char[] strings are in UTF-8 format.»

I would assume that this is normative? Maybe change the 
documentation to use more forceful specification language so that 
it says: «char[] strings MUST be in UTF-8 format.»

So, I think a messed up ```string``` should be considered a type 
error and it would be good if the compiler checked this 
statically where possible (e.g. literals) and simply assumed it 
to hold when parsing strings (like in a ```for``` loop).

In C++ I use ```span<uint8_t>``` for raw string-slices and 
```span<char8_t>``` for utf8 string-slices. I find that to be 
quite clear. In C++ these are distinct types.

(newbies need a wrapper that is foolproof)

Nov 10 2021

Guillaume Piolat <first.last gmail.com> writes:

On Wednesday, 10 November 2021 at 10:23:31 UTC, Ola Fosheim 
Grøstad wrote:
 On Friday, 5 November 2021 at 10:13:13 UTC, Guillaume Piolat 
 wrote:
 Well you only know that it is meant to be utf8 in the context 
 of the auto-decoding foreach (which must still exist). string 
 in actual programs may contains binary files, strings in other 
 codepages encodings.

 I had a look at the [documentation]( 
 https://dlang.org/spec/arrays.html#strings ) today, and it said:

 «char[] strings are in UTF-8 format.»

 I would assume that this is normative? Maybe change the 
 documentation to use more forceful specification language so 
 that it says: «char[] strings MUST be in UTF-8 format.»

I'm not sure what is intended.

import("file.stuff") yields string.
So there is at least one gap, as it is often used with binary 
files that ain't UTF-8.

Also look at that signature: 
https://dlang.org/phobos/std_utf.html#validate
By spec it shall only return true then.

It seems in practice it doesn't have to be utf-8 until you use 
something that assume it is. Which is ok for me.

Nov 10 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Wednesday, 10 November 2021 at 11:47:16 UTC, Guillaume Piolat 
wrote:
 It seems in practice it doesn't have to be utf-8 until you use 
 something that assume it is. Which is ok for me.

Hm… for me the key advantage of stricter typing is that you can 
make more functions free of exceptions and error-handling without 
using much human judgment. The ideal is to only do error handling 
in I/O call-trees.

Nov 10 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Wednesday, 10 November 2021 at 11:47:16 UTC, Guillaume Piolat 
wrote:
 import("file.stuff") yields string.
 So there is at least one gap, as it is often used with binary 
 files that ain't UTF-8.

Maybe a «binary_import!T("file.data")» that yields slice of type 
T?

Nov 10 2021

Elronnd <elronnd elronnd.net> writes:

On Wednesday, 10 November 2021 at 10:23:31 UTC, Ola Fosheim 
Grøstad wrote:
 I had a look at the [documentation]( 
 https://dlang.org/spec/arrays.html#strings ) today, and it said:

 «char[] strings are in UTF-8 format.»

 I would assume that this is normative? Maybe change the 
 documentation to use more forceful specification language so 
 that it says: «char[] strings MUST be in UTF-8 format.»

 So, I think a messed up ```string``` should be considered a 
 type error and it would be good if the compiler checked this 
 statically where possible (e.g. literals) and simply assumed it 
 to hold when parsing strings (like in a ```for``` loop).

I agree this should be required.  If you want something which is 
not valid UTF-8, _do not put it into a string_.  Use ubyte[].

Go further: require a runtime check on cast from ubyte[] to 
char[] (expensive), and on slicing char[] (cheap).  (If you abuse 
unions you are on your own; but obviously that is not allowed in 
 safe code, so has the same limitations as e.g. boundschecking.)

Nov 10 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
 I agree this should be required.  If you want something which 
 is not valid UTF-8, _do not put it into a string_.  Use ubyte[].

Exactly.

 Go further: require a runtime check on cast from ubyte[] to 
 char[] (expensive), and on slicing char[] (cheap).  (If you 
 abuse unions you are on your own; but obviously that is not 
 allowed in  safe code, so has the same limitations as e.g. 
 boundschecking.)

The compiler could do such checks in an extra-solid-debug-mode. 
That could certainly improve unit-testing and other testing. In 
such a mode you could also do overflow checks for signed integers 
(if they are changed so they don't wrap).

Nov 10 2021

kdevel <kdevel vogtner.de> writes:

On Thursday, 11 November 2021 at 07:58:54 UTC, Ola Fosheim 
Grøstad wrote:
 On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
 I agree this should be required.  If you want something which 
 is not valid UTF-8, _do not put it into a string_.  Use 
 ubyte[].

 Exactly.

[...]

 The compiler could do such checks in an extra-solid-debug-mode.

This requires lots of changes or additions

```
import std.stdio;
import std.file;

void main ()
{
    ubyte [] filename = [ 'a', 0x80, 'b', '\0' ]; // valid 
filename in some OS
    auto s = readText (filename);
}
```

This does not yet compile:

    [...]
           R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`

Nov 12 2021

FeepingCreature <feepingcreature gmail.com> writes:

On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 On Thursday, 11 November 2021 at 07:58:54 UTC, Ola Fosheim 
 Grøstad wrote:
 On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
 I agree this should be required.  If you want something which 
 is not valid UTF-8, _do not put it into a string_.  Use 
 ubyte[].

 Exactly.

 [...]

 The compiler could do such checks in an extra-solid-debug-mode.

 This requires lots of changes or additions

 ```
 import std.stdio;
 import std.file;

 void main ()
 {
    ubyte [] filename = [ 'a', 0x80, 'b', '\0' ]; // valid 
 filename in some OS
    auto s = readText (filename);
 }
 ```

 This does not yet compile:

    [...]
           R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`

Yes, because `readText` is typed in a way that it excludes valid 
filenames. But it's *already* wrong - this feature would only 
expose the wrongness, as `filename` is already not a validly 
typed string. File a bug?

Nov 14 2021

kdevel <kdevel vogtner.de> writes:

On Monday, 15 November 2021 at 07:17:03 UTC, FeepingCreature 
wrote:
[...]
 Yes, because `readText` is typed in a way that it excludes 
 valid filenames. But it's *already* wrong - this feature would 
 only expose the wrongness, as `filename` is already not a 
 validly typed string. File a bug?

May I ask you for a bug title? "readText shall accept natively 
typed filename"?

Nov 15 2021

user1234 <user1234 12.de> writes:

On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 This does not yet compile:

    [...]
           R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`

auto-decoding or not... you need to decode from whatever is the 
OS encoding (must be ancient ANSI I presume ?) to UTF-8.

Nov 15 2021

user1234 <user1234 12.de> writes:

On Monday, 15 November 2021 at 08:20:57 UTC, user1234 wrote:
 On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 This does not yet compile:

    [...]
           R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`

 auto-decoding or not... you need to decode from whatever is the 
 OS encoding (must be ancient ANSI I presume ?) to UTF-8.

I meant decode then re-enc to utf

Nov 15 2021

FeepingCreature <feepingcreature gmail.com> writes:

On Monday, 15 November 2021 at 08:22:13 UTC, user1234 wrote:
 On Monday, 15 November 2021 at 08:20:57 UTC, user1234 wrote:
 On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 This does not yet compile:

    [...]
           `R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`

 auto-decoding or not... you need to decode from whatever is 
 the OS encoding (must be ancient ANSI I presume ?) to UTF-8.

 I meant decode then re-enc to utf

I don't see how that could work. `readText` would need to encode 
it to the OS codepage, but `readText` has no idea what encoding 
you intend. And the encoding of a filename isn't even always 
determined by the locale; consider trying to access filenames 
saved in a different locale, ie. what iconv does. There's no way 
around `readText` taking `ubyte[]`.

Nov 15 2021

user1234 <user1234 12.de> writes:

On Monday, 15 November 2021 at 11:20:04 UTC, FeepingCreature 
wrote:
 On Monday, 15 November 2021 at 08:22:13 UTC, user1234 wrote:
 On Monday, 15 November 2021 at 08:20:57 UTC, user1234 wrote:
 On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 This does not yet compile:

    [...]
           `R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`

 auto-decoding or not... you need to decode from whatever is 
 the OS encoding (must be ancient ANSI I presume ?) to UTF-8.

 I meant decode then re-enc to utf

 I don't see how that could work. `readText` would need to 
 encode it to the OS codepage, but `readText` has no idea what 
 encoding you intend. And the encoding of a filename isn't even 
 always determined by the locale; consider trying to access 
 filenames saved in a different locale, ie. what iconv does. 
 There's no way around `readText` taking `ubyte[]`.

I think I was off-topic, my reply was about the filename, e.g

`fname.fromAnsi(cp).toUTF!char.readText()`

you were more talking about the file content apparently ? sorry 
about that.

Nov 15 2021

kdevel <kdevel vogtner.de> writes:

On Monday, 15 November 2021 at 11:26:41 UTC, user1234 wrote:
[...]
 I think I was off-topic, my reply was about the filename, e.g

 `fname.fromAnsi(cp).toUTF!char.readText()`

 you were more talking about the file content apparently ? sorry 
 about that.

I /am/ talking about the filename. In POSIX systems the bytes do 
not mean anything (to the operating system) execpt for the three 
values '\0', '/' and '.' [1].

[1] 
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_170

Nov 15 2021

kdevel <kdevel vogtner.de> writes:

On Monday, 15 November 2021 at 08:22:13 UTC, user1234 wrote:
 On Monday, 15 November 2021 at 08:20:57 UTC, user1234 wrote:
 On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 This does not yet compile:

    [...]
           R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`

 auto-decoding or not... you need to decode from whatever is 
 the OS encoding (must be ancient ANSI I presume ?) to UTF-8.

 I meant decode then re-enc to utf

You can only decode what has been (or is ment to be) encoded. 
Except for '.', '\0', and '/' the character values (0 .. 255) 
have no meaning within a filename.

Nov 15 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Monday, 15 November 2021 at 19:59:40 UTC, kdevel wrote:
 You can only decode what has been (or is ment to be) encoded. 
 Except for '.', '\0', and '/' the character values (0 .. 255) 
 have no meaning within a filename.

It should probably be a system specific string-type that 
validates using the rules of the specific OS.

Nov 15 2021

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
 On Thursday, 11 November 2021 at 07:58:54 UTC, Ola Fosheim 
 Grøstad wrote:
 On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
 I agree this should be required.  If you want something which 
 is not valid UTF-8, _do not put it into a string_.  Use 
 ubyte[].

 Exactly.

 [...]

 The compiler could do such checks in an extra-solid-debug-mode.

 This requires lots of changes or additions

 ```
 import std.stdio;
 import std.file;

 void main ()
 {
    ubyte [] filename = [ 'a', 0x80, 'b', '\0' ]; // valid 
 filename in some OS
    auto s = readText (filename);
 }
 ```

 This does not yet compile:

    [...]
           R = ubyte[]`
      must satisfy one of the following constraints:
    `       isSomeChar!(ElementType!R)
           is(StringTypeOf!R)`

One idea that has come up would be compile time checking of 
strings.

But thinking about the garbage in garbage out concept in general, 
maybe functions should really just accept data and it's the 
callers responsibility that it's valid.

This becomes a philosophical discussion, but could maybe be 
interesting (increased compile times ofc, but could be worth it). 
This would be more of a D3 thing. The Erlang path is fail fast. 
Fix the error at it's root.

Don't get me wrong, I understand why phobos is the way it is now, 
and it works. It's more in the "ideas to explore" category. One 
might say "but what about external data, I don't know if that's 
valid". The answer there would be to sanitize it before passing 
it to the function. It would also be better from a composability 
viewpoint.

In summary: Keep the functions themselves short and friendly. 
Make the data in correct. Put the constraints outside the 
function.

Pros and cons as with everything ofc

Nov 15 2021

Alexey <invalid email.address> writes:

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 I've tried to fix this before, but too many people objected.

 Are we fed up with this yet? I sure am.

 Who wants to take up this cudgel and fix the durned thing once 
 and for all?

 (It's unclear if it would even break existing code.)

I didn't read thread. And I'm not an expert in D or Unicode, of 
course.

But If I would need to solve the problem of unicode handling, I 
would do the following:

1. define type for the 'grapheme' - so grapheme could store any 
unicode symbol;
2. define string of grapheme as array of grapheme, so programmer 
could at any time use usual array tools on those. like so things 
like .length and slicing [x..y] work as usual. call this, for 
instance, 'gstring' or 'graphstring';
3. IMHO, one grapheme should be and alias to ubyte[] or to one 
BigInt;
4. conversion from string/wstring/dstring/ubyte[]/BigInt[]/etc to 
['gstring' or 'graphstring'] should be automatic and this should 
be stated in documentation;
5. ['gstring' or 'graphstring'] should have functions to convert 
to string/wstring/dstring/ubyte[]/BigInt[]/etc

Nov 05 2021

Alexey <invalid email.address> writes:

On Saturday, 6 November 2021 at 04:07:35 UTC, Alexey wrote:

 3. IMHO, one grapheme should be and alias to ubyte[] or to one 
 BigInt;

or may be, even, define one grapheme as dchar[]. or maybe, even, 
define new separate type for 'codepoint' and define one grapheme 
as codepoint[].

Nov 05 2021

rikki cattermole <rikki cattermole.co.nz> writes:

https://dlang.org/phobos/std_uni.html#Grapheme

Nov 05 2021

Alexey <invalid email.address> writes:

On Saturday, 6 November 2021 at 04:18:51 UTC, Alexey wrote:
 On Saturday, 6 November 2021 at 04:07:35 UTC, Alexey wrote:

and as for Ranges: Ranges should not do any automatic string 
conversions

Nov 05 2021

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Nov 06, 2021 at 04:18:51AM +0000, Alexey via Digitalmars-d wrote:
 On Saturday, 6 November 2021 at 04:07:35 UTC, Alexey wrote:
 
 3. IMHO, one grapheme should be and alias to ubyte[] or to one BigInt;

 
 or may be, even, define one grapheme as dchar[]. or maybe, even, define new
 separate type for 'codepoint' and define one grapheme as codepoint[].

Unfortunately, codepoint != grapheme. This was the fundamental error
with autodecoding that made it so bad. It costs us a performance hit but
doesn't even produce the right results in return.

And even more unfortunately, grapheme segmentation is an extremely
convoluted (i.e. slow) operation that normally you would *not* want to
do it unless your code absolutely has to.


T

-- 
Let's eat some disquits while we format the biskettes.

Nov 05 2021

Alexey <invalid email.address> writes:

On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote:
 Unfortunately, codepoint != grapheme. This was the fundamental 
 error with autodecoding that made it so bad. It costs us a 
 performance hit but doesn't even produce the right results in 
 return.

 And even more unfortunately, grapheme segmentation is an 
 extremely convoluted (i.e. slow) operation that normally you 
 would *not* want to do it unless your code absolutely has to.


 T

```D
struct graphstring
{
     grapheme[] grapheme_elements;
}

struct grapheme
{
     dchar[] codepoints;
}

```
Would this really be _that_ slow? also, there is no need to do 
error checks on every action which user may do with graphstrings: 
no need to check on concatenations or slicings, for instance. but 
do checks on conversions from other string/ubyte[] types and to 
those types.

Nov 05 2021

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Saturday, 6 November 2021 at 06:17:55 UTC, Alexey wrote:
 On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote:
 Unfortunately, codepoint != grapheme. This was the fundamental 
 error with autodecoding that made it so bad. It costs us a 
 performance hit but doesn't even produce the right results in 
 return.

 And even more unfortunately, grapheme segmentation is an 
 extremely convoluted (i.e. slow) operation that normally you 
 would *not* want to do it unless your code absolutely has to.


 T

 ```D
 struct graphstring
 {
     grapheme[] grapheme_elements;
 }

 struct grapheme
 {
     dchar[] codepoints;
 }

 ```
 Would this really be _that_ slow? also, there is no need to do 
 error checks on every action which user may do with 
 graphstrings: no need to check on concatenations or slicings, 
 for instance. but do checks on conversions from other 
 string/ubyte[] types and to those types.

This is 1 grapheme A̶͙̜͚̫̬̻ͅ


(U+0041 U+0336 U+0359 U+0345 U+031c U+035a U+032b U+032c U+033b) 
but 9 codepoints (9 dchar, 9 wchar, 17 char (0x41 0xcc 0xb6 0xcd 
0x99 0xcd 0x85 0xcc 0x9c 0xcd 0x9a 0xcc 0xab 0xcc 0xac 0xcc 0xbb)

Nov 06 2021

jfondren <julian.fondren gmail.com> writes:

On Saturday, 6 November 2021 at 06:17:55 UTC, Alexey wrote:
 ```D
 struct graphstring
 {
     grapheme[] grapheme_elements;
 }

 struct grapheme
 {
     dchar[] codepoints;
 }

 ```

std.uni.Grapheme is more complex than a dchar[] (it tries to 
avoid allocating and it owns the dchars) but it has .length and 
opIndex that work like dchar[] (but read the warning on opSlice)

A Grapheme[] you can get with just `s1.byGrapheme.array`.

Round-trip example from std.uni:

```d
 safe unittest {
     import std.array : array;
     import std.conv : text;
     import std.range : retro;
     import std.uni : byGrapheme, byCodePoint;

     string s = "noe\u0308l"; // noël

     // reverse it and convert the result to a string
     string reverse = s.byGrapheme
         .array
         .retro
         .byCodePoint
         .text;

     assert(reverse == "le\u0308on"); // lëon
}
```

Nov 06 2021

Alexey <invalid email.address> writes:

On Saturday, 6 November 2021 at 13:07:53 UTC, jfondren wrote:
 ...

I doubt what std.uni.Grapheme works faster than dchar[]. Also I 
doubt what all the checks and things std.uni.Grapheme does are 
really necessary in context of hypothetical 'graphstring'

Nov 06 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote:
 And even more unfortunately, grapheme segmentation is an 
 extremely convoluted (i.e. slow) operation that normally you 
 would *not* want to do it unless your code absolutely has to.

It is suitable for a library though.

Nov 06 2021

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

Previous discussions:

- https://wiki.dlang.org/DIP76
- https://forum.dlang.org/post/mfvi86$10ml$1 digitalmars.com
- https://issues.dlang.org/show_bug.cgi?id=14519
- https://github.com/dlang/druntime/pull/1240
- https://github.com/dlang/druntime/pull/1279
- https://issues.dlang.org/show_bug.cgi?id=20134
- https://github.com/dlang/phobos/pull/7144

Nov 06 2021

Walter Bright <newshound2 digitalmars.com> writes:

On 11/6/2021 9:09 AM, Vladimir Panteleev wrote:
 On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
 https://issues.dlang.org/show_bug.cgi?id=22473

 
 Previous discussions:
 
 - https://wiki.dlang.org/DIP76
 - https://forum.dlang.org/post/mfvi86$10ml$1 digitalmars.com
 - https://issues.dlang.org/show_bug.cgi?id=14519
 - https://github.com/dlang/druntime/pull/1240
 - https://github.com/dlang/druntime/pull/1279
 - https://issues.dlang.org/show_bug.cgi?id=20134
 - https://github.com/dlang/phobos/pull/7144
 

Thanks, Vladimir.

Nov 06 2021

D Programming

C/C++ Programming

Other

digitalmars.D - dmd foreach loops throw exceptions on invalid UTF sequences, use