digitalmars.D - redefining "put" and "OutputRange"

monarch_dodra (51/51) Aug 30 2013 I'm starting this thread, first, as a shamless plug to a brand

Dmitry Olshansky (37/71) Aug 30 2013 Which is all good and well, but seeing this:

monarch_dodra (26/121) Aug 30 2013 It's a special case for sinks, yes. I'm not a fan of this, but I

Dmitry Olshansky (21/96) Aug 30 2013 You put too much faith in the source code alone. Not every assumption is...

monarch_dodra (10/17) Aug 30 2013 Not "aliases to slices", but slices themselves. For example, a

Dmitry Olshansky (11/28) Aug 30 2013 The interesting observation is that one can safely alias string or for

"monarch_dodra" <monarchdodra gmail.com> writes:

I'm starting this thread, first, as a shamless plug to a brand 
new pull I just did, which (amongst others) allows "put" to 
transcode on the fly any string/char of any length, to any 
string/char of any length.

This fixes issues mostly in std.format, and also allows things 
like formattedWrite to work with pure delegates (it didn't 
before), as well as output directly in wstring or dstring format 
(awesome for writing to a UTF-16 file, for example).

--------

The real reason I'm starting this thread is I believe the current 
way "put" leads to a *MASSIVE*, *HORRIFYING* issue. I dare not 
say it: Escaping references to local stack variables (!!!).

Basically, if R accepts an "E[]", than put will accept a single E 
element as input, and convert it to an "E[]" on the fly, using 
"put(r, (&e)[0 .. 1]);". I'm sure you can see the problem. It 
allows things such as:

//----
void main()
{
     Appender!(int[][]) app; //A type that accumulates slices
     put(app, 1);
     put(app, 2);
     put(app, 3);
     writeln(app.data); //prints: [[3], [3], [3]]
}
//----
Oops!

I'd like to make a proposition: "put" needs to be changed to 
*not* accept putting an E into something that accepts E[]. There 
is simply *no way* to do this safely, and without allocating 
(both of which, IMO, are non-negotiable).

For objects that define put/opCall, then it is not very 
complicated to have two different signatures for 
"put(E[])"/"opCall(E[])" *and* "put(E)"/"opCall(E)". This makes 
it explicit what is and isn't accepted.

Lucky enough, the problem never existed with input ranges: 
"int[][]" never accepted "int", so there is no problem there.

The last thing remaining are "sinks" (delegates and functions). 
As a convenience, and *only* for characters sinks, because we 
*trust* them to make local copies of elements, we can allow 
things like:
put((const(char)[]){}, 'a');// OK
put((const(char)[]){}, '本');// OK
put((const(char)[]){}, "hello"d);// OK
put((const(wchar)[]){}, "hello"c);// OK

This, I think, is what is safest, but still leaves a little open 
door, exceptionally, for easy formatting.

--------

So, the idea is now to yay or nay my change proposal, and/or 
discuss my pull:
https://github.com/D-Programming-Language/phobos/pull/1534

Aug 30 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

30-Aug-2013 14:53, monarch_dodra пишет:
 I'm starting this thread, first, as a shamless plug to a brand new pull
 I just did, which (amongst others) allows "put" to transcode on the fly
 any string/char of any length, to any string/char of any length.

 This fixes issues mostly in std.format, and also allows things like
 formattedWrite to work with pure delegates (it didn't before), as well
 as output directly in wstring or dstring format (awesome for writing to
 a UTF-16 file, for example).

Which is all good and well, but seeing this:

static
if (is (R == T function(const( char)[]), T) || is (R == T 
delegate(const( char)[]), T))
         enum isSink = 1;
     else static if (is (R == T function(const(wchar)[]), T) || is (R == 
T delegate(const(wchar)[]), T))
         enum isSink = 2;
     else static if (is (R == T function(const(dchar)[]), T) || is (R == 
T delegate(const(dchar)[]), T))
         enum isSink = 4;
     else
         enum isSink = 0;

Doesn't inspire confidence - it's special casing on (w|d)char arrays, 
again... Let's hopefully stop spreading this plague throughout 
especially under banners of generality.

 --------

 The real reason I'm starting this thread is I believe the current way
 "put" leads to a *MASSIVE*, *HORRIFYING* issue. I dare not say it:
 Escaping references to local stack variables (!!!).

It is a dangerous primitive. It's not a good idea to wrap everything 
with safe bags and specialize a single case - arrays and not even 
appender of (w|d)char[].

Instead it's once again a case where primitive needs better high-level 
contract inexpressible in simply terms such as  safe-ty provides.

The rule is: OutputRange must not hold references to any slices given.
And is trivially true for many of current ranges.

 Basically, if R accepts an "E[]", than put will accept a single E
 element as input, and convert it to an "E[]" on the fly, using "put(r,
 (&e)[0 .. 1]);". I'm sure you can see the problem. It allows things such
 as:

 //----
 void main()
 {
      Appender!(int[][]) app; //A type that accumulates slices
      put(app, 1);
      put(app, 2);
      put(app, 3);
      writeln(app.data); //prints: [[3], [3], [3]]
 }
 //----

Bad code regardless. The bug is in Appender of slices aliasing them 
instead of copying the data thus breaking the above rule. BTW it will 
break in similar fashion with any alias-able type I've no idea how one 
would help it.


 I'd like to make a proposition: "put" needs to be changed to *not*
 accept putting an E into something that accepts E[]. There is simply *no
 way* to do this safely, and without allocating (both of which, IMO, are
 non-negotiable).

Just relax and step back for a moment. The bug in question is painfully 
easy to blowup so chances for it being HORRIBLE are quite low (it's a 
loud bug). Safety is cool but I expect that output ranges are designed 
with idea of copying something somewhere or absorbing or accumulating.

 For objects that define put/opCall, then it is not very complicated to
 have two different signatures for "put(E[])"/"opCall(E[])" *and*
 "put(E)"/"opCall(E)". This makes it explicit what is and isn't accepted.

And that will subtly break some genuinely fine code...

 Lucky enough, the problem never existed with input ranges: "int[][]"
 never accepted "int", so there is no problem there.

This is it - a confusion between output range of int[]'s accepting them 
one by one and of int and accepting them in chunks.

-- 
Dmitry Olshansky

Aug 30 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 30 August 2013 at 14:03:36 UTC, Dmitry Olshansky wrote:
 30-Aug-2013 14:53, monarch_dodra пишет:
 I'm starting this thread, first, as a shamless plug to a brand 
 new pull
 I just did, which (amongst others) allows "put" to transcode 
 on the fly
 any string/char of any length, to any string/char of any 
 length.

 This fixes issues mostly in std.format, and also allows things 
 like
 formattedWrite to work with pure delegates (it didn't before), 
 as well
 as output directly in wstring or dstring format (awesome for 
 writing to
 a UTF-16 file, for example).

 Which is all good and well, but seeing this:

 static
 if (is (R == T function(const( char)[]), T) || is (R == T 
 delegate(const( char)[]), T))
         enum isSink = 1;
     else static if (is (R == T function(const(wchar)[]), T) || 
 is (R == T delegate(const(wchar)[]), T))
         enum isSink = 2;
     else static if (is (R == T function(const(dchar)[]), T) || 
 is (R == T delegate(const(dchar)[]), T))
         enum isSink = 4;
     else
         enum isSink = 0;

 Doesn't inspire confidence - it's special casing on (w|d)char 
 arrays, again... Let's hopefully stop spreading this plague 
 throughout especially under banners of generality.

It's a special case for sinks, yes. I'm not a fan of this, but I 
think it is the *single* cases we can trust. (More on this bellow)

 --------

 The real reason I'm starting this thread is I believe the 
 current way
 "put" leads to a *MASSIVE*, *HORRIFYING* issue. I dare not say 
 it:
 Escaping references to local stack variables (!!!).

 It is a dangerous primitive. It's not a good idea to wrap 
 everything with safe bags and specialize a single case - arrays 
 and not even appender of (w|d)char[].

 Instead it's once again a case where primitive needs better 
 high-level contract inexpressible in simply terms such as 
  safe-ty provides.

 The rule is: OutputRange must not hold references to any slices 
 given.
 And is trivially true for many of current ranges.

OutputRange really just means that put(r, e) resolves one way or 
another. And it also fundamentally depends on what you consider 
the "element type".

For example, int[][] is an output range for the element int[]. It 
makes a copy of said element (int[]), but it certainly *won't* 
copy the contents of that slice.

 Basically, if R accepts an "E[]", than put will accept a 
 single E
 element as input, and convert it to an "E[]" on the fly, using 
 "put(r,
 (&e)[0 .. 1]);". I'm sure you can see the problem. It allows 
 things such
 as:

 //----
 void main()
 {
     Appender!(int[][]) app; //A type that accumulates slices
     put(app, 1);
     put(app, 2);
     put(app, 3);
     writeln(app.data); //prints: [[3], [3], [3]]
 }
 //----

 Bad code regardless. The bug is in Appender of slices aliasing 
 them instead of copying the data thus breaking the above rule. 
 BTW it will break in similar fashion with any alias-able type 
 I've no idea how one would help it.

The bug most certainly isn't in Appender. Appender's job is to 
accumulate *slices*, and is exactly what it is doing. The caller 
code might be incorrect, but std.range.put *is* accepting it, and 
is doing a terrible job at it.

 I'd like to make a proposition: "put" needs to be changed to 
 *not*
 accept putting an E into something that accepts E[]. There is 
 simply *no
 way* to do this safely, and without allocating (both of which, 
 IMO, are
 non-negotiable).

 Just relax and step back for a moment. The bug in question is 
 painfully easy to blowup so chances for it being HORRIBLE are 
 quite low (it's a loud bug). Safety is cool but I expect that 
 output ranges are designed with idea of copying something 
 somewhere or absorbing or accumulating.

I'd agree, if output ranges were actually "designed". Right now, 
the basic definition is that an "OutputRange" collects 
"Elements". "put" extends the supported "Elements".

The truth is that format sinks "(const(char)[]){}" is the *only* 
OutputRange that collects "Elements", but whose' signture is one 
that accepts a slice. This "flaws" the slice/element notion.

If format sinks were defined as "(char){}" to begin with, then 
everything would work fine (and *does*), but this is not the case 
today, and that is the *only* reason I made an exception for them.

 For objects that define put/opCall, then it is not very 
 complicated to
 have two different signatures for "put(E[])"/"opCall(E[])" 
 *and*
 "put(E)"/"opCall(E)". This makes it explicit what is and isn't 
 accepted.

 And that will subtly break some genuinely fine code...

It would "explicitly" break code that may (or may *not*) be fine.

 Lucky enough, the problem never existed with input ranges: 
 "int[][]"
 never accepted "int", so there is no problem there.

 This is it - a confusion between output range of int[]'s 
 accepting them one by one and of int and accepting them in 
 chunks.

I think the problem is "put" overstepping its boundaries. If 
"r.put(someSlice)" compiles, "put" has no reason to think that R 
actually owns the elements in the slice.

Aug 30 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

30-Aug-2013 18:38, monarch_dodra пишет:
 Which is all good and well, but seeing this:

 static
 if (is (R == T function(const( char)[]), T) || is (R == T
 delegate(const( char)[]), T))
         enum isSink = 1;
     else static if (is (R == T function(const(wchar)[]), T) || is (R
 == T delegate(const(wchar)[]), T))
         enum isSink = 2;
     else static if (is (R == T function(const(dchar)[]), T) || is (R
 == T delegate(const(dchar)[]), T))
         enum isSink = 4;
     else
         enum isSink = 0;

 Doesn't inspire confidence - it's special casing on (w|d)char arrays,
 again... Let's hopefully stop spreading this plague throughout
 especially under banners of generality.

 It's a special case for sinks, yes. I'm not a fan of this, but I think
 it is the *single* cases we can trust. (More on this bellow)

No thanks. Full functionality outweighs trusted but crippled.

 The real reason I'm starting this thread is I believe the current way
 "put" leads to a *MASSIVE*, *HORRIFYING* issue. I dare not say it:
 Escaping references to local stack variables (!!!).

 It is a dangerous primitive. It's not a good idea to wrap everything
 with safe bags and specialize a single case - arrays and not even
 appender of (w|d)char[].

 Instead it's once again a case where primitive needs better high-level
 contract inexpressible in simply terms such as  safe-ty provides.

 The rule is: OutputRange must not hold references to any slices given.
 And is trivially true for many of current ranges.

 OutputRange really just means that put(r, e) resolves one way or
 another. And it also fundamentally depends on what you consider the
 "element type".

You put too much faith in the source code alone. Not every assumption is 
written in the source (while it should be probably).

 For example, int[][] is an output range for the element int[]. It makes
 a copy of said element (int[]), but it certainly *won't* copy the
 contents of that slice.

The main reason for output range is to absorb data one by one or in 
chunks (= slices). In that sense int[][] is a bad output range.

I do not really care for formalism that defines what is an element type 
here.

 I'd like to make a proposition: "put" needs to be changed to *not*
 accept putting an E into something that accepts E[]. There is simply *no
 way* to do this safely, and without allocating (both of which, IMO, are
 non-negotiable).

 Just relax and step back for a moment. The bug in question is
 painfully easy to blowup so chances for it being HORRIBLE are quite
 low (it's a loud bug). Safety is cool but I expect that output ranges
 are designed with idea of copying something somewhere or absorbing or
 accumulating.

 I'd agree, if output ranges were actually "designed".

And they were.

 Right now, the
 basic definition is that an "OutputRange" collects "Elements". "put"
 extends the supported "Elements".

 The truth is that format sinks "(const(char)[]){}" is the *only*
 OutputRange that collects "Elements", but whose' signture is one that
 accepts a slice. This "flaws" the slice/element notion.

Because it was lacking in performance the most.

 If format sinks were defined as "(char){}" to begin with, then
 everything would work fine (and *does*),

And would slowly crawling into oblivion, that said std.stdio is slow 
even w/o put-ing char by char (+char is not complete thus would require 
buffering on the other side of fence).

but this is not the case today,
 and that is the *only* reason I made an exception for them.

Chances are you missed ubyte/ubyte[] of std.digest.

 For objects that define put/opCall, then it is not very complicated to
 have two different signatures for "put(E[])"/"opCall(E[])" *and*
 "put(E)"/"opCall(E)". This makes it explicit what is and isn't accepted.

 And that will subtly break some genuinely fine code...

 It would "explicitly" break code

... and that is bad ...

 that may (or may *not*) be fine.

The point is if it wasn't fine then it wouldn't survive a day in the wilds.

 Lucky enough, the problem never existed with input ranges: "int[][]"
 never accepted "int", so there is no problem there.

 This is it - a confusion between output range of int[]'s accepting
 them one by one and of int and accepting them in chunks.

 I think the problem is "put" overstepping its boundaries. If
 "r.put(someSlice)" compiles, "put" has no reason to think that R
 actually owns the elements in the slice.

It should and this is where we differ I guess. I can't think of a useful 
output range that stores away aliases to slices it takes.

-- 
Dmitry Olshansky

Aug 30 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 30 August 2013 at 15:16:15 UTC, Dmitry Olshansky wrote:
 I think the problem is "put" overstepping its boundaries. If
 "r.put(someSlice)" compiles, "put" has no reason to think that 
 R
 actually owns the elements in the slice.

 It should and this is where we differ I guess. I can't think of 
 a useful output range that stores away aliases to slices it 
 takes.

Not "aliases to slices", but slices themselves. For example, a 
dictionary, which is a container of "words" (strings) could 
define a sink that accepts strings to feed it word. Or, well 
anything that defines the *element* itself as the object. For 
example, something that accumulates *lists* of ints.

In any case, I get your point about functionality. I can rework 
my pull to make it work as before, while still keeping the 
trans-coding functionality :/

But I'm not a huge fan.

Aug 30 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

30-Aug-2013 21:20, monarch_dodra пишет:
 On Friday, 30 August 2013 at 15:16:15 UTC, Dmitry Olshansky wrote:
 I think the problem is "put" overstepping its boundaries. If
 "r.put(someSlice)" compiles, "put" has no reason to think that R
 actually owns the elements in the slice.

 It should and this is where we differ I guess. I can't think of a
 useful output range that stores away aliases to slices it takes.

 Not "aliases to slices", but slices themselves. For example, a
 dictionary, which is a container of "words" (strings) could define a
 sink that accepts strings to feed it word.

The interesting observation is that one can safely alias string or for 
that matter any immutable slices. However any such activity with mutable 
slices is prone to some funky issues, continuing my point that int[][] 
is bad output range :)

 Or, well anything that
 defines the *element* itself as the object. For example, something that
 accumulates *lists* of ints.

Correcting myself - anyway it should deal with potential _mutable_ 
aliasing if it is to keep them around for some reason. If the output 
range can just use it in place - fine.


 In any case, I get your point about functionality. I can rework my pull
 to make it work as before, while still keeping the trans-coding
 functionality :/

 But I'm not a huge fan.

Me neither.

-- 
Dmitry Olshansky

Aug 30 2013

D Programming

C/C++ Programming

Other

digitalmars.D - redefining "put" and "OutputRange"