digitalmars.D - A safer File.readln

Markus Laker (77/77) Jan 22 2017 It's pretty easy to DoS a D program that uses File.readln or

Chris Wright (3/5) Jan 22 2017 The /dev/zero version at least could be solved by calling stat on the fi...

Andrei Alexandrescu (3/5) Jan 22 2017 I recall reported size for special files is zero. We special case

Markus Laker (16/18) Jan 23 2017 Special files are a bit of a distraction, in any case, because

Shachar Shemesh (44/45) Jan 23 2017 When extending arrays, a common approach is to double the size every

Markus Laker (10/12) Jan 23 2017 Yes, you're right, of course: expansion of strings and other

Shachar Shemesh (9/21) Jan 23 2017 It would mean we consume an order of magnitude of the amount of memory

Shachar Shemesh (7/54) Jan 23 2017 One more thing.

Markus Laker (8/12) Jan 23 2017 I'm going to bow out of the discussion about array-expansion,

Andrei Alexandrescu (4/8) Jan 23 2017 Heh, I have a talk about it. The limit is the golden cut,

Shachar Shemesh (3/11) Jan 23 2017 What does D use when we keep appending?
Shachar Shemesh (5/13) Jan 23 2017 I was going to ask for the proof, but I first went to refresh my memory
TheGag96 (3/14) Jan 24 2017 Andrei, could you link this talk? Thanks!

Andrei Alexandrescu (2/13) Jan 25 2017 Not public. -- Andrei

Jens Mueller (7/26) Jan 25 2017 Have you done measurements on the matter? Because I'm not sold on

Andrei Alexandrescu (7/33) Jan 25 2017 Wasn't selling anything.

Kagamin (5/11) Jan 25 2017 An issue I had with low default buffer limits: they are difficult

Markus Laker <d20160302.20.mlaker spamgourmet.com> writes:

It's pretty easy to DoS a D program that uses File.readln or 
File.byLine:

msl james:~/d$ prlimit --as=4000000000 time ./tinycat.d tinycat.d


import std.stdio;

void main(in string[] argv) {
     foreach (const filename; argv[1..$])
         foreach (line; File(filename).byLine)
             writeln(line);
}

0.00user 0.00system 0:00.00elapsed 66%CPU (0avgtext+0avgdata 
4280maxresident)k
0inputs+0outputs (0major+292minor)pagefaults 0swaps
msl james:~/d$ prlimit --as=4000000000 time ./tinycat.d /dev/zero
0.87user 1.45system 0:02.51elapsed 92%CPU (0avgtext+0avgdata 
2100168maxresident)k
0inputs+0outputs (0major+524721minor)pagefaults 0swaps
msl james:~/d$

This trivial program that runs in about 4MiB when asked to print 
itself chewed up 2GiB of memory in about three seconds when 
handed an infinitely long input line, and would have kept going 
if prlimit hadn't killed it.

D is in good company: C++'s getline() and Perl's diamond operator 
have the same vulnerability.

msl james:~/d$ prlimit --as=4000000000 time ./a.out tinycat.cpp
#include <fstream>
#include <iostream>
#include <string>

int main(int const argc, char const *argv[]) {
     for (auto i = 1;  i < argc;  ++i) {
         std::ifstream fh {argv[i]};
         for (std::string line;  getline(fh, line, '\n');  )
             std::cout << line << '\n';
     }

     return 0;
}

0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata 
2652maxresident)k
0inputs+0outputs (0major+113minor)pagefaults 0swaps
msl james:~/d$ prlimit --as=4000000000 time ./a.out /dev/zero
1.12user 1.76system 0:02.92elapsed 98%CPU (0avgtext+0avgdata 
1575276maxresident)k
0inputs+0outputs (0major+786530minor)pagefaults 0swaps
msl james:~/d$ prlimit --as=4000000000 time perl -wpe '' tinycat.d


import std.stdio;

void main(in string[] argv) {
     foreach (const filename; argv[1..$])
         foreach (line; File(filename).byLine)
             writeln(line);
}

0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata 
3908maxresident)k
0inputs+0outputs (0major+192minor)pagefaults 0swaps
msl james:~/d$ prlimit --as=4000000000 time perl -wpe '' /dev/zero
Out of memory!
Command exited with non-zero status 1
4.82user 2.34system 0:07.43elapsed 96%CPU (0avgtext+0avgdata 
3681400maxresident)k
0inputs+0outputs (0major+919578minor)pagefaults 0swaps
msl james:~/d$

But I digress.

What would a safer API look like?  Perhaps we'd slip in a maximum 
line length as an optional argument to readln, byLine and friends:

enum size_t MaxLength = 1 << 20;    // 1MiB
fh.readln(buf, MaxLength);
buf = fh.readln(MaxLength);
auto range = fh.byLine(MaxLength);

Obviously, we wouldn't want to break compatibility with existing 
code by demanding a maximum line length at every call site.  
Perhaps the default maximum length should change from its current 
value -- infinity -- to something like 4MiB: longer than lines in 
most text files, but still affordably small on most modern 
machines.

What should happen if readln encountered an excessively long 
line?  Throw an exception?

Markus

Jan 22 2017

Chris Wright <dhasenan gmail.com> writes:

On Sun, 22 Jan 2017 21:29:39 +0000, Markus Laker wrote:

 It's pretty easy to DoS a D program that uses File.readln or
 File.byLine:

The /dev/zero version at least could be solved by calling stat on the file 
and limiting reads to the reported size.

Jan 22 2017

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/22/17 8:52 PM, Chris Wright wrote:
 The /dev/zero version at least could be solved by calling stat on the file
 and limiting reads to the reported size.

I recall reported size for special files is zero. We special case 
std.file.read for those. -- Andrei

Jan 22 2017

Markus Laker <d20160302.20.mlaker spamgourmet.com> writes:

On Monday, 23 January 2017 at 01:55:59 UTC, Andrei Alexandrescu 
wrote:
 I recall reported size for special files is zero. We special 
 case std.file.read for those. -- Andrei

Special files are a bit of a distraction, in any case, because 
it's easy to create a large disk file full of zeroes:

msl james:~/d$ dd if=/dev/zero of=zeroes count=2048 bs=1048576
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 11.1792 s, 192 MB/s
msl james:~/d$ /usr/bin/time ./tinycat.d zeroes > /dev/null
1.67user 3.14system 0:04.83elapsed 99%CPU (0avgtext+0avgdata 
4199944maxresident)k
0inputs+0outputs (0major+1049634minor)pagefaults 0swaps
msl james:~/d$

A 2GiB disk file caused tinycat.d to use over 4GiB of memory.

Cheers,

Markus

Jan 23 2017

Shachar Shemesh <shachar weka.io> writes:

On 23/01/17 11:15, Markus Laker wrote:

 A 2GiB disk file caused tinycat.d to use over 4GiB of memory.

When extending arrays, a common approach is to double the size every 
time you run out of space. This guarantees an amortized O(1) cost of append.

Unfortunately, this also guarantees that we will never have enough space 
freed by previous copies to reuse existing memory:

100 byte array

increase

100 bytes free
200 bytes array

increase

300 free
400 array

etc. The array will always be bigger than the total amount of space we 
freed.

If, instead of increasing its size by 100%, we increase it by a smaller 
percentage of its previous size, we still maintain the amortized O(1) 
cost (with a multiplier that might be a little higher, but see the trade 
off). On the other hand, we can now reuse memory. Let's say we increase 
by 50% each time:

100 array

increase

100 free
150 array

increase

250 free
225 array

increase

475 free
338 array

813 free
507 array

The next increase will require 761 bytes, but we already have 813 free, 
so we can allocate the new array over the memory already freed from 
before, reducing the heap size.

Of course, if we integrate the allocating and the move, we could have 
reused previously used memory starting from allocation 3, but I'm 
assuming that would also be possible when increasing by 100%, so I'm 
assuming we can't do that.

Of course, if, instead of 50% we increase by less (say, 20%), we could 
reuse previously used memory even sooner.

I am assuming that this is the problem that manifests itself in this use 
scenario. I would suggest solving it at the language level, rather than 
the library level.

Shachar

Jan 23 2017

Markus Laker <d20160302.20.mlaker spamgourmet.com> writes:

On Monday, 23 January 2017 at 10:44:50 UTC, Shachar Shemesh wrote:
 Of course, if, instead of 50% we increase by less (say, 20%), 
 we could reuse previously used memory even sooner.

Yes, you're right, of course: expansion of strings and other 
arrays is a classic time-versus-space trade-off.  However, 
expanding strings more slowly is a much bigger change than I have 
the D experience or credentials to suggest.  And I don't think it 
really solves the problem: it just requires the attacker to wait 
another few seconds for /dev/zero to deliver enough data to fill 
up memory.  A simple length-check in readln, in contrast, would 
prevent an attacker from flooding us with data in the first place.

Markus

Jan 23 2017

Shachar Shemesh <shachar weka.io> writes:

On 23/01/17 13:05, Markus Laker wrote:
 On Monday, 23 January 2017 at 10:44:50 UTC, Shachar Shemesh wrote:
 Of course, if, instead of 50% we increase by less (say, 20%), we could
 reuse previously used memory even sooner.

 Yes, you're right, of course: expansion of strings and other arrays is a
 classic time-versus-space trade-off.  However, expanding strings more
 slowly is a much bigger change than I have the D experience or
 credentials to suggest.  And I don't think it really solves the problem:
 it just requires the attacker to wait another few seconds for /dev/zero
 to deliver enough data to fill up memory.  A simple length-check in
 readln, in contrast, would prevent an attacker from flooding us with
 data in the first place.

 Markus

It would mean we consume an order of magnitude of the amount of memory 
the "attacker" sends.

There is a huge difference between "I send an unterminated string 2GB 
long, and it takes 2GB of memory, causing trouble", and "I send an 
unterminated string 2GB long, and it takes 4GB of memory, causing trouble".

The second is a problem. The first might be obvious and/or benign, 
depending on the use case.

Shachar

Jan 23 2017

Shachar Shemesh <shachar weka.io> writes:

One more thing.

It is possible to tweak the numbers based on the overall use. For 
example, add 100% for arrays smaller than 1MB, 50% for 1MB - 100MB, and 
20% for arrays above 100MB big. This would eliminate the performance 
degradation for almost all users.

Shachar

On 23/01/17 12:44, Shachar Shemesh wrote:
 On 23/01/17 11:15, Markus Laker wrote:

 A 2GiB disk file caused tinycat.d to use over 4GiB of memory.

 When extending arrays, a common approach is to double the size every
 time you run out of space. This guarantees an amortized O(1) cost of
 append.

 Unfortunately, this also guarantees that we will never have enough space
 freed by previous copies to reuse existing memory:

 100 byte array

 increase

 100 bytes free
 200 bytes array

 increase

 300 free
 400 array

 etc. The array will always be bigger than the total amount of space we
 freed.

 If, instead of increasing its size by 100%, we increase it by a smaller
 percentage of its previous size, we still maintain the amortized O(1)
 cost (with a multiplier that might be a little higher, but see the trade
 off). On the other hand, we can now reuse memory. Let's say we increase
 by 50% each time:

 100 array

 increase

 100 free
 150 array

 increase

 250 free
 225 array

 increase

 475 free
 338 array

 813 free
 507 array

 The next increase will require 761 bytes, but we already have 813 free,
 so we can allocate the new array over the memory already freed from
 before, reducing the heap size.

 Of course, if we integrate the allocating and the move, we could have
 reused previously used memory starting from allocation 3, but I'm
 assuming that would also be possible when increasing by 100%, so I'm
 assuming we can't do that.

 Of course, if, instead of 50% we increase by less (say, 20%), we could
 reuse previously used memory even sooner.

 I am assuming that this is the problem that manifests itself in this use
 scenario. I would suggest solving it at the language level, rather than
 the library level.

 Shachar

Jan 23 2017

Markus Laker <d20160302.20.mlaker spamgourmet.com> writes:

On Monday, 23 January 2017 at 11:30:35 UTC, Shachar Shemesh wrote:
 It is possible to tweak the numbers based on the overall use. 
 For example, add 100% for arrays smaller than 1MB, 50% for 1MB 
 - 100MB, and 20% for arrays above 100MB big. This would 
 eliminate the performance degradation for almost all users.

I'm going to bow out of the discussion about array-expansion, 
because I think it's a separate topic from readln and I don't 
know enough about D's internals to contribute meaningfully.  It 
might be worth raising your proposal in a separate thread in 
order to ensure visibility.

Cheers,

Markus

Jan 23 2017

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/23/17 5:44 AM, Shachar Shemesh wrote:
 If, instead of increasing its size by 100%, we increase it by a smaller
 percentage of its previous size, we still maintain the amortized O(1)
 cost (with a multiplier that might be a little higher, but see the trade
 off). On the other hand, we can now reuse memory.

Heh, I have a talk about it. The limit is the golden cut, 
1.6180339887498948482... The proof is fun. Anything larger prevents you 
from reusing previously used space. -- Andrei

Jan 23 2017

Shachar Shemesh <shachar weka.io> writes:

On 23/01/17 15:18, Andrei Alexandrescu wrote:
 On 1/23/17 5:44 AM, Shachar Shemesh wrote:
 If, instead of increasing its size by 100%, we increase it by a smaller
 percentage of its previous size, we still maintain the amortized O(1)
 cost (with a multiplier that might be a little higher, but see the trade
 off). On the other hand, we can now reuse memory.

 Heh, I have a talk about it. The limit is the golden cut,
 1.6180339887498948482... The proof is fun. Anything larger prevents you
 from reusing previously used space. -- Andrei

What does D use when we keep appending?

Shachar

Jan 23 2017

Shachar Shemesh <shachar weka.io> writes:

On 23/01/17 15:18, Andrei Alexandrescu wrote:
 On 1/23/17 5:44 AM, Shachar Shemesh wrote:
 If, instead of increasing its size by 100%, we increase it by a smaller
 percentage of its previous size, we still maintain the amortized O(1)
 cost (with a multiplier that might be a little higher, but see the trade
 off). On the other hand, we can now reuse memory.

 Heh, I have a talk about it. The limit is the golden cut,
 1.6180339887498948482... The proof is fun. Anything larger prevents you
 from reusing previously used space. -- Andrei

I was going to ask for the proof, but I first went to refresh my memory 
a little about the golden ratio, at which point the proof became 
somewhat trivial.

Shachar

Jan 23 2017

TheGag96 <thegag96 gmail.com> writes:

On Monday, 23 January 2017 at 13:18:57 UTC, Andrei Alexandrescu 
wrote:
 On 1/23/17 5:44 AM, Shachar Shemesh wrote:
 If, instead of increasing its size by 100%, we increase it by 
 a smaller
 percentage of its previous size, we still maintain the 
 amortized O(1)
 cost (with a multiplier that might be a little higher, but see 
 the trade
 off). On the other hand, we can now reuse memory.

 Heh, I have a talk about it. The limit is the golden cut, 
 1.6180339887498948482... The proof is fun. Anything larger 
 prevents you from reusing previously used space. -- Andrei

Andrei, could you link this talk? Thanks!

Jan 24 2017

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 01/25/2017 12:58 AM, TheGag96 wrote:
 On Monday, 23 January 2017 at 13:18:57 UTC, Andrei Alexandrescu wrote:
 On 1/23/17 5:44 AM, Shachar Shemesh wrote:
 If, instead of increasing its size by 100%, we increase it by a smaller
 percentage of its previous size, we still maintain the amortized O(1)
 cost (with a multiplier that might be a little higher, but see the trade
 off). On the other hand, we can now reuse memory.

 Heh, I have a talk about it. The limit is the golden cut,
 1.6180339887498948482... The proof is fun. Anything larger prevents
 you from reusing previously used space. -- Andrei

 Andrei, could you link this talk? Thanks!

Not public. -- Andrei

Jan 25 2017

Jens Mueller <jens.k.mueller gmx.de> writes:

On Wednesday, 25 January 2017 at 14:18:15 UTC, Andrei 
Alexandrescu wrote:
 On 01/25/2017 12:58 AM, TheGag96 wrote:
 On Monday, 23 January 2017 at 13:18:57 UTC, Andrei 
 Alexandrescu wrote:
 On 1/23/17 5:44 AM, Shachar Shemesh wrote:
 If, instead of increasing its size by 100%, we increase it 
 by a smaller
 percentage of its previous size, we still maintain the 
 amortized O(1)
 cost (with a multiplier that might be a little higher, but 
 see the trade
 off). On the other hand, we can now reuse memory.

 Heh, I have a talk about it. The limit is the golden cut,
 1.6180339887498948482... The proof is fun. Anything larger 
 prevents
 you from reusing previously used space. -- Andrei

 Andrei, could you link this talk? Thanks!

 Not public. -- Andrei

Have you done measurements on the matter? Because I'm not sold on 
the idea. To me at this point this is just a theoretical 
observation. There are also arguments indicating it is less 
useful. Any numbers on how it affects e.g. memory usage?

Jens

Jan 25 2017

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 01/25/2017 02:12 PM, Jens Mueller wrote:
 On Wednesday, 25 January 2017 at 14:18:15 UTC, Andrei Alexandrescu wrote:
 On 01/25/2017 12:58 AM, TheGag96 wrote:
 On Monday, 23 January 2017 at 13:18:57 UTC, Andrei Alexandrescu wrote:
 On 1/23/17 5:44 AM, Shachar Shemesh wrote:
 If, instead of increasing its size by 100%, we increase it by a
 smaller
 percentage of its previous size, we still maintain the amortized O(1)
 cost (with a multiplier that might be a little higher, but see the
 trade
 off). On the other hand, we can now reuse memory.

 Heh, I have a talk about it. The limit is the golden cut,
 1.6180339887498948482... The proof is fun. Anything larger prevents
 you from reusing previously used space. -- Andrei

 Andrei, could you link this talk? Thanks!

 Not public. -- Andrei

 Have you done measurements on the matter?

Affirmative.

 Because I'm not sold on the
 idea.

Wasn't selling anything.

 To me at this point this is just a theoretical observation.

No.

 There
 are also arguments indicating it is less useful.

That is correct.

 Any numbers on how it
 affects e.g. memory usage?

Depends on the application. You'd do good to collect your own.


Andrei

Jan 25 2017

Kagamin <spam here.lot> writes:

On Sunday, 22 January 2017 at 21:29:39 UTC, Markus Laker wrote:
 Obviously, we wouldn't want to break compatibility with 
 existing code by demanding a maximum line length at every call 
 site.  Perhaps the default maximum length should change from 
 its current value -- infinity -- to something like 4MiB: longer 
 than lines in most text files, but still affordably small on 
 most modern machines.

An issue I had with low default buffer limits: they are difficult 
to discover and usually start to fail only in production where 
you hit the actual big data, which gets only bigger with time. 
You find and bump one limit, deploy, only hit another later.

Jan 25 2017

D Programming

C/C++ Programming

Other

digitalmars.D - A safer File.readln