digitalmars.D - [OT] parsing with sscanf is accidentally quadratic due to strlen

Kagamin (4/4) Mar 03 2021 Parsers based on sscanf choke on big strings:

Patrick Schluter (8/12) Mar 03 2021 Yes, sscanf() calls strlen(). I got bitten by it also some years

Kagamin (4/6) Mar 03 2021 I think you had another bug there? Memory mapped files are not

Patrick Schluter (11/17) Mar 03 2021 No, that bug was tested for. If the file size was a multiple of

Simen =?UTF-8?B?S2rDpnLDpXM=?= (8/20) Mar 03 2021 Dawson’s first law of computing: O(n^2) is the sweet spot of

Nick Treleaven (5/9) Mar 03 2021 I think some years back, Andrei proposed using a user-defined

H. S. Teoh (11/23) Mar 03 2021 Ouch.

Kagamin (3/6) Mar 04 2021 It's used for number parsing, which isn't easy to replace in C:

deadalnix (3/7) Mar 03 2021 An interesting follow up:
user1234 (5/9) Mar 04 2021 The JSON part makes me think to D-YAML which had a similar issue
user1234 (5/9) Mar 04 2021 The JSON part makes me thinks to D-YAMl which used to have a

Kagamin <spam here.lot> writes:

Parsers based on sscanf choke on big strings: 
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
Source: 
https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47

Mar 03 2021

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
 Parsers based on sscanf choke on big strings: 
 https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
 Source: 
 https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47

Yes, sscanf() calls strlen(). I got bitten by it also some years 
ago when I memory mapped some log files to parse and had the my 
program hog the CPU when I went in production. On my test files 
that were not that big, memory mapping and changing fscanf() to 
sscanf() was a no brainer. When it went in production and started 
to map megabytes or gigabyte sized files, I rediscovered what a 
O(n²) algorithm looked like...

Mar 03 2021

Kagamin <spam here.lot> writes:

On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter 
wrote:
 Yes, sscanf() calls strlen(). I got bitten by it also some 
 years ago when I memory mapped some log files to parse

I think you had another bug there? Memory mapped files are not 
null terminated with probability 1/4096.

Mar 03 2021

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Wednesday, 3 March 2021 at 11:36:54 UTC, Kagamin wrote:
 On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter 
 wrote:
 Yes, sscanf() calls strlen(). I got bitten by it also some 
 years ago when I memory mapped some log files to parse

 I think you had another bug there? Memory mapped files are not 
 null terminated with probability 1/4096.

No, that bug was tested for. If the file size was a multiple of 
pagesize (it was on SPARC so 8192 not 4096) I added 1 byte to it 
which forced mmap to add an empty page at the end => guaranteed 0 
at the end.
If you're guaranteed to run on Linux you can force the adding of 
a supplemental page just by passing a size bigger than the file 
size to mmap. This was not possible on newer versions of Solaris.
This said, I had a library to handle these details, also by 
replacing mmap by a malloc/read when the file is below a certain 
size as mmap has a non negligible overhead.

Mar 03 2021

Simen =?UTF-8?B?S2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:

On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter 
wrote:
 On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
 Parsers based on sscanf choke on big strings: 
 https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
 Source: 
 https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47

 Yes, sscanf() calls strlen(). I got bitten by it also some 
 years ago when I memory mapped some log files to parse and had 
 the my program hog the CPU when I went in production. On my 
 test files that were not that big, memory mapping and changing 
 fscanf() to sscanf() was a no brainer. When it went in 
 production and started to map megabytes or gigabyte sized 
 files, I rediscovered what a O(n²) algorithm looked like...

Dawson’s first law of computing: O(n^2) is the sweet spot of 
badly scaling algorithms: fast enough to make it into production, 
but slow enough to make things fall down once it gets there.

(https://twitter.com/BruceDawson0xB/status/1120381406700429312)

--
   Simen

Mar 03 2021

Nick Treleaven <nick geany.org> writes:

On Wednesday, 3 March 2021 at 15:16:03 UTC, Simen Kjærås wrote:
 Dawson’s first law of computing: O(n^2) is the sweet spot of 
 badly scaling algorithms: fast enough to make it into 
 production, but slow enough to make things fall down once it 
 gets there.

I think some years back, Andrei proposed using a user-defined 
attribute to tag functions with their big-O time complexity and 
maybe even use those attributes to calculate complexity of 
chained function calls.

Mar 03 2021

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Mar 03, 2021 at 11:09:04AM +0000, Patrick Schluter via Digitalmars-d
wrote:
 On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
 Parsers based on sscanf choke on big strings:
 https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
 Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47

 
 Yes, sscanf() calls strlen(). I got bitten by it also some years ago
 when I memory mapped some log files to parse and had the my program
 hog the CPU when I went in production. On my test files that were not
 that big, memory mapping and changing fscanf() to sscanf() was a no
 brainer. When it went in production and started to map megabytes or
 gigabyte sized files, I rediscovered what a O(n�) algorithm looked
 like...

Ouch.

I myself am no fan of sscanf: too limited and hard to fine-tune parsing
behaviour. If it were up to me, I wouldn't run any large data sets
through sscanf.  Now this adds one more reason for not using sscanf.

Fortunately in D slices eliminate the strlen problem, and slice-based
std.array.split, et al, are generally better for simple parsing tasks
IMO than *scanf functions.


T

-- 
Heads I win, tails you lose.

Mar 03 2021

Kagamin <spam here.lot> writes:

On Wednesday, 3 March 2021 at 16:58:27 UTC, H. S. Teoh wrote:
 Fortunately in D slices eliminate the strlen problem, and 
 slice-based std.array.split, et al, are generally better for 
 simple parsing tasks IMO than *scanf functions.

It's used for number parsing, which isn't easy to replace in C: 
https://github.com/biojppm/rapidyaml/issues/40

Mar 04 2021

deadalnix <deadalnix gmail.com> writes:

On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
 Parsers based on sscanf choke on big strings: 
 https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
 Source: 
 https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47

An interesting follow up:
https://www.mattkeeter.com/blog/2021-03-01-happen/

Mar 03 2021

user1234 <user1234 12.de> writes:

On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
 Parsers based on sscanf choke on big strings: 
 https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
 Source: 
 https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47

The JSON part makes me think to D-YAML which had a similar issue 
([1], [2]], i.e absurd double checks on AA insertion.

[1]: https://github.com/dlang-community/D-YAML/issues/78
[2]: https://github.com/dlang-community/D-YAML/pull/112

Mar 04 2021

user1234 <user1234 12.de> writes:

On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
 Parsers based on sscanf choke on big strings: 
 https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
 Source: 
 https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47

The JSON part makes me thinks to D-YAMl which used to have a 
similar problem ([1],[2]), i.e absurd check on insertion.

[1] https://github.com/dlang-community/D-YAML/issues/78
[2] https://github.com/dlang-community/D-YAML/pull/112

Mar 04 2021

D Programming

C/C++ Programming

Other

digitalmars.D - [OT] parsing with sscanf is accidentally quadratic due to strlen