www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [OT] parsing with sscanf is accidentally quadratic due to strlen

reply Kagamin <spam here.lot> writes:
Parsers based on sscanf choke on big strings: 
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
Source: 
https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
Mar 03
next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
 Parsers based on sscanf choke on big strings: 
 https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
 Source: 
 https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse and had the my program hog the CPU when I went in production. On my test files that were not that big, memory mapping and changing fscanf() to sscanf() was a no brainer. When it went in production and started to map megabytes or gigabyte sized files, I rediscovered what a O(n²) algorithm looked like...
Mar 03
next sibling parent reply Kagamin <spam here.lot> writes:
On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter 
wrote:
 Yes, sscanf() calls strlen(). I got bitten by it also some 
 years ago when I memory mapped some log files to parse
I think you had another bug there? Memory mapped files are not null terminated with probability 1/4096.
Mar 03
parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Wednesday, 3 March 2021 at 11:36:54 UTC, Kagamin wrote:
 On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter 
 wrote:
 Yes, sscanf() calls strlen(). I got bitten by it also some 
 years ago when I memory mapped some log files to parse
I think you had another bug there? Memory mapped files are not null terminated with probability 1/4096.
No, that bug was tested for. If the file size was a multiple of pagesize (it was on SPARC so 8192 not 4096) I added 1 byte to it which forced mmap to add an empty page at the end => guaranteed 0 at the end. If you're guaranteed to run on Linux you can force the adding of a supplemental page just by passing a size bigger than the file size to mmap. This was not possible on newer versions of Solaris. This said, I had a library to handle these details, also by replacing mmap by a malloc/read when the file is below a certain size as mmap has a non negligible overhead.
Mar 03
prev sibling next sibling parent reply Simen =?UTF-8?B?S2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:
On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter 
wrote:
 On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
 Parsers based on sscanf choke on big strings: 
 https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
 Source: 
 https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse and had the my program hog the CPU when I went in production. On my test files that were not that big, memory mapping and changing fscanf() to sscanf() was a no brainer. When it went in production and started to map megabytes or gigabyte sized files, I rediscovered what a O(n²) algorithm looked like...
Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there. (https://twitter.com/BruceDawson0xB/status/1120381406700429312) -- Simen
Mar 03
parent Nick Treleaven <nick geany.org> writes:
On Wednesday, 3 March 2021 at 15:16:03 UTC, Simen Kjærås wrote:
 Dawson’s first law of computing: O(n^2) is the sweet spot of 
 badly scaling algorithms: fast enough to make it into 
 production, but slow enough to make things fall down once it 
 gets there.
I think some years back, Andrei proposed using a user-defined attribute to tag functions with their big-O time complexity and maybe even use those attributes to calculate complexity of chained function calls.
Mar 03
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Mar 03, 2021 at 11:09:04AM +0000, Patrick Schluter via Digitalmars-d
wrote:
 On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
 Parsers based on sscanf choke on big strings:
 https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
 Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse and had the my program hog the CPU when I went in production. On my test files that were not that big, memory mapping and changing fscanf() to sscanf() was a no brainer. When it went in production and started to map megabytes or gigabyte sized files, I rediscovered what a O(n) algorithm looked like...
Ouch. I myself am no fan of sscanf: too limited and hard to fine-tune parsing behaviour. If it were up to me, I wouldn't run any large data sets through sscanf. Now this adds one more reason for not using sscanf. Fortunately in D slices eliminate the strlen problem, and slice-based std.array.split, et al, are generally better for simple parsing tasks IMO than *scanf functions. T -- Heads I win, tails you lose.
Mar 03
parent Kagamin <spam here.lot> writes:
On Wednesday, 3 March 2021 at 16:58:27 UTC, H. S. Teoh wrote:
 Fortunately in D slices eliminate the strlen problem, and 
 slice-based std.array.split, et al, are generally better for 
 simple parsing tasks IMO than *scanf functions.
It's used for number parsing, which isn't easy to replace in C: https://github.com/biojppm/rapidyaml/issues/40
Mar 04
prev sibling next sibling parent deadalnix <deadalnix gmail.com> writes:
On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
 Parsers based on sscanf choke on big strings: 
 https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
 Source: 
 https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
An interesting follow up: https://www.mattkeeter.com/blog/2021-03-01-happen/
Mar 03
prev sibling next sibling parent user1234 <user1234 12.de> writes:
On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
 Parsers based on sscanf choke on big strings: 
 https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
 Source: 
 https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
The JSON part makes me think to D-YAML which had a similar issue ([1], [2]], i.e absurd double checks on AA insertion. [1]: https://github.com/dlang-community/D-YAML/issues/78 [2]: https://github.com/dlang-community/D-YAML/pull/112
Mar 04
prev sibling parent user1234 <user1234 12.de> writes:
On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
 Parsers based on sscanf choke on big strings: 
 https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
 Source: 
 https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
The JSON part makes me thinks to D-YAMl which used to have a similar problem ([1],[2]), i.e absurd check on insertion. [1] https://github.com/dlang-community/D-YAML/issues/78 [2] https://github.com/dlang-community/D-YAML/pull/112
Mar 04