digitalmars.D - [OT] parsing with sscanf is accidentally quadratic due to strlen
- Kagamin (4/4) Mar 03 2021 Parsers based on sscanf choke on big strings:
- Patrick Schluter (8/12) Mar 03 2021 Yes, sscanf() calls strlen(). I got bitten by it also some years
- Kagamin (4/6) Mar 03 2021 I think you had another bug there? Memory mapped files are not
- Patrick Schluter (11/17) Mar 03 2021 No, that bug was tested for. If the file size was a multiple of
- Simen =?UTF-8?B?S2rDpnLDpXM=?= (8/20) Mar 03 2021 Dawson’s first law of computing: O(n^2) is the sweet spot of
- Nick Treleaven (5/9) Mar 03 2021 I think some years back, Andrei proposed using a user-defined
- H. S. Teoh (11/23) Mar 03 2021 Ouch.
- Kagamin (3/6) Mar 04 2021 It's used for number parsing, which isn't easy to replace in C:
- deadalnix (3/7) Mar 03 2021 An interesting follow up:
- user1234 (5/9) Mar 04 2021 The JSON part makes me think to D-YAML which had a similar issue
- user1234 (5/9) Mar 04 2021 The JSON part makes me thinks to D-YAMl which used to have a
Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
Mar 03 2021
On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse and had the my program hog the CPU when I went in production. On my test files that were not that big, memory mapping and changing fscanf() to sscanf() was a no brainer. When it went in production and started to map megabytes or gigabyte sized files, I rediscovered what a O(n²) algorithm looked like...
Mar 03 2021
On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter wrote:Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parseI think you had another bug there? Memory mapped files are not null terminated with probability 1/4096.
Mar 03 2021
On Wednesday, 3 March 2021 at 11:36:54 UTC, Kagamin wrote:On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter wrote:No, that bug was tested for. If the file size was a multiple of pagesize (it was on SPARC so 8192 not 4096) I added 1 byte to it which forced mmap to add an empty page at the end => guaranteed 0 at the end. If you're guaranteed to run on Linux you can force the adding of a supplemental page just by passing a size bigger than the file size to mmap. This was not possible on newer versions of Solaris. This said, I had a library to handle these details, also by replacing mmap by a malloc/read when the file is below a certain size as mmap has a non negligible overhead.Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parseI think you had another bug there? Memory mapped files are not null terminated with probability 1/4096.
Mar 03 2021
On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter wrote:On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there. (https://twitter.com/BruceDawson0xB/status/1120381406700429312) -- SimenParsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse and had the my program hog the CPU when I went in production. On my test files that were not that big, memory mapping and changing fscanf() to sscanf() was a no brainer. When it went in production and started to map megabytes or gigabyte sized files, I rediscovered what a O(n²) algorithm looked like...
Mar 03 2021
On Wednesday, 3 March 2021 at 15:16:03 UTC, Simen Kjærås wrote:Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.I think some years back, Andrei proposed using a user-defined attribute to tag functions with their big-O time complexity and maybe even use those attributes to calculate complexity of chained function calls.
Mar 03 2021
On Wed, Mar 03, 2021 at 11:09:04AM +0000, Patrick Schluter via Digitalmars-d wrote:On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:Ouch. I myself am no fan of sscanf: too limited and hard to fine-tune parsing behaviour. If it were up to me, I wouldn't run any large data sets through sscanf. Now this adds one more reason for not using sscanf. Fortunately in D slices eliminate the strlen problem, and slice-based std.array.split, et al, are generally better for simple parsing tasks IMO than *scanf functions. T -- Heads I win, tails you lose.Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse and had the my program hog the CPU when I went in production. On my test files that were not that big, memory mapping and changing fscanf() to sscanf() was a no brainer. When it went in production and started to map megabytes or gigabyte sized files, I rediscovered what a O(n²) algorithm looked like...
Mar 03 2021
On Wednesday, 3 March 2021 at 16:58:27 UTC, H. S. Teoh wrote:Fortunately in D slices eliminate the strlen problem, and slice-based std.array.split, et al, are generally better for simple parsing tasks IMO than *scanf functions.It's used for number parsing, which isn't easy to replace in C: https://github.com/biojppm/rapidyaml/issues/40
Mar 04 2021
On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47An interesting follow up: https://www.mattkeeter.com/blog/2021-03-01-happen/
Mar 03 2021
On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47The JSON part makes me think to D-YAML which had a similar issue ([1], [2]], i.e absurd double checks on AA insertion. [1]: https://github.com/dlang-community/D-YAML/issues/78 [2]: https://github.com/dlang-community/D-YAML/pull/112
Mar 04 2021
On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47The JSON part makes me thinks to D-YAMl which used to have a similar problem ([1],[2]), i.e absurd check on insertion. [1] https://github.com/dlang-community/D-YAML/issues/78 [2] https://github.com/dlang-community/D-YAML/pull/112
Mar 04 2021