www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - std.regex is fat

reply Chris Katko <ckatko gmail.com> writes:
Like, insanely fat.

All I wanted was a simple regex. The second include a regex 
function, my program would no longer compile "out of memory for 
fork".

/usr/bin/time -v reports it went from 150MB of RAM for D, 
DAllegro, and Allegro5.

To over 650MB of RAM, and from 1.5 seconds to >5.5 seconds to 
compile. Now I have to close all my Chrome tabs just to compile.

Just for one line of regex. And I get it, it's the overhead of 
the library import, not the single line. But good gosh, more than 
3X the RAM of the entire project for a single library import?

Something doesn't add up!
Oct 12 2018
parent reply Alex <sascha.orlov gmail.com> writes:
On Friday, 12 October 2018 at 13:25:33 UTC, Chris Katko wrote:
 Like, insanely fat.

 All I wanted was a simple regex. The second include a regex 
 function, my program would no longer compile "out of memory for 
 fork".

 /usr/bin/time -v reports it went from 150MB of RAM for D, 
 DAllegro, and Allegro5.

 To over 650MB of RAM, and from 1.5 seconds to >5.5 seconds to 
 compile. Now I have to close all my Chrome tabs just to compile.

 Just for one line of regex. And I get it, it's the overhead of 
 the library import, not the single line. But good gosh, more 
 than 3X the RAM of the entire project for a single library 
 import?

 Something doesn't add up!
Hm... maybe, you run into this: https://forum.dlang.org/post/mailman.3091.1517866806.9493.digitalmars-d puremagic.com
Oct 12 2018
parent reply Chris Katko <ckatko gmail.com> writes:
On Friday, 12 October 2018 at 13:42:34 UTC, Alex wrote:
 On Friday, 12 October 2018 at 13:25:33 UTC, Chris Katko wrote:
 Like, insanely fat.

 All I wanted was a simple regex. The second include a regex 
 function, my program would no longer compile "out of memory 
 for fork".

 /usr/bin/time -v reports it went from 150MB of RAM for D, 
 DAllegro, and Allegro5.

 To over 650MB of RAM, and from 1.5 seconds to >5.5 seconds to 
 compile. Now I have to close all my Chrome tabs just to 
 compile.

 Just for one line of regex. And I get it, it's the overhead of 
 the library import, not the single line. But good gosh, more 
 than 3X the RAM of the entire project for a single library 
 import?

 Something doesn't add up!
Hm... maybe, you run into this: https://forum.dlang.org/post/mailman.3091.1517866806.9493.digitalmars-d puremagic.com
So wait, if their solution was to simply REMOVE std.regex from isEmail. That doesn't solve the regex problem at all. And from what I read in that thread, this penalty is paid per template INSTANTIATION which could explode. 1 - Does anyone know WHY it's so incredibly fat? 2 - If this isn't going to be fixed anytime soon, shouldn't there be a DISCLAIMER on the documentation? (+potential workarounds like keeping regex queries in their own file.) I mean, this kind of thing shouldn't require looking through forums. It's a clear bug, and if it's a WONTFIX (even temporarily), it should be documented clearly as such. If I'm running into this issue, how many other people already did, and possibly even gave up on using D?
Oct 13 2018
next sibling parent reply Chris Katko <ckatko gmail.com> writes:
On Sunday, 14 October 2018 at 02:44:55 UTC, Chris Katko wrote:
 On Friday, 12 October 2018 at 13:42:34 UTC, Alex wrote:
 [...]
So wait, if their solution was to simply REMOVE std.regex from isEmail. That doesn't solve the regex problem at all. And from what I read in that thread, this penalty is paid per template INSTANTIATION which could explode. [...]
For comparison, I just tested and grep uses about 4 MB of RAM to run. So it's not the regex. It's the dmd / templates / CTFE, right?
Oct 13 2018
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Sunday, 14 October 2018 at 03:07:59 UTC, Chris Katko wrote:
 For comparison, I just tested and grep uses about 4 MB of RAM 
 to run.
Running and compiling are two entirely different things. Running the D regex code should be comparable, but compiling it is slow, in great part because of internal templates... There was an effort to speed up the template code, but it is still not complete.
Oct 13 2018
parent Chris Katko <ckatko gmail.com> writes:
On Sunday, 14 October 2018 at 03:26:33 UTC, Adam D. Ruppe wrote:
 On Sunday, 14 October 2018 at 03:07:59 UTC, Chris Katko wrote:
 For comparison, I just tested and grep uses about 4 MB of RAM 
 to run.
Running and compiling are two entirely different things. Running the D regex code should be comparable, but compiling it is slow, in great part because of internal templates... There was an effort to speed up the template code, but it is still not complete.
I know that. I figured people would miss my point on it though so I should have clarified. That's why I said it's likely the templates/DMD that's exploding--not the actual regex action. From a simple program, it takes ~100-150MB of RAM to compile. Adding a single regex (not compiled regex) balloons to 550MB at 5 seconds of compile time. ----------- Anyhow, I wrote my own simple "dgrep" and compared the results with grep, it's very competitive: (NOT to be confused with the above RAM stats for COMPILING) Command being timed: "sh -c cat dgrep.d | ./dgrep 'write' " User time (seconds): 0.00 System time (seconds): 0.00 Percent of CPU this job got: 0% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 3192 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 301 Voluntary context switches: 5 Involuntary context switches: 124 Swaps: 0 File system inputs: 8 File system outputs: 8 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 Command being timed: "sh -c cat dgrep.d | grep 'write'" User time (seconds): 0.00 System time (seconds): 0.00 Percent of CPU this job got: 0% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2224 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 2 Minor (reclaiming a frame) page faults: 282 Voluntary context switches: 10 Involuntary context switches: 0 Swaps: 0 File system inputs: 760 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 So I have to say I'm impressed with the actual performance of the regular expressions engine--especially considering "grep" is, IIRC, considered a fine-tuned beast.
Oct 14 2018
prev sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Sunday, 14 October 2018 at 02:44:55 UTC, Chris Katko wrote:
 So wait, if their solution was to simply REMOVE std.regex from 
 isEmail.
That was ctRegex, which is different than regex.
 That doesn't solve the regex problem at all. And from what I 
 read in that thread, this penalty is paid per template 
 INSTANTIATION which could explode.
Template instantiation, which is a big issue for ctRegex, but not for regular regex.
Oct 13 2018