www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - Fuzzed - a program to find DMDFE parser crash

reply Basile B. <b2.temp gmx.com> writes:
Fuzzed [1] is a simple fuzzer for the D programming language. It 
allows to detect sequences of tokens that crash the parser. While 
the D front end is not yet used to make tools, if this ever 
happens the parser will have to accept invalid code. As 
experienced with dparse, invalid code tend to crash more a parser 
because of a cognitive bias that lead us, "hoomans", to prove 
that things work rather than the opposite.

You can run it on one your core, report the crasher programs to 
the project issue tracker or fix them yourself:

 gdb dmd
 run <the_crasher>
 bt
And then try to see what happens in the parser at the location pointed on top of the back trace. Note that you'll need to build dmd debug version. The time to write this announce, already 5 "crashers" found. [1] https://github.com/BBasile/fuzzed
Dec 15 2018
next sibling parent reply Johan Engelen <j j.nl> writes:
On Saturday, 15 December 2018 at 11:29:45 UTC, Basile B. wrote:
 Fuzzed [1] is a simple fuzzer for the D programming language.
Are you familiar with libFuzzer and LDC's integration? https://johanengelen.github.io/ldc/2018/01/14/Fuzzing-with-LDC.html You can feed libFuzzer with a dictionary of keywords to speed up the initial fuzzing phase, where the keywords are the tokens strings that you use. Besides finding crashes, it's also good to enable ASan to find memory-related bugs that by luck didn't crash the program.
 The time to write this announce, already 5 "crashers" found.
Great :) The other day I was reminded of OSS Fuzz and that it'd be nice if we would setup fuzzing for the frontend and phobos there... -Johan
Dec 15 2018
parent reply Basile B. <b2.temp gmx.com> writes:
On Saturday, 15 December 2018 at 14:22:48 UTC, Johan Engelen 
wrote:
 On Saturday, 15 December 2018 at 11:29:45 UTC, Basile B. wrote:
 Fuzzed [1] is a simple fuzzer for the D programming language.
Are you familiar with libFuzzer and LDC's integration? https://johanengelen.github.io/ldc/2018/01/14/Fuzzing-with-LDC.html
No, but i'm not that surprised to see that a fuzzer already exists. I may have even seen this article but completely forgot it.
 You can feed libFuzzer with a dictionary of keywords to speed 
 up the initial fuzzing phase, where the keywords are the tokens 
 strings that you use.
 Besides finding crashes, it's also good to enable ASan to find 
 memory-related bugs that by luck didn't crash the program.

 The time to write this announce, already 5 "crashers" found.
Great :)
I have about 40 now
 The other day I was reminded of OSS Fuzz and that it'd be nice 
 if we would setup fuzzing for the frontend and phobos there...

 -Johan
I started looking at a crasher: typeof function function in which crashes in hdrgen. Actually i realize that i don't like the D parser. In many cases it checks for errors but continues parsing unconditionally. In the example, "in" leads to an null contract that the pretty formatter dereferences at some point, but parsing should have stopped after "typeof" since there is no left paren. Now take a look at typeof sub parser AST.TypeQualified parseTypeof() { AST.TypeQualified t; const loc = token.loc; nextToken(); check(TOK.leftParentheses); // <-- why continuing if the check fails? if (token.value == TOK.return_) { nextToken(); t = new AST.TypeReturn(loc); } else { AST.Expression exp = parseExpression(); t = new AST.TypeTypeof(loc, exp); } check(TOK.rightParentheses); return t; } I think this is what Walter calls "AST poisoning" (never understood how it worked before today). And the whole parser is like this. This poisoning kills the interest of using a fuzzer. 99% of the crashes will be in hdrgen.
Dec 15 2018
next sibling parent reply Sebastiaan Koppe <mail skoppe.eu> writes:
On Saturday, 15 December 2018 at 15:37:19 UTC, Basile B. wrote:
 I think this is what Walter calls "AST poisoning" (never 
 understood how it worked before today). And the whole parser is 
 like this.

 This poisoning kills the interest of using a fuzzer. 99% of the 
 crashes will be in hdrgen.
As is common with fuzzing, you'll need to ensure the program crashes. Sometimes that requires some tweaking. Regardless, you still have the input to investigate.
Dec 15 2018
next sibling parent reply Neia Neutuladh <neia ikeran.org> writes:
On Sat, 15 Dec 2018 21:09:12 +0000, Sebastiaan Koppe wrote:
 On Saturday, 15 December 2018 at 15:37:19 UTC, Basile B. wrote:
 I think this is what Walter calls "AST poisoning" (never understood how
 it worked before today). And the whole parser is like this.

 This poisoning kills the interest of using a fuzzer. 99% of the crashes
 will be in hdrgen.
As is common with fuzzing, you'll need to ensure the program crashes. Sometimes that requires some tweaking. Regardless, you still have the input to investigate.
I think the point is that DMD tries to recover from parsing failures in order to provide additional error messages. But those parsing failures leave the parser in an invalid state, and invalid states are fertile ground for crashes. The way to fix this is to replace the entire parser and get rid of the idea of AST poisoning; at the first error, you give up on parsing the entire file. From there, you can try recovering from specific errors with proper testing.
Dec 15 2018
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/15/2018 2:48 PM, Neia Neutuladh wrote:
 The way to fix this is to replace the entire parser and get rid of the
 idea of AST poisoning; at the first error, you give up on parsing the
 entire file. From there, you can try recovering from specific errors with
 proper testing.
DMD tries to continue parsing after a syntax error, but it does not attempt semantic analysis if there were any errors.
Dec 15 2018
parent Basile B. <b2.temp gmx.com> writes:
On Sunday, 16 December 2018 at 01:57:17 UTC, Walter Bright wrote:
 On 12/15/2018 2:48 PM, Neia Neutuladh wrote:
 The way to fix this is to replace the entire parser and get 
 rid of the
 idea of AST poisoning; at the first error, you give up on 
 parsing the
 entire file. From there, you can try recovering from specific 
 errors with
 proper testing.
DMD tries to continue parsing after a syntax error, but it does not attempt semantic analysis if there were any errors.
The problem i underlined is more that, like in the code that parses typeof, a non null node is returned even if some expectations are not verified when parsing. I'm not sure of what is the right fix. fixing the ast pretty printer or the parser ?
Dec 15 2018
prev sibling parent Basile B. <b2.temp gmx.com> writes:
On Saturday, 15 December 2018 at 22:48:01 UTC, Neia Neutuladh 
wrote:
 The way to fix this is to replace the entire parser and get rid 
 of the idea of AST poisoning; at the first error, you give up 
 on parsing the entire file. From there, you can try recovering 
 from specific errors with proper testing.
You can still continue parsing after an error but right now many sub-parsers always return an AstNode instead of null. The parser on null sub parser result could go to the end of the scope or to the next statement, depending on what it expected, and continue from there. That being said this wouldn't always work, e.g when a semi colon or a curly brace misses. Simple example: struct Foo { int a, b string c; // error because a type identifier part wasn't expected ... } // ... we're in a aggr body so consume toks past the curly brace struct Bar { }
Dec 15 2018
prev sibling parent Basile B. <b2.temp gmx.com> writes:
On Saturday, 15 December 2018 at 21:09:12 UTC, Sebastiaan Koppe 
wrote:
 On Saturday, 15 December 2018 at 15:37:19 UTC, Basile B. wrote:
 I think this is what Walter calls "AST poisoning" (never 
 understood how it worked before today). And the whole parser 
 is like this.

 This poisoning kills the interest of using a fuzzer. 99% of 
 the crashes will be in hdrgen.
As is common with fuzzing, you'll need to ensure the program crashes.
Yes this is done by piping dmd with the random code (i dont use dmd as a library for now). If the process returns something different of 0 (ok) and 1 (normal compiler error) than the random code is saved in a file: ... ProcessPipes pp = pipeProcess([Options.dc, "-"]); pp.stdin.writeln(src); pp.stdin.close; if (!pp.pid.wait.among(0, 1)) fileName.write(src); ... Actually it would be less convenient to do that with the front end as a library, since SEGFAULTs are supposed to kill the program...
Dec 15 2018
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2018-12-15 16:37, Basile B. wrote:

 This poisoning kills the interest of using a fuzzer. 99% of the crashes 
 will be in hdrgen.
Does that matter as long as the bug is found? -- /Jacob Carlborg
Dec 16 2018
parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 16 December 2018 at 14:24:54 UTC, Jacob Carlborg wrote:
 On 2018-12-15 16:37, Basile B. wrote:

 This poisoning kills the interest of using a fuzzer. 99% of 
 the crashes will be in hdrgen.
Does that matter as long as the bug is found?
Well it's hard to tell if it's begin. Generally in a compiler which is focused on speed, you accept crashing an really bogus input if it makes the parser run faster, because there is no chance of accepting corrupted code, which is what you need to be worried about.
Dec 17 2018
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 17 December 2018 at 10:12:44 UTC, Stefan Koch wrote:
 On Sunday, 16 December 2018 at 14:24:54 UTC, Jacob Carlborg 
 wrote:
 On 2018-12-15 16:37, Basile B. wrote:

 This poisoning kills the interest of using a fuzzer. 99% of 
 the crashes will be in hdrgen.
Does that matter as long as the bug is found?
Well it's hard to tell if it's begin.
meant to say benign.
Dec 17 2018
prev sibling next sibling parent Sebastiaan Koppe <mail skoppe.eu> writes:
On Saturday, 15 December 2018 at 11:29:45 UTC, Basile B. wrote:
 Fuzzed [1] is a simple fuzzer for the D programming language. 
 It allows to detect sequences of tokens that crash the parser. 
 While the D front end is not yet used to make tools, if this 
 ever happens the parser will have to accept invalid code. As 
 experienced with dparse, invalid code tend to crash more a 
 parser because of a cognitive bias that lead us, "hoomans", to 
 prove that things work rather than the opposite.
Nice. In my experience fuzzing parses works very well. I have good memories with afl. So much so that I once wrote a wrapper around it to handle running it distributed. See https://github.com/skoppe/afl-dist Could use a readme and a how-to though.
Dec 15 2018
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 12/15/2018 3:29 AM, Basile B. wrote:
 The time to write this announce, already 5 "crashers" found.
Great! Please post them to bugzilla.
Dec 15 2018
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2018-12-15 12:29, Basile B. wrote:
 While the D front end is not yet used to make tools
I've used it to make a tool, DLP [1]. [1] http://github.com/jacob-carlborg/dlp -- /Jacob Carlborg
Dec 16 2018