I've been thinking a lot recently about what it means to do research in fuzzing. Most of what we do is a loose collection of "things we know to generally work pretty well" but without true understanding of why.

I do often wonder what this means for us as a topic. We use fuzzing in many important projects, and yet, we don't really understand its bounds. What can fuzzers really do, not do, and how do we know what restricts us generally versus its limitations in the context of specific targets.

I'm not an expert in this regard, but I care deeply about making fuzzing widely available and worry about how we actually measure how "good" fuzzers are. Moreover, I'll run this upcoming year's SBFT fuzzing competition and I want to think about this out loud some more.

If you have opinions on this, please do contact me. I want to know what others think about fuzzing research.

A contest by any other name

What is the purpose of an evaluation in the context of software testing? How, and more importantly, what, do we quantitatively measure about different software testing techniques?

For many, it is about winning. That is, "what is the best testing technique today?". Frankly, I think this is the wrong question. We spend so long worrying about what is best that we forget that nothing is complete.

Many modern fuzzing papers measure effectiveness comparatively, showing that their fuzzer is somehow better than another. This is important to know, of course; if your strategy entirely supercedes another, fantastic, let's use yours instead. But what does it mean to supercede another? Merely covering more isn't enough, and even if it was, we rarely observe that fuzzers which hit more edges by count actually form a total superset of the coverable edges. You can see this effect in Fuzzbench, where even the best-performing fuzzers miss hundreds of edges discovered by worse-performing fuzzers.

Clearly, just counting the edges isn't enough. Our ultimate goal is to find the most bugs, so: What about measuring bug discovery?

Bug benchmarks?

a screenshot from Discord from the user eqv

eqv, screenshotted above, is Cornelius Aschermann, who is similarly concerned with recent fuzzer evaluations, has recently wondered quite a bit about how well fuzzer evaluations work as well. Recently, he concluded that evaluating based on the discovery of bugs is not a good predictor of discovering other bugs -- or new bugs.

To quote for those with screenreaders:

(unless I have fucked up somewhere) the eval on edges predicts finding more bugs with the better fuzzer in 67.9% of the cases, while bug eval predicted right in only 64.1% of the cases.

i.e. the user would be substantially better [off] to listen to the edges eval than the bugs eval

mostly because bugs aren't a good predictor of bugs, because they are too noisy

Similarly, he agrees: most fuzzing evaluations are ineffective at best and often misleading:

a second screenshot from Discord from the user eqv

To quote for those with screenreaders, or for those that want to copy/paste:

so a couple of caveats:

  1. if you look at the actual results of Fuzzbench, for most fuzzers the coverage is VERY similar, this result would probably get a lot stronger if we look at instances where coverage is actually different by a statistically [significant] amount
  2. bugs are inherently quite noisy - we don't really expect anything to predict bugfinding very well, since for a lot of targets & fuzzers, the same fuzzer finds the bug only in ~50% of the runs. It's really hard to predict something that's almost completely random.
  3. A substantial fraction of fuzzing papers are pointless and bordering on fraud in evaluation-overclaiming so the authors get their PhD 😛

And, again, this is all still asking the wrong question. The goal is not to find the best fuzzer, but to find bugs we haven't previously been able to.

Where to start?

Another question, not listed here, but discussed in the same Discord server often: how do we know what actually represents a meaningful improvement? Many fuzzers are evaluated based on how quickly edges are found, but this is irrelevant in practice when considering the magnitude of time and compute used by Google on fuzzing. What matters is that bugs are even discoverable. We must shift our focus from covering quickly to thoroughness of testing instead, and that means not evaluating from scratch but instead with meaningful evaluation choices to determine whether we can realistically find bugs at all. Perhaps, we should develop fuzzers specifically to find known bugs which have not yet been discoverable.

I make fun of myself somewhat for so frequently referencing this, but I believe the libwebp bug to be the greatest failure of fuzzing research. A critical vulnerability, not found with manual analysis easily, which is theoretically reachable by fuzzing -- but not with exploration strategies that existed, or at least were actively used, at the time. I fear that many such latent vulnerabilities await us yet, and understanding the failures which led to the webp bug being missed is critical to that.

The influence of trying to be "the best"

When the goal is to be "the best" rather than to advance the state of the art, it inspires problematic evaluations. If one merely needs to show that one is "better" than some existing solution, we commit ourselves to incremental change at best and, more frequently, selective, non-scientific evaluations which do not meaningfully contribute to the world. For the third time, I will link to our analyses of many papers in this regard, but it does not cover all the issues. In particular, we mostly describe how to better perform the "better than X" evaluations -- because we don't yet know how to compare fuzzers more meaningfully.

I worry also that we stifle research which is capable of exploring programs better, but performs worse by aggregated metrics like bug counts and code coverage. For example, if a fuzzer was written that explored programs under test in meaningfully different ways and wasn't just "AFL++, but a little different", I wouldn't expect it to find new bugs magically or to necessarily have greater aggregate edge counts compared to the nearly-a-decade of research spent on coverage-guided fuzzing.

Similarly, if you just test targets which have never been tested, you will of course find new bugs. That doesn't mean that the fuzzer is better, merely that we have failed as practitioners to distribute the tools to the masses.

We need better ways of comparing fuzzers so we can find strategies which do something new, not just do something a little better. In this way, we must allow for fuzzers which maybe don't find new bugs or coverage, but meaningfully differently explore programs. But we need a way yet to actually work out what is meaningfully and interestingly new versus what is strictly worse -- and I don't know if such a metric yet exists.

There is no magic bullet -- and we shouldn't expect there to be

There is a greater question at play, unspoken through all of this: is fuzzing enough? The answer is obviously no, but it's easy to let ourselves think "yes". Fuzzing is nice because it allows us to explore the program progressively, and find bugs automatically.

But finding bugs automatically isn't always possible. Perhaps we need to look instead to the horizon: if we want to make better tools, more capable of finding bugs, we should invest in strategies that enable the discovery of bugs. It seems obvious, but the way to do that is not with mass automation, but improving usability and making these tools more available to those who need them. It is to develop fuzzers which are specialised to certain targets and to make it available to others to allow them to do so themselves.

Similarly, we should invest further in understanding the limits of fuzzing, at least as it currently stands. If we can offer clearer indications as to what the fuzzer has not explored or cannot explore or test, other forms of testing can be applied to what has not yet already been tested.

The future of fuzzing, and of fuzzing evaluations, must be in the pursuit of new capabilities and better usability, not improvements.