I have come to realise---or rather, I have become more and more convinced that---fuzzing research has stalled not because we have no further contributions to make, but because the contributions that we are making are either incremental and merely sound impressive or presented in ways that obscure their utility. To be more concrete: we are spending time trying to "improve" fuzzing generally rather than identifying what can be improved; everyone is trying to be "the best" rather than trying to identify what is actually happening. This is not the first time that I have felt this, but perhaps my understanding of this problem has improved in the last two years. It's time for a revisit!
Last year, I was involved in a paper which tried to standardise fuzzer evaluation. While I still think that this paper is incredibly important in providing baseline evaluation requirements, something that I've only realised in the last year or so is that it asks the wrong questions.
Significance?
Statistical significance is the golden standard for scientific advancement. This shows that there is indeed a difference between two experimental configurations. The only problem is: it is incredibly trivial to have statistical significance in fuzzing.
Case study 1: SBFT'25
Last time, I mentioned that I was to run the fuzzing competition for SBFT'25. Despite having only two contestants, I think this competition truly highlighted the problem of fuzzer evaluation. The first contestant ensembled AFL++ and LibAFL and used fixed-interval corpus minimisation during execution. The second ensembled AFL++, LibAFL, FOX, and ZTaint-Havoc. This submission is an engineering marvel, utilising program-specific knowledge to give far greater knowledge to the fuzzer search pattern.
The kicker? The first contestant, Kraken, won---though this is in part due to a bug that caused the second contestant, HFuzz, to crash on one of the targets. In raw scores, Kraken beat HFuzz on 3 of the targets (+1 additionally, if we include the crashing target), and 4 vice versa. The improvements shown by these fuzzers are statistically significant, by classical evaluation metrics and tests.
Yet, I suspect that if you handed these tools to a bunch of reviewers, they would reject Kraken and accept HFuzz. Why? The fuzzers involved in HFuzz are much more technically interesting; indeed, FOX was accepted at CCS'24 and ZTaint-Havoc was accepted at ISSTA'25, an A* and an A venue, respectively.
Nevertheless, by classical evaluation metrics, these fuzzers are contributing equally. Kraken demonstrates something known to "folklore" literature: intermittent, but not continuous, corpus minimisation and synchronization massively benefits ensemble performance. HFuzz exploits ensembling of very technically advanced fuzzer strategies to gain its advantage. These are orthogonal contributions of equal importance to designing fuzzer campaigns.
Looking closer: HFuzz
HFuzz is composed of many fuzzers, which somewhat obscures the individual contributions of the component fuzzers. Looking closer, FOX and ZTaint-Havoc, both by the same group which submitted HFuzz, demonstrate some interesting contributions. Particularly, they have extreme coverage improvements in certain targets which go beyond what can be ascribed to performance improvements or adopting basic, but clever, changes to the algorithm. That is meaningful, and important! but sadly, underemphasised due to our field's obsession with general improvement.
One thing I admire in particular about these papers is that they present other metrics alongside the baseline typically used. The effect is that we get to see how their contributions affect things that we as a community may be turning a blind eye to; after all, there is still no conclusive proof that edge coverage is the primary signal for finding bugs in real programs. Indeed, we can prove that there are some programs for which there is no relationship between coverage and test efficacy (e.g., crypto algorithms).
Looking even closer
On the other hand, these papers are tainted with extremely subtle evaluation errors that make me question whether they are even generally better. This is not the first time this has happened and it certainly won't be the last; it is effectively impossible to get fuzzer evaluations correct. A follow-up with the errors corrected showed that the improvements at least halved, but the results were still significant.
That's just accounting for experimental error, too; there are some recent findings (which I sadly cannot share here) that make me question the validity of nearly any evaluation performed with benchmarking, even if it is done "perfectly".
Case study 2: DARWIN
DARWIN won a distinguished paper award at NDSS'23. This paper proposed to use evolutionary algorithms to optimise the selection of mutations used during a fuzzing campaign. For many readers (i.e. "fellow researchers I hang out with on discord"), this set off alarm bells; mutation selection strategies based on EA/GA are almost universally disproven mere weeks after paper publication.
While acknowledging my personal bias against such papers, I investigated whether the EA used in DARWIN was meaningfully contributing to its improvement over its chosen baseline. Not only was I unable to reproduce the results presented in the paper (15/19 reported improvements over baseline, 4/18 found), but using a random reweighting instead of their EA had no statistically significant difference. In other words: the reported improvement was not reproduced, and, even in the cases where it was, that improvement could have just have effectively been from random reweight instead of a clever algorithm. This result ended up as section 4.3 of the paper I mentioned at the start of this post. While it's worth noting that the difference in findings (15/19 vs. 4/18) is large, it can still be a result of experimental environment differences that are subtle and benign. Yet, this is not something we can determine as neither the data nor the full artifact was made available in the end.
Zooming out
Fuzzer evaluation is remarkably hard and fragile; the slightest errors can propagate as the difference between significant and not significant, and are often not even due to human error but simply difference in execution environment. Fuzzer development is similarly fraught; the improvements demonstrated by simple strategies can be as significant as those which are technically complex. Ask around in the fuzzing community; results that seem promising on their surface are often of no more consequence than things we have known for a long time. Papers are published and show "improvement", but with little regard for whether those works are actually meaningful for any real target other than slightly increasing coverage over a 24-hour window. After all: if it takes more than 24 hours to implement a new technique, or even to be able to use it, then it's not worth using; just run the original fuzzer for longer!
Moreover, consider this: statistical significance shows that two things are different. Given infinite trials, we would be able to distinguish any two fuzzers. The significance of this is basically irrelevant; our findings need not only be statistically significant, but meaningful.
Call to action
These problems are just scratching the surface, and these are things that the fuzzing community has known for some time. Yet, these problems still exist.
If you are writing new fuzzing papers, please: identify whether your contribution is meaningful beyond the classic understanding of statistical significance. We have shown, again and again at this point, that incremental coverage improvement over a 24-hour period from scratch is just not meaningful. Even if your fuzzer is amazing at just one or two targets, or if it's able to find one or two new bugs that have never been found before on something that has been thoroughly fuzzed: that is a meaningful result! If you are able to fuzz something that has never been able to have been fuzzed before and are able to show that it's meaningfully testing that new target: that is a meaningful contribution, even if you don't find any bugs. If you have new ways of measuring that a program has been tested effectively and can show that it is a stronger guidance metric than coverage: that is a meaningful contribution. Any of these contributions are far more meaningful than 2% coverage "increase" on some targets that have been beaten to death at this point, where we've already seen that coverage a million times in OSS-Fuzz or similar.
If you are reviewing fuzzing papers (or artifacts): you must ask the hard questions. Meaningful, general improvements on the benchmark-able targets are going to be rare if not utterly impossible without major paradigm shifts to the core principles of fuzzing, so don't focus or demand this when it's just irrelevant. Instead, ask yourself (and the authors): why do the results presented in the paper actually suggest something meaningful for our field? Understanding the relationships between programs and inputs, making fuzzers more usable, or techniques which address fundamental issues with classical fuzzers are and must be the future of our field. We must ensure that these papers receive the support (and the scrutiny) that they deserve, and begin to dismiss papers with lofty goals and claims for general improvements with little evidence that their findings are a direct result of (and only resulting from) their specific contributions. Moreover, we must become more aggressive in our evaluation of artifacts; too many papers have been accepted that have fundamental evaluation errors (both design and implementation) that could have been mitigated earlier. In many ways, our field is embroiled in a reproducibility crisis because of lax review---much of which is due to other large scale problems affecting the security and software engineering conferences as a whole, but I digress.
We must change the status quo of the papers in our field. It is tragic that a stark majority of the papers shared in my communities are met with doubt rather than pride, and contempt rather than excitement, because of this trend of research. More than anything else, there is so much opportunity for real progress that isn't taken because we depend so heavily on "trying to make it past the reviewers" rather than "trying to advance the state of the art meaningfully". We, as authors, must deeply investigate the gaps in our knowledge, and we, as reviewers, must be scrutinous, inquisitive, and open to papers that truly push the envelope rather than meaningless incremental progress.
Acknowledgements
Much of this is founded in conversations I had on the fuzzing discord. I would like to thank Cornelius Aschermann for jading me further in fuzzer evaluation. Conversely, I want to thank Marcel Böhme for making me less jaded about the field when I presented him with similar topics during a conversation a few months ago.