I'm working on building new benchmarking tools for fuzzing. As part of this, we need to snuff out potential sources of measurement bias. One such source is described in 3.6.2 of the fuzzer evaluation guidelines, but I thought I'd reiterate it here with an example.
The "unwitting student" statement
Suppose that Alex has a new metric, M, that they want to build a fuzzer around.
This new metric isn't strongly related with classical coverage metrics, but Alex is very confident that it's a good indicator of test performance.
Alex develops a fuzzer as guided by this metric M and by classical coverage metrics.
Alex runs this fuzzer and AFL++ for 24 hours and shows that, by comparing the produced corpora, Alex's fuzzer is better on the new metric M and slightly, but not severely, worse in code coverage.
With this result, Alex declares that their fuzzer is superior to AFL++ on metric M.
Can you see what Alex did wrong?
The coupon collector's measurement error
AFL++ guides on classic coverage metrics, whereas Alex's fuzzer guides on both classical coverage metrics and M.
As a result, Alex's fuzzer will add entries to the corpus (AKA "queue") which increase M---but AFL++ will only do so by chance.
If M isn't strongly related to classical coverage metrics, AFL++ becomes a coupon collector of M over the inputs which happen to be retained.
To demonstrate this, I've prepared a little demo for you below (requires JS and WASM to run).
In this demo, we simulate the execution and corpus retention of inputs.
M1 and M2 are both computed as random walks over a balanced binary tree with the size specified.
Inputs are retained if they cause M1 to increase and ignore M2.
We measure M2 over the executed inputs and M2 over only the saved inputs.
Loading WebAssembly...
As you can see, M2's coverage decreases between that which is executed and that which is retained.
Worse, the more precise that M2 is (i.e., the larger the metric), the more pronounced the effect.
As a result, we cannot rely on the corpora as a source of truth for the coverage of metrics which aren't tracked by the corresponding fuzzer.
In practice
There are some papers which make this mistake, resulting in unfair comparisons; this is something to be wary of when reviewing.
This also appears in standard fuzzing benchmarks like Fuzzbench, where saved inputs are used as the source of truth for measurement. If a fuzzer is guiding by a different metric, it may appear to perform worse than other fuzzers as the only measurement is coverage, when in reality, it might be testing the fewer coverage regions more effectively. Moreover, if we were to extend fuzzbench to measure new metrics, we would need to take measurements of all executed inputs, not just those which were saved. In the same vein, if you want to use historical fuzzing data to measure how different fuzzers perform on a new metric, you cannot rely on the fuzzer snapshot archives to do so!
Takeaways
Be very careful when measuring fuzzer performance. Using the inputs from the saved corpus is only valid iff the metric being measured is what the fuzzer is using for guidance. Otherwise, you need to measure every input.