To make sure LLVM continues to optimize code well, we use both post-commit performance tracking and pre-commit evaluation of new optimization patches. As compiler writers, we wish that the performance of code generated could be characterized by a single number, making it straightforward to decide from an experiment whether code generation is better or worse. Unfortunately, performance of generated code needs to be characterized as a distribution, since effects not completely under control of the compiler, such as heap, stack and code layout or initial state in the processors prediction tables, have a potentially large influence on performance. For example, it's not uncommon when benchmarking a new optimization pass that clearly makes code better, the performance results do show some regressions. But are these regressions due to a problem with the patch, or due to noise effects not under the control of the compiler? Often, the noise levels in performance results are much larger than the expected improvement a patch will make. How can we properly conclude what the true effect of a patch is when the noise is larger than the signal we're looking for?
When we see an experiment that shows a regression while we know that on theoretical grounds the generated code is better, we see a symptom of only measuring a single sample out of the theoretical space of all not-under-the-compiler's-control factors, e.g. code and data layout variation.
In this presentation I'll explain this problem in a bit more detail; I'll summarize suggestions for solving this problem from academic literature; I'll indicate what features in LNT we already have to try and tackle this problem; and I'll show the results of my own experiments on randomizing code layout to try and avoid measurement bias.