![]() |
Benchmarking caveats |
When running or writing benchmarks, one should be very careful.
Here are some problems to watch for:
The blatant cheatIn some cases, compilers just plain cheat on benchmarks, usually "industry-standard" ones. They can search for comments in the code, or simply do a textual match on the entire body of the subroutine, or they can apply pattern-matching to the internal flowgraph representation of the benchmark itself. Once a benchmark is discovered, it is replaced with hand-crafted code designed to yield superlative performance.To guard against this, it is always a good idea to perturb benchmarks a little bit to see if their performance changes. This won't help with compilers that pattern-match the flowgraph; for that, it may actually be necessary to change what the benchmark computes, so that the compiler will not recognize it. Examples of benchmarks that have been pattern-matched in this fashion in the past include Linpack and eqntott. A somewhat milder form of "blatant cheat" is to silently enable the best set of optimizer options when a benchmark is recognized (see "flag fiascos"). The targeted optimizationTargeted optimization is a slightly less outrageous form of pattern-matching. There are many optimizations that a compiler might apply to code to make it go faster, and only a finite amount of time to implement and debug them. In practice, many (most?) don't actually get used. "Targeted" optimizations are not especially useful in general, but they improve performance in some important, quirky corner of an influential benchmark.
One targeted optimization to be especially
careful of is the kind that undoes obfuscations
added to microbenchmarks to prevent other optimizations.
For instance, in the NaturalBridge microbenchmarks, the
constant 1 is sometimes written like this:
Flag fiascosAnother great way to improve performance on a benchmark is to search far and wide for just the right set of option flags to tweak the optimizer. Ten independent flags can be set in 1024 different ways; so if you really wanted the very best performance, you could recompile and benchmark 1024 different ways to see which combination gave the best performance. Since most people don't do this, benchmark numbers obtained in this fashion are suspect. To guard against this, try recompiling the benchmark with flag settings similar to those you normally use.Note that if you are actually motivated to go searching for just the right combinations of flags, then this sort of thing is ok, because it matches your development style. The special caseBenchmarks, especially microbenchmarks, typically run in only one thread. Systems designed to exploit this with special-case code for synchronization and memory allocation look fast when running tiny benchmarks, but not when running real JavaTM programs that use threads for finalization, network connections, and AWT. This difference in performance can be even more striking if a multiprocessor is used, since synchronization on a multiprocessor must make use of bus-locking instructions that are not necessary in the uniprocessor case.
To guard against this, try adding a second idle thread
to any benchmark, and try running it on a multiprocessor,
and compare the modified performance with the original
performance. We use this class for an idle thread
in our microbenchmarks
Another special case sometimes used is noting when a program's output is being discarded, and skipping some or all of the I/O processing that takes place before the actual output. At least one workstation vendor applied this trick to the SPEC sc benchmark. Bait and switchModern high performance just-in-time compilers (JITs) operate in at least two modes, a low-speed mode for infrequently executed code and a high-speed mode for the performance-critical parts of an application. However, conformance test technology has not really caught up with JIT technology and few conformance tests seem to be available to test all modes. This leaves an opportunity to use non-conforming but faster techniques in the high-speed mode.Bending the rulesSometimes, it is possible to make code run faster if it doesn't quite run correctly. For the Java language, getting floating point exactly right on Intel is one instance of this. The best known bit-for-bit compatible implementation of double-precision multiplication requires two scaling operations (FSCALE )
above and beyond the actual multiplication. Most
"Java" implementations aren't quite compatible, in that they
merely load and store all FP operands before and after
use. (Note that this changes with Java 2; the
new default floating point mode allows use of
a wider exponent range, and the new strictfp
keyword specifies bit-for-bit compatible floating
point arithmetic. Unfortunately, most Java
Virtual Machines on Intel don't implement this
either, though BulletTrain does as of release 1.1.0.)
Rigging the gameRigging the game is only feasible for vendors able to influence benchmarking organizations, but it can happen. For example, at least one supercomputer vendor once (privately) claimed that TPC-D was rigged because a particular database organization was required, and (this is what distinguishes a rigged game from one that is not) their happy customers used a different data organization. For another, consider that the SPEC JVM benchmark does not allow results to be quoted for static compilers. This might make sense if it was never useful to statically compile bytecodes into a faster form, but that is not the case; for many applications, static compilation is a perfectly acceptable way to obtain improved performance.Bogus baselinesSome benchmarks, in a misguided attempt to measure a particular aspect of a system's performance, take the difference of two performance measures. For example, they might try to measure the cost of computing the sine of a number by running two loops, one containing a sine calculation, the other empty. This is particularly vulnerable to simple error and outright subversion.Simple error can result from well-intentioned optimization gone awry. For example, an optimizer might more aggressively unroll and optimize a small loop; this would reduce the (mis)calculated net performance of the large loop's body. If, however, the optimizer does a poor job optimizing the smaller calibration loop, the net performance will be overstated instead. In the benchmarking world there are limits to how quickly the big loop can be executed, but the calibration loop can run arbitrarily slowly. Competing optimizationsIf you are trying to measure the effect of optimization X, it is very important to ensure that some other optimization Y is not skewing the results. The three usual "surprise" optimizations are
doesn't use x , so a compiler might do
anything from eliminating the allocation, to eliminating
the loop altogether. These can sometimes be subtle,
depending upon what other things the compiler can
prove about the code in question.
Constant propagation and inlining often work
hand-in-hand with dead code elimination to undermine
measurement. For example, supposing that loop
were wrapped in a subroutine to "use" the result
So, trying one more time to make that code not
be dead, you decide to use a virtual call to
invoke it, everybody knows that those can't be
inlined, right?
Here's another example, take directly from the JavaGrande
Fork-Join benchmark:
Lessons learnedThe first lesson here is that even when writing your own benchmarks, especially microbenchmarks, you must be very careful about all the optimizations that you aren't necessarily trying to measure. It's important to use results (ideally, incorporate their value into something that is ultimately printed), and it is important to either thwart or preemptively apply any optimizations that you don't want contaminating your results.The second, and more important lesson, is that when running benchmarks written or selected by someone else, that you should watch your wallet. The best benchmark is always your own application. |