NaturalBridge logo

Benchmarking caveats

When running or writing benchmarks, one should be very careful. Here are some problems to watch for:

The blatant cheat

In some cases, compilers just plain cheat on benchmarks, usually "industry-standard" ones. They can search for comments in the code, or simply do a textual match on the entire body of the subroutine, or they can apply pattern-matching to the internal flowgraph representation of the benchmark itself. Once a benchmark is discovered, it is replaced with hand-crafted code designed to yield superlative performance.

To guard against this, it is always a good idea to perturb benchmarks a little bit to see if their performance changes. This won't help with compilers that pattern-match the flowgraph; for that, it may actually be necessary to change what the benchmark computes, so that the compiler will not recognize it.

Examples of benchmarks that have been pattern-matched in this fashion in the past include Linpack and eqntott. A somewhat milder form of "blatant cheat" is to silently enable the best set of optimizer options when a benchmark is recognized (see "flag fiascos").

The targeted optimization

Targeted optimization is a slightly less outrageous form of pattern-matching. There are many optimizations that a compiler might apply to code to make it go faster, and only a finite amount of time to implement and debug them. In practice, many (most?) don't actually get used. "Targeted" optimizations are not especially useful in general, but they improve performance in some important, quirky corner of an influential benchmark.

One targeted optimization to be especially careful of is the kind that undoes obfuscations added to microbenchmarks to prevent other optimizations. For instance, in the NaturalBridge microbenchmarks, the constant 1 is sometimes written like this:

  static int oddCounter;
  static int one() {
    if ((oddCounter & 1) == 0) return 1;
    else return (oddCounter * 3) & 1;
If you look carefully at this code, you'll see that it always returns the value one, but not in a way that a compiler should figure out. Amazing as it may seem, this code is potentially vulnerable to the right set of general, non-targeted optimizations, so it should probably be improved. Constant propagation at the bit level, which some C and Fortran compilers employ to avoid the overhead of sign-extension operations, could turn this into "return 1", and then inline that where it is called.

Flag fiascos

Another great way to improve performance on a benchmark is to search far and wide for just the right set of option flags to tweak the optimizer. Ten independent flags can be set in 1024 different ways; so if you really wanted the very best performance, you could recompile and benchmark 1024 different ways to see which combination gave the best performance. Since most people don't do this, benchmark numbers obtained in this fashion are suspect. To guard against this, try recompiling the benchmark with flag settings similar to those you normally use.

Note that if you are actually motivated to go searching for just the right combinations of flags, then this sort of thing is ok, because it matches your development style.

The special case

Benchmarks, especially microbenchmarks, typically run in only one thread. Systems designed to exploit this with special-case code for synchronization and memory allocation look fast when running tiny benchmarks, but not when running real JavaTM programs that use threads for finalization, network connections, and AWT. This difference in performance can be even more striking if a multiprocessor is used, since synchronization on a multiprocessor must make use of bus-locking instructions that are not necessary in the uniprocessor case.

To guard against this, try adding a second idle thread to any benchmark, and try running it on a multiprocessor, and compare the modified performance with the original performance. We use this class for an idle thread in our microbenchmarks

class SleepyThread extends Thread {
  boolean die = false;
  Object lock = new Object();
  public void kill() {
    synchronized (lock) {
      die = true;
  public void run() {
    try {
      synchronized (lock) {
	while (! die) {
    } catch (InterruptedException e) {
and we surround benchmark runs with this code:
  SleepyThread st = new SleepyThread();
  // insert benchmark here
Because the thread is idle, this modification shouldn't change performance much, but we have observed large differences using some Java systems.

Another special case sometimes used is noting when a program's output is being discarded, and skipping some or all of the I/O processing that takes place before the actual output. At least one workstation vendor applied this trick to the SPEC sc benchmark.

Bait and switch

Modern high performance just-in-time compilers (JITs) operate in at least two modes, a low-speed mode for infrequently executed code and a high-speed mode for the performance-critical parts of an application. However, conformance test technology has not really caught up with JIT technology and few conformance tests seem to be available to test all modes. This leaves an opportunity to use non-conforming but faster techniques in the high-speed mode.

Bending the rules

Sometimes, it is possible to make code run faster if it doesn't quite run correctly. For the Java language, getting floating point exactly right on Intel is one instance of this. The best known bit-for-bit compatible implementation of double-precision multiplication requires two scaling operations (FSCALE) above and beyond the actual multiplication. Most "Java" implementations aren't quite compatible, in that they merely load and store all FP operands before and after use. (Note that this changes with Java 2; the new default floating point mode allows use of a wider exponent range, and the new strictfp keyword specifies bit-for-bit compatible floating point arithmetic. Unfortunately, most Java Virtual Machines on Intel don't implement this either, though BulletTrain does as of release 1.1.0.)

Rigging the game

Rigging the game is only feasible for vendors able to influence benchmarking organizations, but it can happen. For example, at least one supercomputer vendor once (privately) claimed that TPC-D was rigged because a particular database organization was required, and (this is what distinguishes a rigged game from one that is not) their happy customers used a different data organization. For another, consider that the SPEC JVM benchmark does not allow results to be quoted for static compilers. This might make sense if it was never useful to statically compile bytecodes into a faster form, but that is not the case; for many applications, static compilation is a perfectly acceptable way to obtain improved performance.

Bogus baselines

Some benchmarks, in a misguided attempt to measure a particular aspect of a system's performance, take the difference of two performance measures. For example, they might try to measure the cost of computing the sine of a number by running two loops, one containing a sine calculation, the other empty. This is particularly vulnerable to simple error and outright subversion.

Simple error can result from well-intentioned optimization gone awry. For example, an optimizer might more aggressively unroll and optimize a small loop; this would reduce the (mis)calculated net performance of the large loop's body. If, however, the optimizer does a poor job optimizing the smaller calibration loop, the net performance will be overstated instead. In the benchmarking world there are limits to how quickly the big loop can be executed, but the calibration loop can run arbitrarily slowly.

Competing optimizations

If you are trying to measure the effect of optimization X, it is very important to ensure that some other optimization Y is not skewing the results. The three usual "surprise" optimizations are
  • dead code elimination
  • constant propagation
  • inlining
The most common error in writing benchmarks is failing to use the result of what is having its performance measured. This often leads to complete elimination of the benchmark kernel. For example,
// Measure allocation/GC performance
for (int i = 0; i < 1000000; i++)
   int[] x = new x[1000];
doesn't use x, so a compiler might do anything from eliminating the allocation, to eliminating the loop altogether. These can sometimes be subtle, depending upon what other things the compiler can prove about the code in question.

Constant propagation and inlining often work hand-in-hand with dead code elimination to undermine measurement. For example, supposing that loop were wrapped in a subroutine to "use" the result

static int[] measureGC() {
  // Measure allocation/GC performance
  int[] x = null;
  for (int i = 0; i < 1000000; i++)
     x = new x[1000];
  return x;
That's still pretty dubious, since it would make plenty of sense to only do the last allocation, and simple unrolling by two would reveal that half the allocations are dead. But never mind that, supposing that call were inlined, and the result of the call wasn't used. Once again, dead code.

So, trying one more time to make that code not be dead, you decide to use a virtual call to invoke it, everybody knows that those can't be inlined, right?

public int[] measureGCVirtual() {
  // Measure allocation/GC performance
  int[] x = null;
  for (int i = 0; i < 1000000; i++)
     x = new x[1000];
  return x;

static void measureGC() {
  Foo f = new Foo();
Of course, the type of f is a constant, and the method bindings are known, so in fact the syntactically virtual call is not virtual at all, and thus can be inlined and optimized.

Here's another example, take directly from the JavaGrande Fork-Join benchmark:

// do something trivial but which won't be optimised away! 
double theta=37.2, sint, res; 
sint = Math.sin(theta);
res = sint*sint; 
//defeat dead code elimination 
if(res <= 0) System.out.println(
     "Benchmark exited with unrealistic res value " + res);
Remember, this is Java, where the effects of floating point arithmetic are completely specified in strictfp mode. The correct result (not what you will obtain from Sun's VM on Pentium, see bending the rules above) is -0.478645918588415, or (in its unambiguous hexadecimal representation) 0xbfdea2227dacdf2a. We can take that constant and square it, and compare that obviously positive result with zero. What remains of this method that "won't be optimized away"? Nothing!

Lessons learned

The first lesson here is that even when writing your own benchmarks, especially microbenchmarks, you must be very careful about all the optimizations that you aren't necessarily trying to measure. It's important to use results (ideally, incorporate their value into something that is ultimately printed), and it is important to either thwart or preemptively apply any optimizations that you don't want contaminating your results.

The second, and more important lesson, is that when running benchmarks written or selected by someone else, that you should watch your wallet. The best benchmark is always your own application.