Best-case benchmarks D. J. Bernstein 2005.01.31 One way to deceive people with benchmarks is to report _best-case_ timings. For example, here's the best-case scenario for cryptographic message authentication: * The input---a key, a nonce, and a message---is in L1 cache. The output---an authenticator---is also in L1 cache. The code is in L1 cache. DRAM access time is ignored. Some applications fit comfortably into L1 cache. Other applications work with many simultaneous keys. Those other applications are ignored. * Expanded keys---any convenient tables that don't depend on the message---are also in L1 cache. Expansion time is ignored. Time to load tables from DRAM is ignored. The expansion means that even fewer keys can fit into cache at once. But, as above, multiple-key applications are ignored. For analogous reasons, code size is ignored, as long as the benchmark program can fit into L1 cache. * Messages have a convenient length, typically a multiple of 16 bytes. The time to handle inconvenient lengths, in applications that don't stick to the convenient lengths, is ignored. Messages, authenticators, etc. are also aligned to 16-byte boundaries in memory. Unaligned memory access often takes time (and extra code for some CPUs); this time is ignored. Some applications are faced with unaligned messages; these applications are ignored. * Messages are very long, as long as will fit into L1 cache. Even better, the L1 cache is extremely large; the benchmarker simulates this by modifying the authenticator computation to not move the message pointer through memory. All that matters is the asymptotic time per byte. The average packet size in a typical Internet protocol is only a few hundred bytes. In these applications, per-message overhead is easily visible. These applications are ignored. * The CPU is the one that makes the authenticator computation take the smallest possible number of clock cycles. Other CPUs are ignored. Real-world timings are often higher, sometimes _much_ higher, than the best-case benchmarks indicate. Even worse, the best-case benchmarks encourage implementors to optimize for the best case rather than for the real world: modifying an authentication function to make it slightly faster for its favorite CPU, for example, even if this makes it much slower on other CPUs. I'm guilty of optimizing for the best case. I advertised my hash127 authentication function as being faster than MD5---which it is _if_ the application doesn't mind expanding each key into a precomputed table occupying several kilobytes. Unfortunately, for any application that handles more than a few keys simultaneously, the tables start missing the cache, costing many CPU cycles. It's much better to use my new Poly1305 function, which provides _consistent_ high speed. I don't mean to suggest that we should ignore the best case and optimize solely for the worst case. Some applications _are_ in the best case. The most useful benchmarks show the best-case behavior _and_ the worst-case behavior _and_ any important cases in between. Example: Functions for cryptographic message authentication should be timed separately for 0-byte messages, 1-byte messages, 2-byte messages, and so on up through 8192-byte messages; timed separately for every available CPU; etc. With modern graphing tools---or, let's be honest, decades-old graphing tools---you can take the tables for two such functions, graph the tables, superimpose the graphs, and focus on the CPUs, the range of message lengths, etc. that you care about. It's natural to look for patterns in data. If a large table of timings can be _accurately_ compressed down to a relatively small set of numbers, wonderful! But it's clear that we lose far too much information when we compress the table down to a single best-case benchmark.