Poly1305-AES speed tables D. J. Bernstein
Authenticators and signatures
A state-of-the-art message-authentication code

Poly1305-AES speed tables

Speed tables
What do the speed tables mean?
Why is the first cycle count much larger than the typical cycle count?
Why do adjacent lines show different timings?
Speed graphs
Best-case benchmarks
Audiences for this page: (1) People who want to know how much time message authentication will take. (2) People who want to compare Poly1305-AES to other authentication functions. (3) People interested in benchmarking as a general topic: specifically, in the difference between best-case benchmarks and thorough benchmarks.

Speed tables

Here are some speed tables printed by my poly1305aes-speed.c benchmarking program:

Code sizes, assuming padding to 16-byte boundaries:

What do the speed tables mean?

In the speed tables, each line reports timings for several consecutive computations, with an instruction-cache/data-cache flush before the first computation. The first number in each line is the message length; then there are symbols indicating what's being computed; then there are cycle counts for the individual computations, including timing overhead. Here are the symbols:

For example, the Pentium-III line

     1024 V0000KD 18325  5334  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217
reports timings for several verifications of authenticators on 1024-byte messages: 18325 CPU cycles for the first verification, 5334 CPU cycles for the second, 5217 CPU cycles for the third, and so on. The keys, nonces, messages, and authenticators were in cache but they weren't aligned.

Why is the first cycle count much larger than the typical cycle count?

The first cycle count in each line of the speed table (the third column) includes the extra time needed to load the Poly1305-AES code from DRAM: roughly ten thousand cycles on an UltraSPARC II, for example.

There's a quantum leap in real-world packet-processing performance when all relevant code fits into cache (at least L2 cache, preferably L1 cache), because this DRAM load can be skipped. This is why code size is important, although it's difficult to measure the importance without also seeing the sizes of other pieces of software.

Even after the first column, the timings don't instantly settle down. One reason is that the AES tables haven't been completely loaded from DRAM. See my paper on cache-timing attacks for further discussion.

Why do adjacent lines show different timings?

Each speed table includes three 1024 V0000KD lines:
     1024 V0000KD 24846  5213  5314  5204  5204  5204  5204  5204  5204  5204  5204  5204  5204  5204  5204  5204  5204  5204  5204  5204
     1024 V0000KD 23589  5709  5207  5207  5207  5207  5207  5207  5207  5207  5207  5207 15287  5207  5207  5207  5207  5207  5207  5207
     1024 V0000KD 18325  5334  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217  5217
The three repetitions are unrolled in the timing program. The location of a function call in memory can affect performance (thanks to branch-prediction subtleties, for example) even when the function is called in exactly the same way. This causes small differences between lines: for example, the first line settles down to 5204 while the third line settles down to 5217.

In theory, a Poly1305-AES user can adjust the position of his function calls to obtain the best performance. I generally presume that the user has not done this, so I report the worst of the three lines when space prevents me from reporting all of them.

Occasional large burps (such as the 15287 in the fifteenth column of the second line above) are caused by the timing computer pausing to do something else, such as handle a network packet. These burps have nothing to do with Poly1305-AES performance and can be safely ignored.

Speed graphs

The following graphs show the time to verify an authenticator in various situations. Color scheme: The graphs include code in cache and code out of cache; the out-of-cache case is a faint cloud above the in-cache case, with the same color scheme.

The horizontal axis on the graphs is message length, from 0 bytes to 8192 bytes. The vertical axis on the graphs is time, from 0 CPU cycles to 49152 CPU cycles. The bottom-left-to-top-right diagonal is 6 CPU cycles per byte.

Lengths divisible by 16 are slightly faster than lengths not divisible by 16. The best case in (almost) every graph is length divisible by 16, everything in cache, everything aligned; this case is visible as 512 black dots at the bottom of the graph.

To convert a speed table to a speed graph, feed it through speed-graph.sh.

AMD Athlon cpuid AuthenticAMD-00000622-0183f9ff, 900MHz, gcc 2.95.4 -O3:

IBM PowerPC RS64 IV (Sstar), 668MHz (one of eight CPUs):

Intel Pentium III cpuid GenuineIntel-00000673-0383fbff, 500MHz, gcc 2.95.4 -O3 -mcpu=pentiumpro:

Intel Pentium III, 850MHz:

Intel Pentium III, 1000MHz (one of two CPUs):

Intel Pentium 4, 1900MHz:

Intel Pentium 4 cpuid GenuineIntel-00000f29-bfebfbff, 3400MHz, gcc 2.95.4 -O3 -mcpu=pentiumpro:

Intel Pentium M cpuid GenuineIntel-00000695-a7e9f9bf, 1300MHz, gcc 2.95.4 -O1:

Motorola PowerPC 7410 (G4), 533MHz (one of two CPUs):

Sun UltraSPARC II, 296MHz (one of four CPUs):

Sun UltraSPARC IIi, 360MHz, gcc 3.0.4 -m64 -O3 -mcpu=ultrasparc:

Sun UltraSPARC III, 900MHz (one of eight CPUs):