AES speed D. J. Bernstein
Hash functions and ciphers

AES speed

Update: Peter Schwabe and I now have a paper on this topic: The software is now available as part of the estreambench toolkit. We have placed the software into the public domain; feel free to integrate it into your own AES applications!

Information below this line has not yet been updated.


This document describes various speedups in AES software. This document assumes that the software is going to be used in an application where timing information is not exposed to attackers.

The reader is expected to already know the standard structure of AES software:

See Section 5.2.1 of "AES Proposal: Rijndael" by Daemen and Rijmen.

Endianness

On a little-endian CPU, extracting the first byte of a 32-bit word is an &0xff arithmetic instruction; on a big-endian CPU, extracting the first byte of a 32-bit word is a >>24 arithmetic instruction. Similar comments apply to the other bytes.

One can write AES software that uses arithmetic instructions as if the CPU were little-endian. If the CPU is actually big-endian, the software swaps the bytes of the AES key, input, and output (at run time). The software also swaps the bytes of the table (at compile time), for example by expressing the table as a sequence of 32-bit integers.

Matched endianness. One can easily eliminate the byte-swapping time for the AES key, input, and output: simply use the appropriate arithmetic instructions for the endianness of the CPU. In this case the table must not be swapped.

Table structure

All else being equal, smaller AES tables are faster: they take less time to load into cache and are more likely to stay in cache. Beware that most benchmarking tools preload caches and thus can't see this speedup.

Daemen and Rijmen suggest "4 KBytes of tables." There are 4 tables. Each table has 256 words occupying 1024 bytes. The loads are spread evenly across the tables.

Rotated lookups. Daemen and Rijmen suggest an alternative "with a total table size of 1KByte" but with extra arithmetic. The point is that the tables are rotations of each other: for example, the first word of the first table is (0xc6,0x63,0x63,0xa5), the first word of the second table is (0xa5,0xc6,0x63,0x63), the first word of the third table is (0x63,0xa5,0xc6,0x63), and the first word of the fourth table is (0x63,0x63,0xa5,0xc6). One can store the first table, and simulate a lookup in another table at the cost of an extra rotation.

Unaligned loads. One can instead use a single 2KB table having 256 8-byte entries such as (0x00,0x63,0xa5,0xc6,0x63,0x63,0xa5,0xc6). There are many reasonable choices of pattern here; what's important is that the pattern includes the desired (0xc6,0x63,0x63,0xa5) and (0xa5,0xc6,0x63,0x63) and so on as substrings. On the Pentium, the PowerPC, et al., one can load 4-byte words from memory addresses that aren't divisible by 4, and there's no penalty when the word doesn't cross an 8-byte boundary.

Masked loads

16 of the 160 table lookups in 10-round AES are masked. The 40 table lookups in 10-round AES key expansion are also masked. The masks are 0x000000ff, 0x0000ff00, 0x00ff0000, and 0xff000000, each used equally often.

The simplest way to compute a mask is with an arithmetic instruction: for example, &0xff00.

Byte loads. One can eliminate 25% of the masks, namely the bottom-byte masks, by combining them with load instructions. All popular CPUs have single-byte-load instructions.

Two-byte loads. One can eliminate another 25% of the masks on CPUs with two-byte-load instructions. This constrains the table pattern: it's important to have (0x00,0x63) on little-endian CPUs, and (0x63,0x00) on big-endian CPUs.

Masked tables. One can eliminate all of the masks by precomputing masked tables, using extra table space. The simplest table structure uses a total of 8KB. Two tables, one with entries such as (0x00,0x63,0xa5,0xc6,0x63,0x63,0xa5,0xc6) and another with entries such as (0x00,0x00,0x00,0x00,0x63,0x00,0x00,0x00), use a total of 4KB. In my experience, the cost of larger tables outweighs the benefit of eliminating a few masks.

Key expansion

A 4-word (128-bit) key is expanded in 40 steps. Each step produces a new word, totalling 44 words in the expanded key. A step has a byte extraction (see below), a masked load, and two xors. The total work is 40 byte extractions, 40 masked loads, and 80 xors. For comparison, the subsequent work to encrypt a block involves 160 byte extractions, 160 loads (of which 16 are masked), and 160 xors.

Daemen and Rijmen say (Section 4.3.2) that key expansion involves "almost no computational overhead." Obviously key expansion is less expensive than encrypting a block. On the other hand, the cost of key expansion is still quite noticeable.

Expanded keys. A typical AES implementation precomputes and stores an expanded key. The 40 byte extractions, 40 masked loads, and 80 xors aren't repeated for every block; they are done only once, along with 44 stores. Each block then involves 44 extra loads for the expanded key. Some stores and loads can be eliminated if many blocks are handled at once and some extra registers are available.

Long-term storage of an expanded key can slow down applications that handle many keys: the expanded keys take more time to load into cache than the original keys and are less likely to stay in cache.

Partially expanded keys. An alternative is to precompute and store a partially expanded key, only 14 words instead of 44 words. The partially expanded key consists of words 0, 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40 from the expanded key. Loading the partially expanded key, and converting it into the fully expanded key, takes only 14 loads and 30 xors.

One can interpolate between partial expansion and full expansion, using various amounts of storage per key and achieving various balances between load and xor.

Index extraction

The 16 xor operations in an AES round produce 4 words in 4 integer registers. The 16 bytes of these words are then extracted and used as indices for the next round.

The simplest way to extract 4 bytes is using 6 instructions, namely 3 shifts and 3 bottom-byte extractions: &255; (>>8)&255; (>>16)&255; >>24.

Using a byte as an index then requires multiplying the byte by a constant that depends on the table structure. Let's assume the 2KB tables described above; then the constant is 8. The multiplications use 4 shifts: <<3; <<3; <<3; <<3.

Scaled-index loads. Many CPUs can multiply an index register by 8 for free as part of a load.

Scaled-index extractions. What about CPUs that can't multiply an index register by 8 for free? Two of the multiplications can nevertheless be eliminated, because they can be combined with shifts. The overall extract-and-scale sequence has 8 instructions: (<<3)&2040; (>>5)&2040; (>>13)&2040; (>>21)&2040. The PowerPC has a combined rotate-and-mask instruction, making this sequence take only 4 instructions.

Scaled tables. One can rotate table entries by 3 bits, reducing the above 8 instructions to 7 instructions.

Second-byte instructions. The x86 architecture (Pentium, Athlon, etc.) includes a combined (>>8)&255 instruction. This means that extracting 4 bytes takes only 5 instructions: &255; (>>8)&255; >>16; &255; >>8. Alternate 5-instruction sequence: &255; (>>8)&255; >>16; &255; (>>8)&255.

Of course, the ultimate measure of performance is a cycle count, not an instruction count. Matsui states that the (>>8)&255; instruction is "a bit expensive" on the Pentium 4 Prescott (f33, f34, f41); presumably this means that the instruction takes more cycles than, e.g., a mere &255. But all of the measurements I've seen indicate the opposite. I'm not sure what I'm missing here.

32-bit shifts on 64-bit architectures. The amd64 architecture (P4E, Athlon 64, Core 2, etc.) can right-shift a 64-bit register, but Matsui comments that this operation is extremely slow on the P4E. It's much better to use the amd64's x86-compatible right-shift instruction; this instruction sets the top 32 bits of its 64-bit input to 0 before shifting.

Byte extraction via loads. A completely different way to extract 4 bytes is with 1 store and 4 loads. One can mix this with the previous approaches to achieve various balances between load and arithmetic.

Consider, for example, the UltraSPARC, which has 2 integer units and 1 load/store unit. A traditional sequence of 14 partially-expanded-key loads (see below), 30 key-expansion xors, 160 scaled-index extractions, 160 table-lookup loads, 160 xors, 16 masks, 4 input loads, and 4 output stores occupies a total of 526 integer instructions (at least 263 cycles) and 182 loads (at least 182 cycles). Using loads for some byte extractions, replacing 36 scaled-index extractions with 9 stores and 36 loads, means a total of 454 integer instructions (at least 227 cycles) and 227 loads/stores (at least 227 cycles).

Unrolling

A typical 9-iteration AES loop involves 9 increments of a loop index, 9 comparisons, and 9 branches, one of which is mispredicted on most CPUs. The loop index also consumes a register, forcing an extra 9 stores and 9 loads on CPUs that don't have registers to spare.

Full unrolling. One can eliminate all of these costs by fully unrolling the loop. Beware, however, that full unrolling costs a few kilobytes of code-cache space.

Partial unrolling. CPUs are more likely to correctly predict a 4-iteration loop than a 9-iteration loop.

Instruction scheduling

The 16 table lookups in an AES round are independent and can be scheduled in many different ways. One can, for example, perform all the table lookups for the first input from bottom byte to top (outputs 0, 3, 2, 1), then perform all the table lookups for the second input from bottom byte to top (outputs 1, 0, 3, 2), then perform all the table lookups for the third input from bottom byte to top (outputs 2, 1, 0, 3), then perform all the table lookups for the fourth input from bottom byte to top (outputs 3, 2, 1, 0). One can, as another example, first perform all the table lookups for the first output in order of the inputs, then perform all the table lookups for the second output in order of the inputs, etc.

Maximum parallelism. The overall depth of the AES round is one byte extraction plus one table lookup plus two xors: a mythical CPU offering extensive parallelism could perform all sixteen byte extractions in parallel, then all sixteen table lookups in parallel, then eight xors in parallel, then four xors in parallel. Note that each output is obtained by xor'ing two parallel xor's, rather than by three serial xor's.

Deferring loads. The amd64 architecture poses several challenges to AES instruction scheduling. First, most integer instructions require the output register to be one of the input registers. Second, typical amd64 CPUs handle a load and xor most efficiently as a unified load-xor, but a unified load-xor gives no opportunity to switch registers. Third, only 4 registers (eax, ebx, ecx, edx) allow second-byte instructions.

Matsui concludes that, on amd64 (and x86), keeping each round's inputs y0, y1, y2, y3 and outputs z0, z1, z2, z3 in eax, ebx, ecx, edx, to allow second-byte instructions, is "impossible without saving/restoring." But that's incorrect. No extra copies are required. A careful instruction sequence uses the minimal conceivable number of instructions: 20 for byte extraction, 16 for table lookups, and 4 for handling the expanded key. The idea is to extract all the bytes from an input, freeing the input's register for an output, before doing any table lookups involving that output:

The maximum number of live registers here is 9, fitting easily into the amd64 instruction set.

Squeezing inputs and outputs into 7 32-bit registers. The x86 architecture poses an additional challenge to AES instruction scheduling: there are only 7 general-purpose integer registers.

It's still possible to handle a round with 0 stores, 4 expanded-key loads, and 16 loads for table lookups. The shortest instruction sequence that I know has a total of 46 instructions, 6 more than what would be possible with extra registers; 1 of the 46 instructions can be eliminated if the key expansion is changed.

The idea of this instruction sequence is to rotate y0 by 16 bits, use the bottom two bytes of both y0 and y2, and then merge the remaining four bytes of y0 and y2 into a single register (for example, shifting y0 down 16 bits, masking y1, and adding the results), freeing a register at the cost of 3 extra instructions (the rotate, the mask, and the add); splitting 3 load-xor instructions into 3 loads and 3 xors then easily puts all outputs into suitable registers. The rotation can be eliminated if the expanded-key word that corresponds to y0 is rotated by 16 bits.

Speed reports

Speed reports vary in whether they use CTR, CBC, etc., and in the exact rules for measuring speeds. The "eSTREAM" cycles/byte counts are for counter-mode AES measured by the eSTREAM benchmarking toolkit; future implementors are encouraged to support the eSTREAM interface for direct comparability.
ArchitectureCPUeSTREAM cycles/byteAd-hoc cycles/byteSoftware
amd64Intel Core 2 Duo (6f6)?9.2Matsui/Nakajima (CHES 2007)
amd64AMD Athlon 64 (15,75,2)?10.625 (170/block)Matsui (FSE 2006)
amd64AMD Athlon 64 (15,75,2)?12.4375 (199/block)Lipmaa
amd64Intel Core 2 Duo (6f6); katana12.56hongjun/v1/1
amd64Intel Core 2 Quad Q6600 (6fb); latour12.57hongjun/v1/1
amd64AMD Athlon 64 (15,75,2)?13.125 (210/block)Osvik
amd64AMD Athlon 64 X2 (15,75,2); mace13.32hongjun/v1/1
amd64AMD Opteron 240 (f58); nmisles8amd6413.45bernstein/amd64-1/1
x86Intel Pentium III (68a)?14 (224/block)Osvik
x86AMD Athlon (622)?14.0625 (225/block)Osvik
x86Intel Pentium III (68a)?14.125 (226/block)Lipmaa
x86Intel Pentium 4 (f12)?15 (240/block)Osvik
x86Intel Pentium 4 (f12)?15.875 (254/block)Lipmaa
x86Intel Pentium M (695); whisper15.96bernstein/x86-mmx-1/1
amd64Intel Pentium 4 (f64)?16 (256/block)Matsui (FSE 2006)
x86Intel Pentium III (68a)?16.25 (260/block)Gladman
amd64Intel Pentium D (f64); nmi016116.74bernstein/amd64-2/1
amd64Intel Pentium D (f64); svlin00116.75bernstein/amd64-2/1
amd64Intel Xeon (f41); nmi005616.75bernstein/amd64-2/1
amd64Intel Xeon (f4a); nmi009016.77bernstein/amd64-2/1
sparcSun UltraSPARC III16.875 (270/block)Lipmaa
amd64Intel Xeon (f41); nmi005716.89bernstein/amd64-2/1
amd64Intel Pentium D (f64); speed16.90bernstein/amd64-2/1
amd64Intel Pentium D (f64); nmi010416.90bernstein/amd64-2/1
amd64Intel Pentium D (f64); nmi024116.93bernstein/amd64-2/1
ppc64IBM POWER5; nmi015416.93bernstein/big-1/1
x86Intel Pentium 4 (f24); nmi008616.96bernstein/x86-mmx-1/1
x86Intel Pentium 4 (f12); fireball16.98bernstein/x86-mmx-1/1
x86Intel Pentium 4 (f24); nmitest417.01bernstein/x86-mmx-1/1
ppc64IBM PowerPC G5 970; nmi004817.17bernstein/big-1/1
x86Intel Pentium 2 (652); boris17.33bernstein/x86-mmx-1/1
x86Intel Pentium 3 (68a)17.49Bernstein aes-128/x86-mmx-1
x86Intel Pentium 3 (672); orpheus17.55bernstein/x86-mmx-1/1
x86Intel Pentium M (6d8)17.57Wu v0/1
x86Intel Pentium 4 (f33)?17.75 (284/block)Matsui/Fukuda (FSE 2005)
x86Intel Xeon (f29); nmibuild4017.79bernstein/x86-mmx-1/1
x86Intel Xeon (f27); nmi005917.79bernstein/x86-mmx-1/1
x86Intel Xeon (f25); nmibuild1617.79bernstein/x86-mmx-1/1
x86Intel Xeon (f25); nmi001317.79bernstein/x86-mmx-1/1
x86Intel Xeon (f29); nmi005917.80bernstein/x86-mmx-1/1
x86Intel Xeon (f29); nmibuild1717.81bernstein/x86-mmx-1/1
x86Intel Xeon (f25); nmibuild1517.82bernstein/x86-mmx-1/1
x86Intel Xeon (f25); nmibuild2617.83bernstein/x86-mmx-1/1
x86Intel Xeon (f25); nmibuild2117.83bernstein/x86-mmx-1/1
x86Intel Xeon (f25); nmi003617.84bernstein/x86-mmx-1/1
x86Intel Xeon (f25); nmibuild2217.84bernstein/x86-mmx-1/1
x86AMD Athlon (622); thoth18.38bernstein/x86-mmx-1/1
ppc32IBM POWER4; nmibuild1418.55bernstein/little-1/1
x86Intel Xeon (f41); nmi007918.88bernstein/x86-mmx-1/1
x86Intel Xeon (f41); nmi006218.89bernstein/x86-mmx-1/1
amd64Intel Core 2 Duo (6f6)18.9OpenSSL 0.9.8e
x86Intel Xeon (f41); nmi006118.91bernstein/x86-mmx-1/1
x86Intel Pentium 4 (f41); svlin00218.94bernstein/x86-mmx-1/1
x86Intel Xeon (f41); nmi007618.96bernstein/x86-mmx-1/1
x86Intel Xeon (f4a); nmi010218.97bernstein/x86-mmx-1/1
x86Intel Xeon (f41); nmi006018.97bernstein/x86-mmx-1/1
x86Intel Xeon (f41); nmi006318.95bernstein/x86-mmx-1/1
x86Intel Pentium 3 (68a)19.06Wu v1/1
ppc32Motorola PowerPC G4 7410; gggg19.11bernstein/big-1/1
amd64Intel Core 2 Duo (6f6)19.5OpenSSL 0.9.8a
x86AMD Athlon (622)?19.9375 (319/block)Lipmaa
x86Intel Pentium 1 (52c)20 (320/block)Lipmaa
sparcSun UltraSPARC III20.75Bernstein big-1/1
amd64AMD Athlon 64 (15,75,2)20.9OpenSSL 0.9.8e
ppc32Motorola PowerPC G4 7400; nmi004220.92bernstein/big-1/1
x86Intel Pentium M (6d8)21OpenSSL 0.9.8a
x86Intel Pentium D (f47); shell21.58bernstein/x86-mmx-1/1
x86AMD Athlon (622)22OpenSSL 0.9.8a
x86Intel Pentium 4 (f29)22OpenSSL 0.9.8b
amd64AMD Athlon 64 (15,75,2)?23.5OpenSSL 0.9.7e
x86Intel Pentium 4 (f41)23.5OpenSSL 0.9.8a
x86Intel Pentium 3 (672); orpheus23.62OpenSSL 0.9.8e
ppc32Motorola PowerPC G4 741024.0625 (385/block)Ahrens
x86Intel Pentium 4 (f12)24.4OpenSSL 0.9.8a
sparcSun UltraSPARC III25OpenSSL
ppc32Motorola PowerPC G4 741025.0625 (401/block)Ahrens
x86Intel Core Duo; nmi006825.74gladman/1
amd64Intel Pentium D (f64); speed27.33OpenSSL 0.9.8e
ppc32Motorola PowerPC G4 7410; gggg29.32OpenSSL 0.9.8c
sparcv9Sun UltraSPARC III; nmi005129.45bernstein/big-1/1
sparcv9Sun UltraSPARC III; nmisolaris1029.46bernstein/big-1/1
ppc64IBM Cell PPE; nmips335.20bernstein/big-1/1
amd64Intel Pentium 4 (f64)37OpenSSL 0.9.7f
x86Intel Pentium 4 (f29)39OpenSSL 0.9.7e
sparcSun UltraSPARC III46.875 (750/block)Bassham
x86Intel Pentium 1 (52c); cruncher38.20hongjun/v1/1

Regarding amd64 Intel Pentium 4, Matsui writes: "The number of memory reads for one block encryption of AES is 4 (for plaintext loads) + 11 x 4 (for subkey loads) + 16 x 10 (for table lookups) = 208, which means that Pentium 4 takes at least 208 cycles/block for one block encryption." But this lower bound ignores the possibility of loading partially expanded keys, saving as many as 30 loads, and using 64-bit loads for keys and plaintext, saving 9 more loads.

Regarding amd64 AMD Athlon 64, Matsui writes: "Considering an instruction latency of Athlon 64, the theoretical limit of AES performance on this processor seems around 16 cycles/round = 160 cycles/block. Our result is hence reaching closely this limit."