AES speed D. J. Bernstein
Hash functions and ciphers

AES speed

Update: Peter Schwabe and I now have a paper on this topic:

[aesspeed] 15pp. (PDF) D. J. Bernstein, Peter Schwabe. New AES software speed records. Document ID: b90c51d2f7eef86b78068511135a231f. URL: https://cr.yp.to/papers.html#aesspeed. Date: 2008.09.26. Supersedes: (PDF) 2008.09.08.

The software is now available as part of the estreambench toolkit. We have placed the software into the public domain; feel free to integrate it into your own AES applications!

Information below this line has not yet been updated.

This document describes various speedups in AES software. This document assumes that the software is going to be used in an application where timing information is not exposed to attackers.

The reader is expected to already know the standard structure of AES software:

each of the 16 state bytes is used as an index for a table lookup producing a 32-bit word;
16 xors combine these 16 words and 4 expanded key words into 4 new state words;
those 4 words are viewed as the starting 16 bytes for the next round.

See Section 5.2.1 of "AES Proposal: Rijndael" by Daemen and Rijmen.

Endianness

On a little-endian CPU, extracting the first byte of a 32-bit word is an &0xff arithmetic instruction; on a big-endian CPU, extracting the first byte of a 32-bit word is a >>24 arithmetic instruction. Similar comments apply to the other bytes.

One can write AES software that uses arithmetic instructions as if the CPU were little-endian. If the CPU is actually big-endian, the software swaps the bytes of the AES key, input, and output (at run time). The software also swaps the bytes of the table (at compile time), for example by expressing the table as a sequence of 32-bit integers.

Matched endianness. One can easily eliminate the byte-swapping time for the AES key, input, and output: simply use the appropriate arithmetic instructions for the endianness of the CPU. In this case the table must not be swapped.

Table structure

All else being equal, smaller AES tables are faster: they take less time to load into cache and are more likely to stay in cache. Beware that most benchmarking tools preload caches and thus can't see this speedup.

Daemen and Rijmen suggest "4 KBytes of tables." There are 4 tables. Each table has 256 words occupying 1024 bytes. The loads are spread evenly across the tables.

Rotated lookups. Daemen and Rijmen suggest an alternative "with a total table size of 1KByte" but with extra arithmetic. The point is that the tables are rotations of each other: for example, the first word of the first table is (0xc6,0x63,0x63,0xa5), the first word of the second table is (0xa5,0xc6,0x63,0x63), the first word of the third table is (0x63,0xa5,0xc6,0x63), and the first word of the fourth table is (0x63,0x63,0xa5,0xc6). One can store the first table, and simulate a lookup in another table at the cost of an extra rotation.

Unaligned loads. One can instead use a single 2KB table having 256 8-byte entries such as (0x00,0x63,0xa5,0xc6,0x63,0x63,0xa5,0xc6). There are many reasonable choices of pattern here; what's important is that the pattern includes the desired (0xc6,0x63,0x63,0xa5) and (0xa5,0xc6,0x63,0x63) and so on as substrings. On the Pentium, the PowerPC, et al., one can load 4-byte words from memory addresses that aren't divisible by 4, and there's no penalty when the word doesn't cross an 8-byte boundary.

Masked loads

16 of the 160 table lookups in 10-round AES are masked. The 40 table lookups in 10-round AES key expansion are also masked. The masks are 0x000000ff, 0x0000ff00, 0x00ff0000, and 0xff000000, each used equally often.

The simplest way to compute a mask is with an arithmetic instruction: for example, &0xff00.

Byte loads. One can eliminate 25% of the masks, namely the bottom-byte masks, by combining them with load instructions. All popular CPUs have single-byte-load instructions.

Two-byte loads. One can eliminate another 25% of the masks on CPUs with two-byte-load instructions. This constrains the table pattern: it's important to have (0x00,0x63) on little-endian CPUs, and (0x63,0x00) on big-endian CPUs.

Masked tables. One can eliminate all of the masks by precomputing masked tables, using extra table space. The simplest table structure uses a total of 8KB. Two tables, one with entries such as (0x00,0x63,0xa5,0xc6,0x63,0x63,0xa5,0xc6) and another with entries such as (0x00,0x00,0x00,0x00,0x63,0x00,0x00,0x00), use a total of 4KB. In my experience, the cost of larger tables outweighs the benefit of eliminating a few masks.

Key expansion

A 4-word (128-bit) key is expanded in 40 steps. Each step produces a new word, totalling 44 words in the expanded key. A step has a byte extraction (see below), a masked load, and two xors. The total work is 40 byte extractions, 40 masked loads, and 80 xors. For comparison, the subsequent work to encrypt a block involves 160 byte extractions, 160 loads (of which 16 are masked), and 160 xors.

Daemen and Rijmen say (Section 4.3.2) that key expansion involves "almost no computational overhead." Obviously key expansion is less expensive than encrypting a block. On the other hand, the cost of key expansion is still quite noticeable.

Expanded keys. A typical AES implementation precomputes and stores an expanded key. The 40 byte extractions, 40 masked loads, and 80 xors aren't repeated for every block; they are done only once, along with 44 stores. Each block then involves 44 extra loads for the expanded key. Some stores and loads can be eliminated if many blocks are handled at once and some extra registers are available.

Long-term storage of an expanded key can slow down applications that handle many keys: the expanded keys take more time to load into cache than the original keys and are less likely to stay in cache.

Partially expanded keys. An alternative is to precompute and store a partially expanded key, only 14 words instead of 44 words. The partially expanded key consists of words 0, 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40 from the expanded key. Loading the partially expanded key, and converting it into the fully expanded key, takes only 14 loads and 30 xors.

One can interpolate between partial expansion and full expansion, using various amounts of storage per key and achieving various balances between load and xor.

Index extraction

The 16 xor operations in an AES round produce 4 words in 4 integer registers. The 16 bytes of these words are then extracted and used as indices for the next round.

The simplest way to extract 4 bytes is using 6 instructions, namely 3 shifts and 3 bottom-byte extractions: &255; (>>8)&255; (>>16)&255; >>24.

Using a byte as an index then requires multiplying the byte by a constant that depends on the table structure. Let's assume the 2KB tables described above; then the constant is 8. The multiplications use 4 shifts: <<3; <<3; <<3; <<3.

Scaled-index loads. Many CPUs can multiply an index register by 8 for free as part of a load.

Scaled-index extractions. What about CPUs that can't multiply an index register by 8 for free? Two of the multiplications can nevertheless be eliminated, because they can be combined with shifts. The overall extract-and-scale sequence has 8 instructions: (<<3)&2040; (>>5)&2040; (>>13)&2040; (>>21)&2040. The PowerPC has a combined rotate-and-mask instruction, making this sequence take only 4 instructions.

Scaled tables. One can rotate table entries by 3 bits, reducing the above 8 instructions to 7 instructions.

Second-byte instructions. The x86 architecture (Pentium, Athlon, etc.) includes a combined (>>8)&255 instruction. This means that extracting 4 bytes takes only 5 instructions: &255; (>>8)&255; >>16; &255; >>8. Alternate 5-instruction sequence: &255; (>>8)&255; >>16; &255; (>>8)&255.

Of course, the ultimate measure of performance is a cycle count, not an instruction count. Matsui states that the (>>8)&255; instruction is "a bit expensive" on the Pentium 4 Prescott (f33, f34, f41); presumably this means that the instruction takes more cycles than, e.g., a mere &255. But all of the measurements I've seen indicate the opposite. I'm not sure what I'm missing here.

32-bit shifts on 64-bit architectures. The amd64 architecture (P4E, Athlon 64, Core 2, etc.) can right-shift a 64-bit register, but Matsui comments that this operation is extremely slow on the P4E. It's much better to use the amd64's x86-compatible right-shift instruction; this instruction sets the top 32 bits of its 64-bit input to 0 before shifting.

Byte extraction via loads. A completely different way to extract 4 bytes is with 1 store and 4 loads. One can mix this with the previous approaches to achieve various balances between load and arithmetic.

Consider, for example, the UltraSPARC, which has 2 integer units and 1 load/store unit. A traditional sequence of 14 partially-expanded-key loads (see below), 30 key-expansion xors, 160 scaled-index extractions, 160 table-lookup loads, 160 xors, 16 masks, 4 input loads, and 4 output stores occupies a total of 526 integer instructions (at least 263 cycles) and 182 loads (at least 182 cycles). Using loads for some byte extractions, replacing 36 scaled-index extractions with 9 stores and 36 loads, means a total of 454 integer instructions (at least 227 cycles) and 227 loads/stores (at least 227 cycles).

Unrolling

A typical 9-iteration AES loop involves 9 increments of a loop index, 9 comparisons, and 9 branches, one of which is mispredicted on most CPUs. The loop index also consumes a register, forcing an extra 9 stores and 9 loads on CPUs that don't have registers to spare.

Full unrolling. One can eliminate all of these costs by fully unrolling the loop. Beware, however, that full unrolling costs a few kilobytes of code-cache space.

Partial unrolling. CPUs are more likely to correctly predict a 4-iteration loop than a 9-iteration loop.

Instruction scheduling

The 16 table lookups in an AES round are independent and can be scheduled in many different ways. One can, for example, perform all the table lookups for the first input from bottom byte to top (outputs 0, 3, 2, 1), then perform all the table lookups for the second input from bottom byte to top (outputs 1, 0, 3, 2), then perform all the table lookups for the third input from bottom byte to top (outputs 2, 1, 0, 3), then perform all the table lookups for the fourth input from bottom byte to top (outputs 3, 2, 1, 0). One can, as another example, first perform all the table lookups for the first output in order of the inputs, then perform all the table lookups for the second output in order of the inputs, etc.

Maximum parallelism. The overall depth of the AES round is one byte extraction plus one table lookup plus two xors: a mythical CPU offering extensive parallelism could perform all sixteen byte extractions in parallel, then all sixteen table lookups in parallel, then eight xors in parallel, then four xors in parallel. Note that each output is obtained by xor'ing two parallel xor's, rather than by three serial xor's.

Deferring loads. The amd64 architecture poses several challenges to AES instruction scheduling. First, most integer instructions require the output register to be one of the input registers. Second, typical amd64 CPUs handle a load and xor most efficiently as a unified load-xor, but a unified load-xor gives no opportunity to switch registers. Third, only 4 registers (eax, ebx, ecx, edx) allow second-byte instructions.

Matsui concludes that, on amd64 (and x86), keeping each round's inputs y0, y1, y2, y3 and outputs z0, z1, z2, z3 in eax, ebx, ecx, edx, to allow second-byte instructions, is "impossible without saving/restoring." But that's incorrect. No extra copies are required. A careful instruction sequence uses the minimal conceivable number of instructions: 20 for byte extraction, 16 for table lookups, and 4 for handling the expanded key. The idea is to extract all the bytes from an input, freeing the input's register for an output, before doing any table lookups involving that output:

Extract the 4 bytes from y0. At this point y1, y2, y3, and the 4 bytes are live.
Feed 1 byte into z0. At this point y1, y2, y3, z0, and 3 more bytes are live.
Extract the 4 bytes from y1, immediately feeding 1 into z0. At this point y2, y3, z0, and 6 more bytes are live.
Feed 2 bytes into z1. At this point y2, y3, z0, z1, and 4 more bytes are live.
Extract the 4 bytes from y2, immediately feeding 2 into z0 and z1. At this point y3, z0, z1, and 6 more bytes are live.
Feed 3 bytes into z2. At this point y3, z0, z1, z2, and 3 more bytes are live.
Extract the 4 bytes from y3, immediately feeding 3 into z0, z1, and z2. At this point z0, z1, z2, and 4 more bytes are live.
Feed 4 bytes into z3. At this point z0, z1, z2, and z3 are live.
Handle 4 words of the expanded key.

The maximum number of live registers here is 9, fitting easily into the amd64 instruction set.

Squeezing inputs and outputs into 7 32-bit registers. The x86 architecture poses an additional challenge to AES instruction scheduling: there are only 7 general-purpose integer registers.

It's still possible to handle a round with 0 stores, 4 expanded-key loads, and 16 loads for table lookups. The shortest instruction sequence that I know has a total of 46 instructions, 6 more than what would be possible with extra registers; 1 of the 46 instructions can be eliminated if the key expansion is changed.

The idea of this instruction sequence is to rotate y0 by 16 bits, use the bottom two bytes of both y0 and y2, and then merge the remaining four bytes of y0 and y2 into a single register (for example, shifting y0 down 16 bits, masking y1, and adding the results), freeing a register at the cost of 3 extra instructions (the rotate, the mask, and the add); splitting 3 load-xor instructions into 3 loads and 3 xors then easily puts all outputs into suitable registers. The rotation can be eliminated if the expanded-key word that corresponds to y0 is rotated by 16 bits.

Speed reports

Speed reports vary in whether they use CTR, CBC, etc., and in the exact rules for measuring speeds. The "eSTREAM" cycles/byte counts are for counter-mode AES measured by the eSTREAM benchmarking toolkit; future implementors are encouraged to support the eSTREAM interface for direct comparability.

Architecture	CPU	eSTREAM cycles/byte	Ad-hoc cycles/byte	Software
amd64	Intel Core 2 Duo (6f6)?		9.2	Matsui/Nakajima (CHES 2007)
amd64	AMD Athlon 64 (15,75,2)?		10.625 (170/block)	Matsui (FSE 2006)
amd64	AMD Athlon 64 (15,75,2)?		12.4375 (199/block)	Lipmaa
amd64	Intel Core 2 Duo (6f6); katana	12.56		hongjun/v1/1
amd64	Intel Core 2 Quad Q6600 (6fb); latour	12.57		hongjun/v1/1
amd64	AMD Athlon 64 (15,75,2)?		13.125 (210/block)	Osvik
amd64	AMD Athlon 64 X2 (15,75,2); mace	13.32		hongjun/v1/1
amd64	AMD Opteron 240 (f58); nmisles8amd64	13.45		bernstein/amd64-1/1
x86	Intel Pentium III (68a)?		14 (224/block)	Osvik
x86	AMD Athlon (622)?		14.0625 (225/block)	Osvik
x86	Intel Pentium III (68a)?		14.125 (226/block)	Lipmaa
x86	Intel Pentium 4 (f12)?		15 (240/block)	Osvik
x86	Intel Pentium 4 (f12)?		15.875 (254/block)	Lipmaa
x86	Intel Pentium M (695); whisper	15.96		bernstein/x86-mmx-1/1
amd64	Intel Pentium 4 (f64)?		16 (256/block)	Matsui (FSE 2006)
x86	Intel Pentium III (68a)?		16.25 (260/block)	Gladman
amd64	Intel Pentium D (f64); nmi0161	16.74		bernstein/amd64-2/1
amd64	Intel Pentium D (f64); svlin001	16.75		bernstein/amd64-2/1
amd64	Intel Xeon (f41); nmi0056	16.75		bernstein/amd64-2/1
amd64	Intel Xeon (f4a); nmi0090	16.77		bernstein/amd64-2/1
sparc	Sun UltraSPARC III		16.875 (270/block)	Lipmaa
amd64	Intel Xeon (f41); nmi0057	16.89		bernstein/amd64-2/1
amd64	Intel Pentium D (f64); speed	16.90		bernstein/amd64-2/1
amd64	Intel Pentium D (f64); nmi0104	16.90		bernstein/amd64-2/1
amd64	Intel Pentium D (f64); nmi0241	16.93		bernstein/amd64-2/1
ppc64	IBM POWER5; nmi0154	16.93		bernstein/big-1/1
x86	Intel Pentium 4 (f24); nmi0086	16.96		bernstein/x86-mmx-1/1
x86	Intel Pentium 4 (f12); fireball	16.98		bernstein/x86-mmx-1/1
x86	Intel Pentium 4 (f24); nmitest4	17.01		bernstein/x86-mmx-1/1
ppc64	IBM PowerPC G5 970; nmi0048	17.17		bernstein/big-1/1
x86	Intel Pentium 2 (652); boris	17.33		bernstein/x86-mmx-1/1
x86	Intel Pentium 3 (68a)	17.49		Bernstein aes-128/x86-mmx-1
x86	Intel Pentium 3 (672); orpheus	17.55		bernstein/x86-mmx-1/1
x86	Intel Pentium M (6d8)	17.57		Wu v0/1
x86	Intel Pentium 4 (f33)?		17.75 (284/block)	Matsui/Fukuda (FSE 2005)
x86	Intel Xeon (f29); nmibuild40	17.79		bernstein/x86-mmx-1/1
x86	Intel Xeon (f27); nmi0059	17.79		bernstein/x86-mmx-1/1
x86	Intel Xeon (f25); nmibuild16	17.79		bernstein/x86-mmx-1/1
x86	Intel Xeon (f25); nmi0013	17.79		bernstein/x86-mmx-1/1
x86	Intel Xeon (f29); nmi0059	17.80		bernstein/x86-mmx-1/1
x86	Intel Xeon (f29); nmibuild17	17.81		bernstein/x86-mmx-1/1
x86	Intel Xeon (f25); nmibuild15	17.82		bernstein/x86-mmx-1/1
x86	Intel Xeon (f25); nmibuild26	17.83		bernstein/x86-mmx-1/1
x86	Intel Xeon (f25); nmibuild21	17.83		bernstein/x86-mmx-1/1
x86	Intel Xeon (f25); nmi0036	17.84		bernstein/x86-mmx-1/1
x86	Intel Xeon (f25); nmibuild22	17.84		bernstein/x86-mmx-1/1
x86	AMD Athlon (622); thoth	18.38		bernstein/x86-mmx-1/1
ppc32	IBM POWER4; nmibuild14	18.55		bernstein/little-1/1
x86	Intel Xeon (f41); nmi0079	18.88		bernstein/x86-mmx-1/1
x86	Intel Xeon (f41); nmi0062	18.89		bernstein/x86-mmx-1/1
amd64	Intel Core 2 Duo (6f6)		18.9	OpenSSL 0.9.8e
x86	Intel Xeon (f41); nmi0061	18.91		bernstein/x86-mmx-1/1
x86	Intel Pentium 4 (f41); svlin002	18.94		bernstein/x86-mmx-1/1
x86	Intel Xeon (f41); nmi0076	18.96		bernstein/x86-mmx-1/1
x86	Intel Xeon (f4a); nmi0102	18.97		bernstein/x86-mmx-1/1
x86	Intel Xeon (f41); nmi0060	18.97		bernstein/x86-mmx-1/1
x86	Intel Xeon (f41); nmi0063	18.95		bernstein/x86-mmx-1/1
x86	Intel Pentium 3 (68a)	19.06		Wu v1/1
ppc32	Motorola PowerPC G4 7410; gggg	19.11		bernstein/big-1/1
amd64	Intel Core 2 Duo (6f6)		19.5	OpenSSL 0.9.8a
x86	AMD Athlon (622)?		19.9375 (319/block)	Lipmaa
x86	Intel Pentium 1 (52c)		20 (320/block)	Lipmaa
sparc	Sun UltraSPARC III	20.75		Bernstein big-1/1
amd64	AMD Athlon 64 (15,75,2)		20.9	OpenSSL 0.9.8e
ppc32	Motorola PowerPC G4 7400; nmi0042	20.92		bernstein/big-1/1
x86	Intel Pentium M (6d8)		21	OpenSSL 0.9.8a
x86	Intel Pentium D (f47); shell	21.58		bernstein/x86-mmx-1/1
x86	AMD Athlon (622)		22	OpenSSL 0.9.8a
x86	Intel Pentium 4 (f29)		22	OpenSSL 0.9.8b
amd64	AMD Athlon 64 (15,75,2)?		23.5	OpenSSL 0.9.7e
x86	Intel Pentium 4 (f41)		23.5	OpenSSL 0.9.8a
x86	Intel Pentium 3 (672); orpheus		23.62	OpenSSL 0.9.8e
ppc32	Motorola PowerPC G4 7410		24.0625 (385/block)	Ahrens
x86	Intel Pentium 4 (f12)		24.4	OpenSSL 0.9.8a
sparc	Sun UltraSPARC III		25	OpenSSL
ppc32	Motorola PowerPC G4 7410		25.0625 (401/block)	Ahrens
x86	Intel Core Duo; nmi0068	25.74		gladman/1
amd64	Intel Pentium D (f64); speed		27.33	OpenSSL 0.9.8e
ppc32	Motorola PowerPC G4 7410; gggg		29.32	OpenSSL 0.9.8c
sparcv9	Sun UltraSPARC III; nmi0051	29.45		bernstein/big-1/1
sparcv9	Sun UltraSPARC III; nmisolaris10	29.46		bernstein/big-1/1
ppc64	IBM Cell PPE; nmips3	35.20		bernstein/big-1/1
amd64	Intel Pentium 4 (f64)		37	OpenSSL 0.9.7f
x86	Intel Pentium 4 (f29)		39	OpenSSL 0.9.7e
sparc	Sun UltraSPARC III		46.875 (750/block)	Bassham
x86	Intel Pentium 1 (52c); cruncher	38.20		hongjun/v1/1

Regarding amd64 Intel Pentium 4, Matsui writes: "The number of memory reads for one block encryption of AES is 4 (for plaintext loads) + 11 x 4 (for subkey loads) + 16 x 10 (for table lookups) = 208, which means that Pentium 4 takes at least 208 cycles/block for one block encryption." But this lower bound ignores the possibility of loading partially expanded keys, saving as many as 30 loads, and using 64-bit loads for keys and plaintext, saving 9 more loads.

Regarding amd64 AMD Athlon 64, Matsui writes: "Considering an instruction latency of Athlon 64, the theoretical limit of AES performance on this processor seems around 16 cycles/round = 160 cycles/block. Our result is hence reaching closely this limit."