Information below this line has not yet been updated.
The reader is expected to already know the standard structure of AES software:
One can write AES software that uses arithmetic instructions as if the CPU were little-endian. If the CPU is actually big-endian, the software swaps the bytes of the AES key, input, and output (at run time). The software also swaps the bytes of the table (at compile time), for example by expressing the table as a sequence of 32-bit integers.
Matched endianness. One can easily eliminate the byte-swapping time for the AES key, input, and output: simply use the appropriate arithmetic instructions for the endianness of the CPU. In this case the table must not be swapped.
Daemen and Rijmen suggest "4 KBytes of tables." There are 4 tables. Each table has 256 words occupying 1024 bytes. The loads are spread evenly across the tables.
Rotated lookups. Daemen and Rijmen suggest an alternative "with a total table size of 1KByte" but with extra arithmetic. The point is that the tables are rotations of each other: for example, the first word of the first table is (0xc6,0x63,0x63,0xa5), the first word of the second table is (0xa5,0xc6,0x63,0x63), the first word of the third table is (0x63,0xa5,0xc6,0x63), and the first word of the fourth table is (0x63,0x63,0xa5,0xc6). One can store the first table, and simulate a lookup in another table at the cost of an extra rotation.
Unaligned loads. One can instead use a single 2KB table having 256 8-byte entries such as (0x00,0x63,0xa5,0xc6,0x63,0x63,0xa5,0xc6). There are many reasonable choices of pattern here; what's important is that the pattern includes the desired (0xc6,0x63,0x63,0xa5) and (0xa5,0xc6,0x63,0x63) and so on as substrings. On the Pentium, the PowerPC, et al., one can load 4-byte words from memory addresses that aren't divisible by 4, and there's no penalty when the word doesn't cross an 8-byte boundary.
The simplest way to compute a mask is with an arithmetic instruction: for example, &0xff00.
Byte loads. One can eliminate 25% of the masks, namely the bottom-byte masks, by combining them with load instructions. All popular CPUs have single-byte-load instructions.
Two-byte loads. One can eliminate another 25% of the masks on CPUs with two-byte-load instructions. This constrains the table pattern: it's important to have (0x00,0x63) on little-endian CPUs, and (0x63,0x00) on big-endian CPUs.
Masked tables. One can eliminate all of the masks by precomputing masked tables, using extra table space. The simplest table structure uses a total of 8KB. Two tables, one with entries such as (0x00,0x63,0xa5,0xc6,0x63,0x63,0xa5,0xc6) and another with entries such as (0x00,0x00,0x00,0x00,0x63,0x00,0x00,0x00), use a total of 4KB. In my experience, the cost of larger tables outweighs the benefit of eliminating a few masks.
Daemen and Rijmen say (Section 4.3.2) that key expansion involves "almost no computational overhead." Obviously key expansion is less expensive than encrypting a block. On the other hand, the cost of key expansion is still quite noticeable.
Expanded keys. A typical AES implementation precomputes and stores an expanded key. The 40 byte extractions, 40 masked loads, and 80 xors aren't repeated for every block; they are done only once, along with 44 stores. Each block then involves 44 extra loads for the expanded key. Some stores and loads can be eliminated if many blocks are handled at once and some extra registers are available.
Long-term storage of an expanded key can slow down applications that handle many keys: the expanded keys take more time to load into cache than the original keys and are less likely to stay in cache.
Partially expanded keys. An alternative is to precompute and store a partially expanded key, only 14 words instead of 44 words. The partially expanded key consists of words 0, 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40 from the expanded key. Loading the partially expanded key, and converting it into the fully expanded key, takes only 14 loads and 30 xors.
One can interpolate between partial expansion and full expansion, using various amounts of storage per key and achieving various balances between load and xor.
The simplest way to extract 4 bytes is using 6 instructions, namely 3 shifts and 3 bottom-byte extractions: &255; (>>8)&255; (>>16)&255; >>24.
Using a byte as an index then requires multiplying the byte by a constant that depends on the table structure. Let's assume the 2KB tables described above; then the constant is 8. The multiplications use 4 shifts: <<3; <<3; <<3; <<3.
Scaled-index loads. Many CPUs can multiply an index register by 8 for free as part of a load.
Scaled-index extractions. What about CPUs that can't multiply an index register by 8 for free? Two of the multiplications can nevertheless be eliminated, because they can be combined with shifts. The overall extract-and-scale sequence has 8 instructions: (<<3)&2040; (>>5)&2040; (>>13)&2040; (>>21)&2040. The PowerPC has a combined rotate-and-mask instruction, making this sequence take only 4 instructions.
Scaled tables. One can rotate table entries by 3 bits, reducing the above 8 instructions to 7 instructions.
Second-byte instructions. The x86 architecture (Pentium, Athlon, etc.) includes a combined (>>8)&255 instruction. This means that extracting 4 bytes takes only 5 instructions: &255; (>>8)&255; >>16; &255; >>8. Alternate 5-instruction sequence: &255; (>>8)&255; >>16; &255; (>>8)&255.
Of course, the ultimate measure of performance is a cycle count, not an instruction count. Matsui states that the (>>8)&255; instruction is "a bit expensive" on the Pentium 4 Prescott (f33, f34, f41); presumably this means that the instruction takes more cycles than, e.g., a mere &255. But all of the measurements I've seen indicate the opposite. I'm not sure what I'm missing here.
32-bit shifts on 64-bit architectures. The amd64 architecture (P4E, Athlon 64, Core 2, etc.) can right-shift a 64-bit register, but Matsui comments that this operation is extremely slow on the P4E. It's much better to use the amd64's x86-compatible right-shift instruction; this instruction sets the top 32 bits of its 64-bit input to 0 before shifting.
Byte extraction via loads. A completely different way to extract 4 bytes is with 1 store and 4 loads. One can mix this with the previous approaches to achieve various balances between load and arithmetic.
Consider, for example, the UltraSPARC, which has 2 integer units and 1 load/store unit. A traditional sequence of 14 partially-expanded-key loads (see below), 30 key-expansion xors, 160 scaled-index extractions, 160 table-lookup loads, 160 xors, 16 masks, 4 input loads, and 4 output stores occupies a total of 526 integer instructions (at least 263 cycles) and 182 loads (at least 182 cycles). Using loads for some byte extractions, replacing 36 scaled-index extractions with 9 stores and 36 loads, means a total of 454 integer instructions (at least 227 cycles) and 227 loads/stores (at least 227 cycles).
Full unrolling. One can eliminate all of these costs by fully unrolling the loop. Beware, however, that full unrolling costs a few kilobytes of code-cache space.
Partial unrolling. CPUs are more likely to correctly predict a 4-iteration loop than a 9-iteration loop.
Maximum parallelism. The overall depth of the AES round is one byte extraction plus one table lookup plus two xors: a mythical CPU offering extensive parallelism could perform all sixteen byte extractions in parallel, then all sixteen table lookups in parallel, then eight xors in parallel, then four xors in parallel. Note that each output is obtained by xor'ing two parallel xor's, rather than by three serial xor's.
Deferring loads. The amd64 architecture poses several challenges to AES instruction scheduling. First, most integer instructions require the output register to be one of the input registers. Second, typical amd64 CPUs handle a load and xor most efficiently as a unified load-xor, but a unified load-xor gives no opportunity to switch registers. Third, only 4 registers (eax, ebx, ecx, edx) allow second-byte instructions.
Matsui concludes that, on amd64 (and x86), keeping each round's inputs y0, y1, y2, y3 and outputs z0, z1, z2, z3 in eax, ebx, ecx, edx, to allow second-byte instructions, is "impossible without saving/restoring." But that's incorrect. No extra copies are required. A careful instruction sequence uses the minimal conceivable number of instructions: 20 for byte extraction, 16 for table lookups, and 4 for handling the expanded key. The idea is to extract all the bytes from an input, freeing the input's register for an output, before doing any table lookups involving that output:
Squeezing inputs and outputs into 7 32-bit registers. The x86 architecture poses an additional challenge to AES instruction scheduling: there are only 7 general-purpose integer registers.
It's still possible to handle a round with 0 stores, 4 expanded-key loads, and 16 loads for table lookups. The shortest instruction sequence that I know has a total of 46 instructions, 6 more than what would be possible with extra registers; 1 of the 46 instructions can be eliminated if the key expansion is changed.
The idea of this instruction sequence is to rotate y0 by 16 bits, use the bottom two bytes of both y0 and y2, and then merge the remaining four bytes of y0 and y2 into a single register (for example, shifting y0 down 16 bits, masking y1, and adding the results), freeing a register at the cost of 3 extra instructions (the rotate, the mask, and the add); splitting 3 load-xor instructions into 3 loads and 3 xors then easily puts all outputs into suitable registers. The rotation can be eliminated if the expanded-key word that corresponds to y0 is rotated by 16 bits.
|Architecture||CPU||eSTREAM cycles/byte||Ad-hoc cycles/byte||Software|
|amd64||Intel Core 2 Duo (6f6)?||9.2||Matsui/Nakajima (CHES 2007)|
|amd64||AMD Athlon 64 (15,75,2)?||10.625 (170/block)||Matsui (FSE 2006)|
|amd64||AMD Athlon 64 (15,75,2)?||12.4375 (199/block)||Lipmaa|
|amd64||Intel Core 2 Duo (6f6); katana||12.56||hongjun/v1/1|
|amd64||Intel Core 2 Quad Q6600 (6fb); latour||12.57||hongjun/v1/1|
|amd64||AMD Athlon 64 (15,75,2)?||13.125 (210/block)||Osvik|
|amd64||AMD Athlon 64 X2 (15,75,2); mace||13.32||hongjun/v1/1|
|amd64||AMD Opteron 240 (f58); nmisles8amd64||13.45||bernstein/amd64-1/1|
|x86||Intel Pentium III (68a)?||14 (224/block)||Osvik|
|x86||AMD Athlon (622)?||14.0625 (225/block)||Osvik|
|x86||Intel Pentium III (68a)?||14.125 (226/block)||Lipmaa|
|x86||Intel Pentium 4 (f12)?||15 (240/block)||Osvik|
|x86||Intel Pentium 4 (f12)?||15.875 (254/block)||Lipmaa|
|x86||Intel Pentium M (695); whisper||15.96||bernstein/x86-mmx-1/1|
|amd64||Intel Pentium 4 (f64)?||16 (256/block)||Matsui (FSE 2006)|
|x86||Intel Pentium III (68a)?||16.25 (260/block)||Gladman|
|amd64||Intel Pentium D (f64); nmi0161||16.74||bernstein/amd64-2/1|
|amd64||Intel Pentium D (f64); svlin001||16.75||bernstein/amd64-2/1|
|amd64||Intel Xeon (f41); nmi0056||16.75||bernstein/amd64-2/1|
|amd64||Intel Xeon (f4a); nmi0090||16.77||bernstein/amd64-2/1|
|sparc||Sun UltraSPARC III||16.875 (270/block)||Lipmaa|
|amd64||Intel Xeon (f41); nmi0057||16.89||bernstein/amd64-2/1|
|amd64||Intel Pentium D (f64); speed||16.90||bernstein/amd64-2/1|
|amd64||Intel Pentium D (f64); nmi0104||16.90||bernstein/amd64-2/1|
|amd64||Intel Pentium D (f64); nmi0241||16.93||bernstein/amd64-2/1|
|ppc64||IBM POWER5; nmi0154||16.93||bernstein/big-1/1|
|x86||Intel Pentium 4 (f24); nmi0086||16.96||bernstein/x86-mmx-1/1|
|x86||Intel Pentium 4 (f12); fireball||16.98||bernstein/x86-mmx-1/1|
|x86||Intel Pentium 4 (f24); nmitest4||17.01||bernstein/x86-mmx-1/1|
|ppc64||IBM PowerPC G5 970; nmi0048||17.17||bernstein/big-1/1|
|x86||Intel Pentium 2 (652); boris||17.33||bernstein/x86-mmx-1/1|
|x86||Intel Pentium 3 (68a)||17.49||Bernstein aes-128/x86-mmx-1|
|x86||Intel Pentium 3 (672); orpheus||17.55||bernstein/x86-mmx-1/1|
|x86||Intel Pentium M (6d8)||17.57||Wu v0/1|
|x86||Intel Pentium 4 (f33)?||17.75 (284/block)||Matsui/Fukuda (FSE 2005)|
|x86||Intel Xeon (f29); nmibuild40||17.79||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f27); nmi0059||17.79||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f25); nmibuild16||17.79||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f25); nmi0013||17.79||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f29); nmi0059||17.80||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f29); nmibuild17||17.81||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f25); nmibuild15||17.82||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f25); nmibuild26||17.83||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f25); nmibuild21||17.83||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f25); nmi0036||17.84||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f25); nmibuild22||17.84||bernstein/x86-mmx-1/1|
|x86||AMD Athlon (622); thoth||18.38||bernstein/x86-mmx-1/1|
|ppc32||IBM POWER4; nmibuild14||18.55||bernstein/little-1/1|
|x86||Intel Xeon (f41); nmi0079||18.88||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f41); nmi0062||18.89||bernstein/x86-mmx-1/1|
|amd64||Intel Core 2 Duo (6f6)||18.9||OpenSSL 0.9.8e|
|x86||Intel Xeon (f41); nmi0061||18.91||bernstein/x86-mmx-1/1|
|x86||Intel Pentium 4 (f41); svlin002||18.94||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f41); nmi0076||18.96||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f4a); nmi0102||18.97||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f41); nmi0060||18.97||bernstein/x86-mmx-1/1|
|x86||Intel Xeon (f41); nmi0063||18.95||bernstein/x86-mmx-1/1|
|x86||Intel Pentium 3 (68a)||19.06||Wu v1/1|
|ppc32||Motorola PowerPC G4 7410; gggg||19.11||bernstein/big-1/1|
|amd64||Intel Core 2 Duo (6f6)||19.5||OpenSSL 0.9.8a|
|x86||AMD Athlon (622)?||19.9375 (319/block)||Lipmaa|
|x86||Intel Pentium 1 (52c)||20 (320/block)||Lipmaa|
|sparc||Sun UltraSPARC III||20.75||Bernstein big-1/1|
|amd64||AMD Athlon 64 (15,75,2)||20.9||OpenSSL 0.9.8e|
|ppc32||Motorola PowerPC G4 7400; nmi0042||20.92||bernstein/big-1/1|
|x86||Intel Pentium M (6d8)||21||OpenSSL 0.9.8a|
|x86||Intel Pentium D (f47); shell||21.58||bernstein/x86-mmx-1/1|
|x86||AMD Athlon (622)||22||OpenSSL 0.9.8a|
|x86||Intel Pentium 4 (f29)||22||OpenSSL 0.9.8b|
|amd64||AMD Athlon 64 (15,75,2)?||23.5||OpenSSL 0.9.7e|
|x86||Intel Pentium 4 (f41)||23.5||OpenSSL 0.9.8a|
|x86||Intel Pentium 3 (672); orpheus||23.62||OpenSSL 0.9.8e|
|ppc32||Motorola PowerPC G4 7410||24.0625 (385/block)||Ahrens|
|x86||Intel Pentium 4 (f12)||24.4||OpenSSL 0.9.8a|
|sparc||Sun UltraSPARC III||25||OpenSSL|
|ppc32||Motorola PowerPC G4 7410||25.0625 (401/block)||Ahrens|
|x86||Intel Core Duo; nmi0068||25.74||gladman/1|
|amd64||Intel Pentium D (f64); speed||27.33||OpenSSL 0.9.8e|
|ppc32||Motorola PowerPC G4 7410; gggg||29.32||OpenSSL 0.9.8c|
|sparcv9||Sun UltraSPARC III; nmi0051||29.45||bernstein/big-1/1|
|sparcv9||Sun UltraSPARC III; nmisolaris10||29.46||bernstein/big-1/1|
|ppc64||IBM Cell PPE; nmips3||35.20||bernstein/big-1/1|
|amd64||Intel Pentium 4 (f64)||37||OpenSSL 0.9.7f|
|x86||Intel Pentium 4 (f29)||39||OpenSSL 0.9.7e|
|sparc||Sun UltraSPARC III||46.875 (750/block)||Bassham|
|x86||Intel Pentium 1 (52c); cruncher||38.20||hongjun/v1/1|
Regarding amd64 Intel Pentium 4, Matsui writes: "The number of memory reads for one block encryption of AES is 4 (for plaintext loads) + 11 x 4 (for subkey loads) + 16 x 10 (for table lookups) = 208, which means that Pentium 4 takes at least 208 cycles/block for one block encryption." But this lower bound ignores the possibility of loading partially expanded keys, saving as many as 30 loads, and using 64-bit loads for keys and plaintext, saving 9 more loads.
Regarding amd64 AMD Athlon 64, Matsui writes: "Considering an instruction latency of Athlon 64, the theoretical limit of AES performance on this processor seems around 16 cycles/round = 160 cycles/block. Our result is hence reaching closely this limit."