Poly1305-AES for the Athlon D. J. Bernstein
Authenticators and signatures
A state-of-the-art message-authentication code

Poly1305-AES for the Athlon

poly1305aes_athlon is a Poly1305-AES implementation aimed at the Athlon, Athlon MP, Athlon XP, and Duron. It's also the best option right now for the Opteron and Athlon 64.

Requirements: poly1305aes_athlon must be run on an x86 CPU (for example, the Athlon XP). It sets the CPU's floating-point mode to extended-precision round-to-nearest; programs must not assume that the floating-point mode is preserved by function calls.

Here are the poly1305aes_athlon files:

Supplementary files useful for timing:

If you want to know how fast the code is, see my separate page of speed tables. If you want a rough idea of how the code works, see the original Poly1305-AES paper, Section 4. If you want to know what improvements are possible in poly1305aes_athlon, read the rest of this page.

How much can the Poly1305 code be improved?

The current version of poly1305_athlon is the same as poly1305_ppro, without even minimal attempts at proper 0x66 padding for the Athlon. The main loop takes around 50 cycles. The obvious bottleneck is 36 cycles for 36 floating-point additions.

How much can the AES code be improved?

The code takes 476 cycles on an Athlon, including timing overhead. This figure is very close to constant for inputs in L1 cache; so far I've found no evidence of input-dependent timings.

My original Athlon AES code (October 2004, summarized in the Poly1305 paper) took about 100 fewer cycles using large tables. The aes_ppro code takes about 70 fewer cycles using small tables. But those implementations (and every other AES implementation I've seen!) have input-dependent Athlon timings, notably because of address-dependent restrictions on two loads in the same cycle, as discussed in my paper on cache-timing attacks.

The potential workaround suggested in my paper would have required large tables. I instead decided to try putting all loads and stores into Athlon execution unit #0. That unit can't do more than one load/store operation in a cycle. (I allowed another unit to do a load/store if the address was input-independent and the previous load/store address was also input-independent.)

I arranged all instructions into groups of 3, with the load/store instructions first in each group; this required inserting some no-ops. I then arranged the groups of 3 into aligned 16-byte chunks, extending instructions by one or two bytes to exactly fill 16 bytes, and adding 3 no-ops if necessary. My extensions were a 1-byte irrelevant DS prefix and a 2-byte irrelevant CS DS prefix; experiments indicate that the Athlon can decode these for free, but can't decode a third irrelevant prefix for free. (Future plan: change 1-byte immediates into 4-byte immediates.) I used 0x90 for a 1-byte no-op, 0x66 0x90 for a 2-byte no-op, and various LEAs for 3-byte and 4-byte no-ops.

I could probably make the code substantially faster if AMD did a better job of documenting the Athlon.