Requirements: poly1305aes_athlon must be run on an x86 CPU (for example, the Athlon XP). It sets the CPU's floating-point mode to extended-precision round-to-nearest; programs must not assume that the floating-point mode is preserved by function calls.
Here are the poly1305aes_athlon files:
If you want to know how fast the code is, see my separate page of speed tables. If you want a rough idea of how the code works, see the original Poly1305-AES paper, Section 4. If you want to know what improvements are possible in poly1305aes_athlon, read the rest of this page.
My original Athlon AES code (October 2004, summarized in the Poly1305 paper) took about 100 fewer cycles using large tables. The aes_ppro code takes about 70 fewer cycles using small tables. But those implementations (and every other AES implementation I've seen!) have input-dependent Athlon timings, notably because of address-dependent restrictions on two loads in the same cycle, as discussed in my paper on cache-timing attacks.
The potential workaround suggested in my paper would have required large tables. I instead decided to try putting all loads and stores into Athlon execution unit #0. That unit can't do more than one load/store operation in a cycle. (I allowed another unit to do a load/store if the address was input-independent and the previous load/store address was also input-independent.)
I arranged all instructions into groups of 3, with the load/store instructions first in each group; this required inserting some no-ops. I then arranged the groups of 3 into aligned 16-byte chunks, extending instructions by one or two bytes to exactly fill 16 bytes, and adding 3 no-ops if necessary. My extensions were a 1-byte irrelevant DS prefix and a 2-byte irrelevant CS DS prefix; experiments indicate that the Athlon can decode these for free, but can't decode a third irrelevant prefix for free. (Future plan: change 1-byte immediates into 4-byte immediates.) I used 0x90 for a 1-byte no-op, 0x66 0x90 for a 2-byte no-op, and various LEAs for 3-byte and 4-byte no-ops.
I could probably make the code substantially faster if AMD did a better job of documenting the Athlon.