Requirements: poly1305aes_ppro must be run on an x86 CPU (for example, the Pentium III). It sets the CPU's floating-point mode to extended-precision round-to-nearest; programs must not assume that the floating-point mode is preserved by function calls.
Here are the poly1305aes_ppro files:
If you want to know how fast the code is, see my separate page of speed tables. If you want a rough idea of how the code works, see the original Poly1305-AES paper, Section 4. If you want to know what improvements are possible in poly1305aes_ppro, read the rest of this page.
The main loop actually takes slightly under 61 cycles on a Pentium M, and slightly over 73 cycles on a Pentium III.
The code outside the main loop takes about 200 more cycles for message lengths divisible by 16. This code is not carefully scheduled. It doesn't even skip the unnecessary carry on the first loop.
The scrambling loop takes 23 cycles on a Pentium M; this loop occurs 10 times, with some extra work on the 10th iteration. The obvious bottleneck is 20 cycles for 20 loads: specifically, 16 table loads and 4 expanded-key loads. There are 38 integer operations, which in theory could be done in only 19 cycles.
Most of the remaining time is key expansion, around 160 cycles. The only obvious bottleneck is 105 cycles for integer operations: my expansion loop involves 21 integer operations, 4 table loads, and 4 expanded-key stores. This includes 1 integer operation (a rotation) that could be moved to the scrambling loop, and 3 integer operations that aren't necessary but that reduce latency. Some of those 3 integer operations could be merged with operations in the next loop at no cost in latency, but this would increase register pressure.
Perhaps better scheduling could close part of the gap from 230+160 down to 200+105. You don't want to tackle this unless you have an extremely accurate Pentium-M simulator. The main problem is to convince the Pentium-M to delay non-critical-path operations, when it has a choice between critical-path operations and non-critical-path operations.
I kept the AES table at 2K. The code carefully stores its temporary variables at stack positions that don't overlap the table modulo 4K. (Otherwise there are slowdowns from Pentium Pro/II/III/M cache conflicts; even worse, data-dependent slowdowns, leaking secret information through timing.) I could save about 20 cycles with a larger table, but I'd have to use 4K for the table (with a gap in the middle, say 1K, to leave space free modulo 4K). A safe way to achieve part of the same speedup without a larger table is to use byte loads at 14 spots and (with some table modifications) 2-byte loads at 14 more spots; this will wait for the next qhasm prototype.
My loops are completely unrolled. This avoids all spills in the scrambling loop, at the expense of 2 integer operations per scrambling loop (a mask-insert to merge 2 live bytes in one register with 2 live bytes in another register) and about 3K of code space. Another way to avoid those spills and the expanded-key spills, on the Pentium II and higher, is to shuffle data back and forth to MMX registers, although the shuffling costs integer operations.
A completely different approach is to merge the expansion loop with the scrambling loop, as in my non-x86 implementations. This should free up some load/store slots for the scrambling loop, and should eliminate the latency problems for the expansion loop. One can freely replace 5 integer operations for byte extraction (movzbl, movzbl, shr, movzbl, movzbl) with 5 load-store operations. The obvious bottlenecks in total micro-operations are 160 for expansion (per loop: 5 for byte extractions, 4 table lookups, 7 xors) and 520 for scrambling (per loop: 20 for byte extractions, 16 table lookups, 16 xors).
Other AES speed results: OpenSSL takes 1122 Pentium-M cycles (425 expansion). Gladman's code reportedly takes 482 Pentium-III cycles (202 expansion); this figure does not include timing overhead, function-call costs, and the costs of switching in and out of MMX mode, so it's worse than my 482 cycles. Lipmaa reports that unpublished code takes 394 Pentium-III cycles (168 expansion); again, many costs are ignored here.
(Time variability: OpenSSL and Gladman's C code have obvious cache-timing leaks on the scale of 15 cycles. I don't know about Gladman's asm code or Lipmaa's code.)
Code sizes: My code is 5632 bytes, including a 2048-byte table. The OpenSSL code is about 6400 bytes, including a 5120-byte table. Gladman's code is reportedly over 10000 bytes, including an 8192-byte table.