Requirements: poly1305aes_aix must be run on a PowerPC CPU, under AIX (as opposed to, e.g., MacOS X), inside a 32-bit program (as opposed to a program compiled with -maix64). It sets the CPU's floating-point mode to round-to-nearest; programs must not assume that the floating-point mode is preserved by function calls.
(32 versus 64: 64-bit code requires larger register saves, larger pointer variables, and different carry handling.)
Here are the poly1305aes_aix files:
If you want to know how fast the code is, see my separate page of speed tables. If you want a rough idea of how the code works, see poly1305aes_53. If you want to know what improvements are possible in poly1305aes_aix (and poly1305aes_macos), read the rest of this page.
The main loop actually takes 120 cycles on the G4 and 79 cycles on the Sstar. Presumably the same scheduling is pretty good for the G4e, but I don't have a G4e to measure. The code isn't as carefully scheduled outside the main loop.
The gap between 101 and 120 needs further investigation. I've noticed some weird effects in the G4 timings, with -D often considerably faster than KD. Must be a memory-layout issue. I wonder whether alpha0 is being bumped out of L1 cache.
Scheduling code for the PowerPC 970 (G5) is a completely different problem. There are two FPUs (up from 1 on the G4, G4e, Sstar), each with 6-cycle-per-operation latency (up from 3 on the G4, 5 on the G4e, 4 on the Sstar), so I should be trying to run twelve operations in parallel. Obviously a tuned poly1305aes_g5 should be split from poly1305aes_macos.
The main loop has 21 loads and only 48 other instructions; the PowerPC's rotate-and-mask operation saves quite a few instructions compared to the UltraSPARC. The most obvious G4 bottleneck is the 2-instruction-per-cycle limit, forcing the computation to take at least 360 cycles overall. The most obvious G4e bottleneck is the 3-instruction-per-cycle limit.
Measurements on a G4 show the AES computation taking about 490 cycles, including timing overhead. The G4's low-precision ``time base'' makes it difficult to see what's happening in more detail.
Other AES speed results: According to Helger Lipmaa, unpublished software by Dennis Ahrens takes 1027 cycles (including 626 expansion) on a G4 and 673 cycles (including 288 expansion) on a G4e. These figures ignore function-call overhead, which is quite severe on a PowerPC (unlike an UltraSPARC) thanks to all the PowerPC callee-save registers.