Requirements: poly1305aes_sparc must be run on a SPARCv9 CPU (e.g., UltraSPARC I), under a 64-bit operating system (e.g., Solaris 2.8), inside a 64-bit program (e.g., a program compiled with gcc -m64 or cc -xarch=v9). It sets the CPU's floating-point mode to round-to-nearest; programs must not assume that the floating-point mode is preserved by function calls.
Here are the poly1305aes_sparc files:
If you want to know how fast the code is, see my separate page of speed tables. If you want a rough idea of how the code works, see poly1305aes_53. If you want to know what improvements are possible in poly1305aes_sparc, read the rest of this page.
The code scheduling outside the main loop isn't very tight but it's not bad. The overall cost for a 16n-byte message on an UltraSPARC II, including timing overhead, is 68n+234 cycles. A message between 16n-15 and 16n-1 bytes takes a flat 68n+302 cycles, because I used branchless code rather than bothering with separate code for different message lengths modulo 16. Separate code might be a little faster.
Pre-expansion of r into a 112-byte table would eliminate roughly a quarter of the 234-cycle overhead. A much larger precomputation, as in hash127, would allow a dot-product main loop, the most obvious bottleneck being 12 loads for every 4 bytes.
Input costs: I don't assume k alignment, so my k load uses 16 byte loads, 12 shifts, and 12 xors, rather than the 4 uint32 loads that one might expect. Same for n. Load contention adds a few extra cycles here; it isn't an obvious problem elsewhere.
Output costs: I'm using 16 byte stores and 12 shifts. I could eliminate 16 previous table masks in light of the stores. Inside Poly1305-AES it's safe to always assume aligned (stack) output, reducing the cost to 4 stores; or to simply return results in registers, eliminating the cost entirely. (I'd have to flip the code and tables from little-endian to big-endian; right now the endianness doesn't matter.) There's some store contention (and shift contention on the UltraSPARC II); I should spread out the stores more.
Total integer operations: 782, obviously taking at least 391 CPU cycles:
Actual cycle counts on a 900MHz UltraSPARC III for timing overhead plus 1, 2, 3, 4, 5 function calls, once everything is in L1 cache: 465, 924, 1379, 1832, 2288. Eliminating one inner-loop iteration saves 38 cycles per call. Timings are consistent across repeated function calls but sometimes vary (between two different values) from one program run to another.
A 296MHz UltraSPARC II produces cycle counts about 1.08x smaller. The variance across program runs is between many different values. Perhaps the program layout in physical memory is hurting branch prediction or causing L1 cache misses; the UltraSPARC II L1 cache is only 1-way associative 16K.
I scheduled my code for the UltraSPARC III load latency, so I don't know why the UltraSPARC III is taking more cycles than the UltraSPARC II. I also don't know why the inner loop is taking 38 cycles rather than the expected 35.
Comparison to other AES speed results: A Bassham paper reports software taking something like 2450 cycles, of which about 1700 are for key expansion. OpenSSL takes slightly over 700 cycles, of which about 300 are for key expansion. Lipmaa reports that unpublished software takes 270 cycles (improving on Gladman's 334 cycles) without key expansion and without timing overhead; presumably this code is unrolled, uses large tables, uses precomputed constants, and assumes alignment, so the only obvious bottleneck is 242 cycles for integer operations.