Poly1305-AES on the UltraSPARC D. J. Bernstein
Authenticators and signatures
A state-of-the-art message-authentication code

Poly1305-AES on the UltraSPARC

poly1305aes_sparc is a Poly1305-AES implementation aimed at the UltraSPARC I, UltraSPARC II, and UltraSPARC III.

Requirements: poly1305aes_sparc must be run on a SPARCv9 CPU (e.g., UltraSPARC I), under a 64-bit operating system (e.g., Solaris 2.8), inside a 64-bit program (e.g., a program compiled with gcc -m64 or cc -xarch=v9). It sets the CPU's floating-point mode to round-to-nearest; programs must not assume that the floating-point mode is preserved by function calls.

Here are the poly1305aes_sparc files:

Supplementary files useful for timing:

If you want to know how fast the code is, see my separate page of speed tables. If you want a rough idea of how the code works, see poly1305aes_53. If you want to know what improvements are possible in poly1305aes_sparc, read the rest of this page.

How much can the Poly1305 code be improved?

The bottleneck in the Poly1305 main loop is 68 floating-point additions: 24 for carries, 28 for multiplication by r, 4 to convert c from integer to floating point, and 4 to add c. Cycle counts show that my main-loop scheduling is perfect (68 cycles) for an UltraSPARC II and nearly perfect (75 cycles) for an UltraSPARC III. Update: With newer code I'm down to 68 cycles for the main loop on the UltraSPARC III. This code will be integrated into the next software release.

The code scheduling outside the main loop isn't very tight but it's not bad. The overall cost for a 16n-byte message on an UltraSPARC II, including timing overhead, is 68n+234 cycles. A message between 16n-15 and 16n-1 bytes takes a flat 68n+302 cycles, because I used branchless code rather than bothering with separate code for different message lengths modulo 16. Separate code might be a little faster.

Pre-expansion of r into a 112-byte table would eliminate roughly a quarter of the 234-cycle overhead. A much larger precomputation, as in hash127, would allow a dot-product main loop, the most obvious bottleneck being 12 loads for every 4 bytes.

How much can the AES code be improved?

The primary AES bottleneck is integer operations. My inner loop has 69 integer operations and 24 loads:

48, 16: 16 y-modification steps. Each step is shift, mask, load, xor.
20, 4: 4 x-modification (key expansion) steps. Each step is shift, mask, load, table mask, xor, xor. Doubling the table size from 4K to 8K would eliminate the table mask.
1, 4: Increment the loop index, load the round constant, and do 3 unspills. Complete unrolling, expanding the code from about 1.5K to about 4.5K, would eliminate the loop-index modification and the round-constant load. I have at most 22 live variables, and I'm using only 23 registers, so I could eliminate the unspills: the table masks can go into %g2 (after .register %g2,#scratch), %g3 (after .register %g3,#scratch), and %g5.

All of this happens 9 times. Outside the inner loop there's another x-modification step; another y-modification step with 16 additional table masks; 4 xors for the initial y setup; 1 op for the loop index; 9 ops to create 4 table pointers; 3 ops to create table masks; and 3 spills.

Input costs: I don't assume k alignment, so my k load uses 16 byte loads, 12 shifts, and 12 xors, rather than the 4 uint32 loads that one might expect. Same for n. Load contention adds a few extra cycles here; it isn't an obvious problem elsewhere.

Output costs: I'm using 16 byte stores and 12 shifts. I could eliminate 16 previous table masks in light of the stores. Inside Poly1305-AES it's safe to always assume aligned (stack) output, reducing the cost to 4 stores; or to simply return results in registers, eliminating the cost entirely. (I'd have to flip the code and tables from little-endian to big-endian; right now the endianness doesn't matter.) There's some store contention (and shift contention on the UltraSPARC II); I should spread out the stores more.

Total integer operations: 782, obviously taking at least 391 CPU cycles:

250 cycles for the y modifications (8K tables would save 8 cycles);
100 cycles for the x modifications (8K tables would save 20 cycles);
24 cycles for input (alignment would save 24 cycles);
6 cycles for output (alignment would save 6 cycles);
5 cycles for loop-index overhead (unrolling would save 5 cycles); and
6 cycles to put long-term constants into registers.

Of course, there are also a few extra cycles for function linkage; this could be eliminated inside Poly1305-AES.

Actual cycle counts on a 900MHz UltraSPARC III for timing overhead plus 1, 2, 3, 4, 5 function calls, once everything is in L1 cache: 465, 924, 1379, 1832, 2288. Eliminating one inner-loop iteration saves 38 cycles per call. Timings are consistent across repeated function calls but sometimes vary (between two different values) from one program run to another.

A 296MHz UltraSPARC II produces cycle counts about 1.08x smaller. The variance across program runs is between many different values. Perhaps the program layout in physical memory is hurting branch prediction or causing L1 cache misses; the UltraSPARC II L1 cache is only 1-way associative 16K.

I scheduled my code for the UltraSPARC III load latency, so I don't know why the UltraSPARC III is taking more cycles than the UltraSPARC II. I also don't know why the inner loop is taking 38 cycles rather than the expected 35.

Comparison to other AES speed results: A Bassham paper reports software taking something like 2450 cycles, of which about 1700 are for key expansion. OpenSSL takes slightly over 700 cycles, of which about 300 are for key expansion. Lipmaa reports that unpublished software takes 270 cycles (improving on Gladman's 334 cycles) without key expansion and without timing overhead; presumably this code is unrolled, uses large tables, uses precomputed constants, and assumes alignment, so the only obvious bottleneck is 242 cycles for integer operations.

Are there other possible improvements?

Function-call overhead costs quite a few cycles in poly1305aes_sparc_authenticate and poly1305aes_sparc_verify. The big problem isn't the direct cost of the call instruction; the big problem is that various 16-byte inputs and outputs are passed through memory rather than registers.