Poly1305-AES for generic computers with IEEE floating point D. J. Bernstein
Authenticators and signatures
A state-of-the-art message-authentication code

Poly1305-AES for generic computers with IEEE floating point

poly1305aes_53 is a catch-all Poly1305-AES implementation that works on a wide variety of computers.

Requirements: poly1305aes_53 requires a C compiler where int is 32-bit twos-complement, long long is 64-bit twos-complement, and double is IEEE double-precision floating-point. Arithmetic on double variables is required to be standard IEEE double-precision arithmetic with 53-bit mantissas; this means that poly1305aes_53 will not work on x86 computers running Linux. (Those computers will use, e.g., poly1305aes_athlon.) Arithmetic on double variables is also required to be round-to-nearest; this means that, on computers that support global rounding modes, poly1305aes_53 will not work with programs that set rounding modes other than round-to-nearest.

Here are the poly1305aes_53 files:

How does this Poly1305-AES implementation work?

The Poly1305 evaluation point r is represented as a sum of four integers r0, r1, r2, r3, which are multiples of 2^0, 2^34, 2^66, 2^98 respectively. These integers are split into 16-bit pieces r0low, r0high, r1low, r1high, r2low, r2high, r3low, and r3high, which are multiples of 2^0, 2^18, 2^34, 2^50, 2^66, 2^82, 2^98, 2^112 respectively. The numbers 5 2^{-130} r1, 5 2^{-130} r2, 5 2^{-130} r3 are split into 16-bit pieces sr1low, sr1high, sr2low, sr2high, sr3low, sr3high, which are multiples of 2^{-96}, 2^{-80}, 2^{-64}, 2^{-48}, 2^{-32}, 2^{-16} respectively. These 16-bit pieces are stored in double-precision floating-point variables.

The Poly1305 accumulator is a vector of eight integers h = (h0,h1,h2,h3,h4,h5,h6,h7) stored in double-precision floating-point variables. These integers are multiples of 2^0, 2^16, 2^32, 2^48, 2^64, 2^80, 2^96, and 2^112 respectively. The sum of the integers is congruent modulo 2^130-5 to 0 at first, then c_1, then (c_1)r, then (c_1)r+c_2, then ((c_1)r+c_2)r, etc.

Before h is multiplied by r, it is replaced with carry(h)=(x0,x2,x4,x6), where x0, x2, x4, x6 are small multiples of 2^0, 2^32, 2^64, 2^96 respectively. One can then safely compute, for example, the new h2 = r0low x2 + r1low x0 + sr2low x6 + sr3low x4 using floating-point multiplications and additions. The output range of every floating-point operation has been verified by one of my qhasm tools and is reported inside poly1305_sparc.s in poly1305aes_sparc.

How does this differ from other Poly1305-AES implementations?

poly1305aes_sparc, poly1305aes_aix, and poly1305aes_macos perform essentially the same computations as poly1305aes_53 but achieve further speedups in three ways:

Instruction selection. For example, my qhasm tools offer direct access to PowerPC instructions such as r bits 0xff0000 = s and carry r = r + s; poly1305aes_macos takes advantage of these instructions. In theory, a C compiler could automatically recognize the relevance of these instructions to poly1305aes_53; in practice, convincing a C compiler to produce these instructions ranges from unnecessarily difficult to practically impossible.
Instruction scheduling. For example, on the UltraSPARC III, the result of a floating-point operation is not ready until 4 cycles later; my qhasm tools make it easy to schedule other operations to happen in the meantime. In theory, a C compiler could automatically figure out a good order of instructions; in practice, even a good C compiler (such as Sun's) for an easily predictable computer (such as the UltraSPARC) often benefits from human input.
Register assignment. With my qhasm tools I've been able to fit every declared register variable into a CPU register. Every C compiler I've tried has given up and spilled variables.

There is a much larger difference between poly1305aes_53 and, e.g., poly1305aes_athlon: in most of my x86 implementations, the Poly1305 accumulator is a vector of four integers stored in 80-bit floating-point registers, rather than eight integers stored in 64-bit floating-point registers. The extra precision means that there are considerably fewer floating-point operations.

How much can the Poly1305 code be improved?

There are, beyond the machine-specific issues discussed above, some ways that various operations in poly1305_53, poly1305_sparc, etc. could be eliminated or parallelized.

There are six parts of the Poly1305 code:

Prologue: preparing r and setting h to 0.
Initial addition of full chunk: setting h to h + c_1, with a 16-byte c_1. This happens (once) for message lengths 16 and up. Some additions could be skipped here and in the subsequent carry, since h is 0.
Main loop: setting h to carry(h) r + c, with a 16-byte c. This happens once for message lengths 32 through 47, twice for message lengths 48 through 63, etc.
Multiplication after final full chunk: setting h to carry(h) r. This happens (once) for message lengths 16 and up.
Partial-chunk handling: setting h to carry(h + c_q) r, with c_q between 1 and 15 bytes. This happens (once) for message lengths not divisible by 16.
Epilogue: setting h to carry(h), completely reducing modulo 2^130-5, and adding s.

This isn't the smallest possible code: there are two different additions of 16-byte message chunks, three different multiplications by r, and four different carries. The point is to allow the main loop to overlap the computation of c (primarily integer operations) with the computation of carry(h) r (primarily floating-point operations). The same improvement is possible outside the main loop, with relatively small speed benefits and further cost in code size:

The initial setting of h to c_1 can be handled in parallel with the preparation of r. This would require two additional prologues: a tiny prologue for message length 0 and a long prologue for message lengths 1 through 15.
The addition of c_q, when c_q is between 1 and 15 bytes, can be handled in parallel with the multiplication of carry(h) by r. This would require duplicating the carry(h) r block.

On some computers it's also possible to save some cycles by overlapping part of the AES computation with the Poly1305 prologue and epilogue. One can even imagine overlapping the entire AES computation with the Poly1305 main loop for messages longer than a few hundred bytes.

How much can the AES code be improved?

aes_big converts the input bytes to big-endian integers, works with big-endian tables, and then converts the big-endian results back to bytes. The 48 byte loads and stores (and the accompanying shifts) can be replaced with 12 uint32 loads and stores on a computer that supports unaligned big-endian loads, such as the PowerPC. (Aligned big-endian loads are sufficient if the user always provides aligned input, but alignment is inconvenient for some applications.)

The bulk of the work inside AES is 200 table lookups inside eight 256-entry uint32 tables, totalling 8 kilobytes. Each table lookup is a load surrounded by a few integer operations. aes_big reduces the tables to 4 kilobytes at the expense of an extra mask (to extract one of the four bytes of the uint32) in 56 of the table lookups. I could absorb 16 of those masks into subsequent byte stores.

It's easy to further compress the tables from 4 kilobytes to 2 kilobytes at the expense of unaligned loads within an 8-byte block (whether or not the input is aligned); aes_big and aes_sparc don't do this, but aes_aix and aes_macos do.

The aes_big loop is not unrolled. Complete unrolling would eliminate nine loop-index modifications, nine branches, and nine loads, at the expense of a few kilobytes of code size.