Poly1305-AES for generic computers with IEEE floating point D. J. Bernstein
Authenticators and signatures
A state-of-the-art message-authentication code

Poly1305-AES for generic computers with IEEE floating point

poly1305aes_53 is a catch-all Poly1305-AES implementation that works on a wide variety of computers.

Requirements: poly1305aes_53 requires a C compiler where int is 32-bit twos-complement, long long is 64-bit twos-complement, and double is IEEE double-precision floating-point. Arithmetic on double variables is required to be standard IEEE double-precision arithmetic with 53-bit mantissas; this means that poly1305aes_53 will not work on x86 computers running Linux. (Those computers will use, e.g., poly1305aes_athlon.) Arithmetic on double variables is also required to be round-to-nearest; this means that, on computers that support global rounding modes, poly1305aes_53 will not work with programs that set rounding modes other than round-to-nearest.

Here are the poly1305aes_53 files:

How does this Poly1305-AES implementation work?

The Poly1305 evaluation point r is represented as a sum of four integers r0, r1, r2, r3, which are multiples of 2^0, 2^34, 2^66, 2^98 respectively. These integers are split into 16-bit pieces r0low, r0high, r1low, r1high, r2low, r2high, r3low, and r3high, which are multiples of 2^0, 2^18, 2^34, 2^50, 2^66, 2^82, 2^98, 2^112 respectively. The numbers 5 2^{-130} r1, 5 2^{-130} r2, 5 2^{-130} r3 are split into 16-bit pieces sr1low, sr1high, sr2low, sr2high, sr3low, sr3high, which are multiples of 2^{-96}, 2^{-80}, 2^{-64}, 2^{-48}, 2^{-32}, 2^{-16} respectively. These 16-bit pieces are stored in double-precision floating-point variables.

The Poly1305 accumulator is a vector of eight integers h = (h0,h1,h2,h3,h4,h5,h6,h7) stored in double-precision floating-point variables. These integers are multiples of 2^0, 2^16, 2^32, 2^48, 2^64, 2^80, 2^96, and 2^112 respectively. The sum of the integers is congruent modulo 2^130-5 to 0 at first, then c_1, then (c_1)r, then (c_1)r+c_2, then ((c_1)r+c_2)r, etc.

Before h is multiplied by r, it is replaced with carry(h)=(x0,x2,x4,x6), where x0, x2, x4, x6 are small multiples of 2^0, 2^32, 2^64, 2^96 respectively. One can then safely compute, for example, the new h2 = r0low x2 + r1low x0 + sr2low x6 + sr3low x4 using floating-point multiplications and additions. The output range of every floating-point operation has been verified by one of my qhasm tools and is reported inside poly1305_sparc.s in poly1305aes_sparc.

How does this differ from other Poly1305-AES implementations?

poly1305aes_sparc, poly1305aes_aix, and poly1305aes_macos perform essentially the same computations as poly1305aes_53 but achieve further speedups in three ways:

There is a much larger difference between poly1305aes_53 and, e.g., poly1305aes_athlon: in most of my x86 implementations, the Poly1305 accumulator is a vector of four integers stored in 80-bit floating-point registers, rather than eight integers stored in 64-bit floating-point registers. The extra precision means that there are considerably fewer floating-point operations.

How much can the Poly1305 code be improved?

There are, beyond the machine-specific issues discussed above, some ways that various operations in poly1305_53, poly1305_sparc, etc. could be eliminated or parallelized.

There are six parts of the Poly1305 code:

This isn't the smallest possible code: there are two different additions of 16-byte message chunks, three different multiplications by r, and four different carries. The point is to allow the main loop to overlap the computation of c (primarily integer operations) with the computation of carry(h) r (primarily floating-point operations). The same improvement is possible outside the main loop, with relatively small speed benefits and further cost in code size: On some computers it's also possible to save some cycles by overlapping part of the AES computation with the Poly1305 prologue and epilogue. One can even imagine overlapping the entire AES computation with the Poly1305 main loop for messages longer than a few hundred bytes.

How much can the AES code be improved?

aes_big converts the input bytes to big-endian integers, works with big-endian tables, and then converts the big-endian results back to bytes. The 48 byte loads and stores (and the accompanying shifts) can be replaced with 12 uint32 loads and stores on a computer that supports unaligned big-endian loads, such as the PowerPC. (Aligned big-endian loads are sufficient if the user always provides aligned input, but alignment is inconvenient for some applications.)

The bulk of the work inside AES is 200 table lookups inside eight 256-entry uint32 tables, totalling 8 kilobytes. Each table lookup is a load surrounded by a few integer operations. aes_big reduces the tables to 4 kilobytes at the expense of an extra mask (to extract one of the four bytes of the uint32) in 56 of the table lookups. I could absorb 16 of those masks into subsequent byte stores.

It's easy to further compress the tables from 4 kilobytes to 2 kilobytes at the expense of unaligned loads within an 8-byte block (whether or not the input is aligned); aes_big and aes_sparc don't do this, but aes_aix and aes_macos do.

The aes_big loop is not unrolled. Complete unrolling would eliminate nine loop-index modifications, nine branches, and nine loads, at the expense of a few kilobytes of code size.