Authenticators and signatures

A state-of-the-art message-authentication code

Requirements:
`poly1305aes_53`
requires a C compiler where
`int` is 32-bit twos-complement,
`long long` is 64-bit twos-complement,
and
`double` is IEEE double-precision floating-point.
Arithmetic on `double` variables is required to be
standard IEEE double-precision arithmetic
with 53-bit mantissas;
this means that `poly1305aes_53` will *not* work
on x86 computers running Linux.
(Those computers will use, e.g., `poly1305aes_athlon`.)
Arithmetic on `double` variables is also required to be
round-to-nearest;
this means that,
on computers that support global rounding modes,
`poly1305aes_53` will *not* work
with programs that set rounding modes other than round-to-nearest.

Here are the `poly1305aes_53` files:

`poly1305_53.h``poly1305_53.c``poly1305_53_constants.c``aes_big.h``aes_big.c``aes_big_constants.c``poly1305aes_53.h``poly1305aes_53_clamp.c``poly1305aes_53_authenticate.c``poly1305aes_53_isequal.c``poly1305aes_53_verify.c`

The Poly1305 accumulator is a vector of eight integers h = (h0,h1,h2,h3,h4,h5,h6,h7) stored in double-precision floating-point variables. These integers are multiples of 2^0, 2^16, 2^32, 2^48, 2^64, 2^80, 2^96, and 2^112 respectively. The sum of the integers is congruent modulo 2^130-5 to 0 at first, then c_1, then (c_1)r, then (c_1)r+c_2, then ((c_1)r+c_2)r, etc.

Before h is multiplied by r,
it is replaced with carry(h)=(x0,x2,x4,x6),
where x0, x2, x4, x6 are *small* multiples of
2^0, 2^32, 2^64, 2^96 respectively.
One can then safely compute, for example,
the new h2 = r0low x2 + r1low x0 + sr2low x6 + sr3low x4
using floating-point multiplications and additions.
The output range of every floating-point operation
has been verified by one of my qhasm tools
and is reported inside `poly1305_sparc.s`
in `poly1305aes_sparc`.

**Instruction selection.**For example, my qhasm tools offer direct access to PowerPC instructions such as`r bits 0xff0000 = s`and`carry r = r + s`;`poly1305aes_macos`takes advantage of these instructions. In theory, a C compiler could automatically recognize the relevance of these instructions to`poly1305aes_53`; in practice, convincing a C compiler to produce these instructions ranges from unnecessarily difficult to practically impossible.**Instruction scheduling.**For example, on the UltraSPARC III, the result of a floating-point operation is not ready until 4 cycles later; my qhasm tools make it easy to schedule other operations to happen in the meantime. In theory, a C compiler could automatically figure out a good order of instructions; in practice, even a good C compiler (such as Sun's) for an easily predictable computer (such as the UltraSPARC) often benefits from human input.**Register assignment.**With my qhasm tools I've been able to fit every declared register variable into a CPU register. Every C compiler I've tried has given up and spilled variables.

There is a much larger difference between `poly1305aes_53`
and, e.g., `poly1305aes_athlon`:
in most of my x86 implementations,
the Poly1305 accumulator
is a vector of four integers stored in 80-bit floating-point registers,
rather than eight integers stored in 64-bit floating-point registers.
The extra precision means that there are considerably fewer
floating-point operations.

There are six parts of the Poly1305 code:

- Prologue: preparing r and setting h to 0.
- Initial addition of full chunk: setting h to h + c_1, with a 16-byte c_1. This happens (once) for message lengths 16 and up. Some additions could be skipped here and in the subsequent carry, since h is 0.
- Main loop: setting h to carry(h) r + c, with a 16-byte c. This happens once for message lengths 32 through 47, twice for message lengths 48 through 63, etc.
- Multiplication after final full chunk: setting h to carry(h) r. This happens (once) for message lengths 16 and up.
- Partial-chunk handling: setting h to carry(h + c_q) r, with c_q between 1 and 15 bytes. This happens (once) for message lengths not divisible by 16.
- Epilogue: setting h to carry(h), completely reducing modulo 2^130-5, and adding s.

- The initial setting of h to c_1 can be handled in parallel with the preparation of r. This would require two additional prologues: a tiny prologue for message length 0 and a long prologue for message lengths 1 through 15.
- The addition of c_q, when c_q is between 1 and 15 bytes, can be handled in parallel with the multiplication of carry(h) by r. This would require duplicating the carry(h) r block.

The bulk of the work inside AES is 200 table lookups
inside eight 256-entry uint32 tables, totalling 8 kilobytes.
Each table lookup is a load surrounded by a few integer operations.
`aes_big` reduces the tables to 4 kilobytes
at the expense of an extra mask
(to extract one of the four bytes of the uint32)
in 56 of the table lookups.
I could absorb 16 of those masks into subsequent byte stores.

It's easy to further compress the tables from 4 kilobytes to 2 kilobytes
at the expense of unaligned loads within an 8-byte block
(whether or not the input is aligned);
`aes_big` and `aes_sparc` don't do this,
but `aes_aix` and `aes_macos` do.

The `aes_big` loop is not unrolled.
Complete unrolling would
eliminate nine loop-index modifications, nine branches, and nine loads,
at the expense of a few kilobytes of code size.