Salsa10
D. J. Bernstein

Hash functions and ciphers
# The Salsa10 hash function

Salsa10 is a function from 64-byte strings to 64-byte strings.
This web page discusses how quickly Salsa10 can be computed.
See also
Salsa20.

## Overview

The 64-byte input x to Salsa10
is viewed in little-endian form
as 16 integers
x0, x1, x2, ..., x15 in {0,1,...,2^32-1}.
These 16 integers are fed through 320 invertible modifications,
where each modification changes one integer.
The modifications involve, overall,
- 10 additions of constants modulo 2^32;
- 320 more additions modulo 2^32;
- 80 ``or'' operations;
- 240 ``xor'' operations; and
- 320 constant-distance rotations.

The resulting 16 integers are added
to the original x0, x1, x2, ..., x15 respectively modulo 2^32,
producing, in little-endian form, the 64-byte output Salsa10(x).
The sequence of modifications is defined by the
original Salsa10 specification,
published 2004.11.11 as part of my paper
``Cache-timing attacks on AES.''

The 970 integer operations in Salsa10 take at least 485 CPU cycles
on a CPU limited to 2 integer operations per cycle,
and at least 324 CPU cycles
on a CPU limited to 3 integer operations per cycle.

## Rounds, half-rounds, lines

The 320 modifications in Salsa10
are grouped into 10 rounds.
Each round consists of 2 half-rounds.
Each half-round consists of 4 parallel lines.
Each line modifies 4 integers independently of the other 12 integers.
Here is a typical line:
x4 ^= (x0 + x12) <<< 6;
x8 ^= (x4 + x0 ) <<< 17;
x12 += (x8 | x4 ) <<< 16;
x0 += (x12 ^ x8 ) <<< 5;

On a typical CPU,
this chain of 12 operations has a latency of at least 12 cycles,
so at least 2 lines must be performed in parallel
to keep up with 2 integer operations per cycle,
and at least 3 lines must be performed in parallel
to keep up with 3 integer operations per cycle.
## Fitting a line into two registers

The typical line shown above can be decomposed into
the following sequence of 17 instructions:
r = x12;
s = x0;
r += s; r <<<= 6;
r ^= x4; x4 = r;
s += r; s <<<= 17;
s ^= x8; x8 = s;
r |= s; r <<<= 16;
r += x12; x12 = r;
s ^= r; s <<<= 5;
x0 += s;

If r, s are registers while x0, x4, x8, x12 are memory locations
then this sequence involves 12 integer operations,
6 loads (4 combined with integer operations),
and 4 stores (1 combined with an integer operation).
On a CPU limited to 3 instructions per cycle,
these 17 instructions take at least 5.66666 cycles.
On a CPU limited to 3 operations per cycle,
these 22 operations take at least 7.33333 cycles.
On a CPU limited to 3 operations per cycle
where a store counts as two operations,
these 26 operations take at least 8.66666 cycles.

Using one or two extra registers eliminates one or two load operations,
and does not cost any extra copies if the CPU has three-operand instructions.
The x86 architecture has a three-operand addition, called LEA,
but not a three-operand xor.

## Athlon notes

The Athlon should be able to sustain 3 instructions per cycle
with proper instruction alignment.
Target: under 500 cycles.
Bottleneck: instructions.
## Pentium notes

Two lines in parallel should be sufficient to keep the Pentium III's
integer units busy.
Target: around 700 cycles.
Bottleneck: operations.
The Pentium M handles stores more efficiently.
Target: around 600 cycles.
Bottleneck: operations.

## PowerPC notes

The PowerPC has enough registers to avoid all loads and stores
in the main loop.
Target: around 500 cycles on PowerPCs with 2 integer units.
Bottleneck: integer operations.
## UltraSPARC notes

The UltraSPARC has enough registers
but does not have a rotate instruction.
Each rotation turns into three integer operations: shift, shift, or.
Target: slightly over 800 cycles.
Bottleneck: integer operations.