Salsa20 D. J. Bernstein
Hash functions and ciphers
The Salsa20 core

Software speed

This web page discusses how quickly the Salsa20 core can be computed. This page includes public-domain asm software for several CPUs.

This page is out of date. Several further speed improvements appear in the Salsa20 stream cipher software. Some of the optimizations are described in more detail in the ``Salsa20 speed'' document.

Athlon: 9.27 cycles/byte

My salsa20_word_athlon implementation of Salsa20 is aimed at the AMD Athlon. It will work on any x86 CPU. Here's the traditional asm, slightly tweaked from salsa20_word_pm.s: salsa20_word_athlon.s.

salsa20_word_athlon takes 604 Athlon cycles, including function-call overhead and 11 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. An average round takes 26.75 cycles. The compiled code occupies 1280 bytes.

One could shoot for 500 Athlon cycles, considering the total number of instructions that need to be carried out.

Pentium III: 13.08 cycles/byte

My salsa20_word_pii implementation of Salsa20 is aimed at the Intel Pentium II and Intel Pentium III. It will work on any x86 CPU with MMX instructions. Here's the qhasm version: salsa20_word_pii.q. Translated to traditional asm: salsa20_word_pii.s.

salsa20_word_pii takes 872 Pentium III cycles, including function-call overhead and 35 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 1280 bytes.

salsa20_word_pii actually takes 859 cycles most of the time but 908 cycles on every fourth call, presumably because of branch mispredictions. An average double-round takes 75 cycles.

For comparison, salsa20_word_pm takes about 1300 Pentium III cycles. One could shoot for about 730 Pentium III cycles, considering the total number of operations and the total number of integer operations that need to be carried out; maybe a bit better with PADDD etc.

Pentium 4 Willamette: 16.44 cycles/byte

My salsa20_word_p4 implementation of Salsa20 is aimed at the Intel Pentium 4. It will work on any x86 CPU with XMM instructions. Here's the qhasm version: salsa20_word_p4.q. Translated to traditional asm: salsa20_word_p4.s.

salsa20_word_p4 takes 1136 Pentium 4 Willamette cycles, including function-call overhead and 84 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 1144 bytes.

Pentium M: 11.57 cycles/byte

My salsa20_word_pm implementation of Salsa20 is aimed at the Intel Pentium M. It will work on any x86 CPU. Here's the qhasm version: salsa20_word_pm.q. Translated to traditional asm: salsa20_word_pm.s.

salsa20_word_pm takes 790 Pentium M cycles, including function-call overhead and 50 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 1248 bytes.

salsa20_word_pm actually takes 780 or 781 cycles most of the time but 856 cycles on every eighth call, presumably because of branch mispredictions. An average double-round takes 67.5 cycles.

For comparison, salsa20_word_pii takes about 830 Pentium M cycles. One could shoot for about 650 Pentium M cycles with x86 instructions; maybe a bit better with MMX/XMM instructions.

PowerPC RS64 IV (Sstar): 11.82 cycles/byte

My salsa20_word_aix implementation of Salsa20 is aimed at the IBM PowerPC RS64 IV (Sstar) running AIX. It will work on any PowerPC CPU under AIX inside a 32-bit program. Here's the qhasm version: salsa20_word_aix.q. Translated to traditional asm: salsa20_word_aix.s.

salsa20_word_aix takes 770 PowerPC RS64 IV cycles, including function-call overhead and 14 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 768 bytes.

Each double-round takes 66 PowerPC RS64 IV cycles. There is an obvious bottleneck of 64 cycles for 128 integer operations in each double-round: each rotation instruction counts as 2 integer operations on the PowerPC RS64 IV (unlike IBM's newer chips such as the 970), so each quarter-round needs 16 integer operations.

PowerPC 7410 (G4): 8.91 cycles/byte

My salsa20_word_macos implementation of Salsa20 is aimed at the Motorola PowerPC 7410 (G4) and PowerPC 7450 (G4e) running MacOS X. It will work on any PowerPC CPU under MacOS X inside a 32-bit program. Here's the qhasm version: salsa20_word_macos.q. Translated to traditional asm: salsa20_word_macos.s.

salsa20_word_macos takes approximately 584 PowerPC 7410 cycles, including function-call overhead and approximately 14 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 768 bytes.

Each double-round takes 49 PowerPC 7410 cycles. There is an obvious bottleneck of 48 cycles for 96 integer operations in each double-round.

Matthijs van Duin reports that an AltiVec implementation of Salsa20 is ``almost twice as fast as djb's non-altivec G4-tuned assembly implementation.''

UltraSPARC II: 13.77 cycles/byte

My salsa20_word_sparc implementation of Salsa20 is aimed at the Sun UltraSPARC II and Sun UltraSPARC III. It will work on any SPARCv9 CPU under a 64-bit operating system inside a 64-bit program. Here's the qhasm version: salsa20_word_sparc.q. Translated to traditional asm: salsa20_word_sparc.s.

salsa20_word_sparc takes 892 UltraSPARC II cycles, including function-call overhead and 11 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 936 bytes.

Each double-round takes 81 UltraSPARC II cycles. There is an obvious bottleneck of 80 cycles for 160 integer instructions in each double-round: each rotation needs 3 integer instructions, so each quarter-round needs 20 integer instructions.

UltraSPARC III: 13.90 cycles/byte

salsa20_word_sparc takes 905 UltraSPARC III cycles, including function-call overhead and 16 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 936 bytes.

Each double-round takes 82 UltraSPARC III cycles. There is an obvious bottleneck of 80 cycles for 160 integer instructions in each double-round: each rotation needs 3 integer instructions, so each quarter-round needs 20 integer instructions.