Hash functions and ciphers

The Salsa20 core

**This page is out of date.**
Several further speed improvements appear in the
Salsa20 stream cipher software.
Some of the optimizations
are described in more detail in the ``Salsa20 speed'' document.

`salsa20_word_athlon` takes 604 Athlon cycles,
including function-call overhead and 11 cycles timing overhead;
timing overhead was subtracted from the cycles/byte figure above.
An average round takes 26.75 cycles.
The compiled code occupies 1280 bytes.

One could shoot for 500 Athlon cycles, considering the total number of instructions that need to be carried out.

`salsa20_word_pii` takes 872 Pentium III cycles,
including function-call overhead and 35 cycles timing overhead;
timing overhead was subtracted from the cycles/byte figure above.
The compiled code occupies 1280 bytes.

`salsa20_word_pii` actually takes 859 cycles most of the time
but 908 cycles on every fourth call,
presumably because of branch mispredictions.
An average double-round takes 75 cycles.

For comparison, `salsa20_word_pm` takes about 1300 Pentium III cycles.
One could shoot for about 730 Pentium III cycles,
considering the total number of operations
and the total number of integer operations
that need to be carried out;
maybe a bit better with PADDD etc.

`salsa20_word_p4` takes 1136 Pentium 4 Willamette cycles,
including function-call overhead and 84 cycles timing overhead;
timing overhead was subtracted from the cycles/byte figure above.
The compiled code occupies 1144 bytes.

`salsa20_word_pm` takes 790 Pentium M cycles,
including function-call overhead and 50 cycles timing overhead;
timing overhead was subtracted from the cycles/byte figure above.
The compiled code occupies 1248 bytes.

`salsa20_word_pm` actually takes 780 or 781 cycles most of the time
but 856 cycles on every eighth call,
presumably because of branch mispredictions.
An average double-round takes 67.5 cycles.

For comparison, `salsa20_word_pii` takes about 830 Pentium M cycles.
One could shoot for about 650 Pentium M cycles with x86 instructions;
maybe a bit better with MMX/XMM instructions.

`salsa20_word_aix` takes
770 PowerPC RS64 IV cycles,
including function-call overhead and 14 cycles timing overhead;
timing overhead was subtracted from the cycles/byte figure above.
The compiled code occupies 768 bytes.

Each double-round takes 66 PowerPC RS64 IV cycles. There is an obvious bottleneck of 64 cycles for 128 integer operations in each double-round: each rotation instruction counts as 2 integer operations on the PowerPC RS64 IV (unlike IBM's newer chips such as the 970), so each quarter-round needs 16 integer operations.

`salsa20_word_macos` takes
approximately 584 PowerPC 7410 cycles,
including function-call overhead and approximately 14 cycles timing overhead;
timing overhead was subtracted from the cycles/byte figure above.
The compiled code occupies 768 bytes.

Each double-round takes 49 PowerPC 7410 cycles. There is an obvious bottleneck of 48 cycles for 96 integer operations in each double-round.

Matthijs van Duin reports that an AltiVec implementation of Salsa20 is ``almost twice as fast as djb's non-altivec G4-tuned assembly implementation.''

`salsa20_word_sparc` takes
892 UltraSPARC II cycles,
including function-call overhead and 11 cycles timing overhead;
timing overhead was subtracted from the cycles/byte figure above.
The compiled code occupies 936 bytes.

Each double-round takes 81 UltraSPARC II cycles. There is an obvious bottleneck of 80 cycles for 160 integer instructions in each double-round: each rotation needs 3 integer instructions, so each quarter-round needs 20 integer instructions.

Each double-round takes 82 UltraSPARC III cycles. There is an obvious bottleneck of 80 cycles for 160 integer instructions in each double-round: each rotation needs 3 integer instructions, so each quarter-round needs 20 integer instructions.