Snuffle 2005 D. J. Bernstein
Hash functions and ciphers

Snuffle 2005: the Salsa20 encryption function

Salsa20 specification; Salsa20 design
Salsa20 speed; Salsa20 software
Salsa20 security
Salsa20 prizes
Salsa20 formalities
Salsa20 diffusion
Salsa20 approval
The Salsa20 encryption function, also known as Snuffle 2005, uses the Salsa20 core to encrypt data.

Salsa20 specification; Salsa20 design

The following paper explains how Salsa20 works:

The original Salsa20 documentation split this information into two documents (including many extra examples in the specification):

The following paper defines XSalsa20, a variant of Salsa20 with a longer nonce, and proves that XSalsa20 is secure if Salsa20 is secure:

Salsa20 speed; Salsa20 software

There are two documents analyzing Salsa20 performance:

There are several Salsa20 implementations:

Some older implementations, limited to the Salsa20 core and not including stream generation, appear on the Salsa20-core speed page.

ECRYPT's test framework reports the following speeds for encrypting a 576-byte packet (or a long stream) with a 256-bit key:
Salsa20 cycles/byte Salsa20/12 cycles/byte Salsa20/8 cycles/byte 4-round cycles/byte Implementation Machine
4.25 (3.93)3.25 (2.80)2.07 (1.88)0.73 (0.68)amd64-xmm6 in 20070618amd64 3000MHz Intel Xeon 5160 (6f6) named td162
4.33 (3.91)2.80 (2.57)2.07 (1.88)0.75 (0.68)amd64-xmm6 in 20070618amd64 2137MHz Intel Core 2 Duo (6f6) named katana
4.39 (4.24)2.88 (2.74)2.14 (1.99)0.75 (0.75)ppc-altivec in 20070618ppc32 533MHz Motorola PowerPC G4 7410 named gggg
4.70 (4.32)3.15 (2.80)2.28 (2.06)0.81 (0.75)x86-xmm5 in 20070618x86 2137MHz Intel Core 2 Duo (6f6) named katana32
7.84 (7.64)5.04 (4.86)3.65 (3.47)1.39 (1.39)amd64-3 in 20070618amd64 2000MHz AMD Athlon 64 X2 (15,75,2) named mace
8.04 (7.82)4.87 (4.83)3.48 (3.28)1.52 (1.51)ppc-altivec in 20070618ppc64 2000MHz IBM PowerPC G5 970 named geespaz
8.62 (8.42)5.51 (5.33)3.96 (3.78)1.55 (1.55)amd64-3 in 20070618amd64 2391MHz AMD Opteron (f5a) named td159
8.78 (8.42)5.73 (5.35)4.18 (3.82)1.53 (1.53)amd64-3 in 20070618amd64 2192MHz AMD Opteron (f58) named td189
10.07 (9.80)6.55 (6.27)4.78 (4.50)1.76 (1.77)x86-1 in 20070618x86 2000MHz AMD Athlon 64 X2 (15,75,2) named mace32
10.24 (10.04)6.65 (6.44)4.84 (4.61)1.80 (1.81)x86-athlon in 20070618x86 900MHz AMD Athlon (622) named thoth
11.47 (11.29)8.51 (8.35)7.00 (6.83)1.49 (1.49)merged in 20070618ppc64 1452MHz IBM POWER4 named tigger
11.56 (11.39)7.85 (7.68)5.97 (5.82)1.86 (1.86)merged in 20070618hppa 1000MHz HP PA-RISC 8900 named td191
11.73 (10.69)7.84 (7.19)5.87 (5.38)1.95 (1.77)amd64-xmm6 in 20070618amd64 3000MHz Intel Pentium D (f64) named svlin001
11.98 (11.70)7.70 (7.44)5.53 (5.30)2.15 (2.13)x86-xmm5 in 20070618x86 1300MHz Intel Pentium M (695) named whisper
12.55 (11.64)8.21 (7.41)5.86 (5.30)2.23 (2.11)x86-xmm5 in 20070618x86 3000MHz Intel Xeon (f26) named td185
12.59 (11.63)8.15 (7.40)5.84 (5.30)2.25 (2.11)x86-xmm5 in 20070618x86 3200MHz Intel Xeon (f25) named td186
12.65 (11.67)8.20 (7.44)5.95 (5.33)2.23 (2.11)x86-xmm5 in 20070618x86 2800MHz Intel Xeon (f29) named svlin003
13.40 (11.84)9.33 (8.12)6.92 (5.76)2.16 (2.03)x86-xmm5 in 20070618x86 3000MHz Intel Pentium 4 (f41) named svlin002
14.29 (13.88)9.29 (8.88)6.79 (6.37)2.50 (2.50)x86-mmx in 20070618x86 1400MHz Intel Pentium III (6b1) named td152
14.45 (14.34)9.33 (9.21)6.76 (6.65)2.56 (2.56)sparc in 20070618sparc 1050MHz Sun UltraSPARC IV named hald
15.94 (15.29)10.31 (9.90)7.66 (7.13)2.76 (2.72)x86-athlon in 20070618x86 3200MHz Intel Pentium D (f47) named shell
18.27 (18.07)12.62 (12.42)8.87 (8.49)3.13 (3.19)merged in 20070618ia64 1500MHz HP Itanium II named td178
18.40 (18.21)12.76 (12.56)8.65 (8.28)3.25 (3.31)merged in 20070618ia64 1400MHz HP Itanium II named td156
19.9312.739.143.59x86-1x86 133MHz Intel Pentium 1 (52c) named cruncher

AVR speeds: At the SASC 2007 workshop, Gordon Meiser reported a Salsa20 implementation for an 8MHz ATmega8 taking 292 cycles/byte and using 1514 bytes of flash memory. For comparison, Meiser reported an AES implementation for an 8MHz ATmega16 taking 786 cycles/byte and using 6664 bytes of flash memory.

ARM speeds: At the SASC 2007 workshop, Cedric Lauradoux reported a Salsa20 implementation for a 200MHz ARM920T taking 69 cycles/byte and using just 868 bytes of code. For comparison, Lauradoux reported an AES implementation taking 101 cycles/byte with 15920 bytes of code.

FPGA speeds: At the SASC 2007 workshop, Marcin Rogawski reported an unrolled-double-round Salsa20 implementation using 3510 logic elements on a Altera Cyclone EP1C20F324C6 (130nm process). The implementation is estimated to drain 450.14 mW at 30MHz and produce 1280 Mbps. For comparison, Rogawski reported an AES implementation using 5053 logic elements; the implementation is estimated to drain 1191.01 mW at 105MHz and produce 611 Mbps.

At the IEEE CCECE 2007 workshop, Yan and Heys reported a "compact" Salsa20 implementation using 194 CLB slices and 4 Block RAMs on a Xilinx 2V250fg256. The implementation is estimated to produce 38 Mbps.

ASIC speeds: At the SASC 2007 workshop, Tim Good reported an unrolled-double-round Salsa20 implementation for a 130nm ASIC, with area estimated as 18626 gate equivalents and speed estimated as 668 Mbps at 35.2 MHz. It's quite clear to me that these speeds are highly suboptimal; I look forward to seeing better Salsa20 ASIC implementations. For comparison, Good reported an AES implementation with area estimated as 5398 gate equivalents and speed estimated as 311 Mbps at 131.2 MHz.

At the IEEE CCECE 2007 workshop, Yan and Heys reported three Salsa20 implementations for a 180nm ASIC: a "compact" implementation with area estimated as 14100 gate equivalents and speed estimated as 71.2 Mbps, a "basic" implementation with area estimated as 23408 gate equivalents and speed estimated as 255 Mbps, and a "fast" implementation with area estimated as 470000 gate equivalents and speed estimated as 4800 Mbps.

Other languages: Larry Bugbee has implemented a Python wrapper for Salsa20: https://www.seanet.com/~bugbee/crypto/salsa20/.

Salsa20 security

I have several documents discussing the security of Salsa20: See also the Salsa20 diffusion page.

Cryptanalysts are strongly encouraged

Every attack should be fast when w and r are small; and there is no excuse for failing to have a computer verify that a fast attack works.

Independent cryptanalysis:

Salsa20 prizes

In May 2005 I announced that, at the end of 2005, I would award a $1000 prize for the public Salsa20 cryptanalysis that I considered most interesting. I didn't make any promises regarding what I'd find interesting, but I posted the following guidelines: I awarded the prize to Paul Crowley for his paper ``Truncated differential cryptanalysis of five rounds of Salsa20.''

Salsa20 formalities

My submission of Salsa20 to the ECRYPT Stream Cipher project (eSTREAM) included the following formalities: