D. J. Bernstein / NearSHA D. J. Bernstein
Hash functions and ciphers

The NearSHA software

The Engine Yard SHA-1 contest lasted for 30 hours. Engine Yard announced a challenge string at the beginning of the contest, 2009.07.20 12:00 PDT. By the end of the contest, 30 hours later, the winning team had found another string whose hash had Hamming distance only 30 from the challenge string.

See http://www.win.tue.nl/cccc/sha-1-challenge.html for more information about the contest, the winning team, the CPUs and GPUs used, the difficulty of finding a string at distance 30, etc. The same team also found a string at distance 31 (matching two other submissions), two strings at distance 32 (matching four other submissions), and several strings at distance 33.

This page is for programmers who would like to see NearSHA, the software used by the winning team. I wrote NearSHA and hereby place it into the public domain. Warning: I wrote NearSHA in a rush (starting 2009.07.18), and compromised quality in several ways for the sake of minimizing development time.

Core 2 implementation

The following cycle counts were measured on one core of a 2.4GHz Intel Core 2 Quad Q6600. Infrastructure: target.h, cpucycles.c, cpucycles.h, Makefile.

search2.cpp, 2725 cycles/hash: first serious search program with a built-in SHA-1 implementation.

search3.cpp, 2082 cycles/hash (1492 cycles/hash in SHA-1 block processing): unrolled inner loop.

search4.cpp, 1430 cycles/hash (1162 cycles/hash in SHA-1 block processing): moved endianness and padding to the caller.

search5.cpp, 553 cycles/hash: moved endianness and padding out of the main loop; and precomputed first block.

search6.cpp, 235 cycles/hash: vectorized.

search7.cpp, 210 cycles/hash: tweaked for better instruction scheduling.

search8.cpp, 201 cycles/hash: precomputed some second-block results, under the assumption that the string has length 69.

GPU implementation

search5.cu: I started from search5.cpp and split a GPU function doit() out of the CPU function main(), using explicit CPU-GPU data transfers. At a lower level, I copied "class hash" (for the CPU) to "class gpu_hash" (for the GPU) and put a "__device__" in front of each method. Perhaps there's some way to convince nvcc to automatically figure out which functions need to be compiled for the CPU, which functions need to be compiled for the GPU, and which functions need to be compiled for both.