From: "D. J. Bernstein" <djb@cr.yp.to>
To: hash-forum@nist.gov
Cc: keccak@noekeon.org
Subject: Second preimages for 6 (7? (8??)) rounds of Keccak?

At the Crypto 2010 rump session, Dinur and Shamir announced an algebraic
attack on Hamsi-256. They've now posted their paper:

   http://eprint.iacr.org/2010/602

It turns out that Dinur and Shamir are looking through 2^256 possible
second preimages, but they argue that this still qualifies as an attack
because they've reduced the number of bit operations in the inner loop.

The simplest form of the attack is as follows. Split a possible preimage
into a 224-bit prefix and a 32-bit suffix. The structure of Hamsi allows
(e.g.) output bits 5,156,214,221,249 to be written as polynomials of
degree at most 6 in the suffix bits, where the polynomial coefficients
are particular functions of the prefix. Compute these polynomials, and
then evaluate these polynomials on all 2^32 suffixes; only about 2^27
suffixes will match the target in the selected 5 output bits.

Standard fast algorithms for multipoint evaluation take only about
5*32*2^31 bit operations, for a total of 80*2^256 bit operations. 
Computing the polynomials in the first place is not a bottleneck, since
the polynomials have low degree. The number of compression-function
evaluations is reduced by a factor of almost 2^5.

Now let's think about what this means for Keccak. Out of the five SHA-3
candidates specified in the Keccak submission, let's focus on the
maximum-security candidate keccakc1024: Keccak[r=576,c=1024,nr=24] with
512-bit output. Each 576-bit input block is fed through 24 rounds; let's
weaken Keccak by reducing the 24 rounds to 6 rounds.

Because the block size is already above 512 bits, we can expect to find
a single-block second preimage. Here's how we'll do that:

   Choose 64 bits (b[1],...,b[64]) not matching the first preimage.
   For each pattern of 128 bits (c[1],...,c[128]):
     Compute a polynomial representation of the function f 
       that maps (x[1],...,x[384]) to the first 10 bits
       of Keccak6(b[1],...,b[64],c[1],...,c[128],x[1],...,x[384]).
     Use fast multipoint evaluation to compute f
       for all (x[1],...,x[384]).
     For each (x[1],...,x[384]) where f matches the target:
       Compute Keccak6(b[1],...,b[64],c[1],...,c[128],x[1],...,x[384]).

This procedure searches through 2^128*2^384 = 2^512 Keccak inputs, so it
has an excellent chance of finding a second preimage, and adding early
termination with a restart effectively increases this chance to 1. So
the only question is how long the procedure takes.

Each Keccak round is quadratic in the input bits, so Keccak6 has degree
at most 64, so each bit of f has degree at most 64 in x[1],...,x[384];
i.e., f can be written as

   sum_S c_S prod_{i in S} x_i
   
where S ranges over subsets of {1,2,...,384} having size at most 64, and
each c_S is a 10-bit string. We'll compute c_S using a standard
identity, namely higher-order differentiation modulo 2:

   c_S = sum_{T subset S} f(T).

There are sum(j=0,64,binomial(384,j)) < 2^246 choices of T, so the
initial computation of each f(T) uses only 2^246 evaluations of Keccak6.
Across the whole attack that's just 2^128*2^246 = 2^374 evaluations of
Keccak6.

Separate computation of c_S = sum_{T subset S} f(T) for each S now uses
sum(j=0,64,2^j*binomial(384,j)) < 2^310 additions of 10-bit strings.
Across the whole attack that's a total of 10*2^438 bit operations. This
is clearly highly suboptimal---standard optimizations (CSE, Paar, etc.)
make a huge difference in the number of additions, and this becomes
important as the degree increases---but I'll stick to it for simplicity.

Fast multipoint evaluation now takes just 384*2^383 additions of 10-bit
strings. Across the whole attack that's a total of 1920*2^512 bit
operations. This is again suboptimal, but it's already much faster than
2^512 evaluations of Keccak6.

This initial filter eliminates most of the 2^384 input strings, leaving
just 2^374 input strings that require Keccak6 evaluations. Across the
whole attack that's just 2^128*2^374 = 2^502 Keccak6 evaluations.

Overall the attack involves 2^502 Keccak6 evaluations plus 1920*2^512
bit operations plus negligible extra work. Can someone double-check the
details and see whether I've calculated all this correctly?

For comparison, there are several papers using Keccak degree bounds to
"distinguish" more rounds of Keccak from a random oracle, but nobody has
managed to give a proper definition of a hash-function "distinguisher"
(a definition that's mathematically clear and that doesn't allow every
function to be trivially distinguished), and it's certainly not possible
to turn these "distinguishers" into attacks against accepted security
notions such as preimage resistance. The SHA-3 Zoo doesn't list any
preimage attacks against reduced-round Keccak, and the only occurrences
of "preimage" on the Keccak cryptanalysis page are pointing to

   * Lathrop finding partial preimages in 4 rounds of Keccak and
     speculating that the attacks could be pushed to more rounds, and

   * Morawiecki and Srebrny finding more restricted partial preimages in
     3 rounds of Keccak.

I think that this very simple 6-round Keccak attack can be pushed to 7
rounds (possibly even 8?), if CSE is used to accelerate the computation
of c_S; if that's correct then Keccak will need at least 8 (or even 9?)
rounds to meet its preimage-resistance goals. Full 24-round Keccak
therefore has <=200% extra computation as a security margin.

For people who evaluate attack cost by counting compression-function
evaluations, this 2^10 reduction in the cost of second preimages can be
easily improved to a reduction factor of 2^100 or more. Of course, this
would actually make the attack much slower; to me this is yet another
clear demonstration of how silly it is to evaluate attack cost by
counting compression-function evaluations.

In the opposite direction, for people who have even the slightest
understanding of the physical reality of attack cost, this attack makes
no sense: it's inherently memory-intensive and communication-intensive,
and for the same machine size it's clearly much slower than parallel
exhaustive search, even though it has somewhat fewer bit operations. The
same is true for the Dinur--Shamir attack on Hamsi-256.

One can also reasonably ask why anyone should care about the difference
between 2^520 bit operations and 2^525 bit operations, or between 2^260
bit operations and 2^270 bit operations. If we're really trying to
maximize the epsilon in 2^(512+epsilon), shouldn't we be selecting the
SHA-3 candidate with the largest number of bit operations in its
compression function and with no known inner-loop optimizations? Is
there any evidence that this type of superficial "security" metric has
any value in predicting real-world security?

---D. J. Bernstein
   Research Professor, Computer Science, University of Illinois at Chicago

P.S. I'm also curious whether anyone has tried message modification
against Keccak. I wouldn't expect anything to get past 12 rounds, but
the lack of reduced-round collision examples is a bit embarrassing!