D. J. Bernstein
Authenticators and signatures

MCS 590, High-Speed Cryptography, Spring 2005

Public messages need to be protected against forgery. Private messages need to be protected against both forgery and espionage. Cryptography solves these two fundamental problems by scrambling messages; an attacker cannot figure out what a scrambled message means, and cannot figure out how to properly scramble a forged message.

A cryptographic system is defined by mathematical functions to scramble and unscramble messages. Many different cryptographic systems have been proposed. This course will not attempt to survey those systems. The goal of this course is to understand the fastest system to protect public messages and the fastest system to protect private messages: in particular, to understand exactly how the underlying mathematical functions are computed at top speed on typical computers.

This course has very few prerequisites. For example, this course will explain from scratch how to compute multiples of points on a particular elliptic curve; students will not be assumed to know what the phrase ``elliptic curve'' means and will not need to have taken any particular mathematics courses. This course will explain from scratch the speed of this elliptic-curve computation on real computers; students will not need to have taken previous courses on computer architecture or supercomputing. This course will also explain from scratch how this elliptic-curve computation protects messages; students will not need to have taken previous courses on cryptography.

Students are required to have taken MCS 401 or an equivalent course on the design and analysis of fast computer algorithms.

Grading: This is an advanced graduate course, so the workload required for an A will be light. No exams; occasional homework assignments checking that everyone understands the definitions. Students interested in devoting extra time to the course material will be given a variety of projects to explore.

Course location: MWF 01:00-01:50 PM, 210 2DH. Call number 20389.

Office hours: M 02:00-04:50 PM, 410 SEO.

10 January 2005

I have some notes from today's lecture.

12 January 2005

Class is extended by 4 minutes each day. Class is cancelled 21 February, 23 February, 25 February.

I've noticed that Clint Whaley is teaching a course at FSU on computing basic linear-algebra functions as quickly as possible.

14 January 2005

Let's define the architecture of a computer as the computer's instruction set, the lowest-level instructions that the computer knows how to follow. Let's define the microarchitecture of the computer as the time taken by the computer to follow those instructions. Understanding exactly how fast a computer is means understanding both the architecture and the microarchitecture.

When I talk about machine instructions in class, I'll use an instruction syntax that looks like C. Unfortunately, if you look for outside information about architecture, you'll find that machine instructions are usually expressed in an ancient syntax that isn't nearly as easy to read: ``Sun assembly language,'' for example, or ``AT&T assembly language.'' You can learn a ton about Pentium 1/MMX/Pro/II/III/4 optimization from Agner Fog's Pentium optimization manual, for example, but you have to know ``MASM assembly language'' first.

I recommend, as a starting point, the lecture notes by David Yates from a course at Boston University. The notes explain the basics of the SPARC instruction set; they have many examples; they use Sun assembly language, which will be a bit painful at first but will help you understand other SPARC-related documents.

Of course, there's no substitute for experiment. Write a short foo.c, type gcc -g -S foo.c, and look at the assembly-language foo.s. Try adding optimization levels: gcc -g -O3 -S foo.c. Try modifying foo.s to make it do something different; compile with gcc -o foo foo.s and then watch the results in gdb, as in Yates's sample debugging session.

Two UltraSPARCs here at UIC are icarus.cc.uic.edu, which has several 900MHz UltraSPARC III processors, and galois.math.uic.edu, which has several 296MHz UltraSPARC II processors.

Meanwhile, here's another homework assignment regarding computer speed. (To repeat what I said on Monday: ``I'll sometimes label a homework assignment as being required. Other homework assignments, such as this one, are optional and have nothing to do with your grade, but they're worth exploring if you'd like to devote extra time to the course material.'') On icarus, with gcc -O2, I tried calling the following function several times in a row, and found that each call takes nearly 60000 CPU cycles:

     int x[1048576];

     int doit(void)
     {
       int i;
       static int j;
       int sum = 0;
       for (i = 0;i < 100;++i) {
         sum += x[j]; j += 654321; j &= 1048575;
         sum += x[j]; j += 654321; j &= 1048575;
         sum += x[j]; j += 654321; j &= 1048575;
         sum += x[j]; j += 654321; j &= 1048575;
         sum += x[j]; j += 654321; j &= 1048575;
         sum += x[j]; j += 654321; j &= 1048575;
         sum += x[j]; j += 654321; j &= 1048575;
         sum += x[j]; j += 654321; j &= 1048575;
         sum += x[j]; j += 654321; j &= 1048575;
         sum += x[j]; j += 654321; j &= 1048575;
       }
       return sum;
     }
On the other hand, if I change 1048575 to 4095, each call takes only 3000 CPU cycles, nearly twenty times faster. Why?

20 January 2005

Yesterday I presented C++ code for Poly1305 (online: poly1305_gmpxx.h, poly1305_gmpxx.cpp), discussed its speed, presented a protocol using Poly1305 to protect one message against forgery, and said how secure the protocol is. Tomorrow I'll present a more general protocol that handles several messages, and I'll discuss attacks in more detail.

Several of you have asked how to measure speed. The quick answer is to use an

     extern long long cpucycles(void);
function that returns the number of CPU cycles since boot, or
     extern "C" {
       extern long long cpucycles(void);
     } ;
from C++. This isn't a standard function; you'll need cpucycles-sparc.S, cpucycles-powerpc.S, or cpucycles-x86.S. (Warning: the PowerPC version actually counts the ``time base,'' which is 1 cycle on the PowerPC RS64 but 16 cycles on the PowerPC G4.) Look at speed1.cpp for a complete example of using cpucycles to print CPU cycles for 64 consecutive calls to poly1305_gmpxx. Here's how I compiled and ran this on icarus:
     icarus% gcc -m64 -O2 -o speed1 \
     speed1.cpp poly1305_gmpxx.cpp cpucycles-sparc.S \
     -lgmpxx -lgmp /usr/local/gcc-3.3.1/lib/sparcv9/libstdc++.a \
     -I/homes/home12/koobera/include -L/homes/home12/koobera/lib
     icarus% ./speed1
	1396496   1194949   1172826   1131245   1146634   ...
Explanation of the gcc line: Here's another homework assignment regarding computer speed. The output of speed1 showed the first call to poly1305_gmpxx taking 1.4 million cycles, while the remaining calls were all under 1.2 million. Why was the first call slower?

Anyway, the course topic next week (perhaps starting at the end tomorrow) is how to compute Poly1305 much more quickly. What are the UltraSPARC's machine instructions? (The important ones, anyway.) How long do those instructions take? How quickly can we perform 130-bit arithmetic using those instructions? What about the PowerPC, the Pentium III, and the Athlon?

24 January 2005

I'm told that some UltraSPARCs are available in a CS lab (SEL 2260) and also in an ECE lab.

Today I started explaining UltraSPARC speed. I discussed the three fundamental bottlenecks that often prevent an UltraSPARC from following four instructions in a CPU cycle: execution-unit contention, operation latency, and load latency.

26 January 2005

More details on the CS UltraSPARCs, thanks to Ariel Berkman: There are a bunch of Suns in SEL 2260, each with a 502MHz UltraSPARC-IIe. There are five Suns on the west wall of SEL 2254, each with a 333MHz UltraSPARC-IIi. The login server ernie.cs.uic.edu is a 296MHz UltraSPARC-II. The login server bert.cs.uic.edu is a dual 296MHz UltraSPARC-II.

Today I listed the traditional names for the UltraSPARC registers:

I described the int64 and uint64 interpretations of a register and presented the simple integer instructions. Here are the simple integer instructions (which I'm writing on the board with a left arrow in place of =) and their translation into Sun's SPARC assembly language:
     r = t                      add   %g0,t,r
     r = s + t                  add   s,t,r
     flags r = s + t            addcc s,t,r
     r = s - t                  sub   s,t,r
     flags r = s - t            subcc s,t,r
     r = s & t                  and   s,t,r
     r = s & ~t                 andn  s,t,r
     r = s | t                  or    s,t,r
     r = s | ~t                 orn   s,t,r
     r = s ^ t                  xor   s,t,r
     r = s ^ ~t                 xnor  s,t,r
     r = 10000 * 1024           sethi 10000,r
     r = s << 13                sllx  s,13,r
     r = (uint64) s >> 13       srlx  s,13,r
     r = (int64) s >> 13        srax  s,13,r
In these instructions, t can be replaced by -4096,-4095,...,4095; 13 can be replaced by 0,1,2,...,63; 10000 can be replaced by 0,1,2,...,4194303. Don't assume that this list is complete; I didn't include, for example, r = s << (t & 31).

Required homework assigned today: The instructions

     r = -1
     s = (uint64) r >> 60
     t = (int64) r >> 60
set s to 15 and t to -1. Explain why.

Optional homework (tangential to the course material, and not easy, but educational): Write a program that, given an integer n in {0,1,...,2^64-1}, prints a minimum-length sequence of simple integer instructions to place n into an UltraSPARC register.

2 February 2005

Today I defined 53-bit floating-point numbers and the fp_53 function. Next time I'll explain exactly how limited-exponent 53-bit floating-point numbers are represented inside registers, and I'll explain the UltraSPARC floating-point instructions. The exponent limitations won't be important for the Poly1305 computation, but the 53-bit rounding of floating-point operations will be very important.

Recommended reading: David Goldberg's paper What every computer scientist should know about floating-point arithmetic.

Homework: Write an int64 sum1000(int64 *p) function that returns p[0]+p[1]+...+p[999] in only about 1000 cycles on an UltraSPARC. For extra credit, add 10000 numbers, or 100000 numbers, or 1000000 numbers; even better, take the number of array entries as a parameter.

10 February 2005

Required homework (assigned 4 February, due 14 February): What real number is represented by the bytes (69,56,0,0,1,0,0,0) in an UltraSPARC floating-point register? In other words, what is double(69,56,0,0,1,0,0,0)? The computer can tell you the answer---
     main()
     {
       double x;
       char *c = (char *) &x;
       c[0] = 69;
       c[1] = 56;
       c[2] = 0;
       c[3] = 0;
       c[4] = 1;
       c[5] = 0;
       c[6] = 0;
       c[7] = 0;
       printf("%f\n",x);
     }
---but the homework is to show how that answer arises from the 69 and 56 and so on. (Optional followup: Can you modify this program so that it also works on little-endian computers?)

Here are some references for building higher-precision arithmetic from floating-point arithmetic:

Meanwhile, for those of you who are interested in reading outside information on computer architecture: You should have worked through enough UltraSPARC examples at this point to be comfortable with UltraSPARC assembly language. You're ready to branch out to more assembly languages:

I'll explain various features of these instruction sets in class; but, as with the UltraSPARC, I'll use a syntax that's meant to be comprehensible to any C programmer, rather than a syntax that will help you read the previous literature.

The ultimate references, if you want to know what a particular instruction does, are the official architecture manuals:

19 February 2005

Optional homework: Write a program that, given a sequence of floating-point additions and multiplications, figures out how many cycles the sequence takes on an UltraSPARC III. For example, given the input
     x = b + c
     u = x * x
     v = b * c
your program should figure out that x is computed on cycle 1, u is computed on cycle 5 (since x has 4 cycles latency), and v is computed on cycle 6 (since there was already a multiplication on cycle 5). Don't worry about how many registers there are.

Followup: Write a program that randomly swaps adjacent instructions (but not instructions that depend on each other!) and sees what effect that has on the number of cycles. If an order of instructions has fewer cycles than anything seen before, print it out. Don't do a swap that pushes the number of cycles beyond the best seen plus 5. How effective is this strategy at finding a good order for a long sequence of instructions?

7 March 2005

Today I explained how the extra precision of the x86 floating-point registers (fp_64 intsead of fp_53) drastically reduces the number of floating-point operations in the Poly1305 main loop (36 add + 17 mul instead of 68 add + 33 mul). The limited number of x86 registers poses latency problems beyond the number of operations, but the bottom line is that the Athlon and the Pentium M handle the main loop in fewer cycles than the UltraSPARC and the PowerPC.

Some research directions: Are there better ways to break large integer operations into floating-point operations? Is there a reasonably fast register-assignment/instruction-scheduling algorithm that achieves minimum cycle counts for these functions without machine-specific help from the user? If this is still beyond computer capabilities (but within human capabilities), how can we make it as easy as possible for the user to work with the optimizer to achieve minimum cycle counts?

Concrete examples of these questions: How could I have written my 68-UltraSPARC-II-cycles loop within a minimum of effort? Could a clever scheduling algorithm have achieved 68 cycles without my help? Can we do better than 68 cycles by, for example, splitting 130-bit integers into 5 26-bit pieces?

9 March 2005

Today I drew the CPU-time-versus-programmer-time picture and discussed verification of floating-point ranges. Next time I'll add our second cryptographic tool: AES, also known as Rijndael. Recommended reading:

11 March 2005

Today I explained the Poly1305-AES authentication protocol, presented an extremely slow attack against Poly1305-AES, and explained how a fast attack on Poly1305-AES can be converted into a fast AES distinguisher.

Optional homework: Explain an attack against Poly1305-AES that inspects 3 authenticators, performs ``only'' 2^128 easy computations, and generates a successful forgery.

18 March 2005

Today I finished the discussion of AES speed. To summarize: AES has 200 lines. Four lines can be implemented as 4 shifts, 4 masks, 4 loads, 4 xors; or as 1 store, 4 loads, 4 shifts, 4 loads, 4 xors. The PowerPC has a combined shift-mask instruction. The x86 architecture has some trouble fitting AES into registers.

Required homework: What is the minimum number of UltraSPARC cycles for these 200 lines, with the right number of stores? Assume that there are no latency problems; consider only integer-execution-unit contention (at most two integer instructions in each cycle) and load/store-unit contention (at most one memory instruction in each cycle). Example, not the minimum: 12 lines can be implemented as 3 stores, 12 loads, 12 shifts, 12 loads, 12 xors, while the other 188 lines are implemented as 188 shifts, 188 masks, 188 loads, 188 xors, for a total of 215 load/store instructions and 588 integer instructions, taking 294 cycles.

Optional homework: Figure out exactly how serious the store-to-load latency problems are on the UltraSPARC II (hint: nasty) and the UltraSPARC III (hint: better). This will require some experimentation; neither chip is adequately documented.

I spent the last few minutes briefly discussing cache-timing attacks on AES. If you're interested in learning more about this, check out my pictures and graphs of time variability. I also have a paper with more information on the topic, but I'm finishing a heavily revised version of the paper, so don't read it now.

29 March 2005

Pollard's kangaroo method, which I briefly mentioned yesterday and will say more about tomorrow, is the most important attack method in public-key cryptography, and in particular the fastest known attack against the Curve function that I'm going to define. Recommended reading: the Stein-Teske paper The parallelized Pollard kangaroo method in real quadratic function fields, Section 2, through the end of 2.2.

30 March 2005

There is a widespread myth that parallelization does not improve price-performance ratio. Most of the cryptanalytic literature analyzes the time of computations on a huge serial computer, as if this said something about the time of a computation on a comparably priced parallel computer. In fact, a parallel computer can achieve price-performance ratios several orders of magnitude better than a serial computer.

Consider, for example, a computer multiplying two n-bit numbers. A serial computer can't possibly do this in time below n^{1+o(1)}, and it needs n^{1+o(1)} bits of memory. A 2-dimensional parallel computer of size n^{1+o(1)} can do the job in time n^{1/2+o(1)}. The price-performance ratio drops from n^{2+o(1)} dollar-seconds to n^{1.5+o(1)} dollar-seconds.

Problems similar to n-bit multiplication show up all over the place in cryptanalysis. Focusing on serial computers makes the problems sound more difficult than they actually are.

It's particularly embarrassing when a ``record-setting'' serial computation is slower, more expensive, and much more complicated than a well-known parallel computation. Here are two examples:

A parallel computer that simply searches through all 2^128 or 2^534 inputs, doing 2^50 function evaluations at once, would be vastly less expensive and would finish more quickly, so you'd have to be an idiot to use the serial computer. Parallel versions of the complicated algorithms are better than serial versions, and there are some situations where they save time, but they are still worse than brute-force search in both of these situations.

One cryptanalytic paper that doesn't make the same mistake is Wiener's recent paper The full cost of cryptanalytic attacks. This paper displays a shocking level of ignorance of prior work (its main theorem follows trivially from a famous theorem in a 1981 paper by Brent and Kung, for example) but is not bad as an introduction to parallel cryptanalysis. In particular, Section 4, on parallel discrete-logarithm computations, is a good explanation of why the rho method, a predecessor to the kangaroo method, is better than baby-step-giant-step. The Stein-Teske paper is a better introduction to the kangaroo method but says only a tiny bit about previous algorithms.

4 April 2005

Today I gave the following recursive definition of the Curve function.

Define p as the prime 2^255-19, and define A=257290. The input to Curve is a positive integer. The output is an element of {0,1,...,p-1,infty}.

Curve(1) is defined as 2.

If g = Curve(n) then Curve(2n) is defined as the quotient mod p of (g^2-1)^2 and 4g(g^2+Ag+1). This means infty when g = infty; it means infty when g(g^2+Ag+1) mod p = 0; otherwise it means the unique solution to (g^2-1)^2-4g(g^2+Ag+1) Curve(2n) mod p = 0.

If g = Curve(n) and h = Curve(n+1) then Curve(2n+1) is defined as the quotient mod p of (gh-1)^2 and (g-h)^2 Curve(1). This means infty when g = h (which actually can't happen); it means the unique solution to h^2-Curve(1) Curve(2n+1) mod p = 0 if g = infty and h != infty; it means the unique solution to g^2-Curve(1) Curve(2n+1) mod p = 0 if g != infty and h = infty; otherwise it means the unique solution to (gh-1)^2-(g-h)^2 Curve(1) Curve(2n+1) mod p = 0.

To compute Curve(nS) given n,Curve(S), use the fact that Curve(2nS) = (g^2-1)^2/4g(g^2+Ag+1) mod p where g = Curve(nS), and the magical fact that Curve((2n+1)S) = (gh-1)^2/(g-h)^2 Curve(S) mod p where g = Curve(nS) and h = Curve((n+1)S).

The traditional name for Curve(n) is ``the x coordinate of the nth multiple of (2,...) on the elliptic curve y^2=x^3+Ax^2+x over the field of p elements.'' Some relevant sections of the Hankerson-Menezes-Vanstone book ``Guide to elliptic curve cryptography'':

13 April 2005

Today I focused on the problem of size-n polynomial multiplication: given the coefficients u[0],u[1],...,u[n-1] of a polynomial u = u[0] + u[1]x + ... + u[n-1]x^{n-1}, and the coefficients v[0],v[1],...,v[n-1] of a polynomial v = v[0] + v[1]x + ... + v[n-1]x^{n-1}, compute the coefficients u[0]v[0],u[0]v[1]+u[1]v[0],...,u[n-1]v[n-1] of the product uv.

The straightforward algorithm uses (n-1)^2 additions.

A fancier algorithm decomposes a size-2n problem into four size-n problems. Write u as A_0 + A_n x^n, and write v as B_0 + B_n x^n. Then uv = A_0 B_0 + (A_0 B_n + A_n B_0) x^n + A_n B_n x^{2n}. The four size-n multiplications take 4(n-1)^2 additions, if they're done with the straightforward algorithm; the addition of A_0 B_n to A_n B_0 takes 2n-1 additions; the addition of A_0 B_0 to (A_0 B_n + A_n B_0) x^n takes n-1 additions, since there's an overlap from x^n through x^{2n-2}; the addition of A_n B_n also takes n-1 additions. The total is (2n-1)^2 additions, just like the straightforward algorithm.

Karatsuba's algorithm uses the same decomposition but computes A_0 B_n + A_n B_0 as (A_0 + A_n)(B_0 + B_n) - A_0 B_0 - A_n B_n. The additions A_0 + A_n and B_0 + B_n take 2n additions; the three size-n multiplications take 3(n-1)^2 additions, if they're done with the straightforward algorithm; the subtractions of A_0 B_0 and A_n B_n take 2(2n-1) additions; the final combinations take 2(n-1) additions. The total is 3n^2+2n-1 additions, which is better than (2n-1)^2 when n is 6 or larger. Next time I'll chop out some more additions.

Required homework, due 25 April: Draw the computation graph for Karatsuba's algorithm for n = 2; in particular, figure out the latency of the algorithm, assuming infinite execution units.

Optional followup: Investigate the Toom-Cook algorithm and the fast Fourier transform.

14 April 2005

Recommended reading: The revised version of my paper Cache-timing attacks on AES is now available.

20 April 2005

Here's the prime-generation method that I discussed in class:
  1. Generate a uniform random integer p in {2^1023,2^1023+1,...,2^1024-1}.
  2. (This step starts about 2800 times on average.) If p mod 4 is not 3, start over.
  3. (This step starts about 700 times on average.) If p mod 5 is not 2 or 3, start over.
  4. (This step starts about 280 times on average.) If p is divisible by a prime below 1000003, start over.
  5. (This step starts about 30 times on average.) If 2^((p-1)/2) mod p is not 1 or p-1, start over.
  6. (This step starts about once on average.) If x^((p+1)/2) mod x^2-3x+1 mod p is not 1 or p-1, start over.
  7. (This step starts once.) Declare p to be prime. Nobody has ever found a non-prime p, although there's no proof.
In the computation of x^((p+1)/2) mod x^2-3x+1 mod p, x is a polynomial variable, not a number. Polynomials a+bx are represented inside the computer as pairs (a,b). The product (a,b)(c,d) is (ac-bd mod p,ad+bc+3bd mod p). Compute the (p+1)/2 power of (0,1), and see whether it's in {(1,0),(p-1,0)}. For example, for p = 23, compute the 12th power of (0,1) as follows: (0,1)^2 = (22,3); (0,1)^3 = (0,1)(22,3) = (20,8); (0,1)^6 = (20,8)^2 = (14,6); (0,1)^12 = (14,6)^2 = (22,0). In other words, x^2 = 22 + 3x; x^3 = 20 + 8x; x^6 = 14 + 6x; x^12 = 22.

There are many techniques to prove numbers prime, although they take more time than the above test. If you'd like to understand how the basic techniques work, read the Crandall-Pomerance book ``Prime numbers,'' Chapter 4. If you'd like to understand how fast the state-of-the-art methods are, read my paper Distinguishing prime numbers from composite numbers.

27 April 2005

Here's the base-8 method of computing h^n for n >= 1. Start by computing h^2, h^3, h^5, h^7. Compute h^2n recursively as (h^n)^2 for n >= 2; compute h^(8n+1) as ((((h^n)^2)^2)^2)h for n >= 1; compute h^(8n+3) as ((((h^n)^2)^2)^2)h^3 for n >= 1; compute h^(8n+5) as ((((h^n)^2)^2)^2)h^5 for n >= 1; compute h^(8n+7) as ((((h^n)^2)^2)^2)h^7 for n >= 1.

How many multiplications does the base-8 method need? Let's call it M(n). Then M(1) = M(2) = M(3) = M(5) = M(7) = 4; M(2n) = M(n) + 1 for n >= 2, so M(4) = M(6) = 5; M(8n+1) = M(n) + 4 for n >= 1; M(8n+3) = M(n) + 4 for n >= 1; M(8n+5) = M(n) + 4 for n >= 1; M(8n+7) = M(n) + 4 for n >= 1.

Define S(n) = M(1) + M(2) + ... + M(n - 1) for n >= 1. Then S(1) = 0, S(2) = 4, S(4) = 12, and S(8n) = (M(1) + M(2) + M(3) + M(4) + M(5) + M(6) + M(7)) + M(8) + M(9) + M(10) + M(11) + M(12) + M(13) + M(14) + M(15) + M(16) + M(17) + ... + M(8n - 2) + M(8n - 1) = (M(1) + M(2) + M(3) + 18) + M(4)+1 + (M(1)+4) + M(5)+1 + (M(1)+4) + M(6)+1 + (M(1)+4) + M(7)+1 + (M(1)+4) + M(8)+1 + (M(2)+4) + ... + M(4n-1)+1 + (M(n-1)+4) = S(4n) + 4S(n) + (1+4)(4n-4) + 18 = S(4n) + 4S(n) + 20n - 2.

S(n) is close to (5/4) n lg n + (3/16) n. (The difference T(n) = S(n) - (5/4) n lg n - (3/16) n satisfies T(16) = 1, T(32) = 4, T(64) = -4, and T(8n) = T(4n) + 4T(n) - 2. One can prove with generating functions that T(n) is on the scale of sqrt(n), since the complex roots of x^3-x^2-4 have magnitude sqrt(2).) Thus the average of M(k), for k between 1 and n-1, is close to (5/4) lg n + 3/16; e.g., about 1280 when n = 2^1024.

(I often miscount the number of multiplications needed for exponentiation, and I often see other people miscounting too, so I highly recommend letting a computer count the number of multiplications for various inputs n. Computer calculation of the average of many samples is easier and less error-prone than hand calculation.)

Brauer in 1939 introduced the idea of jumping from M(n) to M(2^k n + {0,1,...,2^k-1}) with k+1 multiplications, and proved that this idea uses (1+o(1)) lg n multiplications when k grows slowly with n. Thurber in 1973 introduced the idea of jumping from M(n) to M(2^k n + {1,3,...,2^k-1}); this saves time by handling even numbers more efficiently and by reducing the number of initial powers to be computed. Thurber's method is often called ``sliding-windows exponentiation.''