A cryptographic system is defined by mathematical functions to scramble and unscramble messages. Many different cryptographic systems have been proposed. This course will not attempt to survey those systems. The goal of this course is to understand the fastest system to protect public messages and the fastest system to protect private messages: in particular, to understand exactly how the underlying mathematical functions are computed at top speed on typical computers.
This course has very few prerequisites. For example, this course will explain from scratch how to compute multiples of points on a particular elliptic curve; students will not be assumed to know what the phrase ``elliptic curve'' means and will not need to have taken any particular mathematics courses. This course will explain from scratch the speed of this elliptic-curve computation on real computers; students will not need to have taken previous courses on computer architecture or supercomputing. This course will also explain from scratch how this elliptic-curve computation protects messages; students will not need to have taken previous courses on cryptography.
Students are required to have taken MCS 401 or an equivalent course on the design and analysis of fast computer algorithms.
Grading: This is an advanced graduate course, so the workload required for an A will be light. No exams; occasional homework assignments checking that everyone understands the definitions. Students interested in devoting extra time to the course material will be given a variety of projects to explore.
Course location: MWF 01:00-01:50 PM, 210 2DH. Call number 20389.
Office hours: M 02:00-04:50 PM, 410 SEO.
I've noticed that Clint Whaley is teaching a course at FSU on computing basic linear-algebra functions as quickly as possible.
When I talk about machine instructions in class, I'll use an instruction syntax that looks like C. Unfortunately, if you look for outside information about architecture, you'll find that machine instructions are usually expressed in an ancient syntax that isn't nearly as easy to read: ``Sun assembly language,'' for example, or ``AT&T assembly language.'' You can learn a ton about Pentium 1/MMX/Pro/II/III/4 optimization from Agner Fog's Pentium optimization manual, for example, but you have to know ``MASM assembly language'' first.
I recommend, as a starting point, the lecture notes by David Yates from a course at Boston University. The notes explain the basics of the SPARC instruction set; they have many examples; they use Sun assembly language, which will be a bit painful at first but will help you understand other SPARC-related documents.
Of course, there's no substitute for experiment. Write a short foo.c, type gcc -g -S foo.c, and look at the assembly-language foo.s. Try adding optimization levels: gcc -g -O3 -S foo.c. Try modifying foo.s to make it do something different; compile with gcc -o foo foo.s and then watch the results in gdb, as in Yates's sample debugging session.
Two UltraSPARCs here at UIC are icarus.cc.uic.edu, which has several 900MHz UltraSPARC III processors, and galois.math.uic.edu, which has several 296MHz UltraSPARC II processors.
Meanwhile, here's another homework assignment regarding computer speed. (To repeat what I said on Monday: ``I'll sometimes label a homework assignment as being required. Other homework assignments, such as this one, are optional and have nothing to do with your grade, but they're worth exploring if you'd like to devote extra time to the course material.'') On icarus, with gcc -O2, I tried calling the following function several times in a row, and found that each call takes nearly 60000 CPU cycles:
int x[1048576]; int doit(void) { int i; static int j; int sum = 0; for (i = 0;i < 100;++i) { sum += x[j]; j += 654321; j &= 1048575; sum += x[j]; j += 654321; j &= 1048575; sum += x[j]; j += 654321; j &= 1048575; sum += x[j]; j += 654321; j &= 1048575; sum += x[j]; j += 654321; j &= 1048575; sum += x[j]; j += 654321; j &= 1048575; sum += x[j]; j += 654321; j &= 1048575; sum += x[j]; j += 654321; j &= 1048575; sum += x[j]; j += 654321; j &= 1048575; sum += x[j]; j += 654321; j &= 1048575; } return sum; }On the other hand, if I change 1048575 to 4095, each call takes only 3000 CPU cycles, nearly twenty times faster. Why?
Several of you have asked how to measure speed. The quick answer is to use an
extern long long cpucycles(void);function that returns the number of CPU cycles since boot, or
extern "C" { extern long long cpucycles(void); } ;from C++. This isn't a standard function; you'll need cpucycles-sparc.S, cpucycles-powerpc.S, or cpucycles-x86.S. (Warning: the PowerPC version actually counts the ``time base,'' which is 1 cycle on the PowerPC RS64 but 16 cycles on the PowerPC G4.) Look at speed1.cpp for a complete example of using cpucycles to print CPU cycles for 64 consecutive calls to poly1305_gmpxx. Here's how I compiled and ran this on icarus:
icarus% gcc -m64 -O2 -o speed1 \ speed1.cpp poly1305_gmpxx.cpp cpucycles-sparc.S \ -lgmpxx -lgmp /usr/local/gcc-3.3.1/lib/sparcv9/libstdc++.a \ -I/homes/home12/koobera/include -L/homes/home12/koobera/lib icarus% ./speed1 1396496 1194949 1172826 1131245 1146634 ...Explanation of the gcc line:
Anyway, the course topic next week (perhaps starting at the end tomorrow) is how to compute Poly1305 much more quickly. What are the UltraSPARC's machine instructions? (The important ones, anyway.) How long do those instructions take? How quickly can we perform 130-bit arithmetic using those instructions? What about the PowerPC, the Pentium III, and the Athlon?
Today I started explaining UltraSPARC speed. I discussed the three fundamental bottlenecks that often prevent an UltraSPARC from following four instructions in a CPU cycle: execution-unit contention, operation latency, and load latency.
Today I listed the traditional names for the UltraSPARC registers:
r = t add %g0,t,r r = s + t add s,t,r flags r = s + t addcc s,t,r r = s - t sub s,t,r flags r = s - t subcc s,t,r r = s & t and s,t,r r = s & ~t andn s,t,r r = s | t or s,t,r r = s | ~t orn s,t,r r = s ^ t xor s,t,r r = s ^ ~t xnor s,t,r r = 10000 * 1024 sethi 10000,r r = s << 13 sllx s,13,r r = (uint64) s >> 13 srlx s,13,r r = (int64) s >> 13 srax s,13,rIn these instructions, t can be replaced by -4096,-4095,...,4095; 13 can be replaced by 0,1,2,...,63; 10000 can be replaced by 0,1,2,...,4194303. Don't assume that this list is complete; I didn't include, for example, r = s << (t & 31).
Required homework assigned today: The instructions
r = -1 s = (uint64) r >> 60 t = (int64) r >> 60set s to 15 and t to -1. Explain why.
Optional homework (tangential to the course material, and not easy, but educational): Write a program that, given an integer n in {0,1,...,2^64-1}, prints a minimum-length sequence of simple integer instructions to place n into an UltraSPARC register.
Recommended reading: David Goldberg's paper What every computer scientist should know about floating-point arithmetic.
Homework: Write an int64 sum1000(int64 *p) function that returns p[0]+p[1]+...+p[999] in only about 1000 cycles on an UltraSPARC. For extra credit, add 10000 numbers, or 100000 numbers, or 1000000 numbers; even better, take the number of array entries as a parameter.
main() { double x; char *c = (char *) &x; c[0] = 69; c[1] = 56; c[2] = 0; c[3] = 0; c[4] = 1; c[5] = 0; c[6] = 0; c[7] = 0; printf("%f\n",x); }---but the homework is to show how that answer arises from the 69 and 56 and so on. (Optional followup: Can you modify this program so that it also works on little-endian computers?)
Here are some references for building higher-precision arithmetic from floating-point arithmetic:
Meanwhile, for those of you who are interested in reading outside information on computer architecture: You should have worked through enough UltraSPARC examples at this point to be comfortable with UltraSPARC assembly language. You're ready to branch out to more assembly languages:
The ultimate references, if you want to know what a particular instruction does, are the official architecture manuals:
x = b + c u = x * x v = b * cyour program should figure out that x is computed on cycle 1, u is computed on cycle 5 (since x has 4 cycles latency), and v is computed on cycle 6 (since there was already a multiplication on cycle 5). Don't worry about how many registers there are.
Followup: Write a program that randomly swaps adjacent instructions (but not instructions that depend on each other!) and sees what effect that has on the number of cycles. If an order of instructions has fewer cycles than anything seen before, print it out. Don't do a swap that pushes the number of cycles beyond the best seen plus 5. How effective is this strategy at finding a good order for a long sequence of instructions?
Some research directions: Are there better ways to break large integer operations into floating-point operations? Is there a reasonably fast register-assignment/instruction-scheduling algorithm that achieves minimum cycle counts for these functions without machine-specific help from the user? If this is still beyond computer capabilities (but within human capabilities), how can we make it as easy as possible for the user to work with the optimizer to achieve minimum cycle counts?
Concrete examples of these questions: How could I have written my 68-UltraSPARC-II-cycles loop within a minimum of effort? Could a clever scheduling algorithm have achieved 68 cycles without my help? Can we do better than 68 cycles by, for example, splitting 130-bit integers into 5 26-bit pieces?
Optional homework: Explain an attack against Poly1305-AES that inspects 3 authenticators, performs ``only'' 2^128 easy computations, and generates a successful forgery.
Required homework: What is the minimum number of UltraSPARC cycles for these 200 lines, with the right number of stores? Assume that there are no latency problems; consider only integer-execution-unit contention (at most two integer instructions in each cycle) and load/store-unit contention (at most one memory instruction in each cycle). Example, not the minimum: 12 lines can be implemented as 3 stores, 12 loads, 12 shifts, 12 loads, 12 xors, while the other 188 lines are implemented as 188 shifts, 188 masks, 188 loads, 188 xors, for a total of 215 load/store instructions and 588 integer instructions, taking 294 cycles.
Optional homework: Figure out exactly how serious the store-to-load latency problems are on the UltraSPARC II (hint: nasty) and the UltraSPARC III (hint: better). This will require some experimentation; neither chip is adequately documented.
I spent the last few minutes briefly discussing cache-timing attacks on AES. If you're interested in learning more about this, check out my pictures and graphs of time variability. I also have a paper with more information on the topic, but I'm finishing a heavily revised version of the paper, so don't read it now.
Consider, for example, a computer multiplying two n-bit numbers. A serial computer can't possibly do this in time below n^{1+o(1)}, and it needs n^{1+o(1)} bits of memory. A 2-dimensional parallel computer of size n^{1+o(1)} can do the job in time n^{1/2+o(1)}. The price-performance ratio drops from n^{2+o(1)} dollar-seconds to n^{1.5+o(1)} dollar-seconds.
Problems similar to n-bit multiplication show up all over the place in cryptanalysis. Focusing on serial computers makes the problems sound more difficult than they actually are.
It's particularly embarrassing when a ``record-setting'' serial computation is slower, more expensive, and much more complicated than a well-known parallel computation. Here are two examples:
One cryptanalytic paper that doesn't make the same mistake is Wiener's recent paper The full cost of cryptanalytic attacks. This paper displays a shocking level of ignorance of prior work (its main theorem follows trivially from a famous theorem in a 1981 paper by Brent and Kung, for example) but is not bad as an introduction to parallel cryptanalysis. In particular, Section 4, on parallel discrete-logarithm computations, is a good explanation of why the rho method, a predecessor to the kangaroo method, is better than baby-step-giant-step. The Stein-Teske paper is a better introduction to the kangaroo method but says only a tiny bit about previous algorithms.
Define p as the prime 2^255-19, and define A=257290. The input to Curve is a positive integer. The output is an element of {0,1,...,p-1,infty}.
Curve(1) is defined as 2.
If g = Curve(n) then Curve(2n) is defined as the quotient mod p of (g^2-1)^2 and 4g(g^2+Ag+1). This means infty when g = infty; it means infty when g(g^2+Ag+1) mod p = 0; otherwise it means the unique solution to (g^2-1)^2-4g(g^2+Ag+1) Curve(2n) mod p = 0.
If g = Curve(n) and h = Curve(n+1) then Curve(2n+1) is defined as the quotient mod p of (gh-1)^2 and (g-h)^2 Curve(1). This means infty when g = h (which actually can't happen); it means the unique solution to h^2-Curve(1) Curve(2n+1) mod p = 0 if g = infty and h != infty; it means the unique solution to g^2-Curve(1) Curve(2n+1) mod p = 0 if g != infty and h = infty; otherwise it means the unique solution to (gh-1)^2-(g-h)^2 Curve(1) Curve(2n+1) mod p = 0.
To compute Curve(nS) given n,Curve(S), use the fact that Curve(2nS) = (g^2-1)^2/4g(g^2+Ag+1) mod p where g = Curve(nS), and the magical fact that Curve((2n+1)S) = (gh-1)^2/(g-h)^2 Curve(S) mod p where g = Curve(nS) and h = Curve((n+1)S).
The traditional name for Curve(n) is ``the x coordinate of the nth multiple of (2,...) on the elliptic curve y^2=x^3+Ax^2+x over the field of p elements.'' Some relevant sections of the Hankerson-Menezes-Vanstone book ``Guide to elliptic curve cryptography'':
The straightforward algorithm uses (n-1)^2 additions.
A fancier algorithm decomposes a size-2n problem into four size-n problems. Write u as A_0 + A_n x^n, and write v as B_0 + B_n x^n. Then uv = A_0 B_0 + (A_0 B_n + A_n B_0) x^n + A_n B_n x^{2n}. The four size-n multiplications take 4(n-1)^2 additions, if they're done with the straightforward algorithm; the addition of A_0 B_n to A_n B_0 takes 2n-1 additions; the addition of A_0 B_0 to (A_0 B_n + A_n B_0) x^n takes n-1 additions, since there's an overlap from x^n through x^{2n-2}; the addition of A_n B_n also takes n-1 additions. The total is (2n-1)^2 additions, just like the straightforward algorithm.
Karatsuba's algorithm uses the same decomposition but computes A_0 B_n + A_n B_0 as (A_0 + A_n)(B_0 + B_n) - A_0 B_0 - A_n B_n. The additions A_0 + A_n and B_0 + B_n take 2n additions; the three size-n multiplications take 3(n-1)^2 additions, if they're done with the straightforward algorithm; the subtractions of A_0 B_0 and A_n B_n take 2(2n-1) additions; the final combinations take 2(n-1) additions. The total is 3n^2+2n-1 additions, which is better than (2n-1)^2 when n is 6 or larger. Next time I'll chop out some more additions.
Required homework, due 25 April: Draw the computation graph for Karatsuba's algorithm for n = 2; in particular, figure out the latency of the algorithm, assuming infinite execution units.
Optional followup: Investigate the Toom-Cook algorithm and the fast Fourier transform.
There are many techniques to prove numbers prime, although they take more time than the above test. If you'd like to understand how the basic techniques work, read the Crandall-Pomerance book ``Prime numbers,'' Chapter 4. If you'd like to understand how fast the state-of-the-art methods are, read my paper Distinguishing prime numbers from composite numbers.
How many multiplications does the base-8 method need? Let's call it M(n). Then M(1) = M(2) = M(3) = M(5) = M(7) = 4; M(2n) = M(n) + 1 for n >= 2, so M(4) = M(6) = 5; M(8n+1) = M(n) + 4 for n >= 1; M(8n+3) = M(n) + 4 for n >= 1; M(8n+5) = M(n) + 4 for n >= 1; M(8n+7) = M(n) + 4 for n >= 1.
Define S(n) = M(1) + M(2) + ... + M(n - 1) for n >= 1. Then S(1) = 0, S(2) = 4, S(4) = 12, and S(8n) = (M(1) + M(2) + M(3) + M(4) + M(5) + M(6) + M(7)) + M(8) + M(9) + M(10) + M(11) + M(12) + M(13) + M(14) + M(15) + M(16) + M(17) + ... + M(8n - 2) + M(8n - 1) = (M(1) + M(2) + M(3) + 18) + M(4)+1 + (M(1)+4) + M(5)+1 + (M(1)+4) + M(6)+1 + (M(1)+4) + M(7)+1 + (M(1)+4) + M(8)+1 + (M(2)+4) + ... + M(4n-1)+1 + (M(n-1)+4) = S(4n) + 4S(n) + (1+4)(4n-4) + 18 = S(4n) + 4S(n) + 20n - 2.
S(n) is close to (5/4) n lg n + (3/16) n. (The difference T(n) = S(n) - (5/4) n lg n - (3/16) n satisfies T(16) = 1, T(32) = 4, T(64) = -4, and T(8n) = T(4n) + 4T(n) - 2. One can prove with generating functions that T(n) is on the scale of sqrt(n), since the complex roots of x^3-x^2-4 have magnitude sqrt(2).) Thus the average of M(k), for k between 1 and n-1, is close to (5/4) lg n + 3/16; e.g., about 1280 when n = 2^1024.
(I often miscount the number of multiplications needed for exponentiation, and I often see other people miscounting too, so I highly recommend letting a computer count the number of multiplications for various inputs n. Computer calculation of the average of many samples is easier and less error-prone than hand calculation.)
Brauer in 1939 introduced the idea of jumping from M(n) to M(2^k n + {0,1,...,2^k-1}) with k+1 multiplications, and proved that this idea uses (1+o(1)) lg n multiplications when k grows slowly with n. Thurber in 1973 introduced the idea of jumping from M(n) to M(2^k n + {1,3,...,2^k-1}); this saves time by handling even numbers more efficiently and by reducing the number of initial powers to be computed. Thurber's method is often called ``sliding-windows exponentiation.''