D. J. Bernstein
Data structures and program structures

# MCS 501 and CS 501, Algorithms II, Fall 2005

## 7 May 2005

The prerequisite for this course is a baby algorithms course, such as CS 401 or MCS 401. In a baby algorithms course, you saw how to save time in some fundamental computations, such as sorting small integers and finding shortest paths. Some algorithm-design techniques, such as dynamic programming, are helpful in a wide variety of computations; other techniques are specific to particular problems.

There are several ways to continue your study of algorithms:

• Look at mathematical techniques helpful in understanding algorithm speed. This is the subject of MCS 503.
• Look at algorithms that perform the same computations in even less time. Of course, as the algorithms become faster and faster, you have to look at them more and more closely to spot any further speedups.
• Look at other complexity measures: for example, space instead of time. When I taught MCS 501 in Spring 2002, I started by explaining how to sort n small integers in time n^(1/2+o(1)) on a 2-dimensional parallel machine of size n^(1+o(1)). Parallelism is an increasingly important topic in algorithm research and was a major theme of the course.
• Look at other computational problems. For example, in Spring 2002, I explained the Burrows-Wheeler data-compression algorithm; this is the algorithm used in bzip2. One of the major course topics was essentially-linear-time arithmetic.

Here are some questions that I wrote for the MCS preliminary exam in 2002 based on the material I had taught in MCS 501:

• Assume that two n-digit integers, represented as strings in the usual way, can be multiplied in time n^{1+o(1)}. Explain how to compute 1 x 2 x 3 x ... x n in time n^{1+o(1)}.
• State an algebraic algorithm to multiply two degree-n polynomials using n^{1+o(1)} coefficient operations.
• State a parallel algorithm for n computers that computes the sum of two n-bit numbers in time (log n)^{O(1)}.
• Let f be a function from 128-bit strings to 128-bit strings. Explain how 2^{20} computers, each computer evaluating f at approximately 2^{44} inputs, can reasonably attempt to find a collision in f. (A collision is a pair (x,y) such that x!=y but f(x)=f(y).)

There's no reason for me to cover the same topics in Fall 2005. My own interests in algorithms are quite broad, and I'm happy to select course topics to match student interests. Let me know what you're interested in hearing about!

## 25 June 2005

The lecture is scheduled for 14:00-14:58 MWF; note the 8 extra minutes. There will be no lectures on the following dates:
• Monday 19 September.
• Wednesday 21 September.
• Friday 23 September.
• Monday 7 November.
• Wednesday 9 November.
• Friday 11 November.
This information overrides the information in the schedule online.

The class location is listed as 2AH, which I think means Addams Hall.

Call numbers for registration: 12364 for MCS 501; 21116 for CS 501.

My office hours will be 15:10-16:00 Monday, 16:10-17:00 Monday, 17:10-18:00 Monday in 410 SEO.

## 22 August 2005

The class location was 306 Addams Hall but is now changing to 303 Stevenson Hall. Please go directly to 303 Stevenson Hall starting Wednesday 24 August 2005.

Today's class material: How fast is sorting? Several difficulties with an answer such as ``n log n comparisons'': comparison time isn't specified; time for other operations isn't specified; the sorting machine isn't specified. More precise question: a machine sorts n integers in {0,1,2,...,n^2}, each integer represented in the usual way as a string of bits in binary; how long does ths machine take? Sorting machine #1: single-tape Turing machine using insertion sort. Machine finishes in n^(2+o(1)) seconds; machine costs n^(1+o(1)) dollars. Sorting machine #2: two-dimensional RAM using merge sort. Machine finishes in n^(1.5+o(1)) seconds; machine costs n^(1+o(1)) dollars.

## 25 August 2005

Yesterday's class material: Physical constraints requiring high latency for RAM. Sorting machine #3: pipelined two-dimensional RAM using radix-2 sort. Machine finishes in n^(1+o(1)) seconds; machine costs n^(1+o(1)) dollars. Beginning of description of sorting machine #4: two-dimensional mesh.

## 28 August 2005

One good reference for Schimmler sort is Schimmler's 1987 article Fast sorting on the instruction systolic array. The second column-sorting step in Schimmler's algorithm uses a slightly more complicated algorithm than I explained in class, saving sqrt(n) parallel compare-exchange steps out of about 8 sqrt(n).

Class material from Friday: Odd-even transposition sort of n^(0.5) integers on a one-dimensional mesh. Example of odd-even transposition sort. Sorting machine #4: two-dimensional mesh using Schimmler sort. Example of Schimmler sort. Time analysis of Schimmler sort. Machine finishes in n^(0.5+o(1)) seconds; machine costs n^(1+o(1)) dollars. What's covered in a prerequisite course such as MCS/CS 401:

• Some reductions of ``time'' for a few important computational problems, such as sorting and finding shortest paths.
• Some algorithm design techniques, such as divide-and-conquer and dynamic programming.
Research directions after the prerequisite course.

## 31 August 2005

Class material from Monday: This is not a course about constant factors. This is not a course about mathematical methods of algorithm analysis. This course will consider more computational problems. This course will consider more cost measures. Examples of algorithm cost measures. Will prove that n steps of odd-even transposition sort are enough for n numbers. Enough to prove that small numbers come before large numbers. Enough to prove that all arrays of 0's and 7's are sorted. Inductive proof, case 1: last number is a 7. Inductive proof, case 2: last number is a 0.

Today's class material: Schimmler sort on typical balanced array of 0's and 7's. Schimmler sort on typical unbalanced array of 0's and 7's. The parallel price-performance myth. Reality. Example: Schimmler sort. Example: dual-core CPUs.

## 9 September 2005

Class material from last Friday: How expensive is addition of two n-bit integers in the usual binary representation, producing an (n+1)-bit integer in the usual binary representation? Answer depends on addition machine and cost measure. First cost measure: number of instructions; ``time'' in a baby algorithms course. Algorithm performing Theta(n) instructions. Second cost measure: wall-clock time, i.e., real time. Previous algorithm takes time n^(1.5+o(1)). Pipelined algorithm takes time n^(1+o(1)). Third cost measure: circuit depth. Converting carry-chain recurrence to product of many matrices. Parallel multiplication of many matrices.

Class material from Wednesday: Example of integer addition. Matrix view of same example. Simplest algorithm taking depth n^(o(1)) is very bad in other cost measures: number of instructions, mesh price-performance ratio, etc. Fourth cost measure: circuit gates times circuit depth. Serial algorithm has cost n^(2+o(1)). Parallel algorithm has cost n^(2+o(1)). Example for n=6. Better parallel algorithm, merging intermediate results. Cost n^(1+o(1)); exponent 1 is optimal. Constant factors: look up ``parallel prefix.'' Increasing cost measure by expanding circuit depth to circuit delay, accounting for distance in circuit; raises cost to n^(3/2+o(1)). Further increasing cost measure by expanding number of gates to length of wires, accounting for price of wiring; again cost n^(3/2+o(1)). Mesh price-performance ratio is also n^(3/2+o(1)).

Today's class material: Why use price-performance ratio, i.e., price-time product? Minimizing time without regard to price is unrealistic. But why price-performance ratio rather than price + time, for example, or price time^2, or some other combination? Using price-performance ratio as cost has two nice features: k independent computations in serial are, together, k times as expensive as one computation; and k independent computations in parallel are, together, k times as expensive as one computation. Can also graph (price,time) pairs. Advantages and disadvantages of various time measures. Advantages and disadvantages of various price measures. Back to addition: can reduce mesh price-performance ratio to n^(1+o(1)) by changing the problem, using redundant representation of integers. Moving on to integer multiplication: How expensive is multiplication of two n-bit integers in the usual binary representation, producing a 2n-bit integer in the usual binary representation? Simple solution, n^{2+o(1)} instructions: multiply each bit of the first integer u by the second integer v, and add the results. 5-bit example.

## 12 September 2005

Today's class material: Multiplication algorithm parallelizes easily. Tree of sums achieves depth n^{o(1)}. Simple tree layout: area n^{2+o(1)}, including wires; depth n^{0+o(1)}, not counting wire delay; time n^{1+o(1)}, counting wire delay. If massively parallel computation is so great, why don't existing CPUs provide it? Answer: for multiplication, they do! Notes on Duron multiplication performance. Atrubin 1-dimensional mesh: area n^{1+o(1)}, no long wires; time n^{1+o(1)}. We'll see Preparata 2-dimensional mesh: area n^{1+o(1)}, no long wires; time n^{0.5+o(1)}. The Brent-Kung ``area-time theorem'': every circuit for n-bit integer multiplication has area depth^2 larger than a constant times n^2; here area includes wire area, while depth does not include wire delay. Optimality of the Preparata mesh.

If you're interested in learning more about the Brent-Kung area-time theorem, read the original paper.

If you're interested in the Atrubin mesh, look for it in Section 4.3.3 of Knuth's Art of Computer Programming.

If you're interested in achieving multiplication depth Theta(lg n) rather than Theta((lg n)^2), look up ``Wallace trees.''

## 25 September 2005

There are several ways to view the fast Fourier transform (FFT). I presented it as a method of multiplying polynomials: specifically, an algorithm to multiply two polynomials with complex coefficients mod x^n-1, where n is a power of 2.

The FFT is an ``algebraic algorithm'': it produces the output coefficients using complex additions, complex subtractions, and complex multiplications starting from the input coefficients and various constants. I took ``algebraic complexity'' as my cost measure; this is the total number of additions, subtractions, and multiplications. This cost measure avoids the question of how complex numbers are represented inside a computer, and the question of how much time is used by a complex operation.

The FFT is given two size-n polynomials u,v and produces the size-n polynomial w = uv mod x^n-1. In more detail, the FFT has two inputs: a sequence (u[0],u[1],u[2],...,u[n-1]) of n complex numbers representing the polynomial u = u[0] + u[1]x + u[2]x^2 + ... + u[n-1]x^{n-1}, and a sequence (v[0],v[1],v[2],...,v[n-1]) of n complex numbers representing the polynomial v = v[0] + v[1]x + v[2]x^2 + ... + v[n-1]x^{n-1}. The FFT has one output: a sequence (w[0],w[1],w[2],...,w[n-1]) of n complex numbers representing the unique polynomial w = w[0] + w[1]x + w[2]x^2 + ... + w[n-1]x^{n-1} of degree below n such that w-uv is a multiple of x^n-1.

The FFT output can be described without polynomials as the ``cyclic convolution'' of the inputs:

• the first output w[0] is u[0]v[0]+u[1]v[n-1]+u[2]v[n-2]+...+u[n-1]v[1];
• the second output w[1] is u[0]v[1]+u[1]v[0]+u[2]v[n-1]+...+u[n-1]v[2];
• the third output w[2] is u[0]v[2]+u[1]v[1]+u[2]v[0]+...+u[n-1]v[3];
• ...
• the last output w[n-1] is u[0]v[n-1]+u[1]v[n-2]+u[2]v[n-3]+...+u[n-1]v[0].

For example, for n=2, the inputs are two complex vectors (u[0],u[1]) and (v[0],v[1]) representing the linear polynomials u = u[0]+u[1]x and v = v[0]+v[1]x. The output is the complex vector (w[0],w[1])=(u[0]v[0]+u[1]v[1],u[0]v[1]+u[1]v[0]) representing the linear polynomial w = w[0]+w[1]x = (u[0]v[0]+u[1]v[1])+(u[0]v[1]+u[1]v[0])x = uv - u[1]v[1](x^2-1).

As a larger example, for n=4, the inputs are two complex vectors (u[0],u[1],u[2],u[3]) and (v[0],v[1],v[2],v[3]) representing the polynomials u = u[0]+u[1]x+u[2]x^2+u[3]x^3 and v = v[0]+v[1]x+v[2]x^2+v[3]x^3. The product uv is (u[0]v[0]) + (u[0]v[1]+u[1]v[0])x + (u[0]v[2]+u[1]v[1]+u[2]v[0])x^2 + (u[0]v[3]+u[1]v[2]+u[2]v[1]+u[3]v[0])x^3 + (u[1]v[3]+u[2]v[2]+u[3]v[1])x^4 + (u[2]v[3]+u[3]v[2])x^5 + (u[3]v[3])x^6, so uv mod x^4-1 is uv - ((u[1]v[3]+u[2]v[2]+u[3]v[1]) + (u[2]v[3]+u[3]v[2])x + (u[3]v[3])x^2)(x^4-1) = (u[0]v[0]+u[1]v[3]+u[2]v[2]+u[3]v[1]) + (u[0]v[1]+u[1]v[0]+u[2]v[3]+u[3]v[2])x + (u[0]v[2]+u[1]v[1]+u[2]v[0]+u[3]v[3])x^2 + (u[0]v[3]+u[1]v[2]+u[2]v[1]+u[3]v[0])x^3. The output is the vector (u[0]v[0]+u[1]v[3]+u[2]v[2]+u[3]v[1],u[0]v[1]+u[1]v[0]+u[2]v[3]+u[3]v[2],u[0]v[2]+u[1]v[1]+u[2]v[0]+u[3]v[3],u[0]v[3]+u[1]v[2]+u[2]v[1]+u[3]v[0]).

The idea of the FFT, in a nutshell, is to compute uv mod x^n-1 from uv mod x^(n/2)-1 and uv mod x^(n/2)+1, each of which is computed recursively. In more detail:

1. If n=1: Compute u[0]v[0] and stop.
2. Compute u mod x^(n/2)-1.
3. Compute v mod x^(n/2)-1.
4. Recursively multiply mod x^(n/2)-1, obtaining (u mod x^(n/2)-1)(v mod x^(n/2)-1) mod x^(n/2)-1 = uv mod x^(n/2)-1.
5. Compute u mod x^(n/2)+1.
6. Compute v mod x^(n/2)+1.
7. Recursively multiply mod x^(n/2)+1, obtaining (u mod x^(n/2)+1)(v mod x^(n/2)+1) mod x^(n/2)+1 = uv mod x^(n/2)+1.
8. Compute ((uv mod x^(n/2)-1)+(uv mod x^(n/2)+1))/2 + x^(n/2)((uv mod x^(n/2)-1)-(uv mod x^(n/2)+1))/2, which is exactly uv mod x^n-1.
There's a gap in this recursion: how does one multiply mod x^(n/2)+1? One solution is to generalize the FFT to x^n-c^2 for any nonzero complex number c:
1. If n=1: Compute u[0]v[0] and stop.
2. Compute u mod x^(n/2)-c. This polynomial is represented by the vector (u[0]+u[n/2]c,u[1]+u[n/2+1]c,...u[n/2-1]+u[n-1]c); computing this vector takes n/2 multiplications by c and n/2 additions.
3. Compute v mod x^(n/2)-c.
4. Recursively multiply mod x^(n/2)-c, obtaining uv mod x^(n/2)-c.
5. Compute u mod x^(n/2)+c. This polynomial is represented by the vector (u[0]-u[n/2]c,u[1]-u[n/2+1]c,...u[n/2-1]-u[n-1]c); computing this vector takes n/2 subtractions, reusing the previous products u[n/2]c etc.
6. Compute v mod x^(n/2)+c.
7. Recursively multiply mod x^(n/2)+c, obtaining uv mod x^(n/2)+c.
8. Compute ((uv mod x^(n/2)-c)+(uv mod x^(n/2)+c))/2 + x^(n/2)((uv mod x^(n/2)-c)-(uv mod x^(n/2)+c))/2c, which is exactly uv mod x^n-c^2. This takes n/2 additions, n/2 subtractions, n/2 divisions by 2, and n/2 divisions by 2c.

The FFT has algebraic complexity 5n lg n + n:

• (3/2)n lg n subtractions;
• n lg n multiplications by constants (many of which are 1 and thus can be eliminated);
• n lg n divisions by nonzero constants (half of which are divisions by 2, i.e., multiplications by 1/2, which can be combined with subsequent multiplications); and
• n multiplications at the bottom of the recursion.
For comparison, the obvious algorithm has algebraic complexity 2n^2-n: specifically, n^2 multiplications and n^2-n additions.

Homework due 28 September 2005 is to analyze in more detail the n lg n multiplications by constants. How many of the n lg n constants are primitive 4th roots of 1, namely i or -i? How many of the n lg n constants are primitive 8th roots of 1, namely sqrt(i), -sqrt(i), sqrt(-i), or -sqrt(-i)? (For example, there are 4 multiplications by primitive 8th roots of 1 if n=8, and 12 multiplications by primitive 8th roots of 1 if n=16; of course, your answer should cover all values of n.) Same question for 16th, 32nd, etc.

I also described another solution to the problem of handling x^(n/2)+1 recursively: namely, to ``twist'' the multiplication mod x^(n/2)+1 into a multiplication mod x^(n/2)-1. A careful combination of these two solutions produces the ``split-radix FFT,'' which is about 20% faster than the original solutions, and which held the speed records for decades; see, e.g., Section 2 of my paper ``Fast multiplication and its applications.'' Last year, Van Buskirk introduced what I call the ``tangent FFT,'' which is about 5% faster than the split-radix FFT; see the paper ``A modified split-radix FFT with reduced arithmetic complexity'' by Johnson and Frigo.

I then explained one way to use the FFT for integer multiplication. The problem is to multiply the integers u[0] + 2u[1] + 4u[2] + ... and v[0] + 2v[1] + 4v[2] + ..., given bits u[0],v[0],u[1],v[1],...; the basic idea is to multiply the polynomials u[0] + u[1]x + u[2]x^2 + ... and v[0] + v[1]x + v[2]x^2 + ... using the FFT. There are three complications in making this work:

• The FFT computes uv mod x^n-1. What we actually want is the polynomial product uv. Solution (easy): choose n as a power of 2 known to be larger than the degree of uv; then uv mod x^n-1 is exactly uv.
• The polynomial product has O(lg n) bits in each coefficient; what we actually want is 1 bit in each coefficient. In other words, we still have to do carries. Solution (fairly easy): split the polynomial coefficients into O(lg n) integers, each having O(n) bits; add those integers.
• The cost measure is different. The FFT uses n^(1+o(1)) additions, subtractions, and multiplications; what we actually want is n^(1+o(1)) low-level machine instructions. Solution (natural but difficult): choose a floating-point precision, let's say 3 lg n bits; represent each real number inside the computer as a nearby floating-point number; prove, by careful numerical analysis, that the floating-point FFT output is within 0.25 of the integers u[0]v[0], u[0]v[1]+u[1]v[0], etc., allowing those integers to be computed; and observe that an n^(o(1))-bit floating-point approximation to a complex operation uses n^(o(1)) machine instructions.
Next time I'll explain better solutions to the third complication.