[Talk took slightly over 20 minutes.]

Thanks.

When you're waiting for a computer-algebra system like Magma
to finish a number-theoretic computation,
you might wonder what Magma is spending its time doing.
Often the answer is that Magma is doing a huge number of multiplications.
To save time Magma uses impressively fast multiplication algorithms
that were developed at the end of the 1960s.

Recently there have been some extremely surprising speedups
in these fundamental algorithms
for multiplying polynomials
and for multiplying integers.
I should say at the outset that these are not my results,
but I'm told that Bourbaki established a tradition
of giving talks about exciting results from other people,
so my main goal today is to tell you what's going on here.

In particular, classic fast Fourier transforms, FFTs,
have been sped up
asymptotically
by more than 5 percent.
And integer multiplication has been sped up
asymptotically
by an even _larger_ factor.

Now 5 percent might not sound like much
but the previous algorithms
were studied and used by an incredible number of people
with no improvements for 36 years.
Any improvement is an amazing breakthrough.

Now I realize that some of you just don't care about the complexity of multiplication
and you would rather hear some more talks on for example elliptic curves.
So to keep you awake here's a fun little exercise involving elliptic curves.

[slide: Advertisement]
Pick a field k.
Please don't pick a field k of characteristic 2.
And then pick an element d of that field.
Please don't pick a square.
Look at the curve x^2 + y^2 = 1 + dx^2y^2.
And now consider the set of affine points of this curve defined over k.
The exercise is to _prove_ that this set of points
forms a group under the binary operation shown at the bottom of the slide.
If you want to add two points x1 comma y1 and x2 comma y2
the x-coordinate of the answer is
x1 y2 plus x2 y1 divided by 1 plus d x1 x2 y1 y2
and the y-coordinate of the answer is
y1 y2 minus x1 x2 divided by 1 minus d x1 x2 y1 y2.
That's it.

[slide: More on Edwards coordinates]
This is actually an elliptic-curve group.
If you're interested in elliptic-curve computations,
or if you want some hints on the exercise,
or if you think that the exercise violates a theorem of Bosma and Lenstra
on complete sets of addition laws for elliptic curves,
here's a URL which gives you all sorts of additional information
including the two papers listed here.
cr dot yp dot to slash newelliptic dot html.

[slide: Algebraic complexity]
Okay. Now I'll start my talk.
First topic is multiplying polynomials in one variable over the real numbers.
For example
this diagram is an algorithm by Karatsuba that multiplies two linear polynomials f and g
producing a quadratic polynomial.
The algorithm takes the coefficients f0 and f1 of the input polynomial f
and the coefficients g0 and g1 of the input polynomial g
and feeds them through the directed acyclic graph shown here to produce the coefficients
h0 h1 h2 of the product f times g.

We measure the complexity of these algebraic algorithms
by saying that multiplication costs 1
and addition costs 1
and subtraction costs 1,
this input is negated so this addition is actually a subtraction,
and any real constant that you want costs 0,
for example if you wanted to multiply by 100 that would be just one multiplication,
you don't have to pay for the constant 100.
Anyway, _this_ algorithm has three multiplications and three additions and one subtraction
for a total complexity of 7.
For those of you familiar with algebraic complexity theory
I should emphasize that I'm looking at the total complexity
not just the multiplicative complexity.

Now this particular algorithm is _not_ very good.
It's just an _example_ of what algebraic complexity means.
I'll leave it to you as an exercise
to figure out the minimum complexity of computing this quadratic.

[slide: FFT news]
More generally, let's look at multiplying polynomials of arbitrary degree
over the real numbers.
Let's say we have two polynomials f and g of degree below n,
coefficients of x to the 0 x to the 1 and so on up through x to the n minus 1.
The problem is to compute the 2n-1 coefficients of f times g.

After Gauss died
he published an algorithm which he did not call the fast Fourier transform.
He actually wrote it down in 1805.
Fourier's first publication on the topic was in 1807.
But nowadays everyone calls this algorithm the FFT.
Anyway, you can easily use Gauss's FFT to multiply two polynomials
with complexity that is essentially _linear_ in n.
Specifically, it's 15 times n times lg n.
The L G here is logarithm base 2.

Now the FFT was rediscovered several times after Gauss published it.
The last rediscovery was in 1965,
and then finally people paid attention.
Everybody had reasonably fast computers
and everybody was looking at reasonably large n's
and everybody realized how much time was saved by the FFT
compared to the naive quadratic-time approach.
There were tons of papers in the late 1960s on the FFT
and it turned out that Gauss had missed a trick.
Yavne in 1968 published an algorithm which he did not call the split-radix FFT,
but nowadays everyone calls it the split-radix FFT.
This algorithm reduces Gauss's 15 down to 12.
Gauss promptly issued a statement saying how embarrassed he was.

After that 12 was the state of the art for 36 years.
People conjectured that 12 was optimal.
But recently James Van Buskirk
published a new algorithm which he did not call the tangent FFT.
I'm trying to popularize that name.
This algorithm reduces 12 down to 34 thirds,
which of course is smaller than 12.
This is the 5-percent speedup that I mentioned at the beginning.
Again, not a huge speedup,
but it's really quite amazing that any speedup is possible.

[slide: The FFT trick]
I'd like to spend a few minutes saying what's going on
inside the 15 and 12 and 34 thirds.

Let's start with Gauss's FFT.
If you want to multiply for example in the ring of polynomials modulo x to the 1024 minus 1,
you can observe that this ring factors into the ring modulo x to the 512 minus 1
and the ring modulo x to the 512 plus 1.
This is a very easy isomorphism to evaluate,
you simply substitute x to the 512 equals 1 or x to the 512 equals minus 1,
which boils down to some additions and subtractions of coefficients.
Inverting the isomorphism is also very easy.

Now the reason I switched my base field from the real numbers to the complex numbers
is so that I could apply the same trick again.
The ring of polynomials modulo x to the 512 plus 1
factors into 256 minus i and 256 plus i.
And you keep going in this very easy way until you've reached
1024 separate copies of the complex numbers.

Instead of separately factoring 512 minus 1 and 512 plus 1
you might think it's simpler to twist 512 plus 1
into 512 minus 1,
by mapping x to zeta y
where zeta is a 1024th root of unity.
So at the next step you see y to the 256 plus or minus 1 with no i's showing up,
on the other hand in this step you had to multiply by powers of zeta.

[Question from audience about cost of multiplication by i.
Answer: R-complexity of multiplying by a+bi is usually 6,
but drops if, e.g., b=0.]

If you apply this twisting recursively
it turns out to have exactly the same complexity as the original FFT,
exactly the same number of multiplications by any particular root of 1.
You might guess that there's some simple explanation for this,
and you would be right.
You might also guess that the explanation is some fundamental limit on the power of twisting,
and you would be quite wrong.

[slide: Split-radix FFT]
Suppose you wait for a moment on twisting.
Go ahead and factor x to the 512 plus 1.
That involves multiplications by i but those multiplications are very easy.
And _then_ twist x to the 256 minus i into y to the 256 minus 1.
Remember, don't twist until you see the whites of their i's.
Okay, bad joke.

This algorithm involves _more_ multiplications by i than Gauss's algorithm does,
and it has _fewer_ multiplications by complicated roots of 1.
It ends up with a 20% smaller complexity.

If there's one trick that you should remember from this talk,
it's this idea of timing your twists
to maximize the alignment of your roots of 1,
to maximize the number of easy constants that show up in the FFT.
This is what gave us the best FFT complexity from 1968 until 2004.

[Question from audience about definition of alignment.
Answer: Definition is parametrized and depends on context.
For split-radix FFT we define the 4th roots of 1 as being aligned.
But you might want to allow larger groups, as we'll see later.
Terminology comes from, e.g., alignment of computer words.]

[slide: The tangent FFT]
And then the news is Van Buskirk coming along with the tangent FFT,
which still uses this idea of aligning roots
and then does even better by _scaling_ x to the 0 and x to the 1 and so on
to expose some redundancies in the split-radix FFT.
This slide is meant just to give you the flavor of what's going on,
I'm not going to explain how the symmetries here save time,
but I _am_ curious whether anyone _recognizes_ this scaling factor,
this particular sine-cosine product s n k.

[slide: Three expositions]
If you do want to see how the tangent FFT works,
there are now three writeups.
The original 2004 publication by Van Buskirk was simply some software posted online
for a particular input size,
which was quite convincing but not structured in a way that's easy to understand.
There's now an exposition by Frigo and Johnson,
the authors of the FFTW software,
that's the self-proclaimed Fastest Fourier Transform in the West.
There's an exposition by Lundy and Van Buskirk.
I have another exposition
which is more concise because it uses the algebra structure of the algorithms
rather than just the module structure.
So you might prefer that.

[slide: Bit complexity]
All I have left to do is tell you the news about integer multiplication. 
A moment ago we were splitting polynomial multiplication into
additions and multiplications of coefficients.
Now we're going to split integer multiplications
into bit operations.
There's just one bit operation that you need,
called a NAND gate,
which given two bits a and b,
two elements a and b of F_2 if you like,
produces 1 minus ab.
And then my example uses 15 gates to compute the 4-bit product of a couple of 2-bit integers.
I'll leave it to you as an exercise
to figure out the minimum number of gates.

[slide: Integer-multiplication news]
Now everyone quotes the Schoenhage-Strassen algorithm from 1971
which multiplies n-bit integers using n lg n lg lg n bit operations
times some constant.
I don't have to tell you the constant
because the news is actually _more_ than a constant-factor speedup.

With apologies to the German speakers in the audience
I will say that Martin Fuerer has dramatically reduced the log log factor
down to only some constant to the power log star n.
That might sound worse than log log
until you realize that log star
is a very very very slowly growing function.
Even if you put in the exponent it still grows very very very slowly.

[slide: Bit complexity n (lg n)^O(1)]
I have just two slides left.
I'd like to briefly say how this new algorithm works.

Let me start with something simpler.
If you look at prime numbers p on the same scale as n
then some of them have big powers of 2 dividing p minus 1,
for example by Linnik's theorem.
So you can do an FFT modulo those primes p,
there's some appropriate root of 1.
And now you can easily convert your integer multiplication
into a multiplication of polynomials modulo p.
You split your n-bit integers
into something like n over lg n coefficients modulo p.

Now this simple idea ends up being really slow,
the bit complexity is something like
n times log n times log log n times log log log n times log log log log n etc.,
clearly _much_ slower than Schoenhage-Strassen.
I feel like an analytic number theorist.

The problem is that an FFT of size n over lg n
ends up using something like n multiplications in this field,
each of which has to be done recursively,
which turns out to be the bottleneck in the whole algorithm.

Now Schoenhage and Strassen suggested replacing this prime modulus
by some Fermat number, not necessarily prime.
What's good about this ring is that it allows very easy multiplications by powers of 2,
including all the roots of 1 that you need for the FFT.
What's bad about this ring is that it's relatively large,
about square root of n bits,
so you need _many_ levels of recursion in the algorithm,
lg lg n levels of recursion,
each of which involves n lg n bit operations.

So in the obvious algorithm
there's a tiny base ring k
but the multiplications by roots of 1 are difficult.
In the Schoenhage-Strassen algorithm
there's a much larger ring k
but the multiplications by roots of 1 are easy.

[slide: Fuerer idea]
The new algorithm
finds a happy compromise
between these two extremes.
You start with a tiny base ring k
and then extend it a little
to put in _some_ easy roots of 1.
Now you do an FFT over this extension.
If you remember to time your twists appropriately,
which I told you was the one trick you're supposed to remember from this talk,
then you can align most of the multiplications in the FFT to be very easy.
The extension is large enough so that these multiplications are no longer a bottleneck.
But the extension is still _much_ smaller than the Schoenhage-Strassen ring.
The number of levels of recursion drops quite dramatically,
and if you count carefully then you end up with the bit complexity that I stated earlier.

Let me close with some advice to the graduate students in the audience.
If you want a really cool paper in 2008
then obviously you should look for some widely used algorithm that was published in 1972.
Thank you for your attention.

[Question from audience about finding rings k more easily than fields k.
Answer: The algorithm works with rings k,
but the only ways we find _small_ k's are by finding fields.]