[This is approximately the talk I gave on Friday 14 May 2004.]
[It took 25 minutes.]

[title slide on computer left, computer right]

Thank you.

[cfrac slide on overhead]
This is a talk on cryptography,
specifically breaking cryptography,
specifically factoring large integers.
Let me begin by giving you an example
of how modern factorization algorithms work.
I used a particular algorithm,
the continued-fraction method,
to factor a small number n
printed out by my computer's
obviously not very competent random number generator.
First step
was to expand square root of n
into a continued fraction.
I subtracted off the integer part of square root of n,
took the reciprocal,
subtracted off the integer part of that,
took the reciprocal,
et cetera.
In other words,
I wrote square root of n as an integer plus the reciprocal of
an integer plus the reciprocal of,
et cetera.
Truncating that fraction after one step, two steps, three steps, and so on
produced increasingly precise rational approximations to the square root of n,
a few of which are shown here.

[p^2 small slide on computer left]
Next step was to observe that these rational numbers p over q
close to the square root of n
have a small value of p squared minus n q squared.
You can easily prove that this small number
has at most about half as many digits as n.
So each numerator squared is congruent modulo n to something small.
Here I've written down a few of those small numbers.

Third step
was to find a bunch of these small numbers that have a square product.
I'll come back to the question of _how_ I did this.
Assuming that you can multiply a bunch of these congruences
to get a square on the right side,
of course you also have a square on the left side,
so you have two congruent squares modulo n.

At this point, final step, it is not unreasonable to hope
that the difference of the square roots reveals a factor of n.
If you take some of the right sides,
with square product,
and then take the square root of the product,
and subtract off the square root of the corresponding product of left sides,
and take the greatest common divisor of the result with n,
then hopefully you will succeed in factoring n.

Here's why this is a reasonable thing to hope.
Let's assume that n has, say, two odd prime divisors.
Let's also assume that the right sides are all units modulo n.
If the p's are _random_ square roots of the right sides,
uniform random square roots modulo n,
then this gcd has a 50% chance of revealing a factor of n.
Of course, the p's are not random numbers, not even close,
but this strategy does succeed in factoring
almost every composite n that people have tried.
The remaining numbers n
are easily handled by a slight tweak to the algorithm.

[smooth slide on computer right]
So for the rest of this talk I'll focus on the third step.
I took out a sheet of paper
and wrote down the first couple thousand of these congruences modulo n.
I claim
that, out of the first two thousand numbers on the right side here,
some nonempty subset has a square product.
And in a moment I'm going to show you one of those subsets.

Obviously I did not have time to consider _every_ subset,
so what did I do?
Well,
first of all,
take any big prime, let's say 1000003.
What happens if exactly one of my two thousand numbers
is divisible by 1000003.
Well, that number, the number divisible by 1000003, is useless.
It cannot participate in a square.
Any product involving that number has exactly one factor 1000003
and therefore is not a square.
Most numbers are useless for this reason.
If a number has a big prime divisor
then it would be surprising for the same prime to show up
in any of the other numbers.
So chances are that the only useful numbers are smooth numbers,
numbers that are products of powers of small primes.
For example, the fifth number, 145120806,
is 2 times 3 squared times 17 times 647 times 733.
So maybe that could be part of a square.

I went through the two thousand numbers
and figured out which of the numbers were smooth.
I will come back to the question of how.
I factored those smooth numbers
as -1 to some power times 2 to some power times 3 to some power
times 5 to some power etc.
And then I took the exponent vectors
and did some linear algebra modulo 2,
a nullspace computation,
to find a nontrivial linear dependency among the vectors,
a subset of the vectors adding up to 0 modulo 2.
Once I had enough vectors I was guaranteed to find a subset.

[990371647 slide on overhead]
Here's the first subset I found.
The numbers listed here are some of the smooth numbers
among the first couple thousand small numbers
on the right side from before.
Obviously the product of these numbers is a square.
[point to audience member] You count the number of minuses.
[point to audience member] You count the number of 2's.
[point to audience member] 3's.
[point to audience member] 89's.
[point to audience member] Pick your own prime randomly.
The product is a square.
Anyway, I took the square root of the product of these numbers,
subtracted off the product of the corresponding p's,
which I did not copy onto my slides,
took the gcd with n,
and found a factor of n.
Yay.

[``14,'' Oliver Schirokauer piped in at this point,
apparently having counted the 3's as instructed.
There are 28 3's.]

[given-set-P slide on computer left]
Now, how long does it take to find and factor the smooth numbers?
Well, a few years ago, I set a speed record for this.
Let's assume that you have a reasonably large batch
of numbers to check for smoothness.
Reasonably large means that the sequence S of numbers
has at least as many bits as the set of primes that you're interested in.
The set of small primes.
2, 3, 89, etc.
Then the time for my algorithm, per input bit,
is essentially cubic in the log of the total number of bits.
In other words,
essentially cubic in the log of the maximum prime you're interested in.

Previous algorithms, which I'm not going to describe,
were asymptotically a lot slower.
Instead of 3 they had something converging to infinity.
The best exponent was something like square root of log over log log
from the elliptic-curve method.

At this point
you should be thinking
that my result makes no sense.
The elements of S are allowed to be completely independent.
Why does it save time to simultaneously find the small factors
of a batch of independent numbers?

Well, the answer lies in the first step of this algorithm.
If you want to factor all the numbers in S,
the first step is to multiply them all together!

[batch-factorization slide on computer right]

At this point you should be even more confused.
You should be saying WHY?
In response to which I say, yes,
the product of the numbers you're trying to factor is called
y.
Next compute y modulo each of the primes,
and throw away the primes that do not divide y.
In other words, throw away the primes that have no relevance
to the numbers x that we are trying to factor.
And then, last step of the algorithm,
split the sequence S into two halves,
recursively factor the first half using the smaller set of primes
and recursively factor the second half using the smaller set of primes.
I skipped step 4 which is the bottom of the recursion,
if there's only one element in S
then we're basically done,
we've just figured out the primes dividing that element.
I'm glossing over the difference between prime divisors and prime factors,
we need to know the exponents in the factorizations.
Usually the exponents are small so they are easy to calculate
within my time bound.
Handling arbitrary exponents requires a little more work,
but it's doable.

So that's the algorithm to find all the small factors
of a batch of numbers.
The reason that it's so fast
is that these multiplications and divisions, combining a bunch of numbers,
can be done very quickly.
It's a theorem of Toom from 1966
that big numbers can be multiplied in essentially linear time.
It's a theorem of several people from 1971
that the log exponent for multiplication can be reduced to 1.
Then if you want to multiply a bunch of small numbers,
you can multiply pairs of numbers,
then multiply pairs of products,
and so on, forming a balanced tree of products.
The tree has a logarithmic number of levels,
so Step 1 has a log exponent of 2.
Similar comments apply to the divisions in Step 2.
The entire algorithm has a logarithmic number of levels of recursion,
and it's easy to see that each level has about the same amount of data
because we keep throwing away the irrelevant primes,
so the entire algorithm has a log exponent of 3.

So that was from a few years ago.
The reason I'm talking about this subject again today
is that there was a really exciting development just a few months ago.

[not-the-best-way slide on overhead]

There's a new paper by Franke, Kleinjung, Morain, and Wirth
on the latest advances in using elliptic curves to prove primality.
One of the steps in that elliptic-curve primality-proving method
is looking at a bunch of numbers
and identifying the smooth numbers,
actually the _nearly_ smooth numbers, smooth with one big prime factor.
Now in _that_ context they presented
a variant of the algorithm
that typically has log exponent only 2 instead of 3.
Typically saves an entire log factor,
an order of magnitude in the run time of this algorithm.

Unfortunately,
they did not actually _say_ that there was this improvement.
They said that their algorithm was just a simplification of mine,
which it's not.
It's a different algorithm, and it's usually much faster.

The downside is that it does not give the _factorizations_
of these smooth numbers,
it only _recognizes_ the smooth numbers,
but then you can use the _original_ algorithm
to factor those smooth numbers.
Typically only a small fraction of the numbers are smooth,
so this takes negligible time compared to
recognizing the smooth numbers in the first place.

I have a diagram here
[diagram slide on computer]
to emphasize the difference between various things that can be computed.
We're starting with a positive integer x,
actually a whole big batch of x's but let me just focus on one of them.
I originally thought of the goal as finding the small factors of x,
which my algorithm does with this exponent 3.
Of course the actual goal is to find the small factors of x _if_ x is smooth.
Otherwise we throw the factors away.

Now Franke, Kleinjung, Morain, and Wirth
get to the bottom by a different path.
They first figure out whether x is smooth,
and then _if_ it is smooth they find the small factors.
Their algorithm typically has exponent only 2,
and I'll tell you a modification that always has exponent 2,
to discover whether x is smooth.
Then getting to the goal is another invocation of my algorithm,
but only on the smooth x's,
which in the factorization context is a much smaller volume of data.

[usually-better slide on overhead]
So let me show you how their algorithm works
and also my little improvement of it.
Instead of taking the product of x's and dividing by each of the p's,
they take the product of p's and divide by each of the x's.
Then they repeatedly divide each x
by its gcd with this reduced product of the p's,
until the gcd is 1.
The result is exactly the non-smooth part of x.
It's 1 if and only if x is smooth.

In theory
Step 3 could be the bottleneck,
because x could have a huge power of a prime.
It's easy to eliminate that bottleneck
by doing something a little different in Step 3.
Compute a power of z modulo x
where the exponent is at least the number of bits in x.
Then obviously that's 0 if and only if x is smooth.

I should comment that the non-batch special case of this algorithm,
where you have just a single x to handle,
is a perfectly standard algorithm,
particularly in the function-field case.
I certainly don't have to tell a bunch of coding theorists
what the product of all the small irreducibles is.
The novelty is combining this with product trees and remainder trees
to get a fast batch algorithm.

Anyway,
in practice _my_ new part of this makes things a tiny bit faster
but the _big_ improvement is from
Franke, Kleinjung, Morain, and Wirth.
I'm really giving this talk to advertise _their_ idea.

I should also mention that there are many other ways to speed up this algorithm.
In particular, there is a cute little idea of Kramer
that saves for example a factor of 1.5 in building product trees
and a similar factor in remainder trees,
speeding up the whole algorithm.

[newer-algorithms slide on computer left]
Okay.
I have three more slides here.
Two of them address how these batch algorithms interact with sieving.
You're already supposed to know what sieving is.
For example, in the sieve of Eratosthenes,
you write down a bunch of consecutive integers,
you mark every second integer as being divisible by 2,
you mark every third integer as being divisible by 3,
you mark every 89th integer as being divisible by 89,
et cetera.
More generally, if you have a low-degree polynomial
evaluated at consecutive integers,
then you can easily locate the values divisible by any particular small prime.
In 1977,
Schroeppel introduced a factorization algorithm
called the linear sieve
in which the numbers have this property.
They are sieveable numbers.
They are a low-degree polynomial evaluated at consecutive integers.
We now have a whole bunch of algorithms along these lines.

Now sieving is extremely fast
according to people who have never actually carried it out.
You can find all sorts of textbooks
saying that in order to check the smoothness of sieveable numbers,
you simply sieve them to discover their small prime factors.
For reasons I will get to on the next slide,
that is not the right thing to do.
It is not what people actually do.
What people actually do
is something more complicated.
People sieve only up to let's say
the theta power of the largest prime.
Typically,
theta is between 0.4 and 0.8.
For example, the 512-bit RSA factorization
allowed primes up to 2 to the 30,
but only sieved up to 2 to the 24.
So theta was 0.8.
Actually should have been even smaller.

[Something I realized during the questions after the talk:
I should have written something on the slides
about the 512-bit RSA factorization history.]

Okay.
So you have sieveable numbers x.
You use sieving to discover the prime factors of each x
up to say 2 to the 24.
You then look at the remaining piece of x,
the part of x composed of primes above 2 to the 24,
and you throw x away if that remaining piece is too big.
Maybe it's smooth, but we won't spend time on it.
The x's that survive this test,
the ones that have enough factors below 2 to the 24,
are then fed to some other smoothness tester,
such as the algorithms I've been talking about today.
The cutoff in the middle,
the allowable size of the remaining piece of x,
is of course chosen
so that the time spent in this last step
is about the same as the time spent in sieving.

Now there's an analysis by Pomerance from twenty years ago
that tells you roughly how many numbers
and how many smooth numbers
survive this type of early abort,
and that therefore tells you how this cutoff should be chosen
and how long this takes to find enough smooth numbers.
The bottom line has a remarkably simple approximation.
It's R times S to the theta power times T to the 1 minus theta power.
R is the ratio between how many smooth numbers you need
and the probability of any particular number being smooth,
so R is how many numbers you'd expect to look at
if you have an instantaneous smoothness detector that never throws numbers away.
R is what the textbooks report as the total run time of the algorithm.
S is the time taken for sieving per number,
and T is the time taken for the final step per number.
T is what has been improved by a log factor this year,
as I've told you.

[annoyingly-high slide on computer right]
Now T would be irrelevant here if theta were 1.
The reason that nobody chooses theta equal to 1
is that sieving _randomly_ accesses an array of numbers
where,
for efficiency,
the size array is several times _bigger_
than the largest sieving prime.
For example,
the RSA-155 factorization as I mentioned
sieved up to the 2 to the 24,
because that's what reasonably fit into memory.
After aborting most of the x's
it then looked for primes up to 2 to the 30,
and actually should have allowed considerably larger primes,
theta should have been even smaller.

So theta is noticeably smaller than 1.
Which means that big improvements in T,
such as what's happened this year,
make quite a noticeable improvement
in the run time of all of these algorithms.

Sometimes the improvement is even bigger
because there are actually smaller values of theta in competition
with the value of theta that fits the sieve into memory.
Real computers have several layers in the memory hierarchy.
The time S for sieving per number changes dramatically
between sieving on disk, theta equals 1,
sieving in dynamic RAM, main memory,
sieving in level-2 cache, static RAM,
and sieving in level-1 cache.
So there are three interesting values of theta
for the sieve to fit into DRAM
or level-2 cache
or level-1 cache.
Decreasing theta reduces S.
But it puts more pressure on T.
If T is small enough then that's a good thing.
I pointed this out in 2000 along with my original batch smoothness tester,
and it becomes even more important with every improvement in T.

Final slide is simply references.
[refs slide on overhead]
[should have gone back to diagram slide on computer]
Normally I don't do this
but there are enough different pieces of the story here
that I'd like to briefly point out where you can read more.
First paper presents an algorithm for a grand unified problem
that unfortunately I do not have time to talk about.
The time for this algorithm is the number of input bits times a log power.
This paper has all the fundamental techniques
although it makes no attempt to optimize the log power.
The second paper has log exponent 3
and recognizes the importance of this problem for factorization.
The third paper explains the underlying techniques:
fast multiplication, division, gcd, product trees, et cetera.
The fourth paper,
from just a few days ago,
is on log exponent 2.
Finally, the fifth paper is not ready yet
but it will explain that formula R S to the theta T to the 1 minus theta
and it will give a bunch of concrete examples.

Thank you for your attention.