[Talk notes, D. J. Bernstein, 15 June 2004.]
[This is approximately the talk I gave on Monday 14 June 2004.]
[It took 70 minutes. Oops.]

[Visual: title slide on computer right]
Thank you.

[Visual: sieving-c slide on computer right, shifting]
This is a talk about
modern algorithms for factoring difficult integers.
Let me begin by showing you an example of the sieve of Eratosthenes.
The column on the left
shows the integers from 1 through 20.
I marked every second integer as being divisible by 2,
every fourth integer as being divisible by another 2,
every third integer as being divisible by 3,
et cetera.
Of course,
you don't have to start at 1.
On the right side here
I have sieved 20 integers starting at 612.
Again every second integer is divisible by 2,
although the starting point for _this_ arithmetic progression has changed
since 611 is odd.
Well, it's the same arithmetic progression,
but it starts at a different line on the slide.
Anyway, I found many other small prime factors of these input numbers.

I should point out that you can save lots of memory here
by eliminating white space,
a principle that I have also attempted to apply to my name tag.
I wrote some TeX macros that, given a name,
automatically choose the largest possible font for the name tag.
[Audience member: ``Are you going blind?'' Me: ``Yes, I am.'']
For those of you who are not going blind,
you can use the official ANTS conference macros,
which, given a name,
automatically choose the _smallest_ possible font for the name tag.

[Visual: factoring-611 slide on computer right, shifting]
Anyway, starting from these two Eratosthenes sieves,
the second one offset by 611,
we can actually factor the number 611.
At the bottom of this slide I have discovered the factor 47 of 611
by computing the greatest common divisor
of 611 with the difference of two numbers.
Let me trace where these numbers came from.
First of all, notice
that many of these inputs
to the sieve of Eratosthenes
have been completely factored,
and sometimes we have two complete factorizations on the same line
in both the left sieve and the right sieve.
For example 14 equals 2 times 7
and 625 equals 5 times 5 times 5 times 5.
We know the complete prime factorization
of the product 14 times 625.

I have noticed
that here at ANTS
people sometimes point out the impressive computations that they have done,
after which the audience says
ooh aah.
[Audience member: ``Ooh aah.'' Me: ``Hold on a moment.'']
Now it is not shown on this slide,
but I am pleased to report
that I continued each of these sieves for one _hundred_ rows.
[Audience joins in:] Ooh aah.

So I found some other products
of the form c times 611+c,
64 times 675 and 75 times 686,
where the complete prime factorization is revealed by these sieves.
Now multiplying these three lines together produces a square.
This is clear from the prime factorizations,
the sum of the exponent vectors is even.
1 plus 6 plus 1 is even,
0 plus 3 plus 1 is even,
et cetera.
If you change 611 to a larger number
and write down many of these rows involving many primes
then in general you will _not_ have
the product of _all_ the rows magically being a square.
You have to find a nonempty subset of the rows with square product
which is easy to do by linear algebra on the exponent vectors.
So anyway,
the product of all of _these_ numbers is a square.
The square root is one of the numbers inside the gcd.
The other number inside the gcd
is something that has the same square modulo 611.
14 squared is congruent to 14 times 625 modulo 611,
64 squared is congruent to 64 times 675 and so on.
When you take a gcd of 611
with the difference of two numbers that have the same square modulo 611,
then you should not be surprised,
although you are certainly not guaranteed,
to find a factorization of 611,
as I did in this case,
allowing me to steal money from my bank.

This algorithm
for integer factorization
is the number-field sieve when the number field is Q.
So it is the Q sieve.

Okay.
What I would like to do in this talk
is point out certain assumptions that people make
about _how_ you should find large factors of integers,
certain structural rules that people follow in designing these algorithms.

Sometimes people say that you _should_ follow these rules;
sometimes people say that you _must_ follow these rules.

These are myths.

I am going to convince you that all of the rules _can_ be violated
and that _most_ of the rules _should_ be violated.

[Visual: myth-1 slide on computer right, shifting]
So without further ado,
here's a warm-up myth,
myth #1,
which says
that the goal of sieving is to find all relations among the given inputs.
The word relation means, by definition,
a smooth product c times 611 plus c.
Smooth means a product of small primes.
The myth again is that our goal is to identify which inputs are smooth,
14 625 et cetera,
so that we find all of the relations.
A slightly wimpier form of this myth
is that we want to find _almost_ all of the relations
or at least _most_ of the relations.
For example,
here's a quote from a 1994 paper on the number-field sieve
saying that we try to not lose any relations.

That is not even close to the truth.

In fact, the notion of finding _every_ relation is patently absurd,
because we've already thrown relations away
by focusing on the smallest inputs.
For example,
by looking at only a hundred values of c,
I certainly threw away some relations.

[Visual: reality slide on computer right, shifting]
Let me review what we're actually trying to do.

We need to find a nontrivial square product of inputs.
Now to recognize square products
we need to know the complete prime factorizations of the inputs.
[Coprime factorizations are sufficient, of course,
but I didn't want to mention factoring into coprimes until later in the talk.]
Furthermore,
if an input has a _large_ prime factor,
then that input is probably not useful in forming a square,
because that prime probably does not appear anywhere else.
So our goal is to find many completely factored inputs,
preferably inputs whose largest prime factor is small.

Some of those useful inputs are easier to identify than others.
I will be more precise about this later.
For the moment let me just say that
the difficulty of identifying a useful input
depends much more on the size of its _smallest_ prime factors
rather than the largest prime factor.

To optimize the algorithm you have to pay attention to all of these criteria.
Something that is going to become increasingly clear over the next few slides
is that you end up finding only a tiny fraction of the relations.
So what you should remember from this slide
is that _most_ of the inputs will be thrown away
because that makes it _easier_
to find enough useful inputs.

[Visual: myth-2 slide on computer right, shifting]
Myth number 2
says that the way we identify these fully factored inputs
is by sieving,
marking numbers in arithmetic progressions,
as I showed you in my 611 example.
Sieving
is an amazingly fast algorithm
according to people who have never actually done it.

Now in the old days sieving was not possible.
For example, CFRAC,
the continued-fraction algorithm for integer factorization,
does not produce sieveable inputs.
So you have to use other methods of finding small factors of these inputs.
You can use trial division, Pollard's rho method,
more recently Lenstra's elliptic curve method, and so on.

The myth says that all of those auxiliary algorithms
became obsolete in this context
as soon as Rich Schroeppel introduced
the linear sieve.
Inputs are sieveable
in the linear sieve,
the quadratic sieve,
the number-field sieve.
I have a quote here from one author
saying that sieving is much faster than the elliptic-curve method.
Once again, the myth says
that when numbers are sieveable we simply sieve them. 
End of story.

[Visual: reality slide on computer right, shifting]
Now the reality is that sieving is merely a preprocessing step
for another algorithm.
Sieving by itself is _slow_.
The reason is that random access to memory is slow.
Your computer has what's called a storage hierarchy,
it has under a thousand bytes of very fast registers,
then perhaps 64 kilobytes of level-1 cache
that can be accessed in one or two clock cycles,
then perhaps 512 kilobytes of level-2 cache
that is typically an order of magnitude slower,
then quite a lot of dynamic RAM
that is another order of magnitude slower,
and then there's disk.
Now if you do a tiny little sieve
that fits into level-1 cache
then you might think that sieving is fast.
But sieving in level-2 cache is considerably slower.
Sieving in dynamic RAM is really slow.
And sieving on disk,
well,
sieving on disk is actually not as bad as it sounds
if you switch from a simple distribution sort
to a more complicated radix sort,
but it is certainly not an amazingly fast operation.

Because memory access time is variable,
the best strategy turns out to be doing _some_ sieving,
let's say sieving primes up to the theta power
of the largest prime you care about.
Then once you've removed those primes from the inputs,
you keep only the smallest inputs,
you abort the other inputs.
Then with that small fraction of the inputs,
you try some test other than sieving.
So this is a combination of two tests for small factors.
The first test is sieving.
The second test is some non-sieving test.

[Visual: 1994-golliver slide on computer right, shifting]
For example,
one paper uses sieving to discover prime factors up to 2 to the 21,
then they divide those primes out of the inputs
and keep only the remaining pieces below 2 to the 60.
Then they use SQUFOF,
which is a special case of CFRAC,
and the elliptic-curve method
to discover prime factors up to 2 to the 30.

Now this is the same paper
where three paragraphs earlier
the authors said they were trying to keep _every_ relation.
The reason they can say this
is that they redefine the word relation.
They say
that a relation involving a prime above 2 to the 21
is not actually a relation.
It is a _partial_ relation.
They focus on 2 to the 21 as the important number
rather than 2 to the 30.
_I_ think
that primes go up to 2 to the 30,
maybe even beyond,
and these authors are aborting most inputs early at 2 to the 21.
_They_ say
that the normal state of affairs is to stop at 2 to the 21.
They're not stopping early;
they're simply failing to work late.
Whatever.
They are, in fact, aborting most inputs,
as they should.

[Visual: myth-3 slide on computer right, shifting]
Myth number 3
is that the second test is not a bottleneck.
Remember the overall picture at this point
is that we sieve _some_ primes,
and then we abort _most_ of the inputs,
and then we use a second test to find small factors
of the inputs that the survive the abort.
The myth is that sieving is the only bottleneck here.
For example,
here are some authors reporting that sieving took almost all of their time.

Well,
I look at this and say that the parameters were obviously not optimized.

[Visual: reality slide on computer right, shifting]
If the second test is not a bottleneck
then you have time to feed a lot more inputs to the second test.
You should allow larger inputs to survive the abort.
This _will_ find more relations from the same initial set of inputs.
So you could have started with fewer inputs.
Obviously the abort parameters should be chosen
so that the second test is balanced with sieving.

I've written on this slide a useful formula to remember,
this is very much like the
u to the minus u formula for smoothness probabilities,
it's not terribly accurate,
it's occasionally misleading,
but it's in the right ballpark.
The time for finding enough smooth inputs
is roughly R times S to the theta times T to the 1 minus theta.
Here R is what textbooks report for the run time of the algorithm,
it is the ratio between the number of smooth inputs that you need to find
and the probability of one input being smooth.
S is the time per input to sieve.
T is the time per input to apply the second test,
for example the elliptic-curve method.
Theta you remember from before,
you sieve up to the theta power of the largest prime.
For example if theta is 0.7
then this is S to the 0.7 times T to the 0.3.

You might be wondering where this approximation comes from.
[Visual: why-s-t slide on computer right, shifting]
The quick answer is that it comes from a paper by Pomerance.
Pomerance has heuristics from twenty years ago,
actually theorems in certain parameter ranges,
saying asymptotically how many inputs survive an abort
and how many smooth inputs survive an abort.
I'm not going to give you either formula,
I'm just going to compare the formulas,
if you keep the smallest 1 out of every A inputs
then you end up keeping 1 out of every A to the 1 minus theta
smooth inputs.
You can think of A as the parameter if you like.
Now if sieving and the subsequent test are balanced
then you take time S for every input
and time T for one out of every A inputs
so A should be about T over S
and that's all you need to know.

[Visual: better-analysis slide on computer right, shifting]
This is a rather crude analysis for many reasons.
One of my hobbies for many years has been
analyzing number-field-sieve run time much more precisely.
For example,
you can compute extremely tight bounds on smoothness probabilities,
you can measure exactly how big S and T are,
you can analyze the distribution of inputs,
there's all sorts of fun you can have here.
But even from the crude analysis it is clear
that the second test has a quite serious impact
on the overall run time of these algorithms.

[Visual: myth-4 slide on computer right, shifting]
So which second test should we use?
Myth number 4
is that the answer is the elliptic-curve method.
For example,
one recent factorization
aborted numbers above 2 to the 90
and then used ECM to find small factors of the remaining numbers.

A more general form of this myth
is that once we're done sieving
we have to factor each input independently.

However,
there are faster algorithms
that factor a bunch of inputs at the same time.

[Visual: given-set-q slide on computer right, shifting]
I pointed this out a few years ago,
and there's some really exciting news just a few months ago.

My algorithm from 2000
takes a bunch of primes and a bunch of inputs
and factors all the inputs simultaneously using those primes.
The time per input bit
is log cubed.
For comparison, the elliptic-curve method,
instead of exponent 3,
has an exponent that grows something like square root of 2 log y over log log y,
and y doesn't have to be very big before
the elliptic-curve method becomes slower than this batch test.

The exciting recent news
is from Franke, Kleinjung, Morain, and Wirth,
who pointed out that you can save an entire log factor here
if you simply want to know _which_ inputs are smooth.
And then you can apply my original algorithm
to quickly factor those smooth inputs;
normally there aren't very many of them.

Actually, the Franke Kleinjung Morain Wirth algorithm
behaves badly for some extreme inputs,
but that's easy to fix.
Anyway, what's important is that for typical inputs
they reduced the exponent from 3 to 2.

Now I am more than a little bit disappointed
in how they presented this algorithm.
They said that it was a _simplification_ of my algorithm.
It is _not_ a simplification, it is a new algorithm.
It _is_ simpler, certainly;
it produces a subset of the output
using a subset of my techniques.
But it also uses _another_ idea
that's important for saving this log factor.
Another problem with their presentation
is that they don't even say that they're saving time.
This is an entire order of magnitude speedup
and they don't point that out.
The worst part
is that they buried this algorithm
inside a paper on elliptic-curve primality proving.
Now proving primailty is lots of fun
and I am very much looking forward to the talk by Morain.
[Morain comments at this point: ``So am I.'']
But this speedup has many applications other than primality proving.
In particular,
for factoring integers on conventional computers,
this is the biggest news in years.
It belongs in a separate paper.
So they did not advertise this algorithm very well.
Fortunately,
I get to stand up here and advertise it for them.
This is a wonderful little algorithm.
The only reason that I am not going to tell you how these three algorithms work
is that in a few minutes I am going to show you a newer algorithm
that is even more fun.
So you will get the idea of what these algorithms do.

But first let me state
[Visual: myth-5 slide on computer right, shifting]
myth number 5.
Even more fundamental than the question of whether there are early aborts
is the question of whether there is a factor base.
The myth
is that you have to create a factor base.
This myth is built into how everybody talks about factoring algorithms,
until my next slide.
In these algorithms you choose a set of primes,
a factor base,
for example all the primes up to 2 to the 30.
You then look for smooth inputs,
inputs that factor completely using those primes.
And then you do a cull.
You weed out
the non-repeated primes.
A lot of inputs will have primes that do not appear in any other inputs.
So throw those inputs away.
Repeat that process until you have a core of inputs
where all the prime divisors of each input
are divisors of other core inputs.

Now we really want to allow primes up to 2 to the 40,
2 to the 50,
_any_ size,
_if_ they are repeated.
But the myth says
that we have to choose some cutoff,
and for efficiency the cutoff cannot be very large.

[Visual: reality slide on computer right, shifting]
In fact,
this structure is completely unnecessary.
We _can_
allow primes of any size.
We do _not_ have to specify the primes in advance.
We can quickly find all the inputs that are smooth
with respect to the prime divisors of other inputs.

Obviously this algorithm
has to consider all of the inputs at once.
If you have say y squared inputs,
you are not allowed to break them into y chunks with y inputs per chunk.
There's a serious communication cost
in the algorithm I'm going to show you,
just like the communication cost
that we are familiar with
from the linear-algebra stage of these algorithms.

In theory
the slowdown here is just a small constant
compared to the factor-base algorithms.
Maybe the benefit is larger than that constant.
I simply don't know.
This algorithm is very new.
I have not had a chance to do any experiments.
But I do want to convince you
that this structure
of choosing a factor base,
_then_ looking for smooth numbers,
_then_ weeding out the non-repeated primes,
then doing linear algebra,
this structure is not necessary.

[Visual: what's-the-algorithm slide on computer right, shifting]
So here's how this algorithm works.
It also gives you a flavor of how the previous batch algorithms work.

You start with a bunch of input numbers x1 x2 et cetera.
These could be 1 612 2 613 et cetera.
Actually they should be the numbers that survived an abort after sieving.
Since the goal is to factor these numbers
we simplify the problem by multiplying all the numbers together,
forming a huge number y.
We then divide y by x1 and reduce modulo x1,
divide y by x2 and reduce modulo x2,
and so on.
I will come back to the question of how fast this is.
Finishing off the algorithm.
If a power of y over xj is zero modulo xj then print xj.
You can choose the exponent so that, for example,
2 to that power is bigger than xj.
So this prints xj exactly when
every prime divisor of xj
is a divisor of y over xj
in other words a prime divisor of the other inputs.
This throws away exactly the inputs
that have non-repeated prime divisors.
That's exactly what we want.
Feeding the outputs through the same algorithm again a few times
produces the core set of inputs,
which we can then factor into coprimes in essentially linear time.

[Visual: why-is-this-so-fast slide on computer right, shifting]
Now I claimed earlier that this algorithm is actually quite fast.
Computing the product y takes essentially linear time.
You multiply pairs of x's,
then pairs of pairs, and so on until you have the product of all of them.
The bottleneck here is multiplying large integers,
which you do with the FFT and with more advanced tricks.
Once you have y
you then form a similar tree of remainders
to compute y modulo x1 squared, y modulo x2 squared, and so on,
again in essentially linear time,
and then dividing y modulo x1 squared by x1
produces exactly the y over x1 modulo x1 that I wanted.
These are standard algorithms,
although there's a continuing game of improving these algorithms
by constant factors.
In particular,
I'd like to draw attention to a technique introduced recently by Kramer,
which I call FFT doubling,
that speeds all of this up by a factor of about 1.5.

[Visual: myth-6 slide on computer right, shifting]
Okay.
Enough about these batch algorithms.
Let me move on to myth number 6.
The myth is that the number field sieve
involves two sieves.
For example, you sieve c and you sieve 611+c.
You could instead do a single sieve,
checking which 611+c's are smooth,
and then for those c's check which c's are smooth.
But here are some authors saying that two sieves are much faster.
Another way to say this myth is that Coppersmith's variant
of the number-field sieve is not practical.
[Pomerance commented after the talk that,
at the time, it wasn't clear that any form of the number-field sieve
was practical, except for special numbers having small polynomials.]

[Visual: reality slide on computer right, shifting]
But the reality is that Coppersmith's variant _is_ practical.
Instead of doing two sieves and then one or two second tests
you can do one sieve, one second test,
and then maybe another second test.
You check the smoothness of c
only for the occasional values of c where 611+c is smooth.

Actually the time to check the smoothness of c is negligible
so you can afford to check the smoothness of other functions of c.
Expand your notion of relations
to expand the set of useful inputs.
You can also look for very large prime factors in the numbers c.
The point is that you're spending only negligible time per c
for which 611+c is smooth,
so you can afford to do a lot more work on each of those c's.

This is certainly practical.
It obviously saves time as soon as the smoothness chance for 611+c
drops below S/T,
the ratio between the sieving time and the second-test time
that I mentioned before.
The smoothness chance drops subexponentially
while the ratio S/T drops as a power of the logarithm.
Once you've reached moderately large factorizations,
Coppersmith's variant obviously saves time.
Of course,
to do a more precise analysis,
you have to optimize a lot of parameters,
along the lines I mentioned before.

[Visual: myth-7 slide on computer right, shifting]
Myth number 7
is not a question of computer time.
This is a question of programmer time.
The myth says that the direct square-root method
is a bottleneck.
You remember the first square root that I mentioned,
I wrote 14 times 64 times et cetera
as a product of powers of primes,
each exponent being even,
and then divided each exponent in half.
There are three papers saying how you generalize this to number fields.
Instead of using prime factorizations
you can use the direct method
of simply multiplying the inputs together as a big integer
and computing the square root of that integer.
The myth says that you should not use the direct method,
maybe not that it is a complete disaster for the run time
but certainly that it is a serious part of the run time.
For example,
here are some authors
saying that using prime factorizations
is of great consequence for the overall running time.

[Visual: reality slide on computer right, shifting]
Well,
this is simply not true.
It is _not_ of great consequence for the overall running time.
You can simply use the direct square-root method.
With standard algorithms
that takes time y to the 1 plus epsilon
where y is the prime bound.
For comparison,
linear algebra takes time y to the 2 plus epsilon.
If you actually try this out
you find it's millions of times faster.
[Depending on y, obviously.]
The prime-factorization methods
might be marginally faster
but in this context it simply doesn't matter.

[Visual: myth-8 slide on computer right, shifting]
Myth number 8.
Until now I've been imagining
doing these computations on a desktop computer.
I used to imagine a Pentium,
now I imagine an Athlon, because it's better.
Anyway, it's a conventional computer.
A small processor attached to a large randomly accessible memory.
The processor follows various instructions to modify the memory.

The myth is that we want to minimize time
on a conventional computer.
The myth is that this minimizes real time.

Now when people say this
they realize that it's not true,
they make an exception for parallel computers.
People realize that
running a bunch of Athlons at the same time
can reduce the real time for the computation.
But people say that achieving a speedup factor of p from parallelism
requires that you buy p Athlons.
There's supposed to be some sort of theorem guaranteeing this.

[Visual: reality slide on computer right, shifting]
The reality is that
conventional computers do not minimize price-performance ratio.
The problem with this theorem about parallel speedups
is that the price of p computers
does not have to be p times as large as the price of one computer.
You can often split one computer
into two computers
where each computer has half the size.
More generally split into p computers
where each computer is only 1 over p of the size.
There are communication costs between the computers
but those costs do _not_ grow out of control
compared to the computation you're doing.

Eventually you end up with a mesh architecture,
where for a given computer size
you achieve asymptotically smaller time exponents
than you do on a conventional computer.

[Visual: vlsi-literature slide on computer right, shifting]
There are thousands of articles in the VLSI literature,
that's very-large-scale-integration,
pointing out speedups from mesh architectures
in all sorts of applications.
For example,
here's an example that's maybe even more convincing
than my usual example.
Let's consider multiplying two integers with n bits,
obviously an important problem.

The usual fast-multiplication algorithm
by Schoenhage and Strassen using the fast Fourier transform
takes time n to the 1 plus epsilon
on a conventional computer
with something like 8n bits of memory,
maybe a little more.

[Visual: Knuth slide on computer right, shifting]
Now something I saw in Knuth's book years ago,
but I never really thought through the consequences,
was a non-conventional computer that does the same thing.
In Knuth's words,
we leave the domain of conventional computer programming.
There's a mesh by Atrubin,
a bunch of processors operating in parallel,
with a total of I think 11n bits of memory,
that multiplies two n-bit numbers in time proportional to n
with a small constant.
Now you might not be too surprised by the log factor disappearing,
there are all sorts of machine-model changes
that can eliminate log factors,
but it's more interesting that you achieve this competitive performance
without using the FFT.

Later in the literature
you can find a 2-dimensional mesh of about the same total size
that takes time only n to the one half plus epsilon,
using the FFT.
This is obviously much faster for large n
than anything you can do with a conventional computer.
A couple of log factors cannot compete with a square root of n
once n is moderately large.

[Visual: similar-speedups slide on computer right, shifting]
For integer factorization we get similar exponent reductions.
Let me show you the numbers.
If you look at Coppersmith's 1993 article
you see a typical factorization machine stated in some detail.
That machine takes time L to the 1.901 something power,
where L is e to the cube root of log n
two thirds power of log log n.
There's also a little o of 1 in the exponent,
so 1.901 et cetera is the limiting exponent.
The space used by Coppersmith's machine
is the 0.950 power of L.
That's half of the 1.901.

If you use a 2-dimensional mesh
instead of a conventional computer
then you can reduce the time exponent from 1.901 to 1.426
with the same cost of the computer.

None of these numbers are optimal.
[Visual: new-parameters slide on computer right, shifting]
Let me skip to the middle of the slide.
If you change parameters
then you can reduce the time to the 1.185 power of L
and reduce the machine cost to the 0.790 power of L.
Also, Pomerance pointed out for conventional computers
that you can improve the price-performance ratio a little bit
by reducing the smoothness bound
and reducing the machine size to the 0.748 power,
while the time grows by a smaller amount.
So those are the final numbers.

Now some people say that my improvements here
from conventional computers to mesh computers
is from redefining the notion of cost
and inventing the notion of price-performance ratio.
But you can see that the numbers are simply better.
The time drops from L to the 2 point something power
down to L to the 1.185 power.

To summarize,
mesh architectures
are much more cost-effective than conventional architectures.
In the past I've said that the improvement is from circuits,
but what really matters is the mesh architecture.

[Visual: myth-9 slide on computer right, shifting]
Myth number 9 is that we don't have to think about this improvement
in designing algorithms.
Mesh computers simply speed up computations.
At some point we'll all have mesh Athlons on our desks
and our computations will magically run faster.
Later on we'll have quantum Athlons on our desks
for even higher speed.

To optimize algorithms for mesh computers
we can simply optimize algorithms for conventional computers
and then put them on mesh computers.
For example,
the 2-dimensional multiplication mesh
that I mentioned before
is a completely straightforward mesh implementation of an FFT.
It's something that a mesh compiler can do automatically.

[Visual: reality slide on computer right, shifting]
The reality is that switching from conventional computers to mesh computers
changes the comparison between algorithms.

Here's a simple example.
I'll have an even more frightening example later.
Take the elliptic-curve method
versus my original batch factoring algorithm.
My algorithm is faster on conventional computers than ECM.
But on mesh computers ECM has a smaller price-performance ratio.

[Visual: algorithm-analysis-culture slide on computer right, shifting]
I have a kind of philosophical slide here.
But somebody has to say this.

The current literature on analysis of algorithms
has to be completely rewritten.
Right now people talk about time on conventional computers,
count the number of operations.
Space is sometimes mentioned
but there's a clear lex order
where time is considered more important.

Well, that does not make economic sense.
You have to rethink all your algorithm analysis
to account equally for the performance and the _price_ of your computers.
All sorts of tradeoffs,
for example baby-step-giant-step,
have different effects once you start paying attention to price.

[Visual: adventures-in-mesh-programming slide on computer right, shifting]
Let me say a little bit about what you have to do
if you actually want to start building machines and algorithms
for mesh architectures,
which all of us are going to be doing eventually.
There's a programming language for building circuits,
a programming language for mesh architectures,
called Verilog.
You can think of Verilog as being like C.
There are a few of these languages,
not nearly as many as for conventional computers.
I made the mistake of starting with a language called VHDL.
Nobody told me that Verilog was better.
Don't use VHDL unless you're an Ada programmer.

Anyway,
there are a few free tools to run programs in this Verilog language,
the best developed seems to be Icarus Verilog.
This program simulates a mesh architecture on your desktop computer,
for example your Linux computer.
Obviously you can't get the kind of price-performance exponents
I was talking about before if you do this,
but at least you can see that your programs work,
and you can also predict how fast they will be on a mesh architecture.
One level of building circuits for a computation
is writing Verilog programs and simulating them on a conventional computer.

[Visual: fpga slide on computer right, shifting]
A more convincing level of building circuits
is actually using a mesh computer
and carrying out your computations at high speed.
An FPGA, a field-programmable gate array,
is a general-purpose mesh computer
that you can buy right now
and run your Verilog programs on
at much higher speeds than what you can do by simulation.
An FPGA is an order of magnitude slower than a dedicated circuit,
but it has the right cost exponents.

One problem here is that compilers aren't very smart.
They're fairly smart but not _very_ smart,
just like for conventional computers,
you gain some speed by writing code in assembly language.
There are tools for circuits
called manual-place-and-route tools
that you can use to say
this part of your circuit is over here
and this part is over there
to make sure you're taking full advantage of the mesh.

Now if you want to predict what a well-funded attacker will do---
I see in the audience a few representatives of a well-funded attacker---
then you have to imagine building ASICs.
An ASIC is a dedicated circuit
that runs a single Verilog program at the best possible speed.
Building _one_ ASIC is really expensive.
This doesn't become cost-effective until you have
many copies of the chip to build.
But if you do build a lot of ASICs
then you get an order of magnitude better price-performance than for FPGAs.

[Visual: several-companies slide on computer right, shifting]
It's a little annoying how much work it takes
to design a circuit.
You can buy some higher-level programming tools,
for example from SRC,
which is Cray's last company,
tools that make it easier to write programs for mesh architectures,
as long as you're willing to accept a constant-factor slowdown.

I've been getting into this game.
I want it to be as easy as possible to experiment,
to prototype new circuits.
So I've been building tools
that simplify the creation of large meshes for the operations I care about.
These meshes might be a factor of 50 slower than an ASIC
but I'm willing to sacrifice that,
at least initially,
just to get something working.

What I'm aiming for
is completely convincing concrete numbers
for a well-optimized circuit for integer factorization.
Here's the circuit,
here's a demonstration that it works,
here's how fast it is.
At least at the simulation level.
I haven't decided yet whether I'm going to build actual physical circuits,
although obviously that would make the results even more convincing,
to actually go ahead and factor some specific numbers really quickly.
Until I'm done I don't want to speculate as to what the results will be.
There are already more than enough articles
with speculations at various levels of detail.

[Visual: myth-10 slide on computer right, shifting]
Let me finish off with myth number 10.
The myth here is that the quadratic sieve is faster than
the elliptic-curve method.
Earlier on
I was talking about ECM as a method of finding small factors
of the auxiliary numbers that show up
inside the quadratic sieve, the number field sieve, and so on.
But much simpler is to apply ECM directly
to the number you want to factor.
The myth is that this is slower than the quadratic sieve
for finding really big factors,
for factoring RSA keys.

Specifically,
people say that the elliptic curve method
is slower by a log factor
because it has to do multiprecision arithmetic
while the quadratic sieve is mostly doing just single-precision arithmetic.
Actually,
there are several other run-time factors to compare,
but they all roughly cancel.
In particular,
the quadratic sieve
wants slightly larger numbers to be smooth,
maybe not as large as I wrote here
if you optimize the early aborts,
but certainly at most y times as large,
which reduces the smoothness probability
by a log y factor.
Then the quadratic sieve needs something like y over log y smooth inputs,
while ECM needs only one smooth input,
but ECM takes about y operations per input.
So the bottom line is that the quadratic sieve is faster
by roughly a log factor.

Well, that's the myth.
[Visual: reality slide on computer right, shifting]
The reality
is that ECM has a better price-performance ratio
than the quadratic sieve.
The reason is that there's a communication cost
in the linear-algebra stage of the quadratic sieve
whereas the elliptic-curve method has basically no communication at all.
If you optimize price-performance on mesh architectures,
instead of number of operations on conventional architectures,
then you see that the comparison between ECM and the quadratic sieve
is reversed.
Instead of a polylog factor in favor of the quadratic sieve
there's a _subexponential_ factor in favor of _ECM_.

I've been building circuits for ECM.
The bottleneck, of course, is integer multiplication.
The story of my life.
It's important to realize
that the original Schoenhage-Strassen fast multiplication method
is actually pretty damn good for circuits.
There are some subsequent improvements that you should use,
this is a subject of continuing research,
but even the _original_ method
is better than what chip designers are doing in their multipliers today.
If you're going to build a multiplication circuit
you should definitely use Schoenhage-Strassen.

I should mention that years ago
Pomerance gave a talk pointing out how scary it was
to have a bunch of desktop computers running the elliptic-curve method
for factoring big numbers,
512 bits, maybe 1024 bits.
The average run time is pretty good and the variance is huge.
At any moment, one of those computers could suddenly say here's a factor.
Well,
what I'm proposing is even more extreme.
Imagine replacing your desktop computer
with thousands of little elliptic-curve units.
If you were scared of ECM before
then you should be _really_ scared of it now.
[Edlyn Teske's baby is crying in the background.]
I see that _someone_ is scared already
so I'd better stop there.
Thank you for your attention.

[Question from Eric Bach about catching up to Intel.]
[Question from Peter Montgomery about transistor counts.]
[Question from Carl Pomerance about Coppersmith implementation.]