D. J. Bernstein
Integer factorization
Circuits for integer factorization

# Historical notes on mesh routing in NFS

My October 2001 grant proposal briefly describes a wide variety of NFS circuits and proposes finding the fastest one. Treating my sample ECM+mesh-sorting circuit as if it were my final result, with every trivial improvement worth pointing out, is a serious misrepresentation of my work.

In particular, the June 2002 Lenstra-Shamir-Tomlinson-Tromer paper is completely incorrect when it claims that the use of mesh routing in this context is an ``improvement'' over my grant proposal by ``several orders of magnitude.''

The use of mesh routing in this context is one of many obvious possibilities. In March 2001, in my first ``NSA sieving circuit'' talk, I used mesh routing:

```     #1     #2     #3
(2,1)  (2,2)  (2,3)
(5,2)  (7,1)  (7,2)

#4     #5     #6
(2,4)  (2,5)  (3,1)

#7     #8     #9
(3,2)  (3,3)  (5,1)

For each (p,i), processor
generates ith multiple s of p
in {n+1,n+2,...,n+y},
if there is one,
and sends (p,s) to #(s-n)
through the mesh.
```
In my October 2001 grant proposal, I used mesh sorting for my sample circuit, but on page 7 I specifically mentioned the literature on ``mesh routing and mesh sorting.'' Furthermore, in an April 2002 email message to Arjen Lenstra, as part of explaining why conventional implementations aren't cost-effective, I stated details of a counterclockwise mesh-routing algorithm for NFS sieving and NFS linear algebra:
```   Say you have a lot of money and a 1024-bit number to factor.

Traditional sieving and linear algebra will want hundreds of gigabytes
of RAM for each CPU. That's maybe \$100000 of DRAM+interconnections
sitting around waiting for a \$50 CPU. (Never mind the fact that this
isn't the kind of machine you can actually buy right now.)

Replacement, with the same amount of memory: a 20x20 array of \$750
machines, each with \$250 of DRAM, a \$50 CPU, and four Gigabit Ethernet
connections to its neighbors. (These _are_ machines that you can buy
right now.) Take a batch of, say, 1000 hits generated by each processor;
route the hits 19 steps down, then 19 steps right, then 19 steps up,
then 19 steps left, to reach the proper memory locations.

Bottom line: The time for 20^2 1000 sieve hits has been replaced by the
time for 1000 sieve hits plus 4(20-1)1000 Ethernet transmissions of
(say) 64-bit chunks, at triple the cost. The transmissions occupy the
wires for about 5 milliseconds; that should also be ample time for the
CPUs to handle sending, routing, and receiving. (There are many ways to
improve this, obviously.) Do you have a machine that can do a DRAM hit
in under 38 nanoseconds?

The point isn't that these particular choices are a speedup. The point
is that the time ratio grows linearly with the 20, while the cost ratio
with special-purpose circuitry, allowing the 20 to be raised by half as
many orders of magnitude. (And that's without tossing in a parallel
early-abort rho+ECM stage.)

Here's another way to think about the speedup. General-purpose computers
are normally bought with a rough balance between CPU costs and DRAM
costs. DRAM has been getting faster and larger. CPUs have also been
getting faster _and larger_.

Where is the extra size in CPUs going? Answer: fast FPUs, huge caches,
out-of-order operations on hundreds of registers, etc. A CPU equally
useful for factorization could be built at much lower cost---and the
cost gap will widen as DRAM continues getting larger.

If you assume that a 2x improvement in DRAM speed _and_ size will be
accompanied by only a 2x improvement in CPU speed, you're overestimating
the cost of the CPU, or underestimating the cost of DRAM. Bob Silverman
was right: memory is not free.

Someday hundreds of gigabytes of DRAM will cost \$100, and \$100 CPUs will
have expanded to tens of billions of transistors. What will all those
transistors be accomplishing for factorization, or for other serious
computational problems, if the entire CPU is bottlenecked sending its
results to DRAM?

It's obviously more cost-effective to have millions of processing cells,
each with a limited amount of DRAM, each with a few connections to
nearby cells. Whether this is called ``building a VLSI circuit'' or
``programming a computational RAM'' or ``designing a parallel algorithm
that uses very little memory per processor,'' it's the right model of
computation, and not one for which NFS was previously optimized.
```

How could Lenstra, Shamir, Tomlinson, and Tromer publish this obvious use of mesh routing in June 2002 as if it were an interesting new result? I realize that they didn't see my March 2001 talk, but they certainly saw my grant proposal and my April 2002 email.

I am annoyed that my grant proposal has been characterized as presenting slower algorithms than it actually does. I informed Lenstra, Shamir, Tomlinson, and Tromer in plain language that their claim of an ``improvement'' was false; I am disappointed that they have refused to properly assign credit. Are they next going to claim that an early-abort combination of rho and ECM is an ``improvement'' over my grant proposal?