D. J. Bernstein
Integer factorization
Circuits for integer factorization

Historical notes on mesh routing in NFS

My October 2001 grant proposal briefly describes a wide variety of NFS circuits and proposes finding the fastest one. Treating my sample ECM+mesh-sorting circuit as if it were my final result, with every trivial improvement worth pointing out, is a serious misrepresentation of my work.

In particular, the June 2002 Lenstra-Shamir-Tomlinson-Tromer paper is completely incorrect when it claims that the use of mesh routing in this context is an ``improvement'' over my grant proposal by ``several orders of magnitude.''

The use of mesh routing in this context is one of many obvious possibilities. In March 2001, in my first ``NSA sieving circuit'' talk, I used mesh routing:

     #1     #2     #3
     (2,1)  (2,2)  (2,3)
     (5,2)  (7,1)  (7,2)

     #4     #5     #6
     (2,4)  (2,5)  (3,1)

     #7     #8     #9
     (3,2)  (3,3)  (5,1)

     For each (p,i), processor
     generates ith multiple s of p
     in {n+1,n+2,...,n+y},
     if there is one,
     and sends (p,s) to #(s-n)
     through the mesh.
In my October 2001 grant proposal, I used mesh sorting for my sample circuit, but on page 7 I specifically mentioned the literature on ``mesh routing and mesh sorting.'' Furthermore, in an April 2002 email message to Arjen Lenstra, as part of explaining why conventional implementations aren't cost-effective, I stated details of a counterclockwise mesh-routing algorithm for NFS sieving and NFS linear algebra:
   Say you have a lot of money and a 1024-bit number to factor.

   Traditional sieving and linear algebra will want hundreds of gigabytes
   of RAM for each CPU. That's maybe $100000 of DRAM+interconnections
   sitting around waiting for a $50 CPU. (Never mind the fact that this
   isn't the kind of machine you can actually buy right now.)

   Replacement, with the same amount of memory: a 20x20 array of $750
   machines, each with $250 of DRAM, a $50 CPU, and four Gigabit Ethernet
   connections to its neighbors. (These _are_ machines that you can buy
   right now.) Take a batch of, say, 1000 hits generated by each processor;
   route the hits 19 steps down, then 19 steps right, then 19 steps up,
   then 19 steps left, to reach the proper memory locations.

   Bottom line: The time for 20^2 1000 sieve hits has been replaced by the
   time for 1000 sieve hits plus 4(20-1)1000 Ethernet transmissions of
   (say) 64-bit chunks, at triple the cost. The transmissions occupy the
   wires for about 5 milliseconds; that should also be ample time for the
   CPUs to handle sending, routing, and receiving. (There are many ways to
   improve this, obviously.) Do you have a machine that can do a DRAM hit
   in under 38 nanoseconds?

   The point isn't that these particular choices are a speedup. The point
   is that the time ratio grows linearly with the 20, while the cost ratio
   is simply 1 plus overhead. The overhead can be dramatically reduced
   with special-purpose circuitry, allowing the 20 to be raised by half as
   many orders of magnitude. (And that's without tossing in a parallel
   early-abort rho+ECM stage.)

   Here's another way to think about the speedup. General-purpose computers
   are normally bought with a rough balance between CPU costs and DRAM
   costs. DRAM has been getting faster and larger. CPUs have also been
   getting faster _and larger_.

   Where is the extra size in CPUs going? Answer: fast FPUs, huge caches,
   out-of-order operations on hundreds of registers, etc. A CPU equally
   useful for factorization could be built at much lower cost---and the
   cost gap will widen as DRAM continues getting larger.

   If you assume that a 2x improvement in DRAM speed _and_ size will be
   accompanied by only a 2x improvement in CPU speed, you're overestimating
   the cost of the CPU, or underestimating the cost of DRAM. Bob Silverman
   was right: memory is not free. 

   Someday hundreds of gigabytes of DRAM will cost $100, and $100 CPUs will
   have expanded to tens of billions of transistors. What will all those
   transistors be accomplishing for factorization, or for other serious
   computational problems, if the entire CPU is bottlenecked sending its
   results to DRAM?

   It's obviously more cost-effective to have millions of processing cells,
   each with a limited amount of DRAM, each with a few connections to
   nearby cells. Whether this is called ``building a VLSI circuit'' or
   ``programming a computational RAM'' or ``designing a parallel algorithm
   that uses very little memory per processor,'' it's the right model of
   computation, and not one for which NFS was previously optimized.

How could Lenstra, Shamir, Tomlinson, and Tromer publish this obvious use of mesh routing in June 2002 as if it were an interesting new result? I realize that they didn't see my March 2001 talk, but they certainly saw my grant proposal and my April 2002 email.

I am annoyed that my grant proposal has been characterized as presenting slower algorithms than it actually does. I informed Lenstra, Shamir, Tomlinson, and Tromer in plain language that their claim of an ``improvement'' was false; I am disappointed that they have refused to properly assign credit. Are they next going to claim that an early-abort combination of rho and ECM is an ``improvement'' over my grant proposal?