[Talk notes, D. J. Bernstein, 15 June 2004.] [This is approximately the talk I gave on Monday 14 June 2004.] [It took 70 minutes. Oops.] [Visual: title slide on computer right] Thank you. [Visual: sieving-c slide on computer right, shifting] This is a talk about modern algorithms for factoring difficult integers. Let me begin by showing you an example of the sieve of Eratosthenes. The column on the left shows the integers from 1 through 20. I marked every second integer as being divisible by 2, every fourth integer as being divisible by another 2, every third integer as being divisible by 3, et cetera. Of course, you don't have to start at 1. On the right side here I have sieved 20 integers starting at 612. Again every second integer is divisible by 2, although the starting point for _this_ arithmetic progression has changed since 611 is odd. Well, it's the same arithmetic progression, but it starts at a different line on the slide. Anyway, I found many other small prime factors of these input numbers. I should point out that you can save lots of memory here by eliminating white space, a principle that I have also attempted to apply to my name tag. I wrote some TeX macros that, given a name, automatically choose the largest possible font for the name tag. [Audience member: ``Are you going blind?'' Me: ``Yes, I am.''] For those of you who are not going blind, you can use the official ANTS conference macros, which, given a name, automatically choose the _smallest_ possible font for the name tag. [Visual: factoring-611 slide on computer right, shifting] Anyway, starting from these two Eratosthenes sieves, the second one offset by 611, we can actually factor the number 611. At the bottom of this slide I have discovered the factor 47 of 611 by computing the greatest common divisor of 611 with the difference of two numbers. Let me trace where these numbers came from. First of all, notice that many of these inputs to the sieve of Eratosthenes have been completely factored, and sometimes we have two complete factorizations on the same line in both the left sieve and the right sieve. For example 14 equals 2 times 7 and 625 equals 5 times 5 times 5 times 5. We know the complete prime factorization of the product 14 times 625. I have noticed that here at ANTS people sometimes point out the impressive computations that they have done, after which the audience says ooh aah. [Audience member: ``Ooh aah.'' Me: ``Hold on a moment.''] Now it is not shown on this slide, but I am pleased to report that I continued each of these sieves for one _hundred_ rows. [Audience joins in:] Ooh aah. So I found some other products of the form c times 611+c, 64 times 675 and 75 times 686, where the complete prime factorization is revealed by these sieves. Now multiplying these three lines together produces a square. This is clear from the prime factorizations, the sum of the exponent vectors is even. 1 plus 6 plus 1 is even, 0 plus 3 plus 1 is even, et cetera. If you change 611 to a larger number and write down many of these rows involving many primes then in general you will _not_ have the product of _all_ the rows magically being a square. You have to find a nonempty subset of the rows with square product which is easy to do by linear algebra on the exponent vectors. So anyway, the product of all of _these_ numbers is a square. The square root is one of the numbers inside the gcd. The other number inside the gcd is something that has the same square modulo 611. 14 squared is congruent to 14 times 625 modulo 611, 64 squared is congruent to 64 times 675 and so on. When you take a gcd of 611 with the difference of two numbers that have the same square modulo 611, then you should not be surprised, although you are certainly not guaranteed, to find a factorization of 611, as I did in this case, allowing me to steal money from my bank. This algorithm for integer factorization is the number-field sieve when the number field is Q. So it is the Q sieve. Okay. What I would like to do in this talk is point out certain assumptions that people make about _how_ you should find large factors of integers, certain structural rules that people follow in designing these algorithms. Sometimes people say that you _should_ follow these rules; sometimes people say that you _must_ follow these rules. These are myths. I am going to convince you that all of the rules _can_ be violated and that _most_ of the rules _should_ be violated. [Visual: myth-1 slide on computer right, shifting] So without further ado, here's a warm-up myth, myth #1, which says that the goal of sieving is to find all relations among the given inputs. The word relation means, by definition, a smooth product c times 611 plus c. Smooth means a product of small primes. The myth again is that our goal is to identify which inputs are smooth, 14 625 et cetera, so that we find all of the relations. A slightly wimpier form of this myth is that we want to find _almost_ all of the relations or at least _most_ of the relations. For example, here's a quote from a 1994 paper on the number-field sieve saying that we try to not lose any relations. That is not even close to the truth. In fact, the notion of finding _every_ relation is patently absurd, because we've already thrown relations away by focusing on the smallest inputs. For example, by looking at only a hundred values of c, I certainly threw away some relations. [Visual: reality slide on computer right, shifting] Let me review what we're actually trying to do. We need to find a nontrivial square product of inputs. Now to recognize square products we need to know the complete prime factorizations of the inputs. [Coprime factorizations are sufficient, of course, but I didn't want to mention factoring into coprimes until later in the talk.] Furthermore, if an input has a _large_ prime factor, then that input is probably not useful in forming a square, because that prime probably does not appear anywhere else. So our goal is to find many completely factored inputs, preferably inputs whose largest prime factor is small. Some of those useful inputs are easier to identify than others. I will be more precise about this later. For the moment let me just say that the difficulty of identifying a useful input depends much more on the size of its _smallest_ prime factors rather than the largest prime factor. To optimize the algorithm you have to pay attention to all of these criteria. Something that is going to become increasingly clear over the next few slides is that you end up finding only a tiny fraction of the relations. So what you should remember from this slide is that _most_ of the inputs will be thrown away because that makes it _easier_ to find enough useful inputs. [Visual: myth-2 slide on computer right, shifting] Myth number 2 says that the way we identify these fully factored inputs is by sieving, marking numbers in arithmetic progressions, as I showed you in my 611 example. Sieving is an amazingly fast algorithm according to people who have never actually done it. Now in the old days sieving was not possible. For example, CFRAC, the continued-fraction algorithm for integer factorization, does not produce sieveable inputs. So you have to use other methods of finding small factors of these inputs. You can use trial division, Pollard's rho method, more recently Lenstra's elliptic curve method, and so on. The myth says that all of those auxiliary algorithms became obsolete in this context as soon as Rich Schroeppel introduced the linear sieve. Inputs are sieveable in the linear sieve, the quadratic sieve, the number-field sieve. I have a quote here from one author saying that sieving is much faster than the elliptic-curve method. Once again, the myth says that when numbers are sieveable we simply sieve them. End of story. [Visual: reality slide on computer right, shifting] Now the reality is that sieving is merely a preprocessing step for another algorithm. Sieving by itself is _slow_. The reason is that random access to memory is slow. Your computer has what's called a storage hierarchy, it has under a thousand bytes of very fast registers, then perhaps 64 kilobytes of level-1 cache that can be accessed in one or two clock cycles, then perhaps 512 kilobytes of level-2 cache that is typically an order of magnitude slower, then quite a lot of dynamic RAM that is another order of magnitude slower, and then there's disk. Now if you do a tiny little sieve that fits into level-1 cache then you might think that sieving is fast. But sieving in level-2 cache is considerably slower. Sieving in dynamic RAM is really slow. And sieving on disk, well, sieving on disk is actually not as bad as it sounds if you switch from a simple distribution sort to a more complicated radix sort, but it is certainly not an amazingly fast operation. Because memory access time is variable, the best strategy turns out to be doing _some_ sieving, let's say sieving primes up to the theta power of the largest prime you care about. Then once you've removed those primes from the inputs, you keep only the smallest inputs, you abort the other inputs. Then with that small fraction of the inputs, you try some test other than sieving. So this is a combination of two tests for small factors. The first test is sieving. The second test is some non-sieving test. [Visual: 1994-golliver slide on computer right, shifting] For example, one paper uses sieving to discover prime factors up to 2 to the 21, then they divide those primes out of the inputs and keep only the remaining pieces below 2 to the 60. Then they use SQUFOF, which is a special case of CFRAC, and the elliptic-curve method to discover prime factors up to 2 to the 30. Now this is the same paper where three paragraphs earlier the authors said they were trying to keep _every_ relation. The reason they can say this is that they redefine the word relation. They say that a relation involving a prime above 2 to the 21 is not actually a relation. It is a _partial_ relation. They focus on 2 to the 21 as the important number rather than 2 to the 30. _I_ think that primes go up to 2 to the 30, maybe even beyond, and these authors are aborting most inputs early at 2 to the 21. _They_ say that the normal state of affairs is to stop at 2 to the 21. They're not stopping early; they're simply failing to work late. Whatever. They are, in fact, aborting most inputs, as they should. [Visual: myth-3 slide on computer right, shifting] Myth number 3 is that the second test is not a bottleneck. Remember the overall picture at this point is that we sieve _some_ primes, and then we abort _most_ of the inputs, and then we use a second test to find small factors of the inputs that the survive the abort. The myth is that sieving is the only bottleneck here. For example, here are some authors reporting that sieving took almost all of their time. Well, I look at this and say that the parameters were obviously not optimized. [Visual: reality slide on computer right, shifting] If the second test is not a bottleneck then you have time to feed a lot more inputs to the second test. You should allow larger inputs to survive the abort. This _will_ find more relations from the same initial set of inputs. So you could have started with fewer inputs. Obviously the abort parameters should be chosen so that the second test is balanced with sieving. I've written on this slide a useful formula to remember, this is very much like the u to the minus u formula for smoothness probabilities, it's not terribly accurate, it's occasionally misleading, but it's in the right ballpark. The time for finding enough smooth inputs is roughly R times S to the theta times T to the 1 minus theta. Here R is what textbooks report for the run time of the algorithm, it is the ratio between the number of smooth inputs that you need to find and the probability of one input being smooth. S is the time per input to sieve. T is the time per input to apply the second test, for example the elliptic-curve method. Theta you remember from before, you sieve up to the theta power of the largest prime. For example if theta is 0.7 then this is S to the 0.7 times T to the 0.3. You might be wondering where this approximation comes from. [Visual: why-s-t slide on computer right, shifting] The quick answer is that it comes from a paper by Pomerance. Pomerance has heuristics from twenty years ago, actually theorems in certain parameter ranges, saying asymptotically how many inputs survive an abort and how many smooth inputs survive an abort. I'm not going to give you either formula, I'm just going to compare the formulas, if you keep the smallest 1 out of every A inputs then you end up keeping 1 out of every A to the 1 minus theta smooth inputs. You can think of A as the parameter if you like. Now if sieving and the subsequent test are balanced then you take time S for every input and time T for one out of every A inputs so A should be about T over S and that's all you need to know. [Visual: better-analysis slide on computer right, shifting] This is a rather crude analysis for many reasons. One of my hobbies for many years has been analyzing number-field-sieve run time much more precisely. For example, you can compute extremely tight bounds on smoothness probabilities, you can measure exactly how big S and T are, you can analyze the distribution of inputs, there's all sorts of fun you can have here. But even from the crude analysis it is clear that the second test has a quite serious impact on the overall run time of these algorithms. [Visual: myth-4 slide on computer right, shifting] So which second test should we use? Myth number 4 is that the answer is the elliptic-curve method. For example, one recent factorization aborted numbers above 2 to the 90 and then used ECM to find small factors of the remaining numbers. A more general form of this myth is that once we're done sieving we have to factor each input independently. However, there are faster algorithms that factor a bunch of inputs at the same time. [Visual: given-set-q slide on computer right, shifting] I pointed this out a few years ago, and there's some really exciting news just a few months ago. My algorithm from 2000 takes a bunch of primes and a bunch of inputs and factors all the inputs simultaneously using those primes. The time per input bit is log cubed. For comparison, the elliptic-curve method, instead of exponent 3, has an exponent that grows something like square root of 2 log y over log log y, and y doesn't have to be very big before the elliptic-curve method becomes slower than this batch test. The exciting recent news is from Franke, Kleinjung, Morain, and Wirth, who pointed out that you can save an entire log factor here if you simply want to know _which_ inputs are smooth. And then you can apply my original algorithm to quickly factor those smooth inputs; normally there aren't very many of them. Actually, the Franke Kleinjung Morain Wirth algorithm behaves badly for some extreme inputs, but that's easy to fix. Anyway, what's important is that for typical inputs they reduced the exponent from 3 to 2. Now I am more than a little bit disappointed in how they presented this algorithm. They said that it was a _simplification_ of my algorithm. It is _not_ a simplification, it is a new algorithm. It _is_ simpler, certainly; it produces a subset of the output using a subset of my techniques. But it also uses _another_ idea that's important for saving this log factor. Another problem with their presentation is that they don't even say that they're saving time. This is an entire order of magnitude speedup and they don't point that out. The worst part is that they buried this algorithm inside a paper on elliptic-curve primality proving. Now proving primailty is lots of fun and I am very much looking forward to the talk by Morain. [Morain comments at this point: ``So am I.''] But this speedup has many applications other than primality proving. In particular, for factoring integers on conventional computers, this is the biggest news in years. It belongs in a separate paper. So they did not advertise this algorithm very well. Fortunately, I get to stand up here and advertise it for them. This is a wonderful little algorithm. The only reason that I am not going to tell you how these three algorithms work is that in a few minutes I am going to show you a newer algorithm that is even more fun. So you will get the idea of what these algorithms do. But first let me state [Visual: myth-5 slide on computer right, shifting] myth number 5. Even more fundamental than the question of whether there are early aborts is the question of whether there is a factor base. The myth is that you have to create a factor base. This myth is built into how everybody talks about factoring algorithms, until my next slide. In these algorithms you choose a set of primes, a factor base, for example all the primes up to 2 to the 30. You then look for smooth inputs, inputs that factor completely using those primes. And then you do a cull. You weed out the non-repeated primes. A lot of inputs will have primes that do not appear in any other inputs. So throw those inputs away. Repeat that process until you have a core of inputs where all the prime divisors of each input are divisors of other core inputs. Now we really want to allow primes up to 2 to the 40, 2 to the 50, _any_ size, _if_ they are repeated. But the myth says that we have to choose some cutoff, and for efficiency the cutoff cannot be very large. [Visual: reality slide on computer right, shifting] In fact, this structure is completely unnecessary. We _can_ allow primes of any size. We do _not_ have to specify the primes in advance. We can quickly find all the inputs that are smooth with respect to the prime divisors of other inputs. Obviously this algorithm has to consider all of the inputs at once. If you have say y squared inputs, you are not allowed to break them into y chunks with y inputs per chunk. There's a serious communication cost in the algorithm I'm going to show you, just like the communication cost that we are familiar with from the linear-algebra stage of these algorithms. In theory the slowdown here is just a small constant compared to the factor-base algorithms. Maybe the benefit is larger than that constant. I simply don't know. This algorithm is very new. I have not had a chance to do any experiments. But I do want to convince you that this structure of choosing a factor base, _then_ looking for smooth numbers, _then_ weeding out the non-repeated primes, then doing linear algebra, this structure is not necessary. [Visual: what's-the-algorithm slide on computer right, shifting] So here's how this algorithm works. It also gives you a flavor of how the previous batch algorithms work. You start with a bunch of input numbers x1 x2 et cetera. These could be 1 612 2 613 et cetera. Actually they should be the numbers that survived an abort after sieving. Since the goal is to factor these numbers we simplify the problem by multiplying all the numbers together, forming a huge number y. We then divide y by x1 and reduce modulo x1, divide y by x2 and reduce modulo x2, and so on. I will come back to the question of how fast this is. Finishing off the algorithm. If a power of y over xj is zero modulo xj then print xj. You can choose the exponent so that, for example, 2 to that power is bigger than xj. So this prints xj exactly when every prime divisor of xj is a divisor of y over xj in other words a prime divisor of the other inputs. This throws away exactly the inputs that have non-repeated prime divisors. That's exactly what we want. Feeding the outputs through the same algorithm again a few times produces the core set of inputs, which we can then factor into coprimes in essentially linear time. [Visual: why-is-this-so-fast slide on computer right, shifting] Now I claimed earlier that this algorithm is actually quite fast. Computing the product y takes essentially linear time. You multiply pairs of x's, then pairs of pairs, and so on until you have the product of all of them. The bottleneck here is multiplying large integers, which you do with the FFT and with more advanced tricks. Once you have y you then form a similar tree of remainders to compute y modulo x1 squared, y modulo x2 squared, and so on, again in essentially linear time, and then dividing y modulo x1 squared by x1 produces exactly the y over x1 modulo x1 that I wanted. These are standard algorithms, although there's a continuing game of improving these algorithms by constant factors. In particular, I'd like to draw attention to a technique introduced recently by Kramer, which I call FFT doubling, that speeds all of this up by a factor of about 1.5. [Visual: myth-6 slide on computer right, shifting] Okay. Enough about these batch algorithms. Let me move on to myth number 6. The myth is that the number field sieve involves two sieves. For example, you sieve c and you sieve 611+c. You could instead do a single sieve, checking which 611+c's are smooth, and then for those c's check which c's are smooth. But here are some authors saying that two sieves are much faster. Another way to say this myth is that Coppersmith's variant of the number-field sieve is not practical. [Pomerance commented after the talk that, at the time, it wasn't clear that any form of the number-field sieve was practical, except for special numbers having small polynomials.] [Visual: reality slide on computer right, shifting] But the reality is that Coppersmith's variant _is_ practical. Instead of doing two sieves and then one or two second tests you can do one sieve, one second test, and then maybe another second test. You check the smoothness of c only for the occasional values of c where 611+c is smooth. Actually the time to check the smoothness of c is negligible so you can afford to check the smoothness of other functions of c. Expand your notion of relations to expand the set of useful inputs. You can also look for very large prime factors in the numbers c. The point is that you're spending only negligible time per c for which 611+c is smooth, so you can afford to do a lot more work on each of those c's. This is certainly practical. It obviously saves time as soon as the smoothness chance for 611+c drops below S/T, the ratio between the sieving time and the second-test time that I mentioned before. The smoothness chance drops subexponentially while the ratio S/T drops as a power of the logarithm. Once you've reached moderately large factorizations, Coppersmith's variant obviously saves time. Of course, to do a more precise analysis, you have to optimize a lot of parameters, along the lines I mentioned before. [Visual: myth-7 slide on computer right, shifting] Myth number 7 is not a question of computer time. This is a question of programmer time. The myth says that the direct square-root method is a bottleneck. You remember the first square root that I mentioned, I wrote 14 times 64 times et cetera as a product of powers of primes, each exponent being even, and then divided each exponent in half. There are three papers saying how you generalize this to number fields. Instead of using prime factorizations you can use the direct method of simply multiplying the inputs together as a big integer and computing the square root of that integer. The myth says that you should not use the direct method, maybe not that it is a complete disaster for the run time but certainly that it is a serious part of the run time. For example, here are some authors saying that using prime factorizations is of great consequence for the overall running time. [Visual: reality slide on computer right, shifting] Well, this is simply not true. It is _not_ of great consequence for the overall running time. You can simply use the direct square-root method. With standard algorithms that takes time y to the 1 plus epsilon where y is the prime bound. For comparison, linear algebra takes time y to the 2 plus epsilon. If you actually try this out you find it's millions of times faster. [Depending on y, obviously.] The prime-factorization methods might be marginally faster but in this context it simply doesn't matter. [Visual: myth-8 slide on computer right, shifting] Myth number 8. Until now I've been imagining doing these computations on a desktop computer. I used to imagine a Pentium, now I imagine an Athlon, because it's better. Anyway, it's a conventional computer. A small processor attached to a large randomly accessible memory. The processor follows various instructions to modify the memory. The myth is that we want to minimize time on a conventional computer. The myth is that this minimizes real time. Now when people say this they realize that it's not true, they make an exception for parallel computers. People realize that running a bunch of Athlons at the same time can reduce the real time for the computation. But people say that achieving a speedup factor of p from parallelism requires that you buy p Athlons. There's supposed to be some sort of theorem guaranteeing this. [Visual: reality slide on computer right, shifting] The reality is that conventional computers do not minimize price-performance ratio. The problem with this theorem about parallel speedups is that the price of p computers does not have to be p times as large as the price of one computer. You can often split one computer into two computers where each computer has half the size. More generally split into p computers where each computer is only 1 over p of the size. There are communication costs between the computers but those costs do _not_ grow out of control compared to the computation you're doing. Eventually you end up with a mesh architecture, where for a given computer size you achieve asymptotically smaller time exponents than you do on a conventional computer. [Visual: vlsi-literature slide on computer right, shifting] There are thousands of articles in the VLSI literature, that's very-large-scale-integration, pointing out speedups from mesh architectures in all sorts of applications. For example, here's an example that's maybe even more convincing than my usual example. Let's consider multiplying two integers with n bits, obviously an important problem. The usual fast-multiplication algorithm by Schoenhage and Strassen using the fast Fourier transform takes time n to the 1 plus epsilon on a conventional computer with something like 8n bits of memory, maybe a little more. [Visual: Knuth slide on computer right, shifting] Now something I saw in Knuth's book years ago, but I never really thought through the consequences, was a non-conventional computer that does the same thing. In Knuth's words, we leave the domain of conventional computer programming. There's a mesh by Atrubin, a bunch of processors operating in parallel, with a total of I think 11n bits of memory, that multiplies two n-bit numbers in time proportional to n with a small constant. Now you might not be too surprised by the log factor disappearing, there are all sorts of machine-model changes that can eliminate log factors, but it's more interesting that you achieve this competitive performance without using the FFT. Later in the literature you can find a 2-dimensional mesh of about the same total size that takes time only n to the one half plus epsilon, using the FFT. This is obviously much faster for large n than anything you can do with a conventional computer. A couple of log factors cannot compete with a square root of n once n is moderately large. [Visual: similar-speedups slide on computer right, shifting] For integer factorization we get similar exponent reductions. Let me show you the numbers. If you look at Coppersmith's 1993 article you see a typical factorization machine stated in some detail. That machine takes time L to the 1.901 something power, where L is e to the cube root of log n two thirds power of log log n. There's also a little o of 1 in the exponent, so 1.901 et cetera is the limiting exponent. The space used by Coppersmith's machine is the 0.950 power of L. That's half of the 1.901. If you use a 2-dimensional mesh instead of a conventional computer then you can reduce the time exponent from 1.901 to 1.426 with the same cost of the computer. None of these numbers are optimal. [Visual: new-parameters slide on computer right, shifting] Let me skip to the middle of the slide. If you change parameters then you can reduce the time to the 1.185 power of L and reduce the machine cost to the 0.790 power of L. Also, Pomerance pointed out for conventional computers that you can improve the price-performance ratio a little bit by reducing the smoothness bound and reducing the machine size to the 0.748 power, while the time grows by a smaller amount. So those are the final numbers. Now some people say that my improvements here from conventional computers to mesh computers is from redefining the notion of cost and inventing the notion of price-performance ratio. But you can see that the numbers are simply better. The time drops from L to the 2 point something power down to L to the 1.185 power. To summarize, mesh architectures are much more cost-effective than conventional architectures. In the past I've said that the improvement is from circuits, but what really matters is the mesh architecture. [Visual: myth-9 slide on computer right, shifting] Myth number 9 is that we don't have to think about this improvement in designing algorithms. Mesh computers simply speed up computations. At some point we'll all have mesh Athlons on our desks and our computations will magically run faster. Later on we'll have quantum Athlons on our desks for even higher speed. To optimize algorithms for mesh computers we can simply optimize algorithms for conventional computers and then put them on mesh computers. For example, the 2-dimensional multiplication mesh that I mentioned before is a completely straightforward mesh implementation of an FFT. It's something that a mesh compiler can do automatically. [Visual: reality slide on computer right, shifting] The reality is that switching from conventional computers to mesh computers changes the comparison between algorithms. Here's a simple example. I'll have an even more frightening example later. Take the elliptic-curve method versus my original batch factoring algorithm. My algorithm is faster on conventional computers than ECM. But on mesh computers ECM has a smaller price-performance ratio. [Visual: algorithm-analysis-culture slide on computer right, shifting] I have a kind of philosophical slide here. But somebody has to say this. The current literature on analysis of algorithms has to be completely rewritten. Right now people talk about time on conventional computers, count the number of operations. Space is sometimes mentioned but there's a clear lex order where time is considered more important. Well, that does not make economic sense. You have to rethink all your algorithm analysis to account equally for the performance and the _price_ of your computers. All sorts of tradeoffs, for example baby-step-giant-step, have different effects once you start paying attention to price. [Visual: adventures-in-mesh-programming slide on computer right, shifting] Let me say a little bit about what you have to do if you actually want to start building machines and algorithms for mesh architectures, which all of us are going to be doing eventually. There's a programming language for building circuits, a programming language for mesh architectures, called Verilog. You can think of Verilog as being like C. There are a few of these languages, not nearly as many as for conventional computers. I made the mistake of starting with a language called VHDL. Nobody told me that Verilog was better. Don't use VHDL unless you're an Ada programmer. Anyway, there are a few free tools to run programs in this Verilog language, the best developed seems to be Icarus Verilog. This program simulates a mesh architecture on your desktop computer, for example your Linux computer. Obviously you can't get the kind of price-performance exponents I was talking about before if you do this, but at least you can see that your programs work, and you can also predict how fast they will be on a mesh architecture. One level of building circuits for a computation is writing Verilog programs and simulating them on a conventional computer. [Visual: fpga slide on computer right, shifting] A more convincing level of building circuits is actually using a mesh computer and carrying out your computations at high speed. An FPGA, a field-programmable gate array, is a general-purpose mesh computer that you can buy right now and run your Verilog programs on at much higher speeds than what you can do by simulation. An FPGA is an order of magnitude slower than a dedicated circuit, but it has the right cost exponents. One problem here is that compilers aren't very smart. They're fairly smart but not _very_ smart, just like for conventional computers, you gain some speed by writing code in assembly language. There are tools for circuits called manual-place-and-route tools that you can use to say this part of your circuit is over here and this part is over there to make sure you're taking full advantage of the mesh. Now if you want to predict what a well-funded attacker will do--- I see in the audience a few representatives of a well-funded attacker--- then you have to imagine building ASICs. An ASIC is a dedicated circuit that runs a single Verilog program at the best possible speed. Building _one_ ASIC is really expensive. This doesn't become cost-effective until you have many copies of the chip to build. But if you do build a lot of ASICs then you get an order of magnitude better price-performance than for FPGAs. [Visual: several-companies slide on computer right, shifting] It's a little annoying how much work it takes to design a circuit. You can buy some higher-level programming tools, for example from SRC, which is Cray's last company, tools that make it easier to write programs for mesh architectures, as long as you're willing to accept a constant-factor slowdown. I've been getting into this game. I want it to be as easy as possible to experiment, to prototype new circuits. So I've been building tools that simplify the creation of large meshes for the operations I care about. These meshes might be a factor of 50 slower than an ASIC but I'm willing to sacrifice that, at least initially, just to get something working. What I'm aiming for is completely convincing concrete numbers for a well-optimized circuit for integer factorization. Here's the circuit, here's a demonstration that it works, here's how fast it is. At least at the simulation level. I haven't decided yet whether I'm going to build actual physical circuits, although obviously that would make the results even more convincing, to actually go ahead and factor some specific numbers really quickly. Until I'm done I don't want to speculate as to what the results will be. There are already more than enough articles with speculations at various levels of detail. [Visual: myth-10 slide on computer right, shifting] Let me finish off with myth number 10. The myth here is that the quadratic sieve is faster than the elliptic-curve method. Earlier on I was talking about ECM as a method of finding small factors of the auxiliary numbers that show up inside the quadratic sieve, the number field sieve, and so on. But much simpler is to apply ECM directly to the number you want to factor. The myth is that this is slower than the quadratic sieve for finding really big factors, for factoring RSA keys. Specifically, people say that the elliptic curve method is slower by a log factor because it has to do multiprecision arithmetic while the quadratic sieve is mostly doing just single-precision arithmetic. Actually, there are several other run-time factors to compare, but they all roughly cancel. In particular, the quadratic sieve wants slightly larger numbers to be smooth, maybe not as large as I wrote here if you optimize the early aborts, but certainly at most y times as large, which reduces the smoothness probability by a log y factor. Then the quadratic sieve needs something like y over log y smooth inputs, while ECM needs only one smooth input, but ECM takes about y operations per input. So the bottom line is that the quadratic sieve is faster by roughly a log factor. Well, that's the myth. [Visual: reality slide on computer right, shifting] The reality is that ECM has a better price-performance ratio than the quadratic sieve. The reason is that there's a communication cost in the linear-algebra stage of the quadratic sieve whereas the elliptic-curve method has basically no communication at all. If you optimize price-performance on mesh architectures, instead of number of operations on conventional architectures, then you see that the comparison between ECM and the quadratic sieve is reversed. Instead of a polylog factor in favor of the quadratic sieve there's a _subexponential_ factor in favor of _ECM_. I've been building circuits for ECM. The bottleneck, of course, is integer multiplication. The story of my life. It's important to realize that the original Schoenhage-Strassen fast multiplication method is actually pretty damn good for circuits. There are some subsequent improvements that you should use, this is a subject of continuing research, but even the _original_ method is better than what chip designers are doing in their multipliers today. If you're going to build a multiplication circuit you should definitely use Schoenhage-Strassen. I should mention that years ago Pomerance gave a talk pointing out how scary it was to have a bunch of desktop computers running the elliptic-curve method for factoring big numbers, 512 bits, maybe 1024 bits. The average run time is pretty good and the variance is huge. At any moment, one of those computers could suddenly say here's a factor. Well, what I'm proposing is even more extreme. Imagine replacing your desktop computer with thousands of little elliptic-curve units. If you were scared of ECM before then you should be _really_ scared of it now. [Edlyn Teske's baby is crying in the background.] I see that _someone_ is scared already so I'd better stop there. Thank you for your attention. [Question from Eric Bach about catching up to Intel.] [Question from Peter Montgomery about transistor counts.] [Question from Carl Pomerance about Coppersmith implementation.]