D. J. Bernstein
High-speed cryptography

Fall 2006 course at the Fields Institute

2006.10.13

I'm teaching a block course ``High-speed cryptography'' in Toronto starting in a week and a half. Here's the lecture schedule:

WeekDayDateTime
1Mon2006.10.2315:30-17:00
1Tue2006.10.2411:00-12:30, 15:30-17:00
1Wed2006.10.2511:00-12:30, 15:30-17:00
1Thu2006.10.2611:00-12:30, 15:30-17:00
1Fri2006.10.2711:00-12:30
2Mon2006.10.30no lectures (students attend CCANTC workshop)
2Tue2006.10.31no lectures (students attend CCANTC workshop)
2Wed2006.11.01no lectures (students attend CCANTC workshop)
2Thu2006.11.02no lectures (students attend CCANTC workshop)
2Fri2006.11.03no lectures (students attend CCANTC workshop)
3Mon2006.11.0615:30-17:00
3Tue2006.11.0711:00-12:30, 15:30-17:00
3Wed2006.11.0811:00-12:30, 15:30-17:00
3Thu2006.11.0911:00-12:30, 15:30-17:00
3Fri2006.11.1011:00-12:30
4Mon2006.11.13no lectures (students work on course projects)
4Tue2006.11.14no lectures (students work on course projects)
4Wed2006.11.15no lectures (students work on course projects)
4Thu2006.11.1615:30-17:00 (project reports and discussion)
4Fri2006.11.1711:00-12:30 (project reports and discussion)

All lectures will take place in room 230 at the Fields Institute at 222 College Street at the University of Toronto.

Course contents. Anyone with access to the Internet can intercept mail messages and web pages and other packets of data to see what they say. He can also send fake messages that look like real messages.

Cryptography protects messages sent through the Internet. Cryptography scrambles and unscrambles packets to protect against forgery and against espionage. An attacker who forges a message won't be able to scramble it in the right way, so when we unscramble it we'll see that it's a forgery and that we should throw it away. An attacker who intercepts a scrambled credit-card number won't be able to figure out the original number.

High-speed cryptography is important for busy network servers. For example, here's a quote from Cricket Liu, author of several books on the Internet's Domain Name System: ``Verifying signed resource records [i.e., performing cryptographic operations] is computationally intensive [i.e., is painfully slow]. This has slowed deployment of DNS security.'' As a practical matter, DNS security still doesn't exist; attackers can seize control over incoming mail and outgoing web pages by forging DNS records. The users need faster software to scramble DNS records.

This course will focus on the four fundamental tasks in cryptography:

For each task, the course will explain a state-of-the-art function solving the task, explain what is known and what is believed about the security of the function, explain state-of-the-art methods to compute the function on several common computer architectures, and survey the alternative functions discussed in the literature.

Prerequisites and non-prerequisites. Students are expected to have some experience with algorithm analysis (figuring out why a computer hasn't finished a computation yet) and algorithm improvement (making the computer finish the computation more quickly). A typical student will have studied some chapters of a textbook by Cormen, Leiserson, Rivest, and Stein; will have heard that heapsort ``takes time n log n times, um, some constant''; and will have seen many algorithm improvements incorrectly described as ``optimizations.''

The most obvious difference in flavor between this course and a typical algorithm course is that this course will pay attention to constants. For example, this course will explain how a state-of-the-art function takes half a millisecond on my laptop to compute a high-security shared secret from a public conversation. Ignoring constant factors would remove all content from this statement.

Of course, these constants depend on the choice of computer. Sometimes one computer takes 10 clock cycles for an operation while another computer takes just 1 clock cycle; this affects the time for any algorithm built on top of that operation. Students might have seen some computer-dependent algorithm analysis and algorithm improvement in courses on architecture, assembly language, ``optimizing'' compilers, supercomputing, etc. None of those courses are prerequisites; this course will include a detailed introduction to the speed of real computers.

Many higher-level speedups are also constant-factor speedups. In fact, most of the speedups in the algorithm-research literature are constant-factor speedups. However, students will not be presumed to have any previous experience handling constant factors in algorithm analysis.

The other obvious difference in flavor between this course and other algorithm courses is the choice of functions being computed. A typical introductory algorithm course spends time on sorting, for example, and on various graph operations such as shortest paths. This course instead focuses on a few critical cryptographic operations. Those operations rely on some mathematics that students might have seen in previous courses on algebra, number theory, or cryptography. None of those are prerequisites; this course will include an introduction to the relevant mathematics.

2006.10.25

Here's what I've done so far:

2006.10.27

All of my recent cryptographic speed records have been set with the help of qhasm, a private set of programming tools whose sole goal is to reduce the amount of time needed to write high-speed software. Writing a high-speed computation using qhasm is

Here's a portion of the latest qhasm rewrite: qhasm-20061027.tar.gz. This doesn't include qhasm's cycle counter, range analyzer, and scheduler, but it does include qhasm's instruction decoder, register allocator (completely rewritten this week to be much faster), x86 floating-point stack handler, and assembler. What this means, in a nutshell, is that you can write .q files and feed them to qhasm-x86 to produce .s files in traditional x86 assembly language. Those .s files, in turn, can be linked easily with C and C++ programs.

These tools are nowhere near the polished state that I like to reach before releasing software. I haven't even had time for minimal tests! I nevertheless feel completely comfortable recommending qhasm as the programming environment for the course projects, simply because the alternative environments do such a bad job of producing high-speed software. If something in qhasm doesn't work, let me know, and I'll have it fixed promptly.

An example of a .q file is addvector.q. To turn this into addvector.s, first download and compile qhasm:

     wget https://cr.yp.to/qhasm/qhasm-20061027.tar.gz
     gunzip < qhasm-20061027.tar.gz | tar -xf -
     cd qhasm-20061027
     ./do
Then download addvector.q in the qhasm-20061027 directory and feed it through qhasm-x86:
     wget https://cr.yp.to/qhasm/addvector.q
     ./qhasm-x86 < addvector.q > addvector.s
There's an addvector.c program that tests the result:
     wget https://cr.yp.to/qhasm/addvector.c
     gcc -o addvector addvector.c addvector.s
     ./addvector
To see more of what can be in a .q file, look at the X86 file in the qhasm-20061027 directory. The X86 file lists .q instructions and then, after a colon, information about the translation of those instructions into traditional assembly language.

2006.10.28

Here's what I did in the rest of the first week (Thursday morning, Thursday afternoon, Friday morning):

2006.11.03

Course lectures continue Monday!

In response to requests for supporting additional CPUs, I've posted a new version of qhasm that includes qhasm-ppc32-linux, qhasm-ppc32-macos, qhasm-sparc, and qhasm-x86. Here's how to download and compile the new version:

     wget https://cr.yp.to/qhasm/qhasm-20061102.tar.gz
     gunzip < qhasm-20061102.tar.gz | tar -xf -
     cd qhasm-20061102
     ./do
Beware that this version also has a new syntax for handling CPU flags. Previously you were allowed to say things like
     counter -= 1
     ...
     goto topofloop if !=
where the counter -= 1 instruction implicitly sets the CPU's equality flag and the goto instruction explicitly uses the flag. You now have to explicitly indicate the flags that you'll be testing later:
                        =?   counter -= 1
     ...
     goto topofloop if !=
This allows qhasm to check for a common assembly-language error, namely intermediate instructions (...) that implicitly set the flag; there are quite a few flag-setting instructions, especially on the x86 architecture. The new syntax also allows comprehensible selection of different flag options on architectures providing many options, such as the PowerPC.

This version of qhasm needs even more testing than the previous version. I'll be doing a lot of testing this weekend: one of my projects for the next few days is writing a new implementation of the Curve25519 elliptic-curve-Diffie-Hellman function, one of the functions that we'll explore next week. The original Curve25519 software is written in x86 (Pentium, Athlon, etc.) assembly language and relies heavily on 80-bit (extended-precision, 64-bit mantissa) floating-point registers; the new project is to handle other CPUs (UltraSPARC, PowerPC, etc.) using 64-bit (double-precision, 53-bit mantissa) floating-point registers. I'm planning to post code samples here---stay tuned.

2006.11.03, continued

We'll see next week that the Curve25519 computation involves thousands of multiplications modulo the prime 2^255-19. The computation also has thousands of additions and some other work, but those multiplications are the main bottleneck.

How can I efficiently decompose multiplication mod 2^255-19 into the smaller instructions available on a real CPU? The answer depends on the CPU. Let's start with my teaching tool, the UltraSPARC III, before moving on to more popular CPUs.

Recall that the computational highlight of the UltraSPARC is the floating-point multiplier, which can very quickly compute the product of two double-precision floating-point numbers. The multiplier rounds its output to 53-bit precision, but it guarantees exact results for products that already fit into 53 bits, such as integers in the range {-2^53+1,...,-1,0,1,...,2^53-1}. The multiplier's maximum throughput is one product per CPU cycle.

Experience tells me that the floating-point adder is another bottleneck. The adder's maximum throughput is one addition per CPU cycle. So I'll carefully count the number of floating-point additions and the number of floating-point multiplications. Hopefully there won't be any other bottlenecks.

Integer representation. I'll try breaking 255-bit numbers into 12 pieces, each piece having 22 bits. Two pieces will have a 44-bit product, so I should be able to add quite a few of those products without overflowing the 53-bit mantissa of a double-precision floating-point number. Of course, I'll have to carefully check the final code to guarantee that there aren't any accidental overflows, and in the end it might turn out that more than 12 pieces are necessary, but for the moment I'll simply assume that 12 pieces will work.

You might object that 255 isn't divisible by 12: the ratio 255/12 = 21.25 isn't an integer. How exactly do I break a 255-bit number into 12 pieces? Answer: I'll write a 255-bit number f as a sum f_0 + f_22 + f_43 + f_64 + f_85 + f_107 + f_128 + f_149 + f_170 + f_192 + f_213 + f_234 where each f_i is a small multiple of 2^i. The indices i here, namely 0 22 43 etc., are ceilings of multiples of 21.25.

Each f_i fits easily into a floating-point register, so f is represented as a vector of 12 values (f_0,f_22,...,f_234) stored in floating-point registers. All required operations on integers modulo 2^255-19 will be split into floating-point operations.

Addition: 12 fadd + 0 fmul. If f = f_0 + f_22 + f_43 + f_64 + f_85 + f_107 + f_128 + f_149 + f_170 + f_192 + f_213 + f_234 and g = g_0 + g_22 + g_43 + g_64 + g_85 + g_107 + g_128 + g_149 + g_170 + g_192 + g_213 + g_234 then f + g = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234 where h_i = f_i + g_i.

Computing (h_0,h_22,...,h_234) given (f_0,f_22,...,f_234) and (g_0,g_22,...,g_234) takes 12 floating-point additions if f_i and g_i are sufficiently small. Of course, the bound on h_i is larger than the bounds on f_i and g_i, so a series of additions requires occasional carries, but there aren't any chains of additions in Curve25519.

Here's a simple C function that reads (f_0,f_22,...,f_234) from f[0],f[1],...,f[11], reads (g_0,g_22,...,g_234) from g[0],g[1],...,g[11], and puts (h_0,h_22,...,h_234) into h[0],h[1],...,h[11]:

     void curve25519_add(double h[12],const double f[12],const double g[12])
     {
       h[0] = f[0] + g[0];
       h[1] = f[1] + g[1];
       h[2] = f[2] + g[2];
       h[3] = f[3] + g[3];
       h[4] = f[4] + g[4];
       h[5] = f[5] + g[5];
       h[6] = f[6] + g[6];
       h[7] = f[7] + g[7];
       h[8] = f[8] + g[8];
       h[9] = f[9] + g[9];
       h[10] = f[10] + g[10];
       h[11] = f[11] + g[11];
     }
Perhaps I should comment that, in C, array parameters are actually pointers, so h[0],...,h[11] are stored in the location specified by the function caller. I could have written the same function as
     void curve25519_add(double *h,const double *f,const double *g)
     {
       h[0] = f[0] + g[0];
       h[1] = f[1] + g[1];
       h[2] = f[2] + g[2];
       h[3] = f[3] + g[3];
       h[4] = f[4] + g[4];
       h[5] = f[5] + g[5];
       h[6] = f[6] + g[6];
       h[7] = f[7] + g[7];
       h[8] = f[8] + g[8];
       h[9] = f[9] + g[9];
       h[10] = f[10] + g[10];
       h[11] = f[11] + g[11];
     }
and would have had to do so to make the function also work in C++. On the other hand, I like seeing [12] as a reminder that I'm working with 12-component vectors.

Anyway, the line

       h[0] = f[0] + g[0]
actually produces four CPU instructions: load f[0] into a register, load g[0] into a register, add the two registers, and store the result in h[0]. To see these CPU instructions more clearly in the code, I split the line into several lines:
       double f0; /* set aside double-precision storage location f0 */
       double g0; /* set aside double-precision storage location g0 */
       double h0; /* set aside double-precision storage location h0 */
       f0 = f[0]; /* copy 8 bytes from memory location f[0] to register f0 */
       g0 = g[0]; /* copy 8 bytes from memory location g[0] to register g0 */
       h0 = f0 + g0; /* add register f0 to register g0 */
       h[0] = h0; /* copy result from register h0 to memory location h[0] */
I did the same for the other lines, producing the following equivalent C function:
     void curve25519_add(double *h,const double *f,const double *g)
     {
       double f0;
       double f1;
       double f2;
       double f3;
       double f4;
       double f5;
       double f6;
       double f7;
       double f8;
       double f9;
       double f10;
       double f11;

       double g0;
       double g1;
       double g2;
       double g3;
       double g4;
       double g5;
       double g6;
       double g7;
       double g8;
       double g9;
       double g10;
       double g11;

       double h0;
       double h1;
       double h2;
       double h3;
       double h4;
       double h5;
       double h6;
       double h7;
       double h8;
       double h9;
       double h10;
       double h11;
     
       f0 = f[0];
       f1 = f[1];
       f2 = f[2];
       f3 = f[3];
       f4 = f[4];
       f5 = f[5];
       f6 = f[6];
       f7 = f[7];
       f8 = f[8];
       f9 = f[9];
       f10 = f[10];
       f11 = f[11];
     
       g0 = g[0];
       g1 = g[1];
       g2 = g[2];
       g3 = g[3];
       g4 = g[4];
       g5 = g[5];
       g6 = g[6];
       g7 = g[7];
       g8 = g[8];
       g9 = g[9];
       g10 = g[10];
       g11 = g[11];
     
       h0 = f0 + g0;
       h1 = f1 + g1;
       h2 = f2 + g2;
       h3 = f3 + g3;
       h4 = f4 + g4;
       h5 = f5 + g5;
       h6 = f6 + g6;
       h7 = f7 + g7;
       h8 = f8 + g8;
       h9 = f9 + g9;
       h10 = f10 + g10;
       h11 = f11 + g11;
     
       h[0] = h0;
       h[1] = h1;
       h[2] = h2;
       h[3] = h3;
       h[4] = h4;
       h[5] = h5;
       h[6] = h6;
       h[7] = h7;
       h[8] = h8;
       h[9] = h9;
       h[10] = h10;
       h[11] = h11;
     }
You might be wondering how the UltraSPARC, with only 32 double-precision floating-point registers, can handle the 36 values f0,f1,...,g0,g1,...,h0,h1,... declared here. The answer is that several values can be assigned to the same register if they aren't ``alive'' at the same time. For example, I peeked at the C compiler's output and saw that it had translated
       f0 = f[0];
       g0 = g[0];
       h0 = f0 + g0;
       h[0] = h0;
into (essentially)
       %f8 = f[0];
       %f32 = g[0];
       %f8 = %f8 + %f32;
       h[0] = %f8;
putting both f0 and h0 into register %f8.

The above C code is still hiding an important aspect of the corresponding CPU instructions, namely the multiplication involved in array indexing. Loading f[i] from memory means taking the supplied address f, adding 8*i bytes to it to find the address of f[i], and then loading the 8-byte floating-point number at that address. To see the CPU instructions more clearly, I replaced the double pointers by byte pointers:

     void curve25519_add(char *h,char *f,char *g)
     {
       double f0;
       double f1;
       double f2;
       double f3;
       double f4;
       double f5;
       double f6;
       double f7;
       double f8;
       double f9;
       double f10;
       double f11;
     
       double g0;
       double g1;
       double g2;
       double g3;
       double g4;
       double g5;
       double g6;
       double g7;
       double g8;
       double g9;
       double g10;
       double g11;
     
       double h0;
       double h1;
       double h2;
       double h3;
       double h4;
       double h5;
       double h6;
       double h7;
       double h8;
       double h9;
       double h10;
       double h11;
     
       f0 = *(double *) (f + 0);
       f1 = *(double *) (f + 8);
       f2 = *(double *) (f + 16);
       f3 = *(double *) (f + 24);
       f4 = *(double *) (f + 32);
       f5 = *(double *) (f + 40);
       f6 = *(double *) (f + 48);
       f7 = *(double *) (f + 56);
       f8 = *(double *) (f + 64);
       f9 = *(double *) (f + 72);
       f10 = *(double *) (f + 80);
       f11 = *(double *) (f + 88);
     
       g0 = *(double *) (g + 0);
       g1 = *(double *) (g + 8);
       g2 = *(double *) (g + 16);
       g3 = *(double *) (g + 24);
       g4 = *(double *) (g + 32);
       g5 = *(double *) (g + 40);
       g6 = *(double *) (g + 48);
       g7 = *(double *) (g + 56);
       g8 = *(double *) (g + 64);
       g9 = *(double *) (g + 72);
       g10 = *(double *) (g + 80);
       g11 = *(double *) (g + 88);
     
       h0 = f0 + g0;
       h1 = f1 + g1;
       h2 = f2 + g2;
       h3 = f3 + g3;
       h4 = f4 + g4;
       h5 = f5 + g5;
       h6 = f6 + g6;
       h7 = f7 + g7;
       h8 = f8 + g8;
       h9 = f9 + g9;
       h10 = f10 + g10;
       h11 = f11 + g11;
     
       *(double *) (h + 0) = h0;
       *(double *) (h + 8) = h1;
       *(double *) (h + 16) = h2;
       *(double *) (h + 24) = h3;
       *(double *) (h + 32) = h4;
       *(double *) (h + 40) = h5;
       *(double *) (h + 48) = h6;
       *(double *) (h + 56) = h7;
       *(double *) (h + 64) = h8;
       *(double *) (h + 72) = h9;
       *(double *) (h + 80) = h10;
       *(double *) (h + 88) = h11;
     }
The corresponding qhasm code, curve25519_add.q, is very similar:
     enter curve25519_add
     
     int64 h
     int64 f
     int64 g
     
     input h
     input f
     input g
     
     float64 f0
     float64 f1
     float64 f2
     float64 f3
     float64 f4
     float64 f5
     float64 f6
     float64 f7
     float64 f8
     float64 f9
     float64 f10
     float64 f11

     float64 g0
     float64 g1
     float64 g2
     float64 g3
     float64 g4
     float64 g5
     float64 g6
     float64 g7
     float64 g8
     float64 g9
     float64 g10
     float64 g11

     float64 h0
     float64 h1
     float64 h2
     float64 h3
     float64 h4
     float64 h5
     float64 h6
     float64 h7
     float64 h8
     float64 h9
     float64 h10
     float64 h11
     
     f0 = *(float64 *) (f + 0)
     f1 = *(float64 *) (f + 8)
     f2 = *(float64 *) (f + 16)
     f3 = *(float64 *) (f + 24)
     f4 = *(float64 *) (f + 32)
     f5 = *(float64 *) (f + 40)
     f6 = *(float64 *) (f + 48)
     f7 = *(float64 *) (f + 56)
     f8 = *(float64 *) (f + 64)
     f9 = *(float64 *) (f + 72)
     f10 = *(float64 *) (f + 80)
     f11 = *(float64 *) (f + 88)
    
     g0 = *(float64 *) (g + 0)
     g1 = *(float64 *) (g + 8)
     g2 = *(float64 *) (g + 16)
     g3 = *(float64 *) (g + 24)
     g4 = *(float64 *) (g + 32)
     g5 = *(float64 *) (g + 40)
     g6 = *(float64 *) (g + 48)
     g7 = *(float64 *) (g + 56)
     g8 = *(float64 *) (g + 64)
     g9 = *(float64 *) (g + 72)
     g10 = *(float64 *) (g + 80)
     g11 = *(float64 *) (g + 88)
   
     h0 = f0 + g0
     h1 = f1 + g1
     h2 = f2 + g2
     h3 = f3 + g3
     h4 = f4 + g4
     h5 = f5 + g5
     h6 = f6 + g6
     h7 = f7 + g7
     h8 = f8 + g8
     h9 = f9 + g9
     h10 = f10 + g10
     h11 = f11 + g11
     
     *(float64 *) (h + 0) = h0
     *(float64 *) (h + 8) = h1
     *(float64 *) (h + 16) = h2
     *(float64 *) (h + 24) = h3
     *(float64 *) (h + 32) = h4
     *(float64 *) (h + 40) = h5
     *(float64 *) (h + 48) = h6
     *(float64 *) (h + 56) = h7
     *(float64 *) (h + 64) = h8
     *(float64 *) (h + 72) = h9
     *(float64 *) (h + 80) = h10
     *(float64 *) (h + 88) = h11
     
     leave
This qhasm code says int64 h to set aside a 64-bit integer register for h, float64 h3 to set aside a 64-bit floating-point register for h3, etc. Like a C compiler, the qhasm register allocator can put several values into one register if the values' lifetimes don't overlap; the qhasm register allocator is extremely good at this. Unlike a C compiler, the qhasm register allocator doesn't silently insert loads and stores if it can't figure out how to fit all live values into registers; loads and stores have to be explicitly indicated by higher-level tools or by users.

After writing curve25519_add.q I wrote curve25519_addtest.c to test the addition function on a not-very-random input:

     #include <stdio.h>
     
     double powers[12] = {
       1.0 /* 2^0 */
     , 4194304.0 /* 2^22 */
     , 8796093022208.0 /* 2^43 */
     , 18446744073709551616.0 /* 2^64 */
     , 38685626227668133590597632.0 /* 2^85 */
     , 162259276829213363391578010288128.0 /* 2^107 */
     , 340282366920938463463374607431768211456.0 /* 2^128 */
     , 713623846352979940529142984724747568191373312.0 /* 2^149 */
     , 1496577676626844588240573268701473812127674924007424.0 /* 2^170 */
     , 6277101735386680763835789423207666416102355444464034512896.0 /* 2^192 */
     , 13164036458569648337239753460458804039861886925068638906788872192.0 /* 2^213 */
     , 27606985387162255149739023449108101809804435888681546220650096895197184.0 /* 2^234 */
     } ;
     
     double f[12];
     double g[12];
     double h[12];
     
     extern void curve25519_add(double [12],const double [12],const double [12]);
     
     int main(int argc,char **argv)
     {
       int i;
       for (i = 0;i < 12;++i) f[i] = random() * powers[i];
       printf("f=0"); for (i = 0;i < 12;++i) printf("%+.0f",f[i]); printf(";\n");
       for (i = 0;i < 12;++i) g[i] = random() * powers[i];
       printf("g=0"); for (i = 0;i < 12;++i) printf("%+.0f",g[i]); printf(";\n");
       curve25519_add(h,f,g);
       printf("h=0"); for (i = 0;i < 12;++i) printf("%+.0f",h[i]); printf(";\n");
       printf("h-(f+g)\n");
       return 0;
     }
I compiled curve25519_addtest.c together with curve25519_add.q, and fed the output of curve25519_addtest through gp:
     ./qhasm-sparc < curve25519_add.q > curve25519_add.s
     gcc -m64 -o curve25519_addtest curve25519_addtest.c curve25519_add.s
     ./curve25519_addtest | gp
The -m64 option to gcc is essential for the UltraSPARC; otherwise you end up with code compiled for the original 32-bit SPARCs. Anyway, gp reported that h-(f+g) was indeed 0. I won't worry about doing more serious tests at this point.

More operations next time.

2006.11.04

I decided to rename f0,f1,f2,... as f0,f22,f43,... for consistency with the f_0, f_22, ... notation that I'm using in formulas:
     float64 f0
     float64 f22
     float64 f43
     ...
     float64 g0
     float64 g22
     float64 g43
     ...
     float64 h0
     float64 h22
     float64 h43
     ...
     f0 = *(float64 *) (f + 0)
     f22 = *(float64 *) (f + 8)
     f43 = *(float64 *) (f + 16)
     ...
     g0 = *(float64 *) (g + 0)
     g22 = *(float64 *) (g + 8)
     g43 = *(float64 *) (g + 16)
     ...
     h0 = f0 + g0
     h22 = f22 + g22
     h43 = f43 + g43
     ...
     *(float64 *) (h + 0) = h0
     *(float64 *) (h + 8) = h22
     *(float64 *) (h + 16) = h43
qhasm also allows underscores in names, but I'm not interested in typing quite that much. :-)

Here's how to download and test the revised addition software on an UltraSPARC:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_add.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_addtest.c
     ./qhasm-sparc < curve25519_add.q > curve25519_add.s
     gcc -m64 -o curve25519_addtest curve25519_addtest.c curve25519_add.s
     ./curve25519_addtest | gp
Now let's look at some other integer operations.

Subtraction: 12 fadd + 0 fmul. Subtraction has the same cost as addition: simply replace

     h0 = f0 + g0
with
     h0 = f0 - g0
etc. Of course, the resulting h_i is often negative, but that's okay; floating-point numbers are allowed to be negative.

Each cycle, the UltraSPARC's floating-point adder handles one addition or one subtraction. I'll count subtractions as if they were additions.

Here's the software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_sub.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_subtest.c
     ./qhasm-sparc < curve25519_sub.q > curve25519_sub.s
     gcc -m64 -o curve25519_subtest curve25519_subtest.c curve25519_sub.s
     ./curve25519_subtest | gp

Multiplication, schoolbook method: 121 fadd + 144 fmul. If f = f_0 + f_22 + f_43 + f_64 + f_85 + f_107 + f_128 + f_149 + f_170 + f_192 + f_213 + f_234 and g = g_0 + g_22 + g_43 + g_64 + g_85 + g_107 + g_128 + g_149 + g_170 + g_192 + g_213 + g_234 then fg = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234 + h_255 + h_277 + h_298 + h_319 + h_340 + h_362 + h_383 + h_404 + h_425 + h_447 + h_468 where

In fact, the polynomial h_0 x^0 + h_22 x^1 + ... + h_468 x^22 is the product of the polynomials f_0 x^0 + f_22 x^1 + ... + f_234 x^11 and g_0 x^0 + g_22 x^1 + ... + g_234 x^11. Notice that h_i, like f_i and g_i, is a small multiple of 2^i: for example, f_0 g_43 is a small multiple of 2^0 2^43 = 2^43, f_22 g_22 is a small multiple of 2^22 2^22 = 2^44, and f_43 g_0 is a small multiple of 2^43 2^0 = 2^43, so the sum h_43 is a small multiple of 2^43.

(It's important here that I chose my indices as ceilings of multiples of 21.25. If I had instead used the floors 0, 21, 42, 63, 85, ... then I would have run into trouble with f_0 g_85 + f_21 g_63 + f_42 g_42 + f_63 g_21 + f_85 g_0 being a small multiple of 2^84 but often not a multiple of the desired 2^85.)

The above formulas for h_0, h_22, ..., h_468 involve 12^2 = 144 multiplications (each f_i times each g_j) and 11^2 = 121 additions (which I think of as 144 for the number of products minus 23 for the number of outputs).

Multiplication, Karatsuba's method: 114 fadd + 142 fmul. I'm going to use ``Karatsuba's method'' to compute the same quantities h_0, h_22, ..., h_468 more efficiently. Warning: There are several different polynomial-product and integer-product formulas called ``Karatsuba's method,'' and some of them (including the formulas in Karatsuba's original 1963 paper) are slightly more expensive than the formulas shown here. There's no standard terminology for the different formulas.

Anyway, define F_0 = f_0 x^0 + f_22 x^1 + ... + f_107 x^5, F_1 = f_128 x^0 + f_149 x^1 + ... + f_234 x^5, G_0 = g_0 x^0 + g_22 x^1 + ... + g_107 x^5, and G_1 = g_128 x^0 + g_149 x^1 + ... + g_234 x^5. The goal is to compute the polynomial product (F_0 + x^6 F_1)(G_0 + x^6 G_1).

I'll compute (F_0 + x^6 F_1)(G_0 + x^6 G_1) as K + 2^127 x^6 ((F_0 + 2^(-127) F_1)(G_0 + 2^(-127) G_1) - K) where K = F_0 G_0 - 2^(-127) x^6 F_1 G_1. Observe that there are only three half-size polynomial products here, rather than the four that appear in the obvious formula F_0 G_0 + x^6 (F_0 G_1 + F_1 G_0) + x^12 F_1 G_1; so I'll end up with only 3 6^2 = 108 coefficient multiplications, rather than 4 6^2 = 144.

Here are the intermediate quantities computed. The coefficients of F_0 + 2^(-127) F_1 are

The coefficients of G_0 + 2^(-127) G_1 are The coefficients of (F_0 + 2^(-127) F_1)(G_0 + 2^(-127) G_1) are The coefficients of K are h_0, h_22, h_43, h_64, h_85, h_107, k_128, k_149, k_170, k_192, k_213, -2^(-127) h_362, -2^(-127) h_383, -2^(-127) h_404, -2^(-127) h_425, -2^(-127) h_447, -2^(-127) h_468 where Finally, the coefficients of (F_0 + x^6 F_1)(G_0 + x^6 G_1) are h_0, ..., h_468 where Overall there are 142 floating-point multiplications here: the aforementioned 108 coefficient multiplications, 23 multiplications by 2^(-127), and 11 multiplications by 2^127. There are also 114 floating-point additions (again counting subtractions as additions).

How does a constant such as 2^127 end up in a floating-point register? The standard answer has three steps. First, create an array of constants in memory, for example with the following C code:

     const double curve25519_constants[2] = {
       170141183460469231731687303715884105728.0 /* 2^127 */
     , 0.0000000000000000000000000000000000000058774717541114375398436826861112283890933277838604376075437585313920862972736358642578125 /* 2^-127 */
     } ;
Second, copy the address of the array into a register, for example with the following standard sequence of UltraSPARC instructions:
     int64 constants
     int64 constants_low
     constants = (curve25519_constants & 0xfffffc0000000000) >> 32
     constants_low = curve25519_constants & 0xfffffc00
     constants |= (curve25519_constants & 0x3ff00000000) >> 32
     constants_low |= curve25519_constants & 0x3ff
     constants <<= 32
     constants |= constants_low
Third, load the constants from memory:
     float64 two127
     float64 twom127
     two127 = *(float64 *) (constants + 0)
     twom127 = *(float64 *) (constants + 8)

I wrote a straightforward curve25519_mul.q and immediately ran into errors from the assembler: I was trying to use more than 32 floating-point registers at once, keeping all 12 f's, all 12 g's, all 23 h's, etc. live through the entire computation. So I juggled the code to kill values earlier: for example, I computed h0, h22, h43, h64, h85, h107, stored them, and subtracted them from w0, w22, w43, w64, w85, w107, to free up the registers that they were using. I also reloaded the f's and g's when they were needed, rather than keeping them continuously live.

Here's the resulting software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_mul.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_multest.c
     ./qhasm-sparc < curve25519_mul.q > curve25519_mul.s
     gcc -m64 -o curve25519_multest curve25519_multest.c curve25519_mul.s
     ./curve25519_multest | gp

Squaring, schoolbook method: 55 fadd + 89 fmul. In Curve25519, many of the multiplications modulo 2^255-19 are actually squarings: f_i = g_i for each i. This means that, for example, the equation h_107 = f_0 g_107 + f_22 g_85 + f_43 g_64 + f_64 g_43 + f_85 g_22 + f_107 g_0 can be replaced by the equation h_107 = 2 (f_0 f_107 + f_22 f_85 + f_43 f_64), replacing 6 multiplications and 5 additions with 4 multiplications and 2 additions.

(I'm assuming here that doublings are carried out as multiplications by the constant 2. If floating-point multiplications turn out to be a bottleneck later then these multiplications can be replaced by additions.)

The equation h_107 = (2 f_0) f_107 + (2 f_22) f_85 + (2 f_43) f_64 is slightly better, costing only 3 multiplications and 2 additions if 2f_0, 2f_22, 2f_43 are precomputed. It's easy to see that the overall benefit of these precomputations outweighs the overall cost: there are only 11 doublings (of f_0, f_22, ..., f_213) rather than 21 doublings (in h_22, h_43, ..., h_447).

What happens when these speedups are applied to schoolbook multiplication? There are 144 fmul, of which 132 are f_i g_j for distinct i,j, of which 132/2 = 66 are eliminated at the expense of 11 doublings. The remaining 144 - 66 = 78 fmul results require only 78 - 23 = 55 fadds to be compressed into 23 outputs. Overall cost: 55 fadd + 78 fmul + 11 fmul. Software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_sq.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_sqtest.c
     ./qhasm-sparc < curve25519_sq.q > curve25519_sq.s
     gcc -m64 -o curve25519_sqtest curve25519_sqtest.c curve25519_sq.s
     ./curve25519_sqtest | gp

Can one do better? Not to my knowledge. The lowest cost I've been able to achieve for Karatsuba squaring is 63 fadd + 106 fmul. Squaring saves considerably more in the schoolbook method than it does in Karatsuba's method.

Reduction modulo 2^255-19: 11 fadd + 11 fmul. If h = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234 + h_255 + h_277 + h_298 + h_319 + h_340 + h_362 + h_383 + h_404 + h_425 + h_447 + h_468, with h_i a multiple of 2^i, then h is congruent to r_0 + r_22 + r_43 + r_64 + r_85 + r_107 + r_128 + r_149 + r_170 + r_192 + r_213 + r_234 modulo 2^255 - 19, where

Indeed, the difference (r_0 + r_22 + r_43 + r_64 + r_85 + r_107 + r_128 + r_149 + r_170 + r_192 + r_213 + r_234) - h is a sum of terms such as (19 2^(-255) - 1) h_255 = (2^255 - 19) 2^(-255) h_255. This term is a multiple of 2^255 - 19, since by hypothesis 2^(-255) h_255 is an integer; same for the other terms.

Computing the r_i's from the h_i's takes 11 multiplications by the constant 19 2^(-255) and 11 additions. Software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_modp.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_modptest.c
     ./qhasm-sparc < curve25519_modp.q > curve25519_modp.s
     gcc -m64 -o curve25519_modptest curve25519_modptest.c \
     curve25519_modp.s curve25519_mul.s
     ./curve25519_modptest | gp

Carrying and reduction modulo 2^255-19: 63 fadd + 11 fmul. ``Carrying r_0 to r_22'' means subtracting the closest multiple of 2^22 from r_0 and adding it to r_22. This doesn't change the sum r_0 + r_22, but it guarantees that r_0 is in the range {-2^21,...,-1,0,1,...,2^21}.

Similarly, ``carrying r_22 to r_43'' means subtracting the closest multiple of 2^43 from r_22 and adding it to r_43. This doesn't change the sum r_22 + r_43, but it guarantees that 2^(-22) r_22 is in the range {-2^20,...,-1,0,1,...,2^20}. By carrying r_0 to r_22, then carrying r_22 to r_43, then carrying r_43 to r_64, and so on, we guarantee that each 2^(-i) r_i is in a narrow range, making the r's suitable inputs for subsequent multiplications.

A carry takes four floating-point additions:

     closest43 = r22 + alpha43    # rounded
     closest43 -= alpha43         # exact
     r22 -= closest43             # exact
     r43 += closest43             # exact
Here alpha_43 is the constant 2^43 (2^52 + 2^51). The point is that floating-point numbers around alpha_43 are exactly the multiples of 2^43: ..., 2^43 (2^52 + 2^51 - 1), 2^43 (2^52 + 2^51), 2^43 (2^52 + 2^51 + 1), ... The sum r_22 + alpha_43 is thus rounded to the closest multiple of 2^43; subtracting alpha_43 produces the multiple of 2^43 closest to r_22, exactly as desired.

What happens to the carry from r_234? The easy answer is to carry from r_234 to r_0, reducing the carry modulo 2^255-19:

     closest255 = r234 + alpha255
     closest255 -= alpha255
     r234 -= closest255
     closest255 *= scale
     r0 += closest255
where scale is 19 2^(-255).

This might make r_0 too big again. If r_234 starts with 50 bits, for example, then closest_255 can be as large as 2^284; so 19 2^(-255) 2^284, more than 2^33, is added to r_0. Another carry from r_0 to r_22 is required. But the carries can stop after that: a chain of 13 carries from r_0 to r_22 to r_43 to ... to r_234 to r_0 to r_22 guarantees that all r_i's are reasonably small.

One can eliminate the extra scale multiplication by carefully interleaving the carrying into the original reduction modulo 2^255-19. Here's one reasonable carry/reduce strategy, starting from h = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234 + h_255 + h_277 + h_298 + h_319 + h_340 + h_362 + h_383 + h_404 + h_425 + h_447 + h_468:

The 13 carries here cost 52 fadd + 0 fmul, and the 11 reductions cost 11 fadd + 11 fmul, for a total of 63 fadd + 11 fmul.

Software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_carry.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_carrytest.c
     ./qhasm-sparc < curve25519_carry.q > curve25519_carry.s
     gcc -m64 -o curve25519_carrytest curve25519_carrytest.c \
     curve25519_carry.s curve25519_mul.s curve25519_sq.s
     ./curve25519_carrytest | gp

Multiplication, carrying, reduction: 177 fadd + 153 fmul. A complete integer multiplication modulo 2^255-19, producing small results suitable for subsequent multiplications, costs 114 fadd + 142 fmul for the polynomial multiplication, and an additional 63 fadd + 11 fmul for carrying and reduction.

Squaring, carrying, reduction: 118 fadd + 100 fmul. A complete integer squaring modulo 2^255-19, producing small results suitable for subsequent multiplications, costs 55 fadd + 89 fmul for the polynomial squaring, and an additional 63 fadd + 11 fmul for carrying and reduction.

What's coming up. I'll need a few more functions to build a complete Curve25519 computation:

Once the code is working, I'll transform it to eliminate load/store bottlenecks, latency bottlenecks, etc., hopefully bringing the number of cycles close to the number of fadd instructions.

2006.11.05

New qhasm release, expanding the SPARC machine description to include a few additional UltraSPARC instructions that I'm using today:
     wget https://cr.yp.to/qhasm/qhasm-20061105.tar.gz
     gunzip < qhasm-20061105.tar.gz | tar -xf -
     cd qhasm-20061105
     ./do
Now back to the Curve25519 operations.

Small-constant multiplication, carrying, reduction: 48 fadd + 13 fmul. As we'll see later, the Curve25519 main loop has 8 additions/subtractions and 10 multiplications. Half of the 10 multiplications are special: 4 of them are squarings, and 1 of them is a multiplication by a particular integer, namely 121665.

Multiplying by a 17-bit integer is of course easier than multiplying by a 255-bit integer: one can simply multiply each of the 12 floating-point inputs by 121665, costing 0 fadd + 12 fmul. But the results need 12 carries, costing 48 fadd + 1 fmul, so the overall benefit is not as dramatic as one might initially expect.

Software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_121665.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_121665test.c
     ./qhasm-sparc < curve25519_121665.q > curve25519_121665.s
     gcc -m64 -o curve25519_121665test curve25519_121665test.c curve25519_121665.s
     ./curve25519_121665test | gp

Selection: 0 fadd + 0 fmul. There's one additional operation in the Curve25519 main loop: given four 255-bit integers x[0], z[0], x[1], z[1] and a secret bit b, compute x[b], z[b] and x[1-b], z[1-b].

Here are three strategies for performing this operation:

I'll try the third strategy using integer bit operations:

     r3 = *(uint64 *) (r + 24)
     s3 = *(uint64 *) (s + 24)
     s3 ^= r3
     q3 = s3 & bminus1
     q3 ^= r3
     p3 = q3 ^ s3
     *(uint64 *) (p + 24) = p3
     *(uint64 *) (q + 24) = q3
A similar sequence of 3 fadd + 1 fmul, or alternatively 2 fadd + 4 fmul, would have the virtue of working with floating-point registers, but I'm guessing that the loads and stores to move to integer registers won't be a problem, and I don't want to consume precious fadd resources.

Software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_select.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_selecttest.c
     ./qhasm-sparc < curve25519_select.q > curve25519_select.s
     gcc -m64 -o curve25519_selecttest curve25519_selecttest.c curve25519_select.s
     ./curve25519_selecttest | gp

Conversion from 32-byte string: 44 fadd + 0 fmul. Say f is in {-2^51,...,-1,0,1,...,2^51-1}. Here's how to take f out of an UltraSPARC integer register and put it into an UltraSPARC floating-point register:

Here's why this works: Say s is in {0,1}, e is in {1,2,...,2046}, and f is in {-2^51,...,-1,0,1,...,2^51-1}. The UltraSPARC follows the IEEE 754 standard, which specifies that the 8 bytes representing the 64-bit integer 2^63 s + 2^52 e + 2^51 + f also represent the 64-bit floating-point number (-1)^s 2^(e-1023) (1 + 2^(-1) + 2^(-52) f). Now substitute s = 0 and e = 1075. (IEEE 754 also specifies interpretations of e = 0 and e = 2047, but we won't need to worry about those.)

It's just as easy to take f out of an UltraSPARC integer register and put a scaled version of f, let's say 2^32 f, into an UltraSPARC floating-point register:

What if f is in memory rather than in an integer register? To be concrete, let's say bytes (x[0],x[1],x[2],x[3],x[4],x[5]) in memory represent an integer f in {0,1,...,2^48-1} in the usual little-endian way: f = x[0] + 2^8 x[1] + 2^16 x[2] + 2^24 x[3] + 2^32 x[4] + 2^40 x[5]. Here's how to put the same integer f into a floating-point register. First load the bytes into registers (with separate byte loads; loading *(uint64 *) (x+0) doesn't work on the UltraSPARC unless x is 0 modulo 8):

     int64 f0
     int64 f1
     int64 f2
     int64 f3
     int64 f4
     int64 f5
     f0 = *(uint8 *) (x + 0) 
     f1 = *(uint8 *) (x + 1)
     f2 = *(uint8 *) (x + 2)
     f3 = *(uint8 *) (x + 3)
     f4 = *(uint8 *) (x + 4)
     f5 = *(uint8 *) (x + 5)
Shift and add to compute x0 = 2^52 1075 + 2^51 + f = 2^51 2151 + f:
     f1 <<= 8
     f2 <<= 16
     f3 <<= 24
     f4 <<= 32
     f5 <<= 40
     int64 x0
     x0 = 2151
     x0 <<= 51
     x0 += f0
     x0 += f1
     x0 += f2
     x0 += f3
     x0 += f4
     x0 += f5
Set aside 8 bytes of temporary storage in memory, and store x0 there:
     stack64 y0
     y0 = x0
Load the 8 bytes into a floating-point register, and finally subtract 2^52 + 2^51:
     float64 z0
     z0 = y0
     z0 -= alpha0

One of the Curve25519 inputs is a 255-bit integer stored in 32 bytes of memory. I convert

and then carry z_0 to z_22 to z_43 to z_64 to z_85 to z_107 to z_128 to z_149 to z_170 to z_192 to z_213 to z_234. I then have 12 floating-point registers representing the same integer, at the cost of 44 fadd. Software:
     wget https://cr.yp.to/highspeed/fall2006/curve25519_todouble.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_todoubletest.c
     ./qhasm-sparc < curve25519_todouble.q > curve25519_todouble.s
     gcc -m64 -o curve25519_todoubletest curve25519_todoubletest.c curve25519_todouble.s
     ./curve25519_todoubletest | gp

One can save time here by using more integer operations, first to partition the input bits more precisely and second to create the integer representing f rather than 2^52 + 2^51 + f. But I won't bother: conversion to double is a very small part of the Curve25519 time and doesn't warrant any extra programming effort.

Conversion to 32-byte string: 12 fadd + 0 fmul. The output of Curve25519 is an integer in {0,1,...,2^255-20}, represented as a 32-byte string in the same way as the input. This isn't what's naturally produced by arithmetic:

The results of arithmetic have been squeezed to save time in subsequent operations, but they haven't been frozen into a unique result.

Here's a reasonably straightforward freezing strategy. Let's say that the final carrying and reduction produce r = r_0 + r_22 + r_43 + r_64 + r_85 + r_107 + r_128 + r_149 + r_170 + r_192 + r_213 + r_234. Carrying has put each r_i into a fairly limited range:

Evidently the sum r can be slightly outside, but not much outside, the range {-2^254,...,2^254}. I'll simplify matters by making everything positive: 2^255-19 + r is the sum of the positive quantities Putting the small integers 2^(-i) s_i into integer registers is easy. Converting floating-point r_85 into integer 2^(-80) s_80, for example, is a simple matter of adding the constant 2^80 (2^52 + 2^51) + 2^108 - 2^86, storing the result into 8 bytes of memory, and loading the bottom 4 bytes into an integer register. This procedure, like the procedure for converting to double, exploits the IEEE 754 representation of floating-point numbers: the 8-byte string that represents the 64-bit integer 2^52 1156 + 2^51 + 2^(-80) s_80 also represents the floating-point number 2^80 (2^52 + 2^51) + 2^108 - 2^86 + r_85.

A series of shifts and adds now converts s_0, 2^(-16) s_16, ..., 2^(-232) s_232 to integers t_0, t_32, t_64, t_96, t_128, t_160, t_192, t_224 in {0,1,...,2^32-1} with 2^255-19 + r = t_0 + 2^32 t_32 + ... + 2^224 t_224. To see whether 2^255-19 + r is below 2^255-19, I'll add 19 and inspect the 255th bit, taking advantage of the UltraSPARC's 32-bit-carry instructions:

     carry32? u0 = t0 + 19
     carry32? u32 = t32 + 0 + carry32
     carry32? u64 = t64 + 0 + carry32
     carry32? u96 = t96 + 0 + carry32
     carry32? u128 = t128 + 0 + carry32
     carry32? u160 = t160 + 0 + carry32
     carry32? u192 = t192 + 0 + carry32
     carry32? u224 = t224 + 0 + carry32
     wantt = (uint32) u224 >> 31
     wantt -= 1
     wantu = ~wantt
Now it's easy to select either r or 2^255-19 + r as the final Curve25519 output. Here's the software:
     wget https://cr.yp.to/highspeed/fall2006/curve25519_freeze.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_freezetest.c
     ./qhasm-sparc < curve25519_freeze.q > curve25519_freeze.s
     gcc -m64 -o curve25519_freezetest curve25519_freezetest.c \
     curve25519_freeze.s curve25519_carry.s curve25519_mul.s curve25519_sq.s \
     curve25519_todouble.s
     ./curve25519_freezetest | gp

The complete Curve25519 computation: 414907 fadd + 327626 fmul. The Curve25519 computation starts with two 32-byte inputs. It

The most important part of the computation is the main loop:

     for (pos = 254;pos >= 0;--pos) {
       b = e[pos / 8] >> (pos & 7);
       b &= 1;
       curve25519_select(xzmb,xzm1b,xzm,xzm1,b);
       curve25519_add(a0,xzmb,xzmb + 12);
       curve25519_sub(a0 + 12,xzmb,xzmb + 12);
       curve25519_add(a1,xzm1b,xzm1b + 12);
       curve25519_sub(a1 + 12,xzm1b,xzm1b + 12);
       curve25519_sqcarry(b0,a0);
       curve25519_sqcarry(b0 + 12,a0 + 12);
       curve25519_mulcarry(b1,a1,a0 + 12);
       curve25519_mulcarry(b1 + 12,a1 + 12,a0);
       curve25519_add(c1,b1,b1 + 12);
       curve25519_sub(c1 + 12,b1,b1 + 12);
       curve25519_sqcarry(r,c1 + 12);
       curve25519_sub(s,b0,b0 + 12);
       curve25519_121665(t,s);
       curve25519_add(u,t,b0);
       curve25519_mulcarry(xznb,b0,b0 + 12);
       curve25519_mulcarry(xznb + 12,s,u);
       curve25519_sqcarry(xzn1b,c1);
       curve25519_mulcarry(xzn1b + 12,r,x1);
       curve25519_select(xzm,xzm1,xznb,xzn1b,b);
     }
Each of the 255 iterations involves two selections (which can easily be merged across loops into one selection), four additions, four subtractions, one multiplication by 121665 (with carry-reduce), four squarings (with carry-reduce), and five more multiplications (with carry-reduce). Here are the tallies of floating-point operations:
operation per operation per iteration per 255 iterations
select0 fadd + 0 fmul0 fadd + 0 fmul0 fadd + 0 fmul
sub12 fadd + 0 fmul48 fadd + 0 fmul12240 fadd + 0 fmul
sub12 fadd + 0 fmul48 fadd + 0 fmul12240 fadd + 0 fmul
12166548 fadd + 13 fmul48 fadd + 13 fmul12240 fadd + 3315 fmul
sqcarry118 fadd + 100 fmul472 fadd + 400 fmul120360 fadd + 102000 fmul
mulcarry177 fadd + 153 fmul885 fadd + 765 fmul225675 fadd + 195075 fmul
total1501 fadd + 1178 fmul382755 fadd + 300390 fmul
Some operations could be removed in the leading and trailing iterations, but this would produce an extremely small speedup and would cost extra code-cache misses.

The second most important part of the computation is the division. I compute the (p-2)nd = (2^255-21)th power of z[n], and multiply by x[n], with a straightforward sequence of 254 squarings and 12 extra multiplications, costing 32096 fadd + 27236 fmul. There are other ways to compute the reciprocal of z[n], such as Euclid's algorithm and extended binary-gcd algorithms, but these algorithms (especially when modified to avoid leaking information through timing) would take considerable extra work to implement and would have at most a small impact on the Curve25519 speed, so I won't bother with them this month.

The remaining operations are conversion to and from floating point, costing 56 fadd + 0 fmul.

Total operation count (not computer-verified at this point): 414907 fadd + 327626 fmul. Evidently these instructions will take at least 414907 UltraSPARC cycles.

Here's the software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_complete.c
     wget https://cr.yp.to/highspeed/fall2006/curve25519_completetest.c
     gcc -m64 -o curve25519_completetest curve25519_completetest.c \
     curve25519_complete.c curve25519_carry.s curve25519_mul.s curve25519_sq.s \
     curve25519_todouble.s curve25519_121665.s curve25519_add.s curve25519_sub.s \
     curve25519_freeze.s curve25519_select.s
     ./curve25519_completetest | gp

Counting cycles. I've timed this Curve25519 implementation using the cpucycles library:

     long long timings[10];
     unsigned char x[32];
     unsigned char y[32];
     unsigned char z[32];

     for (i = 0;i < 10;++i) { timings[i] = cpucycles(); curve25519(z,x,y); }
     for (i = 0;i < 9;++i) printf("%lld ",timings[i + 1] - timings[i]); printf("\n");
The output shows that this implementation is taking over 2 million UltraSPARC III cycles:
     2251848 2122560 2060876 2694852 2061201 2060841 2061101 3322788 2065352 
There's a big gap, almost a factor of 5, between the 414907-cycle lower bound from fadd throughput and this 2-million-cycle upper bound. That's a gigantic gap compared to, e.g., my NIST P-224 elliptic-curve-Diffie-Hellman software from 2001, which used about 678000 UltraSPARC II cycles for 523578 floating-point additions, or my more recent Poly1305 software, which uses 68 UltraSPARC III cycles for 68 floating-point additions in the main loop.

Why is the software failing so miserably at keeping the adder busy? Here's an illustrative 3-fadd excerpt from curve25519_mul.q:

     h64 = f0 * g64       # starts on cycle 0
     f22g43 = f22 * g43   # starts on cycle 1 (delayed by fmul throughput)
     h64 += f22g43        # starts on cycle 5 (delayed by fmul latency)
     f43g22 = f43 * g22   # starts on cycle 5
     h64 += f43g22        # starts on cycle 9 (delayed by fmul throughput)
     f64g0 = f64 * g0     # starts on cycle 9
     h64 += f64g0         # starts on cycle 13
I didn't have to add these 4 quantities serially. I could instead add them in a tree:
     h64 = f0 * g64       # starts on cycle 0
     f22g43 = f22 * g43   # starts on cycle 1 (delayed by fmul throughput)
     f43g22 = f43 * g22   # starts on cycle 2 (delayed by fmul throughput)
     f64g0 = f64 * g0     # starts on cycle 3 (delayed by fmul throughput)
     h64 += f22g43        # starts on cycle 5 (delayed by fmul latency)
     f43g22 += f64g0      # starts on cycle 7 (delayed by fmul latency)
     h64 += f43g22        # starts on cycle 11 (delayed by fadd latency)
Even better, I could interleave these operations with subsequent additions in the software. There's tremendous parallelism in the Curve25519 computation graph, with far more than 4 independent additions on each level of the graph; I simply have to make this parallelism visible to the CPU.

To be continued.

2006.11.06

Today's lecture topics: Oops, didn't quite get to the last two bits. Will start with them tomorrow.

There are many other ideas in the authentication literature, but most of them are slowdowns! See Section 10 of my hash127 paper for a survey (dating from 2000) and tons of references. The main advance in the field since 2000 is a new focus on minimizing precomputation (not just having r short but also using it directly in the main computation), as I emphasized in the lecture.

2006.11.07

WARNING: There's a proposal to reschedule next week's lectures from Thursday+Friday to Friday+Friday. Hasn't happened yet; we're checking for conflicts.

Today's lecture topics:

Here's what I said about cycles/byte to authenticate a 1024-byte message, including a slight slowdown from computing AES:
CPUPoly1305-AESUMAC-P4-128-AES
PowerPC G4 74108.2721.72
UltraSPARC III5.4751.06
Pentium 45.333.12
Pentium M4.508.48
Athlon3.757.38

As for Salsa20, the best source of information is my Salsa20 page. The leading source of information about stream ciphers generally is eSTREAM (the ECRYPT Stream Cipher Project), particularly eSTREAM's collection of design papers and cryptanalysis papers; I have a web page with links and further comments.

2006.11.08

SCHEDULE CHANGE: The afternoon session on Thursday 16 November has been rescheduled for Friday 17 November. This means that the remaining sessions are as follows:

WeekDayDateTime
3Thu2006.11.0911:00-12:30, 15:30-17:00
3Fri2006.11.1011:00-12:30
4Mon2006.11.13no lectures (students work on course projects)
4Tue2006.11.14no lectures (students work on course projects)
4Wed2006.11.15no lectures (students work on course projects)
4Thu2006.11.16no lectures (students work on course projects)
4Fri2006.11.1711:00-12:30, 15:30-17:00 (project reports and discussion)

Today's lecture topics:

2006.11.09

Today's lecture topics:

2006.11.13

Friday's lecture topics: We'll meet again for course-project discussions on Friday 17 November 2006 at 11:00. Feel free to send me email in the meantime.

Here's a challenge to cryptographers. Right now, attackers all over the Internet can easily forge packets that appear to come from Google; the challenge is to stop these forgeries, guaranteeing that the Google results on the users' screens really are from Google. Perhaps the most obvious approach (to the cryptographic side of the problem; we also need to fix browser buffer overflows and other relevant software bugs) is to have Google add a signature to each Google-results page. This costs

The problem for Google is that W is huge; generating signatures can easily be a bottleneck in generating Google-results pages. Fortunately, there's a much less expensive approach: use the Diffie-Hellman protocol to share a secret between Google and each user, and then use secret-key cryptography to stop forgery (and eavesdropping) for any number of web pages sent to that user. This costs Presumably the number U of Google users is on the scale of a billion, so U evaluations of Curve25519 are on the scale of a few days of time on one computer, and of course Google can spread computations across many computers. One might complain about the cost of Google storing shared secrets in a big central database, but the standard technique of ``remote storage'' (Google stores the user's shared secret as an authenticated encrypted cookie on the user's machine) removes the need for a central database. Anyway, the challenge is to write and deploy all software necessary to actually make this work.

2006.11.16

New qhasm release, including AMD64 (Opteron, Athlon 64, Core, Core 2) support:
     wget https://cr.yp.to/qhasm/qhasm-20061116.tar.gz
     gunzip < qhasm-20061116.tar.gz | tar -xf -
     cd qhasm-20061116
     ./do
Still needs lots of testing but can at least handle an AMD64 XMM version of the Salsa20 software, setting new speed records for Salsa20 on the Intel Core 2.