Fall 2006 course at the Fields Institute

2006.10.13

I'm teaching a block course ``High-speed cryptography'' in Toronto starting in a week and a half. Here's the lecture schedule:

Week Day Date Time

1 Mon 2006.10.23 15:30-17:00
1 Tue 2006.10.24 11:00-12:30, 15:30-17:00
1 Wed 2006.10.25 11:00-12:30, 15:30-17:00
1 Thu 2006.10.26 11:00-12:30, 15:30-17:00
1 Fri 2006.10.27 11:00-12:30
2 Mon 2006.10.30 no lectures (students attend CCANTC workshop)
2 Tue 2006.10.31 no lectures (students attend CCANTC workshop)
2 Wed 2006.11.01 no lectures (students attend CCANTC workshop)
2 Thu 2006.11.02 no lectures (students attend CCANTC workshop)
2 Fri 2006.11.03 no lectures (students attend CCANTC workshop)
3 Mon 2006.11.06 15:30-17:00
3 Tue 2006.11.07 11:00-12:30, 15:30-17:00
3 Wed 2006.11.08 11:00-12:30, 15:30-17:00
3 Thu 2006.11.09 11:00-12:30, 15:30-17:00
3 Fri 2006.11.10 11:00-12:30
4 Mon 2006.11.13 no lectures (students work on course projects)
4 Tue 2006.11.14 no lectures (students work on course projects)
4 Wed 2006.11.15 no lectures (students work on course projects)
4 Thu 2006.11.16 15:30-17:00 (project reports and discussion)
4 Fri 2006.11.17 11:00-12:30 (project reports and discussion)

Week	Day	Date	Time
1	Mon	2006.10.23	15:30-17:00
1	Tue	2006.10.24	11:00-12:30, 15:30-17:00
1	Wed	2006.10.25	11:00-12:30, 15:30-17:00
1	Thu	2006.10.26	11:00-12:30, 15:30-17:00
1	Fri	2006.10.27	11:00-12:30
2	Mon	2006.10.30	no lectures (students attend CCANTC workshop)
2	Tue	2006.10.31	no lectures (students attend CCANTC workshop)
2	Wed	2006.11.01	no lectures (students attend CCANTC workshop)
2	Thu	2006.11.02	no lectures (students attend CCANTC workshop)
2	Fri	2006.11.03	no lectures (students attend CCANTC workshop)
3	Mon	2006.11.06	15:30-17:00
3	Tue	2006.11.07	11:00-12:30, 15:30-17:00
3	Wed	2006.11.08	11:00-12:30, 15:30-17:00
3	Thu	2006.11.09	11:00-12:30, 15:30-17:00
3	Fri	2006.11.10	11:00-12:30
4	Mon	2006.11.13	no lectures (students work on course projects)
4	Tue	2006.11.14	no lectures (students work on course projects)
4	Wed	2006.11.15	no lectures (students work on course projects)
4	Thu	2006.11.16	15:30-17:00 (project reports and discussion)
4	Fri	2006.11.17	11:00-12:30 (project reports and discussion)

All lectures will take place in room 230 at the Fields Institute at 222 College Street at the University of Toronto.

Course contents. Anyone with access to the Internet can intercept mail messages and web pages and other packets of data to see what they say. He can also send fake messages that look like real messages.

Cryptography protects messages sent through the Internet. Cryptography scrambles and unscrambles packets to protect against forgery and against espionage. An attacker who forges a message won't be able to scramble it in the right way, so when we unscramble it we'll see that it's a forgery and that we should throw it away. An attacker who intercepts a scrambled credit-card number won't be able to figure out the original number.

High-speed cryptography is important for busy network servers. For example, here's a quote from Cricket Liu, author of several books on the Internet's Domain Name System: ``Verifying signed resource records [i.e., performing cryptographic operations] is computationally intensive [i.e., is painfully slow]. This has slowed deployment of DNS security.'' As a practical matter, DNS security still doesn't exist; attackers can seize control over incoming mail and outgoing web pages by forging DNS records. The users need faster software to scramble DNS records.

This course will focus on the four fundamental tasks in cryptography:

using a long shared secret to protect a long series of private messages;
expanding a short shared secret into a long shared secret;
sharing a secret through a public conversation; and
using a sender secret, with no shared secrets, to protect a long series of public messages.

For each task, the course will explain a state-of-the-art function solving the task, explain what is known and what is believed about the security of the function, explain state-of-the-art methods to compute the function on several common computer architectures, and survey the alternative functions discussed in the literature.

Prerequisites and non-prerequisites. Students are expected to have some experience with algorithm analysis (figuring out why a computer hasn't finished a computation yet) and algorithm improvement (making the computer finish the computation more quickly). A typical student will have studied some chapters of a textbook by Cormen, Leiserson, Rivest, and Stein; will have heard that heapsort ``takes time n log n times, um, some constant''; and will have seen many algorithm improvements incorrectly described as ``optimizations.''

The most obvious difference in flavor between this course and a typical algorithm course is that this course will pay attention to constants. For example, this course will explain how a state-of-the-art function takes half a millisecond on my laptop to compute a high-security shared secret from a public conversation. Ignoring constant factors would remove all content from this statement.

Of course, these constants depend on the choice of computer. Sometimes one computer takes 10 clock cycles for an operation while another computer takes just 1 clock cycle; this affects the time for any algorithm built on top of that operation. Students might have seen some computer-dependent algorithm analysis and algorithm improvement in courses on architecture, assembly language, ``optimizing'' compilers, supercomputing, etc. None of those courses are prerequisites; this course will include a detailed introduction to the speed of real computers.

Many higher-level speedups are also constant-factor speedups. In fact, most of the speedups in the algorithm-research literature are constant-factor speedups. However, students will not be presumed to have any previous experience handling constant factors in algorithm analysis.

The other obvious difference in flavor between this course and other algorithm courses is the choice of functions being computed. A typical introductory algorithm course spends time on sorting, for example, and on various graph operations such as shortest paths. This course instead focuses on a few critical cryptographic operations. Those operations rely on some mathematics that students might have seen in previous courses on algebra, number theory, or cryptography. None of those are prerequisites; this course will include an introduction to the relevant mathematics.

2006.10.25

Here's what I've done so far:

Motivation for high-speed cryptography
- The Internet is a security disaster
- Secrecy violations: traffic analysis, eavesdropping
- Reliability/integrity violations: denial of service, forgery
- Packets transported by 802.11 wireless, Ethernet wires, ISPs, etc.
- Attacker hears packets, forges packets
- Fix: physically secure network; slow, expensive
- Alternative: cryptography scrambles packets
- Most Internet packets are unprotected; why not?
- Lack of software support; but what about HTTPS?
- Setup problems; but what about sites where it's set up?
- Speed problems; okay, let's try to fix
- Warning: also have to fix buffer overflows etc.
Introduction to constant-factor analysis
- Square root mod n is a signing bottleneck
- Details of square-root function
- Typical algorithm, high level
- Typical algorithm, lower level
- Why algorithm works
- Baby algorithms course
- Baby analysis of this algorithm
- Baby-visible improvement from fast multiplication
- CRT speedup
- 2+o(1) savings from CRT speedup
- 512-bit example
- The windowing speedup
- Asymptotics; another constant factor
- List of several more constant-factor speedups
- Serious analysis vs. baby analysis
- Other constant-factor courses
Course contents
- Could study constants in sorting etc.
- Instead focus on four critical crypto operations
  - Long shared secret protects long series of messages
  - Short shared secret expands into long shared secret
  - Public conversation shares short secret
  - Sender secret protects public messages without sharing secrets
- Selected state-of-the-art functions:
  - Poly1305
  - Salsa20
  - Curve25519
  - Square2048
Course projects
- Set new speed records
- Doable? Certainly
- Example: OpenSSL solidly beaten by GMP
- Cryptographers can change functions
- For starting projects I recommend existing functions
- Some functions: see eSTREAM, Barreto hashing lounge
- Post-quantum public-key systems
- Start with functions that don't have asm software
- For public-key systems, submit to eBATS
Simplified Poly1305 for one message
- Unprotected one-message Internet protocol
- What parties do
- Forgery-protected protocol
- What parties do differently
- Authenticator formula
- Definitions: Z/p etc.
- Definitions: random
- Definitions: uniform etc.
- Example of protocol
- Protocol works and is secure
Security analysis of simplified Poly1305 for one message
- Naive attack
- Better attack
- Generalization of better attack
- Optimality of attack
- Counting roots
- How to formalize such proofs
- Independence in this proof
Channel conversion 1
- Where did shared secret come from?
- Old channel
- Why not reuse that channel?
- Advantages of new channel
- Channel-conversion picture
- Combining with one-time pad
- Channel-conversion picture
Un-simplified Poly1305: real-world tweaks
- Messages are byte strings
- Poly1305 encoding of byte strings
- Example of encoding
- Squeezing 130 bits to 128 bits
- Possible subtractions of 2^128
- Revised security conclusion
- Reduced r range for speed
Poly1305 for multiple messages
- Unprotected multiple-message Internet protocol
- Nonces
- Why nonces?
- Protected multiple-message protocol
- Simplified multiple-message authenticator formula
- Protocol works and is secure
- Un-simplified formula; security
- (Swapped to later:) Channel-conversion picture
- (Swapped to later:) Combining with one-time pad; picture
- (Swapped to later:) Channel improvements we'll see later
Choosing authenticator security levels
- Poly1305 surviving massive attack
- Argument that this is overkill
- Flaws in the argument
- Incorrect-forgery-count quote
- Correct forgery count
Bit operations
- Definition of bit operations
- XOR with 4 bit operations
- Baby analysis of bit operations for b-bit evaluation

2006.10.27

All of my recent cryptographic speed records have been set with the help of qhasm, a private set of programming tools whose sole goal is to reduce the amount of time needed to write high-speed software. Writing a high-speed computation using qhasm is

much easier than writing the same computation in a traditional assembly language: qhasm includes a smart register allocator, for example, whereas a traditional assembly language forces you to allocate registers by hand; and
much easier than writing the same computation using C: you don't have to fight the C compiler's limited instruction selection, confused instruction scheduling, etc.

Here's a portion of the latest qhasm rewrite: qhasm-20061027.tar.gz. This doesn't include qhasm's cycle counter, range analyzer, and scheduler, but it does include qhasm's instruction decoder, register allocator (completely rewritten this week to be much faster), x86 floating-point stack handler, and assembler. What this means, in a nutshell, is that you can write .q files and feed them to qhasm-x86 to produce .s files in traditional x86 assembly language. Those .s files, in turn, can be linked easily with C and C++ programs.

These tools are nowhere near the polished state that I like to reach before releasing software. I haven't even had time for minimal tests! I nevertheless feel completely comfortable recommending qhasm as the programming environment for the course projects, simply because the alternative environments do such a bad job of producing high-speed software. If something in qhasm doesn't work, let me know, and I'll have it fixed promptly.

An example of a .q file is addvector.q. To turn this into addvector.s, first download and compile qhasm:

     wget https://cr.yp.to/qhasm/qhasm-20061027.tar.gz
     gunzip < qhasm-20061027.tar.gz | tar -xf -
     cd qhasm-20061027
     ./do

Then download addvector.q in the qhasm-20061027 directory and feed it through qhasm-x86:

     wget https://cr.yp.to/qhasm/addvector.q
     ./qhasm-x86 < addvector.q > addvector.s

There's an addvector.c program that tests the result:

     wget https://cr.yp.to/qhasm/addvector.c
     gcc -o addvector addvector.c addvector.s
     ./addvector

To see more of what can be in a .q file, look at the X86 file in the qhasm-20061027 directory. The X86 file lists .q instructions and then, after a colon, information about the translation of those instructions into traditional assembly language.

2006.10.28

Here's what I did in the rest of the first week (Thursday morning, Thursday afternoon, Friday morning):

UltraSPARC III integer instructions
- Will start with UltraSPARC, then move to popular CPUs
- Picture of UltraSPARC III
- Two 64-bit additions per clock cycle
- Some instructions: +, -, left shift
- More instructions: right shift, bit operations
- More instructions: constants
- 24 integer registers
- Computing 4a+b mod 2^64
- Two integer execution units
- 2 integer instructions per cycle
- Lower bound on cycles
- 1-cycle latency for integer instructions
- Example: f2=f0+f1,...,f6=f4+f5
- Alternate computation
- Dependency graph for first computation
- Dependency graph for second computation
More about the UltraSPARC III
- More storage: memory
- 64-bit load, 64-bit store
- Big-endian, little-endian
- 8-bit load, 32-bit load, etc.
- Load throughput, latency
- Picture of the memory hierarchy
- Typical benchmarks avoid cache misses
- Example of cache misses in cryptography
- Double-precision floating-point numbers
- Floating-point arithmetic
- Floating-point rounding
- Floating-point addition throughput, latency
- Floating-point multiplication throughput
- Importance of multiplier
- Floating-point multiplication latency; example of CPU variation
- Instructions start in order
- Loads finish in order
Cycle-counting example
- Want approximate product of 100 floating-point numbers
- First implementation: simple loop, ~500 cycles
- Second implementation: unrolling loop, ~400 cycles
- Picture explaining the 400
- Third implementation: 4 parallel products
- Picture showing desired performance
- ~300 cycles; why?
- Picture showing extra delays
- Fourth implementation: separating loads from stores
- ~100 cycles
Poly1305 on the UltraSPARC
- How to split +c, *r, mod 2^130-5 into UltraSPARC insns?
- Current software: 34 fp mults, 68 fp adds, 69 cycles
- Splitting r into 4 fp regs
- Splitting intermediate results into 4 fp regs
- Computing rx
- Computing y congruent to rx mod 2^130-5
- Haven't minimized y; not a problem
- Haven't minimized pieces of y; solve by carries
- Carries via floating-point rounding
- Haven't fit pieces of y into fp regs; solve by splitting r more
Poly1305 on the Pentium Pro/II/III/M
- Extended precision; 17 fp mults, 36 fp adds
- Difficulties: one fp unit; 8 fp regs
Software-engineering issues
- Make fp functions work; then merge functions; then handle latency
- Graph of CPU time vs. programmer time
- Comments on qhasm

2006.11.03

Course lectures continue Monday!

In response to requests for supporting additional CPUs, I've posted a new version of qhasm that includes qhasm-ppc32-linux, qhasm-ppc32-macos, qhasm-sparc, and qhasm-x86. Here's how to download and compile the new version:

     wget https://cr.yp.to/qhasm/qhasm-20061102.tar.gz
     gunzip < qhasm-20061102.tar.gz | tar -xf -
     cd qhasm-20061102
     ./do

Beware that this version also has a new syntax for handling CPU flags. Previously you were allowed to say things like

     counter -= 1
     ...
     goto topofloop if !=

where the counter -= 1 instruction implicitly sets the CPU's equality flag and the goto instruction explicitly uses the flag. You now have to explicitly indicate the flags that you'll be testing later:

                        =?   counter -= 1
     ...
     goto topofloop if !=

This allows qhasm to check for a common assembly-language error, namely intermediate instructions (...) that implicitly set the flag; there are quite a few flag-setting instructions, especially on the x86 architecture. The new syntax also allows comprehensible selection of different flag options on architectures providing many options, such as the PowerPC.

This version of qhasm needs even more testing than the previous version. I'll be doing a lot of testing this weekend: one of my projects for the next few days is writing a new implementation of the Curve25519 elliptic-curve-Diffie-Hellman function, one of the functions that we'll explore next week. The original Curve25519 software is written in x86 (Pentium, Athlon, etc.) assembly language and relies heavily on 80-bit (extended-precision, 64-bit mantissa) floating-point registers; the new project is to handle other CPUs (UltraSPARC, PowerPC, etc.) using 64-bit (double-precision, 53-bit mantissa) floating-point registers. I'm planning to post code samples here---stay tuned.

2006.11.03, continued

We'll see next week that the Curve25519 computation involves thousands of multiplications modulo the prime 2^255-19. The computation also has thousands of additions and some other work, but those multiplications are the main bottleneck.

How can I efficiently decompose multiplication mod 2^255-19 into the smaller instructions available on a real CPU? The answer depends on the CPU. Let's start with my teaching tool, the UltraSPARC III, before moving on to more popular CPUs.

Recall that the computational highlight of the UltraSPARC is the floating-point multiplier, which can very quickly compute the product of two double-precision floating-point numbers. The multiplier rounds its output to 53-bit precision, but it guarantees exact results for products that already fit into 53 bits, such as integers in the range {-2^53+1,...,-1,0,1,...,2^53-1}. The multiplier's maximum throughput is one product per CPU cycle.

Experience tells me that the floating-point adder is another bottleneck. The adder's maximum throughput is one addition per CPU cycle. So I'll carefully count the number of floating-point additions and the number of floating-point multiplications. Hopefully there won't be any other bottlenecks.

Integer representation. I'll try breaking 255-bit numbers into 12 pieces, each piece having 22 bits. Two pieces will have a 44-bit product, so I should be able to add quite a few of those products without overflowing the 53-bit mantissa of a double-precision floating-point number. Of course, I'll have to carefully check the final code to guarantee that there aren't any accidental overflows, and in the end it might turn out that more than 12 pieces are necessary, but for the moment I'll simply assume that 12 pieces will work.

You might object that 255 isn't divisible by 12: the ratio 255/12 = 21.25 isn't an integer. How exactly do I break a 255-bit number into 12 pieces? Answer: I'll write a 255-bit number f as a sum f_0 + f_22 + f_43 + f_64 + f_85 + f_107 + f_128 + f_149 + f_170 + f_192 + f_213 + f_234 where each f_i is a small multiple of 2^i. The indices i here, namely 0 22 43 etc., are ceilings of multiples of 21.25.

Each f_i fits easily into a floating-point register, so f is represented as a vector of 12 values (f_0,f_22,...,f_234) stored in floating-point registers. All required operations on integers modulo 2^255-19 will be split into floating-point operations.

Addition: 12 fadd + 0 fmul. If f = f_0 + f_22 + f_43 + f_64 + f_85 + f_107 + f_128 + f_149 + f_170 + f_192 + f_213 + f_234 and g = g_0 + g_22 + g_43 + g_64 + g_85 + g_107 + g_128 + g_149 + g_170 + g_192 + g_213 + g_234 then f + g = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234 where h_i = f_i + g_i.

Computing (h_0,h_22,...,h_234) given (f_0,f_22,...,f_234) and (g_0,g_22,...,g_234) takes 12 floating-point additions if f_i and g_i are sufficiently small. Of course, the bound on h_i is larger than the bounds on f_i and g_i, so a series of additions requires occasional carries, but there aren't any chains of additions in Curve25519.

Here's a simple C function that reads (f_0,f_22,...,f_234) from f[0],f[1],...,f[11], reads (g_0,g_22,...,g_234) from g[0],g[1],...,g[11], and puts (h_0,h_22,...,h_234) into h[0],h[1],...,h[11]:

     void curve25519_add(double h[12],const double f[12],const double g[12])
     {
       h[0] = f[0] + g[0];
       h[1] = f[1] + g[1];
       h[2] = f[2] + g[2];
       h[3] = f[3] + g[3];
       h[4] = f[4] + g[4];
       h[5] = f[5] + g[5];
       h[6] = f[6] + g[6];
       h[7] = f[7] + g[7];
       h[8] = f[8] + g[8];
       h[9] = f[9] + g[9];
       h[10] = f[10] + g[10];
       h[11] = f[11] + g[11];
     }

Perhaps I should comment that, in C, array parameters are actually pointers, so h[0],...,h[11] are stored in the location specified by the function caller. I could have written the same function as

     void curve25519_add(double *h,const double *f,const double *g)
     {
       h[0] = f[0] + g[0];
       h[1] = f[1] + g[1];
       h[2] = f[2] + g[2];
       h[3] = f[3] + g[3];
       h[4] = f[4] + g[4];
       h[5] = f[5] + g[5];
       h[6] = f[6] + g[6];
       h[7] = f[7] + g[7];
       h[8] = f[8] + g[8];
       h[9] = f[9] + g[9];
       h[10] = f[10] + g[10];
       h[11] = f[11] + g[11];
     }

and would have had to do so to make the function also work in C++. On the other hand, I like seeing [12] as a reminder that I'm working with 12-component vectors.

Anyway, the line

       h[0] = f[0] + g[0]

actually produces four CPU instructions: load f[0] into a register, load g[0] into a register, add the two registers, and store the result in h[0]. To see these CPU instructions more clearly in the code, I split the line into several lines:

       double f0; /* set aside double-precision storage location f0 */
       double g0; /* set aside double-precision storage location g0 */
       double h0; /* set aside double-precision storage location h0 */
       f0 = f[0]; /* copy 8 bytes from memory location f[0] to register f0 */
       g0 = g[0]; /* copy 8 bytes from memory location g[0] to register g0 */
       h0 = f0 + g0; /* add register f0 to register g0 */
       h[0] = h0; /* copy result from register h0 to memory location h[0] */

I did the same for the other lines, producing the following equivalent C function:

     void curve25519_add(double *h,const double *f,const double *g)
     {
       double f0;
       double f1;
       double f2;
       double f3;
       double f4;
       double f5;
       double f6;
       double f7;
       double f8;
       double f9;
       double f10;
       double f11;

       double g0;
       double g1;
       double g2;
       double g3;
       double g4;
       double g5;
       double g6;
       double g7;
       double g8;
       double g9;
       double g10;
       double g11;

       double h0;
       double h1;
       double h2;
       double h3;
       double h4;
       double h5;
       double h6;
       double h7;
       double h8;
       double h9;
       double h10;
       double h11;
     
       f0 = f[0];
       f1 = f[1];
       f2 = f[2];
       f3 = f[3];
       f4 = f[4];
       f5 = f[5];
       f6 = f[6];
       f7 = f[7];
       f8 = f[8];
       f9 = f[9];
       f10 = f[10];
       f11 = f[11];
     
       g0 = g[0];
       g1 = g[1];
       g2 = g[2];
       g3 = g[3];
       g4 = g[4];
       g5 = g[5];
       g6 = g[6];
       g7 = g[7];
       g8 = g[8];
       g9 = g[9];
       g10 = g[10];
       g11 = g[11];
     
       h0 = f0 + g0;
       h1 = f1 + g1;
       h2 = f2 + g2;
       h3 = f3 + g3;
       h4 = f4 + g4;
       h5 = f5 + g5;
       h6 = f6 + g6;
       h7 = f7 + g7;
       h8 = f8 + g8;
       h9 = f9 + g9;
       h10 = f10 + g10;
       h11 = f11 + g11;
     
       h[0] = h0;
       h[1] = h1;
       h[2] = h2;
       h[3] = h3;
       h[4] = h4;
       h[5] = h5;
       h[6] = h6;
       h[7] = h7;
       h[8] = h8;
       h[9] = h9;
       h[10] = h10;
       h[11] = h11;
     }

You might be wondering how the UltraSPARC, with only 32 double-precision floating-point registers, can handle the 36 values f0,f1,...,g0,g1,...,h0,h1,... declared here. The answer is that several values can be assigned to the same register if they aren't ``alive'' at the same time. For example, I peeked at the C compiler's output and saw that it had translated

       f0 = f[0];
       g0 = g[0];
       h0 = f0 + g0;
       h[0] = h0;

into (essentially)

       %f8 = f[0];
       %f32 = g[0];
       %f8 = %f8 + %f32;
       h[0] = %f8;

putting both f0 and h0 into register %f8.

The above C code is still hiding an important aspect of the corresponding CPU instructions, namely the multiplication involved in array indexing. Loading f[i] from memory means taking the supplied address f, adding 8*i bytes to it to find the address of f[i], and then loading the 8-byte floating-point number at that address. To see the CPU instructions more clearly, I replaced the double pointers by byte pointers:

     void curve25519_add(char *h,char *f,char *g)
     {
       double f0;
       double f1;
       double f2;
       double f3;
       double f4;
       double f5;
       double f6;
       double f7;
       double f8;
       double f9;
       double f10;
       double f11;
     
       double g0;
       double g1;
       double g2;
       double g3;
       double g4;
       double g5;
       double g6;
       double g7;
       double g8;
       double g9;
       double g10;
       double g11;
     
       double h0;
       double h1;
       double h2;
       double h3;
       double h4;
       double h5;
       double h6;
       double h7;
       double h8;
       double h9;
       double h10;
       double h11;
     
       f0 = *(double *) (f + 0);
       f1 = *(double *) (f + 8);
       f2 = *(double *) (f + 16);
       f3 = *(double *) (f + 24);
       f4 = *(double *) (f + 32);
       f5 = *(double *) (f + 40);
       f6 = *(double *) (f + 48);
       f7 = *(double *) (f + 56);
       f8 = *(double *) (f + 64);
       f9 = *(double *) (f + 72);
       f10 = *(double *) (f + 80);
       f11 = *(double *) (f + 88);
     
       g0 = *(double *) (g + 0);
       g1 = *(double *) (g + 8);
       g2 = *(double *) (g + 16);
       g3 = *(double *) (g + 24);
       g4 = *(double *) (g + 32);
       g5 = *(double *) (g + 40);
       g6 = *(double *) (g + 48);
       g7 = *(double *) (g + 56);
       g8 = *(double *) (g + 64);
       g9 = *(double *) (g + 72);
       g10 = *(double *) (g + 80);
       g11 = *(double *) (g + 88);
     
       h0 = f0 + g0;
       h1 = f1 + g1;
       h2 = f2 + g2;
       h3 = f3 + g3;
       h4 = f4 + g4;
       h5 = f5 + g5;
       h6 = f6 + g6;
       h7 = f7 + g7;
       h8 = f8 + g8;
       h9 = f9 + g9;
       h10 = f10 + g10;
       h11 = f11 + g11;
     
       *(double *) (h + 0) = h0;
       *(double *) (h + 8) = h1;
       *(double *) (h + 16) = h2;
       *(double *) (h + 24) = h3;
       *(double *) (h + 32) = h4;
       *(double *) (h + 40) = h5;
       *(double *) (h + 48) = h6;
       *(double *) (h + 56) = h7;
       *(double *) (h + 64) = h8;
       *(double *) (h + 72) = h9;
       *(double *) (h + 80) = h10;
       *(double *) (h + 88) = h11;
     }

The corresponding qhasm code, curve25519_add.q, is very similar:

     enter curve25519_add
     
     int64 h
     int64 f
     int64 g
     
     input h
     input f
     input g
     
     float64 f0
     float64 f1
     float64 f2
     float64 f3
     float64 f4
     float64 f5
     float64 f6
     float64 f7
     float64 f8
     float64 f9
     float64 f10
     float64 f11

     float64 g0
     float64 g1
     float64 g2
     float64 g3
     float64 g4
     float64 g5
     float64 g6
     float64 g7
     float64 g8
     float64 g9
     float64 g10
     float64 g11

     float64 h0
     float64 h1
     float64 h2
     float64 h3
     float64 h4
     float64 h5
     float64 h6
     float64 h7
     float64 h8
     float64 h9
     float64 h10
     float64 h11
     
     f0 = *(float64 *) (f + 0)
     f1 = *(float64 *) (f + 8)
     f2 = *(float64 *) (f + 16)
     f3 = *(float64 *) (f + 24)
     f4 = *(float64 *) (f + 32)
     f5 = *(float64 *) (f + 40)
     f6 = *(float64 *) (f + 48)
     f7 = *(float64 *) (f + 56)
     f8 = *(float64 *) (f + 64)
     f9 = *(float64 *) (f + 72)
     f10 = *(float64 *) (f + 80)
     f11 = *(float64 *) (f + 88)
    
     g0 = *(float64 *) (g + 0)
     g1 = *(float64 *) (g + 8)
     g2 = *(float64 *) (g + 16)
     g3 = *(float64 *) (g + 24)
     g4 = *(float64 *) (g + 32)
     g5 = *(float64 *) (g + 40)
     g6 = *(float64 *) (g + 48)
     g7 = *(float64 *) (g + 56)
     g8 = *(float64 *) (g + 64)
     g9 = *(float64 *) (g + 72)
     g10 = *(float64 *) (g + 80)
     g11 = *(float64 *) (g + 88)
   
     h0 = f0 + g0
     h1 = f1 + g1
     h2 = f2 + g2
     h3 = f3 + g3
     h4 = f4 + g4
     h5 = f5 + g5
     h6 = f6 + g6
     h7 = f7 + g7
     h8 = f8 + g8
     h9 = f9 + g9
     h10 = f10 + g10
     h11 = f11 + g11
     
     *(float64 *) (h + 0) = h0
     *(float64 *) (h + 8) = h1
     *(float64 *) (h + 16) = h2
     *(float64 *) (h + 24) = h3
     *(float64 *) (h + 32) = h4
     *(float64 *) (h + 40) = h5
     *(float64 *) (h + 48) = h6
     *(float64 *) (h + 56) = h7
     *(float64 *) (h + 64) = h8
     *(float64 *) (h + 72) = h9
     *(float64 *) (h + 80) = h10
     *(float64 *) (h + 88) = h11
     
     leave

This qhasm code says int64 h to set aside a 64-bit integer register for h, float64 h3 to set aside a 64-bit floating-point register for h3, etc. Like a C compiler, the qhasm register allocator can put several values into one register if the values' lifetimes don't overlap; the qhasm register allocator is extremely good at this. Unlike a C compiler, the qhasm register allocator doesn't silently insert loads and stores if it can't figure out how to fit all live values into registers; loads and stores have to be explicitly indicated by higher-level tools or by users.

After writing curve25519_add.q I wrote curve25519_addtest.c to test the addition function on a not-very-random input:

     #include <stdio.h>
     
     double powers[12] = {
       1.0 /* 2^0 */
     , 4194304.0 /* 2^22 */
     , 8796093022208.0 /* 2^43 */
     , 18446744073709551616.0 /* 2^64 */
     , 38685626227668133590597632.0 /* 2^85 */
     , 162259276829213363391578010288128.0 /* 2^107 */
     , 340282366920938463463374607431768211456.0 /* 2^128 */
     , 713623846352979940529142984724747568191373312.0 /* 2^149 */
     , 1496577676626844588240573268701473812127674924007424.0 /* 2^170 */
     , 6277101735386680763835789423207666416102355444464034512896.0 /* 2^192 */
     , 13164036458569648337239753460458804039861886925068638906788872192.0 /* 2^213 */
     , 27606985387162255149739023449108101809804435888681546220650096895197184.0 /* 2^234 */
     } ;
     
     double f[12];
     double g[12];
     double h[12];
     
     extern void curve25519_add(double [12],const double [12],const double [12]);
     
     int main(int argc,char **argv)
     {
       int i;
       for (i = 0;i < 12;++i) f[i] = random() * powers[i];
       printf("f=0"); for (i = 0;i < 12;++i) printf("%+.0f",f[i]); printf(";\n");
       for (i = 0;i < 12;++i) g[i] = random() * powers[i];
       printf("g=0"); for (i = 0;i < 12;++i) printf("%+.0f",g[i]); printf(";\n");
       curve25519_add(h,f,g);
       printf("h=0"); for (i = 0;i < 12;++i) printf("%+.0f",h[i]); printf(";\n");
       printf("h-(f+g)\n");
       return 0;
     }

I compiled curve25519_addtest.c together with curve25519_add.q, and fed the output of curve25519_addtest through gp:

     ./qhasm-sparc < curve25519_add.q > curve25519_add.s
     gcc -m64 -o curve25519_addtest curve25519_addtest.c curve25519_add.s
     ./curve25519_addtest | gp

The -m64 option to gcc is essential for the UltraSPARC; otherwise you end up with code compiled for the original 32-bit SPARCs. Anyway, gp reported that h-(f+g) was indeed 0. I won't worry about doing more serious tests at this point.

More operations next time.

2006.11.04

I decided to rename f0,f1,f2,... as f0,f22,f43,... for consistency with the f_0, f_22, ... notation that I'm using in formulas:

     float64 f0
     float64 f22
     float64 f43
     ...
     float64 g0
     float64 g22
     float64 g43
     ...
     float64 h0
     float64 h22
     float64 h43
     ...
     f0 = *(float64 *) (f + 0)
     f22 = *(float64 *) (f + 8)
     f43 = *(float64 *) (f + 16)
     ...
     g0 = *(float64 *) (g + 0)
     g22 = *(float64 *) (g + 8)
     g43 = *(float64 *) (g + 16)
     ...
     h0 = f0 + g0
     h22 = f22 + g22
     h43 = f43 + g43
     ...
     *(float64 *) (h + 0) = h0
     *(float64 *) (h + 8) = h22
     *(float64 *) (h + 16) = h43

qhasm also allows underscores in names, but I'm not interested in typing quite that much. :-)

Here's how to download and test the revised addition software on an UltraSPARC:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_add.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_addtest.c
     ./qhasm-sparc < curve25519_add.q > curve25519_add.s
     gcc -m64 -o curve25519_addtest curve25519_addtest.c curve25519_add.s
     ./curve25519_addtest | gp

Now let's look at some other integer operations.

Subtraction: 12 fadd + 0 fmul. Subtraction has the same cost as addition: simply replace

     h0 = f0 + g0

with

     h0 = f0 - g0

etc. Of course, the resulting h_i is often negative, but that's okay; floating-point numbers are allowed to be negative.

Each cycle, the UltraSPARC's floating-point adder handles one addition or one subtraction. I'll count subtractions as if they were additions.

Here's the software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_sub.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_subtest.c
     ./qhasm-sparc < curve25519_sub.q > curve25519_sub.s
     gcc -m64 -o curve25519_subtest curve25519_subtest.c curve25519_sub.s
     ./curve25519_subtest | gp

Multiplication, schoolbook method: 121 fadd + 144 fmul. If f = f_0 + f_22 + f_43 + f_64 + f_85 + f_107 + f_128 + f_149 + f_170 + f_192 + f_213 + f_234 and g = g_0 + g_22 + g_43 + g_64 + g_85 + g_107 + g_128 + g_149 + g_170 + g_192 + g_213 + g_234 then fg = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234 + h_255 + h_277 + h_298 + h_319 + h_340 + h_362 + h_383 + h_404 + h_425 + h_447 + h_468 where

h_0 = f_0 g_0,
h_22 = f_0 g_22 + f_22 g_0,
h_43 = f_0 g_43 + f_22 g_22 + f_43 g_0,
h_64 = f_0 g_64 + f_22 g_43 + f_43 g_22 + f_64 g_0,
h_85 = f_0 g_85 + f_22 g_64 + f_43 g_43 + f_64 g_22 + f_85 g_0,
h_107 = f_0 g_107 + f_22 g_85 + f_43 g_64 + f_64 g_43 + f_85 g_22 + f_107 g_0,
h_128 = f_0 g_128 + f_22 g_107 + f_43 g_85 + f_64 g_64 + f_85 g_43 + f_107 g_22 + f_128 g_0,
h_149 = f_0 g_149 + f_22 g_128 + f_43 g_107 + f_64 g_85 + f_85 g_64 + f_107 g_43 + f_128 g_22 + f_149 g_0,
h_170 = f_0 g_170 + f_22 g_149 + f_43 g_128 + f_64 g_107 + f_85 g_85 + f_107 g_64 + f_128 g_43 + f_149 g_22 + f_170 g_0,
h_192 = f_0 g_192 + f_22 g_170 + f_43 g_149 + f_64 g_128 + f_85 g_107 + f_107 g_85 + f_128 g_64 + f_149 g_43 + f_170 g_22 + f_192 g_0,
h_213 = f_0 g_213 + f_22 g_192 + f_43 g_170 + f_64 g_149 + f_85 g_128 + f_107 g_107 + f_128 g_85 + f_149 g_64 + f_170 g_43 + f_192 g_22 + f_213 g_0,
h_234 = f_0 g_234 + f_22 g_213 + f_43 g_192 + f_64 g_170 + f_85 g_149 + f_107 g_128 + f_128 g_107 + f_149 g_85 + f_170 g_64 + f_192 g_43 + f_213 g_22 + f_234 g_0,
h_255 = f_22 g_234 + f_43 g_213 + f_64 g_192 + f_85 g_170 + f_107 g_149 + f_128 g_128 + f_149 g_107 + f_170 g_85 + f_192 g_64 + f_213 g_43 + f_234 g_22,
h_277 = f_43 g_234 + f_64 g_213 + f_85 g_192 + f_107 g_170 + f_128 g_149 + f_149 g_128 + f_170 g_107 + f_192 g_85 + f_213 g_64 + f_234 g_43,
h_298 = f_64 g_234 + f_85 g_213 + f_107 g_192 + f_128 g_170 + f_149 g_149 + f_170 g_128 + f_192 g_107 + f_213 g_85 + f_234 g_64,
h_319 = f_85 g_234 + f_107 g_213 + f_128 g_192 + f_149 g_170 + f_170 g_149 + f_192 g_128 + f_213 g_107 + f_234 g_85,
h_340 = f_107 g_234 + f_128 g_213 + f_149 g_192 + f_170 g_170 + f_192 g_149 + f_213 g_128 + f_234 g_107,
h_362 = f_128 g_234 + f_149 g_213 + f_170 g_192 + f_192 g_170 + f_213 g_149 + f_234 g_128,
h_383 = f_149 g_234 + f_170 g_213 + f_192 g_192 + f_213 g_170 + f_234 g_149,
h_404 = f_170 g_234 + f_192 g_213 + f_213 g_192 + f_234 g_170,
h_425 = f_192 g_234 + f_213 g_213 + f_234 g_192,
h_447 = f_213 g_234 + f_234 g_213,
h_468 = f_234 g_234.

In fact, the polynomial h_0 x^0 + h_22 x^1 + ... + h_468 x^22 is the product of the polynomials f_0 x^0 + f_22 x^1 + ... + f_234 x^11 and g_0 x^0 + g_22 x^1 + ... + g_234 x^11. Notice that h_i, like f_i and g_i, is a small multiple of 2^i: for example, f_0 g_43 is a small multiple of 2^0 2^43 = 2^43, f_22 g_22 is a small multiple of 2^22 2^22 = 2^44, and f_43 g_0 is a small multiple of 2^43 2^0 = 2^43, so the sum h_43 is a small multiple of 2^43.

(It's important here that I chose my indices as ceilings of multiples of 21.25. If I had instead used the floors 0, 21, 42, 63, 85, ... then I would have run into trouble with f_0 g_85 + f_21 g_63 + f_42 g_42 + f_63 g_21 + f_85 g_0 being a small multiple of 2^84 but often not a multiple of the desired 2^85.)

The above formulas for h_0, h_22, ..., h_468 involve 12^2 = 144 multiplications (each f_i times each g_j) and 11^2 = 121 additions (which I think of as 144 for the number of products minus 23 for the number of outputs).

Multiplication, Karatsuba's method: 114 fadd + 142 fmul. I'm going to use ``Karatsuba's method'' to compute the same quantities h_0, h_22, ..., h_468 more efficiently. Warning: There are several different polynomial-product and integer-product formulas called ``Karatsuba's method,'' and some of them (including the formulas in Karatsuba's original 1963 paper) are slightly more expensive than the formulas shown here. There's no standard terminology for the different formulas.

Anyway, define F_0 = f_0 x^0 + f_22 x^1 + ... + f_107 x^5, F_1 = f_128 x^0 + f_149 x^1 + ... + f_234 x^5, G_0 = g_0 x^0 + g_22 x^1 + ... + g_107 x^5, and G_1 = g_128 x^0 + g_149 x^1 + ... + g_234 x^5. The goal is to compute the polynomial product (F_0 + x^6 F_1)(G_0 + x^6 G_1).

I'll compute (F_0 + x^6 F_1)(G_0 + x^6 G_1) as K + 2^127 x^6 ((F_0 + 2^(-127) F_1)(G_0 + 2^(-127) G_1) - K) where K = F_0 G_0 - 2^(-127) x^6 F_1 G_1. Observe that there are only three half-size polynomial products here, rather than the four that appear in the obvious formula F_0 G_0 + x^6 (F_0 G_1 + F_1 G_0) + x^12 F_1 G_1; so I'll end up with only 3 6^2 = 108 coefficient multiplications, rather than 4 6^2 = 144.

Here are the intermediate quantities computed. The coefficients of F_0 + 2^(-127) F_1 are

u_0 = f_0 + 2^(-127) f_128,
u_22 = f_22 + 2^(-127) f_149,
u_43 = f_43 + 2^(-127) f_170,
u_64 = f_64 + 2^(-127) f_192,
u_85 = f_85 + 2^(-127) f_213,
u_107 = f_107 + 2^(-127) f_234.

The coefficients of G_0 + 2^(-127) G_1 are

v_0 = g_0 + 2^(-127) g_128,
v_22 = g_22 + 2^(-127) g_149,
v_43 = g_43 + 2^(-127) g_170,
v_64 = g_64 + 2^(-127) g_192,
v_85 = g_85 + 2^(-127) g_213,
v_107 = g_107 + 2^(-127) g_234.

The coefficients of (F_0 + 2^(-127) F_1)(G_0 + 2^(-127) G_1) are

w_0 = u_0 v_0,
w_22 = u_0 v_22 + u_22 v_0,
w_43 = u_0 v_43 + u_22 v_22 + u_43 v_0,
w_64 = u_0 v_64 + u_22 v_43 + u_43 v_22 + u_64 v_0,
w_85 = u_0 v_85 + u_22 v_64 + u_43 v_43 + u_64 v_22 + u_85 v_0,
w_107 = u_0 v_107 + u_22 v_85 + u_43 v_64 + u_64 v_43 + u_85 v_22 + u_107 v_0,
w_128 = u_22 v_107 + u_43 v_85 + u_64 v_64 + u_85 v_43 + u_107 v_22,
w_149 = u_43 v_107 + u_64 v_85 + u_85 v_64 + u_107 v_43,
w_170 = u_64 v_107 + u_85 v_85 + u_107 v_64,
w_192 = u_85 v_107 + u_107 v_85,
w_213 = u_107 v_107.

The coefficients of K are h_0, h_22, h_43, h_64, h_85, h_107, k_128, k_149, k_170, k_192, k_213, -2^(-127) h_362, -2^(-127) h_383, -2^(-127) h_404, -2^(-127) h_425, -2^(-127) h_447, -2^(-127) h_468 where

h_0 = f_0 g_0,
h_22 = f_0 g_22 + f_22 g_0,
h_43 = f_0 g_43 + f_22 g_22 + f_43 g_0,
h_64 = f_0 g_64 + f_22 g_43 + f_43 g_22 + f_64 g_0,
h_85 = f_0 g_85 + f_22 g_64 + f_43 g_43 + f_64 g_22 + f_85 g_0,
h_107 = f_0 g_107 + f_22 g_85 + f_43 g_64 + f_64 g_43 + f_85 g_22 + f_107 g_0,
k_128 = (f_22 g_107 + f_43 g_85 + f_64 g_64 + f_85 g_43 + f_107 g_22) - 2^(-127)(f_128 g_128),
k_149 = (f_43 g_107 + f_64 g_85 + f_85 g_64 + f_107 g_43) - 2^(-127)(f_128 g_149 + f_149 g_128),
k_170 = (f_64 g_107 + f_85 g_85 + f_107 g_64) - 2^(-127)(f_128 g_170 + f_149 g_149 + f_170 g_128),
k_192 = (f_85 g_107 + f_107 g_85) - 2^(-127)(f_128 g_192 + f_149 g_170 + f_170 g_149 + f_192 g_128),
k_213 = (f_107 g_107) - 2^(-127)(f_128 g_213 + f_149 g_192 + f_170 g_170 + f_192 g_149 + f_213 g_128),
h_362 = f_128 g_234 + f_149 g_213 + f_170 g_192 + f_192 g_170 + f_213 g_149 + f_234 g_128,
h_383 = f_149 g_234 + f_170 g_213 + f_192 g_192 + f_213 g_170 + f_234 g_149,
h_404 = f_170 g_234 + f_192 g_213 + f_213 g_192 + f_234 g_170,
h_425 = f_192 g_234 + f_213 g_213 + f_234 g_192,
h_447 = f_213 g_234 + f_234 g_213,
h_468 = f_234 g_234.

Finally, the coefficients of (F_0 + x^6 F_1)(G_0 + x^6 G_1) are h_0, ..., h_468 where

h_128 = 2^127 (w_0 - h_0) + k_128,
h_149 = 2^127 (w_22 - h_22) + k_149,
h_170 = 2^127 (w_43 - h_43) + k_170,
h_192 = 2^127 (w_64 - h_64) + k_192,
h_213 = 2^127 (w_85 - h_85) + k_213,
h_234 = 2^127 (w_107 - h_107) - 2^(-127) h_362,
h_255 = 2^127 (w_128 - k_128) - 2^(-127) h_383,
h_277 = 2^127 (w_149 - k_149) - 2^(-127) h_404,
h_298 = 2^127 (w_170 - k_170) - 2^(-127) h_425,
h_319 = 2^127 (w_192 - k_192) - 2^(-127) h_447,
h_340 = 2^127 (w_213 - k_213) - 2^(-127) h_468.

Overall there are 142 floating-point multiplications here: the aforementioned 108 coefficient multiplications, 23 multiplications by 2^(-127), and 11 multiplications by 2^127. There are also 114 floating-point additions (again counting subtractions as additions).

How does a constant such as 2^127 end up in a floating-point register? The standard answer has three steps. First, create an array of constants in memory, for example with the following C code:

     const double curve25519_constants[2] = {
       170141183460469231731687303715884105728.0 /* 2^127 */
     , 0.0000000000000000000000000000000000000058774717541114375398436826861112283890933277838604376075437585313920862972736358642578125 /* 2^-127 */
     } ;

Second, copy the address of the array into a register, for example with the following standard sequence of UltraSPARC instructions:

     int64 constants
     int64 constants_low
     constants = (curve25519_constants & 0xfffffc0000000000) >> 32
     constants_low = curve25519_constants & 0xfffffc00
     constants |= (curve25519_constants & 0x3ff00000000) >> 32
     constants_low |= curve25519_constants & 0x3ff
     constants <<= 32
     constants |= constants_low

Third, load the constants from memory:

     float64 two127
     float64 twom127
     two127 = *(float64 *) (constants + 0)
     twom127 = *(float64 *) (constants + 8)

I wrote a straightforward curve25519_mul.q and immediately ran into errors from the assembler: I was trying to use more than 32 floating-point registers at once, keeping all 12 f's, all 12 g's, all 23 h's, etc. live through the entire computation. So I juggled the code to kill values earlier: for example, I computed h0, h22, h43, h64, h85, h107, stored them, and subtracted them from w0, w22, w43, w64, w85, w107, to free up the registers that they were using. I also reloaded the f's and g's when they were needed, rather than keeping them continuously live.

Here's the resulting software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_mul.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_multest.c
     ./qhasm-sparc < curve25519_mul.q > curve25519_mul.s
     gcc -m64 -o curve25519_multest curve25519_multest.c curve25519_mul.s
     ./curve25519_multest | gp

Squaring, schoolbook method: 55 fadd + 89 fmul. In Curve25519, many of the multiplications modulo 2^255-19 are actually squarings: f_i = g_i for each i. This means that, for example, the equation h_107 = f_0 g_107 + f_22 g_85 + f_43 g_64 + f_64 g_43 + f_85 g_22 + f_107 g_0 can be replaced by the equation h_107 = 2 (f_0 f_107 + f_22 f_85 + f_43 f_64), replacing 6 multiplications and 5 additions with 4 multiplications and 2 additions.

(I'm assuming here that doublings are carried out as multiplications by the constant 2. If floating-point multiplications turn out to be a bottleneck later then these multiplications can be replaced by additions.)

The equation h_107 = (2 f_0) f_107 + (2 f_22) f_85 + (2 f_43) f_64 is slightly better, costing only 3 multiplications and 2 additions if 2f_0, 2f_22, 2f_43 are precomputed. It's easy to see that the overall benefit of these precomputations outweighs the overall cost: there are only 11 doublings (of f_0, f_22, ..., f_213) rather than 21 doublings (in h_22, h_43, ..., h_447).

What happens when these speedups are applied to schoolbook multiplication? There are 144 fmul, of which 132 are f_i g_j for distinct i,j, of which 132/2 = 66 are eliminated at the expense of 11 doublings. The remaining 144 - 66 = 78 fmul results require only 78 - 23 = 55 fadds to be compressed into 23 outputs. Overall cost: 55 fadd + 78 fmul + 11 fmul. Software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_sq.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_sqtest.c
     ./qhasm-sparc < curve25519_sq.q > curve25519_sq.s
     gcc -m64 -o curve25519_sqtest curve25519_sqtest.c curve25519_sq.s
     ./curve25519_sqtest | gp

Can one do better? Not to my knowledge. The lowest cost I've been able to achieve for Karatsuba squaring is 63 fadd + 106 fmul. Squaring saves considerably more in the schoolbook method than it does in Karatsuba's method.

Reduction modulo 2^255-19: 11 fadd + 11 fmul. If h = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234 + h_255 + h_277 + h_298 + h_319 + h_340 + h_362 + h_383 + h_404 + h_425 + h_447 + h_468, with h_i a multiple of 2^i, then h is congruent to r_0 + r_22 + r_43 + r_64 + r_85 + r_107 + r_128 + r_149 + r_170 + r_192 + r_213 + r_234 modulo 2^255 - 19, where

r_0 = h_0 + 19 2^(-255) h_255,
r_22 = h_22 + 19 2^(-255) h_277,
r_43 = h_43 + 19 2^(-255) h_298,
r_64 = h_64 + 19 2^(-255) h_319,
r_85 = h_85 + 19 2^(-255) h_340,
r_107 = h_107 + 19 2^(-255) h_362,
r_128 = h_128 + 19 2^(-255) h_383,
r_149 = h_149 + 19 2^(-255) h_404,
r_170 = h_170 + 19 2^(-255) h_425,
r_192 = h_192 + 19 2^(-255) h_447,
r_213 = h_213 + 19 2^(-255) h_468,
r_234 = h_234.

Indeed, the difference (r_0 + r_22 + r_43 + r_64 + r_85 + r_107 + r_128 + r_149 + r_170 + r_192 + r_213 + r_234) - h is a sum of terms such as (19 2^(-255) - 1) h_255 = (2^255 - 19) 2^(-255) h_255. This term is a multiple of 2^255 - 19, since by hypothesis 2^(-255) h_255 is an integer; same for the other terms.

Computing the r_i's from the h_i's takes 11 multiplications by the constant 19 2^(-255) and 11 additions. Software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_modp.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_modptest.c
     ./qhasm-sparc < curve25519_modp.q > curve25519_modp.s
     gcc -m64 -o curve25519_modptest curve25519_modptest.c \
     curve25519_modp.s curve25519_mul.s
     ./curve25519_modptest | gp

Carrying and reduction modulo 2^255-19: 63 fadd + 11 fmul. ``Carrying r_0 to r_22'' means subtracting the closest multiple of 2^22 from r_0 and adding it to r_22. This doesn't change the sum r_0 + r_22, but it guarantees that r_0 is in the range {-2^21,...,-1,0,1,...,2^21}.

Similarly, ``carrying r_22 to r_43'' means subtracting the closest multiple of 2^43 from r_22 and adding it to r_43. This doesn't change the sum r_22 + r_43, but it guarantees that 2^(-22) r_22 is in the range {-2^20,...,-1,0,1,...,2^20}. By carrying r_0 to r_22, then carrying r_22 to r_43, then carrying r_43 to r_64, and so on, we guarantee that each 2^(-i) r_i is in a narrow range, making the r's suitable inputs for subsequent multiplications.

A carry takes four floating-point additions:

     closest43 = r22 + alpha43    # rounded
     closest43 -= alpha43         # exact
     r22 -= closest43             # exact
     r43 += closest43             # exact

Here alpha_43 is the constant 2^43 (2^52 + 2^51). The point is that floating-point numbers around alpha_43 are exactly the multiples of 2^43: ..., 2^43 (2^52 + 2^51 - 1), 2^43 (2^52 + 2^51), 2^43 (2^52 + 2^51 + 1), ... The sum r_22 + alpha_43 is thus rounded to the closest multiple of 2^43; subtracting alpha_43 produces the multiple of 2^43 closest to r_22, exactly as desired.

What happens to the carry from r_234? The easy answer is to carry from r_234 to r_0, reducing the carry modulo 2^255-19:

     closest255 = r234 + alpha255
     closest255 -= alpha255
     r234 -= closest255
     closest255 *= scale
     r0 += closest255

where scale is 19 2^(-255).

This might make r_0 too big again. If r_234 starts with 50 bits, for example, then closest_255 can be as large as 2^284; so 19 2^(-255) 2^284, more than 2^33, is added to r_0. Another carry from r_0 to r_22 is required. But the carries can stop after that: a chain of 13 carries from r_0 to r_22 to r_43 to ... to r_234 to r_0 to r_22 guarantees that all r_i's are reasonably small.

One can eliminate the extra scale multiplication by carefully interleaving the carrying into the original reduction modulo 2^255-19. Here's one reasonable carry/reduce strategy, starting from h = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234 + h_255 + h_277 + h_298 + h_319 + h_340 + h_362 + h_383 + h_404 + h_425 + h_447 + h_468:

Reduce from h_447 to h_192.
Reduce from h_468 to h_213.
Carry from h_192 to h_213.
Carry from h_213 to h_234.
Carry from h_234 to h_255.
Reduce from h_255 to h_0.
Reduce from h_277 to h_22.
Reduce from h_298 to h_43.
Reduce from h_319 to h_64.
Reduce from h_340 to h_85.
Reduce from h_362 to h_107.
Reduce from h_383 to h_128.
Reduce from h_404 to h_149.
Reduce from h_425 to h_170.
Carry from h_0 to h_22.
Carry from h_22 to h_43.
Carry from h_43 to h_64.
Carry from h_64 to h_85.
Carry from h_85 to h_107.
Carry from h_107 to h_128.
Carry from h_128 to h_149.
Carry from h_149 to h_170.
Carry from h_170 to h_192.
Carry from h_192 to h_213.

The 13 carries here cost 52 fadd + 0 fmul, and the 11 reductions cost 11 fadd + 11 fmul, for a total of 63 fadd + 11 fmul.

Software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_carry.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_carrytest.c
     ./qhasm-sparc < curve25519_carry.q > curve25519_carry.s
     gcc -m64 -o curve25519_carrytest curve25519_carrytest.c \
     curve25519_carry.s curve25519_mul.s curve25519_sq.s
     ./curve25519_carrytest | gp

Multiplication, carrying, reduction: 177 fadd + 153 fmul. A complete integer multiplication modulo 2^255-19, producing small results suitable for subsequent multiplications, costs 114 fadd + 142 fmul for the polynomial multiplication, and an additional 63 fadd + 11 fmul for carrying and reduction.

Squaring, carrying, reduction: 118 fadd + 100 fmul. A complete integer squaring modulo 2^255-19, producing small results suitable for subsequent multiplications, costs 55 fadd + 89 fmul for the polynomial squaring, and an additional 63 fadd + 11 fmul for carrying and reduction.

What's coming up. I'll need a few more functions to build a complete Curve25519 computation:

Converting a 255-bit integer from the traditional 32-byte representation to the 12-floating-point-register representation. This will rely on the details of IEEE 754 floating-point format, which fortunately are rather straightforward.
Converting an integer from floating-point back to 32 bytes, making sure that the result is in {0,1,...,2^255-20}.
Multiplication by a small constant. This, like squaring, could be handled as a general multiplication, but that would waste time.
Conditional exchange. This means computing x[b],x[1-b] given x[0],x[1],b.
The main loop.

Once the code is working, I'll transform it to eliminate load/store bottlenecks, latency bottlenecks, etc., hopefully bringing the number of cycles close to the number of fadd instructions.

2006.11.05

New qhasm release, expanding the SPARC machine description to include a few additional UltraSPARC instructions that I'm using today:

     wget https://cr.yp.to/qhasm/qhasm-20061105.tar.gz
     gunzip < qhasm-20061105.tar.gz | tar -xf -
     cd qhasm-20061105
     ./do

Now back to the Curve25519 operations.

Small-constant multiplication, carrying, reduction: 48 fadd + 13 fmul. As we'll see later, the Curve25519 main loop has 8 additions/subtractions and 10 multiplications. Half of the 10 multiplications are special: 4 of them are squarings, and 1 of them is a multiplication by a particular integer, namely 121665.

Multiplying by a 17-bit integer is of course easier than multiplying by a 255-bit integer: one can simply multiply each of the 12 floating-point inputs by 121665, costing 0 fadd + 12 fmul. But the results need 12 carries, costing 48 fadd + 1 fmul, so the overall benefit is not as dramatic as one might initially expect.

Software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_121665.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_121665test.c
     ./qhasm-sparc < curve25519_121665.q > curve25519_121665.s
     gcc -m64 -o curve25519_121665test curve25519_121665test.c curve25519_121665.s
     ./curve25519_121665test | gp

Selection: 0 fadd + 0 fmul. There's one additional operation in the Curve25519 main loop: given four 255-bit integers x[0], z[0], x[1], z[1] and a secret bit b, compute x[b], z[b] and x[1-b], z[1-b].

Here are three strategies for performing this operation:

Put b into the instruction pointer. In other words, conditionally branch on b, jumping to code that uses x[0], z[0], x[1], z[1] if b is 0, or to code that uses x[1], z[1], x[0], z[0] if b is 1. There's a security problem here: the conditional branch can take a b-dependent amount of time, perhaps leaking b to an attacker who can see the total Curve25519 computation time. I strongly recommend against secret conditional branches.
Put b onto the address bus, taking advantage of the ability of loads to perform computation via variable memory locations. In other words, use b as an array index, loading x[b], z[b] and x[1-b], z[1-b]. There's a security problem here too: the array access can take a b-dependent amount of time, perhaps leaking b to an attacker who can see the total Curve25519 computation time. I strongly recommend against secret array indices.
Compute x[b] by simple arithmetic starting from x[0], x[1], b: for example, using the equation x[b] = b x[1] + (1-b) x[0]. This isn't quite as fast as the previous strategies, but it avoids the timing attacks inherent in the previous strategies.

I'll try the third strategy using integer bit operations:

     r3 = *(uint64 *) (r + 24)
     s3 = *(uint64 *) (s + 24)
     s3 ^= r3
     q3 = s3 & bminus1
     q3 ^= r3
     p3 = q3 ^ s3
     *(uint64 *) (p + 24) = p3
     *(uint64 *) (q + 24) = q3

A similar sequence of 3 fadd + 1 fmul, or alternatively 2 fadd + 4 fmul, would have the virtue of working with floating-point registers, but I'm guessing that the loads and stores to move to integer registers won't be a problem, and I don't want to consume precious fadd resources.

Software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_select.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_selecttest.c
     ./qhasm-sparc < curve25519_select.q > curve25519_select.s
     gcc -m64 -o curve25519_selecttest curve25519_selecttest.c curve25519_select.s
     ./curve25519_selecttest | gp

Conversion from 32-byte string: 44 fadd + 0 fmul. Say f is in {-2^51,...,-1,0,1,...,2^51-1}. Here's how to take f out of an UltraSPARC integer register and put it into an UltraSPARC floating-point register:

Add 2^52 1075 + 2^51 to f.
Store the contents of the integer register.
Wait a little while for the store to complete.
Load the same bytes into the floating-point register.
Subtract 2^52 + 2^51 from the result.

Here's why this works: Say s is in {0,1}, e is in {1,2,...,2046}, and f is in {-2^51,...,-1,0,1,...,2^51-1}. The UltraSPARC follows the IEEE 754 standard, which specifies that the 8 bytes representing the 64-bit integer 2^63 s + 2^52 e + 2^51 + f also represent the 64-bit floating-point number (-1)^s 2^(e-1023) (1 + 2^(-1) + 2^(-52) f). Now substitute s = 0 and e = 1075. (IEEE 754 also specifies interpretations of e = 0 and e = 2047, but we won't need to worry about those.)

It's just as easy to take f out of an UltraSPARC integer register and put a scaled version of f, let's say 2^32 f, into an UltraSPARC floating-point register:

Add 2^52 (1075+32) + 2^51 to f.
Store the contents of the integer register.
Wait a little while for the store to complete.
Load the same bytes into the floating-point register.
Subtract 2^32 (2^52 + 2^51) from the result.

What if f is in memory rather than in an integer register? To be concrete, let's say bytes (x[0],x[1],x[2],x[3],x[4],x[5]) in memory represent an integer f in {0,1,...,2^48-1} in the usual little-endian way: f = x[0] + 2^8 x[1] + 2^16 x[2] + 2^24 x[3] + 2^32 x[4] + 2^40 x[5]. Here's how to put the same integer f into a floating-point register. First load the bytes into registers (with separate byte loads; loading *(uint64 *) (x+0) doesn't work on the UltraSPARC unless x is 0 modulo 8):

     int64 f0
     int64 f1
     int64 f2
     int64 f3
     int64 f4
     int64 f5
     f0 = *(uint8 *) (x + 0) 
     f1 = *(uint8 *) (x + 1)
     f2 = *(uint8 *) (x + 2)
     f3 = *(uint8 *) (x + 3)
     f4 = *(uint8 *) (x + 4)
     f5 = *(uint8 *) (x + 5)

Shift and add to compute x0 = 2^52 1075 + 2^51 + f = 2^51 2151 + f:

     f1 <<= 8
     f2 <<= 16
     f3 <<= 24
     f4 <<= 32
     f5 <<= 40
     int64 x0
     x0 = 2151
     x0 <<= 51
     x0 += f0
     x0 += f1
     x0 += f2
     x0 += f3
     x0 += f4
     x0 += f5

Set aside 8 bytes of temporary storage in memory, and store x0 there:

     stack64 y0
     y0 = x0

Load the 8 bytes into a floating-point register, and finally subtract 2^52 + 2^51:

     float64 z0
     z0 = y0
     z0 -= alpha0

One of the Curve25519 inputs is a 255-bit integer stored in 32 bytes of memory. I convert

the first 6 bytes into a floating-point register z_0 as above,
the next 5 bytes into z_43,
the next 5 bytes into z_85,
the next 6 bytes into z_128,
the next 5 bytes into z_170, and
the last 5 bytes into z_213,

and then carry z_0 to z_22 to z_43 to z_64 to z_85 to z_107 to z_128 to z_149 to z_170 to z_192 to z_213 to z_234. I then have 12 floating-point registers representing the same integer, at the cost of 44 fadd. Software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_todouble.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_todoubletest.c
     ./qhasm-sparc < curve25519_todouble.q > curve25519_todouble.s
     gcc -m64 -o curve25519_todoubletest curve25519_todoubletest.c curve25519_todouble.s
     ./curve25519_todoubletest | gp

One can save time here by using more integer operations, first to partition the input bits more precisely and second to create the integer representing f rather than 2^52 + 2^51 + f. But I won't bother: conversion to double is a very small part of the Curve25519 time and doesn't warrant any extra programming effort.

Conversion to 32-byte string: 12 fadd + 0 fmul. The output of Curve25519 is an integer in {0,1,...,2^255-20}, represented as a 32-byte string in the same way as the input. This isn't what's naturally produced by arithmetic:

I've been working with integers as sums of floating-point numbers, rather than as byte strings.
I've been allowing redundant representations: different floating-point vectors that represent the same integer.
I've been allowing integers outside the range {0,1,...,2^255-20}.

The results of arithmetic have been squeezed to save time in subsequent operations, but they haven't been frozen into a unique result.

Here's a reasonably straightforward freezing strategy. Let's say that the final carrying and reduction produce r = r_0 + r_22 + r_43 + r_64 + r_85 + r_107 + r_128 + r_149 + r_170 + r_192 + r_213 + r_234. Carrying has put each r_i into a fairly limited range:

r_0 is in {-2^21,...,2^21}.
2^(-22) r_22 is in {-2^20,...,2^20}.
2^(-43) r_43 is in {-2^20,...,2^20}.
2^(-64) r_64 is in {-2^20,...,2^20}.
2^(-85) r_85 is in {-2^21,...,2^21}.
2^(-107) r_107 is in {-2^20,...,2^20}.
2^(-128) r_128 is in {-2^20,...,2^20}.
2^(-149) r_149 is in {-2^20,...,2^20}.
2^(-170) r_170 is in {-2^21,...,2^21}.
2^(-192) r_192 is in {-2^20,...,2^20}.
2^(-213) r_213 is in a slightly wider range, thanks to the final carry from r_192 to r_213, but surely in {-2^21,...,2^21}.
2^(-234) r_234 is in {-2^20,...,2^20}.

Evidently the sum r can be slightly outside, but not much outside, the range {-2^254,...,2^254}. I'll simplify matters by making everything positive: 2^255-19 + r is the sum of the positive quantities

s_0 = 2^23 - 19 + r_0,
s_16 = 2^44 - 2^23 + r_22,
s_40 = 2^65 - 2^44 + r_43,
s_64 = 2^86 - 2^65 + r_64,
s_80 = 2^108 - 2^86 + r_85,
s_104 = 2^129 - 2^108 + r_107,
s_128 = 2^150 - 2^129 + r_128,
s_144 = 2^171 - 2^150 + r_149,
s_168 = 2^193 - 2^171 + r_170,
s_192 = 2^214 - 2^193 + r_192,
s_208 = 2^235 - 2^214 + r_213,
s_232 = 2^255 - 2^235 + r_234.

Putting the small integers 2^(-i) s_i into integer registers is easy. Converting floating-point r_85 into integer 2^(-80) s_80, for example, is a simple matter of adding the constant 2^80 (2^52 + 2^51) + 2^108 - 2^86, storing the result into 8 bytes of memory, and loading the bottom 4 bytes into an integer register. This procedure, like the procedure for converting to double, exploits the IEEE 754 representation of floating-point numbers: the 8-byte string that represents the 64-bit integer 2^52 1156 + 2^51 + 2^(-80) s_80 also represents the floating-point number 2^80 (2^52 + 2^51) + 2^108 - 2^86 + r_85.

A series of shifts and adds now converts s_0, 2^(-16) s_16, ..., 2^(-232) s_232 to integers t_0, t_32, t_64, t_96, t_128, t_160, t_192, t_224 in {0,1,...,2^32-1} with 2^255-19 + r = t_0 + 2^32 t_32 + ... + 2^224 t_224. To see whether 2^255-19 + r is below 2^255-19, I'll add 19 and inspect the 255th bit, taking advantage of the UltraSPARC's 32-bit-carry instructions:

     carry32? u0 = t0 + 19
     carry32? u32 = t32 + 0 + carry32
     carry32? u64 = t64 + 0 + carry32
     carry32? u96 = t96 + 0 + carry32
     carry32? u128 = t128 + 0 + carry32
     carry32? u160 = t160 + 0 + carry32
     carry32? u192 = t192 + 0 + carry32
     carry32? u224 = t224 + 0 + carry32
     wantt = (uint32) u224 >> 31
     wantt -= 1
     wantu = ~wantt

Now it's easy to select either r or 2^255-19 + r as the final Curve25519 output. Here's the software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_freeze.q
     wget https://cr.yp.to/highspeed/fall2006/curve25519_freezetest.c
     ./qhasm-sparc < curve25519_freeze.q > curve25519_freeze.s
     gcc -m64 -o curve25519_freezetest curve25519_freezetest.c \
     curve25519_freeze.s curve25519_carry.s curve25519_mul.s curve25519_sq.s \
     curve25519_todouble.s
     ./curve25519_freezetest | gp

The complete Curve25519 computation: 414907 fadd + 327626 fmul. The Curve25519 computation starts with two 32-byte inputs. It

extracts from one input a 255-bit exponent e;
extracts from the other input an integer k modulo the prime p = 2^255-19;
converts k to floating-point representation;
sets up a vector of 4 255-bit integers in floating-point representation;
``main loop'': transforms the vector 255 times, once for each bit in e;
extracts two components, x and z, from the vector;
``division'': computes x z^(p-2) mod p; and
freezes the result to produce a 32-byte output.

The most important part of the computation is the main loop:

     for (pos = 254;pos >= 0;--pos) {
       b = e[pos / 8] >> (pos & 7);
       b &= 1;
       curve25519_select(xzmb,xzm1b,xzm,xzm1,b);
       curve25519_add(a0,xzmb,xzmb + 12);
       curve25519_sub(a0 + 12,xzmb,xzmb + 12);
       curve25519_add(a1,xzm1b,xzm1b + 12);
       curve25519_sub(a1 + 12,xzm1b,xzm1b + 12);
       curve25519_sqcarry(b0,a0);
       curve25519_sqcarry(b0 + 12,a0 + 12);
       curve25519_mulcarry(b1,a1,a0 + 12);
       curve25519_mulcarry(b1 + 12,a1 + 12,a0);
       curve25519_add(c1,b1,b1 + 12);
       curve25519_sub(c1 + 12,b1,b1 + 12);
       curve25519_sqcarry(r,c1 + 12);
       curve25519_sub(s,b0,b0 + 12);
       curve25519_121665(t,s);
       curve25519_add(u,t,b0);
       curve25519_mulcarry(xznb,b0,b0 + 12);
       curve25519_mulcarry(xznb + 12,s,u);
       curve25519_sqcarry(xzn1b,c1);
       curve25519_mulcarry(xzn1b + 12,r,x1);
       curve25519_select(xzm,xzm1,xznb,xzn1b,b);
     }

Each of the 255 iterations involves two selections (which can easily be merged across loops into one selection), four additions, four subtractions, one multiplication by 121665 (with carry-reduce), four squarings (with carry-reduce), and five more multiplications (with carry-reduce). Here are the tallies of floating-point operations:

operation	per operation	per iteration	per 255 iterations
select	0 fadd + 0 fmul	0 fadd + 0 fmul	0 fadd + 0 fmul
sub	12 fadd + 0 fmul	48 fadd + 0 fmul	12240 fadd + 0 fmul
sub	12 fadd + 0 fmul	48 fadd + 0 fmul	12240 fadd + 0 fmul
121665	48 fadd + 13 fmul	48 fadd + 13 fmul	12240 fadd + 3315 fmul
sqcarry	118 fadd + 100 fmul	472 fadd + 400 fmul	120360 fadd + 102000 fmul
mulcarry	177 fadd + 153 fmul	885 fadd + 765 fmul	225675 fadd + 195075 fmul
total		1501 fadd + 1178 fmul	382755 fadd + 300390 fmul

Some operations could be removed in the leading and trailing iterations, but this would produce an extremely small speedup and would cost extra code-cache misses.

The second most important part of the computation is the division. I compute the (p-2)nd = (2^255-21)th power of z[n], and multiply by x[n], with a straightforward sequence of 254 squarings and 12 extra multiplications, costing 32096 fadd + 27236 fmul. There are other ways to compute the reciprocal of z[n], such as Euclid's algorithm and extended binary-gcd algorithms, but these algorithms (especially when modified to avoid leaking information through timing) would take considerable extra work to implement and would have at most a small impact on the Curve25519 speed, so I won't bother with them this month.

The remaining operations are conversion to and from floating point, costing 56 fadd + 0 fmul.

Total operation count (not computer-verified at this point): 414907 fadd + 327626 fmul. Evidently these instructions will take at least 414907 UltraSPARC cycles.

Here's the software:

     wget https://cr.yp.to/highspeed/fall2006/curve25519_complete.c
     wget https://cr.yp.to/highspeed/fall2006/curve25519_completetest.c
     gcc -m64 -o curve25519_completetest curve25519_completetest.c \
     curve25519_complete.c curve25519_carry.s curve25519_mul.s curve25519_sq.s \
     curve25519_todouble.s curve25519_121665.s curve25519_add.s curve25519_sub.s \
     curve25519_freeze.s curve25519_select.s
     ./curve25519_completetest | gp

Counting cycles. I've timed this Curve25519 implementation using the cpucycles library:

     long long timings[10];
     unsigned char x[32];
     unsigned char y[32];
     unsigned char z[32];

     for (i = 0;i < 10;++i) { timings[i] = cpucycles(); curve25519(z,x,y); }
     for (i = 0;i < 9;++i) printf("%lld ",timings[i + 1] - timings[i]); printf("\n");

The output shows that this implementation is taking over 2 million UltraSPARC III cycles:

     2251848 2122560 2060876 2694852 2061201 2060841 2061101 3322788 2065352

There's a big gap, almost a factor of 5, between the 414907-cycle lower bound from fadd throughput and this 2-million-cycle upper bound. That's a gigantic gap compared to, e.g., my NIST P-224 elliptic-curve-Diffie-Hellman software from 2001, which used about 678000 UltraSPARC II cycles for 523578 floating-point additions, or my more recent Poly1305 software, which uses 68 UltraSPARC III cycles for 68 floating-point additions in the main loop.

Why is the software failing so miserably at keeping the adder busy? Here's an illustrative 3-fadd excerpt from curve25519_mul.q:

     h64 = f0 * g64       # starts on cycle 0
     f22g43 = f22 * g43   # starts on cycle 1 (delayed by fmul throughput)
     h64 += f22g43        # starts on cycle 5 (delayed by fmul latency)
     f43g22 = f43 * g22   # starts on cycle 5
     h64 += f43g22        # starts on cycle 9 (delayed by fmul throughput)
     f64g0 = f64 * g0     # starts on cycle 9
     h64 += f64g0         # starts on cycle 13

I didn't have to add these 4 quantities serially. I could instead add them in a tree:

     h64 = f0 * g64       # starts on cycle 0
     f22g43 = f22 * g43   # starts on cycle 1 (delayed by fmul throughput)
     f43g22 = f43 * g22   # starts on cycle 2 (delayed by fmul throughput)
     f64g0 = f64 * g0     # starts on cycle 3 (delayed by fmul throughput)
     h64 += f22g43        # starts on cycle 5 (delayed by fmul latency)
     f43g22 += f64g0      # starts on cycle 7 (delayed by fmul latency)
     h64 += f43g22        # starts on cycle 11 (delayed by fadd latency)

Even better, I could interleave these operations with subsequent additions in the software. There's tremendous parallelism in the Curve25519 computation graph, with far more than 4 independent additions on each level of the graph; I simply have to make this parallelism visible to the CPU.

To be continued.

2006.11.06

Today's lecture topics:

Poly1305 recap
- Channel-conversion picture
- Formulas, summary of modifications
- How stream ciphers will help
- Software speed
- Change prime? Analysis
Historical development
- 1974 Gilbert MacWilliams Sloane
- Costs of long r
- 1979 Wegman Carter
- Stinson's simplified formula
- 1981/1987 Karp Rabin
- Generation of r, expansion of r
- 1993 den Boer, 1994 Taylor, 1994 Bierbrauer Johansson Kabatianskii Smeets
- Implementors moving to large characteristic
- Continued expansion of r
- Short r without precomputation; finally consistent high speed
Possibilities for future development
- UMAC-P4: double speed if Pentium 4, long r, long message
- Adapt Winograd dot-product trick to short r?
- The debate regarding processor-specific MACs
- Select processor-specific MACs dynamically?

Oops, didn't quite get to the last two bits. Will start with them tomorrow.

There are many other ideas in the authentication literature, but most of them are slowdowns! See Section 10 of my hash127 paper for a survey (dating from 2000) and tons of references. The main advance in the field since 2000 is a new focus on minimizing precomputation (not just having r short but also using it directly in the main computation), as I emphasized in the lecture.

2006.11.07

WARNING: There's a proposal to reschedule next week's lectures from Thursday+Friday to Friday+Friday. Hasn't happened yet; we're checking for conflicts.

Today's lecture topics:

CPU-specific message-authentication codes?
- MMX instructions
- Can use for big integers; or use directly?
- Not terribly fast
- SSE2 instructions
- Timings: UMAC-P4-128-AES vs. Poly1305-AES
- P4 sender+P4 receiver could dynamically select UMAC-P4
- UMAC seems obsolete but general issue remains
Introduction to stream ciphers
- Salsa20 expands 256 bits into 512N bits, gigantic N
- Indistinguishability conjecture
- Brute-force attack
- Combining Salsa20, Poly1305
- Security conjecture
- Indistinguishability implies security
- Improved channel-conversion picture
Salsa20 speed
- Cycles per output byte on various CPUs
- Comparison to Salsa20/8 and to AES
- Fast random access to output
- Generating Poly1305's s_n on the fly
- Can use r_n; free; stops multiple forgeries
- Review of receiver's actions
- Speed of Salsa20 is much more important for encryption
- Encrypt-then-authenticate to avoid denial of service
Salsa20 pictures
- Embedding k, n into 4x4 array of 32-bit words
- Modifying one word
- Modifying word in each column
- Modifying another word in each column
- Modifying another word in each column
- Modifying last word in each column
- Modifying rows similarly
- Continuing through 20 rounds
- Adding final 4x4 array to original 4x4 array
- Difference in initial array for second block
- Difference after 1 round
- Difference after 2 rounds
- Difference after 3 rounds
- Difference after 4 rounds
- Diffusion pictures for 256 blocks
Salsa20 attacks
- Brute force on one CPU
- Brute force on many CPUs, many ASICs
- Multiple-target attack
- Better attacks on Salsa20 reduced to 5 or 6 rounds
- Idea of differential attacks
What's new on the course web page

Here's what I said about cycles/byte to authenticate a 1024-byte message, including a slight slowdown from computing AES:

CPU Poly1305-AES UMAC-P4-128-AES

PowerPC G4 7410 8.27 21.72
UltraSPARC III 5.47 51.06
Pentium 4 5.33 3.12
Pentium M 4.50 8.48
Athlon 3.75 7.38

CPU	Poly1305-AES	UMAC-P4-128-AES
PowerPC G4 7410	8.27	21.72
UltraSPARC III	5.47	51.06
Pentium 4	5.33	3.12
Pentium M	4.50	8.48
Athlon	3.75	7.38

As for Salsa20, the best source of information is my Salsa20 page. The leading source of information about stream ciphers generally is eSTREAM (the ECRYPT Stream Cipher Project), particularly eSTREAM's collection of design papers and cryptanalysis papers; I have a web page with links and further comments.

2006.11.08

SCHEDULE CHANGE: The afternoon session on Thursday 16 November has been rescheduled for Friday 17 November. This means that the remaining sessions are as follows:

Week Day Date Time

3 Thu 2006.11.09 11:00-12:30, 15:30-17:00
3 Fri 2006.11.10 11:00-12:30
4 Mon 2006.11.13 no lectures (students work on course projects)
4 Tue 2006.11.14 no lectures (students work on course projects)
4 Wed 2006.11.15 no lectures (students work on course projects)
4 Thu 2006.11.16 no lectures (students work on course projects)
4 Fri 2006.11.17 11:00-12:30, 15:30-17:00 (project reports and discussion)

Week	Day	Date	Time
3	Thu	2006.11.09	11:00-12:30, 15:30-17:00
3	Fri	2006.11.10	11:00-12:30
4	Mon	2006.11.13	no lectures (students work on course projects)
4	Tue	2006.11.14	no lectures (students work on course projects)
4	Wed	2006.11.15	no lectures (students work on course projects)
4	Thu	2006.11.16	no lectures (students work on course projects)
4	Fri	2006.11.17	11:00-12:30, 15:30-17:00 (project reports and discussion)

Today's lecture topics:

Introduction to stream-cipher design
- Mathematicians find patterns
- Inversion examples: factoring et al.
- Mathematicians focus on easy functions; most functions have no patterns
- Conjectured compute-cost-vs-distinguish-cost graph
Non-controversial design issues
- Must have nonlinear operations
- Many reasonable selections of operations
- 32 bits vs. 64 bits: advantages, disadvantages
- Every input bit must affect every output bit
- Computation graph must be heavily connected
- Example of bad connectivity: simplified Snefru
- MD5 graph (on computer)
- SHA-1 graph (on computer)
- SHA-256 graph (on computer)
- TEA graph (on computer)
- Salsa20 graph (on computer)
- Many reasonable graphs; consider parallelism, register use
- Break other symmetries
Controversial design issues
- Multiplication?
- Selection?
- S-box lookups?
- Secret array indices leak through timing
- AES software leaks AES keys
- Plugs are horribly fragile
Last words on stream ciphers and authenticated encryption
- Many, many, many alternatives
Introduction to Diffie-Hellman
- Share a new secret through a public channel
- Channel-conversion picture
- Diffie-Hellman data flow
- Public key is reused, so major cost is computing shared secret
- Combining Curve25519 with Salsa20+Poly1305
- Exponentiation mod 2^255-19
- Performance of index calculus
- High-level definition of Curve25519
- Fastest attack known
- Some cycle counts; comparison to Salsa20+Poly1305
Curve25519 formulas
- The field k; the elliptic curve E(k)
- X-coordinate functions; Curve25519
- Montgomery's 2-isogeny formulas
- Montgomery's formulas as recursion for X(nQ)
- 510 divisions in Z/(2^255-19)

2006.11.09

Today's lecture topics:

Curve25519 computation
- Recap of inputs, outputs
- Recap of Montgomery's recursion
- Division by Fermat's little theorem
- Better: Euclid's algorithm
- Better: Delay divisions
- Montgomery's numerators, denominators
- Can view this as the definition of Curve25519
- Preliminary operation count
- Montgomery's optimizations
- Montgomery's computation graph
- Montgomery's operation count
Curve25519 field arithmetic
- Comparing 2^255-19 to 2^130-5
- 177 floating-point additions for multiplication
- 48 floating-point additions for multiplication by 121665
- Squaring speedups
- Tally of UltraSPARC costs for Curve25519
- Basic Pentium M advantage, disadvantage
- Tally of Pentium M costs for Curve25519
Alternatives to Curve25519
- Security constraints on number of points
- y^2 = x^3 - 3x + ...? Slower
- Other curve shapes, notably genus-2 hyperelliptic curves
- Other characteristics? Slower, maybe less secure
- XTR? Slower
- RSA public-key encryption? Slower
Introduction to public-key signatures
- Protect a public message sent to many receivers
- Channel-conversion picture
- Sender computations, receiver computations
- Data-flow diagram
- Cost evaluation; comparison to Curve25519
- Extra efficiency features
- Will focus on systems with fastest verification
Critical RSA features
- 1977 Rivest Shamir Adleman: roots mod pq, large exponent
- 1977 RSA was a security disaster
- 1979 Rabin: add hashing; reasonable security conjecture
- Currently recommended pq size: 2048 bits
- 1979 Rabin: use small e, preferably e=2; much faster verification

2006.11.13

Friday's lecture topics:

Building a signature system, continued
- Interlude: elliptic-curve version of 1989 Schnorr
- 1979 Rabin: randomize signatures
- 1980 Williams: {1,-1,2,-2}
- Recommendations regarding randomness
- Generating keys to have matching top 1/2
- Generating keys to have matching top 2/3
- Compressing keys to 1/3
Verification speed
- Square, subtract, check divisibility
- Division with remainder via two multiplications
- Using compression to speed up reciprocal
- More speedups, including expansion
- Probabilistic verification of expanded signatures
- Cost estimate for verifying expanded signatures
Signature compression
- Compressing a signature via continued fractions
- Uncompressing a signature
Closing comments

We'll meet again for course-project discussions on Friday 17 November 2006 at 11:00. Feel free to send me email in the meantime.

Here's a challenge to cryptographers. Right now, attackers all over the Internet can easily forge packets that appear to come from Google; the challenge is to stop these forgeries, guaranteeing that the Google results on the users' screens really are from Google. Perhaps the most obvious approach (to the cryptographic side of the problem; we also need to fix browser buffer overflows and other relevant software bugs) is to have Google add a signature to each Google-results page. This costs

total for the Google users: R signature verifications, where R is the total number of downloads of Google-results pages;
for Google: W signatures, where W is the number of different Google-results pages (or, more precisely, the number of Google-results pages that Google doesn't cache).

The problem for Google is that W is huge; generating signatures can easily be a bottleneck in generating Google-results pages. Fortunately, there's a much less expensive approach: use the Diffie-Hellman protocol to share a secret between Google and each user, and then use secret-key cryptography to stop forgery (and eavesdropping) for any number of web pages sent to that user. This costs

total for the Google users: U evaluations of Curve25519, where U is the number of users, plus R applications of Salsa20+Poly1305;
for Google: U evaluations of Curve25519, plus R evaluations of Salsa20+Poly1305.

Presumably the number U of Google users is on the scale of a billion, so U evaluations of Curve25519 are on the scale of a few days of time on one computer, and of course Google can spread computations across many computers. One might complain about the cost of Google storing shared secrets in a big central database, but the standard technique of ``remote storage'' (Google stores the user's shared secret as an authenticated encrypted cookie on the user's machine) removes the need for a central database. Anyway, the challenge is to write and deploy all software necessary to actually make this work.

2006.11.16

New qhasm release, including AMD64 (Opteron, Athlon 64, Core, Core 2) support:

     wget https://cr.yp.to/qhasm/qhasm-20061116.tar.gz
     gunzip < qhasm-20061116.tar.gz | tar -xf -
     cd qhasm-20061116
     ./do

Still needs lots of testing but can at least handle an AMD64 XMM version of the Salsa20 software, setting new speed records for Salsa20 on the Intel Core 2.