High-speed cryptography

Week | Day | Date | Time |
---|---|---|---|

1 | Mon | 2006.10.23 | 15:30-17:00 |

1 | Tue | 2006.10.24 | 11:00-12:30, 15:30-17:00 |

1 | Wed | 2006.10.25 | 11:00-12:30, 15:30-17:00 |

1 | Thu | 2006.10.26 | 11:00-12:30, 15:30-17:00 |

1 | Fri | 2006.10.27 | 11:00-12:30 |

2 | Mon | 2006.10.30 | no lectures (students attend CCANTC workshop) |

2 | Tue | 2006.10.31 | no lectures (students attend CCANTC workshop) |

2 | Wed | 2006.11.01 | no lectures (students attend CCANTC workshop) |

2 | Thu | 2006.11.02 | no lectures (students attend CCANTC workshop) |

2 | Fri | 2006.11.03 | no lectures (students attend CCANTC workshop) |

3 | Mon | 2006.11.06 | 15:30-17:00 |

3 | Tue | 2006.11.07 | 11:00-12:30, 15:30-17:00 |

3 | Wed | 2006.11.08 | 11:00-12:30, 15:30-17:00 |

3 | Thu | 2006.11.09 | 11:00-12:30, 15:30-17:00 |

3 | Fri | 2006.11.10 | 11:00-12:30 |

4 | Mon | 2006.11.13 | no lectures (students work on course projects) |

4 | Tue | 2006.11.14 | no lectures (students work on course projects) |

4 | Wed | 2006.11.15 | no lectures (students work on course projects) |

4 | Thu | 2006.11.16 | 15:30-17:00 (project reports and discussion) |

4 | Fri | 2006.11.17 | 11:00-12:30 (project reports and discussion) |

All lectures will take place in room 230 at the Fields Institute at 222 College Street at the University of Toronto.

**Course contents.**
Anyone with access to the Internet
can intercept mail messages and web pages
and other packets of data to see what they say.
He can also send fake messages that look like real messages.

Cryptography protects messages sent through the Internet. Cryptography scrambles and unscrambles packets to protect against forgery and against espionage. An attacker who forges a message won't be able to scramble it in the right way, so when we unscramble it we'll see that it's a forgery and that we should throw it away. An attacker who intercepts a scrambled credit-card number won't be able to figure out the original number.

High-speed cryptography is important for busy network servers. For example, here's a quote from Cricket Liu, author of several books on the Internet's Domain Name System: ``Verifying signed resource records [i.e., performing cryptographic operations] is computationally intensive [i.e., is painfully slow]. This has slowed deployment of DNS security.'' As a practical matter, DNS security still doesn't exist; attackers can seize control over incoming mail and outgoing web pages by forging DNS records. The users need faster software to scramble DNS records.

This course will focus on the four fundamental tasks in cryptography:

- using a long shared secret to protect a long series of private messages;
- expanding a short shared secret into a long shared secret;
- sharing a secret through a public conversation; and
- using a sender secret, with no shared secrets, to protect a long series of public messages.

**Prerequisites and non-prerequisites.**
Students are expected to have some experience with algorithm analysis
(figuring out why a computer hasn't finished a computation yet)
and algorithm improvement (making the computer finish the computation more quickly).
A typical student
will have studied some chapters of a textbook by Cormen, Leiserson, Rivest, and Stein;
will have heard that heapsort ``takes time n log n times, um, some constant'';
and will have seen many algorithm improvements
incorrectly described as ``optimizations.''

The most obvious difference in flavor between this course and a typical algorithm course is that this course will pay attention to constants. For example, this course will explain how a state-of-the-art function takes half a millisecond on my laptop to compute a high-security shared secret from a public conversation. Ignoring constant factors would remove all content from this statement.

Of course, these constants depend on the choice of computer. Sometimes one computer takes 10 clock cycles for an operation while another computer takes just 1 clock cycle; this affects the time for any algorithm built on top of that operation. Students might have seen some computer-dependent algorithm analysis and algorithm improvement in courses on architecture, assembly language, ``optimizing'' compilers, supercomputing, etc. None of those courses are prerequisites; this course will include a detailed introduction to the speed of real computers.

Many higher-level speedups are also constant-factor speedups. In fact, most of the speedups in the algorithm-research literature are constant-factor speedups. However, students will not be presumed to have any previous experience handling constant factors in algorithm analysis.

The other obvious difference in flavor between this course and other algorithm courses is the choice of functions being computed. A typical introductory algorithm course spends time on sorting, for example, and on various graph operations such as shortest paths. This course instead focuses on a few critical cryptographic operations. Those operations rely on some mathematics that students might have seen in previous courses on algebra, number theory, or cryptography. None of those are prerequisites; this course will include an introduction to the relevant mathematics.

- Motivation for high-speed cryptography
- The Internet is a security disaster
- Secrecy violations: traffic analysis, eavesdropping
- Reliability/integrity violations: denial of service, forgery
- Packets transported by 802.11 wireless, Ethernet wires, ISPs, etc.
- Attacker hears packets, forges packets
- Fix: physically secure network; slow, expensive
- Alternative: cryptography scrambles packets
- Most Internet packets are unprotected; why not?
- Lack of software support; but what about HTTPS?
- Setup problems; but what about sites where it's set up?
- Speed problems; okay, let's try to fix
- Warning: also have to fix buffer overflows etc.

- Introduction to constant-factor analysis
- Square root mod n is a signing bottleneck
- Details of square-root function
- Typical algorithm, high level
- Typical algorithm, lower level
- Why algorithm works
- Baby algorithms course
- Baby analysis of this algorithm
- Baby-visible improvement from fast multiplication
- CRT speedup
- 2+o(1) savings from CRT speedup
- 512-bit example
- The windowing speedup
- Asymptotics; another constant factor
- List of several more constant-factor speedups
- Serious analysis vs. baby analysis
- Other constant-factor courses

- Course contents
- Could study constants in sorting etc.
- Instead focus on four critical crypto operations
- Long shared secret protects long series of messages
- Short shared secret expands into long shared secret
- Public conversation shares short secret
- Sender secret protects public messages without sharing secrets

- Selected state-of-the-art functions:
- Poly1305
- Salsa20
- Curve25519
- Square2048

- Course projects
- Set new speed records
- Doable? Certainly
- Example: OpenSSL solidly beaten by GMP
- Cryptographers can
*change*functions - For starting projects I recommend
*existing*functions - Some functions: see eSTREAM, Barreto hashing lounge
- Post-quantum public-key systems
- Start with functions that don't have asm software
- For public-key systems, submit to eBATS

- Simplified Poly1305 for one message
- Unprotected one-message Internet protocol
- What parties do
- Forgery-protected protocol
- What parties do differently
- Authenticator formula
- Definitions: Z/p etc.
- Definitions: random
- Definitions: uniform etc.
- Example of protocol
- Protocol works and is secure

- Security analysis of simplified Poly1305 for one message
- Naive attack
- Better attack
- Generalization of better attack
- Optimality of attack
- Counting roots
- How to formalize such proofs
- Independence in this proof

- Channel conversion 1
- Where did shared secret come from?
- Old channel
- Why not reuse that channel?
- Advantages of new channel
- Channel-conversion picture
- Combining with one-time pad
- Channel-conversion picture

- Un-simplified Poly1305: real-world tweaks
- Messages are byte strings
- Poly1305 encoding of byte strings
- Example of encoding
- Squeezing 130 bits to 128 bits
- Possible subtractions of 2^128
- Revised security conclusion
- Reduced r range for speed

- Poly1305 for multiple messages
- Unprotected multiple-message Internet protocol
- Nonces
- Why nonces?
- Protected multiple-message protocol
- Simplified multiple-message authenticator formula
- Protocol works and is secure
- Un-simplified formula; security
- (Swapped to later:) Channel-conversion picture
- (Swapped to later:) Combining with one-time pad; picture
- (Swapped to later:) Channel improvements we'll see later

- Choosing authenticator security levels
- Poly1305 surviving massive attack
- Argument that this is overkill
- Flaws in the argument
- Incorrect-forgery-count quote
- Correct forgery count

- Bit operations
- Definition of bit operations
- XOR with 4 bit operations
- Baby analysis of bit operations for b-bit evaluation

- much easier than writing the same computation in a traditional assembly language: qhasm includes a smart register allocator, for example, whereas a traditional assembly language forces you to allocate registers by hand; and
- much easier than writing the same computation using C: you don't have to fight the C compiler's limited instruction selection, confused instruction scheduling, etc.

Here's a portion of the latest qhasm rewrite:
qhasm-20061027.tar.gz.
This doesn't include qhasm's cycle counter, range analyzer, and scheduler,
but it does include
qhasm's instruction decoder,
register allocator (completely rewritten this week to be much faster),
x86 floating-point stack handler,
and assembler.
What this means, in a nutshell, is that you can write `.q` files
and feed them to `qhasm-x86` to produce `.s` files in traditional x86 assembly language.
Those `.s` files, in turn, can be linked easily with C and C++ programs.

These tools are nowhere near the polished state that I like to reach before releasing software. I haven't even had time for minimal tests! I nevertheless feel completely comfortable recommending qhasm as the programming environment for the course projects, simply because the alternative environments do such a bad job of producing high-speed software. If something in qhasm doesn't work, let me know, and I'll have it fixed promptly.

An example of a `.q` file is addvector.q.
To turn this into `addvector.s`,
first download and compile qhasm:

wget https://cr.yp.to/qhasm/qhasm-20061027.tar.gz gunzip < qhasm-20061027.tar.gz | tar -xf - cd qhasm-20061027 ./doThen download

wget https://cr.yp.to/qhasm/addvector.q ./qhasm-x86 < addvector.q > addvector.sThere's an addvector.c program that tests the result:

wget https://cr.yp.to/qhasm/addvector.c gcc -o addvector addvector.c addvector.s ./addvectorTo see more of what can be in a

- UltraSPARC III integer instructions
- Will start with UltraSPARC, then move to popular CPUs
- Picture of UltraSPARC III
- Two 64-bit additions per clock cycle
- Some instructions: +, -, left shift
- More instructions: right shift, bit operations
- More instructions: constants
- 24 integer registers
- Computing 4a+b mod 2^64
- Two integer execution units
- 2 integer instructions per cycle
- Lower bound on cycles
- 1-cycle latency for integer instructions
- Example: f2=f0+f1,...,f6=f4+f5
- Alternate computation
- Dependency graph for first computation
- Dependency graph for second computation

- More about the UltraSPARC III
- More storage: memory
- 64-bit load, 64-bit store
- Big-endian, little-endian
- 8-bit load, 32-bit load, etc.
- Load throughput, latency
- Picture of the memory hierarchy
- Typical benchmarks avoid cache misses
- Example of cache misses in cryptography
- Double-precision floating-point numbers
- Floating-point arithmetic
- Floating-point rounding
- Floating-point addition throughput, latency
- Floating-point multiplication throughput
- Importance of multiplier
- Floating-point multiplication latency; example of CPU variation
- Instructions start in order
- Loads finish in order

- Cycle-counting example
- Want approximate product of 100 floating-point numbers
- First implementation: simple loop, ~500 cycles
- Second implementation: unrolling loop, ~400 cycles
- Picture explaining the 400
- Third implementation: 4 parallel products
- Picture showing desired performance
- ~300 cycles; why?
- Picture showing extra delays
- Fourth implementation: separating loads from stores
- ~100 cycles

- Poly1305 on the UltraSPARC
- How to split +c, *r, mod 2^130-5 into UltraSPARC insns?
- Current software: 34 fp mults, 68 fp adds, 69 cycles
- Splitting r into 4 fp regs
- Splitting intermediate results into 4 fp regs
- Computing rx
- Computing y congruent to rx mod 2^130-5
- Haven't minimized y; not a problem
- Haven't minimized pieces of y; solve by carries
- Carries via floating-point rounding
- Haven't fit pieces of y into fp regs; solve by splitting r more

- Poly1305 on the Pentium Pro/II/III/M
- Extended precision; 17 fp mults, 36 fp adds
- Difficulties: one fp unit; 8 fp regs

- Software-engineering issues
- Make fp functions work; then merge functions; then handle latency
- Graph of CPU time vs. programmer time
- Comments on qhasm

In response to requests for supporting additional CPUs,
I've posted a new version of qhasm that includes
`qhasm-ppc32-linux`,
`qhasm-ppc32-macos`,
`qhasm-sparc`,
and
`qhasm-x86`.
Here's how to download and compile the new version:

wget https://cr.yp.to/qhasm/qhasm-20061102.tar.gz gunzip < qhasm-20061102.tar.gz | tar -xf - cd qhasm-20061102 ./doBeware that this version also has a new syntax for handling CPU flags. Previously you were allowed to say things like

counter -= 1 ... goto topofloop if !=where the

=? counter -= 1 ... goto topofloop if !=This allows qhasm to check for a common assembly-language error, namely intermediate instructions (...) that implicitly set the flag; there are quite a few flag-setting instructions, especially on the x86 architecture. The new syntax also allows comprehensible selection of different flag options on architectures providing many options, such as the PowerPC.

This version of qhasm needs even more testing than the previous version. I'll be doing a lot of testing this weekend: one of my projects for the next few days is writing a new implementation of the Curve25519 elliptic-curve-Diffie-Hellman function, one of the functions that we'll explore next week. The original Curve25519 software is written in x86 (Pentium, Athlon, etc.) assembly language and relies heavily on 80-bit (extended-precision, 64-bit mantissa) floating-point registers; the new project is to handle other CPUs (UltraSPARC, PowerPC, etc.) using 64-bit (double-precision, 53-bit mantissa) floating-point registers. I'm planning to post code samples here---stay tuned.

How can I efficiently decompose multiplication mod 2^255-19 into the smaller instructions available on a real CPU? The answer depends on the CPU. Let's start with my teaching tool, the UltraSPARC III, before moving on to more popular CPUs.

Recall that the computational highlight of the UltraSPARC is the floating-point multiplier, which can very quickly compute the product of two double-precision floating-point numbers. The multiplier rounds its output to 53-bit precision, but it guarantees exact results for products that already fit into 53 bits, such as integers in the range {-2^53+1,...,-1,0,1,...,2^53-1}. The multiplier's maximum throughput is one product per CPU cycle.

Experience tells me that the floating-point adder is another bottleneck. The adder's maximum throughput is one addition per CPU cycle. So I'll carefully count the number of floating-point additions and the number of floating-point multiplications. Hopefully there won't be any other bottlenecks.

**Integer representation.**
I'll try breaking 255-bit numbers into 12 pieces,
each piece having 22 bits.
Two pieces will have a 44-bit product,
so I should be able to add quite a few of those products
without overflowing the 53-bit mantissa of a double-precision floating-point number.
Of course,
I'll have to carefully check the final code
to guarantee that there aren't any accidental overflows,
and in the end it might turn out that more than 12 pieces are necessary,
but for the moment I'll simply assume that 12 pieces will work.

You might object that 255 isn't divisible by 12:
the ratio 255/12 = 21.25 isn't an integer.
How exactly do I break a 255-bit number into 12 pieces?
Answer:
I'll write a 255-bit number f as a sum
f_0 + f_22 + f_43 + f_64 + f_85 + f_107 + f_128 + f_149 + f_170 + f_192 + f_213 + f_234
where each f_i is a small multiple of 2^i.
The indices i here, namely 0 22 43 etc.,
are *ceilings* of multiples of 21.25.

Each f_i fits easily into a floating-point register, so f is represented as a vector of 12 values (f_0,f_22,...,f_234) stored in floating-point registers. All required operations on integers modulo 2^255-19 will be split into floating-point operations.

**Addition: 12 fadd + 0 fmul.**
If
f = f_0 + f_22 + f_43 + f_64 + f_85 + f_107 + f_128 + f_149 + f_170 + f_192 + f_213 + f_234
and
g = g_0 + g_22 + g_43 + g_64 + g_85 + g_107 + g_128 + g_149 + g_170 + g_192 + g_213 + g_234
then
f + g = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234
where h_i = f_i + g_i.

Computing (h_0,h_22,...,h_234) given (f_0,f_22,...,f_234) and (g_0,g_22,...,g_234) takes 12 floating-point additions if f_i and g_i are sufficiently small. Of course, the bound on h_i is larger than the bounds on f_i and g_i, so a series of additions requires occasional carries, but there aren't any chains of additions in Curve25519.

Here's a simple C function that reads (f_0,f_22,...,f_234) from f[0],f[1],...,f[11], reads (g_0,g_22,...,g_234) from g[0],g[1],...,g[11], and puts (h_0,h_22,...,h_234) into h[0],h[1],...,h[11]:

void curve25519_add(double h[12],const double f[12],const double g[12]) { h[0] = f[0] + g[0]; h[1] = f[1] + g[1]; h[2] = f[2] + g[2]; h[3] = f[3] + g[3]; h[4] = f[4] + g[4]; h[5] = f[5] + g[5]; h[6] = f[6] + g[6]; h[7] = f[7] + g[7]; h[8] = f[8] + g[8]; h[9] = f[9] + g[9]; h[10] = f[10] + g[10]; h[11] = f[11] + g[11]; }Perhaps I should comment that, in C, array parameters are actually pointers, so

void curve25519_add(double *h,const double *f,const double *g) { h[0] = f[0] + g[0]; h[1] = f[1] + g[1]; h[2] = f[2] + g[2]; h[3] = f[3] + g[3]; h[4] = f[4] + g[4]; h[5] = f[5] + g[5]; h[6] = f[6] + g[6]; h[7] = f[7] + g[7]; h[8] = f[8] + g[8]; h[9] = f[9] + g[9]; h[10] = f[10] + g[10]; h[11] = f[11] + g[11]; }and would have had to do so to make the function also work in C++. On the other hand, I like seeing

Anyway, the line

h[0] = f[0] + g[0]actually produces four CPU instructions: load

double f0; /* set aside double-precision storage location f0 */ double g0; /* set aside double-precision storage location g0 */ double h0; /* set aside double-precision storage location h0 */ f0 = f[0]; /* copy 8 bytes from memory location f[0] to register f0 */ g0 = g[0]; /* copy 8 bytes from memory location g[0] to register g0 */ h0 = f0 + g0; /* add register f0 to register g0 */ h[0] = h0; /* copy result from register h0 to memory location h[0] */I did the same for the other lines, producing the following equivalent C function:

void curve25519_add(double *h,const double *f,const double *g) { double f0; double f1; double f2; double f3; double f4; double f5; double f6; double f7; double f8; double f9; double f10; double f11; double g0; double g1; double g2; double g3; double g4; double g5; double g6; double g7; double g8; double g9; double g10; double g11; double h0; double h1; double h2; double h3; double h4; double h5; double h6; double h7; double h8; double h9; double h10; double h11; f0 = f[0]; f1 = f[1]; f2 = f[2]; f3 = f[3]; f4 = f[4]; f5 = f[5]; f6 = f[6]; f7 = f[7]; f8 = f[8]; f9 = f[9]; f10 = f[10]; f11 = f[11]; g0 = g[0]; g1 = g[1]; g2 = g[2]; g3 = g[3]; g4 = g[4]; g5 = g[5]; g6 = g[6]; g7 = g[7]; g8 = g[8]; g9 = g[9]; g10 = g[10]; g11 = g[11]; h0 = f0 + g0; h1 = f1 + g1; h2 = f2 + g2; h3 = f3 + g3; h4 = f4 + g4; h5 = f5 + g5; h6 = f6 + g6; h7 = f7 + g7; h8 = f8 + g8; h9 = f9 + g9; h10 = f10 + g10; h11 = f11 + g11; h[0] = h0; h[1] = h1; h[2] = h2; h[3] = h3; h[4] = h4; h[5] = h5; h[6] = h6; h[7] = h7; h[8] = h8; h[9] = h9; h[10] = h10; h[11] = h11; }You might be wondering how the UltraSPARC, with only 32 double-precision floating-point registers, can handle the 36 values

f0 = f[0]; g0 = g[0]; h0 = f0 + g0; h[0] = h0;into (essentially)

%f8 = f[0]; %f32 = g[0]; %f8 = %f8 + %f32; h[0] = %f8;putting both

The above C code is still hiding an important aspect of the corresponding CPU instructions,
namely the multiplication involved in array indexing.
Loading `f[i]` from memory
means taking the supplied address `f`,
adding `8*i` bytes to it to find the address of `f[i]`,
and then loading the 8-byte floating-point number at that address.
To see the CPU instructions more clearly,
I replaced the `double` pointers by byte pointers:

void curve25519_add(char *h,char *f,char *g) { double f0; double f1; double f2; double f3; double f4; double f5; double f6; double f7; double f8; double f9; double f10; double f11; double g0; double g1; double g2; double g3; double g4; double g5; double g6; double g7; double g8; double g9; double g10; double g11; double h0; double h1; double h2; double h3; double h4; double h5; double h6; double h7; double h8; double h9; double h10; double h11; f0 = *(double *) (f + 0); f1 = *(double *) (f + 8); f2 = *(double *) (f + 16); f3 = *(double *) (f + 24); f4 = *(double *) (f + 32); f5 = *(double *) (f + 40); f6 = *(double *) (f + 48); f7 = *(double *) (f + 56); f8 = *(double *) (f + 64); f9 = *(double *) (f + 72); f10 = *(double *) (f + 80); f11 = *(double *) (f + 88); g0 = *(double *) (g + 0); g1 = *(double *) (g + 8); g2 = *(double *) (g + 16); g3 = *(double *) (g + 24); g4 = *(double *) (g + 32); g5 = *(double *) (g + 40); g6 = *(double *) (g + 48); g7 = *(double *) (g + 56); g8 = *(double *) (g + 64); g9 = *(double *) (g + 72); g10 = *(double *) (g + 80); g11 = *(double *) (g + 88); h0 = f0 + g0; h1 = f1 + g1; h2 = f2 + g2; h3 = f3 + g3; h4 = f4 + g4; h5 = f5 + g5; h6 = f6 + g6; h7 = f7 + g7; h8 = f8 + g8; h9 = f9 + g9; h10 = f10 + g10; h11 = f11 + g11; *(double *) (h + 0) = h0; *(double *) (h + 8) = h1; *(double *) (h + 16) = h2; *(double *) (h + 24) = h3; *(double *) (h + 32) = h4; *(double *) (h + 40) = h5; *(double *) (h + 48) = h6; *(double *) (h + 56) = h7; *(double *) (h + 64) = h8; *(double *) (h + 72) = h9; *(double *) (h + 80) = h10; *(double *) (h + 88) = h11; }The corresponding qhasm code,

enter curve25519_add int64 h int64 f int64 g input h input f input g float64 f0 float64 f1 float64 f2 float64 f3 float64 f4 float64 f5 float64 f6 float64 f7 float64 f8 float64 f9 float64 f10 float64 f11 float64 g0 float64 g1 float64 g2 float64 g3 float64 g4 float64 g5 float64 g6 float64 g7 float64 g8 float64 g9 float64 g10 float64 g11 float64 h0 float64 h1 float64 h2 float64 h3 float64 h4 float64 h5 float64 h6 float64 h7 float64 h8 float64 h9 float64 h10 float64 h11 f0 = *(float64 *) (f + 0) f1 = *(float64 *) (f + 8) f2 = *(float64 *) (f + 16) f3 = *(float64 *) (f + 24) f4 = *(float64 *) (f + 32) f5 = *(float64 *) (f + 40) f6 = *(float64 *) (f + 48) f7 = *(float64 *) (f + 56) f8 = *(float64 *) (f + 64) f9 = *(float64 *) (f + 72) f10 = *(float64 *) (f + 80) f11 = *(float64 *) (f + 88) g0 = *(float64 *) (g + 0) g1 = *(float64 *) (g + 8) g2 = *(float64 *) (g + 16) g3 = *(float64 *) (g + 24) g4 = *(float64 *) (g + 32) g5 = *(float64 *) (g + 40) g6 = *(float64 *) (g + 48) g7 = *(float64 *) (g + 56) g8 = *(float64 *) (g + 64) g9 = *(float64 *) (g + 72) g10 = *(float64 *) (g + 80) g11 = *(float64 *) (g + 88) h0 = f0 + g0 h1 = f1 + g1 h2 = f2 + g2 h3 = f3 + g3 h4 = f4 + g4 h5 = f5 + g5 h6 = f6 + g6 h7 = f7 + g7 h8 = f8 + g8 h9 = f9 + g9 h10 = f10 + g10 h11 = f11 + g11 *(float64 *) (h + 0) = h0 *(float64 *) (h + 8) = h1 *(float64 *) (h + 16) = h2 *(float64 *) (h + 24) = h3 *(float64 *) (h + 32) = h4 *(float64 *) (h + 40) = h5 *(float64 *) (h + 48) = h6 *(float64 *) (h + 56) = h7 *(float64 *) (h + 64) = h8 *(float64 *) (h + 72) = h9 *(float64 *) (h + 80) = h10 *(float64 *) (h + 88) = h11 leaveThis qhasm code says

After writing `curve25519_add.q`
I wrote `curve25519_addtest.c` to test the addition function
on a not-very-random input:

#include <stdio.h> double powers[12] = { 1.0 /* 2^0 */ , 4194304.0 /* 2^22 */ , 8796093022208.0 /* 2^43 */ , 18446744073709551616.0 /* 2^64 */ , 38685626227668133590597632.0 /* 2^85 */ , 162259276829213363391578010288128.0 /* 2^107 */ , 340282366920938463463374607431768211456.0 /* 2^128 */ , 713623846352979940529142984724747568191373312.0 /* 2^149 */ , 1496577676626844588240573268701473812127674924007424.0 /* 2^170 */ , 6277101735386680763835789423207666416102355444464034512896.0 /* 2^192 */ , 13164036458569648337239753460458804039861886925068638906788872192.0 /* 2^213 */ , 27606985387162255149739023449108101809804435888681546220650096895197184.0 /* 2^234 */ } ; double f[12]; double g[12]; double h[12]; extern void curve25519_add(double [12],const double [12],const double [12]); int main(int argc,char **argv) { int i; for (i = 0;i < 12;++i) f[i] = random() * powers[i]; printf("f=0"); for (i = 0;i < 12;++i) printf("%+.0f",f[i]); printf(";\n"); for (i = 0;i < 12;++i) g[i] = random() * powers[i]; printf("g=0"); for (i = 0;i < 12;++i) printf("%+.0f",g[i]); printf(";\n"); curve25519_add(h,f,g); printf("h=0"); for (i = 0;i < 12;++i) printf("%+.0f",h[i]); printf(";\n"); printf("h-(f+g)\n"); return 0; }I compiled

./qhasm-sparc < curve25519_add.q > curve25519_add.s gcc -m64 -o curve25519_addtest curve25519_addtest.c curve25519_add.s ./curve25519_addtest | gpThe

More operations next time.

float64 f0 float64 f22 float64 f43 ... float64 g0 float64 g22 float64 g43 ... float64 h0 float64 h22 float64 h43 ... f0 = *(float64 *) (f + 0) f22 = *(float64 *) (f + 8) f43 = *(float64 *) (f + 16) ... g0 = *(float64 *) (g + 0) g22 = *(float64 *) (g + 8) g43 = *(float64 *) (g + 16) ... h0 = f0 + g0 h22 = f22 + g22 h43 = f43 + g43 ... *(float64 *) (h + 0) = h0 *(float64 *) (h + 8) = h22 *(float64 *) (h + 16) = h43qhasm also allows underscores in names, but I'm not interested in typing quite that much. :-)

Here's how to download and test the revised addition software on an UltraSPARC:

wget https://cr.yp.to/highspeed/fall2006/curve25519_add.q wget https://cr.yp.to/highspeed/fall2006/curve25519_addtest.c ./qhasm-sparc < curve25519_add.q > curve25519_add.s gcc -m64 -o curve25519_addtest curve25519_addtest.c curve25519_add.s ./curve25519_addtest | gpNow let's look at some other integer operations.

**Subtraction: 12 fadd + 0 fmul.**
Subtraction has the same cost as addition:
simply replace

h0 = f0 + g0with

h0 = f0 - g0etc. Of course, the resulting h_i is often negative, but that's okay; floating-point numbers are allowed to be negative.

Each cycle, the UltraSPARC's floating-point adder handles one addition *or* one subtraction.
I'll count subtractions as if they were additions.

Here's the software:

wget https://cr.yp.to/highspeed/fall2006/curve25519_sub.q wget https://cr.yp.to/highspeed/fall2006/curve25519_subtest.c ./qhasm-sparc < curve25519_sub.q > curve25519_sub.s gcc -m64 -o curve25519_subtest curve25519_subtest.c curve25519_sub.s ./curve25519_subtest | gp

**Multiplication, schoolbook method: 121 fadd + 144 fmul.**
If
f = f_0 + f_22 + f_43 + f_64 + f_85 + f_107 + f_128 + f_149 + f_170 + f_192 + f_213 + f_234
and
g = g_0 + g_22 + g_43 + g_64 + g_85 + g_107 + g_128 + g_149 + g_170 + g_192 + g_213 + g_234
then
fg = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234
+ h_255 + h_277 + h_298 + h_319 + h_340 + h_362 + h_383 + h_404 + h_425 + h_447 + h_468
where

- h_0 = f_0 g_0,
- h_22 = f_0 g_22 + f_22 g_0,
- h_43 = f_0 g_43 + f_22 g_22 + f_43 g_0,
- h_64 = f_0 g_64 + f_22 g_43 + f_43 g_22 + f_64 g_0,
- h_85 = f_0 g_85 + f_22 g_64 + f_43 g_43 + f_64 g_22 + f_85 g_0,
- h_107 = f_0 g_107 + f_22 g_85 + f_43 g_64 + f_64 g_43 + f_85 g_22 + f_107 g_0,
- h_128 = f_0 g_128 + f_22 g_107 + f_43 g_85 + f_64 g_64 + f_85 g_43 + f_107 g_22 + f_128 g_0,
- h_149 = f_0 g_149 + f_22 g_128 + f_43 g_107 + f_64 g_85 + f_85 g_64 + f_107 g_43 + f_128 g_22 + f_149 g_0,
- h_170 = f_0 g_170 + f_22 g_149 + f_43 g_128 + f_64 g_107 + f_85 g_85 + f_107 g_64 + f_128 g_43 + f_149 g_22 + f_170 g_0,
- h_192 = f_0 g_192 + f_22 g_170 + f_43 g_149 + f_64 g_128 + f_85 g_107 + f_107 g_85 + f_128 g_64 + f_149 g_43 + f_170 g_22 + f_192 g_0,
- h_213 = f_0 g_213 + f_22 g_192 + f_43 g_170 + f_64 g_149 + f_85 g_128 + f_107 g_107 + f_128 g_85 + f_149 g_64 + f_170 g_43 + f_192 g_22 + f_213 g_0,
- h_234 = f_0 g_234 + f_22 g_213 + f_43 g_192 + f_64 g_170 + f_85 g_149 + f_107 g_128 + f_128 g_107 + f_149 g_85 + f_170 g_64 + f_192 g_43 + f_213 g_22 + f_234 g_0,
- h_255 = f_22 g_234 + f_43 g_213 + f_64 g_192 + f_85 g_170 + f_107 g_149 + f_128 g_128 + f_149 g_107 + f_170 g_85 + f_192 g_64 + f_213 g_43 + f_234 g_22,
- h_277 = f_43 g_234 + f_64 g_213 + f_85 g_192 + f_107 g_170 + f_128 g_149 + f_149 g_128 + f_170 g_107 + f_192 g_85 + f_213 g_64 + f_234 g_43,
- h_298 = f_64 g_234 + f_85 g_213 + f_107 g_192 + f_128 g_170 + f_149 g_149 + f_170 g_128 + f_192 g_107 + f_213 g_85 + f_234 g_64,
- h_319 = f_85 g_234 + f_107 g_213 + f_128 g_192 + f_149 g_170 + f_170 g_149 + f_192 g_128 + f_213 g_107 + f_234 g_85,
- h_340 = f_107 g_234 + f_128 g_213 + f_149 g_192 + f_170 g_170 + f_192 g_149 + f_213 g_128 + f_234 g_107,
- h_362 = f_128 g_234 + f_149 g_213 + f_170 g_192 + f_192 g_170 + f_213 g_149 + f_234 g_128,
- h_383 = f_149 g_234 + f_170 g_213 + f_192 g_192 + f_213 g_170 + f_234 g_149,
- h_404 = f_170 g_234 + f_192 g_213 + f_213 g_192 + f_234 g_170,
- h_425 = f_192 g_234 + f_213 g_213 + f_234 g_192,
- h_447 = f_213 g_234 + f_234 g_213,
- h_468 = f_234 g_234.

(It's important here that I chose my indices as
*ceilings* of multiples of 21.25.
If I had instead used the *floors* 0, 21, 42, 63, 85, ...
then I would have run into trouble with
f_0 g_85 + f_21 g_63 + f_42 g_42 + f_63 g_21 + f_85 g_0
being a small multiple of 2^84 but often not a multiple of the desired 2^85.)

The above formulas for h_0, h_22, ..., h_468 involve 12^2 = 144 multiplications (each f_i times each g_j) and 11^2 = 121 additions (which I think of as 144 for the number of products minus 23 for the number of outputs).

**Multiplication, Karatsuba's method: 114 fadd + 142 fmul.**
I'm going to use ``Karatsuba's method'' to compute
the same quantities h_0, h_22, ..., h_468 more efficiently.
Warning: There are
several different polynomial-product and integer-product formulas called ``Karatsuba's method,''
and some of them (including the formulas in Karatsuba's original 1963 paper)
are slightly more expensive than the formulas shown here.
There's no standard terminology for the different formulas.

Anyway, define F_0 = f_0 x^0 + f_22 x^1 + ... + f_107 x^5, F_1 = f_128 x^0 + f_149 x^1 + ... + f_234 x^5, G_0 = g_0 x^0 + g_22 x^1 + ... + g_107 x^5, and G_1 = g_128 x^0 + g_149 x^1 + ... + g_234 x^5. The goal is to compute the polynomial product (F_0 + x^6 F_1)(G_0 + x^6 G_1).

I'll compute (F_0 + x^6 F_1)(G_0 + x^6 G_1) as
K + 2^127 x^6 ((F_0 + 2^(-127) F_1)(G_0 + 2^(-127) G_1) - K)
where K = F_0 G_0 - 2^(-127) x^6 F_1 G_1.
Observe that there are only *three* half-size polynomial products here,
rather than the *four* that appear in the obvious formula
F_0 G_0 + x^6 (F_0 G_1 + F_1 G_0) + x^12 F_1 G_1;
so I'll end up with only 3 6^2 = 108 coefficient multiplications,
rather than 4 6^2 = 144.

Here are the intermediate quantities computed. The coefficients of F_0 + 2^(-127) F_1 are

- u_0 = f_0 + 2^(-127) f_128,
- u_22 = f_22 + 2^(-127) f_149,
- u_43 = f_43 + 2^(-127) f_170,
- u_64 = f_64 + 2^(-127) f_192,
- u_85 = f_85 + 2^(-127) f_213,
- u_107 = f_107 + 2^(-127) f_234.

- v_0 = g_0 + 2^(-127) g_128,
- v_22 = g_22 + 2^(-127) g_149,
- v_43 = g_43 + 2^(-127) g_170,
- v_64 = g_64 + 2^(-127) g_192,
- v_85 = g_85 + 2^(-127) g_213,
- v_107 = g_107 + 2^(-127) g_234.

- w_0 = u_0 v_0,
- w_22 = u_0 v_22 + u_22 v_0,
- w_43 = u_0 v_43 + u_22 v_22 + u_43 v_0,
- w_64 = u_0 v_64 + u_22 v_43 + u_43 v_22 + u_64 v_0,
- w_85 = u_0 v_85 + u_22 v_64 + u_43 v_43 + u_64 v_22 + u_85 v_0,
- w_107 = u_0 v_107 + u_22 v_85 + u_43 v_64 + u_64 v_43 + u_85 v_22 + u_107 v_0,
- w_128 = u_22 v_107 + u_43 v_85 + u_64 v_64 + u_85 v_43 + u_107 v_22,
- w_149 = u_43 v_107 + u_64 v_85 + u_85 v_64 + u_107 v_43,
- w_170 = u_64 v_107 + u_85 v_85 + u_107 v_64,
- w_192 = u_85 v_107 + u_107 v_85,
- w_213 = u_107 v_107.

- h_0 = f_0 g_0,
- h_22 = f_0 g_22 + f_22 g_0,
- h_43 = f_0 g_43 + f_22 g_22 + f_43 g_0,
- h_64 = f_0 g_64 + f_22 g_43 + f_43 g_22 + f_64 g_0,
- h_85 = f_0 g_85 + f_22 g_64 + f_43 g_43 + f_64 g_22 + f_85 g_0,
- h_107 = f_0 g_107 + f_22 g_85 + f_43 g_64 + f_64 g_43 + f_85 g_22 + f_107 g_0,
- k_128 = (f_22 g_107 + f_43 g_85 + f_64 g_64 + f_85 g_43 + f_107 g_22) - 2^(-127)(f_128 g_128),
- k_149 = (f_43 g_107 + f_64 g_85 + f_85 g_64 + f_107 g_43) - 2^(-127)(f_128 g_149 + f_149 g_128),
- k_170 = (f_64 g_107 + f_85 g_85 + f_107 g_64) - 2^(-127)(f_128 g_170 + f_149 g_149 + f_170 g_128),
- k_192 = (f_85 g_107 + f_107 g_85) - 2^(-127)(f_128 g_192 + f_149 g_170 + f_170 g_149 + f_192 g_128),
- k_213 = (f_107 g_107) - 2^(-127)(f_128 g_213 + f_149 g_192 + f_170 g_170 + f_192 g_149 + f_213 g_128),
- h_362 = f_128 g_234 + f_149 g_213 + f_170 g_192 + f_192 g_170 + f_213 g_149 + f_234 g_128,
- h_383 = f_149 g_234 + f_170 g_213 + f_192 g_192 + f_213 g_170 + f_234 g_149,
- h_404 = f_170 g_234 + f_192 g_213 + f_213 g_192 + f_234 g_170,
- h_425 = f_192 g_234 + f_213 g_213 + f_234 g_192,
- h_447 = f_213 g_234 + f_234 g_213,
- h_468 = f_234 g_234.

- h_128 = 2^127 (w_0 - h_0) + k_128,
- h_149 = 2^127 (w_22 - h_22) + k_149,
- h_170 = 2^127 (w_43 - h_43) + k_170,
- h_192 = 2^127 (w_64 - h_64) + k_192,
- h_213 = 2^127 (w_85 - h_85) + k_213,
- h_234 = 2^127 (w_107 - h_107) - 2^(-127) h_362,
- h_255 = 2^127 (w_128 - k_128) - 2^(-127) h_383,
- h_277 = 2^127 (w_149 - k_149) - 2^(-127) h_404,
- h_298 = 2^127 (w_170 - k_170) - 2^(-127) h_425,
- h_319 = 2^127 (w_192 - k_192) - 2^(-127) h_447,
- h_340 = 2^127 (w_213 - k_213) - 2^(-127) h_468.

How does a constant such as 2^127 end up in a floating-point register? The standard answer has three steps. First, create an array of constants in memory, for example with the following C code:

const double curve25519_constants[2] = { 170141183460469231731687303715884105728.0 /* 2^127 */ , 0.0000000000000000000000000000000000000058774717541114375398436826861112283890933277838604376075437585313920862972736358642578125 /* 2^-127 */ } ;Second, copy the address of the array into a register, for example with the following standard sequence of UltraSPARC instructions:

int64 constants int64 constants_low constants = (curve25519_constants & 0xfffffc0000000000) >> 32 constants_low = curve25519_constants & 0xfffffc00 constants |= (curve25519_constants & 0x3ff00000000) >> 32 constants_low |= curve25519_constants & 0x3ff constants <<= 32 constants |= constants_lowThird, load the constants from memory:

float64 two127 float64 twom127 two127 = *(float64 *) (constants + 0) twom127 = *(float64 *) (constants + 8)

I wrote a straightforward `curve25519_mul.q`
and immediately ran into errors from the assembler:
I was trying to use more than 32 floating-point registers at once,
keeping all 12 f's, all 12 g's, all 23 h's, etc. live through the entire computation.
So I juggled the code to kill values earlier:
for example,
I computed h0, h22, h43, h64, h85, h107, stored them,
and subtracted them from w0, w22, w43, w64, w85, w107,
to free up the registers that they were using.
I also reloaded the f's and g's when they were needed,
rather than keeping them continuously live.

Here's the resulting software:

wget https://cr.yp.to/highspeed/fall2006/curve25519_mul.q wget https://cr.yp.to/highspeed/fall2006/curve25519_multest.c ./qhasm-sparc < curve25519_mul.q > curve25519_mul.s gcc -m64 -o curve25519_multest curve25519_multest.c curve25519_mul.s ./curve25519_multest | gp

**Squaring, schoolbook method: 55 fadd + 89 fmul.**
In Curve25519,
many of the multiplications modulo 2^255-19
are actually squarings:
f_i = g_i for each i.
This means that, for example,
the equation
h_107 = f_0 g_107 + f_22 g_85 + f_43 g_64 + f_64 g_43 + f_85 g_22 + f_107 g_0
can be replaced by the equation
h_107 = 2 (f_0 f_107 + f_22 f_85 + f_43 f_64),
replacing 6 multiplications and 5 additions
with 4 multiplications and 2 additions.

(I'm assuming here that doublings are carried out as multiplications by the constant 2. If floating-point multiplications turn out to be a bottleneck later then these multiplications can be replaced by additions.)

The equation h_107 = (2 f_0) f_107 + (2 f_22) f_85 + (2 f_43) f_64 is slightly better, costing only 3 multiplications and 2 additions if 2f_0, 2f_22, 2f_43 are precomputed. It's easy to see that the overall benefit of these precomputations outweighs the overall cost: there are only 11 doublings (of f_0, f_22, ..., f_213) rather than 21 doublings (in h_22, h_43, ..., h_447).

What happens when these speedups are applied to schoolbook multiplication? There are 144 fmul, of which 132 are f_i g_j for distinct i,j, of which 132/2 = 66 are eliminated at the expense of 11 doublings. The remaining 144 - 66 = 78 fmul results require only 78 - 23 = 55 fadds to be compressed into 23 outputs. Overall cost: 55 fadd + 78 fmul + 11 fmul. Software:

wget https://cr.yp.to/highspeed/fall2006/curve25519_sq.q wget https://cr.yp.to/highspeed/fall2006/curve25519_sqtest.c ./qhasm-sparc < curve25519_sq.q > curve25519_sq.s gcc -m64 -o curve25519_sqtest curve25519_sqtest.c curve25519_sq.s ./curve25519_sqtest | gp

Can one do better? Not to my knowledge. The lowest cost I've been able to achieve for Karatsuba squaring is 63 fadd + 106 fmul. Squaring saves considerably more in the schoolbook method than it does in Karatsuba's method.

**Reduction modulo 2^255-19: 11 fadd + 11 fmul.**
If
h = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234
+ h_255 + h_277 + h_298 + h_319 + h_340 + h_362 + h_383 + h_404 + h_425 + h_447 + h_468,
with h_i a multiple of 2^i,
then h is congruent to
r_0 + r_22 + r_43 + r_64 + r_85 + r_107 + r_128 + r_149 + r_170 + r_192 + r_213 + r_234
modulo 2^255 - 19,
where

- r_0 = h_0 + 19 2^(-255) h_255,
- r_22 = h_22 + 19 2^(-255) h_277,
- r_43 = h_43 + 19 2^(-255) h_298,
- r_64 = h_64 + 19 2^(-255) h_319,
- r_85 = h_85 + 19 2^(-255) h_340,
- r_107 = h_107 + 19 2^(-255) h_362,
- r_128 = h_128 + 19 2^(-255) h_383,
- r_149 = h_149 + 19 2^(-255) h_404,
- r_170 = h_170 + 19 2^(-255) h_425,
- r_192 = h_192 + 19 2^(-255) h_447,
- r_213 = h_213 + 19 2^(-255) h_468,
- r_234 = h_234.

Computing the r_i's from the h_i's takes 11 multiplications by the constant 19 2^(-255) and 11 additions. Software:

wget https://cr.yp.to/highspeed/fall2006/curve25519_modp.q wget https://cr.yp.to/highspeed/fall2006/curve25519_modptest.c ./qhasm-sparc < curve25519_modp.q > curve25519_modp.s gcc -m64 -o curve25519_modptest curve25519_modptest.c \ curve25519_modp.s curve25519_mul.s ./curve25519_modptest | gp

**Carrying and reduction modulo 2^255-19: 63 fadd + 11 fmul.**
``Carrying r_0 to r_22'' means subtracting the closest multiple of 2^22 from r_0
and adding it to r_22.
This doesn't change the sum r_0 + r_22,
but it guarantees that r_0 is in the range {-2^21,...,-1,0,1,...,2^21}.

Similarly, ``carrying r_22 to r_43'' means subtracting the closest multiple of 2^43 from r_22 and adding it to r_43. This doesn't change the sum r_22 + r_43, but it guarantees that 2^(-22) r_22 is in the range {-2^20,...,-1,0,1,...,2^20}. By carrying r_0 to r_22, then carrying r_22 to r_43, then carrying r_43 to r_64, and so on, we guarantee that each 2^(-i) r_i is in a narrow range, making the r's suitable inputs for subsequent multiplications.

A carry takes four floating-point additions:

closest43 = r22 + alpha43 # rounded closest43 -= alpha43 # exact r22 -= closest43 # exact r43 += closest43 # exactHere alpha_43 is the constant 2^43 (2^52 + 2^51). The point is that floating-point numbers around alpha_43 are exactly the multiples of 2^43: ..., 2^43 (2^52 + 2^51 - 1), 2^43 (2^52 + 2^51), 2^43 (2^52 + 2^51 + 1), ... The sum r_22 + alpha_43 is thus rounded to the closest multiple of 2^43; subtracting alpha_43 produces the multiple of 2^43 closest to r_22, exactly as desired.

What happens to the carry from r_234? The easy answer is to carry from r_234 to r_0, reducing the carry modulo 2^255-19:

closest255 = r234 + alpha255 closest255 -= alpha255 r234 -= closest255 closest255 *= scale r0 += closest255where

This might make r_0 too big again. If r_234 starts with 50 bits, for example, then closest_255 can be as large as 2^284; so 19 2^(-255) 2^284, more than 2^33, is added to r_0. Another carry from r_0 to r_22 is required. But the carries can stop after that: a chain of 13 carries from r_0 to r_22 to r_43 to ... to r_234 to r_0 to r_22 guarantees that all r_i's are reasonably small.

One can eliminate the extra `scale` multiplication
by carefully interleaving the carrying into the original reduction modulo 2^255-19.
Here's one reasonable carry/reduce strategy, starting from
h = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234
+ h_255 + h_277 + h_298 + h_319 + h_340 + h_362 + h_383 + h_404 + h_425 + h_447 + h_468:

- Reduce from h_447 to h_192.
- Reduce from h_468 to h_213.
- Carry from h_192 to h_213.
- Carry from h_213 to h_234.
- Carry from h_234 to h_255.
- Reduce from h_255 to h_0.
- Reduce from h_277 to h_22.
- Reduce from h_298 to h_43.
- Reduce from h_319 to h_64.
- Reduce from h_340 to h_85.
- Reduce from h_362 to h_107.
- Reduce from h_383 to h_128.
- Reduce from h_404 to h_149.
- Reduce from h_425 to h_170.
- Carry from h_0 to h_22.
- Carry from h_22 to h_43.
- Carry from h_43 to h_64.
- Carry from h_64 to h_85.
- Carry from h_85 to h_107.
- Carry from h_107 to h_128.
- Carry from h_128 to h_149.
- Carry from h_149 to h_170.
- Carry from h_170 to h_192.
- Carry from h_192 to h_213.

Software:

wget https://cr.yp.to/highspeed/fall2006/curve25519_carry.q wget https://cr.yp.to/highspeed/fall2006/curve25519_carrytest.c ./qhasm-sparc < curve25519_carry.q > curve25519_carry.s gcc -m64 -o curve25519_carrytest curve25519_carrytest.c \ curve25519_carry.s curve25519_mul.s curve25519_sq.s ./curve25519_carrytest | gp

**Multiplication, carrying, reduction: 177 fadd + 153 fmul.**
A complete integer multiplication modulo 2^255-19,
producing small results suitable for subsequent multiplications,
costs 114 fadd + 142 fmul for the polynomial multiplication,
and an additional 63 fadd + 11 fmul for carrying and reduction.

**Squaring, carrying, reduction: 118 fadd + 100 fmul.**
A complete integer squaring modulo 2^255-19,
producing small results suitable for subsequent multiplications,
costs 55 fadd + 89 fmul for the polynomial squaring,
and an additional 63 fadd + 11 fmul for carrying and reduction.

**What's coming up.**
I'll need a few more functions to build a complete Curve25519 computation:

- Converting a 255-bit integer from the traditional 32-byte representation to the 12-floating-point-register representation. This will rely on the details of IEEE 754 floating-point format, which fortunately are rather straightforward.
- Converting an integer from floating-point back to 32 bytes, making sure that the result is in {0,1,...,2^255-20}.
- Multiplication by a small constant. This, like squaring, could be handled as a general multiplication, but that would waste time.
- Conditional exchange. This means computing x[b],x[1-b] given x[0],x[1],b.
- The main loop.

wget https://cr.yp.to/qhasm/qhasm-20061105.tar.gz gunzip < qhasm-20061105.tar.gz | tar -xf - cd qhasm-20061105 ./doNow back to the Curve25519 operations.

**Small-constant multiplication, carrying, reduction: 48 fadd + 13 fmul.**
As we'll see later, the Curve25519 main loop has 8 additions/subtractions
and 10 multiplications.
Half of the 10 multiplications are special:
4 of them are squarings,
and 1 of them is a multiplication by a particular integer, namely 121665.

Multiplying by a 17-bit integer is of course easier than multiplying by a 255-bit integer: one can simply multiply each of the 12 floating-point inputs by 121665, costing 0 fadd + 12 fmul. But the results need 12 carries, costing 48 fadd + 1 fmul, so the overall benefit is not as dramatic as one might initially expect.

Software:

wget https://cr.yp.to/highspeed/fall2006/curve25519_121665.q wget https://cr.yp.to/highspeed/fall2006/curve25519_121665test.c ./qhasm-sparc < curve25519_121665.q > curve25519_121665.s gcc -m64 -o curve25519_121665test curve25519_121665test.c curve25519_121665.s ./curve25519_121665test | gp

**Selection: 0 fadd + 0 fmul.**
There's one additional operation in the Curve25519 main loop:
given four 255-bit integers x[0], z[0], x[1], z[1] and a secret bit b,
compute x[b], z[b] and x[1-b], z[1-b].

Here are three strategies for performing this operation:

- Put b into the instruction pointer. In other words, conditionally branch on b, jumping to code that uses x[0], z[0], x[1], z[1] if b is 0, or to code that uses x[1], z[1], x[0], z[0] if b is 1. There's a security problem here: the conditional branch can take a b-dependent amount of time, perhaps leaking b to an attacker who can see the total Curve25519 computation time. I strongly recommend against secret conditional branches.
- Put b onto the address bus, taking advantage of the ability of loads to perform computation via variable memory locations. In other words, use b as an array index, loading x[b], z[b] and x[1-b], z[1-b]. There's a security problem here too: the array access can take a b-dependent amount of time, perhaps leaking b to an attacker who can see the total Curve25519 computation time. I strongly recommend against secret array indices.
- Compute x[b] by simple arithmetic starting from x[0], x[1], b: for example, using the equation x[b] = b x[1] + (1-b) x[0]. This isn't quite as fast as the previous strategies, but it avoids the timing attacks inherent in the previous strategies.

I'll try the third strategy using integer bit operations:

r3 = *(uint64 *) (r + 24) s3 = *(uint64 *) (s + 24) s3 ^= r3 q3 = s3 & bminus1 q3 ^= r3 p3 = q3 ^ s3 *(uint64 *) (p + 24) = p3 *(uint64 *) (q + 24) = q3A similar sequence of 3 fadd + 1 fmul, or alternatively 2 fadd + 4 fmul, would have the virtue of working with floating-point registers, but I'm guessing that the loads and stores to move to integer registers won't be a problem, and I don't want to consume precious fadd resources.

Software:

wget https://cr.yp.to/highspeed/fall2006/curve25519_select.q wget https://cr.yp.to/highspeed/fall2006/curve25519_selecttest.c ./qhasm-sparc < curve25519_select.q > curve25519_select.s gcc -m64 -o curve25519_selecttest curve25519_selecttest.c curve25519_select.s ./curve25519_selecttest | gp

**Conversion from 32-byte string: 44 fadd + 0 fmul.**
Say f is in {-2^51,...,-1,0,1,...,2^51-1}.
Here's how to take f out of an UltraSPARC integer register
and put it into an UltraSPARC floating-point register:

- Add 2^52 1075 + 2^51 to f.
- Store the contents of the integer register.
- Wait a little while for the store to complete.
- Load the same bytes into the floating-point register.
- Subtract 2^52 + 2^51 from the result.

It's just as easy to take f out of an UltraSPARC integer register and put a scaled version of f, let's say 2^32 f, into an UltraSPARC floating-point register:

- Add 2^52 (1075+32) + 2^51 to f.
- Store the contents of the integer register.
- Wait a little while for the store to complete.
- Load the same bytes into the floating-point register.
- Subtract 2^32 (2^52 + 2^51) from the result.

What if f is in memory rather than in an integer register?
To be concrete,
let's say bytes (x[0],x[1],x[2],x[3],x[4],x[5]) in memory
represent an integer f in {0,1,...,2^48-1} in the usual little-endian way:
f = x[0] + 2^8 x[1] + 2^16 x[2] + 2^24 x[3] + 2^32 x[4] + 2^40 x[5].
Here's how to put the same integer f into a floating-point register.
First load the bytes into registers
(with separate byte loads;
loading `*(uint64 *) (x+0)`
doesn't work on the UltraSPARC unless `x` is 0 modulo 8):

int64 f0 int64 f1 int64 f2 int64 f3 int64 f4 int64 f5 f0 = *(uint8 *) (x + 0) f1 = *(uint8 *) (x + 1) f2 = *(uint8 *) (x + 2) f3 = *(uint8 *) (x + 3) f4 = *(uint8 *) (x + 4) f5 = *(uint8 *) (x + 5)Shift and add to compute x0 = 2^52 1075 + 2^51 + f = 2^51 2151 + f:

f1 <<= 8 f2 <<= 16 f3 <<= 24 f4 <<= 32 f5 <<= 40 int64 x0 x0 = 2151 x0 <<= 51 x0 += f0 x0 += f1 x0 += f2 x0 += f3 x0 += f4 x0 += f5Set aside 8 bytes of temporary storage in memory, and store

stack64 y0 y0 = x0Load the 8 bytes into a floating-point register, and finally subtract 2^52 + 2^51:

float64 z0 z0 = y0 z0 -= alpha0

One of the Curve25519 inputs is a 255-bit integer stored in 32 bytes of memory. I convert

- the first 6 bytes into a floating-point register z_0 as above,
- the next 5 bytes into z_43,
- the next 5 bytes into z_85,
- the next 6 bytes into z_128,
- the next 5 bytes into z_170, and
- the last 5 bytes into z_213,

wget https://cr.yp.to/highspeed/fall2006/curve25519_todouble.q wget https://cr.yp.to/highspeed/fall2006/curve25519_todoubletest.c ./qhasm-sparc < curve25519_todouble.q > curve25519_todouble.s gcc -m64 -o curve25519_todoubletest curve25519_todoubletest.c curve25519_todouble.s ./curve25519_todoubletest | gp

One can save time here by using more integer operations, first to partition the input bits more precisely and second to create the integer representing f rather than 2^52 + 2^51 + f. But I won't bother: conversion to double is a very small part of the Curve25519 time and doesn't warrant any extra programming effort.

**Conversion to 32-byte string: 12 fadd + 0 fmul.**
The output of Curve25519 is an integer in {0,1,...,2^255-20},
represented as a 32-byte string in the same way as the input.
This isn't what's naturally produced by arithmetic:

- I've been working with integers as sums of floating-point numbers, rather than as byte strings.
- I've been allowing redundant representations: different floating-point vectors that represent the same integer.
- I've been allowing integers outside the range {0,1,...,2^255-20}.

Here's a reasonably straightforward freezing strategy. Let's say that the final carrying and reduction produce r = r_0 + r_22 + r_43 + r_64 + r_85 + r_107 + r_128 + r_149 + r_170 + r_192 + r_213 + r_234. Carrying has put each r_i into a fairly limited range:

- r_0 is in {-2^21,...,2^21}.
- 2^(-22) r_22 is in {-2^20,...,2^20}.
- 2^(-43) r_43 is in {-2^20,...,2^20}.
- 2^(-64) r_64 is in {-2^20,...,2^20}.
- 2^(-85) r_85 is in {-2^21,...,2^21}.
- 2^(-107) r_107 is in {-2^20,...,2^20}.
- 2^(-128) r_128 is in {-2^20,...,2^20}.
- 2^(-149) r_149 is in {-2^20,...,2^20}.
- 2^(-170) r_170 is in {-2^21,...,2^21}.
- 2^(-192) r_192 is in {-2^20,...,2^20}.
- 2^(-213) r_213 is in a slightly wider range, thanks to the final carry from r_192 to r_213, but surely in {-2^21,...,2^21}.
- 2^(-234) r_234 is in {-2^20,...,2^20}.

- s_0 = 2^23 - 19 + r_0,
- s_16 = 2^44 - 2^23 + r_22,
- s_40 = 2^65 - 2^44 + r_43,
- s_64 = 2^86 - 2^65 + r_64,
- s_80 = 2^108 - 2^86 + r_85,
- s_104 = 2^129 - 2^108 + r_107,
- s_128 = 2^150 - 2^129 + r_128,
- s_144 = 2^171 - 2^150 + r_149,
- s_168 = 2^193 - 2^171 + r_170,
- s_192 = 2^214 - 2^193 + r_192,
- s_208 = 2^235 - 2^214 + r_213,
- s_232 = 2^255 - 2^235 + r_234.

A series of shifts and adds now converts s_0, 2^(-16) s_16, ..., 2^(-232) s_232 to integers t_0, t_32, t_64, t_96, t_128, t_160, t_192, t_224 in {0,1,...,2^32-1} with 2^255-19 + r = t_0 + 2^32 t_32 + ... + 2^224 t_224. To see whether 2^255-19 + r is below 2^255-19, I'll add 19 and inspect the 255th bit, taking advantage of the UltraSPARC's 32-bit-carry instructions:

carry32? u0 = t0 + 19 carry32? u32 = t32 + 0 + carry32 carry32? u64 = t64 + 0 + carry32 carry32? u96 = t96 + 0 + carry32 carry32? u128 = t128 + 0 + carry32 carry32? u160 = t160 + 0 + carry32 carry32? u192 = t192 + 0 + carry32 carry32? u224 = t224 + 0 + carry32 wantt = (uint32) u224 >> 31 wantt -= 1 wantu = ~wanttNow it's easy to select either r or 2^255-19 + r as the final Curve25519 output. Here's the software:

wget https://cr.yp.to/highspeed/fall2006/curve25519_freeze.q wget https://cr.yp.to/highspeed/fall2006/curve25519_freezetest.c ./qhasm-sparc < curve25519_freeze.q > curve25519_freeze.s gcc -m64 -o curve25519_freezetest curve25519_freezetest.c \ curve25519_freeze.s curve25519_carry.s curve25519_mul.s curve25519_sq.s \ curve25519_todouble.s ./curve25519_freezetest | gp

**The complete Curve25519 computation: 414907 fadd + 327626 fmul.**
The Curve25519 computation starts with two 32-byte inputs.
It

- extracts from one input a 255-bit exponent e;
- extracts from the other input an integer k modulo the prime p = 2^255-19;
- converts k to floating-point representation;
- sets up a vector of 4 255-bit integers in floating-point representation;
- ``main loop'': transforms the vector 255 times, once for each bit in e;
- extracts two components, x and z, from the vector;
- ``division'': computes x z^(p-2) mod p; and
- freezes the result to produce a 32-byte output.

The most important part of the computation is the main loop:

for (pos = 254;pos >= 0;--pos) { b = e[pos / 8] >> (pos & 7); b &= 1; curve25519_select(xzmb,xzm1b,xzm,xzm1,b); curve25519_add(a0,xzmb,xzmb + 12); curve25519_sub(a0 + 12,xzmb,xzmb + 12); curve25519_add(a1,xzm1b,xzm1b + 12); curve25519_sub(a1 + 12,xzm1b,xzm1b + 12); curve25519_sqcarry(b0,a0); curve25519_sqcarry(b0 + 12,a0 + 12); curve25519_mulcarry(b1,a1,a0 + 12); curve25519_mulcarry(b1 + 12,a1 + 12,a0); curve25519_add(c1,b1,b1 + 12); curve25519_sub(c1 + 12,b1,b1 + 12); curve25519_sqcarry(r,c1 + 12); curve25519_sub(s,b0,b0 + 12); curve25519_121665(t,s); curve25519_add(u,t,b0); curve25519_mulcarry(xznb,b0,b0 + 12); curve25519_mulcarry(xznb + 12,s,u); curve25519_sqcarry(xzn1b,c1); curve25519_mulcarry(xzn1b + 12,r,x1); curve25519_select(xzm,xzm1,xznb,xzn1b,b); }Each of the 255 iterations involves two selections (which can easily be merged across loops into one selection), four additions, four subtractions, one multiplication by 121665 (with carry-reduce), four squarings (with carry-reduce), and five more multiplications (with carry-reduce). Here are the tallies of floating-point operations:

operation | per operation | per iteration | per 255 iterations |
---|---|---|---|

select | 0 fadd + 0 fmul | 0 fadd + 0 fmul | 0 fadd + 0 fmul |

sub | 12 fadd + 0 fmul | 48 fadd + 0 fmul | 12240 fadd + 0 fmul |

sub | 12 fadd + 0 fmul | 48 fadd + 0 fmul | 12240 fadd + 0 fmul |

121665 | 48 fadd + 13 fmul | 48 fadd + 13 fmul | 12240 fadd + 3315 fmul |

sqcarry | 118 fadd + 100 fmul | 472 fadd + 400 fmul | 120360 fadd + 102000 fmul |

mulcarry | 177 fadd + 153 fmul | 885 fadd + 765 fmul | 225675 fadd + 195075 fmul |

total | 1501 fadd + 1178 fmul | 382755 fadd + 300390 fmul |

The second most important part of the computation is the division. I compute the (p-2)nd = (2^255-21)th power of z[n], and multiply by x[n], with a straightforward sequence of 254 squarings and 12 extra multiplications, costing 32096 fadd + 27236 fmul. There are other ways to compute the reciprocal of z[n], such as Euclid's algorithm and extended binary-gcd algorithms, but these algorithms (especially when modified to avoid leaking information through timing) would take considerable extra work to implement and would have at most a small impact on the Curve25519 speed, so I won't bother with them this month.

The remaining operations are conversion to and from floating point, costing 56 fadd + 0 fmul.

Total operation count (not computer-verified at this point): 414907 fadd + 327626 fmul. Evidently these instructions will take at least 414907 UltraSPARC cycles.

Here's the software:

wget https://cr.yp.to/highspeed/fall2006/curve25519_complete.c wget https://cr.yp.to/highspeed/fall2006/curve25519_completetest.c gcc -m64 -o curve25519_completetest curve25519_completetest.c \ curve25519_complete.c curve25519_carry.s curve25519_mul.s curve25519_sq.s \ curve25519_todouble.s curve25519_121665.s curve25519_add.s curve25519_sub.s \ curve25519_freeze.s curve25519_select.s ./curve25519_completetest | gp

**Counting cycles.**
I've timed this Curve25519 implementation using the
cpucycles library:

long long timings[10]; unsigned char x[32]; unsigned char y[32]; unsigned char z[32]; for (i = 0;i < 10;++i) { timings[i] = cpucycles(); curve25519(z,x,y); } for (i = 0;i < 9;++i) printf("%lld ",timings[i + 1] - timings[i]); printf("\n");The output shows that this implementation is taking over 2 million UltraSPARC III cycles:

2251848 2122560 2060876 2694852 2061201 2060841 2061101 3322788 2065352There's a big gap, almost a factor of 5, between the 414907-cycle lower bound from fadd throughput and this 2-million-cycle upper bound. That's a gigantic gap compared to, e.g., my NIST P-224 elliptic-curve-Diffie-Hellman software from 2001, which used about 678000 UltraSPARC II cycles for 523578 floating-point additions, or my more recent Poly1305 software, which uses 68 UltraSPARC III cycles for 68 floating-point additions in the main loop.

Why is the software failing so miserably at keeping the adder busy?
Here's an illustrative 3-fadd excerpt from `curve25519_mul.q`:

h64 = f0 * g64 # starts on cycle 0 f22g43 = f22 * g43 # starts on cycle 1 (delayed by fmul throughput) h64 += f22g43 # starts on cycle 5 (delayed by fmul latency) f43g22 = f43 * g22 # starts on cycle 5 h64 += f43g22 # starts on cycle 9 (delayed by fmul throughput) f64g0 = f64 * g0 # starts on cycle 9 h64 += f64g0 # starts on cycle 13I didn't have to add these 4 quantities serially. I could instead add them in a tree:

h64 = f0 * g64 # starts on cycle 0 f22g43 = f22 * g43 # starts on cycle 1 (delayed by fmul throughput) f43g22 = f43 * g22 # starts on cycle 2 (delayed by fmul throughput) f64g0 = f64 * g0 # starts on cycle 3 (delayed by fmul throughput) h64 += f22g43 # starts on cycle 5 (delayed by fmul latency) f43g22 += f64g0 # starts on cycle 7 (delayed by fmul latency) h64 += f43g22 # starts on cycle 11 (delayed by fadd latency)Even better, I could interleave these operations with subsequent additions in the software. There's tremendous parallelism in the Curve25519 computation graph, with far more than 4 independent additions on each level of the graph; I simply have to make this parallelism visible to the CPU.

To be continued.

- Poly1305 recap
- Channel-conversion picture
- Formulas, summary of modifications
- How stream ciphers will help
- Software speed
- Change prime? Analysis

- Historical development
- 1974 Gilbert MacWilliams Sloane
- Costs of long r
- 1979 Wegman Carter
- Stinson's simplified formula
- 1981/1987 Karp Rabin
- Generation of r, expansion of r
- 1993 den Boer, 1994 Taylor, 1994 Bierbrauer Johansson Kabatianskii Smeets
- Implementors moving to large characteristic
- Continued expansion of r
- Short r without precomputation; finally consistent high speed

- Possibilities for future development
- UMAC-P4: double speed if Pentium 4, long r, long message
- Adapt Winograd dot-product trick to short r?
- The debate regarding processor-specific MACs
- Select processor-specific MACs dynamically?

There are many other ideas in the authentication literature, but most of them are slowdowns! See Section 10 of my hash127 paper for a survey (dating from 2000) and tons of references. The main advance in the field since 2000 is a new focus on minimizing precomputation (not just having r short but also using it directly in the main computation), as I emphasized in the lecture.

Today's lecture topics:

- CPU-specific message-authentication codes?
- MMX instructions
- Can use for big integers; or use directly?
- Not terribly fast
- SSE2 instructions
- Timings: UMAC-P4-128-AES vs. Poly1305-AES
- P4 sender+P4 receiver could dynamically select UMAC-P4
- UMAC seems obsolete but general issue remains

- Introduction to stream ciphers
- Salsa20 expands 256 bits into 512N bits, gigantic N
- Indistinguishability conjecture
- Brute-force attack
- Combining Salsa20, Poly1305
- Security conjecture
- Indistinguishability implies security
- Improved channel-conversion picture

- Salsa20 speed
- Cycles per output byte on various CPUs
- Comparison to Salsa20/8 and to AES
- Fast random access to output
- Generating Poly1305's s_n on the fly
- Can use r_n; free; stops multiple forgeries
- Review of receiver's actions
- Speed of Salsa20 is much more important for encryption
- Encrypt-then-authenticate to avoid denial of service

- Salsa20 pictures
- Embedding k, n into 4x4 array of 32-bit words
- Modifying one word
- Modifying word in each column
- Modifying another word in each column
- Modifying another word in each column
- Modifying last word in each column
- Modifying rows similarly
- Continuing through 20 rounds
- Adding final 4x4 array to original 4x4 array
- Difference in initial array for second block
- Difference after 1 round
- Difference after 2 rounds
- Difference after 3 rounds
- Difference after 4 rounds
- Diffusion pictures for 256 blocks

- Salsa20 attacks
- Brute force on one CPU
- Brute force on many CPUs, many ASICs
- Multiple-target attack
- Better attacks on Salsa20 reduced to 5 or 6 rounds
- Idea of differential attacks

- What's new on the course web page

Here's what I said about cycles/byte to authenticate a 1024-byte message, including a slight slowdown from computing AES:

CPU | Poly1305-AES | UMAC-P4-128-AES |
---|---|---|

PowerPC G4 7410 | 8.27 | 21.72 |

UltraSPARC III | 5.47 | 51.06 |

Pentium 4 | 5.33 | 3.12 |

Pentium M | 4.50 | 8.48 |

Athlon | 3.75 | 7.38 |

As for Salsa20, the best source of information is my Salsa20 page. The leading source of information about stream ciphers generally is eSTREAM (the ECRYPT Stream Cipher Project), particularly eSTREAM's collection of design papers and cryptanalysis papers; I have a web page with links and further comments.

Week | Day | Date | Time |
---|---|---|---|

3 | Thu | 2006.11.09 | 11:00-12:30, 15:30-17:00 |

3 | Fri | 2006.11.10 | 11:00-12:30 |

4 | Mon | 2006.11.13 | no lectures (students work on course projects) |

4 | Tue | 2006.11.14 | no lectures (students work on course projects) |

4 | Wed | 2006.11.15 | no lectures (students work on course projects) |

4 | Thu | 2006.11.16 | no lectures (students work on course projects) |

4 | Fri | 2006.11.17 | 11:00-12:30, 15:30-17:00 (project reports and discussion) |

Today's lecture topics:

- Introduction to stream-cipher design
- Mathematicians find patterns
- Inversion examples: factoring et al.
- Mathematicians focus on
*easy*functions;*most*functions have no patterns - Conjectured compute-cost-vs-distinguish-cost graph

- Non-controversial design issues
- Must have nonlinear operations
- Many reasonable selections of operations
- 32 bits vs. 64 bits: advantages, disadvantages
- Every input bit must affect every output bit
- Computation graph must be heavily connected
- Example of bad connectivity: simplified Snefru
- MD5 graph (on computer)
- SHA-1 graph (on computer)
- SHA-256 graph (on computer)
- TEA graph (on computer)
- Salsa20 graph (on computer)
- Many reasonable graphs; consider parallelism, register use
- Break other symmetries

- Controversial design issues
- Multiplication?
- Selection?
- S-box lookups?
- Secret array indices leak through timing
- AES software leaks AES keys
- Plugs are horribly fragile

- Last words on stream ciphers and authenticated encryption
- Many, many, many alternatives

- Introduction to Diffie-Hellman
- Share a new secret through a public channel
- Channel-conversion picture
- Diffie-Hellman data flow
- Public key is reused, so major cost is computing shared secret
- Combining Curve25519 with Salsa20+Poly1305
- Exponentiation mod 2^255-19
- Performance of index calculus
- High-level definition of Curve25519
- Fastest attack known
- Some cycle counts; comparison to Salsa20+Poly1305

- Curve25519 formulas
- The field k; the elliptic curve E(k)
- X-coordinate functions; Curve25519
- Montgomery's 2-isogeny formulas
- Montgomery's formulas as recursion for X(nQ)
- 510 divisions in Z/(2^255-19)

- Curve25519 computation
- Recap of inputs, outputs
- Recap of Montgomery's recursion
- Division by Fermat's little theorem
- Better: Euclid's algorithm
- Better: Delay divisions
- Montgomery's numerators, denominators
- Can view this as the definition of Curve25519
- Preliminary operation count
- Montgomery's optimizations
- Montgomery's computation graph
- Montgomery's operation count

- Curve25519 field arithmetic
- Comparing 2^255-19 to 2^130-5
- 177 floating-point additions for multiplication
- 48 floating-point additions for multiplication by 121665
- Squaring speedups
- Tally of UltraSPARC costs for Curve25519
- Basic Pentium M advantage, disadvantage
- Tally of Pentium M costs for Curve25519

- Alternatives to Curve25519
- Security constraints on number of points
- y^2 = x^3 - 3x + ...? Slower
- Other curve shapes, notably genus-2 hyperelliptic curves
- Other characteristics? Slower, maybe less secure
- XTR? Slower
- RSA public-key encryption? Slower

- Introduction to public-key signatures
- Protect a public message sent to
*many*receivers - Channel-conversion picture
- Sender computations, receiver computations
- Data-flow diagram
- Cost evaluation; comparison to Curve25519
- Extra efficiency features
- Will focus on systems with fastest verification

- Protect a public message sent to
- Critical RSA features
- 1977 Rivest Shamir Adleman: roots mod pq, large exponent
- 1977 RSA was a security disaster
- 1979 Rabin: add hashing; reasonable security conjecture
- Currently recommended pq size: 2048 bits
- 1979 Rabin: use
*small*e, preferably e=2; much faster verification

- Building a signature system, continued
- Interlude: elliptic-curve version of 1989 Schnorr
- 1979 Rabin: randomize signatures
- 1980 Williams: {1,-1,2,-2}
- Recommendations regarding randomness
- Generating keys to have matching top 1/2
- Generating keys to have matching top 2/3
- Compressing keys to 1/3

- Verification speed
- Square, subtract, check divisibility
- Division with remainder via two multiplications
- Using compression to speed up reciprocal
- More speedups, including expansion
- Probabilistic verification of expanded signatures
- Cost estimate for verifying expanded signatures

- Signature compression
- Compressing a signature via continued fractions
- Uncompressing a signature

- Closing comments

Here's a challenge to cryptographers. Right now, attackers all over the Internet can easily forge packets that appear to come from Google; the challenge is to stop these forgeries, guaranteeing that the Google results on the users' screens really are from Google. Perhaps the most obvious approach (to the cryptographic side of the problem; we also need to fix browser buffer overflows and other relevant software bugs) is to have Google add a signature to each Google-results page. This costs

- total for the Google users: R signature verifications, where R is the total number of downloads of Google-results pages;
- for Google: W signatures,
where W is the number of
*different*Google-results pages (or, more precisely, the number of Google-results pages that Google doesn't cache).

- total for the Google users: U evaluations of Curve25519, where U is the number of users, plus R applications of Salsa20+Poly1305;
- for Google: U evaluations of Curve25519, plus R evaluations of Salsa20+Poly1305.

wget https://cr.yp.to/qhasm/qhasm-20061116.tar.gz gunzip < qhasm-20061116.tar.gz | tar -xf - cd qhasm-20061116 ./doStill needs lots of testing but can at least handle an AMD64 XMM version of the Salsa20 software, setting new speed records for Salsa20 on the Intel Core 2.