D. J. Bernstein
Authenticators and signatures
nistp224

Timings

Here are some typical cycle counts (for typical exponents, not worst-case exponents), with all data in cache.
Version Compressed
(28-byte keys)
Uncompressed
(56-byte keys)
Chip MHz Compiler
0.75 678633 595537 Athlon (642) 850 gcc 2.95.3
0.70 747418 644972 Athlon (642) 850 gcc 2.95.3
0.75 785900 668566 UltraSPARC II 168 egcs 2.91.66
0.75 823862 724776 Pentium III (672) 449 egcs 2.91.66
0.75 826955 726922 Pentium III (686) 853 gcc 2.95.3
0.75 835530 734731 Pentium II (652) 401 gcc 2.8.1
0.70 838988 716298 UltraSPARC II 296 egcs 2.91.66
0.70 854587 710991 UltraSPARC II 168 egcs 2.91.66
0.70 913905 798360 Pentium III (672) 449 egcs 2.91.66
0.70 916561 798847 Pentium III (686) 853 gcc 2.95.3
0.70 926777 809653 Pentium Pro (619) 180 gcc 2.95.3
0.70 928230 806682 Pentium II (652) 401 gcc 2.8.1
0.75 943244 827360 Pentium 4 (f05) 1406 gcc 2.95.3
0.70 1080160 933272 Pentium 4 (f05) 1406 gcc 2.95.3
0.75 1120824 985097 Pentium (525) 90 egcs 2.91.66
0.75 1166080 1019027 PowerPC RS64-III 668 gcc 3.0.1
0.70 1332077 1164829 Pentium (525) 90 gcc 2.8.1
0.70 1334910 1146536 Pentium (525) 90 egcs 2.91.66
0.75 1355344 1170368 PowerPC 7410 533 gcc 2.95.2
0.70 1355530 1187282 Pentium (52c) 133 gcc 2.8.1
0.70 2453408 2133621 K6-2 (58c) 501 gcc 2.95.2
0.70 ~4100000 ~3600000 SuperSPARC 50 gcc 2.95.2
0.65 ~4300000 ~3800000 486 33 egcs 2.91.66
nistp224 is subject to a continuing series of performance improvements. Please report the nistp224 version number in any comments on nistp224.

Understanding the output of compile/speed

From fastest to slowest: The next number on each line is the time taken by one function call. The speed program tries to flush the L1 instruction cache and L1 data cache before each line, so this time includes cache penalties.

The 31 remaining numbers on each line are times for 31 more function calls in quick succession. Variations in the 56 and 28 lines reflect the number of nibbles equal to 0 in different exponents. Much smaller variations in the 56high and 28high lines reflect branch-prediction penalties, instruction-reordering penalties, etc. Occasional jumps in timing reflect interrupts.

Times are expressed in ticks. There are several possibilities for ticks, from most informative to least informative:

How much can the times be improved?

The following table shows the minimum possible number of cycles to perform floating-point operations on various chips. + means addition; +* means a fused multiplication-addition; +,* means an independent addition and multiplication. 53 and 64 refer to mantissa lengths; the non-x86 chips in the table don't have hardware support for 64. The fractional numbers for the PowerPC reflect the fact that the PowerPC stalls after several consecutive floating-point operations.
53 + 53 +* 53 +,* 64 + 64 +* 64 +,*
Athlon 111111
Pentium 4 11 (SSE2)1 (SSE2)122
Pentium Pro/II/III 122122
Pentium 122122
Alpha 111
UltraSPARC 111
PowerPC 7450 1.251.252.50
Except for the original Pentium, all of these chips can overlap floating-point loads and stores with floating-point additions and multiplications. The goal of floating-point scheduling is to reduce the number of clock cycles for a computation to the number of clock cycles used by floating-point operations.

Most of the time in my 28high and 56high computations is spent in thousands of calls to the core f2, f2g, fg, fgh, fg8h2, and fghi functions:
f2 f2g fg fgh fg8h2 fghi
28high 1015 511 1031 119 225 59
56high 799 511 882 120 225 59
(For sparc and powerpc, 28high involves two extra calls to f2 and one extra call to f2g.) The number of calls could be reduced slightly.

The core functions each involve hundreds of floating-point additions and multiplications:
ppro 64 pentium 64 sparc 53 powerpc 53
f2 +/+*/+,*63/0/5263/0/5260/0/10060/80/20
f2g +/+*/+,*71/0/5263/0/5260/0/10060/80/20
fg +/+*/+,*56/0/8056/0/8049/0/16649/139/27
fgh +/+*/+,*64/0/8056/0/8049/0/16649/139/27
fg8h2 +/+*/+,*64/0/11648/0/14437/0/31037/273/37
fghi +/+*/+,*48/0/14448/0/14437/0/31037/273/37
For example, fgh under powerpc has 139 fused multiplication-additions, 27 more multiplications, and 27+49=76 more additions; these operations require 1.25*(49+139)+2.50*27=302.5 PowerPC-7450 cycles. fgh under sparc has 166 multiplications and 166+49=215 additions; these operations require 215 UltraSPARC cycles. fgh under ppro has 80 multiplications and 80+64=144 additions; these operations require 224 Pentium-III cycles, or 144 Athlon cycles. (ppro performs more additions than pentium here to reduce the fgh latency.)

The core functions involve fewer floating-point operations under ppro and pentium than under sparc and powerpc, because 64-bit operations do more work than 53-bit operations.

pentium, sparc, and powerpc handle f2, fg, and fg8h2 using f2g, fgh, and fghi. I could save about 4% of the floating-point operations by implementing f2, fg, and fg8h2 separately. This would be a bad idea on the original Pentium, which has an 8K instruction cache, and it would probably be a bad idea on the UltraSPARC, which has a 16K instruction cache and a verbose instruction encoding. It would be a good idea on the Pentium MMX and on the PowerPC.

Athlon. The 56high target is 343798 cycles for 343798 additions in the core functions. I'm currently using about 605000 cycles.

I've started optimizing code for this chip. Cycle counts for the core functions in version 0.75 (including function call and timing overhead) are 183, 192, 252, 263, 328, and 340, including 115, 123, 136, 144, 180, and 192 additions. Cycle counts for the development version are 173, 178, 207, 215, 287, and 297, including 115, 123, 136, 144, 188, and 200 additions; about a 1.14x improvement overall.

UltraSPARC. The target is 523578 cycles for 523578 additions. I'm currently using about 678000 cycles.

Pentium Pro/II/III. The target is 526674 cycles for 182876 multiplications and 343798 additions. I'm currently using about 736000 cycles.

Pentium 4. The target is 526674 cycles, as on the Pentium Pro/II/III. The operation latencies are much worse, so it's not a surprise that I'm currently using about 839000 cycles. I don't know whether SSE2 would be faster.

Pentium. The current code is pretty damn good. According to my scheduling tools, the core functions spend only 2.5% of their time in stalls. It might be possible to remove a few more loads and stores.

RS64-III. The target is, if I understand this chip correctly, 734175 cycles. I'm currently using about 1050000 cycles.

More than 100000 cycles are being wasted by the PowerPC callee-save convention. This problem is fairly tricky to work around by hand, and fairly easy to fix in the compiler. I've complained to the gcc authors.

Better scheduling might replace 9, 16, and 16 +,*'s with +*'s in f2g, fgh, and fghi respectively. Also, as noted above, current PowerPC chips have a big enough cache for separate implementations of f2, fg, and fg8h2. These changes would reduce the target.

Other people's results at a 224-bit security level

All these results are for typical exponents, uncompressed points. (Compressed is not necessarily slower; for example, it is the same speed in the binary case.) Many of these numbers are copied from a survey by Lopez and Dahab. In a few cases I don't know whether the curves and primes were chosen randomly.

1.2 million Pentium-II cycles by Brown, Hankerson, Hernandez, and Menezes (2000) for NIST P-224, a random curve over a field of size 2^224-2^96+1. (The same code takes 1.8 million Pentium cycles, or 2.7 million Pentium-4 cycles.)

5.1 million UltraSPARC cycles by Cohen, Miyaji, and Ono (1998) for a random curve over a field of size 2^224-1025.

7.7 million UltraSPARC cycles by Lopez and Dahab (1999) for a random curve over a field of size 2^239.

1.2 million Pentium-II cycles by Hankerson, Hernandez, and Menezes (2000) for NIST K-233, a structured curve over a field of size 2^233.

Other people's results at lower security levels

0.78 million Pentium-II cycles by Brown, Hankerson, Hernandez, and Menezes (2000) for NIST P-192, a random curve over a field of size 2^192-2^64-1.

3.6 million UltraSPARC cycles by Cohen, Miyaji, and Ono (1998) for a random curve over a field of size 2^192-3345.

4.2 million Pentium Pro cycles by De Win, Bosselaers, Vanderberghe, De Gersem, and Vandewalle (1998) for a random curve over a field of prime size near 2^192.

10 million Pentium Pro cycles by De Win, Bosselaers, Vanderberghe, De Gersem, and Vandewalle (1998) for a random curve over a field of size 2^191.

4.8 million UltraSPARC cycles by Lopez and Dahab (1999) for a random curve over a field of size 2^191.

0.78 million Pentium-II cycles by Kobayashi, Morita, Kobayashi, and Horito (1999) for a curve over a field of size (2^31-1)^6.

3.1 million Pentium cycles by Bailey and Paar (1999) for a random curve over a field of size (2^31-1)^6.

0.50 million Alpha-21264 cycles by Kobayashi, Morita, Kobayashi, and Horito (1999) for a curve over a field of size (2^61-1)^3.

0.65 million Alpha-21264 cycles by Bailey and Paar (1999) for a random curve over a field of size (2^61-1)^3.

12.8 million Pentium cycles by De Win, Mister, Preneel, and Wiener (1996) for a random curve over a field of size 2^177.

9.6 million Pentium cycles by De Win, Mister, Preneel, and Wiener (1996) for a random curve over a field of size 2^176. Note that 176 has a factor of 16.

7.5 million Pentium-II cycles by Aydos, Savas, and Koc (1999) for a random curve over a field of size 2^176.

11.8 million Alpha cycles by Guajardo and Paar (1998) for a curve over a field of size 2^176.

0.61 million Pentium-III cycles by Harley (2001) for a random curve over a field of size 2^163.

4.1 million UltraSPARC cycles by Lopez and Dahab (1999) for a random curve over a field of size 2^163.

2.3 million UltraSPARC cycles by Cohen, Miyaji, and Ono (1998) for a random curve over a field of size 2^160-2933.

0.58 million Pentium-II cycles by Hankerson, Hernandez, and Menezes (2000) for NIST K-163, a structured curve over a field of size 2^163.

1.2 million UltraSPARC cycles by Certicom (1998) for a curve over a field of size 2^163, perhaps NIST K-163. (This is the most recent Certicom timing I've found, from a Security Builder 1.2 advertisement. Certicom appears to have stopped publishing any concrete timings.)