D. J. Bernstein
Authenticators and signatures
nistp224

Timings

Here are some typical cycle counts (for typical exponents, not worst-case exponents), with all data in cache.

Version	Compressed (28-byte keys)	Uncompressed (56-byte keys)	Chip	MHz	Compiler
0.75	678633	595537	Athlon (642)	850	gcc 2.95.3
0.70	747418	644972	Athlon (642)	850	gcc 2.95.3
0.75	785900	668566	UltraSPARC II	168	egcs 2.91.66
0.75	823862	724776	Pentium III (672)	449	egcs 2.91.66
0.75	826955	726922	Pentium III (686)	853	gcc 2.95.3
0.75	835530	734731	Pentium II (652)	401	gcc 2.8.1
0.70	838988	716298	UltraSPARC II	296	egcs 2.91.66
0.70	854587	710991	UltraSPARC II	168	egcs 2.91.66
0.70	913905	798360	Pentium III (672)	449	egcs 2.91.66
0.70	916561	798847	Pentium III (686)	853	gcc 2.95.3
0.70	926777	809653	Pentium Pro (619)	180	gcc 2.95.3
0.70	928230	806682	Pentium II (652)	401	gcc 2.8.1
0.75	943244	827360	Pentium 4 (f05)	1406	gcc 2.95.3
0.70	1080160	933272	Pentium 4 (f05)	1406	gcc 2.95.3
0.75	1120824	985097	Pentium (525)	90	egcs 2.91.66
0.75	1166080	1019027	PowerPC RS64-III	668	gcc 3.0.1
0.70	1332077	1164829	Pentium (525)	90	gcc 2.8.1
0.70	1334910	1146536	Pentium (525)	90	egcs 2.91.66
0.75	1355344	1170368	PowerPC 7410	533	gcc 2.95.2
0.70	1355530	1187282	Pentium (52c)	133	gcc 2.8.1
0.70	2453408	2133621	K6-2 (58c)	501	gcc 2.95.2
0.70	~4100000	~3600000	SuperSPARC	50	gcc 2.95.2
0.65	~4300000	~3800000	486	33	egcs 2.91.66

nistp224 is subject to a continuing series of performance improvements. Please report the nistp224 version number in any comments on nistp224.

Understanding the output of `compile/speed`

From fastest to slowest:

56 means uncompressed point, typical exponent.
56high means uncompressed point, worst-case exponent.
28 means compressed point, typical exponent.
28high means compressed point, worst-case exponent.

The next number on each line is the time taken by one function call. The speed program tries to flush the L1 instruction cache and L1 data cache before each line, so this time includes cache penalties.

The 31 remaining numbers on each line are times for 31 more function calls in quick succession. Variations in the 56 and 28 lines reflect the number of nibbles equal to 0 in different exponents. Much smaller variations in the 56high and 28high lines reflect branch-prediction penalties, instruction-reordering penalties, etc. Occasional jumps in timing reflect interrupts.

Times are expressed in ticks. There are several possibilities for ticks, from most informative to least informative:

On sensible computers, one tick is one clock cycle.
On the PowerPC, one tick is n clock cycles. where n is a reasonably small number, such as 16. You can convert times to clock cycles with high accuracy if you know n.
On some computers, one tick is one nanosecond, and times are based directly on a cycle counter. You can convert times to clock cycles with high accuracy if you know the computer's clock speed; unfortunately, many manufacturers round their clock speeds to nice nearby numbers.
On other computers, one tick is one nanosecond, and times are based on glacial one-microsecond clocks.

How much can the times be improved?

The following table shows the minimum possible number of cycles to perform floating-point operations on various chips. + means addition; +* means a fused multiplication-addition; +,* means an independent addition and multiplication. 53 and 64 refer to mantissa lengths; the non-x86 chips in the table don't have hardware support for 64. The fractional numbers for the PowerPC reflect the fact that the PowerPC stalls after several consecutive floating-point operations.

	53 +	53 +*	53 +,*	64 +	64 +*	64 +,*
Athlon	1	1	1	1	1	1
Pentium 4	1	1 (SSE2)	1 (SSE2)	1	2	2
Pentium Pro/II/III	1	2	2	1	2	2
Pentium	1	2	2	1	2	2
Alpha	1	1	1
UltraSPARC	1	1	1
PowerPC 7450	1.25	1.25	2.50

Except for the original Pentium, all of these chips can overlap floating-point loads and stores with floating-point additions and multiplications. The goal of floating-point scheduling is to reduce the number of clock cycles for a computation to the number of clock cycles used by floating-point operations.

Most of the time in my 28high and 56high computations is spent in thousands of calls to the core f2, f2g, fg, fgh, fg8h2, and fghi functions:

f2 f2g fg fgh fg8h2 fghi

28high 1015 511 1031 119 225 59

56high 799 511 882 120 225 59

(For sparc and powerpc, 28high involves two extra calls to f2 and one extra call to f2g.) The number of calls could be reduced slightly.

	f2	f2g	fg	fgh	fg8h2	fghi
28high	1015	511	1031	119	225	59
56high	799	511	882	120	225	59

The core functions each involve hundreds of floating-point additions and multiplications:

ppro 64 pentium 64 sparc 53 powerpc 53

f2 +/+*/+,* 63/0/52 63/0/52 60/0/100 60/80/20

f2g +/+*/+,* 71/0/52 63/0/52 60/0/100 60/80/20

fg +/+*/+,* 56/0/80 56/0/80 49/0/166 49/139/27

fgh +/+*/+,* 64/0/80 56/0/80 49/0/166 49/139/27

fg8h2 +/+*/+,* 64/0/116 48/0/144 37/0/310 37/273/37

fghi +/+*/+,* 48/0/144 48/0/144 37/0/310 37/273/37

For example, fgh under powerpc has 139 fused multiplication-additions, 27 more multiplications, and 27+49=76 more additions; these operations require 1.25*(49+139)+2.50*27=302.5 PowerPC-7450 cycles. fgh under sparc has 166 multiplications and 166+49=215 additions; these operations require 215 UltraSPARC cycles. fgh under ppro has 80 multiplications and 80+64=144 additions; these operations require 224 Pentium-III cycles, or 144 Athlon cycles. (ppro performs more additions than pentium here to reduce the fgh latency.)

	ppro 64	pentium 64	sparc 53	powerpc 53
f2 +/+/+,	63/0/52	63/0/52	60/0/100	60/80/20
f2g +/+/+,	71/0/52	63/0/52	60/0/100	60/80/20
fg +/+/+,	56/0/80	56/0/80	49/0/166	49/139/27
fgh +/+/+,	64/0/80	56/0/80	49/0/166	49/139/27
fg8h2 +/+/+,	64/0/116	48/0/144	37/0/310	37/273/37
fghi +/+/+,	48/0/144	48/0/144	37/0/310	37/273/37

The core functions involve fewer floating-point operations under ppro and pentium than under sparc and powerpc, because 64-bit operations do more work than 53-bit operations.

pentium, sparc, and powerpc handle f2, fg, and fg8h2 using f2g, fgh, and fghi. I could save about 4% of the floating-point operations by implementing f2, fg, and fg8h2 separately. This would be a bad idea on the original Pentium, which has an 8K instruction cache, and it would probably be a bad idea on the UltraSPARC, which has a 16K instruction cache and a verbose instruction encoding. It would be a good idea on the Pentium MMX and on the PowerPC.

Athlon. The 56high target is 343798 cycles for 343798 additions in the core functions. I'm currently using about 605000 cycles.

I've started optimizing code for this chip. Cycle counts for the core functions in version 0.75 (including function call and timing overhead) are 183, 192, 252, 263, 328, and 340, including 115, 123, 136, 144, 180, and 192 additions. Cycle counts for the development version are 173, 178, 207, 215, 287, and 297, including 115, 123, 136, 144, 188, and 200 additions; about a 1.14x improvement overall.

UltraSPARC. The target is 523578 cycles for 523578 additions. I'm currently using about 678000 cycles.

Pentium Pro/II/III. The target is 526674 cycles for 182876 multiplications and 343798 additions. I'm currently using about 736000 cycles.

Pentium 4. The target is 526674 cycles, as on the Pentium Pro/II/III. The operation latencies are much worse, so it's not a surprise that I'm currently using about 839000 cycles. I don't know whether SSE2 would be faster.

Pentium. The current code is pretty damn good. According to my scheduling tools, the core functions spend only 2.5% of their time in stalls. It might be possible to remove a few more loads and stores.

RS64-III. The target is, if I understand this chip correctly, 734175 cycles. I'm currently using about 1050000 cycles.

More than 100000 cycles are being wasted by the PowerPC callee-save convention. This problem is fairly tricky to work around by hand, and fairly easy to fix in the compiler. I've complained to the gcc authors.

Better scheduling might replace 9, 16, and 16 +,*'s with +*'s in f2g, fgh, and fghi respectively. Also, as noted above, current PowerPC chips have a big enough cache for separate implementations of f2, fg, and fg8h2. These changes would reduce the target.

Other people's results at a 224-bit security level

All these results are for typical exponents, uncompressed points. (Compressed is not necessarily slower; for example, it is the same speed in the binary case.) Many of these numbers are copied from a survey by Lopez and Dahab. In a few cases I don't know whether the curves and primes were chosen randomly.

1.2 million Pentium-II cycles by Brown, Hankerson, Hernandez, and Menezes (2000) for NIST P-224, a random curve over a field of size 2^224-2^96+1. (The same code takes 1.8 million Pentium cycles, or 2.7 million Pentium-4 cycles.)

5.1 million UltraSPARC cycles by Cohen, Miyaji, and Ono (1998) for a random curve over a field of size 2^224-1025.

7.7 million UltraSPARC cycles by Lopez and Dahab (1999) for a random curve over a field of size 2^239.

1.2 million Pentium-II cycles by Hankerson, Hernandez, and Menezes (2000) for NIST K-233, a structured curve over a field of size 2^233.

Other people's results at lower security levels

0.78 million Pentium-II cycles by Brown, Hankerson, Hernandez, and Menezes (2000) for NIST P-192, a random curve over a field of size 2^192-2^64-1.

3.6 million UltraSPARC cycles by Cohen, Miyaji, and Ono (1998) for a random curve over a field of size 2^192-3345.

4.2 million Pentium Pro cycles by De Win, Bosselaers, Vanderberghe, De Gersem, and Vandewalle (1998) for a random curve over a field of prime size near 2^192.

10 million Pentium Pro cycles by De Win, Bosselaers, Vanderberghe, De Gersem, and Vandewalle (1998) for a random curve over a field of size 2^191.

4.8 million UltraSPARC cycles by Lopez and Dahab (1999) for a random curve over a field of size 2^191.

0.78 million Pentium-II cycles by Kobayashi, Morita, Kobayashi, and Horito (1999) for a curve over a field of size (2^31-1)^6.

3.1 million Pentium cycles by Bailey and Paar (1999) for a random curve over a field of size (2^31-1)^6.

0.50 million Alpha-21264 cycles by Kobayashi, Morita, Kobayashi, and Horito (1999) for a curve over a field of size (2^61-1)^3.

0.65 million Alpha-21264 cycles by Bailey and Paar (1999) for a random curve over a field of size (2^61-1)^3.

12.8 million Pentium cycles by De Win, Mister, Preneel, and Wiener (1996) for a random curve over a field of size 2^177.

9.6 million Pentium cycles by De Win, Mister, Preneel, and Wiener (1996) for a random curve over a field of size 2^176. Note that 176 has a factor of 16.

7.5 million Pentium-II cycles by Aydos, Savas, and Koc (1999) for a random curve over a field of size 2^176.

11.8 million Alpha cycles by Guajardo and Paar (1998) for a curve over a field of size 2^176.

0.61 million Pentium-III cycles by Harley (2001) for a random curve over a field of size 2^163.

4.1 million UltraSPARC cycles by Lopez and Dahab (1999) for a random curve over a field of size 2^163.

2.3 million UltraSPARC cycles by Cohen, Miyaji, and Ono (1998) for a random curve over a field of size 2^160-2933.

0.58 million Pentium-II cycles by Hankerson, Hernandez, and Menezes (2000) for NIST K-163, a structured curve over a field of size 2^163.

1.2 million UltraSPARC cycles by Certicom (1998) for a curve over a field of size 2^163, perhaps NIST K-163. (This is the most recent Certicom timing I've found, from a Security Builder 1.2 advertisement. Certicom appears to have stopped publishing any concrete timings.)

Timings

Understanding the output of compile/speed

How much can the times be improved?

Other people's results at a 224-bit security level

Other people's results at lower security levels

Understanding the output of `compile/speed`