Authenticators and signatures

nistp224

Version | Compressed (28-byte keys) |
Uncompressed (56-byte keys) |
Chip | MHz | Compiler |
---|---|---|---|---|---|

0.75 | 678633 | 595537 | Athlon (642) | 850 | gcc 2.95.3 |

0.70 | 747418 | 644972 | Athlon (642) | 850 | gcc 2.95.3 |

0.75 | 785900 | 668566 | UltraSPARC II | 168 | egcs 2.91.66 |

0.75 | 823862 | 724776 | Pentium III (672) | 449 | egcs 2.91.66 |

0.75 | 826955 | 726922 | Pentium III (686) | 853 | gcc 2.95.3 |

0.75 | 835530 | 734731 | Pentium II (652) | 401 | gcc 2.8.1 |

0.70 | 838988 | 716298 | UltraSPARC II | 296 | egcs 2.91.66 |

0.70 | 854587 | 710991 | UltraSPARC II | 168 | egcs 2.91.66 |

0.70 | 913905 | 798360 | Pentium III (672) | 449 | egcs 2.91.66 |

0.70 | 916561 | 798847 | Pentium III (686) | 853 | gcc 2.95.3 |

0.70 | 926777 | 809653 | Pentium Pro (619) | 180 | gcc 2.95.3 |

0.70 | 928230 | 806682 | Pentium II (652) | 401 | gcc 2.8.1 |

0.75 | 943244 | 827360 | Pentium 4 (f05) | 1406 | gcc 2.95.3 |

0.70 | 1080160 | 933272 | Pentium 4 (f05) | 1406 | gcc 2.95.3 |

0.75 | 1120824 | 985097 | Pentium (525) | 90 | egcs 2.91.66 |

0.75 | 1166080 | 1019027 | PowerPC RS64-III | 668 | gcc 3.0.1 |

0.70 | 1332077 | 1164829 | Pentium (525) | 90 | gcc 2.8.1 |

0.70 | 1334910 | 1146536 | Pentium (525) | 90 | egcs 2.91.66 |

0.75 | 1355344 | 1170368 | PowerPC 7410 | 533 | gcc 2.95.2 |

0.70 | 1355530 | 1187282 | Pentium (52c) | 133 | gcc 2.8.1 |

0.70 | 2453408 | 2133621 | K6-2 (58c) | 501 | gcc 2.95.2 |

0.70 | ~4100000 | ~3600000 | SuperSPARC | 50 | gcc 2.95.2 |

0.65 | ~4300000 | ~3800000 | 486 | 33 | egcs 2.91.66 |

`56`means uncompressed point, typical exponent.`56high`means uncompressed point, worst-case exponent.`28`means compressed point, typical exponent.`28high`means compressed point, worst-case exponent.

The 31 remaining numbers on each line are times for 31 more function calls
in quick succession.
Variations in the `56` and `28` lines
reflect the number of nibbles equal to 0 in different exponents.
Much smaller variations in the `56high` and `28high` lines
reflect branch-prediction penalties, instruction-reordering penalties, etc.
Occasional jumps in timing reflect interrupts.

Times are expressed in ticks. There are several possibilities for ticks, from most informative to least informative:

- On sensible computers, one tick is one clock cycle.
- On the PowerPC, one tick is n clock cycles. where n is a reasonably small number, such as 16. You can convert times to clock cycles with high accuracy if you know n.
- On some computers, one tick is one nanosecond, and times are based directly on a cycle counter. You can convert times to clock cycles with high accuracy if you know the computer's clock speed; unfortunately, many manufacturers round their clock speeds to nice nearby numbers.
- On other computers, one tick is one nanosecond, and times are based on glacial one-microsecond clocks.

53 + | 53 +* | 53 +,* | 64 + | 64 +* | 64 +,* | |
---|---|---|---|---|---|---|

Athlon | 1 | 1 | 1 | 1 | 1 | 1 |

Pentium 4 | 1 | 1 (SSE2) | 1 (SSE2) | 1 | 2 | 2 |

Pentium Pro/II/III | 1 | 2 | 2 | 1 | 2 | 2 |

Pentium | 1 | 2 | 2 | 1 | 2 | 2 |

Alpha | 1 | 1 | 1 | |||

UltraSPARC | 1 | 1 | 1 | |||

PowerPC 7450 | 1.25 | 1.25 | 2.50 |

Most of the time in my `28high` and `56high` computations
is spent in thousands of calls
to the core `f2`, `f2g`, `fg`, `fgh`,
`fg8h2`, and `fghi` functions:

f2 | f2g | fg | fgh | fg8h2 | fghi | |
---|---|---|---|---|---|---|

28high | 1015 | 511 | 1031 | 119 | 225 | 59 |

56high | 799 | 511 | 882 | 120 | 225 | 59 |

The core functions each involve hundreds of floating-point additions and multiplications:

ppro 64 | pentium 64 | sparc 53 | powerpc 53 | |
---|---|---|---|---|

f2 +/+*/+,* | 63/0/52 | 63/0/52 | 60/0/100 | 60/80/20 |

f2g +/+*/+,* | 71/0/52 | 63/0/52 | 60/0/100 | 60/80/20 |

fg +/+*/+,* | 56/0/80 | 56/0/80 | 49/0/166 | 49/139/27 |

fgh +/+*/+,* | 64/0/80 | 56/0/80 | 49/0/166 | 49/139/27 |

fg8h2 +/+*/+,* | 64/0/116 | 48/0/144 | 37/0/310 | 37/273/37 |

fghi +/+*/+,* | 48/0/144 | 48/0/144 | 37/0/310 | 37/273/37 |

The core functions
involve fewer floating-point operations
under `ppro` and `pentium`
than under `sparc` and `powerpc`,
because 64-bit operations do more work than 53-bit operations.

`pentium`, `sparc`, and `powerpc`
handle `f2`, `fg`, and `fg8h2`
using `f2g`, `fgh`, and `fghi`.
I could save about 4% of the floating-point operations
by implementing `f2`, `fg`, and `fg8h2` separately.
This would be a bad idea on the original Pentium,
which has an 8K instruction cache,
and it would probably be a bad idea on the UltraSPARC,
which has a 16K instruction cache and a verbose instruction encoding.
It would be a good idea on the Pentium MMX and on the PowerPC.

**Athlon.**
The `56high` target is 343798 cycles
for 343798 additions in the core functions.
I'm currently using about 605000 cycles.

I've started optimizing code for this chip. Cycle counts for the core functions in version 0.75 (including function call and timing overhead) are 183, 192, 252, 263, 328, and 340, including 115, 123, 136, 144, 180, and 192 additions. Cycle counts for the development version are 173, 178, 207, 215, 287, and 297, including 115, 123, 136, 144, 188, and 200 additions; about a 1.14x improvement overall.

**UltraSPARC.**
The target is 523578 cycles for 523578 additions.
I'm currently using about 678000 cycles.

**Pentium Pro/II/III.**
The target is 526674 cycles
for 182876 multiplications and 343798 additions.
I'm currently using about 736000 cycles.

**Pentium 4.**
The target is 526674 cycles, as on the Pentium Pro/II/III.
The operation latencies are much worse,
so it's not a surprise that I'm currently using about 839000 cycles.
I don't know whether SSE2 would be faster.

**Pentium.**
The current code is pretty damn good.
According to my scheduling tools,
the core functions spend only 2.5% of their time in stalls.
It might be possible to remove a few more loads and stores.

**RS64-III.**
The target is, if I understand this chip correctly, 734175 cycles.
I'm currently using about 1050000 cycles.

More than 100000 cycles are being wasted by the PowerPC callee-save convention. This problem is fairly tricky to work around by hand, and fairly easy to fix in the compiler. I've complained to the gcc authors.

Better scheduling might replace 9, 16, and 16 +,*'s with +*'s
in `f2g`, `fgh`, and `fghi` respectively.
Also, as noted above, current PowerPC chips have a big enough cache
for separate implementations of `f2`, `fg`, and `fg8h2`.
These changes would reduce the target.

1.2 million Pentium-II cycles by Brown, Hankerson, Hernandez, and Menezes (2000) for NIST P-224, a random curve over a field of size 2^224-2^96+1. (The same code takes 1.8 million Pentium cycles, or 2.7 million Pentium-4 cycles.)

5.1 million UltraSPARC cycles by Cohen, Miyaji, and Ono (1998) for a random curve over a field of size 2^224-1025.

7.7 million UltraSPARC cycles by Lopez and Dahab (1999) for a random curve over a field of size 2^239.

1.2 million Pentium-II cycles by Hankerson, Hernandez, and Menezes (2000) for NIST K-233, a structured curve over a field of size 2^233.

3.6 million UltraSPARC cycles by Cohen, Miyaji, and Ono (1998) for a random curve over a field of size 2^192-3345.

4.2 million Pentium Pro cycles by De Win, Bosselaers, Vanderberghe, De Gersem, and Vandewalle (1998) for a random curve over a field of prime size near 2^192.

10 million Pentium Pro cycles by De Win, Bosselaers, Vanderberghe, De Gersem, and Vandewalle (1998) for a random curve over a field of size 2^191.

4.8 million UltraSPARC cycles by Lopez and Dahab (1999) for a random curve over a field of size 2^191.

0.78 million Pentium-II cycles by Kobayashi, Morita, Kobayashi, and Horito (1999) for a curve over a field of size (2^31-1)^6.

3.1 million Pentium cycles by Bailey and Paar (1999) for a random curve over a field of size (2^31-1)^6.

0.50 million Alpha-21264 cycles by Kobayashi, Morita, Kobayashi, and Horito (1999) for a curve over a field of size (2^61-1)^3.

0.65 million Alpha-21264 cycles by Bailey and Paar (1999) for a random curve over a field of size (2^61-1)^3.

12.8 million Pentium cycles by De Win, Mister, Preneel, and Wiener (1996) for a random curve over a field of size 2^177.

9.6 million Pentium cycles by De Win, Mister, Preneel, and Wiener (1996) for a random curve over a field of size 2^176. Note that 176 has a factor of 16.

7.5 million Pentium-II cycles by Aydos, Savas, and Koc (1999) for a random curve over a field of size 2^176.

11.8 million Alpha cycles by Guajardo and Paar (1998) for a curve over a field of size 2^176.

0.61 million Pentium-III cycles by Harley (2001) for a random curve over a field of size 2^163.

4.1 million UltraSPARC cycles by Lopez and Dahab (1999) for a random curve over a field of size 2^163.

2.3 million UltraSPARC cycles by Cohen, Miyaji, and Ono (1998) for a random curve over a field of size 2^160-2933.

0.58 million Pentium-II cycles by Hankerson, Hernandez, and Menezes (2000) for NIST K-163, a structured curve over a field of size 2^163.

1.2 million UltraSPARC cycles by Certicom (1998) for a curve over a field of size 2^163, perhaps NIST K-163. (This is the most recent Certicom timing I've found, from a Security Builder 1.2 advertisement. Certicom appears to have stopped publishing any concrete timings.)