Zmodexp 0.51, compiled with egcs 2.90.29 -O1 -fofp -malign-double -mpentiumpro -fschedule-insns -fschedule-insns2, can compute any 512-bit power modulo any 512-bit integer in 1627698 Pentium-II cycles. (In other words, 4.66 milliseconds on a Pentium II-350. This is faster than Rainbow's $2000 CryptoSwift hardware.) I'm not aware of any other library better than 3000000 cycles.
Most libraries are much slower on the original Pentium than on the Pentium II. Zmodexp is not. Zmodexp 0.51 can compute any 512-bit power modulo any 512-bit integer in 1819000 Pentium cycles. Zmodexp will provide excellent performance on any modern CPU.
I expect Zmodexp to change the way people implement some common cryptographic tools, notably public-key signatures. However, Zmodexp 0.51 is not ready for integration into other programs: it relies on some seat-of-the-pants numerical analysis that has not yet been mathematically verified; it doesn't support any sizes other than 512 bits; it doesn't support non-x86 chips; and it isn't fully optimized. If you're not interested in the details of how fast arithmetic works then you should probably wait for the next release.