D. J. Bernstein
Fast arithmetic
djbfft

Links

FFT libraries from CPU manufacturers

Intel's Performance Library Suite includes Windows DLL x86 FFT routines in the Signal Processing Library and the Math Kernel Library. My guess is that these routines are carefully scheduled radix-2 routines; a 1024-point single-precision real FFT reportedly takes 56000 Pentium cycles or 42000 Pentium Pro cycles, about 1.3 times slower than djbfft. These libraries also include MMX and SSE routines (see Intel's Application Note 555), which are faster than djbfft for low-precision FFTs. (If you're from Intel, and you have more comprehensive benchmarks, please let me know.)

Sun's Performance Workshop Fortran includes UltraSPARC FFT routines in the Performance Library. This library is reportedly faster than FFTW for double precision but still not as fast as djbfft. Sun's mediaLib includes UltraSPARC VIS FFT routines, which presumably are faster than djbfft for low-precision FFTs. (If you're from Sun, and you have more comprehensive benchmarks, please let me know.)

Compaq's DIGITAL Extended Math Library includes Alpha FFT routines.

IBM's Engineering and Scientific Subroutine Library includes PowerPC FFT routines.

SGI's SGI/Cray Scientific Library includes MIPS R10000 and MIPS R12000 FFT routines. There also appear to be separate FFT routines as part of the SGI C library and the Cray C library, but I haven't found home pages for these libraries.

Other high-performance FFT libraries

The prime95 Mersenne-testing program by Woltman includes a carefully scheduled radix-4 Pentium asm routine. Woltman's routine is reportedly close to the speed of djbfft.

The FFTW authors have been working on pfftw, an asm imitation of djbfft.

I've heard about several unpublished asm FFT projects for various chips.