D. J. Bernstein
Fast arithmetic
djbfft

Frequently asked questions

Can I use djbfft in my own code?

Yes. Please tell me what programs you're using it in so that I can let NSF know.

Is djbfft faster than FFTW?

UltraSPARC-I, UltraSPARC-II: djbfft 0.76 has been tuned for these chips. It is substantially faster than FFTW 2. For example, djbfft computes a size-256 complex FFT in about 6300 cycles; FFTW needs more than 10000 cycles, according to the UltraSPARC-I/167 results reported by the FFTW authors on the FFTW web pages. (Web pages checked in September 1999.)

Pentium, Pentium MMX: djbfft 0.76 has been carefully tuned for these chips. It is more than twice as fast as FFTW 2.

Pentium Pro, Pentium II, Pentium III: djbfft 0.76 has been partially tuned for these chips. It is noticeably faster than FFTW 2.

PowerPC 603, PowerPC 604: djbfft 0.76 has not been tuned for these chips. Nevertheless it is as fast as FFTW 2.

Alpha 21264: djbfft 0.76 has not been tuned for this chip. Nevertheless it appears to be faster than FFTW 2. For example, on a 21264-500, djbfft 0.76 computes a size-256 complex FFT in under 6400 cycles, while FFTW 2 needs more than 7300 cycles.

PA-8200, PA-8500: djbfft 0.76 has not been tuned for these chips. Nevertheless it appears to be faster than FFTW 2. For example, on an 8200-240 under HP-UX B.11.0 with cc +O3 +Oall, djbfft 0.76 computes a size-256 complex FFT in about 4800 cycles, while FFTW 2 needs more than 5400 cycles.

Why is djbfft so fast?

It's easier to answer the opposite question: why are other FFT libraries so slow? There are several common problems:

Don't modern compilers take care of instruction scheduling?

Sure, if you don't care about speed.

Is there any room for further speed improvements?

UltraSPARC-I, UltraSPARC-II: My size-256 FFT uses about 6300 cycles. The most obvious bottleneck is the floating-point adder, which needs 5008 cycles to carry out 5008 floating-point additions. The integer units are mostly idle; some floating-point operations could be simulated by integer operations, or an independent integer computation could be carried out in the meantime. The overhead can be reduced; my current scheduling has some obvious gaps, especially in the real transforms. The code can be scheduled to work directly from the L2 cache, speeding up large transforms.

Pentium, Pentium MMX: My size-256 FFT uses 17196 cycles. The most obvious bottleneck is the floating-point unit, which is using 16112 cycles:

Some of the loads and stores can be eliminated. The function-call overhead can be reduced in several ways with an asm implementation. The number of cache misses can be reduced further for large transforms.

Pentium Pro, Pentium II, Pentium III: My size-256 FFT is under 12000 cycles. I'm still trying to figure out the bottlenecks here; I don't have an accurate Pentium Pro simulator. I wouldn't be surprised if big speedups are still possible.

Will MMX, SSE, 3DNow, VIS, AltiVec, et al. lead to faster FFTs?

Typical circuitry to compute a double-precision floating-point product can be adapted to compute several low-precision products simultaneously. So it's reasonable to expect low-precision FFTs to run more quickly than double-precision FFTs. Vector hardware is often accompanied by faster memory access, so it's possible that double-precision FFTs can be sped up too.