Fast arithmetic

djbfft

Pentium, Pentium MMX: djbfft 0.76 has been carefully tuned for these chips. It is more than twice as fast as FFTW 2.

Pentium Pro, Pentium II, Pentium III: djbfft 0.76 has been partially tuned for these chips. It is noticeably faster than FFTW 2.

PowerPC 603, PowerPC 604: djbfft 0.76 has not been tuned for these chips. Nevertheless it is as fast as FFTW 2.

Alpha 21264: djbfft 0.76 has not been tuned for this chip. Nevertheless it appears to be faster than FFTW 2. For example, on a 21264-500, djbfft 0.76 computes a size-256 complex FFT in under 6400 cycles, while FFTW 2 needs more than 7300 cycles.

PA-8200, PA-8500:
djbfft 0.76 has not been tuned for these chips.
Nevertheless it appears to be faster than FFTW 2.
For example,
on an 8200-240 under HP-UX B.11.0 with cc `+O3 +Oall`,
djbfft 0.76 computes a size-256 complex FFT in about 4800 cycles,
while FFTW 2 needs more than 5400 cycles.

- Arithmetic operations. Many libraries use a radix-2 FFT, performing about (5k-10)2^k floating-point operations for a size-2^k transform. djbfft uses a split-radix-2/4 FFT, performing about (4k-6)2^k floating-point operations. (You can think of split-radix 2/4 as the limit of radix 2, radix 4, radix 8, etc.)
- Root computation. Some libraries compute roots of 1 on the fly. djbfft uses a precomputed root table.
- Root loads. Each step of a standard split-radix FFT involves a root of 1 and the cube of that root. djbfft uses the inverse instead of the cube, saving half the loads. (I spotted this improvement in September 1997; I call it ``exponent -1 split-radix.'' It changes the output order, but that doesn't matter for convolution.)
- Memory use. Many libraries use one or more temporary arrays as large as the input array, missing the cache unnecessarily. djbfft carries out all computations in place.
- Cache thrashing. Most libraries use an iterative FFT, passing repeatedly over the entire input array, missing the cache at every step for large transforms. djbfft uses a recursive decimation-in-frequency FFT. (The cache-friendliness of this type of FFT was pointed out by Singleton in 1967.)
- Cache associativity. The UltraSPARC, for example, can save 512 32-byte lines of memory in its level-1 cache, but it can't save two lines with the same address modulo 16384. This produces seemingly random slowdowns in many programs. djbfft stores its root tables consecutively in memory so that users can easily move their arrays to safe portions of the cache.
- Unaligned data.
The Pentium, Pentium II, etc. need extra time to access
8-byte floating-point variables that are not aligned to 8-byte boundaries.
This produces seemingly random slowdowns in many floating-point programs
(including FFTW
before I explained this problem to the FFTW authors in December 1997).
djbfft automatically uses
gcc's
`-malign-double`option to align 8-byte floating-point variables in the data segment, and it generally avoids storing floating-point variables on the stack. - Floating-point latency. On an UltraSPARC, for example, a floating-point addition takes just one cycle, but the result cannot be used during the next two cycles. Most FFT libraries try to use the result immediately, and end up pausing for two cycles. The instructions in djbfft are scheduled for both the Pentium and the UltraSPARC to avoid most of these delays without any register spills.

Pentium, Pentium MMX: My size-256 FFT uses 17196 cycles. The most obvious bottleneck is the floating-point unit, which is using 16112 cycles:

- 1696 cycles for multiplications,
- 5008 cycles for additions,
- 1560 cycles for register copies,
- 2672 cycles for loads, and
- 5176 cycles for 2588 stores.

Pentium Pro, Pentium II, Pentium III: My size-256 FFT is under 12000 cycles. I'm still trying to figure out the bottlenecks here; I don't have an accurate Pentium Pro simulator. I wouldn't be surprised if big speedups are still possible.