D. J. Bernstein
Fast arithmetic


I have speed reports for djbfft 0.76 on In each case the compiler options are the default options in the djbfft installation: -O1 -fomit-frame-pointer with -malign-double added automatically on the x86 processors.

I also have some speed reports for djbfft 0.75 under alternate compilers:

Contents of the speed reports

Codes used in the reports: Each line shows the individual tick counts for eight iterations of the routine being benchmarked. The first iteration is normally slower than the rest; instructions may not be in cache (or even memory), inputs may not be in cache, etc. The first few iterations may wobble a bit because of branch prediction hysteresis. All the iterations will usually have different speeds for inputs larger than cache. Individual iterations may occasionally be much slower if the operating system happens to perform a context switch.

For example, the Pentium-133 lines

     Using RDTSC, pentium/*.c.
        nothing      27      17      17      18      17      17      17      18
        256 r8-   11288    8127    8102    8102    8102    8102    8102    8102
show that a 256-point in-cache double-precision real inverse DFT, with a tiny amount of timing overhead, normally takes 8102 Pentium cycles.

Notes on previous versions of djbfft

19970916: First version of djbfft. I wrote this code to prove to the FFTW authors that a simple split-radix FFT would run faster than their complicated code on a Pentium. My unscheduled code, 86 lines long, did a size-256 single-precision transform in about 35000 Pentium cycles, faster than FFTW. A few days later, after some casual instruction scheduling, I had the time down to about 24000 Pentium cycles.

19971116: djbfft 0.50. About 23000 Pentium cycles for a size-256 double-precision transform. I was still learning about the Pentium FPU at this point.

19971218: djbfft 0.55. About 20000 Pentium cycles. New in this version: inverse transforms.

19971226: djbfft 0.60. About 20000 Pentium cycles. New in this version: simultaneous support for single precision and double precision.

19980923: djbfft 0.70. About 18000 Pentium cycles, or 15000 UltraSPARC-I cycles. New in this version: multiplication routines to support complex convolution and real convolution.

19990914: djbfft 0.75. About 17000 Pentium cycles, or 6300 UltraSPARC-I cycles, or 13000 Pentium-II cycles. New in this version: real FFTs and UltraSPARC tuning.

19990930: djbfft 0.76. About 17000 Pentium cycles, or 6300 UltraSPARC-I cycles, or 12000 Pentium-II cycles. New in this version: some Pentium-II tuning.