D. J. Bernstein

Fast arithmetic

djbfft
# Benchmarks

I have speed reports for djbfft 0.76 on
In each case the compiler options are
the default options in the djbfft installation:
`-O1 -fomit-frame-pointer`
with `-malign-double` added automatically on the x86 processors.
I also have some speed reports for djbfft 0.75 under alternate compilers:

- a 240MHz HP PA-8200
under HP-UX B.11.0 cc
`+O2 -Dinline`
and
- a 240MHz HP PA-8200
under HP-UX B.11.0 cc
`+O3 +Oall -Dinline`.

## Contents of the speed reports

Codes used in the reports:
`r`: Real transform.
`c`: Complex transform.
`4`: Single-precision transform.
`8`: Double-precision transform.
`+`: Forward DFT.
`-`: Inverse DFT.
`m`: Multiplication.
Convolution against a precomputed filter
takes one forward DFT, one multiplication, and one inverse DFT.
`s`: Scaling.
Precomputation of a filter takes one forward DFT and one scaling.
`nothing`: No computation.
This shows the overhead of the tick-counting mechanism.
`RDTSC`:
Tick counts are obtained from the Pentium cycle counter.
`gethrtime`:
Tick counts are obtained from the Solaris gethrtime() nanosecond counter.

Each line shows the individual tick counts
for eight iterations of the routine being benchmarked.
The first iteration is normally slower than the rest;
instructions may not be in cache (or even memory),
inputs may not be in cache, etc.
The first few iterations may wobble a bit
because of branch prediction hysteresis.
All the iterations will usually have different speeds
for inputs larger than cache.
Individual iterations may occasionally be much slower
if the operating system happens to perform a context switch.
For example, the Pentium-133 lines

Using RDTSC, pentium/*.c.
nothing 27 17 17 18 17 17 17 18
256 r8- 11288 8127 8102 8102 8102 8102 8102 8102

show that a 256-point in-cache double-precision real inverse DFT,
with a tiny amount of timing overhead,
normally takes 8102 Pentium cycles.
## Notes on previous versions of djbfft

19970916:
First version of djbfft.
I wrote this code to prove to the FFTW authors that
a simple split-radix FFT would run faster than
their complicated code on a Pentium.
My unscheduled code, 86 lines long,
did a size-256 single-precision transform in about 35000 Pentium cycles,
faster than FFTW.
A few days later, after some casual instruction scheduling,
I had the time down to about 24000 Pentium cycles.
19971116:
djbfft 0.50.
About 23000 Pentium cycles for a size-256 double-precision transform.
I was still learning about the Pentium FPU at this point.

19971218:
djbfft 0.55.
About 20000 Pentium cycles.
New in this version: inverse transforms.

19971226:
djbfft 0.60.
About 20000 Pentium cycles.
New in this version: simultaneous support for
single precision and double precision.

19980923:
djbfft 0.70.
About 18000 Pentium cycles, or 15000 UltraSPARC-I cycles.
New in this version: multiplication routines
to support complex convolution and real convolution.

19990914:
djbfft 0.75.
About 17000 Pentium cycles,
or 6300 UltraSPARC-I cycles,
or 13000 Pentium-II cycles.
New in this version: real FFTs and UltraSPARC tuning.

19990930:
djbfft 0.76.
About 17000 Pentium cycles,
or 6300 UltraSPARC-I cycles,
or 12000 Pentium-II cycles.
New in this version: some Pentium-II tuning.