D. J. Bernstein

Fast arithmetic

djbfft
# Complex convolution library interface

djbfft currently supports
sizes 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, and 8192.
This document focuses on 256 for simplicity.
## Single-precision transforms

#include <fftc4.h>
complex4 *a*[256];
fftc4_256(*a*);
fftc4_scale256(*a*);
fftc4_un256(*a*);

`fftc4_256`
computes a 256-point complex discrete Fourier transform.
It evaluates the polynomial
*a*[0] + *a*[1] x + *a*[2] x^2 + ... + *a*[255] x^255

at all the 256th roots of 1,
and puts the values into *a*,
overwriting the input.
(Beware that the results are stored in an
unusual order.)
Each *a*[*n*] is a complex number
with 4-byte real part
*a*[*n*].re
and 4-byte imaginary part
*a*[*n*].im.
To compute the inverse transform,
reconstructing a polynomial from its values,
call `fftc4_scale256`
and then `fftc4_un256`.
(`fftc4_scale256` multiplies each
*a*[*n*] by 1/256.)

Note that the position of *a* in memory
can affect performance.

## Single-precision convolutions

#include <fftc4.h>
complex4 *a*[256];
complex4 *b*[256];
fftc4_mul256(*a*,*b*);

`fftc4_mul256` multiplies each
*a*[*n*] by
*b*[*n*]
and puts the result into
*a*[*n*].
The sequence of operations

fftc4_256(*a*);
fftc4_256(*b*);
fftc4_mul256(*a*,*b*);
fftc4_scale256(*a*);
fftc4_un256(*a*);

convolves
*a* with
*b*:
it multiplies the polynomial
*a*[0] + *a*[1] x + *a*[2] x^2 + ... + *a*[255] x^255

by
*b*[0] + *b*[1] x + *b*[2] x^2 + ... + *b*[255] x^255

modulo x^256-1
and puts the result back into *a*.
The sequence of operations
fftc4_256(*b*);
fftc4_scale256(*b*);
fftc4_256(*a*);
fftc4_mul256(*a*,*b*);
fftc4_un256(*a*);

has the same effect.
If you have many polynomials to multiply by the same *b*,
you can save time by reusing the transformed (and scaled) *b*.
## Double-precision transforms and convolutions

#include <fftc8.h>
complex8 *a*[256];
complex8 *b*[256];
fftc8_256(*a*);
fftc8_scale256(*a*);
fftc8_un256(*a*);
fftc8_mul256(*a*,*b*);

The `fftc8` functions are just like the `fftc4` functions
except that they work with 8-byte floating-point numbers
instead of 4-byte floating-point numbers.
**WARNING:**
Some compilers,
notably gcc without the `-malign-double` option,
do not guarantee 8-byte alignment for 8-byte floating-point variables.
The Pentium, Pentium II, et al. will slow down dramatically
if your arrays are not aligned to 8-byte boundaries.

#include <fftfreq.h>
unsigned int *n*;
unsigned int *f*;
*f* = fftfreq_c(*n*,256);

What `fftc4_256` and `fftc8_256`
put into *a*[*n*]
is the value of the input polynomial at
exp(2 pi *i**f*/256)
where *f* = fftfreq_c(*n*,256).