D. J. Bernstein
Fast arithmetic
djbfft

Real convolution library interface

djbfft currently supports sizes 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, and 8192. This document focuses on 256 for simplicity.

Single-precision transforms

```     #include <fftr4.h>

real4 a[256];

fftr4_256(a);
fftr4_scale256(a);
fftr4_un256(a);
```
fftr4_256 computes a 256-point real-to-complex discrete Fourier transform. It evaluates the polynomial
```     a[0] + a[2] x + a[4] x^2 + ... + a[254] x^127
+ a[1] x^128 + a[3] x^129 + ... + a[255] x^255
```
at all the 256th roots of 1 modulo conjugation, and puts the values into a, overwriting the input. (Beware that the results are stored in an unusual order; also note the input interleaving.) Each a[n] is a 4-byte real number.

To compute the inverse transform, reconstructing a polynomial from its values, call fftr4_scale256 and then fftr4_un256. (fftr4_scale256 multiplies a[0] and a[1] by 1/256, and each remaining a[n] by 1/128.)

Note that the position of a in memory can affect performance.

Single-precision convolutions

```     #include <fftr4.h>

real4 a[256];
real4 b[256];

fftr4_mul256(a,b);
```
fftr4_mul256 multiplies each a[n] by b[n] and puts the result into a[n].

The sequence of operations

```     fftr4_256(a);
fftr4_256(b);
fftr4_mul256(a,b);
fftr4_scale256(a);
fftr4_un256(a);
```
convolves a with b: it multiplies the polynomial
```     a[0] + a[2] x + a[4] x^2 + ... + a[254] x^127
+ a[1] x^128 + a[3] x^129 + ... + a[255] x^255
```
by
```     b[0] + b[2] x + b[4] x^2 + ... + b[254] x^127
+ b[1] x^128 + b[3] x^129 + ... + b[255] x^255
```
modulo x^256-1 and puts the result back into a. The sequence of operations
```     fftr4_256(b);
fftr4_scale256(b);
fftr4_256(a);
fftr4_mul256(a,b);
fftr4_un256(a);
```
has the same effect. If you have many polynomials to multiply by the same b, you can save time by reusing the transformed (and scaled) b.

Double-precision transforms and convolutions

```     #include <fftr8.h>

real8 a[256];
real8 b[256];

fftr8_256(a);
fftr8_scale256(a);
fftr8_un256(a);

fftr8_mul256(a,b);
```
The fftr8 functions are just like the fftr4 routines except that they work with 8-byte floating-point numbers instead of 4-byte floating-point numbers.

WARNING: Some compilers, notably gcc without the -malign-double option, do not guarantee 8-byte alignment for 8-byte floating-point variables. The Pentium, Pentium II, et al. will slow down dramatically if your arrays are not aligned to 8-byte boundaries.

Frequencies

```     #include <fftfreq.h>

unsigned int n;
unsigned int f;

f = fftfreq_r(n,256);
```
Here is what fftr4_256 and fftr8_256 put into a, in terms of the input polynomial p:
• a[0] = p(1);
• a[1] = p(-1);
• a[2] = re p(exp(2 pi i fftfreq_r(2,256)/256));
• a[3] = im p(exp(2 pi i fftfreq_r(2,256)/256));
• a[4] = re p(exp(2 pi i fftfreq_r(4,256)/256));
• a[5] = im p(exp(2 pi i fftfreq_r(4,256)/256));
• a[6] = re p(exp(2 pi i fftfreq_r(6,256)/256));
• a[7] = im p(exp(2 pi i fftfreq_r(6,256)/256));
• etc.