D. J. Bernstein
Fast arithmetic
djbfft

# Real convolution library interface

djbfft currently supports sizes 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, and 8192. This document focuses on 256 for simplicity.

## Single-precision transforms

```     #include <fftr4.h>

real4 a;

fftr4_256(a);
fftr4_scale256(a);
fftr4_un256(a);
```
fftr4_256 computes a 256-point real-to-complex discrete Fourier transform. It evaluates the polynomial
```     a + a x + a x^2 + ... + a x^127
+ a x^128 + a x^129 + ... + a x^255
```
at all the 256th roots of 1 modulo conjugation, and puts the values into a, overwriting the input. (Beware that the results are stored in an unusual order; also note the input interleaving.) Each a[n] is a 4-byte real number.

To compute the inverse transform, reconstructing a polynomial from its values, call fftr4_scale256 and then fftr4_un256. (fftr4_scale256 multiplies a and a by 1/256, and each remaining a[n] by 1/128.)

Note that the position of a in memory can affect performance.

## Single-precision convolutions

```     #include <fftr4.h>

real4 a;
real4 b;

fftr4_mul256(a,b);
```
fftr4_mul256 multiplies each a[n] by b[n] and puts the result into a[n].

The sequence of operations

```     fftr4_256(a);
fftr4_256(b);
fftr4_mul256(a,b);
fftr4_scale256(a);
fftr4_un256(a);
```
convolves a with b: it multiplies the polynomial
```     a + a x + a x^2 + ... + a x^127
+ a x^128 + a x^129 + ... + a x^255
```
by
```     b + b x + b x^2 + ... + b x^127
+ b x^128 + b x^129 + ... + b x^255
```
modulo x^256-1 and puts the result back into a. The sequence of operations
```     fftr4_256(b);
fftr4_scale256(b);
fftr4_256(a);
fftr4_mul256(a,b);
fftr4_un256(a);
```
has the same effect. If you have many polynomials to multiply by the same b, you can save time by reusing the transformed (and scaled) b.

## Double-precision transforms and convolutions

```     #include <fftr8.h>

real8 a;
real8 b;

fftr8_256(a);
fftr8_scale256(a);
fftr8_un256(a);

fftr8_mul256(a,b);
```
The fftr8 functions are just like the fftr4 routines except that they work with 8-byte floating-point numbers instead of 4-byte floating-point numbers.

WARNING: Some compilers, notably gcc without the -malign-double option, do not guarantee 8-byte alignment for 8-byte floating-point variables. The Pentium, Pentium II, et al. will slow down dramatically if your arrays are not aligned to 8-byte boundaries.

## Frequencies

```     #include <fftfreq.h>

unsigned int n;
unsigned int f;

f = fftfreq_r(n,256);
```
Here is what fftr4_256 and fftr8_256 put into a, in terms of the input polynomial p:
• a = p(1);
• a = p(-1);
• a = re p(exp(2 pi i fftfreq_r(2,256)/256));
• a = im p(exp(2 pi i fftfreq_r(2,256)/256));
• a = re p(exp(2 pi i fftfreq_r(4,256)/256));
• a = im p(exp(2 pi i fftfreq_r(4,256)/256));
• a = re p(exp(2 pi i fftfreq_r(6,256)/256));
• a = im p(exp(2 pi i fftfreq_r(6,256)/256));
• etc.