Numerical Analysis
Fast Fourier Transform
Cooley-Tukey 1965 cut DFT cost from O(n²) to O(n log n) — the algorithm that makes MP3, JPEG, and 5G possible
The Fast Fourier Transform computes the discrete Fourier transform of an N-sample signal in O(N log N) operations instead of the naïve O(N²). At N = 10⁶ that gap is fifty thousand times. Without the FFT, real-time digital audio, image compression and wireless communication would all be impossible at consumer cost.
- PublishedCooley & Tukey, 1965
- Earliest known formGauss, ~1805
- Naïve costO(N²)
- FFT costO(N log N)
- Speedup at N = 10⁶~50,000×
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
What the FFT actually computes
Start from the discrete Fourier transform itself. Given N complex samples x₀, x₁, …, x_{N−1}, the DFT produces N complex output coefficients X_k:
X_k = Σ_{n=0..N-1} x_n · e^(-2πi · k · n / N) for k = 0..N-1
Each X_k tells you how much of frequency k cycles per N samples is present in the signal. Computed straight from this definition, every output requires N complex multiplications and N additions, and there are N outputs — so the total cost grows like N². At N = 1024 that's about a million complex multiplies; at N = 10⁶ it's a trillion. The signal has not changed; the bottleneck is purely the algorithm.
The FFT computes the same N complex numbers, to within floating-point error, in a number of operations that scales like N log₂ N. The mathematics is identical. The savings come from noticing that most of the N² products in the naïve sum are not independent and can be folded into a recursion.
The Cooley-Tukey radix-2 recursion
Assume N is a power of two. Split the input into its even-indexed samples (x₀, x₂, x₄, …) and its odd-indexed samples (x₁, x₃, x₅, …). Plug those into the DFT sum, separate the two groups, and pull a twiddle factor out of the odd half:
X_k = Σ_{n=0..N/2-1} x_{2n} · e^(-2πi k (2n) / N)
+ Σ_{n=0..N/2-1} x_{2n+1} · e^(-2πi k (2n+1) / N)
= E_k + W_N^k · O_k for k = 0..N/2-1
= E_{k-N/2} - W_N^{k-N/2} · O_{k-N/2} for k = N/2..N-1
Here E_k is the length-(N/2) DFT of the even-indexed samples, O_k is the length-(N/2) DFT of the odd-indexed samples, and W_N^k = e^(-2πi k / N) is the twiddle factor. Two key facts make this useful: (1) E_k and O_k each cost only T(N/2) by recursion, and (2) the same E_k and O_k feed both the lower half and the upper half of the output, with a sign flip. Each output costs one twiddle multiplication and one addition.
The recurrence is T(N) = 2 · T(N/2) + O(N), and the master theorem gives T(N) = O(N log N). Concretely, log₂(1024) = 10, so a 1024-point FFT does about 10 · 1024 = 10⁴ butterfly operations — a hundred times less than the 10⁶ multiplications of the naïve version.
The butterfly diagram
Drawn as a flow graph the recursion looks like a stack of butterflies. At each level the algorithm takes two values, a and b, multiplies one of them by a twiddle factor, and produces two outputs:
top: out₀ = a + W · b
bottom: out₁ = a - W · b
The crossing of inputs to outputs gives each pair the visual shape of a butterfly's wings. A length-N FFT consists of log₂ N stages, each containing N/2 butterflies. The first stage processes pairs separated by 1, the second by 2, the third by 4, and so on up to N/2. The input is read in bit-reversed order: index 0b0110 swaps with 0b0110 read backwards. Reading bit-reversed allows the butterflies to write their output in natural order — a standard implementation trick.
Worked example: FFT of [1, 2, 3, 4]
Take N = 4 and x = (1, 2, 3, 4). The twiddle factors are W₄^0 = 1, W₄^1 = e^(-iπ/2) = −i, W₄^2 = −1, W₄^3 = i. Split into even (1, 3) and odd (2, 4) halves and run length-2 DFTs first.
Stage 1 — length-2 DFTs:
E = DFT₂(1, 3) = (1 + 3, 1 - 3) = (4, -2)
O = DFT₂(2, 4) = (2 + 4, 2 - 4) = (6, -2)
Stage 2 — combine with twiddles W₄^k:
X₀ = E₀ + W₄^0 · O₀ = 4 + 1·6 = 10
X₁ = E₁ + W₄^1 · O₁ = -2 + (-i)·(-2) = -2 + 2i
X₂ = E₀ - W₄^0 · O₀ = 4 - 1·6 = -2
X₃ = E₁ - W₄^1 · O₁ = -2 - (-i)·(-2) = -2 - 2i
Output: X = (10, -2 + 2i, -2, -2 - 2i)
Sanity check the first bin: X₀ is just the sum of all samples, 1 + 2 + 3 + 4 = 10. The second bin |X₁|² + |X₃|² captures the amplitude at one cycle per four samples — the linear ramp pattern in the input. The third bin X₂ is the alternating-sign component, which is also the smallest in magnitude here. Numpy's numpy.fft.fft([1,2,3,4]) returns the same vector to machine precision.
How big is the speedup?
| Length N | Naïve DFT operations (~N²) | FFT operations (~N log₂ N) | Speedup | Real-world signal |
|---|---|---|---|---|
| 64 | 4,096 | 384 | ~11× | One frame of 8×8 JPEG |
| 1,024 | 1,048,576 | 10,240 | ~100× | One MP3 granule |
| 4,096 | 16,777,216 | 49,152 | ~340× | One audio FFT bin block at 44.1 kHz |
| 262,144 (2¹⁸) | ~6.9 × 10¹⁰ | ~4.7 × 10⁶ | ~14,500× | One 512×512 image, row-by-row |
| 1,048,576 (2²⁰) | ~1.1 × 10¹² | ~2.1 × 10⁷ | ~50,000× | 3-D MRI slice volume |
| 10⁹ | ~10¹⁸ | ~3 × 10¹⁰ | ~30 million× | LIGO matched-filter chunk |
| 10¹² | ~10²⁴ | ~4 × 10¹³ | ~25 billion× | Whole-sky radio survey panel |
The takeaway is that the speedup is not constant — it grows. For small problems both algorithms are fast and the choice barely matters. As N climbs into the millions and billions, the gap becomes the difference between "runs in milliseconds" and "would take longer than the age of the universe."
Where the FFT shows up
- MP3 and AAC audio compression. Each 1152-sample MP3 granule is converted to the frequency domain by a modified discrete cosine transform, computed under the hood by an FFT. A song at 44.1 kHz feeds through several thousand FFTs per second; without the speedup, real-time encoding on a 1990s CPU would not have been possible.
- JPEG image compression. JPEG splits a photograph into 8×8 blocks and runs a DCT on each one. The DCT can be implemented as an FFT of a length-16 mirrored signal followed by a small linear combination. JPEG decoders in browsers handle billions of these transforms per day.
- OFDM in 4G LTE, 5G NR, Wi-Fi 6 and DOCSIS 3.1. Orthogonal frequency-division multiplexing carves a wide channel into 1024 to 4096 narrow subcarriers. Encoding and decoding each symbol is exactly an FFT of that size, run dozens of times per millisecond on every base station and handset on Earth.
- MRI image reconstruction. An MRI scanner physically samples the Fourier transform of the patient's spin density (k-space). The image you see on the radiologist's screen is the inverse FFT of that grid. A 512×512×512 volume is one million 512-point FFTs, computed in seconds.
- Gravitational-wave detection at LIGO. Matched filtering against template waveforms is implemented as multiplication in the frequency domain, which requires an FFT and an inverse FFT. LIGO ingests 16 kHz strain data and runs 4 to 16 second analysis windows in real time across thousands of templates — only feasible because of FFTW-class libraries.
The de facto reference implementation is FFTW (the "Fastest Fourier Transform in the West"), written by Frigo and Johnson at MIT. NumPy's numpy.fft, MATLAB's fft(), and Apple's vDSP framework all wrap FFTW or equivalent algorithms. On modern CPUs an FFTW double-precision FFT of length 4096 runs in roughly 25 microseconds.
Variants and extensions
- Radix-4 and radix-8 FFTs. Group four (or eight) samples per butterfly instead of two. Fewer twiddle multiplications per output, at the cost of more arithmetic per butterfly. Common in signal-processing chips where multiplies are expensive but adds are cheap.
- Mixed-radix Cooley-Tukey. When N has multiple small prime factors (2, 3, 5, 7), factor N as a product and recurse on each factor. FFTW's codelets are hand-tuned mixed-radix kernels for every prime up to about 13.
- Bluestein's chirp-z transform. Handles arbitrary length N — including primes — by rewriting the DFT as a convolution and computing the convolution with a power-of-two FFT. Costs about 3× a same-length power-of-two FFT.
- Bruun's algorithm and Rader's algorithm. Two specialised approaches for prime-length DFTs based on polynomial factorisation and on group theory respectively. Largely of theoretical interest now that Bluestein covers the same ground.
- Stockham auto-sort FFT. Avoids the bit-reversal permutation by interleaving data movement with butterflies. Friendly to GPU memory hierarchies and the basis of cuFFT's batched mode.
Common pitfalls
- Forgetting bit reversal. Decimation-in-time FFTs read input in bit-reversed order and write output in natural order; decimation-in-frequency does the opposite. Mixing the two without a permutation step gives garbled results.
- Sign convention. Engineering literature uses e^(−2πi…) for the forward transform; physics literature often uses e^(+2πi…). Library implementations differ. Cross-check the sign or your spectrum will be reflected in time.
- Normalisation. Some libraries scale by 1/N on the forward transform, some on the inverse, some by 1/√N on each. Round-trip
ifft(fft(x))must equal x — if it doesn't, you have a normalisation mismatch. - Aliasing from too-coarse sampling. The FFT does not create the spectrum, it only measures it. If your sample rate is below 2× the highest frequency present (Nyquist), high-frequency content folds down and corrupts low-frequency bins. Anti-alias filter before sampling.
- Spectral leakage from non-periodic inputs. The DFT implicitly treats the input as one period of a periodic signal. A discontinuity at the boundary spreads energy across all bins. Window the input (Hann, Hamming, Blackman) before transforming when the signal is not exactly periodic in N samples.
Why it changed computing
It is rare for a single algorithm to be load-bearing for an entire industry, let alone several at once. The FFT is. The compact-disc audio standard, the JPEG image standard, the OFDM physical layer of 4G and 5G, MRI scanners, radar, sonar, oil-and-gas seismic imaging, particle-physics trigger systems and gravitational-wave detectors all collapse into "compute an FFT" at their core. Computing in Science & Engineering ranked Cooley-Tukey among the top ten algorithms of the 20th century in 2000, alongside the simplex method and the Metropolis algorithm.
What makes the FFT special is that it sits exactly between the physical world (continuous signals at antennas, microphones, MRI coils) and the digital one (samples in memory). Every time analog meets digital at scale, a Fourier transform is the bridge — and only the FFT keeps that bridge cheap enough to cross billions of times a day on consumer hardware.
Frequently asked questions
Why is the FFT faster than the DFT if it computes the same answer?
Both produce identical output (within floating-point error). The FFT exploits algebraic redundancy in the DFT matrix: many of the N² complex multiplications are duplicates that can be factored out by recursion. Cooley and Tukey showed in 1965 that splitting the input into even and odd indices reduces a length-N DFT to two length-N/2 DFTs plus N/2 multiplications, giving the recurrence T(N) = 2T(N/2) + O(N), which solves to O(N log N).
What is a twiddle factor?
A twiddle factor is the complex root of unity W_N^k = e^(-2πi k/N) that appears when combining the even and odd half-size DFTs. It rotates the odd part by k/N of a full turn before adding it to the even part. The name dates to Gentleman and Sande in the 1960s and stuck because the factors literally twiddle phases.
Does N have to be a power of two?
The classic Cooley-Tukey radix-2 algorithm requires N to be a power of two. Mixed-radix variants handle any composite N by factoring it into small prime factors. For prime N, Bluestein's chirp-z algorithm and Rader's algorithm reduce the prime-length DFT to a power-of-two convolution. Production libraries like FFTW pick the best decomposition automatically.
How big is the speedup in practice?
At N = 1024 the naïve DFT does about 10⁶ complex multiplications versus 10⁴ for the FFT — 100× faster. At N = 10⁶ (typical for an MRI volume slice) the gap is 10¹² versus 2×10⁷ — about 50,000×. At N = 10⁹ the FFT remains tractable on a laptop while the naïve DFT would take centuries. The speedup is the entire reason real-time signal processing exists.
Is the FFT exact?
It is mathematically exact: the FFT computes the same complex numbers as the DFT definition. In finite-precision arithmetic the FFT actually accumulates less rounding error than the naïve DFT because it performs fewer operations on each input. Standard double-precision FFTs achieve relative error around 10⁻¹⁵, which is well below any physical noise floor.
How does the FFT relate to the JPEG and MP3 algorithms?
JPEG uses the discrete cosine transform (DCT-II), which is computed via an FFT after a real-to-complex packing trick. MP3 uses the modified discrete cosine transform (MDCT), again FFT-based. Both compress because most signal energy lives in a small set of frequency bins; the FFT moves the data into that representation cheaply enough to run in real time. Without the FFT speedup, lossy media compression at consumer cost would not exist.
Was Cooley-Tukey actually first?
No. Gauss described the same recursion around 1805 in unpublished notes on asteroid orbits. Several others reinvented pieces of it through the 19th and 20th centuries. The 1965 Cooley-Tukey paper became canonical because it appeared at the moment digital computers were ready to exploit it — and because IBM and Princeton actively promoted the idea.