Machine Learning
Transformer Architecture
The stacked block — attention, feed-forward, residual, layer norm — that scaled into every modern LLM
The transformer architecture is the encoder-decoder neural network that powers modern LLMs: stacked blocks of multi-head self-attention and feed-forward layers, wrapped in residual connections and layer normalization, processing every token in parallel.
- IntroducedVaswani et al., 2017
- Per-layer costO(n²·d)
- Sublayers per blockAttention + FFN
- FFN expansion4× d_model
- Params ≈12 · L · d_model²
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a transformer stacks into a working model
In 2017 a team at Google Brain and Google Research published "Attention Is All You Need" and threw away recurrence entirely. The recurrent networks that dominated language modeling read a sentence one word at a time — word t could not be computed until word t−1 was done — so they could not exploit the thousands of parallel cores on a GPU. The transformer's bet was that you could replace the sequential loop with a single operation, self-attention, that relates every token to every other token at once, and then stack that operation deep enough to learn anything.
The architecture is built from one repeated unit, the transformer block, and almost the entire model is just that block copied L times. Each block contains two sublayers:
- Multi-head self-attention — the only place where tokens exchange information. Each token forms a query, compares it against every key, and pulls a weighted blend of every value. h heads do this in parallel subspaces, then concatenate.
- Position-wise feed-forward network (FFN) — the same two-layer MLP applied to each token independently. This is where the bulk of the parameters and the per-token "thinking" live.
Each sublayer is wrapped the same way: a residual connection adds the sublayer's input back to its output, and layer normalization rescales the result. Written out, a modern (pre-norm) block is exactly:
x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))
That two-line recipe, repeated 12 times (BERT-base, GPT-2 small), 96 times (GPT-3), or 80–126 times in current frontier models, is the whole engine. The x that threads through every block is called the residual stream: a running sum that each sublayer reads from and writes to.
The full data flow, end to end
A token starts as an integer ID and ends as a probability distribution over the vocabulary. The journey:
- Tokenize + embed. The input text is split into subword tokens; each ID indexes a learned embedding table of shape
[vocab, d_model], turning the token into ad_model-dimensional vector (768 in BERT-base, 12288 in GPT-3). - Add positional information. Attention is permutation-invariant — it has no built-in notion of word order — so position must be injected. The 2017 paper used fixed sinusoidal encodings; modern models use learned positions or rotary embeddings (RoPE).
- Run the stack. The vector flows through
Lidentical blocks. Attention mixes across tokens; the FFN transforms each token; residuals carry the signal forward. - Project to logits. A final layer norm, then a matrix of shape
[d_model, vocab](usually tied to the embedding table) produces one score per vocabulary word. Softmax turns the scores into next-token probabilities.
The encoder and decoder differ only in two details. The encoder attends bidirectionally — every token sees every other token, ideal for understanding. The decoder uses a causal mask so position i can only attend to positions ≤ i (you can't peek at the word you're trying to predict), and in the full encoder-decoder setup it adds a third sublayer, cross-attention, whose queries come from the decoder but whose keys and values come from the encoder output.
The precise mechanism and complexity
Inside attention, three learned projections turn the input X ∈ ℝ^{n×d} into queries, keys, and values, and the output is the scaled dot-product:
Q = X·W_Q, K = X·W_K, V = X·W_V
Attention(Q,K,V) = softmax( Q·Kᵀ / √d_k ) · V
The Q·Kᵀ step produces an n×n matrix of pairwise scores — this is the quadratic term. The √d_k divisor keeps the dot products from growing with dimension and pushing softmax into a near-one-hot, zero-gradient regime. The FFN, by contrast, is per-token:
FFN(x) = W_2 · GELU(W_1·x + b_1) + b_2 // d → 4d → d
Per layer, for sequence length n and model width d:
- Attention — the QKᵀ and the score×V multiplies are
O(n²·d)compute andO(n²)memory for the score matrix. This is the bottleneck that makes long contexts expensive. - FFN —
O(n·d²)compute (each ofntokens hits ad×4dand a4d×dmatrix).
So whether attention or the FFN dominates depends on the ratio n vs d. For a 2048-token context with d=12288, the FFN's d² term still dwarfs attention — which is why FFN holds ~two-thirds of the FLOPs and parameters. It's only at very long context (tens of thousands of tokens) that the n² attention term takes over and motivates FlashAttention and linear-attention variants.
When to reach for a transformer (and when not)
- Language and sequence modeling at scale — translation, summarization, chat, code. The parallelism lets you train on trillions of tokens.
- Anywhere relationships are long-range — attention connects token 1 and token 10000 in a single hop, where an RNN would have to carry the signal through 9999 sequential steps.
- Vision, audio, proteins — split an image into patches (ViT), an audio clip into frames, a protein into residues, and the same architecture applies.
Where transformers are a poor fit: very long sequences on a tight memory budget (the O(n²) score matrix blows up — a state-space model like Mamba is O(n)); tiny datasets (transformers are data-hungry and under-regularized, so a small CNN or gradient-boosted tree often wins); and ultra-low-latency edge inference, where the quadratic attention and large weight matrices are too heavy without distillation or quantization.
Transformer vs the sequence models it replaced
| Transformer | RNN / LSTM | 1-D CNN | State-space (Mamba) | |
|---|---|---|---|---|
| Path length between distant tokens | O(1) | O(n) | O(n / kernel) | O(1) (recurrent state) |
| Compute per layer | O(n²·d) | O(n·d²) | O(k·n·d²) | O(n·d) |
| Training parallelism | Full (whole sequence at once) | None (sequential in t) | Full | Parallel scan |
| Memory in sequence length | O(n²) scores | O(n) states | O(n) | O(n) |
| Handles variable length | Yes (up to context limit) | Yes (unbounded) | Fixed receptive field | Yes (unbounded) |
| Inductive bias | Weak — needs lots of data | Recency / sequential | Locality | Decaying memory |
| Long-context cost | Quadratic — the pain point | Linear but forgets | Linear but local | Linear, no forgetting cliff |
| Production use | GPT, BERT, Llama, Claude, ViT | Older NMT, time series | WaveNet, some NLP | Emerging long-context LLMs |
The headline trade is path length for compute. The transformer makes any two tokens one step apart — which is exactly what learning long-range dependencies needs — at the price of an O(n²) score matrix. Every architecture in the rightmost columns is essentially trying to buy back the transformer's quality while paying only linear cost in sequence length.
What the numbers actually say
- The block is ≈12·d² parameters. Attention's four projections (Q, K, V, O) are 4·d² weights; the FFN's two matrices are 2·(d·4d) = 8d². Total 12d² per layer — so a model is roughly
12 · L · d²parameters. Plug in GPT-3's L=96, d=12288: 12 × 96 × 12288² ≈ 174 billion, matching its published 175B. - FFN holds two-thirds of the weights. 8d² of the 12d² per-block parameters are in the feed-forward layers. Attention is the famous part; the FFN is where the model is mostly storing what it knows.
- Training FLOPs ≈ 6 · N · D. A standard estimate: training a model with N parameters on D tokens costs about 6ND floating-point operations. GPT-3 (175B params, 300B tokens) → ≈3.14×10²³ FLOPs, on the order of thousands of GPU-years compressed onto a large cluster.
- The score matrix is the memory wall. A naive attention over n=8192 tokens with h=64 heads materializes 64 × 8192² ≈ 4.3 billion floats per layer — tens of gigabytes. FlashAttention (Dao et al., 2022) avoids ever writing the full matrix to HBM, cutting attention memory from O(n²) to O(n) and giving a 2–4× wall-clock speedup.
JavaScript implementation
A single pre-norm transformer block in plain arrays — no framework — so the data flow is explicit. (Weights are random here; in a real model they're learned.)
// row-vector helpers
const dot = (a, b) => a.reduce((s, x, i) => s + x * b[i], 0);
const add = (a, b) => a.map((x, i) => x + b[i]);
const matvec = (W, x) => W.map(row => dot(row, x)); // W is [out][in]
function softmax(v) {
const m = Math.max(...v);
const e = v.map(x => Math.exp(x - m)); // subtract max for stability
const s = e.reduce((a, b) => a + b, 0);
return e.map(x => x / s);
}
function layerNorm(x, eps = 1e-5) {
const mu = x.reduce((a, b) => a + b, 0) / x.length;
const v = x.reduce((a, b) => a + (b - mu) ** 2, 0) / x.length;
return x.map(z => (z - mu) / Math.sqrt(v + eps));
}
const gelu = x => 0.5 * x * (1 + Math.tanh(0.7978845608 * (x + 0.044715 * x ** 3)));
// Single-head self-attention over a sequence of d-dim token vectors.
function selfAttention(X, Wq, Wk, Wv, causal = false) {
const n = X.length, dk = Wq.length;
const Q = X.map(x => matvec(Wq, x));
const K = X.map(x => matvec(Wk, x));
const V = X.map(x => matvec(Wv, x));
const scale = 1 / Math.sqrt(dk);
return X.map((_, i) => {
let scores = K.map((k, j) =>
(causal && j > i) ? -Infinity : dot(Q[i], k) * scale); // mask future
const w = softmax(scores);
// weighted sum of value vectors
return V[0].map((_, d) => w.reduce((s, wj, j) => s + wj * V[j][d], 0));
});
}
function ffn(x, W1, b1, W2, b2) {
const h = add(matvec(W1, x), b1).map(gelu); // d -> 4d
return add(matvec(W2, h), b2); // 4d -> d
}
// Pre-norm block: x + sublayer(LayerNorm(x)), twice.
function transformerBlock(X, p, causal = false) {
const att = selfAttention(X.map(layerNorm), p.Wq, p.Wk, p.Wv, causal);
const X1 = X.map((x, i) => add(x, att[i])); // residual #1
return X1.map(x => add(x, ffn(layerNorm(x), p.W1, p.b1, p.W2, p.b2))); // residual #2
}
Two details that the picture in your head often gets wrong. First, attention is the only step that reads across tokens — ffn and layerNorm operate on one token vector at a time. Second, the residual add happens outside the normalized branch: you normalize a copy, run the sublayer on it, and add the result back to the untouched x. That untouched path is what lets gradients flow straight through.
Python implementation
The same block in NumPy, batched as full matrices the way a GPU actually runs it, plus multi-head attention.
import numpy as np
def softmax(x, axis=-1):
x = x - x.max(axis=axis, keepdims=True) # stability
e = np.exp(x)
return e / e.sum(axis=axis, keepdims=True)
def layer_norm(x, eps=1e-5):
mu = x.mean(-1, keepdims=True)
var = x.var(-1, keepdims=True)
return (x - mu) / np.sqrt(var + eps)
def gelu(x):
return 0.5 * x * (1 + np.tanh(0.7978845608 * (x + 0.044715 * x**3)))
def multi_head_attention(X, Wq, Wk, Wv, Wo, h, causal=False):
n, d = X.shape
dk = d // h
Q = (X @ Wq).reshape(n, h, dk).transpose(1, 0, 2) # [h, n, dk]
K = (X @ Wk).reshape(n, h, dk).transpose(1, 0, 2)
V = (X @ Wv).reshape(n, h, dk).transpose(1, 0, 2)
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(dk) # [h, n, n]
if causal: # forbid looking ahead
scores += np.triu(np.full((n, n), -1e9), k=1)
out = softmax(scores) @ V # [h, n, dk]
out = out.transpose(1, 0, 2).reshape(n, d) # concat heads
return out @ Wo
def ffn(x, W1, b1, W2, b2):
return gelu(x @ W1 + b1) @ W2 + b2 # d -> 4d -> d
def transformer_block(X, p, h=8, causal=False):
X = X + multi_head_attention(layer_norm(X), p["Wq"], p["Wk"],
p["Wv"], p["Wo"], h, causal) # residual 1
X = X + ffn(layer_norm(X), p["W1"], p["b1"], p["W2"], p["b2"]) # residual 2
return X
# A decoder-only stack: just run the same block L times, causal-masked.
def gpt_stack(X, params, L, h=8):
for layer in range(L):
X = transformer_block(X, params[layer], h=h, causal=True)
return layer_norm(X) # final norm before the logit projection
Notice gpt_stack is barely longer than the single block: a decoder-only LLM really is the same masked block run L times, then a final norm and a projection to vocabulary logits. The np.triu(..., k=1) mask is the entire difference between an encoder (BERT) and a decoder (GPT) — strip it out and the same code does bidirectional attention.
Variants worth knowing
Encoder-only (BERT, RoBERTa). Bidirectional attention, no causal mask, no generation head. Trained with masked-language-modeling to produce embeddings for classification, search, and retrieval.
Decoder-only (GPT, Llama, Mistral, Claude). Causal mask only, no separate encoder. This is the dominant LLM shape today — one stack, trained to predict the next token, scales cleanly to hundreds of billions of parameters.
Encoder-decoder (the original 2017 model, T5, BART). Both stacks plus cross-attention. Natural fit for sequence-to-sequence tasks where input and output are distinct, like translation and summarization.
Mixture-of-Experts (Switch Transformer, Mixtral, DeepSeek). Replace the single FFN with many expert FFNs and a router that sends each token to only a couple of them. You get the capacity of a huge model while only paying compute for the activated experts.
Efficient-attention variants. FlashAttention (IO-aware exact attention), grouped-query and multi-query attention (share K/V across heads to shrink the KV cache), and linear-attention / state-space hybrids (Mamba, RWKV) that trade the O(n²) cost for linear scaling at some quality cost.
Common bugs and edge cases
- Forgetting positional information. Drop the positional encoding and the model becomes permutation-invariant — "dog bites man" and "man bites dog" produce identical outputs. The bug is silent: training loss is just mysteriously high.
- Adding the residual to the wrong tensor. The residual must add the sublayer's input, not the normalized copy:
x + Sublayer(LayerNorm(x)), neverLayerNorm(x) + Sublayer(LayerNorm(x)). The second corrupts the residual stream and quietly hurts deep models. - Skipping the √d_k scale. Without it, dot products grow with dimension, softmax saturates to near one-hot, and gradients vanish — large models stop learning.
- A leaky causal mask. In a decoder, an off-by-one in the mask lets a position attend to itself's future token, so the model "cheats" during training and collapses to a near-zero loss that never generalizes.
- Numerically unstable softmax. Always subtract the row max before exponentiating; otherwise long logits overflow to
infand produceNaN. - Assuming attention is the expensive part. At typical context lengths the FFN dominates compute and parameters. Optimizing attention while ignoring the FFN leaves most of the cost on the table.
Frequently asked questions
What is the difference between the encoder and the decoder in a transformer?
The encoder uses bidirectional self-attention — every token sees every other token — and produces context-rich representations of the input. The decoder uses masked (causal) self-attention so each position only sees earlier positions, plus a cross-attention layer that reads the encoder output. Encoder-only models (BERT) classify and embed; decoder-only models (GPT, Llama) generate; encoder-decoder models (T5, the original 2017 transformer) translate and summarize.
Why does a transformer need residual connections and layer normalization?
Residual connections (x + Sublayer(x)) give gradients a direct path back through dozens of layers, preventing the vanishing-gradient collapse that would otherwise make deep stacks untrainable. Layer normalization rescales each token vector to zero mean and unit variance, keeping activation magnitudes stable so the residual stream doesn't explode. Without both, a 96-layer model like GPT-3 simply will not converge.
What does the feed-forward network in a transformer block do?
The position-wise feed-forward network (FFN) applies the same two-layer MLP to each token independently: expand to a hidden dimension d_ff (usually 4×d_model), apply a nonlinearity (ReLU or GELU), then project back down. Attention mixes information across tokens; the FFN is where most per-token computation and most of the parameters live — about two-thirds of the weights in a typical block.
Why are transformers faster to train than RNNs?
An RNN processes a sequence one step at a time — step t depends on step t−1, so the n steps cannot be parallelized. A transformer computes attention over all n positions in a single batched matrix multiply, so the whole sequence runs in parallel on a GPU. The trade-off is O(n²) memory and compute in sequence length versus the RNN's O(n).
What is the pre-norm versus post-norm difference in transformers?
The original 2017 transformer used post-norm: LayerNorm(x + Sublayer(x)). Modern LLMs use pre-norm: x + Sublayer(LayerNorm(x)), normalizing the input to each sublayer instead of the output. Pre-norm keeps the residual stream clean and trains stably without the learning-rate warmup that post-norm needs, which is why GPT-2 onward, Llama, and most current models use it.
How many parameters does a transformer have?
Roughly 12 · n_layers · d_model² for the transformer blocks, plus the embedding and output matrices. For GPT-3: 96 layers, d_model 12288, giving about 175 billion parameters. The 4× FFN expansion means the feed-forward layers hold most of the weight; attention projections (Q, K, V, O) account for the remaining third.