Machine Learning
Positional Encoding
How a transformer learns that word order matters
Positional encoding injects word order into a transformer, which otherwise sees an unordered bag of tokens, by adding sinusoidal or learned vectors to each token embedding so attention can tell position 1 from position 100.
- IntroducedVaswani et al., 2017
- Sinusoidal params0 (fixed)
- Learned paramsmax_len × d
- Sinusoid base10000
- Modern defaultRoPE
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why attention is blind to order
A recurrent network reads a sentence one token at a time, so order is baked into the computation — token 5 is processed after token 4, full stop. A transformer throws that away. Every token attends to every other token simultaneously, in one parallel matrix multiply. That parallelism is the whole reason transformers train fast on GPUs, but it comes with a sharp cost: self-attention is permutation-equivariant. Shuffle the input tokens and the output is the exact same set of vectors, merely reordered. The math has no notion of "first" or "last."
Concretely, the attention score between two tokens is a dot product of their query and key vectors, and those vectors are computed from token content alone. So "the dog bit the man" and "the man bit the dog" contain the same multiset of tokens and would produce identical representations — the model literally cannot tell who bit whom. Positional encoding is the fix: give every position its own distinct vector and fold it into the token, so position 1 no longer looks like position 100.
How sinusoidal encoding works
The 2017 "Attention Is All You Need" paper used a fixed, parameter-free encoding. For a position pos and an embedding dimension index i in a model of width d:
PE(pos, 2i) = sin( pos / 10000^(2i/d) )
PE(pos, 2i+1) = cos( pos / 10000^(2i/d) )
Read it as a stack of clocks. Each pair of dimensions is one clock, and its hand sweeps at a frequency that drops geometrically as i grows. The fastest clock (i = 0) has wavelength 2π — it ticks every couple of positions. The slowest clock has wavelength ≈ 10000·2π ≈ 62,800 positions — over a normal sentence its hand barely moves. Together the clock readings form a unique fingerprint for every position, the way hours-minutes-seconds uniquely stamp a moment in a day.
That resulting d-dimensional vector is simply added to the token embedding before the first attention layer. Two design choices make this elegant:
- Relative offsets become linear. Because of the sine/cosine angle-addition identity,
PE(pos + k)is a fixed rotation ofPE(pos)that depends only on the offsetk, not onpos. So the model can learn a single weight pattern meaning "attend three tokens back" and reuse it everywhere in the sequence. - It is defined for any position. The formula is just trigonometry, so position 5,000 has a perfectly valid encoding even if training never went past 512 — in principle.
Computing the table is O(L · d) time and space for length L, and it can be precomputed once and cached. It adds nothing to attention's O(L² · d) cost.
Learned vs fixed, and where the position lives
You do not have to hand-derive the encoding. BERT and the original GPT use a learned absolute embedding instead: a trainable lookup table of shape max_len × d, indexed by position, trained by gradient descent like any other parameter. The Vaswani paper reported sinusoidal and learned absolute encodings perform "nearly identically" on translation, so they shipped sinusoidal for one reason — it might extrapolate to longer sequences.
There is a deeper axis than learned-vs-fixed: where position enters the network.
- Absolute, added at the input (sinusoidal, learned-absolute): position is mixed into the token once, then carried through residual connections. Simple, but the only signal of "token 7 is two ahead of token 5" is whatever survives in the residual stream.
- Relative / multiplicative, applied inside attention (RoPE, ALiBi, T5 relative bias): position is re-injected at every attention layer, directly shaping the query–key score by the distance between the two tokens. This is why these schemes dominate modern long-context LLMs.
Choosing a scheme
- Fixed-length classification (sentiment, NER, ≤512 tokens): learned absolute is fine and is what BERT uses. You will never see positions outside training range.
- Long-context language models: use rotary (RoPE). It is the default in LLaMA, Mistral, GPT-NeoX, PaLM, and Qwen, and plays well with length-extension tricks.
- You need cheap length extrapolation with no retraining: ALiBi adds a linear distance penalty to attention scores and was designed so a model trained at 1,024 tokens runs at 2,048+ with little loss.
- Encoder–decoder text-to-text (T5 family): learned relative position buckets added as a scalar bias to attention logits, shared across layers.
Comparison of positional schemes
| Sinusoidal | Learned absolute | RoPE (rotary) | ALiBi | T5 relative bias | |
|---|---|---|---|---|---|
| Parameters | 0 | max_len × d | 0 | 0 (fixed slopes) | ~heads × buckets |
| Absolute or relative | Absolute | Absolute | Relative (via rotation) | Relative | Relative |
| Where applied | Added at input | Added at input | Rotates Q,K each layer | Bias on scores each layer | Bias on scores each layer |
| Length extrapolation | Weak | None (hard cap) | Good with scaling | Strong by design | Moderate |
| Used by | Transformer 2017 | BERT, GPT-1/2 | LLaMA, Mistral, PaLM | BLOOM, MPT | T5, mT5 |
The headline trend: the field migrated from absolute-added-at-input toward relative-applied-inside-attention, because that is what lets a 4k-token model stretch to 128k without retraining from scratch.
What the numbers actually say
- Sinusoidal costs zero parameters. A learned absolute table for a 2,048-token, 4,096-wide model is 2048 × 4096 ≈ 8.4 million parameters — small next to the billions in the rest of the model, but a hard wall at 2,048 tokens.
- The base 10000 is just a hyperparameter. NTK-aware scaling and YaRN change it to stretch context. A widely reproduced result: rescaling RoPE lets a LLaMA-2 model trained at 4,096 tokens run coherently at 32k or even 128k with only a few hundred steps of fine-tuning.
- ALiBi was tuned for extrapolation: its authors trained at 1,024 tokens and evaluated at 2,048, matching the perplexity of a sinusoidal model trained directly at 2,048 while training faster and using less memory — position generalization for free.
- Adding position is essentially free at runtime. The
O(L·d)add (or the per-element rotation in RoPE) is dwarfed by attention'sO(L²·d)and the feed-forward layers.
JavaScript implementation
The sinusoidal table, exactly as in the 2017 paper:
// Returns an L × d matrix of fixed sinusoidal positional encodings.
function sinusoidalPE(maxLen, d, base = 10000) {
const pe = Array.from({ length: maxLen }, () => new Float32Array(d));
for (let pos = 0; pos < maxLen; pos++) {
for (let i = 0; i < d; i += 2) {
// angle frequency for this dimension pair
const denom = Math.pow(base, i / d);
const angle = pos / denom;
pe[pos][i] = Math.sin(angle);
if (i + 1 < d) pe[pos][i + 1] = Math.cos(angle);
}
}
return pe;
}
// Add it to a batch of token embeddings (shape L × d), in place.
function addPositional(embeddings, pe) {
for (let pos = 0; pos < embeddings.length; pos++) {
for (let i = 0; i < embeddings[pos].length; i++) {
embeddings[pos][i] += pe[pos][i];
}
}
return embeddings;
}
Two things to flag. First, the encoding is computed once and added — it is not part of the gradient (sinusoidal has no parameters). Second, the even index gets sin and the odd index its cos partner at the same frequency; that sine/cosine pairing is exactly what makes a position offset a clean rotation.
Python implementation
Sinusoidal in vectorized NumPy, plus the rotary (RoPE) variant that modern LLMs actually use:
import numpy as np
def sinusoidal_pe(max_len, d, base=10000.0):
pos = np.arange(max_len)[:, None] # (L, 1)
i = np.arange(0, d, 2)[None, :] # (1, d/2)
denom = base ** (i / d) # (1, d/2)
angle = pos / denom # (L, d/2)
pe = np.zeros((max_len, d))
pe[:, 0::2] = np.sin(angle)
pe[:, 1::2] = np.cos(angle)
return pe # (L, d)
# ---- Rotary positional embedding (RoPE) ----
# Instead of ADDING position, rotate each (x_2i, x_2i+1) pair of the
# query/key by an angle proportional to position. Relative offset between
# two tokens then falls out of the dot product automatically.
def rope(x, base=10000.0):
L, d = x.shape
assert d % 2 == 0
pos = np.arange(L)[:, None] # (L, 1)
i = np.arange(0, d, 2)[None, :] # (1, d/2)
theta = pos / (base ** (i / d)) # (L, d/2)
cos, sin = np.cos(theta), np.sin(theta)
x_even, x_odd = x[:, 0::2], x[:, 1::2]
out = np.empty_like(x)
out[:, 0::2] = x_even * cos - x_odd * sin # rotate each pair
out[:, 1::2] = x_even * sin + x_odd * cos
return out
# Usage: apply rope() to Q and K BEFORE the attention dot product,
# at every attention layer — never to the value vectors.
Note the key structural difference: sinusoidal_pe is added to embeddings once at the input, while rope rotates the query and key vectors inside every attention layer and is never applied to the values.
Variants worth knowing
Rotary Positional Embedding (RoPE). Su et al., 2021. Encodes absolute position as a rotation but makes the query–key dot product depend only on the relative offset. The dominant choice in 2023+ open LLMs (LLaMA, Mistral, Qwen, DeepSeek).
ALiBi (Attention with Linear Biases). Press et al., 2022. No positional vectors at all — it just subtracts a per-head linear penalty proportional to the distance between two tokens from the attention score. Built for length extrapolation; used in BLOOM and MPT.
T5 relative position bias. A learned scalar added to attention logits, looked up from a bucketed function of token distance, shared across all layers. Logarithmic bucketing lets it cover long ranges with few parameters.
NTK-aware scaling and YaRN. Not new encodings but ways to stretch RoPE past its training length by rescaling the base / interpolating frequencies — how a 4k-context model is fine-tuned up to 32k–128k.
NoPE (no positional encoding). Recent finding that decoder-only models with causal masking can sometimes infer position from the mask alone, learning order implicitly — a reminder that the mask itself carries some positional information.
Common bugs and edge cases
- Feeding positions beyond
max_lento a learned table. Learned absolute embeddings have no row for unseen positions — you get an index-out-of-range crash or, worse, garbage from a wrapped index. Sinusoidal/RoPE won't crash but still degrade past training length. - Mismatched sin/cos pairing. If you fill all even-and-odd dims with
sinusing different frequencies, you lose the rotation property and relative-offset attention stops working. Each frequency needs its matchedsin/cospair. - Applying RoPE to value vectors. RoPE goes on queries and keys only. Rotating the values corrupts the content that attention is supposed to mix.
- Scaling embeddings but not the encoding (or vice versa). The 2017 model multiplies token embeddings by √d before adding the encoding so the two have comparable magnitude; skip it and position drowns out content or vice versa.
- Forgetting padding offsets. If you left-pad sequences, position 0 lands on a pad token and real tokens get shifted indices. Either right-pad or compute positions over non-pad tokens only.
- Assuming "longer context" is free. Doubling sequence length quadruples attention cost (
O(L²)); the positional encoding extends cheaply, but the attention it feeds does not.
Frequently asked questions
Why does a transformer need positional encoding at all?
Self-attention is permutation-equivariant: shuffle the input tokens and the output is the same set of vectors, just shuffled. Nothing in the attention math depends on order, so "the dog bit the man" and "the man bit the dog" would produce identical token representations. Positional encoding breaks that symmetry by making each position's vector distinct.
Why sinusoids of geometrically decreasing frequency?
Each dimension is a sine or cosine whose wavelength grows geometrically from 2π up to about 10000·2π. Low dimensions oscillate fast and encode fine position; high dimensions oscillate slowly and encode coarse position — like the hand positions of nested clocks. Crucially, the encoding of position pos+k is a fixed linear (rotation) function of the encoding of pos, so the model can learn to attend by relative offset.
What is the difference between sinusoidal, learned, and rotary positional encoding?
Sinusoidal is a fixed formula added to embeddings — zero parameters, extrapolates in principle. Learned absolute embeddings are a trainable lookup table of size max_len × d — slightly better in-distribution but capped at the longest sequence seen in training. Rotary (RoPE) rotates the query and key vectors by an angle proportional to position, injecting position multiplicatively inside attention rather than additively at the input; it is the default in LLaMA, GPT-NeoX, PaLM, and most modern LLMs.
Can a transformer handle sequences longer than it was trained on?
Not for free. Learned absolute embeddings have no entry for positions beyond max_len and simply fail. Sinusoidal and RoPE are defined for any position, but models trained on length L still degrade sharply past L because they never saw those rotation angles. ALiBi and RoPE-scaling tricks like NTK-aware interpolation and YaRN exist specifically to extend context — for example stretching a 4k-trained LLaMA to 32k or 128k tokens.
Do positional encodings get added to every layer?
In the original 2017 transformer, the absolute encoding is added once, to the input embeddings, and then carried through residual connections. Rotary and relative schemes are different: they re-apply position at every attention layer, rotating or biasing the query–key dot product each time, which is part of why they generalize to longer contexts better.
Why is the constant 10000 used in the sinusoidal formula?
It sets the maximum wavelength. With base 10000, the slowest dimension completes one cycle over roughly 10000·2π ≈ 62800 positions, so even very long sequences map to distinct points on that slow wave. The choice is a hyperparameter, not magic — RoPE-scaling methods literally change this base to stretch the effective context window.