Machine Learning
Rotary Position Embedding (RoPE)
Encode position by spinning the vectors — so attention sees distance, not coordinates
Rotary Position Embedding (RoPE) encodes a token's position by rotating its query and key vectors by an angle proportional to position, so the attention dot product depends only on the relative distance between tokens — the scheme behind LLaMA, GPT-NeoX, and Mistral.
- IntroducedRoFormer, Su et al. 2021
- Learned parameters0
- Applied toQ and K, not V
- Cost per tokenO(d)
- Frequency scheduleθₕ = 10000−2i/d
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The problem RoPE solves
Self-attention is permutation-invariant. Feed a transformer the tokens dog bit man and it computes the exact same attention scores as man bit dog, because the query·key dot product never looks at where a token sits — only at what it is. Without some injection of position, a language model is a bag-of-words machine.
The original 2017 transformer fixed this by adding a fixed sinusoidal position vector to each token embedding before the first layer. It works, but it has a clumsy quality: position and content get summed into the same numbers, the signal must survive every layer of the network unaided, and the dot product between two tokens depends on their absolute positions, not the gap between them.
RoPE takes a different route. Instead of adding anything, it rotates each query and key vector by an angle proportional to its position, applied freshly inside every attention layer. The payoff is a single, elegant property: after rotation, the dot product of a query at position m with a key at position n depends only on the difference m − n. Attention becomes naturally relative — it sees distance, not coordinates.
The mechanism, precisely
Take a query vector q of dimension d (per head). RoPE pairs up its coordinates: (q₀, q₁), (q₂, q₃), and so on — d/2 pairs in total. Each pair is treated as a point in a 2D plane and rotated by an angle that scales with the token's position.
Pair i rotates at its own frequency:
θ_i = base^(−2i/d), i = 0 … d/2 − 1, base = 10000
So pair 0 spins fast (θ ≈ 1 radian per position), and the last pair crawls (θ tiny). At position m, pair i is rotated by angle m·θ_i using the standard 2D rotation matrix:
[ q'₀ ] [ cos(mθ) −sin(mθ) ] [ q₀ ]
[ q'₁ ] = [ sin(mθ) cos(mθ) ] [ q₁ ]
Keys are rotated identically using their own position. Here is the whole point, in one line of algebra. A rotation by mθ followed by the inner product with a vector rotated by nθ is a rotation by (m − n)θ:
⟨ R(mθ)·q , R(nθ)·k ⟩ = qᵀ · R((n − m)θ) · k
The absolute positions m and n cancel; only their difference survives. Every attention score is automatically a function of relative distance, with no relative-position lookup table and no learned bias. Because rotation preserves vector length, RoPE also leaves the norm of every query and key untouched — it only changes their angle, never their magnitude.
In practice nobody multiplies by a sparse rotation matrix. The same result comes from two precomputed tables cos and sin of shape [seq_len, d] and a cheap element-wise formula, which we'll see in the code below.
When to reach for RoPE
- Decoder-only language models. RoPE is the de facto standard for autoregressive LLMs; it is what LLaMA, Mistral, Qwen, and Gemma all use.
- When you care about extrapolation. Relative encoding generalizes to gaps the model has seen, and the frequency structure makes context extension (interpolation, YaRN) tractable.
- When you want zero parameter overhead. RoPE adds nothing to the parameter count and nothing to memorize per position.
- Long-context retrieval. Combined with a scaling trick, RoPE-based models stretch from a few thousand tokens of training context to hundreds of thousands at inference.
If you are training a small encoder where absolute position is genuinely meaningful and sequences never exceed the training length, a learned absolute embedding or ALiBi may be simpler. RoPE earns its keep at scale and at length.
RoPE vs other position schemes
| RoPE | Sinusoidal (absolute) | Learned absolute | T5 relative bias | ALiBi | |
|---|---|---|---|---|---|
| How it injects position | Rotate Q, K | Add to embedding | Add learned vector | Add bias to scores | Add linear distance penalty |
| Relative or absolute | Relative | Absolute | Absolute | Relative | Relative |
| Learned parameters | 0 | 0 | seq_len × d | ~32 buckets/head | 0 (fixed slopes) |
| Applied where | Inside every layer | Before layer 0 | Before layer 0 | Inside every layer | Inside every layer |
| Extrapolation | Good with scaling | Poor | None (out of table) | Moderate | Strong |
| Cost | O(d) per token | O(d) once | O(d) once | O(L²) bias add | O(L²) bias add |
| Used by | LLaMA, Mistral, GPT-NeoX | Original Transformer | BERT, GPT-2 | T5 | BLOOM, MPT |
The headline contrast is RoPE vs ALiBi for long context. ALiBi bakes in a fixed distance penalty that extrapolates beautifully but cannot represent "attend strongly to a token far away." RoPE keeps full expressivity and reaches long context through angle rescaling instead — which is why frontier models lean on RoPE plus YaRN rather than ALiBi.
What the numbers actually say
- Default base is 10000. With
d = 128per head, the fastest pair turns about 1 radian per token and the slowest completes one full turn only every≈ 54,000tokens — a built-in range of "clocks" from fine to coarse. - Zero added parameters. A learned absolute embedding for a 4096-token context at
d_model = 4096costs about 16.8 million parameters; RoPE costs none. - Cost is a rounding error. RoPE is
O(d)per token per head versus theO(d²)of the Q/K/V projections already present — typically well under 1% of attention FLOPs. - Context multiplied 8–16×. Position Interpolation extended LLaMA from 2K to 32K context with only 1,000 fine-tuning steps; YaRN pushed RoPE models past 128K tokens, where the original 2K-trained model would produce gibberish beyond ~2K.
- Raising the base helps length. Long-context LLaMA variants bump base from 10000 to 500000 or higher, slowing every clock so the longest period covers the new context window.
JavaScript implementation
RoPE applied to a single vector. We use the "interleaved pairs" convention (x₀, x₁), (x₂, x₃), …; many LLaMA checkpoints instead split into halves, which we cover in the variants section.
// Precompute inverse frequencies once: theta_i = base^(-2i/d)
function invFreqs(d, base = 10000) {
const f = new Float64Array(d / 2);
for (let i = 0; i < d / 2; i++) f[i] = Math.pow(base, -(2 * i) / d);
return f;
}
// Rotate one query/key vector x at integer position `pos`.
// Pairs are interleaved: (x[0],x[1]), (x[2],x[3]), ...
function applyRoPE(x, pos, freqs) {
const out = new Float64Array(x.length);
for (let i = 0; i < x.length / 2; i++) {
const angle = pos * freqs[i];
const c = Math.cos(angle), s = Math.sin(angle);
const a = x[2 * i], b = x[2 * i + 1];
out[2 * i] = a * c - b * s; // rotated first coordinate
out[2 * i + 1] = a * s + b * c; // rotated second coordinate
}
return out;
}
// The relative property in action: rotate q at m, k at n,
// and the dot product equals q rotated by (m - n) dotted with k.
const d = 8, freqs = invFreqs(d);
const q = Float64Array.from({ length: d }, () => Math.random());
const k = Float64Array.from({ length: d }, () => Math.random());
const dot = (u, v) => u.reduce((s, _, i) => s + u[i] * v[i], 0);
const lhs = dot(applyRoPE(q, 7, freqs), applyRoPE(k, 3, freqs));
const rhs = dot(applyRoPE(q, 4, freqs), k); // 4 = 7 - 3
console.log(lhs.toFixed(6), rhs.toFixed(6)); // identical
The closing check is the proof, made concrete: positions 7 and 3 give the same score as positions 4 and 0, because both gaps are 4. That equality is what "attention sees relative distance" means in code.
Python / PyTorch implementation
The production form precomputes cos and sin tables for the whole sequence and applies them to the entire [batch, heads, seq, dim] tensor at once. This is the half-split (GPT-NeoX/LLaMA) convention.
import torch
def rope_tables(seq_len, dim, base=10000.0, device="cpu"):
# theta_i = base^(-2i/dim) for the first half of the dimensions
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, device=device) / dim))
pos = torch.arange(seq_len, device=device).float()
freqs = torch.outer(pos, inv_freq) # [seq_len, dim/2]
emb = torch.cat([freqs, freqs], dim=-1) # [seq_len, dim] (duplicate)
return emb.cos(), emb.sin()
def rotate_half(x):
x1, x2 = x.chunk(2, dim=-1) # split, not interleave
return torch.cat([-x2, x1], dim=-1)
def apply_rope(q, k, cos, sin):
# q, k: [batch, heads, seq, dim]; cos, sin: [seq, dim] -> broadcast
cos, sin = cos[None, None], sin[None, None]
q_rot = q * cos + rotate_half(q) * sin
k_rot = k * cos + rotate_half(k) * sin
return q_rot, k_rot
# usage inside attention, before computing scores
B, H, S, D = 2, 8, 16, 64
q = torch.randn(B, H, S, D)
k = torch.randn(B, H, S, D)
cos, sin = rope_tables(S, D)
q, k = apply_rope(q, k, cos, sin)
scores = (q @ k.transpose(-1, -2)) / D**0.5 # relative by construction
Note rotate_half: it negates the second half and swaps, which reproduces the 2D rotation when paired with the duplicated cos/sin. It is the vectorized equivalent of the per-pair multiply in the JavaScript version — just organized around a split rather than interleaved layout.
Variants worth knowing
Interleaved vs half-split pairing. The original RoFormer pairs adjacent dimensions; LLaMA-style code pairs dimension i with i + d/2. The two are a fixed permutation apart and are mathematically equivalent, but you must apply RoPE with the same convention the weights were trained under or scores come out scrambled.
Position Interpolation (PI). Linearly compress positions by dividing by a scale factor so a model trained on length L can address length L·s, mapping new positions back into angles the model already saw. Cheap, needs brief fine-tuning, but compresses fine-grained position resolution uniformly.
NTK-aware scaling. Instead of compressing positions, increase the RoPE base. This stretches the low-frequency (long-period) dimensions while leaving high-frequency dimensions nearly intact, preserving local resolution better than PI — often usable without fine-tuning at all.
YaRN. A refined per-frequency scheme that interpolates low frequencies, extrapolates high frequencies, and adds a temperature correction to attention. It reaches 128K+ context with far less fine-tuning data than PI and is widely used in long-context releases.
3D / multimodal RoPE. Vision and video models extend RoPE to multiple position axes (row, column, time), assigning a slice of the frequency dimensions to each — used in models like Qwen-VL to encode 2D image patches and temporal order.
Common bugs and edge cases
- Pairing convention mismatch. Loading LLaMA weights but applying interleaved RoPE (or vice versa) produces a model that runs without error and outputs nonsense. This is the single most common RoPE bug when porting checkpoints.
- Rotating values. RoPE belongs on Q and K only. Apply it to V and you inject position into the mixed output for no benefit and hurt quality.
- Forgetting RoPE in the KV cache. During generation the cached keys must already be rotated at their absolute positions. Rotate before caching, or rotate consistently on read — but never mix the two.
- Wrong dtype for the angles. Compute
cos/sinin float32 even for a bfloat16 model; doing the trig in low precision at large positions accumulates angle error that degrades long-context accuracy. - Naive length extrapolation. Running a 4K-trained model at 16K with no scaling gives the slow-frequency dimensions angles they never saw; perplexity explodes. Always pair long context with PI, NTK, or YaRN.
- Changing base without re-tuning. Bumping the base extends range but shifts every angle; a model trained at base 10000 needs fine-tuning to adapt to base 500000.
Frequently asked questions
Why does RoPE use rotation instead of adding a position vector?
Rotation is the unique operation that makes the attention dot product depend only on relative position. When you rotate a query at position m and a key at position n by angles mθ and nθ, their inner product depends on (m − n)θ — the gap — not on the absolute positions. Adding a position vector, the way the original sinusoidal scheme does, mixes content and position into the same value and loses that clean relative property.
Does RoPE add any learned parameters?
No. RoPE is a fixed, parameter-free transformation. The rotation angles come from a deterministic frequency schedule θ_i = base^(−2i/d), with base = 10000 by default. Nothing about the rotation is trained, which is one reason it replaced learned absolute position embeddings in most large models.
Why is RoPE applied to queries and keys but not values?
Position only needs to influence which tokens attend to which — and that decision lives entirely in the query·key dot product that produces attention scores. Values carry the content that gets mixed once the scores are set, so rotating them would inject position into the output without any benefit. RoPE rotates Q and K, leaves V untouched.
How does RoPE handle sequences longer than it was trained on?
Out of the box it extrapolates poorly: positions beyond the training length produce rotation angles the model never saw, and quality collapses. The fixes rescale the angles. Position Interpolation linearly compresses positions into the trained range; NTK-aware scaling and YaRN adjust the per-frequency base so low frequencies stretch while high frequencies stay intact. These let a 4K-trained model run at 32K or more with light fine-tuning.
What is the time and memory cost of RoPE?
Applying RoPE is O(d) per token per head — a handful of multiply-adds over d/2 coordinate pairs, with no extra matrix and no parameters to store. The cos and sin tables are precomputed once per sequence length, sized seq_len × d, and reused across all layers and heads. Compared to the O(d²) projections already in attention, RoPE is essentially free.
Who invented RoPE and which models use it?
Jianlin Su and colleagues introduced it in the 2021 RoFormer paper. It is now the default position scheme in most open large language models — LLaMA and LLaMA 2/3, GPT-NeoX, PaLM, Mistral, Qwen, and Gemma all use RoPE, usually combined with a context-extension trick like YaRN.