Machine Learning

Low-Rank Adaptation (LoRA)

Fine-tune a giant model by training two tiny matrices instead of all the weights

LoRA fine-tunes a giant model by freezing its weights and training two tiny low-rank matrices B and A whose product BA is added to each weight matrix — cutting trainable parameters by 10,000× while matching full fine-tuning quality.

  • IntroducedHu et al., 2021
  • Trainable params~0.01–1% of base
  • Typical rank r1–64 (8 default)
  • Inference overhead (merged)0
  • Extra params per matrixr·(d + k)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: don't relearn what the model already knows

A modern language model has already spent millions of GPU-hours learning grammar, facts, reasoning, and the shape of human language. When you fine-tune it for your task — legal summaries, a support chatbot, your company's tone — you are not teaching it language from scratch. You are nudging it. The question LoRA asks is: how big does that nudge actually have to be?

Full fine-tuning answers "as big as the whole model" — it makes every one of the billions of weights trainable, then saves a complete new copy. That is wasteful in two ways. You store a full duplicate of a multi-gigabyte model per task, and you compute and store optimizer state (Adam keeps two extra numbers per parameter) for every single weight, which is what actually blows up GPU memory.

Low-Rank Adaptation (LoRA), introduced by Edward Hu and colleagues at Microsoft in 2021, makes a sharper bet. It freezes the original weights entirely and learns a small correction for each weight matrix. The key insight: that correction has a low intrinsic rank. A 4096×4096 weight matrix has roughly 16.8 million entries, but the change you need to adapt it to a new task can often be captured by a matrix of rank 8 — described by just 4096·8 + 8·4096 ≈ 65,000 numbers. That is a 256× reduction for that one matrix, and the savings compound across the whole network.

The mechanism: freeze W, add BA

Take any weight matrix W of shape d × k inside the frozen model. Its forward pass is h = Wx. LoRA replaces this with:

h = Wx + ΔW·x
  = Wx + (B·A)·x · (α / r)

where the learned update ΔW = BA is factored into two thin matrices:

  • A has shape r × k — it projects the input down into an r-dimensional bottleneck.
  • B has shape d × r — it projects that bottleneck back up to the output dimension.
  • r is the rank, the inner dimension, typically 1 to 64. With r ≪ min(d, k), the product BA is forced to be low-rank.
  • α / r is a fixed scaling factor that decouples the update magnitude from the choice of r.

Only B and A receive gradients. W never moves. Two initialization details make this work cleanly: A is filled with small random Gaussian values and B is initialized to all zeros. So at step zero, BA = 0 and the adapted model is bit-for-bit identical to the pre-trained one — you start from the strong prior, not from noise. Yet gradients still flow, because the chain rule routes a non-zero signal through A.

Complexity. For a single d × k matrix, full fine-tuning trains d·k parameters; LoRA trains r·(d + k). The forward pass adds two small matmuls — A·x is O(r·k) and B·(Ax) is O(d·r) — versus O(d·k) for the frozen path, so the extra compute is negligible when r ≪ d, k. The big win is memory: optimizer state scales with trainable parameters, so an Adam run that needed memory for billions of weights now needs it for a few million.

Why merged LoRA is free at inference

During training you keep W, B, and A separate and pay one extra (cheap) matmul. But BA has exactly the same shape as W. So once training finishes you can fold the adapter in permanently:

W' = W + (α / r) · B·A

Now you ship a single matrix W' with the identical shape, dtype, and FLOP count as the original. Merged LoRA has zero inference overhead. This is its decisive advantage over adapter layers (Houlsby et al., 2019), which insert new modules into the network and therefore add latency you can never remove.

The flip side: keep the adapter unmerged and you can hot-swap tasks at runtime. A server can hold one frozen 70B base in memory and switch between a "legal" adapter, a "medical" adapter, and a "coding" adapter — each a few megabytes — without reloading the base. Frameworks like S-LoRA and vLLM exploit exactly this to serve thousands of fine-tunes from one copy of the weights.

When to reach for LoRA

  • You have one big base and many tasks. Store one frozen model plus a tiny adapter per task instead of N full copies.
  • You are GPU-memory bound. LoRA's main practical payoff is fitting fine-tuning into the VRAM you have. Pair it with quantization (QLoRA) to go further.
  • Your task is "in distribution" for the base. Instruction-following, tone, format, domain vocabulary — nudges the base already half-knows.
  • You need fast, cheap iteration. Adapters train in minutes-to-hours on modest hardware and produce megabyte-sized artifacts you can version in git.

Reach instead for full fine-tuning when the target task is genuinely far from pre-training (a brand-new modality, a different language the base barely saw) — the low-rank assumption breaks and you may need the model's full capacity. And if you only need behavior changes at the prompt level, plain prompting or RAG can beat any fine-tuning.

LoRA vs other adaptation methods

LoRAFull fine-tuneQLoRAAdapter layersPrefix / prompt tuningBitFit
Base weightsfrozenall trainedfrozen, 4-bitfrozenfrozenfrozen
Trainable paramsr·(d+k) per matrix100%same as LoRAbottleneck MLPsvirtual tokens onlybiases only (~0.1%)
Extra inference latency0 (merged)00 (merged)yes — new layerslonger sequence0
Memory for traininglowvery highlowestlowlowestlow
Quality vs full FTmatches on most tasksbaseline~matches LoRAslightly below LoRAweaker on hard taskslimited capacity
Hot-swappableyesnoyesyesyesyes

LoRA's sweet spot is the combination almost no other method hits: full-fine-tune quality, near-zero training memory, and zero inference cost once merged. Prompt tuning is even lighter to train but tends to lag on hard reasoning tasks and eats context length; adapter layers match quality but tax every forward pass forever.

What the numbers actually say

  • 10,000× fewer trainable parameters. The original paper's headline claim is that LoRA reduces trainable parameters by up to 10,000× on GPT-3 175B (175 billion → roughly 17.5 million implied). In its GPT-3 experiments the actual budgets reported are 4.7M parameters (rank 1–2) and 37.7M (rank 8 on the query/value projections) — i.e. on the order of several-thousand to tens-of-thousands fewer — while matching or beating full fine-tuning on WikiSQL, MultiNLI, and SAMSum.
  • 3× less GPU memory. Because optimizer and gradient state vanish for the frozen weights, the same paper reported VRAM during training dropping by about 2/3 on the 175B model.
  • QLoRA fits 65B on one 48 GB GPU. Dettmers et al. (2023) fine-tuned a 65B model that would need 780+ GB for 16-bit full fine-tuning, on a single 48 GB card, by 4-bit NF4 quantizing the frozen base.
  • Adapters are megabytes, not gigabytes. A rank-8 adapter on a 7B model is on the order of 10–30 MB versus a ~14 GB full checkpoint — small enough to email or commit.
  • Rank can be tiny. The paper found rank 1 or 2 on the right matrices already captured most of the gain; doubling rank to 64 gave diminishing returns, evidence the update really is low-rank.

JavaScript implementation

A self-contained LoRA layer in plain JavaScript. W is frozen; only A and B would receive gradient updates. This shows the forward pass and the merge.

// Row-major matrices as flat Float64Array. matmul: (m×n)·(n×p) -> m×p
function matmul(a, b, m, n, p) {
  const out = new Float64Array(m * p);
  for (let i = 0; i < m; i++)
    for (let j = 0; j < p; j++) {
      let s = 0;
      for (let t = 0; t < n; t++) s += a[i * n + t] * b[t * p + j];
      out[i * p + j] = s;
    }
  return out;
}

class LoRALinear {
  // W is the FROZEN base weight, shape d×k (out×in).
  constructor(W, d, k, rank = 8, alpha = 16) {
    this.W = W; this.d = d; this.k = k; this.r = rank;
    this.scale = alpha / rank;
    // A: r×k random Gaussian, B: d×r zeros  =>  BA = 0 at init
    this.A = new Float64Array(rank * k).map(() => gauss() * 0.01);
    this.B = new Float64Array(d * rank);          // all zeros
  }

  // x is a column vector length k (treated as k×1).
  forward(x) {
    const base = matmul(this.W, x, this.d, this.k, 1);   // Wx        d×1
    const Ax   = matmul(this.A, x, this.r, this.k, 1);   // Ax        r×1
    const BAx  = matmul(this.B, Ax, this.d, this.r, 1);  // B(Ax)     d×1
    const out  = new Float64Array(this.d);
    for (let i = 0; i < this.d; i++) out[i] = base[i] + this.scale * BAx[i];
    return out;
  }

  // Fold the adapter into W permanently: W' = W + scale·BA. Zero overhead after.
  merge() {
    const BA = matmul(this.B, this.A, this.d, this.r, this.k); // d×k
    const Wm = new Float64Array(this.d * this.k);
    for (let i = 0; i < Wm.length; i++) Wm[i] = this.W[i] + this.scale * BA[i];
    return Wm; // drop B and A; ship Wm with the original shape
  }
}

function gauss() { // Box-Muller
  let u = 0, v = 0;
  while (u === 0) u = Math.random();
  while (v === 0) v = Math.random();
  return Math.sqrt(-2 * Math.log(u)) * Math.cos(2 * Math.PI * v);
}

The forward pass never materializes the full d×k update — it computes A·x first (cheap, output size r) and only then expands with B. That ordering is the whole point: you stay in the low-rank bottleneck for as long as possible.

Python implementation (PyTorch)

import torch
import torch.nn as nn
import math

class LoRALinear(nn.Module):
    """Wraps a frozen nn.Linear and adds a trainable low-rank update."""
    def __init__(self, base: nn.Linear, r: int = 8, alpha: int = 16,
                 dropout: float = 0.0):
        super().__init__()
        self.base = base
        for p in self.base.parameters():
            p.requires_grad = False          # FREEZE W (and bias)

        d_out, d_in = base.weight.shape       # W is d_out × d_in
        self.r = r
        self.scaling = alpha / r              # decouple magnitude from rank
        self.drop = nn.Dropout(dropout)

        # A: r × d_in  (Kaiming uniform),  B: d_out × r  (zeros) -> BA starts at 0
        self.A = nn.Parameter(torch.empty(r, d_in))
        self.B = nn.Parameter(torch.zeros(d_out, r))
        nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))

    def forward(self, x):
        # frozen path + low-rank update; @ broadcasts over batch/seq dims
        update = (self.drop(x) @ self.A.T) @ self.B.T
        return self.base(x) + update * self.scaling

    @torch.no_grad()
    def merge(self):
        """Fold BA into the base weight; afterwards inference == base model."""
        delta = (self.B @ self.A) * self.scaling      # d_out × d_in
        self.base.weight.add_(delta)
        self.B.zero_()                                # idempotent if re-called
        return self.base

# Apply to the query/value projections of an attention block, freeze the rest:
def inject_lora(model, r=8, alpha=16, targets=("q_proj", "v_proj")):
    for name, module in model.named_modules():
        for child_name, child in module.named_children():
            if isinstance(child, nn.Linear) and child_name in targets:
                setattr(module, child_name, LoRALinear(child, r, alpha))
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"trainable params: {trainable:,}")
    return model

Note (x @ A.T) @ B.T rather than x @ (A.T @ B.T): associativity lets you avoid ever building the dense d_in × d_out product during the forward pass, keeping the intermediate at width r. In practice you would use the peft library, but this is exactly what it does under the hood.

Variants worth knowing

QLoRA (Dettmers et al., 2023). Quantize the frozen base to 4-bit NormalFloat (NF4), keep LoRA adapters in bf16, and add double-quantization plus paged optimizers. Because the base never updates, quantization error never compounds through gradients. This is what made single-GPU fine-tuning of 30B–65B models routine.

DoRA (Weight-Decomposed LoRA, 2024). Decomposes each weight into a magnitude vector and a direction, and applies LoRA only to the direction. Closes much of the remaining gap to full fine-tuning, especially at low rank, for a small extra cost.

LoRA+ and rsLoRA. LoRA+ uses a higher learning rate for B than A (they play asymmetric roles). rsLoRA fixes the scaling to α/√r so higher ranks actually help instead of saturating.

AdaLoRA. Allocates the rank budget adaptively across matrices using an SVD-style parameterization, pruning singular values that don't matter — more capacity where the task needs it, less where it doesn't.

VeRA. Shares a single pair of frozen random A/B across all layers and learns only tiny per-layer scaling vectors, shrinking the adapter by another order of magnitude.

Common bugs and edge cases

  • Initializing B randomly. If BA ≠ 0 at step zero you inject random noise into a trained network and training starts from a worse point. B must be zeros (or A zeros, but not both — then no gradient flows).
  • Forgetting to freeze the base. If requires_grad stays on for W, you silently fall back to full fine-tuning and lose every memory benefit. Always print the trainable-parameter count to confirm.
  • Mismatched α and r across runs. Changing r without keeping α/r in mind silently rescales the update. Pin a convention (e.g. α = 2r) so results are comparable.
  • Merging into a quantized base. You cannot add a bf16 BA directly into 4-bit weights without dequantizing first — naively merging a QLoRA adapter corrupts the weights. Dequantize, merge, then optionally re-quantize.
  • Double-merging. Calling merge() twice adds BA to W a second time. Zero out B after merging, or track a merged flag.
  • Adapting too few matrices on a hard task. Query/value-only is great for light adaptation; for instruction-tuning, also targeting MLP and output projections (and raising r) often matters more than people expect.

Frequently asked questions

Why can a model with billions of weights be fine-tuned with so few parameters?

The LoRA paper showed that the weight change needed to adapt a pre-trained model to a new task has a low intrinsic rank — the update ΔW lives in a tiny subspace. So instead of learning a full d×k matrix, you learn its factorization BA where B is d×r and A is r×k, with r as small as 1 to 16. The pre-trained weights already contain most of what the task needs; you only nudge them in a few directions.

Does LoRA make inference slower than the base model?

Not if you merge. During training the model computes Wx + BAx, an extra matrix multiply. But because BA has the same shape as W, you can fold it in once with W' = W + BA·(α/r) and ship a single merged weight matrix. Merged LoRA has exactly zero inference overhead versus the original model. The overhead only exists if you keep the adapter separate to hot-swap tasks.

What are the rank r and scaling factor alpha in LoRA?

r is the rank of the bottleneck — the inner dimension shared by B and A. It controls capacity: r=8 is a common default, r=1 often works for narrow tasks. alpha is a constant that scales the update by alpha/r before adding it, decoupling the learning-rate-like magnitude from the rank so you can change r without re-tuning. A frequent convention is alpha = 2r.

How is QLoRA different from LoRA?

QLoRA quantizes the frozen base weights to 4-bit (NF4) and trains LoRA adapters in higher precision on top. The base never updates, so the lossy quantization never compounds through gradients. This let researchers fine-tune a 65-billion-parameter model on a single 48 GB GPU — full fine-tuning would have needed over 780 GB.

Why is the B matrix initialized to zero?

At the start of training BA must equal zero so the adapted model is identical to the pre-trained model — otherwise you'd inject random noise into a carefully trained network on step one. A is initialized with small random Gaussian values and B is initialized to all zeros, which makes the product BA exactly zero while still giving gradients a non-zero path to flow through A.

Which weight matrices should LoRA be applied to?

In a Transformer the original paper adapted only the attention query and value projections (Wq, Wv) and froze everything else, getting strong results. Later work found that also adapting the MLP and output projections helps on harder tasks. The trade-off is parameter count: every matrix you add an adapter to multiplies the trainable parameters. Targeting all linear layers is the modern default for instruction-tuning.