Machine Learning

Contrastive Learning

Teach a model what's the same and what's different — and skip the labels

Contrastive learning trains an encoder to map matching pairs (augmentations of the same image) close together and mismatched pairs far apart in embedding space — learning useful representations from unlabeled data using the InfoNCE loss.

  • SupervisionSelf-supervised
  • Core lossInfoNCE
  • Positive pairTwo views of one sample
  • NegativesRest of the batch
  • Temperature τ≈ 0.07–0.1

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: same vs different

Supervised learning needs a teacher. To learn that a photo is a "cat," someone first had to type the word "cat" next to a million photos. Contrastive learning throws that requirement out. It learns by answering a much cheaper question: which of these two things came from the same source?

Take one photo of a dog. Crop it two different ways, jitter the colors, flip one horizontally. You now have two views of the same underlying thing — a positive pair. Every other photo in the batch is a negative. Train an encoder so that the two views land near each other in a vector space and the negatives land far away, and a remarkable thing happens: the encoder is forced to discover the features that survive cropping and color shifts — shape, texture, the actual semantics of "dog" — because those are the only things the two views still share. The augmentations are doing the labeling for free.

The payoff is a general-purpose representation. After contrastive pre-training on unlabeled images, a single linear layer trained on a small labeled set reaches accuracy that used to require full supervised training. The expensive labels are needed only for the final, tiny fine-tuning step.

How it works: pull positives, push negatives

The machinery has four moving parts:

  1. Augmentation. Each input x is transformed twice by random operations (crop, resize, color jitter, blur, grayscale) to produce two correlated views x̃ᵢ and x̃ⱼ. SimCLR's ablations show this choice matters more than the architecture — random crop + color distortion is the workhorse combination.
  2. Encoder. A backbone f(·) (a ResNet, a ViT) maps each view to a representation h. This is the thing you keep and reuse downstream.
  3. Projection head. A small MLP g(·) maps h to a lower-dimensional z where the loss is computed. Crucially, you throw this head away after pre-training — representations one layer before the loss are measurably better for transfer.
  4. L2 normalization. Each z is projected onto the unit hypersphere, so "similarity" becomes a cosine — a dot product of unit vectors, bounded in [-1, 1].

Then the loss pulls the positive together and pushes the negatives apart. The geometric picture is a tug-of-war on a sphere: every positive pair is a spring contracting, every negative is a spring expanding, and the encoder settles into a configuration where related inputs cluster and unrelated inputs spread out to fill the space.

The math: InfoNCE and temperature

Define cosine similarity between two normalized vectors as sim(u, v) = uᵀv. For a positive pair (i, j) in a batch where every other view is a negative, the InfoNCE (also called NT-Xent — normalized temperature-scaled cross-entropy) loss for anchor i is:

              exp( sim(zᵢ, zⱼ) / τ )
ℓᵢ = − log ───────────────────────────────────
            Σ_{k≠i} exp( sim(zᵢ, zₖ) / τ )

Read it as a classification problem. Among all 2N − 1 candidate partners for anchor i, exactly one — its positive j — is the correct "class," and the loss is the cross-entropy of a softmax that should put all its mass there. The temperature τ divides every logit: small τ sharpens the softmax so the gradient concentrates on hard negatives sitting close to the anchor; large τ flattens it. SimCLR uses τ = 0.1; CLIP learns it.

The total loss averages ℓᵢ over all 2N views (each view serves as anchor once). The deep reason this works: minimizing InfoNCE maximizes a lower bound on the mutual information I(zᵢ; zⱼ) between the two views, and the bound's tightness improves with the number of negatives K — roughly I ≥ log K − ℓ. That is the formal justification for big batches.

Complexity. Building the full 2N × 2N pairwise similarity matrix for a batch of N images costs O(N²·d) time and O(N²) memory for the matrix (with d the embedding dimension). The encoder forward/backward pass over the batch is the dominant O(N·C) cost where C is per-image network FLOPs, but it's the O(N²) similarity matrix and the activation memory of large batches that force multi-GPU training.

When to use contrastive learning

  • Abundant unlabeled data, scarce labels. Medical imaging, satellite imagery, industrial defect detection — millions of unlabeled images, a few thousand annotated ones. Pre-train contrastively, fine-tune on the labeled slice.
  • Retrieval and similarity search. The embedding space is the product: face verification, image search, recommendation, deduplication. Cosine distance in the learned space ranks neighbors.
  • Multimodal alignment. CLIP aligns images and text so you can search images with words. The same recipe aligns audio-text, video-text, code-docstring.
  • Pre-training a backbone you'll reuse across many downstream tasks, where labels for each task are expensive but a generic feature extractor amortizes.

When not to: if you already have plentiful labels for your exact task, end-to-end supervised training is usually simpler and at least as accurate. And contrastive methods are sensitive to the augmentation choice — a bad augmentation pipeline (one that destroys the signal you care about) teaches the encoder to ignore exactly the wrong thing.

Contrastive vs other representation methods

Contrastive (SimCLR)MoCoSupervisedAutoencoderBYOL / SimSiamMasked (MAE / BERT)
Needs labelsNoNoYesNoNoNo
Needs negativesYesYes (queue)N/ANoNoNo
Collapse riskLow (negatives block it)LowNoneNoneNeeds stop-grad + predictorNone
Batch-size sensitivityHigh (≥ 4096 ideal)Low (decoupled queue)LowLowModerateLow
Memory bottleneckO(N²) sim matrixQueue of K keysActivationsDecoderTwo branchesDecoder / mask tokens
Reconstructs input?NoNoNoYes (pixels)NoYes (masked tokens)
Canonical useImage pre-training, CLIPDetection backbonesDirect task accuracyDenoising, anomalyLabel-free SSLNLP, vision transformers

The headline split is negatives vs no negatives. InfoNCE-style methods (SimCLR, MoCo, CLIP) need negatives to prevent collapse and pay for them with batch size or a memory queue. BYOL and SimSiam dropped negatives entirely and avoid collapse with an asymmetric architecture instead — proving the negatives were a means, not the end.

What the numbers actually say

  • SimCLR (2020) closed most of the supervised gap. A linear classifier on frozen SimCLR features hit 76.5% top-1 on ImageNet with a ResNet-50 (4×), within a couple of points of the fully-supervised baseline of the same architecture — using zero labels during pre-training.
  • Label efficiency is the real win. Fine-tuned on just 1% of ImageNet labels (≈13 images per class), SimCLR reached ~63% top-1, beating the previous label-efficient state of the art by a wide margin. With 10% of labels it cleared 73%.
  • Batch size pays off, then plateaus. SimCLR's accuracy rose monotonically with batch size from 256 up to 4096–8192 negatives, then flattened — more negatives sharpen the mutual-information bound only up to a point.
  • MoCo decouples that cost. Instead of a giant batch, MoCo maintains a queue of 65,536 momentum-encoded keys as negatives, getting SimCLR-level negatives from an ordinary 256-image batch on a single 8-GPU machine.
  • CLIP scaled to 400M pairs. Trained on 400 million image–text pairs, CLIP matched the zero-shot ImageNet accuracy of a fully-supervised ResNet-50 (~76%) without ever seeing an ImageNet label.

JavaScript implementation

The full NT-Xent loss over a batch of embeddings. We assume z is a 2N × d array where rows i and i + N form a positive pair (two views of sample i).

// z: Float arrays, already L2-normalized rows. 2N rows total.
function ntXentLoss(z, N, tau = 0.1) {
  const M = 2 * N;
  const dot = (a, b) => a.reduce((s, v, k) => s + v * b[k], 0);

  // Pairwise similarity / temperature
  const sim = Array.from({ length: M }, (_, i) =>
    Array.from({ length: M }, (_, j) => dot(z[i], z[j]) / tau)
  );

  let total = 0;
  for (let i = 0; i < M; i++) {
    const pos = (i + N) % M;            // the matching view

    // log-sum-exp over all k != i  (exclude self-similarity)
    let max = -Infinity;
    for (let k = 0; k < M; k++) if (k !== i) max = Math.max(max, sim[i][k]);
    let denom = 0;
    for (let k = 0; k < M; k++) if (k !== i) denom += Math.exp(sim[i][k] - max);

    // ℓ_i = -log( exp(sim_pos) / Σ_{k≠i} exp(sim_k) )
    total += -(sim[i][pos] - (max + Math.log(denom)));
  }
  return total / M;                     // average over all 2N anchors
}

Two details that bite people. First, you must exclude self-similarity sim(i, i) from the denominator — it's always 1/τ and would dominate the softmax. Second, the log-sum-exp trick (subtracting the row max before exponentiating) is not optional: with τ = 0.1 a raw similarity of 1 becomes exp(10), and a batch of a few thousand overflows a 64-bit float without it.

Python implementation (PyTorch)

The same loss, but expressed as a single cross-entropy call — the trick that makes it fast on a GPU.

import torch
import torch.nn.functional as F

def nt_xent(z_i, z_j, tau=0.1):
    # z_i, z_j: (N, d) embeddings of the two views, NOT yet normalized
    N = z_i.shape[0]
    z = torch.cat([z_i, z_j], dim=0)          # (2N, d)
    z = F.normalize(z, dim=1)                 # unit hypersphere

    sim = (z @ z.T) / tau                      # (2N, 2N) cosine / temperature
    sim.fill_diagonal_(float('-inf'))         # drop self-similarity

    # For row i in [0,N), the positive is row i+N, and vice versa.
    targets = torch.cat([torch.arange(N, 2 * N),
                         torch.arange(0, N)]).to(z.device)

    # InfoNCE == cross-entropy where the "class" is the positive's index
    return F.cross_entropy(sim, targets)

The elegance: once you set the diagonal to -inf and point targets[i] at the positive's row index, the whole InfoNCE loss is F.cross_entropy. PyTorch's cross-entropy already does a numerically stable log-softmax internally, so you get the log-sum-exp safety for free. The minimal training loop wraps it:

for x in loader:                              # x: a batch of raw images
    xi, xj = augment(x), augment(x)           # two random views
    zi = head(encoder(xi))                    # projection head g(f(·))
    zj = head(encoder(xj))
    loss = nt_xent(zi, zj, tau=0.1)
    loss.backward(); opt.step(); opt.zero_grad()
# After training: discard `head`, keep `encoder` for downstream tasks.

Variants worth knowing

Triplet loss (2015, FaceNet). The ancestor of InfoNCE. One anchor, one positive, exactly one negative, with a margin: max(0, d(a,p) − d(a,n) + margin). InfoNCE generalizes it to many negatives at once via the softmax, which is why it trains far more efficiently.

SimCLR (2020). The clean baseline above: in-batch negatives, strong augmentation, a projection head, NT-Xent. Established that the augmentation pipeline and a nonlinear head matter enormously.

MoCo (2019). Replaces in-batch negatives with a queue of features from recent batches, encoded by a slow momentum-updated copy of the encoder. Decouples the negative count from the batch size, so you don't need 8 GPUs of RAM.

BYOL / SimSiam (2020–2021). Negative-free. An online network predicts the output of a target network; a stop-gradient on the target branch (plus a predictor MLP) prevents the trivial collapse to a constant. Proof that negatives aren't strictly necessary.

SupCon — supervised contrastive (2020). If you do have labels, treat every same-class sample as a positive (not just augmentations of one image). Often beats plain cross-entropy on clean classification.

CLIP (2021). Cross-modal contrastive learning. Image and text encoders trained so the true (image, caption) pairs lie on the diagonal of the similarity matrix — enabling zero-shot classification by comparing an image to text prompts.

Common bugs and edge cases

  • Forgetting to normalize embeddings. InfoNCE assumes cosine similarity; without L2 normalization the dot product mixes magnitude and direction, and the temperature stops meaning what you think it means. Normalize before the loss.
  • Leaving self-similarity in the denominator. sim(i, i) = 1/τ is the largest possible logit and will swamp the softmax. Mask the diagonal to -inf.
  • Augmentations too weak (or too strong). Too weak and the two views are nearly identical — the encoder solves the task with low-level cues (color histograms) and learns nothing semantic. Too strong and the views share no real signal. SimCLR's whole result hinges on getting this dial right.
  • Keeping the projection head for downstream tasks. The representation before the head transfers better; the head is trained to discard task-irrelevant information for the contrastive objective. Throw it away.
  • Temperature set blindly. Too small (≈0.01) makes training unstable and over-focused on a few hard negatives; too large (≈1.0) treats all negatives equally and weakens the signal. Start at 0.07–0.1.
  • Silent representation collapse. If your loss plummets to near zero but downstream accuracy is at chance, the encoder may have collapsed to a constant. With negatives this is rare; in negative-free setups always verify the stop-gradient is actually stopping the gradient.

Frequently asked questions

How does contrastive learning work without labels?

The labels are manufactured for free. Two random augmentations of the same image — a crop here, a color jitter there — are declared a positive pair that should land close together; every other image in the batch is a negative that should be pushed apart. No human annotation is needed because the supervision signal is the augmentation pipeline itself.

What is the InfoNCE loss?

InfoNCE (Noise-Contrastive Estimation) is a softmax-style cross-entropy loss over similarities. For an anchor, the positive's cosine similarity (divided by a temperature τ) is the logit you want to maximize, and all the negatives are the competing classes. Minimizing it is equivalent to maximizing a lower bound on the mutual information between the two views.

Why does contrastive learning need such large batches?

In SimCLR the negatives come from the rest of the batch, so a batch of 256 gives only ~510 negatives while a batch of 4096 gives ~8190. More negatives make the softmax denominator a sharper estimate of the true distribution, which is why SimCLR's ImageNet accuracy keeps climbing up to batch size 4096–8192. MoCo sidesteps this with a momentum-updated queue of negatives instead.

What does the temperature parameter τ do?

τ scales the similarities before the softmax. A small τ (≈0.07–0.1) sharpens the distribution so the loss focuses on the hardest negatives — the ones already close to the anchor. Too small and training becomes unstable; too large and all negatives get treated equally and the representation collapses toward uniformity.

What is representation collapse and how is it avoided?

Collapse is when the encoder outputs the same constant vector for everything — trivially making positives close. Negatives prevent it in InfoNCE-style methods by punishing similarity between unrelated samples. Negative-free methods (BYOL, SimSiam) avoid collapse instead with a predictor head plus a stop-gradient on the target branch.

Is CLIP contrastive learning?

Yes. CLIP applies the same InfoNCE idea across two modalities: an image encoder and a text encoder are trained so that an image and its true caption are a positive pair, while the image paired with every other caption in the batch is a negative. The positives lie on the diagonal of the image-by-text similarity matrix.