Machine Learning

Residual Networks (ResNet)

The trick that made 100-layer networks trainable — by learning the difference, not the whole

A residual network (ResNet) adds skip connections that let a layer learn a residual F(x) on top of its input x, so the output is F(x) + x — keeping gradients near 1 and making 100+ layer networks trainable.

  • Block outputy = F(x) + x
  • IntroducedHe et al., 2015
  • ImageNet winnerILSVRC 2015
  • Deepest trained1001 layers
  • ResNet-50 params≈ 25.5M

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: learn the difference, not the whole answer

Stacking more layers should never hurt. A deeper network can always copy a shallower one and set the extra layers to do nothing — the identity mapping. So in theory a 56-layer network is at least as expressive as a 20-layer network. In practice, before 2015, it wasn't. Kaiming He and colleagues at Microsoft Research trained plain CNNs of increasing depth and found that the 56-layer model had higher training and test error than the 20-layer one. The extra layers weren't overfitting — they were failing to optimize at all. They called this the degradation problem.

The fix is almost insultingly simple. Instead of asking a stack of layers to learn the full transformation H(x), ask it to learn only the residual — the difference from the input — and then add the input back:

  H(x) = F(x) + x      so      F(x) = H(x) − x

If the best thing a block can do is nothing — pass the input through unchanged — then it just has to drive its weights toward zero so that F(x) ≈ 0, which optimizers do easily. Pushing a stack of nonlinear layers to exactly reproduce the identity is hard; pushing them toward zero is trivial. That reframing is the entire idea of a Residual Network (ResNet).

The residual block and why gradients survive

A residual block has two paths. The main path runs the input through two or three convolutional layers with batch normalization and ReLU. The shortcut path (the skip connection) carries the unchanged input forward. At the end of the block the two paths are summed element-wise, then a final ReLU is applied:

x ──────────────────────────────┐  (identity shortcut)
 │                               │
 └─► conv ─► BN ─► ReLU ─► conv ─► BN ─►(+)─► ReLU ─► out
                                       ▲
                                  F(x) │ added to x

The magic is in the backward pass. With y = F(x) + x, differentiate by the chain rule:

∂L/∂x = ∂L/∂y · ∂y/∂x = ∂L/∂y · (1 + ∂F/∂x)

That 1 is everything. In a plain deep network the gradient is a long product of Jacobians; if each factor has magnitude below 1, the product decays exponentially with depth and the early layers stop learning — the vanishing gradient. The skip connection adds a +1 at every block, so the gradient always has an unobstructed highway back to the input. Even if the learned term ∂F/∂x shrinks to zero, the gradient that reaches x is at least ∂L/∂y. Across N blocks, the identity path contributes an additive term instead of a multiplicative one, which is why a 152-layer ResNet trains where a 152-layer plain net stalls.

There's a subtlety. When a block changes the number of channels or downsamples spatially, the input x and the output F(x) have different shapes and can't be added. ResNet handles this with a projection shortcut: a 1×1 convolution (with matching stride) reshapes x before the sum. Within a stage where shapes already match, it uses the parameter-free identity shortcut. He et al. labelled these option A (zero-pad identity), option B (projection only on shape change), and option C (projection everywhere); option B is the standard and the best accuracy/cost trade-off.

When to reach for residual connections

  • Any network deeper than ~20 layers. Below that, plain stacks train fine; above it, the degradation problem bites and skip connections are essentially mandatory.
  • Image classification, detection, and segmentation backbones. ResNet-50 and ResNet-101 are still default feature extractors in Faster R-CNN, Mask R-CNN, and countless downstream models.
  • Transformers and modern sequence models. Every Transformer block wraps its attention and feed-forward sublayers in a residual connection — x + Sublayer(x) — for exactly the same gradient-flow reason. ResNet's idea outlived its original computer-vision context.
  • As a default, not a last resort. Residual connections cost almost nothing (identity shortcuts add zero parameters) and rarely hurt, so they're a near-free architectural insurance policy against depth.

When not to bother: very shallow networks (a few layers) gain little, and the addition does impose that intermediate tensors share a channel count within a stage, which constrains some exotic architectures.

ResNet vs other deep CNN architectures

ResNet-50VGG-16Plain-34 (no skips)DenseNet-121Highway NetInception-v3
Year201520142015 baseline201720152015
Skip mechanismadditive identitynonenoneconcatenationgated (learned)none
Max trainable depth1000+ layers~19 (degrades beyond)fails past ~30250+ layers100+ layers~48 layers
Parameters≈ 25.5M≈ 138M≈ 21.8M≈ 8Mvaries≈ 23.8M
ImageNet top-5 error≈ 5.3%≈ 7.3%≈ 11%+≈ 5.3%≈ 5.6%
Extra params per skip0 (identity)0 (concat)2 gate matrices
Memory cost of shortcutlow (reuses x)high (keeps all maps)medium

VGG is the cautionary tale: it stacked plain 3×3 convolutions and paid with 138 million parameters — over 5× ResNet-50 — for worse accuracy and a depth ceiling. The Highway Network (Srivastava et al., 2015) anticipated ResNet with learned gates y = T·F(x) + (1−T)·x; ResNet simplified the gate to a constant 1 and found it trained deeper and better. DenseNet pushed the idea further by concatenating every earlier feature map rather than adding, at the cost of memory.

What the numbers actually say

  • Depth that was previously impossible. The ILSVRC 2015 winning ensemble used ResNets up to 152 layers — 8× deeper than VGG-19 — yet at lower computational complexity (ResNet-152 is ≈ 11.3 GFLOPs vs VGG-19's ≈ 19.6 GFLOPs).
  • 1001 layers, trained successfully. The 2016 pre-activation ResNet trained a 1001-layer model on CIFAR-10 to 4.62% error — a regime where plain networks produce pure noise.
  • Parameter efficiency. ResNet-50's ≈ 25.5M parameters beat VGG-16's ≈ 138M on ImageNet. The savings come from bottleneck blocks and replacing VGG's giant fully-connected head (its FC layers alone hold ≈ 123M of its parameters) with global average pooling.
  • The degradation gap. On CIFAR-10, a plain 56-layer net reached worse training error than a plain 20-layer net; the residual versions reversed it, with deeper consistently beating shallower.
  • Bottleneck FLOP win. A bottleneck block doing 256→64→64→256 channels costs roughly the same FLOPs as a two-layer 3×3 basic block on 64 channels, but operates on 4× the channel width — that's how ResNet-50/101/152 stay affordable.

JavaScript implementation

A residual block is just an addition wrapped around any sub-network. Here is the forward pass in plain JavaScript on flat arrays, with the projection handled when shapes differ:

// Element-wise ReLU
const relu = v => v.map(a => Math.max(0, a));

// A toy "layer": dense matmul (weights[out][in]) + bias, no activation
function dense(x, weights, bias) {
  return weights.map((row, o) =>
    row.reduce((sum, w, i) => sum + w * x[i], bias[o]));
}

// One residual block: y = ReLU( F(x) + shortcut(x) )
function residualBlock(x, block) {
  // main path: dense -> relu -> dense
  let h = relu(dense(x, block.w1, block.b1));
  h = dense(h, block.w2, block.b2);

  // shortcut: identity if shapes match, else a 1x1-style projection
  const shortcut = block.proj
    ? dense(x, block.proj.w, block.proj.b)   // projection shortcut
    : x;                                     // identity shortcut

  if (h.length !== shortcut.length) {
    throw new Error(`shape mismatch ${h.length} vs ${shortcut.length} — need a projection`);
  }

  const summed = h.map((v, i) => v + shortcut[i]);  // F(x) + x
  return relu(summed);
}

// Stack blocks — depth no longer hurts because each adds its input back
function resnetForward(x, blocks) {
  return blocks.reduce((acc, block) => residualBlock(acc, block), x);
}

The only line that matters conceptually is h.map((v, i) => v + shortcut[i]) — that single addition is what every ResNet, and every Transformer, is built on.

Python implementation (PyTorch)

In practice you write this with convolutions and batch norm. This is the canonical bottleneck block, faithful to the 2015 paper:

import torch
import torch.nn as nn

class Bottleneck(nn.Module):
    expansion = 4  # output channels = mid * 4

    def __init__(self, in_ch, mid_ch, stride=1, downsample=None):
        super().__init__()
        out_ch = mid_ch * self.expansion
        # 1x1 reduce -> 3x3 spatial -> 1x1 restore
        self.conv1 = nn.Conv2d(in_ch, mid_ch, 1, bias=False)
        self.bn1   = nn.BatchNorm2d(mid_ch)
        self.conv2 = nn.Conv2d(mid_ch, mid_ch, 3, stride=stride, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(mid_ch)
        self.conv3 = nn.Conv2d(mid_ch, out_ch, 1, bias=False)
        self.bn3   = nn.BatchNorm2d(out_ch)
        self.relu  = nn.ReLU(inplace=True)
        # projection shortcut when channels or stride change (option B)
        self.downsample = downsample

    def forward(self, x):
        identity = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))          # no ReLU before the add

        if self.downsample is not None:          # reshape x to match out
            identity = self.downsample(x)

        out += identity                          # F(x) + x  — the skip connection
        return self.relu(out)                    # ReLU after the add

Two details people get wrong: the final ReLU comes after the addition (so the shortcut isn't squashed), and conv3 has no activation before the sum. The downsample module is the 1×1 projection that matches shapes on the first block of each stage.

Variants worth knowing

Pre-activation ResNet (ResNet v2, 2016). Reorders the block to BN → ReLU → conv, so the shortcut path is a clean, unactivated signal from input to output. This made the 1001-layer model trainable and slightly improved accuracy. It's the version most modern codebases use.

Bottleneck blocks. The 1×1 → 3×3 → 1×1 sandwich (ResNet-50/101/152) versus the two-3×3 basic block (ResNet-18/34). Bottlenecks buy depth and width cheaply by doing the expensive 3×3 convolution at a reduced channel count.

ResNeXt (2017). Replaces the single 3×3 with grouped convolutions — a "cardinality" dimension of parallel paths summed together. Same FLOPs as ResNet, better accuracy.

Wide ResNet (2016). Argues depth has diminishing returns; instead widen the blocks (more channels) and stay shallower. A 16-layer Wide ResNet can beat a 1000-layer thin one while training faster.

DenseNet (2017). Replaces the additive skip with concatenation: each layer receives every previous layer's feature maps. Stronger feature reuse and fewer parameters, but higher memory because nothing is discarded.

Residual connections in Transformers. Not a CNN at all, but the same trick: every attention and feed-forward sublayer is wrapped as x + Sublayer(x) with layer normalization. This is why the residual idea is arguably the single most reused building block in modern deep learning.

Common bugs and edge cases

  • Adding tensors of mismatched shape. If a block changes channels or downsamples and you forget the projection shortcut, the element-wise add fails (or silently broadcasts wrong). Always insert a 1×1 conv with matching stride on the identity path when shapes change.
  • Applying ReLU before the addition. Squashing F(x) with ReLU before adding x blocks negative residuals and partly defeats the design. The ReLU belongs after the sum (v1) or the whole block is pre-activated (v2).
  • Putting batch norm on the shortcut. Normalizing the identity path rescales the very signal you wanted to preserve, weakening the gradient highway. Keep identity shortcuts clean; only projection shortcuts carry a BN.
  • Initializing the last BN's gamma to zero. A well-known trick (zero-init the final BN gamma in each block) makes each block start as the identity, which stabilizes early training of very deep nets. Skipping it can slow convergence.
  • Assuming skip connections cure overfitting. They solve the optimization degradation problem, not generalization. A 152-layer ResNet still needs data augmentation, weight decay, and regularization like any large model.
  • Double-counting downsampling. If the main path uses stride 2 in conv2 but the projection shortcut uses stride 1, the spatial dimensions won't line up. The stride must match on both paths.

Frequently asked questions

Why does adding more layers to a plain network make it worse?

This is the degradation problem, not overfitting. He et al. (2015) showed a plain 56-layer CNN had higher training AND test error than a 20-layer one. The extra layers couldn't even learn the identity mapping that would have preserved the shallower network's accuracy, so optimization — not capacity — was the bottleneck.

How does a skip connection fix the vanishing gradient problem?

The block computes y = F(x) + x, so the gradient of the loss with respect to x is ∂L/∂y · (1 + ∂F/∂x). The constant 1 from the identity branch is added to every block, so even if the ∂F/∂x terms shrink toward zero, the gradient never fully vanishes — it has a clear path straight back to the input.

What is the difference between a basic block and a bottleneck block?

A basic block (ResNet-18/34) stacks two 3×3 convolutions. A bottleneck block (ResNet-50/101/152) uses 1×1 → 3×3 → 1×1 convolutions: the first 1×1 reduces channels (e.g. 256→64), the 3×3 does spatial work cheaply, and the last 1×1 restores channels (64→256). The bottleneck delivers similar accuracy for far fewer FLOPs.

What happens to the shortcut when the input and output have different shapes?

When a block changes channel count or spatial resolution (a stride-2 block), the identity x can't be added directly. ResNet uses a projection shortcut — a 1×1 convolution (option B) with matching stride — to reshape x before the addition. Within a stage where shapes match, it uses a parameter-free identity shortcut.

Is the skip connection added before or after the activation?

In the original 2015 ResNet the order is conv → BN → ReLU → conv → BN, then add x, then a final ReLU. The 2016 pre-activation variant moves BN and ReLU before each conv so the shortcut carries a clean, unactivated signal end to end — this trained ResNet-1001 and slightly improved accuracy.

Does ResNet have more parameters than VGG because it is deeper?

No — it has far fewer. VGG-16 has about 138 million parameters; ResNet-50 has about 25.5 million and ResNet-152 about 60 million. Skip connections add zero parameters for identity shortcuts, and bottleneck blocks plus global average pooling (instead of huge fully-connected layers) keep the count low despite the depth.