Machine Learning
Convolutional Neural Network (CNN)
Slide a tiny filter, share its weights, and let edges grow into objects
A convolutional neural network slides small learnable filters across an image, sharing weights to build a hierarchy of features — edges, then textures, then shapes — that powers state-of-the-art image recognition.
- Core operationConvolution (sliding dot product)
- Key trickWeight sharing + local connectivity
- 3×3 conv params9·Cin·Cout + Cout
- Output size⌊(W − K + 2P)/S⌋ + 1
- InvarianceApproximate, to translation only
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a CNN sees an image
A photograph is just a grid of numbers — for a color image, three numbers per pixel (red, green, blue). A naïve approach would flatten that grid into one long vector and feed it to a fully-connected network. That throws away the single most important fact about an image: pixels that sit next to each other are related, and a feature like an edge is useful no matter where it appears. A convolutional neural network bakes both facts into its architecture.
The core operation is the convolution. Take a small grid of learnable weights — the filter or kernel, often just 3×3 — and slide it across the image. At every position, multiply each weight by the pixel beneath it, sum the results, add a bias, and write that single number into an output grid called a feature map. Mathematically the value at output position (i, j) is:
out[i][j] = bias + Σ_m Σ_n W[m][n] · input[i·S + m][j·S + n]
where W is the K×K filter and S is the stride. (Technically this is cross-correlation; deep-learning libraries call it convolution but skip the kernel flip that signal processing uses — the network learns the flipped weights anyway, so it makes no practical difference.) A filter that happens to have large positive weights on one side and negative on the other becomes an edge detector; the feature map lights up exactly where the image transitions from dark to light.
Two properties make this powerful. Local connectivity: each output neuron looks at only a small patch, not the whole image. Weight sharing: the same filter weights are reused at every position. A single 3×3 filter is 9 numbers, whether the image is 28×28 or 4000×3000. That is what keeps CNNs small enough to train.
The layers stacked together
A convolution alone is linear, so after each conv layer you apply a nonlinearity — almost always ReLU, max(0, x) — which lets the network model curved decision boundaries and rectifies the feature map to keep only positive activations.
Between conv blocks you usually pool. A 2×2 max-pool slides a window and keeps the largest value, halving each spatial dimension. Pooling has no learnable weights; its jobs are to shrink the map (cutting compute fourfold per layer), enlarge the effective receptive field, and grant a little tolerance to small shifts. Many modern architectures drop pooling in favor of strided convolutions, which downsample and learn.
Stack several conv → ReLU → pool blocks and the abstraction climbs. The first layer learns edges and color blobs. The second combines edges into corners and textures. The third assembles textures into object parts — an eye, a wheel, a wing. This emergent feature hierarchy is the central idea, and you do not design it; gradient descent discovers it. Finally you flatten the last feature maps (or global-average-pool them), pass them through one or two dense layers, and apply softmax to produce class probabilities.
When to reach for a CNN
- Data with grid topology — images obviously, but also spectrograms, video frames, Go boards, and even 1D time series (a 1D conv is a learned FIR filter).
- Limited training data — the locality and weight-sharing priors are strong inductive biases, so CNNs generalize from far fewer examples than transformers, which must learn those priors from scratch.
- On-device and real-time inference — efficient CNNs (MobileNet, EfficientNet) run in milliseconds on phones; their structured sparsity maps cleanly onto hardware.
- Translation matters but absolute position doesn't — a cat is a cat in any corner of the frame.
Reach for something else when position is semantically critical (a fully-connected head or coordinate channels), when you need to model long-range dependencies across a sequence (attention / transformers), or when your data has no spatial grid at all (tabular data — use gradient-boosted trees).
CNN vs other vision architectures
| CNN | Fully-connected MLP | Vision Transformer (ViT) | MLP-Mixer | |
|---|---|---|---|---|
| Inductive bias | Locality + translation equivariance | None (sees flat vector) | Weak; learns from data | Weak; per-patch + per-channel mixing |
| Params for a vision task | Low (weight sharing) | Huge (≈150M into one dense layer) | Moderate–high | Moderate |
| Data efficiency | High | Very low | Low (needs 14M+ images or pretraining) | Low |
| Long-range dependencies | Weak (limited by receptive field) | Global but unstructured | Global from layer 1 (self-attention) | Global per token |
| Compute scaling | O(K²·C²·HW) per layer | O(input·output) | O(N²·d) in tokens | O(N·d) mixing |
| Real-world use | Medical imaging, mobile vision, detection | Rarely for images | Large-scale classification, multimodal | Research / efficiency studies |
The headline trade-off is bias versus scale. A CNN hard-codes assumptions that are correct for images, so it wins when data is scarce. A ViT assumes almost nothing and must learn structure from data, so it wins at massive scale. Recent designs like ConvNeXt deliberately port transformer training tricks back onto a pure-convolutional backbone and match ViTs, which is strong evidence the convolution itself was never the bottleneck.
What the numbers actually say
- Parameter savings are dramatic. A dense layer mapping a 224×224×3 image to 1,000 units needs about 150 million weights. A 3×3 conv with 3 input and 64 output channels needs 3·9·64 + 64 = 1,792 weights — five orders of magnitude fewer, reused across all 50,176 spatial positions.
- Two small filters beat one big one. Stacking two 3×3 convs gives the same 5×5 receptive field with 2·(9C²) = 18C² parameters versus 25C² for a single 5×5 — 28% fewer params, plus an extra nonlinearity. This is the VGG insight that ended the era of large kernels.
- Convolution dominates the FLOP budget. A single conv layer costs roughly K²·Cin·Cout·H·W multiply-adds. For ResNet-50 the conv layers account for the overwhelming majority of its ≈4 billion FLOPs per image, while the final dense classifier is a rounding error.
- Depthwise separable convolutions cut cost ≈8–9×. MobileNet factors a standard conv into a depthwise (per-channel) conv plus a 1×1 pointwise conv, replacing K²·Cin·Cout with K²·Cin + Cin·Cout. For K=3 and typical channel counts that is roughly a 1/8 to 1/9 reduction in compute.
JavaScript implementation
A 2D valid convolution over a single-channel image with a single K×K filter, plus 2×2 max-pooling. No framework — just to show the arithmetic.
// image: H×W array of numbers; kernel: K×K array; returns (H-K+1)×(W-K+1) feature map.
function conv2d(image, kernel, { stride = 1, relu = true } = {}) {
const H = image.length, W = image[0].length, K = kernel.length;
const outH = Math.floor((H - K) / stride) + 1;
const outW = Math.floor((W - K) / stride) + 1;
const out = Array.from({ length: outH }, () => new Float32Array(outW));
for (let i = 0; i < outH; i++) {
for (let j = 0; j < outW; j++) {
let sum = 0;
for (let m = 0; m < K; m++)
for (let n = 0; n < K; n++)
sum += kernel[m][n] * image[i * stride + m][j * stride + n];
out[i][j] = relu ? Math.max(0, sum) : sum; // ReLU rectifies the map
}
}
return out;
}
function maxPool2x2(fmap) {
const H = fmap.length, W = fmap[0].length;
const out = Array.from({ length: H >> 1 }, () => new Float32Array(W >> 1));
for (let i = 0; i + 1 < H; i += 2)
for (let j = 0; j + 1 < W; j += 2)
out[i >> 1][j >> 1] = Math.max(
fmap[i][j], fmap[i][j + 1], fmap[i + 1][j], fmap[i + 1][j + 1]);
return out;
}
// A Sobel-style vertical-edge detector — the kind of filter a CNN learns on its own.
const sobelX = [[1, 0, -1], [2, 0, -2], [1, 0, -1]];
const edges = maxPool2x2(conv2d(myImage, sobelX)); // detect → rectify → downsample
Two details worth flagging. First, the triple-nested loop is exactly why naïve convolution is slow; production libraries reshape the patches into a matrix (im2col) and call a single tuned matrix multiply (GEMM), or use Winograd/FFT algorithms. Second, this is a valid convolution — the output is smaller than the input by K−1 on each axis. To keep the size, pad the image with (K−1)/2 zeros first.
Python implementation (PyTorch)
The whole point of a CNN is that you don't hand-pick the filters — gradient descent learns them. Here is a small but real classifier whose conv weights are trainable.
import torch
import torch.nn as nn
import torch.nn.functional as F
class SmallCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
# in_channels, out_channels, kernel_size; 'same' padding keeps H,W
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1) # 32 learned filters
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(2, 2) # halves H and W
self.fc1 = nn.Linear(64 * 8 * 8, 128) # for 32x32 input
self.fc2 = nn.Linear(128, num_classes)
def forward(self, x): # x: (batch, 3, 32, 32)
x = self.pool(F.relu(self.conv1(x))) # -> (batch, 32, 16, 16)
x = self.pool(F.relu(self.conv2(x))) # -> (batch, 64, 8, 8)
x = torch.flatten(x, 1) # -> (batch, 4096)
x = F.relu(self.fc1(x))
return self.fc2(x) # raw logits; loss does the softmax
model = SmallCNN()
# How the filters get learned: standard supervised loop.
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
for images, labels in train_loader: # images: (B,3,32,32)
opt.zero_grad()
logits = model(images)
loss = loss_fn(logits, labels)
loss.backward() # backprop through conv layers
opt.step() # updates the shared filter weights
# Count parameters to see weight-sharing pay off:
n = sum(p.numel() for p in model.parameters())
print(f"{n:,} parameters") # ~545k — tiny for an image model
Notice CrossEntropyLoss applies the softmax internally, so the network returns raw logits — feeding it pre-softmaxed values is a classic double-softmax bug. Also note that backward() propagates gradients through the convolutions; because the weights are shared, the gradient at each filter weight is the sum of contributions from every spatial position it touched.
Variants and landmark architectures
LeNet-5 (1998). Yann LeCun's digit reader — two conv layers, two pools, three dense layers — ran in production reading checks. The blueprint every CNN still follows.
AlexNet (2012). Won ImageNet by a 10-point margin and ignited the deep-learning boom: ReLU, dropout, GPU training, data augmentation. Same recipe as LeNet, far bigger.
VGG (2014). Showed that stacking many 3×3 convs beats fewer large kernels, standardizing the small-filter design.
ResNet (2015). Added skip connections so gradients flow through identity shortcuts, finally making 100+ layer networks trainable by solving the vanishing-gradient problem. The most-cited architecture in the field.
Inception / GoogLeNet. Runs multiple filter sizes in parallel and concatenates them, letting the network choose its own receptive field per block.
MobileNet & EfficientNet. Depthwise-separable convolutions and principled depth/width/resolution scaling for phones and edge devices.
U-Net. An encoder–decoder with skip connections for pixel-wise segmentation — the workhorse of medical imaging and the backbone of modern image-diffusion models.
Dilated (atrous) convolutions. Insert gaps between kernel taps to enlarge the receptive field without extra parameters or downsampling — central to semantic segmentation.
Common bugs and edge cases
- Off-by-one in output size. Forgetting padding shrinks every layer; a deep "valid" stack can drive the feature map to zero size. Always verify
⌊(W − K + 2P)/S⌋ + 1matches the dense layer's expected input. - Channel-order mismatch. PyTorch is NCHW, TensorFlow defaults to NHWC. Loading weights or images in the wrong order silently produces garbage features.
- Double softmax. Applying softmax before a loss that already includes it flattens the gradient and stalls training. Return raw logits.
- Assuming full translation invariance. CNNs are equivariant, not invariant. Without data augmentation a 20-pixel shift — or a feature that straddles two pool windows — can flip the prediction. They are essentially blind to rotation and scale unless you augment.
- Forgetting to normalize inputs. Feeding raw 0–255 pixels instead of normalized values makes the first-layer gradients explode; subtract the dataset mean and divide by its standard deviation.
- Receptive field too small. If the final layer can't "see" the whole object, the network classifies on texture, not shape. Grow the receptive field with depth, stride, or dilation before blaming the data.
Frequently asked questions
Why does a CNN beat a fully-connected network on images?
Two structural priors: local connectivity (each neuron sees only a small patch) and weight sharing (the same filter is reused everywhere). A dense layer on a 224×224×3 image has about 150 million weights into a 1,000-unit layer; a 3×3 convolution with 64 filters has 1,792. Fewer parameters means less overfitting and far less compute, and the sharing bakes in the assumption that a feature is useful wherever it appears.
What is the difference between convolution and pooling?
Convolution is a learned, parameterized operation: it slides filters and computes weighted sums, producing feature maps that detect patterns. Pooling has no learnable weights — it just downsamples, typically by taking the maximum (max-pool) or average over a 2×2 window, halving the spatial size to add a little translation tolerance and cut compute. Many modern nets replace pooling with strided convolutions.
How do you compute the output size of a convolution layer?
Output = floor((W − K + 2P) / S) + 1, where W is input size, K is kernel size, P is padding, and S is stride. A "same" convolution uses P = (K − 1) / 2 with stride 1 to keep the output the same spatial size as the input. A "valid" convolution uses P = 0 and shrinks the map by K − 1 on each axis.
Is a CNN actually translation invariant?
Only approximately. Convolution is translation equivariant — shift the input and the feature map shifts the same way. Pooling adds a little local invariance, and the global pooling plus softmax at the end mostly ignores position. But CNNs are notoriously NOT invariant to large shifts unless trained with data augmentation; a feature that lands between pool windows can flip the prediction. They are essentially blind to rotation and scale without augmentation.
What is the receptive field and why does it matter?
The receptive field is the region of the original image that influences one output neuron. Each layer enlarges it: stacking two 3×3 convolutions gives a 5×5 receptive field with fewer parameters than one 5×5 filter. To classify a whole object the final layer's receptive field must cover most of the image, which is why deep stacks, strides, and dilated convolutions exist — to grow the receptive field fast.
Have transformers made CNNs obsolete?
No. Vision Transformers match or beat CNNs at very large data and model scales, but CNNs remain dominant when data is limited because their locality and weight-sharing priors are strong inductive biases. On-device and real-time vision still leans on efficient CNNs like MobileNet and EfficientNet, and convergent designs (the pure-convolutional ConvNeXt, convolutional stems in ViTs) show the two families are converging rather than one replacing the other.