Computer Science

Neural Network

Layered matrix multiplications + non-linearities, learning by gradient descent

A neural network is a stack of layers, each performing a matrix multiplication followed by a non-linear activation, that learns to map inputs to outputs by minimizing a loss function via gradient descent. Modern deep networks have millions to trillions of parameters and power image recognition, language models, and game-playing AIs. The math is simple — calculus and linear algebra — but the scale is what makes them work.

Forward passO(W) per layer — W = weights in that layer
Backward pass (training)O(W) per layer — same as forward
Universal approximation1 hidden layer with enough neurons can fit any continuous function
Common activationsReLU, GELU, tanh, sigmoid, softmax (output)
Gradient descent variantsSGD, Adam, AdamW, RMSprop, LAMB
Modern scaleGPT-4 ~1.8T params; Llama 3 70B; ResNet-50 25M

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Watch on YouTube

How a neural network works

Strip away the metaphors and a neural network is layered matrix multiplications. A single layer does:

output = activation(input × weights + bias)

Where input is a vector, weights is a matrix, bias is a vector, and activation is a non-linear function applied element-wise. Stack many such layers and you get a "deep" neural network. The weights are the parameters that get trained; the inputs flow forward through the stack.

Without the non-linearity, stacking layers is mathematically pointless — multiple matrix multiplications collapse into a single matrix. The activation function (ReLU, sigmoid, tanh, GELU) introduces non-linearity that lets the network represent complex functions.

Layers and activations

Activation	Function	Used in	Notes
ReLU	max(0, x)	Hidden layers (default)	Fast, no vanishing gradient on positive side
Sigmoid	1 / (1 + e⁻ˣ)	Binary output, gates in LSTMs	Saturates → vanishing gradient in deep nets
tanh	(eˣ − e⁻ˣ) / (eˣ + e⁻ˣ)	Hidden layers (older networks)	Like sigmoid but centered at 0
GELU	x × Φ(x)	Transformers (BERT, GPT)	Smooth, slightly outperforms ReLU
Softmax	eˣⁱ / Σ eˣʲ	Multi-class output layer	Outputs sum to 1; interpret as probabilities
Linear (none)	x	Regression output, attention scores	For continuous outputs

ReLU has been the default since 2012. Variants (Leaky ReLU, GELU, Swish) outperform plain ReLU by small amounts in some benchmarks; the field has mostly settled on GELU for transformers and ReLU for everything else.

Training — gradient descent + backpropagation

Training adjusts the weights to minimize a loss function. The loss measures how wrong the network is on a given example (e.g., cross-entropy for classification, mean squared error for regression). The procedure:

Forward pass. Run input through the network, compute output.
Compute loss. Compare output to ground truth.
Backward pass. Compute gradient of loss with respect to every weight using the chain rule (this is backpropagation).
Update weights. Take a small step in the direction that reduces loss: w ← w − learning_rate × gradient.
Repeat over batches of training examples until the loss stops decreasing.

The "small step" is what makes it gradient descent. Take steps too large and you overshoot; too small and training never finishes. Adaptive optimizers (Adam, AdamW) automatically adjust the step size per-parameter based on the gradient's history.

The universal approximation theorem

A theorem from 1989 (Cybenko; Hornik et al.) states that a neural network with a single hidden layer of sufficient width can approximate any continuous function on a compact domain to arbitrary accuracy. Why isn't every network just one wide layer, then?

Because "sufficient width" can mean exponentially many neurons. Deeper networks are exponentially more parameter-efficient at representing certain functions — they exploit hierarchical structure (low-level features compose into mid-level features into high-level concepts). Empirically, depth wins decisively for vision, language, and audio.

Architecture types

Type	Best for	Distinguishing feature	Examples
Multi-layer perceptron (MLP)	Tabular data	Fully connected layers	Generic baseline
Convolutional (CNN)	Images	Translation-invariant filters	ResNet, VGG, EfficientNet
Recurrent (RNN, LSTM, GRU)	Sequences (older)	State carried token-to-token	seq2seq, language models pre-2017
Transformer	Sequences (modern)	Self-attention over all positions	BERT, GPT, T5, Llama
Graph neural network (GNN)	Graph-structured data	Message passing along edges	Drug discovery, social networks
Diffusion model	Image generation	Iterative denoising	DALL-E 2/3, Stable Diffusion

JavaScript: minimal MLP from scratch

function relu(x)        { return Math.max(0, x); }
function reluDeriv(x)   { return x > 0 ? 1 : 0; }
function softmax(arr)   {
  const max = Math.max(...arr);
  const exp = arr.map(v => Math.exp(v - max));
  const sum = exp.reduce((a, b) => a + b);
  return exp.map(v => v / sum);
}

class Layer {
  constructor(nIn, nOut) {
    // Xavier initialization — keeps activations from exploding
    const scale = Math.sqrt(2 / nIn);
    this.W = Array.from({ length: nIn }, () =>
      Array.from({ length: nOut }, () => (Math.random() - 0.5) * 2 * scale)
    );
    this.b = new Array(nOut).fill(0);
  }

  forward(x) {
    this.input = x;
    const out = new Array(this.b.length).fill(0);
    for (let i = 0; i < this.b.length; i++) {
      let sum = this.b[i];
      for (let j = 0; j < x.length; j++) sum += x[j] * this.W[j][i];
      out[i] = relu(sum);
    }
    this.output = out;
    return out;
  }

  // Backward pass — compute gradients and update weights
  backward(dOut, lr) {
    const dIn = new Array(this.input.length).fill(0);
    for (let i = 0; i < this.b.length; i++) {
      const dPreAct = dOut[i] * reluDeriv(this.output[i]);
      this.b[i] -= lr * dPreAct;
      for (let j = 0; j < this.input.length; j++) {
        dIn[j] += dPreAct * this.W[j][i];
        this.W[j][i] -= lr * dPreAct * this.input[j];
      }
    }
    return dIn;
  }
}

class MLP {
  constructor(sizes) {
    this.layers = [];
    for (let i = 0; i < sizes.length - 1; i++) {
      this.layers.push(new Layer(sizes[i], sizes[i+1]));
    }
  }
  forward(x) { return this.layers.reduce((a, l) => l.forward(a), x); }
  // Cross-entropy loss + backward — left as exercise
}

const net = new MLP([784, 128, 64, 10]); // MNIST-shaped
// Then training loop: forward → compute loss → backward → repeat

Production deployments use frameworks (PyTorch, JAX, TensorFlow) that handle GPU parallelism, automatic differentiation, distributed training, and a thousand performance-critical details. Implementing from scratch is purely educational.

Modern scale

Model	Year	Parameters	Trained on
LeNet-5	1998	60K	MNIST handwritten digits
AlexNet	2012	60M	ImageNet (1.2M images)
BERT-Large	2018	340M	3.3B words (Wikipedia + BookCorpus)
GPT-3	2020	175B	~500B words
GPT-4	2023	~1.8T (estimated)	Multi-modal corpus
Llama 3 70B	2024	70B	15T tokens

The trend has been compute-driven — when more compute became available (GPUs from 2012, larger clusters since 2017), the field built bigger models. Scaling laws (Kaplan et al., 2020) showed model size, dataset size, and compute scale together; insufficient on any axis caps performance.

Common pitfalls

Vanishing gradients in deep networks. Sigmoid and tanh activations have derivatives ≤ 1; through 50 layers the gradient becomes 0.5⁵⁰ ≈ 10⁻¹⁵. Use ReLU, batch normalization, residual connections (ResNet's contribution) — these enable training networks 100+ layers deep.
Exploding gradients. The opposite — gradients grow without bound through layers, weights diverge. Solutions: gradient clipping, careful weight initialization, smaller learning rate.
Wrong learning rate. Too high → loss oscillates or diverges. Too low → training takes forever. Practical default — Adam optimizer with lr=3e-4 to 1e-3, learning rate scheduling (cosine decay or warmup-then-decay).
Data leakage. Test set accidentally contains training examples (or close variants). Network learns to memorize specific examples rather than generalizing — performs perfectly on test, fails in production. De-duplicate carefully; split by user/time, not random shuffle, when applicable.
Confusing training loss with test loss. Training loss can go to 0 by memorization; test loss reveals actual generalization. Always monitor both; stop training when test loss starts rising even if training loss keeps falling.
Imbalanced classes silently breaking everything. 99% of fraud examples are non-fraud → network predicts "non-fraud" always, gets 99% accuracy, useless. Use class weights, focal loss, or appropriate metrics (precision/recall, F1, AUC).

Frequently asked questions

How does a neural network learn?

By gradient descent on a loss function. (1) Forward pass — feed input through layers, get output. (2) Compute loss — how wrong is the output vs the target? (3) Backpropagation — apply chain rule to compute the gradient of loss with respect to each weight. (4) Update weights — move each weight a tiny step in the direction that reduces loss. Repeat over millions of examples. The weights converge to values that produce useful outputs.

Why ReLU instead of sigmoid?

ReLU (max(0, x)) is fast to compute and doesn't suffer from vanishing gradients. Sigmoid saturates near 0 and 1 — its derivative gets tiny, so the gradient barely propagates back through deep networks (the "vanishing gradient problem"). ReLU's derivative is exactly 1 for positive inputs, so gradients flow cleanly through many layers. This is what enabled training networks deeper than ~5 layers.

What's backpropagation?

An efficient algorithm for computing gradients in computational graphs. Forward pass — compute outputs, store intermediate values. Backward pass — start with d(loss)/d(output), apply chain rule layer by layer to compute d(loss)/d(weight) for every weight. Total work is the same order as forward pass, O(W). Without backprop (using numerical differentiation), training a network with billions of parameters would be impossibly slow.

What's overfitting and how do you prevent it?

Overfitting — the network memorizes training examples but fails on new data. Caused by too many parameters relative to data, or too many training epochs. Prevention — more training data, smaller model, regularization (L2 weight decay, dropout), early stopping (monitor validation loss), data augmentation. Modern large language models avoid overfitting partly through enormous datasets and deduplication.

What's the difference between a neural network and a deep neural network?

Just depth. "Neural network" can mean a single-layer perceptron. "Deep" usually means more than 1 hidden layer (so 3+ total). Modern "deep learning" architectures have 50-1000+ layers (ResNet, Transformer). The word "deep" is partly historical — it once distinguished researchers using many layers from the dominant 1-2 layer approaches.

How do transformers (the architecture behind GPT) differ from older networks?

Transformers use "self-attention" — each position in a sequence can attend to all other positions, allowing the network to model long-range dependencies in parallel. RNNs and LSTMs processed sequentially (token by token). Transformers scaled better to GPU parallelism and larger models, which is why they took over NLP after 2017 (Vaswani et al., "Attention Is All You Need").

Why do neural networks need so much data?

They have many parameters — millions to trillions. Each parameter is degrees of freedom that needs to be fit by the training data. Information-theoretically, you need on the order of one example per parameter (in practice less, due to inductive biases and regularization). Modern LLMs are trained on hundreds of billions of words; image models on hundreds of millions of images. Data scaling laws relate model size to required dataset size.