Machine Learning
Graph Neural Network
Nodes that learn by listening to their neighbors
A graph neural network learns on graph-structured data by repeatedly passing messages between connected nodes: each node aggregates its neighbors' feature vectors, transforms them with a shared weight matrix, and updates its own embedding — stacking k layers to capture k-hop structure.
- Core operationMessage passing
- Cost per layerO(|E|·d + |V|·d²)
- Receptive fieldk hops at k layers
- Typical depth2–3 layers
- Key failure modeOversmoothing
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a graph neural network learns
Most deep learning assumes a tidy grid. A CNN slides a filter over an image because every pixel has the same number of neighbors in the same fixed positions. An RNN walks a sequence because every token has exactly one predecessor. But a social network, a molecule, a road map, or a citation graph has no grid — each node has a different number of neighbors, in no particular order, and there's no "left" or "up." A graph neural network is the architecture that drops the grid assumption and learns directly on this irregular structure.
The trick, introduced by Scarselli et al. in 2009 and made practical by Kipf and Welling's 2017 Graph Convolutional Network, is message passing. Every node holds a feature vector — its current "embedding." In one layer, three things happen at every node simultaneously:
- Message. Each neighbor sends its current embedding (optionally transformed) along the edge.
- Aggregate. The node combines all incoming messages with a permutation-invariant function — sum, mean, or max — so the result doesn't depend on the arbitrary order neighbors are listed in.
- Update. The aggregated message is multiplied by a shared learned weight matrix, optionally mixed with the node's own previous embedding, passed through a nonlinearity (ReLU), and stored as the node's new embedding.
Run that once and every node now knows something about its immediate neighbors. Run it again and the messages it receives already carry second-hand information, so each node now sees two hops out. Stack k layers and every node's embedding summarizes its entire k-hop neighborhood. That growing "receptive field" is the whole idea: a node classifies itself not by its own features alone but by the structure of the community around it.
The precise mechanism and the math
Write the graph as N nodes with feature matrix H ∈ ℝ^(N×d) and adjacency matrix A ∈ ℝ^(N×N). A single GCN layer is one compact matrix expression:
H⁽ˡ⁺¹⁾ = σ( Â · H⁽ˡ⁾ · W⁽ˡ⁾ )
where  = D̃^(−½) · (A + I) · D̃^(−½)
D̃ = degree matrix of (A + I)
W⁽ˡ⁾ ∈ ℝ^(d×d') shared, learned
σ = ReLU (nonlinearity)
Reading it left to right: A + I adds a self-loop so a node keeps a copy of its own features. The D̃^(−½) · … · D̃^(−½) symmetric normalization scales each edge by 1/√(deg(i)·deg(j)), which stops high-degree hub nodes from drowning everyone else out. Multiplying by H gathers the neighborhood, and W projects it into the next feature space.
The same thing written per-node — the "message passing" view — makes the local computation explicit:
hᵢ⁽ˡ⁺¹⁾ = σ( W · AGG{ hⱼ⁽ˡ⁾ / cᵢⱼ : j ∈ N(i) ∪ {i} } )
Complexity. One layer costs O(|E|·d + |V|·d²): the |E|·d term gathers a d-dimensional message along every edge once, and the |V|·d² term applies the d×d weight matrix to every node. Because the gather follows edges rather than all node pairs, sparse graphs (where |E| ≪ |V|²) stay cheap — this is exactly why you implement ·H as a sparse matrix multiply, never a dense N×N one. Memory is O(|V|·d + |E|) for activations plus the graph itself.
When to reach for a GNN
- The data is relational. If the connections between examples matter as much as the examples themselves — fraud rings, drug-target interactions, "users who follow users" — a GNN exploits the edges that a plain MLP throws away.
- Node, edge, or whole-graph prediction. Classify accounts (node), predict missing links (edge), or score an entire molecule's toxicity (graph) — all three reuse the same message-passing backbone with a different readout head.
- Variable-size, irregular inputs. Molecules have different atom counts; social graphs grow daily. Weight sharing lets one model handle them all.
- You want to inject structure as a prior. Telling the model "these two things are connected" is a strong, cheap inductive bias.
Skip a GNN when the graph is a near-complete clique (the structure carries no signal — use a transformer or set model), when nodes are effectively independent (a gradient-boosted tree on tabular features will beat it), or when you only need 1-hop reasoning (a single aggregation, not a deep network, will do).
GNN variants compared
| GCN | GraphSAGE | GAT | GIN | MPNN (general) | |
|---|---|---|---|---|---|
| Aggregator | Degree-normalized mean | Mean / max / LSTM | Learned attention | Sum + MLP | Arbitrary |
| Neighbor weighting | Fixed (1/√degᵢdegⱼ) | Uniform | Learned per edge | Uniform (sum) | Defined by you |
| Inductive (new graphs)? | No — transductive | Yes | Yes | Yes | Depends |
| Extra params / layer | 1 matrix W | W per aggregator | W + attention vector a | MLP + ε | Varies |
| Cost per layer | O(|E|d + |V|d²) | O(|E|d + |V|d²) | O(|E|d + |V|d²) | O(|E|d + |V|d²) | ≥ that |
| Expressive power | ≤ 1-WL test | ≤ 1-WL test | ≤ 1-WL test | = 1-WL test (provably max) | Varies |
| Best for | Citation / homophilous graphs | Huge graphs, sampling | Noisy or heterophilous neighbors | Graph-level (molecules) | Custom messages (chemistry) |
The headline difference is how much each neighbor counts. GCN bakes the weights into the degrees; GAT learns them from the data. GIN's choice of sum (not mean or max) is theoretically important: Xu et al. proved in 2019 that sum-aggregation makes a GNN as discriminative as the Weisfeiler-Leman graph isomorphism test — the provable ceiling for this whole family of architectures. Mean and max throw away multiplicity, so they can't tell some structurally different graphs apart.
What the numbers actually say
- 2–3 layers is the sweet spot. On the standard Cora citation benchmark (2,708 papers, 5,429 citation edges, 7 classes), a 2-layer GCN reaches about 81.5% test accuracy. Push it to 4+ layers without tricks and accuracy falls — repeated averaging blurs the embeddings together. That's oversmoothing in one sentence.
- Sparsity is the win. Cora's adjacency has 2,708² ≈ 7.3 million possible entries but only ~10,500 nonzero (each edge counted both ways). A sparse
·Htouches ~10K edges instead of 7.3M pairs — roughly a 700× reduction in multiply-adds per layer. - Neighbor explosion limits depth. With average degree 10, a node's k-hop neighborhood holds ≈ 10ᵏ nodes. By 6 hops that's a million — most of the graph. This exponential blow-up, not just oversmoothing, is why full-batch deep GNNs run out of memory and why GraphSAGE samples a fixed number of neighbors per layer.
- Attention isn't free. A GAT layer computes a score for every edge and stores per-edge coefficients, so on a graph with 100K edges and 8 attention heads you hold 800K extra floats per layer versus a GCN's single shared matrix.
JavaScript implementation
A minimal forward pass of one GCN layer, message-passing style, with no ML framework. The graph is an adjacency list; each node has a feature vector.
// adj: Array<number[]> — adj[i] = neighbor indices of node i
// H: Array<number[]> — H[i] = d-dim feature vector of node i
// W: number[][] — d x d' learned weight matrix
const relu = x => Math.max(0, x);
function gcnLayer(adj, H, W) {
const N = H.length, d = H[0].length, dOut = W[0].length;
// 1. precompute degrees (with self-loop) for symmetric normalization
const deg = adj.map((nbrs, i) => nbrs.length + 1); // +1 for self-loop
const out = [];
for (let i = 0; i < N; i++) {
// 2. AGGREGATE: degree-normalized sum over neighbors ∪ {self}
const agg = new Array(d).fill(0);
const neighborhood = [...adj[i], i]; // add self-loop
for (const j of neighborhood) {
const c = 1 / Math.sqrt(deg[i] * deg[j]); // 1/√(degᵢ·degⱼ)
for (let k = 0; k < d; k++) agg[k] += c * H[j][k];
}
// 3. TRANSFORM by W, then UPDATE through ReLU
const h = new Array(dOut).fill(0);
for (let p = 0; p < dOut; p++) {
let s = 0;
for (let k = 0; k < d; k++) s += agg[k] * W[k][p];
h[p] = relu(s);
}
out.push(h);
}
return out; // H for next layer
}
// Stack layers = deeper receptive field
function gcnForward(adj, H, weights /* W per layer */) {
return weights.reduce((feats, W) => gcnLayer(adj, feats, W), H);
}
Two details to notice. First, the inner loop only ever visits adj[i] — it follows edges, never iterates over all N nodes, which is what keeps it O(|E|·d + |V|·d²) instead of O(|V|²·d). Second, the aggregation is a plain sum after weighting, so the result is identical no matter what order adj[i] happens to be in — that permutation invariance is mandatory, not optional.
Python implementation
The same layer in vectorized NumPy as the textbook sparse matrix form ·H·W, plus a 2-layer node classifier — the GCN you'd actually train.
import numpy as np
def normalize_adj(A):
"""Â = D̃^(-1/2) (A + I) D̃^(-1/2) — symmetric normalization with self-loops."""
A_hat = A + np.eye(A.shape[0])
deg = A_hat.sum(axis=1) # row sums = degrees (incl. self-loop)
d_inv_sqrt = np.power(deg, -0.5)
d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.0 # isolated nodes -> 0, avoid div-by-zero
D = np.diag(d_inv_sqrt)
return D @ A_hat @ D
def relu(x):
return np.maximum(0, x)
def gcn_layer(A_hat, H, W):
return relu(A_hat @ H @ W) # one message-passing step
class GCN:
def __init__(self, d_in, d_hid, d_out, seed=0):
rng = np.random.default_rng(seed)
# Xavier-ish init keeps activations from exploding across layers
self.W1 = rng.normal(0, np.sqrt(2 / d_in), (d_in, d_hid))
self.W2 = rng.normal(0, np.sqrt(2 / d_hid), (d_hid, d_out))
def forward(self, A, H):
A_hat = normalize_adj(A)
H = gcn_layer(A_hat, H, self.W1) # layer 1: 1-hop receptive field
logits = A_hat @ H @ self.W2 # layer 2: 2-hop, no ReLU before softmax
return logits # feed to softmax + cross-entropy
# --- tiny worked example: a 4-node path graph 0-1-2-3 ---
A = np.array([[0,1,0,0],
[1,0,1,0],
[0,1,0,1],
[0,0,1,0]], dtype=float)
H = np.eye(4) # one-hot features
model = GCN(d_in=4, d_hid=8, d_out=2)
print(model.forward(A, H).shape) # (4, 2) -- 2 class scores per node
In a real pipeline you'd swap the dense A_hat for a SciPy sparse matrix and let autograd (PyTorch Geometric's GCNConv or DGL's GraphConv) handle backprop — but the forward math is exactly the three lines above.
Variants worth knowing
GraphSAGE (2017). Instead of needing the whole graph, it learns an aggregator (mean, max-pool, or LSTM) and samples a fixed number of neighbors per layer. This makes it inductive — it can embed nodes or graphs it never saw in training — and lets it scale to billion-edge graphs by mini-batching.
Graph Attention Network (GAT, 2018). Replaces fixed degree weights with learned attention: a small scoring function over each edge's endpoints, softmaxed over a node's neighbors. Often uses multiple attention heads, like a transformer, and shines when some neighbors are noise.
Graph Isomorphism Network (GIN, 2019). Uses sum aggregation followed by an MLP and proves this is as powerful as the Weisfeiler-Leman test — the most discriminative a message-passing GNN can be. The go-to baseline for graph-level (molecule) classification.
Relational GCN (R-GCN). Handles graphs with multiple edge types (knowledge graphs) by learning a separate weight matrix per relation, with basis decomposition to keep the parameter count sane.
Graph Transformers. Drop hard message passing and let every node attend to every other (with positional/structural encodings of the graph). They sidestep oversmoothing and long-range bottlenecks at O(|V|²) cost — practical only with sparsification or on small graphs.
Common bugs and edge cases
- Forgetting the self-loop. Without adding
A + I, a node's update ignores its own features and depends only on neighbors — a node with no neighbors becomes a zero vector. Add the self-loop before normalizing. - Stacking too many layers. The classic mistake: "deeper is better" from CNNs doesn't transfer. Past 3–4 layers, oversmoothing collapses embeddings. If you need long-range reasoning, use residual connections, jumping-knowledge layers, or a graph transformer — not raw depth.
- Using a non-permutation-invariant aggregator. Concatenating neighbor features in list order, or feeding them to a plain RNN without sorting, makes the output depend on arbitrary node ordering. Aggregation must be sum/mean/max (or attention) — symmetric by construction.
- Dividing by zero on isolated nodes. A degree-0 node makes
deg^(−½)blow up. The self-loop fixes most cases (degree becomes 1); still guard the normalization against inf/NaN. - Train/test leakage in transductive setups. A vanilla GCN sees the whole graph, including test nodes' edges, during training. That's expected for transductive node classification — but don't report it as if the model generalized to an unseen graph. Use GraphSAGE for a true inductive split.
- Dense adjacency on a big graph. Materializing the full N×N matrix to compute
A_hat @ Hturns an O(|E|) operation into O(|V|²) memory and instantly OOMs on graphs past a few tens of thousands of nodes. Always use sparse tensors.
Frequently asked questions
What is message passing in a graph neural network?
Message passing is the core operation: in each layer, every node collects the feature vectors of its direct neighbors, combines them with a permutation-invariant aggregator (sum, mean, or max), transforms the result with a shared learned weight matrix, and uses it to update its own embedding. Stacking k layers lets information flow k hops across the graph.
Why does a GNN use the same weights for every node?
Weight sharing is what makes a GNN scale to graphs of any size and generalize to unseen graphs. A single weight matrix W per layer is applied to every node's aggregated message, exactly the way a CNN slides one filter over every pixel. Without sharing you would need a parameter per node and could not transfer the model to a new graph.
What is oversmoothing in graph neural networks?
Oversmoothing is the failure mode where stacking too many message-passing layers makes every node's embedding converge to nearly the same vector, because repeated neighborhood averaging acts like a low-pass filter. Most GCNs peak at 2–3 layers; beyond that, accuracy drops as nodes become indistinguishable.
How is a graph convolutional network (GCN) different from a graph attention network (GAT)?
A GCN weights each neighbor by a fixed factor derived from node degrees (1/√(deg(i)·deg(j))). A GAT instead learns the weights: it computes an attention score for each edge from the two endpoints' features, softmaxes over neighbors, and uses those as the aggregation coefficients — so important neighbors count more, at the cost of extra parameters and compute.
What's the time complexity of one GNN layer?
One message-passing layer costs O(|E|·d + |V|·d²): the |E|·d term gathers a d-dimensional message along every edge, and the |V|·d² term applies the d×d weight matrix to every node. Because real graphs are sparse (|E| ≪ |V|²), this is far cheaper than the O(|V|²) of treating the graph as a dense matrix.
Can a GNN make predictions on a graph it has never seen?
Yes, if it's inductive. Models like GraphSAGE learn aggregation functions rather than a fixed embedding per node, so they can embed brand-new nodes or entire new graphs at inference time. The original GCN is transductive — it needs the whole graph, including test nodes, present during training.