Networking

Spanning Tree Protocol

How switches agree to switch off cables — so the network stops eating itself

The Spanning Tree Protocol (STP, IEEE 802.1D) is a Layer-2 algorithm where switches elect a root bridge and cooperatively block redundant links, leaving a single loop-free spanning tree so broadcasts can't circulate forever while every switch stays reachable.

  • StandardIEEE 802.1D (1990)
  • Invented byRadia Perlman, 1985
  • Classic convergence30–50 s
  • RSTP convergence< 1 s
  • ResultExactly one active path

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why a redundant cable can take down the whole LAN

Network engineers run extra cables between switches on purpose — if one link dies, traffic should reroute over the spare. But Ethernet has a fatal flaw the redundant cable exposes: a Layer-2 frame has no time-to-live field. When a switch receives a broadcast frame (an ARP request, say), it floods a copy out every other port. If two switches are joined by two cables, that broadcast comes back around, gets flooded again, comes back again — and now there are two copies, then four, then eight. Within milliseconds the loop is amplifying frames at line rate. This is a broadcast storm, and it melts the whole broadcast domain, not just the looped link.

Radia Perlman solved this at DEC in 1985 with an algorithm she famously described in a poem ("I think that I shall never see / a graph more lovely than a tree"). The idea: the switches form a physical graph with cycles, and they cooperatively compute a spanning tree over that graph — a connected, acyclic subgraph that touches every switch. Links that aren't part of the tree get logically blocked: still physically plugged in, still monitored, but not forwarding data. The result has exactly one active path between any two points, so no frame can loop. Standardized as IEEE 802.1D in 1990, it's been running quietly in nearly every managed switch since.

The elegant part is that no switch has a map of the network. Each one only exchanges small messages — Bridge Protocol Data Units (BPDUs) — with its direct neighbors, and the tree emerges from those local comparisons. It's a distributed minimum-spanning-tree computation rooted at one elected switch.

The algorithm: elect, measure, assign roles

STP converges in three logical phases, all driven by BPDUs flooded every 2 seconds (the hello time):

  1. Elect the root bridge. Every switch boots up believing it is the root and advertises its Bridge ID — a 2-byte priority (default 32768) concatenated with its 6-byte MAC address, 8 bytes total. When a switch hears a BPDU with a lower Bridge ID than its own, it stops claiming root and starts relaying the superior BPDU. The switch with the globally lowest Bridge ID wins. Because priority is usually identical everywhere, the lowest MAC address breaks the tie — which is why the oldest switch in the rack often becomes root by accident.
  2. Measure cost to root. Each switch computes its root path cost by adding the cost of the link a BPDU arrived on to the cost already in that BPDU. Costs are inversely related to bandwidth: in the modern (long) cost table a 100 Mbps link is 200,000, 1 Gbps is 20,000, 10 Gbps is 2,000. The port offering the lowest total cost becomes that switch's root port — its one chosen path toward root.
  3. Assign port roles per segment. For each LAN segment, the switch offering that segment the cheapest path to root owns the segment's single designated port. Every port that is neither a root port nor a designated port is put into the blocking state. Those blocked ports are exactly the redundant links — the cycle-closing edges removed to make the graph a tree.

Ties at every step are broken by the same lexicographic ordering: lowest root Bridge ID, then lowest root path cost, then lowest sender Bridge ID, then lowest sender Port ID. This total order is what guarantees every switch independently converges on the same tree without a coordinator.

The cost minimization makes STP a genuine spanning tree, but note it is not a global minimum spanning tree of the whole graph — it's a shortest-path tree from the root, like running Dijkstra outward from the elected root bridge. The total of all link costs is not what STP minimizes; the cost from each switch to the root is.

The five port states

A port can't jump straight from "blocked" to "forwarding" — that could create a transient loop while different switches still hold stale views. Classic 802.1D walks each port through timed states:

StateForwards data?Learns MACs?Processes BPDUs?Dwell time
DisabledNoNoNoAdmin down
BlockingNoNoYes (listens)Until needed
ListeningNoNoYes15 s (forward delay)
LearningNoYesYes15 s (forward delay)
ForwardingYesYesYesSteady state

That's the source of STP's notorious latency: a port that needs to start forwarding waits 15 s in Listening + 15 s in Learning = 30 s, and if a stale BPDU has to age out first it can add up to 20 s of Max Age, for the worst-case ~50 s. For 30+ seconds after you plug in a cable, the port carries no user traffic at all.

When STP matters — and when you've moved past it

  • Any switched network with physical redundancy. The moment two switches share more than one path, you need a loop-prevention protocol or a storm is one ARP broadcast away.
  • Mixed-vendor Layer-2 fabrics. STP/RSTP is the interoperable baseline every managed switch speaks, so it's the safe default at the edge.
  • Access-layer fault tolerance where you want a backup uplink that activates automatically when the primary fails — STP unblocks the standby link on its own.

You move past STP when blocking half your links becomes unacceptable. In a leaf-spine data center you want every uplink carrying traffic, so operators replace STP with Layer-3 routing to the host, or with fabrics like TRILL, SPB (802.1aq), or EVPN/VXLAN that compute equal-cost multipath instead of pruning to a single tree. STP wastes bandwidth by design — every blocked link is paid-for capacity sitting idle.

STP vs RSTP vs MSTP vs Layer-3 fabrics

STP (802.1D)RSTP (802.1w)PVST+ / Rapid-PVSTMSTP (802.1s)L3 / SPB / TRILL
Convergence30–50 s< 1 s (typ. 1–2 s)< 1 s per VLAN< 1 s< 1 s, ms with BFD
Trees1 for whole bridged LAN11 per VLAN1 per VLAN group (instance)None — multipath
Uses all links?No (blocks redundant)NoLoad-balance per VLANLoad-balance per instanceYes (ECMP)
Convergence mechanismTimers (forward delay, max age)Proposal/agreement handshakeRSTP per VLANRSTP per instanceIS-IS / routing
CPU / BPDU overheadLowLowHigh (1 instance × N VLANs)Low (few instances)Moderate (routing)
Scales to large fabric?PoorlyModeratelyPoorly (per-VLAN blowup)WellExcellently
Year19902001Cisco, 1990s20022010s

RSTP (Rapid STP) is the headline upgrade: instead of waiting out fixed timers, a switch and its neighbor negotiate a port's new role directly with a proposal/agreement exchange, cutting convergence from tens of seconds to about one. RSTP is backward-compatible — drop it into an 802.1D network and it falls back to timer behavior on legacy links. MSTP adds the ability to run a handful of independent rapid trees and map VLANs onto them, getting per-VLAN load balancing without the per-VLAN CPU cost of Cisco's PVST+.

What the numbers actually cost you

  • A broadcast storm doubles every loop traversal. On a 1 Gbps switch a loop can hit line rate — about 1.49 million 64-byte frames per second per direction — within a few milliseconds, freezing the control plane so you often can't even SSH in to fix it. The only cure is to physically pull a cable.
  • Classic STP costs you ~30 s of dead link on every topology change. If an access switch reboots, hosts behind a newly-forwarding port wait 30 s for connectivity. Across a building of 200 access ports that flap during a maintenance window, that's real downtime — and the reason PortFast exists.
  • RSTP buys back ~29 of those seconds. Sub-second convergence is the difference between a TCP session surviving a link failure (retransmit and continue) and timing out (RTO is typically 1–3 s, then exponential backoff).
  • STP strands ~50% of link capacity in a two-uplink design. Each access switch with redundant uplinks blocks one of them, so half the money spent on uplink optics and ports earns nothing until a failure. MSTP or L3-ECMP recovers it.
  • BPDUs are tiny but constant: 35–60 bytes every 2 s per port. Negligible bandwidth, but on a switch with thousands of PVST+ instances the CPU cost of generating them per-VLAN is the actual scaling wall.

JavaScript: computing the spanning tree's blocked links

STP's port-role assignment is, at its core, a shortest-path tree from the elected root, with every non-tree edge blocked. Here's the computation a switch network converges to — root election by lowest Bridge ID, then Dijkstra-style cost propagation, then marking the redundant edges blocked:

// switches: [{ id, priority, mac }]   links: [{ a, b, cost }]
function spanningTree(switches, links) {
  // 1. Elect root: lowest (priority, mac) = lowest Bridge ID.
  const bridgeId = s => s.priority * 2 ** 48 + s.mac;     // 8-byte BID
  const root = switches.reduce((r, s) => bridgeId(s) < bridgeId(r) ? s : r);

  // 2. Dijkstra from root over link costs → each switch's root path cost
  //    and the link it uses as its root port.
  const adj = new Map(switches.map(s => [s.id, []]));
  for (const e of links) {
    adj.get(e.a).push({ to: e.b, cost: e.cost, edge: e });
    adj.get(e.b).push({ to: e.a, cost: e.cost, edge: e });
  }
  const dist = new Map(switches.map(s => [s.id, Infinity]));
  const rootPortEdge = new Map();        // switch id → edge it reaches root through
  dist.set(root.id, 0);
  const pq = [[0, root.id]];
  while (pq.length) {
    pq.sort((x, y) => x[0] - y[0]);
    const [d, u] = pq.shift();
    if (d > dist.get(u)) continue;
    for (const { to, cost, edge } of adj.get(u)) {
      const nd = d + cost;
      if (nd < dist.get(to)) {
        dist.set(to, nd);
        rootPortEdge.set(to, edge);       // tree edge toward root
        pq.push([nd, to]);
      }
    }
  }

  // 3. Tree edges = each non-root switch's root-port edge. Everything else blocks.
  const treeEdges = new Set([...rootPortEdge.values()]);
  return links.map(e => ({
    link: `${e.a}-${e.b}`,
    state: treeEdges.has(e) ? 'forwarding' : 'blocking',
  }));
}

Real STP doesn't run Dijkstra centrally — each switch only compares the BPDUs it hears from neighbors — but the fixed point those local comparisons reach is exactly this shortest-path tree. The blocked links are the graph's cycle-closing edges relative to the root.

Python: detecting whether a topology even needs blocking

A switched topology only risks a loop when it has more links than a tree would — that is, when edges ≥ vertices. Counting how many links STP must block is just the cyclomatic number of the graph, E − V + C (edges minus vertices plus connected components):

def links_to_block(switches, links):
    """How many redundant links STP will put into blocking state."""
    parent = {s: s for s in switches}

    def find(x):
        while parent[x] != x:
            parent[x] = parent[parent[x]]   # path compression
            x = parent[x]
        return x

    components = len(switches)
    redundant = 0
    for a, b in links:
        ra, rb = find(a), find(b)
        if ra == rb:
            redundant += 1          # this link closes a cycle → STP blocks it
        else:
            parent[ra] = rb
            components -= 1

    # A spanning tree of a connected graph has V-1 edges; extras are blocked.
    return redundant, components    # (blocked links, isolated islands)

sw = ['A', 'B', 'C', 'D']
ln = [('A','B'), ('B','C'), ('C','A'), ('C','D')]   # triangle + tail
print(links_to_block(sw, ln))      # (1, 1) → one link blocked, all connected

This union-find pass is the same structural test Kruskal's algorithm uses to reject cycle-closing edges. STP's job is the networking incarnation of it: keep V − 1 links forwarding, block the rest, and the broadcast storm can never form.

Variants worth knowing

RSTP — Rapid Spanning Tree (802.1w, 2001). Replaces timer-based transitions with an explicit proposal/agreement handshake between neighbors and a notion of edge ports and point-to-point links. Converges in well under a second. It is the modern default and subsumes plain STP.

MSTP — Multiple Spanning Tree (802.1s, 2002). Runs several independent RSTP instances and maps VLANs onto them. You might put odd VLANs on instance 1 (root = switch A) and even VLANs on instance 2 (root = switch B), so both uplinks forward and load is shared — without one tree per VLAN.

PVST+ / Rapid-PVST+ (Cisco). One spanning-tree instance per VLAN. Gives per-VLAN load balancing and tuning but explodes CPU and BPDU overhead at hundreds of VLANs; MSTP was standardized partly to fix that.

TRILL and SPB (802.1aq). Post-STP "Layer-2 routing" fabrics that run IS-IS to compute shortest paths and forward over all links with equal-cost multipath, eliminating the blocked-link waste entirely. TRILL was co-designed by Radia Perlman — the same person who gave us STP — as its explicit successor.

UDLD and Loop Guard. Companion safety features. A unidirectional fiber fault can make a blocked port stop hearing BPDUs and wrongly transition to forwarding, creating a loop; Loop Guard keeps such a port blocked, and UDLD detects the one-way link directly.

Common bugs and operational gotchas

  • Accidental root election. Leaving every switch at the default priority of 32768 hands the root role to whichever box has the lowest MAC — often an old access switch in a closet, forcing all traffic through a slow path. Always set the core switch's priority low (e.g. 4096) explicitly.
  • Forgetting PortFast on host ports. Without it, every workstation and server waits ~30 s for link, and DHCP/PXE often times out before the port forwards. But never enable PortFast on a port that might reach another switch.
  • PortFast without BPDU Guard. A user plugs a cheap switch into a PortFast access port; it can win the root election or form a loop. BPDU Guard err-disables any PortFast port that receives a BPDU, shutting that down instantly.
  • Unidirectional link loops. A fiber that transmits but doesn't receive can silently unblock a port. This is the failure UDLD and Loop Guard exist to catch — pure STP can't detect it.
  • Mismatched timers or path-cost tables. Mixing the old short cost table (10 Mbps = 100, 100 Mbps = 19) with the long table on different switches yields inconsistent trees. Standardize the cost method fabric-wide.
  • TCN-triggered MAC-table flushes. A flapping port generates Topology Change Notifications that make switches flush learned MAC addresses, causing brief unicast flooding. Frequent flaps (a bad cable) can keep a network in near-constant relearning. PortFast suppresses TCNs from host ports for this reason.

Frequently asked questions

Why do Layer-2 loops cause a broadcast storm but Layer-3 routing doesn't?

Ethernet frames have no time-to-live field, so a broadcast frame caught in a switching loop is duplicated and forwarded forever, doubling on every pass until it saturates every link. IP packets carry a TTL that decrements at each hop and drops the packet at zero, so a routing loop wastes hops but self-terminates. STP exists precisely because Ethernet has no TTL to save it.

How does STP elect the root bridge?

Every switch starts claiming to be root and floods BPDUs advertising its Bridge ID — a 2-byte priority (default 32768) followed by its 6-byte MAC address. The switch with the numerically lowest Bridge ID wins. Since priority is usually identical, the lowest MAC address breaks the tie, which is why the oldest switch often becomes root by accident unless you set priority deliberately.

What's the difference between STP, RSTP, and MSTP?

Classic STP (802.1D) converges in 30–50 seconds using timer-based port states. RSTP (802.1w) converges in under a second by negotiating port roles directly with neighbors via proposal/agreement handshakes. MSTP (802.1s) lets you map groups of VLANs to a handful of independent spanning-tree instances, so you get RSTP-speed convergence and per-VLAN load balancing without running one tree per VLAN.

Why does classic STP take 30 to 50 seconds to converge?

A port that may need to forward must pass through Listening (15 s, forward delay) then Learning (15 s) before Forwarding, to be sure no transient loop forms. If a topology change invalidates cached information, a port can also wait up to Max Age (20 s) for stale BPDUs to expire. Worst case is roughly 20 + 15 + 15 = 50 seconds; the common case is about 30.

What is a designated port versus a root port?

Each non-root switch has exactly one root port — the single port with the lowest accumulated path cost back to the root, used to reach the root. Each LAN segment has exactly one designated port — the port on whichever switch offers that segment the cheapest path to root, used to serve traffic onto the segment. Every remaining port is blocked. The root bridge's ports are all designated.

What do PortFast and BPDU Guard do?

PortFast skips the Listening and Learning states on access ports connected to end hosts, so a laptop or server gets link in about a second instead of 30. BPDU Guard protects those PortFast ports: if one ever receives a BPDU — meaning someone plugged a switch into a host port — it err-disables the port immediately, stopping a rogue switch from hijacking the root election or forming a loop.