Systems

Reference Counting

Free the object the instant nobody is looking — unless they're looking at each other

Reference counting tracks how many references point to each object and frees it the instant the count hits zero — giving deterministic, incremental memory reclamation, but it leaks reference cycles unless paired with weak references or a cycle collector.

  • Reclaim timingImmediate at count → 0
  • Cost per assignmentO(1) inc + dec
  • Pause timeNo stop-the-world
  • Per-object overhead1 counter word
  • Blind spotReference cycles

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How reference counting works

Every object carries a small integer — its reference count — that records how many references currently point at it. The invariant is the whole idea: the count always equals the number of live pointers to the object. Whenever you copy a pointer, you increment (often called retain). Whenever a pointer goes away — a variable leaves scope, a field is overwritten, an array slot is cleared — you decrement (release). The moment a decrement drives the count to zero, nobody can reach the object anymore, so you free it immediately and decrement everything it pointed to.

That last clause is what makes the technique recursive. Freeing a list node releases its next pointer, which may take that node to zero, which releases its next, and so the whole tail unwinds. The cleanup is local — you never scan the heap, never trace from roots, never pause the program. George Collins published the idea in 1960, the same year John McCarthy introduced tracing garbage collection for Lisp; the two approaches have competed ever since.

The defining property is determinism. An object dies at a known, reproducible point — the assignment that drops its last reference — not at some unpredictable future moment when a tracing collector happens to run. That is why C++'s std::shared_ptr, Rust's Rc/Arc, Swift's ARC, Python's CPython runtime, COM, and Objective-C all lean on it: destructors fire predictably, file handles close on time, and there is no garbage-collection pause in the middle of a frame.

The precise mechanism and its cost

Three operations define the algorithm. Let rc(o) be the count stored in object o:

retain(o):    rc(o) += 1                        // O(1)
release(o):   rc(o) -= 1
              if rc(o) == 0:                     // amortized O(1) per object
                  for child in references(o):
                      release(child)             // recursive cascade
                  free(o)
assign(p, o): retain(o); release(*p); *p = o     // order matters: retain first

Each pointer write is O(1) in the common case: one increment, one decrement. The subtlety is the release cascade. Freeing one object can free a long chain, but each object is freed exactly once over the program's life, so the amortized cost stays O(1) per object — the total free work is proportional to the total allocation, just like tracing. What differs is when that work happens: spread continuously across mutations rather than batched into a collection cycle.

The assign ordering is a real bug magnet: you must retain the new target before releasing the old one. If *p and o are the same object with count 1, releasing first frees it, and the subsequent retain touches freed memory. Retain-first makes self-assignment a harmless +1/−1.

Counts are not free in space either. A 64-bit counter word adds 8 bytes per object — on a heap of small objects that can be a measurable tax. Implementations shrink it: CPython packs the count next to the type pointer; Rust's Rc stores strong and weak counts side by side in the heap allocation, not in the value itself.

When to choose reference counting

  • Latency-sensitive systems — games, audio, real-time control — where a multi-millisecond GC pause is unacceptable. The cost is paid in tiny slices.
  • Deterministic resource release — closing sockets, unlocking mutexes, flushing buffers at a known program point (RAII in C++ is built on this).
  • Acyclic data — trees, DAGs, ropes, immutable structures — where cycles simply cannot form, so the one weakness never bites.
  • Shared ownership across modules — when several subsystems hold a resource and the last one out should clean up, without a central owner.
  • Environments without a runtime — C and C++ have no tracing GC; reference counting via smart pointers is the idiomatic automatic-memory tool.

Avoid it as a sole strategy when your object graph is naturally cyclic — doubly linked lists, parent↔child trees, observer graphs, arbitrary user data — unless you are disciplined about weak references or bolt on a cycle collector. Avoid it on hot multi-threaded paths where atomic count churn dominates; there a tracing collector or arena often wins.

Reference counting vs other memory strategies

Reference countingMark-and-sweepCopying/generational GCManual (malloc/free)Arena/region
Reclaim timingImmediate at count 0Periodic, on collectionPeriodic, on collectionExplicit, programmer-timedBulk at region end
Pause behaviorNone (incremental)Stop-the-worldStop-the-world (often shorter)NoneNone
Collects cyclesNo (needs help)YesYesProgrammer's jobN/A — freed wholesale
Per-mutation costinc + dec (atomic if shared)Write barrier (sometimes)Write barrierNoneNone
Per-object overhead1 counter word1 mark bitCopy + forwarding pointerNoneNone
ThroughputLower (steady tax)HigherHighest for short-lived dataHighestHighest
Memory tightnessTight (frees promptly)Loose (waits for collection)Needs ~2× heap for copy spaceAs tight as you codeHolds until region drops
Real-world useCPython, Swift ARC, C++ shared_ptr, COMBoehm GC, early RubyJVM, Go, V8, .NETC, embeddedCompilers, request handlers

The headline trade is latency vs throughput. Reference counting smears the bill across every pointer write so no single operation stalls, but the steady increment/decrement tax and poor cache behavior of touching counts usually leave it behind a good tracing collector on total work. Many production runtimes therefore hybridize: CPython reference-counts everything and runs a periodic cycle detector to mop up the cycles that counting alone leaks.

What the numbers actually say

  • Atomic counts are the expensive part. An uncontended atomic increment is a handful of nanoseconds, but under cross-core contention a fetch-add that bounces a cache line between cores costs roughly 30–100 ns each — orders of magnitude more than the ~1 ns non-atomic version. This is precisely why CPython keeps its counts non-atomic and serializes object access behind the Global Interpreter Lock.
  • Counter overhead is one machine word. 8 bytes per object on a 64-bit build. On a workload of 16-byte objects that is a 50% space surcharge before payload; on large objects it vanishes into the noise.
  • CPython spends a real fraction of its time on counts. Measurements of the reference-counting and the GIL machinery put their combined cost in the high single-digit-to-low-double-digit percent of interpreter time on pointer-heavy code — the motivation behind the no-GIL (PEP 703) work and biased reference counting.
  • Cascading frees can recurse to the heap's depth. Releasing the head of an n-node singly linked list naively recurses n deep; at a few hundred thousand nodes that overflows a default 1–8 MB stack, so robust implementations free iteratively via a deferred work list.
  • Cycle leakage is unbounded. A single leaked cycle of k objects never returns its memory; a server that creates one leaked cycle per request leaks linearly with traffic until it OOMs. The leak is silent — every count is a perfectly valid positive number.

JavaScript implementation

JavaScript has its own tracing GC, so you'd never hand-roll this in production — but modeling the counter explicitly makes the mechanism and the cycle blind spot concrete:

class RCObject {
  constructor(name) {
    this.name = name;
    this.rc = 0;            // reference count
    this.refs = new Map();  // field name -> RCObject (strong edges)
    this.alive = true;
  }
}

function retain(o) { if (o) o.rc += 1; return o; }

function release(o, freed = []) {
  if (!o || !o.alive) return freed;
  o.rc -= 1;
  if (o.rc === 0) {
    o.alive = false;
    freed.push(o.name);
    // cascade: drop every outgoing strong edge
    for (const child of o.refs.values()) release(child, freed);
    o.refs.clear();
  }
  return freed;
}

// assign field `key` of `owner` to point at `target` (retain-before-release)
function setField(owner, key, target) {
  retain(target);                  // retain FIRST
  release(owner.refs.get(key));    // then drop the old target
  if (target) owner.refs.set(key, target);
  else owner.refs.delete(key);
}

// --- acyclic case: frees cleanly ---
const root = retain(new RCObject('root'));   // rc 1 (a stack root holds it)
const a = new RCObject('a');
const b = new RCObject('b');
setField(root, 'left', a);                   // a.rc = 1
setField(a, 'next', b);                      // b.rc = 1
console.log(release(root));                  // -> ['root','a','b']  full cascade

// --- the cycle leak: counts never reach zero ---
const x = retain(new RCObject('x'));
const y = new RCObject('y');
setField(x, 'peer', y);                      // y.rc = 1
setField(y, 'peer', x);                      // x.rc = 2 (root + y)
console.log(release(x));                     // -> []  x.rc drops 2->1, nothing freed
// x and y are now unreachable from any root, yet both have rc 1. Leaked.

The two scenarios share identical machinery; the only difference is the back-edge y.peer = x. That single edge keeps both counts pinned at 1 forever. No local decision can ever tell that the pair, taken together, is dead.

Python implementation

CPython is a reference counter — sys.getrefcount exposes the live value (it reads one higher than you expect, because the argument itself is a temporary reference):

import sys, gc

class Node:
    def __init__(self, name):
        self.name = name
        self.peer = None
    def __del__(self):
        print(f"freed {self.name}")

# Deterministic free: the moment the last name binding drops, __del__ fires.
n = Node("solo")
print(sys.getrefcount(n))   # 2: the variable `n` plus the getrefcount argument
del n                       # prints "freed solo" immediately

# The cycle leak that reference counting alone cannot reclaim:
a = Node("a")
b = Node("b")
a.peer = b                  # b.rc: 1 -> 2
b.peer = a                  # a.rc: 1 -> 2
del a, b                    # each drops to 1, neither hits 0 -> NOTHING printed

# CPython ships a cycle detector exactly to mop this up:
collected = gc.collect()    # finds the unreachable cycle, breaks it, frees both
print("cycle collector reclaimed", collected, "objects")  # prints 2; __del__ fires

This is the canonical hybrid: reference counting handles the 90%+ of objects that die acyclically and promptly, while a periodic mark-sweep cycle detector (gc module) scans only the container objects that could possibly form cycles. The detector uses a clever trick — it subtracts internal references from each object's count; any object left with a positive "external" count is a root, and everything not reachable from a root is a dead cycle.

Variants worth knowing

Atomic vs non-atomic counts. Single-threaded counters use plain integer ops (Rust's Rc, CPython under the GIL). Shared counters need atomic fetch-add/sub (Rust's Arc, C++ shared_ptr). The atomic version is correct across threads but pays cache-coherence traffic on every retain/release.

Deferred reference counting. Don't count references from the stack and registers — only from the heap — and reconcile the stack periodically. This kills the vast majority of count churn (locals are touched constantly) at the price of slightly delayed reclamation, used in several research and production runtimes.

Biased reference counting. Give each object an "owner" thread that uses cheap non-atomic ops, and a separate atomic count for accesses from other threads. Introduced by Choi et al. (2018) and prototyped on the Swift runtime, the technique was adopted by the no-GIL CPython (PEP 703) to recover most of the single-thread speed in a concurrent setting.

1-bit / limited-field counts. Most objects are referenced exactly once, so store a single "shared?" bit instead of a full word; spill to a side table only for the rare highly shared object. Saves space at the cost of a fallback path.

Weak references and cycle collectors. The two ways to handle cycles: make programmer-chosen back-edges weak (no count, auto-nulled on free), or run a trial-deletion cycle detector (CPython, the Bacon-Rajan recycler) that finds dead cycles automatically. Most real systems use both.

Common bugs and edge cases

  • Reference cycles. The headline failure. A parent holding a child that holds the parent never frees. Fix: make one edge a weak reference (Swift weak/unowned, Rust Weak, C++ weak_ptr) or run a cycle collector.
  • Release-before-retain on self-assignment. x = x where you release first can free a count-1 object before retaining it. Always retain the new target first.
  • Forgetting a decrement. A missed release leaks; an extra one frees a still-live object and yields a use-after-free. Manual counting (raw COM AddRef/Release) is notoriously error-prone — smart pointers and ARC exist to automate it.
  • Stack overflow on cascade. Recursive freeing of a long chain blows the stack; defer the cascade onto an explicit work list to make deletion iterative.
  • Atomic correctness vs cost. Use non-atomic counts on a shared object and a lost update frees live memory; use atomic counts everywhere and pay coherence traffic you didn't need. Choose per object based on whether it crosses threads.
  • Resurrection in finalizers. If a destructor re-publishes a reference to the dying object (stores this somewhere reachable), the count goes back above zero after you decided to free — a classic source of dangling pointers. Most runtimes forbid or specially handle this.

Frequently asked questions

Why can't reference counting collect cycles?

A cycle keeps every object's count above zero by mutual reference: A points to B and B points to A, so each has count 1 even when the program can no longer reach either. Since the count never reaches zero, neither is ever freed. Reference counting only sees local edges, never global reachability, so an isolated cycle is invisible to it.

Is reference counting faster than tracing garbage collection?

It has lower latency — work is spread across every pointer assignment instead of bunched into a stop-the-world pause — but usually lower throughput. Each assignment costs an increment plus a decrement, and naive counters need atomic operations across threads, which can cost 30–100 ns under contention. Tracing GC touches only live objects in bulk, so it often wins on raw throughput.

What is a weak reference?

A weak reference points to an object without incrementing its count. It lets you observe an object without keeping it alive, which is how you break cycles: make the back-edge weak. When the object is freed, weak references are nulled out so you can detect the dangling case instead of crashing.

What is the difference between reference counting and ARC?

ARC (Automatic Reference Counting), used by Swift and Objective-C, is reference counting where the compiler inserts the retain and release calls for you at compile time instead of a runtime scanning objects. The runtime mechanism is identical; ARC just removes the manual bookkeeping and still cannot collect cycles on its own.

Why do reference counts need atomic operations?

If two threads increment the same count at once, a non-atomic read-modify-write can lose an update, dropping the count too low and freeing a live object. Atomic fetch-add and fetch-sub prevent the lost update, but they force cache-line synchronization between cores, which is why CPython instead protects counts with a global interpreter lock rather than paying per-operation atomics.

What is cascading deletion in reference counting?

When freeing an object drops a child's count to zero, that child is freed too, which can drop its children to zero, and so on. Freeing the head of a million-node list can recurse a million deep and blow the stack. Production counters defer the chain onto an explicit work list to keep deletion iterative and bounded.