Operating Systems

Control Groups (cgroups)

Limit, account, and isolate CPU, memory, I/O, PIDs for groups of processes — the kernel side of Docker

Control groups (cgroups) are a Linux kernel feature that limits, accounts for, and isolates the resource usage of a group of processes — CPU time (cpu, cpuset), memory (memory), block I/O (blkio), network bandwidth (net_cls), PID count, and devices. Introduced in 2007 (Paul Menage, Google), restructured as cgroups v2 in 2016 — unified hierarchy where each process belongs to exactly one cgroup. Together with namespaces (PID, network, mount), cgroups are the kernel primitive behind Docker, Kubernetes pods, systemd slices, and LXC. A typical Kubernetes pod's CPU limit translates directly to a cpu.max write in cgroup v2.

  • Introduced2007 (cgroups v1)
  • Unified hierarchycgroups v2 (2016)
  • Controllerscpu, memory, io, pids, devices
  • Used byDocker, k8s, systemd, LXC
  • CPU limit formatcpu.max = quota period
  • Memory limitmemory.max

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why cgroups matter

  • Containers. Docker, podman, containerd all use cgroups to enforce per-container CPU and memory caps. docker run --memory=512m --cpus=1.5 writes memory.max=536870912 and cpu.max="150000 100000".
  • Multi-tenant isolation. A buggy tenant cannot starve neighbours. Each tenant lives in a sibling cgroup with hard caps; the kernel enforces the partition.
  • Fair-share scheduling. cpu.weight (default 100, range 1–10000) gives proportional CPU under contention without hard caps — critical for batch workloads on shared clusters.
  • Kubernetes accounting. kubelet polls cgroup stats every 10 seconds for cAdvisor; HPA scales based on cgroup memory/CPU usage, not host stats.
  • systemd slices. Every systemd unit lives in a cgroup. systemctl set-property nginx.service MemoryMax=2G writes memory.max on the unit's cgroup at runtime.
  • OOM containment. A container exceeding its memory limit dies via per-cgroup OOM killer; the host stays up. Without cgroups, the global OOM killer might pick any process.
  • Pressure Stall Information (PSI). v2 exposes cpu.pressure, memory.pressure, io.pressure — fraction of time tasks were stalled. Drives modern autoscalers and oomd.

Anatomy of a cgroup

Every cgroup is a directory in the cgroup virtual filesystem (mounted at /sys/fs/cgroup). Creating a cgroup is mkdir /sys/fs/cgroup/myapp; assigning a process is echo $PID > /sys/fs/cgroup/myapp/cgroup.procs. Limits are written as plaintext files. A pod cgroup tree looks like:

  • /sys/fs/cgroup/kubepods.slice/ — root pod cgroup
  • kubepods-burstable.slice/ — Burstable QoS class
  • kubepods-burstable-pod<UUID>.slice/ — one pod
  • cri-containerd-<CONTAINER_ID>.scope/ — one container
  • cgroup.procs — list of PIDs in this cgroup
  • cpu.max, cpu.weight, cpu.stat
  • memory.max, memory.current, memory.events, memory.pressure
  • io.max, io.stat, io.pressure
  • pids.max, pids.current

Reads are O(1) (kernel maintains atomic counters); writes are validated synchronously. A typical Kubernetes node has 1,000–5,000 cgroup directories, each backed by a few hundred bytes of in-kernel state plus VFS inodes — overhead is negligible compared to the workloads.

The CPU controller in detail

Two parameters shape CPU behaviour. cpu.max is a hard cap (CFS bandwidth controller): "quota period" in microseconds. "50000 100000" means 50 ms of CPU time per 100 ms wall-clock window — equivalent to 0.5 CPUs averaged. The CFS scheduler tracks per-task runtime; when the cgroup's quota is exhausted mid-period, every task in the cgroup is throttled until the period boundary. cpu.stat exposes nr_throttled and throttled_usec — large values mean your latency-sensitive workload is hitting its cap and stalling. Common in Kubernetes; the fix is either raising the limit, lowering the period (controversial), or removing the limit entirely for non-bursty workloads.

cpu.weight is soft: a relative share enforced only under contention. Two cgroups with weight 100 and 200 split CPU 1:2 when both are runnable; if one is idle, the other gets all available CPU. Maps to the CFS task weight via a logarithmic curve.

The memory controller in detail

Memory accounting tracks anonymous pages, page cache, kernel slab, sockets, and swap. memory.max is the hard limit; the kernel will reclaim pages (push to swap, drop clean cache) when usage approaches the limit. If reclaim fails to free enough, the per-cgroup OOM killer fires. memory.high is a soft cap that triggers reclaim before the hard limit — useful for preventing latency spikes from sudden OOM events. memory.swap.max bounds swap usage independently. memory.events counts low, high, max, oom, and oom_kill events; alerts on these are the standard SLO signal for "container is being squeezed."

Common misconceptions

  • "cgroups isolate processes." They limit and account; isolation (separate PID space, network stack, mount tree) is the job of namespaces. A container needs both.
  • "v2 is incompatible with v1." The kernel can mount both simultaneously (hybrid mode); a process can be in v1 controllers and v2 controllers at the same time. Most distros now ship v2-only because the cleanup is worth it.
  • "cgroups have no overhead." The bookkeeping costs O(controllers × tasks) on every fork and a few percent of cache traffic for atomic counters. Real, measurable, but small (1–3% on most workloads).
  • "cpu.max on Kubernetes is always good." CFS throttling on a multi-CPU machine can cause p99 latency spikes even when average usage is low — a single-threaded burst exhausts the quota in one core's worth of microseconds. Many shops disable CPU limits and rely only on requests.
  • "OOM kill means the cgroup is at fault." Sometimes the kernel under-counts cache or socket buffers; raising the limit or tracking memory.stat is needed before blaming the workload.
  • "systemd slices are different from cgroups." A systemd slice is a cgroup. systemctl status shows the cgroup path; cat /sys/fs/cgroup/<slice>/cpu.max shows the live limit.

A concrete example

Run a 256 MB-capped, 0.5-CPU-capped shell under cgroups v2:

  • sudo mkdir /sys/fs/cgroup/demo
  • echo "+cpu +memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
  • echo "50000 100000" | sudo tee /sys/fs/cgroup/demo/cpu.max
  • echo "256M" | sudo tee /sys/fs/cgroup/demo/memory.max
  • echo $$ | sudo tee /sys/fs/cgroup/demo/cgroup.procs
  • Then stress-ng --vm 1 --vm-bytes 300M will be OOM-killed; stress-ng --cpu 4 will be throttled to ~50% of one core total.

Frequently asked questions

What is the difference between cgroups v1 and v2?

cgroups v1 (2007) used multiple independent hierarchies — one per controller (cpu, memory, blkio, etc) — so a process could live in different cgroups across controllers, which made coordinated accounting difficult and led to inconsistent semantics. cgroups v2 (merged in Linux 4.5, 2016) uses a single unified hierarchy where each process belongs to exactly one cgroup, and controllers are enabled per-cgroup via cgroup.subtree_control. v2 cleaned up controller behavior (e.g. memory.max replaces memory.limit_in_bytes), added pressure stall information (PSI), and unified I/O accounting across block devices. Both can coexist; systemd uses v2 since 2019, and most modern distros (Fedora 31+, RHEL 9, Ubuntu 22.04+) default to v2-only.

How do cgroups relate to Linux namespaces?

They are orthogonal kernel primitives. Namespaces (PID, network, mount, UTS, IPC, user, cgroup) virtualize what a process can see — its own /proc, its own network stack, its own filesystem mounts. cgroups limit what a process can use — how much CPU time, how much RAM, how many open PIDs. A container (Docker, podman, Kubernetes pod) is just the combination: the runtime creates new namespaces for isolation and assigns the processes to a cgroup for resource caps. Without cgroups you have isolation but a noisy neighbor can starve the host; without namespaces you have caps but processes can see and signal each other.

What does cpu.cfs_quota_us actually do?

It is the cgroups v1 CPU bandwidth control knob. The Completely Fair Scheduler (CFS) measures runtime in microseconds within a fixed period (cpu.cfs_period_us, default 100000 = 100 ms). cpu.cfs_quota_us sets how many microseconds of CPU time the cgroup may consume in each period. quota=50000, period=100000 means 0.5 CPU; quota=200000, period=100000 means 2.0 CPUs. When a cgroup exceeds its quota mid-period, all its tasks are throttled until the next period boundary — which is the source of the infamous CFS throttling latency spikes in Kubernetes (visible as nr_throttled and throttled_time in cpu.stat). cgroups v2 collapses these into a single cpu.max file with the format 'quota period'.

Why is OOM killer called from memory cgroup?

When a cgroup's memory.max is reached and reclaim cannot free enough, the kernel invokes a per-cgroup OOM killer that selects a victim from inside that cgroup — not the global OOM killer. This is critical for multi-tenant systems: a container that exceeds its limit dies without disturbing other tenants. The selection uses oom_score_adj plus RSS to pick a victim. memory.oom.group=1 (v2) escalates the kill to the entire cgroup, mirroring how Kubernetes treats a pod as the failure unit. The decision shows up in dmesg as 'Memory cgroup out of memory: Killed process N (name)'. To investigate, read memory.events for oom_kill counters.

How does Kubernetes set resource limits via cgroups?

kubelet translates a pod spec into cgroup writes. resources.limits.cpu: '500m' becomes a cpu.max write of '50000 100000' (50 ms quota in a 100 ms period = 0.5 CPU). resources.limits.memory: '256Mi' becomes memory.max = 268435456. requests.cpu becomes cpu.weight (v2) or cpu.shares (v1) — relative weight only honored under contention. Each pod gets its own cgroup under /sys/fs/cgroup/kubepods.slice/, with QoS subdirectories (Guaranteed, Burstable, BestEffort) shaping eviction order. Containers within a pod share the pod cgroup and add per-container subgroups for individual limits.

What is a freezer cgroup?

The freezer is a cgroup controller that pauses (freezes) every task in the cgroup — they remain in the kernel TASK_FROZEN state, holding all their state but consuming zero CPU, until thawed. In v1 it lived under /sys/fs/cgroup/freezer/; in v2 it is exposed via cgroup.freeze (write 1 to freeze, 0 to thaw). Used by CRIU (checkpoint/restore in userspace) to snapshot containers consistently, by systemd-run to suspend service units, by Android to put background apps to sleep, and by Docker pause/unpause. Unlike SIGSTOP, it works atomically across an entire cgroup tree and survives across signal delivery races.