Computer Architecture

CUDA & the SIMT Model

Thousands of threads, one instruction stream — and a warp of 32 that moves as one

CUDA's SIMT model runs thousands of GPU threads in lockstep groups of 32 called warps — one instruction stream drives every thread, so divergent branches serialize and coalesced memory access is everything.

  • Warp size (NVIDIA)32 threads
  • Threads per block≤ 1024
  • Resident warps per SMup to 64
  • Fully divergent if/else≈ 2× slower
  • Uncoalesced access penaltyup to 32×

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How SIMT actually works

You write CUDA code as if each thread were an ordinary scalar program: int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i];. Every one of your thousands of threads runs that same body with a different index. It feels like ordinary multithreading. It is not.

Under the hood the GPU does not give each thread its own instruction fetch. It bundles threads into warps of 32 and runs each warp in lockstep: the warp scheduler fetches one instruction and issues it to all 32 lanes in the same cycle. This is the SIMT model — Single Instruction, Multiple Threads — coined by NVIDIA with the Tesla architecture in 2006 and the G80 chip behind the GeForce 8800. SIMT is the bridge between the convenience of writing scalar per-thread code and the efficiency of wide SIMD hardware: you think in threads, the hardware executes in vectors.

The hierarchy you program against has four levels:

  • Thread — one scalar instance of your kernel body, with its own registers.
  • Warp — 32 threads that share an instruction stream and a scheduler. This is the unit the hardware actually schedules. You never name it in code, but it governs everything that matters for performance.
  • Block (CTA) — up to 1024 threads (so up to 32 warps) that run on a single streaming multiprocessor (SM), can share fast on-chip __shared__ memory, and can synchronize with __syncthreads().
  • Grid — all the blocks of one kernel launch. Blocks are independent and may run in any order, on any SM, concurrently or not.

When you write kernel<<<numBlocks, threadsPerBlock>>>(...), the GPU's GigaThread engine hands blocks to SMs as they free up. Each SM partitions its block into warps and time-slices among all resident warps, switching every cycle to hide the ~400–800 cycle latency of a global-memory load. That is the trick: you launch far more threads than can execute at once so the scheduler always has a ready warp to run while others wait on memory.

Warps, lockstep, and divergence

Because all 32 lanes share one program counter (conceptually — see the Volta note below), a warp can only ever be executing one instruction. So what happens at an if?

if (threadIdx.x < 16) {
    x = expensive_path_A();   // lanes 0..15 want this
} else {
    x = expensive_path_B();   // lanes 16..31 want this
}

The warp cannot run both branches simultaneously. The hardware sets an active mask — a 32-bit bitmap of which lanes are "on" — and runs path A with lanes 0–15 active and 16–31 idle, then runs path B with the mask flipped. The two halves execute serially. This is warp divergence, and it is the SIMT model's defining cost.

The arithmetic is blunt. If a warp splits evenly across an if/else and both sides cost the same, you do roughly twice the work for the result of one — about 2× slowdown for that region. A 32-way switch where every lane lands on a different case can run up to 32× slower than a uniform warp. Crucially, divergence between warps is free: if warp 0 takes the if and warp 1 takes the else, each is internally uniform and pays nothing. Only divergence within a single warp serializes. That is why aligning your branch granularity to multiples of 32 threads is a real optimization, not a micro-tweak.

Memory coalescing: the other half of the model

SIMT's second hard rule is about loads and stores. When a warp issues a global-memory access, the memory controller looks at all 32 lane addresses at once and tries to coalesce them into the fewest possible aligned 32-, 64-, or 128-byte transactions.

If lane i reads base + i*4 (32 consecutive floats, aligned), the whole warp's read collapses into a single 128-byte transaction — peak bandwidth. If instead each lane reads a scattered or large-stride address, the request can shatter into up to 32 separate transactions, each fetching a full cache line of which you use only 4 bytes. That is up to a 32× reduction in effective bandwidth. On a memory-bound kernel — and most real kernels are memory-bound — coalescing is usually the single biggest lever you have. The classic fix is the array-of-structs to struct-of-arrays transformation, so consecutive threads touch consecutive addresses.

When SIMT wins — and when it doesn't

  • Massively data-parallel work — the same operation over millions of independent elements: dense linear algebra, image and signal processing, neural-network matmuls, particle and fluid simulation, ray tracing.
  • High arithmetic intensity or high bandwidth demand — workloads that either do many FLOPs per byte (compute-bound, great fit) or stream huge contiguous arrays (bandwidth-bound, still great if coalesced).
  • Latency-tolerant pipelines — you can amortize PCIe transfer and kernel-launch overhead across enough work.

SIMT is a poor fit when control flow is wildly data-dependent (heavy per-element branching that diverges within warps), when the working set is tiny (launch and transfer overhead dominates), when memory access is irreducibly random (pointer-chasing graph traversal kills coalescing), or when you need low single-task latency rather than throughput. A branchy state machine on a handful of items belongs on the CPU.

SIMT vs SIMD vs SMT

SIMT (CUDA)SIMD (AVX/NEON)SMT (Hyper-Threading)
Programming modelScalar per-thread codeExplicit vector intrinsics / autovecIndependent scalar threads
Lane width / group32-thread warp (fixed)128–512 bit register (4–16 lanes)2 threads per core (typical)
BranchingMasked, divergent halves serializeManual masks/blend; both sides computedFully independent, no penalty
Who packs the lanesHardware groups your threadsYou (or the compiler)N/A — threads are separate
Latency hidingSwitch warps every cycle (deep)Out-of-order core, prefetchFill bubbles from the other thread
Scale per chip10,000s resident threads1 warp-equivalent per core port2–8 threads per core
Memory penaltyUncoalesced access up to 32×Unaligned / gather slowdownsCache contention between siblings
Best forThroughput on huge data-parallel setsTight inner loops on the CPUHiding CPU stalls, mixed workloads

The headline: SIMD makes you manage the vector and the masks; SIMT lets you write scalar code and has the hardware vectorize 32 threads for you — at the price of divergence and coalescing being implicit performance cliffs instead of explicit ones. Internally, a warp is a 32-wide SIMD unit. SIMT is the programming abstraction layered over SIMD execution.

What the numbers actually say

  • Occupancy is bounded by resources, not just thread count. An SM can hold up to 64 warps (2,048 threads) on recent architectures, but registers and shared memory are the real limits. If your kernel uses 64 registers per thread and an SM has a 64K-register file, you cap at 1,024 resident threads — 50% occupancy — no matter how many you launch.
  • Latency hiding needs over-subscription. A global load costs hundreds of cycles. To hide it, the scheduler needs enough other ready warps to issue in the meantime — this is why launching tens of thousands of threads for a few thousand cores is correct, not wasteful.
  • Coalescing can swing a kernel by ~32×. The same kernel reading contiguous vs. fully scattered global memory differs by the worst-case ratio of one 128-byte transaction to thirty-two of them.
  • Divergence is bounded by warp size. Worst case is one path per lane: 32× for that region. A binary if/else is at most 2×. Divergence between separate warps is free.
  • Warp size has been 32 since 2006. Every NVIDIA architecture from Tesla through Hopper and Blackwell uses a 32-thread warp. AMD wavefronts were 64 on GCN and are 32 or 64 on RDNA.

A minimal CUDA kernel

The canonical "hello, parallelism" is SAXPY — y = a*x + y over a big array. Note the index math and the bounds guard, which are the two things every kernel gets right or wrong:

__global__ void saxpy(int n, float a, const float* x, float* y) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;   // global thread id
    if (i < n)                                        // guard: n may not divide blockDim
        y[i] = a * x[i] + y[i];                       // coalesced: lane i touches address i
}

void launch(int n, float a, float* x, float* y) {
    int threads = 256;                                // multiple of 32 (warp size)
    int blocks  = (n + threads - 1) / threads;        // ceil-divide to cover all n
    saxpy<<<blocks, threads>>>(n, a, x, y);
    cudaDeviceSynchronize();
}

Three things to notice. The block size is a multiple of 32 so no warp is partially populated. The if (i < n) guard handles the last block, where n rarely divides evenly — and because the inactive lanes are all in the tail, that divergence is cheap. And lane i touching address i is what makes the load coalesce.

A divergence-free parallel reduction

Summing an array is the textbook lesson in writing for the warp. The naive tree reduction uses if (tid % stride == 0), which leaves scattered lanes active and diverges badly. The fix is to keep the active lanes contiguous so whole warps go idle together instead of half of every warp:

__global__ void reduce(const float* in, float* out, int n) {
    extern __shared__ float s[];
    int tid = threadIdx.x;
    int i   = blockIdx.x * blockDim.x + tid;
    s[tid]  = (i < n) ? in[i] : 0.0f;
    __syncthreads();

    // Halve the stride each step; active lanes stay contiguous (0..stride-1),
    // so once stride < 32 only one warp is active and the rest exit cleanly.
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) s[tid] += s[tid + stride];
        __syncthreads();
    }
    if (tid == 0) out[blockIdx.x] = s[0];
}

The __syncthreads() is a block-wide barrier: every thread in the block must reach it before any moves on, which is how the shared array is safely consumed across steps. Modern code goes further and uses warp-level primitives like __shfl_down_sync() to do the final 32-element reduction entirely inside one warp's registers, skipping shared memory and the barrier altogether.

The same idea in Python (Numba)

You do not need C++ to touch SIMT. Numba's @cuda.jit compiles Python to PTX, and the thread/grid model is identical:

from numba import cuda
import numpy as np

@cuda.jit
def saxpy(n, a, x, y):
    i = cuda.grid(1)                 # blockIdx.x * blockDim.x + threadIdx.x
    if i < n:                        # same tail guard
        y[i] = a * x[i] + y[i]

n = 1 << 20
x = cuda.to_device(np.ones(n, np.float32))
y = cuda.to_device(np.zeros(n, np.float32))

threads = 256                        # multiple of 32
blocks  = (n + threads - 1) // threads
saxpy[blocks, threads](n, 2.0, x, y)
cuda.synchronize()
print(y.copy_to_host()[:5])          # -> [2. 2. 2. 2. 2.]

The warp, divergence, and coalescing rules are exactly the same underneath — Numba just hides the launch boilerplate. The same kernel body, the same index math, the same 32-lane lockstep.

Variants and the wider landscape

Independent Thread Scheduling (Volta, 2017). Before Volta, all 32 lanes shared a single program counter and a hand-rolled lock inside a warp could deadlock. Since Volta each thread carries its own PC and call stack, so divergent threads make independent forward progress and can even synchronize mid-divergence. It fixes correctness footguns; it does not remove the cost of divergence, because the warp still issues one instruction per cycle. With it came __syncwarp() and the _sync family of warp intrinsics, which require an explicit lane mask.

Tensor Cores. Since Volta, SMs include matrix-multiply-accumulate units that operate on small tiles (e.g. 16×16) per warp in one operation — the engine behind modern deep-learning throughput. They sit alongside the SIMT CUDA cores, not instead of them.

Cooperative Groups. A CUDA API that lets you name and synchronize thread groups explicitly — sub-warp tiles, whole blocks, even grid-wide — rather than relying on the implicit warp and the blunt __syncthreads().

AMD ROCm/HIP and SYCL. HIP mirrors CUDA almost line-for-line; the warp becomes a "wavefront" of 32 or 64. SYCL and OpenCL express the same hierarchy with vendor-neutral names (work-item, sub-group, work-group, NDRange). Same SIMT execution model, different spelling.

Common bugs and edge cases

  • Calling __syncthreads() inside divergent control flow. If only some threads in a block reach the barrier, the others wait forever — undefined behavior or hang. The barrier must be reached by all threads in the block, unconditionally.
  • Forgetting the if (i < n) tail guard. The last block almost always launches more threads than there are elements; without the guard those lanes read or write out of bounds.
  • Block size not a multiple of 32. A block of 100 threads still allocates 4 warps (128 lanes); the 28 phantom lanes waste scheduling slots every cycle.
  • Strided / array-of-structs access. Storing struct {float x,y,z;} and having thread i read element i de-coalesces every load. Switch to struct-of-arrays so consecutive lanes hit consecutive addresses.
  • Assuming implicit warp synchronization (pre-Volta habit). Relying on lockstep without __syncwarp() or the _sync intrinsics is a data race on Volta and later. Always pass an explicit mask.
  • Shared-memory bank conflicts. Shared memory has 32 banks; if multiple lanes of a warp hit the same bank with different addresses, those accesses serialize — a quieter cousin of uncoalesced global access.
  • Treating divergence between warps as a problem. It isn't. Only divergence within a warp serializes; aligning data so a whole warp takes the same branch is the cure, not eliminating branches everywhere.

Frequently asked questions

What is the difference between SIMT and SIMD?

SIMD applies one instruction to a fixed-width vector register — the programmer (or compiler) explicitly packs N lanes and any branching is handled with masks by hand. SIMT exposes each lane as an independent scalar thread with its own registers and program counter view, and the hardware groups 32 of them into a warp that issues one instruction at a time. SIMT is SIMD execution with a scalar programming model bolted on top — you write per-thread code, the hardware vectorizes it.

Why is a CUDA warp 32 threads?

32 is the SIMT width NVIDIA has shipped on every architecture since Tesla (2006). The warp scheduler issues one instruction for all 32 lanes per cycle, so 32 is the granularity of both scheduling and divergence. AMD's equivalent (a wavefront) was 64 on GCN and is now configurable to 32 or 64 on RDNA; Intel's SIMD width varies. There is nothing magic about 32 — it is a fixed hardware constant you must design around.

What is warp divergence and why does it cost performance?

When threads in a warp take different sides of a branch, the warp cannot run both paths at once — it executes the if-block with the else-threads masked off, then the else-block with the if-threads masked off. The two halves run serially, so a fully divergent if/else inside a warp can roughly halve throughput. A 32-way switch with every thread taking a different case is up to 32× slower. Divergence between warps is free; only divergence within a warp serializes.

What is memory coalescing in CUDA?

When the 32 threads of a warp access global memory, the hardware tries to merge their requests into as few 32-, 64-, or 128-byte cache-line transactions as possible. If thread i reads address base + i (a contiguous, aligned run), one warp load becomes a single 128-byte transaction. If the threads read scattered or strided addresses, it can fragment into up to 32 separate transactions, cutting effective bandwidth by up to 32×. Coalescing is usually the single biggest lever on a memory-bound kernel.

How many threads can a GPU actually run at once?

You launch far more threads than execute simultaneously. A modern data-center GPU has on the order of 100+ streaming multiprocessors (SMs), each holding up to 64 resident warps (2,048 threads), so tens of thousands of threads are resident and the scheduler hides memory latency by switching warps every cycle. The number physically executing in a given cycle is the number of warp schedulers times their issue width — far smaller. You over-subscribe on purpose: latency hiding, not raw concurrency, is the point.

Does the independent thread scheduling in Volta+ remove warp divergence?

No. Since Volta (2017) each thread has its own program counter, so divergent threads can make forward progress and even synchronize within a divergent region — which fixes deadlocks that plagued pre-Volta lock code. But the warp still issues only one instruction per cycle, so divergent paths still execute serially. Independent thread scheduling improves correctness and flexibility, not the raw cost of divergence.