Computer Architecture

SIMD

One instruction, many lanes — 4× to 16× speedup for data-parallel loops

SIMD packs many values into one register and applies one instruction to all of them — 4× to 16× speedup for data-parallel loops. SSE, AVX, AVX-512, NEON, SVE explained.

  • Stands forSingle Instruction, Multiple Data
  • AVX2 width256 bits → 8 floats
  • AVX-512 width512 bits → 16 floats
  • AVX-512 FMA throughput32 flops/cycle/core
  • NEON width128 bits → 4 floats
  • SVE width128–2048 bits (length-agnostic)

Interactive visualization

Press play, or step through manually. Watch scalar code add eight floats one cycle at a time, while a SIMD register adds all eight simultaneously.

Open visualization fullscreen ↗

How SIMD works

A scalar add instruction takes one source register, adds another, and writes one result. A SIMD add takes a wide register holding several values laid end-to-end and adds another wide register, producing several independent results. The trick is that the back-end ALU is built wide enough to compute every lane in parallel, so the wide add still completes in one cycle.

The lane width is the value width times the lane count. An AVX2 __m256 register is 256 bits and can hold 8 single-precision floats, 4 doubles, 32 bytes, 16 shorts, 8 32-bit ints, or 4 64-bit ints. An AVX-512 __m512 doubles every count: 16 floats, 8 doubles, 64 bytes. NEON's float32x4_t is 128 bits — 4 floats. SVE registers are vector-length-agnostic: the same binary works on hardware with 128, 256, ..., 2048-bit vectors.

Each ISA exposes a parallel family of instructions for each lane width and data type: paddd (packed add doublewords), addps (single-precision packed), vaddps zmm0, zmm1, zmm2 (AVX-512 with mask), vfmadd231ps (fused multiply-add), vpgatherdd (gather from arbitrary indices), vpermps (permute lanes). Modern x86 has thousands of these.

SSE vs AVX vs AVX-512 vs NEON vs SVE

ISAWidthFloats per regFMA throughputIntroducedWhere
SSE128 bit48 flops/cycle1999 (Pentium III)All x86 since 2003
AVX256 bit816 flops/cycle2011 (Sandy Bridge)x86 client + server
AVX2256 bit + int816 flops/cycle2013 (Haswell)x86 client + server
AVX-512512 bit1632 flops/cycle2016 (Knights Landing)Server + select client
NEON128 bit48 flops/cycle2005 (Cortex-A8)Every ARM Cortex-A
SVE / SVE2128–2048 bit4–64scales with width2016 spec; A64FX 2019Fugaku, Graviton 3+, Neoverse V1+

An AVX-512 core running fused multiply-add at 3 GHz hits 32 flops × 3e9 = 96 gigaflops per core. A 64-core Sapphire Rapids socket therefore tops 6 teraflops in FP32, all from one set of instructions per cycle.

When SIMD pays off

  • Tight numeric loops over contiguous arrays. Dot products, matrix multiply, convolutions, image filters, RGB-to-YUV color conversion, audio resampling. These are the canonical wins.
  • Bitwise reductions over packed data. Population count, byte-level search (memchr, simdjson's parser), checksum (CRC32 instruction), regex character classes, base64 encode/decode.
  • Vector math libraries. sin, cos, exp, log over arrays — written once in intrinsics, called from every numerical kernel.
  • Bulk compare and shuffle. String search using pcmpestri, JSON tokenization with byte-class lookup tables, sort networks for small N.

And when not: pointer-chasing structures (linked lists, trees), branch-heavy code, loops with strong cross-iteration dependencies, and anything bottlenecked on cache misses already. SIMD doubles the work per cycle but doesn't change the memory wall.

Writing SIMD code

// AVX2 — sum 1024 floats with 8-wide vectors.
#include <immintrin.h>

float sum_avx2(const float *a, int n) {
    __m256 acc = _mm256_setzero_ps();
    int i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 v = _mm256_loadu_ps(a + i);
        acc = _mm256_add_ps(acc, v);
    }
    // Horizontal sum of 8 lanes.
    __m128 lo = _mm256_castps256_ps128(acc);
    __m128 hi = _mm256_extractf128_ps(acc, 1);
    __m128 s = _mm_add_ps(lo, hi);
    s = _mm_hadd_ps(s, s);
    s = _mm_hadd_ps(s, s);
    float total = _mm_cvtss_f32(s);
    // Tail loop for the leftover <8 elements.
    for (; i < n; i++) total += a[i];
    return total;
}

// AVX-512 with mask — handles the tail in one shot.
float sum_avx512(const float *a, int n) {
    __m512 acc = _mm512_setzero_ps();
    int i = 0;
    for (; i + 16 <= n; i += 16) {
        acc = _mm512_add_ps(acc, _mm512_loadu_ps(a + i));
    }
    int rem = n - i;
    __mmask16 m = (1U << rem) - 1U;
    acc = _mm512_mask_add_ps(acc, m, acc, _mm512_maskz_loadu_ps(m, a + i));
    return _mm512_reduce_add_ps(acc);
}

// NEON — same idea on ARM.
#include <arm_neon.h>

float sum_neon(const float *a, int n) {
    float32x4_t acc = vdupq_n_f32(0);
    int i = 0;
    for (; i + 4 <= n; i += 4) {
        acc = vaddq_f32(acc, vld1q_f32(a + i));
    }
    float total = vaddvq_f32(acc);
    for (; i < n; i++) total += a[i];
    return total;
}

Auto-vectorization can sometimes generate code that's competitive with hand-intrinsics. Compile with -O3 -march=native -fopt-info-vec on GCC to see exactly which loops got vectorized and why others did not. Common blockers: unaligned pointers, possible aliasing (mark inputs restrict), function calls inside the loop, integer overflow that the compiler cannot prove safe.

Performance numbers

  • AVX-512 fused multiply-add: 1 cycle of latency-throughput for a 16-wide single-precision FMA — 32 flops per cycle per core.
  • A simple a[i] = b[i] + c[i] loop: scalar ~1 ns/element, AVX2 ~0.14 ns/element (7× speedup), AVX-512 ~0.08 ns/element (12× speedup). Memory bandwidth caps further gains.
  • simdjson parses JSON at ~3 GB/s on a single core using SSE4.2 plus AVX2 — 5× the next-fastest non-SIMD parser.
  • memchr with AVX2: ~64 bytes per cycle compared to ~1 byte per cycle scalar. ffmpeg, glibc, and Go runtime all ship hand-tuned AVX2/AVX-512 memchr.
  • AVX-512 frequency throttle on Skylake-SP: 100–300 MHz drop on heavy 512-bit FP; eliminated on Ice Lake-SP and AMD Zen 4.
  • NEON dot product on Apple M1: ~24 GFLOPS per core sustained; ~1.5 ns to multiply-add 16 floats.

Common pitfalls

  • Unaligned loads. Older SSE required 16-byte alignment for movaps; movups handles unaligned at a small cost. AVX is forgiving but still benefits from alignment. Use _mm_malloc or posix_memalign to force 32 or 64-byte alignment.
  • Tail handling. Most loops don't divide cleanly by the lane count. Either scalar epilogue, mask-loaded final iteration (AVX-512 / SVE), or peel one iteration off the front.
  • AVX-SSE transition penalty. Mixing legacy SSE (addps xmm0) and AVX (vaddps ymm0) without vzeroupper causes a state-save penalty of ~70 cycles. Compilers handle this; hand-asm doesn't always.
  • Aliasing assumptions. A SIMD compiler can't vectorize for (i) a[i] = b[i] + a[i-1] because there's a real dependency. Use restrict on inputs that genuinely don't alias.
  • Reduction order matters for floats. Different SIMD widths produce different rounding because horizontal sum is non-associative. Pick a fixed reduction order if reproducibility matters.

Frequently asked questions

What does SIMD stand for?

Single Instruction, Multiple Data. One instruction operates on a wide register holding many independent values — 4, 8, or 16 floats, or larger counts of smaller integers.

What's the difference between SSE, AVX, AVX-512, NEON, and SVE?

SSE: 128-bit x86 SIMD (1999). AVX: 256-bit (2011). AVX-512: 512-bit + mask registers (2016). NEON: ARM's 128-bit SIMD on every Cortex-A. SVE: vector-length-agnostic ARM ISA — same binary on 128 to 2048-bit registers.

How much faster is SIMD than scalar code?

Headline: 4× / 8× / 16× by lane count. Real-world: 3–10× because of load/store and tail overhead. Hand-tuned BLIS, MKL, FFTW routinely show 6–10× over auto-vectorized scalar.

What's auto-vectorization?

The compiler turning scalar loops into SIMD automatically. Works under -O2/-O3 on simple counted loops with no aliasing, no early exit, and contiguous access. Add #pragma omp simd for stubborn cases.

Why does AVX-512 throttle the clock?

Heavy 512-bit FP on early Skylake-X dropped 100–300 MHz to stay in power. Later silicon (Ice Lake-SP, Zen 4) closed the gap. Profile before assuming AVX-512 wins.

How do I write SIMD code today?

Auto-vectorization first. Portable intrinsics (std::simd, Highway, xsimd, ISPC) second. Architecture-specific intrinsics (_mm256_, vaddq_, svadd_) third. Hand-asm is last resort.

When is SIMD a bad fit?

Branch-heavy code, pointer chasing, cross-iteration dependencies, tiny vectors under 16 elements, anything already cache-miss bound. SIMD widens compute, not memory bandwidth.