Question 1

What does SIMD stand for?

Accepted Answer

Single Instruction, Multiple Data. One instruction operates on a wide register that holds many independent values — typically 4, 8, or 16 floats, or larger counts of smaller integers. It is one of the four classes in Flynn's 1966 taxonomy (SISD, SIMD, MISD, MIMD), and the only one that has scaled to mainstream client CPUs and GPUs.

Question 2

What's the difference between SSE, AVX, AVX-512, NEON, and SVE?

Accepted Answer

SSE is x86's 128-bit SIMD (4 floats) introduced with the Pentium III in 1999. AVX (2011) widened it to 256-bit (8 floats); AVX2 added integer ops to AVX. AVX-512 (2016) doubled width to 512 bits (16 floats) and added mask registers. NEON is ARM's 128-bit SIMD, present in every Cortex-A core since 2005. SVE is ARM's vector-length-agnostic ISA — the same binary runs on hardware with 128 to 2048-bit registers; SVE2 generalized it.

Question 3

How much faster is SIMD than scalar code?

Accepted Answer

The headline number is the lane count: 4× for 128-bit SSE/NEON over scalar single-precision, 8× for AVX, 16× for AVX-512. Real-world speedups are smaller because of load and store overhead, alignment, tail-handling, and the fact that the scalar baseline was already going through the same out-of-order back-end. Numerical libraries like Intel MKL, BLIS, and FFTW routinely show 6-10× over auto-vectorized scalar.

Question 4

What's auto-vectorization?

Accepted Answer

Auto-vectorization is the compiler turning scalar loops into SIMD instructions automatically. GCC, Clang, and MSVC do it under -O2/-O3. It works best on simple counted loops with no dependencies, no early exit, no function calls, and aligned contiguous access. Add #pragma omp simd or #pragma GCC ivdep when you know the loop is safe but the compiler doesn't. For complex code, drop to intrinsics or ISPC.

Question 5

Why does AVX-512 throttle the clock?

Accepted Answer

Heavy 512-bit floating-point on early Skylake-X server parts pulled enough current that the core had to drop its frequency by 100-300 MHz — the so-called AVX frequency. Light 512-bit code (just data movement) did not throttle. Later silicon (Ice Lake-SP, Sapphire Rapids, AMD Zen 4) closed the gap or eliminated the throttle. The lesson: profile, do not assume AVX-512 is always faster than AVX2 for a given hot loop.

Question 6

How do I write SIMD code today?

Accepted Answer

Three layers, from highest to lowest. (1) Auto-vectorization with -O3 plus restrict, alignment hints, and contiguous access. (2) Portable intrinsics — std::simd (C++26), std::experimental::simd, Highway, xsimd, ISPC. (3) Architecture-specific intrinsics: _mm256_add_ps for AVX, vaddq_f32 for NEON, svadd_f32_z for SVE. Last resort: hand-written assembly, which still beats intrinsics on a few kernels in libcrypto and ffmpeg.

Question 7

When is SIMD a bad fit?

Accepted Answer

Anything branch-heavy, dependent on previous iterations, irregularly addressed (pointer chasing), or with mostly cold data. Loop bodies under ~10 instructions amortize SIMD setup badly. Algorithms with non-trivial control flow inside the loop need masked operations (AVX-512 mask registers, ARM SVE predicates) which add complexity. And on tiny vectors (under ~16 elements), the overhead of load/store and tail-handling can erase the gain.

ISA	Width	Floats per reg	FMA throughput	Introduced	Where
SSE	128 bit	4	8 flops/cycle	1999 (Pentium III)	All x86 since 2003
AVX	256 bit	8	16 flops/cycle	2011 (Sandy Bridge)	x86 client + server
AVX2	256 bit + int	8	16 flops/cycle	2013 (Haswell)	x86 client + server
AVX-512	512 bit	16	32 flops/cycle	2016 (Knights Landing)	Server + select client
NEON	128 bit	4	8 flops/cycle	2005 (Cortex-A8)	Every ARM Cortex-A
SVE / SVE2	128–2048 bit	4–64	scales with width	2016 spec; A64FX 2019	Fugaku, Graviton 3+, Neoverse V1+

SIMD

Interactive visualization

How SIMD works

SSE vs AVX vs AVX-512 vs NEON vs SVE

When SIMD pays off

Writing SIMD code

Performance numbers

Common pitfalls

Frequently asked questions