Skip to main content

SIMD

Before GPUs dominated machine learning, CPUs ran all the math. What made CPUs fast enough for linear algebra was not clock speed -- it was SIMD (Single Instruction, Multiple Data), the ability to process multiple data elements with a single instruction. Understanding SIMD explains why aligned memory is faster, why contiguous arrays outperform random access, why quantization to INT8 speeds up CPU inference, and how CPU BLAS libraries achieve near-peak throughput. SIMD is also the CPU analog of GPU SIMT -- the same fundamental idea (data parallelism) expressed at a different scale.

What Is SIMD?

SIMD extends the CPU with wide vector registers that hold multiple data elements. A single instruction operates on all elements simultaneously:

ISAYearRegister WidthRegistersFloats/OpDoubles/OpKey Addition
SSE1999128-bit (XMM)XMM0-XMM1542Floating-point SIMD
SSE22001128-bitXMM0-XMM1542Double precision, integer SIMD
AVX2011256-bit (YMM)YMM0-YMM15843-operand form (non-destructive)
AVX22013256-bitYMM0-YMM1584Integer 256-bit, FMA, gather
AVX-5122017512-bit (ZMM)ZMM0-ZMM31168Masking, scatter, 32 registers

#include <immintrin.h>

int n = 1024;
float a[1024], b[1024], c[1024];

// ── Scalar: 1 element per instruction ──
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i]; // 1 add per cycle
}
// Total: ~1024 add instructions

// ── SIMD (AVX): 8 elements per instruction ──
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]); // Load 8 floats (32 bytes, aligned)
__m256 vb = _mm256_load_ps(&b[i]); // Load 8 floats
__m256 vc = _mm256_add_ps(va, vb); // 8 parallel adds
_mm256_store_ps(&c[i], vc); // Store 8 floats
}
// Total: ~128 add instructions (8x fewer)
**Theoretical throughput.** A modern CPU core with AVX-512 can execute 2 FMA (fused multiply-add) instructions per cycle, each operating on 16 FP32 values. That is $2 \times 16 \times 2 = 64$ FP32 FLOPS per cycle (FMA counts as 2 FLOPS). At 3 GHz, one core delivers 192 GFLOPS -- comparable to a low-end GPU. But unlike a GPU, a CPU has only 8-64 cores, so aggregate throughput is much lower.

Common AVX Intrinsics

Intrinsics are C functions that map 1:1 to SIMD instructions. The naming convention is _mm<width>_<op>_<type>:

ComponentValuesMeaning
Width(none), 256, 512128-bit, 256-bit, 512-bit
Operationadd, mul, fmadd, load, store, etc.The operation
Typeps (packed single), pd (packed double), epi32 (packed int32), si256 (256-bit integer)Data type

#include <immintrin.h>

// ── Load and store ──
__m256 v = _mm256_load_ps(ptr); // Load 8 floats (ptr must be 32-byte aligned)
__m256 v = _mm256_loadu_ps(ptr); // Load 8 floats (unaligned, ~5% slower)
_mm256_store_ps(ptr, v); // Store 8 floats (aligned)
_mm256_storeu_ps(ptr, v); // Store 8 floats (unaligned)

// ── Arithmetic ──
__m256 sum = _mm256_add_ps(a, b); // a + b (8 floats)
__m256 diff = _mm256_sub_ps(a, b); // a - b
__m256 prod = _mm256_mul_ps(a, b); // a * b
__m256 fma = _mm256_fmadd_ps(a, b, c); // a*b + c (fused, same latency as mul)

// ── Comparison and selection ──
__m256 mask = _mm256_cmp_ps(a, b, _CMP_GT_OQ); // a > b ? all-1s : all-0s
__m256 result = _mm256_blendv_ps(b, a, mask); // mask ? a : b (branchless select)

// ── Reduction (horizontal) ──
__m256 mx = _mm256_max_ps(a, b); // Element-wise max
// Full horizontal sum requires multiple shuffles (no single instruction)

// ── Broadcast and set ──
__m256 v = _mm256_set1_ps(3.14f); // All 8 elements = 3.14
__m256 v = _mm256_setzero_ps(); // All zeros
__m256 v = _mm256_broadcast_ss(&scalar); // Broadcast single float to all lanes

float simd_dot(const float* a, const float* b, int n) {
__m256 sum0 = _mm256_setzero_ps();
__m256 sum1 = _mm256_setzero_ps(); // Two accumulators to hide FMA latency

int i;
for (i = 0; i + 15 < n; i += 16) {
__m256 a0 = _mm256_loadu_ps(a + i);
__m256 b0 = _mm256_loadu_ps(b + i);
sum0 = _mm256_fmadd_ps(a0, b0, sum0); // sum0 += a0 * b0

__m256 a1 = _mm256_loadu_ps(a + i + 8);
__m256 b1 = _mm256_loadu_ps(b + i + 8);
sum1 = _mm256_fmadd_ps(a1, b1, sum1); // sum1 += a1 * b1
}

sum0 = _mm256_add_ps(sum0, sum1); // Combine accumulators

// Horizontal sum: reduce 8 floats to 1
__m128 hi = _mm256_extractf128_ps(sum0, 1);
__m128 lo = _mm256_castps256_ps128(sum0);
__m128 s = _mm_add_ps(lo, hi); // 4 floats
s = _mm_hadd_ps(s, s); // 2 floats
s = _mm_hadd_ps(s, s); // 1 float
float result = _mm_cvtss_f32(s);

// Handle remaining elements (scalar cleanup)
for (; i < n; i++) result += a[i] * b[i];
return result;
}
**Why use multiple accumulators?** FMA has a latency of 4-5 cycles but a throughput of 2 per cycle. A single accumulator creates a dependency chain: each FMA must wait for the previous one to finish. Using 2-4 independent accumulators lets the CPU pipeline execute multiple FMAs in flight, achieving near-peak throughput.

Auto-Vectorization

Modern compilers can automatically generate SIMD code from scalar loops. This is the preferred approach -- let the compiler handle the intrinsics:


// The compiler will vectorize this with -O2 -mavx2
void add_arrays(float* restrict a, const float* restrict b,
const float* restrict c, int n) {
for (int i = 0; i < n; i++) {
a[i] = b[i] + c[i];
}
}
// 'restrict' tells the compiler: a, b, c do not overlap in memory
// Without restrict, the compiler may not vectorize (aliasing could cause bugs)

Check if vectorization happened:


# GCC: show vectorization report
gcc -O2 -mavx2 -ftree-vectorize -fopt-info-vec-optimized source.c
# Output: "loop vectorized using 32 byte vectors"

gcc -O2 -mavx2 -fopt-info-vec-missed source.c
# Output: explains WHY a loop was NOT vectorized

# Clang: show vectorization report
clang -O2 -mavx2 -Rpass=loop-vectorize source.c
clang -O2 -mavx2 -Rpass-missed=loop-vectorize source.c

# Compile for current CPU (enables all supported ISA extensions)
gcc -O3 -march=native source.c
BlockerExampleWhy It BlocksFix
Pointer aliasinga and c might overlapCompiler cannot reorder loads/storesUse restrict keyword
Loop-carried dependencya[i] = a[i-1] + 1Each iteration depends on the previousRestructure algorithm (e.g., prefix sum)
Function calls in loopa[i] = sin(b[i])Compiler cannot inline/vectorize arbitrary functionsUse vectorized math library (-lmvec, Intel SVML)
Complex control flowif (b[i] > 0) ... in loopBranches prevent uniform SIMD executionUse branchless/mask operations, _mm256_blendv_ps
Non-unit stridea[i*3]Non-contiguous access cannot be vectorized efficientlyRestructure to unit stride access
Unknown trip countwhile (p != end)Compiler cannot plan vector iterationsUse counted loop with known bounds
**Practical tips for auto-vectorization:** 1. **Always use `-O2` or higher.** Vectorization is disabled at `-O0` and `-O1`. 2. **Use `restrict` on all pointer parameters** that do not alias. 3. **Prefer simple counted loops** (`for (int i = 0; i < n; i++)`) over complex iteration patterns. 4. **Avoid early exits** (`break`, `return`) inside loops. 5. **Use `-march=native`** to enable the widest SIMD available on your CPU. 6. **Check the vectorization report** before writing manual intrinsics -- the compiler may already be doing what you want.

SIMD and SIMT: CPU vs GPU Parallelism

SIMD (CPU) and SIMT (GPU) both exploit data parallelism, but at different scales and with different tradeoffs:

PropertyCPU SIMD (AVX-512)GPU SIMT (CUDA)
Width8-16 elements per instruction32 threads per warp
Parallelism modelOne instruction stream, wide operands32 program counters, lockstep execution
DivergenceNot possible (same instruction for all lanes)Possible but expensive (serialized branches)
Memory accessUniform load/store, hardware prefetchCoalesced access required for efficiency
ProgrammingIntrinsics, auto-vectorization, or compilerWrite per-thread scalar code
Latency hidingOut-of-order execution, prefetchMassive thread count (warp switching)
Register file16-32 vector registers per core65536 registers per SM (shared across warps)
Typical scale8-64 cores, each with SIMD100+ SMs, each running 1000+ threads
**The unifying mental model.** Both SIMD and SIMT express the same idea: take a computation that applies independently to many data elements, and execute it on all elements simultaneously using wide hardware. On a CPU, you pack 8-16 values into a vector register and operate on them with one instruction. On a GPU, you launch thousands of threads that each operate on one element, and the hardware groups them into warps of 32 that execute in lockstep. The key constraint is the same: **divergence** (different elements needing different operations) kills performance on both architectures.

SIMD in ML Libraries

You rarely need to write SIMD intrinsics for ML work. The BLAS and runtime libraries handle this internally:

LibraryCPU BackendSIMD UsedWhat It Accelerates
NumPyOpenBLAS or MKLAVX2/AVX-512Matrix multiply, elementwise ops
PyTorch (CPU)MKL/MKL-DNN (oneDNN)AVX2/AVX-512Convolutions, GEMM, batch norm
ONNX RuntimeCPU EPAVX-512, VNNIINT8/FP16 inference
llama.cppCustom kernelsAVX2/AVX-512, NEONQuantized LLM inference
TensorFlowEigen, oneDNNAVX2/AVX-512All CPU ops

Data type Bytes Elements per AVX-512 register Relative throughput
--------- ----- ---------------------------- -------------------
FP64 8 8 1x
FP32 4 16 2x
FP16/BF16 2 32 4x
INT8 1 64 8x
INT4 0.5 128 16x (with packing)
**Why understanding SIMD matters even if you never write intrinsics.** SIMD explains four things that every ML engineer encounters:
  1. Aligned memory is faster -- SIMD loads from aligned addresses avoid crossing cache line boundaries and use faster instructions (vmovaps vs vmovups).
  2. Contiguous arrays beat random access -- SIMD instructions operate on consecutive elements; scatter/gather from random addresses is 5-10x slower.
  3. CPU quantization works -- INT8 inference fits 4x more elements per SIMD register than FP32, providing roughly 4x speedup for memory-bound operations (and the actual VNNI instruction computes INT8 dot products directly).
  4. Power-of-2 sizes are faster -- SIMD widths are powers of 2, so arrays with lengths that are multiples of 16 (AVX-512) or 8 (AVX2) avoid the scalar cleanup loop that handles remaining elements.