SIMD Performance — From Scalar to AVX2

Inference worked. It was just painfully slow.

The bottleneck was obvious: every dot product between quantized weights and f32 activations was running one element at a time, in scalar Rust. For a 7B model at Q4_0, that's billions of multiply-accumulates per token, each burning an individual CPU instruction.

The goal wasn't maximum theoretical throughput — that's what the GPU backend is for. The goal was to stop leaving 75% of the CPU on the table. AVX2 gives 256-bit registers: 8 f32s or 32 int8s per instruction. For quantized GEMV, the win comes from processing an entire Q4_0 block (32 weights) in a handful of instructions instead of a loop of 32.

I went with std::arch intrinsics over portable SIMD (std::simd is still unstable). Raw intrinsics are ugly but predictable — you know exactly what's hitting the execution ports.

Runtime Detection

Can't assume AVX2 is available. The dispatch uses an AtomicU8 that gets probed once at startup via std::is_x86_feature_detected!("avx2"). Every hot path checks this flag and branches:

if has_avx2() → avx2::qmv_q4_0()
else → scalar::qmv_q4_0()

The branch predictor locks onto the right path after the first call. Zero overhead in steady state.

The Kernels

Q4_0 AVX2 — 3.9x

The heart of it: _mm256_maddubs_epi16 multiplies 32 unsigned×signed int8 pairs and accumulates to int16. One instruction does what took 32 scalar multiplies.

The tricky part was nibble extraction. Each byte holds two 4-bit weights. Low nibble is weight[2i], high nibble is weight[2i+1]. Masked with _mm256_and_si256 for lows, _mm256_srli_epi16 + mask for highs.

Q8_0 AVX2 — 5.0x

Simplest kernel. Q8_0 weights are already int8, so it's just load, _mm256_maddubs_epi16, horizontal sum. 4 instructions per block. The 5x speedup over scalar comes almost entirely from processing 32 elements per instruction instead of 1.

Q4_K AVX2 — 1.9x

Most complex kernel. Q4_K has 256-weight super-blocks with 6-bit scale extraction from a sharded 12-byte metadata array. The scale unpacking alone takes more instructions than the entire Q8_0 kernel. 1.9x is lower than Q4_0 because the scale extraction overhead amortizes poorly — the dot product itself is fast, but half the time is spent extracting and broadcasting sub-block scales.

Q6_K — Fused Scalar

I wrote the AVX2 kernel but the speedup was marginal (<1.2x) because Q6_K's 6-bit extraction involves so much bit manipulation that the SIMD advantage drowns in shuffle instructions. Kept the fused scalar path instead — simpler and only 15% slower than the AVX2 version.

The Q4_0 Nibble Bug

This one was subtle. I initially treated Q4_0 nibbles the same way as Q4_K — grouped (first 128 low nibbles, then 128 high nibbles). But Q4_0 is interleaved: byte[i] holds weight[2i] in the low nibble and weight[2i+1] in the high nibble.

The scalar code happened to get the right answer because it iterated weight-by-weight. The AVX2 code loaded 16 bytes at once and applied the wrong extraction pattern. The differential test caught it immediately.

Allocation Elimination

ScratchBuffers

Before this, every token decode allocated fresh Vec<f32> buffers for intermediate results. For a 22-layer model, that's 13+ allocations per token. At 5 TPS, 65 heap allocs/frees per second — not catastrophic, but unnecessary.

ScratchBuffers pre-allocates everything at runner construction. Normed hidden state, Q/K/V projections, attention scores, FFN intermediates, logits — all allocated once. Zero per-token allocations in the decode hot path.

KV Cache Pre-allocation

Previously grew dynamically. Now allocated to max_seq_len at init. For a 4096-context model with 8 KV heads and 128-dim heads, that's ~576 MB. Allocated once, never touched by the allocator again.

RoPE Lookup Table

RoPE sin/cos values are position-dependent but deterministic. Pre-compute the entire max_seq_len × head_dim table at init. During inference, RoPE application is just a multiply-and-add from the table — no trig functions in the hot path.

The Borrow Checker Problem

Rust's borrow checker created an annoying pattern: self.weights and self.scratch are both fields of TransformerRunner, so a method can't borrow both mutably. Refactored the core math into free functions:

fn qmv(weights: &[u8], activations: &[f32], output: &mut [f32], ...)
fn rms_norm(input: &[f32], weight: &[f32], output: &mut [f32], eps: f32)
fn apply_rope(q: &mut [f32], k: &mut [f32], pos: usize, ...)

No &self, no borrow conflicts. The runner's forward() method just calls these with the right slices.

Differential Testing

10,000+ random iterations per kernel. For each iteration: generate random block data (scale + quantized weights), run scalar dequant, run AVX2 dequant, compare element-by-element with f32 epsilon tolerance.

This caught the nibble interleaving bug on the first run. It also validated that the horizontal sum reduction produces bit-identical results to the scalar accumulation path.

SIMD Performance

On this page