The shaders worked but everything was drowning in PCIe traffic. Fixed that, got 8.5x.

GPU Performance — Killing the Round-Trips

The Vulkan backend worked. Shaders compiled, weights uploaded, tokens came out. But it was almost certainly slower than CPU. Not because the shaders were slow — because every single operation was drowning in PCIe traffic and driver overhead.

The numbers tell the story. Single-token decode has an arithmetic intensity of 3.57 FLOPs/byte. The ridge point on an RTX 5090 is 1872 FLOPs/byte. The GPU is memory-bandwidth-bound, and VRAM bandwidth is the whole advantage — 960+ GB/s vs 32 GB/s over PCIe. Every time I downloaded data to CPU and re-uploaded it, I threw away that 30:1 bandwidth advantage.

The initial implementation had four problems. This round fixed all of them.

The KV Cache Catastrophe

The worst offender. Every token, every layer, the forward pass would:

Download the entire KV cache from VRAM to CPU
Patch one position (a few KB)
Re-upload the entire cache back to VRAM

For a model with 32 KV heads, 128 head_dim, and 22 layers — that's roughly 2.8 GB of PCIe traffic per token. For data that never needed to leave the GPU.

The fix was embarrassingly simple. Vulkan's vkCmdCopyBuffer supports offset copies natively. I added copy_buffer_offset to the ComputeBackend trait and the Vulkan command engine. Now the KV cache write is a single GPU-to-GPU copy of just the new position's slice — a few hundred bytes instead of gigabytes.

CPU Attention Fallback

The attention shaders (attn_score.comp, attn_value.comp, softmax.comp) were compiled and registered, but never wired up. The forward pass downloaded Q, the entire K cache, and the entire V cache to CPU, ran attention in a Rust loop, and uploaded the result. The comment said "GQA grouping is complex to wire through single dispatches." It wasn't that complex.

The shaders needed two fixes for GQA. The original attn_score.comp indexed the key cache as pos * head_dim + i — flat layout, no concept of multiple KV heads sharing positions. With GQA, the cache layout is pos * kv_stride + kv_h * head_dim + i where kv_stride = n_kv_heads * head_dim. Same deal for attn_value.comp, plus it needed an out_offset push constant so each Q head writes to its own slice of the output buffer.

The dispatch loop in gpu_runner.rs iterates over Q heads, maps each to its KV head via kv_h = h / kv_group, and fires three shader dispatches per head: attn_score, softmax, attn_value. All on GPU. No CPU downloads.

RoPE Only Rotated One Head

Found this while reading the shader code. rope.comp dispatched head_dim / 2 threads total. That rotates exactly one head's worth of Q and one head's worth of K. A 32-head model was only rotating head 0.

Rewrote the shader to take n_heads_q and n_heads_k as push constants. Each thread computes which head it belongs to and applies the rotation to the correct pair. Dispatch count is ceil((n_heads_q + n_heads_k) * half_dim / 64) — one dispatch covers all heads. Also matched the interleaved pair layout (x[2i], x[2i+1]) that the CPU model runner uses, instead of the split-half layout the old shader had.

330 Submits Per Token

The original code called begin_command_buffer → dispatch → submit_and_wait for every single shader invocation. With ~15 dispatches per layer across 22 layers, that's 330+ GPU submissions per token. At 5-10μs of driver overhead each, that's 1.6-3.3ms of pure waste.

Added a batched command buffer API to VulkanBackend: begin_batch(), dispatch_batched(), barrier(), transfer_barrier(), copy_buffer_batched(), submit_batch(). The entire forward pass — all layers, all attention heads, final norm, output projection — records into a single command buffer and submits once.

The barrier placement matters. Independent operations (Q, K, V projections all reading from the same normed buffer) don't need barriers between them. Barriers go at dependency edges: after norm → before matmuls, after matmuls → before RoPE, after RoPE → before KV cache write, and so on. Transfer barriers separate vkCmdCopyBuffer from compute dispatches.

GPU Buffer Copies

The residual connection at the start of each layer and FFN block copies hidden → residual. Previously this downloaded to a CPU Vec<u8> and re-uploaded. Added copy_buffer to the ComputeBackend trait — on Vulkan it's a single vkCmdCopyBuffer, on CPU it's a memcpy. Two PCIe round-trips per layer eliminated.

First End-to-End Test: Four Bugs

With all the architectural changes in place, I built with --features vulkan and ran TinyLlama. It segfaulted immediately. Fixing it took four rounds.

Bug 1: SiluMul double-application. The SwiGLU activation in the FFN block is supposed to be SiLU(gate) * up — one fused operation. The silu_mul.comp shader does this correctly. But gpu_runner.rs was calling it twice: once with gate as both inputs (computing SiLU(gate) * gate), then again to multiply by up (which applied SiLU a second time). The CPU reference does it in one step. The fix was a single fused dispatch: SiluMul(gate, up, gate).

Bug 2: Missing Q6_K shader. TinyLlama Q4_K_M uses mixed quantization — Q4_K for most weights, but Q6_K for sensitive tensors like attn_v and ffn_down. The GPU backend only had shaders for Q4_0, Q4_K, and Q8_0. When it hit a Q6_K tensor, the error propagated up and crashed during cleanup. Wrote a new qmv_q6_k.comp shader — Q6_K is 210 bytes per super-block (256 weights) with a split layout: ql[128] for low 4 bits, qh[64] for high 2 bits, scales[16] as int8, and d as f16. The dequant processes two halves of 128 weights, extracting four groups of 32 from each (ql, qh) pair, then subtracting 32 for the symmetric bias.

Bug 3: Descriptor pool exhaustion. Each forward pass allocates ~2,400 descriptor sets (22 layers × ~110 sets per layer, dominated by 32-head attention). After 3 forward passes the pool was full. The pool had FREE_DESCRIPTOR_SET but I never freed anything. Added reset_descriptor_pool() — called at the start of each forward pass after the previous fence wait guarantees all sets are stale.

Bug 4: Vulkan drop order segfault. The VulkanBackend struct had fields ordered as vk_device, memory, pipelines, commands. Rust drops in declaration order. So vk_device (which destroys the logical device) was dropped first, then memory, pipelines, and commands tried to destroy their Vulkan handles on an already-dead device. Reordered to commands, pipelines, memory, vk_device — child objects die before parents.

Benchmarks

TinyLlama 1.1B Q4_0, 50 tokens generated, temperature 0:

Backend	Prefill (tok/s)	Decode (tok/s)	Speedup
CPU (AVX2)	2.9	2.83	1.0x
Vulkan	24.3	23.99	8.5x

That 8.5x is on a first-pass implementation with no kernel fusion, no cooperative matrix, and single-threaded attention dispatches. The theoretical peak for this GPU at this model size is higher — but this is already validates that moving everything to VRAM works.

What Changed

Infrastructure:

ComputeBackend trait: added copy_buffer, copy_buffer_offset, attn_score, attn_value. Updated rope with head count params.
VulkanBackend: batched command API (begin_batch, dispatch_batched, barrier, submit_batch).
CommandEngine: offset-aware buffer copies.
Pipeline push constant sizes updated for the three modified shaders.

Shaders (3 recompiled + 1 new):

rope.comp — multi-head rotation with interleaved pair layout
attn_score.comp — kv_stride push constant for GQA cache indexing
attn_value.comp — kv_stride + out_offset for GQA multi-head output
qmv_q6_k.comp — new, Q6_K quantized matmul (210B super-blocks, split ql/qh layout)

Forward pass (gpu_runner.rs):

Single command buffer submit for entire forward pass
KV cache stays in VRAM — offset copies only
Attention runs entirely on GPU with per-head GQA dispatch
Buffer copies are GPU-to-GPU
SwiGLU fixed: single fused SiLU(gate) * up dispatch
Descriptor pool reset between forward passes

Vulkan lifecycle:

Drop order fixed: commands → pipelines → memory → vk_device (children before parents)
Descriptor pool reset via vkResetDescriptorPool at start of each forward pass

New shader: ShaderKey::QmvQ6K registered in pipeline (12 total shaders now). Supports Q4_K_M mixed quantization models.

Tests: 7 new tests covering copy_buffer, copy_buffer_offset (including KV cache pattern), multi-head RoPE, non-zero position RoPE, full attention pipeline, and GQA attention with 2:1 head ratio. Total test count: 98.

What's Left

Cooperative matrix — VK_KHR_cooperative_matrix for tensor core access. The quantized GEMV shaders are the hot path and would benefit most.
Staging buffer reuse — embedding upload and logits download allocate fresh staging buffers every token. Could pre-allocate at init.
More quant shaders — Q5_K, Q3_K, Q2_K for broader model support.
Correctness validation — need a model that produces coherent output to diff CPU vs Vulkan token-by-token. The Q4_0 TinyLlama produces gibberish on both backends (model quality issue, not code), so correctness can't be verified yet.

GPU Performance

On this page