Vulkan GPU Backend — Moving the Math to VRAM

CPU inference works. It's correct, well-tested, and reasonably fast with AVX2. But single-token decode is memory-bandwidth-bound (arithmetic intensity = 3.57 FLOPs/byte, vs a ridge point of 1872 on modern GPUs). The GPU wins not because it computes faster — it wins because it reads memory 10-17x faster. DDR5 gives ~102 GB/s. An RTX 3060 gives 960 GB/s. That's the entire argument for GPU inference.

Crate Choice

Three options:

wgpu — portable, well-maintained, but no cooperative matrix or subgroup operations. Can't access tensor cores. Out.
vulkano — safe Rust wrapper over Vulkan. Lags the spec by 1-2 years. Out.
ash — raw Vulkan bindings. Unsafe, verbose, but full access to compute shaders, push constants, descriptor sets, and eventually VK_KHR_cooperative_matrix. This is what I went with.

gpu-allocator handles VRAM allocation (VMA-style). I don't want to write a memory allocator — I want to write inference kernels.

Vulkan Infrastructure

Five modules under yule-gpu/src/vulkan/:

device.rs — Instance creation with validation layers in debug. Physical device enumeration — prefer discrete GPUs, fall back to integrated. Logical device with a single compute queue. is_available() does a lightweight probe without full initialization — the CLI calls this to decide whether --backend auto should use Vulkan.

memory.rs — gpu-allocator wrapper. Allocate device-local VRAM, copy to/from device via staging buffers.

pipeline.rs — SPIR-V loading from include_bytes!(). Descriptor set layouts, pipeline layouts with push constants, compute pipeline creation and caching.

commands.rs — Command buffer allocation, record dispatches with push constants and pipeline barriers, submit + fence wait.

mod.rs — VulkanBackend struct implementing the ComputeBackend trait. Maps high-level operations to shader dispatches.

Compute Shaders

11 GLSL 450 compute shaders, compiled to SPIR-V at dev time via glslc. Pre-compiled .spv files loaded with include_bytes!() — no runtime shader compilation needed.

Element-wise

add.comp — residual connections. 256 threads.
silu_mul.comp — fused SiLU activation with element-wise multiply.

Normalization

rms_norm.comp — two-pass in shared memory. Parallel sum of squares, then normalize.

Position

rope.comp — rotary position embedding. 64 threads per head.

Attention

softmax.comp — three-pass stable softmax (max, exp+sum, normalize).
attn_score.comp — Q@K^T dot products. One workgroup per position.
attn_value.comp — weighted V aggregation. One workgroup per output dimension.
embed_lookup.comp — token embedding dequant.

Quantized GEMV (the hot path)

qmv_q4_0.comp — custom unpack_f16() because GLSL 450 has no native f16 load. 256 threads per row, shared memory reduction.
qmv_q4_k.comp — most complex shader. 6-bit scale extraction from the sharded 12-byte scales array. Matches the corrected extraction logic from my research.
qmv_q8_0.comp — simplest quantized kernel. Sign extension via int(qs) - 256 * int(qs > 127).

Quantized weights stored as raw bytes in SSBO. Dequant happens inside the shader — no VRAM bloat from f32 expansion.

GpuTransformerRunner

Hybrid CPU/GPU forward pass right now. The split:

On CPU: embedding lookup (single row, negligible), attention score/value computation (GQA grouping makes GPU dispatch complex — next phase), KV cache writes, sampling.

On GPU: RMSNorm, all QKV/output/gate/up/down projections, RoPE, SiLU multiply, residual add, final norm + output projection.

Weight upload happens at init — all quantized weight tensors go to VRAM as raw bytes. The shaders dequantize during the dot product, same as the CPU path.

CLI Wiring

--backend auto|cpu|vulkan on yule run and yule serve.

auto — probes for Vulkan device. If found, use it. Otherwise CPU.
cpu — force CPU TransformerRunner.
vulkan — force GPU GpuTransformerRunner. Errors if unavailable.

Feature flag chain: yule-cli/vulkan → yule-gpu/vulkan + yule-infer/vulkan. Default builds compile out the entire Vulkan path — no ash or gpu-allocator in the dependency tree.

The Shader Compilation Story

Needed glslc to compile .comp → .spv. It doesn't ship with Rust. Tried in order:

glslc — not installed
glslangValidator — not installed
pip install glslang — failed
shaderc as a cargo build dep — needed cmake, then ninja
Installed Vulkan SDK via winget install KhronosGroup.VulkanSDK

Found glslc at /c/VulkanSDK/1.4.341.1/Bin/glslc.exe. Compiled all 11 shaders. Created shaders/compile.sh for future recompilation.

The .spv files are checked in. build.rs has rerun-if-changed directives but does NOT compile shaders — that's a manual step. I considered compile-time shader compilation via the shaderc crate but the cmake/ninja dependency chain was too heavy for a build requirement.

What's Left — Done

All resolved in GPU Performance: GPU attention with GQA, KV cache offset writes, single command buffer submit, GPU buffer copies, multi-head RoPE fix. Remaining: cooperative matrix (VK_KHR_cooperative_matrix) and benchmarks.

Vulkan GPU

On this page