Vulkan GPU
GPU wins on bandwidth, not compute. Time to move everything to VRAM.
Vulkan GPU Backend — Moving the Math to VRAM
CPU inference works. It's correct, well-tested, and reasonably fast with AVX2. But single-token decode is memory-bandwidth-bound (arithmetic intensity = 3.57 FLOPs/byte, vs a ridge point of 1872 on modern GPUs). The GPU wins not because it computes faster — it wins because it reads memory 10-17x faster. DDR5 gives ~102 GB/s. An RTX 3060 gives 960 GB/s. That's the entire argument for GPU inference.
Crate Choice
Three options:
- wgpu — portable, well-maintained, but no cooperative matrix or subgroup operations. Can't access tensor cores. Out.
- vulkano — safe Rust wrapper over Vulkan. Lags the spec by 1-2 years. Out.
- ash — raw Vulkan bindings. Unsafe, verbose, but full access to compute shaders, push constants, descriptor sets, and eventually
VK_KHR_cooperative_matrix. This is what I went with.
gpu-allocator handles VRAM allocation (VMA-style). I don't want to write a memory allocator — I want to write inference kernels.
Vulkan Infrastructure
Five modules under yule-gpu/src/vulkan/:
device.rs — Instance creation with validation layers in debug. Physical device enumeration — prefer discrete GPUs, fall back to integrated. Logical device with a single compute queue. is_available() does a lightweight probe without full initialization — the CLI calls this to decide whether --backend auto should use Vulkan.
memory.rs — gpu-allocator wrapper. Allocate device-local VRAM, copy to/from device via staging buffers.
pipeline.rs — SPIR-V loading from include_bytes!(). Descriptor set layouts, pipeline layouts with push constants, compute pipeline creation and caching.
commands.rs — Command buffer allocation, record dispatches with push constants and pipeline barriers, submit + fence wait.
mod.rs — VulkanBackend struct implementing the ComputeBackend trait. Maps high-level operations to shader dispatches.
Compute Shaders
11 GLSL 450 compute shaders, compiled to SPIR-V at dev time via glslc. Pre-compiled .spv files loaded with include_bytes!() — no runtime shader compilation needed.
Element-wise
- add.comp — residual connections. 256 threads.
- silu_mul.comp — fused SiLU activation with element-wise multiply.
Normalization
- rms_norm.comp — two-pass in shared memory. Parallel sum of squares, then normalize.
Position
- rope.comp — rotary position embedding. 64 threads per head.
Attention
- softmax.comp — three-pass stable softmax (max, exp+sum, normalize).
- attn_score.comp — Q@K^T dot products. One workgroup per position.
- attn_value.comp — weighted V aggregation. One workgroup per output dimension.
- embed_lookup.comp — token embedding dequant.
Quantized GEMV (the hot path)
- qmv_q4_0.comp — custom
unpack_f16()because GLSL 450 has no native f16 load. 256 threads per row, shared memory reduction. - qmv_q4_k.comp — most complex shader. 6-bit scale extraction from the sharded 12-byte scales array. Matches the corrected extraction logic from my research.
- qmv_q8_0.comp — simplest quantized kernel. Sign extension via
int(qs) - 256 * int(qs > 127).
Quantized weights stored as raw bytes in SSBO. Dequant happens inside the shader — no VRAM bloat from f32 expansion.
GpuTransformerRunner
Hybrid CPU/GPU forward pass right now. The split:
On CPU: embedding lookup (single row, negligible), attention score/value computation (GQA grouping makes GPU dispatch complex — next phase), KV cache writes, sampling.
On GPU: RMSNorm, all QKV/output/gate/up/down projections, RoPE, SiLU multiply, residual add, final norm + output projection.
Weight upload happens at init — all quantized weight tensors go to VRAM as raw bytes. The shaders dequantize during the dot product, same as the CPU path.
CLI Wiring
--backend auto|cpu|vulkan on yule run and yule serve.
auto— probes for Vulkan device. If found, use it. Otherwise CPU.cpu— force CPU TransformerRunner.vulkan— force GPU GpuTransformerRunner. Errors if unavailable.
Feature flag chain: yule-cli/vulkan → yule-gpu/vulkan + yule-infer/vulkan. Default builds compile out the entire Vulkan path — no ash or gpu-allocator in the dependency tree.
The Shader Compilation Story
Needed glslc to compile .comp → .spv. It doesn't ship with Rust. Tried in order:
glslc— not installedglslangValidator— not installedpip install glslang— failedshadercas a cargo build dep — needed cmake, then ninja- Installed Vulkan SDK via
winget install KhronosGroup.VulkanSDK
Found glslc at /c/VulkanSDK/1.4.341.1/Bin/glslc.exe. Compiled all 11 shaders. Created shaders/compile.sh for future recompilation.
The .spv files are checked in. build.rs has rerun-if-changed directives but does NOT compile shaders — that's a manual step. I considered compile-time shader compilation via the shaderc crate but the cmake/ninja dependency chain was too heavy for a build requirement.
What's Left — Done
All resolved in GPU Performance: GPU attention with GQA, KV cache offset writes, single command buffer submit, GPU buffer copies, multi-head RoPE fix. Remaining: cooperative matrix (VK_KHR_cooperative_matrix) and benchmarks.