YuleYule
Changelogs

Vulkan GPU

GPU wins on bandwidth, not compute. Time to move everything to VRAM.

Vulkan GPU Backend — Moving the Math to VRAM

CPU inference works. It's correct, well-tested, and reasonably fast with AVX2. But single-token decode is memory-bandwidth-bound (arithmetic intensity = 3.57 FLOPs/byte, vs a ridge point of 1872 on modern GPUs). The GPU wins not because it computes faster — it wins because it reads memory 10-17x faster. DDR5 gives ~102 GB/s. An RTX 3060 gives 960 GB/s. That's the entire argument for GPU inference.


Crate Choice

Three options:

  • wgpu — portable, well-maintained, but no cooperative matrix or subgroup operations. Can't access tensor cores. Out.
  • vulkano — safe Rust wrapper over Vulkan. Lags the spec by 1-2 years. Out.
  • ash — raw Vulkan bindings. Unsafe, verbose, but full access to compute shaders, push constants, descriptor sets, and eventually VK_KHR_cooperative_matrix. This is what I went with.

gpu-allocator handles VRAM allocation (VMA-style). I don't want to write a memory allocator — I want to write inference kernels.


Vulkan Infrastructure

Five modules under yule-gpu/src/vulkan/:

device.rs — Instance creation with validation layers in debug. Physical device enumeration — prefer discrete GPUs, fall back to integrated. Logical device with a single compute queue. is_available() does a lightweight probe without full initialization — the CLI calls this to decide whether --backend auto should use Vulkan.

memory.rs — gpu-allocator wrapper. Allocate device-local VRAM, copy to/from device via staging buffers.

pipeline.rs — SPIR-V loading from include_bytes!(). Descriptor set layouts, pipeline layouts with push constants, compute pipeline creation and caching.

commands.rs — Command buffer allocation, record dispatches with push constants and pipeline barriers, submit + fence wait.

mod.rs — VulkanBackend struct implementing the ComputeBackend trait. Maps high-level operations to shader dispatches.


Compute Shaders

11 GLSL 450 compute shaders, compiled to SPIR-V at dev time via glslc. Pre-compiled .spv files loaded with include_bytes!() — no runtime shader compilation needed.

Element-wise

  • add.comp — residual connections. 256 threads.
  • silu_mul.comp — fused SiLU activation with element-wise multiply.

Normalization

  • rms_norm.comp — two-pass in shared memory. Parallel sum of squares, then normalize.

Position

  • rope.comp — rotary position embedding. 64 threads per head.

Attention

  • softmax.comp — three-pass stable softmax (max, exp+sum, normalize).
  • attn_score.comp — Q@K^T dot products. One workgroup per position.
  • attn_value.comp — weighted V aggregation. One workgroup per output dimension.
  • embed_lookup.comp — token embedding dequant.

Quantized GEMV (the hot path)

  • qmv_q4_0.comp — custom unpack_f16() because GLSL 450 has no native f16 load. 256 threads per row, shared memory reduction.
  • qmv_q4_k.comp — most complex shader. 6-bit scale extraction from the sharded 12-byte scales array. Matches the corrected extraction logic from my research.
  • qmv_q8_0.comp — simplest quantized kernel. Sign extension via int(qs) - 256 * int(qs > 127).

Quantized weights stored as raw bytes in SSBO. Dequant happens inside the shader — no VRAM bloat from f32 expansion.


GpuTransformerRunner

Hybrid CPU/GPU forward pass right now. The split:

On CPU: embedding lookup (single row, negligible), attention score/value computation (GQA grouping makes GPU dispatch complex — next phase), KV cache writes, sampling.

On GPU: RMSNorm, all QKV/output/gate/up/down projections, RoPE, SiLU multiply, residual add, final norm + output projection.

Weight upload happens at init — all quantized weight tensors go to VRAM as raw bytes. The shaders dequantize during the dot product, same as the CPU path.


CLI Wiring

--backend auto|cpu|vulkan on yule run and yule serve.

  • auto — probes for Vulkan device. If found, use it. Otherwise CPU.
  • cpu — force CPU TransformerRunner.
  • vulkan — force GPU GpuTransformerRunner. Errors if unavailable.

Feature flag chain: yule-cli/vulkanyule-gpu/vulkan + yule-infer/vulkan. Default builds compile out the entire Vulkan path — no ash or gpu-allocator in the dependency tree.


The Shader Compilation Story

Needed glslc to compile .comp.spv. It doesn't ship with Rust. Tried in order:

  1. glslc — not installed
  2. glslangValidator — not installed
  3. pip install glslang — failed
  4. shaderc as a cargo build dep — needed cmake, then ninja
  5. Installed Vulkan SDK via winget install KhronosGroup.VulkanSDK

Found glslc at /c/VulkanSDK/1.4.341.1/Bin/glslc.exe. Compiled all 11 shaders. Created shaders/compile.sh for future recompilation.

The .spv files are checked in. build.rs has rerun-if-changed directives but does NOT compile shaders — that's a manual step. I considered compile-time shader compilation via the shaderc crate but the cmake/ninja dependency chain was too heavy for a build requirement.


What's Left — Done

All resolved in GPU Performance: GPU attention with GQA, KV cache offset writes, single command buffer submit, GPU buffer copies, multi-head RoPE fix. Remaining: cooperative matrix (VK_KHR_cooperative_matrix) and benchmarks.

On this page