YuleYule
Changelogs

Inference Debugging

Seven bugs, zero compiler warnings. Classic.

Inference Debugging — Seven Bugs, Zero Compiler Warnings

Going from "NaN logits" to a working forward pass, one painful bug at a time.


The foundation was done — GGUF parser, Merkle verification, CLI. Next up was supposed to be "just wire up the math." I had dequant kernels, a tokenizer, and a model runner. First test: load TinyLlama 1.1B Q4_0, feed it a prompt, get logits.

The logits were all NaN.


Bug #1: GGUF Type ID Mapping

Every single logit was NaN. The embedding lookup was returning garbage because the dequant function was being called with the wrong block size — the type ID mapping didn't match what GGUF actually stores.

My DType enum assigned integer values that didn't match the GGUF spec's type IDs. When the parser read type_id = 2 (Q4_0 in GGUF), my code mapped it to something else entirely. Every tensor was being dequantized with the wrong kernel.

Rewrote the DType enum to match GGUF spec type IDs exactly. Lesson learned: when you're parsing a binary format, the enum values aren't suggestions — they're the spec.


Bug #2: Q4_K Nibble Layout

After fixing type IDs, logits were no longer NaN but completely wrong. Wrote a Python reference dequant, compared block-by-block with Rust output for the same tensor data.

Q4_K uses a grouped nibble layout, not interleaved. The low nibble of each byte gives the first 32 weights, and the high nibble gives the next 32. I was doing interleaved (even/odd indices from the same byte go to adjacent weights), which is how Q4_0 works but not Q4_K.

// WRONG (interleaved, like Q4_0):
nibble = if i % 2 == 0 { byte & 0xF } else { byte >> 4 }

// RIGHT (grouped):
q = ql[l] & 0xF    // first 32 weights from low nibbles
q = ql[l] >> 4     // next 32 weights from high nibbles

Bug #3: Q6_K Scale Sub-grouping

Q6_K tensors (used for the output projection) gave wrong values. Compared dequant output against ggml reference code — values were in the right ballpark but shifted wrong. Classic scale mismatch.

Q6_K has 16 int8 scales for 256 weights, but they're not applied uniformly. The ggml reference uses is = l / 16 to index into sub-groups of scales. I was using the same scale for all 32 iterations of the inner loop. Added the is = l / 16 offset to scale indexing in both dequant and vec_dot.


Bug #4: Q5_K High-Bit Mask Pattern

Q5_K stores a 5th bit per weight in a separate qh byte array, using a shifting mask pattern. The mask starts at u1=1, u2=2 and shifts left by 2 for each group of 64 weights. I had the mask pattern wrong. Rewrote the dequant with the correct grouped layout and shifting qh masks.


Bug #5: Q8_K Scale Is f32, Not f16

Looked at the actual block layout. Q8_K is 292 bytes: 4 bytes d + 256 bytes qs + 32 bytes bsums. I was reading d as f16 (2 bytes) and starting qs at offset 2. But Q8_K's d is f32 (4 bytes), so qs starts at offset 4. Reading 2 bytes of the float as the scale, then reading misaligned quantized values. Changed read_f16(block, 0) to f32::from_le_bytes at offset 0, qs at offset 4.


Bug #6: SentencePiece Tokenizer — Greedy vs BPE

With dequant fixed, logits were valid numbers but the model was producing gibberish. "ricericericericrice..."

Printed my token IDs and compared with llama-cpp-python for the same input string. They were different. My tokenizer used greedy longest-match: scan the vocabulary for the longest prefix, emit it, advance. SentencePiece doesn't work that way — it splits the input into individual UTF-8 codepoints first, then merges adjacent tokens upward based on merge scores.

The difference: greedy found ▁capital as a single token (id=7483). SentencePiece found , c, a, p, i, t, al then merged them into ▁ca, pi, tal — three separate tokens.

Rewrote encode_bpe() to prepend a space, replace spaces with (SentencePiece normalization), split into individual UTF-8 codepoints, then apply score-based BPE merges upward. After the fix, tokenization matched llama-cpp-python exactly for every test string.


Bug #7: ff_dim Was 2048 Instead of 5632

Layer 0 output was wrong after the FFN block. Attention output matched the Python reference perfectly, but post-FFN hidden state diverged.

Wrote a Python script that does the full forward pass for one token through one layer. Compared Rust vs Python at each stage: embedding, attention norm, QKV, RoPE, attention, output projection — all matched. Then FFN: broke.

GGUF tensor shapes are [ne[0], ne[1]] where ne[0] is the contiguous/inner dimension. For the gate weight [dim, ff_dim], shape[0] = dim = 2048 (input features) and shape[1] = ff_dim = 5632 (output features). I used shape[0] to get ff_dim. Got 2048. The FFN was computing 2048 outputs instead of 5632.

GGUF shape convention is the opposite of what you'd expect from PyTorch's [out_features, in_features] weight layout. The shape array is [in_features, out_features]ne[0] is always the fast/contiguous dimension.


Validation

I wrote a series of Python scripts implementing the exact same math as the Rust code, using only numpy and the raw GGUF file:

  1. validate.py — Embedding lookup, basic sanity checks
  2. validate_layer.py — Full layer 0 (attention + FFN) for BOS token
  3. validate_full.py — All 22 layers for BOS token
  4. validate_pos1.py — 2-token attention in layer 0
  5. validate_pos1_full.py — All 22 layers for 2-token sequence
  6. validate_logits.py — llama-cpp-python reference logit comparison

The methodology: run both Python and Rust with identical inputs, compare at every intermediate point. When they diverge, binary search for the exact operation that breaks.

Final results: pos=0 all 22 layers matched to 6+ decimal places. pos=1 all 22 layers matched to 6+ decimal places. Hidden state norms matched exactly.


The Remaining Divergence

After all fixes, the model still produces different top predictions than llama-cpp-python for some prompts. Spent a while chasing this before concluding it's not a bug:

  1. llama.cpp uses SIMD (AVX2/NEON) with different accumulation order. Over 30 tokens × 22 layers × thousands of dot products, small f32 precision differences compound.
  2. llama-cpp-python uses CPU_REPACK — a runtime tensor repacking optimization that changes the memory layout and computation path for Q4_0 tensors.
  3. TinyLlama 1.1B at Q4_0 is inherently noisy. With 4-bit weights and only 1.1B parameters, the model's confidence on any given token is low.

The forward pass math is correct. The model generates coherent text. Time to make it fast.


Bug Summary

BugComponentSeverityDetection Method
Type ID mappingGGUF parserCritical (NaN)First inference attempt
Q4_K nibble layoutdequant.rsHigh (wrong values)Python reference comparison
Q6_K scale indexingdequant.rsHighggml source code review
Q5_K qh mask patterndequant.rsMediumggml source code review
Q8_K d is f32 not f16dequant.rsMediumBlock size calculation
SentencePiece encodingtokenizer.rsHigh (wrong tokens)llama-cpp-python token comparison
ff_dim shape[0] vs shape[1]model_runner.rsCritical (wrong FFN)Layer-by-layer Python validation

Seven bugs, zero of which showed up as compilation errors. All silent correctness issues that required numerical validation to find. That's the nature of ML inference code — the math compiles fine, it just gives you wrong numbers.

On this page