Multi-Architecture — One Runner, Five Models

I had a LlamaRunner. Then I needed Mistral. Then Phi-3. Then Qwen2. Then Gemma2. Copy-pasting and tweaking wasn't going to scale — five runners with 90% shared code is five bugs to fix instead of one.

Config Flags, Not Code Duplication

TransformerRunner replaces LlamaRunner. Architecture differences are captured in a config struct:

struct RunnerConfig {
    dim, n_layers, n_heads, n_kv_heads, head_dim, vocab_size, ff_dim, norm_eps,
    rope_freq_base, max_seq_len,
    sliding_window: Option<usize>,      // Mistral, Gemma2
    partial_rotary_dim: Option<usize>,  // Phi-3
    has_qkv_bias: bool,                 // Qwen2
    activation: Activation,             // SwiGLU or GeGLU
    has_post_attn_norm: bool,           // Gemma2
    has_post_ffn_norm: bool,            // Gemma2
    logit_softcap: Option<f32>,         // Gemma2
    attn_logit_softcap: Option<f32>,    // Gemma2
    norm_weight_offset: f32,            // Gemma: 1.0, others: 0.0
}

The forward pass reads these flags at runtime. Conditional branches for optional features are branch-predicted away after the first token — zero overhead for architectures that don't use them.

Architecture-Specific Features

Sliding Window Attention (Mistral, Gemma2)

Standard attention lets every token attend to every previous token. Sliding window restricts it to the last W tokens (W=4096 for Mistral 7B), which bounds the KV cache size. Once you've generated W tokens, old entries get overwritten.

The attention score loop clamps start = max(0, pos - window). KV cache position wraps modulo window size.

Partial RoPE (Phi-3)

Most models apply rotary embeddings to the full head dimension. Phi-3 only rotates a subset — rotary_dim from GGUF metadata. The RopeTable is sized to rotary_dim instead of head_dim, and dimensions beyond that pass through unchanged.

QKV Bias (Qwen2)

Qwen2 adds a learned bias vector after each Q, K, and V projection. Detection is automatic: if the GGUF file contains tensors named blk.{i}.attn_q.bias, the bias gets applied. No explicit architecture flag needed — the tensor existence is the flag.

GeGLU (Gemma2)

Most models use SwiGLU: SiLU(gate) * up. Gemma2 uses GeGLU: GELU(gate) * up. One enum toggle.

Gemma Quirks

Gemma2 is the most "different" architecture I support. It stacks four non-standard behaviors:

Norm weight offset — all RMSNorm weights have 1.0 added. Without this, the model outputs garbage because the norms suppress instead of scaling.
Embedding scaling — embedding output is multiplied by sqrt(dim). Every other architecture uses the raw embedding. Miss this and the hidden state magnitudes are wrong from layer 0.
Logit softcapping — both attention logits and final output logits are capped via tanh(x / cap) * cap. Prevents extreme values from dominating softmax.
Post-attention and post-FFN norms — additional RMSNorm layers after the output projection and after the down projection. Detected by checking for extra norm tensors in the GGUF file.

TransformerWeights

The weight structure mirrors the config's optional features:

struct LayerWeights<'a> {
    attn_norm: &'a [u8],
    wq, wk, wv, wo: &'a [u8],
    q_bias, k_bias, v_bias: Option<&'a [u8]>,  // Qwen2
    post_attn_norm: Option<&'a [u8]>,           // Gemma2
    ffn_norm: &'a [u8],
    gate, up, down: &'a [u8],
    post_ffn_norm: Option<&'a [u8]>,            // Gemma2
}

Weight lookups are tensor-name-based from the GGUF file. Optional weights are None when the tensor doesn't exist.

Chat Templates

Each architecture has its own prompt format. Hardcoded the five most common:

Llama3: <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|>
Mistral: [INST] {content} [/INST]
Phi-3: <|user|>\n{content}<|end|>\n<|assistant|>\n
Qwen2: <|im_start|>user\n{content}<|im_end|>\n<|im_start|>assistant\n
Gemma2: <start_of_turn>user\n{content}<end_of_turn>\n<start_of_turn>model\n

These are hardcoded strings, not a Jinja2 parser. A full template parser is on the list but hardcoded works for the five architectures I support now.

GGUF Metadata Extraction

RunnerConfig is populated entirely from GGUF metadata keys. Missing keys get sensible defaults (norm_eps=1e-5, rope_freq_base=10000.0, no sliding window, no softcapping). The config-building logic auto-detects the architecture from what's in the file — no manual --arch flag needed.

Multi-Architecture

On this page