Multi-Architecture
One runner, five models. Copy-pasting wasn't going to scale.
Multi-Architecture — One Runner, Five Models
I had a LlamaRunner. Then I needed Mistral. Then Phi-3. Then Qwen2. Then Gemma2. Copy-pasting and tweaking wasn't going to scale — five runners with 90% shared code is five bugs to fix instead of one.
Config Flags, Not Code Duplication
TransformerRunner replaces LlamaRunner. Architecture differences are captured in a config struct:
struct RunnerConfig {
dim, n_layers, n_heads, n_kv_heads, head_dim, vocab_size, ff_dim, norm_eps,
rope_freq_base, max_seq_len,
sliding_window: Option<usize>, // Mistral, Gemma2
partial_rotary_dim: Option<usize>, // Phi-3
has_qkv_bias: bool, // Qwen2
activation: Activation, // SwiGLU or GeGLU
has_post_attn_norm: bool, // Gemma2
has_post_ffn_norm: bool, // Gemma2
logit_softcap: Option<f32>, // Gemma2
attn_logit_softcap: Option<f32>, // Gemma2
norm_weight_offset: f32, // Gemma: 1.0, others: 0.0
}The forward pass reads these flags at runtime. Conditional branches for optional features are branch-predicted away after the first token — zero overhead for architectures that don't use them.
Architecture-Specific Features
Sliding Window Attention (Mistral, Gemma2)
Standard attention lets every token attend to every previous token. Sliding window restricts it to the last W tokens (W=4096 for Mistral 7B), which bounds the KV cache size. Once you've generated W tokens, old entries get overwritten.
The attention score loop clamps start = max(0, pos - window). KV cache position wraps modulo window size.
Partial RoPE (Phi-3)
Most models apply rotary embeddings to the full head dimension. Phi-3 only rotates a subset — rotary_dim from GGUF metadata. The RopeTable is sized to rotary_dim instead of head_dim, and dimensions beyond that pass through unchanged.
QKV Bias (Qwen2)
Qwen2 adds a learned bias vector after each Q, K, and V projection. Detection is automatic: if the GGUF file contains tensors named blk.{i}.attn_q.bias, the bias gets applied. No explicit architecture flag needed — the tensor existence is the flag.
GeGLU (Gemma2)
Most models use SwiGLU: SiLU(gate) * up. Gemma2 uses GeGLU: GELU(gate) * up. One enum toggle.
Gemma Quirks
Gemma2 is the most "different" architecture I support. It stacks four non-standard behaviors:
- Norm weight offset — all RMSNorm weights have 1.0 added. Without this, the model outputs garbage because the norms suppress instead of scaling.
- Embedding scaling — embedding output is multiplied by
sqrt(dim). Every other architecture uses the raw embedding. Miss this and the hidden state magnitudes are wrong from layer 0. - Logit softcapping — both attention logits and final output logits are capped via
tanh(x / cap) * cap. Prevents extreme values from dominating softmax. - Post-attention and post-FFN norms — additional RMSNorm layers after the output projection and after the down projection. Detected by checking for extra norm tensors in the GGUF file.
TransformerWeights
The weight structure mirrors the config's optional features:
struct LayerWeights<'a> {
attn_norm: &'a [u8],
wq, wk, wv, wo: &'a [u8],
q_bias, k_bias, v_bias: Option<&'a [u8]>, // Qwen2
post_attn_norm: Option<&'a [u8]>, // Gemma2
ffn_norm: &'a [u8],
gate, up, down: &'a [u8],
post_ffn_norm: Option<&'a [u8]>, // Gemma2
}Weight lookups are tensor-name-based from the GGUF file. Optional weights are None when the tensor doesn't exist.
Chat Templates
Each architecture has its own prompt format. Hardcoded the five most common:
- Llama3:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|> - Mistral:
[INST] {content} [/INST] - Phi-3:
<|user|>\n{content}<|end|>\n<|assistant|>\n - Qwen2:
<|im_start|>user\n{content}<|im_end|>\n<|im_start|>assistant\n - Gemma2:
<start_of_turn>user\n{content}<end_of_turn>\n<start_of_turn>model\n
These are hardcoded strings, not a Jinja2 parser. A full template parser is on the list but hardcoded works for the five architectures I support now.
GGUF Metadata Extraction
RunnerConfig is populated entirely from GGUF metadata keys. Missing keys get sensible defaults (norm_eps=1e-5, rope_freq_base=10000.0, no sliding window, no softcapping). The config-building logic auto-detects the architecture from what's in the file — no manual --arch flag needed.