Supported Models

Architectures

Yule uses a unified TransformerRunner that handles multiple architectures through config flags rather than separate code paths:

Architecture	Models	Special Features
Llama	Llama 2, Llama 3, TinyLlama, CodeLlama	Standard transformer
Mistral	Mistral 7B, Mixtral (partial)	Sliding window attention
Phi	Phi-3, Phi-3.5	Partial RoPE, QKV bias
Qwen	Qwen2, Qwen2.5	QKV bias
Gemma	Gemma 2	GeGLU activation, post-norms, softcapping

Architecture is auto-detected from GGUF metadata. No manual configuration needed.

Architecture-Specific Features

Sliding window attention (Mistral, Gemma2) — attention is masked beyond a window size, reducing memory for long contexts
Partial RoPE (Phi-3) — rotary embeddings applied to only part of each head dimension
QKV bias (Qwen2, Phi-3) — bias terms added after QKV projection, detected from tensor presence
GeGLU activation (Gemma2) — GELU(gate) * up instead of SwiGLU
Post-attention and post-FFN norms (Gemma2) — extra normalization layers
Logit softcapping (Gemma2) — tanh(logits / cap) * cap to bound final logits

Quantization Types

Type	Bits/Weight	AVX2	Notes
Q4_0	4.5	Yes (3.9x)	Basic 4-bit, 32-element blocks
Q4_K	4.5	Yes (1.9x)	K-quant 4-bit with super-blocks
Q5_K	5.5	Scalar	K-quant 5-bit
Q6_K	6.5	Scalar	K-quant 6-bit, good quality
Q8_0	8.5	Yes (5.0x)	8-bit, near-lossless
Q8_K	8.5	Scalar	K-quant 8-bit
Q2_K	2.5	Scalar	Aggressive compression
Q3_K	3.4	Scalar	K-quant 3-bit
F16	16	Scalar	Half precision
F32	32	Scalar	Full precision
BF16	16	Scalar	Brain float

The "AVX2" column shows speedup over scalar for the dot product kernel. AVX2 is auto-detected at startup and used when available on x86-64 CPUs.

Performance

Single-user decode is memory-bandwidth-bound. Approximate throughput on CPU:

Model	Quant	RAM	Approx tok/s
TinyLlama 1.1B	Q4_0	~600MB	3-5
Llama 3.2 7B	Q4_K_M	~4GB	0.5-2
Mistral 7B	Q4_0	~4GB	0.5-2

GPU acceleration (Vulkan, Metal, CUDA) is planned and will improve throughput by 10-100x.

Yule reads GGUF (GPT-Generated Unified Format) files. This is the standard format used by llama.cpp and the HuggingFace ecosystem. Models are available on HuggingFace — search for any model name + "GGUF".

Safetensors support is planned but not yet implemented.

Supported Models

Architectures

Architecture-Specific Features

Quantization Types

Performance

File Format

On this page