Supported Models
Architectures, quantizations, and SIMD acceleration
Architectures
Yule uses a unified TransformerRunner that handles multiple architectures through config flags rather than separate code paths:
| Architecture | Models | Special Features |
|---|---|---|
| Llama | Llama 2, Llama 3, TinyLlama, CodeLlama | Standard transformer |
| Mistral | Mistral 7B, Mixtral (partial) | Sliding window attention |
| Phi | Phi-3, Phi-3.5 | Partial RoPE, QKV bias |
| Qwen | Qwen2, Qwen2.5 | QKV bias |
| Gemma | Gemma 2 | GeGLU activation, post-norms, softcapping |
Architecture is auto-detected from GGUF metadata. No manual configuration needed.
Architecture-Specific Features
- Sliding window attention (Mistral, Gemma2) — attention is masked beyond a window size, reducing memory for long contexts
- Partial RoPE (Phi-3) — rotary embeddings applied to only part of each head dimension
- QKV bias (Qwen2, Phi-3) — bias terms added after QKV projection, detected from tensor presence
- GeGLU activation (Gemma2) —
GELU(gate) * upinstead of SwiGLU - Post-attention and post-FFN norms (Gemma2) — extra normalization layers
- Logit softcapping (Gemma2) —
tanh(logits / cap) * capto bound final logits
Quantization Types
| Type | Bits/Weight | AVX2 | Notes |
|---|---|---|---|
| Q4_0 | 4.5 | Yes (3.9x) | Basic 4-bit, 32-element blocks |
| Q4_K | 4.5 | Yes (1.9x) | K-quant 4-bit with super-blocks |
| Q5_K | 5.5 | Scalar | K-quant 5-bit |
| Q6_K | 6.5 | Scalar | K-quant 6-bit, good quality |
| Q8_0 | 8.5 | Yes (5.0x) | 8-bit, near-lossless |
| Q8_K | 8.5 | Scalar | K-quant 8-bit |
| Q2_K | 2.5 | Scalar | Aggressive compression |
| Q3_K | 3.4 | Scalar | K-quant 3-bit |
| F16 | 16 | Scalar | Half precision |
| F32 | 32 | Scalar | Full precision |
| BF16 | 16 | Scalar | Brain float |
The "AVX2" column shows speedup over scalar for the dot product kernel. AVX2 is auto-detected at startup and used when available on x86-64 CPUs.
Performance
Single-user decode is memory-bandwidth-bound. Approximate throughput on CPU:
| Model | Quant | RAM | Approx tok/s |
|---|---|---|---|
| TinyLlama 1.1B | Q4_0 | ~600MB | 3-5 |
| Llama 3.2 7B | Q4_K_M | ~4GB | 0.5-2 |
| Mistral 7B | Q4_0 | ~4GB | 0.5-2 |
GPU acceleration (Vulkan, Metal, CUDA) is planned and will improve throughput by 10-100x.
File Format
Yule reads GGUF (GPT-Generated Unified Format) files. This is the standard format used by llama.cpp and the HuggingFace ecosystem. Models are available on HuggingFace — search for any model name + "GGUF".
Safetensors support is planned but not yet implemented.