YuleYule
Architecture

Supported Models

Architectures, quantizations, and SIMD acceleration

Architectures

Yule uses a unified TransformerRunner that handles multiple architectures through config flags rather than separate code paths:

ArchitectureModelsSpecial Features
LlamaLlama 2, Llama 3, TinyLlama, CodeLlamaStandard transformer
MistralMistral 7B, Mixtral (partial)Sliding window attention
PhiPhi-3, Phi-3.5Partial RoPE, QKV bias
QwenQwen2, Qwen2.5QKV bias
GemmaGemma 2GeGLU activation, post-norms, softcapping

Architecture is auto-detected from GGUF metadata. No manual configuration needed.

Architecture-Specific Features

  • Sliding window attention (Mistral, Gemma2) — attention is masked beyond a window size, reducing memory for long contexts
  • Partial RoPE (Phi-3) — rotary embeddings applied to only part of each head dimension
  • QKV bias (Qwen2, Phi-3) — bias terms added after QKV projection, detected from tensor presence
  • GeGLU activation (Gemma2) — GELU(gate) * up instead of SwiGLU
  • Post-attention and post-FFN norms (Gemma2) — extra normalization layers
  • Logit softcapping (Gemma2) — tanh(logits / cap) * cap to bound final logits

Quantization Types

TypeBits/WeightAVX2Notes
Q4_04.5Yes (3.9x)Basic 4-bit, 32-element blocks
Q4_K4.5Yes (1.9x)K-quant 4-bit with super-blocks
Q5_K5.5ScalarK-quant 5-bit
Q6_K6.5ScalarK-quant 6-bit, good quality
Q8_08.5Yes (5.0x)8-bit, near-lossless
Q8_K8.5ScalarK-quant 8-bit
Q2_K2.5ScalarAggressive compression
Q3_K3.4ScalarK-quant 3-bit
F1616ScalarHalf precision
F3232ScalarFull precision
BF1616ScalarBrain float

The "AVX2" column shows speedup over scalar for the dot product kernel. AVX2 is auto-detected at startup and used when available on x86-64 CPUs.

Performance

Single-user decode is memory-bandwidth-bound. Approximate throughput on CPU:

ModelQuantRAMApprox tok/s
TinyLlama 1.1BQ4_0~600MB3-5
Llama 3.2 7BQ4_K_M~4GB0.5-2
Mistral 7BQ4_0~4GB0.5-2

GPU acceleration (Vulkan, Metal, CUDA) is planned and will improve throughput by 10-100x.

File Format

Yule reads GGUF (GPT-Generated Unified Format) files. This is the standard format used by llama.cpp and the HuggingFace ecosystem. Models are available on HuggingFace — search for any model name + "GGUF".

Safetensors support is planned but not yet implemented.

On this page