Architecture
Architecture Overview
Crate structure, inference thread model, and design decisions
Crate Structure
Yule is a 10-crate Cargo workspace:
| Crate | Purpose |
|---|---|
yule-core | GGUF parser, dequant kernels, tokenizer, SIMD dispatch |
yule-infer | TransformerRunner, KV cache, sampling, weight loading |
yule-gpu | Compute backend abstraction (CPU done, Vulkan/Metal/CUDA planned) |
yule-verify | Merkle tree, Ed25519 signatures |
yule-attest | Attestation sessions, audit log |
yule-sandbox | Process isolation (Job Object on Windows) |
yule-api | Axum HTTP server, auth, routes, SSE streaming |
yule-registry | Model download and cache (planned) |
yule-cli | CLI entry point |
yule-bench | Criterion benchmarks |
Inference Thread Model
The API server needs to be async (Axum/tokio) for HTTP handling, but TransformerRunner holds &[u8] mmap references that aren't Send. It can't be moved into a tokio task.
Solution: a dedicated std::thread owns all model resources. Async HTTP handlers communicate with it via channels:
HTTP request
→ Axum handler
→ std::sync::mpsc::Sender<InferenceRequest>
→ Inference Thread (owns mmap + runner + tokenizer)
→ tokens via tokio::sync::mpsc::UnboundedSender<TokenEvent>
→ SSE stream or collected JSON responseThe mmap is intentionally leaked via Box::leak to get a 'static lifetime. This is correct because the model stays loaded for the entire server lifetime.
Single request at a time. This is the right tradeoff for single-model CPU inference. Concurrent requests queue on the channel.
Why Rust
Not just "memory safety". The specific wins:
- GGUF parsers in C/C++ have had CVEs. Buffer overflows in file format parsers are the most common vulnerability class. Rust eliminates them.
- The runtime executes arbitrary model weights. A tampered model could exploit parser bugs to achieve code execution. Rust's type system prevents this entire class of attack.
- Zero-cost abstractions for SIMD. AVX2 intrinsics via
std::archwith no runtime overhead. The SIMD dispatch uses anAtomicU8flag checked once at startup. - No GC pauses during inference. Token generation latency is predictable.