Architecture Overview

Crate Structure

Yule is a 10-crate Cargo workspace:

Crate	Purpose
`yule-core`	GGUF parser, dequant kernels, tokenizer, SIMD dispatch
`yule-infer`	TransformerRunner, KV cache, sampling, weight loading
`yule-gpu`	Compute backend abstraction (CPU done, Vulkan/Metal/CUDA planned)
`yule-verify`	Merkle tree, Ed25519 signatures
`yule-attest`	Attestation sessions, audit log
`yule-sandbox`	Process isolation (Job Object on Windows)
`yule-api`	Axum HTTP server, auth, routes, SSE streaming
`yule-registry`	Model download and cache (planned)
`yule-cli`	CLI entry point
`yule-bench`	Criterion benchmarks

Inference Thread Model

The API server needs to be async (Axum/tokio) for HTTP handling, but TransformerRunner holds &[u8] mmap references that aren't Send. It can't be moved into a tokio task.

Solution: a dedicated std::thread owns all model resources. Async HTTP handlers communicate with it via channels:

HTTP request
  → Axum handler
    → std::sync::mpsc::Sender<InferenceRequest>
      → Inference Thread (owns mmap + runner + tokenizer)
        → tokens via tokio::sync::mpsc::UnboundedSender<TokenEvent>
          → SSE stream or collected JSON response

The mmap is intentionally leaked via Box::leak to get a 'static lifetime. This is correct because the model stays loaded for the entire server lifetime.

Single request at a time. This is the right tradeoff for single-model CPU inference. Concurrent requests queue on the channel.

Why Rust

Not just "memory safety". The specific wins:

GGUF parsers in C/C++ have had CVEs. Buffer overflows in file format parsers are the most common vulnerability class. Rust eliminates them.
The runtime executes arbitrary model weights. A tampered model could exploit parser bugs to achieve code execution. Rust's type system prevents this entire class of attack.
Zero-cost abstractions for SIMD. AVX2 intrinsics via std::arch with no runtime overhead. The SIMD dispatch uses an AtomicU8 flag checked once at startup.
No GC pauses during inference. Token generation latency is predictable.

Architecture Overview

Crate Structure

Inference Thread Model

Why Rust

On this page