YuleYule
Architecture

Architecture Overview

Crate structure, inference thread model, and design decisions

Crate Structure

Yule is a 10-crate Cargo workspace:

CratePurpose
yule-coreGGUF parser, dequant kernels, tokenizer, SIMD dispatch
yule-inferTransformerRunner, KV cache, sampling, weight loading
yule-gpuCompute backend abstraction (CPU done, Vulkan/Metal/CUDA planned)
yule-verifyMerkle tree, Ed25519 signatures
yule-attestAttestation sessions, audit log
yule-sandboxProcess isolation (Job Object on Windows)
yule-apiAxum HTTP server, auth, routes, SSE streaming
yule-registryModel download and cache (planned)
yule-cliCLI entry point
yule-benchCriterion benchmarks

Inference Thread Model

The API server needs to be async (Axum/tokio) for HTTP handling, but TransformerRunner holds &[u8] mmap references that aren't Send. It can't be moved into a tokio task.

Solution: a dedicated std::thread owns all model resources. Async HTTP handlers communicate with it via channels:

HTTP request
  → Axum handler
    → std::sync::mpsc::Sender<InferenceRequest>
      → Inference Thread (owns mmap + runner + tokenizer)
        → tokens via tokio::sync::mpsc::UnboundedSender<TokenEvent>
          → SSE stream or collected JSON response

The mmap is intentionally leaked via Box::leak to get a 'static lifetime. This is correct because the model stays loaded for the entire server lifetime.

Single request at a time. This is the right tradeoff for single-model CPU inference. Concurrent requests queue on the channel.

Why Rust

Not just "memory safety". The specific wins:

  • GGUF parsers in C/C++ have had CVEs. Buffer overflows in file format parsers are the most common vulnerability class. Rust eliminates them.
  • The runtime executes arbitrary model weights. A tampered model could exploit parser bugs to achieve code execution. Rust's type system prevents this entire class of attack.
  • Zero-cost abstractions for SIMD. AVX2 intrinsics via std::arch with no runtime overhead. The SIMD dispatch uses an AtomicU8 flag checked once at startup.
  • No GC pauses during inference. Token generation latency is predictable.

On this page