API Server
yule run is for humans. yule serve is for everything else.
API Server — Two Surfaces, One Engine
yule run is for humans. yule serve is for everything else — editors, chat UIs, automation pipelines, agents. I needed an HTTP API that's both standards-compatible (OpenAI format, because that's what every tool expects) and integrity-native (because I have things to prove that OpenAI's format can't express).
Two Surfaces
Yule-native (/yule/*)
Every response includes an integrity block:
{
"integrity": {
"merkle_root": "a1b2c3...",
"sandbox_active": true,
"attestation_id": "17080...",
"device_pubkey": "ed25519:..."
}
}Endpoints:
GET /yule/health— status, uptime, model architecture, sandbox stateGET /yule/model— full model info: architecture, params, context length, merkle root, tensor countPOST /yule/chat— messages in, tokens out, with integrity proofPOST /yule/tokenize— text in, token IDs out
OpenAI-compatible (/v1/*)
Standard format so existing tools work out of the box:
POST /v1/chat/completions— streaming and non-streamingGET /v1/models— list available models
Point any OpenAI-compatible client at http://localhost:11434 and it works. The trade-off: you lose the integrity block because OpenAI's response format doesn't have a field for it.
Streaming
Both surfaces support SSE. For Yule-native, the stream sends typed events:
data: {"type":"token","content":"Hello"}
data: {"type":"token","content":" world"}
data: {"type":"done","token_count":42,"integrity":{...}}The attestation record is created on the done event, after all tokens are collected. So the integrity proof covers the full output.
The Inference Thread Problem
TransformerRunner holds &[u8] references into the memory-mapped model file. Mmap isn't Send, so the runner can't move across thread boundaries. But Axum handlers are async and run on the tokio runtime, which is multi-threaded.
Solution: the inference engine lives on a dedicated std::thread. HTTP handlers send requests via an mpsc channel and receive tokens back via another channel:
HTTP handler → InferenceRequest → mpsc::Sender → Inference Thread
│
HTTP handler ← token/done ← mpsc::Receiver ← ────────┘The inference thread owns the model, tokenizer, and mmap. It runs a blocking loop: receive request, tokenize, run forward passes, send tokens back one at a time. No locks needed.
Authentication
Capability-token auth. On startup, the server either generates a random token via getrandom or accepts one via --token. The token is printed to stderr (not stdout, so piping doesn't leak it). The server stores the blake3 hash — if the server's memory is dumped, the attacker gets a hash, not the token.
All endpoints require Authorization: Bearer <token>. 401 on failure.
Notes
Axum over warp (too declarative) and actix-web (actor model is overkill here). Tower middleware handles auth, logging, error handling.
Dedicated thread instead of spawn_blocking because spawn_blocking creates a new OS thread per request. The model isn't thread-safe, so one thread + one channel is simpler than locks.
Two API surfaces because OpenAI's response format has no field for merkle root, sandbox status, or attestation ID. The native surface carries integrity data, the OpenAI surface is for tool compatibility.