API Server — Two Surfaces, One Engine

yule run is for humans. yule serve is for everything else — editors, chat UIs, automation pipelines, agents. I needed an HTTP API that's both standards-compatible (OpenAI format, because that's what every tool expects) and integrity-native (because I have things to prove that OpenAI's format can't express).

Two Surfaces

Yule-native (`/yule/*`)

Every response includes an integrity block:

{
  "integrity": {
    "merkle_root": "a1b2c3...",
    "sandbox_active": true,
    "attestation_id": "17080...",
    "device_pubkey": "ed25519:..."
  }
}

Endpoints:

GET /yule/health — status, uptime, model architecture, sandbox state
GET /yule/model — full model info: architecture, params, context length, merkle root, tensor count
POST /yule/chat — messages in, tokens out, with integrity proof
POST /yule/tokenize — text in, token IDs out

OpenAI-compatible (`/v1/*`)

Standard format so existing tools work out of the box:

POST /v1/chat/completions — streaming and non-streaming
GET /v1/models — list available models

Point any OpenAI-compatible client at http://localhost:11434 and it works. The trade-off: you lose the integrity block because OpenAI's response format doesn't have a field for it.

Streaming

Both surfaces support SSE. For Yule-native, the stream sends typed events:

data: {"type":"token","content":"Hello"}
data: {"type":"token","content":" world"}
data: {"type":"done","token_count":42,"integrity":{...}}

The attestation record is created on the done event, after all tokens are collected. So the integrity proof covers the full output.

The Inference Thread Problem

TransformerRunner holds &[u8] references into the memory-mapped model file. Mmap isn't Send, so the runner can't move across thread boundaries. But Axum handlers are async and run on the tokio runtime, which is multi-threaded.

Solution: the inference engine lives on a dedicated std::thread. HTTP handlers send requests via an mpsc channel and receive tokens back via another channel:

HTTP handler → InferenceRequest → mpsc::Sender → Inference Thread
                                                      │
HTTP handler ← token/done ← mpsc::Receiver ← ────────┘

The inference thread owns the model, tokenizer, and mmap. It runs a blocking loop: receive request, tokenize, run forward passes, send tokens back one at a time. No locks needed.

Authentication

Capability-token auth. On startup, the server either generates a random token via getrandom or accepts one via --token. The token is printed to stderr (not stdout, so piping doesn't leak it). The server stores the blake3 hash — if the server's memory is dumped, the attacker gets a hash, not the token.

All endpoints require Authorization: Bearer <token>. 401 on failure.

Notes

Axum over warp (too declarative) and actix-web (actor model is overkill here). Tower middleware handles auth, logging, error handling.

Dedicated thread instead of spawn_blocking because spawn_blocking creates a new OS thread per request. The model isn't thread-safe, so one thread + one channel is simpler than locks.

Two API surfaces because OpenAI's response format has no field for merkle root, sandbox status, or attestation ID. The native surface carries integrity data, the OpenAI surface is for tool compatibility.

API Server

On this page