CLI
yule serve
Start the local API server
Usage
yule serve <model> [options]Arguments
| Argument | Description |
|---|---|
model | Path to a .gguf model file |
Options
| Flag | Default | Description |
|---|---|---|
--bind <addr> | 127.0.0.1:11434 | Address and port to listen on |
--token <token> | Auto-generated | Use a specific auth token instead of generating one |
--no-sandbox | false | Disable process sandboxing |
What Happens on Start
- Model file is parsed and weights are memory-mapped
- Merkle tree is computed over all tensor data (blake3, 1MB leaves)
- Inference thread spawns with the model loaded
- Auth token is generated (or the provided one is registered)
- HTTP server starts listening
The server prints the token and endpoint URLs to stderr:
loading model: ./model.gguf
model loaded: Llama (201 tensors, merkle: ffc7e1fd6016a6f9)
token: yule_b49913e2c05162951af4f87d62c2c9a6555eb91299c7fdcc
listening on 127.0.0.1:11434
yule api: http://127.0.0.1:11434/yule/health
openai: http://127.0.0.1:11434/v1/chat/completionsAuth
Every request must include the token as a Bearer token:
curl -H "Authorization: Bearer yule_b499..." http://localhost:11434/yule/healthRequests without a valid token get a 401 Unauthorized response.
Sandbox
By default, the server process is placed in a Windows Job Object sandbox with memory limits, no child process spawning, and UI restrictions. Use --no-sandbox to disable this (not recommended for untrusted models).
See Security for details.
Examples
# default settings
yule serve ./model.gguf
# custom port with a fixed token
yule serve ./model.gguf --bind 0.0.0.0:8080 --token my-secret-key
# no sandbox (development only)
yule serve ./model.gguf --no-sandbox