yule serve

Usage

yule serve <model> [options]

Arguments

Argument	Description
`model`	Path to a `.gguf` model file

Options

Flag	Default	Description
`--bind <addr>`	`127.0.0.1:11434`	Address and port to listen on
`--token <token>`	Auto-generated	Use a specific auth token instead of generating one
`--no-sandbox`	`false`	Disable process sandboxing

What Happens on Start

Model file is parsed and weights are memory-mapped
Merkle tree is computed over all tensor data (blake3, 1MB leaves)
Inference thread spawns with the model loaded
Auth token is generated (or the provided one is registered)
HTTP server starts listening

The server prints the token and endpoint URLs to stderr:

loading model: ./model.gguf
model loaded: Llama (201 tensors, merkle: ffc7e1fd6016a6f9)

  token: yule_b49913e2c05162951af4f87d62c2c9a6555eb91299c7fdcc

listening on 127.0.0.1:11434
  yule api:  http://127.0.0.1:11434/yule/health
  openai:    http://127.0.0.1:11434/v1/chat/completions

Auth

Every request must include the token as a Bearer token:

curl -H "Authorization: Bearer yule_b499..." http://localhost:11434/yule/health

Requests without a valid token get a 401 Unauthorized response.

By default, the server process is placed in a Windows Job Object sandbox with memory limits, no child process spawning, and UI restrictions. Use --no-sandbox to disable this (not recommended for untrusted models).

See Security for details.

Examples

# default settings
yule serve ./model.gguf

# custom port with a fixed token
yule serve ./model.gguf --bind 0.0.0.0:8080 --token my-secret-key

# no sandbox (development only)
yule serve ./model.gguf --no-sandbox