Quick Start

Run Inference

yule run ./tinyllama.gguf --prompt "Explain gravity in one sentence"

Output streams to stdout as tokens are generated. Stats print to stderr when done:

loaded: Llama (1.1B, 22 layers, dim 2048)
parse: 15.2ms
tokenizer: 32000 tokens, loaded in 5.4ms
weights: mapped in 0.1ms

prompt: 12 tokens
prefill: 6364.7ms (1.9 tok/s)
Gravity is the force that attracts objects with mass toward each other.

generated: 14 tokens in 4200.0ms (3.33 tok/s)

Start the API Server

yule serve ./tinyllama.gguf

The server prints a capability token to stderr:

loading model: ./tinyllama.gguf
model loaded: Llama (201 tensors, merkle: ffc7e1fd6016a6f9)

  token: yule_b49913e2c05162951af4f87d62c2c9a6555eb91299c7fdcc

listening on 127.0.0.1:11434
  yule api:  http://127.0.0.1:11434/yule/health
  openai:    http://127.0.0.1:11434/v1/chat/completions

All API requests require the token as a Bearer token:

curl -H "Authorization: Bearer yule_b499..." http://localhost:11434/yule/health

{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 27,
  "model": "tinyllama",
  "architecture": "Llama",
  "sandbox": false
}

What's Next

CLI Reference for all flags and options
API Reference for the full endpoint documentation
Security for how sandbox and verification work

Quick Start

Run Inference

Start the API Server

What's Next

On this page