Run a server

pie serve starts a long-lived engine. Use it when you need multiple clients, persistent state across runs, or the ability to share an inferlet through the registry. Read this after Build and publish.

Start

pie serve

╭─ Pie Engine ─────────────────────────────────╮
│ Host         127.0.0.1:8080                  │
│ Model        default (Qwen/Qwen3-0.6B)       │
│ Driver       cuda_native                     │
│ Device       cuda:0                          │
╰──────────────────────────────────────────────╯

✓ Server ready at ws://127.0.0.1:8080

The server now accepts WebSocket clients. The model loads once at startup and serves every request from the same loaded weights.

Common flags

Option	Description
`-c`, `--config <PATH>`	Use a non-default config file.
`--host <HOST>`	Override `[server].host`.
`--port <PORT>`	Override `[server].port`.
`--no-auth`	Disable authentication for this run.
`--debug`	Show engine, driver, and server diagnostics.
`-m`, `--monitor`	Launch the TUI alongside the server.
`--no-snapshot`	Disable the host-side Python snapshot optimization.

# Custom port, no auth, debug diagnostics (dev)
pie serve --port 9000 --no-auth --debug

# Alternate config (e.g. for staging)
pie serve -c ./staging.toml

See the CLI reference for the full list.

Monitor mode

pie serve -m

A real-time TUI showing the running model, batch utilization, throughput, and per-process status. Press q to exit the TUI; the server keeps running until Ctrl-C.

Use the monitor to spot:

Per-process credit balance and bid (the scheduling surface in action).
Batch occupancy. A consistently low number suggests undersized request load or oversized tensor parallelism.
Per-step latency. Spikes correlate with prefill bursts.

Dummy backend

For protocol testing without loading real weights:

pie config set model.0.driver.type dummy
pie serve

The engine returns random tokens but every other code path runs (scheduler, KV cache, decoders, sessions, registry). Useful for client integration tests.

Authorize clients

When [auth].enabled = true, only registered users can connect. Add a public key:

cat ~/.ssh/id_ed25519.pub | pie auth add alice laptop
pie auth list

pie auth remove alice [key_name] revokes a single key or the whole user.

For development without auth, pass --no-auth or set [auth].enabled = false in the config file. Leave auth on whenever the port is reachable from off-host.

Graceful shutdown

Ctrl-C:

^C
✓ Shutdown complete

The engine stops accepting new connections, lets running processes finish, tears down driver workers, and exits. In-flight processes get up to a configured grace period to complete.

For a hard kill (no grace period), send SIGKILL. Saved contexts and adapters survive a graceful shutdown but not a hard kill.

Tuning runtime parallelism

The runtime schedules engine work on a Tokio worker pool. The default is fine for most deployments. To override:

[runtime]
worker_threads = 8

Set worker_threads to the number of CPU cores you want to dedicate to the runtime. Too low and the scheduler bottlenecks; too high and oversubscription hurts throughput. The bench scripts in Profiling help find the right value.

Multi-GPU on one host

CUDA-capable drivers run tensor-parallel inference across multiple GPUs:

[model.driver]
type = "cuda_native"
device = ["cuda:0", "cuda:1"]
tensor_parallel_size = 2

Restart pie serve to apply. The engine shards the model across the listed devices automatically. From the inferlet side, the model handle is the same; the engine schedules forward passes across all listed devices.

Multiple models

[[model]]
name = "small"
hf_repo = "Qwen/Qwen3-0.6B"

[model.driver]
type = "cuda_native"
device = ["cuda:0"]

[[model]]
name = "large"
hf_repo = "Qwen/Qwen2.5-7B-Instruct"

[model.driver]
type = "cuda_native"
device = ["cuda:1"]

Both load on startup. Each [[model]] block's device list must be disjoint from every other model's. Inside an inferlet, bind by name (Model::load("small") / Model::load("large")). See Loading and selecting models for the inferlet-side surface.

Connecting clients: launch processes against this engine.
Profiling: measure throughput and latency.
Configuration: the full config schema.

Start​

Common flags​

Monitor mode​

Dummy backend​

Authorize clients​

Graceful shutdown​

Tuning runtime parallelism​

Multi-GPU on one host​

Multiple models​

Next​