Configuration

Pie reads its configuration from ~/.pie/config.toml by default. The file is split into five sections — [server], [auth], [telemetry], [runtime], and one or more [[model]] array-of-tables entries. pie config init writes a working default; this page documents every key.

Schema

[server]
host = "127.0.0.1"
port = 8080
verbose = true
registry = "https://registry.pie-project.org/"
python_snapshot = true
# max_concurrent_processes = 64

[auth]
enabled = false

[telemetry]
enabled = false
endpoint = "http://localhost:4317"
service_name = "pie"

[runtime]
worker_threads = 8
wasm_max_instances = 1000
wasm_max_memory_mb = 4096
wasm_warm_memory_mb = 0
wasm_warm_slots = 100
allow_fs = false
fs_scratch_dir = "/tmp/pie"
allow_network = true
network_allowed_hosts = ["*"]
max_upload_mb = 256

# One or more model blocks. Use `[[model]]` (array-of-tables); the first
# entry is the implicit default for inferlets that don't specify a model.
[[model]]
name = "default"
hf_repo = "Qwen/Qwen3-0.6B"

[model.scheduler]
batch_policy = "adaptive"
request_timeout_secs = 120
default_endowment_pages = 64
admission_oversubscription_factor = 4.0
restore_pause_at_utilization = 0.85
# default_token_limit = 100000

[model.driver]
type = "cuda_native"
device = ["cuda:0"]
tensor_parallel_size = 1
activation_dtype = "bfloat16"
random_seed = 42

[model.driver.options]
gpu_mem_utilization = 0.85
max_batch_tokens = 10240
max_batch_size = 512

[model.driver].type is a discriminator. Embedded drivers (portable, cuda_native, dummy) run inside the pie process. Python drivers (dev, vllm, sglang) run as supervised subprocesses. Driver-specific knobs live in [model.driver.options]; see the per-driver pages (CUDA, Portable, vLLM, SGLang) for what each backend accepts.

To add a second model, append another [[model]] block with a unique name and a device list that does not overlap any other model's.

Server keys

Key	Default	Description
`host`	`127.0.0.1`	Bind address for the WebSocket server.
`port`	`8080`	TCP port for client connections.
`verbose`	`false`	Enable engine diagnostics (also reachable via `pie serve --debug`).
`registry`	`https://registry.pie-project.org/`	Inferlet registry the engine fetches from when `pie run <name>` is used without `--path`.
`python_snapshot`	`true`	Use the host-side Python snapshot to skip cold-start interpreter init for Python inferlets. Disable with `--no-snapshot` for debugging.
`max_concurrent_processes`	unset	Cap on simultaneously running inferlets. When set, an admission semaphore gates spawn requests; new processes wait for a slot. Unset means unlimited. Must be `> 0`.

Auth keys

Key	Default	Description
`enabled`	`true` (template default is `false`)	Require client public-key auth on the WebSocket. Disable with `pie serve --no-auth` for development. The keystore is managed via `pie auth add\|remove\|list`.

Telemetry keys

Key	Default	Description
`enabled`	`false`	Emit OpenTelemetry traces and metrics.
`endpoint`	`http://localhost:4317`	OTLP gRPC endpoint.
`service_name`	`pie`	Service name reported in spans.

Runtime keys

The [runtime] block tunes the tokio worker pool, the wasmtime engine pool, the per-instance security policy (filesystem and network), and upload limits. All values have explicit defaults; pie pins them so its behavior is decoupled from upstream wasmtime / tokio version bumps.

Key	Default	Description
`worker_threads`	`os.cpu_count()`	Number of tokio worker threads. On hosts with many logical cores under high request concurrency, the default can cause migration overhead; lowering to ~8 often improves throughput. Must be `> 0`.
`wasm_max_instances`	`1000`	Maximum number of concurrently instantiated wasm modules in the wasmtime pooling allocator. Must be `> 0`.
`wasm_max_memory_mb`	`4096`	Maximum linear memory per wasm instance, in MiB. This is a virtual reservation — only touched memory is mapped — but lower it (e.g. `64`) if you need tight RSS control and your inferlets fit. Must be `> 0`.
`wasm_warm_memory_mb`	`0`	Per-slot warm memory pre-mapped at pool startup, in MiB. Trades startup time for cold-start latency. Must be `>= 0`.
`wasm_warm_slots`	`100`	Number of pool slots to keep warm. Must be `>= 0`.
`allow_fs`	`false`	When true, every inferlet receives a preopened scratch directory at `/scratch` mapped to a per-process subdirectory under `fs_scratch_dir`. Inferlets without this flag have no host filesystem visibility.
`fs_scratch_dir`	`<tempdir>/pie`	Host root under which per-process `/scratch` directories are created when `allow_fs = true`.
`allow_network`	`true`	When false, wasi:sockets is denied. Useful for tight outbound HTTP control.
`network_allowed_hosts`	`["*"]`	Allowed-host filter for wasi:sockets. Supports CIDRs and host:port forms (e.g. `["10.0.0.0/8", "127.0.0.1"]`, `["10.0.0.0/8:443"]`). Filters wasi:sockets only — wasi:http bypasses the per-socket hook.
`max_upload_mb`	`256`	Maximum size of a single chunked upload accepted by the server, in MiB. Must be `> 0`.

Per-model keys

Each [[model]] entry has only identity-level fields at the top:

Key	Default	Description
`name`	required	The inferlet-side lookup key for `Model.load("<name>")`. Must be unique across `[[model]]` entries. The first `[[model]]` is the implicit default.
`hf_repo`	required	HuggingFace repo id or local HuggingFace snapshot directory to load. Repo ids use the standard HuggingFace cache and download on cache miss.

Per-process admission and market policy live in [model.scheduler], not at the top level.

`[model.scheduler]`

Batch firing + per-process admission/market knobs:

Key	Default	Description
`batch_policy`	`"adaptive"`	Batch-firing policy. One of `"adaptive"`, `"eager"`, `"greedy"`. See `runtime/src/inference/adaptive_policy.rs`.
`request_timeout_secs`	`120`	Maximum wall time for a single inference request before the runtime drops it. Must be `> 0`.
`default_token_limit`	unset	Per-process compute cap (tokens) when the launch request doesn't specify one. Unset means no hard cap. Must be `> 0` if set.
`default_endowment_pages`	`64`	Per-process initial endowment in KV pages. Sets the process's claim weight in the bidding market — bigger values guarantee more pages held under contention. Must be `> 0`.
`admission_oversubscription_factor`	`4.0`	How much logical KV memory the market overcommits relative to physical: `Σ endowment ≤ total_pages × factor`. `1.0` = no overbook; `4.0` = sell 4× capacity, betting on non-peak duty cycles. Must be a finite `> 0` number.
`restore_pause_at_utilization`	`0.85`	Pause restoring suspended processes when any device exceeds this GPU page utilization (`(0.0, 1.0]`). Prevents evict→restore→re-evict thrash.

`[model.driver]`

Driver discriminator + universal driver fields:

Key	Default	Description
`type`	required	Driver name. One of `portable`, `cuda_native`, `dummy`, `dev`, `vllm`, `sglang`. See the per-driver reference pages.
`device`	required	Device list for this model. A single string is accepted and normalized to a one-element list. The list across all `[[model]]` entries must be disjoint.
`tensor_parallel_size`	`1`	TP world size.
`activation_dtype`	`"bfloat16"`	Activation dtype for forward passes.
`random_seed`	`42`	Seed used by samplers and dropout-free stochastic ops.

`[model.driver.options]`

Driver-specific knobs in the driver's own vocabulary. Embedded drivers are validated by pie before boot. Python drivers accept venv or python as standalone-only interpreter overrides and pass the remaining options to the driver subprocess. See the per-driver pages (CUDA, Portable, vLLM, SGLang) for each driver's option set.

Editing the config

Use pie config rather than hand-editing when possible:

pie config show              # pretty-print ~/.pie/config.toml
pie config set server.port 9090
pie config set auth.enabled true
pie config set model.0.hf_repo meta-llama/Llama-3.2-1B
pie config set model.0.driver.device "cuda:0,cuda:1"

pie config init regenerates the default file (and downloads the embedded Python runtime that hosts Python inferlets).

pie config set takes a dot-path under the section the key lives in (e.g. server.port, not port). Numeric segments index into TOML arrays — model.0.hf_repo targets the first [[model]] block. Comma-separated values become lists ("cuda:0,cuda:1" → ["cuda:0", "cuda:1"]).

Migration from the older schema

Pie previously used a dotted [model.<name>] shape with admission knobs at the top level and driver options under the driver type discriminator. The current loader rejects the old shape with a migration error. The mapping:

Old key	New location
`[model.<name>]` (table)	`[[model]]` array-of-tables with `name = "<name>"`
`[server].primary_model`	Removed. The first `[[model]]` is the implicit default.
`[server].allow_filesystem`	`[runtime].allow_fs`
`[model.<name>].default_token_budget`	`[model.scheduler].default_token_limit`
`[model.<name>].default_endowment_pages`	`[model.scheduler].default_endowment_pages`
`[model.<name>].oversubscription_factor`	`[model.scheduler].admission_oversubscription_factor`
`[model.<name>.scheduler].policy`	`[model.scheduler].batch_policy`
`[model.<name>.driver.<type>]`	`[model.driver.options]`

Run pie config init to regenerate the file in the new shape.

Environment variables

Variable	Description
`PIE_HOME`	Override the default Pie home directory (`~/.pie`).
`HF_HOME`	HuggingFace cache directory (used by `pie model …`).
`PIE_SDK`	Override the SDK lookup path used by `bakery` / `pie build` for editable in-tree development. Without it, bakery searches its install location and the current working directory.

Registry authentication

pie build does not need authentication; it produces a local .wasm file. Publishing to the registry is done through the bakery toolchain.

bakery login walks a GitHub OAuth device-code flow and stores the resulting token under the user's bakery state directory. After login, bakery inferlet publish uploads a built .wasm plus its Pie.toml to the registry server configured in [server].registry. To revoke, delete the stored token file or revoke the token from your GitHub account settings. See bakery <cmd> --help for the full publish/search/info surface.

CLI reference: every pie command and flag.
Per-driver options: CUDA, Portable, vLLM, SGLang.
Supported architectures are listed on each driver page (e.g. CUDA #supported-architectures).

Schema​

Server keys​

Auth keys​

Telemetry keys​

Runtime keys​

Per-model keys​

[model.scheduler]​

[model.driver]​

[model.driver.options]​

Editing the config​

Migration from the older schema​

Environment variables​

Registry authentication​

Related​