Skip to main content

Loading and selecting models

Every inferlet starts by binding to a model. The engine has already loaded weights onto the GPU; your inferlet just acquires a handle. This page covers how to ask the engine which models are loaded and how to bind to one. Read this after Initial setup.

What Model::load does

Model::load(name) returns a handle to a model the engine has already loaded. It does not read weights from disk, copy parameters, or allocate GPU memory. Multiple inferlets can hold handles to the same model concurrently.

The name is the name field in a [[model]] block in ~/.pie/config.toml. The default config has one model, named default.

use inferlet::{model::Model, runtime, Result};

let names = runtime::models();
let model = Model::load(names.first().ok_or("no models")?)?;

runtime::models() returns the list of names from the engine's config in declaration order. If the config has only one model, indexing the first element is safe. If it has several, see Selecting by name below.

Selecting by name

Add more models to ~/.pie/config.toml:

[[model]]
name = "default"
hf_repo = "Qwen/Qwen3-0.6B"

[model.driver]
type = "cuda_native"
device = ["cuda:0"]

[[model]]
name = "large"
hf_repo = "Qwen/Qwen2.5-7B-Instruct"

[model.driver]
type = "cuda_native"
device = ["cuda:1"]

Restart the engine. Both models load on startup. Inside an inferlet, bind by name:

let small = Model::load("default")?;
let large = Model::load("large")?;

This is the pattern for the Oracle and Worker topology: a small fast model handles routing, a large model handles the chosen task. Both are running. Switching between them inside an inferlet is a Model::load call away.

What you can do with a Model

The Model handle is intentionally minimal. Two things hang off it:

  • model.tokenizer() returns the model's Tokenizer.
  • Context::new(&model) allocates a fresh KV cache against this model. See Context overview.

A Model does not expose architecture details, vocab size, or parameter counts. Those live in the engine config and the HuggingFace repo card. If you need them inside an inferlet, request them through runtime::query(...).

Naming conventions

The model name in the config is freely chosen. Conventions that work well:

Use caseSuggested name
Single-model deploymentsdefault
Two-model setupssmall and large, or router and worker
Per-domaincode, math, chat, vision

Inferlets that bind by hardcoded name (Model::load("large")) are tied to the engine config. Inferlets that take the name as input or use runtime::models().first() are portable across configs.

Multiple devices

A single model can span multiple devices:

[model.driver]
type = "cuda_native"
device = ["cuda:0", "cuda:1"]
tensor_parallel_size = 2

CUDA-capable drivers handle tensor parallelism. From the inferlet's side, the model handle is the same; the engine schedules forward passes across the listed devices. See the CUDA driver page for the per-driver behavior.

Next