Loading and selecting models

Every inferlet starts by binding to a model. The engine has already loaded weights onto the GPU; your inferlet just acquires a handle. This page covers how to ask the engine which models are loaded and how to bind to one. Read this after Initial setup.

What `Model::load` does

Model::load(name) returns a handle to a model the engine has already loaded. It does not read weights from disk, copy parameters, or allocate GPU memory. Multiple inferlets can hold handles to the same model concurrently.

The name is the name field in a [[model]] block in ~/.pie/config.toml. The default config has one model, named default.

Rust
Python
JavaScript

use inferlet::{model::Model, runtime, Result};

let names = runtime::models();
let model = Model::load(names.first().ok_or("no models")?)?;

from inferlet import Model, runtime

names = runtime.models()
model = Model.load(names[0])

import { Model, runtime } from 'inferlet';

const names = runtime.models();
const model = Model.load(names[0]);

runtime::models() returns the list of names from the engine's config in declaration order. If the config has only one model, indexing the first element is safe. If it has several, see Selecting by name below.

Selecting by name

Add more models to ~/.pie/config.toml:

[[model]]
name = "default"
hf_repo = "Qwen/Qwen3-0.6B"

[model.driver]
type = "cuda_native"
device = ["cuda:0"]

[[model]]
name = "large"
hf_repo = "Qwen/Qwen2.5-7B-Instruct"

[model.driver]
type = "cuda_native"
device = ["cuda:1"]

Restart the engine. Both models load on startup. Inside an inferlet, bind by name:

Rust
Python
JavaScript

let small = Model::load("default")?;
let large = Model::load("large")?;

small = Model.load("default")
large = Model.load("large")

const small = Model.load('default');
const large = Model.load('large');

This is the pattern for the Oracle and Worker topology: a small fast model handles routing, a large model handles the chosen task. Both are running. Switching between them inside an inferlet is a Model::load call away.

What you can do with a `Model`

The Model handle is intentionally minimal. Two things hang off it:

model.tokenizer() returns the model's Tokenizer.
Context::new(&model) allocates a fresh KV cache against this model. See Context overview.

A Model does not expose architecture details, vocab size, or parameter counts. Those live in the engine config and the HuggingFace repo card. If you need them inside an inferlet, request them through runtime::query(...).

Naming conventions

The model name in the config is freely chosen. Conventions that work well:

Use case	Suggested name
Single-model deployments	`default`
Two-model setups	`small` and `large`, or `router` and `worker`
Per-domain	`code`, `math`, `chat`, `vision`

Inferlets that bind by hardcoded name (Model::load("large")) are tied to the engine config. Inferlets that take the name as input or use runtime::models().first() are portable across configs.

Multiple devices

A single model can span multiple devices:

[model.driver]
type = "cuda_native"
device = ["cuda:0", "cuda:1"]
tensor_parallel_size = 2

CUDA-capable drivers handle tensor parallelism. From the inferlet's side, the model handle is the same; the engine schedules forward passes across the listed devices. See the CUDA driver page for the per-driver behavior.

Tokenizers: encode and decode text, inspect special tokens.
Context overview: allocate a KV context against a loaded model.
CUDA driver: which architectures the CUDA drivers handle.

What Model::load does​

Selecting by name​

What you can do with a Model​

Naming conventions​

Multiple devices​

Next​

What `Model::load` does

Selecting by name

What you can do with a `Model`

Naming conventions

Multiple devices

Next