Rust SDK reference

The full inferlet SDK API for Rust. The Guide walks through how to use these APIs with runnable code; this page enumerates the surface.

Inferlet entry point

use inferlet::Result;
use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct Input { prompt: String }

#[derive(Serialize)]
struct Output { text: String }

#[inferlet::main]
async fn main(input: Input) -> Result<Output> {
    // ...
}

The #[inferlet::main] macro generates the WebAssembly entry point and a JSON bridge. The function takes any Deserialize input type and returns any Serialize output type. inferlet::Result<T> aliases std::result::Result<T, String>; an Err(s) becomes the Error event the client receives.

Prelude

use inferlet::prelude::*;

Pulls in the common types: main, Context, Result, Schema, Model, runtime, messaging, Adapter, Forward, Output, SampleHandle, ProbeHandle, Sampler, Probe, Generator, GenStep, chat, reasoning, tools, Speculator, plus the ForwardPassExt, SubscriptionExt, and FutureStringExt extension traits.

Argument parsing

inferlet::Arguments re-exports pico_args::Arguments for callers that want richer flag parsing than serde_json against the input dict.

use inferlet::{parse_args, Arguments};

let mut args: Arguments = parse_args(raw_argv);
let n: usize = args.value_from_str("--n").unwrap_or(4);

Runtime

use inferlet::runtime;

Function	Description
`runtime::models() -> Vec<String>`	Names of every model the engine has loaded.
`runtime::version() -> String`	Pie runtime version string.
`runtime::instance_id() -> String`	Unique identifier for this engine instance.
`runtime::username() -> String`	Username of the user who launched the inferlet.

Model

use inferlet::model::Model;

let model = Model::load("default")?;

Method	Description
`Model::load(name: &str) -> Result<Model>`	Bind to a model loaded by the engine. The name is the `[model.<name>]` key in `~/.pie/config.toml`.
`model.tokenizer() -> &Tokenizer`	The model's tokenizer.

Tokenizer

Method	Description
`tok.encode(text: &str) -> Vec<u32>`	Text to token IDs.
`tok.decode(ids: &[u32]) -> Result<String>`	Token IDs to text.
`tok.vocabs() -> (Vec<u32>, Vec<Vec<u8>>)`	All token IDs paired with their raw byte sequences.
`tok.special_tokens() -> (Vec<u32>, Vec<Vec<u8>>)`	Special token IDs (BOS, EOS, etc.).
`tok.split_regex() -> &str`	The split regex used during BPE pre-tokenization.

Context

Construction and lifecycle

use inferlet::Context;

let mut ctx = Context::new(&model)?;

Method	Description
`Context::new(&model) -> Result<Context>`	Fresh anonymous context. KV pages are released on drop.
`Context::open(&model, name: &str) -> Result<Context>`	Clone a saved snapshot. The snapshot stays.
`Context::take(&model, name: &str) -> Result<Context>`	Move a saved snapshot into a fresh context. The snapshot is removed.
`Context::delete(&model, name: &str) -> Result<()>`	Drop a saved snapshot.
`ctx.save(name: &str) -> Result<()>`	Snapshot under a user-chosen name.
`ctx.snapshot() -> Result<String>`	Snapshot under a runtime-generated name. Returns the name.
`ctx.fork() -> Result<Context>`	Copy-on-write clone. O(1).

Saved snapshots persist across inferlet runs as long as the engine is up.

Filling

Method	Description
`ctx.system(text: &str) -> &mut Context`	Add a system message.
`ctx.user(text: &str) -> &mut Context`	Add a user message.
`ctx.assistant(text: &str) -> &mut Context`	Add a pre-filled assistant turn.
`ctx.cue() -> &mut Context`	Mark the current position as the model's start.
`ctx.seal() -> &mut Context`	Close the current assistant turn.
`ctx.append(tokens: &[u32]) -> &mut Context`	Append raw tokens.
`ctx.flush() -> impl Future<Output = Result<()>>`	Run prefill on buffered tokens; commit pages.
`ctx.truncate(n: u32) -> Result<()>`	Drop the trailing `n` working-page tokens (rollback primitive). Pages already committed cannot be truncated through this API — go through `ctx.inner()` if you need to.

Inspection

Method	Description
`ctx.model() -> &Model`	The bound model.
`ctx.page_size() -> u32`	Tokens per KV page.
`ctx.seq_len() -> u32`	Total committed + working tokens.
`ctx.buffer() -> &[u32]`	SDK-side buffered tokens not yet flushed.
`ctx.inner() -> &RawContext`	Underlying resource for page-level ops.

Page operations

Reach for these via ctx.inner() when implementing custom forward-pass loops, sliding windows, or speculation rollback.

Method	Description
`raw.reserve_working_pages(n: u32) -> Result<()>`	Pre-allocate `n` working pages.
`raw.commit_working_pages(k: u32) -> Result<()>`	Promote `k` full working pages to committed.
`raw.release_working_pages(n: u32)`	Free `n` working pages.
`raw.truncate_working_page_tokens(n: u32)`	Drop the last `n` tokens (rollback).
`raw.committed_page_count() -> u32`	Number of committed pages.
`raw.working_page_count() -> u32`	Number of working pages.
`raw.working_page_token_count() -> u32`	Tokens in the trailing working page.

Generator

let g = ctx.generate(Sampler::TopP { temperature: 0.6, p: 0.95 })
    .max_tokens(256);

ctx.generate(sampler) returns a Generator. Drive it with one of the collectors or step by step.

Collectors

Method	Returns	Notes
`g.collect_text().await`	`Result<String>`	Drives the loop; decodes via the chat template.
`g.collect_tokens().await`	`Result<Vec<u32>>`	All accepted token IDs.
`g.collect_json::<T>().await`	`Result<T>`	Derives schema from `T: JsonSchema + Deserialize`, applies as constraint, parses output.

Builder methods

Method	Description
`.max_tokens(n)`	Stop after `n` accepted tokens.
`.stop(&[...])` / `.add_stop(&[...])`	Extra stop-token IDs (added to the model's EOS).
`.constrain(c)`	Apply a `Constrain` impl.
`.constrain_with(JsonSchema(s))?`	Apply a declarative schema.
`.speculator(s)`	Plug in a custom speculator.
`.system_speculation()`	Use the runtime's draft model.
`.adapter(&a)`	Apply a LoRA adapter.
`.zo_seed(seed: i64)`	Set an Evolution Strategies seed for every step.
`.horizon(n)`	Hint expected output length for bid planning.
`.probe_each_step(idx, p)`	Attach a probe to every step (returns a handle).

Inspection

Method	Description
`g.is_done() -> bool`	`true` after generation has terminated.
`g.tokens_generated() -> usize`	Tokens accepted so far.

Per-step iteration

let mut g = ctx.generate(Sampler::Argmax).max_tokens(256);

while let Some(mut step) = g.next()? {
    let out = step.execute().await?;
    // Inspect or override; commit chosen tokens.
    g.accept(&[chosen_token]);
}

Method	Description
`g.next() -> Result<Option<GenStep>>`	Yield the next step or end the loop.
`step.execute().await -> Result<Output>`	Run one forward pass.
`step.clear_sampler()`	Drop the auto-attached sampler so you can attach your own.
`step.probe(idx, probe)`	Attach a probe to this step.
`g.accept(tokens: &[u32])`	Commit chosen tokens to the generator's state.

Forward

let mut fwd = ctx.forward();
fwd.input(&token_ids);
let h = fwd.sample(&[0], Sampler::Argmax);
let out = fwd.execute().await?;
let token = out.token(h);

ctx.forward() returns a Forward<'ctx>. Page reservation, position derivation, and post-execute commit happen automatically.

Builder methods

Method	Description
`.input(&tokens)`	Token IDs with auto-derived sequential positions.
`.input_at(&tokens, &positions)`	Token IDs with explicit position IDs.
`.attention_mask(&[brle, ...])`	One BRLE mask per input token.
`.mask(&brle)`	Logit mask (BRLE over the vocabulary).
`.sample(&indices, sampler)`	Attach a sampler at output positions. Returns `SampleHandle`.
`.probe(index, probe)`	Attach a probe at one position. Returns `ProbeHandle<P>`.
`.adapter(&adapter)`	Use a LoRA adapter for this pass.
`.zo_seed(seed: i64)`	Set an Evolution Strategies seed for this pass.
`.execute().await -> Result<Output>`	Run the pass.

Inspection

Method	Description
`fwd.start_position() -> u32`	Position the first auto-input token will occupy. Equals the owning context's `seq_len()` at `pass()` time.
`fwd.page_size() -> u32`	Page size of the owning context — for sizing per-position structures (masks etc.) without re-querying.

Output access

Accessor	Returns	Use after
`out.token(h: SampleHandle)`	`Option<u32>`	A single-index sampler.
`out.tokens_at(h: SampleHandle)`	`Vec<u32>`	A multi-index sampler.
`out.distribution(h)`	`Option<(&[u32], &[f32])>`	`Distribution { ... }` probe.
`out.logits(h)`	`Option<&[u8]>` (cast with `bytemuck`)	`Logits` probe.
`out.logprobs(h)`	`Option<&[f32]>`	`Logprob(t)` or `Logprobs(ts)` probe.
`out.entropy(h)`	`Option<f32>`	`Entropy` probe.
`out.tokens`	`&[u32]`	Generator-accepted tokens this step (post stop / max-tokens truncation). Empty for raw `Forward::execute`.
`out.auto_sampler() -> Option<SampleHandle>`	Handle for the Generator's auto-attached sampler. `None` for raw Forward and after `clear_sampler()`.
`out.raw() -> &RawOutput`	Underlying slot list + speculative side channel.

Mismatched access (a sampler slot through a probe handle, or vice versa) returns None.

Samplers

A Sampler chooses one token per slot.

use inferlet::sample::Sampler;

Variant	Helper	Description
`Sampler::Argmax`	(use the variant directly)	Greedy.
`Sampler::TopP { temperature, p }`	`Sampler::top_p(t, p)`	Nucleus sampling.
`Sampler::TopK { temperature, k }`	`Sampler::top_k(t, k)`	Top-k sampling.
`Sampler::MinP { temperature, p }`	`Sampler::min_p(t, p)`	Min-p sampling.
`Sampler::TopKTopP { temperature, k, p }`	`Sampler::top_k_top_p(t, k, p)`	Top-k filter, then nucleus.
`Sampler::Multinomial { temperature, draws }`	`Sampler::multinomial(t, draws)`	Multinomial draws.

The const fn helpers build the same variant. Either form is acceptable. There is no Sampler::argmax() helper; use Sampler::Argmax directly.

Probes

A Probe reads the model's distribution at a position without choosing a token.

use inferlet::sample::{Distribution, Entropy, Logits, Logprob, Logprobs};

Probe	Output accessor	Returns	Notes
`Logits`	`out.logits(h)`	packed `&[u8]` (cast with `bytemuck`)	Pre-softmax.
`Distribution { temperature, k }`	`out.distribution(h)`	`(&[u32], &[f32])`	Temperature-scaled top-`k`. `k = 0` for full vocabulary.
`Logprob(token_id)`	`out.logprobs(h)`	length-1 `&[f32]`	`log p(token_id
`Logprobs(ids)`	`out.logprobs(h)`	length-K `&[f32]`	Multi-candidate logprobs in input order.
`Entropy`	`out.entropy(h)`	`f32`	Shannon entropy of the unscaled distribution.

A single forward pass can mix samplers and probes at different slots.

Constraints

Schema

Schema is a trait. Five built-in implementors are re-exported at the crate root:

use inferlet::{AnyJson, JsonSchema, Regex, Ebnf, Schema};

ctx.generate(sampler).constrain_with(JsonSchema(s))?;

Type	Description
`AnyJson`	Any valid JSON. Unit struct.
`JsonSchema<'a>(pub &'a str)`	JSON matching the JSON Schema string.
`Regex<'a>(pub &'a str)`	Strings matching the regex.
`Ebnf<'a>(pub &'a str)`	Custom EBNF grammar (Lark format).
`&Grammar`	Pre-compiled grammar resource (the `Schema` impl is on `&Grammar`, so pass a borrow).

User code can implement Schema for custom grammar sources by providing fn build_constraint(&self, model: &Model) -> Result<GrammarConstraint>.

constrain_with returns Result because parsing can fail. Multiple constraints AND together.

Constrain trait

pub trait Constrain: Send {
    /// Advance internal state with the tokens just accepted, then return the
    /// BRLE-encoded logit mask for the next position.
    /// Returning `&[]` means "no restriction".
    fn step(&mut self, accepted: &[u32]) -> &[u32];
}

accepted is &[] on the first call. The mask uses BRLE: [run_of_false, run_of_true, run_of_false, ...], where 1 = allowed, 0 = forbidden.

GrammarConstraint

use inferlet::GrammarConstraint;

let gc = GrammarConstraint::from_ebnf(my_grammar, &model)?;

Constructor	Description
`GrammarConstraint::json(&model)`	Free-form JSON.
`GrammarConstraint::from_grammar(&grammar, &model)`	Pre-compiled grammar.
`GrammarConstraint::from_json_schema(s, &model)?`	JSON Schema string.
`GrammarConstraint::from_regex(p, &model)?`	Regex pattern.
`GrammarConstraint::from_ebnf(g, &model)?`	EBNF grammar (Lark format).

Matcher

A stateful walker over a compiled grammar automaton. Reach for it when implementing a hand-rolled Constrain that wraps a grammar but adds extra logic.

use inferlet::inference::{Grammar, Matcher};

let grammar = Grammar::from_ebnf(&grammar_src)?;
let mut m = Matcher::new(&grammar, &model.tokenizer());

m.accept_tokens(&prefix_tokens)?;
let mask = m.next_token_logit_mask();
let done = m.is_terminated();
m.reset();

Grammar constructors mirror GrammarConstraint: Grammar::from_json_schema, Grammar::json, Grammar::from_regex, Grammar::from_ebnf.

Speculative decoding

Speculation is off by default. Opt in by calling g.system_speculation() (runtime n-gram drafter) or g.speculator(s) (custom drafter) on the generator builder.

Speculator trait

use inferlet::spec::Speculator;

pub trait Speculator: Send {
    /// Produce draft tokens and their absolute positions for the next
    /// forward pass. Empty vec = "no speculation this step."
    fn draft(&mut self) -> (Vec<u32>, Vec<u32>);

    /// Called with the verifier's accepted token sequence. The first
    /// accepted token corresponds to the anchor's own next-token
    /// prediction; the rest (if any) are matched drafts.
    fn accept(&mut self, accepted: &[u32]);

    /// Roll back the last `n` drafted tokens. Default impl is a no-op.
    fn rollback(&mut self, n: u32) { let _ = n; }

    /// Reset to initial state. Default impl is a no-op.
    fn reset(&mut self) {}
}

Plug in with g.speculator(spec) on the generator builder. Only draft and accept are required — rollback and reset have empty default impls.

Adapters and fine-tuning

use inferlet::adapter::Adapter;

let adapter = Adapter::create(&model, "my-adapter")?;

Method	Description
`Adapter::create(&model, name) -> Result<Adapter>`	Create a new LoRA overlay scoped to the model.
`Adapter::open(&model, name) -> Option<Adapter>`	Open an existing adapter; `None` if absent.
`adapter.fork(new_name) -> Adapter`	Copy under a new name.
`adapter.save(path) -> Result<()>`	Serialize to disk.
`adapter.load(path) -> Result<()>`	Load weights from disk.
`adapter.destroy()`	Drop the adapter.

Apply at inference: g.adapter(&adapter) on a Generator, or fwd.adapter(&adapter) on a Forward.

Decoders

Decoders translate per-step tokens into normalized events. Pie ships three with the same shape: new(&model), feed(&tokens), reset().

chat::Decoder

use inferlet::chat;

let mut dec = chat::Decoder::new(&model);
match dec.feed(&tokens)? {
    chat::Event::Delta(s)            => { /* visible text */ }
    chat::Event::Done(s)             => { /* end of turn */ }
    chat::Event::Idle                => { /* no semantic boundary */ }
    chat::Event::Interrupt(token_id) => { /* template control token */ }
}

Variant	Payload	Meaning
`Delta(String)`	text chunk	Streaming visible text.
`Done(String)`	full reply	Model reached end-of-turn.
`Idle`	(none)	Batch produced no semantic boundary.
`Interrupt(u32)`	control token id	Template surfaced a control token without rendering it.

reasoning::Decoder

use inferlet::reasoning;

Variant	Payload	Meaning
`Idle`	(none)	No reasoning content yet.
`Start`	(none)	Entering a reasoning block.
`Delta(String)`	text chunk	Reasoning text.
`End(String)`	full reasoning text	Reasoning block closed.

tools::Decoder

use inferlet::tools;

Variant	Payload	Meaning
`Start`	(none)	A tool call is being assembled.
`Call(String, String)`	(name, args_json)	Tool call complete.

Helpers:

Function	Description
`tools::equip_prefix(&model, &[schema, ...]) -> Result<Vec<u32>>`	Prefix tokens that equip the model with a tool list.
`tools::answer_prefix(&model, name, result_json) -> Vec<u32>`	Prefix tokens that feed a tool result back.
`tools::parse_call(&model, text) -> Option<(String, String)>`	One-shot extract of a single completed call from a finished string.
`tools::native_grammar(&model, &[schema, ...]) -> Option<Grammar>`	Compiled grammar over the model's native tool-call format. `None` if the model has no native template.
`tools::native_matcher(&model, &[schema, ...]) -> Option<Matcher>`	Stateful matcher over the model's native tool-call format. Pass via `GrammarConstraint::new(matcher)`.

Decoders are stateful. Call reset() between turns or after a terminal event so the decoder is ready for the next.

Scheduling and market hooks

Every running context places a bid in the engine's KV-page market. The SDK manages this automatically. Override or read it for custom admission control.

Context-level hooks

ctx.set_bid(2.5);                // override the next-step bid
let _g = ctx.idle();             // RAII guard: bid 0 until `_g` drops
ctx.suspend()?;                  // return all pages to the pool; resume later

Method	Description
`ctx.set_bid(value: f64)`	Override the auto-bid for the next forward pass. The Generator restores its auto-bid on the step after.
`ctx.idle() -> Idle<'_>`	RAII guard that holds the context out of the auction (bid 0). Drop the guard to resume normal bidding.
`ctx.suspend() -> Result<()>`	Return all pages to the pool. Suspended contexts are restored automatically (highest-bid first) when memory frees up.

Reading the market

use inferlet::scheduling;

Function	Returns	Use it for
`price()`	Cost in credits to allocate one new KV page.	Computing how much a planned context will cost.
`rent(&ctx)`	Clearing price from the most recent knapsack auction.	Detecting contention.
`dividend(&model)`	Endowment-proportional share of solver revenue.	Re-investing dividends into your own bid.
`latency(&ctx)`	Per-tick wall time in seconds.	Estimating tokens/sec or backing off.
`balance(&model)`	Current credit balance for this inferlet.	Deciding when to suspend or stop.

For the formula behind the default bid, see the SOSP paper.

Session (user ↔ inferlet)

use inferlet::pie::core::session;
use inferlet::FutureStringExt;     // for `.wait_async()` on FutureString

The session is the bidirectional channel between the inferlet and the client that launched it. Send and receive happen on the same channel; signals from process.signal(...) arrive through receive.

Function	Description
`session::send(msg: &str)`	Send a text message to the client. Arrives as a `Stdout` event.
`session::send_file(data: &[u8])`	Send a binary blob. Arrives as a `File` event.
`session::receive() -> FutureString`	Wait for the next text message. Pair with `FutureStringExt::wait_async()`.
`session::receive_file() -> FutureBlob`	Wait for the next binary blob.

A FutureString resolves when the next inbound payload arrives:

use inferlet::pie::core::session;
use inferlet::FutureStringExt;

let next = session::receive();
let msg: Option<String> = next.wait_async().await;

Messaging (inferlet ↔ inferlet)

use inferlet::messaging;
use inferlet::{FutureStringExt, SubscriptionExt};

Pub/sub and queues across inferlets running in the same engine. The bus is engine-local; messages do not cross instances.

Function	Description
`messaging::broadcast(topic: &str, msg: &str)`	Publish to every subscriber of `topic`. Fire-and-forget.
`messaging::subscribe(topic: &str) -> Subscription`	Open a subscription. Holds messages until consumed.
`messaging::push(topic: &str, msg: &str)`	Push a message onto a queue. Each pull consumes one.
`messaging::pull(topic: &str) -> FutureString`	Wait for the next queued message.

Subscription methods:

Method	Description
`sub.pollable() -> Pollable`	WASI pollable for ready-state detection.
`sub.get() -> Option<String>`	Non-blocking poll for the next message.
`sub.get_async().await`	Async poll that yields until a message arrives. From `SubscriptionExt`.
`sub.unsubscribe()`	Drop the subscription.

MCP (Model Context Protocol)

use inferlet::mcp;

The MCP client lets an inferlet call tools, read resources, and render prompts from MCP servers the host has registered. The client side of registration (telling the engine about a server) lives in the client SDK; this section is the inferlet-facing surface.

Function	Description
`mcp::client::available_servers() -> Vec<String>`	Names of MCP servers registered by the client that launched this inferlet.
`mcp::client::connect(name: &str) -> Result<Session>`	Open a session against the named server.

Session methods (all return JSON strings; parse with serde_json):

Method	Returns	Description
`s.list_tools() -> Result<String>`	JSON `{"tools": [...]}`	Tools the server exposes.
`s.call_tool(name, args_json) -> Result<String>`	JSON `tools/call` result	Invoke a tool with JSON-encoded arguments.
`s.list_resources() -> Result<String>`	JSON `{"resources": [...]}`	Available resources.
`s.read_resource(uri) -> Result<String>`	JSON `resources/read` result	Fetch one resource by URI.
`s.list_prompts() -> Result<String>`	JSON `{"prompts": [...]}`	Prompt templates.
`s.get_prompt(name, args_json) -> Result<String>`	JSON `prompts/get` result	Render a prompt template.

The session resource closes when dropped.

Inferlet entry point​

Prelude​

Argument parsing​

Runtime​

Model​

Tokenizer​

Context​

Construction and lifecycle​

Filling​

Inspection​

Page operations​

Generator​

Collectors​

Builder methods​

Inspection​

Per-step iteration​

Forward​

Builder methods​

Inspection​

Output access​

Samplers​

Probes​

Constraints​

Schema​

Constrain trait​

GrammarConstraint​

Matcher​

Speculative decoding​

Speculator trait​

Adapters and fine-tuning​

Decoders​

chat::Decoder​

reasoning::Decoder​

tools::Decoder​

Scheduling and market hooks​

Context-level hooks​

Reading the market​

Session (user ↔ inferlet)​

Messaging (inferlet ↔ inferlet)​

MCP (Model Context Protocol)​

Inferlet entry point

Prelude

Argument parsing

Runtime

Model

Tokenizer

Context

Construction and lifecycle

Filling

Inspection

Page operations

Generator

Collectors

Builder methods

Inspection

Per-step iteration

Forward

Builder methods

Inspection

Output access

Samplers

Probes

Constraints

Schema

Constrain trait

GrammarConstraint

Matcher

Speculative decoding

Speculator trait

Adapters and fine-tuning

Decoders

chat::Decoder

reasoning::Decoder

tools::Decoder

Scheduling and market hooks

Context-level hooks

Reading the market

Session (user ↔ inferlet)

Messaging (inferlet ↔ inferlet)

MCP (Model Context Protocol)