Skip to main content

Rust SDK reference

The full inferlet SDK API for Rust. The Guide walks through how to use these APIs with runnable code; this page enumerates the surface.

Inferlet entry point

use inferlet::Result;
use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct Input { prompt: String }

#[derive(Serialize)]
struct Output { text: String }

#[inferlet::main]
async fn main(input: Input) -> Result<Output> {
// ...
}

The #[inferlet::main] macro generates the WebAssembly entry point and a JSON bridge. The function takes any Deserialize input type and returns any Serialize output type. inferlet::Result<T> aliases std::result::Result<T, String>; an Err(s) becomes the Error event the client receives.

Prelude

use inferlet::prelude::*;

Pulls in the common types: main, Context, Result, Schema, Model, runtime, messaging, Adapter, Forward, Output, SampleHandle, ProbeHandle, Sampler, Probe, Generator, GenStep, chat, reasoning, tools, Speculator, plus the ForwardPassExt, SubscriptionExt, and FutureStringExt extension traits.

Argument parsing

inferlet::Arguments re-exports pico_args::Arguments for callers that want richer flag parsing than serde_json against the input dict.

use inferlet::{parse_args, Arguments};

let mut args: Arguments = parse_args(raw_argv);
let n: usize = args.value_from_str("--n").unwrap_or(4);

Runtime

use inferlet::runtime;
FunctionDescription
runtime::models() -> Vec<String>Names of every model the engine has loaded.
runtime::version() -> StringPie runtime version string.
runtime::instance_id() -> StringUnique identifier for this engine instance.
runtime::username() -> StringUsername of the user who launched the inferlet.

Model

use inferlet::model::Model;

let model = Model::load("default")?;
MethodDescription
Model::load(name: &str) -> Result<Model>Bind to a model loaded by the engine. The name is the [model.<name>] key in ~/.pie/config.toml.
model.tokenizer() -> &TokenizerThe model's tokenizer.

Tokenizer

MethodDescription
tok.encode(text: &str) -> Vec<u32>Text to token IDs.
tok.decode(ids: &[u32]) -> Result<String>Token IDs to text.
tok.vocabs() -> (Vec<u32>, Vec<Vec<u8>>)All token IDs paired with their raw byte sequences.
tok.special_tokens() -> (Vec<u32>, Vec<Vec<u8>>)Special token IDs (BOS, EOS, etc.).
tok.split_regex() -> &strThe split regex used during BPE pre-tokenization.

Context

Construction and lifecycle

use inferlet::Context;

let mut ctx = Context::new(&model)?;
MethodDescription
Context::new(&model) -> Result<Context>Fresh anonymous context. KV pages are released on drop.
Context::open(&model, name: &str) -> Result<Context>Clone a saved snapshot. The snapshot stays.
Context::take(&model, name: &str) -> Result<Context>Move a saved snapshot into a fresh context. The snapshot is removed.
Context::delete(&model, name: &str) -> Result<()>Drop a saved snapshot.
ctx.save(name: &str) -> Result<()>Snapshot under a user-chosen name.
ctx.snapshot() -> Result<String>Snapshot under a runtime-generated name. Returns the name.
ctx.fork() -> Result<Context>Copy-on-write clone. O(1).

Saved snapshots persist across inferlet runs as long as the engine is up.

Filling

MethodDescription
ctx.system(text: &str) -> &mut ContextAdd a system message.
ctx.user(text: &str) -> &mut ContextAdd a user message.
ctx.assistant(text: &str) -> &mut ContextAdd a pre-filled assistant turn.
ctx.cue() -> &mut ContextMark the current position as the model's start.
ctx.seal() -> &mut ContextClose the current assistant turn.
ctx.append(tokens: &[u32]) -> &mut ContextAppend raw tokens.
ctx.flush() -> impl Future<Output = Result<()>>Run prefill on buffered tokens; commit pages.
ctx.truncate(n: u32) -> Result<()>Drop the trailing n working-page tokens (rollback primitive). Pages already committed cannot be truncated through this API — go through ctx.inner() if you need to.

Inspection

MethodDescription
ctx.model() -> &ModelThe bound model.
ctx.page_size() -> u32Tokens per KV page.
ctx.seq_len() -> u32Total committed + working tokens.
ctx.buffer() -> &[u32]SDK-side buffered tokens not yet flushed.
ctx.inner() -> &RawContextUnderlying resource for page-level ops.

Page operations

Reach for these via ctx.inner() when implementing custom forward-pass loops, sliding windows, or speculation rollback.

MethodDescription
raw.reserve_working_pages(n: u32) -> Result<()>Pre-allocate n working pages.
raw.commit_working_pages(k: u32) -> Result<()>Promote k full working pages to committed.
raw.release_working_pages(n: u32)Free n working pages.
raw.truncate_working_page_tokens(n: u32)Drop the last n tokens (rollback).
raw.committed_page_count() -> u32Number of committed pages.
raw.working_page_count() -> u32Number of working pages.
raw.working_page_token_count() -> u32Tokens in the trailing working page.

Generator

let g = ctx.generate(Sampler::TopP { temperature: 0.6, p: 0.95 })
.max_tokens(256);

ctx.generate(sampler) returns a Generator. Drive it with one of the collectors or step by step.

Collectors

MethodReturnsNotes
g.collect_text().awaitResult<String>Drives the loop; decodes via the chat template.
g.collect_tokens().awaitResult<Vec<u32>>All accepted token IDs.
g.collect_json::<T>().awaitResult<T>Derives schema from T: JsonSchema + Deserialize, applies as constraint, parses output.

Builder methods

MethodDescription
.max_tokens(n)Stop after n accepted tokens.
.stop(&[...]) / .add_stop(&[...])Extra stop-token IDs (added to the model's EOS).
.constrain(c)Apply a Constrain impl.
.constrain_with(JsonSchema(s))?Apply a declarative schema.
.speculator(s)Plug in a custom speculator.
.system_speculation()Use the runtime's draft model.
.adapter(&a)Apply a LoRA adapter.
.zo_seed(seed: i64)Set an Evolution Strategies seed for every step.
.horizon(n)Hint expected output length for bid planning.
.probe_each_step(idx, p)Attach a probe to every step (returns a handle).

Inspection

MethodDescription
g.is_done() -> booltrue after generation has terminated.
g.tokens_generated() -> usizeTokens accepted so far.

Per-step iteration

let mut g = ctx.generate(Sampler::Argmax).max_tokens(256);

while let Some(mut step) = g.next()? {
let out = step.execute().await?;
// Inspect or override; commit chosen tokens.
g.accept(&[chosen_token]);
}
MethodDescription
g.next() -> Result<Option<GenStep>>Yield the next step or end the loop.
step.execute().await -> Result<Output>Run one forward pass.
step.clear_sampler()Drop the auto-attached sampler so you can attach your own.
step.probe(idx, probe)Attach a probe to this step.
g.accept(tokens: &[u32])Commit chosen tokens to the generator's state.

Forward

let mut fwd = ctx.forward();
fwd.input(&token_ids);
let h = fwd.sample(&[0], Sampler::Argmax);
let out = fwd.execute().await?;
let token = out.token(h);

ctx.forward() returns a Forward<'ctx>. Page reservation, position derivation, and post-execute commit happen automatically.

Builder methods

MethodDescription
.input(&tokens)Token IDs with auto-derived sequential positions.
.input_at(&tokens, &positions)Token IDs with explicit position IDs.
.attention_mask(&[brle, ...])One BRLE mask per input token.
.mask(&brle)Logit mask (BRLE over the vocabulary).
.sample(&indices, sampler)Attach a sampler at output positions. Returns SampleHandle.
.probe(index, probe)Attach a probe at one position. Returns ProbeHandle<P>.
.adapter(&adapter)Use a LoRA adapter for this pass.
.zo_seed(seed: i64)Set an Evolution Strategies seed for this pass.
.execute().await -> Result<Output>Run the pass.

Inspection

MethodDescription
fwd.start_position() -> u32Position the first auto-input token will occupy. Equals the owning context's seq_len() at pass() time.
fwd.page_size() -> u32Page size of the owning context — for sizing per-position structures (masks etc.) without re-querying.

Output access

AccessorReturnsUse after
out.token(h: SampleHandle)Option<u32>A single-index sampler.
out.tokens_at(h: SampleHandle)Vec<u32>A multi-index sampler.
out.distribution(h)Option<(&[u32], &[f32])>Distribution { ... } probe.
out.logits(h)Option<&[u8]> (cast with bytemuck)Logits probe.
out.logprobs(h)Option<&[f32]>Logprob(t) or Logprobs(ts) probe.
out.entropy(h)Option<f32>Entropy probe.
out.tokens&[u32]Generator-accepted tokens this step (post stop / max-tokens truncation). Empty for raw Forward::execute.
out.auto_sampler() -> Option<SampleHandle>Handle for the Generator's auto-attached sampler. None for raw Forward and after clear_sampler().
out.raw() -> &RawOutputUnderlying slot list + speculative side channel.

Mismatched access (a sampler slot through a probe handle, or vice versa) returns None.

Samplers

A Sampler chooses one token per slot.

use inferlet::sample::Sampler;
VariantHelperDescription
Sampler::Argmax(use the variant directly)Greedy.
Sampler::TopP { temperature, p }Sampler::top_p(t, p)Nucleus sampling.
Sampler::TopK { temperature, k }Sampler::top_k(t, k)Top-k sampling.
Sampler::MinP { temperature, p }Sampler::min_p(t, p)Min-p sampling.
Sampler::TopKTopP { temperature, k, p }Sampler::top_k_top_p(t, k, p)Top-k filter, then nucleus.
Sampler::Multinomial { temperature, draws }Sampler::multinomial(t, draws)Multinomial draws.

The const fn helpers build the same variant. Either form is acceptable. There is no Sampler::argmax() helper; use Sampler::Argmax directly.

Probes

A Probe reads the model's distribution at a position without choosing a token.

use inferlet::sample::{Distribution, Entropy, Logits, Logprob, Logprobs};
ProbeOutput accessorReturnsNotes
Logitsout.logits(h)packed &[u8] (cast with bytemuck)Pre-softmax.
Distribution { temperature, k }out.distribution(h)(&[u32], &[f32])Temperature-scaled top-k. k = 0 for full vocabulary.
Logprob(token_id)out.logprobs(h)length-1 &[f32]`log p(token_id
Logprobs(ids)out.logprobs(h)length-K &[f32]Multi-candidate logprobs in input order.
Entropyout.entropy(h)f32Shannon entropy of the unscaled distribution.

A single forward pass can mix samplers and probes at different slots.

Constraints

Schema

Schema is a trait. Five built-in implementors are re-exported at the crate root:

use inferlet::{AnyJson, JsonSchema, Regex, Ebnf, Schema};

ctx.generate(sampler).constrain_with(JsonSchema(s))?;
TypeDescription
AnyJsonAny valid JSON. Unit struct.
JsonSchema<'a>(pub &'a str)JSON matching the JSON Schema string.
Regex<'a>(pub &'a str)Strings matching the regex.
Ebnf<'a>(pub &'a str)Custom EBNF grammar (Lark format).
&GrammarPre-compiled grammar resource (the Schema impl is on &Grammar, so pass a borrow).

User code can implement Schema for custom grammar sources by providing fn build_constraint(&self, model: &Model) -> Result<GrammarConstraint>.

constrain_with returns Result because parsing can fail. Multiple constraints AND together.

Constrain trait

pub trait Constrain: Send {
/// Advance internal state with the tokens just accepted, then return the
/// BRLE-encoded logit mask for the next position.
/// Returning `&[]` means "no restriction".
fn step(&mut self, accepted: &[u32]) -> &[u32];
}

accepted is &[] on the first call. The mask uses BRLE: [run_of_false, run_of_true, run_of_false, ...], where 1 = allowed, 0 = forbidden.

GrammarConstraint

use inferlet::GrammarConstraint;

let gc = GrammarConstraint::from_ebnf(my_grammar, &model)?;
ConstructorDescription
GrammarConstraint::json(&model)Free-form JSON.
GrammarConstraint::from_grammar(&grammar, &model)Pre-compiled grammar.
GrammarConstraint::from_json_schema(s, &model)?JSON Schema string.
GrammarConstraint::from_regex(p, &model)?Regex pattern.
GrammarConstraint::from_ebnf(g, &model)?EBNF grammar (Lark format).

Matcher

A stateful walker over a compiled grammar automaton. Reach for it when implementing a hand-rolled Constrain that wraps a grammar but adds extra logic.

use inferlet::inference::{Grammar, Matcher};

let grammar = Grammar::from_ebnf(&grammar_src)?;
let mut m = Matcher::new(&grammar, &model.tokenizer());

m.accept_tokens(&prefix_tokens)?;
let mask = m.next_token_logit_mask();
let done = m.is_terminated();
m.reset();

Grammar constructors mirror GrammarConstraint: Grammar::from_json_schema, Grammar::json, Grammar::from_regex, Grammar::from_ebnf.

Speculative decoding

Speculation is off by default. Opt in by calling g.system_speculation() (runtime n-gram drafter) or g.speculator(s) (custom drafter) on the generator builder.

Speculator trait

use inferlet::spec::Speculator;

pub trait Speculator: Send {
/// Produce draft tokens and their absolute positions for the next
/// forward pass. Empty vec = "no speculation this step."
fn draft(&mut self) -> (Vec<u32>, Vec<u32>);

/// Called with the verifier's accepted token sequence. The first
/// accepted token corresponds to the anchor's own next-token
/// prediction; the rest (if any) are matched drafts.
fn accept(&mut self, accepted: &[u32]);

/// Roll back the last `n` drafted tokens. Default impl is a no-op.
fn rollback(&mut self, n: u32) { let _ = n; }

/// Reset to initial state. Default impl is a no-op.
fn reset(&mut self) {}
}

Plug in with g.speculator(spec) on the generator builder. Only draft and accept are required — rollback and reset have empty default impls.

Adapters and fine-tuning

use inferlet::adapter::Adapter;

let adapter = Adapter::create(&model, "my-adapter")?;
MethodDescription
Adapter::create(&model, name) -> Result<Adapter>Create a new LoRA overlay scoped to the model.
Adapter::open(&model, name) -> Option<Adapter>Open an existing adapter; None if absent.
adapter.fork(new_name) -> AdapterCopy under a new name.
adapter.save(path) -> Result<()>Serialize to disk.
adapter.load(path) -> Result<()>Load weights from disk.
adapter.destroy()Drop the adapter.

Apply at inference: g.adapter(&adapter) on a Generator, or fwd.adapter(&adapter) on a Forward.

Decoders

Decoders translate per-step tokens into normalized events. Pie ships three with the same shape: new(&model), feed(&tokens), reset().

chat::Decoder

use inferlet::chat;

let mut dec = chat::Decoder::new(&model);
match dec.feed(&tokens)? {
chat::Event::Delta(s) => { /* visible text */ }
chat::Event::Done(s) => { /* end of turn */ }
chat::Event::Idle => { /* no semantic boundary */ }
chat::Event::Interrupt(token_id) => { /* template control token */ }
}
VariantPayloadMeaning
Delta(String)text chunkStreaming visible text.
Done(String)full replyModel reached end-of-turn.
Idle(none)Batch produced no semantic boundary.
Interrupt(u32)control token idTemplate surfaced a control token without rendering it.

reasoning::Decoder

use inferlet::reasoning;
VariantPayloadMeaning
Idle(none)No reasoning content yet.
Start(none)Entering a reasoning block.
Delta(String)text chunkReasoning text.
End(String)full reasoning textReasoning block closed.

tools::Decoder

use inferlet::tools;
VariantPayloadMeaning
Start(none)A tool call is being assembled.
Call(String, String)(name, args_json)Tool call complete.

Helpers:

FunctionDescription
tools::equip_prefix(&model, &[schema, ...]) -> Result<Vec<u32>>Prefix tokens that equip the model with a tool list.
tools::answer_prefix(&model, name, result_json) -> Vec<u32>Prefix tokens that feed a tool result back.
tools::parse_call(&model, text) -> Option<(String, String)>One-shot extract of a single completed call from a finished string.
tools::native_grammar(&model, &[schema, ...]) -> Option<Grammar>Compiled grammar over the model's native tool-call format. None if the model has no native template.
tools::native_matcher(&model, &[schema, ...]) -> Option<Matcher>Stateful matcher over the model's native tool-call format. Pass via GrammarConstraint::new(matcher).

Decoders are stateful. Call reset() between turns or after a terminal event so the decoder is ready for the next.

Scheduling and market hooks

Every running context places a bid in the engine's KV-page market. The SDK manages this automatically. Override or read it for custom admission control.

Context-level hooks

ctx.set_bid(2.5); // override the next-step bid
let _g = ctx.idle(); // RAII guard: bid 0 until `_g` drops
ctx.suspend()?; // return all pages to the pool; resume later
MethodDescription
ctx.set_bid(value: f64)Override the auto-bid for the next forward pass. The Generator restores its auto-bid on the step after.
ctx.idle() -> Idle<'_>RAII guard that holds the context out of the auction (bid 0). Drop the guard to resume normal bidding.
ctx.suspend() -> Result<()>Return all pages to the pool. Suspended contexts are restored automatically (highest-bid first) when memory frees up.

Reading the market

use inferlet::scheduling;
FunctionReturnsUse it for
price()Cost in credits to allocate one new KV page.Computing how much a planned context will cost.
rent(&ctx)Clearing price from the most recent knapsack auction.Detecting contention.
dividend(&model)Endowment-proportional share of solver revenue.Re-investing dividends into your own bid.
latency(&ctx)Per-tick wall time in seconds.Estimating tokens/sec or backing off.
balance(&model)Current credit balance for this inferlet.Deciding when to suspend or stop.

For the formula behind the default bid, see the SOSP paper.

Session (user ↔ inferlet)

use inferlet::pie::core::session;
use inferlet::FutureStringExt; // for `.wait_async()` on FutureString

The session is the bidirectional channel between the inferlet and the client that launched it. Send and receive happen on the same channel; signals from process.signal(...) arrive through receive.

FunctionDescription
session::send(msg: &str)Send a text message to the client. Arrives as a Stdout event.
session::send_file(data: &[u8])Send a binary blob. Arrives as a File event.
session::receive() -> FutureStringWait for the next text message. Pair with FutureStringExt::wait_async().
session::receive_file() -> FutureBlobWait for the next binary blob.

A FutureString resolves when the next inbound payload arrives:

use inferlet::pie::core::session;
use inferlet::FutureStringExt;

let next = session::receive();
let msg: Option<String> = next.wait_async().await;

Messaging (inferlet ↔ inferlet)

use inferlet::messaging;
use inferlet::{FutureStringExt, SubscriptionExt};

Pub/sub and queues across inferlets running in the same engine. The bus is engine-local; messages do not cross instances.

FunctionDescription
messaging::broadcast(topic: &str, msg: &str)Publish to every subscriber of topic. Fire-and-forget.
messaging::subscribe(topic: &str) -> SubscriptionOpen a subscription. Holds messages until consumed.
messaging::push(topic: &str, msg: &str)Push a message onto a queue. Each pull consumes one.
messaging::pull(topic: &str) -> FutureStringWait for the next queued message.

Subscription methods:

MethodDescription
sub.pollable() -> PollableWASI pollable for ready-state detection.
sub.get() -> Option<String>Non-blocking poll for the next message.
sub.get_async().awaitAsync poll that yields until a message arrives. From SubscriptionExt.
sub.unsubscribe()Drop the subscription.

MCP (Model Context Protocol)

use inferlet::mcp;

The MCP client lets an inferlet call tools, read resources, and render prompts from MCP servers the host has registered. The client side of registration (telling the engine about a server) lives in the client SDK; this section is the inferlet-facing surface.

FunctionDescription
mcp::client::available_servers() -> Vec<String>Names of MCP servers registered by the client that launched this inferlet.
mcp::client::connect(name: &str) -> Result<Session>Open a session against the named server.

Session methods (all return JSON strings; parse with serde_json):

MethodReturnsDescription
s.list_tools() -> Result<String>JSON {"tools": [...]}Tools the server exposes.
s.call_tool(name, args_json) -> Result<String>JSON tools/call resultInvoke a tool with JSON-encoded arguments.
s.list_resources() -> Result<String>JSON {"resources": [...]}Available resources.
s.read_resource(uri) -> Result<String>JSON resources/read resultFetch one resource by URI.
s.list_prompts() -> Result<String>JSON {"prompts": [...]}Prompt templates.
s.get_prompt(name, args_json) -> Result<String>JSON prompts/get resultRender a prompt template.

The session resource closes when dropped.