Skip to main content

Python SDK reference

The full inferlet SDK API for Python. The Guide walks through how to use these APIs with runnable code; this page enumerates the surface.

Inferlet entry point

async def main(input: dict) -> str:
name = input.get("name", "world")
return f"hello {name}"

The Python inferlet runtime invokes a top-level async def main(input) coroutine. input is a parsed JSON dict whose shape matches the manifest's [parameters] block. The return value is JSON-serialized into the Return event the client receives.

Raise an exception to fail the run; the message becomes the Error event.

Runtime

from inferlet import runtime
FunctionReturnsDescription
runtime.models()list[str]Names of every model the engine has loaded.
runtime.version()strPie runtime version string.
runtime.instance_id()strUnique identifier for this engine instance.
runtime.username()strUsername of the user who launched the inferlet.

Model

from inferlet import Model
MethodDescription
Model.load(name: str) -> ModelBind to a loaded model. name is the [model.<name>] key in ~/.pie/config.toml.
model.tokenizer() -> TokenizerThe model's tokenizer.

Tokenizer

MethodReturnsDescription
tk.encode(text)list[int]Text to token IDs.
tk.decode(tokens)strToken IDs to text.
tk.vocabs()tuple[list[int], list[bytes]]Full vocabulary as parallel lists.
tk.special_tokens()tuple[list[int], list[bytes]]Special token IDs (BOS, EOS, etc.).
tk.split_regex()strThe split regex used during BPE pre-tokenization.

Context

from inferlet import Context

ctx = Context(model)

Construction and lifecycle

MethodDescription
Context(model)Fresh anonymous context. KV pages release on drop.
Context.open(model, name) -> Context | NoneClone a saved snapshot. Returns None if the name is absent.
Context.take(model, name) -> Context | NoneTake ownership of a snapshot (snapshot is removed).
Context.delete(model, name)Drop a saved snapshot.
ctx.fork() -> ContextCopy-on-write clone. O(1).
ctx.save(name)Snapshot under a user-chosen name.
ctx.snapshot() -> strSnapshot under a runtime-generated name. Returns the name.
ctx.release()Force-destroy this context immediately.

Context supports the context-manager protocol (with Context(model) as ctx:).

Filling

MethodDescription
ctx.system(text) -> ContextAdd a system message.
ctx.user(text) -> ContextAdd a user message.
ctx.assistant(text) -> ContextAdd a pre-filled assistant turn.
ctx.cue() -> ContextMark the current position as the model's start.
ctx.seal() -> ContextClose the current assistant turn.
ctx.append(tokens) -> ContextAppend raw token IDs.
await ctx.flush()Run prefill on buffered tokens; commit pages.

Inspection

MemberKindTypeDescription
ctx.modelpropertyModelThe bound model.
ctx.page_sizepropertyintTokens per KV page.
ctx.seq_lenpropertyintTotal committed + working tokens (excludes the SDK buffer).
ctx.buffer()methodlist[int]SDK-side buffered tokens not yet flushed.

Truncate

MethodDescription
ctx.truncate(n)Drop the trailing n working-page tokens. Rollback primitive. Pages already committed cannot be truncated through this API.

The Python SDK does not expose ctx.inner() for raw page operations; those are Rust-only.

Generator

from inferlet import Sampler

g = ctx.generate(Sampler.top_p(0.6, 0.95), max_tokens=256)

ctx.generate(sampler, **options) returns a Generator. Drive it with one of the collectors, with async for, or by calling __anext__ directly.

Constructor options

ctx.generate accepts keyword-only arguments matching the Rust builder methods:

KwargTypeDescription
max_tokensint | NoneStop after n accepted tokens.
stopIterable[int] | NoneExtra stop-token IDs. With auto_flush=True, defaults to the chat template's stop tokens.
constrainSchema | Constraint | list[Schema | Constraint] | NoneAttach one or more constraints. Multiple constraints compose by AND-ing their per-step BRLE masks.
logit_masklist[int] | NoneStatic BRLE mask applied every step. Composes with constrain like any other constraint.
speculatorSpeculator | NoneCustom speculator (any class implementing the protocol).
system_speculationboolUse the runtime's built-in n-gram drafter. Default False.
adapterAdapter | NoneLoRA adapter to apply on every step.
zo_seedint | NoneEvolution Strategies seed for every forward pass.
horizonint | NoneHint expected output length for credit pacing.
auto_flushboolWhen True (default), append cue() to the buffer before returning the Generator and use chat-template stop tokens by default. Set False to inspect the buffer or call cue() yourself.

speculator and system_speculation are mutually exclusive.

Builder methods (returns Generator)

MethodDescription
g.max_tokens(n)Hard cap on tokens.
g.stop(tokens)Replace the stop set.
g.add_stop(tokens)Append to the stop set.
g.constrain(c)Add a constraint. Composes with previously attached ones.
g.horizon(n)Hint expected output length.
g.adapter(a)Apply an adapter.
g.zo_seed(seed)Set an Evolution Strategies seed for every step.
g.probe_each_step(idx, probe) -> ProbeHandleAttach a probe to every step.

Inspection

Property / methodDescription
g.tokens_generatedTotal tokens accepted so far.
g.is_doneTrue after generation has terminated.

Collectors

MethodReturnsDescription
await g.collect_tokens()list[int]Drain the loop; return all accepted tokens.
await g.collect_text()strDrain, run a chat decoder internally, return the assembled string.
await g.collect_json(*, schema=None, parse=None)AnyAdd a JSON-schema constraint, drain, parse the output. With parse=cb, run the callback on the JSON string.

Per-step iteration

async for step in g:
out = await step.execute()
# inspect; optionally call g.accept(...)

step is a GenStep. After step.execute(), the loop folds the result into the generator's state.

GenStep methodDescription
step.clear_sampler() -> GenStepDrop the auto-attached sampler. The forward pass still runs; you must read a probe and pick a token yourself.
step.probe(idx, probe) -> ProbeHandleAttach an extra probe for this iteration only.
await step.execute() -> OutputRun the forward pass and fold the result into generator state.
Generator methodDescription
g.accept(tokens) -> list[int]Register manually-sampled tokens. Returns the post-accept token list (after constraint reconciliation).

Forward

fwd = ctx.forward()
fwd.input(token_ids)
h = fwd.sample([0], Sampler.argmax())
out = await fwd.execute()
token = out.token(h)

ctx.forward() returns a Forward bound to the context. The builder reserves working pages, derives positions, and commits pages on execute.

Builder methods

MethodDescription
fwd.input(tokens) -> ForwardToken IDs with auto-derived positions.
fwd.input_at(tokens, positions) -> ForwardToken IDs with explicit position IDs.
fwd.attention_mask(masks) -> ForwardPer-input-token attention masks (BRLE).
fwd.mask(brle) -> ForwardLogit mask applied at every sampled position.
fwd.sample(indices, sampler) -> SampleHandleAttach a sampler at output positions.
fwd.probe(idx, probe) -> ProbeHandleAttach a probe at one position.
fwd.adapter(a) -> ForwardApply an adapter.
fwd.zo_seed(seed) -> ForwardSet an Evolution Strategies seed for this pass.
await fwd.execute() -> OutputRun the pass.

Inspection

MethodDescription
fwd.start_position()Position the first auto-input token will occupy. Equal to ctx.seq_len at the time forward() was called.

Output access

MethodReturnsUse after
out.token(h: SampleHandle)int | NoneSingle-index sampler.
out.tokens_at(h: SampleHandle)list[int]Multi-index sampler.
out.distribution(h)tuple[list[int], list[float]] | NoneDistribution(...) probe.
out.logits(h)bytes | NoneLogits() probe. Length vocab_size * 4, native-endian f32.
out.logprobs(h)list[float] | NoneLogprob(t) or Logprobs(ts) probe.
out.entropy(h)float | NoneEntropy() probe.
out.tokenslist[int]Generator-accepted tokens this step (post stop / max-tokens truncation). Empty for raw Forward.execute().
out.auto_samplerSampleHandle | NoneHandle for the Generator's auto-attached sampler. None for raw Forward and after clear_sampler().
out.rawunderlying WIT objectProperty. The raw slot list + speculative side channel.

Samplers

from inferlet import Sampler
ConstructorDescription
Sampler.argmax()Greedy.
Sampler.top_p(temperature, p)Nucleus sampling.
Sampler.top_k(temperature, k)Top-k sampling.
Sampler.min_p(temperature, p)Min-p sampling.
Sampler.top_k_top_p(temperature, k, p)Top-k filter, then nucleus.
Sampler.multinomial(temperature, draws)Multinomial draws.

Probes

from inferlet import Logits, Distribution, Logprob, Logprobs, Entropy
ProbeOutput accessorReturns
Logits()out.logits(h)Native-endian f32 bytes (length vocab_size * 4).
Distribution(temperature, k)out.distribution(h)Top-k ids and probs. k=0 for full vocab.
Logprob(token_id)out.logprobs(h)Length-1 list.
Logprobs(token_ids)out.logprobs(h)Length-K list (input order).
Entropy()out.entropy(h)Shannon entropy.

Constraints

Schema implementors

from inferlet import AnyJson, JsonSchema, Regex, Ebnf
ClassDescription
AnyJson()Any valid JSON.
JsonSchema(schema=str)JSON conforming to a JSON Schema string.
Regex(pattern=str)Strings matching the regex.
Ebnf(source=str)Custom EBNF grammar (Lark format).

All four are dataclasses. Pass them to ctx.generate(constrain=...) or g.constrain(...).

Custom Schema

class MySchema:
def build_constraint(self, model: Model) -> GrammarConstraint:
return GrammarConstraint.json(model)

Any class with a build_constraint(model) method satisfies the Schema protocol.

Custom Constraint

class MyConstraint:
def step(self, accepted: list[int]) -> list[int]:
return [] # no restriction this step

Any class with a step(accepted) method satisfies the Constraint protocol. Return [] for "no restriction this step."

Grammar / GrammarConstraint / Matcher

from inferlet import Grammar, GrammarConstraint, Matcher
Class / methodDescription
Grammar.from_json_schema(s)Build from JSON Schema.
Grammar.json()Free-form JSON.
Grammar.from_regex(p)Regex pattern.
Grammar.from_ebnf(g)EBNF (Lark) grammar.
GrammarConstraint.from_grammar(g, model)Pre-compiled grammar.
GrammarConstraint.from_json_schema(s, model)JSON Schema.
GrammarConstraint.json(model)Free-form JSON.
GrammarConstraint.from_regex(p, model)Regex.
GrammarConstraint.from_ebnf(g, model)EBNF.
Matcher(grammar, tokenizer)Stateful walker.
m.accept_tokens(ids)Advance the matcher state.
m.next_token_logit_mask()BRLE mask.
m.is_terminatedWhether the matcher reached a terminal state.
m.reset()Reset to initial state.

Speculative decoding

from inferlet.spec import Speculator

class MySpec:
def draft(self) -> tuple[list[int], list[int]]:
return [], []

def accept(self, tokens: list[int]) -> None:
return None

def rollback(self, n: int) -> None:
return None

def reset(self) -> None:
return None

Any class implementing the four methods satisfies the protocol. Pass to ctx.generate(speculator=MySpec()).

system_speculation=True opts into the runtime's built-in n-gram drafter. Mutually exclusive with speculator.

Adapters

from inferlet import Adapter
MethodDescription
Adapter.create(model, name) -> AdapterCreate a new LoRA overlay.
Adapter.open(model, name) -> Adapter | NoneOpen an existing one.
a.fork(new_name) -> AdapterCopy under a new name.
a.save(path)Serialize to disk.
a.load(path)Load weights from disk.

Adapter supports the context-manager protocol. Slot release happens on garbage collect; there is no explicit destroy().

Apply at inference: pass adapter=a to ctx.generate(...) or call fwd.adapter(a).

Decoders (parsers)

All three follow the same shape: Decoder(model) constructor, feed(tokens) returning an event, reset().

chat.Decoder

from inferlet import chat

dec = chat.Decoder(model)
match dec.feed(tokens):
case chat.Event.Delta(text=t): ...
case chat.Event.Done(text=t): ...
case chat.Event.Idle(): ...
case chat.Event.Interrupt(token=tid): ...
EventPayloadMeaning
Idle(none)Batch had no semantic boundary.
Delta(text)text chunkStreaming visible text.
Done(text)full replyEnd of turn.
Interrupt(token)control token idTemplate surfaced a control token.

Helpers: chat.system(model, msg), chat.user(...), chat.assistant(...), chat.cue(model), chat.seal(model), chat.stop_tokens(model) — return token-ID lists for use with ctx.append(...).

reasoning.Decoder

from inferlet import reasoning
EventPayloadMeaning
Idle(none)No reasoning content this batch.
Start(none)Entering a reasoning block.
Delta(text)text chunkReasoning text.
End(text)full reasoningReasoning block closed.

tools.Decoder

from inferlet import tools
EventPayloadMeaning
Start(none)Tool call assembling.
Call(name, args)name, JSON-encoded argsCall complete.

Helpers:

FunctionDescription
tools.equip_prefix(model, schemas)Tokens that register the tool schemas in the chat template. Append before your user message via Context.append.
tools.answer_prefix(model, name, value)Tokens that frame a tool result for the next turn. value may be a string, dict, or list (non-strings are JSON-encoded).
tools.native_matcher(model, schemas)Build a Matcher over the model's native tool-call format. Returns None if the model has no enforceable format — fall through to free-form generation + your own parser. Wrap with GrammarConstraint to pass to Generator.constrain.

I/O

Session

from inferlet import session
FunctionReturnsDescription
session.send(message)Send to the client. Strings go through verbatim; other types are JSON-encoded.
session.send_file(data: bytes)Send a binary blob.
await session.receive()strWait for the next inbound message.
await session.receive_file()bytesWait for the next inbound file.

Signals from process.signal(...) arrive through session.receive.

Messaging

from inferlet import messaging
FunctionDescription
messaging.broadcast(topic, message)Publish to every subscriber.
messaging.subscribe(topic) -> SubscriptionOpen a subscription.
messaging.push(topic, message)Push onto a queue.
await messaging.pull(topic) -> strWait for the next queued message.

Subscription:

MethodDescription
await sub.next() -> str | NoneWait for the next broadcast. None after unsubscribe.
sub.unsubscribe()Drop the subscription.
async for msg in sub:Async-iterable shorthand.

MCP

from inferlet import mcp
FunctionReturnsDescription
mcp.available_servers()list[str]Names of registered servers.
mcp.connect(name) -> McpSessionsessionOpen a session.

McpSession:

MethodReturnsDescription
s.list_tools()str (JSON)Raw tools/list result.
s.call_tool(name, args)str (JSON)Raw tools/call result.
s.list_resources()str (JSON)Raw resources/list result.
s.read_resource(uri)str (JSON)Raw resources/read result.
s.list_prompts()str (JSON)Raw prompts/list result.
s.get_prompt(name, args)str (JSON)Raw prompts/get result.

McpSession supports the context-manager protocol.

Scheduling

from inferlet import scheduling
FunctionReturnsDescription
scheduling.balance(model)floatCurrent credit balance for this inferlet.
scheduling.rent(ctx)floatClearing price from the most recent auction.
scheduling.dividend(model)floatEndowment-proportional dividend last step.
scheduling.latency(ctx)floatPer-tick wall time in seconds.
scheduling.price()floatCost in credits per new KV page.

To override the default bid: ctx.set_bid(value). To skip bidding for a scope: with ctx.idle():. Both live on Context.

Filesystem and HTTP

The Python SDK does not currently expose Pie-specific HTTP or filesystem APIs. Use the standard library (open, os.makedirs) for filesystem access (against the host-preopened /scratch directory). HTTP support is in progress; see I/O / HTTP for the intended API shape.