Skip to main content

Context overview

A Context is the unit of conversation state inside an inferlet. It owns a token sequence, a KV cache, and an optional name. Most generation goes through one. This page explains the model. Read this after Loading and selecting models.

What a context owns

Two things, both bound to a model:

  • A token sequence. The text you've fed in (system prompt, user turns, assistant turns) plus tokens generated so far, encoded as token IDs.
  • A KV cache. The transformer's per-layer key/value tensors for those tokens, organized as fixed-size pages.

Plus a small staging buffer. When you call ctx.user("..."), the SDK appends bytes to a pending buffer; the engine sees nothing until flush() or generate() runs prefill. The pending buffer is part of the context's logical state but does not yet exist as KV.

A context is bound to one model for its lifetime. To use two models in the same inferlet, allocate one context per model.

Create one

use inferlet::{Context, model::Model, runtime, Result};

let model = Model::load(runtime::models().first().ok_or("no models")?)?;
let mut ctx = Context::new(&model)?;

A fresh context is anonymous: its KV pages are released when the inferlet exits or when the handle drops.

Fill it

A context exposes the model's chat template directly. Methods return &mut self (Rust), self (Python), or this (JS) so they chain.

ctx.system("You are a helpful assistant.")
.user("What is the capital of France?")
.cue();
MethodDescription
system(text)Append a system message.
user(text)Append a user message.
assistant(text)Append a pre-filled assistant turn.
cue()Mark the position where the model takes over.
seal()Close the current assistant turn.
append(tokens)Append raw token IDs (skip the chat template).

The chat template (role tags, BOS, EOS, thinking-mode markers) is applied automatically. If you need raw token-level control, use append(token_ids) instead.

Run something on it

A filled context goes one of two places:

Both implicitly flush any pending tokens. You rarely call flush() directly; the case where you do is when you want to commit a long prefix once and then fork from it.

Lifecycle

A context lives as long as the handle exists, plus any time it spends as a named snapshot.

  • Anonymous: no name. Pages release when the handle drops or the inferlet exits.
  • Saved: ctx.save("my-prefix") creates a snapshot. The snapshot persists past the inferlet's exit until the engine restarts or you Context::delete it.
  • Forked: ctx.fork() is a copy-on-write clone. See Forking and saving.

Inspect

A context exposes its own metrics:

let len = ctx.seq_len(); // total tokens (committed + working)
let psize = ctx.page_size(); // tokens per page

For lower-level counts (committed vs working pages, working-page token count, the pending buffer), see the SDK reference.

Next

  • Pages: the page model, prefill, and the committed-vs-working distinction.
  • Forking and saving: branch a context, snapshot a prefix, share state across runs.
  • Scheduling and budgets: the credit auction that prices forward passes.