Context overview
A Context is the unit of conversation state inside an inferlet. It owns a token sequence, a KV cache, and an optional name. Most generation goes through one. This page explains the model. Read this after Loading and selecting models.
What a context owns
Two things, both bound to a model:
- A token sequence. The text you've fed in (system prompt, user turns, assistant turns) plus tokens generated so far, encoded as token IDs.
- A KV cache. The transformer's per-layer key/value tensors for those tokens, organized as fixed-size pages.
Plus a small staging buffer. When you call ctx.user("..."), the SDK appends bytes to a pending buffer; the engine sees nothing until flush() or generate() runs prefill. The pending buffer is part of the context's logical state but does not yet exist as KV.
A context is bound to one model for its lifetime. To use two models in the same inferlet, allocate one context per model.
Create one
- Rust
- Python
- JavaScript
use inferlet::{Context, model::Model, runtime, Result};
let model = Model::load(runtime::models().first().ok_or("no models")?)?;
let mut ctx = Context::new(&model)?;
from inferlet import Context, Model, runtime
model = Model.load(runtime.models()[0])
ctx = Context(model)
import { Context, Model, runtime } from 'inferlet';
const model = Model.load(runtime.models()[0]);
const ctx = new Context(model);
A fresh context is anonymous: its KV pages are released when the inferlet exits or when the handle drops.
Fill it
A context exposes the model's chat template directly. Methods return &mut self (Rust), self (Python), or this (JS) so they chain.
- Rust
- Python
- JavaScript
ctx.system("You are a helpful assistant.")
.user("What is the capital of France?")
.cue();
(ctx
.system("You are a helpful assistant.")
.user("What is the capital of France?")
.cue())
ctx.system('You are a helpful assistant.')
.user('What is the capital of France?')
.cue();
| Method | Description |
|---|---|
system(text) | Append a system message. |
user(text) | Append a user message. |
assistant(text) | Append a pre-filled assistant turn. |
cue() | Mark the position where the model takes over. |
seal() | Close the current assistant turn. |
append(tokens) | Append raw token IDs (skip the chat template). |
The chat template (role tags, BOS, EOS, thinking-mode markers) is applied automatically. If you need raw token-level control, use append(token_ids) instead.
Run something on it
A filled context goes one of two places:
ctx.generate(...): the high-level autoregressive loop. See Generation overview.ctx.forward(): a single forward pass. See The forward pass.
Both implicitly flush any pending tokens. You rarely call flush() directly; the case where you do is when you want to commit a long prefix once and then fork from it.
Lifecycle
A context lives as long as the handle exists, plus any time it spends as a named snapshot.
- Anonymous: no name. Pages release when the handle drops or the inferlet exits.
- Saved:
ctx.save("my-prefix")creates a snapshot. The snapshot persists past the inferlet's exit until the engine restarts or youContext::deleteit. - Forked:
ctx.fork()is a copy-on-write clone. See Forking and saving.
Inspect
A context exposes its own metrics:
- Rust
- Python
- JavaScript
let len = ctx.seq_len(); // total tokens (committed + working)
let psize = ctx.page_size(); // tokens per page
len_ = ctx.seq_len # total tokens (committed + working)
psize = ctx.page_size # tokens per page
const len = ctx.seqLen; // total tokens (committed + working)
const psize = ctx.pageSize; // tokens per page
For lower-level counts (committed vs working pages, working-page token count, the pending buffer), see the SDK reference.
Next
- Pages: the page model, prefill, and the committed-vs-working distinction.
- Forking and saving: branch a context, snapshot a prefix, share state across runs.
- Scheduling and budgets: the credit auction that prices forward passes.