Skip to main content

Forking and saving

Two operations let you reuse a context's state across branches and across runs: fork() clones in O(1), and save() snapshots under a name. This page covers both. Read this after Pages.

Running on the dummy driver?

Forks share KV pages correctly on the dummy, but each branch's next token is drawn independently — the parent and child diverge into unrelated random streams instead of producing related continuations from a shared prefix. The page-state machine is exercised faithfully; only the token semantics are absent.

Fork

ctx.fork() is a copy-on-write clone. Committed pages are shared between parent and child; only working pages and any new tokens the child writes cost extra memory.

let mut child = ctx.fork()?;
child.user("Compare Rust and Python in one sentence.").cue();
let reply = child.generate(sampler).max_tokens(256).collect_text().await?;

Pages 0-2 are committed and shared; parent and child each have their own working page.

Pages 0, 1, 2 are committed and shared read-only. Page 3' is a fresh copy of the parent's working page; divergent writes go there. The fork is O(1). Memory and compute scale with the divergent tokens, not the prompt length.

Use it for:

  • Best-of-N. Fork N times from a shared prefix and pick the best output. See the best-of-n inferlet.
  • Tree of thought. Explore continuations in parallel and prune low-scoring branches. See the tree-of-thought inferlet.
  • Tool-call retries. Fork before the call so a failure rolls back cleanly without losing the conversation.
  • Self-critique. Fork to run a critic against the same prefix without disturbing the original.

Save and reuse

Naming a context keeps its KV pages alive past the inferlet's exit. Other inferlets, or future runs of the same one, reattach by name. The most common use is prefix caching: compute a long shared prompt once, save it, and fork from the saved snapshot for each request.

// One-time setup: compute and cache the system prompt.
let mut prefix = Context::new(&model)?;
prefix.system("You are a helpful assistant. Answer with concise technical detail.");
prefix.flush().await?;
prefix.save("system-prompt-v1")?;

// Per-request:
let prefix = Context::open(&model, "system-prompt-v1")?;
let mut ctx = prefix.fork()?;

ctx.user("New user question").cue();
let resp = ctx.generate(sampler).collect_text().await?;

A saved snapshot named system-prompt-v1 with two requests forking from it.

Context::open clones the snapshot. The snapshot stays. Each request gets its own fork.

The four operations on snapshots

OperationWhat it does
ctx.save(name)Snapshot the current context under name. The snapshot is immutable.
Context::open(model, name)Open a snapshot. Equivalent to a fork from the snapshot; the snapshot stays.
Context::take(model, name)Open a snapshot and remove it from the registry in one step.
Context::delete(model, name)Drop a snapshot.

take is useful when you know exactly one inferlet will use the snapshot; it avoids leaving the snapshot lying around. delete is the cleanup primitive.

Saved snapshots persist across inferlet runs as long as the engine is up. They do not survive a server restart. For persistent state across restarts, write to the filesystem or to an external store.

Persistent named contexts

A second pattern: keep a context alive across many inferlet runs as a session. The first run creates and saves; each subsequent run opens, appends a turn, generates, saves again.

let mut ctx = match Context::open(&model, "session-alice") {
Ok(c) => c,
Err(_) => {
let mut c = Context::new(&model)?;
c.system("You are alice's assistant.");
c.flush().await?;
c
}
};

ctx.user(&input.message).cue();
let reply = ctx.generate(sampler).collect_text().await?;

ctx.save("session-alice")?;

The session's KV cache stays warm between runs. The next user turn adds a few hundred tokens of prefill on top of the existing pages instead of re-prefilling the whole conversation.

When to fork vs. when to save

NeedUse
Branching within a single runfork()
Prompt prefix shared across requestssave() once, open().fork() per request
User session that survives across requestssave() per turn, open() next turn
One-shot, no sharingNeither — just use the context directly

Forking is free and is the right answer for branching workflows. Saving is for cross-run state.

Next