Skip to main content

Core Concepts

This page covers the mental model behind Pie inferlets: how contexts work, how forking enables parallel generation, how to control the KV cache, and how to customize sampling and decoding.

Context Lifecycle

Every interaction with a model flows through a context — a stateful object that holds a token buffer and a KV cache. The lifecycle follows three phases:

  1. Fillfill_system(), fill_user(), and fill() append tokens to a pending buffer. Nothing is computed yet.
  2. Flushflush() runs prefill on the pending tokens, populating the KV cache. This is the expensive step.
  3. Generategenerate() runs the autoregressive decode loop: decode one token, sample, check stop conditions, repeat.
let mut ctx = model.create_context();
ctx.fill_system("System prompt"); // Tokens buffered
ctx.fill_user("User message"); // More tokens buffered
ctx.flush().await; // Prefill: tokens processed, KV cache populated
let output = ctx.generate(sampler, stop).await; // Decode loop
note

generate() processes any pending tokens as part of its decode loop, so you don't need to call flush() before generate(). However, an explicit flush() is needed when you want to commit tokens to the KV cache before forking or exporting KV state.

Forking (Copy-on-Write Contexts)

fork() creates a child context that shares the parent's KV cache via copy-on-write. This lets you branch a conversation without duplicating compute.

Key points:

  • Call flush() before fork() to commit tokens to KV cache — this becomes the shared prefix.
  • Each fork gets its own view. Filling or generating in one fork does not affect others.
  • Multiple forks from the same parent incur zero extra prefill cost for the shared prefix.
let mut common = model.create_context();
common.fill_system("You are a helpful assistant.");
common.flush().await; // Commit shared prefix to KV cache

let mut ctx1 = common.fork(); // Shares KV cache with common
let mut ctx2 = common.fork(); // Also shares — no extra prefill

ctx1.fill_user("Question A");
ctx2.fill_user("Question B");
// Both generate concurrently, sharing the system prompt KV cache
future::join(
ctx1.generate(sampler, stop),
ctx2.generate(sampler, stop)
).await;

See also: parallel generation and tree-of-thought examples.

KV Cache Control

Beyond the automatic KV management that contexts provide, Pie gives you explicit control over cache persistence and memory.

Export / Import

You can export KV pages after a flush() and reimport them in later requests. This avoids re-prefilling expensive system prompts across calls.

// First request: compute and cache the system prompt
let mut ctx = model.create_context();
ctx.fill_system(&long_system_prompt);
ctx.flush().await;

// Export KV pages for later reuse
ctx.queue().export_kv_pages(&ctx.kv_pages, "my-prefix");

// Save metadata to persistent key-value store
inferlet::store_set("prefix_state", &serialized_state);

// --- Later requests ---

// Restore from cache (no re-prefill needed!)
let pages = queue.import_kv_pages("my-prefix");
let mut ctx = Context::from_imported_state(&model, pages, token_ids, last_len);
ctx.fill_user("New question"); // Only this needs prefill
let output = ctx.generate(sampler, stop).await;

Masking / Eviction

For long-running conversations, you can mask tokens so the model ignores them in attention, then free the memory those KV pages occupied.

// Sliding window: keep initial sink tokens + recent window, evict the rest
let committed_len = ctx.token_ids.len();
if committed_len > sink_size + window_size {
let evict_start = sink_size;
let evict_end = committed_len - window_size;
ctx.mask_token_range(evict_start, evict_end, true); // Mask old tokens
ctx.drop_masked_kv_pages(); // Free their memory
}
  • mask_token_range(start, end, masked) — marks tokens so the model ignores them in attention.
  • drop_masked_kv_pages() — frees memory for fully-masked KV pages.
  • Together, these enable bounded-memory generation for arbitrarily long sequences.

See also: prefix caching and attention sink examples.

Custom Sampling & Decoding

Built-in Samplers

Pie ships with common sampling strategies:

Sampler::greedy()                          // Always pick highest probability
Sampler::top_p(temperature, top_p) // Nucleus sampling
Sampler::top_k(temperature, top_k) // Top-K sampling
Sampler::min_p(temperature, min_p) // Min-P sampling
Sampler::top_k_top_p(temperature, top_k, top_p) // Combined Top-K + Top-P
Sampler::reasoning() // Preset: top_k_top_p(0.6, 20, 0.95)

Custom Sampler

Implement the Sample trait to inject your own logic — token masking, grammar constraints, classifier-free guidance, etc.:

struct MySampler;

impl Sample for MySampler {
fn sample(&self, ids: &[u32], probs: &[f32]) -> u32 {
// Custom logic: mask tokens, adjust probabilities, etc.
ids[0] // placeholder
}
}

let sampler = Sampler::Custom {
temperature: 0.0,
sampler: Box::new(MySampler),
};

Stop Conditions

Stop conditions are composable with .or():

let stop = stop_condition::max_len(256)
.or(stop_condition::ends_with_any(model.eos_tokens()));

Low-Level Decode

For fine-grained control, decode_step_dist() returns the full probability distribution before sampling:

let dist = ctx.decode_step_dist().await;
// dist.ids: Vec<u32> — token IDs
// dist.probs: Vec<f32> — corresponding probabilities

This is useful for implementing custom decoding strategies like best-of-N, beam search, or output validation.

Speculative Decoding

Implement the Drafter trait to propose multiple tokens at once, then let the model verify them in a single forward pass:

impl Drafter for MyDrafter {
fn update(&mut self, context: &[u32]) { /* learn from accepted tokens */ }
fn draft(&mut self) -> (Vec<u32>, Vec<u32>) { /* propose tokens + positions */ }
}

let output = ctx.generate_with_drafter(&mut drafter, &mut sampler, &mut stop, None).await;

See also: constrained decoding, cacheback decoding, and output validation examples.