Skip to main content

Context and KV cache

Inferlets that exercise the page-based KV cache — forking from a shared prefix, building prefix trees, sliding-window attention, and the runtime's automatic page-trim.

Running on the dummy driver?

parallel-generation and prefix-tree fork from a shared prefix and rely on each branch producing a related continuation. On the dummy, branches share KV pages correctly but each branch's tokens are drawn independently — the page state machine is exercised, the branch content is unrelated random. windowed-attention, attention-sink, and page-trim-bench are unaffected; they exercise mask and page-trim plumbing, not token semantics.

InferletWhat it shows
parallel-generationForked contexts that share committed pages and decode in parallel.
prefix-treePrefix-tree caching with concurrent generation from one shared context.
windowed-attentionSliding window: bounded-memory generation by masking + releasing pages.
attention-sinkSink + sliding window (StreamingLLM).
page-trim-benchBenchmarks the runtime's page-trim optimization on a sink+window mask.
  • Pages: the page model and committed-vs-working distinction.
  • Forking and saving: copy-on-write branching and named snapshots.
  • Inputs: BRLE attention masks for sliding windows and sinks.