Context and KV cache
Inferlets that exercise the page-based KV cache — forking from a shared prefix, building prefix trees, sliding-window attention, and the runtime's automatic page-trim.
Running on the dummy driver?
parallel-generation and prefix-tree fork from a shared prefix and rely on each branch producing a related continuation. On the dummy, branches share KV pages correctly but each branch's tokens are drawn independently — the page state machine is exercised, the branch content is unrelated random. windowed-attention, attention-sink, and page-trim-bench are unaffected; they exercise mask and page-trim plumbing, not token semantics.
| Inferlet | What it shows |
|---|---|
parallel-generation | Forked contexts that share committed pages and decode in parallel. |
prefix-tree | Prefix-tree caching with concurrent generation from one shared context. |
windowed-attention | Sliding window: bounded-memory generation by masking + releasing pages. |
attention-sink | Sink + sliding window (StreamingLLM). |
page-trim-bench | Benchmarks the runtime's page-trim optimization on a sink+window mask. |
Related guide
- Pages: the page model and committed-vs-working distinction.
- Forking and saving: copy-on-write branching and named snapshots.
- Inputs: BRLE attention masks for sliding windows and sinks.