Writing Inferlets
Inferlets are lightweight WebAssembly programs that define custom inference logic in Pie. Instead of treating the LLM as a black-box API, inferlets let you program the serving loop — controlling KV cache, sampling, branching, and more from within the engine.
What can you build?
- Prefix caching — cache a long system prompt's KV state and reuse it across requests, skipping redundant prefill
- Parallel generation — fork a context into independent branches that share KV cache and generate concurrently
- Sliding window attention — mask and evict old KV pages to bound memory for arbitrarily long generation
- Grammar-constrained output — inject a custom sampler that masks invalid tokens, guaranteeing valid JSON or other structured formats
- Speculative decoding — implement a drafter that proposes tokens for the model to verify in bulk
- Tree / Skeleton of Thought — explore reasoning branches in parallel using nested forks
- Multi-agent pipelines — coordinate multiple inferlets via broadcast/subscribe messaging
A taste of inferlet code
use inferlet::{Args, Result, Sampler, get_auto_model};
use inferlet::stop_condition::{self, StopCondition};
#[inferlet::main]
async fn main(mut args: Args) -> Result<()> {
// Load whatever model the server has configured
let model = get_auto_model();
// Create a generation context (manages tokens, KV cache, attention)
let mut ctx = model.create_context();
// Build the conversation
ctx.fill_system("You are a helpful assistant.");
ctx.fill_user("How are you?");
// Stop after 256 tokens or when the model emits an end-of-sequence token
let stop = stop_condition::max_len(256)
.or(stop_condition::ends_with_any(model.eos_tokens()));
// Generate with nucleus sampling (temperature=0.6, top_p=0.95)
let output = ctx.generate(Sampler::top_p(0.6, 0.95), stop).await;
println!("{}", output);
Ok(())
}
Inferlets compile to WebAssembly. Write them in Rust (recommended), Python, or JavaScript/TypeScript.