Skip to main content

Writing Inferlets

Inferlets are lightweight WebAssembly programs that define custom inference logic in Pie. Instead of treating the LLM as a black-box API, inferlets let you program the serving loop — controlling KV cache, sampling, branching, and more from within the engine.

What can you build?

  • Prefix caching — cache a long system prompt's KV state and reuse it across requests, skipping redundant prefill
  • Parallel generation — fork a context into independent branches that share KV cache and generate concurrently
  • Sliding window attention — mask and evict old KV pages to bound memory for arbitrarily long generation
  • Grammar-constrained output — inject a custom sampler that masks invalid tokens, guaranteeing valid JSON or other structured formats
  • Speculative decoding — implement a drafter that proposes tokens for the model to verify in bulk
  • Tree / Skeleton of Thought — explore reasoning branches in parallel using nested forks
  • Multi-agent pipelines — coordinate multiple inferlets via broadcast/subscribe messaging

A taste of inferlet code

use inferlet::{Args, Result, Sampler, get_auto_model};
use inferlet::stop_condition::{self, StopCondition};

#[inferlet::main]
async fn main(mut args: Args) -> Result<()> {
// Load whatever model the server has configured
let model = get_auto_model();
// Create a generation context (manages tokens, KV cache, attention)
let mut ctx = model.create_context();

// Build the conversation
ctx.fill_system("You are a helpful assistant.");
ctx.fill_user("How are you?");

// Stop after 256 tokens or when the model emits an end-of-sequence token
let stop = stop_condition::max_len(256)
.or(stop_condition::ends_with_any(model.eos_tokens()));

// Generate with nucleus sampling (temperature=0.6, top_p=0.95)
let output = ctx.generate(Sampler::top_p(0.6, 0.95), stop).await;
println!("{}", output);
Ok(())
}

Inferlets compile to WebAssembly. Write them in Rust (recommended), Python, or JavaScript/TypeScript.

Next steps