Writing Inferlets

Inferlets are lightweight WebAssembly programs that define custom inference logic in Pie. Instead of treating the LLM as a black-box API, inferlets let you program the serving loop — controlling KV cache, sampling, branching, and more from within the engine.

What can you build?

Prefix caching — cache a long system prompt's KV state and reuse it across requests, skipping redundant prefill
Parallel generation — fork a context into independent branches that share KV cache and generate concurrently
Sliding window attention — mask and evict old KV pages to bound memory for arbitrarily long generation
Grammar-constrained output — inject a custom sampler that masks invalid tokens, guaranteeing valid JSON or other structured formats
Speculative decoding — implement a drafter that proposes tokens for the model to verify in bulk
Tree / Skeleton of Thought — explore reasoning branches in parallel using nested forks
Multi-agent pipelines — coordinate multiple inferlets via broadcast/subscribe messaging

A taste of inferlet code

use inferlet::{Args, Result, Sampler, get_auto_model};
use inferlet::stop_condition::{self, StopCondition};

#[inferlet::main]
async fn main(mut args: Args) -> Result<()> {
    // Load whatever model the server has configured
    let model = get_auto_model();
    // Create a generation context (manages tokens, KV cache, attention)
    let mut ctx = model.create_context();

    // Build the conversation
    ctx.fill_system("You are a helpful assistant.");
    ctx.fill_user("How are you?");

    // Stop after 256 tokens or when the model emits an end-of-sequence token
    let stop = stop_condition::max_len(256)
        .or(stop_condition::ends_with_any(model.eos_tokens()));

    // Generate with nucleus sampling (temperature=0.6, top_p=0.95)
    let output = ctx.generate(Sampler::top_p(0.6, 0.95), stop).await;
    println!("{}", output);
    Ok(())
}

Inferlets compile to WebAssembly. Write them in Rust (recommended), Python, or JavaScript/TypeScript.

Next steps

Your First Inferlet

Create, build, and run your first custom inferlet in 5 minutes.

Core Concepts

Understand contexts, forking, KV cache control, and custom sampling.

Examples

Prefix caching, parallel generation, constrained decoding, and more.

What can you build?​

A taste of inferlet code​

Next steps​