What is Pie?
Pie is an LLM serving system. A serving system is the software that sits between your inference requests and the model's parameters on a GPU: it loads weights, batches forward passes, manages the KV cache, and exposes an interface to clients. Pie does the same, with one difference. Instead of accepting prompts and returning tokens through a fixed endpoint, Pie accepts programs that the engine runs next to the model. This page explains why that matters.
Pie is a research prototype under active development. APIs and features change without notice.
Why a new serving system?
Many serving systems exist, e.g., vLLM, SGLang, and TensorRT-LLM. These systems treat the model as a black box: the client sends a prompt, the engine returns tokens, and anything richer than that lives in a separate process on the client side.
However, three problems show up in practice:
- KV cache is invisible to the application. A multi-turn agent can re-prefill between tool calls. In multi-agent systems with shared context, KV cache thrashing can occur, because the engine is oblivious to KV cache dependencies. The engine alone cannot make optimal KV cache decisions.
- Tool calls are network round trips. Agentic systems generate tool requests, send them back to the client, wait for results, then resume inference. Every iteration of the agent loop pays for a round trip.
- Custom decoding requires modifying the engine. Speculative decoding, constrained decoding, and custom samplers cannot be expressed through the API. To add one, you patch the serving system itself.
These problems are well-known. They have been addressed in piecemeal ways, such as explicit prompt-caching APIs and tool-call APIs that allow limited in-engine execution. In Pie, the engine exposes its internal state through a small set of primitives, and the application uses those primitives to implement whatever logic it needs.
For interested readers, the motivation is described in depth in Pie: a programmable serving system for emerging LLM applications (SOSP '25) and Serve programs, not prompts (HotOS '25).
Pie serves programs, not prompts
Pie's primitive is the inferlet: a small program you write in Rust, Python, or TypeScript that compiles to WebAssembly and runs inside the engine. The inferlet has direct access to the KV cache, the token stream, and the forward pass. The client launches an inferlet by name; the inferlet runs to completion; tokens and events stream back through it.
Below are three short inferlets. The Examples page lists more inferlets.
use inferlet::{Context, Result, model::Model, runtime, sample::Sampler, tool, tools};
/// Search the web for current information.
#[tool]
async fn web_search(query: String) -> Result<String> {
// Real implementations hit MCP, HTTP, or the inferlet's own state.
Ok(format!("(stub result for: {query})"))
}
#[inferlet::main]
async fn main(prompt: String) -> Result<String> {
let model = Model::load(runtime::models().first().unwrap())?;
let mut ctx = Context::new(&model)?;
ctx.system("Use web_search if you need fresh facts, then answer.")
.equip(&[&web_search])?
.user(&prompt);
loop {
// Stream tokens through tools::Decoder so the loop aborts the
// moment the model emits a tool-call structure. No wasted
// tokens past it.
let mut tdec = tools::Decoder::new(&model);
let mut full = Vec::new();
let call = {
let mut g = ctx.generate(Sampler::Argmax).max_tokens(512);
loop {
let Some(step) = g.next()? else { break None };
let out = step.execute().await?;
full.extend_from_slice(&out.tokens);
if let tools::Event::Call(name, args) = tdec.feed(&out.tokens)? {
break Some((name, args));
}
}
};
let Some((name, args)) = call else {
return Ok(model.tokenizer().decode(&full)?);
};
// Hint to the engine that this inferlet is not using the KV
// cache. Under contention, the engine evicts our pages for a
// peer and restores them when the idle scope ends.
let result = {
let _idle = ctx.idle();
match name.as_str() {
"web_search" => web_search::call(&args).await?,
_ => return Err(format!("unknown tool: {name}")),
}
};
ctx.append(&tools::answer_prefix(&model, &name, &result));
}
}
What Pie brings us
The examples above run end-to-end inside the engine, without client round trips or engine patches. Three things follow from that:
-
Application-specific optimization. The engine sees a stream of forward passes and doesn't know which KV pages will be reused. The inferlet does. A multi-turn agent pins a shared prefix across turns; a planner with cheap-to-recompute branches releases its pages under contention. Placement, prefetching, and eviction are policy decisions, and the right policy depends on what the application is doing.
-
Programmatic model control. Speculative decoding, watermarking, constrained decoding, custom samplers, and adapter composition are loops over the forward pass and the token stream. You write them as inferlets. Tool calls run in the same process as decoding, so an agent loop completes without round-tripping to a client.
-
Less engineering. The engine and the application logic evolve independently. The engine exposes a fixed set of primitives, and new behavior is extended on top of them rather than patched into the engine. Agent loops, KV cache policy, decoding strategy, and tool execution live in one inferlet instead of separate processes.
The next page, Components, describes the engine, SDKs, and Bakery toolchain that make up the distribution.