Skip to main content

Announcing Pie

· 3 min read

We are excied to announce the open-source release of Pie, along with our SOSP 2025 paper that details its design and implementation.

Pie is a programmable system for LLM serving. Pie lets you program the serving loop itself, not just send a prompt and wait. That shift makes today's agentic apps and complex reasoning strategies both easier to build and faster to serve.

Visit our GitHub repository to explore the code, documentation, and examples.


Accelerate your AI apps with Pie

Fast. Cut end‑to‑end latency and boost throughput by using application‑aware optimization with integrated I/O and fine‑grained KV‑cache control. (Measured on Llama 3.2 1B using L40 GPU) Agentic workflows

Flexible. Compose APIs to implement custom decoding, resource policies, and per‑request optimizations—without patching or forking your serving stack. Benchmarks


What is programmable serving?

Existing LLM serving systems run a single monolithic decode loop with global policies (e.g., cache management, speculative decoding). Pie splits this into fine‑grained handlers—embed, forward, sample, KV‑page alloc/free, etc.—and lets your inferlets define custom decoding logic and application‑specific resource management.

Pie architecture diagram

Inferlets compile to WebAssembly, so you can write them in Rust, C++, Go, and more.


Why programmable serving?

Modern LLM apps demand LLM usage beyond simple text completion. They're interactive, non-linear, and tool-heavy. Current LLM serving systems struggle with three limitations:

  • Inference inefficiency from missing application-level optimizations
  • Integration friction with the external data and tools
  • Implementation challenges for the custom generation workflows

Pie addresses all three:

  1. Application-specific KV control. Keep, reuse, split, or drop KV cache pages based on your workflow (trees/graphs of thought, multi-step plans, map–reduce summaries).
  2. Integrated computation & I/O. Call APIs or run code, without extra round-trips and re-prefills.
  3. Custom generation processes. Mix speculative/assisted decoding, MCTS, grammar constraints, safety filters, watermarking at the user level, without modifying the serving system.

How Pie works

1) Write inferlets with Pie APIs to control LLM resources (KV, embeddings), run inference (embed, forward, sample), and do lightweight I/O (HTTP, message bus).

use inferlet::stop_condition::{StopCondition, ends_with_any, max_len};
use inferlet::{Args, Result, Sampler, get_auto_model};

#[inferlet::main]
async fn main(mut args: Args) -> Result<String> {

let model = get_auto_model();
let mut ctx = model.create_context();

ctx.fill_system("You are a helpful, respectful and honest assistant.");
ctx.fill_user("How are you?");

let sampler = Sampler::top_p(0.6, 0.95);
let stop_cond = max_len(max_num_outputs).or(ends_with_any(model.eos_tokens()));

let final_text = ctx.generate(sampler, stop_cond).await;

Ok(final_text)
}

2) Compile to Wasm. Inferlets build into reusable Wasm binaries.

3) Submit to the Pie server. The server hosts and runs inferlets in a sandboxed Wasm runtime. Submit binaries via Python/JavaScript APIs or CLI, with custom args as needed. You can also interactively communicate with running inferlets.

How inferlets are being served

Pie is a three-layered system where each layer separates concerns.

Pie architecture diagram

  • Application layer — your inferlet runs in sandboxed Wasm runtime.
  • Control layer — virtualizes resources (KV cache page, embeddings), manages batch scheduler.
  • Inference layer — executes batched API alls in GPUs.

Roadmap

Pie is actively being developed. We are continuously working on new features, optimizations, and bug fixes. Please check out our roadmap for the latest updates on our ongoing and planned work.


Further reading

Our HotOS 2025 paper motivates the vision that LLM serving systems are operating systems. Our SOSP 2025 paper details the design and implementation of Pie.