Skip to main content

Overview

Pie is a programmable system for LLM serving. Pie lets you program the serving loop itself, not just send a prompt and wait. That shift makes today's agentic apps and complex reasoning strategies both easier to build and faster to serve.


Accelerate your AI apps with Pie

Fast. Cut end‑to‑end latency and boost throughput by using application‑aware optimization with integrated I/O and fine‑grained KV‑cache control. (Measured on Llama 3.2 1B using L40 GPU)

Agentic workflows

Flexible. Compose APIs to implement custom decoding, resource policies, and per‑request optimizations—without patching or forking your serving stack.

Benchmarks


What is programmable serving?

Existing LLM serving systems run a single monolithic decode loop with global policies (e.g., cache management, speculative decoding). Pie splits this into fine‑grained handlers—embed, forward, sample, KV‑page alloc/free, etc.—and lets your inferlets define custom decoding logic and application‑specific resource management.

Pie architecture diagram

Inferlets compile to WebAssembly, so you can write them in Rust, C++, Go, and more.


Why programmable serving?

Modern LLM apps demand LLM usage beyond simple text completion. They're interactive, non-linear, and tool-heavy. Current LLM serving systems struggle with three limitations:

  • Inference inefficiency from missing application-level optimizations
  • Integration friction with the external data and tools
  • Implementation challenges for the custom generation workflows

Pie addresses all three:

  1. Application-specific KV control. Keep, reuse, split, or drop KV cache pages based on your workflow (trees/graphs of thought, multi-step plans, map–reduce summaries).
  2. Integrated computation & I/O. Call APIs or run code, without extra round-trips and re-prefills.
  3. Custom generation processes. Mix speculative/assisted decoding, MCTS, grammar constraints, safety filters, watermarking at the user level, without modifying the serving system.

How Pie works

1) Write inferlets with Pie APIs to control LLM resources (KV, embeddings), run inference (embed, forward, sample), and do lightweight I/O (HTTP, message bus).

use inferlet::stop_condition::{StopCondition, ends_with_any, max_len};
use inferlet::{Args, Result, Sampler, get_auto_model};

#[inferlet::main]
async fn main(mut args: Args) -> Result<String> {

let model = get_auto_model();
let mut ctx = model.create_context();

ctx.fill_system("You are a helpful, respectful and honest assistant.");
ctx.fill_user("How are you?");

let sampler = Sampler::top_p(0.6, 0.95);
let stop_cond = max_len(max_num_outputs).or(ends_with_any(model.eos_tokens()));

let final_text = ctx.generate(sampler, stop_cond).await;

Ok(final_text)
}

2) Compile to Wasm. Inferlets build into reusable Wasm binaries.

3) Submit to the Pie server. The server hosts and runs inferlets in a sandboxed Wasm runtime. Submit binaries via Python/JavaScript APIs or CLI, with custom args as needed. You can also interactively communicate with running inferlets.


System Architecture

Pie is a three-layered system where each layer separates concerns.

Pie system diagram

LayerResponsibility
ApplicationYour inferlet runs in sandboxed Wasm runtime
ControlVirtualizes resources (KV cache page, embeddings), manages batch scheduler
InferenceExecutes batched API calls on GPUs

Client Libraries

Pie provides client libraries for:

  • Pythonpip install pie-client
  • JavaScriptnpm install @pie-project/client
  • Rustcargo add pie-client

See the Client API documentation for details.


Standard Inferlets

Pie includes standard inferlets for common use cases:

InferletDescription
std/text-completionSimple text completion
std/chatMulti-turn chat with system prompts

Run them directly:

pie run text-completion -- --prompt "Once upon a time"
pie run chat

Further Reading

  • Our HotOS 2025 paper motivates the vision that LLM serving systems are operating systems.
  • Our SOSP 2025 paper details the design and implementation of Pie.

Next Steps