Skip to main content

Pie: Programmable LLM Serving

Go beyond serving prompts.
Define, optimize, and deploy your custom LLM inference workflows with Pie.

Modern AI Demands A New Serving Paradigm

Current LLM serving systems use a rigid, monolithic loop,
creating bottlenecks for complex applications. Pie replaces this with a fully programmable architecture using sandboxed WebAssembly programs called inferlets.

Existing LLM Serving

Diagram of a monolithic LLM serving architecture

A one-size-fits-all process that limits innovation and forces inefficient workarounds for advanced use cases.

Programmable LLM Serving

Diagram of Pie’s programmable serving architecture

A flexible foundation of fine-grained APIs, giving you direct control to build application-specific optimizations.

Serve Programs, Not Prompts

By moving control from the system to the developer-written inferlets,
Pie unlocks new capabilities and optimizations for advanced LLM workflows.

Existing Systems

Inference inefficiency

Missing application-level optimizations lead to wasted tokens, redundant compute, and rigid execution paths.

Pie

🎛️ Fine-grained KV cache control

Customize KV cache management around your reasoning pattern, prune or reuse states precisely, and avoid unnecessary recompute.

Existing Systems

Implementation challenges

Custom decoding and optimization methods require invasive system patches or forks, complicating maintenance and velocity.

Pie

⚙️ Customizable generation

Define bespoke decoding algorithms and safety filters per request using programmable APIs. No system fork required.

Existing Systems

Integration friction

External data/tools sit outside the generation loop, adding round-trip latency and brittle glue code.

Pie

🔗 Seamless workflow integration

Call tools and data sources inside the serving engine to cut round-trips and keep state aligned with execution.

Dive Deeper

Ready to see how it works? Check out our documentation and get started with Pie today.