Skip to main content

A programmable LLM serving system.

High-performance inference engine where you write the loop.
Forward passes are library calls in your inferlet.

Serve programs, not prompts

In existing serving systems, the inference workflow is baked into the engine. In Pie, you write it.

Conventional serving systems

A conventional serving system. Prompts from users enter the engine and pass through a fixed pipeline of batch, embed, prefill or decode, and sample stages, with one global autoregressive loop.A conventional serving system. Prompts from users enter the engine and pass through a fixed pipeline of batch, embed, prefill or decode, and sample stages, with one global autoregressive loop.

Every request runs through the same fixed pipeline. Branching and tool calls live outside the engine.

Programmable serving system - Pie

Pie's serving model. Each application runs as an inferlet inside the engine, calling into the model's KV cache and forward pass through a control layer.Pie's serving model. Each application runs as an inferlet inside the engine, calling into the model's KV cache and forward pass through a control layer.

Each inferlet runs its own workflow inside the engine. It controls the KV cache, forward pass, and tool calls directly.

01

Program the forward pass

The decoder loop, sampler, and constraint matcher are exposed to user code. Write your own drafter, sampling rule, or grammar.

loading demo…

02

Program the KV cache

Branch, rewind, persist, and reopen the cache by calling engine APIs from user code.

loading demo…

03

Program the I/O

Tool clients, persistent files, and message brokers run inside the engine. User code calls them like a library.

loading demo…

Ready to write the loop?

Install the engine, ship your first inferlet, and program the forward pass, KV cache, and I/O from your own code.