Skip to main content

A programmable LLM serving system.

High-performance inference engine where you write the loop.
Forward passes are library calls in your inferlet.

Serve programs, not prompts

In existing serving systems, the inference workflow is baked into the engine. In Pie, you write it.

Conventional serving systems

A conventional serving system. Prompts from users enter the engine and pass through a fixed pipeline of batch, embed, prefill or decode, and sample stages, with one global autoregressive loop.A conventional serving system. Prompts from users enter the engine and pass through a fixed pipeline of batch, embed, prefill or decode, and sample stages, with one global autoregressive loop.

Every request runs through the same fixed pipeline. Branching and tool calls live outside the engine.

Programmable serving system - Pie

Pie's serving model. Each application runs as an inferlet inside the engine, calling into the model's KV cache and forward pass through a control layer.Pie's serving model. Each application runs as an inferlet inside the engine, calling into the model's KV cache and forward pass through a control layer.

Each inferlet runs its own workflow inside the engine. It controls the KV cache, forward pass, and tool calls directly.

More to optimize, more to customize

Pie unlocks opportunities for optimization and custom model behavior.
Each tab is an inferlet that customizes the inference loop in a different way.

use inferlet::{Context, Result, model::Model, runtime, sample::Sampler};

#[inferlet::main]
async fn main(prompt: String) -> Result<String> {
let model = Model::load(runtime::models().first().ok_or("no models")?)?;
let mut ctx = Context::new(&model)?;

ctx.system("You are a helpful assistant.")
.user(&prompt)
.cue();

ctx.generate(Sampler::TopP { temperature: 0.6, p: 0.95 })
.max_tokens(256)
.collect_text()
.await
}