Skip to main content

Constrained generation

Constrained decoding masks the model's logits each step so only tokens that keep the output valid against a grammar can be sampled. Pie supports JSON Schema, regex, EBNF, and reusable compiled grammars. Read this after Samplers and probabilities.

The high-level entry point is Schema. Pass any Schema value to constrain_with (Rust) or as the constrain argument (Python and JavaScript) on a generate call. The SDK compiles it into a stateful matcher and drives the matcher per generated token.

Running on the dummy driver?

Constraint masks are honored by the dummy driver, so output structure is in-grammar — but the content inside the grammar (string field values, numbers, tool arguments) is uniform random. Use a real driver if you need meaningful values.

Schemas

Each language exposes the same five built-in schema variants. Their constructors differ.

use inferlet::{AnyJson, JsonSchema, Regex, Ebnf, sample::Sampler};

// Force any valid JSON
let json: serde_json::Value = ctx
.generate(Sampler::Argmax)
.constrain_with(AnyJson)?
.max_tokens(256)
.collect_json::<serde_json::Value>()
.await?;

// JSON Schema
let schema = r#"{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}"#;
let text = ctx
.generate(Sampler::Argmax)
.constrain_with(JsonSchema(schema))?
.max_tokens(128)
.collect_text()
.await?;

// Regex
ctx.generate(Sampler::Argmax)
.constrain_with(Regex(r"\d{4}-\d{2}-\d{2}"))?;

// EBNF (Lark format)
ctx.generate(Sampler::Argmax)
.constrain_with(Ebnf(&my_grammar))?;

Schema is a trait. The built-in implementors JsonSchema, AnyJson, Regex, and Ebnf are tuple/unit structs. A pre-compiled &Grammar also implements Schema.

ConceptRustPythonJavaScript
Any valid JSONAnyJsonAnyJson()anyJson()
JSON conforming to a schema stringJsonSchema(s)JsonSchema(schema=s)jsonSchema(s)
Strings matching a regexRegex(p)Regex(pattern=p)regex(p)
Custom EBNF grammarEbnf(g)Ebnf(source=g)ebnf(g)
Pre-compiled grammar&GrammarGrammarConstraint.from_grammar(g, model)grammar(g)

In Rust, constrain_with returns Result because schema parsing can fail at compile time. Multiple constraints AND together: every constraint contributes a mask, and the masks combine before each forward pass.

Pair grammars with greedy or low-temperature sampling

Once the mask narrows the distribution to a handful of valid tokens, stochastic sampling rarely improves quality and can lower speculative-decoding acceptance.

Typed JSON output

If you have a type for the result, the SDK can derive the schema and parse the output for you.

use schemars::JsonSchema;
use serde::Deserialize;

#[derive(Deserialize, JsonSchema)]
struct Plan { steps: Vec<String> }

let plan: Plan = ctx
.generate(Sampler::Argmax)
.max_tokens(256)
.collect_json::<Plan>()
.await?;

collect_json::<T> derives the JSON schema for T, applies it as a constraint, samples, and parses. It errors if you also pass constrain_with, since it owns the schema.

Reuse a compiled grammar

If the same grammar applies across many turns or streams, compile it once and reuse the constraint object. This amortizes the parser cost across calls. The compiled object is safe to reuse across forks of the same context.

use inferlet::inference::Grammar;

let grammar = Grammar::from_ebnf(&my_grammar)?;

let out = ctx
.generate(Sampler::Argmax)
.constrain_with(&grammar)? // &Grammar implements Schema
.max_tokens(256)
.collect_text()
.await?;

Why constrain at the decoder

A common pattern wraps an LLM with a parser, a retry loop, and a prompt asking for JSON output. The model occasionally drifts. The parser breaks. The retry costs tokens.

Constrained decoding turns format compliance into a property of the decoder, not of the prompt:

  • Every token is valid. The mask intersects with the sampler before it picks, so an invalid token cannot be sampled.
  • No retries. A bad output is impossible by construction.
  • Cheap. The mask is BRLE-encoded and ANDed once per step. There is no rejection sampling and no extra forward pass.

This makes structured output appropriate for any task where the format is part of the contract: API responses, form filling, function arguments, downstream parsers.

Custom constraints

For non-grammar logic, implement the Constrain trait (Rust) / Constraint protocol (Python) / Constraint interface (JS). Each step the generator passes any newly accepted tokens (or empty on the first step) and gets back the BRLE-encoded logit mask for the next position. Returning an empty mask means "no restriction" and is treated as transparent during composition.

use inferlet::Constrain;

struct MyConstraint {
cached: Vec<u32>,
/* …state… */
}

impl Constrain for MyConstraint {
fn step(&mut self, accepted: &[u32]) -> &[u32] {
// Update internal state from `accepted`.
self.cached.clear();
self.cached.extend_from_slice(&recompute_brle_for_state(self));
&self.cached
}
}

let out = ctx
.generate(Sampler::Argmax)
.constrain(MyConstraint { cached: Vec::new() /* … */ })
.max_tokens(256)
.collect_text()
.await?;

The trait has one method. There is no reset, no Result. Return &[] to indicate "no restriction this step." The returned slice can borrow from self.

The custom constraint composes with Schema constraints: each adds its own mask and the masks AND together.

Next

  • Speculative decoding: how constraints interact with draft-and-verify.
  • Tool-call parser: tool-call schemas backed by the model's native grammar.
  • Inputs: apply masks at the forward-pass level instead of through Generator.