Skip to main content

Chat parser

chat::Decoder is a stateful parser that turns the model's raw token stream into clean text deltas. It strips role tags, suppresses special tokens, and emits a terminal Done event when the model ends its turn. The SDK names the type Decoder; functionally it is a tokenizer-aware parser. Read this after Generator.

The parser does not generate. Pair it with a Generator (or any other source of token IDs) and feed the tokens in.

Stream tokens through the parser

The standard pattern: drive the Generator step by step and feed each step's tokens into the parser.

use inferlet::{chat, sample::Sampler};

let mut g = ctx.generate(Sampler::TopP { temperature: 0.6, p: 0.95 })
.max_tokens(256);
let mut parser = chat::Decoder::new(&model);

while let Some(step) = g.next()? {
let out = step.execute().await?;
match parser.feed(&out.tokens)? {
chat::Event::Delta(s) => print!("{s}"),
chat::Event::Done(_) => break,
_ => {}
}
}

Each step.execute() runs one forward pass and returns the tokens accepted in that step (one without speculative decoding, several with it). The parser is stateful; reset it between turns with parser.reset().

If you only want the assembled string and don't care about streaming, the Generator's collect_text collector runs the same parser internally — see Generator.

Events

EventWhen
IdleThe token batch produced no visible text (e.g. an opening role tag).
Delta(text)New visible text. Append it to the UI.
Done(text)The assistant turn ended. text is the full assembled reply.
Interrupt(token_id)A control token the template did not render as visible text. Most callers ignore this.

The parser yields one event per feed(...) call. For multi-token batches (with speculative decoding accepted), the parser still emits one event with the consolidated delta.

Streaming back to the client

Inside an inferlet, stream each delta to the launching client with session::send:

use inferlet::pie::core::session;

while let Some(step) = g.next()? {
let out = step.execute().await?;
if let chat::Event::Delta(s) = parser.feed(&out.tokens)? {
session::send(&s);
}
}

The client receives each delta as a Stdout event. See User and inferlet for the full session API.

Multi-turn

After a turn finishes, seal() closes the assistant block. The next user() opens a new turn. The KV cache from earlier turns stays live, so nothing re-prefills.

ctx.system("You are a math tutor.");

ctx.user("What is 2+2?").cue();
let answer = ctx.generate(sampler.clone()).max_tokens(64).collect_text().await?;
ctx.seal();

ctx.user("Why?").cue();
let explain = ctx.generate(sampler).max_tokens(256).collect_text().await?;
ctx.seal();

The text the model produced is written back into the context (every accepted token commits to the KV cache), so the next turn already sees it.

Custom stop tokens

Some workloads want to stop on a marker the chat template doesn't recognize as end-of-turn (e.g. a tool-call boundary). chat::stop_tokens(model) returns the model's standard stop list (EOS plus any role-end markers); add your own and pass to the Generator:

let mut stops = chat::stop_tokens(&model);
stops.extend(my_extra_stop_ids);

let g = ctx.generate(sampler).max_tokens(512).stop(&stops);

This stops the Generator. The parser sees the same token stream the Generator emitted — it doesn't need to know about your custom stop.

Next