Chat parser

chat::Decoder is a stateful parser that turns the model's raw token stream into clean text deltas. It strips role tags, suppresses special tokens, and emits a terminal Done event when the model ends its turn. The SDK names the type Decoder; functionally it is a tokenizer-aware parser. Read this after Generator.

The parser does not generate. Pair it with a Generator (or any other source of token IDs) and feed the tokens in.

Stream tokens through the parser

The standard pattern: drive the Generator step by step and feed each step's tokens into the parser.

Rust
Python
JavaScript

use inferlet::{chat, sample::Sampler};

let mut g = ctx.generate(Sampler::TopP { temperature: 0.6, p: 0.95 })
    .max_tokens(256);
let mut parser = chat::Decoder::new(&model);

while let Some(step) = g.next()? {
    let out = step.execute().await?;
    match parser.feed(&out.tokens)? {
        chat::Event::Delta(s) => print!("{s}"),
        chat::Event::Done(_)  => break,
        _                      => {}
    }
}

from inferlet import Sampler, chat

g = ctx.generate(Sampler.top_p(0.6, 0.95), max_tokens=256)
parser = chat.Decoder(model)

async for step in g:
    out = await step.execute()
    match parser.feed(out.tokens):
        case chat.Event.Delta(text=t):
            print(t, end="")
        case chat.Event.Done(text=_):
            break
        case _:
            pass

import { Sampler, chat } from 'inferlet';

const g = ctx.generate(Sampler.topP(0.6, 0.95), { maxTokens: 256 });
const parser = new chat.Decoder(model);

for await (const step of g) {
    const out = await step.execute();
    const ev = parser.feed(out.tokens);
    if (ev.type === 'delta') process.stdout.write(ev.text);
    else if (ev.type === 'done') break;
}

Each step.execute() runs one forward pass and returns the tokens accepted in that step (one without speculative decoding, several with it). The parser is stateful; reset it between turns with parser.reset().

If you only want the assembled string and don't care about streaming, the Generator's collect_text collector runs the same parser internally — see Generator.

Events

Event	When
`Idle`	The token batch produced no visible text (e.g. an opening role tag).
`Delta(text)`	New visible text. Append it to the UI.
`Done(text)`	The assistant turn ended. `text` is the full assembled reply.
`Interrupt(token_id)`	A control token the template did not render as visible text. Most callers ignore this.

The parser yields one event per feed(...) call. For multi-token batches (with speculative decoding accepted), the parser still emits one event with the consolidated delta.

Streaming back to the client

Inside an inferlet, stream each delta to the launching client with session::send:

Rust
Python
JavaScript

use inferlet::pie::core::session;

while let Some(step) = g.next()? {
    let out = step.execute().await?;
    if let chat::Event::Delta(s) = parser.feed(&out.tokens)? {
        session::send(&s);
    }
}

from inferlet import session

async for step in g:
    out = await step.execute()
    if isinstance((ev := parser.feed(out.tokens)), chat.Event.Delta):
        session.send(ev.text)

import { session } from 'inferlet';

for await (const step of g) {
    const out = await step.execute();
    const ev = parser.feed(out.tokens);
    if (ev.type === 'delta') session.send(ev.text);
}

The client receives each delta as a Stdout event. See User and inferlet for the full session API.

Multi-turn

After a turn finishes, seal() closes the assistant block. The next user() opens a new turn. The KV cache from earlier turns stays live, so nothing re-prefills.

Rust
Python
JavaScript

ctx.system("You are a math tutor.");

ctx.user("What is 2+2?").cue();
let answer = ctx.generate(sampler.clone()).max_tokens(64).collect_text().await?;
ctx.seal();

ctx.user("Why?").cue();
let explain = ctx.generate(sampler).max_tokens(256).collect_text().await?;
ctx.seal();

ctx.system("You are a math tutor.")

ctx.user("What is 2+2?").cue()
answer = await ctx.generate(sampler, max_tokens=64).collect_text()
ctx.seal()

ctx.user("Why?").cue()
explain = await ctx.generate(sampler, max_tokens=256).collect_text()
ctx.seal()

ctx.system('You are a math tutor.');

ctx.user('What is 2+2?').cue();
const answer = await ctx.generate(sampler, { maxTokens: 64 }).collectText();
ctx.seal();

ctx.user('Why?').cue();
const explain = await ctx.generate(sampler, { maxTokens: 256 }).collectText();
ctx.seal();

The text the model produced is written back into the context (every accepted token commits to the KV cache), so the next turn already sees it.

Custom stop tokens

Some workloads want to stop on a marker the chat template doesn't recognize as end-of-turn (e.g. a tool-call boundary). chat::stop_tokens(model) returns the model's standard stop list (EOS plus any role-end markers); add your own and pass to the Generator:

Rust
Python
JavaScript

let mut stops = chat::stop_tokens(&model);
stops.extend(my_extra_stop_ids);

let g = ctx.generate(sampler).max_tokens(512).stop(&stops);

stops = chat.stop_tokens(model) + my_extra_stop_ids

g = ctx.generate(sampler, max_tokens=512, stop=stops)

const stops = [...chat.stopTokens(model), ...myExtraStopIds];

const g = ctx.generate(sampler, { maxTokens: 512, stop: stops });

This stops the Generator. The parser sees the same token stream the Generator emitted — it doesn't need to know about your custom stop.

Reasoning parser: the parser for thinking-block events.
Tool-call parser: the parser for tool calls.
Generator: the multi-step source the parser composes with.
Constrained generation: pair the chat parser with a JSON or regex constraint upstream.

Stream tokens through the parser​

Events​

Streaming back to the client​

Multi-turn​

Custom stop tokens​

Next​

Stream tokens through the parser

Events

Streaming back to the client

Multi-turn

Custom stop tokens

Next