Generator
Generator is the multi-step loop on top of Forward. You configure it once with a sampler and stop conditions, then either drain it with a collector or step it yourself.
Each step does the same thing: submit one forward pass, sample a token, fold the result into the loop's state, and decide whether to stop. The Generator handles the bookkeeping — advancing constraint matchers, accepting or rejecting speculative drafts, counting tokens against max_tokens, re-bidding credits — so you don't write any of that yourself.
Read this after Generation overview.
Construct one
ctx.generate(sampler) returns a Generator<'ctx> (Rust) or a Generator-shaped object (Python / JS).
- Rust
- Python
- JavaScript
use inferlet::sample::Sampler;
let g = ctx.generate(Sampler::TopP { temperature: 0.6, p: 0.95 })
.max_tokens(256);
The Rust API is a builder. Methods take self and return Self, so calls chain. A collect_* call consumes the Generator; next() borrows it, so you can call next() repeatedly to step through.
from inferlet import Sampler
g = ctx.generate(
Sampler.top_p(0.6, 0.95),
max_tokens=256,
)
The Python API takes options as keyword arguments to generate. Every Rust builder method has an equivalent kwarg.
import { Sampler } from 'inferlet';
const g = ctx.generate(
Sampler.topP(0.6, 0.95),
{ maxTokens: 256 },
);
The JS API takes an options object. Every Rust builder method has an equivalent option key.
Builder surface
Every option below has the same shape across SDKs: a Rust builder method, a Python keyword argument, a JS option key. The behavior is the same.
| Rust builder | Python kwarg | JS option | Effect |
|---|---|---|---|
.max_tokens(n) | max_tokens=n | maxTokens: n | Stop after n tokens have been generated. |
.stop(&[id, ...]) | stop=[id, ...] | stop: [id, ...] | Stop when the model emits any of these token IDs. |
.constrain(c) | constrain=c | constrain: c | Attach a Constrain / Constraint instance. |
.constrain_with(schema) | constrain=schema | constrain: schema | Attach a Schema (the SDK compiles it into a constraint). |
.speculator(s) | speculator=s | speculator: s | Use a custom speculator for drafting. |
.system_speculation() | system_speculation=True | systemSpeculation: true | Use the runtime's built-in n-gram drafter. |
.adapter(&a) | adapter=a | adapter: a | Apply a LoRA adapter on every step. |
.zo_seed(seed) | zo_seed=seed | zoSeed: seed | Set an Evolution Strategies seed for every step. |
.horizon(n) | horizon=n | horizon: n | Tell the credit bidder you expect about n more steps. |
.probe_each_step(idx, probe) | g.probe_each_step(idx, probe) | g.probeEachStep(idx, probe) | Attach a probe at idx on every step. Returns a typed handle. Builder method on the Generator (no constructor-option equivalent). |
Two compatibility notes:
speculatorandsystem_speculationare mutually exclusive. Setting both is a runtime error. See Speculative decoding.constrainandconstrain_withcompose. Every constraint contributes a logit mask; the masks AND together before each forward pass. See Constrained generation.
Drain with a collector
A collector runs the loop to completion and returns the result in a particular shape. Three are built in. None of them is chat-specific.
- Rust
- Python
- JavaScript
let tokens: Vec<u32> = g.collect_tokens().await?; // raw IDs
let text: String = g.collect_text().await?; // string
let plan: Plan = g.collect_json::<Plan>().await?; // typed JSON
tokens = await g.collect_tokens() # list[int]
text = await g.collect_text() # str
data = await g.collect_json(schema=PLAN_SCHEMA) # parsed JSON
const tokens = await g.collectTokens(); // Uint32Array
const text = await g.collectText(); // string
const plan = await g.collectJson<Plan>({ schema, parse }); // typed JSON
| Collector | Returns | What it does |
|---|---|---|
collect_tokens | All accepted token IDs. | Loops until done; accumulates out.tokens from each step. |
collect_text | The assembled string. | Loops until done; runs a chat::Decoder internally to strip control tokens and assemble visible text. The chat parser is an implementation detail; the collector itself is not chat-specific. |
collect_json | Parsed JSON, optionally typed. | Attaches a JSON-schema constraint (derived from T in Rust, supplied as a string in Python / JS), loops until done, parses the result. |
collect_text is the right call when you want a string and don't need streaming. For streaming deltas, step the Generator yourself and feed tokens through a parser. See Chat parser.
Step manually
For watermarking, custom sampling, mid-generation tool calls, or anything that needs to see (or replace) what the Generator does each step, drive it explicitly.
- Rust
- Python
- JavaScript
use inferlet::sample::{Sampler, Logits};
let mut g = ctx.generate(Sampler::TopP { temperature: 0.6, p: 0.95 })
.max_tokens(256);
while let Some(mut step) = g.next()? {
// Optional: drop the auto-sampler and pick the token yourself.
step.clear_sampler();
let h = step.probe(0, Logits);
let out = step.execute().await?;
let logits_bytes = out.logits(h).unwrap();
let chosen: u32 = pick_with_watermark(logits_bytes);
// Register the chosen token so the Generator advances state.
g.accept(&[chosen]);
}
from inferlet import Sampler, Logits
g = ctx.generate(Sampler.top_p(0.6, 0.95), max_tokens=256)
async for step in g:
step.clear_sampler()
h = step.probe(0, Logits())
out = await step.execute()
chosen = pick_with_watermark(out.logits(h))
g.accept([chosen])
import { Sampler, Logits } from 'inferlet';
const g = ctx.generate(Sampler.topP(0.6, 0.95), { maxTokens: 256 });
for await (const step of g) {
step.clearSampler();
const h = step.probe(0, Logits());
const out = await step.execute();
const chosen = pickWithWatermark(out.logits(h));
g.accept(new Uint32Array([chosen]));
}
Two pieces matter:
step.clear_sampler()drops the Generator's auto-attached sampler. The forward pass still runs; what changes is that no token is sampled automatically. You read the distribution off a probe (LogitsorDistribution) and pick a token yourself.g.accept(&[token])registers a manually-sampled token with the Generator. Without it, the loop has no way to know what was chosen —max_tokens,stop, and constraint state never advance. Pass a slice with one token in the common case. (Speculative decoding accepts a run of tokens at once; same call shape.)
step.probe(idx, probe) adds a one-off probe for the current step. The handle works exactly like a probe attached to a Forward builder — see Samplers and probabilities.
Stop conditions
A Generator stops on the first of these:
max_tokensreached.- The model emits a token in the
stopset. - The model emits an end-of-turn marker the runtime recognizes (the engine's default stop list — your
stopadds to this list, it does not replace it). - An attached constraint reaches a terminal state.
Once stopped, g.next() returns Ok(None) (Rust) or the async iterator ends (Python / JS). Collectors return at that point.
Stop on a custom token
Pass a list of token IDs to stop. The chat-template stop set (EOS, role-end markers) is added on top automatically; you don't need to repeat it.
- Rust
- Python
- JavaScript
use inferlet::chat;
// Stop on EOS / role-end markers AND on whatever your custom marker is.
let mut stops = chat::stop_tokens(&model);
stops.extend(my_extra_stop_ids);
let g = ctx.generate(sampler).max_tokens(512).stop(&stops);
from inferlet import chat
stops = chat.stop_tokens(model) + my_extra_stop_ids
g = ctx.generate(sampler, max_tokens=512, stop=stops)
import { chat } from 'inferlet';
const stops = [...chat.stopTokens(model), ...myExtraStopIds];
const g = ctx.generate(sampler, { maxTokens: 512, stop: stops });
Inspect
g.is_done() (Rust) / g.is_done (Python) / g.isDone() (JavaScript) reports whether the Generator has terminated. Useful when you split the loop across functions.
g.tokens_generated() (Rust) / g.tokens_generated (Python) / g.tokensGenerated (JavaScript) is the running count of tokens accepted so far. Useful for progress reporting or for budget checks inside a custom loop.
Next
- Chat parser. Stream visible chat text as deltas.
- Reasoning parser. Stream thinking-block events.
- Tool-call parser. Detect tool calls in the stream.
- Constrained generation. The
SchemaandConstrainAPI the Generator'sconstrainknob takes. - Speculative decoding. The Speculator API the Generator's
speculatorknob takes.