Skip to main content

Tool-call parser

Pie has two paths for tool use:

  1. Hand-rolled agent loops. Prompt the model to emit a structured action, parse the response in code, run the tool, feed the observation back. Works with any model.
  2. Native tool-call templates. Some models build a tool-call format into their chat template (Llama 3.1+, Qwen 2.5+ instruct, others). The tools module wraps the equip / answer / decode flow, and tools::Decoder is the stream parser that detects calls in the model's output.

Read this after Chat parser. For external tool servers, see MCP.

Running on the dummy driver?

The dummy driver returns uniform random tokens, so native tool-call marker sequences (Llama 3.1 / Qwen 2.5 format) almost never appear in the stream — tools::Decoder will run to max_tokens without firing a Call event, and the hand-rolled ReAct parser will fail to match its action format. The "forcing valid tool calls with grammar" pattern further down does produce parseable tool-call JSON on the dummy because constraint masks are honored; only the field values are random.

Hand-rolled ReAct

Prompt-engineer the action format and parse the response yourself.

use inferlet::{Context, sample::Sampler};

const SYSTEM_PROMPT: &str = "\
You have these tools:
- Calculator[expr]: evaluate a math expression
- FinalAnswer[answer]: report the final answer

Respond in this format:
Thought: <your reasoning>
Action: ToolName[input]";

ctx.system(SYSTEM_PROMPT)
.user("What is 15 * 37?")
.cue();

for _ in 0..max_iterations {
let response = ctx
.generate(Sampler::Argmax)
.max_tokens(512)
.collect_text()
.await?;

match parse_action(&response) {
Action::Tool(name, input) => {
let result = run_tool(&name, &input);
ctx.user(&format!("Observation: {result}")).cue();
}
Action::FinalAnswer(answer) => return Ok(answer),
Action::None => break,
}
}

The agent-react and agent-codeact inferlets are full implementations.

Native tool-call template

For models with a built-in tool-call format, the tools module wraps the equip / answer / decode flow. You hand it a JSON schema of available tools; the model's chat template emits structured calls; the decoder parses them.

Equip and answer

Get the prefix tokens from tools::equip_prefix, append them to the context, and start generating. When the model emits a call, run it and feed the result back with tools::answer_prefix.

use inferlet::tools;

let calculator_schema = serde_json::json!({
"name": "calculator",
"description": "Evaluate a mathematical expression",
"parameters": {
"type": "object",
"properties": {
"expression": { "type": "string" }
},
"required": ["expression"]
}
}).to_string();

let prefix = tools::equip_prefix(&model, &[calculator_schema])?;
ctx.append(&prefix);

ctx.user("What is 15 * 37?").cue();

// After running the tool:
let answer = tools::answer_prefix(&model, "calculator", &result_json);
ctx.append(&answer);
ctx.cue();

Streaming detection

Feed each step's tokens to a tools::Decoder. It emits Start while a call is being assembled and Call(name, args_json) when the arguments close.

use inferlet::{tools, sample::Sampler};

let mut g = ctx.generate(Sampler::Argmax).max_tokens(512);
let mut dec = tools::Decoder::new(&model);

while let Some(step) = g.next()? {
let out = step.execute().await?;
match dec.feed(&out.tokens)? {
tools::Event::Call(name, args_json) => {
let result = run_tool(&name, &args_json);
let answer = tools::answer_prefix(&model, &name, &result);
ctx.append(&answer);
ctx.cue();
dec.reset();
}
tools::Event::Start => {} // call still being assembled
}
}

dec.reset() after each Call so the decoder is ready for the next call in the same context.

Parallel tool calls

When a turn produces multiple tool calls, run them concurrently with the language's standard async primitive instead of one at a time. The pattern: collect all Call events from one assistant turn, dispatch them as a fan-out, await all results, append all answers, resume generation.

use futures::future;

let mut pending: Vec<(String, String)> = Vec::new();

while let Some(step) = g.next()? {
let out = step.execute().await?;
match dec.feed(&out.tokens)? {
tools::Event::Call(name, args) => pending.push((name, args)),
tools::Event::Start => {}
}

if turn_just_ended(&out) {
let results: Vec<String> = future::join_all(
pending.iter().map(|(n, a)| run_tool_async(n, a))
).await;

for ((name, _), result) in pending.iter().zip(results.iter()) {
let answer = tools::answer_prefix(&model, name, result);
ctx.append(&answer);
}
ctx.cue();
pending.clear();
dec.reset();
}
}

turn_just_ended is workload-specific: in the simplest form, the chat parser's Done event for the same step indicates the turn closed. With native tool-call templates, the template's end-of-turn marker is the signal.

One-shot parsing (Rust only)

If you don't need streaming, the Rust SDK exposes tools::parse_call to extract a single completed call from a finished string. The Python and JavaScript SDKs do not ship a one-shot helper — feed the full token output through tools.Decoder and read the first Call event.

let response = ctx.generate(Sampler::Argmax).max_tokens(512).collect_text().await?;
if let Some((name, args_json)) = tools::parse_call(&model, &response) {
let result = run_tool(&name, &args_json);
}

Forcing valid tool calls

To force the model to produce a syntactically valid call, attach the model's native grammar to the generator. tools::native_grammar returns a compiled grammar over the tool-call format; pass it through a schema constraint.

use inferlet::{tools, sample::Sampler};

if let Some(grammar) = tools::native_grammar(&model, &[calculator_schema]) {
let response = ctx
.generate(Sampler::Argmax)
.max_tokens(512)
.constrain_with(&grammar)? // &Grammar implements Schema
.collect_text()
.await?;
}

The Rust tools::native_grammar returns a compiled Grammar; the Python tools.native_matcher and JavaScript tools.nativeMatcher return a stateful Matcher (wrap with GrammarConstraint). All three return None / undefined for models without a native tool-call template — fall back to the hand-rolled approach in that case.

Next