Tool-call parser
Pie has two paths for tool use:
- Hand-rolled agent loops. Prompt the model to emit a structured action, parse the response in code, run the tool, feed the observation back. Works with any model.
- Native tool-call templates. Some models build a tool-call format into their chat template (Llama 3.1+, Qwen 2.5+ instruct, others). The
toolsmodule wraps the equip / answer / decode flow, andtools::Decoderis the stream parser that detects calls in the model's output.
Read this after Chat parser. For external tool servers, see MCP.
The dummy driver returns uniform random tokens, so native tool-call marker sequences (Llama 3.1 / Qwen 2.5 format) almost never appear in the stream — tools::Decoder will run to max_tokens without firing a Call event, and the hand-rolled ReAct parser will fail to match its action format. The "forcing valid tool calls with grammar" pattern further down does produce parseable tool-call JSON on the dummy because constraint masks are honored; only the field values are random.
Hand-rolled ReAct
Prompt-engineer the action format and parse the response yourself.
- Rust
- Python
- JavaScript
use inferlet::{Context, sample::Sampler};
const SYSTEM_PROMPT: &str = "\
You have these tools:
- Calculator[expr]: evaluate a math expression
- FinalAnswer[answer]: report the final answer
Respond in this format:
Thought: <your reasoning>
Action: ToolName[input]";
ctx.system(SYSTEM_PROMPT)
.user("What is 15 * 37?")
.cue();
for _ in 0..max_iterations {
let response = ctx
.generate(Sampler::Argmax)
.max_tokens(512)
.collect_text()
.await?;
match parse_action(&response) {
Action::Tool(name, input) => {
let result = run_tool(&name, &input);
ctx.user(&format!("Observation: {result}")).cue();
}
Action::FinalAnswer(answer) => return Ok(answer),
Action::None => break,
}
}
from inferlet import Sampler
SYSTEM_PROMPT = """\
You have these tools:
- Calculator[expr]: evaluate a math expression
- FinalAnswer[answer]: report the final answer
Respond in this format:
Thought: <your reasoning>
Action: ToolName[input]"""
ctx.system(SYSTEM_PROMPT)
ctx.user("What is 15 * 37?")
ctx.cue()
for _ in range(max_iterations):
response = await ctx.generate(
Sampler.argmax(), max_tokens=512,
).collect_text()
action = parse_action(response)
if action.kind == "tool":
result = run_tool(action.name, action.input)
ctx.user(f"Observation: {result}").cue()
elif action.kind == "final":
return action.answer
else:
break
import { Sampler } from 'inferlet';
const SYSTEM_PROMPT = `\
You have these tools:
- Calculator[expr]: evaluate a math expression
- FinalAnswer[answer]: report the final answer
Respond in this format:
Thought: <your reasoning>
Action: ToolName[input]`;
ctx.system(SYSTEM_PROMPT)
.user('What is 15 * 37?')
.cue();
for (let i = 0; i < maxIterations; i++) {
const response = await ctx
.generate(Sampler.argmax(), { maxTokens: 512 })
.collectText();
const action = parseAction(response);
if (action.kind === 'tool') {
const result = runTool(action.name, action.input);
ctx.user(`Observation: ${result}`).cue();
} else if (action.kind === 'final') {
return action.answer;
} else {
break;
}
}
The agent-react and agent-codeact inferlets are full implementations.
Native tool-call template
For models with a built-in tool-call format, the tools module wraps the equip / answer / decode flow. You hand it a JSON schema of available tools; the model's chat template emits structured calls; the decoder parses them.
Equip and answer
Get the prefix tokens from tools::equip_prefix, append them to the context, and start generating. When the model emits a call, run it and feed the result back with tools::answer_prefix.
- Rust
- Python
- JavaScript
use inferlet::tools;
let calculator_schema = serde_json::json!({
"name": "calculator",
"description": "Evaluate a mathematical expression",
"parameters": {
"type": "object",
"properties": {
"expression": { "type": "string" }
},
"required": ["expression"]
}
}).to_string();
let prefix = tools::equip_prefix(&model, &[calculator_schema])?;
ctx.append(&prefix);
ctx.user("What is 15 * 37?").cue();
// After running the tool:
let answer = tools::answer_prefix(&model, "calculator", &result_json);
ctx.append(&answer);
ctx.cue();
from inferlet import tools
import json
calculator_schema = {
"name": "calculator",
"description": "Evaluate a mathematical expression",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string"},
},
"required": ["expression"],
},
}
prefix = tools.equip_prefix(model, [json.dumps(calculator_schema)])
ctx.append(prefix)
ctx.user("What is 15 * 37?").cue()
# After running the tool:
answer = tools.answer_prefix(model, "calculator", result_json)
ctx.append(answer)
ctx.cue()
import { tools } from 'inferlet';
const calculatorSchema = {
name: 'calculator',
description: 'Evaluate a mathematical expression',
parameters: {
type: 'object',
properties: { expression: { type: 'string' } },
required: ['expression'],
},
};
const prefix = tools.equipPrefix(model, [JSON.stringify(calculatorSchema)]);
ctx.append(prefix);
ctx.user('What is 15 * 37?').cue();
// After running the tool:
const answer = tools.answerPrefix(model, 'calculator', JSON.stringify(result));
ctx.append(answer);
ctx.cue();
Streaming detection
Feed each step's tokens to a tools::Decoder. It emits Start while a call is being assembled and Call(name, args_json) when the arguments close.
- Rust
- Python
- JavaScript
use inferlet::{tools, sample::Sampler};
let mut g = ctx.generate(Sampler::Argmax).max_tokens(512);
let mut dec = tools::Decoder::new(&model);
while let Some(step) = g.next()? {
let out = step.execute().await?;
match dec.feed(&out.tokens)? {
tools::Event::Call(name, args_json) => {
let result = run_tool(&name, &args_json);
let answer = tools::answer_prefix(&model, &name, &result);
ctx.append(&answer);
ctx.cue();
dec.reset();
}
tools::Event::Start => {} // call still being assembled
}
}
from inferlet import Sampler, tools
g = ctx.generate(Sampler.argmax(), max_tokens=512)
dec = tools.Decoder(model)
async for step in g:
out = await step.execute()
match dec.feed(out.tokens):
case tools.Event.Call(name=n, args=a):
result = run_tool(n, a)
answer = tools.answer_prefix(model, n, result)
ctx.append(answer)
ctx.cue()
dec.reset()
case tools.Event.Start():
pass
import { Sampler, tools } from 'inferlet';
const g = ctx.generate(Sampler.argmax(), { maxTokens: 512 });
const dec = new tools.Decoder(model);
for await (const step of g) {
const out = await step.execute();
const ev = dec.feed(out.tokens);
if (ev.type === 'call') {
const result = runTool(ev.name, ev.args);
const answer = tools.answerPrefix(model, ev.name, JSON.stringify(result));
ctx.append(answer);
ctx.cue();
dec.reset();
}
// ev.type === 'start': call still being assembled
}
dec.reset() after each Call so the decoder is ready for the next call in the same context.
Parallel tool calls
When a turn produces multiple tool calls, run them concurrently with the language's standard async primitive instead of one at a time. The pattern: collect all Call events from one assistant turn, dispatch them as a fan-out, await all results, append all answers, resume generation.
- Rust
- Python
- JavaScript
use futures::future;
let mut pending: Vec<(String, String)> = Vec::new();
while let Some(step) = g.next()? {
let out = step.execute().await?;
match dec.feed(&out.tokens)? {
tools::Event::Call(name, args) => pending.push((name, args)),
tools::Event::Start => {}
}
if turn_just_ended(&out) {
let results: Vec<String> = future::join_all(
pending.iter().map(|(n, a)| run_tool_async(n, a))
).await;
for ((name, _), result) in pending.iter().zip(results.iter()) {
let answer = tools::answer_prefix(&model, name, result);
ctx.append(&answer);
}
ctx.cue();
pending.clear();
dec.reset();
}
}
import asyncio
pending: list[tuple[str, str]] = []
async for step in g:
out = await step.execute()
match dec.feed(out.tokens):
case tools.Event.Call(name=n, args=a):
pending.append((n, a))
case tools.Event.Start():
pass
if turn_just_ended(out):
results = await asyncio.gather(
*(run_tool_async(n, a) for n, a in pending)
)
for (n, _), r in zip(pending, results):
ctx.append(tools.answer_prefix(model, n, r))
ctx.cue()
pending.clear()
dec.reset()
const pending: { name: string; args: string }[] = [];
for await (const step of g) {
const out = await step.execute();
const ev = dec.feed(out.tokens);
if (ev.type === 'call') pending.push({ name: ev.name, args: ev.args });
if (turnJustEnded(out)) {
const results = await Promise.all(
pending.map(p => runToolAsync(p.name, p.args)),
);
pending.forEach((p, i) => {
ctx.append(tools.answerPrefix(model, p.name, JSON.stringify(results[i])));
});
ctx.cue();
pending.length = 0;
dec.reset();
}
}
turn_just_ended is workload-specific: in the simplest form, the chat parser's Done event for the same step indicates the turn closed. With native tool-call templates, the template's end-of-turn marker is the signal.
One-shot parsing (Rust only)
If you don't need streaming, the Rust SDK exposes tools::parse_call to extract a single completed call from a finished string. The Python and JavaScript SDKs do not ship a one-shot helper — feed the full token output through tools.Decoder and read the first Call event.
- Rust
- Python
- JavaScript
let response = ctx.generate(Sampler::Argmax).max_tokens(512).collect_text().await?;
if let Some((name, args_json)) = tools::parse_call(&model, &response) {
let result = run_tool(&name, &args_json);
}
tokens = await ctx.generate(Sampler.argmax(), max_tokens=512).collect_tokens()
dec = tools.Decoder(model)
match dec.feed(tokens):
case tools.Event.Call(name=n, args=a):
result = run_tool(n, a)
case _:
pass
const tokens = await ctx.generate(Sampler.argmax(), { maxTokens: 512 }).collectTokens();
const dec = new tools.Decoder(model);
const ev = dec.feed(tokens);
if (ev.type === 'call') {
const result = runTool(ev.name, ev.args);
}
Forcing valid tool calls
To force the model to produce a syntactically valid call, attach the model's native grammar to the generator. tools::native_grammar returns a compiled grammar over the tool-call format; pass it through a schema constraint.
- Rust
- Python
- JavaScript
use inferlet::{tools, sample::Sampler};
if let Some(grammar) = tools::native_grammar(&model, &[calculator_schema]) {
let response = ctx
.generate(Sampler::Argmax)
.max_tokens(512)
.constrain_with(&grammar)? // &Grammar implements Schema
.collect_text()
.await?;
}
from inferlet import Sampler, GrammarConstraint, tools
matcher = tools.native_matcher(model, [json.dumps(calculator_schema)])
if matcher:
response = await ctx.generate(
Sampler.argmax(),
max_tokens=512,
constrain=GrammarConstraint(matcher),
).collect_text()
import { Sampler, GrammarConstraint, tools } from 'inferlet';
const matcher = tools.nativeMatcher(model, [JSON.stringify(calculatorSchema)]);
if (matcher) {
const response = await ctx.generate(Sampler.argmax(), {
maxTokens: 512,
constrain: new GrammarConstraint(matcher),
}).collectText();
}
The Rust tools::native_grammar returns a compiled Grammar; the Python tools.native_matcher and JavaScript tools.nativeMatcher return a stateful Matcher (wrap with GrammarConstraint). All three return None / undefined for models without a native tool-call template — fall back to the hand-rolled approach in that case.
Next
- MCP: connect to external tool servers.
- Constrained generation: grammar and schema constraints in detail.
- Tutorial: build a parallel research agent: parallel HTTP fetches as tool calls.