Interactive Chat
Oftentimes, the generation process has to be an open-loop, where the model has to interact with the user or other agents frequently. In this tutorial, we will learn how to make an inferlet to do message passing across multiple turns of interaction, between the user and other running inferlets.
Message Passing
Inferlets may use two types of message passing APIs: (1) for user-inferlet interaction, and (2) for inferlet-inferlet interaction. They use a slightly different message passing mechanism, which we will explain below.
User-Inferlet Interaction
The user-inferlet interaction is done via the Push/Pull pattern, i.e., inferlet::send
and inferlet::recv
APIs. The send
API sends a message to the user, while the recv
API waits for a message from the user. The recv API call is asynchronous, so it does not block the inferlet from doing other tasks while waiting for a message.
For instance, the following code snippet shows how a single round of message passing works:
use inferlet::{Args, Result, send, recv};
#[inferlet::main]
async fn main(mut args: Args) -> Result<()> {
send("Hello, user! How can I help you today?");
let user_message = recv().await;
println!("User said: {}", user_message);
Ok(())
}
On the client side, you can use the Python or JavaScript API to interact with the inferlet. For instance, the following Python code snippet shows how to interact with the inferlet:
instance = await client.launch_instance(program_hash, arguments=[])
event, message = await instance.recv()
print(f"Received message '{message}'")
instance.send("I need help with my homework.")
Inferlet-Inferlet Interaction
Inter-inferlet message passing is done via Pub/Sub pattern, i.e., inferlet::broadcast
and inferlet::subscribe
APIs. The broadcast
API publishes a message to a topic, while the subscribe
API subscribes to a topic and waits for messages published to that topic. Similar to the recv
API, the subscribe
API call is asynchronous, so it does not block the inferlet from doing other tasks while waiting for a message.
Here, the following code snippet shows how two inferlets can interact with each other via message passing:
use inferlet::{Args, Result, broadcast, subscribe};
#[inferlet::main]
async fn main(mut args: Args) -> Result<()> {
let rank = args.value_from_str(["-r", "--rank"]).unwrap_or(0);
if rank == 0 {
broadcast("Hello from Inferlet A!", "topic1");
} else {
let msg = subscribe("topic1").await;
println!("Inferlet B received: {}", msg);
}
Ok(())
}
Note that the order of inferlet launch is important here. The inferlet that subscribes to a topic must be launched after the inferlet that broadcasts to that topic. Otherwise, the subscriber inferlet will wait indefinitely for a message that has already been published.
Stateful context across turns
Leveraging the message passing APIs, we can build a stateful chat inferlet that maintains the context across multiple turns of interaction. This is particularly useful for long conversations, where the context can grow large and re-filling the context can have significant time-to-first-token (TTFT) latency.
use inferlet::{Args, Result, send, recv, get_auto_model};
use inferlet::stop_condition::{StopCondition, ends_with_any, max_len};
use inferlet::Sampler;
#[inferlet::main]
async fn main(mut args: Args) -> Result<()> {
let model = get_auto_model();
let mut ctx = model.create_context();
ctx.fill_system("You are a helpful, respectful and honest assistant.");
send("Hello, user! How can I help you today?");
loop {
let user_message = recv().await;
if user_message.to_lowercase() == "exit" {
send("Goodbye!");
break;
}
ctx.fill_user(&user_message);
let sampler = Sampler::top_p(0.6, 0.95);
let stop_cond = max_len(256).or(ends_with_any(model.eos_tokens()));
let response = ctx.generate(sampler, stop_cond).await;
send(&response);
}
Ok(())
}
In this example, we don't need to call ctx.fill_assistant
after generating the response, because ctx.generate
already fills the generated tokens into the context.
Further optimizations
The above inferlet trade-off memory for latency. You can further optimize the inferlet by leveraging application-specific context retainment strategies. For instance viable strategies include:
-
Adaptive strategy: Prefill latency only gets noticed when the context is very large. So you can use a threshold-based strategy to retain the context only when it exceeds a certain size.
-
Predictive strategy: If you have a good statistics of when will the user send the next message, (User takes time to think and type!) you may first drop the context, and proactively re-fill the context when you predict the user is about to send a message.
These are just few examples. The gist is that Pie gives you the flexibility to implement your own strategies that best fit your application.