Parallel Generation

Many deliberate prompting strategies, such as best-of-N, tree-of-thoughts, and graph-of-thoughts, require multiple parallel calls to the language model. Pie makes it highly efficient to implement these strategies.

Concurrency in inferlets

Asynchronous programming is a powerful paradigm that allows you to write concurrent code that is efficient and easy to read. Pie provides built-in support for asynchronous programming in inferlets. You can use Rust async/await syntax to write asynchronous code in inferlets. The Pie runtime will take care of scheduling and executing the asynchronous tasks.

You can also use built-in Rust futures primitives, such as futures::join! and futures::select!, to run multiple asynchronous tasks concurrently.

Here is an example of using futures::join! to run multiple LLM calls in parallel:

use inferlet::{Args, Result, get_auto_model};
use inferlet::stop_condition::{StopCondition, ends_with_any, max_len};
use inferlet::Sampler;
use futures::future::join_all;

#[inferlet::main]
async fn main(_args: Args) -> Result<()> {
    let model = get_auto_model();
    let mut ctx_a = model.create_context();
    let mut ctx_b = model.create_context();
    let mut ctx_judge = model.create_context();

    ctx_a.fill_user("Write a warm, seasonal email subject line for a 20% off Fall skincare sale.");
    ctx_b.fill_user("Write a concise, urgency-driven email subject line for a 20% off Fall skincare sale.");

    // Run two generation tasks in parallel
    let (a, b) = futures::join!(
        ctx_a.generate(Sampler::top_p(0.7, 0.95), ends_with_any(model.eos_tokens())),
        ctx_b.generate(Sampler::top_k(30), ends_with_any(model.eos_tokens()))
    );

    ctx_judge.fill_user(&format!(
        "A: {} vs. B: {} \n Which will get more opens?",
        a.trim(), b.trim()
    ));

    let result = ctx_judge.generate(Sampler::greedy(), max_len(80)).await;
    println!("Judge decision: {}", result.trim());
    Ok(())
}

Context forks

In many scenarios, you may want to create multiple contexts that share the same context prefix. For example, in best-of-N decoding, you may want to create N contexts that all start with the same user prompt. Pie provides a convenient fork API to create a new context that shares the same initial prompt as the original context.

This is more efficient than creating a new context and filling the prompt again, as it reuses the same KV cache pages. Here is the simple Best-of-N example using the fork API:

// Start with a single base context that sets up the shared prefix once.
let mut base = model.create_context();
base.fill_user("Write a catchy, on-brand email subject line for a 20% off Fall skincare sale.");

// Fork N contexts that share the same prefix (KV cache) as `base`.
let n = 5usize;
let mut forks: Vec<_> = (0..n).map(|_| base.fork()).collect();

// Launch N generations in parallel. Each fork can tweak sampling a bit if desired.
let gens = forks.iter_mut().enumerate().map(|(i, ctx)| {
    // Slightly vary sampler parameters per fork to diversify candidates.
    // (Feel free to swap strategies: top_k, temperature, etc.)
    let sampler = if i % 2 == 0 {
        Sampler::top_p(0.7, 0.95)
    } else {
        Sampler::top_k(40)
    };

    ctx.generate(sampler, ends_with_any(model.eos_tokens()).and(max_len(64)))
});

let candidates: Vec<String> = join_all(gens)
    .await
    .into_iter()
    .map(|s| s.trim().to_string())
    .collect();

// Judge the candidates in a new context.
// Omitted for brevity; see previous example.

What happens when forking a context?

When you call the fork method on a context, it calls ctx.flush() to ensure that all pending tokens are processed and the KV cache is up to date. Then, it creates a new context that shares the same KV cache pages as the original context. This means that the new context can access the same KV cache as the original context, but it has its own independent state for generation.

Concurrency in inferlets​

Context forks​

What happens when forking a context?​

Concurrency in inferlets

Context forks

What happens when forking a context?