TensorRT-LLM

Experimental

The stable TensorRT-LLM driver path uses TensorRT-LLM's high-level LLM.generate API. It is suitable for standard causal token generation, not Pie workloads that need low-level external KV pages or arbitrary per-position logits.

type = "tensorrt_llm" (alias: "tensorrt-llm"). A subprocess Python driver that delegates token generation to TensorRT-LLM.

Capabilities

The driver keeps a replayable token history per Pie context and requests one new token from TensorRT-LLM for each token-producing sampler. It reports:

supports_user_attention_mask = false
supports_adapters = false
tensor_parallel_size = 1 only; use multiple devices as data-parallel replicas.

Unsupported Pie features fail explicitly: custom attention/logit masks, adapter math, raw-logit/distribution/logprob/entropy probes, speculative verification, and context forks/restores that arrive without a replayable token prefix.

Install

sudo apt-get install -y libopenmpi-dev openmpi-bin
pie driver tensorrt-llm install ~/.pie/venvs/tensorrt_llm --run
pie driver tensorrt-llm set venv ~/.pie/venvs/tensorrt_llm
pie driver tensorrt-llm doctor

This installs pie-driver-tensorrt-llm into a Python 3.12 virtual environment. The wheel pins TensorRT-LLM 1.2.1, Torch 2.9.1 CUDA 12.8 wheels, and the CUDA 13 CUBLAS wheel needed by TensorRT-LLM's native bindings. TensorRT-LLM imports mpi4py, so the driver also needs a system MPI runtime; NVIDIA's TensorRT-LLM container already includes one.

Configuration

[model.driver]
type = "tensorrt_llm"
device = ["cuda:0"]
tensor_parallel_size = 1
activation_dtype = "bfloat16"

[model.driver.options]
venv = "/home/me/.pie/venvs/tensorrt_llm"
trust_remote_code = true
skip_tokenizer_init = true
# backend = "pytorch"
# attn_backend = "TRTLLM"
# enable_chunked_prefill = true

[model.driver.options.llm_kwargs]
# Passed through to tensorrt_llm.LLM(..., **llm_kwargs)

Key	Default	Description
`venv` / `python`	unset	Optional per-model interpreter override.
`trust_remote_code`	`true`	Forwarded to `tensorrt_llm.LLM`.
`skip_tokenizer_init`	`true`	Pie sends token IDs and reads token IDs back, so tokenizer init is skipped by default.
`backend`	unset	Optional TensorRT-LLM backend override.
`attn_backend`	unset	Optional TensorRT-LLM attention backend override.
`enable_chunked_prefill`	unset	Optional TensorRT-LLM chunked-prefill flag.
`max_seq_len`	unset	Optional TensorRT-LLM runtime sequence length cap. Use this to match benchmark baselines such as `2048`; otherwise TensorRT-LLM may size runtime structures for the model's full context window.
`max_batch_size`	unset	Optional TensorRT-LLM runtime batch-size cap.
`max_num_tokens`	unset	Optional TensorRT-LLM runtime token cap.
`kv_cache_free_gpu_memory_fraction`	unset	Optional TensorRT-LLM KV-cache memory fraction.
`llm_kwargs`	`{}`	Version-specific passthrough to the TensorRT-LLM `LLM` constructor.
`execution_mode`	`"generate"`	`"generate"` uses public `LLM.generate`. `"pyexecutor"` is an experimental TensorRT-LLM 1.2.1 PyTorch-backend path that drives private `PyExecutor` primitives directly so TensorRT-owned KV stays resident across Pie decode steps.
`pyexecutor_max_tokens`	`4096`	Maximum continuation window for each private `PyExecutor` request before Pie recreates it from the full token history.
`pyexecutor_worker_stop_timeout_s`	`30.0`	Timeout while stopping TensorRT-LLM's background worker during `pyexecutor` initialization.
`lookahead_tokens`	`16`	Deterministic continuation chunk size, capped at `16` in the high-level driver. The driver returns one token per Pie step and buffers the rest; stochastic sampling still uses one-token calls.
`max_concurrent_requests`	`128`	Capability advertised to Pie's scheduler.
`max_batched_tokens`	`8192`	Capability advertised to Pie's scheduler.
`virtual_kv_page_size`	`16`	Virtual page size advertised to Pie. TensorRT-LLM owns its real KV cache.
`virtual_total_pages`	`65536`	Virtual page count advertised to Pie. TensorRT-LLM owns its real KV cache.

Supported architectures

TensorRT-LLM accepts models supported by its upstream LLM API. See the TensorRT-LLM documentation for the authoritative supported-model list and backend-specific requirements.

Capabilities​

Install​

Configuration​

Supported architectures​

Capabilities

Install

Configuration

Supported architectures