Skip to main content

TensorRT-LLM

Experimental

The stable TensorRT-LLM driver path uses TensorRT-LLM's high-level LLM.generate API. It is suitable for standard causal token generation, not Pie workloads that need low-level external KV pages or arbitrary per-position logits.

type = "tensorrt_llm" (alias: "tensorrt-llm"). A subprocess Python driver that delegates token generation to TensorRT-LLM.

Capabilities

The driver keeps a replayable token history per Pie context and requests one new token from TensorRT-LLM for each token-producing sampler. It reports:

  • supports_user_attention_mask = false
  • supports_adapters = false
  • tensor_parallel_size = 1 only; use multiple devices as data-parallel replicas.

Unsupported Pie features fail explicitly: custom attention/logit masks, adapter math, raw-logit/distribution/logprob/entropy probes, speculative verification, and context forks/restores that arrive without a replayable token prefix.

Install

sudo apt-get install -y libopenmpi-dev openmpi-bin
pie driver tensorrt-llm install ~/.pie/venvs/tensorrt_llm --run
pie driver tensorrt-llm set venv ~/.pie/venvs/tensorrt_llm
pie driver tensorrt-llm doctor

This installs pie-driver-tensorrt-llm into a Python 3.12 virtual environment. The wheel pins TensorRT-LLM 1.2.1, Torch 2.9.1 CUDA 12.8 wheels, and the CUDA 13 CUBLAS wheel needed by TensorRT-LLM's native bindings. TensorRT-LLM imports mpi4py, so the driver also needs a system MPI runtime; NVIDIA's TensorRT-LLM container already includes one.

Configuration

[model.driver]
type = "tensorrt_llm"
device = ["cuda:0"]
tensor_parallel_size = 1
activation_dtype = "bfloat16"

[model.driver.options]
venv = "/home/me/.pie/venvs/tensorrt_llm"
trust_remote_code = true
skip_tokenizer_init = true
# backend = "pytorch"
# attn_backend = "TRTLLM"
# enable_chunked_prefill = true

[model.driver.options.llm_kwargs]
# Passed through to tensorrt_llm.LLM(..., **llm_kwargs)
KeyDefaultDescription
venv / pythonunsetOptional per-model interpreter override.
trust_remote_codetrueForwarded to tensorrt_llm.LLM.
skip_tokenizer_inittruePie sends token IDs and reads token IDs back, so tokenizer init is skipped by default.
backendunsetOptional TensorRT-LLM backend override.
attn_backendunsetOptional TensorRT-LLM attention backend override.
enable_chunked_prefillunsetOptional TensorRT-LLM chunked-prefill flag.
max_seq_lenunsetOptional TensorRT-LLM runtime sequence length cap. Use this to match benchmark baselines such as 2048; otherwise TensorRT-LLM may size runtime structures for the model's full context window.
max_batch_sizeunsetOptional TensorRT-LLM runtime batch-size cap.
max_num_tokensunsetOptional TensorRT-LLM runtime token cap.
kv_cache_free_gpu_memory_fractionunsetOptional TensorRT-LLM KV-cache memory fraction.
llm_kwargs{}Version-specific passthrough to the TensorRT-LLM LLM constructor.
execution_mode"generate""generate" uses public LLM.generate. "pyexecutor" is an experimental TensorRT-LLM 1.2.1 PyTorch-backend path that drives private PyExecutor primitives directly so TensorRT-owned KV stays resident across Pie decode steps.
pyexecutor_max_tokens4096Maximum continuation window for each private PyExecutor request before Pie recreates it from the full token history.
pyexecutor_worker_stop_timeout_s30.0Timeout while stopping TensorRT-LLM's background worker during pyexecutor initialization.
lookahead_tokens16Deterministic continuation chunk size, capped at 16 in the high-level driver. The driver returns one token per Pie step and buffers the rest; stochastic sampling still uses one-token calls.
max_concurrent_requests128Capability advertised to Pie's scheduler.
max_batched_tokens8192Capability advertised to Pie's scheduler.
virtual_kv_page_size16Virtual page size advertised to Pie. TensorRT-LLM owns its real KV cache.
virtual_total_pages65536Virtual page count advertised to Pie. TensorRT-LLM owns its real KV cache.

Supported architectures

TensorRT-LLM accepts models supported by its upstream LLM API. See the TensorRT-LLM documentation for the authoritative supported-model list and backend-specific requirements.