vLLM
The vLLM driver is a prototype. Interfaces, defaults, and capabilities change frequently.
type = "vllm". A subprocess Python driver that delegates the forward pass to vLLM. Pie wraps a curated subset of vLLM's EngineArgs; memory and batch capacity are resolved by the driver and reported back through DriverCapabilities.
Use this driver to run a model that the embedded cuda_native driver does not yet implement, or to run against vLLM's attention kernels.
Capabilities
The bridge runs vLLM's standard causal attention. It reports the following through DriverCapabilities:
supports_user_attention_mask = false. User-supplied attention masks are silently dropped.supports_adapters = false.init_adapter,update_adapter, andload_adapterraiseNotImplementedError.
Inferlets that need custom masks or adapter math must run on the cuda_native driver.
Install
pie driver vllm install ~/.pie/venvs/vllm --run
pie driver vllm set venv ~/.pie/venvs/vllm
pie driver vllm doctor
This installs pie-driver-vllm into a Python 3.12 virtual environment. vLLM, Torch, and FlashInfer pins are co-resolved. The install is large: roughly 5 to 10 GiB on disk.
Configuration
[model.driver]
type = "vllm"
device = ["cuda:0"]
activation_dtype = "bfloat16"
# Dedicated throughput workers can trade CPU for lower IPC wake latency.
# ipc_profile = "latency"
[model.driver.options]
venv = "/home/me/.pie/venvs/vllm"
attention_backend = "FLASHINFER" # FLASH_ATTN / TRITON_ATTN / FLEX_ATTENTION / …
gpu_memory_utilization = 0.9
enforce_eager = false # disable CUDA graphs
max_num_seqs = 256 # optional active sequence cap
max_num_batched_tokens = 8192 # optional vLLM per-step token budget
max_model_len = 2048 # optional context length cap
# n-gram speculative decoding (driver-side drafts)
spec_ngram_enabled = false
spec_ngram_num_drafts = 4
spec_ngram_min_n = 2
spec_ngram_max_n = 4
| Key | Default | Description |
|---|---|---|
attention_backend | unset | FLASHINFER / FLASH_ATTN / TRITON_ATTN / FLEX_ATTENTION / etc. Unset lets vLLM auto-pick per platform. |
gpu_memory_utilization | 0.9 | Fraction of free GPU memory for KV cache + activations. |
enforce_eager | false | Disable torch.compile and CUDA graphs. |
max_num_seqs | unset | Optional active sequence cap passed through to vLLM. |
max_num_batched_tokens | unset | Optional per-step token budget passed through to vLLM. |
max_model_len | unset | Optional context length cap passed through to vLLM. |
spec_ngram_enabled | false | Enable n-gram speculative decoding. The driver maintains a per-session token history, proposes linear draft continuations, and the runtime verifies them in the shared batch path. The inferlet opts in to receiving drafts by calling output_speculative_tokens(true); otherwise drafts are dropped. |
spec_ngram_num_drafts | 4 | Drafts proposed per accepted iteration. |
spec_ngram_min_n | 2 | n-gram match window minimum. |
spec_ngram_max_n | 4 | n-gram match window maximum. |
The vLLM driver uses vLLM's resolved attention block size and scheduler limits,
then reports them in DriverCapabilities during startup. Leave
max_num_seqs and max_num_batched_tokens unset unless you need parity with a
specific standalone vLLM benchmark or deployment policy. Set max_model_len
when comparing with a standalone vLLM run that uses a shorter context length.
For dedicated throughput benchmarking, set [model.driver].ipc_profile = "latency". It uses the polling IPC path and can reduce scheduling overhead at
the cost of a busy CPU thread. Leave it unset for the default balanced
subprocess profile when CPU utilization matters.
Supported architectures
vLLM accepts anything in its model zoo. See the vLLM supported models list for the authoritative roster.
Quantization
vLLM has its own quantization knobs (quantization, kv_cache_dtype, etc.). Set them under [model.driver.options] exactly as vLLM expects.