vLLM
type = "vllm". Pie wraps vLLM's EngineArgs; field names mirror vLLM verbatim so values flow through.
Pick vLLM when you need a model Pie's embedded drivers do not support yet but vLLM does, or when you want vLLM's mature decode-batch throughput on stock setups. Inferlet features that depend on raw forward-pass control (custom attention masks, page-trim, raw-logits samplers) are not available through vLLM; those require a Pie-controlled driver such as cuda_native or dev.
Install
pie driver vllm install ~/.pie/venvs/vllm --run
pie driver vllm set venv ~/.pie/venvs/vllm
pie driver vllm doctor
This installs pie-driver-vllm into a Python 3.12 virtual environment. vLLM, Torch, and FlashInfer pins are co-resolved; expect a large install.
Configuration
[model.driver]
type = "vllm"
device = ["cuda:0"]
activation_dtype = "bfloat16"
[model.driver.options]
venv = "/home/me/.pie/venvs/vllm"
attention_backend = "FLASHINFER" # FLASH_ATTN / TRITON_ATTN / FLEX_ATTENTION / …
gpu_memory_utilization = 0.9
max_num_seqs = 256
# max_num_batched_tokens = 8192 # default: vllm picks
# block_size = 16 # default: vllm picks per attention backend
enforce_eager = false # disable CUDA graphs
# n-gram speculative decoding (driver-supplied drafts)
spec_ngram_enabled = false
spec_ngram_num_drafts = 4
spec_ngram_min_n = 2
spec_ngram_max_n = 4
| Key | Default | Description |
|---|---|---|
attention_backend | unset | FLASHINFER / FLASH_ATTN / TRITON_ATTN / FLEX_ATTENTION / etc. Unset lets vLLM auto-pick per platform. |
gpu_memory_utilization | 0.9 | Fraction of free GPU memory for KV cache + activations. |
max_num_seqs | 256 | Max concurrent sequences in a batch. |
max_num_batched_tokens | unset | Max tokens (across all sequences) in a batch. Unset = vLLM's default. |
block_size | unset | KV cache block size override. Unset = vLLM picks based on attention backend's allowed sizes (FlashInfer: 16/32/64; FlashAttention: 16/32). |
enforce_eager | false | Disable torch.compile and CUDA graphs. |
spec_ngram_enabled | false | Enable n-gram speculative decoding (vLLM-side drafting). |
spec_ngram_num_drafts | 4 | Drafts proposed per accepted iteration. |
spec_ngram_min_n | 2 | n-gram match window minimum. |
spec_ngram_max_n | 4 | n-gram match window maximum. |
Supported architectures
vLLM accepts anything in its model zoo. See the vLLM supported models list for the authoritative roster.
Quantization
vLLM has its own quantization knobs (quantization, kv_cache_dtype, etc.). Set them under [model.driver.options] exactly as vLLM expects.