Skip to main content

vLLM

type = "vllm". Pie wraps vLLM's EngineArgs; field names mirror vLLM verbatim so values flow through.

Pick vLLM when you need a model Pie's embedded drivers do not support yet but vLLM does, or when you want vLLM's mature decode-batch throughput on stock setups. Inferlet features that depend on raw forward-pass control (custom attention masks, page-trim, raw-logits samplers) are not available through vLLM; those require a Pie-controlled driver such as cuda_native or dev.

Install

pie driver vllm install ~/.pie/venvs/vllm --run
pie driver vllm set venv ~/.pie/venvs/vllm
pie driver vllm doctor

This installs pie-driver-vllm into a Python 3.12 virtual environment. vLLM, Torch, and FlashInfer pins are co-resolved; expect a large install.

Configuration

[model.driver]
type = "vllm"
device = ["cuda:0"]
activation_dtype = "bfloat16"

[model.driver.options]
venv = "/home/me/.pie/venvs/vllm"
attention_backend = "FLASHINFER" # FLASH_ATTN / TRITON_ATTN / FLEX_ATTENTION / …
gpu_memory_utilization = 0.9
max_num_seqs = 256
# max_num_batched_tokens = 8192 # default: vllm picks
# block_size = 16 # default: vllm picks per attention backend
enforce_eager = false # disable CUDA graphs

# n-gram speculative decoding (driver-supplied drafts)
spec_ngram_enabled = false
spec_ngram_num_drafts = 4
spec_ngram_min_n = 2
spec_ngram_max_n = 4
KeyDefaultDescription
attention_backendunsetFLASHINFER / FLASH_ATTN / TRITON_ATTN / FLEX_ATTENTION / etc. Unset lets vLLM auto-pick per platform.
gpu_memory_utilization0.9Fraction of free GPU memory for KV cache + activations.
max_num_seqs256Max concurrent sequences in a batch.
max_num_batched_tokensunsetMax tokens (across all sequences) in a batch. Unset = vLLM's default.
block_sizeunsetKV cache block size override. Unset = vLLM picks based on attention backend's allowed sizes (FlashInfer: 16/32/64; FlashAttention: 16/32).
enforce_eagerfalseDisable torch.compile and CUDA graphs.
spec_ngram_enabledfalseEnable n-gram speculative decoding (vLLM-side drafting).
spec_ngram_num_drafts4Drafts proposed per accepted iteration.
spec_ngram_min_n2n-gram match window minimum.
spec_ngram_max_n4n-gram match window maximum.

Supported architectures

vLLM accepts anything in its model zoo. See the vLLM supported models list for the authoritative roster.

Quantization

vLLM has its own quantization knobs (quantization, kv_cache_dtype, etc.). Set them under [model.driver.options] exactly as vLLM expects.