Skip to main content

vLLM

Experimental

The vLLM driver is a prototype. Interfaces, defaults, and capabilities change frequently.

type = "vllm". A subprocess Python driver that delegates the forward pass to vLLM. Pie wraps a curated subset of vLLM's EngineArgs; memory and batch capacity are resolved by the driver and reported back through DriverCapabilities.

Use this driver to run a model that the embedded cuda_native driver does not yet implement, or to run against vLLM's attention kernels.

Capabilities

The bridge runs vLLM's standard causal attention. It reports the following through DriverCapabilities:

  • supports_user_attention_mask = false. User-supplied attention masks are silently dropped.
  • supports_adapters = false. init_adapter, update_adapter, and load_adapter raise NotImplementedError.

Inferlets that need custom masks or adapter math must run on the cuda_native driver.

Install

pie driver vllm install ~/.pie/venvs/vllm --run
pie driver vllm set venv ~/.pie/venvs/vllm
pie driver vllm doctor

This installs pie-driver-vllm into a Python 3.12 virtual environment. vLLM, Torch, and FlashInfer pins are co-resolved. The install is large: roughly 5 to 10 GiB on disk.

Configuration

[model.driver]
type = "vllm"
device = ["cuda:0"]
activation_dtype = "bfloat16"
# Dedicated throughput workers can trade CPU for lower IPC wake latency.
# ipc_profile = "latency"

[model.driver.options]
venv = "/home/me/.pie/venvs/vllm"
attention_backend = "FLASHINFER" # FLASH_ATTN / TRITON_ATTN / FLEX_ATTENTION / …
gpu_memory_utilization = 0.9
enforce_eager = false # disable CUDA graphs
max_num_seqs = 256 # optional active sequence cap
max_num_batched_tokens = 8192 # optional vLLM per-step token budget
max_model_len = 2048 # optional context length cap

# n-gram speculative decoding (driver-side drafts)
spec_ngram_enabled = false
spec_ngram_num_drafts = 4
spec_ngram_min_n = 2
spec_ngram_max_n = 4
KeyDefaultDescription
attention_backendunsetFLASHINFER / FLASH_ATTN / TRITON_ATTN / FLEX_ATTENTION / etc. Unset lets vLLM auto-pick per platform.
gpu_memory_utilization0.9Fraction of free GPU memory for KV cache + activations.
enforce_eagerfalseDisable torch.compile and CUDA graphs.
max_num_seqsunsetOptional active sequence cap passed through to vLLM.
max_num_batched_tokensunsetOptional per-step token budget passed through to vLLM.
max_model_lenunsetOptional context length cap passed through to vLLM.
spec_ngram_enabledfalseEnable n-gram speculative decoding. The driver maintains a per-session token history, proposes linear draft continuations, and the runtime verifies them in the shared batch path. The inferlet opts in to receiving drafts by calling output_speculative_tokens(true); otherwise drafts are dropped.
spec_ngram_num_drafts4Drafts proposed per accepted iteration.
spec_ngram_min_n2n-gram match window minimum.
spec_ngram_max_n4n-gram match window maximum.

The vLLM driver uses vLLM's resolved attention block size and scheduler limits, then reports them in DriverCapabilities during startup. Leave max_num_seqs and max_num_batched_tokens unset unless you need parity with a specific standalone vLLM benchmark or deployment policy. Set max_model_len when comparing with a standalone vLLM run that uses a shorter context length.

For dedicated throughput benchmarking, set [model.driver].ipc_profile = "latency". It uses the polling IPC path and can reduce scheduling overhead at the cost of a busy CPU thread. Leave it unset for the default balanced subprocess profile when CPU utilization matters.

Supported architectures

vLLM accepts anything in its model zoo. See the vLLM supported models list for the authoritative roster.

Quantization

vLLM has its own quantization knobs (quantization, kv_cache_dtype, etc.). Set them under [model.driver.options] exactly as vLLM expects.