SGLang

type = "sglang". Mirrors SGLang's ServerArgs; field names mirror SGLang verbatim so values flow through.

Pick SGLang when you want strong custom-mask support or its n-gram speculative decoding. Like vLLM, some inferlet features that depend on raw forward-pass control are not available through SGLang; those require a Pie-controlled driver such as cuda_native or dev.

Install

pie driver sglang install ~/.pie/venvs/sglang --run
pie driver sglang set venv ~/.pie/venvs/sglang
pie driver sglang doctor

This installs pie-driver-sglang into a Python 3.12 virtual environment. SGLang, Torch, and FlashInfer pins are co-resolved; expect a large install.

Configuration

[model.driver]
type = "sglang"
device = ["cuda:0"]
activation_dtype = "bfloat16"

[model.driver.options]
venv = "/home/me/.pie/venvs/sglang"
attention_backend       = "triton"       # triton / flashinfer / fa3 / fa4 / …
mem_fraction_static     = 0.65
page_size               = 16
disable_radix_cache     = true
disable_cuda_graph      = false
# cuda_graph_max_bs       = 256          # default: sglang auto-picks
# max_running_requests    = 256          # default: sglang auto-picks
# max_total_tokens        = 65536        # default: sglang auto-picks
# chunked_prefill_size    = 8192         # default: sglang auto-picks
kv_cache_dtype          = "auto"
trust_remote_code       = true
# context_length          = 8192         # default: read from HF config

# Pinned-host KV pool for D2H/H2D swap (GiB). 0 = disabled.
cpu_mem_budget_in_gb    = 0

# n-gram speculative decoding (driver-supplied drafts)
spec_ngram_enabled      = false
spec_ngram_num_drafts   = 4
spec_ngram_max_depth    = 18
spec_ngram_capacity     = 1_000_000

Key	Default	Description
`attention_backend`	`"triton"`	`triton` / `flashinfer` / `flex_attention` / `fa3` / `fa4` / `aiter` / `wave` / `torch_native` / etc. `triton` works on any NVIDIA SM 7.5+ and is stable across sglang versions.
`mem_fraction_static`	`0.65`	Fraction of free GPU memory reserved for KV cache + activations. Lower than sglang's standalone `0.88` because pie's KV-rebind allocates a parallel tensor in pie's canonical layout.
`page_size`	`16`	KV cache page size override. Unset lets sglang pick.
`disable_radix_cache`	`true`	Disable SGLang's radix cache. Pie owns prefix sharing through its scheduler, so the default avoids duplicated caching work while still allowing explicit experiments.
`disable_cuda_graph`	`false`	Run eager (no `torch.compile`, no CUDA graphs).
`cuda_graph_max_bs`	unset	Override the largest CUDA-graph batch-size bin sglang captures. Unset = sglang's auto-pick.
`max_running_requests`	unset	Cap on simultaneously-running requests. Unset = sglang's auto-pick based on `max_total_tokens`.
`max_total_tokens`	unset	Cap on total tokens (across requests) per fire_batch. Unset = sglang's auto-pick.
`chunked_prefill_size`	unset	Chunked-prefill size override.
`kv_cache_dtype`	`"auto"`	KV cache element dtype. `auto` inherits the activation dtype.
`trust_remote_code`	`true`	Trust user-supplied remote code in HF repos (needed for some models).
`context_length`	unset	Explicit context length cap. Unset reads from HF config.
`cpu_mem_budget_in_gb`	`0`	Pinned-host KV pool for D2H/H2D swap, in GiB. `0` disables swap. (Pie knob, not an sglang ServerArgs field.)
`spec_ngram_enabled`	`false`	Enable n-gram speculative decoding (sglang-side drafting). The inferlet opts in to seeing drafts via `output_speculative_tokens(true)` on its forward pass; otherwise they are dropped.
`spec_ngram_num_drafts`	`4`	Drafts proposed per accepted iteration.
`spec_ngram_max_depth`	`18`	Maximum n-gram trie depth.
`spec_ngram_capacity`	`1_000_000`	Approximate node budget for the trie.

Supported architectures

SGLang accepts anything in its model zoo. See the SGLang supported models list for the authoritative roster.

Quantization

SGLang has its own quantization knobs (quantization, kv_cache_dtype, etc.). Set them under [model.driver.options] exactly as SGLang expects.

Install​

Configuration​

Supported architectures​

Quantization​

Install

Configuration

Supported architectures

Quantization