SGLang
type = "sglang". Mirrors SGLang's ServerArgs; field names mirror SGLang verbatim so values flow through.
Pick SGLang when you want strong custom-mask support or its n-gram speculative decoding. Like vLLM, some inferlet features that depend on raw forward-pass control are not available through SGLang; those require a Pie-controlled driver such as cuda_native or dev.
Install
pie driver sglang install ~/.pie/venvs/sglang --run
pie driver sglang set venv ~/.pie/venvs/sglang
pie driver sglang doctor
This installs pie-driver-sglang into a Python 3.12 virtual environment. SGLang, Torch, and FlashInfer pins are co-resolved; expect a large install.
Configuration
[model.driver]
type = "sglang"
device = ["cuda:0"]
activation_dtype = "bfloat16"
[model.driver.options]
venv = "/home/me/.pie/venvs/sglang"
attention_backend = "triton" # triton / flashinfer / fa3 / fa4 / …
mem_fraction_static = 0.65
page_size = 16
disable_radix_cache = true
disable_cuda_graph = false
# cuda_graph_max_bs = 256 # default: sglang auto-picks
# max_running_requests = 256 # default: sglang auto-picks
# max_total_tokens = 65536 # default: sglang auto-picks
# chunked_prefill_size = 8192 # default: sglang auto-picks
kv_cache_dtype = "auto"
trust_remote_code = true
# context_length = 8192 # default: read from HF config
# Pinned-host KV pool for D2H/H2D swap (GiB). 0 = disabled.
cpu_mem_budget_in_gb = 0
# n-gram speculative decoding (driver-supplied drafts)
spec_ngram_enabled = false
spec_ngram_num_drafts = 4
spec_ngram_max_depth = 18
spec_ngram_capacity = 1_000_000
| Key | Default | Description |
|---|---|---|
attention_backend | "triton" | triton / flashinfer / flex_attention / fa3 / fa4 / aiter / wave / torch_native / etc. triton works on any NVIDIA SM 7.5+ and is stable across sglang versions. |
mem_fraction_static | 0.65 | Fraction of free GPU memory reserved for KV cache + activations. Lower than sglang's standalone 0.88 because pie's KV-rebind allocates a parallel tensor in pie's canonical layout. |
page_size | 16 | KV cache page size override. Unset lets sglang pick. |
disable_radix_cache | true | Disable SGLang's radix cache. Pie owns prefix sharing through its scheduler, so the default avoids duplicated caching work while still allowing explicit experiments. |
disable_cuda_graph | false | Run eager (no torch.compile, no CUDA graphs). |
cuda_graph_max_bs | unset | Override the largest CUDA-graph batch-size bin sglang captures. Unset = sglang's auto-pick. |
max_running_requests | unset | Cap on simultaneously-running requests. Unset = sglang's auto-pick based on max_total_tokens. |
max_total_tokens | unset | Cap on total tokens (across requests) per fire_batch. Unset = sglang's auto-pick. |
chunked_prefill_size | unset | Chunked-prefill size override. |
kv_cache_dtype | "auto" | KV cache element dtype. auto inherits the activation dtype. |
trust_remote_code | true | Trust user-supplied remote code in HF repos (needed for some models). |
context_length | unset | Explicit context length cap. Unset reads from HF config. |
cpu_mem_budget_in_gb | 0 | Pinned-host KV pool for D2H/H2D swap, in GiB. 0 disables swap. (Pie knob, not an sglang ServerArgs field.) |
spec_ngram_enabled | false | Enable n-gram speculative decoding (sglang-side drafting). The inferlet opts in to seeing drafts via output_speculative_tokens(true) on its forward pass; otherwise they are dropped. |
spec_ngram_num_drafts | 4 | Drafts proposed per accepted iteration. |
spec_ngram_max_depth | 18 | Maximum n-gram trie depth. |
spec_ngram_capacity | 1_000_000 | Approximate node budget for the trie. |
Supported architectures
SGLang accepts anything in its model zoo. See the SGLang supported models list for the authoritative roster.
Quantization
SGLang has its own quantization knobs (quantization, kv_cache_dtype, etc.). Set them under [model.driver.options] exactly as SGLang expects.