Skip to main content

SGLang

type = "sglang". Mirrors SGLang's ServerArgs; field names mirror SGLang verbatim so values flow through.

Pick SGLang when you want strong custom-mask support or its n-gram speculative decoding. Like vLLM, some inferlet features that depend on raw forward-pass control are not available through SGLang; those require a Pie-controlled driver such as cuda_native or dev.

Install

pie driver sglang install ~/.pie/venvs/sglang --run
pie driver sglang set venv ~/.pie/venvs/sglang
pie driver sglang doctor

This installs pie-driver-sglang into a Python 3.12 virtual environment. SGLang, Torch, and FlashInfer pins are co-resolved; expect a large install.

Configuration

[model.driver]
type = "sglang"
device = ["cuda:0"]
activation_dtype = "bfloat16"

[model.driver.options]
venv = "/home/me/.pie/venvs/sglang"
attention_backend = "triton" # triton / flashinfer / fa3 / fa4 / …
mem_fraction_static = 0.65
page_size = 16
disable_radix_cache = true
disable_cuda_graph = false
# cuda_graph_max_bs = 256 # default: sglang auto-picks
# max_running_requests = 256 # default: sglang auto-picks
# max_total_tokens = 65536 # default: sglang auto-picks
# chunked_prefill_size = 8192 # default: sglang auto-picks
kv_cache_dtype = "auto"
trust_remote_code = true
# context_length = 8192 # default: read from HF config

# Pinned-host KV pool for D2H/H2D swap (GiB). 0 = disabled.
cpu_mem_budget_in_gb = 0

# n-gram speculative decoding (driver-supplied drafts)
spec_ngram_enabled = false
spec_ngram_num_drafts = 4
spec_ngram_max_depth = 18
spec_ngram_capacity = 1_000_000
KeyDefaultDescription
attention_backend"triton"triton / flashinfer / flex_attention / fa3 / fa4 / aiter / wave / torch_native / etc. triton works on any NVIDIA SM 7.5+ and is stable across sglang versions.
mem_fraction_static0.65Fraction of free GPU memory reserved for KV cache + activations. Lower than sglang's standalone 0.88 because pie's KV-rebind allocates a parallel tensor in pie's canonical layout.
page_size16KV cache page size override. Unset lets sglang pick.
disable_radix_cachetrueDisable SGLang's radix cache. Pie owns prefix sharing through its scheduler, so the default avoids duplicated caching work while still allowing explicit experiments.
disable_cuda_graphfalseRun eager (no torch.compile, no CUDA graphs).
cuda_graph_max_bsunsetOverride the largest CUDA-graph batch-size bin sglang captures. Unset = sglang's auto-pick.
max_running_requestsunsetCap on simultaneously-running requests. Unset = sglang's auto-pick based on max_total_tokens.
max_total_tokensunsetCap on total tokens (across requests) per fire_batch. Unset = sglang's auto-pick.
chunked_prefill_sizeunsetChunked-prefill size override.
kv_cache_dtype"auto"KV cache element dtype. auto inherits the activation dtype.
trust_remote_codetrueTrust user-supplied remote code in HF repos (needed for some models).
context_lengthunsetExplicit context length cap. Unset reads from HF config.
cpu_mem_budget_in_gb0Pinned-host KV pool for D2H/H2D swap, in GiB. 0 disables swap. (Pie knob, not an sglang ServerArgs field.)
spec_ngram_enabledfalseEnable n-gram speculative decoding (sglang-side drafting). The inferlet opts in to seeing drafts via output_speculative_tokens(true) on its forward pass; otherwise they are dropped.
spec_ngram_num_drafts4Drafts proposed per accepted iteration.
spec_ngram_max_depth18Maximum n-gram trie depth.
spec_ngram_capacity1_000_000Approximate node budget for the trie.

Supported architectures

SGLang accepts anything in its model zoo. See the SGLang supported models list for the authoritative roster.

Quantization

SGLang has its own quantization knobs (quantization, kv_cache_dtype, etc.). Set them under [model.driver.options] exactly as SGLang expects.