SGLang
The SGLang driver is a prototype. Interfaces, defaults, and capabilities change frequently.
type = "sglang". A subprocess Python driver that delegates the forward pass to SGLang. Pie wraps a curated subset of SGLang's ServerArgs; memory and batch capacity are resolved by the driver and reported back through DriverCapabilities.
Use this driver to run a model that the embedded cuda_native driver does not yet implement, or to run against SGLang's attention kernels and its n-gram speculative decoding.
Capabilities
The bridge runs SGLang's standard causal attention. It replaces SGLang's LogitsProcessor with a hidden-state capture hook and runs the stock forward pass; there is no path for arbitrary attention masks. It reports the following through DriverCapabilities:
supports_user_attention_mask = false. User-supplied attention masks are silently dropped.supports_adapters = false.init_adapter,update_adapter, andload_adapterraiseNotImplementedError.
Inferlets that need custom masks or adapter math must run on the cuda_native driver.
Install
pie driver sglang install ~/.pie/venvs/sglang --run
pie driver sglang set venv ~/.pie/venvs/sglang
pie driver sglang doctor
This installs pie-driver-sglang into a Python 3.12 virtual environment. SGLang, Torch, and FlashInfer pins are co-resolved. The install is large: roughly 5 to 10 GiB on disk.
Configuration
[model.driver]
type = "sglang"
device = ["cuda:0"]
activation_dtype = "bfloat16"
[model.driver.options]
venv = "/home/me/.pie/venvs/sglang"
attention_backend = "triton" # triton / flashinfer / fa3 / fa4 / …
mem_fraction_static = 0.65
disable_radix_cache = true
disable_cuda_graph = false
# cuda_graph_max_bs = 256 # default: sglang auto-picks
# chunked_prefill_size = 8192 # default: sglang auto-picks
kv_cache_dtype = "auto"
trust_remote_code = true
# context_length = 8192 # default: read from HF config
# Pinned-host KV pool for D2H/H2D swap (GiB). 0 disables swap.
cpu_mem_budget_in_gb = 0
# n-gram speculative decoding (driver-side drafts)
spec_ngram_enabled = false
spec_ngram_num_drafts = 4
spec_ngram_max_depth = 18
spec_ngram_capacity = 1_000_000
| Key | Default | Description |
|---|---|---|
attention_backend | "triton" | triton / flashinfer / flex_attention / fa3 / fa4 / aiter / wave / torch_native / etc. triton works on any NVIDIA SM 7.5+ and is stable across SGLang versions. |
mem_fraction_static | 0.65 | Fraction of free GPU memory reserved for KV cache + activations. Lower than SGLang's standalone 0.88 because Pie's KV-rebind allocates a parallel tensor in Pie's canonical layout. |
disable_radix_cache | true | Disable SGLang's radix cache. Pie's scheduler owns prefix sharing, so the default avoids duplicated caching work. |
disable_cuda_graph | false | Run eager (no torch.compile, no CUDA graphs). |
cuda_graph_max_bs | unset | Override the largest CUDA-graph batch-size bin SGLang captures. Unset uses SGLang's auto-pick. |
chunked_prefill_size | unset | Chunked-prefill size override. |
kv_cache_dtype | "auto" | KV cache element dtype. auto inherits the activation dtype. |
trust_remote_code | true | Trust user-supplied remote code in HF repos (needed for some models). |
context_length | unset | Explicit context length cap. Unset reads from HF config. |
cpu_mem_budget_in_gb | 0 | Pinned-host KV pool for D2H/H2D swap, in GiB. 0 disables swap. (Pie knob, not an SGLang ServerArgs field.) |
spec_ngram_enabled | false | Enable n-gram speculative decoding. The driver maintains a per-session token history, proposes linear draft continuations, and the runtime verifies them in the shared batch path. The inferlet opts in to receiving drafts by calling output_speculative_tokens(true); otherwise drafts are dropped. |
spec_ngram_num_drafts | 4 | Drafts proposed per accepted iteration. |
spec_ngram_max_depth | 18 | Maximum n-gram trie depth. |
spec_ngram_capacity | 1_000_000 | Approximate node budget for the trie. |
The SGLang driver uses SGLang's resolved page size and scheduler limits, then
reports them in DriverCapabilities during startup. Pie config does not accept
manual SGLang batch-capacity overrides such as max_running_requests or
max_total_tokens.
Supported architectures
SGLang accepts anything in its model zoo. See the SGLang supported models list for the authoritative roster.
Quantization
SGLang has its own quantization knobs (quantization, kv_cache_dtype, etc.). Set them under [model.driver.options] exactly as SGLang expects.