Skip to main content

SGLang

Experimental

The SGLang driver is a prototype. Interfaces, defaults, and capabilities change frequently.

type = "sglang". A subprocess Python driver that delegates the forward pass to SGLang. Pie wraps a curated subset of SGLang's ServerArgs; memory and batch capacity are resolved by the driver and reported back through DriverCapabilities.

Use this driver to run a model that the embedded cuda_native driver does not yet implement, or to run against SGLang's attention kernels and its n-gram speculative decoding.

Capabilities

The bridge runs SGLang's standard causal attention. It replaces SGLang's LogitsProcessor with a hidden-state capture hook and runs the stock forward pass; there is no path for arbitrary attention masks. It reports the following through DriverCapabilities:

  • supports_user_attention_mask = false. User-supplied attention masks are silently dropped.
  • supports_adapters = false. init_adapter, update_adapter, and load_adapter raise NotImplementedError.

Inferlets that need custom masks or adapter math must run on the cuda_native driver.

Install

pie driver sglang install ~/.pie/venvs/sglang --run
pie driver sglang set venv ~/.pie/venvs/sglang
pie driver sglang doctor

This installs pie-driver-sglang into a Python 3.12 virtual environment. SGLang, Torch, and FlashInfer pins are co-resolved. The install is large: roughly 5 to 10 GiB on disk.

Configuration

[model.driver]
type = "sglang"
device = ["cuda:0"]
activation_dtype = "bfloat16"

[model.driver.options]
venv = "/home/me/.pie/venvs/sglang"
attention_backend = "triton" # triton / flashinfer / fa3 / fa4 / …
mem_fraction_static = 0.65
disable_radix_cache = true
disable_cuda_graph = false
# cuda_graph_max_bs = 256 # default: sglang auto-picks
# chunked_prefill_size = 8192 # default: sglang auto-picks
kv_cache_dtype = "auto"
trust_remote_code = true
# context_length = 8192 # default: read from HF config

# Pinned-host KV pool for D2H/H2D swap (GiB). 0 disables swap.
cpu_mem_budget_in_gb = 0

# n-gram speculative decoding (driver-side drafts)
spec_ngram_enabled = false
spec_ngram_num_drafts = 4
spec_ngram_max_depth = 18
spec_ngram_capacity = 1_000_000
KeyDefaultDescription
attention_backend"triton"triton / flashinfer / flex_attention / fa3 / fa4 / aiter / wave / torch_native / etc. triton works on any NVIDIA SM 7.5+ and is stable across SGLang versions.
mem_fraction_static0.65Fraction of free GPU memory reserved for KV cache + activations. Lower than SGLang's standalone 0.88 because Pie's KV-rebind allocates a parallel tensor in Pie's canonical layout.
disable_radix_cachetrueDisable SGLang's radix cache. Pie's scheduler owns prefix sharing, so the default avoids duplicated caching work.
disable_cuda_graphfalseRun eager (no torch.compile, no CUDA graphs).
cuda_graph_max_bsunsetOverride the largest CUDA-graph batch-size bin SGLang captures. Unset uses SGLang's auto-pick.
chunked_prefill_sizeunsetChunked-prefill size override.
kv_cache_dtype"auto"KV cache element dtype. auto inherits the activation dtype.
trust_remote_codetrueTrust user-supplied remote code in HF repos (needed for some models).
context_lengthunsetExplicit context length cap. Unset reads from HF config.
cpu_mem_budget_in_gb0Pinned-host KV pool for D2H/H2D swap, in GiB. 0 disables swap. (Pie knob, not an SGLang ServerArgs field.)
spec_ngram_enabledfalseEnable n-gram speculative decoding. The driver maintains a per-session token history, proposes linear draft continuations, and the runtime verifies them in the shared batch path. The inferlet opts in to receiving drafts by calling output_speculative_tokens(true); otherwise drafts are dropped.
spec_ngram_num_drafts4Drafts proposed per accepted iteration.
spec_ngram_max_depth18Maximum n-gram trie depth.
spec_ngram_capacity1_000_000Approximate node budget for the trie.

The SGLang driver uses SGLang's resolved page size and scheduler limits, then reports them in DriverCapabilities during startup. Pie config does not accept manual SGLang batch-capacity overrides such as max_running_requests or max_total_tokens.

Supported architectures

SGLang accepts anything in its model zoo. See the SGLang supported models list for the authoritative roster.

Quantization

SGLang has its own quantization knobs (quantization, kv_cache_dtype, etc.). Set them under [model.driver.options] exactly as SGLang expects.