Models
What you can run on Pie depends on three choices: the model architecture, the driver Pie routes inference through, and the precision and parallelism that driver supports for that architecture. This page is the at-a-glance matrix. For configuration and supported architectures of each driver, see CUDA, Portable, vLLM, and SGLang.
Drivers
cuda
Standalone C++/CUDA binary on FlashInfer. Pie's first-party path on NVIDIA hardware, with the full inferlet feature set: custom forward, page-trim, raw-logits sampling, LoRA.
portable
Standalone C++ binary on libggml. Loads HF safetensors
and GGUF directly. GPU backend is selected at C++ build time.
vllm@0.16.0
Forwards inference to vLLM. Inherits vLLM's model zoo, decode kernels, and quantization stack.
sglang@0.5.9
Forwards inference to SGLang. Spec-decoding via SGLang.
Architecture support
| Architecture | cuda | portable | vllm | sglang | Checkpoints |
|---|---|---|---|---|---|
| Qwen | |||||
| Qwen 2qwen2 | stable | stable | upstream | upstream | |
| Qwen 2.5qwen2 | stable | stable | upstream | upstream | |
| Qwen 3qwen3 | stable | stable | upstream | upstream | |
| Qwen 3 MoEqwen3_moe | — | preview | upstream | upstream | |
| Qwen 3.5qwen3_5 | stable | preview | — | upstream | |
| Qwen 3.5 MoEqwen3_5_moe | — | preview | — | upstream | |
| Qwen 3.6qwen3_5 | stable | preview | — | upstream | |
| Qwen 3.6 MoEqwen3_5_moe | — | preview | — | upstream | |
| Llama | |||||
| Llama 3llama | stable | stable | upstream | upstream | |
| Llama 3.1llama | stable | stable | upstream | upstream | |
| Llama 3.2llama | stable | stable | upstream | upstream | |
| Gemma | |||||
| Gemma 2gemma2 | stable | stable | upstream | upstream | |
| Gemma 3gemma3, gemma3_text | stable | stable | upstream | upstream | |
| Gemma 3ngemma3n, gemma3n_text | — | stable | upstream | upstream | |
| Gemma 4gemma4, gemma4_text | stable | stable | — | — | |
| Gemma 4 MoEgemma4 (with experts) | — | preview | — | — | |
| GPT-OSS | |||||
| GPT-OSSgpt_oss, gptoss | stable | preview | upstream | upstream | |
| Mistral | |||||
| Mistralmistral | — | stable | upstream | upstream | |
| Ministral 3mistral3 | stable | stable | — | upstream | |
| Mistral Smallmistral3 | stable | stable | upstream | upstream | |
| Mixtralmixtral | stable | preview | upstream | upstream | |
| Phi | |||||
| Phi-3phi3 | stable | stable | upstream | upstream | |
| Phi-3-smallphi3small | — | preview | upstream | upstream | |
| Phi-3.5 / Phi-4phi3 | stable | stable | upstream | upstream | |
| Phi-3.5-MoEphimoe | — | stable | upstream | upstream | |
| OLMo | |||||
| OLMo 2olmo2 | — | stable | upstream | upstream | |
| OLMo 3olmo3 | stable | stable | upstream | upstream | |
The cuda column reflects the native (PyTorch) driver, which is the default. The cuda_native C++/CUDA driver covers a slightly different set (e.g. it adds qwen3_moe, qwen3_5_moe, gemma3n, olmo2, plain mistral); see the CUDA driver page for details.
The vllm and sglang columns reflect the pinned versions Pie ships against (vllm 0.16.0, sglang 0.5.9). Architectures merged upstream after those releases — currently Gemma 4 / Gemma 4 MoE in both, and Qwen 3.5 / 3.6 in vllm — show as unsupported here even when newer builds carry them.
Some inferlet features (custom attention masks, page-trim, raw-logits samplers) remain native-only and are not available through the delegate drivers.
Adding a new architecture
For a one-off, route the model at the vllm or sglang driver and rely on
the upstream zoo. To add it as a first-party architecture, drop a
pie/src/pie_driver/model/<name>.py file and register it in
pie/src/pie_driver/model/__init__.py against the matching HF model_type.
If your target architecture is not in the upstream backends either, open an issue with the HuggingFace link.
Next
- CUDA driver. Native + cuda_native config and architectures.
- Portable driver. ggml-backed standalone binary.
- vLLM driver. vLLM-backed inference.
- SGLang driver. SGLang-backed inference.
- Configuration. Full
~/.pie/config.tomlreference.