Models

What you can run on Pie depends on three choices: the model architecture, the driver Pie routes inference through, and the precision and parallelism that driver supports for that architecture. This page is the at-a-glance matrix. For configuration and supported architectures of each driver, see CUDA, Portable, vLLM, and SGLang.

Drivers

First-party

cuda

Standalone C++/CUDA binary on FlashInfer. Pie's first-party path on NVIDIA hardware, with the full inferlet feature set: custom forward, page-trim, raw-logits sampling, LoRA.

Hardware

CUDA SM 8.0+

Weights

bf16

Parallel

TPDPPP

Cross-platform

portable

Standalone C++ binary on libggml. Loads HF safetensors and GGUF directly. GPU backend is selected at C++ build time.

Hardware

CPUCUDAMetalVulkanHIPSYCL

Weights

bf16fp16Q4_K_M

Parallel

TPDPPP

Delegate

vllm@0.16.0

Forwards inference to vLLM. Inherits vLLM's model zoo, decode kernels, and quantization stack.

Hardware

CUDAROCm

Weights

upstream

Parallel

TPDPPP

Delegate

sglang@0.5.9

Forwards inference to SGLang. Spec-decoding via SGLang.

Hardware

CUDAROCm

Weights

upstream

Parallel

TPDPPP

Architecture support

● stable tested end-to-end◐ preview implemented, not yet verified↗ upstream covered by upstream backend— not supported

Architecture	cuda	portable	vllm	sglang	Checkpoints
Qwen
Qwen 2qwen2	stable	stable	upstream	upstream	0.5B 1.5B 7B 72B
Qwen 2.5qwen2	stable	stable	upstream	upstream	0.5B 1.5B 3B 7B 14B 32B 72B
Qwen 3qwen3	stable	stable	upstream	upstream	0.6B 1.7B 4B 8B 14B 32B
Qwen 3 MoEqwen3_moe	—	preview	upstream	upstream	30B-A3B 235B-A22B
Qwen 3.5qwen3_5	stable	preview	—	upstream	0.8B 2B 4B 9B 27B
Qwen 3.5 MoEqwen3_5_moe	—	preview	—	upstream	35B-A3B 122B-A10B 397B-A17B
Qwen 3.6qwen3_5	stable	preview	—	upstream	27B
Qwen 3.6 MoEqwen3_5_moe	—	preview	—	upstream	35B-A3B
Llama
Llama 3llama	stable	stable	upstream	upstream	8B 70B
Llama 3.1llama	stable	stable	upstream	upstream	8B 70B 405B
Llama 3.2llama	stable	stable	upstream	upstream	1B 3B
Gemma
Gemma 2gemma2	stable	stable	upstream	upstream	2B 9B 27B
Gemma 3gemma3, gemma3_text	stable	stable	upstream	upstream	270M 1B 4B 12B 27B
Gemma 3ngemma3n, gemma3n_text	—	stable	upstream	upstream	E2B E4B
Gemma 4gemma4, gemma4_text	stable	stable	—	—	31B E4B E2B
Gemma 4 MoEgemma4 (with experts)	—	preview	—	—	26B-A4B
GPT-OSS
GPT-OSSgpt_oss, gptoss	stable	preview	upstream	upstream	20B 120B
Mistral
Mistralmistral	—	stable	upstream	upstream	7B v0.3
Ministral 3mistral3	stable	stable	—	upstream	3B 8B 14B
Mistral Smallmistral3	stable	stable	upstream	upstream	24B 3.1 24B
Mixtralmixtral	stable	preview	upstream	upstream	8x7B 8x22B
Phi
Phi-3phi3	stable	stable	upstream	upstream	mini 3.8B medium 14B
Phi-3-smallphi3small	—	preview	upstream	upstream	small 7B
Phi-3.5 / Phi-4phi3	stable	stable	upstream	upstream	3.5 mini Phi-4 14B Phi-4 mini Phi-4 reasoning
Phi-3.5-MoEphimoe	—	stable	upstream	upstream	42B-A6.6B
OLMo
OLMo 2olmo2	—	stable	upstream	upstream	1B 7B 13B 32B
OLMo 3olmo3	stable	stable	upstream	upstream	7B 32B

The cuda column reflects the native (PyTorch) driver, which is the default. The cuda_native C++/CUDA driver covers a slightly different set (e.g. it adds qwen3_moe, qwen3_5_moe, gemma3n, olmo2, plain mistral); see the CUDA driver page for details.

The vllm and sglang columns reflect the pinned versions Pie ships against (vllm 0.16.0, sglang 0.5.9). Architectures merged upstream after those releases — currently Gemma 4 / Gemma 4 MoE in both, and Qwen 3.5 / 3.6 in vllm — show as unsupported here even when newer builds carry them.

Some inferlet features (custom attention masks, page-trim, raw-logits samplers) remain native-only and are not available through the delegate drivers.

Adding a new architecture

For a one-off, route the model at the vllm or sglang driver and rely on the upstream zoo. To add it as a first-party architecture, drop a pie/src/pie_driver/model/<name>.py file and register it in pie/src/pie_driver/model/__init__.py against the matching HF model_type.

If your target architecture is not in the upstream backends either, open an issue with the HuggingFace link.

CUDA driver. Native + cuda_native config and architectures.
Portable driver. ggml-backed standalone binary.
vLLM driver. vLLM-backed inference.
SGLang driver. SGLang-backed inference.
Configuration. Full ~/.pie/config.toml reference.

Drivers​

cuda

portable

vllm@0.16.0

sglang@0.5.9

Architecture support​

Adding a new architecture​

Next​

Drivers

Architecture support

Adding a new architecture

Next