Skip to main content

Models

What you can run on Pie depends on three choices: the model architecture, the driver Pie routes inference through, and the precision and parallelism that driver supports for that architecture. This page is the at-a-glance matrix. For configuration and supported architectures of each driver, see CUDA, Portable, vLLM, and SGLang.

Drivers

First-party

cuda

Standalone C++/CUDA binary on FlashInfer. Pie's first-party path on NVIDIA hardware, with the full inferlet feature set: custom forward, page-trim, raw-logits sampling, LoRA.

Hardware
CUDA SM 8.0+
Weights
bf16
Parallel
TPDPPP
Cross-platform

portable

Standalone C++ binary on libggml. Loads HF safetensors and GGUF directly. GPU backend is selected at C++ build time.

Hardware
CPUCUDAMetalVulkanHIPSYCL
Weights
bf16fp16Q4_K_M
Parallel
TPDPPP
Delegate

vllm@0.16.0

Forwards inference to vLLM. Inherits vLLM's model zoo, decode kernels, and quantization stack.

Hardware
CUDAROCm
Weights
upstream
Parallel
TPDPPP
Delegate

sglang@0.5.9

Forwards inference to SGLang. Spec-decoding via SGLang.

Hardware
CUDAROCm
Weights
upstream
Parallel
TPDPPP

Architecture support

stable tested end-to-end preview implemented, not yet verified upstream covered by upstream backend not supported
ArchitecturecudaportablevllmsglangCheckpoints
Qwen
Qwen 2qwen2stablestableupstreamupstream
Qwen 2.5qwen2stablestableupstreamupstream
Qwen 3qwen3stablestableupstreamupstream
Qwen 3 MoEqwen3_moepreviewupstreamupstream
Qwen 3.5qwen3_5stablepreviewupstream
Qwen 3.5 MoEqwen3_5_moepreviewupstream
Qwen 3.6qwen3_5stablepreviewupstream
Qwen 3.6 MoEqwen3_5_moepreviewupstream
Llama
Llama 3llamastablestableupstreamupstream
Llama 3.1llamastablestableupstreamupstream
Llama 3.2llamastablestableupstreamupstream
Gemma
Gemma 2gemma2stablestableupstreamupstream
Gemma 3gemma3, gemma3_textstablestableupstreamupstream
Gemma 3ngemma3n, gemma3n_textstableupstreamupstream
Gemma 4gemma4, gemma4_textstablestable
Gemma 4 MoEgemma4 (with experts)preview
GPT-OSS
GPT-OSSgpt_oss, gptossstablepreviewupstreamupstream
Mistral
Mistralmistralstableupstreamupstream
Ministral 3mistral3stablestableupstream
Mistral Smallmistral3stablestableupstreamupstream
Mixtralmixtralstablepreviewupstreamupstream
Phi
Phi-3phi3stablestableupstreamupstream
Phi-3-smallphi3smallpreviewupstreamupstream
Phi-3.5 / Phi-4phi3stablestableupstreamupstream
Phi-3.5-MoEphimoestableupstreamupstream
OLMo
OLMo 2olmo2stableupstreamupstream
OLMo 3olmo3stablestableupstreamupstream

The cuda column reflects the native (PyTorch) driver, which is the default. The cuda_native C++/CUDA driver covers a slightly different set (e.g. it adds qwen3_moe, qwen3_5_moe, gemma3n, olmo2, plain mistral); see the CUDA driver page for details.

The vllm and sglang columns reflect the pinned versions Pie ships against (vllm 0.16.0, sglang 0.5.9). Architectures merged upstream after those releases — currently Gemma 4 / Gemma 4 MoE in both, and Qwen 3.5 / 3.6 in vllm — show as unsupported here even when newer builds carry them.

Some inferlet features (custom attention masks, page-trim, raw-logits samplers) remain native-only and are not available through the delegate drivers.

Adding a new architecture

For a one-off, route the model at the vllm or sglang driver and rely on the upstream zoo. To add it as a first-party architecture, drop a pie/src/pie_driver/model/<name>.py file and register it in pie/src/pie_driver/model/__init__.py against the matching HF model_type.

If your target architecture is not in the upstream backends either, open an issue with the HuggingFace link.

Next