Skip to main content

Models

What you can run on Pie depends on three choices: the model architecture, the driver Pie routes inference through, and the precision and parallelism that driver supports for that architecture. This page is the at-a-glance matrix. For configuration and supported architectures of each driver, see CUDA, Portable, vLLM, SGLang, and TensorRT-LLM.

Drivers

cuda

Standalone C++/CUDA binary for high-throughput inference.

Hardware
SM 8.0+
Weights
bf16int8int4
Parallel
TPDPPP

portable

Standalone C++ binary for local inference based on ggml.

Hardware
CPUMetalVulkan
Weights
bf16int8int4
Parallel
TPDPPP

vllm@0.16.0

Delegates forward pass execution to vLLM. Check the Inferlet API capability.

Hardware
CUDAROCm
Weights
upstream
Parallel
TPDPPP

sglang@0.5.9

Delegates forward pass execution to SGLang. Check the Inferlet API capability.

Hardware
CUDAROCm
Weights
upstream
Parallel
TPDPPP

tensorrt-llm@1.2.1

Delegates forward pass execution to TensorRT-LLM. Check the Inferlet API capability.

Hardware
CUDA
Weights
upstream
Parallel
TPDPPP

Architecture support

stable tested end-to-end preview implemented, not yet verified upstream covered by upstream backend not supported
ArchitecturecudaportablevllmsglangtrtllmCheckpoints
Qwen
Qwen 2qwen2stablestableupstreamupstream
Qwen 2.5qwen2stablestableupstreamupstream
Qwen 3qwen3stablestableupstreamupstreamupstream
Qwen 3 MoEqwen3_moepreviewpreviewupstreamupstream
Qwen 3.5qwen3_5stablepreviewupstreamupstream
Qwen 3.5 MoEqwen3_5_moepreviewupstream
Qwen 3.6qwen3_5stablepreviewupstreamupstream
Qwen 3.6 MoEqwen3_5_moepreviewupstream
Qwen3-VLqwen3_vlstableupstreamupstream
Llama
Llama 3llamastablestableupstreamupstream
Llama 3.1llamastablestableupstreamupstream
Llama 3.2llamastablestableupstreamupstream
Gemma
Gemma 2gemma2stablestableupstreamupstream
Gemma 3gemma3, gemma3_textstablestableupstreamupstream
Gemma 3ngemma3n, gemma3n_textstableupstreamupstream
Gemma 4gemma4, gemma4_textstablestableupstream
Gemma 4 MoEgemma4 (with experts)preview
GPT-OSS
GPT-OSSgpt_oss, gptossstablepreviewupstreamupstream
Mistral
Mistralmistralstableupstreamupstream
Ministral 3mistral3stablestableupstream
Mistral Smallmistral3stablestableupstreamupstream
Mixtralmixtralstablepreviewupstreamupstream
Phi
Phi-3phi3stablestableupstreamupstream
Phi-3-smallphi3smallpreviewupstreamupstream
Phi-3.5 / Phi-4phi3stablestableupstreamupstream
Phi-3.5-MoEphimoestableupstreamupstream
OLMo
OLMo 2olmo2stableupstreamupstream
OLMo 3olmo3stablestableupstreamupstream
GLM
GLM-5.1glm_moe_dsapreviewupstreamupstream
Nemotron
Nemotron-Hnemotron_hstable
DeepSeek / Kimi
DeepSeek V3deepseek_v3previewupstreamupstream
Kimi K2kimi_k2previewupstreamupstream
Sesame
CSMcsmpreview

Multimodal models. Qwen3-VL (vision), Gemma 4 (vision + audio), Nemotron-3 Omni (vision + audio), and CSM (audio output) accept or produce non-text modalities. See Multimodal generation.