Skip to main content

CUDA

Pie's CUDA path is cuda_native: an embedded C++/CUDA driver linked into the pie binary when the installed flavor includes CUDA.

Install

Install a CUDA-flavored pie binary:

curl -fsSL https://pie-project.org/install.sh | PIE_FLAVOR=cuda12.8 bash

Auto-detection chooses cuda13.0 for NVIDIA driver >= 580, cuda12.8 for driver >= 525, and portable otherwise. Valid CUDA flavors are cuda12.8, cuda13.0, and the matching portable-cuda* variants. CUDA 12.8 is the minimum toolkit (the FlashInfer Hopper/Mamba-SSU and FP4 kernels require 12.8+).

cuda_native configuration

[model.driver]
type = "cuda_native"
device = ["cuda:0"]
tensor_parallel_size = 1
activation_dtype = "bfloat16"

[model.driver.options]
gpu_mem_utilization = 0.90
memory_profile = "auto"
kv_page_size = 32
kv_cache_dtype = "auto"
swap_pool_size = 0
weight_dtype = "bfloat16"
runtime_quant = ""
ready_timeout_s = 600.0
shutdown_timeout_s = 5.0
KeyDefaultDescription
binary_path""Accepted for older config compatibility, ignored by standalone pie; the driver is embedded.
gpu_mem_utilization0.90Fraction of total GPU memory the automatic planner may use after weights load, minus safety headroom.
memory_profile"auto"Planner profile: "auto", "latency", "balanced", "throughput", or "capacity".
kv_page_size32KV cache page size in tokens.
kv_cache_dtype"auto"KV cache format: auto, bf16, bfloat16, fp8_e4m3, fp8_e5m2, int8_per_token_head, fp8_per_token_head, fp4_e2m1, or nvfp4.
swap_pool_size0Pinned host KV-page count for swap-out. 0 disables swap.
weight_dtype"bfloat16"Weight precision.
runtime_quant""Per-channel symmetric quantization of projection weights: "" (off), "fp8" (FP8 E4M3), or "int8". Norms, embeddings, and the LM head stay in weight_dtype.
mxfp4_moe"auto"GPT-OSS MXFP4 expert policy: "auto", "routed_dequant"/"packed", or "bf16"/"dequant". "auto" runs native packed MXFP4 GEMM on Blackwell and routed dequant elsewhere.
enable_system_speculationfalseOpt in to native MTP speculative decoding (off by default; see Speculative decoding).
mtp_num_drafts3Max MTP draft tokens per speculation step (0–32).
mtp_assistant_snapshot_dir""Optional Gemma-4 MTP assistant checkpoint path; auto-discovered from the HF cache when empty.
ready_timeout_s600.0Seconds to wait for driver readiness.
shutdown_timeout_s5.0Seconds to wait for graceful shutdown.

cuda_native accepts only planner inputs for GPU memory sizing. The driver reports derived forward limits, KV page size, KV pages, arena size, and Qwen3.5 state slots at startup.

CUDA-graph capture of the decode path is available (experimental); it engages automatically for graph-safe architectures running a native BF16 KV cache.

Supported architectures

cuda_native covers the architectures ported to driver/cuda/src/.

FamilyHF model_typeNotes
Llama 3.x / Mistral-compatiblellamaInstruct and base checkpoints.
Qwen 2.xqwen2Qwen2 and Qwen2.5.
Qwen 3.xqwen3Includes the default Qwen/Qwen3-0.6B.
Qwen 3.5 / 3.5-MoEqwen3_5, qwen3_5_moeHybrid GDN linear attention; native MTP speculation.
Qwen3-VLqwen3_vlMultimodal — vision input.
Phi-3phi3Microsoft Phi-3 family.
MixtralmixtralMoE path.
GPT-OSSgptoss, gpt_ossMoE; MXFP4 experts (see mxfp4_moe).
Gemma 2 / 3 / 4gemma2, gemma3_text, gemma3n, gemma4_text, gemma4Gemma-4 adds multimodal (vision + audio in) and native MTP.
Mistral 3mistral3Ministral-class checkpoints.
OLMo 3olmo3AI2 OLMo 3.
GLM-5.1glm_moe_dsaMLA attention, DSA MoE; FP4 expert quantization.
Nemotron-Hnemotron_hHybrid Mamba/attention.
Kimi / DeepSeekkimi_k2, deepseek_v2, deepseek_v3, deepseek_v4MLA-based MoE. deepseek_v4 is architecture-level support (no public checkpoint yet).
CSMcsmMultimodal — audio output (text-to-speech).

Run pie model list to see whether cached HuggingFace repos are compatible with your installed drivers. For the multimodal models (Qwen3-VL, Gemma-4, CSM), see Multimodal generation.

Quantization

KV cache format is controlled by kv_cache_dtype under [model.driver.options]. The default auto preserves the native BF16 cache. Set it explicitly to opt into fp8_e4m3, fp8_e5m2, int8_per_token_head, fp8_per_token_head, fp4_e2m1, or nvfp4.

Weight quantization is separate: weight_dtype = "bfloat16" is the default, and runtime_quant quantizes projection weights to "fp8" or "int8". GPT-OSS experts use MXFP4 via mxfp4_moe, and GLM-5.1 routed experts support FP4. Quantization support varies by architecture.