CUDA

Pie has two CUDA-oriented paths:

cuda_native: embedded C++/CUDA driver linked into the pie binary when the installed flavor includes CUDA.
dev: subprocess Python driver backed by PyTorch and FlashInfer, useful for development and model/kernel prototyping.

Use cuda_native for the standalone binary path. Use dev when you need the readable Python implementation or its platform extras.

Install

Install a CUDA-flavored pie binary:

curl -fsSL https://pie-project.org/install.sh | PIE_FLAVOR=cuda12.8 bash

Auto-detection chooses cuda13.0 for NVIDIA driver >= 580, cuda12.8 for driver >= 525, and portable otherwise. Valid CUDA flavors are cuda12.6, cuda12.8, cuda13.0, and the matching portable-cuda* variants.

For the Python dev driver:

pie driver dev install ~/.pie/venvs/dev --run
pie driver dev set venv ~/.pie/venvs/dev
pie driver dev doctor

The install recipe uses pie-driver-dev[cu128] by default. For source or manual installs, the driver wheel also exposes cu126, cu128, and metal extras.

`cuda_native` configuration

[model.driver]
type = "cuda_native"
device = ["cuda:0"]
tensor_parallel_size = 1
activation_dtype = "bfloat16"

[model.driver.options]
gpu_mem_utilization  = 0.85
kv_page_size         = 32
max_batch_tokens     = 10240
max_batch_size       = 512
max_num_kv_pages     = 1024
swap_pool_size       = 0
weight_dtype         = "bfloat16"
runtime_quant        = ""
ready_timeout_s      = 600.0
shutdown_timeout_s   = 5.0

Key	Default	Description
`binary_path`	`""`	Accepted for older config compatibility, ignored by standalone `pie`; the driver is embedded.
`gpu_mem_utilization`	`0.85`	Fraction of free GPU memory the driver claims.
`kv_page_size`	`32`	KV cache page size in tokens.
`max_batch_tokens`	`10240`	Cap on tokens per batch.
`max_batch_size`	`512`	Cap on sequences per batch.
`max_num_kv_pages`	`1024`	KV cache page count. KV memory scales linearly.
`swap_pool_size`	`0`	Pinned host KV-page count for swap-out. `0` disables swap.
`weight_dtype`	`"bfloat16"`	Weight precision.
`runtime_quant`	`""`	Empty disables runtime quantization; `"fp8"` enables the current FP8 path where supported.
`ready_timeout_s`	`600.0`	Seconds to wait for driver readiness.
`shutdown_timeout_s`	`5.0`	Seconds to wait for graceful shutdown.

`dev` configuration

[model.driver]
type = "dev"
device = ["cuda:0"]
tensor_parallel_size = 1
activation_dtype = "bfloat16"

[model.driver.options]
venv = "/home/me/.pie/venvs/dev"
gpu_mem_utilization  = 0.8
max_batch_tokens     = 10240
max_batch_size       = 512
max_dist_size        = 32
max_num_embeds       = 128
max_num_adapters     = 32
max_adapter_rank     = 8
kv_page_size         = 16
weight_dtype         = "auto"
cpu_mem_budget_in_gb = 0

Key	Default	Description
`venv` / `python`	unset	Optional per-model interpreter override. Otherwise `pie driver dev set ...`, `PIE_PYTHON`, the active venv, and `python3` are checked in order.
`gpu_mem_utilization`	`0.8`	Fraction of free GPU memory used as the KV budget.
`max_batch_tokens`	`10240`	Maximum tokens summed across all sequences in a batch.
`max_batch_size`	`512`	Maximum sequences per batch.
`max_dist_size`	`32`	Cap on `Distribution { k }` probe size.
`max_num_embeds`	`128`	Cap on embedding lookups per batch.
`max_num_adapters`	`32`	Adapter slot capacity.
`max_adapter_rank`	`8`	Maximum LoRA rank.
`kv_page_size`	`16`	KV cache page size in tokens.
`weight_dtype`	`"auto"`	`auto`, `float32`, `float16`, `bfloat16`, `int4`, `int8`, or `float8`.
`cpu_mem_budget_in_gb`	`0`	Pinned host pool for KV swap, in GiB. `0` disables swap.

Supported architectures

cuda_native covers the architectures ported to driver/cuda/src/. The dev driver has the broader Python model roster under driver/dev/src/pie_driver_dev/model/.

Family	HF `model_type`	Notes
Llama 3.x / Mistral-compatible	`llama`	Instruct and base checkpoints.
Qwen 2.x	`qwen2`	Qwen2 and Qwen2.5.
Qwen 3.x	`qwen3`	Includes the default `Qwen/Qwen3-0.6B`.
Qwen 3.5	`qwen3_5`	Supported where the selected CUDA path has the graph implemented.
Phi-3	`phi3`	Microsoft Phi-3 family.
Mixtral	`mixtral`	MoE path.
Gemma 2 / 3 / 4	`gemma2`, `gemma3_text`, `gemma4_text`, `gemma4`	Text checkpoints.
Mistral 3	`mistral3`	Ministral-class checkpoints.
OLMo 3	`olmo3`	AI2 OLMo 3.
GPT-OSS	`gptoss`, `gpt_oss`	Community GPT-OSS variants.

Run pie model list to see whether cached HuggingFace repos are compatible with your installed drivers.

Quantization

For dev, set weight_dtype under [model.driver.options]:

[model.driver.options]            # type = "dev"
weight_dtype = "float8"           # auto / float32 / float16 / bfloat16 / int4 / int8 / float8

For cuda_native, weight_dtype = "bfloat16" is the default and runtime_quant = "fp8" enables the current runtime quantization path where supported.

Install​

cuda_native configuration​

dev configuration​

Supported architectures​

Quantization​

Install

`cuda_native` configuration

`dev` configuration

Supported architectures

Quantization