Portable

type = "portable". An embedded ggml-backed driver built from driver/portable/ on top of vendored llama.cpp/ggml. Loads HuggingFace safetensors directly, no GGUF conversion. By default it picks the best backend compiled into the binary and falls back to CPU. CUDA, Vulkan, Metal, HIP, and SYCL backends are selected at C++ build time via GGML_*=ON CMake flags.

Pick portable for CPU-only inference, non-NVIDIA GPUs (Vulkan / Metal / HIP / SYCL), or any host where you want a self-contained binary without a Python runtime.

Install

The default installer uses portable when no supported CUDA build is detected:

curl -fsSL https://pie-project.org/install.sh | bash

To force the portable build:

curl -fsSL https://pie-project.org/install.sh | PIE_FLAVOR=portable bash

From source, cargo install --path server builds pie with driver-portable enabled by default. CUDA-capable portable release artifacts are named portable-cuda12.6, portable-cuda12.8, and portable-cuda13.0.

Configuration

[model.driver]
type = "portable"
device = ["auto"]                 # logical local replica name
activation_dtype = "bfloat16"

[model.driver.options]
kv_page_size         = 32
max_num_kv_pages     = 1024
max_batch_tokens     = 10240
max_batch_size       = 512
cpu_pages            = 0
ready_timeout_s      = 120.0
shutdown_timeout_s   = 5.0

Key	Default	Description
`binary_path`	`""`	Accepted for older config compatibility, ignored by standalone `pie`; the driver is embedded.
`kv_page_size`	`32`	KV cache page size in tokens.
`max_num_kv_pages`	`1024`	KV pool size. KV memory scales linearly.
`max_batch_tokens`	`10240`	Cap on tokens per fire_batch.
`max_batch_size`	`512`	Cap on sequences per fire_batch.
`cpu_pages`	`0`	Host-side swap pool capacity. `0` = no swap.
`ready_timeout_s`	`120.0`	Seconds to wait for `READY` on stdout.
`shutdown_timeout_s`	`5.0`	Seconds to wait for graceful shutdown after SIGTERM.

Supported architectures

Verified end-to-end on RTX PRO 6000 Blackwell (CUDA backend):

Family	HF `model_type`	Status	Notes
Llama 3.x	`llama`	stable	All sizes (1B-70B); base + instruct.
Qwen 2.x	`qwen2`	stable	Qwen2 / Qwen2.5 all sizes.
Qwen 3.x	`qwen3`	stable	Qwen3 incl. the 0.6B default.
Qwen 3 MoE	`qwen3_moe`	preview	Qwen3-30B-A3B verified loads.
Qwen 3.5 / 3.6	`qwen3_5`	preview	Linear-attention layers wired.
Qwen 3.5 MoE	`qwen3_5_moe`	preview	35B-A3B verified loads.
Phi-3	`phi3`	stable	mini / medium (instruct + base).
Phi-3-small	`phi3small`	preview	Loads + runs; v1 graph treats blocksparse as causal (correct for prompts ≤ 1024 tokens). End-to-end inferlet test blocked on tiktoken→tokenizer.json conversion.
Phi-3.5 / Phi-4	`phi3`	stable	Phi-3.5-mini-instruct, Phi-4 (14B), Phi-4-mini-instruct, Phi-4-reasoning, Phi-4-mini-reasoning all work via the existing `phi3` arch (canonical pangram on RTX PRO 6000 Blackwell).
Phi-3.5-MoE	`phimoe`	stable	42B-A6.6B sparse MoE. Mixtral-style per-expert w1/w2/w3 + LayerNorm-with-bias + Q/K/V/O biases + lm_head bias. End-to-end canonical pangram.
Mixtral	`mixtral`	preview	8x7B verified end-to-end.
Gemma 2	`gemma2`	stable	2B / 9B / 27B.
Gemma 3	`gemma3`, `gemma3_text`	stable	270M / 1B text-only AND 4B / 12B / 27B multimodal-wrapped.
Gemma 3n	`gemma3n`	stable	E2B-it / E4B-it (AltUp + PLE + Laurel + KV-share).
Gemma 4	`gemma4`, `gemma4_text`	stable for dense	Mobile (E2B / E4B), 31B (alt-attention + per-layer KV).
Gemma 4 MoE	`gemma4` (with experts)	—	26B-A4B: loader-only; expert dispatch graph not yet wired.
Mistral 7B / Mistral 3	`mistral`, `mistral3`	stable	Ministral 3 (3B / 8B / 14B), Mistral-Small 24B / 3.1-24B.
OLMo 2	`olmo2`	stable	Mapped to the `olmo3` arch via dispatch alias (`hf_config.cpp`).
OLMo 3	`olmo3`	stable	7B / 32B base.
GPT-OSS	`gptoss`, `gpt_oss`	preview	20B verified (MXFP4 dequant path).

Architectures listed as stable have produced canonical outputs on instruct-tuned variants (e.g., Phi-3-mini-instruct → "The quick brown fox jumps over the lazy dog. This sentence is a pangram…"). Base checkpoints loop on the inferlet's chat-template wrap; use the raw-completion inferlet for clean base-model output.

Quantization

Not configurable via Pie. The portable driver loads weights at their stored dtype directly from safetensors; only activation_dtype (one level up under [model.driver]) is configurable. For weight-side quantization, use the CUDA driver.

Install​

Configuration​

Supported architectures​

Quantization​

Install

Configuration

Supported architectures

Quantization