Skip to main content

Portable

type = "portable". An embedded ggml-backed driver built from driver/portable/ on top of vendored llama.cpp/ggml. Loads HuggingFace safetensors directly, no GGUF conversion. By default it picks the best backend compiled into the binary and falls back to CPU. CUDA, Vulkan, Metal, HIP, and SYCL backends are selected at C++ build time via GGML_*=ON CMake flags.

Pick portable for CPU-only inference, non-NVIDIA GPUs (Vulkan / Metal / HIP / SYCL), or any host where you want a self-contained binary without a Python runtime.

Install

The default installer uses portable when no supported CUDA build is detected:

curl -fsSL https://pie-project.org/install.sh | bash

To force the portable build:

curl -fsSL https://pie-project.org/install.sh | PIE_FLAVOR=portable bash

From source, cargo install --path server builds pie with driver-portable enabled by default. CUDA-capable portable release artifacts are named portable-cuda12.6, portable-cuda12.8, and portable-cuda13.0.

Configuration

[model.driver]
type = "portable"
device = ["auto"] # logical local replica name
activation_dtype = "bfloat16"

[model.driver.options]
kv_page_size = 32
max_num_kv_pages = 1024
max_batch_tokens = 10240
max_batch_size = 512
cpu_pages = 0
ready_timeout_s = 120.0
shutdown_timeout_s = 5.0
KeyDefaultDescription
binary_path""Accepted for older config compatibility, ignored by standalone pie; the driver is embedded.
kv_page_size32KV cache page size in tokens.
max_num_kv_pages1024KV pool size. KV memory scales linearly.
max_batch_tokens10240Cap on tokens per fire_batch.
max_batch_size512Cap on sequences per fire_batch.
cpu_pages0Host-side swap pool capacity. 0 = no swap.
ready_timeout_s120.0Seconds to wait for READY on stdout.
shutdown_timeout_s5.0Seconds to wait for graceful shutdown after SIGTERM.

Supported architectures

Verified end-to-end on RTX PRO 6000 Blackwell (CUDA backend):

FamilyHF model_typeStatusNotes
Llama 3.xllamastableAll sizes (1B-70B); base + instruct.
Qwen 2.xqwen2stableQwen2 / Qwen2.5 all sizes.
Qwen 3.xqwen3stableQwen3 incl. the 0.6B default.
Qwen 3 MoEqwen3_moepreviewQwen3-30B-A3B verified loads.
Qwen 3.5 / 3.6qwen3_5previewLinear-attention layers wired.
Qwen 3.5 MoEqwen3_5_moepreview35B-A3B verified loads.
Phi-3phi3stablemini / medium (instruct + base).
Phi-3-smallphi3smallpreviewLoads + runs; v1 graph treats blocksparse as causal (correct for prompts ≤ 1024 tokens). End-to-end inferlet test blocked on tiktoken→tokenizer.json conversion.
Phi-3.5 / Phi-4phi3stablePhi-3.5-mini-instruct, Phi-4 (14B), Phi-4-mini-instruct, Phi-4-reasoning, Phi-4-mini-reasoning all work via the existing phi3 arch (canonical pangram on RTX PRO 6000 Blackwell).
Phi-3.5-MoEphimoestable42B-A6.6B sparse MoE. Mixtral-style per-expert w1/w2/w3 + LayerNorm-with-bias + Q/K/V/O biases + lm_head bias. End-to-end canonical pangram.
Mixtralmixtralpreview8x7B verified end-to-end.
Gemma 2gemma2stable2B / 9B / 27B.
Gemma 3gemma3, gemma3_textstable270M / 1B text-only AND 4B / 12B / 27B multimodal-wrapped.
Gemma 3ngemma3nstableE2B-it / E4B-it (AltUp + PLE + Laurel + KV-share).
Gemma 4gemma4, gemma4_textstable for denseMobile (E2B / E4B), 31B (alt-attention + per-layer KV).
Gemma 4 MoEgemma4 (with experts)26B-A4B: loader-only; expert dispatch graph not yet wired.
Mistral 7B / Mistral 3mistral, mistral3stableMinistral 3 (3B / 8B / 14B), Mistral-Small 24B / 3.1-24B.
OLMo 2olmo2stableMapped to the olmo3 arch via dispatch alias (hf_config.cpp).
OLMo 3olmo3stable7B / 32B base.
GPT-OSSgptoss, gpt_osspreview20B verified (MXFP4 dequant path).

Architectures listed as stable have produced canonical outputs on instruct-tuned variants (e.g., Phi-3-mini-instruct → "The quick brown fox jumps over the lazy dog. This sentence is a pangram…"). Base checkpoints loop on the inferlet's chat-template wrap; use the raw-completion inferlet for clean base-model output.

Quantization

Not configurable via Pie. The portable driver loads weights at their stored dtype directly from safetensors; only activation_dtype (one level up under [model.driver]) is configurable. For weight-side quantization, use the CUDA driver.