Roadmap

Last updated: October 6, 2025

note

This roadmap is directional. Priorities may evolve based on user feedback and research findings.

Legend

runtimeapiperformancetooling

In progress

Apple Metal backend runtime
Enable GPU execution on Apple silicon via Metal for laptop development and demos.
Native C++ inference backend runtime
Lean, low‑latency server core; smaller footprint vs. polyglot runtimes.
PEFT adapters API (e.g., LoRA) api
Hot‑swap adapters per request/session with safe isolation.
Multi‑GPU (tensor & pipeline parallelism) performance
Scale model size and throughput across devices.
4‑/8‑bit quantization performance
Reduce VRAM while preserving quality; per‑layer policies.
KV‑cache quantization performance
Larger batches and longer contexts via cache compression.
WASI Preview 3 migration runtime
Better asynchronous I/O, networking support.
Preemptive inferlet scheduling runtime
Fairness and SLOs under load; pause/resume primitives.
Expanded model support runtime
Broader LLM coverage across families and sizes.
Inferlet standard libraries (Go / C++ / Python / JS) tooling
Ergonomic helpers for common patterns; batteries included.

We welcome proposals—open a GitHub issue with your use case or RFC.

Planned / exploratory

Inferlet package manager tooling
Discover, version, and share inferlets and libraries.

Feature ideas / RFCs: please file a GitHub issue describing the problem, constraints, and expected API shape.
If something here would unblock you, let us know. Real workloads get priority.

In progress​

Planned / exploratory​

In progress

Planned / exploratory