Roadmap
Last updated: October 6, 2025
This roadmap is directional. Priorities may evolve based on user feedback and research findings.
Legend
In progress
-
Apple Metal backend runtime
Enable GPU execution on Apple silicon via Metal for laptop development and demos. -
Native C++ inference backend runtime
Lean, low‑latency server core; smaller footprint vs. polyglot runtimes. -
PEFT adapters API (e.g., LoRA) api
Hot‑swap adapters per request/session with safe isolation. -
Multi‑GPU (tensor & pipeline parallelism) performance
Scale model size and throughput across devices. -
4‑/8‑bit quantization performance
Reduce VRAM while preserving quality; per‑layer policies. -
KV‑cache quantization performance
Larger batches and longer contexts via cache compression. -
WASI Preview 3 migration runtime
Better asynchronous I/O, networking support. -
Preemptive inferlet scheduling runtime
Fairness and SLOs under load; pause/resume primitives. -
Expanded model support runtime
Broader LLM coverage across families and sizes. -
Inferlet standard libraries (Go / C++ / Python / JS) tooling
Ergonomic helpers for common patterns; batteries included.
We welcome proposals—open a GitHub issue with your use case or RFC.
Planned / exploratory
- Inferlet package manager tooling
Discover, version, and share inferlets and libraries.
Feature ideas / RFCs: please file a GitHub issue describing the problem, constraints, and expected API shape.
If something here would unblock you, let us know. Real workloads get priority.