Skip to main content

Roadmap

Last updated: October 6, 2025

note

This roadmap is directional. Priorities may evolve based on user feedback and research findings.

Legend

runtimeapiperformancetooling

In progress

  • Apple Metal backend runtime
    Enable GPU execution on Apple silicon via Metal for laptop development and demos.

  • Native C++ inference backend runtime
    Lean, low‑latency server core; smaller footprint vs. polyglot runtimes.

  • PEFT adapters API (e.g., LoRA) api
    Hot‑swap adapters per request/session with safe isolation.

  • Multi‑GPU (tensor & pipeline parallelism) performance
    Scale model size and throughput across devices.

  • 4‑/8‑bit quantization performance
    Reduce VRAM while preserving quality; per‑layer policies.

  • KV‑cache quantization performance
    Larger batches and longer contexts via cache compression.

  • WASI Preview 3 migration runtime
    Better asynchronous I/O, networking support.

  • Preemptive inferlet scheduling runtime
    Fairness and SLOs under load; pause/resume primitives.

  • Expanded model support runtime
    Broader LLM coverage across families and sizes.

  • Inferlet standard libraries (Go / C++ / Python / JS) tooling
    Ergonomic helpers for common patterns; batteries included.

We welcome proposals—open a GitHub issue with your use case or RFC.


Planned / exploratory

  • Inferlet package manager tooling
    Discover, version, and share inferlets and libraries.

Feature ideas / RFCs: please file a GitHub issue describing the problem, constraints, and expected API shape.
If something here would unblock you, let us know. Real workloads get priority.