Skip to main content

Roadmap

note

This roadmap is directional. Priorities may evolve based on user feedback and research findings.

Legend

runtimeapiperformancetooling

Released in 0.2.0 (2026-01-20)

  • Apple Metal support runtime
    Enable GPU execution on Apple silicon via Metal for laptop development and demos.

  • Multi-GPU support with mixed DP and TP performance
    Scale model size and throughput across devices using Data Parallelism and Tensor Parallelism.

  • Quantization support (float8/int8/float4) performance
    Reduce VRAM while preserving quality; per‑layer policies.

  • New model support runtime
    Support for gpt-oss, ministral3, gemma2, gemma3, olmo3.


In progress

  • PEFT adapters API (e.g., LoRA) api
    Hot‑swap adapters per request/session with safe isolation.

  • KV‑cache quantization performance
    Larger batches and longer contexts via cache compression.

  • Inferlet standard libraries (Python / JS) tooling
    Ergonomic helpers for common patterns; batteries included.

  • Inferlet package manager tooling
    Discover, version, and share inferlets and libraries.


We welcome proposals—open a GitHub issue with your use case or RFC.

Feature ideas / RFCs: please file a GitHub issue describing the problem, constraints, and expected API shape.
If something here would unblock you, let us know. Real workloads get priority.