ArkSphere Community : AI-native runtime, infrastructure, and open source.

llmaz

llmaz is an advanced inference platform for large language models on Kubernetes that simplifies model deployment, routing and autoscaling across heterogeneous clusters.

Overview

llmaz (pronounced /lima:z/) is a Kubernetes-native inference platform from InftyAI. It provides production-ready tooling and control plane components to deploy, orchestrate and serve large language models at scale.

Key features

  • Support for many inference backends (vLLM, Text-Generation-Inference, llama.cpp, TensorRT-LLM, etc.).
  • Heterogeneous cluster and device support with model routing and scheduling.
  • Built-in integrations such as Open WebUI for chat, RAG and other common workflows.

Use cases

  • Deploy LLM inference services on Kubernetes with standardized APIs for applications.
  • Distributed and elastic inference across GPUs/CPUs and mixed environments.
  • Multi-provider model sourcing and automatic model loading for operational workflows.

Technical notes

  • CRD-based control plane for declarative model and service definitions.
  • Integrations for model hubs and secret management for private model access.
  • Production-oriented features: HPA integration, Karpenter autoscaling and observability hooks.
llmaz
Resource Info
🌱 Open Source 🛠️ Dev Tools 🛰️ Inference Service