Recently, I’ve been following the progress of domestic GPU scheduling and Kubernetes AI resource models. With the release of HAMi v2.9, I want to share several observations on how the AI Infra control plane is evolving.

Why Discuss This Topic Now
When DeepSeek R1 was released in early 2025, most people focused on the fact that it trained a model competitive with OpenAI o1 for just $5.6 million. What struck me, however, was that as inference costs plummeted, GPU utilization issues would quickly come to the forefront.
As models became more useful and inference demand exploded, “one GPU per model” rapidly became a luxury. Meanwhile, the NVIDIA H200 export saga accelerated the adoption of domestic compute. First, a sales ban; then, at the end of 2025, a 25% tariff under Trump; and by January 2026, Chinese customs had cleared zero units. Policy now mandates that over 40% of data center chips must be domestically produced by 2026.
The reality is harsh: not only are GPUs scarce, but you must also learn to use NVIDIA, Ascend, Cambricon, Hygon, and other very different platforms simultaneously.
That’s why I believe HAMi v2.9 is more significant than it appears on the surface.
GPUs Are No Longer Just About the Card
Kubernetes has always managed GPUs in a rather crude way:
resources:
limits:
nvidia.com/gpu: 1This was sufficient in 2019, when the main question was simply whether a Pod needed a GPU or not.
But that’s no longer enough. An inference service might only need 4GB of VRAM, multiple small models can share a single card, training jobs care about GPU topology and interconnect bandwidth, and multi-tenancy requires fault domain isolation. Treating GPUs as integer resources is like using an abacus for statistical analysis—not impossible, but the mental model is out of sync with reality.
The most notable feature in HAMi v2.9 is the HAMi-core mode for the Ascend 910C. Previously, sharing Ascend cards relied on SR-IOV hardware virtualization, which was coarse-grained and inflexible. HAMi-core takes a different approach: it uses LD_PRELOAD to intercept ACL calls in user space, enabling memory isolation at the MB level and compute throttling by percentage.
In short: it’s managed by software, not hardware slicing.
This is reminiscent of how SDN abstracted the network control plane from hardware devices—GPU partitioning is shifting from a hardware capability to a cluster control plane capability. Considering that Huawei shipped 810,000 Ascend 910C cards last year—nearly half of all domestic chips—this capability has significant real-world impact.
DRA: Kubernetes Finally Has a Robust Device Resource Model
Kubernetes v1.34 (September 2025) officially promoted DRA (Dynamic Resource Allocation) to GA, and Red Hat OpenShift 4.21 followed suit. This is a big deal.
The Device Plugin solved “how to connect GPUs to K8s,” but not “how to express complex AI resource requirements.” Device Plugins only know how many cards are on a node, not how much VRAM you need, what topology, or what isolation level.
DRA standardizes device resource declaration, allocation, and management via ResourceClaim and DeviceClass. HAMi-DRA takes a pragmatic approach: it doesn’t require users to change how they declare resources. Instead, it uses a Mutating Webhook to automatically convert existing Device Plugin-style declarations into the DRA model. Legacy systems don’t need to change, but can still leverage new capabilities.
I liken this to what CSI did for storage: it didn’t eliminate vendor differences, but allowed Kubernetes to consume different storage capabilities in a unified way. DRA does the same for AI accelerators—NVIDIA, Ascend, AMD, Vastai cards will never be identical, but the scheduling layer should speak a common language.
A Complete Control Plane Path
If we look beyond individual features and consider HAMi-core, DRA, CDI, and the scheduler together, they actually correspond to different layers of GPU resource management:
- HAMi-core: How to partition and isolate devices internally
- DRA: How to declare, allocate, and bind resources
- CDI: How to standardize device injection into container runtimes
- Scheduler/Webhook: How to schedule, admit, and observe
Connecting these layers, from top to bottom, forms the complete Kubernetes GPU Control Plane:
This is a complete control plane path. In v2.9, Volcano vGPU was upgraded to v0.19 with enhanced CDI support. While this may seem like a minor improvement in device injection, it actually completes a critical link in this chain.
Heterogeneity Is the Main Battlefield for Domestic AI Infra
The reality for domestic AI clusters: you can’t build infrastructure around just one type of GPU.
Enterprise environments often have NVIDIA, Ascend, Biren, Cambricon, Hygon, Muxi, Kunlunxin, Vastai, and other devices coexisting. Each card has different drivers, runtimes, virtualization capabilities, and monitoring methods. HAMi v2.9 adds support for Vastai, covering more than ten types of heterogeneous compute devices. Mixed training and inference, online and offline workloads, domestic and overseas GPUs, multi-team and multi-tenant resource pools—in these scenarios, unified scheduling is far more important than single-card performance.
Key Judgments
GPU sharing will shift from a cost-saving measure to a default requirement. After the explosion of inference workloads, not every workload deserves exclusive access to an entire card. Exclusive allocation will increasingly become a luxury.
DRA is the way forward, but migration will be gradual. The Device Plugin ecosystem is too large to disappear overnight. HAMi-DRA’s compatibility layer shows the project team understands this.
Heterogeneous scheduling will become the core challenge for AI Infra. Whoever can abstract different vendor devices into a unified scheduling language will control the key position in the control plane.
Kubernetes will be reshaped by AI workloads. From scheduling semantics to resource models, AI requires much greater expressiveness than traditional web services. DRA, CDI, topology-aware scheduling—these are not isolated evolutions, but all point to one thing: Kubernetes is evolving from a container orchestrator to the control plane for AI computing.
Summary
The significance of HAMi v2.9 is not just in supporting a particular device or partitioning method, but in making one thing clear: the next generation of AI infrastructure competition is not just about model frameworks or GPU counts, but about the control plane.
GPUs are shifting from external devices on nodes to native resources within the Kubernetes control plane. Whoever defines the resource model for the AI era will define the long-term boundaries of AI Infra.
