Organization and Culture: How the Operating Model Changes
The compute governance closed loop is the foundational safeguard for sustainable innovation in AI-native organizations.
The FinOps Foundation states directly in “Scaling Kubernetes for AI/ML Workloads with FinOps” that Kubernetes elasticity can easily evolve into a runaway cost problem. Therefore, FinOps should not be just cost reporting, but must become a shared operating model where every scaling decision simultaneously answers two questions: Are performance SLOs met?, and Is it affordable?.
The Challenge of API-first “Implicit Assumptions” in the AI Era
The diagram below shows the boundary relationships and accountability chains between platform, ML, and security teams.
The intuitive path of API-first is: first make the interfaces and workflows work, then gradually optimize performance and cost through engineering. In AI-native infrastructure, this path often fails because it relies on three implicit assumptions that no longer hold in the AI era.
Assumption 1: Resources are not the core scarcity
Traditional software bets scarcity on engineering efficiency, throughput, and stability; whereas AI-native infrastructure scarcity comes primarily from asset boundaries like GPU/interconnect/power consumption. Scarcity is no longer “slow to scale,” but “hard to scale and expensive,” constrained by both supply chain and datacenter conditions.
Assumption 2: Request costs are predictable
Traditional request cost distribution is relatively stable; AI requests are inherently long-tailed: branching in agentic tasks, inflation of long contexts, and chain amplification of tool calls all make tokens and GPU time into random variables that cannot be linearly extrapolated. You think you’re scaling “QPS,” but actually you’re scaling “total cost of tail probability events.”
Assumption 3: State is ephemeral and discardable
The cloud-native era emphasized stateless scaling with externalized state; but on the inference side, inference state/context reuse often determines whether unit costs are controllable. NVIDIA describes this in Rubin’s ICMS (Inference Context Memory Storage) as the “context storage challenge brought by new inference paradigms”: KV cache needs reuse across sessions/services, sequence length growth causes linear KV cache inflation, forcing persistence and shared access, forming a “new context tier,” and proving with TPS and energy efficiency gains that this is not a nice-to-have, but a threshold for scalability.
The Nature of Compute Governance: What is Being Governed
“Compute governance” is often misunderstood as “managing GPUs,” but what truly needs governance is the resource consequences of intent. More precisely, it’s governing the combined effects of four types of objects:
Token Economics
- Each request/task’s token consumption, context inflation, implicit token tax from tool definitions and intermediate results, ultimately directly mapping to cost and latency.
Accelerator Time
- GPU time, memory footprint, batching strategies, and the impact of routing and cache hits on effective throughput. The key is not “whether there are GPUs,” but “whether output per unit GPU time is controllable.”
Interconnect and Storage (Fabric & Storage)
- Network and storage pressures from training all-reduce, inference KV/cache sharing, and cross-service data movement. AI performance and cost are often amplified by fabric, not by APIs.
Organizational Budget and Risk (Budget & Risk)
- Multi-tenant isolation, fairness, audit, compliance, and accountability. These determine whether the system can scale to multiple teams/business lines, not just scaling demos to more instances.
The FinOps Foundation also emphasizes: AI/ML cost drivers are not just GPUs; storage (checkpoints/embeddings/artifacts), network (distributed training/cross-AZ), and additional licensing and marketplace fees often “quietly exceed compute.” Therefore, governance objects must cover end-to-end, not just stare at inference bills.
MCP/Agent: Amplification Effects Under Governance Gaps
MCP/Agent expand the “capability surface,” but simultaneously make cost curves steeper, especially showing exponential amplification when governance is missing:
- More tools, more branches: Planning space expands, tail probability rises, cost volatility becomes uncontrollable.
- Tool definitions and intermediate results consume context: Directly consuming context window and tokens, translating to cost and latency.
- Stronger tool usage triggers more external I/O: External system calls, network round trips, and data movement all enter the overall cost function.
Anthropic explicitly states in “Code execution with MCP” that direct tool calls increase cost and latency due to tool definitions and intermediate results consuming context window; when tool numbers rise to hundreds or thousands, this becomes a scalability bottleneck, thus proposing code execution forms to improve efficiency and reduce token consumption.
Minimal Implementation Path for “Compute Governance First”
You don’t have to bind to any vendor, but you must implement a “minimum viable governance stack.” The goal is not perfection, but giving the system controllable boundary conditions from day one.
Admission and Budget (Admission + Budget)
- Set budgets and priorities for workload types (training/inference/agent tasks).
- Include budget, max steps, max tokens, max tool calls in policy-as-intent, and enforce at the entry point.
End-to-End Metering and Attribution (Metering + Attribution)
- At minimum achieve one traceable chain: request/agent → tokens → GPU time/memory → network/storage → cost attribution (tenant/project/model/tool).
- Without attribution, there is no governance; without governance, enterprise scaling is impossible, because costs and responsibilities cannot align, and organizations will internally 消耗 on “who consumed the budget.”
Isolation and Sharing (Isolation + Sharing)
- Sharing for improving utilization; isolation for reducing risk. Both must exist simultaneously, not either/or.
- CNCF’s Cloud Native AI report notes: GPU virtualization and sharing (like MIG, MPS, DRA, etc.) can improve utilization and reduce costs, but requires careful orchestration and management, and demands collaboration between AI and cloud-native engineering teams.
- The key to governance is not choosing sharing or isolation, but making it an executable policy: who shares under what conditions, who isolates under what conditions.
Topology and Network as First-Class Citizens (Topology + Fabric First)
- AI training and high-throughput inference are highly sensitive to network characteristics.
- Cisco’s AI-ready infrastructure design guides and related CVD/Design Zone emphasize: building high-performance, lossless Ethernet fabric for AI/ML workloads, and delivering reference architectures and deployment guides through validated designs.
- This means topology is not “the datacenter team’s business,” but a core variable determining whether JCT, tail latency, and capacity models hold.
Context/State Becomes a Governance Object (Context as a Governed Asset)
- When long-context and agentic become mainstream, KV cache and inference context reuse will directly determine unit costs.
- NVIDIA’s ICMS defines this as a “new context tier” for solving KV cache reuse and shared access, emphasizing TPS/energy efficiency gains.
- In this era, treating context as a temporary variable is actively relinquishing cost control.
Anti-Pattern Checklist
The following anti-patterns are not “engineering inelegance,” but will trigger organizational 失控,worth vigilance.
API-first, treating governance as post-optimization
- Result: System launches first, only to discover unit costs and tail latency are uncontrollable, can only “hard brake” through feature limiting/rate limiting, ultimately locking the product roadmap.
- Contrast: FinOps points out elasticity easily becomes runaway costs, must advance cost governance into architecture decisions.
Treating MCP/Agent as capability accelerators, not cost amplifiers
- Result: More tools make it “smarter,” but token and external call costs rise exponentially, engineering teams forced to fight systemic amplification with “more complex prompts and rules.”
- Contrast: Anthropic notes tool definitions and intermediate results consume context, increase cost and latency, proposing more efficient execution forms as the scalability path.
Only buying GPUs, without sharing/isolation and orchestration
- Result: Low utilization, severe contention, budget explosion, organizations internally blame each other “who’s grabbing resources, who’s burning money.”
- Contrast: CNCF Cloud Native AI report emphasizes sharing/virtualization improves utilization, but must match orchestration and collaboration mechanisms.
Ignoring network and topology, treating AI as ordinary microservices
- Result: Training JCT and inference tail latency amplified by network, capacity planning and cost models fail, more scaling makes it more unstable.
- Contrast: Cisco in AI-ready network design and validated designs makes requirements like lossless Ethernet fabric critical foundations for AI/ML.
Summary
The first-principle entry point for AI-native is the compute governance closed loop: budget and admission, metering and attribution, sharing and isolation, topology and network, context assetization. API/Agent/MCP remain important, but must be constrained by this closed loop, otherwise the system can only oscillate between “smarter” and “more bankrupt.”