Meta’s Capacity Efficiency Program runs AI agents that find and fix performance issues across hyperscale infrastructure. The system recovers hundreds of megawatts of power and compresses hours of manual regression investigation into minutes. The architecture reveals how agent-driven cost optimization requires unified abstractions, not just LLM calls.
The platform encodes senior efficiency engineer expertise into reusable, composable skills. Agents operate through a standardized tool interface that works across heterogeneous infrastructure. This lets the program scale megawatt delivery without proportionally scaling headcount.
The Two-Front Problem
Meta’s efficiency work splits into offense and defense:
Offense: Proactively finding optimization opportunities across product areas. Engineers manually identify wins but can only address a fraction of them.
Defense: Catching regressions that make it to production. FBDetect (Meta’s in-house regression detection tool) catches thousands of regressions weekly. Each unresolved regression compounds power waste across the fleet.
The agent platform addresses both. On defense, automated resolution prevents megawatts from compounding. On offense, AI-assisted opportunity resolution expands to more product areas every half, handling a growing volume of wins that engineers would never reach manually.
Unified Tool Interface Architecture
The core abstraction is a standardized tool interface that encodes domain expertise. Instead of agents directly calling infrastructure APIs, they invoke tools that wrap domain knowledge.
Tool interface components:
- Skill definitions: Reusable actions like “analyze memory allocation pattern” or “adjust cache eviction policy”
- Parameter schemas: Typed inputs that constrain agent behavior (service name, metric threshold, rollback criteria)
- Execution context: Resource locks, approval gates, and rollback triggers
- Success metrics: Power savings, latency impact, error rate changes
The interface standardizes how agents interact with infrastructure regardless of the underlying system. An agent optimizing memory usage in a storage service uses the same tool interface as one tuning CPU scheduling in a compute cluster.
This abstraction layer solves three problems:
- Heterogeneity: Meta’s infrastructure spans custom hardware, open source systems, and proprietary services. The tool interface hides this complexity.
- Safety: Tools encode guardrails (resource limits, rollback conditions) that prevent runaway behavior.
- Composability: Skills combine into multi-step workflows without custom orchestration logic per task.
Orchestration and State Management
When multiple performance issues compete for the same resources, the orchestration layer coordinates agent actions.
Coordination mechanisms:
- Resource locking: Agents acquire locks before modifying infrastructure to prevent conflicting changes
- Priority queuing: High-impact regressions (measured in megawatts) preempt lower-priority optimizations
- Dependency tracking: Some optimizations require others to complete first (example: memory reallocation before cache tuning)
State management tracks:
- In-flight changes: What each agent is modifying and expected completion time
- Rollback checkpoints: Snapshots of configuration and metrics before changes
- Success criteria: Target metrics and acceptable degradation bounds
The system uses a centralized state store that agents query before taking action. This prevents race conditions where two agents optimize the same service simultaneously.
Observability and Safety Gates
Production infrastructure agents need observability that answers: what did the agent decide, why, and what happened?
Observability stack:
- Decision logs: Agent reasoning traces showing which tools were considered and why one was selected
- Metric correlation: Pre-change and post-change metrics (power, latency, error rate) linked to agent actions
- Rollback triggers: Automated rollback when metrics exceed degradation thresholds
Safety gates:
| Gate Type | Trigger Condition | Action |
|---|---|---|
| Pre-execution approval | Change affects critical service | Human approval required |
| Error rate rollback | Error rate increase >5% | Automatic rollback |
| Latency rollback | Latency degradation >10% | Automatic rollback |
| Cumulative failure | Repeated rollbacks | Disable agent for manual review |
The system measures success by power savings and engineering time freed. Agents that consistently trigger rollbacks get disabled until engineers review the tool definitions or agent prompts.
Failure Modes and Mitigations
Failure mode 1: Agent optimizes one metric at the expense of another
Example: Reducing memory usage increases CPU cycles, negating power savings.
Mitigation: Tools return multi-dimensional success metrics. Agents must satisfy all constraints (power, latency, error rate) or the change rolls back.
Failure mode 2: Cascading rollbacks across dependent services
Example: Agent optimizes service A, which degrades service B, triggering a rollback in B that re-degrades A.
Mitigation: Dependency graph prevents agents from modifying services with active downstream changes. Locks propagate through the dependency tree.
Failure mode 3: Tool interface drift
Example: Infrastructure changes but tool definitions don’t update, causing agents to invoke obsolete APIs.
Mitigation: Tool definitions version alongside infrastructure. Agents query tool registry for latest versions before execution.
Potential Risks
Metric gaming: Agents could discover actions that temporarily improve metrics without fixing underlying issues (example: restarting a service for short-term gains). Similar systems mitigate this by requiring sustained improvement over extended windows before counting changes as successful.
Optimization conflicts: Multiple agents operating across different layers (network, compute, storage) could create optimization conflicts where improvements in one layer degrade another. The dependency tracking and resource locking mechanisms address this, but complex multi-layer interactions remain a risk.
When Meta’s Approach Applies
This architecture solves a specific volume problem. FBDetect catches thousands of regressions weekly across Meta’s fleet. Manual investigation takes hours per regression. The agent platform works because the volume of opportunities and regressions exceeds what manual teams can address.
Meta’s system demonstrates three conditions that enable agent-driven optimization:
- Volume justifies automation: The source describes thousands of weekly regressions and a growing number of product areas with optimization opportunities that engineers cannot manually reach.
- Megawatt scale amplifies ROI: Hundreds of megawatts recovered translates to power cost savings that fund platform development and operation.
- Encoded expertise scales: Senior efficiency engineers encode patterns once (in tool definitions), then agents apply them across all services.
The approach depends on infrastructure that supports fast rollback and reliable metric collection, so agents can fail safely without causing prolonged outages.
Technical Verdict
Use this approach when:
- You face a volume problem where optimization opportunities or regressions exceed manual team capacity (hundreds or thousands per week, not dozens per month)
- Performance issues have measurable cost (power, compute waste, latency-driven revenue loss) that justifies platform investment
- Domain experts can encode tuning patterns into reusable tools without requiring deep application context for each decision
- Infrastructure supports fast rollback and reliable metric collection so agents can fail safely
Avoid this approach when:
- Manual optimization keeps up with the volume of opportunities (small infrastructure or infrequent regressions)
- Performance tuning requires application-specific business logic that changes frequently
- Rollback is expensive or risky (stateful systems where rollback loses data or triggers cascading failures)
- Metrics are unreliable or delayed, making it hard to attribute changes to agent actions
The unified tool interface is the key insight. Without it, agents would need custom logic for each infrastructure component, making the system unmaintainable. The interface lets Meta scale agent-driven optimization across a growing number of product areas without rewriting orchestration logic. The business model works when megawatt recovery and freed engineering time exceed platform operating costs.
Source Links
- Meta Engineering: Capacity Efficiency at Meta (Published April 16, 2026)