mech.app
AI Agents

Meta's Capacity Efficiency Agents: How Unified Tool Interfaces Turn Performance Tuning into Autonomous Infrastructure Work

Meta built AI agents that automate performance optimization across hyperscale infrastructure using standardized tool interfaces and encoded domain expertise.

Source: engineering.fb.com
Meta's Capacity Efficiency Agents: How Unified Tool Interfaces Turn Performance Tuning into Autonomous Infrastructure Work

Meta’s Capacity Efficiency Program runs AI agents that find and fix performance issues across hyperscale infrastructure. The system recovers hundreds of megawatts of power and compresses hours of manual regression investigation into minutes. The architecture reveals how agent-driven cost optimization requires unified abstractions, not just LLM calls.

The platform encodes senior efficiency engineer expertise into reusable, composable skills. Agents operate through a standardized tool interface that works across heterogeneous infrastructure. This lets the program scale megawatt delivery without proportionally scaling headcount.

The Two-Front Problem

Meta’s efficiency work splits into offense and defense:

Offense: Proactively finding optimization opportunities across product areas. Engineers manually identify wins but can only address a fraction of them.

Defense: Catching regressions that make it to production. FBDetect (Meta’s in-house regression detection tool) catches thousands of regressions weekly. Each unresolved regression compounds power waste across the fleet.

The agent platform addresses both. On defense, automated resolution prevents megawatts from compounding. On offense, AI-assisted opportunity resolution expands to more product areas every half, handling a growing volume of wins that engineers would never reach manually.

Unified Tool Interface Architecture

The core abstraction is a standardized tool interface that encodes domain expertise. Instead of agents directly calling infrastructure APIs, they invoke tools that wrap domain knowledge.

Tool interface components:

  • Skill definitions: Reusable actions like “analyze memory allocation pattern” or “adjust cache eviction policy”
  • Parameter schemas: Typed inputs that constrain agent behavior (service name, metric threshold, rollback criteria)
  • Execution context: Resource locks, approval gates, and rollback triggers
  • Success metrics: Power savings, latency impact, error rate changes

The interface standardizes how agents interact with infrastructure regardless of the underlying system. An agent optimizing memory usage in a storage service uses the same tool interface as one tuning CPU scheduling in a compute cluster.

This abstraction layer solves three problems:

  1. Heterogeneity: Meta’s infrastructure spans custom hardware, open source systems, and proprietary services. The tool interface hides this complexity.
  2. Safety: Tools encode guardrails (resource limits, rollback conditions) that prevent runaway behavior.
  3. Composability: Skills combine into multi-step workflows without custom orchestration logic per task.

Orchestration and State Management

When multiple performance issues compete for the same resources, the orchestration layer coordinates agent actions.

Coordination mechanisms:

  • Resource locking: Agents acquire locks before modifying infrastructure to prevent conflicting changes
  • Priority queuing: High-impact regressions (measured in megawatts) preempt lower-priority optimizations
  • Dependency tracking: Some optimizations require others to complete first (example: memory reallocation before cache tuning)

State management tracks:

  • In-flight changes: What each agent is modifying and expected completion time
  • Rollback checkpoints: Snapshots of configuration and metrics before changes
  • Success criteria: Target metrics and acceptable degradation bounds

The system uses a centralized state store that agents query before taking action. This prevents race conditions where two agents optimize the same service simultaneously.

Observability and Safety Gates

Production infrastructure agents need observability that answers: what did the agent decide, why, and what happened?

Observability stack:

  • Decision logs: Agent reasoning traces showing which tools were considered and why one was selected
  • Metric correlation: Pre-change and post-change metrics (power, latency, error rate) linked to agent actions
  • Rollback triggers: Automated rollback when metrics exceed degradation thresholds

Safety gates:

Gate TypeTrigger ConditionAction
Pre-execution approvalChange affects critical serviceHuman approval required
Error rate rollbackError rate increase >5%Automatic rollback
Latency rollbackLatency degradation >10%Automatic rollback
Cumulative failureRepeated rollbacksDisable agent for manual review

The system measures success by power savings and engineering time freed. Agents that consistently trigger rollbacks get disabled until engineers review the tool definitions or agent prompts.

Failure Modes and Mitigations

Failure mode 1: Agent optimizes one metric at the expense of another

Example: Reducing memory usage increases CPU cycles, negating power savings.

Mitigation: Tools return multi-dimensional success metrics. Agents must satisfy all constraints (power, latency, error rate) or the change rolls back.

Failure mode 2: Cascading rollbacks across dependent services

Example: Agent optimizes service A, which degrades service B, triggering a rollback in B that re-degrades A.

Mitigation: Dependency graph prevents agents from modifying services with active downstream changes. Locks propagate through the dependency tree.

Failure mode 3: Tool interface drift

Example: Infrastructure changes but tool definitions don’t update, causing agents to invoke obsolete APIs.

Mitigation: Tool definitions version alongside infrastructure. Agents query tool registry for latest versions before execution.

Potential Risks

Metric gaming: Agents could discover actions that temporarily improve metrics without fixing underlying issues (example: restarting a service for short-term gains). Similar systems mitigate this by requiring sustained improvement over extended windows before counting changes as successful.

Optimization conflicts: Multiple agents operating across different layers (network, compute, storage) could create optimization conflicts where improvements in one layer degrade another. The dependency tracking and resource locking mechanisms address this, but complex multi-layer interactions remain a risk.

When Meta’s Approach Applies

This architecture solves a specific volume problem. FBDetect catches thousands of regressions weekly across Meta’s fleet. Manual investigation takes hours per regression. The agent platform works because the volume of opportunities and regressions exceeds what manual teams can address.

Meta’s system demonstrates three conditions that enable agent-driven optimization:

  • Volume justifies automation: The source describes thousands of weekly regressions and a growing number of product areas with optimization opportunities that engineers cannot manually reach.
  • Megawatt scale amplifies ROI: Hundreds of megawatts recovered translates to power cost savings that fund platform development and operation.
  • Encoded expertise scales: Senior efficiency engineers encode patterns once (in tool definitions), then agents apply them across all services.

The approach depends on infrastructure that supports fast rollback and reliable metric collection, so agents can fail safely without causing prolonged outages.

Technical Verdict

Use this approach when:

  • You face a volume problem where optimization opportunities or regressions exceed manual team capacity (hundreds or thousands per week, not dozens per month)
  • Performance issues have measurable cost (power, compute waste, latency-driven revenue loss) that justifies platform investment
  • Domain experts can encode tuning patterns into reusable tools without requiring deep application context for each decision
  • Infrastructure supports fast rollback and reliable metric collection so agents can fail safely

Avoid this approach when:

  • Manual optimization keeps up with the volume of opportunities (small infrastructure or infrequent regressions)
  • Performance tuning requires application-specific business logic that changes frequently
  • Rollback is expensive or risky (stateful systems where rollback loses data or triggers cascading failures)
  • Metrics are unreliable or delayed, making it hard to attribute changes to agent actions

The unified tool interface is the key insight. Without it, agents would need custom logic for each infrastructure component, making the system unmaintainable. The interface lets Meta scale agent-driven optimization across a growing number of product areas without rewriting orchestration logic. The business model works when megawatt recovery and freed engineering time exceed platform operating costs.