ITBench-AA: Why Frontier Models Score Below 50% on Enterprise IT Agent Tasks

Frontier models score below 50% on ITBench-AA, the first benchmark designed for agentic enterprise IT tasks. Claude Opus 4.7 leads at 47%, GPT-5.5 (xhigh) hits 46%, and Qwen3.7 Max reaches 42%. These numbers expose a critical gap between demo-quality agents and production-ready automation in regulated environments.

IBM Research and Artificial Analysis built ITBench-AA to measure model performance on real operational workflows: Kubernetes incident response, financial operations, and CISO-level security tasks. The initial release focuses on Site Reliability Engineering (SRE), where agents must diagnose live systems by reading logs, tracing dependencies, and identifying root-cause entities across complex infrastructure.

Why Enterprise IT Tasks Break Frontier Models

ITBench-AA differs from coding benchmarks like HumanEval or SWE-bench in three ways:

Multi-step workflows with state dependencies: Agents must maintain context across log queries, metric checks, and configuration reads. A single missed dependency breaks the entire diagnostic chain.
False positive traps: Models that over-investigate surface upstream fault-injection mechanisms or co-occurring symptoms as root causes. Gemini 3.1 Pro Preview averages 83 turns per task at 30% accuracy, spending cycles on noise.
No ground truth in the prompt: Unlike code generation tasks where the spec is explicit, SRE tasks require agents to infer system topology, normal behavior baselines, and failure boundaries from scattered telemetry.

The benchmark exposes a specific failure mode: models confuse correlation with causation. When a Kubernetes pod crashes, agents often flag the scheduler, the node, the network policy, and the application code as equally suspicious. Production SRE requires eliminating false positives before escalating to human operators.

Architecture of ITBench-AA Evaluation

The benchmark runs agents against live Kubernetes clusters with injected faults. Each task provides:

A symptom description (e.g., “API latency spiked to 5 seconds”)
Access to kubectl, logs, metrics APIs, and configuration files
A turn limit (models average 31 to 83 turns depending on strategy)

Agents succeed only if they identify the specific entity (pod, service, config map) that caused the failure. Partial credit is not awarded. This mirrors real incident response where “the database might be slow” is not actionable.

Model	Accuracy	Avg Turns	Strategy
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)	47%	~40	Focused investigation, early pruning
GPT-5.5 (xhigh)	46%	31	Efficient traversal, fewer false positives
Qwen3.7 Max	42%	~50	Balanced exploration
Gemini 3.1 Pro Preview	30%	83	Over-investigation, high false positive rate

Turn count does not correlate with accuracy. GPT-5.5 completes tasks in 31 turns at 46%, while Gemini uses 83 turns to reach 30%. This suggests that efficient agents prune the search space early, while inefficient agents chase every correlated signal.

Observability and Audit Requirements

Enterprise IT automation must satisfy compliance requirements that consumer-facing agents ignore:

Audit trails: Every tool call, log query, and configuration read must be logged with timestamps and justifications. ITBench-AA does not yet measure this, but production deployments will require it.
Approval gates: High-risk actions (restarting services, modifying configs) require human approval. Agents must pause execution and surface a decision point with context.
Rollback boundaries: If an agent’s remediation fails, the system must revert to the last known-good state. This requires transactional orchestration that most agent frameworks do not provide.

The benchmark currently measures diagnostic accuracy, not operational safety. A 47% success rate on diagnosis suggests that autonomous remediation (restarting pods, scaling deployments) is not yet viable without human oversight.

Failure Modes in Multi-Step Workflows

ITBench-AA tasks expose three failure patterns:

Premature convergence: Agents lock onto the first plausible cause (e.g., high CPU on a node) without checking if it explains all symptoms. This fails when the root cause is a cascading failure with multiple contributing factors.
Context window exhaustion: Long log files and metric dumps consume tokens. Agents that dump entire logs into context lose the ability to reason about earlier findings. Efficient agents summarize incrementally.
Tool call loops: Some agents retry the same kubectl command with minor variations, hoping for different output. This burns turns without advancing the investigation.

Production agent frameworks need circuit breakers: if an agent retries the same tool three times, force a strategy change or escalate to a human.

Deployment Shape for Enterprise IT Agents

A production deployment for ITBench-AA-style tasks requires:

Orchestration layer: Manages tool call sequences, enforces turn limits, and logs all actions for audit. LangGraph, Temporal, or custom state machines work here.
Tool registry: Exposes kubectl, log APIs, metrics endpoints, and configuration stores as callable functions with schema validation. Tools must enforce least-privilege access (read-only by default, write requires approval).
State persistence: Agents must survive restarts. Store conversation history, tool call results, and intermediate hypotheses in a durable store (Postgres, Redis, or S3).
Human-in-the-loop hooks: Pause execution at approval gates. Surface a decision payload (current hypothesis, proposed action, risk assessment) to an operator via Slack, PagerDuty, or a custom UI.

Security boundaries matter. Agents should not have direct write access to production clusters. Instead, they generate remediation plans that humans review and execute.

When to Use ITBench-AA as a Selection Criterion

Use ITBench-AA scores when evaluating models for:

Incident response automation: If you need agents to diagnose Kubernetes, database, or network failures without human intervention, a 47% success rate is not production-ready. Plan for human oversight on every task.
Compliance-heavy environments: Financial services, healthcare, and government IT require audit trails and approval gates. ITBench-AA does not measure these yet, but low diagnostic accuracy suggests that full autonomy is premature.
Cost-sensitive operations: Models with high turn counts (Gemini at 83 turns) burn API costs on false positives. Efficient models (GPT-5.5 at 31 turns) reduce both latency and cost.

Avoid using ITBench-AA scores as the sole criterion for general-purpose coding or customer-facing chatbots. The benchmark measures operational reasoning under uncertainty, not code generation or conversational fluency.

Technical Verdict

Use ITBench-AA when: You are building agents for enterprise IT operations (SRE, FinOps, CISO workflows) and need to measure diagnostic accuracy on multi-step, state-dependent tasks. The benchmark exposes failure modes (false positives, context exhaustion, tool call loops) that coding benchmarks miss.

Avoid ITBench-AA when: You need agents for code generation, customer support, or single-step tasks. The benchmark’s focus on operational reasoning and live system diagnosis does not transfer to domains where ground truth is explicit and workflows are linear.

Current state: No frontier model is production-ready for autonomous enterprise IT tasks. A 47% success rate means agents require human oversight on every incident. Plan deployments with approval gates, audit logging, and rollback mechanisms. Expect to spend engineering effort on orchestration, tool security, and state management before agents deliver ROI.

Source Links

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks