mech.app
AI Agents

Uber's $1,500 Token Cap: What Real-World Agent Cost Controls Look Like in Production

Reverse-engineering Uber's per-tool token budget reveals the infrastructure needed to meter, limit, and audit agentic coding spend at scale.

Source: simonwillison.net
Uber's $1,500 Token Cap: What Real-World Agent Cost Controls Look Like in Production

According to Simon Willison’s analysis of Bloomberg reporting, Uber’s $1,500 monthly cap per tool, per engineer, is one of the first public examples of granular cost governance for agentic coding tools at scale. Uber blew through its 2026 AI budget in four months. Their response exposes the plumbing required to meter, limit, and audit token spend across multiple providers without breaking developer workflows.

The policy is tool-isolated. Spending on Cursor does not affect your Claude Code budget. General LLM access (ChatGPT, Claude web) remains uncapped. The cap applies only to tools that burn tokens autonomously: agents that read codebases, generate diffs, run tests, and iterate without human approval for every API call.

Why This Matters for Agent Infrastructure

Most agent cost discussions focus on prompt optimization or model selection. Uber’s cap reveals the operational layer beneath: how do you enforce spending limits when agents call APIs directly, often from developer machines or CI runners, without a central orchestration layer you control?

The problem is not just budgeting. It is attribution, isolation, and enforcement across:

  • Multiple LLM providers (Anthropic, OpenAI, potentially others)
  • Multiple tools (Cursor, Claude Code, GitHub Copilot)
  • Multiple usage patterns (interactive coding, batch refactors, CI integrations)
  • Thousands of engineers with varying usage profiles

Simon Willison’s personal usage sits around $1,000/month per provider, costing him $100 due to subsidized individual plans. Uber pays full enterprise rates. At $1,500/month per tool, an engineer using two tools hits $36,000/year, roughly 11% of Uber’s median software engineer compensation ($330,000). That ratio suggests Uber views agent productivity gains as worth roughly one-tenth of an engineer’s fully loaded cost.

Enforcement Architecture Options

To enforce per-user, per-tool token limits, you need three capabilities: metering, attribution, and blocking. Here are the likely implementation paths.

Proxy Layer

Route all LLM API traffic through an internal proxy that tracks tokens by user and tool. This works if tools support custom API endpoints or if you can intercept traffic at the network layer.

# Pseudocode; production implementation requires Redis connection pooling,
# circuit breakers, and fallback to in-memory cache
class TokenBudgetProxy:
    def __init__(self, redis_client, budget_service):
        self.redis = redis_client
        self.budgets = budget_service
    
    async def handle_request(self, user_id, tool_id, request):
        try:
            # Check budget before forwarding
            key = f"usage:{user_id}:{tool_id}:{current_month()}"
            current = await self.redis.get(key) or 0
            limit = await self.budgets.get_limit(user_id, tool_id)
            
            if current >= limit:
                return {"error": "Monthly token budget exceeded", "code": 429}
            
            # Forward to provider
            response = await self.forward_to_provider(request)
            
            # Meter response tokens
            tokens = response.usage.total_tokens
            await self.redis.incrby(key, tokens)
            await self.redis.expire(key, days=60)
            
            return response
        except RedisConnectionError:
            # Fallback to allow-by-default or in-memory cache
            pass

Pros: Centralized metering, real-time enforcement, works across tools.

Cons: Adds latency (10-50ms per request), requires network-level interception or tool cooperation, becomes a single point of failure.

SDK Wrapper

Distribute an internal SDK that wraps provider SDKs and enforces limits before making API calls. Tools like Cursor and Claude Code would need to integrate this wrapper, or you fork the tools internally.

Pros: No proxy latency, client-side budget checks reduce backend load.

Cons: Requires tool vendor cooperation or internal forks, engineers can bypass by using upstream SDKs directly, harder to audit.

Billing Webhook

Use provider billing webhooks to track usage and cut off access retroactively. This approach relies on provider-side usage reporting rather than real-time metering.

Pros: No infrastructure changes, no latency, works with any tool.

Cons: Engineers can exceed limits before cutoff, requires account-level API key rotation to enforce, poor UX (hard stops mid-workflow).

Uber likely uses a hybrid: proxy for real-time metering, billing webhooks for reconciliation, and account-level API key rotation for enforcement. The tool isolation requirement (Cursor spend does not affect Claude Code) suggests separate API keys per tool, tracked in a central budget service.

Observability Stack

Real-time token tracking across multiple providers requires:

ComponentPurposeImplementationRationale
Metering proxyCapture token counts per requestEnvoy or custom Go proxy with Lua filtersEnvoy handles 100k+ req/sec with sub-10ms p99 latency; Lua filters allow custom metering logic without recompiling
Time-series DBStore usage by user, tool, provider, timestampPrometheus or InfluxDB with 60-day retentionPrometheus supports high-cardinality labels (user_id, tool_id) at 10k+ metric series; 60-day retention balances audit needs with storage cost
Budget serviceEnforce limits, handle overridesPostgreSQL for limits, Redis for current usage countersPostgreSQL provides ACID guarantees for budget policy changes; Redis TTLs auto-expire monthly counters without manual cleanup
AlertingNotify users at 80%, 100% of budgetPagerDuty or Slack webhooks triggered by threshold queriesReal-time alerts prevent mid-workflow budget exhaustion; Slack integration allows self-service budget increase requests
Audit logTrack who spent what, when, and whyS3 or BigQuery for compliance and chargebackImmutable append-only logs in S3 satisfy SOC2 requirements; BigQuery enables SQL queries for cost allocation and anomaly detection

The latency budget for the proxy is tight. Agent tools make dozens of API calls per minute during active coding. Adding 50ms per call turns a 2-second agent loop into a 3-second loop. Uber’s proxy likely runs in the same AWS region as Anthropic and OpenAI endpoints, uses connection pooling, and caches budget checks in Redis with TTLs under 10 seconds.

Policy Design and Gaming Risks

The per-tool isolation is critical. Without it, engineers would shift usage to whichever tool has budget remaining, making capacity planning impossible. But isolation creates new gaming vectors:

  • Account rotation: Create multiple accounts to multiply budgets. Uber likely ties budgets to corporate SSO identity (employee email or LDAP DN) rather than tool accounts, making account multiplication ineffective. Budget enforcement happens at the proxy layer using SSO tokens, not tool-level credentials.
  • Tool hopping: Burn through Cursor budget, switch to Claude Code. Intended behavior under Uber’s policy, but requires tracking total AI spend per engineer for chargeback.
  • Batch job abuse: Run expensive batch refactors in CI where attribution is harder. Requires CI runner identity to inherit engineer budgets or use separate pool.

Uber’s policy only caps agentic tools, not general LLM access. This suggests they differentiate between human-in-the-loop usage (ChatGPT, Claude web) and autonomous agents. The risk: engineers use uncapped tools to generate code, then paste into capped tools for execution. Hard to prevent without monitoring clipboard activity, which crosses privacy lines.

Failure Modes

Likely failure modes based on Uber’s policy design (actual implementation details are not public):

  • Agent loops hang: Tools like Cursor retry failed API calls. If the proxy returns 429, the tool may retry indefinitely or fail silently.
  • Partial refactors: An agent starts a multi-file refactor, hits the budget limit halfway through, and leaves the codebase in an inconsistent state.
  • CI pipeline failures: If CI runners share engineer budgets, a single expensive test suite can block all other work.
  • Budget exhaustion races: Multiple agent sessions hit the budget simultaneously. Without distributed locking, the proxy may allow overspend before Redis counters sync.

Uber’s implementation likely includes grace periods (soft cap at $1,500, hard cap at $1,600), manual override workflows for critical work, and separate budget pools for CI runners to mitigate these risks. The observability stack must surface budget exhaustion early enough for engineers to adjust usage or request increases.

Cost vs. Value Trade-Offs

At $36,000/year per engineer (assuming two tools), Uber is betting agent productivity gains exceed 11% of compensation. That is a 1.1x multiplier on output, or an estimated 2-6 hours saved per week depending on agent quality and engineer trust levels. Plausible for agents that handle boilerplate, tests, and refactors, but only if engineers trust the output enough to skip manual review.

The cap also signals Uber’s cost structure. Enterprise LLM pricing lacks volume discounts at Uber’s scale (tens of thousands of engineers). If Anthropic charged $0.50 per million tokens instead of $3, the cap would be $9,000/month per tool. Uber’s policy is as much a negotiation tactic as a cost control.

Technical Verdict

Use this approach when:

  • You have thousands of engineers using multiple agentic tools
  • Token costs are unpredictable and growing faster than budget
  • You need per-user, per-tool attribution for chargeback or compliance
  • You can enforce API traffic routing through a proxy or SDK wrapper
  • Monthly token spend exceeds $50k or engineering headcount exceeds 500

Avoid this approach when:

  • Token spend is under $50k/month or engineering headcount is under 50 (manual budget reviews are cheaper)
  • Agent usage is centralized in a few orchestration services (meter at the service layer instead)
  • Tools do not support custom API endpoints and you cannot intercept network traffic
  • Latency sensitivity is extreme (high-frequency trading, real-time systems)

Uber’s cap is not a silver bullet. It shifts the problem from runaway costs to budget exhaustion and workflow interruptions. But it is the first public example of production-grade agent cost governance, and the infrastructure patterns (proxy metering, tool isolation, real-time budgets) will become standard as agentic tools move from novelty to necessity.