According to Simon Willison’s analysis of Bloomberg reporting, Uber’s $1,500 monthly cap per tool, per engineer, is one of the first public examples of granular cost governance for agentic coding tools at scale. Uber blew through its 2026 AI budget in four months. Their response exposes the plumbing required to meter, limit, and audit token spend across multiple providers without breaking developer workflows.
The policy is tool-isolated. Spending on Cursor does not affect your Claude Code budget. General LLM access (ChatGPT, Claude web) remains uncapped. The cap applies only to tools that burn tokens autonomously: agents that read codebases, generate diffs, run tests, and iterate without human approval for every API call.
Why This Matters for Agent Infrastructure
Most agent cost discussions focus on prompt optimization or model selection. Uber’s cap reveals the operational layer beneath: how do you enforce spending limits when agents call APIs directly, often from developer machines or CI runners, without a central orchestration layer you control?
The problem is not just budgeting. It is attribution, isolation, and enforcement across:
- Multiple LLM providers (Anthropic, OpenAI, potentially others)
- Multiple tools (Cursor, Claude Code, GitHub Copilot)
- Multiple usage patterns (interactive coding, batch refactors, CI integrations)
- Thousands of engineers with varying usage profiles
Simon Willison’s personal usage sits around $1,000/month per provider, costing him $100 due to subsidized individual plans. Uber pays full enterprise rates. At $1,500/month per tool, an engineer using two tools hits $36,000/year, roughly 11% of Uber’s median software engineer compensation ($330,000). That ratio suggests Uber views agent productivity gains as worth roughly one-tenth of an engineer’s fully loaded cost.
Enforcement Architecture Options
To enforce per-user, per-tool token limits, you need three capabilities: metering, attribution, and blocking. Here are the likely implementation paths.
Proxy Layer
Route all LLM API traffic through an internal proxy that tracks tokens by user and tool. This works if tools support custom API endpoints or if you can intercept traffic at the network layer.
# Pseudocode; production implementation requires Redis connection pooling,
# circuit breakers, and fallback to in-memory cache
class TokenBudgetProxy:
def __init__(self, redis_client, budget_service):
self.redis = redis_client
self.budgets = budget_service
async def handle_request(self, user_id, tool_id, request):
try:
# Check budget before forwarding
key = f"usage:{user_id}:{tool_id}:{current_month()}"
current = await self.redis.get(key) or 0
limit = await self.budgets.get_limit(user_id, tool_id)
if current >= limit:
return {"error": "Monthly token budget exceeded", "code": 429}
# Forward to provider
response = await self.forward_to_provider(request)
# Meter response tokens
tokens = response.usage.total_tokens
await self.redis.incrby(key, tokens)
await self.redis.expire(key, days=60)
return response
except RedisConnectionError:
# Fallback to allow-by-default or in-memory cache
pass
Pros: Centralized metering, real-time enforcement, works across tools.
Cons: Adds latency (10-50ms per request), requires network-level interception or tool cooperation, becomes a single point of failure.
SDK Wrapper
Distribute an internal SDK that wraps provider SDKs and enforces limits before making API calls. Tools like Cursor and Claude Code would need to integrate this wrapper, or you fork the tools internally.
Pros: No proxy latency, client-side budget checks reduce backend load.
Cons: Requires tool vendor cooperation or internal forks, engineers can bypass by using upstream SDKs directly, harder to audit.
Billing Webhook
Use provider billing webhooks to track usage and cut off access retroactively. This approach relies on provider-side usage reporting rather than real-time metering.
Pros: No infrastructure changes, no latency, works with any tool.
Cons: Engineers can exceed limits before cutoff, requires account-level API key rotation to enforce, poor UX (hard stops mid-workflow).
Uber likely uses a hybrid: proxy for real-time metering, billing webhooks for reconciliation, and account-level API key rotation for enforcement. The tool isolation requirement (Cursor spend does not affect Claude Code) suggests separate API keys per tool, tracked in a central budget service.
Observability Stack
Real-time token tracking across multiple providers requires:
| Component | Purpose | Implementation | Rationale |
|---|---|---|---|
| Metering proxy | Capture token counts per request | Envoy or custom Go proxy with Lua filters | Envoy handles 100k+ req/sec with sub-10ms p99 latency; Lua filters allow custom metering logic without recompiling |
| Time-series DB | Store usage by user, tool, provider, timestamp | Prometheus or InfluxDB with 60-day retention | Prometheus supports high-cardinality labels (user_id, tool_id) at 10k+ metric series; 60-day retention balances audit needs with storage cost |
| Budget service | Enforce limits, handle overrides | PostgreSQL for limits, Redis for current usage counters | PostgreSQL provides ACID guarantees for budget policy changes; Redis TTLs auto-expire monthly counters without manual cleanup |
| Alerting | Notify users at 80%, 100% of budget | PagerDuty or Slack webhooks triggered by threshold queries | Real-time alerts prevent mid-workflow budget exhaustion; Slack integration allows self-service budget increase requests |
| Audit log | Track who spent what, when, and why | S3 or BigQuery for compliance and chargeback | Immutable append-only logs in S3 satisfy SOC2 requirements; BigQuery enables SQL queries for cost allocation and anomaly detection |
The latency budget for the proxy is tight. Agent tools make dozens of API calls per minute during active coding. Adding 50ms per call turns a 2-second agent loop into a 3-second loop. Uber’s proxy likely runs in the same AWS region as Anthropic and OpenAI endpoints, uses connection pooling, and caches budget checks in Redis with TTLs under 10 seconds.
Policy Design and Gaming Risks
The per-tool isolation is critical. Without it, engineers would shift usage to whichever tool has budget remaining, making capacity planning impossible. But isolation creates new gaming vectors:
- Account rotation: Create multiple accounts to multiply budgets. Uber likely ties budgets to corporate SSO identity (employee email or LDAP DN) rather than tool accounts, making account multiplication ineffective. Budget enforcement happens at the proxy layer using SSO tokens, not tool-level credentials.
- Tool hopping: Burn through Cursor budget, switch to Claude Code. Intended behavior under Uber’s policy, but requires tracking total AI spend per engineer for chargeback.
- Batch job abuse: Run expensive batch refactors in CI where attribution is harder. Requires CI runner identity to inherit engineer budgets or use separate pool.
Uber’s policy only caps agentic tools, not general LLM access. This suggests they differentiate between human-in-the-loop usage (ChatGPT, Claude web) and autonomous agents. The risk: engineers use uncapped tools to generate code, then paste into capped tools for execution. Hard to prevent without monitoring clipboard activity, which crosses privacy lines.
Failure Modes
Likely failure modes based on Uber’s policy design (actual implementation details are not public):
- Agent loops hang: Tools like Cursor retry failed API calls. If the proxy returns 429, the tool may retry indefinitely or fail silently.
- Partial refactors: An agent starts a multi-file refactor, hits the budget limit halfway through, and leaves the codebase in an inconsistent state.
- CI pipeline failures: If CI runners share engineer budgets, a single expensive test suite can block all other work.
- Budget exhaustion races: Multiple agent sessions hit the budget simultaneously. Without distributed locking, the proxy may allow overspend before Redis counters sync.
Uber’s implementation likely includes grace periods (soft cap at $1,500, hard cap at $1,600), manual override workflows for critical work, and separate budget pools for CI runners to mitigate these risks. The observability stack must surface budget exhaustion early enough for engineers to adjust usage or request increases.
Cost vs. Value Trade-Offs
At $36,000/year per engineer (assuming two tools), Uber is betting agent productivity gains exceed 11% of compensation. That is a 1.1x multiplier on output, or an estimated 2-6 hours saved per week depending on agent quality and engineer trust levels. Plausible for agents that handle boilerplate, tests, and refactors, but only if engineers trust the output enough to skip manual review.
The cap also signals Uber’s cost structure. Enterprise LLM pricing lacks volume discounts at Uber’s scale (tens of thousands of engineers). If Anthropic charged $0.50 per million tokens instead of $3, the cap would be $9,000/month per tool. Uber’s policy is as much a negotiation tactic as a cost control.
Technical Verdict
Use this approach when:
- You have thousands of engineers using multiple agentic tools
- Token costs are unpredictable and growing faster than budget
- You need per-user, per-tool attribution for chargeback or compliance
- You can enforce API traffic routing through a proxy or SDK wrapper
- Monthly token spend exceeds $50k or engineering headcount exceeds 500
Avoid this approach when:
- Token spend is under $50k/month or engineering headcount is under 50 (manual budget reviews are cheaper)
- Agent usage is centralized in a few orchestration services (meter at the service layer instead)
- Tools do not support custom API endpoints and you cannot intercept network traffic
- Latency sensitivity is extreme (high-frequency trading, real-time systems)
Uber’s cap is not a silver bullet. It shifts the problem from runaway costs to budget exhaustion and workflow interruptions. But it is the first public example of production-grade agent cost governance, and the infrastructure patterns (proxy metering, tool isolation, real-time budgets) will become standard as agentic tools move from novelty to necessity.