FutureSim: How Google Replays Real-World Events to Benchmark Adaptive AI Agents

Static benchmarks measure what agents know. FutureSim measures how fast they adapt when the world changes.

Google’s new evaluation framework replays real-world event streams in chronological order, forcing agents to forecast outcomes while information arrives over simulated time. The paper (arXiv 2605.15188v1) exposes a gap: frontier agents tested over a three-month replay achieved 25% accuracy at best, with many scoring worse than making no prediction at all.

This matters because production agents face streaming data, not frozen datasets. Your trading bot, incident responder, or supply chain optimizer must adapt as events unfold. FutureSim shows how to build evaluation infrastructure that respects temporal causality and catches agents that memorize rather than reason.

The Temporal Replay Problem

Most agent benchmarks present a static snapshot. The agent sees all context at once, retrieves what it needs, and answers. This hides three failure modes:

Information arrival timing: Real events arrive in sequence. An agent that waits for complete information will miss deadlines.
Causal dependencies: Event B depends on Event A. If your replay shuffles order, you test retrieval, not reasoning.
Training corpus leakage: If the agent’s training data includes “future” events from your test period, you measure memorization, not adaptation.

FutureSim solves this by replaying news articles and question resolutions in the exact order they occurred. The agent lives inside a simulated timeline where January 15 data arrives before January 16 data.

Architecture: Chronological Event Streams

The core infrastructure has three layers:

Event Store
Real-world data sources (news APIs, prediction markets, public datasets) indexed by publication timestamp. Each event carries:

Content (article text, market resolution, data point)
Timestamp (when it became knowable)
Dependencies (references to prior events)

Simulation Clock
A virtual timeline that advances in discrete steps. At each tick:

Release all events with timestamp <= current_time
Allow agent to query, reason, and update forecasts
Advance clock to next event batch

Agent Harness
The evaluation wrapper that:

Enforces temporal boundaries (no future queries)
Logs tool calls, retrieval requests, and forecast updates
Measures latency between information arrival and forecast revision

class TemporalSimulation:
    def __init__(self, event_store, start_date, end_date):
        self.events = event_store.filter(start_date, end_date)
        self.current_time = start_date
        self.agent_state = {}
    
    def step(self):
        # Release events up to current time
        available = [e for e in self.events if e.timestamp <= self.current_time]
        
        # Agent observes and acts
        observations = self.format_observations(available)
        actions = self.agent.act(observations, self.agent_state)
        
        # Log and score
        self.log_forecasts(actions, self.current_time)
        
        # Advance clock to next event
        self.current_time = self.next_event_time()

Data Leakage Controls

The hardest problem: preventing agents from using training data that includes “future” events. FutureSim uses three defenses:

Knowledge Cutoff Enforcement
Only test on periods after the agent’s training cutoff. If your model was trained through December 2025, replay starts January 2026.

Synthetic Future Construction
For controlled experiments, generate plausible event sequences that never appeared in any training corpus. Trade realism for isolation.

Retrieval Auditing
Log every tool call and retrieval query. Flag any that reference documents published after current_time in the simulation. This catches agents that bypass temporal boundaries through side channels.

Metrics That Matter

FutureSim tracks adaptation speed, not just final accuracy:

Metric	What It Measures	Why It Matters
Brier Skill Score	Forecast accuracy vs. baseline	Separates signal from noise
Adaptation Latency	Time from event to forecast update	Catches agents that wait for complete information
Forecast Revision Rate	How often agent changes predictions	Distinguishes reactive from static agents
Temporal Coherence	Predictions respect causal order	Detects retrieval shortcuts

The paper reports that many agents had negative Brier skill scores, meaning they performed worse than a naive baseline. This suggests they struggle to weight new information against priors.

Failure Modes Exposed

FutureSim revealed three common failure patterns:

Over-Indexing on Recency
Agents that update forecasts after every new article, ignoring base rates. Adaptation latency looks good, but accuracy suffers.

Retrieval Paralysis
Agents that wait to gather “enough” context before forecasting. They miss early signals and never catch up.

Memory Thrashing
Agents that forget earlier context when new events arrive. Their state management can’t handle long simulations.

Implementation Trade-offs

Building a temporal replay system forces hard choices:

Event Granularity
Fine-grained (minute-by-minute) replay is realistic but expensive. Coarse-grained (daily batches) is cheaper but hides intra-day adaptation. FutureSim uses daily batches for the three-month test.

Simulation Speed
Real-time replay (one simulated day per real day) lets you test long-horizon reasoning but delays results. Accelerated replay (full simulation in hours) is faster but may stress agent memory systems differently.

Observation Formatting
Raw event streams (JSON blobs) are realistic but hard for agents to parse. Structured summaries are easier but may leak information through the summarization process.

When to Build Your Own

FutureSim is a research artifact, not a product. You should build similar infrastructure if:

Your agents make time-sensitive decisions (trading, incident response, logistics)
Training data leakage is a real risk (financial models, news summarization)
You need to measure adaptation speed, not just final accuracy

You can skip it if:

Your domain is static (document classification, batch ETL)
Temporal order doesn’t affect correctness (most RAG applications)
You’re testing reasoning, not adaptation (math proofs, code generation)

Technical Verdict

Use FutureSim-style replay when:

Agents must adapt to streaming information
Temporal causality matters for correctness
You need to separate memorization from reasoning
Production workloads involve forecasting or time-sensitive decisions

Avoid when:

Your domain is static or order-independent
Simulation infrastructure cost exceeds value of temporal fidelity
Agents don’t maintain state across interactions
You’re testing one-shot reasoning, not long-horizon adaptation

The core insight: most agent benchmarks test retrieval and reasoning in a frozen world. FutureSim tests whether your agent can keep up when the world moves. If your production workload involves streaming data, you need evaluation infrastructure that respects time.