Waymo runs two parallel agent systems. One drives the car. The other tests it in simulation. Both must enforce identical safety constraints, or the whole system falls apart.
Drago Anguelov, Waymo’s VP of Research, recently detailed this dual-challenge architecture on the Practical AI podcast. The conversation exposed rare plumbing details about how production autonomous systems synchronize policy enforcement across physical and virtual execution environments.
The Two-Agent Problem
Most agentic systems run in one environment. Waymo’s onboard driver agent runs on hardware in a moving vehicle. The simulation test agent runs in a data center, replaying scenarios at scale. The challenge is keeping both agents aligned on what “safe” means.
Onboard driver agent:
- Runs real-time inference on sensor streams (lidar, camera, radar)
- Makes throttle, brake, and steering decisions at 10+ Hz
- Must respect hard latency budgets (milliseconds matter)
- Operates under power and thermal constraints
Simulation test agent:
- Replays recorded sensor data or synthetic scenarios
- Runs thousands of parallel instances
- Can slow down or speed up time
- Has unlimited compute budget for post-hoc analysis
The gap between these two environments creates a synchronization problem. If the onboard agent learns a new safety rule, the simulation must enforce it. If simulation discovers a failure mode, the onboard agent must be patched.
Safety Constraint Synchronization
Waymo’s approach treats safety constraints as first-class artifacts that travel with the agent code. This is not a configuration file. It is a versioned policy bundle that both environments load.
Key synchronization primitives:
- Shared policy repository: Safety rules live in a central repo. Both onboard and simulation agents pull from the same source of truth.
- Version pinning: Each deployed onboard agent has a policy version hash. Simulation runs must match that hash or explicitly test a newer version.
- Constraint validation: Before deployment, simulation runs a regression suite against the new policy. If any previously-safe scenario fails, the policy is rejected.
- Replay fidelity: Simulation must reproduce onboard sensor noise, timing jitter, and compute variability. Otherwise, a constraint that passes in simulation may fail on the road.
This is harder than it sounds. Simulation runs on GPUs with different floating-point behavior than the onboard hardware. Sensor data gets compressed for storage, losing fidelity. Time synchronization between multiple sensors introduces edge cases that only appear in the real world.
Failure Mode Discovery Loop
When simulation finds a new failure mode, Waymo’s pipeline looks like this:
- Scenario extraction: The simulation agent logs the exact sensor inputs, agent state, and decision sequence that led to the failure.
- Root cause analysis: Engineers replay the scenario in a debugger, inspecting the agent’s internal reasoning.
- Constraint update: A new safety rule is added to the policy bundle (e.g., “do not enter an intersection if cross-traffic is within 5 meters and moving faster than 10 m/s”).
- Regression testing: The updated policy runs against the full scenario library. If it breaks existing safe behaviors, the constraint is refined.
- Onboard deployment: The new policy is pushed to the fleet via over-the-air update. The onboard agent now enforces the constraint in real time.
The feedback loop runs continuously. Waymo’s fleet generates petabytes of sensor data daily. Simulation agents replay edge cases at scale. New constraints flow back to the onboard agents.
Version Control for Agent Behavior
Waymo treats agent behavior like code. Every policy change gets a commit hash. Every deployed vehicle reports its policy version. This creates an audit trail.
Version control challenges:
| Challenge | Waymo’s Approach |
|---|---|
| Policy drift between onboard and simulation | Pin simulation runs to onboard policy version |
| Rollback after bad deployment | Keep last-known-good policy version, revert via OTA |
| A/B testing new constraints | Deploy new policy to subset of fleet, compare metrics |
| Debugging historical incidents | Replay with exact policy version from incident timestamp |
This is not Git for code. It is Git for agent decision-making. The diff between two policy versions shows which constraints changed. The blame view shows which engineer added a rule and why.
Observability Primitives
Waymo’s observability stack must answer: “Why did the agent make this decision?” Both onboard and in simulation.
Key primitives:
- Decision traces: Every agent action logs the constraints it evaluated, the sensor inputs it considered, and the internal state that led to the decision.
- Scenario replay: Engineers can rewind to any point in a drive, swap in a different policy version, and see how the agent would have behaved.
- Constraint coverage: Simulation tracks which safety rules were exercised in each scenario. If a constraint never fires, it may be dead code.
- Latency budgets: Onboard agents log how long each constraint evaluation took. If a rule exceeds its budget, it gets flagged for optimization.
The onboard agent cannot log everything (bandwidth and storage limits). It logs decision summaries and uploads them when the vehicle returns to a depot. Simulation agents log exhaustively because they have no resource constraints.
Deployment Shape
Waymo’s deployment pipeline has three stages:
- Simulation validation: New policy runs against millions of synthetic and recorded scenarios. Pass rate must exceed 99.99%.
- Closed-course testing: Policy deploys to vehicles on a private test track. Human safety drivers monitor behavior.
- Public road deployment: Policy rolls out to a small subset of the fleet (e.g., 1% of vehicles in Phoenix). Metrics are monitored for anomalies.
If any stage fails, the pipeline halts. The policy is rolled back, and engineers debug.
Deployment risks:
- Sensor drift: Onboard sensors degrade over time. A constraint that worked in simulation may fail if a camera is dirty.
- Edge case explosion: Real-world scenarios are infinite. Simulation cannot cover everything.
- Latency creep: Adding constraints increases compute load. The onboard agent may miss its latency budget.
- Policy conflicts: Two constraints may contradict each other in rare scenarios (e.g., “avoid pedestrians” vs. “do not stop in the middle of an intersection”).
Waymo’s pipeline includes automated checks for these risks, but humans still review every policy change before deployment.
Code Example: Constraint Evaluation
Here is a simplified example of how Waymo’s onboard agent might evaluate a safety constraint:
class SafetyConstraint:
def __init__(self, name, version, evaluator):
self.name = name
self.version = version
self.evaluator = evaluator
self.last_eval_time_ms = 0
def evaluate(self, sensor_data, agent_state):
start_time = time.monotonic()
result = self.evaluator(sensor_data, agent_state)
self.last_eval_time_ms = (time.monotonic() - start_time) * 1000
if self.last_eval_time_ms > 5: # Budget exceeded
log_warning(f"Constraint {self.name} took {self.last_eval_time_ms}ms")
return result
# Example constraint: Do not proceed if pedestrian is crossing
def pedestrian_crossing_constraint(sensor_data, agent_state):
for pedestrian in sensor_data.pedestrians:
if pedestrian.is_crossing and pedestrian.distance < 10:
return {"allowed": False, "reason": "pedestrian_crossing"}
return {"allowed": True}
# Load constraints from policy bundle
policy_version = "a3f9c2e"
constraints = [
SafetyConstraint("pedestrian_crossing", policy_version, pedestrian_crossing_constraint),
# ... more constraints
]
# Evaluate all constraints before making a decision
def make_decision(sensor_data, agent_state):
for constraint in constraints:
result = constraint.evaluate(sensor_data, agent_state)
if not result["allowed"]:
return {"action": "brake", "reason": result["reason"]}
return {"action": "proceed"}
This is pseudocode. Real implementations use compiled languages (C++, Rust) and run on specialized hardware. But the structure is the same: constraints are versioned, evaluated in sequence, and logged for observability.
Likely Failure Modes
Simulation divergence: If simulation does not accurately model onboard hardware, constraints may pass in testing but fail on the road. Waymo mitigates this by injecting real-world noise into simulation (sensor jitter, timing delays, compute variability).
Policy version skew: If a vehicle misses an OTA update, it may run an outdated policy. Waymo’s fleet management system tracks policy versions and flags vehicles that are out of sync.
Constraint conflicts: Two safety rules may contradict each other in rare scenarios. Waymo’s pipeline includes automated conflict detection, but humans still review edge cases.
Latency budget violations: Adding constraints increases compute load. If the onboard agent misses its latency budget, it may fail to react in time. Waymo profiles every constraint and optimizes hot paths.
Replay fidelity loss: Recorded sensor data loses fidelity due to compression. Simulation may miss edge cases that only appear in raw data. Waymo uses lossless compression for critical scenarios.
Technical Verdict
Use this architecture when:
- You have agents running in two environments (production and test) that must enforce identical policies.
- Failure modes discovered in one environment must propagate to the other.
- You need an audit trail for agent decisions (e.g., regulatory compliance, post-incident analysis).
- Your agents make high-stakes decisions where safety constraints cannot drift.
Avoid this architecture when:
- Your agents run in a single environment (no need for dual synchronization).
- Policy changes are infrequent and can be manually reviewed.
- Your system does not have hard latency budgets (you can afford heavyweight constraint evaluation).
- You do not need to replay historical decisions (observability is not critical).
Waymo’s dual-agent architecture is overkill for most systems. But if you are building agents that operate in the physical world, where failure means injury or death, this is the plumbing you need.