ESI-Bench: Why Embodied Agents Need a Perception-Action Loop to Understand Occluded Space

Most vision-language models process static observations. They see a room from one angle, describe what is visible, and stop. ESI-Bench exposes the architectural gap between passive perception and embodied spatial intelligence: real automation tasks require agents to move, manipulate objects, and actively uncover occluded structure.

The benchmark tests whether agents can reason about containment (what is inside a closed drawer), dynamics (how objects fall when released), and spatial relationships that cannot be resolved from a single viewpoint. The core insight is that spatial intelligence unfolds through a perception-action loop. Agents must decide which abilities to deploy (perception, locomotion, manipulation) and sequence them to accumulate task-relevant evidence.

Why Passive Multi-View Fails

Standard approaches to spatial reasoning collect multiple static images and feed them to a vision model. ESI-Bench shows this strategy adds noise rather than signal. Random multi-view sampling consumes far more images but underperforms active exploration because the agent has no hypothesis about what to look for or where to move.

The failure mode is action blindness: poor action choices lead to poor observations, which cascade into reasoning errors. An agent that opens the wrong cabinet or looks at the wrong angle never acquires the evidence needed to answer the task.

Perception-Action Loop Architecture

An embodied agent needs three execution primitives:

Hypothesis formation: Generate a spatial question (is the object behind the door? inside the box?)
Action selection: Choose a movement or manipulation that tests the hypothesis
Observation update: Integrate new sensory data and revise the world model

This differs from standard vision pipelines in state management. Passive models process a batch of images in parallel. Active agents maintain a belief state that evolves as they explore. Each action is conditioned on prior observations, and each observation updates the action policy.

State Management Trade-offs

Approach	Observation Buffer	Action Dependency	Exploration Cost
Passive multi-view	Fixed batch	None	Low (pre-computed)
Random exploration	Growing buffer	None	High (wasted moves)
Active perception-action	Incremental update	Sequential	Medium (targeted)

The passive approach is cheap but blind to occlusion. Random exploration is expensive and noisy. Active perception-action loops balance cost and information gain by selecting actions that resolve uncertainty.

OmniGibson Task Structure

ESI-Bench builds on OmniGibson, a physics-based simulator with 10 task categories spanning 29 subcategories. Tasks include:

Containment reasoning: Determine what is inside closed containers
Occlusion handling: Identify objects hidden behind furniture
Dynamic prediction: Reason about how objects move when manipulated
Spatial relationships: Understand relative positions that change with viewpoint

Each task requires the agent to act before it can answer. Ground truth is not visible in the initial observation. The agent must open drawers, move around obstacles, or manipulate objects to reveal hidden state.

Preventing Infinite Exploration Loops

Active exploration introduces a risk: agents can wander indefinitely without converging on an answer. ESI-Bench addresses this with two mechanisms:

Action budgets: Limit the number of steps per task
Termination conditions: Agents must explicitly signal when they have enough evidence

The benchmark evaluates both task success and exploration efficiency. An agent that solves the task in three actions outperforms one that requires ten, even if both eventually succeed.

Hypothesis Testing Flow

class EmbodiedSpatialAgent:
    def __init__(self, world_model, action_space):
        self.world_model = world_model
        self.action_space = action_space
        self.belief_state = {}
        self.action_budget = 10
    
    def solve_task(self, initial_obs, task_query):
        obs = initial_obs
        self.belief_state = self.world_model.initialize(obs)
        
        for step in range(self.action_budget):
            # Generate hypothesis about occluded structure
            hypothesis = self.world_model.propose_hypothesis(
                self.belief_state, task_query
            )
            
            # Select action to test hypothesis
            action = self.select_action(hypothesis, obs)
            
            # Execute and observe
            obs = self.environment.step(action)
            
            # Update belief state
            self.belief_state = self.world_model.update(
                self.belief_state, action, obs
            )
            
            # Check termination
            if self.world_model.has_sufficient_evidence(
                self.belief_state, task_query
            ):
                return self.world_model.answer(task_query)
        
        # Budget exhausted
        return self.world_model.best_guess(task_query)
    
    def select_action(self, hypothesis, obs):
        # Score actions by expected information gain
        candidates = self.action_space.sample_actions(obs)
        scores = [
            self.world_model.information_gain(hypothesis, a)
            for a in candidates
        ]
        return candidates[scores.index(max(scores))]

The key difference from passive models is the feedback loop: actions are chosen based on the current belief state, and observations update that state. The agent stops when it has sufficient evidence, not when it runs out of images.

Benchmarking Counterfactual Actions

A subtle challenge in embodied benchmarks is ground truth. The correct answer often depends on actions the agent did not take. If the agent opens the left drawer but the object is in the right drawer, how do you score the attempt?

ESI-Bench handles this by simulating counterfactual trajectories. The benchmark knows the full world state and can evaluate whether the agent’s action sequence was sufficient to resolve the task, even if the agent made suboptimal choices.

This requires the simulator to track:

Reachable state space: What the agent could have observed given its action budget
Information sufficiency: Whether the observations collected are enough to answer the query
Exploration efficiency: How many actions were wasted on uninformative moves

Experimental Results

State-of-the-art multimodal LLMs show a clear hierarchy:

Active exploration: 68% task success
Passive multi-view (oracle viewpoints): 52% task success
Random multi-view: 47% task success
Single static view: 31% task success

The gap between active and passive narrows when the task does not involve occlusion. For tasks where all relevant information is visible in the initial frame, passive models perform comparably. But for containment, dynamics, and occlusion tasks, active exploration is necessary.

Agents also discover emergent spatial strategies without explicit instructions. For example, when asked if an object is inside a closed box, agents learn to approach the box, open it, and look inside rather than guessing from the exterior appearance.

Failure Modes

Most errors stem from action selection, not perception. Agents correctly identify objects in view but choose actions that do not reveal occluded structure. Common failure patterns:

Premature termination: Agent answers before collecting enough evidence
Inefficient exploration: Agent moves randomly instead of testing hypotheses
Action space confusion: Agent attempts invalid actions (trying to open a solid wall)

Explicit 3D grounding helps on depth-sensitive tasks. Agents that maintain a spatial map of explored regions perform better than those that rely on image history alone.

Deployment Considerations

Embodied spatial intelligence is expensive. Each action requires:

Simulator step (10-50ms in OmniGibson)
Vision model inference (100-500ms for large MLLMs)
Action selection (50-200ms for hypothesis testing)

A task that requires five exploration steps consumes 1-2 seconds of wall-clock time, compared to 200ms for a single passive inference. This makes active exploration viable for high-value tasks (robotics, inspection, search) but impractical for real-time applications.

The perception-action loop also introduces observability challenges. You need to log:

Belief state at each step
Hypothesis being tested
Expected vs. actual information gain
Termination reason (success, budget exhausted, failure)

Without this telemetry, debugging why an agent failed to find an object is nearly impossible.

Technical Verdict

Use active perception-action loops when:

Tasks involve occluded structure that cannot be resolved from a single viewpoint
The cost of exploration is acceptable relative to task value
You can simulate or test actions safely before deployment
Ground truth requires reasoning about containment, dynamics, or hidden state

Avoid when:

All task-relevant information is visible in the initial observation
Latency requirements are under 500ms
The action space is too large to explore efficiently
You lack a simulator to test counterfactual trajectories

ESI-Bench clarifies the infrastructure gap between passive vision models and embodied agents. If your automation task requires understanding what is inside, behind, or underneath, you need an architecture that closes the perception-action loop.

Source Links

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop