EntityBench: How Multi-Shot Video Agents Track Identity Across 100+ Frames

Multi-shot video generation is moving from research demos to production pipelines, but entity consistency remains the hardest unsolved plumbing problem. A character who appears in shot 3 needs to look identical in shot 47, even if 44 other frames separate the appearances. EntityBench exposes the gap between single-shot quality and multi-shot coherence by introducing the first systematic benchmark for measuring identity drift across long sequences.

The paper introduces 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously. The hardest tier includes sequences up to 50 shots with 13 cross-shot characters, 8 locations, 22 objects, and recurrence gaps spanning 48 shots. The benchmark reflects production-scale challenges: 50-shot sequences require managing 13+ characters across 48-frame recurrence gaps.

The State Persistence Problem

Video generation agents must solve three distinct problems:

Entity serialization: Extract and store appearance embeddings, spatial coordinates, and semantic labels after each shot.
Entity resolution: Match entities across shots when prompts use different names, angles, or contexts.
Consistency enforcement: Ensure the same entity renders identically across recurrence gaps of 10, 20, or 48 shots.

Existing single-shot models generate high-quality frames but lack memory. Each generation call starts from scratch. If you prompt “a woman in a red coat” in shot 3 and “the woman from earlier” in shot 47, the model has no mechanism to retrieve the original appearance. The agent orchestration layer must handle this.

EntityMem: Memory-Augmented Generation

The paper proposes EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. The architecture separates entity tracking from generation:

class EntityMemory:
    def __init__(self):
        self.entity_bank = {}  # entity_id -> reference_embedding
        self.appearance_log = []  # (shot_id, entity_id, bbox, embedding)
        self.detector = ObjectDetector()  # YOLO or similar detection model
    
    def store_entity(self, entity_id, frame, bbox):
        """Extract and store entity appearance after generation."""
        embedding = self.extract_embedding(frame, bbox)
        self.entity_bank[entity_id] = {
            'reference': embedding,
            'last_seen': frame.shot_id,
            'bbox': bbox
        }
        self.appearance_log.append((frame.shot_id, entity_id, bbox, embedding))
    
    def retrieve_entity(self, entity_id, current_shot):
        """Retrieve entity reference for conditioning next generation."""
        if entity_id not in self.entity_bank:
            return None
        
        ref = self.entity_bank[entity_id]
        recurrence_gap = current_shot - ref['last_seen']
        return ref['reference'], recurrence_gap
    
    def detect_entities(self, frame):
        """Detect entities in generated frame and return (entity_id, bbox) tuples."""
        detections = self.detector.detect(frame)
        return [(d.entity_id, d.bbox) for d in detections]
    
    def generate_shot(self, prompt, entity_schedule, current_shot):
        """Generate shot with entity memory conditioning."""
        entity_refs = [
            self.retrieve_entity(entity_id, current_shot)
            for entity_id in entity_schedule
        ]
        
        # Condition generation on retrieved embeddings via cross-attention
        frame = self.model.generate(
            prompt=prompt,
            entity_conditioning=entity_refs
        )
        
        # Update memory bank with new appearances
        for entity_id, bbox in self.detect_entities(frame):
            self.store_entity(entity_id, frame, bbox)
        
        return frame

The key insight: entity memory is write-once, read-many. After the first appearance, the reference embedding becomes the source of truth. Subsequent generation calls condition on this embedding rather than re-extracting from intermediate frames.

Orchestration Flow

A multi-shot pipeline coordinates multiple model calls in sequence:

Generation: Text-to-video model produces raw frames.
Detection: Object detection model identifies entity bounding boxes.
Embedding extraction: Vision encoder extracts appearance embeddings from cropped regions.
Similarity scoring: Compare new embeddings against memory bank to verify identity.
Memory update: Store verified embeddings for future conditioning.

Each step introduces latency and potential failure modes. If detection misses an entity, the memory bank becomes stale. If embedding extraction drifts due to lighting changes, similarity scoring may reject valid matches.

Evaluation Architecture

EntityBench uses a three-pillar evaluation suite:

Pillar	Metric	Failure Mode
Intra-shot quality	FVD, aesthetic score	Blurry frames, artifacts
Prompt-following	CLIP alignment	Wrong actions, missing objects
Cross-shot consistency	Embedding cosine similarity	Identity drift, appearance changes

The fidelity gate is critical. Only entity appearances that match the prompt description enter cross-shot scoring. If the model generates a blonde character when the prompt specifies brunette, that appearance is excluded from consistency measurement. This prevents rewarding models that maintain consistency by ignoring prompts.

Consistency Drift Patterns

Experiments show cross-shot entity consistency degrades sharply with recurrence distance. A character who appears in shots 1, 5, and 10 maintains high similarity. The same character in shots 1, 25, and 50 shows significant drift.

The degradation is not linear. Consistency drops faster after the first recurrence gap, then plateaus. This suggests models struggle with initial entity resolution more than long-term memory. Once an entity is “locked in” to the generation context, it persists reasonably well. The hard part is re-establishing identity after 20+ intervening shots.

EntityMem shows the highest character fidelity among evaluated methods, with a Cohen’s d effect size of +2.33 (indicating strong improvement over baseline methods). Explicit per-entity memory outperforms implicit context-based approaches because it decouples entity tracking from prompt engineering. You don’t need to repeat “the woman in the red coat from shot 3” in every subsequent prompt. The memory bank handles retrieval.

The statistical significance of EntityMem’s improvement reflects consistent gains across all three benchmark tiers (easy, medium, hard), with the largest effect observed in hard-tier sequences featuring 48-shot recurrence gaps.

Production Deployment Patterns

EntityBench results show consistency degrades sharply after 20+ intervening shots. These findings translate directly into production implementation strategies.

Entity merge conflicts: If two prompts describe the same character with slightly different details (“woman in red coat” vs “woman with red jacket”), the system must decide whether to create separate memory entries or merge them. Merge too aggressively and you conflate distinct entities. Merge too conservatively and you fragment memory.

Memory bank pruning: A 50-shot sequence with 13 characters generates 650+ entity appearances. Storing every embedding is expensive. EntityBench’s 48-shot recurrence gaps suggest production systems need pruning strategies: keep only the most recent N appearances, or keep appearances at exponentially spaced intervals (shots 1, 2, 4, 8, 16, 32).

Embedding drift compensation: Lighting, camera angle, and occlusion cause embedding drift even for the same entity. EntityBench’s fidelity gate demonstrates that systems need drift-tolerant similarity thresholds or embedding normalization strategies to maintain accuracy across varied visual conditions.

Parallel generation: Generating 50 shots sequentially is slow. Parallel generation requires speculative entity memory: predict which entities will appear in future shots and pre-populate memory banks. If predictions are wrong, you waste compute. If predictions are right, you cut latency by 10x.

State Management Trade-offs

Approach	Memory Cost	Lookup Speed	Drift Tolerance
Store all embeddings	Linear in shots × entities	Constant per entity	High
Store keyframes only	Constant per entity	Constant	Medium
Recompute on demand	Constant	Linear in shots	Low
Hierarchical indexing	Logarithmic in shots	Logarithmic in entities	Medium

EntityMem uses the “store all embeddings” approach for maximum fidelity. Production systems may prefer keyframe storage with periodic refresh to balance memory and accuracy.

Technical Verdict

Use EntityBench when:

You are building multi-shot video generation pipelines and need to measure entity consistency across long sequences.
You need a standardized benchmark to compare different memory architectures or entity tracking strategies.
You are evaluating the trade-off between generation quality and cross-shot coherence.

Avoid when:

You are generating single-shot videos or short clips (under 10 shots) where entity memory overhead exceeds benefit.
Your use case does not require strict entity consistency (abstract animations, non-narrative content).
You need real-time generation. The memory bank lookup and embedding extraction add 20-30% latency per shot.

EntityBench demonstrates that multi-shot coherence requires explicit entity memory and orchestration overhead that single-shot optimization does not address. The benchmark exposes a fundamental tension in agentic video generation: single-shot models optimize for quality, but multi-shot pipelines optimize for consistency. EntityMem shows that explicit memory beats implicit context, but the orchestration complexity is non-trivial. Production systems need pruning strategies, drift compensation, and merge conflict resolution.