mech.app
Automation

Real-Time Agent Observability: How Claude Code's Hook Architecture Exposes Multi-Agent Coordination Bottlenecks

Claude Code's blocking hooks reveal the tension between real-time visibility and agent throughput. Here's how to instrument multi-agent systems without...

Source: github.com
Real-Time Agent Observability: How Claude Code's Hook Architecture Exposes Multi-Agent Coordination Bottlenecks

When you run multiple agents in production, you need real-time visibility into what they’re doing. Logs from five minutes ago don’t help when an agent is stuck in a retry loop or waiting on a blocked resource. The problem is that every observability hook you add slows down the agents themselves.

Agents Observe started as an automation harness around Claude Code and hit this wall immediately. The developer needed real-time visibility into multi-agent teams, with filtering and search. But Claude Code’s hook architecture is blocking (as reported in the Show HN discussion, where the developer noted performance degradation with multiple plugins using hooks). Every plugin that registers a hook adds latency to the agent’s execution path. With multiple plugins, performance degrades rapidly.

This isn’t a Claude Code problem. It’s the fundamental tension in agent observability: you want high-frequency event streams (per-tool-call, per-decision, per-token), but you can’t afford to block the agent on I/O every time it does something interesting.

Why Blocking Hooks Kill Agent Throughput

Claude Code exposes hooks for agent lifecycle events: tool calls, model responses, state transitions. When an event fires, Claude Code calls every registered hook synchronously. If your hook writes to a database, sends a network request, or updates a dashboard, the agent waits.

The overhead compounds quickly. As an illustrative example: if an agent makes 10 tool calls per minute, each tool call fires 3 hooks (pre-call, post-call, result), and you have 3 plugins registered with each hook taking 50ms, you’re looking at 4.5 seconds per minute spent waiting on observability. That’s 7.5% of your agent’s time. Add more agents, more plugins, or slower hooks, and you’re spending more time observing than executing.

The blocking model makes sense for single-agent debugging. You want deterministic ordering. You want to see every event before the next one fires. But it doesn’t scale to multi-agent production workloads.

Architecture: Async Queues vs. Separate Telemetry Channels

There are three ways to instrument agents without blocking them:

1. Async event queues

Hook writes to an in-memory queue and returns immediately. Background worker drains the queue and sends events to your observability backend.

Pros:

  • Minimal latency impact (queue write is microseconds)
  • Preserves event ordering within a single agent
  • Easy to add batching and compression

Cons:

  • Queue can fill up if backend is slow
  • Need backpressure strategy (drop, block, or sample)
  • Adds memory overhead

2. Fire-and-forget network calls

Hook spawns a thread or async task to send the event, doesn’t wait for response.

Pros:

  • Zero blocking time for the agent
  • Simple to implement

Cons:

  • No delivery guarantees
  • Can’t apply backpressure
  • Thread/task overhead adds up

3. Separate telemetry channel

Agent writes structured logs to stdout/stderr, separate process tails and ships them.

Pros:

  • Agent process stays clean
  • Standard Unix tooling (tail, grep, jq)
  • Easy to swap backends

Cons:

  • Parsing overhead in the shipper
  • Harder to correlate events across agents
  • Requires process supervision

Agents Observe appears to use approach #1 (async queues) based on the real-time dashboard requirement. Events flow through a queue to a WebSocket server that pushes updates to the browser. The queue acts as a buffer when the dashboard can’t keep up.

Event Granularity: What Actually Matters

Agent observability has a signal-to-noise problem. Agents produce thousands of events per minute. Most are irrelevant. The question is: what granularity do you actually need?

GranularityUse CaseEvent VolumeTypical Latency Impact
Per-tokenDebugging hallucinations, prompt engineering100-1000/secHigh (not recommended for production)
Per-tool-callUnderstanding agent behavior, debugging failures1-10/secLow with async queues
Per-decisionHigh-level orchestration flow0.1-1/secMinimal
Per-agent-runSuccess/failure metrics only0.01-0.1/secNegligible

Note: Latency impact estimates are based on typical multi-agent deployments with async queue architectures. Actual impact varies by implementation.

Most production systems need per-tool-call granularity. You want to see:

  • Which tool was called
  • What arguments were passed
  • What the tool returned
  • How long it took
  • Whether it succeeded or failed

Per-token is only useful during development. Per-decision is too coarse (you lose visibility into why the agent made that decision). Per-agent-run is fine for metrics but useless for debugging.

The filtering and search requirements in Agents Observe suggest the developer is dealing with high event volume. When you’re running multiple agents simultaneously, even per-tool-call granularity produces thousands of events per minute. You need to filter by agent ID, tool name, time range, or result status just to find the relevant events.

Implementation: Hook Architecture with Async Queues

Here’s what a non-blocking hook implementation looks like using thread-safe queues:

import asyncio
import queue
import threading
from dataclasses import dataclass
from datetime import datetime
from typing import Any, Callable

@dataclass
class AgentEvent:
    timestamp: datetime
    agent_id: str
    event_type: str
    payload: dict[str, Any]

class AsyncObservabilityHook:
    def __init__(self, max_queue_size: int = 10000):
        # Thread-safe queue for sync/async boundary
        self.queue: queue.Queue[AgentEvent] = queue.Queue(maxsize=max_queue_size)
        self.handlers: list[Callable] = []
        self._worker_task = None
        self._shutdown = False
    
    def register_handler(self, handler: Callable):
        """Register an async handler that processes events."""
        self.handlers.append(handler)
    
    def emit(self, agent_id: str, event_type: str, payload: dict):
        """
        Non-blocking event emission from synchronous agent code.
        Drops events if queue is full rather than blocking.
        """
        event = AgentEvent(
            timestamp=datetime.utcnow(),
            agent_id=agent_id,
            event_type=event_type,
            payload=payload
        )
        try:
            self.queue.put_nowait(event)
        except queue.Full:
            # Queue is full, event dropped
            pass
    
    async def start_worker(self):
        """Background worker that drains the queue asynchronously."""
        while not self._shutdown:
            try:
                # Non-blocking get with timeout
                event = await asyncio.get_event_loop().run_in_executor(
                    None, self.queue.get, False
                )
            except queue.Empty:
                await asyncio.sleep(0.01)
                continue
            
            # Fan out to all handlers concurrently
            tasks = [handler(event) for handler in self.handlers]
            await asyncio.gather(*tasks, return_exceptions=True)
    
    def shutdown(self):
        """Signal worker to stop."""
        self._shutdown = True

# Usage in agent code
hook = AsyncObservabilityHook()

async def websocket_handler(event: AgentEvent):
    # Send to real-time dashboard
    await websocket.send_json({
        "timestamp": event.timestamp.isoformat(),
        "agent_id": event.agent_id,
        "event_type": event.event_type,
        "payload": event.payload
    })

async def database_handler(event: AgentEvent):
    # Persist to database
    await db.insert("events", {
        "timestamp": event.timestamp,
        "agent_id": event.agent_id,
        "event_type": event.event_type,
        "payload": event.payload
    })

hook.register_handler(websocket_handler)
hook.register_handler(database_handler)

# Start the background worker in async context
asyncio.create_task(hook.start_worker())

# In synchronous agent execution code
def call_tool(tool_name: str, args: dict):
    hook.emit(agent_id="agent-1", event_type="tool_call", payload={
        "tool": tool_name,
        "args": args
    })
    result = execute_tool(tool_name, args)
    hook.emit(agent_id="agent-1", event_type="tool_result", payload={
        "tool": tool_name,
        "result": result
    })
    return result

The key is that emit() never blocks. It writes to a thread-safe queue and returns immediately. The background worker drains the queue and fans out to handlers (WebSocket, database, metrics backend). If the queue fills up, events are dropped. That’s the right trade-off: agent throughput matters more than perfect observability.

Multi-Agent Coordination Bottlenecks

When you run multiple agents, observability reveals coordination bottlenecks you didn’t know existed.

Shared resource contention: Two agents trying to write to the same file, call the same rate-limited API, or update the same database record. Without per-tool-call visibility, this looks like random slowdowns. With it, you see the retry loops and backoff timers.

Sequential dependencies: Agent A waits for Agent B to finish before starting. If Agent B is slow, Agent A is idle. The observability dashboard shows Agent A in a “waiting” state with no tool calls. That’s a signal to redesign the workflow or add parallelism.

Event storms: One agent produces a burst of events (e.g., processing a large file), overwhelming the observability backend. The queue fills up, events get dropped, and you lose visibility into other agents. The fix is either sampling (only emit 1 in N events) or separate queues per agent.

The Agents Observe project explicitly mentions filtering and search as requirements. That’s a symptom of event storms. When you’re running multiple agents, you need to isolate the signal from one agent without drowning in noise from the others.

State Management and Replay

Real-time observability is useful during execution. But you also need to replay agent sessions after the fact. That requires persisting events in a queryable format.

The simplest approach is to write events to a time-series database (InfluxDB, TimescaleDB, Prometheus) or a log aggregator (Elasticsearch, Loki). You get filtering, search, and time-range queries for free.

The harder problem is state reconstruction. If you want to replay an agent session, you need to capture the full state at each step: conversation history, tool results, memory contents, and decision rationale. That’s a lot of data.

One pattern is to emit state snapshots at decision boundaries (after each tool call or model response). The snapshot includes everything needed to resume execution from that point. You can then replay by loading a snapshot and re-executing from there.

The trade-off is storage cost. A single agent session might produce hundreds of snapshots, each several KB. For a multi-agent system running 24/7, that adds up quickly. You need a retention policy (keep full snapshots for 7 days, then downsample to decision-level events only).

Security Boundaries in Observability

Agent observability exposes sensitive data. Tool arguments might include API keys, user credentials, or PII. Tool results might include confidential business data. If your observability backend is compromised, so is everything your agents touch.

The right approach is to scrub sensitive data before emitting events:

  • Redact known patterns (API keys, tokens, passwords)
  • Hash or encrypt PII (email addresses, phone numbers)
  • Truncate large payloads (only keep first/last N bytes)
  • Apply role-based access control to the observability dashboard

Agents Observe is a local dashboard, so the security boundary is the developer’s machine. But if you’re shipping events to a centralized backend, you need to treat observability data as sensitive as the agent’s execution environment.

Technical Verdict

Use real-time agent observability when:

  • You’re running multi-agent teams and need to debug coordination failures or performance bottlenecks in production
  • You can implement async queues to decouple event emission from agent execution
  • You can afford the infrastructure (async queues, WebSocket servers, time-series database)
  • You have a strategy for handling event storms (sampling, filtering, backpressure)
  • You need filtering and search across high-volume event streams

Avoid it when:

  • You’re using Claude Code’s native hooks without a queue layer (blocking overhead will degrade performance with multiple plugins, as reported in the source material)
  • You’re running a single agent in a controlled environment (just use logs)
  • Your agents are latency-sensitive and can’t tolerate any overhead
  • You don’t have a plan for securing sensitive data in event streams
  • You’re not prepared to manage the storage and retention costs of high-frequency event capture

The blocking hook problem in Claude Code is a specific implementation detail, but the underlying tension is universal. Every observability system trades off visibility against performance. Async queues are the right answer for production multi-agent systems. Fire-and-forget is fine for low-volume use cases. Separate telemetry channels work if you already have a log aggregation pipeline.

The key insight from Agents Observe is that filtering and search are first-class requirements. If you can’t isolate the signal from one agent in a multi-agent system, your observability is useless.