mech.app
AI Agents

Deep Agent Evals: Five Patterns for Testing Multi-Step Reasoning Chains in Production

How to evaluate agents that make multiple LLM calls and tool invocations using offline pytest patterns and online LangSmith monitoring on AWS.

Source: aws.amazon.com
Deep Agent Evals: Five Patterns for Testing Multi-Step Reasoning Chains in Production

Testing a single LLM call is simple. You send a prompt, check the output, and measure accuracy. Testing an agent that chains five LLM calls, database queries, and retries is fundamentally different. Each step can fail silently. An incorrect tool call in step two corrupts the final answer, but your end-to-end test only sees the wrong result.

AWS published a practical guide that combines LangChain’s deep agent evaluation patterns with Anthropic’s agent eval framework. The reference implementation uses a text-to-SQL agent running on Amazon Bedrock with Nova 2 Lite. The agent takes natural language questions about a customer database, generates SQL, executes queries, and returns answers. The guide covers five evaluation patterns, offline pytest-based testing, and production monitoring with LangSmith tracing.

The Deep Agent Problem

A deep agent is any system that chains multiple LLM calls and tool invocations. The text-to-SQL example follows this flow:

  1. User asks “How many customers are from Canada?”
  2. Agent calls LLM to generate SQL query
  3. Agent validates SQL syntax
  4. Agent executes query against database
  5. Agent calls LLM to format the result
  6. Agent returns natural language answer

Traditional unit tests catch syntax errors in step three. They miss the case where step two generates valid SQL that queries the wrong table. They miss the case where step five formats a correct number as “approximately eight” when the answer is exactly eight. They miss the case where the agent retries three times with progressively worse SQL before giving up.

You need evaluation patterns that test intermediate steps, not just final outputs.

Five Evaluation Patterns

The AWS guide defines five patterns for deep agent evaluation. Each pattern targets a different failure mode.

Pattern 1: Final Answer Correctness

Test the end result against ground truth. For “How many customers are from Canada?”, the expected answer is eight. This catches catastrophic failures but tells you nothing about why the agent failed.

def test_final_answer():
    result = agent.run("How many customers are from Canada?")
    assert result == "8"

This pattern works for deterministic queries. It breaks when the answer format varies (“8”, “eight”, “There are 8 customers”) or when the question has multiple valid answers.

Pattern 2: Intermediate Step Validation

Test each step in the agent chain. For the text-to-SQL agent, validate that the generated SQL is syntactically correct, queries the right table, and uses the right columns.

def test_sql_generation():
    trace = agent.run_with_trace("How many customers are from Canada?")
    sql_step = trace.get_step("generate_sql")
    
    assert "SELECT COUNT(*)" in sql_step.output
    assert "FROM customers" in sql_step.output
    assert "WHERE country = 'Canada'" in sql_step.output

This pattern catches errors early in the chain. It requires instrumentation to expose intermediate outputs. LangSmith provides trace-level access to every LLM call and tool invocation.

Pattern 3: Tool Call Sequence Validation

Test that the agent calls tools in the right order. A text-to-SQL agent should generate SQL before executing it. An agent that executes first and generates later is broken, even if the final answer is correct.

def test_tool_sequence():
    trace = agent.run_with_trace("How many customers are from Canada?")
    tool_calls = [step.tool for step in trace.steps if step.tool]
    
    assert tool_calls == ["generate_sql", "validate_sql", "execute_query", "format_result"]

This pattern catches logic errors in the orchestration layer. It also catches cases where the agent skips validation or retries in an infinite loop.

Pattern 4: Retry and Error Handling

Test that the agent recovers from failures. Inject a bad SQL query and verify that the agent retries with a corrected query.

def test_retry_on_syntax_error():
    # Mock the SQL validator to fail on first attempt, succeed on retry
    # This simulates a syntax error that the agent must correct
    task = {
        "question": "How many customers are from Canada?",
        "force_error": "syntax_error",  # Test harness injects failure
        "expected_retries": 1
    }
    
    result = agent.run(task["question"])
    trace = agent.get_trace()
    
    assert trace.retry_count == task["expected_retries"]
    assert result == "8"

This pattern exposes brittle error handling. Many agents retry with the same bad prompt, hit the retry limit, and return a generic error message.

Pattern 5: Latency and Cost Budgets

Test that the agent completes within acceptable time and token budgets. A correct answer that takes 30 seconds and costs $2 in API calls is a production failure.

def test_performance_budget():
    start = time.time()
    result = agent.run("How many customers are from Canada?")
    duration = time.time() - start
    
    trace = agent.get_trace()
    total_tokens = sum(step.tokens for step in trace.steps)
    
    assert duration < 5.0  # seconds
    assert total_tokens < 10000
    assert result == "8"

This pattern catches runaway agents that retry indefinitely or generate massive prompts.

Offline Evaluation with pytest

The AWS guide uses pytest to run offline evaluations before deploying to production. The test suite runs against a local SQLite database with known data. Each test is a task (input question, expected answer, success criteria). Each test run is a trial.

import pytest
from langsmith import Client

@pytest.fixture
def agent():
    return TextToSQLAgent(
        model="amazon.nova-2-lite",
        database="test_customers.db"
    )

@pytest.fixture
def langsmith_client():
    return Client(api_key=os.getenv("LANGSMITH_API_KEY"))

def test_canada_customer_count(agent, langsmith_client):
    task = {
        "question": "How many customers are from Canada?",
        "expected_answer": "8",
        "expected_sql": "SELECT COUNT(*) FROM customers WHERE country = 'Canada'"
    }
    
    with langsmith_client.trace(name="canada_customer_count") as trace:
        result = agent.run(task["question"])
        sql = agent.get_generated_sql()
        
        assert result == task["expected_answer"]
        assert task["expected_sql"] in sql

LangSmith captures every trace. You can review failures in the web UI, see the exact prompt and response for each LLM call, and replay the agent execution step by step.

Trace replay is a separate debugging step. After a test fails, you open the LangSmith dashboard, filter by test name, and inspect the full execution graph. You see the exact SQL generated, tool calls made, and LLM reasoning for each query. This is critical for debugging multi-step failures where the error appears in step five but originates in step two.

Online Monitoring in Production

Offline evals catch known failure modes. Production monitoring catches unknown failures. The AWS guide configures LangSmith to trace every agent invocation in production and alert on anomalies.

Key metrics to monitor:

  • Success rate: Percentage of tasks that return a valid answer
  • Retry rate: Percentage of tasks that require retries
  • Latency p95: 95th percentile response time
  • Token usage: Total tokens per task
  • Tool call distribution: Frequency of each tool invocation

LangSmith provides a dashboard for each metric. You can set alerts when success rate drops below 95%, when retry rate exceeds 10%, or when latency spikes above 10 seconds.

The production agent logs every trace to LangSmith. You can filter by error type, user ID, or question category. You can sample 10% of successful traces and 100% of failed traces to reduce logging costs.

from langsmith import Client

client = Client(
    api_key=os.getenv("LANGSMITH_API_KEY"),
    sampling_rate=0.1,  # Log 10% of successful traces
    error_sampling_rate=1.0  # Log 100% of errors
)

def run_agent_in_production(question):
    with client.trace(name="production_agent") as trace:
        try:
            result = agent.run(question)
            trace.log_success(result)
            return result
        except Exception as e:
            trace.log_error(e)
            raise

Evaluation Trade-offs

PatternWhat It CatchesWhat It MissesCostWhen to Use
Final AnswerCatastrophic failures where agent returns wrong data type or crashesRoot cause of failure, intermediate step errorsLowAgent makes 1-2 LLM calls with deterministic outputs
Intermediate StepsBad tool outputs, incorrect SQL generation, wrong API parametersOrchestration logic errors, tool call sequencing bugsMediumAgent makes 3+ LLM calls and you need to debug which step fails
Tool SequenceWrong tool order, skipped validation steps, infinite retry loopsSemantic errors in individual tool callsMediumAgent has complex orchestration with conditional branching
Retry HandlingBrittle error recovery, agents that retry with same bad promptPerformance degradation from excessive retriesHighAgent operates in unreliable environments (flaky APIs, transient DB errors)
Performance BudgetRunaway costs from excessive retries or large promptsCorrectness issues unrelated to latency or tokensLowAgent has strict SLA requirements or operates at high volume

You need all five patterns. Final answer tests catch regressions. Intermediate step tests catch model drift. Tool sequence tests catch orchestration bugs. Retry tests catch production outages. Performance tests catch cost explosions.

Deployment Shape on AWS

The reference architecture runs on:

  • Amazon Bedrock: Hosts the Nova 2 Lite model for SQL generation and result formatting
  • Amazon RDS: Stores the customer database
  • AWS Lambda: Runs the agent orchestration logic
  • LangSmith: Captures traces and provides the evaluation dashboard

The Lambda function has a 15-minute timeout (AWS maximum for Lambda functions). The reference architecture allocates a 5-second latency budget per task to maintain p95 latency under the 10-second user-facing SLA, accounting for cold starts (typically 1-2 seconds) and network overhead (500ms-1s). The 5-second per-task budget leaves room for concurrent requests and ensures the Lambda function does not approach the 15-minute timeout under normal load.

The database connection pool has a maximum of 10 connections. If the agent retries too many times, it exhausts the pool and blocks other requests. The retry limit is set to 3 attempts per task.

Likely Failure Modes

SQL injection: The agent generates SQL from user input. If the LLM hallucinates a malicious query, it can read or modify arbitrary data. The mitigation is to run queries in a read-only transaction and validate SQL against a whitelist of allowed tables.

Infinite retry loops: The agent retries on syntax errors. If the LLM keeps generating the same bad SQL, the agent hits the retry limit and fails. The mitigation is to include the error message in the retry prompt and track retry count per error type.

Context window overflow: The agent includes the database schema in every prompt. If the schema is too large, it exceeds the context window and the LLM truncates the input. The mitigation is to use a schema summarization tool that extracts column names and types without full table definitions, reducing schema size by 70% while preserving query generation accuracy.

Cost runaway: The agent retries three times per task. If 10% of tasks require retries, token usage increases by 30%. At $0.30 per 1M tokens (Nova 2 Lite pricing), a 30% increase on 10,000 tasks per day costs approximately $900 per month in additional API calls. The mitigation is to set a per-task token budget and fail fast when the budget is exceeded.

Stale execution state: The agent caches database schema metadata to reduce latency. If the schema changes (new columns, renamed tables), the agent generates SQL against an outdated schema. The mitigation is to set a cache TTL of 5 minutes and force a schema refresh on SQL execution errors that mention missing columns or tables.

Agent-Specific Constraints

Real-time latency requirements: This pattern does not work for sub-100ms latency requirements. LangSmith’s logging overhead (50-100ms per trace) and the multi-step evaluation flow make this impractical for high-frequency operations. Use circuit breakers and pre-computed fallback responses instead.

Regulatory audit trails: LangSmith stores traces in a managed service. If your compliance framework requires on-premises audit logs (e.g., SEC Rule 17a-4 for trade execution logs, SOC 2 Type II for customer data access), you need to export LangSmith traces to your own storage or use a self-hosted alternative like OpenTelemetry with local trace storage.

Data residency requirements: LangSmith’s default deployment stores traces in US-based infrastructure. If you operate under GDPR or other data residency regulations, verify that LangSmith supports regional deployments or configure trace export to AWS S3 in your required region.

Air-gapped environments: LangSmith requires network access to log traces. If your agent runs in an air-gapped network (common in financial institutions with strict network segmentation or government deployments), you cannot use LangSmith’s cloud service. Use local trace storage with OpenTelemetry or a self-hosted LangSmith instance.

Technical Verdict

Use this approach for multi-step agents that meet these criteria:

Multi-step chains: The agent makes 3+ LLM calls or tool invocations per task. Single-call agents do not benefit from intermediate step validation.

Latency requirements: You need sub-5-second response times. LangSmith adds 50-100ms of logging overhead per trace, which is acceptable for most user-facing applications but not for high-frequency operations.

Audit trails: You need to inspect intermediate SQL queries, tool calls, or reasoning steps for compliance or debugging. LangSmith provides full trace replay, allowing you to inspect the exact SQL generated, tool calls made, and LLM reasoning for each query.

Cost control: You need to track token usage and prevent runaway costs. The performance budget pattern catches agents that exceed token limits before they drain your API budget.

Avoid this approach if:

  • Your agent is a single LLM call with no tools. The overhead of tracing and intermediate step validation is not worth it.
  • You need real-time evaluation with sub-100ms latency. LangSmith’s logging overhead makes this impractical.
  • Your agent runs on-device or in air-gapped environments without network access to LangSmith’s cloud service.

The text-to-SQL pattern applies directly to other agentic workflows: market data aggregators that fetch prices from multiple APIs, portfolio analyzers that run calculations and format reports, or compliance agents that query transaction logs and generate audit summaries. Any agent that chains LLM calls, database queries, and validation steps benefits from these five evaluation patterns.