Production Agent Failures: What 'Ask HN' War Stories Reveal About Durability, Observability, and the Build-vs-Buy Calculus

An Ask HN thread exposes the real problem: production agents fail when orchestration plumbing breaks under partial failure. The original poster spent a month rewriting jobs as durable execution on DBOS after cascading errors broke entire report generation workflows. The scenario is common: a team of agents fans out to process transcript data, each subagent making API calls and running LLM inference. When one step fails at step 9 of 12, the error cascades and the entire workflow breaks with almost no visibility.

The Failure Mode No One Talks About

The gap between prototype and production is orchestration under partial failure. This is not a prompt engineering problem. This is an infrastructure problem.

What Cascading Errors Look Like

Fan-out without isolation: One subagent OOMs, kills the parent process, orphans 47 other subagents mid-flight.
No idempotency: Retry logic re-runs steps 1-8, burns tokens, writes duplicate rows, confuses downstream consumers.
Invisible state: You know the job failed. You do not know if it failed before or after writing the summary to S3.
No progress tracking: Users see “processing” for 20 minutes, then “error.” No way to resume from step 9.

The original poster’s solution was to rewrite jobs as durable execution primitives. That means wrapping each step in a transaction, persisting state to a database, and using a runtime that can replay from the last successful checkpoint. DBOS is one option. Temporal is another. The point is that you need a layer that treats agent steps as durable, retryable units of work.

Durable Execution vs. Traditional Job Queues

Durable execution is not the same as a job queue. A job queue (Celery, BullMQ, SQS) gives you retry logic and dead-letter handling. Durable execution gives you workflow-level guarantees: if step 9 fails, you can replay from step 9 without re-running steps 1-8.

Primitive	Failure Boundary	State Persistence	Replay Granularity	Use Case
Job Queue (Celery, SQS)	Job-level	External (you manage)	Entire job	Parallel subagent tasks with no inter-step dependencies
Workflow Engine (Temporal, Prefect)	Step-level	Built-in	Individual step	Multi-agent orchestration with step dependencies
Durable Execution (DBOS, Restate)	Function-level	Transactional	Individual function call	Agent workflows with stateful API calls and LLM steps

Durable execution frameworks treat each function call as a transaction. If the function succeeds, the result is persisted. If it fails, the runtime retries from the last successful state. This is critical for agent workflows where each step might call an external API, run an LLM, or write to a database.

Example: Durable Execution Pattern

The following pseudocode is inspired by DBOS TypeScript SDK patterns. Refer to the official DBOS documentation at https://docs.dbos.dev for exact syntax and decorator names.

// Pseudocode inspired by DBOS TypeScript SDK patterns
// See https://docs.dbos.dev for actual implementation
import { DBOS } from '@dbos-inc/dbos-sdk';

class ReportAgent {
  @DBOS.workflow()
  async generateReport(transcriptIds: string[]) {
    const summaries = [];
    
    for (const id of transcriptIds) {
      // Each step is durable: if fetchTranscript fails, 
      // retry only this step, not the entire loop
      const transcript = await this.fetchTranscript(id);
      const summary = await this.summarize(transcript);
      summaries.push(summary);
    }
    
    return await this.combineReports(summaries);
  }

  @DBOS.step()
  async fetchTranscript(id: string) {
    // API call with automatic retry and state persistence
    return await fetch(`/api/transcripts/${id}`);
  }

  @DBOS.step()
  async summarize(text: string) {
    // LLM call: if it fails, DBOS retries this step only
    return await llm.complete({ prompt: text });
  }

  @DBOS.step()
  async combineReports(summaries: string[]) {
    // Final aggregation step
    return summaries.join('\n\n');
  }
}

The @DBOS.workflow() decorator marks the top-level function as a durable workflow. Each @DBOS.step() is a checkpoint. If summarize() fails on transcript 9, DBOS retries only that step. The first 8 summaries are already persisted. No wasted tokens, no duplicate writes.

The Build-vs-Buy Calculus

The original poster asks the right question: how many engineer-weeks have you sunk into agent infrastructure versus actual agent logic? The ratio matters because orchestration plumbing is undifferentiated heavy lifting. You are not building a competitive advantage by writing your own retry logic, checkpoint serialization, or progress UI.

When to Build

You have specific durability requirements that existing tools do not support (e.g., custom state serialization, exotic failure modes).
You already run Temporal or Prefect for non-agent workflows and can reuse the infrastructure.
Your team has deep expertise in distributed systems and can maintain the plumbing long-term.

When to Buy

You are spending more than 2 engineer-weeks per quarter on orchestration bugs.
You need human-in-the-loop approvals, progress tracking, or audit logs and do not want to build the UI.
You want to focus on agent logic (prompts, tool calls, domain-specific reasoning) rather than infrastructure.

The original poster chose DBOS because it provides durable execution primitives without requiring a separate workflow engine. The Ask HN thread mentions other tools like Temporal for workflow orchestration, LangSmith for observability, and Braintrust for evals. The decision depends on where your pain is.

What Makes a Tool Worth a Line Item

From the thread and similar conversations, the tools that justify a paid line item share these traits:

Automatic checkpoint recovery: No manual state management, no custom serialization.
Step-level observability: You can see which step failed, what the input was, and what the error message said.
Progress visibility: Users see “processing step 9 of 12” instead of “processing…”
Human-in-the-loop primitives: Built-in approval flows, not ad-hoc Slack webhooks.
Audit trail: Every step, every retry, every decision is logged and queryable.

If you are building these features yourself, you are building a workflow engine. That is fine if it is your core competency. If it is not, you are burning time that could go into better prompts, smarter tool selection, or domain-specific agent logic.

Observability Gaps in Agent Workflows

The original poster mentions “almost no visibility” when a subagent fails. This is the observability gap that kills production agents. Traditional monitoring (logs, metrics, traces) does not map cleanly to agent workflows because the failure mode is not “service down” or “latency spike.” The failure mode is “step 9 returned garbage, step 10 used it as input, step 11 hallucinated, step 12 wrote the hallucination to the database.”

What You Need to See

Step-level input/output: What did the LLM receive? What did it return?
Tool call history: Which tools were invoked, with what arguments, and what did they return?
State snapshots: What was the workflow state before and after each step?
Error context: Not just “API call failed” but “API call to /transcripts/42 failed with 429 after 3 retries.”

Tools that address this gap include LangSmith for LangChain workflows, Braintrust for evals and production traces, and Temporal for workflow executions. If you are building in-house, you need to instrument every step, serialize every state transition, and build a UI to query it.

Partial Failure Recovery Patterns

When an agent fails at step 9 of 12, you have three options:

Retry the step: Re-run step 9 with the same input. Works if the failure was transient (rate limit, network blip).
Replay from checkpoint: Re-run steps 9-12 using the state saved after step 8. Works if steps 9-12 are idempotent.
Restart the workflow: Re-run steps 1-12. Expensive, but sometimes necessary if earlier steps produced bad state.

The right choice depends on whether your steps are idempotent and whether you can afford to re-run expensive LLM calls.

In the original poster’s scenario, if step 9 (summarize transcript 9) fails with OOM, durable execution retries only that step. Steps 1-8 (transcripts 1-8 already summarized) are cached. Without durable execution, the entire fan-out restarts, wasting tokens and time. If you have 100 transcripts and step 87 fails, you do not want to re-process transcripts 1-86.

Idempotency Matters

If step 9 writes a row to a database, retrying step 9 should not write a duplicate row. If step 9 writes a summary to S3, retrying step 9 should overwrite the same key, not create a duplicate. This means:

Use upsert semantics (INSERT … ON CONFLICT UPDATE).
Generate deterministic IDs (hash the input, do not use auto-increment).
Check for existing state before writing (SELECT before INSERT).

Durable execution frameworks handle this by persisting the result of each step. If step 9 already succeeded, the framework returns the cached result instead of re-running the function.

The Hidden Cost of Ad-Hoc Progress Tracking

The original poster mentions coding progress tracking “ad-hoc.” This is a red flag. Ad-hoc progress tracking means:

Emitting events to a message queue and hoping a consumer updates the UI.
Writing progress percentages to a database and polling from the frontend.
Sending Slack messages with step numbers and hoping someone is watching.
Logging to stdout and grepping for “step 9 complete” in production.

None of these approaches give you replay-ability, audit trails, or the ability to answer “what was the state when step 9 failed?” A proper workflow engine persists every state transition to a database, giving you replay-ability and audit trails. You can query the workflow history, see exactly what happened at each step, and resume from the last successful checkpoint without manual intervention.

Technical Verdict

Use durable execution (DBOS, Temporal, Restate) when:

Your agent workflows have more than 3 steps and any step can fail independently.
You have more than 5 concurrent subagents or fan-out patterns where partial failure is common.
Step failure rate exceeds 2% or you are spending more than 2 engineer-weeks per quarter debugging cascading errors or building retry logic.
You need to show progress to users and cannot afford to restart from step 1 on every failure.
You need audit trails, human-in-the-loop approvals, or the ability to answer “what was the workflow state when step 9 failed?”

Avoid durable execution when:

Your agent workflows are stateless (every step is independent, no shared state between steps).
You can afford to restart the entire workflow on failure (cheap steps, no external side effects, no expensive LLM calls).
You are still in the prototype phase and do not yet know what your failure modes are.
Your workflows complete in under 10 seconds and failures are rare enough to tolerate full restarts.
You already run a workflow engine (Temporal, Prefect) for other use cases and can reuse it with minimal additional engineering effort.

The gap between prototype and production is surviving partial failure, recovering from checkpoints, and showing users what is happening. If you are building this plumbing yourself, you are building a workflow engine. If that is not your core competency, buy the infrastructure and focus on the agent logic.

Source Links

Ask HN: What are your worst war stories bringing agentic applications into prod