Trigger.dev V2: What a Temporal Alternative for TypeScript Reveals About Durable Execution Plumbing

Trigger.dev launched in February 2023 as a “developer-first Zapier alternative” and pulled 745 points on Hacker News. Eight months later, the team pivoted to V2 and repositioned as a “Temporal alternative for TypeScript.” That shift exposes a real infrastructure gap: the tooling that works for event-driven automation (webhook fires, function runs, done) breaks down when you need durable execution guarantees for multi-step agent workflows.

The V2 announcement drew 172 points and surfaced a pattern we see across agentic projects. Developers want TypeScript-native orchestration without the JVM complexity of Temporal, but they need the same guarantees: retries that survive process crashes, state persistence across long-running tasks, and execution resumability after timeouts or failures.

What Changed Between V1 and V2

V1 focused on replacing Zapier’s visual builder with code. You wrote TypeScript functions that responded to webhooks, API events, or schedules. The runtime handled triggers and basic retries, but execution was ephemeral. If a task took longer than the serverless timeout or the process died mid-run, you lost state.

V2 rebuilt the execution model around durable tasks. Each task is a unit of work that can:

Run for hours or days without hitting function timeouts
Survive worker crashes and resume from the last checkpoint
Retry individual steps without re-executing the entire workflow
Persist state at explicit await boundaries

This is the same problem Temporal solves with event sourcing, but Trigger.dev took a different path. Instead of replaying a full event log to reconstruct state, they checkpoint task state to Postgres at each async boundary and use a worker pool to resume execution.

Execution Guarantees and State Persistence

Trigger.dev tasks run in long-lived worker processes, not serverless functions. When you call await inside a task, the runtime serializes the execution context (variables, call stack, pending promises) and writes it to the database. If the worker dies, another worker picks up the task and deserializes the state to continue.

The state persistence boundary is explicit: only at await points. This means:

Synchronous code between awaits runs atomically (no mid-loop checkpoints)
You control granularity by choosing where to await
State size matters because the entire context gets serialized

Here’s what a durable task looks like:

import { task } from "@trigger.dev/sdk/v3";

export const processDocument = task({
  id: "process-document",
  run: async (payload: { documentId: string }) => {
    // Step 1: Fetch document (checkpoint after this completes)
    const doc = await fetchDocument(payload.documentId);
    
    // Step 2: Extract text (checkpoint after this completes)
    const text = await extractText(doc.url);
    
    // Step 3: Analyze with LLM (checkpoint after this completes)
    const analysis = await analyzeSentiment(text);
    
    // Step 4: Store results (checkpoint after this completes)
    await storeResults(payload.documentId, analysis);
    
    return { documentId: payload.documentId, sentiment: analysis.score };
  },
});

If the worker crashes after step 2, the next worker resumes at step 3 with text already in scope. The runtime doesn’t re-fetch the document or re-extract text.

Retry Logic and Failure Modes

Trigger.dev distinguishes between three failure types:

Network failures: Transient errors from external APIs. Retries use exponential backoff with jitter. Default is 3 attempts over 10 seconds.
Code errors: Exceptions thrown in your task code. No automatic retry unless you wrap the block in a retry policy. Failures surface in the dashboard with stack traces.
Timeouts: Tasks have a maximum duration (default 1 hour, configurable up to 24 hours). Timeout failures do not retry automatically because they often indicate infinite loops or stuck external calls.

You configure retry behavior per task or per step:

import { task } from "@trigger.dev/sdk/v3";

export const flakeyApiCall = task({
  id: "flakey-api",
  retry: {
    maxAttempts: 5,
    factor: 2,
    minTimeout: 1000,
    maxTimeout: 30000,
  },
  run: async (payload: { url: string }) => {
    const result = await fetch(payload.url);
    if (!result.ok) {
      throw new Error(`API returned ${result.status}`);
    }
    return result.json();
  },
});

For agent workflows, this matters because tool calls often hit rate limits or return transient errors. You want retries on network failures but not on validation errors (which need different handling, like asking the LLM to fix the input).

Deployment Model and Worker Architecture

Trigger.dev runs on a worker pool architecture, not serverless. You deploy workers as long-running containers (Docker, Kubernetes, or their managed cloud). Each worker:

Polls the task queue for new jobs
Executes tasks in isolated V8 contexts
Checkpoints state to Postgres at await boundaries
Reports metrics and logs to the dashboard

The managed cloud version handles worker scaling, but self-hosted deployments require you to manage the pool size. Workers are stateless (all state lives in Postgres), so you can scale horizontally by adding containers.

Cold starts are not a concern because workers stay warm. This is the opposite of serverless functions, where every invocation might hit a cold start. The trade-off: you pay for idle workers, but you get predictable latency.

Comparison: Trigger.dev vs. Temporal vs. Serverless Queues

Feature	Trigger.dev	Temporal	Serverless Queues (SQS + Lambda)
State persistence	Postgres checkpoints at await	Event sourcing with full replay	None (stateless functions)
Execution model	Long-lived workers	Long-lived workers	Ephemeral functions (15 min max)
Retry granularity	Per-step with configurable policies	Per-activity with replay	Per-message (entire function retries)
Language support	TypeScript only	Go, Java, TypeScript, Python	Any (but no durable execution)
Observability	Built-in dashboard with traces	Temporal UI with full history	CloudWatch logs (manual correlation)
Deployment complexity	Docker containers + Postgres	Temporal cluster (6+ services)	Managed (but no durability)
State size scaling	Varies with serialized execution context (typical agent workflows 10-50KB per checkpoint)	Linear with event count (full history)	N/A (no state)

Trigger.dev sits between serverless simplicity and Temporal’s full event sourcing. You get durable execution without running a Temporal cluster, but you lose polyglot support and the ability to replay arbitrary historical workflows.

Agent Orchestration Implications

For agentic workflows, durable execution solves three problems:

Tool call retries: If an agent calls a search API and gets a 429 rate limit, you want to retry that specific tool call without re-running the entire reasoning loop.
Human-in-the-loop: If an agent needs approval before executing a high-risk action, the workflow must pause for hours or days without losing state.
Cost control: If an LLM call costs $0.50 and fails halfway through a 10-step workflow, you don’t want to re-run the first 5 steps (and re-pay for those LLM calls).

Trigger.dev’s checkpoint model handles all three. You await the tool call, the approval webhook, or the LLM response, and the runtime persists state. If anything fails, only the failed step retries.

The limitation: you can’t change task code and replay old executions. Temporal’s event sourcing lets you fix a bug in step 3, redeploy, and replay all in-flight workflows with the new code. Trigger.dev checkpoints serialize the execution context, so code changes only affect new tasks.

Observability and Debugging

The Trigger.dev dashboard shows:

Real-time task execution with step-by-step traces
Retry attempts and failure reasons
Execution duration and checkpoint timing
Logs streamed from task code

Each task run gets a unique ID. You can inspect the serialized state at any checkpoint, which helps debug why a task resumed with unexpected variable values.

For agent workflows, this visibility matters because LLM outputs are non-deterministic. If an agent workflow fails on step 7 of 10, you need to see what the LLM returned in step 6 to understand why step 7 broke.

When State Serialization Breaks

Trigger.dev serializes execution state using V8’s structured clone algorithm. This works for JSON-serializable data (objects, arrays, primitives) but fails for:

Open file handles or network sockets
Class instances with methods (only data properties serialize)
Closures that capture non-serializable variables
Streams or iterators

If your task holds a database connection or a WebSocket, it won’t survive a checkpoint. You need to re-establish connections after each await:

import { task } from "@trigger.dev/sdk/v3";

export const streamingTask = task({
  id: "streaming-task",
  run: async (payload: { userId: string }) => {
    // Note: database connection won't serialize across checkpoints
    // const db = await connectToDatabase();
    // This pattern would fail when the runtime tries to checkpoint
    
    // Good: query and let connection close
    const result1 = await query("SELECT * FROM users WHERE id = $1", [payload.userId]);
    
    // Reconnect for next query
    const result2 = await query("SELECT * FROM orders WHERE user_id = $1", [payload.userId]);
    
    return { users: result1, orders: result2 };
  },
});

For agent workflows, this means you can’t keep a persistent connection to a streaming LLM API across checkpoints. Each tool call or LLM request must be a discrete await. Reconnecting after every await adds latency overhead. If your agent workflow involves 10+ sequential tool calls, this pattern can add latency per step. For latency-critical agents, consider batching tool calls or using Temporal’s activity-level granularity instead.

Self-Hosting vs. Managed Cloud

Trigger.dev offers both. The managed cloud handles worker scaling, Postgres backups, and dashboard hosting. Self-hosting requires:

Postgres (for task state and queue)
Redis (for distributed locks and caching)
Docker or Kubernetes (for worker containers)
Object storage (for large payloads or logs)

The open-source version includes the full runtime and dashboard. You lose automatic scaling and managed observability, but you control data residency and avoid vendor lock-in.

For agent projects with sensitive data (healthcare, finance), self-hosting lets you keep LLM prompts and tool outputs inside your VPC.

Technical Verdict

Use Trigger.dev when:

Your agent calls a search API, gets a 429 rate limit, and you want to retry only that tool call without re-running the LLM reasoning loop that preceded it, preserving the LLM output across retries via per-await checkpoints.
You need human-in-the-loop approval workflows where an agent pauses for hours or days (e.g., waiting for a compliance officer to approve a financial transaction) without losing the conversation context or tool-call history.
Your tasks run longer than serverless timeouts (15 minutes) but complete within hours, not weeks. Examples: document processing pipelines, multi-step research agents, batch data enrichment.
You want built-in observability without stitching together CloudWatch logs. The dashboard shows which step failed, what the LLM returned, and how many retries occurred.
Your team already writes TypeScript and doesn’t need polyglot support. You avoid the operational overhead of running a Temporal cluster (6+ services, JVM tuning, event-history pruning).

Avoid Trigger.dev when:

Your agent orchestrates 15+ sequential tool calls (e.g., multi-hop research with web search, document retrieval, summarization, fact-checking). Reconnecting to the database or LLM API after each await adds latency overhead that compounds across steps. Temporal’s activity batching or serverless SQS with Lambda may be faster for high-frequency tool-calling patterns.
You need to replay historical workflows with updated code. If you fix a bug in step 3 and want to reprocess all in-flight tasks with the new logic, Temporal’s event sourcing handles this because it stores discrete events rather than serialized execution state. Trigger.dev checkpoints serialize the execution context, so code changes only affect new tasks.
Your tasks require languages other than TypeScript (Go, Python, Java). Temporal supports polyglot workflows.
Your state includes non-serializable objects (open database connections, class instances with methods, WebSocket handles). You’ll spend time refactoring to reconnect after every checkpoint.
You need sub-second task latency. Worker polling adds overhead, and reconnection patterns compound this. If your agent must respond in under 500ms, consider in-memory state with Redis or a custom event loop.

For multi-step agent workflows requiring tool-call retries and human-in-the-loop pauses, Trigger.dev’s per-step checkpointing outperforms serverless queues but trades replay flexibility for TypeScript simplicity. If your agent needs to call 10 tools, wait for human approval, and resume hours later without losing state, this is the plumbing you need. Just watch for latency creep in high-frequency tool-calling patterns.