Trigger.dev launched in February 2023 as a “developer-first Zapier alternative” and pulled 745 points on Hacker News. Eight months later, the team pivoted to V2 and repositioned as a “Temporal alternative for TypeScript.” That shift exposes a real infrastructure gap: the tooling that works for event-driven automation (webhook fires, function runs, done) breaks down when you need durable execution guarantees for multi-step agent workflows.
The V2 announcement drew 172 points and surfaced a pattern we see across agentic projects. Developers want TypeScript-native orchestration without the JVM complexity of Temporal, but they need the same guarantees: retries that survive process crashes, state persistence across long-running tasks, and execution resumability after timeouts or failures.
What Changed Between V1 and V2
V1 focused on replacing Zapier’s visual builder with code. You wrote TypeScript functions that responded to webhooks, API events, or schedules. The runtime handled triggers and basic retries, but execution was ephemeral. If a task took longer than the serverless timeout or the process died mid-run, you lost state.
V2 rebuilt the execution model around durable tasks. Each task is a unit of work that can:
- Run for hours or days without hitting function timeouts
- Survive worker crashes and resume from the last checkpoint
- Retry individual steps without re-executing the entire workflow
- Persist state at explicit await boundaries
This is the same problem Temporal solves with event sourcing, but Trigger.dev took a different path. Instead of replaying a full event log to reconstruct state, they checkpoint task state to Postgres at each async boundary and use a worker pool to resume execution.
Execution Guarantees and State Persistence
Trigger.dev tasks run in long-lived worker processes, not serverless functions. When you call await inside a task, the runtime serializes the execution context (variables, call stack, pending promises) and writes it to the database. If the worker dies, another worker picks up the task and deserializes the state to continue.
The state persistence boundary is explicit: only at await points. This means:
- Synchronous code between awaits runs atomically (no mid-loop checkpoints)
- You control granularity by choosing where to await
- State size matters because the entire context gets serialized
Here’s what a durable task looks like:
import { task } from "@trigger.dev/sdk/v3";
export const processDocument = task({
id: "process-document",
run: async (payload: { documentId: string }) => {
// Step 1: Fetch document (checkpoint after this completes)
const doc = await fetchDocument(payload.documentId);
// Step 2: Extract text (checkpoint after this completes)
const text = await extractText(doc.url);
// Step 3: Analyze with LLM (checkpoint after this completes)
const analysis = await analyzeSentiment(text);
// Step 4: Store results (checkpoint after this completes)
await storeResults(payload.documentId, analysis);
return { documentId: payload.documentId, sentiment: analysis.score };
},
});
If the worker crashes after step 2, the next worker resumes at step 3 with text already in scope. The runtime doesn’t re-fetch the document or re-extract text.
Retry Logic and Failure Modes
Trigger.dev distinguishes between three failure types:
- Network failures: Transient errors from external APIs. Retries use exponential backoff with jitter. Default is 3 attempts over 10 seconds.
- Code errors: Exceptions thrown in your task code. No automatic retry unless you wrap the block in a retry policy. Failures surface in the dashboard with stack traces.
- Timeouts: Tasks have a maximum duration (default 1 hour, configurable up to 24 hours). Timeout failures do not retry automatically because they often indicate infinite loops or stuck external calls.
You configure retry behavior per task or per step:
import { task } from "@trigger.dev/sdk/v3";
export const flakeyApiCall = task({
id: "flakey-api",
retry: {
maxAttempts: 5,
factor: 2,
minTimeout: 1000,
maxTimeout: 30000,
},
run: async (payload: { url: string }) => {
const result = await fetch(payload.url);
if (!result.ok) {
throw new Error(`API returned ${result.status}`);
}
return result.json();
},
});
For agent workflows, this matters because tool calls often hit rate limits or return transient errors. You want retries on network failures but not on validation errors (which need different handling, like asking the LLM to fix the input).
Deployment Model and Worker Architecture
Trigger.dev runs on a worker pool architecture, not serverless. You deploy workers as long-running containers (Docker, Kubernetes, or their managed cloud). Each worker:
- Polls the task queue for new jobs
- Executes tasks in isolated V8 contexts
- Checkpoints state to Postgres at await boundaries
- Reports metrics and logs to the dashboard
The managed cloud version handles worker scaling, but self-hosted deployments require you to manage the pool size. Workers are stateless (all state lives in Postgres), so you can scale horizontally by adding containers.
Cold starts are not a concern because workers stay warm. This is the opposite of serverless functions, where every invocation might hit a cold start. The trade-off: you pay for idle workers, but you get predictable latency.
Comparison: Trigger.dev vs. Temporal vs. Serverless Queues
| Feature | Trigger.dev | Temporal | Serverless Queues (SQS + Lambda) |
|---|---|---|---|
| State persistence | Postgres checkpoints at await | Event sourcing with full replay | None (stateless functions) |
| Execution model | Long-lived workers | Long-lived workers | Ephemeral functions (15 min max) |
| Retry granularity | Per-step with configurable policies | Per-activity with replay | Per-message (entire function retries) |
| Language support | TypeScript only | Go, Java, TypeScript, Python | Any (but no durable execution) |
| Observability | Built-in dashboard with traces | Temporal UI with full history | CloudWatch logs (manual correlation) |
| Deployment complexity | Docker containers + Postgres | Temporal cluster (6+ services) | Managed (but no durability) |
| State size scaling | Varies with serialized execution context (typical agent workflows 10-50KB per checkpoint) | Linear with event count (full history) | N/A (no state) |
Trigger.dev sits between serverless simplicity and Temporal’s full event sourcing. You get durable execution without running a Temporal cluster, but you lose polyglot support and the ability to replay arbitrary historical workflows.
Agent Orchestration Implications
For agentic workflows, durable execution solves three problems:
- Tool call retries: If an agent calls a search API and gets a 429 rate limit, you want to retry that specific tool call without re-running the entire reasoning loop.
- Human-in-the-loop: If an agent needs approval before executing a high-risk action, the workflow must pause for hours or days without losing state.
- Cost control: If an LLM call costs $0.50 and fails halfway through a 10-step workflow, you don’t want to re-run the first 5 steps (and re-pay for those LLM calls).
Trigger.dev’s checkpoint model handles all three. You await the tool call, the approval webhook, or the LLM response, and the runtime persists state. If anything fails, only the failed step retries.
The limitation: you can’t change task code and replay old executions. Temporal’s event sourcing lets you fix a bug in step 3, redeploy, and replay all in-flight workflows with the new code. Trigger.dev checkpoints serialize the execution context, so code changes only affect new tasks.
Observability and Debugging
The Trigger.dev dashboard shows:
- Real-time task execution with step-by-step traces
- Retry attempts and failure reasons
- Execution duration and checkpoint timing
- Logs streamed from task code
Each task run gets a unique ID. You can inspect the serialized state at any checkpoint, which helps debug why a task resumed with unexpected variable values.
For agent workflows, this visibility matters because LLM outputs are non-deterministic. If an agent workflow fails on step 7 of 10, you need to see what the LLM returned in step 6 to understand why step 7 broke.
When State Serialization Breaks
Trigger.dev serializes execution state using V8’s structured clone algorithm. This works for JSON-serializable data (objects, arrays, primitives) but fails for:
- Open file handles or network sockets
- Class instances with methods (only data properties serialize)
- Closures that capture non-serializable variables
- Streams or iterators
If your task holds a database connection or a WebSocket, it won’t survive a checkpoint. You need to re-establish connections after each await:
import { task } from "@trigger.dev/sdk/v3";
export const streamingTask = task({
id: "streaming-task",
run: async (payload: { userId: string }) => {
// Note: database connection won't serialize across checkpoints
// const db = await connectToDatabase();
// This pattern would fail when the runtime tries to checkpoint
// Good: query and let connection close
const result1 = await query("SELECT * FROM users WHERE id = $1", [payload.userId]);
// Reconnect for next query
const result2 = await query("SELECT * FROM orders WHERE user_id = $1", [payload.userId]);
return { users: result1, orders: result2 };
},
});
For agent workflows, this means you can’t keep a persistent connection to a streaming LLM API across checkpoints. Each tool call or LLM request must be a discrete await. Reconnecting after every await adds latency overhead. If your agent workflow involves 10+ sequential tool calls, this pattern can add latency per step. For latency-critical agents, consider batching tool calls or using Temporal’s activity-level granularity instead.
Self-Hosting vs. Managed Cloud
Trigger.dev offers both. The managed cloud handles worker scaling, Postgres backups, and dashboard hosting. Self-hosting requires:
- Postgres (for task state and queue)
- Redis (for distributed locks and caching)
- Docker or Kubernetes (for worker containers)
- Object storage (for large payloads or logs)
The open-source version includes the full runtime and dashboard. You lose automatic scaling and managed observability, but you control data residency and avoid vendor lock-in.
For agent projects with sensitive data (healthcare, finance), self-hosting lets you keep LLM prompts and tool outputs inside your VPC.
Technical Verdict
Use Trigger.dev when:
- Your agent calls a search API, gets a 429 rate limit, and you want to retry only that tool call without re-running the LLM reasoning loop that preceded it, preserving the LLM output across retries via per-await checkpoints.
- You need human-in-the-loop approval workflows where an agent pauses for hours or days (e.g., waiting for a compliance officer to approve a financial transaction) without losing the conversation context or tool-call history.
- Your tasks run longer than serverless timeouts (15 minutes) but complete within hours, not weeks. Examples: document processing pipelines, multi-step research agents, batch data enrichment.
- You want built-in observability without stitching together CloudWatch logs. The dashboard shows which step failed, what the LLM returned, and how many retries occurred.
- Your team already writes TypeScript and doesn’t need polyglot support. You avoid the operational overhead of running a Temporal cluster (6+ services, JVM tuning, event-history pruning).
Avoid Trigger.dev when:
- Your agent orchestrates 15+ sequential tool calls (e.g., multi-hop research with web search, document retrieval, summarization, fact-checking). Reconnecting to the database or LLM API after each await adds latency overhead that compounds across steps. Temporal’s activity batching or serverless SQS with Lambda may be faster for high-frequency tool-calling patterns.
- You need to replay historical workflows with updated code. If you fix a bug in step 3 and want to reprocess all in-flight tasks with the new logic, Temporal’s event sourcing handles this because it stores discrete events rather than serialized execution state. Trigger.dev checkpoints serialize the execution context, so code changes only affect new tasks.
- Your tasks require languages other than TypeScript (Go, Python, Java). Temporal supports polyglot workflows.
- Your state includes non-serializable objects (open database connections, class instances with methods, WebSocket handles). You’ll spend time refactoring to reconnect after every checkpoint.
- You need sub-second task latency. Worker polling adds overhead, and reconnection patterns compound this. If your agent must respond in under 500ms, consider in-memory state with Redis or a custom event loop.
For multi-step agent workflows requiring tool-call retries and human-in-the-loop pauses, Trigger.dev’s per-step checkpointing outperforms serverless queues but trades replay flexibility for TypeScript simplicity. If your agent needs to call 10 tools, wait for human approval, and resume hours later without losing state, this is the plumbing you need. Just watch for latency creep in high-frequency tool-calling patterns.