mech.app
Automation

Trigger.dev's Event-Driven Task Architecture: Code-First Orchestration for Long-Running Jobs

How Trigger.dev handles retries, state management, and observability for event-driven background tasks written directly in TypeScript code.

Source: trigger.dev
Trigger.dev's Event-Driven Task Architecture: Code-First Orchestration for Long-Running Jobs

Trigger.dev positions itself as a developer-first alternative to Zapier, but the distinction matters. Zapier is a no-code workflow builder. Trigger.dev is a background task orchestration platform that embeds directly in your codebase. You write TypeScript functions, annotate them as tasks, and the platform handles execution, retries, observability, and scaling. The architecture matters because it exposes how modern agentic systems manage long-running workflows, state persistence, and failure recovery without forcing you into a visual editor or proprietary DSL.

The platform launched on Hacker News with 745 points and 190 comments, then pivoted from V1 to V2 based on months of developer feedback. The V2 architecture positions Trigger.dev as a “Temporal alternative for TypeScript devs,” which is more accurate. It competes with Temporal, Inngest, and Restate, not with Zapier in the no-code automation market.

How Task Execution Works

Trigger.dev tasks are standard TypeScript functions wrapped in a task() call. The platform intercepts execution, routes the task to a worker, and manages the lifecycle. Here’s the execution flow:

  1. Task definition: You define a task in your codebase with an ID and a run function.
  2. Trigger invocation: An event, webhook, or manual call triggers the task.
  3. Worker assignment: The platform routes the task to an available worker (container or serverless function).
  4. Execution: The worker runs your function, capturing logs, errors, and state changes.
  5. Completion or retry: On success, the task completes. On failure, the retry policy kicks in.

The platform decouples task definition from execution. You write code that looks synchronous, but the runtime handles async execution, retries, and observability behind the scenes.

import { task } from "@trigger.dev/sdk/v3";

export const processOrder = task({
  id: "process-order",
  retry: {
    maxAttempts: 3,
    factor: 2, // Multiplier for exponential backoff
    minTimeout: 1000,
    maxTimeout: 10000,
  },
  run: async (payload: { orderId: string }) => {
    const order = await fetchOrder(payload.orderId);
    await chargePayment(order.paymentMethod, order.total);
    await sendConfirmationEmail(order.email);
    return { status: "completed", orderId: order.id };
  },
});

The retry configuration uses exponential backoff with jitter. The factor parameter multiplies the wait time on each retry. If chargePayment() throws an error, the platform waits 1 second, then 2 seconds (1 × 2), then 4 seconds (2 × 2) before retrying. After three failures, the task moves to a dead-letter queue (a queue for permanently failed tasks that require manual intervention).

State Management for Long-Running Workflows

Trigger.dev tasks can run for hours or days. According to the platform documentation, Trigger.dev checkpoints state after each await expression, storing intermediate results in its backend. On retry, the platform skips completed steps and resumes from the failed operation. This is critical for agentic workflows that call multiple APIs, wait for human input, or process large datasets.

import { task } from "@trigger.dev/sdk/v3";
import { search, browse, analyze } from "./agent-tools";

export const researchAgent = task({
  id: "research-agent",
  run: async ({ topic }: { topic: string }) => {
    // Checkpoint 1: Search the web
    const searchResults = await search(topic);
    
    // Checkpoint 2: Fetch full articles
    const articles = await Promise.all(
      searchResults.map(url => browse(url))
    );
    
    // Checkpoint 3: Analyze content
    const analysis = await analyze(articles);
    
    return { summary: analysis, sources: searchResults };
  },
});

If the analyze() call fails, the platform doesn’t re-run search() or browse(). It retries from analyze() with the cached results. This reduces API costs and execution time for workflows with expensive or rate-limited operations.

The state model differs from Temporal’s durable execution. Temporal replays the entire function from the beginning, using deterministic execution to skip already-completed steps. Trigger.dev checkpoints at each await, which is simpler but requires you to structure code with explicit async boundaries.

Retry Logic and Failure Modes

The retry system supports three strategies:

  • Exponential backoff: Doubles the wait time between retries (default).
  • Fixed delay: Waits a constant time between retries.
  • Custom function: You provide a function that calculates the next retry delay based on attempt number and error type.

Each task can override the global retry policy. You can also configure per-error-type retries, so transient network errors retry immediately while rate-limit errors wait longer.

Failure TypeRetry StrategyMax Attempts
Network timeoutExponential backoff5
Rate limit (429)Fixed delay10
Auth failure (401)No retry0
Validation errorNo retry0
Unknown errorExponential backoff3

Dead-letter queues capture tasks that exhaust all retries. You can inspect failed tasks, modify the payload, and manually re-trigger them. For multi-step workflows where one API call succeeds but a downstream operation fails repeatedly, you can examine the checkpointed state to identify whether the issue is data validation, rate limiting, or service availability.

Observability and Debugging

The platform exposes a real-time dashboard that shows task execution traces, logs, and state transitions. Each task run gets a unique ID, and you can drill into individual steps to see input, output, and timing.

The observability layer captures:

  • Execution timeline: When each step started and finished.
  • Logs: Console output from your task function.
  • Errors: Stack traces and error messages.
  • State snapshots: The payload and return value at each checkpoint.
  • Retry history: How many times the task retried and why.

Unlike CloudWatch Logs or Datadog, which show raw output, Trigger.dev’s dashboard shows state snapshots at each checkpoint. You see the exact payload and return value when a step completed or failed. For agentic workflows with multiple LLM calls or API requests, this visibility is critical for debugging latency spikes or unexpected behavior.

The platform also supports OpenTelemetry tracing, so you can correlate task execution with external services. If your task calls an LLM API, you can trace the request through your application, Trigger.dev, and the LLM provider.

Deployment and Scaling

Trigger.dev runs tasks in isolated containers or serverless functions, depending on your deployment mode. The hosted version uses AWS Lambda for short tasks and ECS containers for long-running jobs. The self-hosted version supports Docker Compose, Kubernetes, or any container orchestrator.

Each task runs in a separate execution environment, so a memory leak or infinite loop in one task doesn’t affect others. The platform scales workers based on queue depth, spinning up new containers when tasks are waiting and shutting them down when the queue is empty.

Concurrency controls let you limit how many instances of a task can run simultaneously. Concurrency limits prevent overwhelming rate-limited APIs like Stripe’s 100 requests per second threshold or database connection pools with fixed capacity. You can set global concurrency limits or per-task limits.

export const sendEmail = task({
  id: "send-email",
  queue: {
    concurrencyLimit: 10, // Max 10 concurrent executions
  },
  run: async (payload: { to: string; subject: string }) => {
    await emailProvider.send(payload);
  },
});

The queue system uses a priority model. You can assign priority levels to tasks, and the platform processes high-priority tasks before low-priority ones. This is useful for user-facing workflows that need low latency.

Event Routing and Triggers

Trigger.dev supports multiple trigger types:

  • Webhooks: External services POST to a unique URL, and the platform routes the payload to your task.
  • Scheduled tasks: Cron-style schedules trigger tasks at fixed intervals.
  • Manual invocation: You call tasks.trigger() from your application code.
  • Event listeners: Tasks subscribe to internal events (e.g., “user.created”) and run when the event fires.

The event routing system uses a pub/sub model. Multiple tasks can listen to the same event, and the platform fans out execution to all subscribers. If one task fails, it doesn’t block others.

Event filtering happens at the routing layer. You can specify conditions (e.g., “only trigger if payload.amount > 1000”), and the platform skips execution if the condition isn’t met. This reduces unnecessary task runs and lowers costs.

Technical Verdict

Use Trigger.dev when your tasks average over 30 seconds execution time and call three or more external APIs. The checkpoint model saves money on retries by skipping already-completed steps, and the observability layer shows exactly where failures occur in multi-step workflows. It fits teams that write TypeScript and need automatic retry logic with exponential backoff for transient failures like network timeouts or rate limits.

Avoid it if you need sub-100ms latency for high-throughput tasks. The platform adds overhead for state persistence and observability that makes sense for long-running jobs but slows down simple operations. If your tasks complete in under five seconds and rarely fail, a traditional job queue like BullMQ or Celery will be faster and cheaper.

Also skip Trigger.dev if you need complex branching logic with conditional loops or parallel execution with intricate dependencies. Temporal’s replay-based execution model handles that better. Finally, if your team prefers visual workflow editors or needs non-technical users to build workflows, stick with Zapier, n8n, or Retool Workflows. The platform shines for agentic systems that orchestrate multiple LLM calls, API requests, or data processing steps where each operation is expensive or rate-limited. The open-source model means you can self-host or fork the codebase if the hosted version doesn’t meet your needs.