mech.app
Automation

Trigger.dev V2: TypeScript Workflow Engine Architecture Without Temporal's Go Runtime

How Trigger.dev builds durable execution, state persistence, and retry logic in TypeScript, competing with Temporal without event sourcing or Go.

Source: trigger.dev
Trigger.dev V2: TypeScript Workflow Engine Architecture Without Temporal's Go Runtime

Trigger.dev launched in February 2023 as a Zapier alternative, collected 745 points on Hacker News, then pivoted eight months later to durable execution primitives. The V2 architecture (172 points, October 2023) positions itself as a Temporal competitor for TypeScript developers. It offers workflow guarantees without requiring Go or event sourcing knowledge.

The shift matters because most workflow engines force you into one of two camps: JavaScript task queues with weak durability (BullMQ, Agenda) or strongly consistent systems that require polyglot runtimes (Temporal, Cadence). Trigger.dev tries to split the difference by offering Temporal-style semantics in a TypeScript-native execution model.

What Changed Between V1 and V2

V1 focused on integration connectors and webhook triggers. You wrote event handlers that responded to external services. V2 rebuilt the core around long-running tasks with explicit retry boundaries, state checkpoints, and scheduling primitives.

Key architectural differences:

  • V1: Event-driven handlers, no built-in state persistence, relied on external queues
  • V2: Durable execution model, automatic retries, checkpoint-based state recovery
  • Execution isolation: Moved from shared Node processes to containerized task runners
  • Observability: Added structured tracing with step-level replay and debugging hooks

The pivot came from user feedback. Developers wanted to run multi-hour jobs, handle transient failures gracefully, and debug workflow state without parsing logs. V1’s event model could not guarantee those properties.

State Persistence Without Event Sourcing

Temporal persists every workflow decision as an immutable event log. Replay reconstructs state by re-executing the workflow code against that log. Trigger.dev skips event sourcing and uses explicit checkpoints instead.

How it works:

  1. Checkpoint API: You call checkpoint(key, data) at explicit boundaries in your task code
  2. Postgres storage: Checkpoints write to a relational table with task ID, step index, and serialized state
  3. Retry recovery: On failure, the runtime reloads the last checkpoint and resumes from that step
  4. No replay: The task does not re-execute prior steps, it jumps directly to the failure point

This trades Temporal’s deterministic replay for simpler mental models. You control when state gets persisted. The downside: if you forget to checkpoint before an expensive operation, a crash forces you to redo that work.

Checkpoint Example

export const processOrder = task({
  id: "process-order",
  run: async ({ orderId }: { orderId: string }) => {
    // Fetch order data
    const order = await db.order.findUnique({ where: { id: orderId } });
    
    // Checkpoint after fetch
    await checkpoint("order-fetched", { order });
    
    // Call payment API (might fail)
    const payment = await stripe.charges.create({
      amount: order.total,
      currency: "usd",
      source: order.paymentToken,
    });
    
    // Checkpoint after payment
    await checkpoint("payment-complete", { payment });
    
    // Update inventory (idempotent)
    await db.inventory.decrement({ productId: order.productId });
    
    return { orderId, paymentId: payment.id };
  },
});

If the Stripe call fails, the retry starts from the order-fetched checkpoint. The database fetch does not run again. If inventory decrement fails, you restart from payment-complete and skip the charge.

Concurrency and Scheduling Primitives

Trigger.dev exposes three execution modes:

ModeTriggerConcurrency ControlUse Case
ScheduledCron expressionGlobal limit per taskNightly ETL, report generation
Event-drivenWebhook, SDK callQueue-based with max workersUser signup flows, API webhooks
RealtimeFrontend SDKPer-connection streamLive progress updates, chat agents

Queue Backpressure

Event-driven tasks use a priority queue backed by Postgres. You configure maxConcurrency per task definition. When the queue depth exceeds the worker pool size, new tasks wait in PENDING state.

Backpressure handling:

  • Rate limiting: Exponential backoff for tasks that fail repeatedly
  • Dead letter queue: After N retries, tasks move to a manual review queue
  • Priority override: You can bump specific task instances to the front

The system avoids distributed locking to eliminate coordination overhead. The coordinator polls the queue table with SELECT FOR UPDATE SKIP LOCKED to claim tasks. This design accepts the risk of occasional duplicate task claims in exchange for simpler infrastructure. Workers run in separate containers and report heartbeats every 30 seconds.

Scheduled Task Guarantees

Cron schedules persist in the database with next-run timestamps. A background process scans for due tasks every 10 seconds and enqueues them. If the scheduler crashes, the next instance picks up missed runs on startup.

Missed execution behavior:

  • Default: Skip missed runs, schedule the next occurrence
  • Catchup mode: Enqueue all missed runs sequentially
  • Idempotency key: Prevent duplicate execution if scheduler restarts mid-enqueue

No distributed cron coordination. If you run multiple scheduler instances, you need external leader election (not included).

Execution Isolation Model

V2 runs each task in a dedicated Docker container. The runtime spins up a container from a pre-built image that includes your task code, executes the function, then tears down the container.

Isolation boundaries:

  • Filesystem: Ephemeral, wiped after task completion
  • Network: Outbound allowed, inbound blocked except for health checks
  • Memory: Configurable limit (default 512MB), OOM kills trigger retries
  • CPU: Shared, no hard limits unless you configure cgroups

The container runtime uses a sidecar proxy to intercept checkpoint calls and forward them to the coordinator API. Your task code never talks directly to Postgres.

Cold Start Mitigation

Container startup adds 2-5 seconds of latency. Trigger.dev keeps a warm pool of containers for frequently executed tasks. When a task completes, the container stays alive for 60 seconds. If another instance of the same task arrives, it reuses the warm container.

Warm pool sizing:

  • Per-task limit: Max 5 warm containers per task definition
  • Global limit: Max 50 warm containers across all tasks
  • Eviction policy: LRU, with priority boost for tasks with high execution frequency

You can disable warm pools for tasks that need strict isolation or have large memory footprints.

Observability and Debugging

Every task execution generates a trace with step-level spans. The UI shows:

  • Timeline view: Visual breakdown of checkpoint boundaries and retry attempts
  • State inspector: JSON view of checkpoint payloads at each step
  • Log aggregation: Structured logs with correlation IDs across retries
  • Replay mode: Re-run a failed task from any checkpoint with modified input

Trace IDs propagate through HTTP headers if your task calls external APIs. You can link Trigger.dev traces to OpenTelemetry spans in your own services.

Failure Mode Visibility

The dashboard flags common failure patterns:

  • Timeout loops: Tasks that hit max duration repeatedly
  • Retry storms: High failure rate across multiple task instances
  • Checkpoint gaps: Long execution spans without checkpoints (risk of wasted retries)
  • Memory leaks: Containers that grow memory usage across warm starts

No automatic remediation. The system surfaces the pattern and lets you decide whether to adjust retry limits, add checkpoints, or refactor task logic.

Comparison: Trigger.dev vs. Temporal

DimensionTrigger.devTemporal
LanguageTypeScript onlyPolyglot (Go, Java, Python, TypeScript)
State modelExplicit checkpointsEvent sourcing with replay
ExecutionContainerized tasksWorker processes with sticky queues
SchedulingCron + event triggersTimers, signals, child workflows
ObservabilityBuilt-in UI with replayRequires external tracing setup
Self-hostingDocker Compose, KubernetesKubernetes with Cassandra/Postgres
Learning curveLow (familiar async/await)High (workflow determinism rules)

Temporal guarantees exactly-once execution through deterministic replay. Trigger.dev guarantees at-least-once with idempotency keys. If your task has side effects (API calls, database writes), you must handle deduplication yourself.

Deployment Shape

Trigger.dev runs as three services:

  1. Coordinator: Handles task scheduling, queue management, checkpoint storage
  2. Worker pool: Spins up containers and executes task code
  3. API gateway: Exposes SDK endpoints for triggering tasks and querying state

Managed cloud deployment uses AWS ECS for workers and RDS Postgres for state. Self-hosted setup provides Docker Compose and Helm charts.

Resource requirements (self-hosted):

  • Coordinator: 1 CPU, 2GB RAM, scales horizontally
  • Worker pool: 2 CPU, 4GB RAM per worker node, autoscales based on queue depth
  • Postgres: 2 CPU, 8GB RAM, replication recommended for production

No external dependencies beyond Postgres and Docker. No Kafka, no Redis, no Elasticsearch.

Technical Verdict

Use Trigger.dev if:

  • Your entire stack is TypeScript and you want to avoid polyglot workflow engines
  • Tasks run for minutes to hours, not milliseconds
  • You need built-in observability and replay without configuring distributed tracing infrastructure
  • At-least-once execution is acceptable and you can add idempotency keys to side effects
  • You want simpler mental models than event sourcing and deterministic replay
  • Cold start latency of 2-5 seconds is tolerable for your use case

Avoid Trigger.dev if:

  • You need exactly-once guarantees for financial transactions, inventory updates, or other critical state changes
  • Your workflows require complex branching, parallel execution, or saga compensation patterns
  • You already run Temporal and have invested in event sourcing patterns across your organization
  • You need sub-second task latency (container cold starts add unavoidable overhead)
  • Your team works in multiple languages and needs polyglot workflow support
  • You require advanced workflow primitives like signals, queries, or child workflow orchestration

Trigger.dev fits AI agent orchestration, ETL pipelines, and async API workflows where tasks have clear boundaries and idempotency is manageable. It struggles with high-throughput event processing or workflows that need strong consistency across distributed transactions without manual coordination.

The V2 pivot from Zapier-style integrations to workflow primitives shows clear product-market fit. The 172-point Show HN and open-source traction suggest developers want TypeScript-native orchestration that does not force them into Go or Java ecosystems. The checkpoint-based state model trades Temporal’s replay guarantees for operational simplicity and a lower learning curve.