Swarm: Linux Process Isolation for Multi-Agent Coding Systems

Most multi-agent coding systems share memory and state across agents. This creates race conditions, resource contention, and cascading failures. Swarm takes a different approach. It treats each coding agent as a separate Linux process with explicit OS-level boundaries. No shared heap, no cross-contamination, no need for application-level state management.

The tool appeared on Show HN with modest traction (5 points, 3 comments) but the architecture is worth examining. Instead of building yet another agent framework with custom isolation logic, Swarm delegates to Linux primitives that have been battle-tested for decades. Namespaces prevent agents from seeing each other’s processes. Cgroups enforce CPU and memory limits. Filesystem isolation (via overlayfs or bind mounts) lets agents work on the same codebase without conflicts.

This is orchestration through OS primitives, not application code. The coordinator spawns agents as child processes, sets up their isolation boundaries, and communicates over IPC. When an agent crashes or hangs, the kernel handles cleanup. The coordinator just restarts it.

Architecture: OS-Level Isolation Primitives

Swarm uses three Linux features to isolate agents:

Isolation Layer	Mechanism	Purpose	Failure Mode
Process	PID namespaces	Agents cannot see or signal each other’s processes	Agent isolation prevents cross-process interference
Resources	cgroups v2	CPU shares, memory limits, I/O bandwidth per agent	OOM killer terminates agent, not system
Filesystem	overlayfs or bind mounts	Each agent sees its own workspace copy	Write conflicts isolated to agent’s layer

PID namespaces give each agent its own process tree. From the agent’s perspective, it is PID 1. It cannot enumerate or send signals to processes outside its namespace. This prevents agents from interfering with each other or the coordinator.

cgroups enforce resource quotas. The coordinator creates a cgroup for each agent and sets limits (e.g., 2GB RAM, 50% CPU). If an agent tries to allocate more memory, the kernel OOM killer terminates it. The coordinator detects the exit and restarts the agent. Other agents keep running.

Filesystem isolation is the trickiest part. Multiple agents need to read the same codebase but write different files. Swarm uses overlayfs to create a layered filesystem. The base layer is the original codebase (read-only). Each agent gets its own upper layer (read-write). Changes in the upper layer shadow the base layer. Agents see their own modifications without affecting the base or other agents.

IPC Plumbing: Pipes and Sockets for Coordination

Agents communicate with the coordinator over Unix domain sockets or anonymous pipes. The coordinator spawns each agent with fork() and exec(), sets up the isolation boundaries, then opens a socket pair for bidirectional communication.

The protocol is line-delimited JSON. Agents send requests (read file, write file, run tool, query LLM). The coordinator validates the request against the agent’s capability set, executes it, and sends back a response.

# Reference implementation showing isolation setup pattern
# Requires cgexec (libcgroup-tools) and unshare (util-linux)
import subprocess
import os

def spawn_agent(agent_id, cgroup_path, workspace_path):
    # Create cgroup for resource limits
    os.makedirs(f"/sys/fs/cgroup/{cgroup_path}", exist_ok=True)
    with open(f"/sys/fs/cgroup/{cgroup_path}/memory.max", "w") as f:
        f.write("2G")
    with open(f"/sys/fs/cgroup/{cgroup_path}/cpu.max", "w") as f:
        f.write("50000 100000")  # 50% CPU
    
    # Set up overlayfs for workspace isolation
    upper = f"/tmp/agent-{agent_id}/upper"
    work = f"/tmp/agent-{agent_id}/work"
    merged = f"/tmp/agent-{agent_id}/merged"
    os.makedirs(upper, exist_ok=True)
    os.makedirs(work, exist_ok=True)
    os.makedirs(merged, exist_ok=True)
    
    subprocess.run([
        "mount", "-t", "overlay", "overlay",
        "-o", f"lowerdir={workspace_path},upperdir={upper},workdir={work}",
        merged
    ])
    
    # Spawn agent in new PID namespace with cgroup
    proc = subprocess.Popen(
        ["cgexec", "-g", f"memory,cpu:{cgroup_path}",
         "unshare", "--pid", "--fork", "--mount-proc",
         "python", "agent.py", agent_id],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        cwd=merged
    )
    
    return proc

The coordinator multiplexes requests from multiple agents. Each agent has its own socket. The coordinator uses select() or epoll() to wait for messages from any agent, dispatches the request, and sends the response back to the originating socket.

Agents cannot communicate directly. If Agent A needs to coordinate with Agent B, it sends a message to the coordinator requesting coordination. The coordinator brokers the interaction. This prevents agents from bypassing capability checks or creating hidden communication channels.

Filesystem Isolation: Overlayfs for Parallel Execution

The overlayfs setup is critical. Without it, agents would either:

Work on separate copies of the codebase (wasting disk space and making merge painful)
Share a single workspace (creating write conflicts and race conditions)

Overlayfs solves this with copy-on-write semantics. The base layer is the original codebase. When an agent reads a file, it sees the base layer. When it writes a file, the kernel creates a copy in the agent’s upper layer. Subsequent reads see the modified version. Other agents still see the base layer.

When an agent finishes its task, the coordinator inspects the upper layer to see what changed. If the changes are valid (pass tests, meet acceptance criteria), the coordinator merges them into the base layer. If not, it discards the upper layer and restarts the agent.

This is similar to how Docker containers layer filesystems, but applied to coding agents instead of application deployments.

Capability Boundaries and Security

Swarm does not sandbox agents in the security sense. Agents run as the user, with full access to the filesystem within their namespace and the network. The isolation prevents accidents (agent A overwrites agent B’s files) and limits blast radius (agent crash does not kill the coordinator), but it does not stop malicious code.

The coordinator enforces capability boundaries at the IPC layer. Each agent declares its required capabilities when it starts:

Filesystem read/write (scoped to workspace)
Network access (allow/deny)
Tool invocations (which external commands it can run)
LLM API calls (which models it can query)

The coordinator checks every request against the agent’s capability set. If an agent tries to read a file outside its workspace or run a tool it did not declare, the coordinator rejects the request.

This is a whitelist model. Agents cannot escalate privileges or discover new capabilities at runtime. If an agent needs a new capability, it must restart with an updated declaration.

Failure Modes and Recovery

Agents fail in predictable ways:

Memory exhaustion: The cgroup memory limit triggers the OOM killer. The kernel sends SIGKILL to the agent. The coordinator detects the exit code, logs the failure, and restarts the agent with the same task.

CPU starvation: If an agent enters an infinite loop, it consumes its CPU quota but cannot starve other agents. The coordinator sets a wall-clock timeout (e.g., 5 minutes). If the agent does not complete its task, the coordinator kills it and retries.

Filesystem conflicts: If two agents try to modify the same file, their changes go to separate overlayfs layers. The coordinator detects the conflict when merging and either picks one version, merges them (if possible), or asks a human.

Process hangs: The coordinator monitors agent liveness through the IPC socket. If an agent stops responding, the coordinator kills the process tree (using the PID namespace to ensure all child processes die) and restarts.

The coordinator itself is a single point of failure. If it crashes, all agents die (their parent process exits). Swarm does not currently implement coordinator failover, so this is a hard stop. For production use, you would need a supervisor process (systemd, Kubernetes) to restart the coordinator.

Observability: Tracing Agent Execution

Each agent writes logs to a separate file (one per agent ID). The coordinator aggregates these logs and correlates them with IPC messages. You can trace a request from coordinator to agent and back.

The coordinator also exposes metrics:

Agent spawn/exit counts
CPU and memory usage per agent (from cgroups)
IPC message latency (time from request to response)
Filesystem layer sizes (how much each agent wrote)

These metrics feed into Prometheus or a similar system. You can alert on agents that consume too much memory, take too long to respond, or write excessive data.

The overlayfs layers are inspectable. You can diff an agent’s upper layer against the base to see exactly what it changed. This is useful for debugging (why did the agent modify this file?) and auditing (what did the agent actually do?)

Comparison to Shared-State Approaches

Most multi-agent frameworks run agents in the same process or share state through a database. LangGraph and AutoGen both use in-process threading or async tasks with shared memory. CrewAI runs agents in the same Python interpreter with a shared context object. This is simpler to implement but creates failure coupling. If one agent leaks memory, the whole system slows down. If one agent deadlocks, others may block waiting for a shared resource.

Swarm trades simplicity for isolation. You pay the cost of process spawning, IPC serialization, and filesystem layering. But you get independent failure domains, enforceable resource limits, and no shared state bugs.

Container orchestrators (Kubernetes, Docker Swarm) also use process isolation, but they target long-lived services, not short-lived coding tasks. Swarm is optimized for agents that spawn, execute a task, and exit in seconds or minutes.

Technical Verdict

Use Swarm’s approach when:

You need hard resource limits per agent (memory, CPU, I/O)
Agents are semi-trusted (LLM-generated code, third-party tools)
Failure isolation matters more than millisecond latency
You want filesystem-level conflict detection without custom merge logic
You are deploying on Linux servers with cgroups v2 and overlayfs support

Avoid it when:

You need sub-millisecond coordination between agents (IPC overhead is 1-10ms)
You are deploying to non-Linux environments (macOS, Windows, containers without privileged mode)
All agents are first-party code you fully trust
You need agents to share large in-memory data structures (serialization cost is prohibitive)
Your orchestrator must survive coordinator crashes (Swarm has no failover)

Swarm proves that OS-level isolation is a viable orchestration primitive for multi-agent systems. Instead of reinventing process management in application code, delegate to the kernel. The result is simpler failure handling, enforceable resource limits, and fewer shared-state bugs. The trade-off is Linux-only deployment and higher IPC overhead, but for most coding agents (which spend their time waiting on LLMs and filesystems), this cost is negligible.