Pi Agent Harness: Unified LLM API and Dockerized vLLM Pods for Self-Extensible Coding Agents

Pi Agent Harness hit #2 on GitHub Trending for TypeScript with a clear pitch: self-hosted coding agents that run on a $6 VPS instead of vendor-locked platforms. The architecture centers on three layers: a unified multi-provider LLM API (@earendil-works/pi-ai), an agent runtime with tool calling and state management (@earendil-works/pi-agent-core), and separate Docker containers for Go backend and Next.js frontend. The project explicitly rejects Vercel and Supabase in favor of commodity hardware deployment.

The interesting plumbing is how Pi abstracts provider differences while preserving tool-calling semantics, how blocking plugin hooks degrade performance, and what “self-extensible” actually means for agent security boundaries.

Unified LLM API: Provider Abstraction Without Semantic Loss

@earendil-works/pi-ai wraps OpenAI, Anthropic, Google, and other providers behind a single interface. The challenge is not just HTTP client unification. Tool calling, streaming behavior, and error semantics differ across providers.

Key design decisions:

Normalized tool schema: Each provider has its own function-calling format. Pi translates a canonical tool definition into provider-specific JSON schemas at request time.
Streaming normalization: OpenAI sends delta chunks, Anthropic sends content_block_delta events. Pi’s abstraction layer emits a unified stream interface so agent code doesn’t branch on provider.
Error boundary mapping: Rate limits, context length errors, and refusal messages have different HTTP status codes and body shapes. Pi maps these to a common error taxonomy.

The trade-off is latency. Every request passes through a translation layer. For high-throughput agents, this adds 5-15ms per call. For interactive coding sessions, it’s negligible.

Why this matters for agent harnesses:

Agents that switch providers mid-session (for cost, latency, or capability reasons) need stable tool-calling semantics. If your agent uses Claude for planning and GPT-4 for code generation, the tool registry and state machine must not care which model executes a given step.

Agent Runtime: Tool Calling and State Persistence

@earendil-works/pi-agent-core is the orchestration layer. It manages:

Tool registry: Functions the agent can invoke (file read, shell exec, API calls).
State machine: Tracks conversation history, tool results, and pending actions.
Execution loop: Calls LLM, parses tool requests, executes tools, feeds results back.

The runtime is stateful. Each agent session has a unique ID, and the state persists across Docker container restarts. This is not a serverless function model. It’s a long-running process model.

State persistence shape:

interface AgentSession {
  id: string;
  messages: Message[];
  toolResults: Map<string, ToolResult>;
  activeProvider: 'openai' | 'anthropic' | 'google';
  checkpointTimestamp: number;
}

Sessions serialize to disk or a database. On container restart, the agent runtime rehydrates from the last checkpoint. This enables crash recovery without losing multi-turn context.

Blocking hooks degrade performance:

The README warns that “Claude code hooks are blocking” and performance degrades with many plugins. This reveals a concurrency bottleneck. If tool execution is synchronous and plugins register blocking hooks (e.g., linting, type-checking, security scans), the agent waits for each hook before proceeding.

Mitigation strategies:

Async tool execution: Run tools in parallel when they don’t depend on each other.
Hook timeouts: Kill hooks that exceed a budget (e.g., 500ms).
Lazy evaluation: Defer expensive hooks (like full codebase indexing) until the agent explicitly requests them.

Pi does not yet implement parallel tool execution. This is a known performance cliff for agents with 10+ plugins.

Docker Deployment: Isolation and Portability

Pi ships as two Docker containers:

Go backend: Runs the agent runtime, manages state, proxies LLM requests.
Next.js frontend: Serves the web UI, handles WebSocket connections for streaming.

Why separate containers?

Isolation: Frontend crashes don’t kill agent state.
Scaling: You can run multiple frontend replicas behind a load balancer while keeping a single stateful backend.
Security boundary: The backend never serves user-uploaded assets. The frontend never touches LLM API keys.

Deployment shape:

services:
  pi-backend:
    image: earendil-works/pi-backend:latest
    environment:
      - OPENAI_API_KEY
      - ANTHROPIC_API_KEY
    volumes:
      - ./sessions:/data/sessions
    ports:
      - "8080:8080"

  pi-frontend:
    image: earendil-works/pi-frontend:latest
    environment:
      - BACKEND_URL=http://pi-backend:8080
    ports:
      - "3000:3000"

The backend exposes a REST API for session management and a WebSocket endpoint for streaming. The frontend is a thin client. All agent logic lives in the backend.

Session recovery:

When the backend restarts, it scans /data/sessions for serialized state. Each session file is a JSON blob with message history and tool results. The agent runtime rehydrates the session and resumes from the last checkpoint.

This is simpler than distributed state (Redis, etcd) but limits you to single-node deployments. For multi-node setups, you need a shared filesystem (NFS, S3) or a database backend.

Self-Extensibility: Can the Agent Modify Its Own Tool Registry?

Pi calls itself a “self-extensible coding agent.” In practice, this means the agent can:

Write new tool definitions: Generate a TypeScript function, save it to disk, and register it in the tool registry.
Install dependencies: Run npm install to add libraries the new tool needs.
Reload the runtime: Restart the agent process to load the new tool.

Security boundaries:

The agent has shell access. It can execute arbitrary code. Pi does not sandbox tool execution. If you run Pi on a shared server, a malicious prompt could install a backdoor.

Mitigation options:

Read-only filesystem: Mount the tool directory as read-only. The agent can propose new tools but cannot install them without human approval.
Approval workflow: Queue tool modifications in a pending state. A human reviews and approves before the agent reloads.
Sandboxed execution: Run tools in a separate container with limited capabilities (no network, no filesystem write outside /tmp).

Pi does not implement any of these by default. It assumes you trust the agent or run it in an isolated environment.

vLLM Pods: Self-Hosted Model Inference

Pi includes vLLM pod definitions for running open-weight models (Llama, Mistral, CodeLlama) locally. The pod is a Docker container running vLLM’s OpenAI-compatible API server.

Why vLLM instead of Ollama or llama.cpp?

Batching: vLLM uses continuous batching to serve multiple requests concurrently. Ollama processes requests sequentially.
Quantization: vLLM supports GPTQ, AWQ, and bitsandbytes quantization. You can run 70B models on 24GB VRAM.
OpenAI compatibility: Pi’s unified LLM API treats vLLM as another provider. No special-casing required.

Deployment trade-offs:

Approach	Latency	Cost	Control
OpenAI API	200-500ms	$0.01-0.06/1K tokens	None
Anthropic API	300-700ms	$0.008-0.024/1K tokens	None
vLLM on VPS	1-3s (cold start), 200-800ms (warm)	$0.10-0.50/hour	Full
vLLM on GPU instance	100-300ms	$1-4/hour	Full

For interactive coding, 200-800ms latency is acceptable. For high-throughput batch jobs (e.g., generating test cases for 1,000 functions), vLLM on a dedicated GPU instance is cheaper than API calls.

Observability: What Happens When the Agent Gets Stuck?

Pi logs every LLM request, tool execution, and state transition. Logs are JSON lines written to stdout. You pipe them to a log aggregator (Loki, Elasticsearch, CloudWatch).

Key observability signals:

Tool execution time: Histogram of how long each tool takes. Identifies slow plugins.
LLM token usage: Tracks input/output tokens per request. Useful for cost attribution.
Error rate by provider: Counts 429s, 500s, and timeouts per LLM provider.
Session duration: How long agents run before completing or timing out.

Missing observability:

Tool call graph: No visualization of which tools the agent invoked and in what order.
State size over time: No metric for how large the session state grows. Long-running agents can accumulate megabytes of message history.
Checkpoint frequency: No metric for how often the agent persists state. If checkpoints are too infrequent, you lose work on crash.

You can add these by instrumenting the agent runtime with OpenTelemetry spans.

Performance Characteristics and Failure Modes

Blocking hooks:

The README explicitly warns that blocking hooks degrade performance. If you register 10 plugins and each hook takes 100ms, every tool call adds 1 second of latency. The agent becomes unusable.

State bloat:

Long-running sessions accumulate message history. After 1,000 turns, the session state is 5-10MB. Serializing and deserializing this on every checkpoint slows the agent.

Provider rate limits:

If the agent hits a rate limit, Pi retries with exponential backoff. But if the agent is in a loop (e.g., repeatedly calling a broken tool), it burns through your rate limit budget in minutes.

Docker restart behavior:

If the backend container crashes mid-turn, the agent loses the current LLM response. It resumes from the last checkpoint, which might be 10-30 seconds stale. The agent might repeat the same tool call.

Technical Verdict

Use Pi when:

You need a portable coding agent that runs on commodity hardware (VPS, homelab, on-prem).
You want to switch LLM providers without rewriting agent logic.
You need crash recovery and session persistence for long-running tasks.
You are comfortable with shell-level access for the agent (or can add sandboxing yourself).

Avoid Pi when:

You need sub-100ms latency for tool execution (the blocking hook model is too slow).
You require multi-node horizontal scaling (the single-backend architecture doesn’t support it).
You need built-in sandboxing or approval workflows for self-extensibility (Pi assumes you trust the agent).
You want a managed platform with zero ops (Pi is self-hosted infrastructure).

Pi is plumbing for developers who want control over agent deployment and provider choice. It is not a turnkey SaaS. If you are building an agent that needs to run in air-gapped environments, on customer infrastructure, or with strict cost controls, Pi gives you the primitives to do that. If you want a hosted agent platform, use Vercel AI SDK or LangGraph Cloud instead.