Agents that write code need somewhere safe to run it. The Model Context Protocol (MCP) gives you tools, but when an agent needs to process data across multiple tool calls, you hit a choice: let the model hold state in its context window, or give it a sandboxed execution environment where it can write small programs.
The ptc_runner project ships an MCP server that exposes a Lisp REPL as a tool. The agent writes Lisp, the server evaluates it in a sandbox, and only the final result comes back. This is not about Lisp advocacy. It is about what happens when you treat agent-generated code as untrusted input and design the execution boundary accordingly.
Why Code Mode Exists
When an agent fetches email, parses it, groups by sender, and counts threads, doing that work in the model’s context burns tokens on intermediate state. The model is not a runtime. It is a planner.
Code mode flips the pattern. The agent writes a program that:
- Calls MCP tools directly in the execution environment
- Processes results locally without round-tripping through the context window
- Returns only the final answer
Cloudflare and Anthropic both shipped variations on this. Cloudflare’s Code Mode gives agents a TypeScript API. Anthropic’s MCP code execution does the same with Python. Both keep intermediate results out of the conversation.
The ptc_runner MCP server does this with Lisp. The reason is not nostalgia. It is that Lisp code is data, which makes sandboxing and introspection cheaper.
Sandboxing Agent-Generated Code
When you let an agent write code, you are accepting untrusted input. The sandbox must enforce:
- No filesystem access
- No network access
- No access to parent process memory
- Resource limits on CPU and memory
The ptc_runner sandbox uses a restricted Lisp evaluator. The environment exposes only the functions and bindings you explicitly allow. There is no eval escape hatch, no FFI, no syscalls.
What the Sandbox Allows
The REPL tool provides:
- Basic arithmetic and list operations
- Bindings to MCP tools (email, calendar, phone)
- A small standard library for data manipulation
- Persistent state between invocations within a session
The agent writes something like:
(let ((emails (fetch-emails :limit 50)))
(group-by emails :key sender :count t))
The sandbox evaluates this, calls the fetch-emails tool, processes the list, and returns a summary. The raw email bodies never enter the model’s context.
What It Blocks
The evaluator rejects:
- File I/O primitives
- Network sockets
- Process spawning
- Reflection that could leak sandbox internals
If the agent tries (open-file "/etc/passwd"), the evaluator throws an error before touching the filesystem. The error message goes back to the agent, which can retry with a valid operation.
State Management Between Invocations
The MCP server maintains a session for each agent conversation. Within that session, the REPL environment persists bindings.
If the agent defines a variable in one invocation:
(define email-stats (compute-stats (fetch-emails)))
That binding is available in the next invocation:
(filter email-stats (lambda (s) (> (count s) 10)))
State serialization is straightforward because Lisp data structures are already symbolic. The server serializes the environment as an s-expression and deserializes it on the next call. No JSON schema mapping, no type coercion.
When the conversation ends, the session is discarded. There is no cross-conversation persistence.
Error Handling and Timeouts
Agent-generated code will loop forever, divide by zero, and try to allocate gigabytes of memory. The sandbox must handle this without crashing the MCP server.
The evaluator runs in a separate thread with:
- A wall-clock timeout (default 5 seconds)
- A memory allocation limit
- A recursion depth limit
If evaluation exceeds any limit, the thread is killed and an error is returned to the agent. The error includes the limit that was hit and the partial result if available.
Syntax errors and runtime exceptions are caught and returned as structured error messages. The agent sees:
{
"error": "UnboundVariable",
"symbol": "fetch-emials",
"suggestion": "Did you mean fetch-emails?"
}
The suggestion comes from a Levenshtein distance check against the available tool bindings. This cuts down on retry loops.
Tool Boundary Design
The MCP server exposes two tools to the agent:
lisp_eval: Evaluate a Lisp expression and return the resultlisp_env: Inspect the current environment bindings
The agent does not call fetch-emails directly. It calls lisp_eval with a program that calls fetch-emails. The distinction matters because it changes where data lives.
In a traditional MCP setup:
Agent -> fetch_emails tool -> JSON result -> Agent context -> filter tool -> JSON result -> Agent context
In code mode:
Agent -> lisp_eval tool -> (fetch-emails + filter in sandbox) -> final result -> Agent context
The intermediate email bodies never serialize to JSON and never enter the conversation. The agent writes the data pipeline once and gets back only what it asked for.
Why Lisp Instead of Python or JavaScript
Python and JavaScript are the default choices for agent code execution. They have large ecosystems and the model has seen more training examples. But they are harder to sandbox.
| Concern | Python/JavaScript | Lisp |
|---|---|---|
| Sandboxing | Requires restricted execution environments (RestrictedPython, vm2) with known escapes | Evaluator is a 500-line interpreter with no FFI or syscalls |
| State serialization | Pickle, JSON, or custom serialization with type mapping | S-expressions serialize natively |
| Introspection | Reflection can leak sandbox internals | Code is data, introspection is safe by default |
| Error recovery | Stack traces reference stdlib internals | Errors are symbolic and easy to parse |
| Dependency management | Agent might try to import unavailable modules | No import mechanism, all tools are explicit bindings |
The tradeoff is that the model writes worse Lisp than Python. But the model is not writing production code. It is writing short-lived data pipelines. Lisp’s simplicity makes the sandbox smaller and the failure modes easier to reason about.
Deployment Shape
The MCP server runs as a sidecar process. The agent (Claude Desktop, custom orchestrator, etc.) connects over stdio or HTTP.
Typical setup:
┌─────────────┐
│ Agent │
│ (Claude) │
└──────┬──────┘
│ MCP protocol
│
┌──────▼──────┐
│ ptc_runner │
│ MCP Server │
├─────────────┤
│ Sandbox │
│ (Lisp) │
└──────┬──────┘
│ Tool calls
│
┌──────▼──────┐
│ Gmail API │
│ Calendar API│
│ Phone API │
└─────────────┘
The server is stateless except for session data. You can run multiple instances behind a load balancer if you serialize sessions to a shared store (Redis, Postgres).
Resource limits are enforced per invocation, not per server. A runaway evaluation in one session does not affect others.
Observability
The server logs:
- Each
lisp_evalinvocation with the submitted code - Evaluation time and memory usage
- Errors and timeouts
- Tool calls made from within the sandbox
Structured logs go to stdout as JSON. You can pipe them to any log aggregator.
The lisp_env tool gives the agent introspection. It can ask “what bindings are available?” and get back a list of tool names and their signatures. This helps the agent recover from errors without human intervention.
Failure Modes
The sandbox does not protect against:
- Logic bugs in agent-generated code (the agent might write a correct program that does the wrong thing)
- Denial of service via many small requests (rate limiting is outside the sandbox)
- Prompt injection that convinces the agent to exfiltrate data through tool calls (the sandbox cannot distinguish malicious intent from legitimate use)
The sandbox does protect against:
- Filesystem or network access
- Memory exhaustion
- Infinite loops
- Crashes that take down the MCP server
The threat model assumes the agent is not adversarial, but the code it writes is untrusted. This is the right model for personal assistants and internal tools. It is not sufficient for multi-tenant systems where agents might be actively hostile.
Technical Verdict
Use a sandboxed REPL when:
- Your agent needs to process large datasets returned by MCP tools
- You want to keep intermediate results out of the context window
- You control the deployment environment and can enforce resource limits
- The agent is writing short-lived data pipelines, not production software
Avoid it when:
- The agent needs to call tools sequentially with human-in-the-loop approval (code mode hides individual tool calls)
- You need audit logs of every tool invocation (the sandbox makes one MCP call that wraps many internal tool calls)
- The model has no training data for your execution language (Lisp is a harder sell than Python)
- You are building a multi-tenant system where agents might be adversarial (you need stronger isolation than a process sandbox)
The ptc_runner approach is not the only way to do code mode. But it shows the plumbing: isolation boundaries, state management, error handling, and tool boundary design. If you are building MCP servers that execute agent-generated code, these are the problems you will hit.
Source Links
- The Right Tool for Code Mode (primary source)
- ptc_runner on GitHub