Sandboxed Lisp REPLs for MCP Agents: Why Code Execution Needs Isolation Boundaries

Agents that write code need somewhere safe to run it. The Model Context Protocol (MCP) gives you tools, but when an agent needs to process data across multiple tool calls, you hit a choice: let the model hold state in its context window, or give it a sandboxed execution environment where it can write small programs.

The ptc_runner project ships an MCP server that exposes a Lisp REPL as a tool. The agent writes Lisp, the server evaluates it in a sandbox, and only the final result comes back. This is not about Lisp advocacy. It is about what happens when you treat agent-generated code as untrusted input and design the execution boundary accordingly.

Why Code Mode Exists

When an agent fetches email, parses it, groups by sender, and counts threads, doing that work in the model’s context burns tokens on intermediate state. The model is not a runtime. It is a planner.

Code mode flips the pattern. The agent writes a program that:

Calls MCP tools directly in the execution environment
Processes results locally without round-tripping through the context window
Returns only the final answer

Cloudflare and Anthropic both shipped variations on this. Cloudflare’s Code Mode gives agents a TypeScript API. Anthropic’s MCP code execution does the same with Python. Both keep intermediate results out of the conversation.

The ptc_runner MCP server does this with Lisp. The reason is not nostalgia. It is that Lisp code is data, which makes sandboxing and introspection cheaper.

Sandboxing Agent-Generated Code

When you let an agent write code, you are accepting untrusted input. The sandbox must enforce:

No filesystem access
No network access
No access to parent process memory
Resource limits on CPU and memory

The ptc_runner sandbox uses a restricted Lisp evaluator. The environment exposes only the functions and bindings you explicitly allow. There is no eval escape hatch, no FFI, no syscalls.

What the Sandbox Allows

The REPL tool provides:

Basic arithmetic and list operations
Bindings to MCP tools (email, calendar, phone)
A small standard library for data manipulation
Persistent state between invocations within a session

The agent writes something like:

(let ((emails (fetch-emails :limit 50)))
  (group-by emails :key sender :count t))

The sandbox evaluates this, calls the fetch-emails tool, processes the list, and returns a summary. The raw email bodies never enter the model’s context.

What It Blocks

The evaluator rejects:

File I/O primitives
Network sockets
Process spawning
Reflection that could leak sandbox internals

If the agent tries (open-file "/etc/passwd"), the evaluator throws an error before touching the filesystem. The error message goes back to the agent, which can retry with a valid operation.

State Management Between Invocations

The MCP server maintains a session for each agent conversation. Within that session, the REPL environment persists bindings.

If the agent defines a variable in one invocation:

(define email-stats (compute-stats (fetch-emails)))

That binding is available in the next invocation:

(filter email-stats (lambda (s) (> (count s) 10)))

State serialization is straightforward because Lisp data structures are already symbolic. The server serializes the environment as an s-expression and deserializes it on the next call. No JSON schema mapping, no type coercion.

When the conversation ends, the session is discarded. There is no cross-conversation persistence.

Error Handling and Timeouts

Agent-generated code will loop forever, divide by zero, and try to allocate gigabytes of memory. The sandbox must handle this without crashing the MCP server.

The evaluator runs in a separate thread with:

A wall-clock timeout (default 5 seconds)
A memory allocation limit
A recursion depth limit

If evaluation exceeds any limit, the thread is killed and an error is returned to the agent. The error includes the limit that was hit and the partial result if available.

Syntax errors and runtime exceptions are caught and returned as structured error messages. The agent sees:

{
  "error": "UnboundVariable",
  "symbol": "fetch-emials",
  "suggestion": "Did you mean fetch-emails?"
}

The suggestion comes from a Levenshtein distance check against the available tool bindings. This cuts down on retry loops.

Tool Boundary Design

The MCP server exposes two tools to the agent:

lisp_eval: Evaluate a Lisp expression and return the result
lisp_env: Inspect the current environment bindings

The agent does not call fetch-emails directly. It calls lisp_eval with a program that calls fetch-emails. The distinction matters because it changes where data lives.

In a traditional MCP setup:

Agent -> fetch_emails tool -> JSON result -> Agent context -> filter tool -> JSON result -> Agent context

In code mode:

Agent -> lisp_eval tool -> (fetch-emails + filter in sandbox) -> final result -> Agent context

The intermediate email bodies never serialize to JSON and never enter the conversation. The agent writes the data pipeline once and gets back only what it asked for.

Why Lisp Instead of Python or JavaScript

Python and JavaScript are the default choices for agent code execution. They have large ecosystems and the model has seen more training examples. But they are harder to sandbox.

Concern	Python/JavaScript	Lisp
Sandboxing	Requires restricted execution environments (RestrictedPython, vm2) with known escapes	Evaluator is a 500-line interpreter with no FFI or syscalls
State serialization	Pickle, JSON, or custom serialization with type mapping	S-expressions serialize natively
Introspection	Reflection can leak sandbox internals	Code is data, introspection is safe by default
Error recovery	Stack traces reference stdlib internals	Errors are symbolic and easy to parse
Dependency management	Agent might try to import unavailable modules	No import mechanism, all tools are explicit bindings

The tradeoff is that the model writes worse Lisp than Python. But the model is not writing production code. It is writing short-lived data pipelines. Lisp’s simplicity makes the sandbox smaller and the failure modes easier to reason about.

Deployment Shape

The MCP server runs as a sidecar process. The agent (Claude Desktop, custom orchestrator, etc.) connects over stdio or HTTP.

Typical setup:

┌─────────────┐
│   Agent     │
│  (Claude)   │
└──────┬──────┘
       │ MCP protocol
       │
┌──────▼──────┐
│ ptc_runner  │
│ MCP Server  │
├─────────────┤
│  Sandbox    │
│  (Lisp)     │
└──────┬──────┘
       │ Tool calls
       │
┌──────▼──────┐
│  Gmail API  │
│ Calendar API│
│  Phone API  │
└─────────────┘

The server is stateless except for session data. You can run multiple instances behind a load balancer if you serialize sessions to a shared store (Redis, Postgres).

Resource limits are enforced per invocation, not per server. A runaway evaluation in one session does not affect others.

Observability

The server logs:

Each lisp_eval invocation with the submitted code
Evaluation time and memory usage
Errors and timeouts
Tool calls made from within the sandbox

Structured logs go to stdout as JSON. You can pipe them to any log aggregator.

The lisp_env tool gives the agent introspection. It can ask “what bindings are available?” and get back a list of tool names and their signatures. This helps the agent recover from errors without human intervention.

Failure Modes

The sandbox does not protect against:

Logic bugs in agent-generated code (the agent might write a correct program that does the wrong thing)
Denial of service via many small requests (rate limiting is outside the sandbox)
Prompt injection that convinces the agent to exfiltrate data through tool calls (the sandbox cannot distinguish malicious intent from legitimate use)

The sandbox does protect against:

Filesystem or network access
Memory exhaustion
Infinite loops
Crashes that take down the MCP server

The threat model assumes the agent is not adversarial, but the code it writes is untrusted. This is the right model for personal assistants and internal tools. It is not sufficient for multi-tenant systems where agents might be actively hostile.

Technical Verdict

Use a sandboxed REPL when:

Your agent needs to process large datasets returned by MCP tools
You want to keep intermediate results out of the context window
You control the deployment environment and can enforce resource limits
The agent is writing short-lived data pipelines, not production software

Avoid it when:

The agent needs to call tools sequentially with human-in-the-loop approval (code mode hides individual tool calls)
You need audit logs of every tool invocation (the sandbox makes one MCP call that wraps many internal tool calls)
The model has no training data for your execution language (Lisp is a harder sell than Python)
You are building a multi-tenant system where agents might be adversarial (you need stronger isolation than a process sandbox)

The ptc_runner approach is not the only way to do code mode. But it shows the plumbing: isolation boundaries, state management, error handling, and tool boundary design. If you are building MCP servers that execute agent-generated code, these are the problems you will hit.

Source Links

The Right Tool for Code Mode (primary source)
ptc_runner on GitHub