mech.app
Dev Tools

Recursive Language Models: Breaking Context Limits with Sandboxed Orchestration

How Amazon Bedrock AgentCore uses Code Interpreter as persistent memory to orchestrate sub-LLM calls from Python, processing unbounded documents through...

Source: aws.amazon.com
Recursive Language Models: Breaking Context Limits with Sandboxed Orchestration

Context windows are hard limits. When you need to analyze a 500-page annual report plus supplementary filings, you either exceed the model’s token budget or hit the “lost in the middle” problem where the model loses track of information buried in long inputs. Prompt engineering does not fix this.

Recursive Language Models (RLM) solve the problem by treating documents as external environments instead of context. The parent LLM orchestrates sub-LLM calls from within a sandboxed Python environment, using Code Interpreter as persistent working memory. Each sub-call analyzes a specific section, and the parent aggregates results across recursive decomposition steps.

AWS just published implementation guidance using Bedrock AgentCore and the Strands SDK. This is not a bigger context window. It is an orchestration pattern that decouples document size from model limits.

How Recursive Language Models Work

The parent LLM does not see the entire document. It sees a task description and a set of tools. One tool is a sub-LLM call that accepts a document section and a focused question. The parent decides what to analyze, the sandboxed Python environment executes the sub-call, and the result lands in persistent memory.

The flow:

  1. Parent LLM receives a high-level task (e.g., “Compare revenue growth across two annual reports”).
  2. Parent generates Python code to load document sections and invoke sub-LLM calls.
  3. Code Interpreter executes the Python in a sandbox, making sub-LLM API calls as needed.
  4. Sub-LLM results are stored in variables within the sandbox session.
  5. Parent LLM reads the results from memory and decides the next step (more sub-calls, aggregation, or final answer).

The parent never loads the full document into its context. It reasons over summaries and intermediate results stored in the sandbox.

Orchestration Boundary and State Management

The orchestration boundary sits between the parent LLM’s planning loop and the sandboxed Python runtime. The parent emits Python code. The Code Interpreter executes it. The parent reads the output.

State lives in the sandbox session. Variables persist across tool calls within a single session. This means the parent can:

  • Load a document section into a variable.
  • Call a sub-LLM to analyze it.
  • Store the result in another variable.
  • Repeat for the next section.
  • Aggregate results from all variables.

The sandbox does not leak state between sessions. Each new task starts with a clean environment. This isolation prevents cross-contamination when processing multiple documents or users.

The parent LLM does not maintain state in its own context. It delegates state management to the sandbox. This keeps the parent’s context window small and focused on orchestration logic.

Code Interpreter as Persistent Memory

Code Interpreter is not just a tool executor. It is working memory. The parent LLM writes intermediate results to variables, and those variables persist for the duration of the session.

Example flow for a two-document comparison:

# Parent LLM generates this code
doc1_section1 = load_section("report_2024.pdf", pages=(1, 50))
doc1_summary = sub_llm_call(doc1_section1, "Summarize revenue trends")

doc2_section1 = load_section("report_2025.pdf", pages=(1, 50))
doc2_summary = sub_llm_call(doc2_section1, "Summarize revenue trends")

comparison = sub_llm_call(
    f"Doc1: {doc1_summary}\nDoc2: {doc2_summary}",
    "Compare revenue growth between the two years"
)

print(comparison)

The parent never sees the raw document text. It sees the summaries stored in doc1_summary and doc2_summary. The sandbox holds the intermediate state, and the parent orchestrates the next step based on what it reads from memory.

This pattern scales to arbitrary document lengths. The parent can loop over sections, accumulate results in a list, and aggregate at the end. The only limit is the sandbox session timeout and memory allocation.

Sub-LLM Call Mechanics

A sub-LLM call is a function invocation from within the sandbox. The Strands SDK provides a wrapper that:

  • Accepts a document chunk and a question.
  • Sends the chunk to a Bedrock model (e.g., Claude).
  • Returns the model’s response as a string.

The parent LLM does not make the API call directly. It generates Python code that calls the wrapper function. The Code Interpreter executes the code and returns the result.

This indirection matters for two reasons:

  1. The parent’s context stays small. It never holds the document chunk.
  2. The sandbox enforces execution boundaries. If a sub-LLM call times out, the sandbox catches the exception and returns an error message to the parent.

The parent can handle errors by retrying with a smaller chunk, skipping the section, or escalating to a human operator.

Failure Modes and Mitigation

Failure ModeSymptomMitigation
Sub-LLM timeoutSandbox returns timeout error after 60sParent retries with smaller chunk or skips section
Malformed sub-LLM responseJSON parse error or unexpected formatParent logs error, uses fallback summary, or re-prompts with stricter format instructions
Sandbox memory exhaustionCode Interpreter crashes mid-sessionParent splits document into smaller batches, processes sequentially
Parent hallucinationParent generates invalid Python or incorrect orchestration logicAdd validation layer that checks generated code syntax before execution
State corruptionVariable overwritten by mistake in generated codeUse immutable data structures or append-only lists for intermediate results

The most common failure is sub-LLM timeout. Large document sections take time to process. The parent should chunk aggressively and set conservative timeout thresholds.

Malformed responses happen when the sub-LLM returns prose instead of structured data. The parent’s prompt to the sub-LLM should specify exact output format (e.g., “Return JSON with keys: summary, key_metrics, risks”).

Sandbox memory exhaustion is rare but possible with very large documents. The parent should process in batches and clear variables after aggregation.

Deployment Shape

The deployment has three components:

  1. Parent LLM: A Bedrock model (e.g., Claude Sonnet) that orchestrates the workflow.
  2. Code Interpreter: A Bedrock AgentCore tool that executes Python in a sandboxed environment.
  3. Sub-LLM pool: One or more Bedrock models invoked from within the sandbox via the Strands SDK.

The parent and sub-LLMs can be the same model or different models. Using a smaller, faster model for sub-calls reduces latency and cost. Using a larger model for the parent improves orchestration quality.

The Code Interpreter runs in AWS-managed infrastructure. You do not provision servers. You pay per execution second and per API call.

The Strands SDK is a Python library that wraps Bedrock API calls. You install it in the Code Interpreter environment by including it in the parent’s tool definition.

Observability and Debugging

You need visibility into three layers:

  1. Parent LLM traces: What code did the parent generate? What decisions did it make?
  2. Sandbox execution logs: What Python code ran? What errors occurred?
  3. Sub-LLM call logs: What chunks were sent? What responses came back?

Bedrock AgentCore logs Code Interpreter execution to CloudWatch. You see the generated Python, stdout, stderr, and execution time.

The Strands SDK can log sub-LLM calls to a structured format (e.g., JSON lines). You pipe these logs to S3 or a monitoring service.

The parent LLM’s reasoning is harder to observe. You can add explicit logging steps in the parent’s prompt (e.g., “Before each sub-call, print your reasoning”). This adds tokens but improves debuggability.

When a workflow fails, start with the sandbox logs. Most failures are Python errors or sub-LLM timeouts. If the sandbox logs look clean, check the parent’s generated code for logic errors.

Security Boundaries

The sandbox enforces strict boundaries:

  • No network access except to Bedrock API endpoints.
  • No file system writes outside the session directory.
  • No access to AWS credentials or environment variables.

The parent LLM cannot exfiltrate data by generating code that writes to external services. The sandbox blocks outbound connections.

The sub-LLM calls go through Bedrock’s API, which enforces IAM policies. You control which models the sandbox can invoke and which data the models can access.

The parent’s generated code runs in an isolated environment. If the parent generates malicious code (e.g., infinite loop), the sandbox terminates the session after a timeout.

You should still validate the parent’s generated code before execution. Add a static analysis step that checks for dangerous patterns (e.g., eval(), exec(), os.system()).

When to Use This Pattern

Use Recursive Language Models when:

  • Document size exceeds the largest available context window.
  • You need to compare or aggregate information across multiple long documents.
  • Latency is acceptable (sub-LLM calls add overhead).
  • You can tolerate occasional sub-call failures and need retry logic.

Avoid this pattern when:

  • Documents fit comfortably in a single context window (simpler to just send the whole thing).
  • You need real-time responses (orchestration adds seconds to minutes of latency).
  • Your task requires holistic reasoning over the entire document (recursive decomposition may miss cross-section dependencies).
  • You lack observability infrastructure to debug multi-layer failures.

Technical Verdict

Recursive Language Models shift the bottleneck from context window size to orchestration complexity. You trade a hard limit for a soft one. The parent LLM must be smart enough to decompose the task correctly, and you must handle sub-call failures gracefully.

This pattern works best for tasks that naturally decompose into independent sub-questions (e.g., “Summarize each section, then compare summaries”). It works poorly for tasks that require holistic reasoning (e.g., “Identify subtle contradictions across the entire document”).

The Code Interpreter as persistent memory is the key insight. Without it, you would need to pass intermediate results back through the parent’s context, which reintroduces the context window problem. The sandbox keeps the parent’s context small and focused.

If you are hitting context limits and your task can be decomposed, this is a practical alternative to waiting for bigger models. If your task requires holistic reasoning, you still need a bigger context window or a different approach (e.g., retrieval-augmented generation with careful chunking).