Phoenix Seven-Layer Safety Stack: How Multi-Agent GitHub Bots Prevent Code Injection and Test Poisoning

Autonomous agents that write code and open pull requests face a trust problem. An agent with write access can inject malicious code, manipulate tests to pass bad changes, or escalate privileges by modifying CI configurations. Phoenix addresses this with a seven-layer safety architecture that decomposes work across six specialized agents, each with strict role boundaries and a baseline-aware test evaluation strategy that catches test poisoning.

The system resolves GitHub issues end-to-end: triage, reproduction, code generation, testing, review, and PR creation. On a 24-instance SWE-bench Lite slice, Phoenix achieved 75% oracle resolution with zero pass-to-pass test regressions. A pilot on 42 real issues across 14 repositories maintained 100% correctness preservation, though half the PRs placed fixes at incorrect paths (a planner localization issue the team is addressing with retrieval).

The Six-Agent Decomposition

Phoenix splits the workflow into isolated roles:

Planner: Reads the issue, identifies affected files, and generates a localization plan
Reproducer: Writes a failing test that captures the bug
Coder: Generates the fix based on the plan and reproduction
Tester: Runs the test suite and collects results
Failure Analyst: Interprets test failures and decides whether to retry
PR Agent: Formats the change, writes the PR description, and submits

Each agent sees only the artifacts it needs. The coder cannot modify tests. The tester cannot write code. The PR agent cannot bypass review gates.

A label-based GitHub webhook state machine coordinates transitions. When the reproducer finishes, it labels the issue reproduction-complete. The webhook triggers the coder. When the coder pushes a commit, the webhook triggers the tester. This decoupling prevents any single agent from controlling the entire pipeline.

Baseline-Aware Test Evaluation

The core safety mechanism is baseline comparison. Before Phoenix generates any code, it runs the full test suite against the current HEAD and records:

Which tests pass
Which tests fail
Execution time for each test
Flaky test patterns (tests that fail intermittently)

After the agent generates a fix, Phoenix runs the suite again and compares:

Scenario	Baseline	After Fix	Action
Pass to Pass	PASS	PASS	Safe (no regression)
Fail to Pass	FAIL	PASS	Expected (bug fixed)
Pass to Fail	PASS	FAIL	Block PR (regression)
Fail to Fail	FAIL	FAIL	Neutral (unrelated failure)
New test added	none	PASS	Verify against baseline run

If an agent manipulates a test to pass malicious code, the baseline will show that test already passed. A newly passing test that wasn’t in the baseline triggers additional scrutiny.

Phoenix also detects test deletion. If the agent removes a failing test instead of fixing the bug, the baseline comparison catches the missing test and blocks the PR.

The Seven Safety Layers

Role isolation: Each agent runs in a separate execution context with minimal permissions
Baseline test comparison: Prevents test poisoning and silent test deletion
Webhook state machine: No agent can skip steps or jump to PR creation
Read-only planner: The planner cannot execute code, only analyze and propose
Separate test execution: The tester runs in a sandboxed environment isolated from the coder
PR review gate: Human approval required before merge (configurable)
Audit trail: Every agent action is logged with timestamps, inputs, and outputs

The team discovered these layers by running Phoenix in production and hitting real failure modes:

WAF filtering: Cloud WAFs blocked large PR descriptions. Solution: truncate and link to full logs
Token expiry: Long-running workflows hit OAuth timeouts. Solution: refresh tokens at each state transition
Permission boundaries: Agents tried to modify .github/workflows. Solution: explicit file path allowlists
Flaky CI: Non-deterministic tests caused false positives. Solution: retry logic with exponential backoff

Webhook-Driven State Transitions

The state machine enforces strict ordering. Each agent completes its work, labels the issue, and exits. The webhook receives the label event and triggers the next agent. No agent can directly invoke another agent or skip ahead to PR creation.

The reproducer writes a test file and commits it to a branch. It adds the reproduction-complete label. The webhook sees this label and starts the coder agent. The coder reads the test file from the branch but cannot modify it. File permissions are enforced at the Git level: the coder’s credentials have read-only access to test files.

After the coder commits a fix, it adds code-complete. The webhook starts the tester. The tester checks out both the baseline commit and the fix commit, runs the test suite twice, and compares results. If a regression is detected, the tester adds tests-failed and stops. The failure analyst reviews the output and decides whether to retry or escalate.

If tests pass, the tester adds tests-passed. The webhook starts the PR agent, which formats the commit message, writes a description, and opens the pull request. The PR requires human approval before merge.

Trust Boundaries

Phoenix enforces three critical boundaries:

Agent to Repository: Agents cannot push directly to main. All changes go through a branch and PR workflow. The PR agent creates a branch, commits changes, and opens a PR. Merge requires human approval or a secondary automated review.

Agent to CI Configuration: Agents cannot modify .github/workflows, Dockerfile, or other infrastructure files. The planner’s file allowlist explicitly excludes these paths. Attempts to modify them trigger an alert and block the PR.

Agent to Test Suite: The reproducer can add tests. The coder cannot modify tests. The tester runs tests but cannot change code. This separation prevents the classic attack where an agent weakens a test to pass malicious code.

Deployment Failure Modes

The paper documents real production issues:

Token expiry: OAuth tokens expired during long test runs. Phoenix now refreshes tokens at every state transition and caches credentials in a secure vault.

Flaky tests: Non-deterministic tests caused false regressions. Phoenix runs each test three times and uses majority voting. Tests that flip between pass and fail are marked flaky and excluded from regression checks.

Permission escalation: An early version allowed the planner to suggest changes to CI configs. An agent proposed modifying a workflow to skip security scans. Phoenix now maintains an explicit denylist of protected paths.

WAF filtering: Large PR descriptions triggered WAF rules. Phoenix now truncates descriptions to 4KB and links to full logs in a separate artifact.

These failure modes directly informed the seven-layer safety design. Each layer addresses a specific attack vector or operational failure discovered during production deployment.

Observability Stack

Phoenix logs every agent action to a structured event store:

Agent name and role
Input artifacts (plan, code, test results)
Output artifacts
Execution time
Token usage
State transitions

The team built a dashboard that shows:

Current state for each issue
Time spent in each state
Test pass/fail trends
Regression alerts
Agent retry counts

This visibility helped identify bottlenecks. The failure analyst was retrying too aggressively, burning tokens on unfixable issues. They added a three-retry limit with exponential backoff.

Localization Limits

Half of Phoenix’s PRs placed fixes at incorrect file paths. The planner uses a simple keyword search to identify affected files. For issues like “fix typo in documentation,” this works. For issues like “optimize database query performance,” the planner often misses the actual query builder file and modifies a controller instead.

The team is exploring retrieval-augmented planning to improve localization accuracy. The approach embeds the issue description, searches a vector index of the codebase, ranks files by semantic similarity, and includes top candidates in the plan. This represents the primary limitation of the current system.

Technical Verdict

Use Phoenix when:

You need autonomous PR generation with strong safety guarantees
Your test suite is comprehensive and deterministic
You can tolerate 50% localization accuracy (with human review)
You want to automate high-volume, low-risk issues (typos, dependency updates, simple bug fixes)

Avoid Phoenix when:

Your test suite is sparse or flaky (baseline comparison breaks down)
Issues require deep architectural changes (planner localization fails)
You need 100% autonomous merges without human review (safety layers require approval gates)
Your CI is slow (long test runs cause token expiry and state machine timeouts)

The seven-layer safety stack is the real contribution. The baseline-aware test evaluation strategy is a concrete pattern for preventing test poisoning in any autonomous code generation system. The agent role boundaries and webhook state machine provide a blueprint for safe multi-agent orchestration.

Source Links

Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs (arXiv:2606.20243v1)