Your CI pipeline was built for a human who can squint at a Grafana panel, remember last Tuesday’s outage, and feel uneasy. An LLM agent opening pull requests has no scar tissue. It sees green checkmarks and merges. Then you roll back four out of six PRs in a week.
The problem is not model intelligence. The problem is that every layer of your CI tooling assumes a human will interpret unstructured logs, click through dashboards, and apply context the agent does not have. Exit codes, red text in GitHub Actions, and manual approval gates are human interfaces. Agents need machine-readable verdicts, deterministic infrastructure, and hard blast-radius limits.
This is the gap people are calling an agent harness: the runtime layer that wraps the model, gives it ephemeral infra to test changes, enforces policy boundaries, and returns a structured pass/fail signal the agent can act on.
Why Traditional CI Breaks for Agents
Standard CI/CD pipelines expose five failure modes when the consumer is an LLM:
Flaky preview environments. Agents cannot distinguish between “the test failed because your code is bad” and “the test failed because the staging database was restarting.” A human checks Slack, sees the incident channel, and re-runs the build. The agent retries the same broken code three times, then escalates to a human anyway.
Unstructured log output. GitHub Actions streams text. Agents parse it with regex or ask the LLM to summarize, which adds latency and hallucination risk. You need JSON artifacts with explicit success/failure states, not ANSI color codes.
No blast-radius enforcement. A human sees a Terraform plan that touches 47 resources and pauses. The agent sees terraform apply succeeded and moves on. You need policy-as-code gates (OPA, Sentinel, Kyverno) that reject changes exceeding a resource diff threshold before the agent ever sees the result.
Missing replay traffic. Unit tests pass. Integration tests pass. The agent merges. Production traffic hits a code path your test suite never exercised. Humans know to run a subset of prod traffic through the preview environment. Agents do not unless you wire it into the pipeline.
Approval gates designed for Slack. Your pipeline has a manual approval step that posts to #deployments and waits for a thumbs-up emoji. The agent cannot read Slack. It cannot interpret “looks good but wait until after the sales demo.” You need programmatic approval: a webhook the agent calls with a structured payload, and a decision service that returns approved: true or approved: false with a reason code.
What an Agent Harness Actually Does
An agent harness is infrastructure that sits between the model and your production systems. It provides:
- Ephemeral preview environments that spin up per-PR, run the full stack, and tear down after the verdict.
- Deterministic test execution with retries, flake detection, and structured result schemas.
- Policy enforcement that blocks changes violating blast-radius, security, or cost rules before the agent sees success.
- Replay traffic from production logs so the preview environment exercises real request patterns.
- Machine-readable verdicts in JSON with pass/fail, confidence score, policy violations, and next actions.
The harness does not make the model smarter. It makes the environment legible to the model.
Architecture: Harness Components
A working harness has four layers:
1. Ephemeral Infrastructure Controller
Spins up a Kubernetes namespace or AWS account per pull request. Uses Terraform or Crossplane to provision:
- Application pods
- Backing services (Postgres, Redis, message queues)
- Network policies isolating the preview from production
- Observability stack (Prometheus, Jaeger) scoped to the namespace
Teardown happens on PR close or after a TTL (usually 2 hours). Cost per slot: $180/month on AWS for a microservices stack with 8 services and a small RDS instance.
| Cost Component | Monthly Cost | Notes |
|---|---|---|
| EKS compute (t3.medium nodes) | $73 | 2 nodes, 50% utilization |
| RDS (db.t3.micro) | $15 | Preview database per namespace |
| EBS storage | $12 | 100GB across preview namespaces |
| Data transfer | $8 | Intra-VPC traffic |
| Load balancer | $18 | ALB for preview ingress |
| CloudWatch logs | $6 | 10GB/month retention |
| Ephemeral snapshots | $48 | S3 storage for Terraform state |
| Total per agent slot | $180 | Assumes 2-hour TTL, 8 PRs/week |
Cost scales linearly with agent slots. Three agents opening PRs concurrently costs $540/month.
2. Policy Gate
Runs before tests execute. Uses Open Policy Agent or Kyverno to evaluate:
- Terraform plan diffs (max 10 resources changed)
- Kubernetes manifest changes (no privilege escalation, no host network)
- Cost estimates (AWS Cost Explorer API or Infracost)
- Security posture (Checkov, Trivy scans)
Returns a structured decision:
{
"allowed": false,
"violations": [
{
"policy": "blast-radius/terraform-diff",
"severity": "high",
"message": "Plan modifies 47 resources, limit is 10"
}
]
}
If allowed: false, the pipeline stops. The agent never sees test results because the change is out-of-bounds.
3. Test Orchestrator
Runs tests in the ephemeral environment. Key differences from human-oriented CI:
- Flake detection: Reruns failures up to 3 times. If a test passes on retry, marks it
flaky: truein the result JSON. - Replay traffic: Pulls a sample of production HTTP logs (anonymized), replays them through the preview environment, diffs responses against production baseline.
- Structured output: JUnit XML is not enough. You need:
{
"verdict": "fail",
"confidence": 0.72,
"test_results": {
"unit": {"passed": 145, "failed": 2, "flaky": 1},
"integration": {"passed": 23, "failed": 0},
"replay": {"passed": 891, "failed": 3, "drift_pct": 0.3}
},
"failures": [
{
"test": "test_user_auth",
"type": "integration",
"error": "AssertionError: expected 200, got 401",
"logs_url": "https://logs.example.com/trace/abc123"
}
]
}
The confidence score combines test pass rate, flake rate, and replay drift. The formula:
confidence = (unit_pass_rate × 0.2) + (integration_pass_rate × 0.3) + (replay_pass_rate × 0.5) - (flake_count × 0.05)
Below 0.8, the harness returns verdict: fail even if all tests technically passed. Replay traffic gets the highest weight (50%) because it exercises real production patterns. Flake count penalizes unstable tests.
4. Verdict API
The agent calls POST /harness/verdict with the PR number. The API returns:
{
"pr": 1234,
"verdict": "pass",
"confidence": 0.94,
"policy_check": "pass",
"tests": "pass",
"replay_drift": 0.1,
"merge_allowed": true,
"next_action": "merge"
}
If merge_allowed: false, the response includes next_action: "fix" or next_action: "escalate" with a reason. The agent uses this to decide whether to iterate on the code or hand off to a human.
Implementation: Wiring It Together
Assume you already have:
- A Kubernetes cluster (EKS, GKE, AKS)
- Terraform managing your infra
- A GitOps controller (ArgoCD, Flux)
Step 1: Ephemeral Namespace per PR
Use a GitHub Actions workflow that triggers on pull_request:
name: Agent Harness
on:
pull_request:
types: [opened, synchronize]
jobs:
provision:
runs-on: ubuntu-latest
steps:
- name: Provision preview namespace
run: |
kubectl create namespace pr-${{ github.event.pull_request.number }}
kubectl label namespace pr-${{ github.event.pull_request.number }} \
pr=${{ github.event.pull_request.number }} \
ttl=2h
- name: Apply manifests
run: |
kustomize build ./k8s/overlays/preview | \
kubectl apply -n pr-${{ github.event.pull_request.number }} -f -
A separate cron job garbage-collects namespaces older than 2 hours.
Step 2: Policy Gate with OPA
Deploy OPA as an admission controller. Write a policy that checks Terraform plans:
package terraform.blast_radius
deny[msg] {
input.resource_changes_count > 10
msg := sprintf("Plan modifies %d resources, limit is 10", [input.resource_changes_count])
}
In your CI, run terraform plan -out=plan.json and POST it to OPA:
curl -X POST http://opa:8181/v1/data/terraform/blast_radius \
-d @plan.json \
| jq -e '.result.deny | length == 0'
Exit code 1 stops the pipeline.
Step 3: Replay Traffic
Use GoReplay or a similar tool to capture production HTTP traffic:
gor --input-raw :8080 \
--output-file requests.gor \
--output-file-size-limit 100m
In the preview environment, replay the captured traffic:
gor --input-file requests.gor \
--output-http http://preview-service.pr-1234.svc.cluster.local:8080 \
--output-http-track-response \
--output-http-stats
Diff response codes and latencies against production baseline. If drift exceeds 5%, fail the verdict.
Step 4: Structured Verdict
Aggregate results into JSON and expose via API:
@app.post("/harness/verdict")
def get_verdict(pr: int):
policy = check_policy(pr)
tests = run_tests(pr)
replay = replay_traffic(pr)
# Calculate confidence with weighted formula
unit_total = tests["unit"]["passed"] + tests["unit"]["failed"]
unit_pass_rate = tests["unit"]["passed"] / unit_total if unit_total > 0 else 0.0
integration_total = tests["integration"]["passed"] + tests["integration"]["failed"]
integration_pass_rate = tests["integration"]["passed"] / integration_total if integration_total > 0 else 0.0
replay_total = replay["passed"] + replay["failed"]
replay_pass_rate = replay["passed"] / replay_total if replay_total > 0 else 0.0
flake_count = tests["unit"]["flaky"] + tests["integration"]["flaky"]
confidence = (unit_pass_rate * 0.2) + (integration_pass_rate * 0.3) + (replay_pass_rate * 0.5) - (flake_count * 0.05)
confidence = max(0.0, min(1.0, confidence)) # Clamp to [0, 1]
verdict = "pass" if confidence > 0.8 and policy["allowed"] else "fail"
return {
"pr": pr,
"verdict": verdict,
"confidence": confidence,
"policy_check": "pass" if policy["allowed"] else "fail",
"merge_allowed": verdict == "pass"
}
The agent polls this endpoint after the CI run completes.
Trade-offs: What You Gain and Lose
| Aspect | Human-Oriented CI | Agent Harness |
|---|---|---|
| Log format | Unstructured text, ANSI colors | JSON with explicit pass/fail |
| Flake handling | Human re-runs manually | Automatic retry + flake tagging |
| Blast radius | Human reviews Terraform plan | OPA policy blocks oversized diffs |
| Approval gates | Slack emoji, manual click | Webhook with structured decision |
| Cost per PR | $0 (shared staging) | $5-$6 (ephemeral infra, amortized) |
| Setup complexity | Low (GitHub Actions template) | High (Kubernetes, OPA, replay tooling) |
| Agent autonomy | Low (needs human for edge cases) | High (can merge safely without human) |
The harness adds cost and complexity. You pay for ephemeral infrastructure and spend a day wiring policy gates. In return, you get agents that can merge code without human babysitting.
Failure Modes and Observability
Even with a harness, agents fail in predictable ways:
Policy drift. Your OPA rules allow 10 resource changes. The agent learns to split one logical change into three PRs, each under the limit. Solution: Add a policy that checks for related PRs opened within 1 hour.
Replay coverage gaps. Your replay traffic does not include the edge case that breaks in production. Solution: Augment replay with fuzz testing (Hypothesis, Schemathesis) that generates requests outside your prod traffic distribution.
Confidence score gaming. The agent learns that adding more unit tests raises the confidence score, even if the tests are trivial. Solution: Weight integration and replay results higher than unit tests in the confidence calculation.
Harness downtime. The ephemeral infra controller crashes. PRs pile up. Solution: Expose harness health as a Prometheus metric. If the controller is down for more than 5 minutes, route agent PRs to a human review queue.
Instrument the harness with:
- Verdict latency: Time from PR open to verdict available (target: under 10 minutes)
- Confidence distribution: Histogram of confidence scores (watch for clustering at threshold)
- Policy rejection rate: Percentage of PRs blocked by policy (over 30% means rules are too strict)
- Rollback rate: PRs merged by agent that later rolled back (target: under 2%)
Technical Verdict
Use an agent harness when:
- Agents open more than 5 PRs per week AND you are rolling back more than 1 in 10 agent-merged PRs
- Rollback cost per incident exceeds $500 (engineer time, customer impact, SLA credits)
- You already run Kubernetes or have Terraform managing infrastructure (setup cost drops to 1-2 days)
- You can budget $180-$300 per agent slot per month for ephemeral environments
- Your test suite has integration coverage above 60% (harness amplifies existing tests, does not replace them)
Skip it when:
- Agent PR volume is under 3 per week (human review is cheaper)
- Your stack is a monolith or tightly coupled (ephemeral environments cost $1000+/month)
- Current rollback rate with human-only CI is already under 5% (you do not have a merge quality problem)
- Your test suite has under 40% integration coverage (fix test coverage first, then add harness)
- You cannot dedicate 2-3 engineer-days to initial setup and ongoing policy tuning
The harness is infrastructure for teams that have moved past “can we get agents to write code” and into “can we trust agents to ship code.” If you are still debugging basic agent output quality, fix that first. If you are rolling back agent PRs weekly, the harness pays for itself in two weeks.
Source Links
- [Original article: Stop Building CI Pipelines For Humans. Your AI Agents Need A Harness.](https://dev.to/mateenali66/stop-