Agent-generated code is moving from experiment to production. The question is no longer whether AI can write refactoring PRs. The question is what security and quality signals separate safe automated changes from risky ones. A new empirical study [2605.21453v1] of Python refactoring pull requests from the AIDev dataset provides the first systematic look at what happens when agent-authored code hits real GitHub repositories.
The findings are mixed. Agentic commits improve a quality attribute in 22.5% of the studied changes [2605.21453v1], with usability improving most frequently (36.5%). But 24.17% of modified files introduce new Pylint issues, and 4.7% add new Bandit security findings. The merge rate tells the real story: 73.5% of these PRs get merged anyway, including cases that introduce new lint or security findings [2605.21453v1].
What the Study Measured
Researchers analyzed agent-authored Python refactoring PRs using three tools:
- PyQu: ML-based quality assessment across five attributes (maintainability, usability, reliability, efficiency, security)
- Pylint: Domain-independent static analysis for code quality
- Bandit: Security-focused static analysis for Python
The analysis ran on both the pre-change and post-change state of each file, capturing the delta introduced by the agent. The dataset came from AIDev, a collection of real-world agent contributions to GitHub projects [2605.21453v1].
Security and Quality Patterns
The study identified 24 recurring change operations that agents perform during refactoring [2605.21453v1]. The most common new issues are convention-level violations: long lines (C0301), import order, and naming conventions. These rarely block merges. Security findings cluster around exception handling patterns and hardcoded values.
The paper reports that usability improvements occur most frequently (36.5% of changes), while security findings appear in 4.7% of modified files. The distribution of quality impacts varies by refactoring type, with structural changes showing lower risk profiles than defensive coding additions [2605.21453v1].
The Review Pipeline Gap
The 73.5% merge rate [2605.21453v1] suggests current review workflows may not be equipped to systematically evaluate agent contributions against static analysis deltas. The paper documents that PRs introducing new Pylint issues (predominantly convention-level violations like C0301 for line length) still achieve high merge rates, suggesting that convention violations are treated as acceptable technical debt rather than merge blockers.
The study does not provide CI configuration data, but the merge behavior implies that static analysis is advisory, not mandatory. This creates a trust boundary problem: agents can introduce low-severity issues that accumulate over time.
Gating Agent Contributions
Based on the observability gap revealed by the 73.5% merge rate, a practical approach would compare static analysis findings before and after changes. The following GitHub Actions workflow demonstrates how to implement delta-based gates (this is example implementation derived from the study’s findings, not from the paper itself):
# .github/workflows/agent-pr-gate.yml
name: Agent PR Quality Gate
on:
pull_request:
types: [opened, synchronize]
jobs:
static-analysis:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Get changed Python files
id: changed-files
run: |
git diff --name-only origin/${{ github.base_ref }}...HEAD \
| grep '\.py$' > changed_files.txt || true
- name: Run Pylint on new code
run: |
pylint --exit-zero --output-format=json \
$(cat changed_files.txt) > pylint_new.json
- name: Checkout base branch
uses: actions/checkout@v3
with:
ref: ${{ github.base_ref }}
path: base-branch
- name: Run Pylint on base code
working-directory: base-branch
run: |
pylint --exit-zero --output-format=json \
$(cat ../changed_files.txt) > ../pylint_old.json || echo '{}' > ../pylint_old.json
- name: Run Bandit on new code
run: |
bandit -f json -o bandit_new.json \
$(cat changed_files.txt) || echo '{"results":[]}' > bandit_new.json
- name: Run Bandit on base code
working-directory: base-branch
run: |
bandit -f json -o ../bandit_old.json \
$(cat ../changed_files.txt) || echo '{"results":[]}' > ../bandit_old.json
- name: Compare findings
run: |
python - <<EOF
import json, sys
def load_findings(path):
with open(path) as f:
return json.load(f)
pylint_old = load_findings('pylint_old.json')
pylint_new = load_findings('pylint_new.json')
bandit_old = load_findings('bandit_old.json')
bandit_new = load_findings('bandit_new.json')
# Count new security findings
new_bandit = len(bandit_new.get('results', [])) - len(bandit_old.get('results', []))
# Count net increase in error-level Pylint issues
old_errors = sum(1 for m in pylint_old if m.get('type') == 'error')
new_errors = sum(1 for m in pylint_new if m.get('type') == 'error')
error_delta = new_errors - old_errors
if new_bandit > 0:
print(f"FAIL: {new_bandit} new Bandit security findings")
sys.exit(1)
if error_delta > 0:
print(f"FAIL: {error_delta} net increase in Pylint errors")
sys.exit(1)
print("PASS: No new security findings or error-level issues")
EOF
The key is comparing before and after states. Blocking on absolute finding counts will reject PRs that improve code quality. Blocking on new security findings and net increases in error-level issues provides a practical gate.
Change Operation Categories
The paper documents 24 recurring refactoring operations [2605.21453v1] that break into several meta-categories:
Structural refactors (no semantic change):
- Extract method
- Inline variable
- Rename symbol
- Reorder imports
Defensive coding (adds safety):
- Add exception handling
- Add type hints
- Add input validation
- Add logging
Cleanup (removes code):
- Remove unused imports
- Remove dead code
- Remove duplicate logic
- Consolidate string literals
Modernization (updates idioms):
- Replace string formatting
- Replace deprecated APIs
- Use context managers
- Use comprehensions
Configuration (externalizes values):
- Extract constants
- Move to config file
- Replace hardcoded paths
- Parameterize queries
Defensive coding and configuration changes introduce the most security findings. Exception handling refactors often add bare-except or overly broad catches. Configuration extraction can expose secrets if not paired with secret management.
Failure Modes
Based on the paper’s change operation taxonomy and finding patterns, agent refactorings fail in predictable ways:
Over-generalization: Agents extract constants that should remain inline (magic numbers with semantic meaning).
Context loss: Removing comments or docstrings during cleanup operations.
Incomplete refactors: Renaming a symbol in some files but not others, breaking the build.
Security regressions: Adding exception handling that swallows security-relevant errors.
Style drift: Introducing formatting that conflicts with project conventions.
The study shows these failures often get merged because the PR improves other quality attributes [2605.21453v1]. The net effect is positive, but the introduced issues become technical debt.
Observability Requirements
To safely deploy agent refactoring, you need visibility into:
- Change operation classification: Tag each PR with the refactor type (structural, defensive, cleanup, etc.).
- Static analysis delta: Track new findings, removed findings, and net change per severity level.
- Test coverage impact: Measure whether refactored code has adequate test coverage.
- Semantic equivalence: Verify that behavior is preserved (requires integration tests).
- Merge decision rationale: Capture why maintainers accepted PRs with new findings.
The last point is critical. If maintainers are accepting convention violations as acceptable debt, that policy should be explicit and automated. If they are missing security findings, the gate is broken.
Technical Verdict
Findings apply when:
- Your CI gates can enforce static analysis deltas: the study found 4.7% of modified files introduced new Bandit findings. A zero-tolerance policy for new security findings would reject these PRs.
- You can measure and route the 24.17% of changes that introduce Pylint issues based on severity (block errors, allow conventions with explicit policy).
- Your refactoring pipeline can classify change operations (structural, defensive, cleanup, modernization, configuration) and apply different review rules to each category, since the study shows defensive coding and configuration changes carry higher security risk.
- You have integration tests to verify semantic equivalence for the 22.5% of changes that improve quality attributes, ensuring gains are real and not offset by behavior regressions.
- You can track usability improvements (36.5% of quality gains) separately from security impacts to make informed trade-off decisions.
Findings do not apply when:
- You lack the observability to measure static analysis deltas: the 73.5% merge rate in the study occurred despite new findings, indicating workflows that cannot systematically compare pre- and post-refactor states.
- Your project treats all static analysis as advisory rather than mandatory, allowing the accumulation of convention violations and low-severity issues that become technical debt.
- You cannot distinguish between agent and human contributions in your metrics, making it impossible to track whether the 4.7% security finding rate is acceptable for your risk profile.
- Your codebase has security-critical paths that require manual review: the study’s taxonomy shows exception handling and configuration refactors introduce the most security findings, and these need human judgment.
- You lack explicit policies for acceptable technical debt categories: if maintainers are merging PRs with new findings (as the 73.5% rate suggests), that decision should be automated and auditable, not ad-hoc.
The observability gap revealed by the 73.5% merge rate is the real problem: without systematic measurement of static analysis deltas, you cannot distinguish safe refactorings from risky ones. Delta-based static analysis gates that track new Bandit findings and net increases in Pylint errors provide a practical starting point. Convention violations can be allowed if you track them for batch cleanup and have explicit debt policies.