AI Refactoring PRs: Security Signals in Agent-Generated Code Changes

Agent-generated code is moving from experiment to production. The question is no longer whether AI can write refactoring PRs. The question is what security and quality signals separate safe automated changes from risky ones. A new empirical study [2605.21453v1] of Python refactoring pull requests from the AIDev dataset provides the first systematic look at what happens when agent-authored code hits real GitHub repositories.

The findings are mixed. Agentic commits improve a quality attribute in 22.5% of the studied changes [2605.21453v1], with usability improving most frequently (36.5%). But 24.17% of modified files introduce new Pylint issues, and 4.7% add new Bandit security findings. The merge rate tells the real story: 73.5% of these PRs get merged anyway, including cases that introduce new lint or security findings [2605.21453v1].

What the Study Measured

Researchers analyzed agent-authored Python refactoring PRs using three tools:

PyQu: ML-based quality assessment across five attributes (maintainability, usability, reliability, efficiency, security)
Pylint: Domain-independent static analysis for code quality
Bandit: Security-focused static analysis for Python

The analysis ran on both the pre-change and post-change state of each file, capturing the delta introduced by the agent. The dataset came from AIDev, a collection of real-world agent contributions to GitHub projects [2605.21453v1].

Security and Quality Patterns

The study identified 24 recurring change operations that agents perform during refactoring [2605.21453v1]. The most common new issues are convention-level violations: long lines (C0301), import order, and naming conventions. These rarely block merges. Security findings cluster around exception handling patterns and hardcoded values.

The paper reports that usability improvements occur most frequently (36.5% of changes), while security findings appear in 4.7% of modified files. The distribution of quality impacts varies by refactoring type, with structural changes showing lower risk profiles than defensive coding additions [2605.21453v1].

The Review Pipeline Gap

The 73.5% merge rate [2605.21453v1] suggests current review workflows may not be equipped to systematically evaluate agent contributions against static analysis deltas. The paper documents that PRs introducing new Pylint issues (predominantly convention-level violations like C0301 for line length) still achieve high merge rates, suggesting that convention violations are treated as acceptable technical debt rather than merge blockers.

The study does not provide CI configuration data, but the merge behavior implies that static analysis is advisory, not mandatory. This creates a trust boundary problem: agents can introduce low-severity issues that accumulate over time.

Gating Agent Contributions

Based on the observability gap revealed by the 73.5% merge rate, a practical approach would compare static analysis findings before and after changes. The following GitHub Actions workflow demonstrates how to implement delta-based gates (this is example implementation derived from the study’s findings, not from the paper itself):

# .github/workflows/agent-pr-gate.yml
name: Agent PR Quality Gate
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  static-analysis:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
        with:
          fetch-depth: 0
      
      - name: Get changed Python files
        id: changed-files
        run: |
          git diff --name-only origin/${{ github.base_ref }}...HEAD \
            | grep '\.py$' > changed_files.txt || true
      
      - name: Run Pylint on new code
        run: |
          pylint --exit-zero --output-format=json \
            $(cat changed_files.txt) > pylint_new.json
      
      - name: Checkout base branch
        uses: actions/checkout@v3
        with:
          ref: ${{ github.base_ref }}
          path: base-branch
      
      - name: Run Pylint on base code
        working-directory: base-branch
        run: |
          pylint --exit-zero --output-format=json \
            $(cat ../changed_files.txt) > ../pylint_old.json || echo '{}' > ../pylint_old.json
      
      - name: Run Bandit on new code
        run: |
          bandit -f json -o bandit_new.json \
            $(cat changed_files.txt) || echo '{"results":[]}' > bandit_new.json
      
      - name: Run Bandit on base code
        working-directory: base-branch
        run: |
          bandit -f json -o ../bandit_old.json \
            $(cat ../changed_files.txt) || echo '{"results":[]}' > ../bandit_old.json
      
      - name: Compare findings
        run: |
          python - <<EOF
          import json, sys
          
          def load_findings(path):
              with open(path) as f:
                  return json.load(f)
          
          pylint_old = load_findings('pylint_old.json')
          pylint_new = load_findings('pylint_new.json')
          bandit_old = load_findings('bandit_old.json')
          bandit_new = load_findings('bandit_new.json')
          
          # Count new security findings
          new_bandit = len(bandit_new.get('results', [])) - len(bandit_old.get('results', []))
          
          # Count net increase in error-level Pylint issues
          old_errors = sum(1 for m in pylint_old if m.get('type') == 'error')
          new_errors = sum(1 for m in pylint_new if m.get('type') == 'error')
          error_delta = new_errors - old_errors
          
          if new_bandit > 0:
              print(f"FAIL: {new_bandit} new Bandit security findings")
              sys.exit(1)
          
          if error_delta > 0:
              print(f"FAIL: {error_delta} net increase in Pylint errors")
              sys.exit(1)
          
          print("PASS: No new security findings or error-level issues")
          EOF

The key is comparing before and after states. Blocking on absolute finding counts will reject PRs that improve code quality. Blocking on new security findings and net increases in error-level issues provides a practical gate.

Change Operation Categories

The paper documents 24 recurring refactoring operations [2605.21453v1] that break into several meta-categories:

Structural refactors (no semantic change):

Extract method
Inline variable
Rename symbol
Reorder imports

Defensive coding (adds safety):

Add exception handling
Add type hints
Add input validation
Add logging

Cleanup (removes code):

Remove unused imports
Remove dead code
Remove duplicate logic
Consolidate string literals

Modernization (updates idioms):

Replace string formatting
Replace deprecated APIs
Use context managers
Use comprehensions

Configuration (externalizes values):

Extract constants
Move to config file
Replace hardcoded paths
Parameterize queries

Defensive coding and configuration changes introduce the most security findings. Exception handling refactors often add bare-except or overly broad catches. Configuration extraction can expose secrets if not paired with secret management.

Failure Modes

Based on the paper’s change operation taxonomy and finding patterns, agent refactorings fail in predictable ways:

Over-generalization: Agents extract constants that should remain inline (magic numbers with semantic meaning).

Context loss: Removing comments or docstrings during cleanup operations.

Incomplete refactors: Renaming a symbol in some files but not others, breaking the build.

Security regressions: Adding exception handling that swallows security-relevant errors.

Style drift: Introducing formatting that conflicts with project conventions.

The study shows these failures often get merged because the PR improves other quality attributes [2605.21453v1]. The net effect is positive, but the introduced issues become technical debt.

Observability Requirements

To safely deploy agent refactoring, you need visibility into:

Change operation classification: Tag each PR with the refactor type (structural, defensive, cleanup, etc.).
Static analysis delta: Track new findings, removed findings, and net change per severity level.
Test coverage impact: Measure whether refactored code has adequate test coverage.
Semantic equivalence: Verify that behavior is preserved (requires integration tests).
Merge decision rationale: Capture why maintainers accepted PRs with new findings.

The last point is critical. If maintainers are accepting convention violations as acceptable debt, that policy should be explicit and automated. If they are missing security findings, the gate is broken.

Technical Verdict

Findings apply when:

Your CI gates can enforce static analysis deltas: the study found 4.7% of modified files introduced new Bandit findings. A zero-tolerance policy for new security findings would reject these PRs.
You can measure and route the 24.17% of changes that introduce Pylint issues based on severity (block errors, allow conventions with explicit policy).
Your refactoring pipeline can classify change operations (structural, defensive, cleanup, modernization, configuration) and apply different review rules to each category, since the study shows defensive coding and configuration changes carry higher security risk.
You have integration tests to verify semantic equivalence for the 22.5% of changes that improve quality attributes, ensuring gains are real and not offset by behavior regressions.
You can track usability improvements (36.5% of quality gains) separately from security impacts to make informed trade-off decisions.

Findings do not apply when:

You lack the observability to measure static analysis deltas: the 73.5% merge rate in the study occurred despite new findings, indicating workflows that cannot systematically compare pre- and post-refactor states.
Your project treats all static analysis as advisory rather than mandatory, allowing the accumulation of convention violations and low-severity issues that become technical debt.
You cannot distinguish between agent and human contributions in your metrics, making it impossible to track whether the 4.7% security finding rate is acceptable for your risk profile.
Your codebase has security-critical paths that require manual review: the study’s taxonomy shows exception handling and configuration refactors introduce the most security findings, and these need human judgment.
You lack explicit policies for acceptable technical debt categories: if maintainers are merging PRs with new findings (as the 73.5% rate suggests), that decision should be automated and auditable, not ad-hoc.

The observability gap revealed by the 73.5% merge rate is the real problem: without systematic measurement of static analysis deltas, you cannot distinguish safe refactorings from risky ones. Delta-based static analysis gates that track new Bandit findings and net increases in Pylint errors provide a practical starting point. Convention violations can be allowed if you track them for batch cleanup and have explicit debt policies.