Debloating AI-Generated Codebases: Why Agent-Written Code Needs a Cleanup Pass

AI coding agents ship features fast. They also ship a lot of extra code. Maxim Saplin’s weekend experiment removed 31.7% of total app code (19,772 lines to 13,509) and 37.4% of Dart-specific code (15,859 to 9,924) from an AI-grown Flutter codebase without breaking functionality. The discrepancy exists because total app code includes native platform code (iOS/Android) that was not refactored. The Dart-only figure represents the actual agent-generated bloat removed.

This exposes a pattern: agent-written code has a distinctive smell. Verbose READMEs, half-finished abstractions, comments that explain nothing, and layers introduced for problems that no longer exist.

The AI Bloat Pattern

Agent-generated code exhibits local competence but global entropy. Each feature works in isolation. The codebase as a whole becomes fragile. Saplin’s Flutter media player app worked, but bug fixes were partial and new agent additions increased entropy even when features landed.

The overhead shows up in three layers:

Phantom abstractions: Classes and interfaces introduced for a problem that was refactored away three commits ago, but the abstraction stayed.
Copy-paste safety: Agents duplicate logic instead of refactoring shared behavior because duplication is safer than risking a broken import.
Verbose scaffolding: Agents write defensive code with extensive error handling, logging, and validation even when the call path is internal and predictable.

Saplin kept 335 tests green and the analyzer clean through the entire refactoring. The app still worked. The codebase just had 37% less Dart code to maintain.

Why Human Review Fails at Scale

The cognitive load of reviewing agent code is asymmetric. Once you start looking, you notice weird layers, half-fixes, old ideas still wired through the system. Saplin describes the trap: “The choice becomes: do I stop and rewrite this? Do I spend the weekend paying down debt I only discovered because I looked?”

Most teams keep shipping around it. The app works, but the codebase appears fragile. Bug fixes are partial. New agent-made additions increase entropy even when the feature lands. This is not a review problem. It is a structural problem that needs automated cleanup passes.

Debloat Pipeline Architecture

The solution is a structured, automated approach. A debloat pipeline runs after agent code generation and before merge. It is not a linter (linters flag style issues). It is not a test suite (tests validate behavior). It is a refactoring pass that prunes unnecessary code while preserving functionality.

The pipeline has three stages:

Static analysis: Detect overhead patterns (unused imports, duplicate logic, over-abstracted interfaces).
Automated refactoring: Apply safe transformations (inline single-use functions, collapse redundant classes, remove dead code).
Validation: Run tests, check analyzer output, verify build artifacts.

Stage 1: Static Analysis for Bloat Patterns

You need custom rules beyond standard linters. Agent code has characteristic smells:

Unused abstractions: Interfaces with a single implementation, base classes with no subclasses.
Duplicate logic: Functions with >80% similarity in different files.
Over-defensive code: Try-catch blocks around internal calls, null checks on non-nullable types.
Verbose comments: Multi-line comments that restate the function signature.

Example detection rule for unused abstractions:

# detect_unused_abstractions.py
# Detects abstract classes with zero or one implementation.
# Run with: python detect_unused_abstractions.py <source_file.py>

import ast
from collections import defaultdict

class AbstractionAnalyzer(ast.NodeVisitor):
    def __init__(self):
        self.interfaces = defaultdict(int)  # interface -> implementation count
        
    def visit_ClassDef(self, node):
        # Check if class is abstract (has ABC or Protocol)
        is_abstract = any(
            isinstance(base, ast.Name) and base.id in ('ABC', 'Protocol')
            for base in node.bases
        )
        
        if is_abstract:
            self.interfaces[node.name] = 0
        else:
            # Count implementations
            for base in node.bases:
                if isinstance(base, ast.Name) and base.id in self.interfaces:
                    self.interfaces[base.id] += 1
                    
        self.generic_visit(node)
        
    def get_unused(self):
        return [name for name, count in self.interfaces.items() if count <= 1]

# Usage example:
source_code = """
from abc import ABC

class PaymentProcessor(ABC):
    def process(self): pass

class StripeProcessor(PaymentProcessor):
    def process(self): return "stripe"
"""

tree = ast.parse(source_code)
analyzer = AbstractionAnalyzer()
analyzer.visit(tree)
unused = analyzer.get_unused()
print(f"Unused abstractions: {unused}")  # Output: ['PaymentProcessor']

This flags interfaces with zero or one implementation. Those are candidates for inlining.

Stage 2: Automated Refactoring

Safe transformations preserve behavior while reducing lines of code. The key is to apply transformations that have low risk of breaking tests:

Inline single-use functions: If a function is called once, inline it at the call site.
Collapse redundant classes: If a class has no state and one method, replace it with a function.
Remove dead code: If a function is never called, delete it.
Simplify error handling: If a try-catch block catches and re-raises, remove it.

Tools like rope (Python refactoring library), jscodeshift (JavaScript codemod toolkit), or dart fix (Dart’s built-in refactoring tool) can automate these transformations. The key is to run them in a specific order:

Remove dead code first (reduces surface area for later transformations).
Inline single-use functions (exposes more dead code).
Collapse redundant classes (simplifies inheritance trees).
Simplify error handling (reduces noise).

Stage 3: Validation

After each transformation, run the full test suite and check analyzer output. If tests fail or the analyzer reports new errors, roll back the transformation and flag it for human review.

This is where the pipeline needs observability. You want to know:

Which transformations were applied.
Which transformations were rolled back.
How much code was removed in each stage.
How long the pipeline took.

Emit structured logs:

{
  "stage": "inline_single_use_functions",
  "transformations_applied": 47,
  "transformations_rolled_back": 3,
  "lines_removed": 892,
  "duration_ms": 12400,
  "test_status": "pass",
  "timestamp": "2024-06-01T20:15:32Z",
  "git_commit_hash": "a3f7c9d"
}

Trade-offs: Agent Pass vs. CI/CD Gate

You have three options for where to run the debloat pipeline:

Approach	Pros	Cons	When to Use
Separate agent pass	Agent can explain changes, preserves git history	Slower, requires agent context window	- Teams with <5 developers - Projects <6 months old - Internal tools where refactoring risk is low
Human review step	Developer has final say, catches edge cases	Does not scale, cognitive load is high	- Small teams (<3 devs) - Critical codebases with <70% test coverage - Regulated environments
CI/CD gate with auto-refactor	Fast, consistent, no human bottleneck	Risk of breaking changes, hard to debug	- Mature pipelines with >85% test coverage - Agent-generated code >30% of codebase - CI/CD latency <5 min

Saplin’s approach was a manual weekend experiment. He read the code, identified cruft, and refactored by hand. This works for a one-time cleanup but does not scale to continuous agent development.

For production use, the CI/CD gate approach is most practical. Run the debloat pipeline on every PR from an agent. If transformations pass validation, auto-commit them. If they fail, flag the PR for human review.

Failure Modes

Debloat pipelines fail in predictable ways:

False positives: Flagging code as dead when it is called via reflection or dynamic dispatch.
Test brittleness: Tests that depend on implementation details break when code is refactored.
Semantic drift: Automated refactoring changes behavior in subtle ways (for example, inlining a function that had side effects).
Pipeline slowdown: Running full test suite after every transformation makes the pipeline too slow for CI/CD.

Mitigations:

Use type information to reduce false positives (track dynamic calls).
Enforce test isolation (unit tests should not depend on implementation details).
Run semantic equivalence checks (compare output before and after refactoring).
Batch transformations and run tests once per batch instead of once per transformation.

Technical Verdict

Use a debloat pipeline when:

Test coverage exceeds 80% and tests are isolated from implementation details. Below this threshold, false positives will block legitimate PRs and require constant manual intervention.
Agent-generated code represents more than 30% of your codebase. Below this, manual review is more efficient than pipeline maintenance.
CI/CD pipeline latency is under 5 minutes for a full test run. Longer pipelines make the debloat stage a bottleneck.
You have observability into what the pipeline changes (structured logs, diff reports, rollback tracking). Without this, debugging pipeline failures becomes guesswork.
Your team trusts automated refactoring tools and has experience with tools like rope, jscodeshift, or dart fix.

Avoid it when:

Test coverage is below 70% or tests are brittle (depend on private methods, mock internals, or have flaky assertions). The pipeline will break tests more often than it removes technical debt.
The codebase is small enough for manual review (fewer than 10,000 lines or fewer than 3 active developers). The overhead of maintaining custom static analysis rules outweighs the benefit.
You lack observability into pipeline behavior. If you cannot see which transformations were applied or rolled back, you cannot debug failures or tune the pipeline.
Your team does not trust automated refactoring. If developers routinely override or disable the pipeline, it adds friction without value.
Agent-generated code has low test coverage or high semantic complexity (such as state machines, async workflows, or code with side effects). The risk of semantic drift (where refactoring changes behavior) is too high.

The debloat pipeline becomes a pressure relief valve: agents can be verbose and defensive, and the pipeline cleans up after them. This lets teams keep shipping agent-generated features without accumulating cognitive debt.

Source Links

Debloating The AI-Grown Codebase (Dev.to)