Self-Improving Tax Agents: How Codex Learns from IRS Rejections

OpenAI just published a case study showing how Thrive and Crete built a production tax agent that files returns, receives IRS rejection codes, and regenerates corrected filings automatically. This is not a demo. It is a deployed system handling legally binding documents with a feedback loop that improves accuracy over time.

The plumbing matters because tax compliance is a hard constraint problem. You cannot hallucinate a deduction. You cannot guess at depreciation schedules. The agent must parse structured rejection codes, trace them back to the code that generated the filing, and regenerate the document without introducing new errors.

How the Feedback Loop Works

The agent operates in three phases: generation, validation, and correction.

Generation phase:

Codex reads client financial data (income, expenses, deductions)
Generates Python code that populates IRS forms (1040, Schedule C, etc.)
Executes the code to produce XML or PDF filings

Validation phase:

Submits the filing to IRS e-file system
Receives acceptance or rejection with error codes
Parses rejection codes into structured failure reasons

Correction phase:

Maps rejection codes to specific form fields or calculation logic
Regenerates the code that produced the error
Resubmits the corrected filing

The key insight is that rejection codes are machine-readable. The IRS returns codes like F1040-502-01 (dependent SSN mismatch) or SEIC-F1040-003 (earned income calculation error). The agent maintains a mapping between these codes and the code generation prompts that produced the faulty output.

State Management Across Filing Cycles

Tax positions span multiple years. A net operating loss in 2024 carries forward to 2025. Depreciation schedules run for five or fifteen years. The agent needs durable state to track these positions.

State storage requirements:

Historical filings and their acceptance status
Multi-year carryforward amounts (NOLs, credits, depreciation)
Rejection history and the corrections that resolved them
Client-specific preferences (aggressive vs. conservative positions)

The case study does not specify the state backend, but the requirements suggest a document store with versioning. Each filing is a snapshot. Each correction is a new version. Queries need to retrieve “all carryforwards from 2023 that affect 2026 Schedule C.”

Sandboxing Code Generation for Legal Documents

When the output is a legally binding tax return, you cannot let the agent execute arbitrary code. The sandbox must enforce:

Read-only access to client financial data. No writes to source records.
Deterministic execution. Same inputs produce same outputs. No network calls during generation.
Audit trails. Every line item on the return traces back to a specific code block and prompt.

The agent likely runs generated code in a restricted Python environment (no os, subprocess, or requests modules). The execution environment captures all intermediate calculations so auditors can reconstruct how the agent arrived at a specific deduction or credit.

Observability for Financial Agents

You need to answer: “Why did the agent claim a $12,000 home office deduction?” The observability stack must capture:

The prompt that generated the code
The code that calculated the deduction
The financial records that fed the calculation
The IRS form field that received the value

This is not standard LLM observability. You are not tracking token counts or latency. You are building an audit trail that satisfies IRS scrutiny.

Observability hooks:

Layer	What to Capture	Why
Prompt	Full context sent to Codex	Reproduce the generation
Code	Generated Python with line numbers	Trace calculations
Execution	All variable values during run	Debug incorrect math
Filing	Final XML/PDF with metadata	Match return to agent version
Rejection	IRS error codes and timestamps	Measure correction latency

The agent likely logs every decision to a structured event store. Each event links to the filing it affected. When the IRS audits a return three years later, you can replay the exact agent state that produced it.

Failure Modes and Mitigations

Failure mode 1: Cascading errors. The agent fixes one rejection but introduces a new error in a dependent field. Mitigation: run a full validation suite after each correction, not just the field that failed.

Failure mode 2: Ambiguous rejection codes. Some IRS codes map to multiple possible causes. Mitigation: maintain a decision tree of likely fixes ranked by historical success rate.

Failure mode 3: Client data quality. Garbage in, garbage out. The agent cannot fix missing or contradictory financial records. Mitigation: validate input data before generation. Flag anomalies for human review.

Failure mode 4: Regulatory changes. Tax law changes every year. A prompt that worked in 2025 may produce incorrect filings in 2026. Mitigation: version prompts by tax year. Retrain on new IRS publications before each filing season.

Architecture Sketch

class TaxAgent:
    def __init__(self, client_id, tax_year):
        self.client_id = client_id
        self.tax_year = tax_year
        self.state = load_state(client_id, tax_year)
        self.rejection_map = load_rejection_map()
    
    def generate_filing(self):
        prompt = build_prompt(self.state)
        code = codex.generate(prompt)
        filing = execute_sandboxed(code, self.state)
        return filing
    
    def submit_and_validate(self, filing):
        response = irs_efile.submit(filing)
        if response.accepted:
            self.state.mark_accepted(filing)
            return True
        else:
            self.state.log_rejection(response.errors)
            return False
    
    def correct_filing(self, errors):
        for error in errors:
            fix_prompt = self.rejection_map.get_fix(error.code)
            corrected_code = codex.generate(fix_prompt, context=error)
            self.state.apply_correction(corrected_code)
        return self.generate_filing()
    
    def run(self):
        filing = self.generate_filing()
        if self.submit_and_validate(filing):
            return filing
        else:
            corrected = self.correct_filing(self.state.rejections)
            return self.submit_and_validate(corrected)

This is simplified, but it shows the loop. Generate, submit, check, correct, repeat.

Technical Verdict

Use this pattern when:

You have a structured validation API that returns machine-readable error codes
The domain has clear rules (tax law, insurance underwriting, regulatory filings)
You can afford the latency of multiple submission attempts
You need an audit trail that satisfies external regulators

Avoid this pattern when:

Validation is subjective or requires human judgment
Errors are not actionable (vague rejection messages)
You cannot sandbox code execution safely
The cost of a failed submission is too high (one-shot opportunities)

The self-improving loop works because the IRS provides structured feedback. If your validation layer is a black box, you cannot build the correction mapping. If your domain lacks clear rules, the agent will thrash.

Source Links

Building self-improving tax agents with Codex