mech.app
AI Agents

Self-Improving Tax Agents: How Codex Learns from IRS Rejections

OpenAI, Thrive, and Crete built a tax agent that parses IRS rejection codes, regenerates filings, and closes the loop on compliance automation.

Source: openai.com
Self-Improving Tax Agents: How Codex Learns from IRS Rejections

OpenAI just published a case study showing how Thrive and Crete built a production tax agent that files returns, receives IRS rejection codes, and regenerates corrected filings automatically. This is not a demo. It is a deployed system handling legally binding documents with a feedback loop that improves accuracy over time.

The plumbing matters because tax compliance is a hard constraint problem. You cannot hallucinate a deduction. You cannot guess at depreciation schedules. The agent must parse structured rejection codes, trace them back to the code that generated the filing, and regenerate the document without introducing new errors.

How the Feedback Loop Works

The agent operates in three phases: generation, validation, and correction.

Generation phase:

  • Codex reads client financial data (income, expenses, deductions)
  • Generates Python code that populates IRS forms (1040, Schedule C, etc.)
  • Executes the code to produce XML or PDF filings

Validation phase:

  • Submits the filing to IRS e-file system
  • Receives acceptance or rejection with error codes
  • Parses rejection codes into structured failure reasons

Correction phase:

  • Maps rejection codes to specific form fields or calculation logic
  • Regenerates the code that produced the error
  • Resubmits the corrected filing

The key insight is that rejection codes are machine-readable. The IRS returns codes like F1040-502-01 (dependent SSN mismatch) or SEIC-F1040-003 (earned income calculation error). The agent maintains a mapping between these codes and the code generation prompts that produced the faulty output.

State Management Across Filing Cycles

Tax positions span multiple years. A net operating loss in 2024 carries forward to 2025. Depreciation schedules run for five or fifteen years. The agent needs durable state to track these positions.

State storage requirements:

  • Historical filings and their acceptance status
  • Multi-year carryforward amounts (NOLs, credits, depreciation)
  • Rejection history and the corrections that resolved them
  • Client-specific preferences (aggressive vs. conservative positions)

The case study does not specify the state backend, but the requirements suggest a document store with versioning. Each filing is a snapshot. Each correction is a new version. Queries need to retrieve “all carryforwards from 2023 that affect 2026 Schedule C.”

When the output is a legally binding tax return, you cannot let the agent execute arbitrary code. The sandbox must enforce:

  • Read-only access to client financial data. No writes to source records.
  • Deterministic execution. Same inputs produce same outputs. No network calls during generation.
  • Audit trails. Every line item on the return traces back to a specific code block and prompt.

The agent likely runs generated code in a restricted Python environment (no os, subprocess, or requests modules). The execution environment captures all intermediate calculations so auditors can reconstruct how the agent arrived at a specific deduction or credit.

Observability for Financial Agents

You need to answer: “Why did the agent claim a $12,000 home office deduction?” The observability stack must capture:

  • The prompt that generated the code
  • The code that calculated the deduction
  • The financial records that fed the calculation
  • The IRS form field that received the value

This is not standard LLM observability. You are not tracking token counts or latency. You are building an audit trail that satisfies IRS scrutiny.

Observability hooks:

LayerWhat to CaptureWhy
PromptFull context sent to CodexReproduce the generation
CodeGenerated Python with line numbersTrace calculations
ExecutionAll variable values during runDebug incorrect math
FilingFinal XML/PDF with metadataMatch return to agent version
RejectionIRS error codes and timestampsMeasure correction latency

The agent likely logs every decision to a structured event store. Each event links to the filing it affected. When the IRS audits a return three years later, you can replay the exact agent state that produced it.

Failure Modes and Mitigations

Failure mode 1: Cascading errors. The agent fixes one rejection but introduces a new error in a dependent field. Mitigation: run a full validation suite after each correction, not just the field that failed.

Failure mode 2: Ambiguous rejection codes. Some IRS codes map to multiple possible causes. Mitigation: maintain a decision tree of likely fixes ranked by historical success rate.

Failure mode 3: Client data quality. Garbage in, garbage out. The agent cannot fix missing or contradictory financial records. Mitigation: validate input data before generation. Flag anomalies for human review.

Failure mode 4: Regulatory changes. Tax law changes every year. A prompt that worked in 2025 may produce incorrect filings in 2026. Mitigation: version prompts by tax year. Retrain on new IRS publications before each filing season.

Architecture Sketch

class TaxAgent:
    def __init__(self, client_id, tax_year):
        self.client_id = client_id
        self.tax_year = tax_year
        self.state = load_state(client_id, tax_year)
        self.rejection_map = load_rejection_map()
    
    def generate_filing(self):
        prompt = build_prompt(self.state)
        code = codex.generate(prompt)
        filing = execute_sandboxed(code, self.state)
        return filing
    
    def submit_and_validate(self, filing):
        response = irs_efile.submit(filing)
        if response.accepted:
            self.state.mark_accepted(filing)
            return True
        else:
            self.state.log_rejection(response.errors)
            return False
    
    def correct_filing(self, errors):
        for error in errors:
            fix_prompt = self.rejection_map.get_fix(error.code)
            corrected_code = codex.generate(fix_prompt, context=error)
            self.state.apply_correction(corrected_code)
        return self.generate_filing()
    
    def run(self):
        filing = self.generate_filing()
        if self.submit_and_validate(filing):
            return filing
        else:
            corrected = self.correct_filing(self.state.rejections)
            return self.submit_and_validate(corrected)

This is simplified, but it shows the loop. Generate, submit, check, correct, repeat.

Technical Verdict

Use this pattern when:

  • You have a structured validation API that returns machine-readable error codes
  • The domain has clear rules (tax law, insurance underwriting, regulatory filings)
  • You can afford the latency of multiple submission attempts
  • You need an audit trail that satisfies external regulators

Avoid this pattern when:

  • Validation is subjective or requires human judgment
  • Errors are not actionable (vague rejection messages)
  • You cannot sandbox code execution safely
  • The cost of a failed submission is too high (one-shot opportunities)

The self-improving loop works because the IRS provides structured feedback. If your validation layer is a black box, you cannot build the correction mapping. If your domain lacks clear rules, the agent will thrash.