OpenAI just published a case study showing how Thrive and Crete built a production tax agent that files returns, receives IRS rejection codes, and regenerates corrected filings automatically. This is not a demo. It is a deployed system handling legally binding documents with a feedback loop that improves accuracy over time.
The plumbing matters because tax compliance is a hard constraint problem. You cannot hallucinate a deduction. You cannot guess at depreciation schedules. The agent must parse structured rejection codes, trace them back to the code that generated the filing, and regenerate the document without introducing new errors.
How the Feedback Loop Works
The agent operates in three phases: generation, validation, and correction.
Generation phase:
- Codex reads client financial data (income, expenses, deductions)
- Generates Python code that populates IRS forms (1040, Schedule C, etc.)
- Executes the code to produce XML or PDF filings
Validation phase:
- Submits the filing to IRS e-file system
- Receives acceptance or rejection with error codes
- Parses rejection codes into structured failure reasons
Correction phase:
- Maps rejection codes to specific form fields or calculation logic
- Regenerates the code that produced the error
- Resubmits the corrected filing
The key insight is that rejection codes are machine-readable. The IRS returns codes like F1040-502-01 (dependent SSN mismatch) or SEIC-F1040-003 (earned income calculation error). The agent maintains a mapping between these codes and the code generation prompts that produced the faulty output.
State Management Across Filing Cycles
Tax positions span multiple years. A net operating loss in 2024 carries forward to 2025. Depreciation schedules run for five or fifteen years. The agent needs durable state to track these positions.
State storage requirements:
- Historical filings and their acceptance status
- Multi-year carryforward amounts (NOLs, credits, depreciation)
- Rejection history and the corrections that resolved them
- Client-specific preferences (aggressive vs. conservative positions)
The case study does not specify the state backend, but the requirements suggest a document store with versioning. Each filing is a snapshot. Each correction is a new version. Queries need to retrieve “all carryforwards from 2023 that affect 2026 Schedule C.”
Sandboxing Code Generation for Legal Documents
When the output is a legally binding tax return, you cannot let the agent execute arbitrary code. The sandbox must enforce:
- Read-only access to client financial data. No writes to source records.
- Deterministic execution. Same inputs produce same outputs. No network calls during generation.
- Audit trails. Every line item on the return traces back to a specific code block and prompt.
The agent likely runs generated code in a restricted Python environment (no os, subprocess, or requests modules). The execution environment captures all intermediate calculations so auditors can reconstruct how the agent arrived at a specific deduction or credit.
Observability for Financial Agents
You need to answer: “Why did the agent claim a $12,000 home office deduction?” The observability stack must capture:
- The prompt that generated the code
- The code that calculated the deduction
- The financial records that fed the calculation
- The IRS form field that received the value
This is not standard LLM observability. You are not tracking token counts or latency. You are building an audit trail that satisfies IRS scrutiny.
Observability hooks:
| Layer | What to Capture | Why |
|---|---|---|
| Prompt | Full context sent to Codex | Reproduce the generation |
| Code | Generated Python with line numbers | Trace calculations |
| Execution | All variable values during run | Debug incorrect math |
| Filing | Final XML/PDF with metadata | Match return to agent version |
| Rejection | IRS error codes and timestamps | Measure correction latency |
The agent likely logs every decision to a structured event store. Each event links to the filing it affected. When the IRS audits a return three years later, you can replay the exact agent state that produced it.
Failure Modes and Mitigations
Failure mode 1: Cascading errors. The agent fixes one rejection but introduces a new error in a dependent field. Mitigation: run a full validation suite after each correction, not just the field that failed.
Failure mode 2: Ambiguous rejection codes. Some IRS codes map to multiple possible causes. Mitigation: maintain a decision tree of likely fixes ranked by historical success rate.
Failure mode 3: Client data quality. Garbage in, garbage out. The agent cannot fix missing or contradictory financial records. Mitigation: validate input data before generation. Flag anomalies for human review.
Failure mode 4: Regulatory changes. Tax law changes every year. A prompt that worked in 2025 may produce incorrect filings in 2026. Mitigation: version prompts by tax year. Retrain on new IRS publications before each filing season.
Architecture Sketch
class TaxAgent:
def __init__(self, client_id, tax_year):
self.client_id = client_id
self.tax_year = tax_year
self.state = load_state(client_id, tax_year)
self.rejection_map = load_rejection_map()
def generate_filing(self):
prompt = build_prompt(self.state)
code = codex.generate(prompt)
filing = execute_sandboxed(code, self.state)
return filing
def submit_and_validate(self, filing):
response = irs_efile.submit(filing)
if response.accepted:
self.state.mark_accepted(filing)
return True
else:
self.state.log_rejection(response.errors)
return False
def correct_filing(self, errors):
for error in errors:
fix_prompt = self.rejection_map.get_fix(error.code)
corrected_code = codex.generate(fix_prompt, context=error)
self.state.apply_correction(corrected_code)
return self.generate_filing()
def run(self):
filing = self.generate_filing()
if self.submit_and_validate(filing):
return filing
else:
corrected = self.correct_filing(self.state.rejections)
return self.submit_and_validate(corrected)
This is simplified, but it shows the loop. Generate, submit, check, correct, repeat.
Technical Verdict
Use this pattern when:
- You have a structured validation API that returns machine-readable error codes
- The domain has clear rules (tax law, insurance underwriting, regulatory filings)
- You can afford the latency of multiple submission attempts
- You need an audit trail that satisfies external regulators
Avoid this pattern when:
- Validation is subjective or requires human judgment
- Errors are not actionable (vague rejection messages)
- You cannot sandbox code execution safely
- The cost of a failed submission is too high (one-shot opportunities)
The self-improving loop works because the IRS provides structured feedback. If your validation layer is a black box, you cannot build the correction mapping. If your domain lacks clear rules, the agent will thrash.