ClinEnv: Why Medical AI Agents Need Multi-Stage Environments Instead of Multiple-Choice Benchmarks

Most medical AI benchmarks give agents a paragraph and five radio buttons. ClinEnv gives them a longitudinal inpatient admission, four specialized sub-agents, and a sequence of irreversible decision commits. The difference exposes plumbing gaps that static benchmarks cannot measure.

The Problem with Multiple-Choice Medical Benchmarks

A physician does not diagnose by selecting option C. Clinical practice is incremental information gathering across heterogeneous data streams (labs, imaging, patient history) followed by sequential, irreversible decisions under uncertainty. You order a CT scan. You start antibiotics. You cannot rewind.

Static benchmarks evaluate outcome correctness but ignore process quality. They cannot measure:

Whether the agent gathered the right information before committing
How many redundant queries it issued
Whether it understood the irreversibility of medication orders
How state management degrades across multiple decision stages

ClinEnv addresses this by modeling real inpatient admissions as ordered sequences of decision stages. At each stage, the agent must query specialized sub-agents, then commit to medications, procedures, and diagnoses. The environment scores both what the agent decides and how it gathers information.

Architecture: Multi-Stage Decision Workflow

ClinEnv constructs each case as a directed acyclic graph of decision stages. Each stage represents a temporal slice of the admission (admission, day 2, discharge). The agent cannot backtrack.

Stage structure:

State accumulation: Labs, vitals, imaging results, and prior decisions persist across stages
Query phase: Agent calls four specialized sub-agents (lab interpreter, imaging reader, history summarizer, medication advisor)
Commit phase: Agent submits medications, procedures, and diagnoses
Transition: Environment advances to next stage, making prior commits immutable

Sub-agent interface:

Each sub-agent exposes a single function with a structured schema. The agent can query each sub-agent multiple times per stage, but the environment tracks redundancy.

class LabInterpreterAgent:
    def query(self, lab_name: str, context: dict) -> dict:
        """
        Returns interpretation of specified lab result
        given current patient context.
        """
        return {
            "value": float,
            "reference_range": tuple,
            "clinical_significance": str,
            "trend": str  # "improving" | "worsening" | "stable"
        }

Commit schema:

The agent submits a structured decision object at the end of each stage. The environment validates against ontology-grounded vocabularies (ICD-10, RxNorm, CPT).

{
    "medications": [
        {"name": "vancomycin", "dose": "1g", "route": "IV", "frequency": "q12h"}
    ],
    "procedures": ["central_line_placement", "chest_xray"],
    "diagnoses": ["sepsis", "pneumonia"]
}

The environment compares submitted decisions to ground truth using deterministic ontology matching, not fuzzy string similarity.

State Management Across Stages

ClinEnv maintains a cumulative state object that grows as the case progresses. Each stage appends new observations without removing prior data.

State schema:

{
    "patient_id": str,
    "current_stage": int,
    "stage_history": [
        {
            "stage_id": int,
            "timestamp": datetime,
            "vitals": dict,
            "labs": dict,
            "imaging": dict,
            "queries_issued": list,
            "decisions_committed": dict
        }
    ],
    "active_medications": list,
    "active_diagnoses": list,
    "irreversible_commits": list
}

The agent receives the full state at the start of each stage. It cannot modify prior stages. This mirrors real clinical workflows where you cannot un-administer a medication.

Evaluation: Decoupling Process from Outcome

ClinEnv scores two dimensions independently:

Decision quality (outcome):

F1 score for medications, procedures, and diagnoses at each stage
Weighted by clinical importance (missing sepsis antibiotics costs more than missing a vitamin supplement)

Information-gathering quality (process):

Query efficiency: ratio of unique queries to total queries
Query relevance: whether queries preceded related decisions
Query timing: whether critical information was gathered early

The paper reports that outcome quality and process quality are sharply decoupled. Models can achieve reasonable discharge diagnosis F1 (0.51) while issuing redundant queries and failing on management decisions (0.17 F1).

Performance Breakdown by Stage and Decision Type

Decision Type	Early Stage F1	Late Stage F1	Gap
Diagnoses	0.51	0.51	0.00
Medications	0.23	0.17	-0.06
Procedures	0.19	0.14	-0.05

Difficulty concentrates in management decisions (medications, procedures) and later stages. Models continue to issue redundant queries as cases progress, suggesting poor state tracking.

The strongest model tested reaches only 0.31 overall decision F1. This is not a model capability problem. This is a plumbing problem.

Failure Modes Exposed by Multi-Stage Environments

Redundant querying:

Agents repeatedly query the same sub-agent with identical or near-identical parameters. This suggests they are not maintaining a query cache or tracking which information they have already gathered.

Premature commits:

Agents commit to diagnoses or medications before querying relevant sub-agents. The environment tracks query-decision ordering and flags commits made without supporting information.

State drift:

Performance degrades across stages even when new information should clarify the case. This suggests agents are not effectively accumulating state or are losing context as the state object grows.

Irreversibility blindness:

Agents attempt to “revise” prior decisions by issuing contradictory commits in later stages. The environment rejects these, but the attempts reveal that agents do not model decision irreversibility.

Implementation Considerations

Ontology grounding:

ClinEnv uses ICD-10 for diagnoses, RxNorm for medications, and CPT for procedures. Exact string matching fails. You need a mapping layer that handles synonyms, abbreviations, and hierarchical relationships.

Sub-agent latency:

Each query to a sub-agent adds latency. If the agent issues 20 queries per stage across 5 stages, that is 100 serial API calls. You need batching or parallel execution.

State serialization:

The cumulative state object grows linearly with the number of stages. For long admissions (10+ stages), this can exceed context windows. You need summarization or hierarchical state compression.

Ground truth construction:

ClinEnv automatically constructs cases from real EHR data, but ground truth labels require clinical expertise. The paper does not detail the labeling process, but this is a bottleneck for scaling.

When to Use Multi-Stage Environments

Use ClinEnv-style environments when:

You need to evaluate sequential decision-making under uncertainty
Process quality matters as much as outcome quality
Decisions are irreversible and order-dependent
Agents must gather information from multiple heterogeneous sources
You want to measure state management across long horizons

Avoid when:

You only care about final outcome correctness
Decisions are independent and can be evaluated in isolation
You need high-throughput evaluation (multi-stage environments are slow)
Ground truth for intermediate stages is unavailable

Technical Verdict

ClinEnv exposes plumbing gaps that static benchmarks cannot measure. If you are building agents for domains where decisions are sequential, irreversible, and information-dependent (clinical, legal, financial), you need environments that model those constraints.

The sharp decoupling between outcome quality and process quality is the key finding. An agent can get the right answer for the wrong reasons, and you will not know unless you measure how it gathered information.

For medical AI specifically, the 0.17 F1 on management decisions is a red flag. Diagnosis is pattern matching. Management is workflow orchestration. Current agents are bad at orchestration.

If you are building clinical decision support, do not evaluate on multiple-choice benchmarks. Build or use environments that model incremental information gathering, irreversible commits, and state accumulation across stages. Otherwise, you are optimizing for a task that does not exist in production.