MeDxAgent: Multi-Agent Consultation for Interactive Medical Diagnosis

Most medical AI benchmarks treat diagnosis like a quiz: model sees complete patient data, picks from multiple choices, done. MeDxAgent (arXiv 2606.03416v1) shows why that approach misses the plumbing of real clinical workflows. Diagnosis is interactive. Doctors ask follow-up questions, route cases to specialists, and refine hypotheses across multiple turns. The paper introduces a multi-agent consultation system that orchestrates this loop and a benchmark (MeDxBench) with 4,421 cases across 20 specialties to test it.

The core insight: single-shot evals hide the state management, turn-taking, and specialist coordination problems that break agent systems in production. If you are building health decision support, this paper exposes the orchestration gaps you will hit.

Why Static Benchmarks Break

Clinical diagnosis is not a batch job. A patient presents with incomplete information. The physician asks targeted questions, orders tests, consults specialists, and iterates. Most LLM evaluations skip this loop entirely:

Complete information upfront: All symptoms, labs, and history provided in a single prompt.
Multiple-choice output: Model selects from a fixed set of diagnoses.
No interaction: No mechanism for the model to ask clarifying questions or request additional data.

This setup optimizes for benchmark leaderboards but fails to model the decision tree of real consultations. MeDxAgent reframes diagnosis as a multi-turn orchestration problem with three agent roles: a primary care agent, specialist agents, and a coordinator that manages state and routing.

MeDxAgent Architecture

The system uses a consultation flow that mirrors clinical practice:

Demographics collection: Primary care agent gathers age, sex, and basic history first.
Symptom exploration: Agent asks open-ended questions to build a symptom profile.
Specialist routing: Coordinator routes case to relevant specialist agents based on symptom category.
Targeted questioning: Specialists ask domain-specific follow-up questions.
Diagnosis synthesis: Coordinator aggregates specialist outputs and produces final diagnosis.

Each agent maintains its own context window but receives a summarized dialogue history to avoid token bloat. The coordinator tracks consultation state (current turn, active specialist, collected data) and enforces turn limits to prevent infinite loops.

State Management Pattern

MeDxAgent uses a shared state object that persists across turns:

class ConsultationState:
    demographics: Dict[str, Any]
    symptoms: List[str]
    dialogue_history: List[Turn]
    active_specialist: Optional[str]
    candidate_diagnoses: List[str]
    confidence_scores: Dict[str, float]
    turn_count: int
    max_turns: int = 10

    def should_route_to_specialist(self) -> bool:
        return len(self.symptoms) >= 3 and not self.active_specialist

    def is_consultation_complete(self) -> bool:
        return (
            self.turn_count >= self.max_turns
            or max(self.confidence_scores.values()) > 0.85
        )

def run_consultation(patient_input: str) -> Diagnosis:
    state = ConsultationState()
    
    # Primary care agent collects demographics
    state = primary_care_agent.collect_demographics(patient_input, state)
    
    while not state.is_consultation_complete():
        if state.should_route_to_specialist():
            # Route to specialist with summarized history
            specialist = coordinator.select_specialist(state.symptoms)
            state = specialist.ask_questions(
                summarize_history(state.dialogue_history),
                state.candidate_diagnoses
            )
        else:
            # Continue with primary care agent
            state = primary_care_agent.explore_symptoms(state)
        
        state.turn_count += 1
    
    return coordinator.synthesize_diagnosis(state)

The state object prevents context loss when switching between agents. Demographics and symptoms are collected before specialist routing to give the specialist agent enough signal to ask useful questions. The paper found that collecting demographics first improved accuracy by 3.2% compared to interleaving demographic and symptom questions.

Specialist Coordination

When the primary care agent identifies a likely specialty (cardiology, neurology, etc.), the coordinator spawns a specialist agent. The specialist receives:

Summarized dialogue history (not raw transcript)
Current symptom list
Top 3 candidate diagnoses from primary care agent

The specialist asks targeted questions and updates the candidate diagnosis list. If confidence remains low after specialist consultation, the coordinator can route to a second specialist or escalate to a multi-specialist consensus round.

Consensus Logic and Failure Modes

Multi-agent diagnosis creates a consensus problem. When specialists disagree, how do you aggregate their outputs? MeDxAgent tested three strategies:

Strategy	Accuracy	Failure Mode
Majority vote	68.4%	Ties and low-confidence majorities produce random selection
Confidence-weighted average	71.2%	Overconfident specialists dominate even when wrong
Coordinator synthesis	73.8%	Coordinator can ignore specialist input or hallucinate consensus

The paper found that coordinator synthesis (a separate LLM call that reviews all specialist outputs and produces a final diagnosis) performed best but introduced a new failure mode: the coordinator sometimes ignored specialist input entirely and reverted to the primary care agent’s initial hypothesis. Adding explicit prompts to “cite which specialist influenced your decision” reduced this behavior but did not eliminate it.

Observability Gaps

Debugging a failed consultation requires tracing the decision path across multiple agent turns. The paper does not provide implementation details for observability tooling, but the architecture implies you need:

Turn-level logs: Which agent was active, what question was asked, what data was collected.
State snapshots: Consultation state after each turn to identify where context was lost.
Confidence traces: How confidence scores changed as new information arrived.
Routing decisions: Why the coordinator chose a specific specialist or ended the consultation.

Without these traces, a misdiagnosis looks like a model failure when it might be a routing bug (wrong specialist) or a state management bug (lost symptom data between turns).

Benchmark Results

MeDxAgent achieved 73.8% accuracy on MeDxBench compared to 63.5% for a single-shot baseline (all information provided upfront, no interaction). The interactive consultation closed 52.3% of the gap to a full-information oracle (oracle had access to complete patient records and ground truth diagnosis).

Key findings:

Demographics-first collection improved accuracy by 3.2%.
Summarized dialogue history (vs. raw transcript) improved accuracy by 2.1% and reduced token usage by 40%.
Feeding candidate diagnoses to specialists improved accuracy by 4.7% by focusing specialist questions on differential diagnosis.

The paper also found that accuracy gains from individual design choices (demographics-first, summarized history, candidate diagnoses) only emerged fully when combined. Each choice in isolation produced marginal gains, but the orchestration pattern mattered more than any single component.

Implementation Considerations

If you are building a multi-agent consultation system, here are the plumbing decisions MeDxAgent exposes:

Turn limits: Set a hard cap on consultation turns to prevent infinite loops. MeDxAgent used 10 turns. Too low and you cut off useful questioning. Too high and you burn tokens on diminishing returns.

Specialist routing logic: Use symptom keywords or a classifier to route cases. MeDxAgent used a simple keyword match (chest pain → cardiology). More sophisticated routing (embedding similarity, multi-label classification) might improve accuracy but adds latency and complexity.

Confidence thresholds: Define when to end consultation early. MeDxAgent used 0.85 as a cutoff. Lower thresholds reduce consultation length but increase misdiagnosis risk. Higher thresholds burn tokens on cases where the model is already correct.

Context window management: Summarize dialogue history to fit within token limits. MeDxAgent used a separate summarization call after every 3 turns. This adds latency but prevents context overflow in long consultations.

Failure recovery: Define what happens when a specialist returns low-confidence output. MeDxAgent routed to a second specialist. You could also escalate to a human reviewer or return a “diagnosis uncertain” flag instead of forcing a guess.

Security and Safety Boundaries

Note: This section infers safety requirements not explicitly discussed in the paper. Production systems should validate these constraints independently.

A production system needs:

Input validation: Reject prompts that attempt to manipulate agent behavior (jailbreaks, prompt injection).
Output filtering: Block diagnoses that fall outside the model’s training distribution or require imaging/lab confirmation.
Human-in-the-loop gates: Flag cases where confidence is low or diagnosis has high-stakes consequences (cancer, rare diseases).
Audit logs: Persist full consultation history for regulatory compliance and post-incident review.

Medical AI systems operate under stricter liability and regulatory constraints than general-purpose agents. A misdiagnosis is not just a bad user experience; it is a potential malpractice claim. The orchestration layer needs to enforce safety boundaries that single-shot models can ignore.

Technical Verdict

Use MeDxAgent’s consultation pattern when:

You are building decision support for domains with incomplete information and iterative refinement (medical, legal, financial advisory).
Your users expect to provide information incrementally rather than all at once.
You need to route cases to domain specialists and aggregate their input.
You can tolerate 10+ LLM calls per session and have budget for multi-turn orchestration.

Avoid this approach when:

You have complete information upfront and no need for follow-up questions.
Latency requirements are strict (each turn adds 1-3 seconds of inference time).
Your domain lacks clear specialist boundaries, making routing logic require manual curation for every case type.
You cannot implement turn-level observability infrastructure to trace state changes and routing decisions across agent handoffs.
Token budgets are constrained and you cannot afford summarization overhead or multi-agent consensus calls.

The real contribution here is not the benchmark or the accuracy gain. It is the exposure of orchestration problems that static evals hide: state management across turns, specialist routing logic, consensus aggregation, and confidence thresholds. If you are moving LLMs from batch prediction to interactive decision support, these are the plumbing problems you will hit.