Sequential Diagnosis Using Large Language Models

Sequential Diagnosis Using Large Language Models: A Novel Framework for Multi-Step Clinical Reasoning

In the 2025 paper titled “Sequential Diagnosis with Language Models” (Nori et al., 2025), researchers have unveiled a powerful new direction for artificial intelligence in medicine

Sequential Diagnosis Using Large Language Models

In the 2025 paper titled Sequential Diagnosis with Language Models (Nori et al., 2025), researchers have unveiled a powerful new direction for artificial intelligence in medicine—one that doesn’t just answer medical trivia but performs real-time, step-by-step clinical reasoning. This research introduces SDBench (Sequential Diagnosis Benchmark), a benchmark that mimics real-world diagnostic practice, and MAI-DxO (Medical AI Diagnostic Orchestrator), an AI orchestrator that simulates a team of expert physicians. Together, they signal a shift in how we evaluate and deploy AI in medicine—not as a static reference tool, but as a dynamic, decision-making partner.

The study argues that such static tests “overstate model competence” and hide critical flaws, like premature conclusions or unnecessary testing. Real world cases often involve tricky, incomplete data that AI needs to handle.

This study, alongside other pioneering works, highlights how AI is reshaping healthcare diagnostics.

Why Traditional Benchmarks Fall Short

Many studies have looked at how large language models (LLMs) can be used in medicine, but most of them focus broadly on various clinical uses—like helping before a doctor’s visit, supporting treatment decisions, or educating patients. Another study by Zhang, L., et al. (2024) titled Large Language Models for Disease Diagnosis: A Scoping Review suggests that adding data like X-rays or patient records could make benchmarks even more realistic. As existing frameworks don’t specifically explore how LLMs perform in diagnosing diseases, which is a critical and complex task. Most medical AI models are tested on static clinical quizzes, vignettes or multiple-choice questions. 

Medical diagnosis is inherently sequential and probabilistic. Clinicians rarely arrive at a definitive diagnosis in a single step. Instead, they:

  • Develop an initial differential diagnosis (a ranked list of possibilities),
  • Order tests to gather evidence,
  • Reassess the differential based on results,
  • Iterate until a confident diagnosis is reached.

This process involves domain expertise, pattern recognition, Bayesian thinking, and risk-benefit analysis — all within a dynamic, uncertain environment. Automating this process is one of the most ambitious goals in clinical AI.

Enter SDBench: Diagnosis as a Dialogue

 simulating sequential diagnosis

Modern LLMs (such as GPT-4, Med-PaLM, or LLaMA) can understand context, generate coherent reasoning chains, and adapt to new information. These capabilities make them strong candidates for supporting or simulating sequential diagnosis. But to move from impressive single-turn QA to robust diagnostic reasoning, we need a structured framework

To fix this, the team created SDBench, a new diagnostic benchmark built from 304 real-world cases from the New England Journal of Medicine. Each case unfolds sequentially—just like a real consultation. AI agents must:

  • Ask targeted questions
  • Order diagnostic tests
  • Adjust hypotheses as new data comes in
  • Eventually, commit to a final diagnosis

Every step is costed (e.g., £250 per doctor visit), and performance is evaluated not just on accuracy, but also on efficiency and clinical realism

MAI-DxO: A Virtual Panel of Five Doctors

At the heart of the study is MAI-DxO, a model-agnostic orchestrator that brings teamwork to AI. It combines the reasoning styles of five simulated physicians:

  • Dr. Hypothesis: Builds and ranks diagnostic possibilities
  • Dr. Test-Chooser: Recommends the most informative, cost-effective tests
  • Dr. Challenger: Spots premature assumptions and proposes “what-if” challenges
  • Dr. Stewardship: Keeps an eye on medical costs
  • Dr. Checklist: Verifies every action for accuracy and clinical logic

This multi-agent orchestration enables deeper reasoning, reduces wasteful testing, and reflects real-world diagnostic teamwork. A notable feature of MAI-DxO is its budget-tracking capability, which estimates cumulative costs and assigns financial values to each diagnostic action. This allows the system to cancel or revise test selections in real time, ensuring cost containment without compromising clinical rigor.

Sequential Reasoning in Practice: A Clinical Example

Let’s take a real-world scenario:

Patient Presentation: A 45-year-old male with chest pain, sweating, and shortness of breath.

  1. Dr. Hypothesis: Differential includes acute coronary syndrome, pulmonary embolism, GERD, and anxiety.
  2. Dr. Test-Chooser: Orders ECG and troponin tests.
  3. Results: ECG shows ST elevation; troponin is elevated.
  4. Dr. Challenger: Re-examines whether this could be a mimicker like pericarditis.
  5. Dr. Hypothesis (updated): High probability of STEMI.
  6. Dr. Finalizer: Confirms STEMI diagnosis and recommends immediate intervention.

This step-by-step dialogue can be entirely simulated by an LLM framework, aligning with clinical workflows and enabling better decision support.

Ensemble Reasoning for Enhanced Accuracy & Cost

An advanced configuration of MAI-DxO introduces model ensembling, simulating multiple diagnostic panels operating in parallel. These panels independently analyze cases before a final consensus is reached—mirroring the collaborative practices of high-stakes clinical environments. This approach has demonstrated measurable gains in accuracy, with ensemble variants achieving up to 85.5% diagnostic accuracy, while still maintaining cost-efficiency relative to baseline models.

The results? Nothing short of remarkable:

ModelAccuracyAvg. Cost per Case
Generalist Physicians19.9%£2,350
GPT-4o (baseline)49.3%£2,175
o3 (baseline)78.6%£6,200
MAI-DxO + o380%£1,900
MAI-DxO (Max Accuracy Mode)85.5%£5,670

That’s 4x higher accuracy than physicians—while reducing costs by up to 70% compared to raw large models.

Robust and Transferable Across AI Models

MAI-DxO works across multiple LLMs—OpenAI, Gemini, Claude, Grok, DeepSeek, Llama—delivering consistent performance boosts, especially for less powerful models. Even when tested on unseen post-training cases, the orchestrator maintained high accuracy, indicating true reasoning ability—not memorization.

Real-World Evidence: AI Already Surpassing Human Benchmarks

The broader medical AI ecosystem is echoing this evolution.

In a landmark study led by Google researchers in collaboration with Massachusetts General Hospital and Northwestern University, AI demonstrated remarkable proficiency in lung cancer screening. The system achieved an AUC of 94.4%, a performance metric indicating strong predictive accuracy in identifying malignant lung nodules from low-dose CT scans. While the study did not report a direct “accuracy” figure, the AI consistently outperformed six experienced radiologists in certain diagnostic scenarios, notably reducing false positives and false negatives. This advancement highlights how AI can assist radiologists by streamlining routine assessments and allowing greater focus on complex, patient-specific cases.

Similarly, a South Korean study published in The Lancet Digital Health showcased AI’s capabilities in breast cancer detection. The AI system achieved 88.8% sensitivity, surpassing radiologists who averaged 75.3%. In early-stage breast cancer detection, the AI reached 91% accuracy, while radiologists achieved 74%. These results underline the potential of AI to augment diagnostic precision, especially in early detection where timely intervention is critical.

Implications: A Glimpse into AI-Powered Healthcare

sequential diagnosis with LLMs

This isn’t just about pushing AI performance metrics. It’s about scaling expert-level diagnosis to the entire world—especially in areas where access to specialists is limited. With proper safeguards, we may even see smartphone-based triage tools that emulate specialist-level care. 

While promising, sequential diagnosis with LLMs is not without hurdles:

  • Hallucinations: LLMs may generate confident but incorrect reasoning.
  • Clinical Validation: Needs rigorous testing in real-world environments.
  • Regulatory Approval: Transparent, explainable AI is crucial for clinical deployment.
  • Bias and Generalization: Trained models must be robust across populations and healthcare systems.

Final Thoughts: Toward an AI-Augmented Healthcare System

Nori et al.’s study doesn’t merely present a better benchmark or a smarter model—it reimagines how AI should be evaluated and integrated into medicine. It brings us closer to systems that reason like doctors, collaborate like teams, and scale like technology.

As we continue to refine these frameworks, the future of healthcare increasingly points toward hybrid intelligence—where LLMs serve as clinical co-pilots, augmenting rather than replacing human expertise. The goal is not to remove doctors from the loop, but to embed AI into the diagnostic journey in a way that is trustworthy, iterative, and explainable.

Emerging leaders in this space are now focused on developing agentic AI systems—models that emulate human-like clinical reasoning through multi-role orchestration and adaptive decision-making. At the forefront of this evolution, platforms like those developed at Exascale are pioneering deep-tech solutions that seamlessly connect advanced AI frameworks with healthtech transformation, enabling smarter and safer diagnostics across diverse care settings.

“The future of clinical diagnosis may not lie in choosing between man and machine—but in orchestrating both.”

Toward an AI-Augmented Healthcare System

Source: https://doi.org/10.48550/arXiv.2506.22405

You might also want to read : Unlocking Industry 4.0: Why Every Manufacturing CxO Needs an AI Deeptech Platform

Leave a Reply

Your email address will not be published. Required fields are marked *