A clinical environment simulator for dynamic AI evaluation


  • Goh, E. et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw. Open 7, e2440969 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Cabral, S. et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern. Med. 184, 581–583 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Goh, E. et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nat. Med. 31, 1233–1238 (2025).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Gao, S. et al. TxAgent: an AI agent for therapeutic reasoning across a universe of tools. Preprint at https://doi.org/10.48550/arXiv.2503.10970 (2025).

  • Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Sandmann, S. et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med. 31, 2546–2549 (2025).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Tordjman, M. et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat. Med. 31, 2550–2555 (2025).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning 248–260 (PMLR, 2022).

  • Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (eds Inui, K. et al.) 2567–2577 (ACL, 2019).

  • Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).

    Article 
    CAS 

    Google Scholar
     

  • Schmidgall, S. et al. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments. Preprint at https://doi.org/10.48550/arXiv.2405.07960 (2024).

  • Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Fan, Z. et al. AI Hospital: benchmarking large language models in a multi-agent medical interaction simulator. In Proc. 31st International Conference on Computational Linguistics 10183–10213 (ACL, 2025).

  • Li, J. et al. Agent Hospital: a simulacrum of hospital with evolvable medical agents. Preprint at https://doi.org/10.48550/arXiv.2405.02957 (2024).

  • Bedi, S. et al. Holistic evaluation of large language models for medical tasks with MedHELM. Nat. Med. https://doi.org/10.1038/s41591-025-04151-2 (2026).

  • Zhang, S. et al. Rethinking human-AI collaboration in complex medical decision making: a case study in sepsis diagnosis. In Proc. 2024 CHI Conference on Human Factors in Computing Systems 445, 1–18 (ACM, 2024).

  • Nori, H. et al. Sequential diagnosis with language models. Preprint at https://doi.org/10.48550/arXiv.2506.22405 (2025).

  • Bedi, S., Mlauzi, I., Shin, D., Koyejo, S. & Shah, N. H. The optimization paradox in clinical AI multi-agent systems. Preprint at https://doi.org/10.48550/arXiv.2506.06574 (2025).

  • Rosenthal, J. T., Beecy, A. & Sabuncu, M. R. Rethinking clinical trials for medical AI with dynamic deployments of adaptive systems. NPJ Digit. Med. 8, 252 (2025).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Palepu, A. et al. Towards conversational AI for disease management. Preprint at https://doi.org/10.48550/arXiv.2503.06074 (2025).

  • Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).

  • Kansal, A., Chen, E., Jin, B. T., Rajpurkar, P. & Kim, D. A. MC-MED, multimodal clinical monitoring in the emergency department. Sci. Data 12, 1094 (2025).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Lazic, D. A., Grujic, V. & Tanaskovic, M. The role of flight simulation in flight training of pilots for crisis management. SFJD 3, 3624–3636 (2022).

    Article 

    Google Scholar
     

  • Allerton, D. J. The impact of flight simulation in aerospace. Aeronaut. J. 114, 747–756 (2010).

    Article 

    Google Scholar
     

  • Mahmood, F. A benchmarking crisis in biomedical machine learning. Nat. Med. 31, 1060 (2025).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Page, B., Irving, D., Amalberti, R. & Vincent, C. Health services under pressure: a scoping review and development of a taxonomy of adaptive strategies. BMJ Qual. Saf. 33, 738–747 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Morley, C., Unwin, M., Peterson, G. M., Stankovich, J. & Kinsman, L. Emergency department crowding: a systematic review of causes, consequences and solutions. PLoS ONE 13, e0203316 (2018).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Pines, J. M. et al. The impact of emergency department crowding measures on time to antibiotics for patients with community-acquired pneumonia. Ann. Emerg. Med. 50, 510–516 (2007).

    Article 
    PubMed 

    Google Scholar
     

  • Bernstein, S. L. et al. The effect of emergency department crowding on clinically oriented outcomes. Acad. Emerg. Med. 16, 1–10 (2009).

    Article 
    PubMed 

    Google Scholar
     

  • Emanuel, E. J. et al. Fair allocation of scarce medical resources in the time of Covid-19. N. Engl. J. Med. 382, 2049–2055 (2020).

    Article 
    PubMed 

    Google Scholar
     

  • Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med. 31, 77–86 (2025).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Arora, R. K. et al. HealthBench: Evaluating large language models towards improved human health. Preprint at https://doi.org/10.48550/arXiv.2505.08775 (2025).

  • Jiang, Y. et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. NEJM AI 2, 9 (2025).

  • Zhang, C. et al. API agents vs. GUI agents: divergence and convergence. In ICML 2025 Workshop on Computer Use Agents (ICML, 2025).

  • Finlayson, S. G. et al. Adversarial attacks on medical machine learning. Science 363, 1287–1289 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Javed, H., El-Sappagh, S. & Abuhmed, T. Robustness in deep learning models for medical diagnostics: security and adversarial challenges towards robust AI applications. Artif. Intell. Rev. 58, 12 (2024).

  • Kumar, A. et al. OrderRex clinical user testing: a randomized trial of recommender system decision support on simulated cases. J. Am. Med. Inform. Assoc. 27, 1850–1859 (2020).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Elendu, C. et al. The impact of simulation-based training in medical education: a review. Medicine 103, e38813 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Sinsky, C. et al. Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties. Ann. Intern. Med. 165, 753–760 (2016).

    Article 
    PubMed 

    Google Scholar
     

  • Tierney, A. A. et al. Ambient artificial intelligence scribes: Learnings after 1 year and over 2.5 million uses. NEJM Catal. Innov. Care Deliv. https://doi.org/10.1056/CAT.25.0040 (2025).



  • Source link

    Leave a Reply

    Your email address will not be published. Required fields are marked *