Image by Irwan, from Unsplash

AI in Healthcare: New Stanford Benchmark Measures Real-World Performance

Reading time: 3 min

Last Updated: Sep 16, 2025

Written by Kiara Fabbri Former Tech News Writer
Fact-Checked by Sarah Frazier Former Content Manager

Stanford researchers conducted virtual EHR tests of AI agents, which report how models like Claude 3.5 can assist doctors with routine healthcare tasks.

In a rush? Here are the quick facts:

AI agents can perform tasks like ordering tests and prescribing medications.
Claude 3.5 Sonnet v2 achieved the highest success rate at 70%.
Many AI models struggled with complex workflows and system interoperability.

Stanford researchers are setting new evaluation criteria to determine whether AI systems’ are able to perform real-world medical tasks. While AI demonstrated potential for medical applications in various fields, experts warn it still needs further testing.

“Working on this project convinced me that AI won’t replace doctors anytime soon,” said Kameron Black, co-author and Clinical Informatics Fellow at Stanford Health Care.

In order to investigate this, the team developed MedAgentBench, a virtual electronic health record (EHR) system, built to assess how AI agents performed medical procedures that doctors do on a daily basis.

It is important to note that unlike chatbots, AI agents can act autonomously, handling complex, multistep tasks using patient data, ordering tests, and prescribing medications.

“Chatbots say things. AI agents can do things,” said Jonathan Chen, associate professor of medicine and biomedical data science and senior author. “This means they could theoretically directly retrieve patient information from the electronic medical record, reason about that information, and take action by directly entering in orders for tests and medications. This is a much higher bar for autonomy in the high-stakes world of medical care. We need a benchmark to establish the current state of AI capability on reproducible tasks that we can optimize toward,” Chen added.

In order to test the virtual system, the researchers gained data from 100 patient profiles, which accumulated 785,000 records. Secondly, about a dozen large language models (LLMs) were tested on 300 clinical tasks.

The results showed that the Claude 3.5 Sonnet v2 model achieved a 70% success rate as the top-performing model, however many models failed to handle complex workflows, as well as system integration processes.

“We hope this benchmark can help model developers track progress and further advance agent capabilities,” said Yixing Jiang, PhD student and co-author.

The experts predict that AI agents will take over basic clinical administrative work, hopefully decreasing physician burnout without fully replacing human doctors from practice.

“I’m passionate about finding solutions to clinician burnout,” Black said. “I hope that by working on agentic AI applications in healthcare that augment our workforce, we can help offload burden from clinicians and divert this impending crisis,” Black added.

AI in Healthcare: New Stanford Benchmark Measures Real-World Performance

We're thrilled you enjoyed our work!