The Chatbot Will See You Now
The idea that AI will transform healthcare has been around for about as long as the technology itself has been around. By and large, the industry has remained unchanged, however, with Balmol’s cost disease still keeping costs cripplingly high across the world. A recent paper from Wharton suggests that things might be about to change.
A fresh approach
The paper took a novel approach to assessing how AI is used in medicine. For instance, most previous studies have treated AI almost as a parlour trick. Show it an X-ray, ask for a diagnosis, and award marks. Suffice it to say, real medicine isn’t quite like this. It’s messy and unfolds over time. Information arrives in fragments, and patients can get better or worse. Doctors must decide not just what is wrong, but when to act, which tests to order, and how to manage a patient who may be getting worse while they deliberate. The researchers wanted to test whether Google’s Gemini Pro 2.5 could cope with all of these at once.
They deployed the AI in BodyInteract, which is a high-fidelity medical simulation used to train and certify real physicians. They deployed it across four clinical scenarios: hypoglycaemia managed at home, and three emergency-room presentations: pneumonia, ischaemic stroke, and congestive heart failure. The AI had to navigate the same interface as humans, and also ordered and interpreted tests, listened to lung sounds, read CT scans, administered medications, and called for specialist help.
The results were compared with around 14,000 simulation runs performed by medical students, as well as a benchmark set by an experienced emergency physician.
The results were pretty positive, as across each of the three emergency cases, the AI completed scenarios successfully in 95% of runs. This is compared to 89% for the medical students. What’s more, the AI was also 37% faster, which is obviously crucial in time-pressed acute care. What’s more, it was also able to match the accuracy of diagnosis of the human students as well.
How AI reasons
Arguably, the most interesting part of the paper isn’t to do with the outputs but the process. The way AI reasons rather than the actual performance of the technology. The researchers logged the model’s evolving probability estimates for each possible diagnosis at each step in the case, creating a real-time window into its decision-making. What they found is that the machine behaves, to an impressive degree, like a rational Bayesian agent.
For instance, at the start of each engagement, the AI tends to order the kind of tests that will produce the biggest shift in its diagnostic understanding. As the case progresses, however, and the uncertainty reduces, the AI stops ordering such tests as the technology seemed to understand when it had enough information.
The confidence estimates of the AI also mirrored actual accuracy pretty well. When it ended a case assigning 80-100% probability to a diagnosis, it was correct every time. When its confidence was low, errors were more likely. This is unusual, as AI can often be mistakenly confident.
Better, up to a point
As with most studies, there is a caveat, and this time, it’s in what the researchers refer to as the “thoroughness-efficiency frontier.” The human learners spent far more time talking to virtual patients and were much better at asking questions. By contrast, the AI would move faster in order to stabilize the patient, and then move on. This makes it an effective machine for what the authors term “patient stabilization.” It makes it a less convincing one for the fuller job of medicine.
This matters, as patient communication isn’t a “nice to have.” It’s often how clinicians gather the kind of contextual insights that aren’t in lab reports or clinical records. It’s how they detect when patients are downplaying symptoms, and how they build the trust that makes patients willing to return.
It’s an area that AI scored poorly in. Across the three complex cases, they completed only 22% of recommended dialogue actions, compared with 62% for human students. The picture is also mixed when it comes to costs. The AI was found to order more tests than the expert physician, which resulted in costs almost doubling per session.
Handle with care
Of course, the study has important limitations. The simulation is based on a model of disease progression rather than the messy reality of actual patients. The 20-minute time period also bears little resemblance to a real emergency department where patients could be managed over hours with dozens of staff involved.
Suffice it to say, the researchers aren’t arguing that physicians should be replaced by AI. What they are saying, is that a general-purpose language model, without any medical fine-tuning or bespoke engineering, can perform well on a clinical workflow from beginning to end.
It’s likely that AI will best be deployed in areas where human experts aren’t accessible, such as in remote and rural areas or even in overwhelmed emergency departments. In such contexts, an AI that can stabilize a patient and triage intelligently can supplement clinical teams rather than replace them.
It’s clear that we’ve moved beyond whether AI can pass medical board exams. That bar was passed a while ago. Now, the question is whether they can manage the workflow, integrate information across time and modality, and make good decisions under pressure. This paper suggests the technology is making good progress there, too.
The doctor, it seems, need not worry about being replaced. But she may soon find herself with a very capable, if somewhat taciturn, colleague.

