Discoveries in Health Policy: NEJM AI: With Complex Cases, GPT-4 Usually Beats Humans (Eriksen et al.)

Monday, November 27, 2023

NEJM AI: With Complex Cases, GPT-4 Usually Beats Humans (Eriksen et al.)

Last April, NEJM announced it would launch ai.NEJM, a new AI-focused journal (see coverage at MedPageToday). At the time, it publishes Lee et al., a Microsoft-based article on "benefits, limits, risks of GPT-4" in medicine (here).

In today's news, NEJM publishes Erikesen et al, on the success rate of GPT-4 to "diagnose complex clinical cases." Based on online data, the authors found GPT-4 (which was right 57% of the time) outperformed 99% of human readers.

See the article here:

https://onepub-media.nejmgroup-production.org/ai/media/000de933-9406-4f17-87b5-8e28c5cf5da7.pdf?

See open-access coverage at MobiHealthNews by Jessica Hagen here:

https://www.mobihealthnews.com/news/gpt-4-outperformed-9998-simulated-human-readers-diagnosing-complex-clinical-cases

Glitch: The article refers to supplemental information, including methods, but at least on the day of release, I couldn't find any at NEJM/

Here's a summary from Claude.ai:

Eriksen et al. Use of GPT-4 to Diagnose Complex Clinical Cases. AI.NEJM.ORG.

In this perspective article published on ai.nejm.org, Eriksen and colleagues assessed the performance of the AI system GPT-4 in diagnosing complex real-world medical cases. They compared GPT-4's diagnostic accuracy to that of medical journal readers answering the same questions online.

The authors utilized 38 open-access clinical case challenges with comprehensive patient information and a multiple choice question on the diagnosis. They provided the cases to GPT-4 along with instructions to solve the case and select the most likely diagnosis. They ran GPT-4 on the cases multiple times to assess reproducibility.

GPT-4 correctly diagnosed 21.8 out of the 38 cases on average (57%), with very high reproducibility across runs. In comparison, simulated medical journal readers answering the same cases [based on 248,000 online answers) only diagnosed 13.7 cases correctly on average (36%).

Based on distributions of reader answers, GPT-4 performed better than 99.98% of a simulated population of readers.

The authors conclude that GPT-4 performed surprisingly well on these complex real-world cases, even outperforming most medical journal readers. However, GPT-4 still missed almost half of diagnoses. More research is needed before considering clinical implementation of such AI systems. Specialized medical AI models, additional data sources beyond text, transparency of commercial models like GPT-4, and further evaluation of safety and validity are still required first. But these early findings indicate AI could become a powerful supportive tool for clinical diagnosis in the future.