Article Image

What Does GPT-4's 86% USMLE Success Mean for Healthcare?

27th June 2024

Large Language Models (LLMs) like GPT-4 are changing how we think about AI's role in medicine. One key example is how well these models have performed on the United States Medical Licensing Examination (USMLE), a tough test for medical graduates. This success shows us that AI can understand and use medical knowledge in ways similar to human doctors.

In this blog post, we're going to look at what this means for the future. We'll focus on the results of GPT-4 and talk about how AI could help doctors. This includes making diagnoses more accurate, improving patient care, and making medical tasks more efficient. Our goal is to understand how AI, especially through models like GPT-4, might work alongside doctors to make healthcare better.

What is USMLE?

The United States Medical Licensing Examination (USMLE) is a three-step examination for medical licensure in the United States and is one of the most critical milestones in a medical student's career. It assesses a physician's ability to apply knowledge, concepts, and principles, and to demonstrate fundamental patient-centered skills that are important in health and disease and constitute the basis of safe and effective patient care.

Example Question and Analysis

Let's dive into an example question that illustrates the complexity and depth of knowledge required to succeed in the USMLE exams. This type of question demonstrates why AI might struggle, as it requires not only factual knowledge but also the application of that knowledge in a clinical context, understanding of patient history, and critical thinking to arrive at the correct diagnosis and treatment plan.

Question: A 25-year-old woman comes to the physician because of a two-month history of joint pain in her hands. She reports that the pain is worse in the morning. Physical examination shows swelling and tenderness in the metacarpophalangeal and proximal interphalangeal joints. Which of the following is the most likely diagnosis?

A) Osteoarthritis

B) Rheumatoid arthritis

C) Psoriatic arthritis

D) Gout

Answer: B) Rheumatoid arthritis

Explanation: The patient's symptoms of pain in the hands that is worse in the morning, along with physical findings of swelling and tenderness in the metacarpophalangeal and proximal interphalangeal joints, are characteristic of rheumatoid arthritis (RA). RA is an autoimmune disorder that typically affects the small joints in a symmetrical pattern.

Possible Mistakes: Choosing A) Osteoarthritis would be incorrect because osteoarthritis typically affects older individuals and involves the distal interphalangeal joints, not the proximal ones.

C) Psoriatic arthritis could be misleading if one focuses solely on joint pain without considering the absence of psoriasis or the typical asymmetric and distal joint involvement.

D) Gout usually presents with acute onset and affects the big toe (podagra) initially, not the hands.

This question requires an understanding of disease mechanisms, patient symptoms, and diagnostic criteria, areas where AI might not fully grasp the nuances of human medical judgment or the subtleties of patient presentations. While AI can store and retrieve vast amounts of information, the clinical reasoning and empathy involved in medicine, especially in interpreting complex cases and understanding patient experiences, remain uniquely human traits.

Well, at-least that's what we assumed until recent developments 😲

Capabilities of GPT-4 on USMLE

Researchers did a comprehensive evaluation of GPT-4 on medical competency exams and benchmark datasets revealed significant advancements in the capabilities of large language models (LLMs) in the medical domain. Here's a summary of the key statistics and comparisons, including model performances and human benchmarking where available:

GPT-4 achieved 86.65% on USMLE

  • GPT-3.5 got 53.61% and 58.78%
  • GPT-4 significantly outperforms its predecessor, GPT-3.5, and specialized models such as Flan-PaLM 540B and Med-PaLM on the United States Medical Licensing Examination (USMLE).
  • The passing threshold for USMLE exams is approximately 60% correct on multiple-choice questions, indicating that GPT-4 not only surpasses this threshold but also exhibits a performance that significantly exceeds the average human performance.

The full evaluation can be found here:  https://arxiv.org/pdf/2303.13375.pdf

Performance Limitations

1. Questions Involving Images and Media:

GPT-4's (text-only version) was evaluated on USMLE questions, some of which relied on images. While GPT-4 still performed well on these questions despite not having access to the visual media, it is reasonable to infer that its performance on these types of questions might not be as robust as on text-based questions. The paper notes that GPT-4 was able to use logical reasoning to answer questions intended to be answered with visual aids, but this indirect method could potentially lead to errors in cases where visual interpretation is critical.

2. Calibration and Probabilities:

The paper highlights GPT-4's improved calibration over previous models, indicating it has a better understanding of when its answers are likely to be correct. However, the need to emphasize calibration suggests that there might still be issues with overconfidence or underconfidence in its answers. In high-stakes fields like medicine, any misestimation of confidence could lead to incorrect conclusions or recommendations, which can be seen as a form of failure.

3. Memorization Concerns:

The researchers investigated whether GPT-4's performance could be attributed to memorization of exam content. They found no evidence of direct memorization, indicating its performance is likely due to understanding rather than recall. However, the paper discusses the potential for memorization or leakage from training data as a limitation of LLMs, implying that if GPT-4 had relied on memorization, it could fail in situations requiring novel application of knowledge or when faced with content it hadn't been exposed to during training.

4. Richer Prompting Strategies:

The paper also mentions the exploration of more sophisticated prompting strategies, such as chain-of-thought prompting, which did not yield significant performance benefits for GPT-4 in this context. This might suggest that while GPT-4 is highly capable, there could be specific types of reasoning or problem-solving where its approach is not optimized, leading to lesser performance.

Implications for Clinical Practice

The implications of ChatGPT's USMLE performance extend far beyond the realm of academic testing. As healthcare seeks to embrace AI, ChatGPT's capabilities hint at a future where AI can:

  • Reduce Administrative Burden: Automating documentation and other time-consuming tasks, allowing clinicians to focus more on patient care.
  • Enhance Diagnostic Accuracy: Offering second opinions and alerting clinicians to potential diagnostic and treatment errors.
  • Personalize Patient Care: Analyzing vast datasets to tailor treatments to individual patient profiles, potentially improving outcomes.
  • Facilitate Continuous Learning: Acting as an ever-ready study and reference tool for healthcare professionals at all stages of their careers.

Conclusion

As we navigate the intersection of AI and healthcare, GPT-4's remarkable performance on the USMLE illuminates a path toward a future where AI not only complements but significantly enhances medical practice. Despite facing challenges with visual questions, calibration accuracy, and the intricacies of human reasoning, the potential of AI to alleviate administrative burdens, improve diagnostic precision, and personalize care is undeniable. By judiciously integrating AI tools like GPT-4 alongside human expertise, the medical community can harness these advancements to enrich patient care, optimize clinical workflows, and foster a learning environment that keeps pace with the rapid evolution of medicine. This journey requires careful consideration of the limitations and ethical implications of deploying AI in healthcare settings, ensuring that as we step into this new era, we do so with the goal of enhancing the well-being of patients and empowering healthcare professionals worldwide.

© Copyright 2024 Notewand