Performance of GPT-4 on the American College of Radiology In-training Examination: Evaluating Accuracy, Model Drift, and Fine-tuning

David L. Payne*, Kush Purohit, Walter Morales Borrero, Katherine Chung, Max Hao, Mutshipay Mpoy, Michael Jin, Prateek Prasanna, Virginia Hill

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

Rationale and Objectives: In our study, we evaluate GPT-4′s performance on the American College of Radiology (ACR) 2022 Diagnostic Radiology In-Training Examination (DXIT). We perform multiple experiments across time points to assess for model drift, as well as after fine-tuning to assess for differences in accuracy. Materials and Methods: Questions were sequentially input into GPT-4 with a standardized prompt. Each answer was recorded and overall accuracy was calculated, as was logic-adjusted accuracy, and accuracy on image-based questions. This experiment was repeated several months later to assess for model drift, then again after the performance of fine-tuning to assess for changes in GPT's performance. Results: GPT-4 achieved 58.5% overall accuracy, lower than the PGY-3 average (61.9%) but higher than the PGY-2 average (52.8%). Adjusted accuracy was 52.8%. GPT-4 showed significantly higher (p = 0.012) confidence for correct answers (87.1%) compared to incorrect (84.0%). Performance on image-based questions was significantly poorer (p < 0.001) at 45.4% compared to text-only questions (80.0%), with adjusted accuracy for image-based questions of 36.4%. When the questions were repeated, GPT-4 chose a different answer 25.5% of the time and there was no change in accuracy. Fine-tuning did not improve accuracy. Conclusion: GPT-4 performed between PGY-2 and PGY-3 levels on the 2022 DXIT, significantly poorer on image-based questions, and with large variability in answer choices across time points. Exploratory experiments in fine-tuning did not improve performance. This study underscores the potential and risks of using minimally-prompted general AI models in interpreting radiologic images as a diagnostic tool. Implementers of general AI radiology systems should exercise caution given the possibility of spurious yet confident responses.

Original languageEnglish (US)
Pages (from-to)3046-3054
Number of pages9
JournalAcademic radiology
Volume31
Issue number7
DOIs
StatePublished - Jul 2024

Keywords

  • AI Safety
  • Artificial Intelligence
  • Radiology
  • Residency

ASJC Scopus subject areas

  • Radiology Nuclear Medicine and imaging

Fingerprint

Dive into the research topics of 'Performance of GPT-4 on the American College of Radiology In-training Examination: Evaluating Accuracy, Model Drift, and Fine-tuning'. Together they form a unique fingerprint.

Cite this