Abstract
Rationale and Objectives: In our study, we evaluate GPT-4′s performance on the American College of Radiology (ACR) 2022 Diagnostic Radiology In-Training Examination (DXIT). We perform multiple experiments across time points to assess for model drift, as well as after fine-tuning to assess for differences in accuracy. Materials and Methods: Questions were sequentially input into GPT-4 with a standardized prompt. Each answer was recorded and overall accuracy was calculated, as was logic-adjusted accuracy, and accuracy on image-based questions. This experiment was repeated several months later to assess for model drift, then again after the performance of fine-tuning to assess for changes in GPT's performance. Results: GPT-4 achieved 58.5% overall accuracy, lower than the PGY-3 average (61.9%) but higher than the PGY-2 average (52.8%). Adjusted accuracy was 52.8%. GPT-4 showed significantly higher (p = 0.012) confidence for correct answers (87.1%) compared to incorrect (84.0%). Performance on image-based questions was significantly poorer (p < 0.001) at 45.4% compared to text-only questions (80.0%), with adjusted accuracy for image-based questions of 36.4%. When the questions were repeated, GPT-4 chose a different answer 25.5% of the time and there was no change in accuracy. Fine-tuning did not improve accuracy. Conclusion: GPT-4 performed between PGY-2 and PGY-3 levels on the 2022 DXIT, significantly poorer on image-based questions, and with large variability in answer choices across time points. Exploratory experiments in fine-tuning did not improve performance. This study underscores the potential and risks of using minimally-prompted general AI models in interpreting radiologic images as a diagnostic tool. Implementers of general AI radiology systems should exercise caution given the possibility of spurious yet confident responses.
Original language | English (US) |
---|---|
Pages (from-to) | 3046-3054 |
Number of pages | 9 |
Journal | Academic radiology |
Volume | 31 |
Issue number | 7 |
DOIs | |
State | Published - Jul 2024 |
Keywords
- AI Safety
- Artificial Intelligence
- Radiology
- Residency
ASJC Scopus subject areas
- Radiology Nuclear Medicine and imaging