Development and External Validation of an Artificial Intelligence Model for Identifying Radiology Reports Containing Recommendations for Additional Imaging

AJR Am J Roentgenol. 2023 Sep;221(3):377-385. doi: 10.2214/AJR.23.29120. Epub 2023 Apr 19.

Abstract

BACKGROUND. Reported rates of recommendations for additional imaging (RAIs) in radiology reports are low. Bidirectional encoder representations from transformers (BERT), a deep learning model pretrained to understand language context and ambiguity, has potential for identifying RAIs and thereby assisting large-scale quality improvement efforts. OBJECTIVE. The purpose of this study was to develop and externally validate an artificial intelligence (AI)-based model for identifying radiology reports containing RAIs. METHODS. This retrospective study was performed at a multisite health center. A total of 6300 radiology reports generated at one site from January 1, 2015, to June 30, 2021, were randomly selected and split by 4:1 ratio to create training (n = 5040) and test (n = 1260) sets. A total of 1260 reports generated at the center's other sites (including academic and community hospitals) from April 1 to April 30, 2022, were randomly selected as an external validation group. Referring practitioners and radiologists of varying sub-specialties manually reviewed report impressions for presence of RAIs. A BERT-based technique for identifying RAIs was developed by use of the training set. Performance of the BERT-based model and a previously developed traditional machine learning (TML) model was assessed in the test set. Finally, performance was assessed in the external validation set. The code for the BERT-based RAI model is publicly available. RESULTS. Among a total of 7419 unique patients (4133 women, 3286 men; mean age, 58.8 years), 10.0% of 7560 reports contained RAI. In the test set, the BERT-based model had 94.4% precision, 98.5% recall, and an F1 score of 96.4%. In the test set, the TML model had 69.0% precision, 65.4% recall, and an F1 score of 67.2%. In the test set, accuracy was greater for the BERT-based than for the TML model (99.2% vs 93.1%, p < .001). In the external validation set, the BERT-based model had 99.2% precision, 91.6% recall, an F1 score of 95.2%, and 99.0% accuracy. CONCLUSION. The BERT-based AI model accurately identified reports with RAIs, outperforming the TML model. High performance in the external validation set suggests the potential for other health systems to adapt the model without requiring institution-specific training. CLINICAL IMPACT. The model could potentially be used for real-time EHR monitoring for RAIs and other improvement initiatives to help ensure timely performance of clinically necessary recommended follow-up.

Keywords: artificial intelligence; external validation; natural language processing; quality and safety improvement; recommendations for additional imaging.

MeSH terms

  • Artificial Intelligence*
  • Diagnostic Imaging
  • Female
  • Humans
  • Male
  • Middle Aged
  • Natural Language Processing
  • Radiography
  • Radiology*
  • Retrospective Studies