Extracting forced vital capacity from the electronic health record through natural language processing in rheumatoid arthritis-associated interstitial lung disease

Pharmacoepidemiol Drug Saf. 2024 Jan;33(1):e5744. doi: 10.1002/pds.5744. Epub 2023 Dec 19.

Abstract

Purpose: To develop a natural language processing (NLP) tool to extract forced vital capacity (FVC) values from electronic health record (EHR) notes in patients with rheumatoid arthritis-interstitial lung disease (RA-ILD).

Methods: We selected RA-ILD patients (n = 7485) in the Veterans Health Administration (VA) between 2000 and 2020 using validated ICD-9/10 codes. We identified numeric values in proximity to FVC string patterns from clinical notes in the EHR. Subsequently, we performed processing steps to account for variability in note structure, related pulmonary function test (PFT) output, and values copied across notes, then assigned dates from linked administrative procedure records. NLP-derived FVC values were compared to values recorded directly from PFT equipment available on a subset of patients.

Results: We identified 5911 FVC values (n = 1844 patients) from PFT equipment and 15 383 values (n = 4982 patients) by NLP. Among 2610 date-matched FVC values from NLP and PFT equipment, 95.8% of values were within 5% predicted. The mean (SD) difference was 0.09% (5.9), and values strongly correlated (r = 0.94, p < 0.001), with a precision of 0.87 (95% CI 0.86, 0.88). NLP captured more patients with longitudinal FVC values (n = 3069 vs. n = 1164). Mean (SD) change in FVC %-predicted per year was similar between sources (-1.5 [30.0] NLP vs. -0.9 [16.6] PFT equipment; standardized response mean = 0.05 for both).

Conclusions: NLP of EHR notes increases the capture of accurate, longitudinal FVC values by three-fold over PFT equipment. Use of this NLP tool can facilitate pharmacoepidemiologic research in RA-ILD and other lung diseases by capturing this critical measure of disease severity.

Keywords: electronic health record; forced vital capacity; interstitial lung disease; natural language processing; pulmonary function test; rheumatoid arthritis.

MeSH terms

  • Arthritis, Rheumatoid* / complications
  • Arthritis, Rheumatoid* / epidemiology
  • Electronic Health Records
  • Humans
  • Lung Diseases, Interstitial* / diagnosis
  • Lung Diseases, Interstitial* / epidemiology
  • Lung Diseases, Interstitial* / etiology
  • Natural Language Processing
  • Vital Capacity