A clinically-guided unsupervised clustering approach to recommend symptoms of disease associated with diagnostic opportunities

Diagnosis (Berl). 2022 Sep 21;10(1):43-53. doi: 10.1515/dx-2022-0044. eCollection 2023 Feb 1.

Abstract

Objectives: A first step in studying diagnostic delays is to select the signs, symptoms and alternative diseases that represent missed diagnostic opportunities. Because this step is labor intensive requiring exhaustive literature reviews, we developed machine learning approaches to mine administrative data sources and recommend conditions for consideration. We propose a methodological approach to find diagnostic codes that exhibit known patterns of diagnostic delays and apply this to the diseases of tuberculosis and appendicitis.

Methods: We used the IBM MarketScan Research Databases, and consider the initial symptoms of cough before tuberculosis and abdominal pain before appendicitis. We analyze diagnosis codes during healthcare visits before the index diagnosis, and use k-means clustering to recommend conditions that exhibit similar trends to the initial symptoms provided. We evaluate the clinical plausibility of the recommended conditions and the corresponding number of possible diagnostic delays based on these diseases.

Results: For both diseases of interest, the clustering approach suggested a large number of clinically-plausible conditions to consider (e.g., fever, hemoptysis, and pneumonia before tuberculosis). The recommended conditions had a high degree of precision in terms of clinical plausibility: >70% for tuberculosis and >90% for appendicitis. Including these additional clinically-plausible conditions resulted in more than twice the number of possible diagnostic delays identified.

Conclusions: Our approach can mine administrative datasets to detect patterns of diagnostic delay and help investigators avoid under-identifying potential missed diagnostic opportunities. In addition, the methods we describe can be used to discover less-common presentations of diseases that are frequently misdiagnosed.

Keywords: administrative data; diagnostic delay; machine learning.

Publication types

  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Appendicitis* / diagnosis
  • Cluster Analysis
  • Delayed Diagnosis
  • Delivery of Health Care
  • Humans
  • Tuberculosis* / diagnosis
  • Tuberculosis* / epidemiology