A pilot study using machine learning and domain knowledge to facilitate comparative effectiveness review updating

Siddhartha R Dalal; Paul G Shekelle; Susanne Hempel; Sydne J Newberry; Aneesa Motala; Kanaka D Shetty

doi:10.1177/0272989X12457243

A pilot study using machine learning and domain knowledge to facilitate comparative effectiveness review updating

Med Decis Making. 2013 Apr;33(3):343-55. doi: 10.1177/0272989X12457243. Epub 2012 Sep 7.

Authors

Siddhartha R Dalal¹, Paul G Shekelle^{1

2}, Susanne Hempel¹, Sydne J Newberry¹, Aneesa Motala¹, Kanaka D Shetty¹

Affiliations

¹ Southern California Evidence-based Practice Center, RAND Corporation, Santa Monica, CA (SRD, PGS, SH, SJN, AM, KDS)
² Greater Los Angeles Veterans Affairs Healthcare System, Los Angeles, CA (PGS).

PMID: 22961102
DOI: 10.1177/0272989X12457243

Abstract

Background: Comparative effectiveness and systematic reviews require frequent and time-consuming updating.

Results: of earlier screening should be useful in reducing the effort needed to screen relevant articles.

Methods: We collected 16,707 PubMed citation classification decisions from 2 comparative effectiveness reviews: interventions to prevent fractures in low bone density (LBD) and off-label uses of atypical antipsychotic drugs (AAP). We used previously written search strategies to guide extraction of a limited number of explanatory variables pertaining to the intervention, outcome, and

Study design: We empirically derived statistical models (based on a sparse generalized linear model with convex penalties [GLMnet] and a gradient boosting machine [GBM]) that predicted article relevance. We evaluated model sensitivity, positive predictive value (PPV), and screening workload reductions using 11,003 PubMed citations retrieved for the LBD and AAP updates. Results. GLMnet-based models performed slightly better than GBM-based models. When attempting to maximize sensitivity for all relevant articles, GLMnet-based models achieved high sensitivities (0.99 and 1.0 for AAP and LBD, respectively) while reducing projected screening by 55.4% and 63.2%. The GLMnet-based model yielded sensitivities of 0.921 and 0.905 and PPVs of 0.185 and 0.102 when predicting articles relevant to the AAP and LBD efficacy/effectiveness analyses, respectively (using a threshold of P ≥ 0.02). GLMnet performed better when identifying adverse effect relevant articles for the AAP review (sensitivity = 0.981) than for the LBD review (0.685). The system currently requires MEDLINE-indexed articles.

Conclusions: We evaluated statistical classifiers that used previous classification decisions and explanatory variables derived from MEDLINE indexing terms to predict inclusion decisions. This pilot system reduced workload associated with screening 2 simulated comparative effectiveness review updates by more than 50% with minimal loss of relevant articles.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Artificial Intelligence*
Pilot Projects