Using data mining techniques to characterize participation in observational studies

Ariel Linden; Paul R Yarnold

doi:10.1111/jep.12515

Using data mining techniques to characterize participation in observational studies

J Eval Clin Pract. 2016 Dec;22(6):835-843. doi: 10.1111/jep.12515. Epub 2016 Jan 25.

Authors

Ariel Linden^{1

2}, Paul R Yarnold³

Affiliations

¹ Linden Consulting Group, LLC, Ann Arbor, MI, USA.
² Division of General Medicine, Medical School, University of Michigan, Ann Arbor, MI, USA.
³ Optimal Data Analysis, LLC, Chicago, IL, USA.

PMID: 26805004
DOI: 10.1111/jep.12515

Abstract

Data mining techniques are gaining in popularity among health researchers for an array of purposes, such as improving diagnostic accuracy, identifying high-risk patients and extracting concepts from unstructured data. In this paper, we describe how these techniques can be applied to another area in the health research domain: identifying characteristics of individuals who do and do not choose to participate in observational studies. In contrast to randomized studies where individuals have no control over their treatment assignment, participants in observational studies self-select into the treatment arm and therefore have the potential to differ in their characteristics from those who elect not to participate. These differences may explain part, or all, of the difference in the observed outcome, making it crucial to assess whether there is differential participation based on observed characteristics. As compared to traditional approaches to this assessment, data mining offers a more precise understanding of these differences. To describe and illustrate the application of data mining in this domain, we use data from a primary care-based medical home pilot programme and compare the performance of commonly used classification approaches - logistic regression, support vector machines, random forests and classification tree analysis (CTA) - in correctly classifying participants and non-participants. We find that CTA is substantially more accurate than the other models. Moreover, unlike the other models, CTA offers transparency in its computational approach, ease of interpretation via the decision rules produced and provides statistical results familiar to health researchers. Beyond their application to research, data mining techniques could help administrators to identify new candidates for participation who may most benefit from the intervention.

Keywords: data mining; machine learning; observational studies; observed characteristics; selection; selection bias.

MeSH terms

Adult
Data Mining / methods*
Female
Humans
Machine Learning
Male
Middle Aged
Observational Studies as Topic*
Selection Bias