Sample selection in the face of design constraints: Use of clustering to define sample strata for qualitative research

Health Serv Res. 2019 Apr;54(2):509-517. doi: 10.1111/1475-6773.13100. Epub 2018 Dec 11.

Abstract

Objective: To sample 40 physician organizations stratified on the basis of longitudinal cost of care measures for qualitative interviews in order to describe the range of care delivery structures and processes that are being deployed to influence the total costs of caring for patients.

Data sources: Three years of physician organization-level total cost of care data (n = 156 in California) from the Integrated Healthcare Association's value-based pay-for-performance program.

Study design: We fit total cost of care data using mixture and K-means clustering algorithms to segment the population of physician organizations into sampling strata based on 3-year cost trajectories (ie, cost curves).

Principal findings: A mixture of multivariate normal distributions can classify physician organization cost curves into clusters defined by total cost level, shape, and within-cluster variation. K-means clustering does not accommodate differing levels of within-cluster variation and resulted in more clusters being allocated to unstable cost curves. A mixture of regressions approach focuses overly on anomalous trajectories and is sensitive to model coding.

Conclusions: Statistical clustering can be used to form sampling strata when longitudinal measures are of primary interest. Many clustering algorithms are available; the choice of the clustering algorithm can strongly impact the resulting strata because various algorithms focus on different aspects of the observed data.

Keywords: biostatistical methods; health care costs; sampling.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Cluster Analysis*
  • Health Care Costs / statistics & numerical data*
  • Health Services Research / methods*
  • Humans
  • Longitudinal Studies
  • Models, Statistical*
  • Qualitative Research*