De-identifying Socioeconomic Data at the Census Tract Level for Medical Research Through Constraint-based Clustering

AMIA Annu Symp Proc. 2022 Feb 21:2021:793-802. eCollection 2021.

Abstract

Numerous studies have shown that a person's health status is closely related to their socioeconomic status. It is evident that incorporating socioeconomic data associated with a patient's geographic area of residence into clinical datasets will promote medical research. However, most socioeconomic variables are unique in combination and are affiliated with small geographical regions (e.g., census tracts) that are often associated with less than 20,000 people. Thus, sharing such tract-level data can violate the Safe Harbor implementation of de-identification under the Health Insurance Portability and Accountability Act of 1996 (HIPAA). In this paper, we introduce a constraint-based k-means clustering approach to generate census tract-level socioeconomic data that is de-identification compliant. Our experimental analysis with data from the American Community Survey illustrates that the approach generates a protected dataset with high similarity to the unaltered values, and achieves a substantially better data utility than the HIPAA Safe Harbor recommendation of 3-digit ZIP code.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Biomedical Research*
  • Census Tract*
  • Cluster Analysis
  • Health Insurance Portability and Accountability Act
  • Humans
  • Social Class
  • United States