Variable selection for latent class analysis in the presence of missing data with application to record linkage

Huiping Xu; Xiaochun Li; Zuoyi Zhang; Shaun Grannis

doi:10.1177/09622802241242317

Variable selection for latent class analysis in the presence of missing data with application to record linkage

Stat Methods Med Res. 2024 Apr 9:9622802241242317. doi: 10.1177/09622802241242317. Online ahead of print.

Authors

Huiping Xu¹, Xiaochun Li¹, Zuoyi Zhang², Shaun Grannis³

Affiliations

¹ Department of Biostatistics and Health Data Science, Indiana University, Indianapolis, IN, USA.
² AbbVie Inc., North Chicago, IL, USA.
³ Regenstrief Institute Inc., Indianapolis, IN, USA.

PMID: 38592341
DOI: 10.1177/09622802241242317

Abstract

The Fellegi-Sunter model is a latent class model widely used in probabilistic linkage to identify records that belong to the same entity. Record linkage practitioners typically employ all available matching fields in the model with the premise that more fields convey greater information about the true match status and hence result in improved match performance. In the context of model-based clustering, it is well known that such a premise is incorrect and the inclusion of noisy variables could compromise the clustering. Variable selection procedures have therefore been developed to remove noisy variables. Although these procedures have the potential to improve record matching, they cannot be applied directly due to the ubiquity of the missing data in record linkage applications. In this paper, we modify the stepwise variable selection procedure proposed by Fop, Smart, and Murphy and extend it to account for missing data common in record linkage. Through simulation studies, our proposed method is shown to select the correct set of matching fields across various settings, leading to better-performing algorithms. The improved match performance is also seen in a real-world application. We therefore recommend the use of our proposed selection procedure to identify informative matching fields for probabilistic record linkage algorithms.

Keywords: Fellegi–Sunter model; missing data; model-based clustering; patient matching; probabilistic record linkage.