Compositional Data Analysis using Kernels in mass cytometry data

Pratyaydipta Rudra; Ryan Baxter; Elena W Y Hsieh; Debashis Ghosh

doi:10.1093/bioadv/vbac003

Compositional Data Analysis using Kernels in mass cytometry data

Bioinform Adv. 2022 Feb 11;2(1):vbac003. doi: 10.1093/bioadv/vbac003. eCollection 2022.

Authors

Pratyaydipta Rudra¹, Ryan Baxter², Elena W Y Hsieh^{2

3}, Debashis Ghosh⁴

Affiliations

¹ Department of Statistics, Oklahoms State University, Stillwater, OK 74078, USA.
² Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA.
³ Department of Pediatrics, Section of Allergy and Immunology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA.
⁴ Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA.

Abstract

Motivation: Cell-type abundance data arising from mass cytometry experiments are compositional in nature. Classical association tests do not apply to the compositional data due to their non-Euclidean nature. Existing methods for analysis of cell type abundance data suffer from several limitations for high-dimensional mass cytometry data, especially when the sample size is small.

Results: We proposed a new multivariate statistical learning methodology, Compositional Data Analysis using Kernels (CODAK), based on the kernel distance covariance (KDC) framework to test the association of the cell type compositions with important predictors (categorical or continuous) such as disease status. CODAK scales well for high-dimensional data and provides satisfactory performance for small sample sizes (n < 25). We conducted simulation studies to compare the performance of the method with existing methods of analyzing cell type abundance data from mass cytometry studies. The method is also applied to a high-dimensional dataset containing different subgroups of populations including Systemic Lupus Erythematosus (SLE) patients and healthy control subjects.

Availability and implementation: CODAK is implemented using R. The codes and the data used in this manuscript are available on the web at http://github.com/GhoshLab/CODAK/.

Contact: prudra@okstate.edu.

Supplementary information: Supplementary data are available at Bioinformatics Advances online.

Grants and funding

K23 AR070897/AR/NIAMS NIH HHS/United States