Quantifying the seed sensitivity of cancer subclonal reconstruction algorithms

bioRxiv [Preprint]. 2024 Feb 8:2024.02.05.579021. doi: 10.1101/2024.02.05.579021.

Abstract

Background: Intra-tumoural heterogeneity complicates cancer prognosis and impairs treatment success. One of the ways subclonal reconstruction (SRC) quantifies intra-tumoural heterogeneity is by estimating the number of subclones present in bulk DNA sequencing data. SRC algorithms are probabilistic and need to be initialized by a random seed. However, the seeds used in bioinformatics algorithms are rarely reported in the literature. Thus, the impact of the initializing seed on SRC solutions has not been studied. To address this gap, we generated a set of ten random seeds to systematically benchmark the seed sensitivity of three probabilistic SRC algorithms: PyClone-VI, DPClust, and PhyloWGS.

Results: We characterized the seed sensitivity of three algorithms across fourteen whole-genome sequences of head and neck squamous cell carcinoma and nine SRC pipelines, each composed of a single nucleotide variant caller, a copy number aberration caller and an SRC algorithm. This led to a total of 1470 subclonal reconstructions, including 1260 single-region and 210 multi-region reconstructions. The number of subclones estimated per patient vary across SRC pipelines, but all three SRC algorithms show substantial seed sensitivity: subclone estimates vary across different seeds for the same set of input using the same SRC algorithm. No seed consistently estimated the mode number of subclones across all patients for any SRC algorithm.

Conclusions: These findings highlight the variability in quantifying intra-tumoural heterogeneity introduced by the seed sensitivity of probabilistic SRC algorithms. We recommend that authors, reviewers and editors adopt guidelines to both report and randomize seed choices. It may also be valuable to consider seed-sensitivity in the benchmarking of newly developed SRC algorithms. These findings may be of interest in other areas of bioinformatics where seeded probabilistic algorithms are used and suggest consideration of formal seed reporting standards to enhance reproducibility.

Publication types

  • Preprint