False Positive Detection for scRNA-Seq Variant Calling Tools
Thomas Pepas
Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University.
Date/Time: August 22nd, 2024 at 2:00pm.
Abstract: Single cell RNA-Seq is a rich source of putative cell-specific transcript variants. Single-cell expressed single nucleotide variants (sce-SNVs) manifest as non-reference bases in the scRNA-Seq reads, and novel loci with cell-specific variation can be discovered by searching for these mismatched bases in the alignments of multiple cells’ reads. However, various artifacts of sequencing errors, sequence alignment, and transcript homology can result in read alignments with one or more mismatched bases – these artifacts result in false-positive sce-SNV calls which do not represent cell-specific variants. Widely used variant analysis tools, such as Strelka2 from Illumina and GATK’s HaplotypeCaller, are designed for bulk genomics NGS data, and apply a variety of strategies to detect and eliminate false positive variants, but do not account for the properties of single-cell and transcript-based reads, leading to overzealous removal of genuine sce-SNVs and undetected false positives.
We first developed an understanding of relevant false positive detection techniques from the Strelka2 and HaplotypeCaller manuscripts. Using this information, we established and extracted reference sequence and alignment features from scRNA-Seq reads at putative sce-SNV loci. In particular, we employed nucleotide bit scores as a means to describe the complexity of sequences near each locus, counted the number of nearby loci with high proportions of mismatched bases, and filtered reads based on mapping quality and distance from each variant locus to the ends of genomic alignments. Using single-cell RNA-Seq data collected using the 10x Genomics Chromium v2 protocol from NCBI Bioproject PRJNA662503, we used an in-house scRNA-Seq variant calling tool to propose sce-SNV loci on chromosome 22 for one of the project’s samples (SAMN16086830) and asked expert curators from the Horvath Lab at GWU to annotate false positives. Using the 145 curated loci and extracted features as labeled data for supervised machine learning, various classifiers were trained and evaluated using stratified 20-fold cross-validation and the utility of each feature was assessed by inspecting erroneous classifications. The highest performing model, a random forest, was exported for use in a script, called scQCLoci, for integration with the in-house scRNA-Seq variant calling tools.
To further validate the model, called sce-SNV loci from the scRNA-Seq project’s other sample (SAMN16086829) will be evaluated independently using scQCLoci and the expert curators at the Horvath Lab, and the status of the loci compared. The successful filtering of false-positives from called sce-SNV loci by scQCLoci will enable the use of fast, in-house sce-SNV calling tools, and significantly improve the sce-SNV analysis bandwidth of the Horvath Lab and improve the coverage of the Single-Cell Expressed SNV Catalog published by the Edwards and Horvath labs.