Internship Presentations

Cell-specific SNV discovery using barcode-stratified scRNA-Seq data

Ni Gao

Mentors: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University.

Date/Time: May 3rd, 2022 at 12:00pm.

Abstract: Currently, the detection of single nucleotide variants (SNV) from single-cell RNA sequencing data is usually performed on the pooled sequencing reads across all cells in a sample. However, since post-zygotically occurring SNVs only exist in a small proportion of cells, the VAF of these mutations is usually so low in pooled scRNA-sequencing data that it does not reach the VAF threshold required by SNV discovery software designed for genomic NGS data. As a result, discovering post-zygotic SNVs such as somatic mutations in pooled scRNA-Seq data is challenging. In addition, traditional SNV discovery software, such as GATK and Strelka, do not account for cellular barcodes so they cannot provide cell-specific detection of SNVs. Therefore, a new workflow has been developed for detecting SNVs in scRNA-sequencing data stratifying scRNA-sequencing data by celluar barcodes. Barcode-stratified analysis of scRNA-Seq data using SNV discovery software identifies many more SNVs than the same to tools applied to pooled scRNA-Seq or bulk RNA-Seq alignments. Unlike the analysis of pooled scRNA-Seq data, barcode-stratified analysis of scRNA-Seq data enables cell-specific observations of expressed SNVs across cells and cell-types. The cellular barcode-stratified analysis is enriched for low (cellular) frequency variants, representing cellular heterogeneity and potential somatic mutations, which are typically under-reported in pooled scRNA-Seq analyses. Furthermore, cell-specific SNV observations can be associated with cell-type to provide a more comprehensive understanding of single cell-heterogeneity.

Unfortunately, the barcode-stratified SNV discovery analysis of scRNA-Seq data is time-consuming. We explore techniques to speed up the analysis and further study the characteristics of observed SNVs to improve our understanding of single-cell expressed SNVs. We developed an algorithm to focus SNV discovery tools on regions with sufficient read-depth for variants to be observed. Coupled with SCExecute, a tool for barcode-stratified analysis of scRNA-Seq data, the algorithm substantially reduces the work carried out by SNV discovery tools applied to the relatively sparse barcode-stratified aligned reads. In the second part of the project, we analyzed the pooled scRNA-Seq data to study SNVs observed only by the barcode-stratified analysis strategy. We used SCReadCounts, a tool for estimating cell-level expressed SNV expression from scRNA-Seq data, to count cell-specific reads with the reference and variant alleles of each SNV in an unbiased manner. Unlike SNV discovery tools SCReadCounts output provides read-counts for monoallelic reference transcripts, distinguishing between loci with abundant transcripts, but low (cellular) frequency variant alleles, and low abundance transcripts observed in few cells. We calculated and compared allele expression for SNVs from both analysis strategies. In addition, we applied the “singleR” tool to assign cell types to cell barcodes based on gene expression values computed by STARsolo.

We found the algorithm to focus SNV discovery tools to regions with sufficient depth was extremely effective in speeding up the analysis, with little to no change in the SNVs observed. We also found, as expected, more SNVs present only in a small subset of cells, from the barcode-stratified scRNA-Seq analysis compared with the pooled scRNA-Seq analysis, as well as more novel SNVs, missing from dbSNP. Surprisingly, we found the majority of SNV loci expressed as monoallelic reference or variant, with relatively little biallelic expression, suggesting substantial imprinting. Finally, we explored the relationship between cell types and the cell-specific observation of transcripts with expressed variants – finding a significant improvement in expressed SNV discovery in minority cell-types as compared to the pooled analysis strategy.

Spring 2022