Bioinformatics Internship Presentation

Nawaf Alomran (Mentor: Dr. Anelia Horvath, Department of Biochemistry and Molecular Medicine, George Washington University)

September 1st, 2015, 3:40-4:00pm, Room 1300, Harris Building

With the advancement of Next-Generation Sequencing (NGS) technologies, the integration of sequencing data from matching genome/exome and transcriptome has become increasingly feasible. Such integration allows for comparative studies of genetic variants at encoded (DNA) and expressed (RNA) levels. Hence, it provides unique means for the identification of variants (single nucleotide variants, SNV) possibly implicated in regulatory processes such as RNA editing, Allele-Specific expression and loss, somatic mutagenesis and loss of heterozygosity. SNVs are considered the most common genetic variation and many of them have been directly associated with a variety of diseases including cancer.

The identification of SNVs implicated in the above listed events from NGS data requires comparative quantitative assessment of the distribution of variant and reference alleles at the level of DNA and RNA. Here, we present RNA2DNAlign, a robust probabilistic framework employing Python programing language to assess experimentally derived NGS data for quantitative variant imbalances indicative for regulatory events. RNA2DNAlign computes the confidence of the observation using binomial test for differential distribution of variant and normal alleles between the corresponding datasets from one individual.

RNA2DNAlign employs three core algorithms. Foremost, for meaningful RNA to DNA comparisons, filtration is performed to retain only exon-positioned SNVs. The variants are initially called through the mpileup utility of Samtools. Secondly, quality re-assessment, accompanied by local realignment of the reads is performed on the alignment files (binary alignment mapping, .bam files) to refine the reference and variant read count at each SNVs locus and to generate p-values based on the binomial distribution test. The computed p-values are then logarithmically transformed (log10) and corrected for multiple trials (FDR, False Discovery Rate). The third module tests if the variant and reference read distribution at each locus classifies the SNV as associated to any of the above described events. For RNA-editing and allele-specific expression/loss, the events are categorized as tumor-specific when confined only to the tumor counterpart. RNA2DNAlign generates eight outcomes (including the tumor- specific categories) listing the SNVs implicated in the above events, with their confidence scores.

Using RNA2DNAlign, a total of 360 germline and tumor exomes and transcriptomes from 90 breast cancer patients, were analyzed. All the SNVs were then annotated using SeattleSeq Variation tool to associate them with their functional and biological attributes. Additionally, RNA editing is intersected with DARNED database to distinguish previously reported from novel RNA-editing variants. Similarly, to annotate novel and reported somatic mutations, the data were run through COSMIC database.

In summary, we present a novel, robust and efficient tool - RNA2DNAlign for the identification of SNVs implicated in regulatory processes. Using RNA2DNAlign, we were able to identify numerous previously reported and novel functional SNVs.