Developing a reproducible computational pipeline for detecting lncRNA based on RNA-seq data

Posted in Internship Presentations  |  Tagged

Minghan Wu

Mentors: Dr. Matthew McCoy, Innovation Center for Biomedical Informatics (ICBI ) and Department of Oncology, Georgetown University Medical Center; and Dr. Sreejith Nair, Oncology Academic Department, SOC, Georgetown University

Date/Time: May 3rd, 2022 at 12:00pm.

Abstract: Long noncoding RNAs (lncRNAs) mediate their biological roles as RNAs rather than functioning as templates for protein synthesis. LncRNAs play an indispensable role in many biological events such as epigenetic regulation of gene expression, cell cycle regulation and cell differentiation regulation. Recent studies have also implicated specific lncRNAs in the etiology of a diverse array of cancers.

The objective of this internship project is to build a general reproducible computational workflow, which integrates Precision Run-on sequencing (PRO-seq) data and RNA-Seq data to identify annotated and novel lncRNAs, with increased sensitivity and accuracy for low-abundance lncRNAs. The output from this analysis pipeline could also be used for downstream estimation of differentially expressed coding and non-coding genes, visualization and as a foundation for in vivo experiments.

The pipeline successfully performs the raw data quality control (QC), read alignment, transcriptome assembly and filtering (based on the length, read coverage, protein-coding potential prediction, and evidence of nascent transcripts). After a comparative assessment of the efficiency and accuracy of different software tools used in each individual steps, and the calibration of the parameters, this workflow has been successfully validated and found to identify novel lncRNAs activated in response various ligands (E.g., Estrogen, tamoxifen) that bind to estrogen receptor in breast cancer cells. The differential lncRNAs expressions in conditions such as Estrogen vs. Control, Tamoxifen vs. Control, and Estrogen vs. Tamoxifen are compared and analyzed to identify the differential lncRNA expression. The next step is to assess the cancer relevance of the differentially expressed by parsing the data through validated public databases and wet lab experiments.