Developing a reproducible, adaptable data analysis pipeline for Nanopore RNA-seq data with Snakemake, a Python-based workflow engine

Vincent Miller

Mentor: Dr. Vishal Koparde, Frederick National Laboratory for Cancer Research.

Date/Time: August 24, 2021 at 3:20pm

Abstract: Over the past decade, there has been an exciting development of so-called ‘third-generation’ sequencers, which have the capability to produce substantially longer reads than previous sequencing technologies. In comparison to traditional short-read sequencers such as Illumina, longer read lengths hope to reduce the computational challenges that surround genome assembly and transcript reconstruction. Third-generation technologies such as those marketed by Oxford Nanopore (ONT) allow for the direct sequencing of gene isoforms, as well as detection of epigenetic markers such as DNA methylation. Such platforms also offer potential improvements with portability and sequencer speed; the recently commercialized MinION sequencer from ONT is only slightly larger than a USB flash drive, and can sequence anywhere, processing about 500 bases per second per pore. However, third generation sequencing technologies like those developed by ONT are not without their limitations. They face an important challenge with the accurate identification of nucleotide bases and experience higher rates than NGS technologies like Illumina. This is primarily due to the general instability of the molecular machinery involved. Additionally, since the sequencing process occurs rapidly, the signals given off by individual nucleotides can be blurred by nearby bases. These drawbacks present novel computational challenges for deciphering base signals and inferring underlying sequence. To this end, a set of new tools are being actively developed for performing innovative analyses on data output from third-generation sequencers and are designed to meet the unique computational challenges that they face. Further, some programs traditionally used for short-read data have expanded their usage to provide options more well-suited for long-read input.

My primary objective for this internship was to explore the tools currently available for processing and analyzing third-generation RNA-seq data, and develop a reproducible, scalable, and adaptable data processing pipeline using the Python-based workflow management system Snakemake. Once developed, the pipeline could be executed on a high-performance-computing cluster (HPC), or in other compute environments. Overall, my project was to develop and fine-tune an automated pipeline for processing long-read RNA-seq data by iteratively going through these various stages of the pipeline development cycle.

The workflow, here, successfully performs raw read QC, read filtering, mapping, transcript assembly and quantification, and downstream analysis of long-read data such as differential expression and data visualization of results. After bench-marking individual steps of the pipeline, and comparing various tools with one another for sensitivity and accuracy using testing data, the pipeline was successfully validated with a full-sized data set obtained from the Singapore Nanopore Expression Consortium on the NIH HPC, Biowulf. This shows that the pipeline runs successfully on a compute environment like Biowulf, has reasonable resource allocations set to run on real-world datasets, and is highly adaptable for various long-read RNA-seq data analyses.

Tagged: Summer 2021