Building a Relational Database for the Aedes albopictus Transcriptome and Application to Identification of microRNA Binding Sites

Bioinformatics Internship Presentation

Bridget Tripp (Mentor: Dr. Peter Armbruster, Department of Biology, Georgetown University)

August 26th, 2pm-4:30pm Room 1300, Harris Building.

High-throughput DNA sequencing is revolutionizing the ability to interrogate genome-wide expression data and elucidate the molecular underpinnings of a wide range of biological processes. However, storing large volumes of data in a user-friendly format is a common challenge associated with high-throughput sequence data. These data are commonly stored in flat files and excel spreadsheets, limiting the querying capabilities and increasing the risk of compromising data consistency. Relational database management systems solve these challenges by providing a central repository for collection and maintenance of, and the referential integrity of, disparate data elements. The Aedes albopictus transcriptome was sequenced using a high-throughput Illumina platform and annotated using a reference Dipteran protein set and the Aedes aegypti genome sequence. These results were contained in multiple excel spreadsheets and fasta flat files, thus limiting the accessibility and practical research application of this data. A relational database, which ultimately contained data pertaining to 14,213 different gene models across six life stages and two ecological conditions was designed to integrate these disparate documents so future parsing, manipulation, and querying of these data are seamless and consistency is insured.

Relating data for thousands of gene models across varying life stages presented a unique challenge. The primary goal of the Ae albopictus transcriptome database is to query data related to the gene identification number (gene id) so this became the common thread. Using this as a guiding principle, a list of the necessary data attributes was drafted. The attributes were then parsed out into five associated groups. These groups contained the data attributes to be included as input tables for the database and were linked to one another based on gene id. Using the relational database management system SQLite, and a SQLite Manager user-interface, the database was constructed and data tables loaded. The database relationships and querying capabilities were validated and proved successful.

When working with massive quantities of data requiring large volumes of information to be retrieved according to systematic query criteria, a relational database is an essential research tool. The research querying capabilities of the transcriptome relational database was successfully confirmed by its ability to generate reports that were instrumental in determining if differently expressed transcripts were enriched for miRNA binding sites. Through the development of a relational database for the Ae albopictus transcriptome, future data parsing, manipulation, and retrieval has become easier and more efficient, all while assuring referential integrity.