Is NoSQL for Bioinformatics?
Shiyang Yuan (Mentor: Dr. Sheng-Chih Chen, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University)
August 26th, 2pm-4:30pm Room 1300, Harris Building.
Starting from a relational database management system (RDBMS) is virtually
an axiom of software development. However, NoSQL shows us that relational
databases may not always be the only answer. The specialties of NoSQL are
Ease-of-deployment and Performance. Widely used NoSQL databases, such as
CouchDB or MongoDB, allow people to insert data without the need to define the
data. This schema-less approach is the equivalent of a document store, not a data store
like a traditional database. The goal of this project is to identify the differences of
implementation between relational and NoSQL database.
Initially, I downloaded the 2.5GB Swiss-Prot database which is XML file from
Uniprot website. After completing the download task, I set up the data flow in SSIS
(ETL tool) to load XML data into Microsoft SQL Server (relational database
management system) and tried create, read, update, and delete (CRUD) the data. Then,
after understanding the XML file and its schema, I created an ER-diagram in
Microsoft SQL Server to show the relationships among tables. Because CouchDB is a
document type NoSQL, I used a Python program, instead of SSIS, to convert the
XML file to JSON file and then imported it directly to CouchDB by Curl (command
line). Finally, I created some views in CouchDB and verified that I could do CRUD as
I did with SQL Server.
As opposed to Relational Databases, CouchDB does not store data in tables with
uniform sized fields for each record. Instead, in CouchDB, each record is stored as
document with certain characteristics. Any number of fields with whatever lengths
can be added into each document. Fields can also contain multiple pieces of data.
Moreover, Relational Databases use long-established standard SQL queries for data
capture events and create views for data retrieve.