Curation Workflow Development for SkateBase: A Resource for Little Skate Genome Annotation

Bioinformatics Internship Presentation

Sara RobinsonSara Robinson (Mentor: Dr. Karen Ross, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University)

May 12th, 2:00pm, Room 1202, Harris Building

The Little Skate Genome Project, an ongoing effort of the North East Bioinformatics Collaborative (NEBC) of the North East Cyberinfrastructure Consortium (NECC), was initiated to accelerate studies of comparative physiology based largely on the little skate (Leucoraja erinacea)’s relevance as a biomedical model.  The scientific community cannot derive value from large sequencing efforts unless the information is easily accessible, shareable, and translated to knowledge.  SkateBase (http://skatebase.org) has been established as a public portal for dissemination of little skate genome curation.   When the research article introducing the SkateBase elasmobranch genome project was published in 2014 (Wyffels et al.), the authors emphasized the importance of data sharing and measured its value by the number of derived publications.  In 2014, 19 publications in refereed journals cited data downloaded from the SkateBase public web portal.  That number continues to increase (29 as of May 2017) and, considering molecular data are the foundation of genetic investigations and experimental reagent development, the information hosted on SkateBase must be current.  Considerable progress has been made on skate gene annotation through student projects in the bioinformatics programs at Georgetown and the University of Delaware; however, most of this information has not been added to SkateBase.  Getting the information into SkateBase is challenging because the curation workflow is outdated and does not take advantage of automation. 

My project was designed to address the problem of data availability and access by expanding the amount of data available on the SkateBase Gene Table.  A collection of gene curation reports completed by University of Delaware students (2012-2015), and Georgetown University students (2013-2015), served as the source of the new data.  The annotation evidence from each student report was independently verified using NCBI databases, UniProtKB, protein family and domain databases like Pfam and SMART, and literature databases such as PubMed.  Procedures for generating other data for SkateBase needed to be updated.  For example, the Integrative Genomics Viewer (IGV) tool was used to create gene cartoons from a General Feature Format (GFF) file describing genomic features, and Sequin, an independent software tool developed by the NCBI for submitting and updating entries to the GenBank sequence database, was used to create customized GenBank records for each gene.   

The data was collected and organized on an Excel spreadsheet.  Previously, formatting of the data per SkateBase standards for upload to the Gene Table was done manually, which was time consuming and error-prone.  A custom Python script was written to handle the data more efficiently and facilitate the transition from spreadsheet to public portal. 

Finally, documentation with curation guidelines is being written to make it easier for other curators to continue this work while maintaining data collection and formatting standards.  Indeed, a significant barrier to practical data sharing is the variable persistence of data and software tools.  Available resources must be current, and curation workflows must be robust, well documented, and exploit automation as much as possible. 

Wyffels J, King BL, Vincent J, Chen C, Wu CH, Polson SW. SkateBase, an elasmobranch genome project and collection of molecular resources for chondrichthyan fishes. F1000Res. 2014 Aug 12;3:191. doi: 10.12688/f1000research.4996.1. eCollection 2014. PubMed PMID: 25309735; PubMed Central PMCID: PMC4184313.