Detection of Annotation Errors: Sequence or Structural Similarity - Which One to Rely On?

Bioinformatics Internship Presentation

Kholud Almarzoki (Mentor: Dr. C.R. Vinayaka, Protein Information Resource, Georgetown University)

May 9th, 10:30am, Room 1202, Harris Building.

INTRODUCTION: Propagation of annotation in UniProt by sequence comparison is very efficient. The annotation propagation is based on hidden Markov models using multiple sequence alignments of protein families. However, there are about 20% of these entries are misannotated. The purpose of this project is to search for the misannotations in UniProt Protein Knowledgebase which are due to sequence alignment and sequence comparison errors and use any available structural data to correct these errors.

METHOD: In this project the dataset was collected from UniProtKB database in which the active site have the description “Charge relay system” transferred from a template and have structural data. This dataset contains one hundred fifty enzymes. Each entry in the dataset was investigated by comparing the active site position provided in the UniProt entry with active site position in the PDB using CHIMERA. If a misannotation is found, we correct it by finding the right position and it will be confirmed by doing a multiple sequence alignment and support this correction with the research study provided in the entry.

RESULTS: During the investigation of the dataset, the expected results that at least 5% misannotation to be found. By the end of this project 10% of the entries were found to be misannotated by providing a wrong position of the residues in the active side or annotating the wrong residue or even not mentioning the complete residues that perform the active site.

CONCLUSION: Based on the result of this project we can conclude that structural similarity can be more reliable than the sequence similarity since the structural cores evolve three to ten times slower than sequence. Correcting the entries with the structural evidence will improve the quality of the data provided in UniProt and make it more reliable for the users.