Faster enumeration of amino-acid tags for peptide tandem mass-spectra

Bioinformatics Internship Presentation

Gabriele Dani (Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University)

December 18th, 2015, 2:00pm, Room 1300, Harris Building

Tandem mass spectrometry combined with sequence database search is a widely used method for protein identification. In the currently available tools, many peptides are scored against all the spectra at a high cost. The aim of the project is to stop the search early without losing too many correct peptide identifications. If the scoring process of the peptide-spectrum pairs is done in quality order it is possible to reduce the number of peptides-spectrum pairs that must be scored, and maybe not even score any peptides against the spectra that ultimately go unidentified.

The first step of the development of this hypothesis meant parsing a spectra file and building a database of edges containing the peaks forming it and the corresponding amino acid that the delta between the m/z value of the peaks represent. The quality of the peaks as a matter of relative intensity to the intensity of the base peak was also recorded in the database as a means of ordering edges being considered while looking for tags. The file to be used is a high-accuracy CID peptide fragmentation spectra file from Freese et al. 2011.

As an effort to avoid scoring spectra that are not identifiable, a second data structure was built as a suffix tree using the human proteome database. This structure will be used to search for the tags as they are extended and remove them from the queue when they are not found as part of a protein.

The next step was to construct an algorithm that searched the database for its most high quality edges per spectra and include them in a queue that represented the start point for the formation of the tags. The queue being always order by the quality of its members (by relative intensity) was subsequently fed with tags formed by the adjacent edges in both directions of the best quality tag present in the queue. The feeding of the extended tag is conditional of its existence of the tag as a substring in to suffix tree.

We were able to demonstrate the tradeoffs in terms of “early” stop on the search and analyze the quality of the outcome. As a result of the algorithm we were able to score the right peptide early and avoid scoring spectra that would mostly not be identified. We avoided spending efforts on that spectra by using the thresholds and mainly the protein database as a cut-off for tags.