Propagation of glycopeptide identifications using spectra similarity networks

Bioinformatics Internship Presentation

Junyuan Zheng (Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Gerogetown University)

August 26th, 2:20pm-2:40pm Room 1300, Harris Building.

Protein glycosylation is a common post-translational modification that plays important role in many biological processes and diseases. This study focuses on N-glycosylation, in which glycans attach to Asn at the NXS/T motifs. In previous studies, a computational strategy to identify glycopeptides from tandem mass spectra (MS/MS) was developed, and the notion of spectra similarity for glycopeptide spectrum was explored. In this project, we focus on similar glycopeptide spectra whose precursors differ by a single monosaccharide mass in order to propagate glycopeptide identifications.

We developed a Python-based program that reads in spectral similarity relationships and initial glycopeptide labels for the spectra; executes a glycopeptide identification propaga tion algorithm; and outputs a modified network with new glycopeptide labels for visualization using Cytoscape. We apply a voting strategy in which a spectrum’s neighbors vote on its glycopeptide label. We use the notion of quorum in the voting algorithm and explore a variety of heuristics for label propagation. The first phase of the algorithm, “cleanup”, requires labeled spectra to be consistent with their network neighborhood, while the second phase, “propagation”, iteratively labels unlabeled spectra by neighbor voting until no more spectra can be labeled.

We apply the glycopeptide identification propagation program to 3288 MS/MS spectra from the tryptic digest of human haptoglobin, with an initial set of 496 glycopeptide identifications. The Spectral similarity network joined 2202 spectra with 74731 edges. We use the number of spectra with glycopeptide labels and the estimated false discovery rate (FDR) to measure the effectiveness of our algorithm. In the first phase, “cleanup” was able to reduce the FDR from 22.3% to 2.9%, significantly reducing the number of false positive labels. The second phase, “propagation”, then increased the number of labeled spectra from 162 to 557, while holding the FDR to 6.9%. The software takes less than a minute to run. The program substantially increases the number of identified glycopeptide spectra while significantly reducing the false discovery rate.