Prioritizing PRIDE Studies Related to Deglycosylated N-Linked Glycan Attachment Sites Using Text Mining

Zihan Yang

Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center.

Date/Time: May 8th, 2026 at 10:00 AM.

Abstract: Public proteomics repositories such as PRIDE contain thousands of studies, but they do not provide a simple label for identifying experiments related to deglycosylated N-linked glycan attachment sites. This project developed a text-mining workflow to help prioritize PRIDE studies that may be relevant to this biological question. Each PRIDE study was treated as a study-level text document using fields such as the title, project description, sample-processing protocol, data-processing protocol, and keywords. An initial curated positive set was built through PRIDE–GlyGen mapping, keyword searching, and manual review, using glycoproteomics-related terms such as PNGase F, deamidation, deglycosylation, N-glycosylation, glycopeptide, and glycan.

Several machine-learning approaches were tested, including TF-IDF features, semantic embeddings, and a hybrid model combining both feature types. Logistic regression was used to assign each PRIDE study a positive-like probability score. In the final corrected setup, 57 curated positive studies were compared against a 1:25 random PRIDE background. The hybrid model generated a ranked list of unlabeled PRIDE candidates, but no unlabeled study passed the conservative predict_proba >= 0.9 threshold; the highest unlabeled score was approximately 0.639. Therefore, the final model did not identify new confirmed positives, but it did provide a reproducible triage workflow for ranking studies for future manual review. Overall, this project shows that text mining can support biological dataset curation in PRIDE, while also highlighting the difficulty of automatic discovery when positive examples are limited and repository text is noisy.

Tagged: Spring 2026