Automated Extraction & Standardization of Proteoform Descriptions from UniProtKB

Nasser Almoammar

Mentor: Dr. Darren Natale, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center.

Date/Time: April 28th, 2026 at 12:00 PM.

Abstract: UniProtKB entries describe post-translational modifications (PTMs) using non-standardized, human-readable free text, providing no machine-readable referents for the proteoforms they characterize. The Protein Ontology (PRO) offers a framework for standardized proteoform descriptions, but curation has traditionally depended on expert manual reading of primary literature, a process that takes time given the amount of PTM data available in UniProtKB. This work describes an improved automated pipeline designed to extract PTM statements from UniProtKB flat files, classify them by confidence and complexity, and generate standardized PRO term templates suitable for database submission.

The improved pipeline incorporates feature-table-based site validation, quote-specific parsing, microbiome entry exclusion, alternate handling and ontology-aware corrections. Compared to the baseline, these changes reduce incorrectly discarded entries by ~25%, reduce unresolved MOD ID cases and increased the output by 18%. Applied to the human reference proteome, the pipeline processes over 3,000 PTM entries across 30+ PTM types, with an estimated scope of ~8,800 human and ~67,000 cross-organism extractable proteoforms, providing a scalable foundation for standardization between PRO and UniProtKB.

Tagged: Spring 2026