LLM Summarization of Plant Protein Phosphorylation Information for Uniprot Community Submission
Xingchen Liu
Mentor: Dr. Karen Ross, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University.
Date/Time: August 22nd, 2024 at 1:00pm.
Abstract: Phosphorylation of plant proteins plays a vital function in controlling plant cell signaling, including growth, development, environmental responses, and metabolic pathways. However, annotations of plant protein phosphorylation in biological databases like UniProt are rather limited, restricting scientists’ comprehension of plant signaling pathways and preventing broader applications in plant science research.
To expand annotations in UniProt beyond what is possible with limited curator resources, UniProt has implemented a system whereby the user community can submit new citations and annotations, which are included in the bibliography section of protein entry pages. Community submissions to date have been limited both because the system relies on users to initiate submissions and because user supplied annotations often require significant editing by curators to conform to UniProt style and content guidelines.
To overcome these gaps, we have applied advanced text-mining algorithms and large language models (LLMs) to automatically summarize literature pertaining to plant protein phosphorylation. Specifically, we use the text-mining tools pGenN (detects and normalizes plant proteins) and RLIMS-P (detects phosphorylation events) to discover and extract plant phosphorylation-related information from PubMed abstracts and employ LLMs to summarize this data into draft UniProt-style community submissions.
By carefully regulating the responses created by the LLMs, we ensure that only phosphorylation-related annotation information is provided, avoiding references to other proteins or post-translational modifications other than phosphorylation. We have optimized the prompt design using the COSTAR framework to ensure the generated text corresponds to professional standards.
After confirming initial samples, we submitted the resulting annotations to a UniProt curator, who judged the overall quality to be good. We are currently employing APIs to scale up annotation generation. Ultimately, we plan to send these draft annotations to paper authors for approval and display them as community submissions on the UniProt website. This effort is a pioneering application of LLMs in systematically summarizing specialized biological literature. This will greatly boost plant phosphorylation annotation in UniProt, enhancing the scientific community’s understanding of plant biology.