Large Language Models (LLMs) Validation of Text-Mined Information for UniProt Computationally Mapped Bibliography
Amber Wu
Mentor: Dr. Karen Ross, Program Co-Director, Department of Biochemistry and Molecular &
Cellular Biology, Georgetown University
Date/Time: April 29th, 2025 at 12:00pm
Abstract: Alzheimer’s Disease and related dementias are complex conditions that are still being unraveled. The eMIND tool, a BERT-based text-mining system, has been developed to extract information on the impact of variants on these diseases with good recall and precision rates. However, although the eMIND seems to be performing well, it still lags behind manual expert curation in identifying diseases and the impacts of variants.
On the other hand, Large Language Models (LLMs) represent a powerful new approach for natural language processing tasks in bioinformatics. These AI systems, trained on vast amounts of text data, can understand complex relationships described in scientific literature and evaluate claims about biological mechanisms with contextual awareness that traditional text-mining tools often lack.
This project aims to harness the capabilities of large language models (LLMs) to bridge this gap, thereby improving the reliability and utility of eMIND’s outputs and enhancing the accuracy of this text-mining tool for integration into the UniProt Computationally Mapped Bibliography.
The validation process follows a structured pipeline, and our results demonstrated high
validation accuracy (92%+ across all tests). This automated validation pipeline significantly
reduces manual curation time while maintaining high data quality, representing an effective
approach for improving computational annotations in biological databases using LLM
technology.
- Tagged
- Spring 2025