Improving Variant-Impact Annotation with LLM Validation and Negated Data Handling
Yandi Xu
Mentor: Dr. Karen Ross, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center.
Date/Time: December 9th, 2025 at 12:00 PM.
Abstract: This project developed and scaled an automated bioinformatics pipeline that integrates eMIND, a deep-learning text-mining system, with GPT-4o-mini to enhance functional annotation of Alzheimer's disease (AD)-related genetic variants for UniProt. The pipeline processed eMIND's extracted variant-impact relationships from PubMed literature through multiple validation layers: a main validation test confirming variant-impact associations, population-stratification detection, variant-combination identification, and negation analysis to capture null results. Starting from raw eMIND output, the workflow retrieved PubMed abstracts via NCBI Entrez, split the data into disease-specific and general impact categories, and applied four distinct LLM validation tests using customized prompts. Results were filtered to remove negated evidence (NEGATION="yes"), low-confidence assertions (MAIN="no"), and sentences with inherent negation markers. The pipeline then aggregated evidence sentences by grouping identical variant-disease-impact combinations and formatted outputs into ABB-compatible annotations for UniProt display. A novel contribution was the explicit handling of negated results—studies reporting no significant associations—which are typically excluded from annotation databases despite their scientific value. The final dataset combined 280+ hours of development into a robust, documented pipeline that processes thousands of variant annotations while maintaining quality through systematic validation checks. This work demonstrates how combining rule-based text mining with modern LLMs can improve both the precision and comprehensiveness of large-scale biomedical knowledge curation, ensuring that both positive findings and significant negative results are preserved for the AD research community.
- Tagged
- Fall 2025