Systematic Curation and Large Language Model-Based Extraction of Alzheimer’s Disease Biomarkers

Alma Ogunsina

Mentor: Dr. Raja Mazumder, Department of Biochemistry and Molecular Medicine, George Washington University / Daniall Masood, PhD, George Washington University.

Date/Time: August 21st, 2025 at 2:15 PM.

Abstract: Biomarkers are measurable indicators of normal or abnormal biological processes, pathogenic conditions, or responses to therapeutic interventions. These characteristics can be biological molecules, imaging results, or physiological signals that are widely used in disease diagnosis, prognosis, and personalized medicine. However, biomarker data is dispersed across numerous publications and public resources, making it difficult for researchers and clinicians to efficiently access this information. The Biomarker Partnership project addressed this problem by integrating all biomarker data into BiomarkerKB, a comprehensive, structured biomarker knowledgebase. This project focused on manually curating Alzheimer’s disease (AD) biomarkers and outlining an automated extraction pipeline using a large language model (LLM).

The project was conducted in two phases. During Phase 1, over 300 AD biomarkers were identified from existing databases and scientific literature using targeted PubMed searches. For each biomarker, key biomarker fields such as assessed biomarker entity, condition, best biomarker role, specimen, and evidence were extracted and standardized into a structured table. A quality control processing script was implemented to ensure data consistency across all biomarkers.

Phase 2 focused on model development. Although model training was not completed, a proposed workflow was developed to support future fine-tuning of a BioBERT model for Named Entity Recognition (NER) and Relation Extraction (RE), with the goal of identifying biomarker-related entities and capturing the relationships between them. The proposed workflow includes tokenizing the raw biomarker data using the AutoTokenizer supplied by Hugging Face Transformer library, applying BIO tags to each token for NER, labeling relation types between entity pairs for RE, and training the model with Hugging Face’s Trainer API. Model performance would be evaluated using standard metrics such as F1 score and precision.

This work contributes a high-quality curated dataset of AD biomarkers for inclusion in the knowledgebase and proposes a scalable framework for automating biomarker extraction from unstructured, biomedical literature, ultimately improving the accessibility of biomarker information for research and clinical use.

Tagged: Summer 2025; Summer 2025 #1