Internship Presentations

LLM Validation of Variant-Drug Relationships in Cancer

Lillian Wallace

Mentor: Dr. Karen Ross, Program Co-Director, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University.

Date/Time: August 21st, 2025 at 3:15 PM.

Abstract: Research has documented many somatic variants that affect cancers and other diseases, however, only a portion of these variants have been curated and stored in relevant databases. To improve curation efficiency, a BERT-based text-mining tool was developed to identify and extract information about these variants from PubMed abstracts and full-length PMC articles.

The goals of this project were twofold: to improve accuracy of the text-mined data and to produce formatted Uniprot annotated bibliography entries for the validated data. A workflow was developed to automate both tasks by leveraging a large language model (LLM) for validation and generation of summaries for the bibliography. The workflow was designed for and tested on mined data addressing variant-drug relationships identified in a cancer, here, all rows were about non-small cell lung cancer.

Out of 5,467 rows, 1,497 (27.4%) were able to be validated. Validation accuracy was evaluated on 100 rows that passed validation and 50 rows that failed, both sets were randomly selected. The evaluation results showed high precision (95%) and a somewhat lower recall (80.5%). The workflow successfully provided high-quality, validated data which will save a substantial amount of curation time.

However, there are still interactions that can be obtained from the data that failed validation. A large portion of interactions (35.4%) failed because they required secondary drugs or variants which the original table cannot support. There are also a number of false negatives, many belonging to the more complex paragraphs which the LLM may not have understood. This workflow is a promising start, but it remains to be seen whether curation of more complex interactions and passages can be automated.

Tagged
Summer 2025
Summer 2025 #1