Internship Presentations

Automated Classification of CIViC Evidence Types Using ChatGPT-4o mini

Karim El Khoury

Mentor: Dr. Karen Ross, Program Co-Director, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University.

Date/Time: August 21st, 2025 at 1:30 PM.

Abstract: The Clinical Interpretation of Variants in Cancer (CIViC) is a community-driven knowledgebase that curates evidence from peer-reviewed literature to support the clinical interpretation of somatic variants in cancer. CIViC curators extract and annotate relevant information from scientific papers, assigning evidence types such as Predictive, Diagnostic, Prognostic, Predisposing, Oncogenic, and Functional based on the role of a variant in cancer.

In this project, we applied a Large Language Model (LLM)—ChatGPT-4o mini—to help automate this evidence classification process. We focused on scientific publications that have already been identified by CIViC curators as potentially relevant sources. For each paper, we retrieved the PubTator Central XML, extracted the abstract and relevant sections from the results, and identified variants mentioned using PubTator’s concept annotations. We then prompted ChatGPT-4o mini to classify the evidence type(s) and provide justification based on the variant, gene, and contextual content.

To test the approach, we applied it to 58 manually curated CIViC entries. The process took approximately 2.5 minutes and produced outputs that included the extracted variants, predicted evidence type(s), and LLM-generated justifications for each. If this approach proves effective, we plan to scale it up and apply it to the full CIViC evidence table—the complete set of annotated papers and variant interpretations hosted on the CIViC website—to aid curators and reduce manual workload

Tagged
Summer 2025
Summer 2025 #1