Building a Virtual Glycobiology Teaching Assistant using Retrieval Augmented Generation
Eliot Kmiec
Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University.
Date/Time: August 22nd, 2024 at 3:00pm.
Abstract: Generative AI chat-bots, based on Large Language Models (LLMs), have demonstrated a remarkable ability to coherently respond to user queries written in natural language, often generating accurate and cogent answers to questions that cannot be readily answered using traditional search engines. Although LLMs are built on a vast amount of information, they have limitations when it comes to knowledge not included in their initial training, especially in specialized fields (Soudani et al 2024). The field of glycobiology is one such specialized field, where an appropriately trained generative AI chat-bot could help those new to the field navigate the complex biology of glycans. Training a new, domain-specific chat-bot from scratch, however, is too expensive, so we explored a popular alternative strategy, called Retrieval Augmented Generation (RAG), to provide existing general LLM chat-bots with domain-specific information and improve the accuracy of its responses. RAG uses an LLM-based semantic search strategy to find domain-specific text related to the user’s query and sends this specialized text alongside the query to the chat-bot, thereby improving domain-specific responses. We compared RAG-based LLM responses to unassisted LLM responses on several different tasks using a curated set of questions with human-answers as ground-truth and reference information retrieved from the textbook “Essentials of Glycobiology, 4th edition” (Varki et al. 2022) as a source of reliable glycobiology information. Our study found that RAG significantly improves the factual content of the AI’s responses in this specialized area by incorporating specific phrases and facts from the textbook.
Finally, we used what we learned to design a RAG-based Teaching Assistant, GlyBot, using the OpenAI API and a python package called LlamaIndex, which can use information from Pubmed, the “Essentials of Glycobiology” textbook, and GlyGen, a glycomics knowledgebase, to get coherent glycobiology information from Chat-GPT. Quipped a leader in the glycobiology field: “Gave me answers that I would be happy to get from my graduate students. I guess it is, in fact, time to retire.”