Improvement of glycan structure inference from glycan images using deep-learning-based object detection

Michelle Vesser

Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University

Date/Time: August 23rd, 2022 at 3:40pm.

Abstract: Glycan structures are typically provided as images in scientific papers, presentations, and posters and in online glycoinformatics resources. While humans can easily understand glycan structures implied by glycan images, specific structural details cannot easily be extracted in a computable form. Matching the glycans in an image to known structures facilitates the integration of new knowledge with existing glycoinformatics data resources.

In principle, glycan image recognition using deep-learning-based object detection has been previously demonstrated but fails on certain classes of images and their glycan structures. The open-source YOLO: Real-Time Object Detection project can be trained to find the bounding-box of specific items in images and was previously applied to recognize glycans in glycan-containing figures from a diverse set of manuscripts. However, the limited size of the initial training set resulted in “blind-spots”. We worked to improve glycan recognition and the extraction of glycan details by refactoring the code, building new object detection models, and expanding the glycan image training set.

To improve the training data, we expanded the glycan image training set by adding a more comprehensive set of automatically generated glycan images containing only one glycan on a white background, in contrast with the existing training data, which consisted of complex figures from published manuscripts with many glycans per image. We also removed problematic examples from the training set. In addition, we increased the size of the “ground-truth” bounding boxes to provide more space around each glycan, mitigating YOLO’s tendency to predict bounding boxes that were too small. We re-established the end-to-end glycan image processing pipeline, including YOLO training, and modularized the existing code-base to break the problem down into self-contained components, thereby making it possible to evaluate and improve the individual elements of the analysis piece-by-piece. These changes resulted in a two-fold improvement for glycan recall on a problematic testing set of images containing single glycans.

Following this success, we explored using YOLO to recognize glycans with their orientation to solve a long-standing issue in the previously developed processing pipeline. While initial training results looked promising, it did not seem to improve upon the existing heuristic for glycan orientation inference.

We also worked to improve the identification of individual monosaccharides in glycan images for semantic glycan structure extraction. We replaced the existing approach based on detecting monosaccharide shape and color directly with a YOLO-based object detection approach. This resulted in a 38% improvement in recall of individual monosaccharides. The YOLO approach improves the identification of fucose in particular, as well as monosaccharides in glycan structures drawn using non-standard color schemes. Identifying each monosaccharide in a glycan image is crucial for the successful semantic extraction of monosaccharide composition and glycan structure.

This work improves the existing glycan extraction pipeline to make it a more viable option for glycan curation. During this work, we refactored and modularized the code involved in the glycan extraction pipeline, developed a more thorough and flexible approach to training YOLO detection models, and developed tools to evaluate and compare the performance of different models. We applied YOLO to multiple problems related to glycan extraction, including glycan orientation and monosaccharide identification. Ultimately, we have improved the identification of glycans, extraction of individual monosaccharide types and locations, and glycan topology, leading to overall improvements to the successful extraction of glycan structures from input images.

Tagged: Summer 2022