Semantic Link Extraction from Glycan Structure Images using Deep-Learning

Xinyu Hu

Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center

Date/Time: July 26th, 2023 at 10:00am.

Abstract: Glycan structures are typically provided as images in manuscripts and online resources – easy for humans to understand, but difficult for computers to work with. Unfortunately, there is often no computer-readable description of the structure or even consistent presentation of the glycans in the images. An automated system for extracting the semantic details from a glycan images will make glycan knowledge curation from manuscripts easier and more reliable.

Previous work had shown the effectiveness of deep-learning-based object detection models for glycan structures in complex figures and for monosaccharides in single glycan images. The YOLO object detection model is trained using bounding boxes around the objects it should recognize – for monosaccharides, the trained YOLO model draws boxes around the monosaccharide symbols it recognizes in a glycan image. With monosaccharides identified, linkages between monosaccharides are extracted using image processing heuristics, an approach that has proven somewhat unreliable. To improve the performance of linkage detection, we explore whether a YOLO object-detection model might also be applied to linkages in glycan images.

We avoid the manual annotation bottleneck for the creation of training data by developing scripts to automatically generate glycan images with known semantics, monosaccharide positions, and linkage information with randomized style, size, color, orientation, and blurring. From the semantics and position information, bounding boxes for glycan linkages, suitable for training YOLO could also be readily constructed. Testing images with known true linkage bounding boxes were also generated using the same method. Two different YOLO models were trained, a big model with 9,000 images trained for 10,000 iterations; and a small model with 1,000 images trained for 6,000 iterations.

These models were first evaluated based on the intersection over union metric (IOU) to determine how well YOLO predicted linkage bounding boxes. Precision-recall plots were used to evaluate the models – the big model showed better precision and recall than the small model at all levels of prediction confidence, and surprisingly, decreasing IOU thresholds showed improved performance. Next, we extract semantic linkage information from predicted boxes using the known coordinates of the monosaccharides. For semantic link extraction we found very few false positives with high recall.

This project provides a demonstration of YOLO applied to linkage extraction. First, we generalized an approach to extracting semantic and positional information alongside randomly generated glycan images and used it to create bounding boxes for glycan links. Second, we trained YOLO models for linkage detection and they seem to perform very well both on test data and real images from scientific papers. Lastly, we developed both box-based and semantics-based model evaluation methods and applied them to these models to understand their performance. The YOLO-based object detection model does very well at linkage extraction. We plan to ultimately integrate it with the existing pipeline to more reliably extract complete glycan structures from their images.

Tagged: Summer 2023