Machine Learning Models for Link Prediction in Glycan Images

Campbell Ross

Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center.

Date/Time: August 21st, 2025 at 1:00 PM.

Abstract: Manual curation of semantic information in the figures of glycobiology manuscripts is time-consuming. The GlycanImageExtractor web application uses a deep-learning based image analysis approach to identify glycans and extract their semantic information. We seek to extend the deep-learning based image analysis to support the extraction of glycosidic link information – anomeric configuration and carbon bond positions – annotated on some glycan images.

Methods: We fine-tuned YOLO-v3 models to identify and classify links in glycan images. To do so, glycan training and testing images were generated by uniformly sampling GlyTouCan accessions with common monosaccharides and link types. Generated images were rendered with explicitly annotated links, including “?” for missing values, and without any annotations, in a 9:1 ratio. Since some link annotations are substantially less common than others in GlyTouCan structures, we sought to make the annotations in the training images less unbalanced. For training images, we replaced link information with randomly sampled anomeric configuration and carbon bond values. As training and evaluation images were generated, we also extracted the image’s true link annotations to serve as ground truth. Datasets contained 1k to 40k images with (0-10% unannotated, 0-17% with random links) for training and 2k to 4k images for evaluation (0-10% unannotated).

Results: We observed minor performance improvements with increased training data size, and modest improvements in the correct prediction of links for unannotated images when they were added to the training data. For a glycan to be extracted correctly, every link must be correct. Therefore, we evaluated performance on a whole image basis (a single link error rendered the structure incorrect). Recall and precision increased from the 1k image dataset to the 20k image dataset (89.9% to 98.5% recall at 96% and 99% precision). and plateaued thereafter. When unannotated images were included in the training set, performance improved when evaluating on unannotated images (96.0% to 99.9% recall at 99% precision).

Conclusion: The final model identified and classified links in glycan images to a high degree of recall and precision. This addition of link information creates a more complete semantic representation of glycan images extracted by the GlycanImageExtractor tool. Continued improvements in automating curation will boost the scope and quality of data within the CFDE, ultimately accelerating scientific progress.

Tagged: Summer 2025; Summer 2025 #1