Glycan Structure Extraction from Scientific Literature

Nhat Duong

Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center.

Date/Time: August 25, 2020 at 1:40pm

Abstract: The extraction and representation of accepted glycobiology knowledge is challenging due to the widespread use of stylized images for describing glycans in published literature. In the absence of an explicit computer-readable glycan sequence or accession number, human curation is required to extract published glycosylation knowledge for our glycomics data-resources. Glycan structure images in published literature are highly stylized but poorly standardized, despite efforts to make them more consistent. Automated extraction of glycan structures from images in published literature will ease the human curation effort.

The first half of the project focuses on extracting figures from PDF files, identifying the bounding box of potential glycan images, and saving the images for further processing. A list of published literature associated with manually curated glycan knowledge in the UniCarbKB and GlyGen resources was selected to drive the development of the approach. Two python modules were adopted for working with PDF files: PDFPlumber to extract the coordinates of figures in the PDF file; PyMuPDF to annotate the PDF file with information about each identified glycan. To identify and extract glycan objects in images, we used an object classification algorithm (Yolov3) integrated with an open-source neural network framework (Darknet). A library of training set was set up by manually identifying bounding boxes on more than 1000 glycans across 300 images extracted from literature. Using a script that imports both Yolov3 algorithm and Darknet framework, training was carried out by the free Google Colab infrastructure. To improve the accuracy of the model, we additionally fine-tuned the existing library parameters prior to applying them to manuscript figures. The result of this effort successfully identified glycan structures, extracted them for detailed analysis while also the highlighted extracted glycan and make them clickable in the PDF file.

The second half of the project seeks to analyze the glycan images to extract the glycan structure, including the monosaccharides and their linkages. The open-source library OpenCV was used to apply a variety of image processing algorithms to the glycan images extracted from the manuscripts’ figures. Using OpenCV, the distinctive colors and shapes of monosaccharides were recognized, making it possible to count individual monosaccharides and construct a URL to the GNOme Glycan Structure Browser, suitable for embedding in the original PDF file. The linkage between monosaccharides was determined by identifying intersections of lines extending between its monosaccharide against bounding boxes of neighbor monosaccharides. Finally, we determined the orientation of the glycan images and the common N-glycan core to identify the reducing-end monosaccharide of the structure. Together, these properties describe the semantic topology of the glycan, suitable for conversion to the GlycoCT glycan sequence format using the PyGly module provided by Dr. Edwards’ lab.

We developed an effective tool for extracting glycan structures from the figures of published manuscripts. The method successfully records the position of glycans on all pages, extracts their topology information, and annotates them in-place in the PDF file with links to tools that can be used to identify the specific structure. This strategy can annotate most manuscripts in less than a minute and is able to analyze about 3 glycans per second. Topology (linkage) determination is still a work in progress and under active development. Nevertheless, this prototype demonstrates the potential utility of automated extraction of glycan structures from published manuscript figures, significantly lowering the curation burden for the representation of glycosylation knowledge in glycomics resources.

Tagged: Summer 2020