Semantic Glycan Data Modelling using Semantic MediaWiki

Posted in Internship Presentation  |  Tagged

Xuewei Chen (Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University)

August 28, 2018, 2:00pm, Room 1300, Harris Building

Semantic Glycan Data Modelling using Semantic MediaWiki

We explore a Semantic MediaWIki-based data modeling approach as a simple, self-contained infrastructure for semantic data-sharing for glycan data. We require a database system for storing the attributes of simple objects, such as Glycans and Species, and relationships between them, using concepts and terms from a variety of RDF ontologies, such as GlycoRDF. Furthermore, we require a simple web-based front end for searching and browsing the stored objects, and we require a sophisticated query language to answer more complex questions. The semantic-web community has adopted the notion of RDF triples, and the SPARQL query language for these data-modeling tasks, which, when integrated with the Semantic MediaWiki front end, satisfies each of these requirements.

RDF (Resource Description Format) treats information as triples, composed of subject, predicate, and object. A subject or object can be a URI (Uniform Resource Identifier) or a literal, while the predicates are URIs of elements from RDF ontologies. Ontologies are used to define the object classes and predicates for a particular application domain. For glycomics data, we use GlycoRDF as the standard ontology to describe the RDF format data. GlycoRDF has five main object classes, including glycan, source, and reference compound amongst others.

Semantic Mediawiki is a software for users to create, search, and browse information in multiple machine-readable formats and human-readable presentations. In this project, we use a bitnami virtual machine to run this software, configuring a triple store to hold object properties and to execute complex SPARQL queries over the triples. Using this infrastructure we create a GlycoRDF data model, load data into Semantic MediaWiki and the triple store, and execute SPARQL queries against the triple store as needed. For bulk loading of the data warehouse, we programmatically interact with the Semantic MediaWiki server using the mwclient Python module and the Fuseki SPARQL endpoint using the rdflib Python module. A series of Python classes is defined to handle the mechanics of interacting with the Semantic MediaWiki data-store.

While this approach for the construction of a semantic web application was largely successful, we encountered a number of issues with successfully loading a few thousand objects, including difficulty deleting objects and successfully executing large scale SPARQL queries. Further work is needed to completely understand these issues with the Semantic MediaWiki / Fuseki Jena implementation to determine whether or not this infrastructure remains a viable semantic web data-store implementation platform.