Latifat Braimah (Mentor: Dr. Deborah Berry, Histopathology and Tissue Shared Resource, Georgetown University)
August 29, 2017, 2:30pm, Room 341, Basic Science
Database is a unique approach of structuring data-set such that data retrieval (querying) can be simply accessible. A well designed and implemented database not only make querying accessible, it supports several ways of querying existing, inserting new, updating, deleting or simply modifying its records. Being host to the Human Tissue Bank, HTSR periodically receives the associated clinical/pathological data (usually as data exports in excel workbook format) with collected tissue samples from Tumor Registry (TR). And as a means of support to the Translational and Biomedical Research community, HTSR needed a query-able cohort database (on 150 patients present in the Tissue Microarray (TMA) experiments in addition to the clinical breast cancer data from TR), for which access can be granted to Clinical/Private Investigators (CI/PI) - who may be interested in the clinical history of a patient. Achieving this duty requires that; the received TR data exports be processed thoroughly. Based on the output, identify database fields and their types. Map/upload the processed data into the database and offer querying support for CI/PI (users) of the database.
The supporting platform for designing and implementing the database is REDCap (Research Electronic Data Capture). REDCap is a web-based application, used for designing and managing databases and surveys. It is available either as online or offline tool. Because it is web-based, it requires a web server and a database server - these are requirements to be met before for installing the application on a host (consortiums i.e. institutions like Georgetown) server. This feature allows REDCap to be customizable (to include optimized security features). REDCap allows for Longitudinal database design. However, it uses the functionalities of Relational Database Management System (RDBMS from the database server) to store data in tables with well defined relationships (using primary and foreign keys constraints). Designing a REDCap database includes defining a Data Dictionary (identifying and creating data fields). This can be done directly on the web application (online mode) or in an excel file (offline mode) and then uploaded into REDCap. The data security and interoperability (i.e. extendable with APIs & DTS) provided by REDCap makes it an excellent platform for this project.
During the cause of this project, TR data dump was processed prior to being uploaded into the new REDCap database. The following procedures were used for data processing: First, we resolved redundancies & formatting issues. Next, we de-identified the data to exclude patient sensitive information (i.e. removing names, converting date of births to ages, medical record number to index numbers, and so on). On reviewing the resulting output, we realized that TR data dump has fields containing long physician notes (text strings) - describing the different types of treatment procedures (Anscillary Therapy, Chemotherapy, Hormone Therapy, Immunotherapy, Radiation Therapy, and Treatment). It was also observed that these notes often included identifiable patient information along with relevant clinical data (drug regimen & doses, treatment site, diagnosis location, etc). For this reason, we set out to explore the information embedded in these text string fields - the String Extraction phase.
At the onset of String Extraction, we divided the bulk data into 2 categories - the Non-Text String data and the Text Strings. We created re-useable python scripts for extracting information from the strings. These scripts use a list of terms and/or phrases to search for similar words in the strings. Once a match is found, it is pulled from the string and into a new field (depending on the type of matched terms/phrases. i.e. drug terms were pulled into a DRUG field, treatment site terms into SITE field, and so on). At the end of an extraction round, a manual quality check (QC) was done on the output to evaluate the quality of extracted information. This extraction process was remarkably iterative. We expanded the terms/phrases list and the quality of the resulting output improved with every iteration. While string extraction was on-going, database fields and their types (text, integer, dropdown, radio-button, etc) were being identified for the purpose of creating the data dictionary. Since the scope of the data is limited to breast cancer, database fields (along with corresponding data values) were restricted in this regard (560 fields were identified in total). At the end of the data processing, outputs (saved in CSV files) were formatted, mapped to data dictionary and bulk uploaded into REDCap.
Completion of this project resulted in a Database with rich (with more relevant information extracted from the strings) and robust (formatted for optimal query-ability) clinical data set for the consented breast cancer patients in TR - including the 150 patients present on the TMA. In anticipation of receiving more data dump from TR in the future, all the data processing procedures were scripted. 25-page semi-automation documentation on how to follow these procedures was created. This effort was to ensure that inserting new, querying, updating or deleting existing records from the database is effortlessly achievable.