Integrating Clinical Breast Cancer Data into a Longitudinal REDCap Project Database

Bioinformatics Internship Presentation

Gaelle Sop-KamgaGaelle Sop-Kamga (Mentors: Dr. Deborah Berry and Krysta Chaldekas, Histopathology & Tissue Shared Resource, Lombardi Comprehensive Cancer Center, Georgetown University)

July 25th, 2016, 1:00pm, Room 1202, Harris Building

In Georgetown’s Histopathology and Tissue Shared Resource, clinical and patient data has been collected in conjunction with specimen collected for the tissue bank in an unstandardized database. Clinical investigators associated with the Lombardi Cancer center can be given permission to use this specimen data when designing studies and design clinical study cohorts, but querying the dataset for these purposes is significantly limited by abundant redundancies in the data, and unclear definitions of entities, attributes and relationships within the dataset.

The current method of storing this data in the HTSR makes no use of REDCap, the database solutions tool for the collection and translation use of clinical data in research studies. REDCap is a centrally hosted browser-based software solution distributed by Vanderbilt University to REDCap Consortium partners, such as Georgetown University. No installation is required for using the software, and as a metadata-driven solution, all users define the structure of their database, or REDCap Project structure using a CSV file containing metadata; field names, their characteristics, and the instruments they are assigned to. In this project, we dealt with the integration of the HTSR clinical datasets into REDCap software solution, by designing and implementing a longitudinal REDCap project, one in which data is organized by patient, but partitioned chronologically. In addition to designing a structure suitable for the format of data gathered in the HTSR, we wrote python programs to transform the raw data into the normalized REDCA project structure and prepare it for loading.

A data subset of 10 patients, each with multiple treatment events and diagnosis information was used to designate the scope of fields and categories that could adequately capture a breast cancer clinical dataset. Based on the structure of this data set, a redcap project was built with 7 instruments and 89 fields, plus the 7 fields indicating the completion status of each instrument. The longitudinal dimension of the database consists of 53 event partitions; 1 demographic and 1 decease event, 2 possible remission events, 4 possible diagnosis events, 10 procedure events, 6 surgeries, 10 chemotherapy treatment events, 6 radiotherapy events, 3 immunotherapy events, 8 hormonal therapy events, and 2 possible other form of therapy events. REDCap’s API functionalities proved did not provide significantly useful functionalities for transforming data and performing bulk upload to the project; instead we wrote three sequential python programs to transform the raw data into into project-compatible formatting. This included partitioning the raw data set into further events, clearing duplicated information, setting the event name designations, converting multiple choice labels into the REDCap number values be uploaded into redcap. Following the transformation and upload into this REDCap project database, the project consisted of a total of 182 individual event records within 10 patient records.

Further validation of this test data uploaded can be performed prior to approval of the project structure, and its deployment to production, where real patient data would be securely store and queried through more customized reports.