Development of a Python Program for De-identification of Breast Cancer Patient Data

Bioinformatics Internship Presentation

Abrar Albahrani (Mentors: Dr. Deborah Berry and Krysta Chaldekas, Histopathology & Tissue Shared Resource, Lombardi Comprehensive Cancer Center, Georgetown University)

August 30th, 2016, 2:00pm, Room 1300, Harris Building

In objective of creating a breast cancer database, Histopathology and Tissue Shared Resource (HTSR) at Georgetown Lombardi Comprehensive Cancer Center collects breast cancer patients’ data to retain a record of patients’ treatment history. The database is intended to make treatment history available for permitted investigators for studies that could potentially alter treatment approach and/or improve treatment response for future patients. Investigators conduct studies to analyze patient outcomes following various types of interventions. These studies include but are not limited to use of medical devices, drugs, and/or surgeries. With patient consent, data collection for research purposes is permitted; however, to be used under the HTSR’s existing IRB-approved protocol, the data must be de-identified before it is uploaded into a database and shared with investigators. The Health Insurance Portability and Accountability Act (HIPAA) specifies eighteen identifiers, Protected health information (PHI), required to be de-identified before information gathered from paper charts or electronic medical records may be used for research. The eighteen identifiers involve direct information about individuals that potentially risks their right of privacy. The process of de-identification is applied by removing a data element or manipulating it so it cannot be linked to the individual but retains clinically relevant information.

In order to utilize clinical patient data for research, and to comply with existing regulatory and institutional approvals, we implemented a python program to (a) de-identify a breast cancer data file and (b) prepare it for upload to a database.

The dataset is a Microsoft Excel file with seventy-three columns capturing information about 1500 patients and their treatment. We identified four PHI in twenty columns: (1) names, (2) medical record numbers, (3) geographical identifiers, (4) and dates.

We wrote a python program to de-identify dates, converting them to ages, and we ran the program on a sample CSV file with a limited number of patients. Since some columns in dataset uses a range of two dates to report period of treatment, we wrote the python program to calculate decimal age to clearly state the difference between two dates that days or months different. Since patients’ names and medical record numbers are de-identified by removing them completely from the dataset; and to substitute for medical record numbers, patients are assigned also give a unique code, known as an index number, to uniquely identify their record in the dataset.

The python program recognizes and removes physicians’ names when they occur before or after medical suffix or prefix; or when matched with list of names supplied in python program. Finally, geographical identifiers, which refer to health facilities or hospitals where patients receive treatments, are replaced with a unique code. Similarly to physicians’ names, a list of facilities is supplied in the program and any matches for them in dataset file are removed.

Once the program is capable of de-identify a subset of data, another subset is provided for testing and improvement.

The output CSV file contain sixty-eight columns is de-identified from names, medical record numbers, geographical identifiers, and dates. To get the output file, two python scripts are applied: one that calculates decimal ages, de-identifies, and removes names and geographical identifies; the other file outputs columns and rows with respect to index number. Although complete de-identification is not guaranteed for physician names, updates such as further expansion of physician names list would make the program identification process is more thorough. Moreover, considering that output file is prone to containing undetected PHI, output file has to go through quality control to assure complete de-identification of result.