Is Clinical Proteomics Data Personally Identifiable Information?

Bioinformatics Internship Presentation

Safiyah Murray (Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University)

December 5th, 1:00pm-1:30pm, Room 1202, Harris Building.


It is possible to match an individual DNA sample to genomic data, compromising a person’s or patient’s anonymity.  Due to this fact, genome specific data is no longer publically available.  It also may be possible to match an individual to proteomic data they have available publically, such as individually specific sample data maintained by the Clinical Proteomic Tumor Analysis Consortium, thus compromising a person’s identity.  However, proteomic data belonging to a person or patient is still publically available.  It is therefore necessary to determine whether publically available proteomic data can be processed and then mapped back to an original individual sample.  The first step in this process is identifying variant peptides, and determining from which samples the peptide(s) originate.  Specific variant peptides will only be found in certain individuals within a population therefore partitioning the population, leaving the few individuals from whence the samples came apart from those without the variant peptide.  This project uses and assesses the tools currently available to bioinformaticians in order to accomplish the first step of identifying variant peptides, identifying corresponding wild type peptides, and zygogisity, e.g. which allele (wild type or variant) for an expressed trait does the data indicate is contained in the individual sample.


The dataset and Aspera Connect software tool for high-speed data transfers were downloaded from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) website.  In particular, 4 of the 95 tumor samples from The Cancer Genome Atlas (TCGA) Colorectal Cancer proteome study of colorectal tissue sample data were used in this experiment: TCGA-A6-3807-01A-22_Proteome_VU_20121019; TCGA-A6-3808-01A-22_Proteome_VU_20121205; TCGA-A6-3810-01A-22_Proteome_VU_20121029; and TCGA-AA-3518-01A-11_Proteome_VU_20120915.  Variant peptides were identified using the PepArML peptide identification platform, and parsed with Python code.  Prior knowledge of variant peptides identified in CPTAC/TCGA colon cancer data came from Nature article Proteomic characterization of human colon and rectal cancer, supplemental table 2 (S2_variant_peptides), which is an excel spreadsheet of prevalence data and the single amino acid variants (SAAV) identified by the authors Zhang, et al. (2014).  Prior knowledge of variant peptides also came in the form of sample specific Fasta files containing all the known variant peptides for that TCGA sample, as well as loci of the SNP.  The downloaded sample specific Fasta files were used to check and validate a variant peptide indicated by the PepArML search.  The Variant peptides were considered the peptides without a match in the search database selected in the PepArML search, e.g. CPTAC RefSeq Human Colon SNPs database.  This database contains all the known documented peptides with SNP/variant sequences in human colon cancer tissue samples that PepArML searched against.  Variants without a match obtained in the PepArML search were then BLASTed (blastp) against the Human RefSeq database and parsed using Python code in order to find wild type matches for the variants.  Wild type peptide sequences were considered the peptide sequences that matched the full length of the variant peptide with one point mutation only.  Python code was also used to search PepArML results for instances of wild type and/or variant peptide sequences in individual samples.  Some variant peptides are found in only one sample or present in very few samples, splitting those samples without the variant from those with the variant, possibly identifying the sample(s). 


Variant peptides identified in this experiment were uploaded to a spreadsheet and have been checked for validity, meaning they have been identified as true variants by the Zhang, et al. (2014) article in Nature and/or the sample specific TCGA Fasta files.  Wild type and variant spectra count hit data from PepArML search results were collected and analyzed.  Heterozygous samples reveal both wild type and variant peptide matches.  Wild type matches were expected in heterozygous samples, however, we do not observe similar spectra counts, which may relate to expression levels between wild type allele and variant allele.  Homozygous samples either reveal wild type peptide matches or a variant peptide matches in the spectra count, not both.  Variant matches were expected in homozygous samples.  Prevalence of variant peptides in this experiment was also consistent with the data provided by Zhang, et al. (2014), e.g. variant peptide: MVAVGICR, spectra: TCGA-A6-3807, instances: 2. According to the initial data on this small dataset, there are instances where a homozygous variant peptide is rare, meaning it has only one instance in only one sample.  These rare peptides in conjunction with other variant peptides should in theory point to only a few samples, therefore possibly identifying the original sample from whence it came.  More tests are needed to determine the prevalence of the variant peptide(s) in all 95 of the TCGA Colorectal samples.  Nevertheless, variant peptides identified in this experiment may not have needed to be validated through the search of the data found in the Zhang, et al. (2014) article in Nature or the sample specific TCGA Fasta files.  Although there is a possibility of errors and misreads, it is possible to identify the presence of a variant peptide through the methods used in this experiment alone, and therefore, it may not be necessary to search and check for variant peptides in the Zhang, et al. (2014) article in Nature or the sample specific TCGA Fasta files.