Indirect Observation of Single Nucleotide Variants in Proteomics Data

Bioinformatics Internship Presentation

Bixuan WangBixuan Wang (Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University.)

May 13th, 2016, 10:00pm, Room 1300, Harris Building

Bottom-up proteomics is a widely used workflow for protein identification and quantification. Tandem Mass Spectrometry (MS/MS) is unbiased with respect to protein isoforms but unexpected protein isoforms such as coding single nucleotide polymorphisms (cSNPs) cannot be identified by standard database search approaches since they are missing from sequence database. One solution for cSNP data analysis is RNA-Seq to catalogue potential cSNPs. To avoid this additional analysis, one alternative is to seek evidence of protein isoforms directly from the MS/MS data. The hypothesis for this project is that by analyzing peptide abundance, we can infer the existence of cSNPs.

We believe that peptide abundance should be correlated across samples for a particular gene. The existence of cSNPs is expected to result in altered peptide abundance and lack of correlation across samples. First, for each gene, we computed the correlation between every pair of peptides and looked for reduced correlation indicating a potential cSNP loci. To evaluate this approach, we used published cSNP data from the Clinical Proteomics Tumor Analysis Consortium (CPTAC) and dbSNP as ground truth. To better understand the putative cSNP identification outcome, we developed visualization tools carrying out heat map of correlations and histogram of peptide pair correlations. Moreover, we would also analyzed other potential interpretations for lack of correlation, including post translational modification, splicing and glycosylation. Finally, we could compute a false positive, false negative rate for indirectly observed cSNPs using published cSNP data.

The Pearson’s correlation test showed that the correlation between peptide pairs within gene and the correlation between peptide pairs across genes are both normally distributed and they are significantly different from each other.  We generated heatmaps based on the correlation information. In addition, we performed Fisher’s Exact Test to compute the log odds by treating each peptide of each sample individually based on which we generated another group of heatmaps. Some genes’ heatmaps did provides true indication of cSNPs. However, there’re also a significant number of heatmaps appeared to be false positives and false negatives. We thought the reason maybe that the lack of correlation can also be caused by alternative splicing, enzymatic digestion, phosphorylation, etc.

To analyze the factors that can affect the wild peptide abundance, we used the data of variant peptide spectral counts to check if there exists some relation between variant peptide and the wild type peptide when they were both detected in one specific sample. The hypothesis of this step is that for heterozygous, the alleles of wild type peptide would be expressed to achieve the same abundance of variant type peptide as a mechanism to maintain the normal character. We evaluated the ratio of variant count and wild type count of each record of cSNP, upon which we found that for samples with both variant and wild type peptide, the log ratio usually appeared equals to or below 0, indicating that for heterozygotes the wild type peptides were expressed with a level not lower than the variant peptide expression level. Then we added the information of whether a sample came from germline or somatic or non-variant to test if such behavior can also be observed in non-variant samples. Some examples showed that the relation between variant and wild type count can indicate whether it’s a sample with SNP or not.

The correlation between peptide pairs inside one gene and the correlation between peptide pairs in different genes are different, but the difference cannot be totally explained by the existence of cSNPs and isoforms. However, the condition of wild type peptide abundance does not drop down in variant samples seems to provide a clue of variant-wild type relation in heterozygotes that the wild type allele in heterozygotes are expressed at the same level of variant alleles level, even higher. We hope that this can be an indication to tell whether a sample with variant peptide counts is really a heterozygote or just a false positive of MS/MS.