Prediction of Relative Protein Abundance from mRNA Expression Data

Christopher Nguyen

Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University.

Date/Time: August 22nd, 2024 at 1:40pm.

Abstract: The expression of transcripts in cells have a substantial influence on proteins’ abundance. However, there is still much unknown about the relationship between the expression of mRNA and protein abundance. For example, correlations between mRNA and corresponding protein levels can vary wildly depending on the gene, with some CPTAC studies reporting median correlations between actual and predicted protein abundances of approximately 0.26. To investigate the relationship between the two, multiple groups have turned to machine learning to predict protein abundance from RNA expression data. Using data supplied in the NCI-CPTAC DREAM Proteogenomics Challenge, a competitive machine learning challenge, we also constructed machine learning models to predict relative protein abundance, as published by the CPTAC TCGA Breast Cancer and Ovarian Cancer studies. Relative protein abundance was predicted from the RNA-Seq based expression data from the corresponding TCGA studies for the same samples.

We first processed the Breast Cancer study data by filtering the protein set to those with mRNA expression data and eliminating proteins with more than 50% missing values. Samples present in both the mRNA expression and protein data were the only ones retained. The produced protein training data presents the relative abundance of 8234 proteins over 77 samples while the mRNA data provides the mRNA expression for 8234 genes over the same 77 samples. Subject to 10-fold cross-validation, we fit a Random Forest Regression model for each protein, using the model to predict relative abundance for the withheld samples. For each protein, we computed the correlation between predicted and provided relative protein abundance across samples. The statistic used to evaluate the overall performance of our training strategies was the median of all the protein prediction correlations.

We explored the use of different feature selection methods and sought to understand whether using the transcript expression from a proteins’ own gene is more useful for predicting protein abundance in comparison to those of other genes’ transcripts. Additionally, since protein abundance is relative to an internal reference sample for studies based around mass spectrometry, we attempted to include this into our models by adding a feature to each protein model which represented this unknown reference value. Since the data from the Breast and Ovarian Cancer studies was collected from three CPTAC labs with different internal reference samples, we added features which represent the collection site of each sample in the combined dataset.

From the process of constructing our models, we determined that their overall performance is improved if each protein abundance value is predicted using its corresponding mRNA expression as a feature, indicating that they are highly linked even if there are other genes more correlated. Additionally, using a low number of genes determined by F-statistic can help to improve model performance while using all genes leads to overfitting. To test the protein models objectively, we trained on a combined Breast Cancer and Ovarian Cancer dataset (filtered to 156 samples x 5665 proteins), from which 50 samples had been withheld, and saved the models to a file. An independent auditor ran the models on the withheld samples and a median protein correlation of 0.395 was obtained, based on relative abundance predictions for 5665 proteins. Significantly, this is better than the average median correlations reported by the participants of the CPTAC Challenge but did not perform similarly to the winners of the CPTAC challenge and other groups that have repeated the work. Though the source of this performance is unclear, we hypothesize that the size of our dataset is a significant factor. Accounting for the internal reference only slightly affected our results when included, potentially indicating that random forest regression is capable of accounting for differences in internal reference.

Tagged: Summer 2024; Summer 2024 #3