Predicting Protein Abundance from mRNA Data Using Machine Learning

Atsede Siba

Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University.

Date/Time: December 10th, 2024 at 12:00pm.

Abstract: Accurately predicting protein abundance is critical for understanding the cause and effect between mRNA and protein, providing insights into biological processes, and advancing proteomics research. Traditional protein detection techniques, such as mass spectrometry (MS), are often expensive, and limited by sensitivity, detection thresholds, and quantitation inconsistencies. To overcome these limitations, this study leverages data from the CPTAC DREAM Proteogenomics Challenge to develop machine learning models that predict protein abundance from mRNA expression data.

RNAseq and protein data were derived from breast and ovarian cancer studies conducted at three sites: the Broad Institute, Johns Hopkins University (JHU), and Pacific Northwest National Laboratory (PNNL). Each site used distinct experimental setups and internal reference samples, contributing to potential variability in protein quantitation. Features representing these collection sites were incorporated into the models to address this variability. Preprocessing steps included filtering proteins with over 50% missing values, removing RNA-Seq data with any missing entries, and retaining only overlapping genes and proteins. After preprocessing, the dataset comprised 132 samples with 11,835 genes (RNA-Seq) and 156 samples with 6,816 proteins in the protein data, narrowing to 5,661 proteins with self-gene.

Using the combined dataset, feature selection techniques, such as F-regression, were applied to identify the most relevant genes for prediction, avoiding overfitting caused by irrelevant features. Results showed that the strongest predictors of protein abundance are the mRNA expression levels of their corresponding genes. Linear regression models were trained using 10-fold cross-validation, and 50 samples were withheld for model validation. The models achieved a median correlation of 0.46.

These findings highlight the potential of machine learning to integrate transcriptomic and proteomic data for improving protein abundance predictions. Furthermore, the results underscore the importance of careful feature selection in predictive modelling.

Tagged: Fall 2024