AI model prediction of LIY deletion and validation

Enlei Zhu

Mentor: Dr. Markus Hoffmann, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center.

Date/Time: August 21st, 2025 at 3:30 PM.

Abstract: The JAK-Stat pathway is a crucial cell signaling pathway that is involved in a variety of cell processes, such as growth and immune response(3). The Stat family of proteins (Signal Transducers and Activators of Transcription) is essential to this pathway and regulates a series of gene expressions vital to cell growth. Stat5a and Stat5b are two important Stat family proteins that regulate a series of processes such as growth, mammary gland development, and lipid metabolism(7). Previous studies have shown that a mutation in the SH2 domain of the Stat5b protein can cause dramatic changes in the level of Stat5 phosphorylation as well as significant Gain/Loss of function(6). Observation with real mice has shown that engineered mice with LiY deletion (Leu 666, Ile 667 and Tyr668) completely lose the ability to produce milk during periods where they are supposed to lactate after giving birth despite carrying out a normal pregnancy in all other aspects. This usually results in offspring death as pulps would not be fed. In this project, we aim to explore the mechanism behind the inability of lactation in mice with the LiY deletion, and attempt to explain both the structural and functional difference between a normal Stat5a protein and one that contains the LiY deletion. To achieve this, we will utilize a variety of tools to explore and validate the connection between the mutation and the inability to lactate. This includes the usage of a variety of mutation-pathogenicity scoring tools to predict the effects of the mutation as well as using Alphafold3(AF3), the most accurate and commonly used AI model to predict the structural change of the protein as well as its ability to dimerize, which is crucial for its ability to regulate downstream processes that might be important for milk production.

Material and Method A variety of pathogenicity-scoring tools were used to predict the effect of the LiY deletion and generate an initial estimate of the effect for this deletion. These include both web-based predicting tools and locally deployed tools that run either from a windows desktop or from a linux virtual machine. However, all the tools deployed in this section uses the exact same input, which consists of a VCF format input file that contains the genome name(GRCm39), as well as the location of the deletion sourced from UCSC genome browser(8). A total of four different tools were used, which includes CADD, MMsplice, SpliceAI and SnpEff. Most tools utilized default input parameters, with minor alterations for SnpEff discussed in detail later. CADD integrates diverse annotation sources including protein-level language models and regulatory scores, and assigns a PHRED-like score to the deletion, reflecting its predicted deleteriousness compared to variants observed in healthy individuals(10).The PHRED-like score is scaled so that each 10 point increase representing a 10 fold change in the level of deleterious of said variant. A score of 20 would represent the top 1 percent of total known variants when it comes to the deleteriousness of the variant. MMsplice models splicing outcomes based on the local sequence context. The tool produced a delta logit PSI score (ΔΨ), indicating potential alterations to exon inclusion levels(1). To complement MMSplice, we also used SpliceAI, a deep learning model trained to predict cryptic splice site gains/losses from long-range sequence context. Although SpliceAI is optimized for SNVs, we formatted the deletion appropriately to obtain predictions of acceptor/donor gain or loss probabilities(5). Lastly, to annotate the variant at the gene level specific to mice, we used SnpEff with the GRCm38.99 mouse transcript database since the newest GRCm39 genome transcript was not available for SnpEff.. SnpEff provided a functional classification for the deletion (e.g., in-frame deletion, missense, etc.) and linked it to specific Stat5a isoforms(2). Note that because SnpEff uses a command-line interface, a parameter input of -genome GRCm38.99 is needed to execute the program in addition to the default input.

After initial prediction of the pathogenicity for the interested deletion was predicted, we deployed AF3 to predict the effect of the deletion. To achieve this, we utilized ChimeraX, which contained the option to create a temporary Google Colab notebook that requires input containing the exact sequence wished to be predicted. Since the sequence is relatively long, a virtual environment is needed. The result of this prediction is then imported into ChimeraX and color coded based on different domains. The above listed steps are repeated from both the wildtype sequence as well as the sequencing containing the LiY deletion. Lastly, in order to evaluate the prediction result from AF3, we analyzed real mice RNA-sequencing data using nf-core’s RNA-seq data processing pipe line(3) as well as further analyzed it with tools such including DESeq2 for normalization and differential analysis and biomaRt for converting Ensembl symbols to gene names. The data contained RNA expression levels for both wildtype and LiY mutation mice, in two different stages of pregnancy—day 1 of lactation, as well as day 18 of pregnancy. To achieve this, an AWS virtual machine was used, and a sample sheet was generated pairing the RNA data together according to their .fasta file names. Then all required packages were installed according to the nf-core guideline on running such a pipeline and standard command was executed with a command-line interface. Note that the nf-core/rnaseq pipeline was executed with the GRCm38 genome as opposed to the newest GRCm39 genome. This is due to the AWS iGenome S3 bucket currently does not contain the GRCm39 genome assembly, and downloading the whole genome to the virtual machine and reference it on multiple processes of the pipeline would take up too much RAM and disk space(4). After said pipeline is executed, the result from star-salmon, which is a process included in the pipeline, was extracted and combined with all 24 samples in the same folder. The resulting files are then processed in R, which utilized packages such as biomaRt and DESeq2. The files are first combined together, then their default Ensemble identification number is converted to genome name via biomaRt, which also combines expression levels in case multiple Ensemble number maps to the same gene name. After that, they are processed through DESeq2 which provides methods to test for differential expression by use of negative binomial generalized linear models. The results is then presented in two important statistics– log2foldchange, which represents the difference between the level of mRNA expression between WT and LiY deletion mice for said specific gene, and adjusted p value, which demonstrates the confidence in the claim that there is a significant difference in the level of mRNA expression in WT vs LiY deletion mice. The results were then filtered by the gene family of interest. This includes both the Stat family of genes as well as the CSN family of genes, with the latter encodes the protein casein, which is an essential protein found in milk and the gene known to be regulated by Stat proteins. These expression level values are then converted into a heatmap that demonstrates the scope of up/down regulation of these genes in relation to the LiY deletion.

Results and Discussion Scoring Tools SpliceAI predicted only modest splice-altering effects, with the highest delta score being 0.04 for donor gain at +681 bp, and 0.03 for acceptor loss at -33 bp. SpliceAI reports four scores that ranges between 0 and 1, including donor gain/loss, and acceptor gain/loss, each representing the chance of the variant being splice altering. Both the gain/loss value represent a weak chance of this variant being splice-altering, as they both score well below the threshold for confident splice-altering events of 0.5-0.6. These low scores suggest that this deletion is unlikely to cause canonical splice site disruptions, although the presence of a donor/acceptor signal might cause a subtle impact on splicing. MMSplice predicted a consistent Δlog PSI of 0.1331 across all affected transcripts, indicating a modest predicted reduction in exon inclusion. The variant’s predicted pathogenicity score of 0.3208 further suggests a non-negligible but not strongly deleterious effect on splicing regulation. From a broader functional perspective, CADD scored the deletion with a PHRED-scaled score of 19.76, placing it in the top ~1% of potentially deleterious variants in the human genome. This supports the hypothesis that the LiY deletion may have functional relevance, especially given its classification as an in-frame coding deletion in a conserved domain (ConsScore: 6, PhyloP: 0.997). Finally, SnpEff, applied to the mouse genome for comparative purposes, annotated the deletion as an inframe deletion with moderate impact in the Stat5a transcript.

AlphaFold 3 AF3 multimer predictions reveal a complete divergence between the wild-type Stat5a dimer and the mutant carrying the LiY deletion. In the wild-type, three out of the five predicted models show a stable SH2–SH2 interaction, with model #3 forming a compact dimer that represents a “lung-like” geometry. This conformation aligns with known active forms of Stat dimers, where SH2 domain contacts are critical for cytokine signaling, DNA binding, and nuclear translocation. When comparing the predicted structure of the WT vs LiY deletion structures, AF3 provided five differently trained neural networks that accounts for variation and uncertainty in the structural prediction. Interestingly, all five predicted structures of the LiY deletion mutant fail to reproduce this SH2–SH2 interface and the lung-like shape. Instead, the SH2 domains remain spatially separated, resulting in an open, extended structure that resembles more of the boat-like configuration seen in model 1 of the wildtype. This predicted structure for LiY Stat5a dimer is structurally similar to a predicted structure by a 2005 paper(2), which dimerizes through its beta barrel domains and forms a “boat like” shape with SH2 domains and linker domains from both monomer pointing to the same side.. Since this form of the Stat 5A dimer has very little SH2 domain hybridization and are structurally very different, it would be reasonable to argue that the LiY mutation likely will cause a severe damage to the protein’s ability to regulate and activate downstream transcription of genes such as the Csn families. The Csn family of genes codes for the production of casein, which accounts for the majority of protein content in most mammal milk. Thus the expression of Csn genes are essential for proper mammary gland development and lactation.

nf-core/rnaseq and gene mapping When comparing the two groups in different stages of pregnancy/lactation, we identified significant differences between these two groups. This includes the Jak family of genes being significantly upregulated in the L1 comparison, and a similar difference not being observed in the P18 group. However, it is consistent among the two groups that when the LiY deletion occurs, all the Csn2 and Csn3 gene families are severely downregulated when compared to WT. Among all samples, log2FC value for Csn2, Csn3, and Csn1s2b averages around -3 to -4, indicating an average reduction of RNA transcription by 10 fold. This is consistent with our previous knowledge that Stat5 binds to the promoter region of Csn2 and Csn3 transcription regions, and thus the effect of this knockout(as represented by loge2foldchange) is greatly amplified in the Csn2 and Csn3 families of RNA expression when compared to Stat genes expressions. Conversely, this result is also consistent with our understanding the JAK-Stat pathway as a lower expression levels of Stat5 RNAs leads to a lower level of the 5A/5B complexes, and ultimately leads to higher levels of JAK transcription as the JAK families of proteins are upstream regulators of the Stat family.

Limitations and Challenges Due to the lack of available tools that scores the pathogenicity of multiple base pairs deletions, our original proposed tools to predict the effect of this deletion such as PolyPhen2 were unable to produce any valid result. As such, a new set of tools were used to evaluate the effect of this deletion. However, all these tools except for SnpEff are tools designed and trained on the human genome, and as such predictions of this deletion on mice might be inaccurate. Result Discussion Even though different pathogenicity-scoring tools generated only a mild to moderate score for this LiY deletion, both AF3 and experimentally obtained RNA level data suggest a very significant alteration in both the structure and function of the Stat5a dimer. This is consistent with the observation that mice with said deletion were unable to produce milk during the days of the pregnancy where wild-type mice would be able to. The predicted structure change by AF3, as well as analyzed RNA expression data, explains the observation of failure in lactation in vivo and strongly connects the structural disruption of Stat5a dimerization to the inability of these mice to activate casein genes transcription during lactation periods. We can reasonably suggest that when predicting mutations with similar structures and functions, AF3 will likely provide a more accurate result of the consequences of the mutation when compared to traditional tools. Together, these results highlight the important role of SH2 mediated dimerization of Stat5a and its role in mammary gland development and milk production. Looking forward, the methodologies discussed here—using AI models to predict structure shift in different variants and using real RNA data analysis as validation, could serve as a cheaper and more efficient approach to analyze the consequences of nontrivial variants.

References: 1.Cheng, Jun, Thi Yen Duong Nguyen, Kamil J. Cygan, Muhammed Hasan Çelik, William G. Fairbrother, Žiga Avsec, and Julien Gagneur. 2019. “MMSplice: Modular Modeling Improves the Predictions of Genetic Variant Effects on Splicing.” Genome Biology 20 (1): 48.

2.Cingolani, Pablo, Adrian Platts, Le Lily Wang, Melissa Coon, Tung Nguyen, Luan Wang, Susan J. Land, Xiangyi Lu, and Douglas M. Ruden. 2012. “A Program for Annotating and Predicting the Effects of Single Nucleotide Polymorphisms, SnpEff: SNPs in the Genome of Drosophila Melanogaster Strain W1118; Iso-2; Iso-3.” Fly 6 (2): 80–92.

3.Hu, Xiaoyi, Jing Li, Maorong Fu, Xia Zhao, and Wei Wang. 2021. “The JAK/STAT Signaling Pathway: From Bench to Clinic.” Signal Transduction and Targeted Therapy 6 (1): 402.

4.“iGenomes.” n.d. Accessed August 17, 2025. https://support.illumina.com/sequencing/sequencing_software/igenome.html.

5.Jaganathan, Kishore, Sofia Kyriazopoulou Panagiotopoulou, Jeremy F. McRae, Siavash Fazel Darbandi, David Knowles, Yang I. Li, Jack A. Kosmicki, et al. 2019. “Predicting Splicing from Primary Sequence with Deep Learning.” Cell 176 (3): 535-548.e24.

6.Lee, Hye Kyung, Jichun Chen, Rachael L. Philips, Sung-Gwon Lee, Xingmin Feng, Zhijie Wu, Chengyu Liu, et al. 2024. “STAT5B Leukemic Mutations, Altering SH2 Tyrosine 665, Have Opposing Impacts on Immune Gene Programs.” bioRxivorg. https://doi.org/10.1101/2024.12.20.629685.

7.Liu, X., G. W. Robinson, K. U. Wagner, L. Garrett, A. Wynshaw-Boris, and L. Hennighausen. 1997. “Stat5a Is Mandatory for Adult Mammary Gland Development and Lactogenesis.” Genes & Development 11 (2): 179–86.

8.“Mouse Mm10 Chr4:100,642,083-101,775,172 UCSC Genome Browser V485.” n.d. Accessed August 17, 2025. https://genome.ucsc.edu/cgi-bin/hgTracks?db=mm10&
lastVirtModeType=default&lastVirtModeExtraState=&
virtModeType=default&virtMode=0&nonVirtPosition=&position=chr4:100642083-101775172&
hgsid=2932749352_jk1pEz6jmEvawjt3RVF945FCrmnD.

9.Neculai, Dante, Ana Mirela Neculai, Sophie Verrier, Kenneth Straub, Klaus Klumpp, Edith Pfitzner, and Stefan Becker. 2005. “Structure of the Unphosphorylated STAT5a Dimer.” The Journal of Biological Chemistry 280 (49): 40782–87.

10.Schubach, Max, Thorben Maass, Lusiné Nazaretyan, Sebastian Röner, and Martin Kircher. 2024. “CADD v1.7: Using Protein Language Models, Regulatory CNNs and Other Nucleotide-Level Scores to Improve Genome-Wide Variant Predictions.” Nucleic Acids Research 52 (D1): D1143–54.

Tagged: Summer 2025; Summer 2025 #2