Analysis of STAT-GAS Motif Binding Prediction Accuracy with AlphaFold3

Aster Rajesh

Mentor: Dr. Markus Hoffmann, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University.

Date/Time: August 22nd, 2025 at 9:30 AM.

Abstract: Signal transducer and activator of transcription (STAT) proteins are transcription factors that relay signals from cytokines and growth factors to the nucleus, where they regulate immune-response gene expression. Upon activation by Janus kinase (JAK)-mediated phosphorylation, STATs dimerize and bind gamma-activated sequence (GAS) motifs, with the consensus sequence TTCN₃GAA, in promoters and enhancers. Although STAT family members share structural features, they differ in activation triggers, biological functions, and DNA target specificity. In this study, both canonical and isoform-specific sequences of human STAT1, STAT3, and STAT5A were investigated to assess AlphaFold3 binding predictions against GAS motifs using the AlphaFold3 server’s multi-chain protein–DNA modeling capability. This interaction is specifically examined in the mammary gland for the purposes of this study and could have variable binding in other tissues. While AlphaFold3 has demonstrated high accuracy in protein structure prediction, benchmarking for DNA-binding specificity is limited, particularly for distinguishing verified binding sequences from non-binding variants. Candidate GAS motif sites were identified using FIMO (Find Individual Motif Occurrences) from the Hoffmann et al. 2025 dataset on the hg38 human reference genome. The FIMO and MEME suite tool work to scan the hg38 genome to find instances of the GAS motif that these STAT proteins can locate and bind to, this provides us with a list of potential binding sites to work with. The STAT protein sequences were pulled from uniprot with codes P42224, P40763, and P42229 being STAT1, 3, and 5 respectively. Uniprot houses canonical and isoformic sequences under the same code but with a suffix of -1 or -2 which denotes the isoform version; for the purposes of specifics, the uniprot codes for the isoforms would be P42224-1, P40763-1, and P42229-1 for the isoform versions of STAT1, 3, and 5. The aforementioned FIMO hits were flanked by ±200 bp to preserve local sequence context. Experimental STAT ChIP-seq narrowPeak files were then processed to quantify binding confidence. These narrowPeak files can be likened to a gps which each entry giving an address for a binding site. Furthermore, narrowPeak files exist with 10 values per entry, with the 9th value in each entry corresponding to the q value, which is essentially a confidence level for the binding of that entry’s binding site. These files contain peak scores which is a numerical score and corresponds to the strength and confidence of the detected binding site. These peak scores also reflect: signal enrichment (how strong the signal was compared to noise), sequencing depth coverage(how many reads aligned to that region), and binding confidence. The narrowPeak files for each STAT type were filtered into three groups based on peak score: >20, 0–20, and 0. These values correspond with high confidence binding, low confidence binding, and no confidence binding. The 20 q value cutoff was chosen due to it being far beyond common cutoffs meaning peaks observed at this level have a very high chance of being significant. Furthermore, the 20 cutoff may miss weaker sites but can have lower false positives, hence a more conservative threshold. The 0-20 q value peaks were essentially “maybe” areas where binding was not confirmed to be observed, however it was an interval of lower confidence for binding. In addition, this can let us probe AlphaFold3 to demonstrate its capability to discriminate between borderline motif/peak cases. The final condition, 0 q value, was implemented as a control value mainly due to the nonsignificant q value that was observed and thus could be safely assumed to not bind. As mentioned before, the 0 q values are binding sites where binding should not occur at least for this condition or for the mammary tissue. Overlapping the flanked FIMO hits with these stratified narrowPeak datasets allowed the extraction of genomic regions of high confidence (>20) or low confidence (0-20) to support STAT binding. For the 0 q value group, unlike the >20 and 0-20 groups, there were no extracted values from the narrowPeak files which indicates there were no binding sites on STAT1, 3, or 5 where binding was never expected for this condition or for this tissue sample (mammary gland). These were then converted into FASTA format to serve as inputs for AlphaFold3 with a +/- 200 base pair flank around overlapping STAT-GAS motifs, producing a comprehensive dataset of 1,200 total protein–DNA predictions (100 sequences for each STAT type × two binding-score conditions × canonical and isoform variants). This data was able to be interpreted through confusion matrices, and can be made into statistics for the following categories: accuracy, precision, and sensitivity. Accuracy can be calculated as such: (True Positive(TP) + True Negative(TN)) / (True Positive + False Positive(FP) + False Negative(FN) + True Negative). Precision can be calculated as such: TP / (TP + FP) and sensitivity can be calculated through: TP / (TP + FN). In the context of this study, accuracy accounts for all binding and non binding observed, precision encompasses true negatives (correctly modeling non binding sites as non binding), and sensitivity models the true positives (correctly displaying binding sites as binding sites). Results consisted of inputting the random 100 sequences of each of the overlapping fasta sequences for both of the conditions for high confidence (>20) and low confidence (0-20) in combination with each of the STAT types (STAT1, 3, and 5). The no confidence binding threshold (0), as mentioned previously, errored with the narrowpeak filtering and AlphaFold3 input. Starting with the canonical isoforms, the inputs were fed into AlphaFold3 consisting of a total of 100 overlapping fasta sequences for each high confidence and low confidence thresholds combined with the 3 STAT types of interest, for a total of 600 test cases. The results were counted and the data was represented in a confusion matrix. The canonical versions of the STAT1, STAT3, and STAT5 protein sequences combined with the high confidence threshold sequences yielded a 100/100 binding accuracy, meaning that AlphaFold3 predicted that every sequence would bind to the three STAT types confirming their affinity for the collected GAS motif present in the overlapping sequences. The canonical versions of the STAT1, STAT3, and STAT5 protein sequences combined with the low confidence threshold sequences yielded a 33/100, 11/100, and 46/100 binding accuracy, indicating that AlphaFold3 predicted some binding in the lower confidence interval but overall had a much lower binding prediction accuracy than the higher confidence interval. Additionally, the confusion matrices can be used to calculate the accuracy, precision, and sensitivity for both high confidence, low confidence, and each STAT type. STAT1 high confidence had: accuracy score of 1.0, precision score of 1.0, and a sensitivity of 1.00 whereas the low confidence condition had an accuracy of 0.665, precision of 1.0, and a sensitivity of 0.33. STAT3 high confidence had: accuracy score of 1.0, precision score of 1.0, and a sensitivity of 1.00 whereas the low confidence condition had an accuracy of 0.555, precision of 1.0, and a sensitivity of 0.11. Finally, STAT5 high confidence had: accuracy score of 1.0, precision score of 1.0, and a sensitivity of 1.00 whereas the low confidence condition had an accuracy of 0.730, precision of 1.0, and a sensitivity of 0.46. These 3 statistics would indicate a high amount of binding that AlphaFold3 is able to model, a reliability to detect non binding sites as non binding, and binding sites as binding for the high confidence interval binding sites. Looking at the statistics for the lower confidence interval accuracy, precision, and sensitivity are all lower across the board indicating that there is a lower level of ability to correctly predict all binding and non binding, true negatives (correctly labeling non binding sites as non binding), and true positives (correctly labeling binding sites as binding sites). Overall for the canonical STAT sequences, there is a relatively high accuracy for both intervals demonstrating a high level of overall correctness, a high specificity showing that AlphaFold3 is able to avoid false positives, and lower sensitivity for the lower confidence interval specifically which could mean that AlphaFold3 struggles with weak or borderline STAT-GAS binding interactions. However, when investigating the isoform proteins, which exist as truncated versions of the canonical sequences, all 600 test cases were put into AlphaFold3 and the results were compiled into confusion matrices and it observed a 0/600 binding for all three STAT types and both confidence intervals. The confusion matrix summaries indicate that AlphaFold3 robustly identifies high confidence canonical STAT–DNA interaction but struggles to recognize binding and fails entirely to recognize isoform DNA interactions, likely due to structural deviations affecting DNA contact regions. It is worth noting that these STAT-GAS interactions were pulled from the mammary gland so it is possible that the above binding trend may not be uniform in other cells and could potentially have a much different binding accuracy. The wide variety of STAT types used, specifically the three canonical and three truncated isoforms, yields the key detail that AlphaFold3 is not particularly able to model isoforms with a great level of specificity or accuracy as was seen in the three canonical STAT protein sequences. Another concept of interest is Phantom peaks which serve as false-positive ChIP-seq signals that appear consistently in certain genomic regions producing artificially high enrichment scores. In this particular study, phantom peaks could potentially mislead the STAT–GAS motif overlap analysis and cause AlphaFold3 to model non-biological, artifactual binding events as if they were genuine. There were several key challenges encountered in the workflow. The uncertainty in interpreting AlphaFold3’s confidence metrics for DNA–protein docking, resulting in using the same confidence intervals as was observed in the Hoffmann et al. paper. This cutoff value of 20 was seen to be efficient in the prior paper as it had an appropriate amount of coverage and depth; a peak of 20 or higher is a safe interval for higher confidence binding as it is a significant q value meaning it is more often than not reliable as a binding indicator. Another difficulty was the conceptualizing of the overlapping motifs and STAT peaks using BEDtools and extracting sequences and the coding process in general. The biggest challenge that came across was the sheer number of test cases required to complete to yield a usable amount of results for each STAT type, isoform or canonical, and the confidence interval. Taking into account 100 test cases for each combination it required 1,200 inputs into AlphaFold3. This volume of inputs manifested the main issues present with the computational constraints when running large-scale predictions due to the 30 token limit per day- meaning only 30 sequences could be run a day. Future work could involve targeted cutting of binding sites and checking with RNA-Seq in model organisms such as mice, and seeing if neighboring gene regulation is upregulated or downregulated, indicating a functional site. However, these lines of experiments could take multiple years, but could validate the computational predictions. In summary, this workflow from raw genomic data to structural modeling—integrating motif scanning, ChIP-seq peak filtering, sequence extraction, and AlphaFold3 multi-chain prediction—demonstrates that canonical STAT sequences mostly follow expected binding trends across confidence thresholds, however for the mammary gland there should not have been any binding observed in the lower confidence threshold. Isoforms, although, are not recognized as binding-competent under current AlphaFold3 modeling parameters. These findings highlight both the promise and the present limitations of structure-prediction tools such as AlphaFold3 in computational DNA-binding specificity analysis. References: AlphaFold Protein Structure Database. n.d. “AlphaFold Protein Structure Database.” Accessed August 17, 2025. https://www.alphafold.ebi.ac.uk. Ambrosio, Raffaele, Giorgia Fimiani, Jlenia Monfregola, Emma Sanzari, Nicola De Felice, Maria Carolina Salerno, Claudio Pignata, Michele D’Urso, and Matilde Valeria Ursini. 2002. “The Structure of Human STAT5A and B Genes Reveals Two Regions of Nearly Identical Sequence and an Alternative Tissue Specific STAT5B Promoter.” Gene 285 (1–2): 311–18. Bailey, Timothy L., Mikael Boden, Fabian A. Buske, Martin Frith, Charles E. Grant, Luca Clementi, Jingyuan Ren, Wilfred W. Li, and William S. Noble. 2009. “MEME SUITE: Tools for Motif Discovery and Searching.” Nucleic Acids Research 37 (Web Server issue): W202-8. Bonham, Andrew J., Nikola Wenta, Leah M. Osslund, Aaron J. Prussin 2nd, Uwe Vinkemeier, and Norbert O. Reich. 2013. “STAT1:DNA Sequence-Dependent Binding Modulation by Phosphorylation, Protein:Protein Interactions and Small-Molecule Inhibition.” Nucleic Acids Research 41 (2): 754–63. Crooks, Gavin E., Gary Hon, John-Marc Chandonia, and Steven E. Brenner. 2004. “WebLogo: A Sequence Logo Generator.” Genome Research 14 (6): 1188–90. ENCODE Project Consortium. 2012. “An Integrated Encyclopedia of DNA Elements in the Human Genome.” Nature 489 (7414): 57–74. Hoffmann, Markus, Tiago Vaz, Shreeti Chhatrala, and Lothar Hennighausen. 2025. “Data-Driven Projections of Candidate Enhancer-Activating SNPs in Immune Regulation.” BMC Genomics 26 (1): 197. Levy, David E., and J. E. Darnell Jr. 2002. “Stats: Transcriptional Control and Biological Impact.” Nature Reviews. Molecular Cell Biology 3 (9): 651–62. Lewis, H. Dan, Ashley Winter, Thomas F. Murphy, Snehlata Tripathi, Virendra N. Pandey, and Beverly E. Barton. 2008. “STAT3 Inhibition in Prostate and Pancreatic Cancer Lines by STAT3 Binding Sequence Oligonucleotides: Differential Activity between 5’ and 3’ Ends.” Molecular Cancer Therapeutics 7 (6): 1543–50. Schindler, C., and J. E. Darnell Jr. 1995. “Transcriptional Responses to Polypeptide Ligands: The JAK-STAT Pathway.” Annual Review of Biochemistry 64 (1): 621–51. DeepMind. (2024). AlphaFold Protein Structure Database [Computer software]. European Bioinformatics Institute. Retrieved from https://www.alphafold.ebi.ac.uk OpenAI. (2025). ChatGPT (GPT-5) [Large language model]. Retrieved from https://chat.openai.com Zhang, Yong, Tao Liu, Clifford A. Meyer, Jérôme Eeckhoute, David S. Johnson, Bradley E. Bernstein, Chad Nusbaum, et al. 2008. “Model-Based Analysis of ChIP-Seq (MACS).” Genome Biology 9 (9): R137. The UniProt Consortium. (2025). UniProt: the universal protein knowledgebase [Database]. Retrieved from https://www.uniprot.org Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., … Edgar, R. (2013). NCBI Gene Expression Omnibus (GEO): archive for functional genomics data sets—update [Database]. Nucleic Acids Research, 41(D1), D991–D995. https://www.ncbi.nlm.nih.gov/geo

Tagged: Summer 2025; Summer 2025 #2