Internship Presentations

Machine Learning to Predict Host Specificity and Geographic Origin of Salmonella Kentucky

Lauren McAllister

Mentor: Dr. Bradd Haley, Environmental Microbial & Food Safety Laboratory, U.S. Department of Agriculture.

Date/Time: August 21st, 2025 at 2:45 PM.

Abstract: Salmonella Kentucky is a polyphyletic serovar frequently isolated from food-producing animals in the United States. It is a significant cause of salmonellosis worldwide and cases within the U.S. are typically associated with travel abroad. Rapid methods for identifying epidemiologically relevant characteristics, such as animal host and geographic origin, remain underdeveloped.

This study used machine learning and statistical methods to predict the animal host (bovine or poultry) and geographic origin (North America or not) of S. Kentucky isolates and identified genomic features linked to host specificity. Core-genome SNPs, gene presence/absence, and intergenic features were extracted from genome assembies and used to train ML algorithms. Various algorithms were tested including Random Forest, Support Vector Machine, Logistic Regression, XGBoost, and K-Nearest Neighbor.

All five machine learning algorithms and genomic feature types achieved high performance for both classification tasks. While variations in performance were statistically indistinguishable, XGBoost on core-genome SNPs outperformed other models in all performance metrics. For host prediction, the top model achieved F1 scores of 0.943 for poultry and 0.891 for bovine. The area under the ROC curve (AUC) was 0.923, indicating high ability to distinguish between classes. For geographic origin prediction, the top model achieved F1 scores of 0.981 for North America and 0.982 for not North America. The ROC AUC was near perfect at 0.994.

The top machine learning models were used to estimate the proportion of human S. Kentucky cases attributed to different hosts and geographic origins. The majority of the U.S. human clinical cases were predicted to be from poultry, though bovine sources remained prevalent. In ST198, the majority of U.S. human clinical cases were predicted to be acquired from outside of the U.S.

These findings demonstrate the strong capability of machine learning models to predict the animal host and geographic origin of Salmonella isolates from core-genome SNPs. These models may assist in tracing sources of S. Kentucky contamination in produce and human infections. Statistically significant genomic features associated with host specificity were also identified; this information may be used to identify genomic targets for S. Kentucky carriage mitigation

Tagged
Summer 2025
Summer 2025 #1