Seth Commichaux (Mentor: Dr. Karen Ross, Biochemistry and Molecular & Cellular Biology, Georgetown University)
August 26th, 2016, 2:00pm, Room 1202, Harris Building
Preliminarily, our research found that many human genetic disease classification systems such as the World Health Organization's International Classification of Diseases System ICD-10 or Disease Ontology, there is a diagnostic oriented method of organization. And while this may be useful for medical practitioners who need to distinguish diseases based upon symptoms and practical tests, it often provides little insight into the molecular/pathway relationships of genetic diseases to one another. Efforts have been made to elucidate molecular/pathway relationships between genetic diseases such as the Human Disease Network (Goh et al., 2007), which built a network of human genetic diseases where nodes were diseases and edges representing affected genes in common. We looked at human genetic diseases at the level of UniProt-annotated protein features as an alternative method to current disease classification methods with hopes that it could shed light on unique relationships between diseases that would otherwise go unnoticed. The data used in this research came from a Protein Information Resource (PIR) project that mapped ClinVar disease-associated SNPs with UniProt protein features. The methodologies/computational techniques employed include hierarchical clustering, heatmap analysis, Chi-Square and Fisher Test statistics, network analysis, data filtering, and reading of the scientific literature. Only diseases with OMIM identifiers were used; these OMIMs were grouped together according to general disease categories created by the Human Disease Network; this was done to help find general patterns in our data. Fisher tests with multiple test corrections were performed with our pathological dataset compared to the overall distribution of protein features in UniProt. Some results were that some SNP-affected protein features were found to be enriched for certain categories of diseases; for example, calcium-binding and intramembrane protein features were found to be enriched for cardiovascular diseases, active-site for metabolic diseases, nucleotide binding protein feature with neurological diseases. We also found that the cytoplasmic and extracellular domains of membrane proteins were consistently under-represented in the pathological dataset. Long QT Syndrome was picked as a use case for feature-level classification. Whereas the subtypes of Long QT were named, in presumptive order of discovery, as each new disease-involved gene was found, we used hierarchical clustering to show relationships between the subtypes according to affected protein features involved in each subtype. Overall, we present a new potential method of classifying human genetic diseases at the protein feature level; some evidence that some protein features are enriched for certain disease categories; and that this method might be useful for classifying the subtypes of a disease.