Orthlog Inference using Multiple-Sequence Alignment
Zekai Ding
Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center.
Date/Time: August 25, 2020 at 4:20pm
Abstract: The task of this project is to infer orthologous genes from contig-based transcripts for the mosquito species, Wyeomyia smithii, which lacks functional gene annotations, to by using multiple-sequence alignments with sequences from well-annotated, related mosquito species Culex (pipiens) quinquefasciatus, Aedes albopictus, Anopheles gambiae and Aedes aegypti. The use of multiple sequence alignments to verify sequence membership in orthologous clusters is expected to compensate for the more relaxed parameters necessary to align sequences from species that are not very closely related.
We downloaded complete transcripts of the well-annotated species from Vectorbase. Using a seed transcript, we use tblastx to retrieve homologous sequences from the other species and then carry out multiple sequence alignments using ClustalX. The resulting guide tree can then be checked for consistency with the expected phylogenetic relationships between the species. For the four well-annotated species, we use VectorBase ortholog sequence clusters to develop the procedure and establish a criterion for identifying multiple sequence alignments consistent with the expected phylogeny, based on the ClustalX guide tree distance matrix, and establish a global, expected distance matrix for these species. We then use known, highly conserved genes to establish an expected distance matrix that includes W. smithii.
Our experiments demonstrate that the multiple-sequence alignment guide tree can readily distinguish between a good ortholog sequence clusters which exhibit the expected phylogeny and sequence clusters that do not. We expect this process will aid in the annotation of transcripts for sequenced species without a well-annotated, closely-related, organism.
- Tagged
- Summer 2020