Replacing Pedigree Error Analysis, Using Phased Linked Reads to Identify False Genotypes

Bioinformatics Internship Presentation

Zhezhen Wang (Mentors: Dr. Justin Zook, NIST and Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University)

August 30th, 2016, 2:00pm, Room 1300, Harris Building

As a project of the Genome in a Bottle Consortium, this research is developing methods to identify errors in next-generation sequencing (NGS) data. In previous research by Platinum Genomes, errors were identified by sequencing and phasing a 3 generation family, including 4 grandparents, 2 parents and 11 grandchildren. This method is expensive and requires a large family, which is usually not possible. For clinical sequencing requiring high accuracy data, identifying errors in individual sequenced genomes is important. We hypothesize that phased linked reads from 10X Genomics in a single individual can be used to identify many of the same errors identified by phased pedigree analysis.

In this project, we use phased linked reads from the mother in Platinum Genomes pedigree (NA12878).  There is a low coverage set and a high coverage set for this sample. Two callsets from Platinum Genomes pedigree analysis were used for training: the “good calls” contains phased variants that were inherited as expected in all family members. The “bad calls” are inconsistent inheritance haplotypes, which might be considered as possible sequencing errors. These two sets are further categories based on their zygosity, and consistency with rest of the pedigree. To examine the correctness and separate false genotypes from the true ones, several methods have been tried on training set chromosome 22  with cross validation. Among all the methods, random forest, one of the machine learning methods provided the best results.

Using random forest and 10-fold cross validation, including counts, proportions or both as features in the model, 90% of “bad calls” can be separated from homozygous variants without classifying any “good calls” as “bad” for both the low coverage set and the high coverage one. About 80% of  “bad calls” in heterozygous variants can be separated without including more than 0.5% “good calls” as “bad”. Adding distance and angle in addition to counts and proportions does improve the performance slightly for high coverage sites. In order to identify false genotypes without distinguish the dataset based on zygosity, we purposed two models: one used minimum value of random forest scores of models trained on homozygote variants and heterozygote variants respectively; the other one used a mix of homozygote variants and heterozygote variants as the training set. Using chromosome 21 as a test set, both models performs well by identifying over 80% of “bad calls” of high coverage sites, and classifying less than 0.5% “good calls” as “bad” for the low coverage set. For the high coverage set, over 85% “bad calls” of high coverage sites can be separated from variants without classifying any “good calls” as “bad”. All these results indicate that using phased linked reads in identifying false genotypes does fulfill our goals, so methods are worth developing. Further research can be done in wrapping this process of and extending analysis based on SNP to indels.