Analysis of PySam Versioning Changes

Patrick Chen

Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center.

Date/Time: August 25, 2020 at 2:20pm

Abstract: ReadCounts is a computational framework for counting the reads of particular nucleotides at target genomic positions. It recognizes the expected homozygous and heterozygous allele counts using a statistical test. PySam is the python module used by readCounts to read and handle mapped short read sequence data stored in SAM/BAM based on the samtools (vcftools) software suite. The latest version of readCounts produces results that are inconsistent with previous versions. Changes in PySam’s API may be responsible for the inconsistency, which cast doubt on the results computed by older readCounts releases.

To test our hypothesis, we first verified the change in the number of reads observed at each locus when different versions of the PySam module were used. Second, we created a test-script to list all the reads from the target loci returned by each version of PySam and a compare-script to identify whether the reads were common or unique to each version. We also explored the changes described in PySam and samtools documentation and GitHub code repository and attempted to understand the effect of new parameters and defaults. We made a simple machine learning (Naive Bayes) model to determine the most common read properties that were missing or added by the new version of PySam. Lastly, we built a script that showed the detailed properties of reads returned from each PySam version, to further discover which reads were showing inconsistencies.

We discovered a specific parameter, min_base_quality, whose default value changed from zero to 13, significantly reducing the number of reads returned by the PySam API. We also explored the effect of the stepper parameter, and the properties of specific reads filtered out when the samtools stepper was used. With this new understanding of the parameters of the PySam API, we were able to get identical results from both versions of PySam and the samtools mpileup command-line tool. We have updated readCounts to expose these key parameters for the users to set and are working with collaborators to establish the most appropriate parameters for a given NGS data-type and experimental design.

Tagged: Summer 2020