ChIP-Seq data analysis pipeline development in a breast cancer study

Bioinformatics Internship Presentation

Nan Hu (Mentors: Dr. Robert Clarke and Lu Jin, Department of Oncology and Lombardi Comprehensive Cancer Center, Georgetown University)

August 30th, 2016, 2:00pm, Room 1300, Harris Building

According to the American Cancer Society, estrogen receptor-ɑ (ERɑ) positive breast cancers account for over two-thirds of the breast cancers diagnosed in the United States. Antiestrogens, such as ICI 182,780 (ICI, also called Fulvestrant), are the most widely used treatment for postmenopausal women nowadays. However, more than half of all ERɑ positive breast tumors shows resistance to antiestrogen therapy and caused tumor recurring eventually. Thus, further understanding of mechanisms of endocrine resistance can be critical. In that instance, by performing chromatin immunoprecipitation (ChIP) technique, the specific protein:DNA complexes could be pulled down. Those complexes could reveal the certain binding sites that were occupied by specific transcription factors (TFs), cofactors, or other chromatin-associated proteins as well as histone modifications. Nevertheless, next-generation sequencing provides enormous potential for assaying DNA segment decomposed from those complexes. Therefore, ChIP-seq technique could rapidly transform our ability to understand the in vivo protein-DNA interactions on a genome-wide scale.

In this study, three different treatments (Vehicle, E2, and ICI) have been given to two breast cancer cell lines: LCC1 (anti-estrogen sensitive) and LCC9 (anti-estrogen resistant). Under each treatment, 4 different immunoprecipitations (IP) were used to perform the ChIP experiments: H3K27ac, H3K4me1, H3K4me3, and ERɑ. For each IP, two replicates have been included as well. The input was a mixture of 4 different IP samples. This input data was also used as the control in the following analysis. Even though ChIP-seq is superior than the ChIP-Chip method, problems like counting short reads, and other biases that are intrinsic from experimental procedures still exist. It is thus critical to develop effective analysis workflow for processing ChIP-seq data in order to ensure the correct inference of biologically meaningful information.

The first step in this analysis workflow was performing quality check among the raw fastq files. FastQC, a quality control tool designed for high throughput sequence data was used in this pipeline (Version 0.11.5). In almost all the samples, repeated adaptor sequences were observed. Thus, a Perl script called Trim Galore! was used to trim the adaptor sequences from the raw fastq file in order to improve the data quality. Another quality check was performed after trimming to make sure no more redundant sequences. Those data with too many contaminations were excluded from the following analysis. Then Bowtie2 (Version 2.2.9) was used to map reads from the trimmed fastq file to the reference genome. In this case, Hg19 (Feb. 2009, GRCh37) was used as the reference genome. The reference genome was downloaded from UCSC genome browser and sorted from 1 to 22 plus sex chromosome X and Y. All random chromosomes and haplotypes were excluded. Thirdly, the enriched regions are identified by peak calling software. Numerous software has been developed for different research purposes, such as SISSRs or MACS for transcription factor binding site, CCAT for histone modification, and QuEST etc. In this study, MACS2 (version 2.1.0) was used. We modified the parameter suitable for our data to generate accurate results. Finally, bedmap (a module in BEDOPS v2.4.20) was used to annotated the peak calling result. The comparison between the distinguish annotated results among the different samples will subsequently validated the different expressional candidate gene in previous study. The workflow described above was been developed as a semi-automatic shell script as well. Further experiments should be performed to confirm the result.