Correction of Tn5 sequence bias from ATAC-seq data
Next we will correct paired-end ATAC-seq data from the Greenleaf group (Buenrostro et al., 2013) and naked DNA ATAC-seq generated by our group. Note that ATAC-seq uses Illumina’s Nextera kit to directly transpose sequencing adapters into accessible chromatin. ATAC-seq is unique among enzymatic accessibility assays because each transposition event inserts two adapters into the chromatin. Each Tn5 molecule can be pre-loaded with any combination of the paired-end 1 and paired-end 2 adapter.
Downloading and processing ATAC-seq data
First we will download SRA files from naked ATAC-seq, which is a genomic DNA isolation followed by standard ATAC-seq protocols. These data are paired-end, which necessitates splitting the SRA file into two fastq files. Next we exclude all instances of reads that align to chrM–this is not a problem with these data, but typical ATAC-seq on chromatin can yield a high fraction of reads aligning to chrM. We will treat the reads that align to the plus and minus strand differently, because of how the Tn5 recognition site is distinct for plus and minus reads. Note that we optimized the k-mer mask for ATAC-seq.
mkdir ~/ATAC_Walavalkar
cd ~/ATAC_Walavalkar
#ATAC Naked
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX243/SRX2438155/SRR5123141/SRR5123141.sra
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX243/SRX2438156/SRR5123142/SRR5123142.sra
fastq-dump --split-3 SRR5123141.sra
fastq-dump --split-3 SRR5123142.sra
mv SRR5123141_1.fastq C1_gDNA_PE1_rep1.fastq
mv SRR5123141_2.fastq C1_gDNA_PE2_rep1.fastq
mv SRR5123142_1.fastq C1_gDNA_PE1_rep2.fastq
mv SRR5123142_2.fastq C1_gDNA_PE2_rep2.fastq
gzip *fastq
wget https://raw.githubusercontent.com/igvteam/igv/master/genomes/sizes/hg38.chrom.sizes
plus_mask=NXNXXXCXXNNXNNNXXN
minus_mask=NXXNNNXNNXXCXXXNXN
for fq in *_PE1_rep1.fastq.gz do
name=$(echo $fq | awk -F"_PE1_rep1.fastq.gz" '{print $1}')
echo $name
bowtie2 -x ~/DNase_ENCODE/hg38 -1 $fq,${name}_PE1_rep2.fastq.gz -2 ${name}_PE2_rep1.fastq.gz,${name}_PE2_rep2.fastq.gz -S $name.sam
grep -v '\tchrM\t' $name.sam > $name.chrM.sam
samtools view -b $name.chrM.sam | samtools sort - $name
rm *sam
samtools view -bh -F 20 ${name}.bam > ${name}_plus.bam
samtools view -bh -f 0x10 ${name}.bam > ${name}_minus.bam
seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_plus.bam --kmer-mask ${plus_mask} --bw=${name}_plus_${plus_mask}-mer.bigWig --shift-counts --read-size=75 seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_minus.bam --kmer-mask ${minus_mask} --bw=${name}_minus_${minus_mask}-mer.bigWig --shift-counts --read-size=75 seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_minus.bam --no-scale --bw=${name}_no_scale_minus.bigWig --shift-counts --read-size=75
seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_plus.bam --no-scale --bw=${name}_no_scale_plus.bigWig --shift-counts --read-size=75
bigWigMerge ${name}_plus_${plus_mask}-mer.bigWig ${name}_minus_${minus_mask}-mer.bigWig ${name}_${plus_mask}_${minus_mask}_merged.bedGraph
bigWigMerge ${name}_no_scale_plus.bigWig ${name}_no_scale_minus.bigWig ${name}_no_scale_merged.bedGraph
sort -k1,1 -k2,2n ${name}_${plus_mask}_${minus_mask}_merged.bedGraph > ${name}_${plus_mask}_${minus_mask}_merged.sorted.bedGraph
sort -k1,1 -k2,2n ${name}_no_scale_merged.bedGraph > ${name}_no_scale_merged.sorted.bedGraph
bedGraphToBigWig ${name}_${plus_mask}_${minus_mask}_merged.sorted.bedGraph hg38.chrom.sizes ${name}_${plus_mask}_${minus_mask}_merged.bigWig bedGraphToBigWig ${name}_no_scale_merged.sorted.bedGraph hg38.chrom.sizes ${name}_no_scale_merged.bigWig
done
mkdir Naked
mv C1*merged.bigWig Naked
Processing GM12878 ATAC-seq and TF binding data for GM12878 cells
Perform the same processes for the original ATAC-seq data (Buenrostro et al., 2013). As we did in Section 2.4, we want to retrieve ChIP-seq data for GM12878 cells to plot composite profiles of ATAC signal.
#GM12878 Greenleaf
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX298/SRX298000/SRR891268/SRR891268.sra fastq-dump --split-3 SRR891268.sra
mv SRR891268_1.fastq ATAC_GM12878_PE1.fastq
mv SRR891268_2.fastq ATAC_GM12878_PE2.fastq
gzip ATAC_GM12878_PE1.fastq gzip ATAC_GM12878_PE2.fastq
plus_mask=NXNXXXCXXNNXNNNXXN minus_mask=NXXNNNXNNXXCXXXNXN
for fq in *_PE1.fastq.gz
do
name=$(echo $fq | awk -F"_PE1.fastq.gz" '{print $1}')
echo $name
bowtie2 -x ~/DNase_ENCODE/hg38 -1 $fq -2 ${name}_PE2.fastq.gz -S $name.sam grep -v '\tchrM\t' $name.sam > $name.chrM.sam
samtools view -b $name.chrM.sam | samtools sort - $name
samtools view -bh -F 20 ${name}.bam > ${name}_plus.bam
samtools view -bh -f 0x10 ${name}.bam > ${name}_minus.bam
seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_plus.bam --kmer-mask ${plus_mask} --bw=${name}_plus_${plus_mask}-mer.bigWig --shift-counts --read-size=50 seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_minus.bam --kmer-mask ${minus_mask} --bw=${name}_minus_${minus_mask}-mer.bigWig --shift-counts --read-size=50 seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_minus.bam --no-scale --bw=${name}_no_scale_minus.bigWig --shift-counts --read-size=50
seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_plus.bam --no-scale --bw=${name}_no_scale_plus.bigWig --shift-counts --read-size=50
bigWigMerge ${name}_plus_${plus_mask}-mer.bigWig ${name}_minus_${minus_mask}-mer.bigWig ${name}_${plus_mask}_${minus_mask}_merged.bedGraph bigWigMerge ${name}_no_scale_plus.bigWig ${name}_no_scale_minus.bigWig ${name}_no_scale_merged.bedGraph
sort -k1,1 -k2,2n ${name}_${plus_mask}_${minus_mask}_merged.bedGraph > ${name}_${plus_mask}_${minus_mask}_merged.sorted.bedGraph
sort -k1,1 -k2,2n ${name}_no_scale_merged.bedGraph > ${name}_no_scale_merged.sorted.bedGraph
bedGraphToBigWig ${name}_${plus_mask}_${minus_mask}_merged.sorted.bedGraph hg38.chrom.sizes ${name}_${plus_mask}_${minus_mask}_merged.bigWig bedGraphToBigWig ${name}_no_scale_merged.sorted.bedGraph hg38.chrom.sizes ${name}_no_scale_merged.bigWig
done
mkdir gm12878
mv *GM12878*merged.bigWig gm12878
url=http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhTfbs/
wget ${url}wgEncodeSydhTfbsGm12878Corestsc30189IggmusPk.narrowPeak.gz
wget ${url}wgEncodeSydhTfbsGm12878Ebf1sc137065StdPk.narrowPeak.gz
wget ${url}wgEncodeSydhTfbsGm12878Ctcfsc15914c20StdPk.narrowPeak.gz
url=http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/
wget ${url}wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1.broadPeak.gz
grep -i -A 25 'MOTIF MA0138.2 REST' ~/DNase_ENCODE/motif_databases/JASPAR/JASPAR_CORE_2016.meme > REST_meme_temp.txt
grep -i -A 18 'MOTIF MA0154.3 EBF1' ~/DNase_ENCODE/motif_databases/JASPAR/JASPAR_CORE_2016.meme > EBF1_meme_temp.txt
grep -i -A 15 'MOTIF MA0079.3 SP1' ~/DNase_ENCODE/motif_databases/JASPAR/JASPAR_CORE_2016.meme > SP1_meme_temp.txt
grep -i -A 23 'MOTIF MA0139.1 CTCF' ~/DNase_ENCODE/motif_databases/JASPAR/JASPAR_CORE_2016.meme > CTCF_meme_temp.txt
head -9 ~/DNase_ENCODE/motif_databases/JASPAR/JASPAR_CORE_2016_vertebrates.meme > header_meme_temp.txt
for i in *meme_temp.txt
do
name=$(echo $i | awk -F"_meme" '{print $1}')
cat header_meme_temp.txt $i > ${name}_minimal_meme.txt
done
rm *temp.txt
for peak in *Gm12878*Peak.gz
do
name=$(echo $peak | awk -F"TfbsGm12878" '{print $NF}' | awk -F"." '{print $1}')
unz=$(echo $peak | awk -F".gz" '{print $1}')
echo $name
gunzip $peak
echo $unz
liftOver $unz ~/DNase_ENCODE/hg19ToHg38.over.chain $name.hg38.narrowPeak $name.hg38.narrow.unmapped.txt -bedPlus=6 fastaFromBed -fi ~/DNase_ENCODE/hg38.fa -bed $name.hg38.narrowPeak -fo $name.hg38.fasta
gzip ${name}*Peak
done
mv Ctcfsc15914c20StdPk.hg38.fasta CTCF.hg38.fasta
mv Sp1Pcr1xPkRep1.hg38.fasta SP1.hg38.fasta
mv Ebf1sc137065StdPk.hg38.fasta EBF1.hg38.fasta
mv Corestsc30189IggmusPk.hg38.fasta REST.hg38.fasta
for meme in *.hg38.fasta
do
name=$(echo $meme | awk -F".hg38.fasta" '{print $1}')
echo $name
mast ${name}_minimal_meme.txt $meme -hit_list -mt 0.0005 > ${name}_mast.txt
done
for meme in *.hg38.fasta
do
name=$(echo $meme | awk -F".hg38.fasta" '{print $1}')
echo $name
fimo --thresh 0.0001 --text ${name}_minimal_meme.txt ~/DNase_ENCODE/hg38.fa > ${name}_fimo.txt
grep -v chrM ${name}_fimo.txt > ${name}_noM_fimo.txt
rm ${name}_fimo.txt
mv ${name}_noM_fimo.txt ${name}_fimo.txt
done
Plotting ATAC composites using R
For each transcription factor motif we will plot the ATAC signal using the GM12878 and Naked DNA data as we did in Section 2.5 for DNase data. This section necessitates the loading of functions from Section 2.5. source
('https://raw.githubusercontent.com/guertinlab/seqOutBias/master/docs/R/seqOutBias_functions.R')
setwd('~/ATAC_Walavalkar/')
all.composites.ATAC = cycle.fimo.new.not.hotspots(path.dir.mast = '~/ATAC_Walavalkar/', path.dir.bigWig = '/Users/guertinlab/ATAC_Walavalkar/gm12878/', window = 30, exp = 'GM12878_ATAC')
all.composites.ATAC.naked = cycle.fimo.new.not.hotspots(path.dir.fimo = '~/ATAC_Walavalkar/', path.dir.bigWig = '/Users/guertinlab/ATAC_Walavalkar/Naked/', window = 30, exp = 'Naked_ATAC')
save(all.composites.ATAC, all.composites.ATAC.naked, file = 'ATAC_naked_composites.Rdata')
all.composites.ATAC.naked$cond = gsub("C1_gDNA_no_scale_merged", "Raw", all.composites.ATAC.naked$cond) all.composites.ATAC.naked$cond = gsub("C1_gDNA_NXNXXXCXXNNXNNNXXN_NXXNNNXNNXXCXXXNXN_merged", "Corrected",
all.composites.ATAC.naked$cond)
composites.func.panels.naked.chromatin(all.composites.ATAC.naked[(all.composites.ATAC.naked$grp != 'CTCF'),], fact = paste('ATAC Naked', sep= ' '), summit = 'Motif', num = 24,
col.lines = rev(c(rgb(0,0,1,1/2), rgb(0,0,0,1/2))), fill.poly = rev(c(rgb(0,0,1,1/4), rgb(0,0,0,1/4))))
all.composites.ATAC$cond = gsub("ATAC_GM12878_no_scale_merged", "Raw", all.composites.ATAC$cond)
all.composites.ATAC$cond = gsub("ATAC_GM12878_NXNXXXCXXNNXNNNXXN_NXXNNNXNNXXCXXXNXN_merged", "Corrected", all.composites.ATAC$cond)
composites.func.panels.naked.chromatin(all.composites.ATAC[all.composites.ATAC$grp != 'CTCF',],
fact = paste('ATAC GM12878', sep= ' '), summit = 'Motif', num = 24,
col.lines = rev(c(rgb(0,0,1,1/2), rgb(0,0,0,1/2))), fill.poly = rev(c(rgb(0,0,1,1/4), rgb(0,0,0,1/4))))