seqOutBias

Vignette

by Guertin Lab

⇦ Contents ⇨

Correction of Tn5 sequence bias from ATAC-seq data

Next we will correct paired-end ATAC-seq data from the Greenleaf group (Buenrostro et al., 2013) and naked DNA ATAC-seq generated by our group. Note that ATAC-seq uses Illumina’s Nextera kit to directly transpose sequencing adapters into accessible chromatin. ATAC-seq is unique among enzymatic accessibility assays because each transposition event inserts two adapters into the chromatin. Each Tn5 molecule can be pre-loaded with any combination of the paired-end 1 and paired-end 2 adapter.

Downloading and processing ATAC-seq data

First we will download SRA files from naked ATAC-seq, which is a genomic DNA isolation followed by standard ATAC-seq protocols. These data are paired-end, which necessitates splitting the SRA file into two fastq files. Next we exclude all instances of reads that align to chrM–this is not a problem with these data, but typical ATAC-seq on chromatin can yield a high fraction of reads aligning to chrM. We will treat the reads that align to the plus and minus strand differently, because of how the Tn5 recognition site is distinct for plus and minus reads. Note that we optimized the k-mer mask for ATAC-seq.

mkdir ~/ATAC_Walavalkar
cd ~/ATAC_Walavalkar

#ATAC Naked
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX243/SRX2438155/SRR5123141/SRR5123141.sra
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX243/SRX2438156/SRR5123142/SRR5123142.sra

fastq-dump --split-3 SRR5123141.sra
fastq-dump --split-3 SRR5123142.sra

mv SRR5123141_1.fastq C1_gDNA_PE1_rep1.fastq
mv SRR5123141_2.fastq C1_gDNA_PE2_rep1.fastq
mv SRR5123142_1.fastq C1_gDNA_PE1_rep2.fastq
mv SRR5123142_2.fastq C1_gDNA_PE2_rep2.fastq

gzip *fastq

wget https://raw.githubusercontent.com/igvteam/igv/master/genomes/sizes/hg38.chrom.sizes
plus_mask=NXNXXXCXXNNXNNNXXN
minus_mask=NXXNNNXNNXXCXXXNXN 
for fq in *_PE1_rep1.fastq.gz do
    name=$(echo $fq | awk -F"_PE1_rep1.fastq.gz" '{print $1}')
    echo $name
    bowtie2 -x ~/DNase_ENCODE/hg38 -1 $fq,${name}_PE1_rep2.fastq.gz -2 ${name}_PE2_rep1.fastq.gz,${name}_PE2_rep2.fastq.gz -S $name.sam
    grep -v '\tchrM\t' $name.sam > $name.chrM.sam
    samtools view -b $name.chrM.sam | samtools sort - $name
    rm *sam
    samtools view -bh -F 20 ${name}.bam > ${name}_plus.bam
    samtools view -bh -f 0x10 ${name}.bam > ${name}_minus.bam
    seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_plus.bam --kmer-mask ${plus_mask} --bw=${name}_plus_${plus_mask}-mer.bigWig --shift-counts --read-size=75 seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_minus.bam --kmer-mask ${minus_mask} --bw=${name}_minus_${minus_mask}-mer.bigWig --shift-counts --read-size=75 seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_minus.bam --no-scale --bw=${name}_no_scale_minus.bigWig --shift-counts --read-size=75
    seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_plus.bam --no-scale --bw=${name}_no_scale_plus.bigWig --shift-counts --read-size=75
    bigWigMerge ${name}_plus_${plus_mask}-mer.bigWig ${name}_minus_${minus_mask}-mer.bigWig ${name}_${plus_mask}_${minus_mask}_merged.bedGraph
    bigWigMerge ${name}_no_scale_plus.bigWig ${name}_no_scale_minus.bigWig ${name}_no_scale_merged.bedGraph
    sort -k1,1 -k2,2n ${name}_${plus_mask}_${minus_mask}_merged.bedGraph > ${name}_${plus_mask}_${minus_mask}_merged.sorted.bedGraph
    sort -k1,1 -k2,2n ${name}_no_scale_merged.bedGraph > ${name}_no_scale_merged.sorted.bedGraph
    bedGraphToBigWig ${name}_${plus_mask}_${minus_mask}_merged.sorted.bedGraph hg38.chrom.sizes ${name}_${plus_mask}_${minus_mask}_merged.bigWig bedGraphToBigWig ${name}_no_scale_merged.sorted.bedGraph hg38.chrom.sizes ${name}_no_scale_merged.bigWig
done
mkdir Naked
mv C1*merged.bigWig Naked

Processing GM12878 ATAC-seq and TF binding data for GM12878 cells

Perform the same processes for the original ATAC-seq data (Buenrostro et al., 2013). As we did in Section 2.4, we want to retrieve ChIP-seq data for GM12878 cells to plot composite profiles of ATAC signal.

#GM12878 Greenleaf
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX298/SRX298000/SRR891268/SRR891268.sra fastq-dump --split-3 SRR891268.sra
mv SRR891268_1.fastq ATAC_GM12878_PE1.fastq
mv SRR891268_2.fastq ATAC_GM12878_PE2.fastq

gzip ATAC_GM12878_PE1.fastq gzip ATAC_GM12878_PE2.fastq
plus_mask=NXNXXXCXXNNXNNNXXN minus_mask=NXXNNNXNNXXCXXXNXN

for fq in *_PE1.fastq.gz
do
    name=$(echo $fq | awk -F"_PE1.fastq.gz" '{print $1}')
    echo $name
    bowtie2 -x ~/DNase_ENCODE/hg38 -1 $fq -2 ${name}_PE2.fastq.gz -S $name.sam grep -v '\tchrM\t' $name.sam > $name.chrM.sam
    samtools view -b $name.chrM.sam | samtools sort - $name
    samtools view -bh -F 20 ${name}.bam > ${name}_plus.bam
    samtools view -bh -f 0x10 ${name}.bam > ${name}_minus.bam
    seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_plus.bam --kmer-mask ${plus_mask} --bw=${name}_plus_${plus_mask}-mer.bigWig --shift-counts --read-size=50 seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_minus.bam --kmer-mask ${minus_mask} --bw=${name}_minus_${minus_mask}-mer.bigWig --shift-counts --read-size=50 seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_minus.bam --no-scale --bw=${name}_no_scale_minus.bigWig --shift-counts --read-size=50
    seqOutBias ~/DNase_ENCODE/hg38.fa ${name}_plus.bam --no-scale --bw=${name}_no_scale_plus.bigWig --shift-counts --read-size=50
    bigWigMerge ${name}_plus_${plus_mask}-mer.bigWig ${name}_minus_${minus_mask}-mer.bigWig ${name}_${plus_mask}_${minus_mask}_merged.bedGraph bigWigMerge ${name}_no_scale_plus.bigWig ${name}_no_scale_minus.bigWig ${name}_no_scale_merged.bedGraph
    sort -k1,1 -k2,2n ${name}_${plus_mask}_${minus_mask}_merged.bedGraph > ${name}_${plus_mask}_${minus_mask}_merged.sorted.bedGraph
    sort -k1,1 -k2,2n ${name}_no_scale_merged.bedGraph > ${name}_no_scale_merged.sorted.bedGraph
    bedGraphToBigWig ${name}_${plus_mask}_${minus_mask}_merged.sorted.bedGraph hg38.chrom.sizes ${name}_${plus_mask}_${minus_mask}_merged.bigWig bedGraphToBigWig ${name}_no_scale_merged.sorted.bedGraph hg38.chrom.sizes ${name}_no_scale_merged.bigWig
done

mkdir gm12878
mv *GM12878*merged.bigWig gm12878

url=http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhTfbs/
wget ${url}wgEncodeSydhTfbsGm12878Corestsc30189IggmusPk.narrowPeak.gz
wget ${url}wgEncodeSydhTfbsGm12878Ebf1sc137065StdPk.narrowPeak.gz
wget ${url}wgEncodeSydhTfbsGm12878Ctcfsc15914c20StdPk.narrowPeak.gz 
url=http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/
wget ${url}wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1.broadPeak.gz

grep -i -A 25 'MOTIF MA0138.2 REST' ~/DNase_ENCODE/motif_databases/JASPAR/JASPAR_CORE_2016.meme > REST_meme_temp.txt
grep -i -A 18 'MOTIF MA0154.3 EBF1' ~/DNase_ENCODE/motif_databases/JASPAR/JASPAR_CORE_2016.meme > EBF1_meme_temp.txt
grep -i -A 15 'MOTIF MA0079.3 SP1' ~/DNase_ENCODE/motif_databases/JASPAR/JASPAR_CORE_2016.meme > SP1_meme_temp.txt
grep -i -A 23 'MOTIF MA0139.1 CTCF' ~/DNase_ENCODE/motif_databases/JASPAR/JASPAR_CORE_2016.meme > CTCF_meme_temp.txt

head -9 ~/DNase_ENCODE/motif_databases/JASPAR/JASPAR_CORE_2016_vertebrates.meme > header_meme_temp.txt

for i in *meme_temp.txt
do
    name=$(echo $i | awk -F"_meme" '{print $1}')
    cat header_meme_temp.txt $i > ${name}_minimal_meme.txt
done

rm *temp.txt

for peak in *Gm12878*Peak.gz
do
    name=$(echo $peak | awk -F"TfbsGm12878" '{print $NF}' | awk -F"." '{print $1}')
    unz=$(echo $peak | awk -F".gz" '{print $1}')
    echo $name
    gunzip $peak
    echo $unz
    liftOver $unz ~/DNase_ENCODE/hg19ToHg38.over.chain $name.hg38.narrowPeak $name.hg38.narrow.unmapped.txt -bedPlus=6 fastaFromBed -fi ~/DNase_ENCODE/hg38.fa -bed $name.hg38.narrowPeak -fo $name.hg38.fasta
    gzip ${name}*Peak
done

mv Ctcfsc15914c20StdPk.hg38.fasta CTCF.hg38.fasta
mv Sp1Pcr1xPkRep1.hg38.fasta SP1.hg38.fasta
mv Ebf1sc137065StdPk.hg38.fasta EBF1.hg38.fasta
mv Corestsc30189IggmusPk.hg38.fasta REST.hg38.fasta

for meme in *.hg38.fasta
do
    name=$(echo $meme | awk -F".hg38.fasta" '{print $1}')
    echo $name
    mast ${name}_minimal_meme.txt $meme -hit_list -mt 0.0005 > ${name}_mast.txt
done

for meme in *.hg38.fasta
do
    name=$(echo $meme | awk -F".hg38.fasta" '{print $1}')
    echo $name
    fimo --thresh 0.0001 --text ${name}_minimal_meme.txt ~/DNase_ENCODE/hg38.fa > ${name}_fimo.txt 
    grep -v chrM ${name}_fimo.txt > ${name}_noM_fimo.txt
    rm ${name}_fimo.txt
    mv ${name}_noM_fimo.txt ${name}_fimo.txt
done

Plotting ATAC composites using `R`

For each transcription factor motif we will plot the ATAC signal using the GM12878 and Naked DNA data as we did in Section 2.5 for DNase data. This section necessitates the loading of functions from Section 2.5. source

('https://raw.githubusercontent.com/guertinlab/seqOutBias/master/docs/R/seqOutBias_functions.R')

setwd('~/ATAC_Walavalkar/')

all.composites.ATAC = cycle.fimo.new.not.hotspots(path.dir.mast = '~/ATAC_Walavalkar/', path.dir.bigWig = '/Users/guertinlab/ATAC_Walavalkar/gm12878/', window = 30, exp = 'GM12878_ATAC')
all.composites.ATAC.naked = cycle.fimo.new.not.hotspots(path.dir.fimo = '~/ATAC_Walavalkar/', path.dir.bigWig = '/Users/guertinlab/ATAC_Walavalkar/Naked/', window = 30, exp = 'Naked_ATAC')

save(all.composites.ATAC, all.composites.ATAC.naked, file = 'ATAC_naked_composites.Rdata')

all.composites.ATAC.naked$cond = gsub("C1_gDNA_no_scale_merged", "Raw", all.composites.ATAC.naked$cond) all.composites.ATAC.naked$cond = gsub("C1_gDNA_NXNXXXCXXNNXNNNXXN_NXXNNNXNNXXCXXXNXN_merged", "Corrected",
all.composites.ATAC.naked$cond)

composites.func.panels.naked.chromatin(all.composites.ATAC.naked[(all.composites.ATAC.naked$grp != 'CTCF'),], fact = paste('ATAC Naked', sep= ' '), summit = 'Motif', num = 24,
col.lines = rev(c(rgb(0,0,1,1/2), rgb(0,0,0,1/2))), fill.poly = rev(c(rgb(0,0,1,1/4), rgb(0,0,0,1/4))))

all.composites.ATAC$cond = gsub("ATAC_GM12878_no_scale_merged", "Raw", all.composites.ATAC$cond)
all.composites.ATAC$cond = gsub("ATAC_GM12878_NXNXXXCXXNNXNNNXXN_NXXNNNXNNXXCXXXNXN_merged", "Corrected", all.composites.ATAC$cond)

composites.func.panels.naked.chromatin(all.composites.ATAC[all.composites.ATAC$grp != 'CTCF',],
    fact = paste('ATAC GM12878', sep= ' '), summit = 'Motif', num = 24,
    col.lines = rev(c(rgb(0,0,1,1/2), rgb(0,0,0,1/2))), fill.poly = rev(c(rgb(0,0,1,1/4), rgb(0,0,0,1/4))))

Correction of Tn5 sequence bias from ATAC-seq data

Downloading and processing ATAC-seq data

Processing GM12878 ATAC-seq and TF binding data for GM12878 cells

Plotting ATAC composites using R

Plotting ATAC composites using `R`