This is an analysis vignette accompanying the manuscript entitled “TRPS1 modulates chromatin accessibility to regulate estrogen receptor alpha (ER) binding and ER target gene expression in luminal breast cancer cells” (https://doi.org/10.1101/2023.07.03.547524). The reader should be able to follow these steps to download the raw data and reproduce all the results in the manuscript.
Here we just focus on chromosome 8 (where TRPS1 is located) and a subset of columns to keep the file size manageable. I uploaded this file to LocusZoom to make the plot.
wget https://bcac.ccge.medschl.cam.ac.uk/files/icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt.zip
unzip icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt.zip
head -n 1 icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt > Breast_cancer_GWAS_PMID_32424353_chr8.txt
grep '^8_' icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt >> Breast_cancer_GWAS_PMID_32424353_chr8.txt
awk '{OFS = "\t"; print $9, $10, $41, $40, $46}' Breast_cancer_GWAS_PMID_32424353_chr8.txt > Breast_cancer_GWAS_PMID_32424353_chr8_subset.txt
I downloaded the 22Q1 data release from the DepMap portal, with the TRPS1 and ESR1 CRISPR gene effect (Chronos) scores for all cell lines. I also downloaded the list of annotated luminal breast cancer cell lines from the DepMap portal as of March 13, 2022.
Set up in R. R code chunks will be in light blue, to distinguish from bash code chunks in grey.
setwd("~/Library/CloudStorage/Box-Box/GuertinLab/DepMap/")
library(tidyverse)
library(dplyr)
library(ggrepel)
library(ggsignif)
`%notin%` <- Negate(`%in%`)
Read in the data.
Chronos <- read_csv("Chronos.csv")
colnames(Chronos)[c(2,3)] <- c("ESR1", "TRPS1")
Luminal <- read_csv("cell lines in luminal.csv")
Chronos_luminal_last <- rbind.data.frame(Chronos[Chronos$`DepMap ID` %notin% Luminal$`Depmap Id`,],
Chronos[Chronos$`DepMap ID` %in% Luminal$`Depmap Id`,])
cols = ifelse(Chronos_luminal_last$`DepMap ID` %in% Luminal$`Depmap Id`, "red", "black")
Plot.
pdf("Chronos_ESR1_vs_TRPS1_Luminal.pdf")
ggplot(Chronos_luminal_last, aes(x = ESR1, y = TRPS1, color = cols)) +
geom_point() +
scale_color_manual(values = c("black", "red")) +
geom_text_repel(aes(label=ifelse(`DepMap ID` %in% Luminal$`Depmap Id`,
as.character(`Cell Line Name`),'')),
max.overlaps = nrow(Chronos_luminal_last)) +
xlab("ESR1 knockout score") +
ylab("TRPS1 knockout score") +
coord_fixed(ratio = 1, xlim = c(-1.2, 0.5), ylim = c(-1.2, 0.5)) +
theme_bw() +
theme(text = element_text(size = 20), legend.position="none")
dev.off()
pdf("Chronos_TRPS1_score_by_class.pdf")
ggplot(Chronos_luminal_last, aes(x = cols, y = TRPS1, color = cols)) +
geom_violin() +
geom_boxplot(width=.1) +
#geom_signif(comparisons = list(c("black", "red")), map_signif_level=F)
geom_hline(yintercept = 0, linetype="dashed") +
scale_color_manual(values = c("black", "red")) +
ylim(-1.2, 0.5) +
ylab("TRPS1 knockout score") +
xlab("Cell line class") +
theme_bw() +
theme(legend.position="none")
dev.off()
T-tests.
#0.002235
t.test(Chronos_luminal_last$TRPS1[cols == "red"],
Chronos_luminal_last$TRPS1[cols == "black"])
#3.204e-05
wilcox.test(Chronos_luminal_last$TRPS1[cols == "red"],
Chronos_luminal_last$TRPS1[cols == "black"])
Log in to the cluster to perform the processing for each sample in parallel.
ssh -Y ts2hx@rivanna.hpc.virginia.edu
ijob -A gioeli_lab -p standard --cpus-per-task=10 --time=8:00:00
module load gcc/11.4.0 sratoolkit/3.0.3
vdb-config -i #press x to exit
Make directories for the later scripts to write files.
title=ChIP
scripts=/scratch/ts2hx/scripts
main=/scratch/ts2hx/${title}/results
mkdir -p /scratch/ts2hx/${title}/logs
mkdir -p /scratch/ts2hx/${title}/raw_data
mkdir -p ${main}/fastqc
mkdir -p ${main}/bowtie2/controls
mkdir -p ${main}/macs2
mkdir -p ${main}/meme
mkdir -p ${main}/chipqc/meta
mkdir -p ${main}/chipqc/bams
mkdir -p ${main}/counts
mkdir -p ${main}/bigWigs
Get the required script for bigWig merging and move to a directory that will end up in the $PATH.
wget https://raw.githubusercontent.com/guertinlab/Nascent_RNA_Methods/main/normalize_bedGraph.py
chmod +x normalize_bedGraph.py
mv normalize_bedGraph.py $HOME/bin/
Get the ENCODE blacklist.
wget https://raw.githubusercontent.com/Boyle-Lab/Blacklist/master/lists/hg38-blacklist.v2.bed.gz -P $HOME/hg38/
gunzip $HOME/hg38/hg38-blacklist.v2.bed.gz
mv $HOME/hg38/hg38-blacklist.v2.bed $HOME/hg38/hg38.blacklist.bed
Download the fastq’s and gzip.
cd /scratch/ts2hx/${title}/raw_data
for i in {676..761}
do
sbatch -p standard -A gioeli_lab -t 1:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name fasterq-dump \
-o /scratch/ts2hx/${title}/logs/SRR25076${i}_fasterq_dump.log \
--wrap="module load gcc/11.4.0 sratoolkit/3.0.3; fasterq-dump SRR25076${i}; gzip SRR25076${i}_*.fastq"
sleep 1
done
Combine the reads and rename
cat SRR25076713_1.fastq.gz SRR25076725_1.fastq.gz > T47D_Clone28_IgG_DMSO_rep1_PE1.fastq.gz
cat SRR25076712_1.fastq.gz SRR25076726_1.fastq.gz > T47D_Clone28_IgG_DMSO_rep2_PE1.fastq.gz
cat SRR25076703_1.fastq.gz SRR25076711_1.fastq.gz > T47D_Clone28_IgG_DMSO_rep3_PE1.fastq.gz
cat SRR25076700_1.fastq.gz SRR25076701_1.fastq.gz > T47D_Clone28_IgG_DMSO_rep4_PE1.fastq.gz
cat SRR25076698_1.fastq.gz SRR25076699_1.fastq.gz > T47D_Clone28_IgG_dTAG_rep1_PE1.fastq.gz
cat SRR25076696_1.fastq.gz SRR25076697_1.fastq.gz > T47D_Clone28_IgG_dTAG_rep2_PE1.fastq.gz
cat SRR25076694_1.fastq.gz SRR25076695_1.fastq.gz > T47D_Clone28_IgG_dTAG_rep3_PE1.fastq.gz
cat SRR25076692_1.fastq.gz SRR25076693_1.fastq.gz > T47D_Clone28_IgG_dTAG_rep4_PE1.fastq.gz
cat SRR25076752_1.fastq.gz SRR25076753_1.fastq.gz > T47D_Clone28_ER_DMSO_rep1_PE1.fastq.gz
cat SRR25076754_1.fastq.gz SRR25076755_1.fastq.gz > T47D_Clone28_ER_DMSO_rep2_PE1.fastq.gz
cat SRR25076751_1.fastq.gz SRR25076756_1.fastq.gz > T47D_Clone28_ER_DMSO_rep3_PE1.fastq.gz
cat SRR25076750_1.fastq.gz SRR25076757_1.fastq.gz > T47D_Clone28_ER_DMSO_rep4_PE1.fastq.gz
cat SRR25076749_1.fastq.gz SRR25076758_1.fastq.gz > T47D_Clone28_ER_dTAG_rep1_PE1.fastq.gz
cat SRR25076748_1.fastq.gz SRR25076759_1.fastq.gz > T47D_Clone28_ER_dTAG_rep2_PE1.fastq.gz
cat SRR25076747_1.fastq.gz SRR25076760_1.fastq.gz > T47D_Clone28_ER_dTAG_rep3_PE1.fastq.gz
cat SRR25076738_1.fastq.gz SRR25076761_1.fastq.gz > T47D_Clone28_ER_dTAG_rep4_PE1.fastq.gz
cat SRR25076743_1.fastq.gz SRR25076744_1.fastq.gz SRR25076745_1.fastq.gz SRR25076746_1.fastq.gz > T47D_Clone28_HA_DMSO_rep1_PE1.fastq.gz
cat SRR25076739_1.fastq.gz SRR25076740_1.fastq.gz SRR25076741_1.fastq.gz SRR25076742_1.fastq.gz > T47D_Clone28_HA_DMSO_rep2_PE1.fastq.gz
cat SRR25076736_1.fastq.gz SRR25076737_1.fastq.gz > T47D_Clone28_HA_DMSO_rep3_PE1.fastq.gz
cat SRR25076734_1.fastq.gz SRR25076735_1.fastq.gz > T47D_Clone28_HA_DMSO_rep4_PE1.fastq.gz
cat SRR25076718_1.fastq.gz SRR25076719_1.fastq.gz SRR25076720_1.fastq.gz SRR25076721_1.fastq.gz > T47D_Clone28_HA_dTAG_rep1_PE1.fastq.gz
cat SRR25076716_1.fastq.gz SRR25076722_1.fastq.gz > T47D_Clone28_HA_dTAG_rep2_PE1.fastq.gz
cat SRR25076715_1.fastq.gz SRR25076723_1.fastq.gz > T47D_Clone28_HA_dTAG_rep3_PE1.fastq.gz
cat SRR25076714_1.fastq.gz SRR25076724_1.fastq.gz > T47D_Clone28_HA_dTAG_rep4_PE1.fastq.gz
cat SRR25076713_2.fastq.gz SRR25076725_2.fastq.gz > T47D_Clone28_IgG_DMSO_rep1_PE2.fastq.gz
cat SRR25076712_2.fastq.gz SRR25076726_2.fastq.gz > T47D_Clone28_IgG_DMSO_rep2_PE2.fastq.gz
cat SRR25076703_2.fastq.gz SRR25076711_2.fastq.gz > T47D_Clone28_IgG_DMSO_rep3_PE2.fastq.gz
cat SRR25076700_2.fastq.gz SRR25076701_2.fastq.gz > T47D_Clone28_IgG_DMSO_rep4_PE2.fastq.gz
cat SRR25076698_2.fastq.gz SRR25076699_2.fastq.gz > T47D_Clone28_IgG_dTAG_rep1_PE2.fastq.gz
cat SRR25076696_2.fastq.gz SRR25076697_2.fastq.gz > T47D_Clone28_IgG_dTAG_rep2_PE2.fastq.gz
cat SRR25076694_2.fastq.gz SRR25076695_2.fastq.gz > T47D_Clone28_IgG_dTAG_rep3_PE2.fastq.gz
cat SRR25076692_2.fastq.gz SRR25076693_2.fastq.gz > T47D_Clone28_IgG_dTAG_rep4_PE2.fastq.gz
cat SRR25076752_2.fastq.gz SRR25076753_2.fastq.gz > T47D_Clone28_ER_DMSO_rep1_PE2.fastq.gz
cat SRR25076754_2.fastq.gz SRR25076755_2.fastq.gz > T47D_Clone28_ER_DMSO_rep2_PE2.fastq.gz
cat SRR25076751_2.fastq.gz SRR25076756_2.fastq.gz > T47D_Clone28_ER_DMSO_rep3_PE2.fastq.gz
cat SRR25076750_2.fastq.gz SRR25076757_2.fastq.gz > T47D_Clone28_ER_DMSO_rep4_PE2.fastq.gz
cat SRR25076749_2.fastq.gz SRR25076758_2.fastq.gz > T47D_Clone28_ER_dTAG_rep1_PE2.fastq.gz
cat SRR25076748_2.fastq.gz SRR25076759_2.fastq.gz > T47D_Clone28_ER_dTAG_rep2_PE2.fastq.gz
cat SRR25076747_2.fastq.gz SRR25076760_2.fastq.gz > T47D_Clone28_ER_dTAG_rep3_PE2.fastq.gz
cat SRR25076738_2.fastq.gz SRR25076761_2.fastq.gz > T47D_Clone28_ER_dTAG_rep4_PE2.fastq.gz
cat SRR25076743_2.fastq.gz SRR25076744_2.fastq.gz SRR25076745_2.fastq.gz SRR25076746_2.fastq.gz > T47D_Clone28_HA_DMSO_rep1_PE2.fastq.gz
cat SRR25076739_2.fastq.gz SRR25076740_2.fastq.gz SRR25076741_2.fastq.gz SRR25076742_2.fastq.gz > T47D_Clone28_HA_DMSO_rep2_PE2.fastq.gz
cat SRR25076736_2.fastq.gz SRR25076737_2.fastq.gz > T47D_Clone28_HA_DMSO_rep3_PE2.fastq.gz
cat SRR25076734_2.fastq.gz SRR25076735_2.fastq.gz > T47D_Clone28_HA_DMSO_rep4_PE2.fastq.gz
cat SRR25076718_2.fastq.gz SRR25076719_2.fastq.gz SRR25076720_2.fastq.gz SRR25076721_2.fastq.gz > T47D_Clone28_HA_dTAG_rep1_PE2.fastq.gz
cat SRR25076716_2.fastq.gz SRR25076722_2.fastq.gz > T47D_Clone28_HA_dTAG_rep2_PE2.fastq.gz
cat SRR25076713_2.fastq.gz SRR25076723_2.fastq.gz > T47D_Clone28_HA_dTAG_rep3_PE2.fastq.gz
cat SRR25076714_2.fastq.gz SRR25076724_2.fastq.gz > T47D_Clone28_HA_dTAG_rep4_PE2.fastq.gz
The script used in the next code chunk (230322_chipseq_analysis_on_input_file.sh) is below:
#!/bin/bash/
# This script takes a gzip'd fastq file of ChIP-seq data, runs FastQC, removes adapters,
# and outputs a bam file for peak calling, a bed file for counting reads in peaks,
# and a bigWig file for visualization.
# USAGE: sh chipseq_analysis_on_input_file.sh <path to the gzip'd PE1 file> <path to main results directory>
cores=$SLURM_CPUS_PER_TASK
#For fastq_pair
PATH=$HOME/bin:$PATH
#Change this to your main results directory
main=$2
#Change this to the location of the blacklist
blacklist=$HOME/hg38/hg38.blacklist.bed
# grab base of filename for naming outputs
name=`basename $1 _PE1.fastq.gz`
echo "Sample name is $name"
# grab the directory
dir=`dirname $1`
cd $dir
# directory with bowtie genome index
genome=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome
#Set up more variables for additional directories to help clean up the results folder
fastqc_out=${main}/fastqc/
bowtie_results=${main}/bowtie2
chipqc_out=${main}/chipqc/bams
counts_out=${main}/counts
bigWigs_out=${main}/bigWigs
#Set up the software environment
module load fastqc/0.11.5
module load cutadapt/3.4
module load gcc/9.2.0 pigz/2.4
module load bowtie2/2.2.9
module load samtools/1.12
module load wigtobigwig/2.8
module load bedtools/2.29.2
#Unzip
gunzip ${name}_PE1.fastq.gz
gunzip ${name}_PE2.fastq.gz
#Run FastQC and move output to the appropriate folder
fastqc ${name}_PE1.fastq
fastqc ${name}_PE2.fastq
mv ${dir}/${name}_PE*_fastqc.* $fastqc_out
#Count raw reads for complexity estimate
raw_reads=$(wc -l ${name}_PE1.fastq | awk '{print $1/4}')
echo "Raw reads: $raw_reads"
#Remove adapters
cutadapt -j $cores -m 10 -O 1 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA ${name}_PE1.fastq -o ${name}_PE1_noadap.fastq
cutadapt -j $cores -m 10 -O 1 -a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT ${name}_PE2.fastq -o ${name}_PE2_noadap.fastq
#Pair reads
reads=$(wc -l ${name}_PE1_noadap.fastq | awk '{print $1/4}')
fastq_pair -t $reads ${name}_PE1_noadap.fastq ${name}_PE2_noadap.fastq
#Align to the genome and keep concordantly aligned, unique reads
#Also print statistics on duplication rate and estimated library complexity
cd $bowtie_results
bowtie2 -p $((cores-5)) --maxins 1000 -x $genome \
-1 ${dir}/${name}_PE1_noadap.fastq.paired.fq -2 ${dir}/${name}_PE2_noadap.fastq.paired.fq | \
samtools view -bS -f 0x2 - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | \
samtools markdup -s -r - ${name}.bam
samtools index ${name}.bam
#Isolate PE1 for ChIPQC
samtools view -f 0x42 ${name}.bam -o ${chipqc_out}/${name}_ChIPQC.bam
samtools index ${chipqc_out}/${name}_ChIPQC.bam ${chipqc_out}/${name}_ChIPQC.bam.bai
#Convert bam to bed for counting
samtools sort -n ${name}.bam | bedtools bamtobed -i stdin -bedpe | \
awk 'BEGIN { OFS = "\t"} { print $1, $2, $6, ".", "1" }' |
grep -v "random" | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | grep -v "alt" | \
intersectBed -v -a stdin -b $blacklist | sort -k1,1 -k2,2n > ${counts_out}/${name}_not_scaled.bed
#Count resultant reads for complexity estimate
resultant_reads=$(wc -l ${counts_out}/${name}_not_scaled.bed | awk '{print $1}')
echo "Resultant reads: $resultant_reads"
#Zip and delete files
cd $dir
pigz -p $cores ${name}_PE1.fastq
pigz -p $cores ${name}_PE2.fastq
rm ${name}_*noadap*
#Deeptools wasn't loading python/numpy before, but purging and reloading helped (this may be unnecessary now)
module purge
module load deeptools/3.5.1
#Convert bam to bigWig for coverage visualization
#We re-do this later with a larger bin size, so this is not necessary here (should change in future)
bamCoverage --bam ${bowtie_results}/${name}.bam -o ${bigWigs_out}/${name}_not_scaled.bw -p $cores \
--binSize 1 --extendReads --centerReads
Run the above script on each sample in parallel.
chmod +x ${scripts}/230322_chipseq_analysis_on_input_file.sh
for fastq in *_HA*_PE1.fastq.gz *_ER_*_PE1.fastq.gz *_IgG_*_PE1.fastq.gz
do
name=`basename $fastq _PE1.fastq.gz`
fastq=/scratch/ts2hx/${title}/raw_data/${fastq}
sbatch -p standard -A gioeli_lab -t 8:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name chipseq-analysis \
-o /scratch/ts2hx/${title}/logs/${name}_chipseq.log \
--wrap="sh ${scripts}/230322_chipseq_analysis_on_input_file.sh $fastq $main"
sleep 1
done
Move IgG samples to subdirectory.
cd ${main}/bowtie2/
mv *IgG*.bam* controls/
The script used in the next code chunk (230308_macs2.sh) is below:
#!/bin/bash
#Get the variables from the arguments
samples=$1
control=$2
factor=$3
main=$4
blacklist=$HOME/hg38/hg38.blacklist.bed
module load macs2/2.2.7.1 gcc/7.1.0 openmpi/3.1.4 intel/18.0 intelmpi/18.0 R/4.1.1 bedtools/2.26.0
macs2 callpeak -t ${samples}_*.bam -c ${control}_*.bam -f BAMPE -g hs -n $factor --keep-dup all \
--outdir ${main}/macs2
#Remove noncanonical chromosomes and blacklisted regions, and make a window of 50bp on either side of the summit
grep -v "random" ${main}/macs2/${factor}_summits.bed | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
intersectBed -v -a stdin -b $blacklist | slopBed -b 50 -i stdin -g $HOME/hg38/hg38.chrom.sizes > \
${main}/macs2/${factor}_summit_window.bed
Run the above script on each sample in parallel.
#Call peaks for each factor
chmod +x ${scripts}/230308_macs2.sh
for bam in *_DMSO_rep1.bam
do
name=`basename $bam _DMSO_rep1.bam`
factor=$(echo $name | awk -F"_" '{print $3}')
samples=T47D_Clone28_${factor}
control=controls/T47D_Clone28_IgG
sbatch -p largemem -A gioeli_lab -t 4:00:00 -N 1 -n 1 --mem-per-cpu=64000 --job-name macs2 \
-o /scratch/ts2hx/${title}/logs/${factor}_macs2.log \
--wrap="sh ${scripts}/230308_macs2.sh $samples $control $factor $main"
sleep 1
done
Remove noncanonical chromosomes and blacklisted regions. Could have added this to the macs2 script when I did it for the summit window.
blacklist=$HOME/hg38/hg38.blacklist.bed
module load gcc/9.2.0 bedtools/2.29.2
for bam in *_DMSO_rep1.bam
do
name=`basename $bam _DMSO_rep1.bam`
factor=$(echo $name | awk -F"_" '{print $3}')
grep -v "random" ${main}/macs2/${factor}_peaks.narrowPeak | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
intersectBed -v -a stdin -b $blacklist > ${factor}_tmp.txt
mv ${factor}_tmp.txt ${main}/macs2/${factor}_peaks.narrowPeak
grep -v "random" ${main}/macs2/${factor}_summits.bed | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
intersectBed -v -a stdin -b $blacklist > ${factor}_tmp.txt
mv ${factor}_tmp.txt ${main}/macs2/${factor}_summits.bed
done
Count reads that overlap any part of the full narrowPeak region.
cd ${main}/counts/
for bed in `ls *_not_scaled.bed | grep -v IgG`
do
name=`basename $bed _not_scaled.bed`
factor=$(echo $name | awk -F"_" '{print $3}')
sbatch -p standard -A gioeli_lab -t 10:00 -N 1 -n 1 --job-name ${name}_mapBed \
-o /scratch/ts2hx/${title}/logs/${name}_mapBed.log \
--wrap="module load gcc/9.2.0 bedtools/2.29.2; \
mapBed -null "0" -a ${main}/macs2/${factor}_peaks.narrowPeak -b $bed > ${name}_peak_counts.txt"
sleep 1 # wait 1 second between each job submission
done
Just keep the last column (the counts). I should do this in the above jobs.
for counts in T47D_Clone28_*_peak_counts.txt
do
name=`basename $counts _peak_counts.txt`
awk '{print $NF}' ${name}_peak_counts.txt > ${name}_peak_counts_only.txt
echo $name | cat - ${name}_peak_counts_only.txt > ${name}_peak_counts.txt
rm ${name}_peak_counts_only.txt
done
Combine into 1 file per factor.
for peaks in ../macs2/*_peaks.narrowPeak
do
factor=`basename $peaks _peaks.narrowPeak`
awk '{OFS = "\t"} {print $1, $2, $3}' $peaks > peaks.bed
echo -e "chr\tstart\tend" | cat - peaks.bed > peaks_header.bed
paste -d'\t' peaks_header.bed T47D_Clone28_${factor}_*_peak_counts.txt > Combined_${factor}_peak_counts.txt
rm peaks*
done
The script used in the next code chunk (230322_macs2.sh) is below:
#!/bin/bash
samples=$1
control=$2
factor=$3
main=$4
blacklist=$HOME/hg38/hg38.blacklist.bed
module load macs2/2.2.7.1 gcc/7.1.0 openmpi/3.1.4 intel/18.0 intelmpi/18.0 R/4.1.1 bedtools/2.26.0
macs2 callpeak --call-summits -t ${samples}_*.bam -c ${control}_*.bam -f BAMPE -g hs -n $factor --keep-dup all \
--outdir ${main}/macs2
grep -v "random" ${main}/macs2/${factor}_peaks.narrowPeak | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
grep -v "alt" | intersectBed -v -a stdin -b $blacklist > ${factor}_tmp.txt
mv ${factor}_tmp.txt ${main}/macs2/${factor}_peaks.narrowPeak
grep -v "random" ${main}/macs2/${factor}_summits.bed | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
grep -v "alt" | intersectBed -v -a stdin -b $blacklist > ${factor}_tmp.txt
mv ${factor}_tmp.txt ${main}/macs2/${factor}_summits.bed
slopBed -b 50 -i ${main}/macs2/${factor}_summits.bed -g $HOME/hg38/hg38.chrom.sizes > \
${main}/macs2/${factor}_summit_window.bed
Run the above script on the HA samples.
cd ${main}/bowtie2/
factor=HA_DMSO_vs_dTAG
samples=T47D_Clone28_HA_DMSO
control=T47D_Clone28_HA_dTAG
sbatch -p largemem -A gioeli_lab -t 4:00:00 -N 1 -n 1 --mem-per-cpu=64000 --job-name macs2 \
-o /scratch/ts2hx/${title}/logs/${factor}_macs2.log \
--wrap="sh ${scripts}/230322_macs2.sh $samples $control $factor $main"
Count reads in HA peaks.
cd ${main}/counts/
#Just count read fragments that overlap the summit
peaks=${main}/macs2/${factor}_summits.bed
for bed in T47D_Clone28_HA_*_not_scaled.bed T47D_Clone28_IgG_*_not_scaled.bed
do
name=`basename $bed _not_scaled.bed`
echo $name
name=${name}_${factor}
mapBed -null "0" -a $peaks -b $bed > ${name}_peak_counts.txt
awk '{print $NF}' ${name}_peak_counts.txt > ${name}_peak_counts_only.txt
echo $name | cat - ${name}_peak_counts_only.txt > ${name}_peak_counts.txt
rm ${name}_peak_counts_only.txt
done
echo -e "chr\tstart\tend\tname\tqvalue" | cat - $peaks | \
paste -d'\t' - *${factor}_peak_counts.txt > Combined_${factor}_peak_counts.txt
rm T47D_Clone28_*${factor}_peak_counts.txt
I ran the following interactively to make sure I had the right samplesheet before running ChIPQC.
module load gcc/9.2.0 openmpi/3.1.6 R/4.2.1
R
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
library(BiocManager)
if (!require("ChIPQC", quietly = TRUE))
BiocManager::install("ChIPQC")
library(ChIPQC)
if (!require("TxDb.Hsapiens.UCSC.hg38.knownGene", quietly = TRUE))
install.packages("TxDb.Hsapiens.UCSC.hg38.knownGene")
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
samplesheet <- data.frame(SampleID=character(),
Tissue=character(),
Factor=character(),
Condition=character(),
Replicate=integer(),
bamReads=character(),
ControlID=character(),
bamControl=character(),
Peaks=character(),
PeakCaller=character(),
stringsAsFactors=FALSE)
path = "/scratch/ts2hx/ChIP/results/bowtie2/"
samples <- dir(path, pattern = ".bam.bai")
samples
setwd("/scratch/ts2hx/ChIP/results/")
for (i in 1:length(samples)) {
samplesheet[i,1] <- strsplit(strsplit(samples[i], split = ".bam")[[1]][1], split = "T47D_Clone28_")[[1]][2]
samplesheet[i,2] <- strsplit(samples[i], split = "_")[[1]][1]
samplesheet[i,3] <- strsplit(samples[i], split = "_")[[1]][3]
samplesheet[i,4] <- strsplit(samples[i], split = "_")[[1]][4]
samplesheet[i,5] <- strsplit(strsplit(samples[i], split = ".bam")[[1]][1], split = "_rep")[[1]][2]
samplesheet[i,6] <- paste("chipqc/bams/", strsplit(samples[i], split = ".bam")[[1]][1], "_ChIPQC.bam", sep = "")
samplesheet[i,7] <- paste("T47D_Clone28_IgG_", samplesheet[i,4], "_rep", samplesheet[i,5], sep = "")
samplesheet[i,8] <- paste("chipqc/bams/T47D_Clone28_IgG_", samplesheet[i,4], "_rep", samplesheet[i,5], "_ChIPQC.bam", sep = "")
samplesheet[i,9] <- paste("macs2/", samplesheet[i,3], "_peaks.narrowPeak", sep = "")
samplesheet[i,10] <- "narrow"
}
head(samplesheet)
samplesheet[9:16,9] <- "macs2/HA_DMSO_vs_dTAG_peaks.narrowPeak"
write.csv(samplesheet, file = "/scratch/ts2hx/ChIP/results/chipqc/meta/samplesheet.csv", row.names = FALSE)
q()
The script used in the next code chunk (230313_ChIPQC.R) is below:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
library(BiocManager)
if (!require("ChIPQC", quietly = TRUE))
BiocManager::install("ChIPQC")
library(ChIPQC)
if (!require("TxDb.Hsapiens.UCSC.hg38.knownGene", quietly = TRUE))
install.packages("TxDb.Hsapiens.UCSC.hg38.knownGene")
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
setwd("/scratch/ts2hx/ChIP/results/")
samples <- read.csv("/scratch/ts2hx/ChIP/results/chipqc/meta/samplesheet.csv")
register(SerialParam())
chipObj <- ChIPQC(samples, annotation="hg38")
save(chipObj, file = "/scratch/ts2hx/ChIP/results/chipqc/chipObj.R")
load("/scratch/ts2hx/ChIP/results/chipqc/chipObj.R")
setwd("/scratch/ts2hx/ChIP/results/chipqc")
ChIPQCreport(chipObj, reportName="ChIPQC", reportFolder="ChIPQCreport")
Run the above script to generate the ChIPQC object.
sbatch -p standard -A gioeli_lab -t 2:00:00 -N 1 -n 1 --job-name chipqc -o /scratch/ts2hx/${title}/logs/chipqc.log \
--wrap="module load gcc/9.2.0 openmpi/3.1.6 R/4.2.1; Rscript ${scripts}/230313_ChIPQC.R"
First make size factors based on total mapped reads (divide each by the geometric mean of all of them like in DESeq2).
cd ${main}/bowtie2
module load samtools/1.12
for i in HA ER
do
echo $i
> ${i}_header.txt
> ${i}_reads.txt
for j in T47D_Clone28_${i}_*.bam
do
echo $j
name=$(echo $j | awk -F".bam" '{print $1}')
echo $name | paste ${i}_header.txt - > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_header.txt
reads=`samtools view -c $j`
echo $reads | paste ${i}_reads.txt - > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_reads.txt
done
cat ${i}_header.txt ${i}_reads.txt > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_reads.txt
rm ${i}_header.txt
done
Combine these into one file.
module load gcc/9.2.0 openmpi/3.1.6 R/4.2.1
R
library(DESeq2)
setwd("/scratch/ts2hx/ChIP/results/bowtie2/")
samples <- dir("/scratch/ts2hx/ChIP/results/bowtie2/", pattern = "_reads.txt")
for (i in samples) {
x <- read.table(i, sep = '\t', header = TRUE)[,-1]
factor = sapply(strsplit(colnames(x)[1], "_"), "[", 3)
write.table(estimateSizeFactorsForMatrix(x), file = paste0("/scratch/ts2hx/ChIP/results/bigWigs/", factor, '_sizeFactor.txt'), quote =F, col.names=F)
}
q()
Generate the binned bigWigs.
for i in *.bam
do
name=$(echo $i | awk -F".bam" '{print $1}')
sbatch -p standard -A gioeli_lab -t 1:00:00 --cpus-per-task=10 --job-name bamCoverage -o ${name}_bamCoverage.log \
--export=ALL,name=$name,main=$main --wrap="module load deeptools/3.5.1 samtools/1.12; samtools index ${name}.bam; \
bamCoverage --bam ${name}.bam -o ${main}/bigWigs/${name}_binned_not_scaled.bw -p 10 --extendReads --centerReads;"
sleep 1
done
The script used in the next code chunk (230329_ChIP_scale_bigWig_binned.sh) is below. It scales each bigWig based on the read depth.
#! /bin/bash
factor=$2
directory=$3
PATH=$HOME/bin:$PATH
cd $directory
module load ucsc-tools/3.7.4
name=$(echo $1 | awk -F"_binned_not_scaled.bw" '{print $1}')
invscale1=$(grep ${name} ${factor}_sizeFactor.txt | awk -F" " '{print $2}')
invscale=$(echo $invscale1 | bc)
scaletrue=$(bc <<< "scale=4 ; 1.0 / $invscale")
name=${name}_binned
echo $name
echo $scaletrue
echo file_to_scale
echo ${name}_not_scaled.bw
bigWigToBedGraph ${name}_not_scaled.bw ${name}.bg
echo normalizing
normalize_bedGraph.py -i ${name}.bg -s $scaletrue -o ${name}_scaled.bg
LC_COLLATE=C sort -k1,1 -k2,2n ${name}_scaled.bg > ${name}_scaled_sorted.bg
cat ${name}_scaled_sorted.bg | grep -v "random" | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM"> \
${name}_scaled.bg
bedGraphToBigWig ${name}_scaled.bg $HOME/hg38/hg38.chrom.sizes ${name}_scaled.bigWig
rm ${name}_scaled.bg
rm ${name}_scaled_sorted.bg
rm ${name}.bg
Run the above script on each sample in parallel.
cd ${main}/bigWigs
for i in *_binned_not_scaled.bw
do
name=$(echo $i | awk -F"_binned_not_scaled.bw" '{print $1}')
factor=$(echo $name | awk -F"_" '{print $3}')
sbatch -p standard -A gioeli_lab -t 1:00:00 --cpus-per-task=10 --job-name scale_bigWig \
-o ../../logs/${name}_bamCoverage.log --wrap="sh ${scripts}/230329_ChIP_scale_bigWig_binned.sh \
${name}_binned_not_scaled.bw $factor ${main}/bigWigs"
sleep 1
done
The script used in the next code chunk (230329_ChIP_to_browser_binned.sh) is below. It merges the scaled bigWigs across replicates into one bigWig and one bedGraph per condition.
#! /bin/bash
directory=$2
PATH=$HOME/bin:$PATH
cd $directory
module load ucsc-tools/3.7.4
name=$(echo $1 | awk -F"T47D_Clone28_" '{print $2}' | awk -F"_rep1" '{print $1}')
echo $name
files=$(ls T47D_Clone28_${name}_rep*_binned_scaled.bigWig)
echo $files
echo "track type=bedGraph name=${name} alwaysZero=on visibility=full" > T47D_Clone28_${name}_binned_temp.txt
count=$(ls T47D_Clone28_${name}_rep*_binned_scaled.bigWig | wc -l | bc)
scaleall=$(bc <<< "scale=4 ; 1.0 / $count")
name=T47D_Clone28_${name}_binned
echo $count
echo $scaleall
bigWigMerge $files ${name}_tmp.bg
normalize_bedGraph.py -i ${name}_tmp.bg -s $scaleall -o ${name}_scaled.bg
LC_COLLATE=C sort -k1,1 -k2,2n ${name}_scaled.bg > ${name}_scaled_sorted.bg
cat ${name}_temp.txt ${name}_scaled_sorted.bg | \
grep -v "random" | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | grep -v "alt" > \
${name}_scaled.bedGraph
rm ${name}_tmp.bg
rm ${name}_temp.txt
rm ${name}_scaled.bg
rm ${name}_scaled_sorted.bg
head ${name}_scaled.bedGraph
bedGraphToBigWig ${name}_scaled.bedGraph $HOME/hg38/hg38.chrom.sizes \
${name}_combined_normalized.bigWig
gzip ${name}_scaled.bedGraph
Run the above script on each condition in parallel.
for i in T47D_Clone28_*_rep1_binned_scaled.bigWig
do
name=$(echo $i | awk -F"_rep1_scaled.bigWig" '{print $1}')
sbatch -p largemem -A gioeli_lab -t 4:00:00 -N 1 -n 1 --job-name to_browser -o ../../logs/${name}_to_browser.log \
--wrap="sh ${scripts}/230329_ChIP_to_browser_binned.sh $i ${main}/bigWigs"
sleep 1
done
exit
exit
title=ChIP
main=/scratch/ts2hx/${title}/results
cd /Users/TScott/Library/CloudStorage/Box-Box/GuertinLab/${title}/results
scp -r ts2hx@rivanna.hpc.virginia.edu:${main}/fastqc/ .
scp -r ts2hx@rivanna.hpc.virginia.edu:${main}/macs2/*_summit* macs2/
scp -r ts2hx@rivanna.hpc.virginia.edu:${main}/macs2/*peaks.narrowPeak macs2/
scp -r ts2hx@rivanna.hpc.virginia.edu:${main}/counts/Combined* counts/
scp -r ts2hx@rivanna.hpc.virginia.edu:${main}/bigWigs/*_not_scaled.bw bigWigs/
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/ChIP/results/bigWigs/*_binned_scaled.bedGraph.gz bigWigs/
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/ChIP/results/bigWigs/*_binned_combined_normalized.bigWig bigWigs/
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/ChIP/results/bigWigs/*_sizeFactor.txt counts/
scp ts2hx@rivanna.hpc.virginia.edu:${main}/chipqc/chipObj.R chipqc
Here we plot the fraction of reads in peaks, as well as the peak-calling-independent metric of cross-strand correlation.
library(ChIPQC)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
setwd("/Users/TScott/Library/CloudStorage/Box-Box/GuertinLab/ChIP/results/chipqc/")
load("chipObj.R")
pdf("ChIP_CC.pdf")
plotCC(chipObj, facetBy = c("Factor", "Condition"))
dev.off()
plotFrip(chipObj)
frip = frip(chipObj)[1:16]
Factor = sapply(strsplit(names(frip), "_"), "[", 1)
Condition = sapply(strsplit(names(frip), "_"), "[", 2)
Replicate = sapply(strsplit(names(frip), "_"), "[", 3)
FRiP = as.vector(frip)*100
Sample = names(frip)
frip = cbind.data.frame(Sample, FRiP, Factor, Condition, Replicate)
pdf("ChIP_FRiP.pdf", height = 5)
ggplot(frip, aes(x = Sample, y = FRiP, fill = Factor, alpha = Condition)) +
geom_bar(stat="identity") +
scale_fill_manual(values = c("grey40", "darkmagenta")) +
scale_alpha_manual(values = c(1, 0.5)) +
ylab("FRiP (%)") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
dev.off()
Set up.
ssh -Y ts2hx@rivanna.hpc.virginia.edu
ijob -c 1 -A gioeli_lab -p standard --cpus-per-task=20 --time=12:00:00
main=/scratch/ts2hx/ChIP/results
cd $main
mkdir -p heatmaps
cd heatmaps
module load deeptools/3.5.1
Define functional TRPS1 peaks by MACS2 q-value < 10^(-30). I chose this based on making CDFs for TRPS1 proximity to activated genes using the PRO-seq data. More liberal q-values gave peaks that were promoter-proximal and not differentially close to activated genes compared to unchanged genes.
awk '$5 > 30' /scratch/ts2hx/ChIP/results/macs2/HA_DMSO_vs_dTAG_summits.bed > \
Functional_TRPS1_summits.bed
Make the matrix (this can be used for a composite profile as well).
computeMatrix reference-point --referencePoint center -b 500 -a 500 -p 20 --missingDataAsZero \
-R Functional_TRPS1_summits.bed \
-S /scratch/ts2hx/ChIP/results/bigWigs/T47D_Clone28_HA_DMSO_combined_normalized.bigWig \
/scratch/ts2hx/ChIP/results/bigWigs/T47D_Clone28_HA_dTAG_combined_normalized.bigWig \
-o matrix_HA_ChIP_TRPS1_peaks.gz --outFileSortedRegions TRPS1_peaks_sorted_for_heatmap.bed
Make the heatmap.
plotHeatmap -m matrix_HA_ChIP_TRPS1_peaks.gz -out heatmap_HA_ChIP_TRPS1_peaks.pdf --heatmapHeight 14 \
--regionsLabel "Functional TRPS1 peaks" --xAxisLabel "Distance from summit" \
--samplesLabel "DMSO" "dTAG" --colorMap Purples --whatToShow "heatmap and colorbar"
Leave the cluster environment, and copy files to your local computer.
exit
exit
cd ~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/results
scp -r ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/ChIP/results/heatmaps .
First get the complexity estimate file for size factors (total aligned reads).
title=ChIP
cd /Users/TScott/Library/CloudStorage/Box-Box/GuertinLab/${title}/
scp -r ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/${title}/logs/*chipseq.log logs/
scp -r ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/${title}/results/chipqc/chipObj.R results/chipqc/
cd logs
echo -e "name\traw_reads\taligned_reads\tunique_reads\testimated_size\tresultant_reads" > \
230322_complexity_estimate.txt
for i in *_HA_*_chipseq.log *_ER_*_chipseq.log *_IgG_*_chipseq.log
do
name=`basename $i _chipseq.log`
raw_reads=`grep "Raw reads:" $i | awk '{print $3}'`
aligned_reads=`grep "READ:" $i | awk '{print $2/2}'`
unique_reads=`grep "WRITTEN:" $i | awk '{print $2/2}'`
estimated_size=`grep "ESTIMATED_LIBRARY_SIZE:" $i | awk '{print $2}'`
resultant_reads=`grep "Resultant reads:" $i | awk '{print $3}'`
echo -e "$name\t$raw_reads\t$aligned_reads\t$unique_reads\t$estimated_size\t$resultant_reads" >> \
230322_complexity_estimate.txt
done
Now use DESeq2 to identify differentially bound peaks (in this case all of them).
direc = '~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/results/counts/'
setwd(direc)
library(DESeq2)
library(lattice)
library(viridis)
library(DEGreport)
library(lattice)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)
Get read depth.
complexity_estimate <- read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/logs/230322_complexity_estimate.txt", header = T)
name = sapply(strsplit(complexity_estimate$name, "Clone28_"), "[", 2)
complexity_estimate = t(complexity_estimate$unique_reads)
colnames(complexity_estimate) = name
complexity_estimate = complexity_estimate[,-grep("ER", colnames(complexity_estimate))]
Read in the counts table and change the column names.
x <- read.table("Combined_HA_DMSO_vs_dTAG_peak_counts.txt", sep = '\t', header = TRUE)
rownames(x) = paste0(x[,1], ":", x[,2], "-", x[,3])
peak_name = x[,4]
peak_qvalue = x[,5]
x = x[,-c(1:6)]
colnames(x) = sapply(strsplit(sapply(strsplit(colnames(x), "Clone28_"), "[", 2), "_HA_DMSO"), "[", 1)
Subtract the read depth normalized background signal.
background <- sweep(x[,grep("IgG", colnames(x))], 2, complexity_estimate[grep("IgG", names(complexity_estimate))], FUN = "/")
background = rowMeans(background)
background = cbind.data.frame(background, background, background, background, background, background, background, background)
background <- sweep(as.data.frame(background), 2, complexity_estimate[grep("HA", names(complexity_estimate))], FUN = "*")
background = round(background, digits = 0)
x = x[,grep("HA", colnames(x))] - background
x[x < 0] = 0
Run Deseq2.
treatment = factor(sapply(strsplit(colnames(x), '_'), '[', 2), levels = c("DMSO", "dTAG"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(treatment, rep), ~ treatment)
factors <- read.table("HA_sizeFactor.txt")
size_factors = factors$V2
names(size_factors) = factors$V1
sizeFactors(deseq.df) <- size_factors
deseq.df = DESeq(deseq.df)
Plot.
pdf("MA_HA_DMSO_vs_dTAG_subtract_IgG_qvalue_30.pdf")
res.deseq = results(deseq.df[peak_qvalue > 30,], contrast = c("treatment", "dTAG", "DMSO"))
save(res.deseq, file = "res.deseq.HA.q30.Rdata")
plotMA(res.deseq, ylim = c(-4,4))
dev.off()
Log in to the cluster to perform the processing for each sample in parallel.
ssh -Y ts2hx@rivanna.hpc.virginia.edu
ijob -c 1 -A gioeli_lab -p standard --cpus-per-task=10 --time=8:00:00
Make the main directory (should make nice subdirectories like I did for ChIP).
main=/scratch/ts2hx/TRPS1_timecourse_ATAC
mkdir -p $main
Get the required script for bigWig merging and move to a directory that will end up in the $PATH.
wget https://raw.githubusercontent.com/guertinlab/Nascent_RNA_Methods/main/normalize_bedGraph.py
chmod +x normalize_bedGraph.py
mv normalize_bedGraph.py $HOME/bin/
Get the ENCODE blacklist.
wget https://raw.githubusercontent.com/Boyle-Lab/Blacklist/master/lists/hg38-blacklist.v2.bed.gz -P $HOME/hg38/
gunzip $HOME/hg38/hg38-blacklist.v2.bed.gz
mv $HOME/hg38/hg38-blacklist.v2.bed $HOME/hg38/hg38.blacklist.bed
Get chrM for pre-alignment.
module load gcc/9.2.0 bowtie2/2.2.9
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz
tar -xzf hg38.chromFa.tar.gz
mkdir -p $HOME/hg38
mv chroms/chrM.fa $HOME/hg38
rm -r chroms
rm hg38.chromFa.tar.gz
ls $HOME/hg38
bowtie2-build $HOME/hg38/chrM.fa $HOME/hg38/chrM_hg38
Make tallymer and table for seqOutBias (this takes a while).
cd $HOME/seqOutBias/
seqOutBias seqtable /project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome.fa --read-size=62
Download and gzip.
module load gcc/11.4.0 sratoolkit/3.0.3
cd $main
for i in {46..69}
do
sbatch -p standard -A gioeli_lab -t 1:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name fasterq-dump \
-o /scratch/ts2hx/${title}/logs/SRR250739${i}_fasterq_dump.log \
--wrap="module load gcc/11.4.0 sratoolkit/3.0.3; fasterq-dump SRR250739${i}; gzip SRR250739${i}_*.fastq"
sleep 1
done
Rename.
mv SRR25073969_1.fastq.gz T47D_Clone28_0hour_rep1_PE1.fastq.gz
mv SRR25073969_2.fastq.gz T47D_Clone28_0hour_rep1_PE2.fastq.gz
mv SRR25073968_1.fastq.gz T47D_Clone28_0hour_rep2_PE1.fastq.gz
mv SRR25073968_2.fastq.gz T47D_Clone28_0hour_rep2_PE2.fastq.gz
mv SRR25073967_1.fastq.gz T47D_Clone28_0hour_rep3_PE1.fastq.gz
mv SRR25073967_2.fastq.gz T47D_Clone28_0hour_rep3_PE2.fastq.gz
mv SRR25073966_1.fastq.gz T47D_Clone28_0hour_rep4_PE1.fastq.gz
mv SRR25073966_2.fastq.gz T47D_Clone28_0hour_rep4_PE2.fastq.gz
mv SRR25073965_1.fastq.gz T47D_Clone28_0.5hour_rep1_PE1.fastq.gz
mv SRR25073965_2.fastq.gz T47D_Clone28_0.5hour_rep1_PE2.fastq.gz
mv SRR25073964_1.fastq.gz T47D_Clone28_0.5hour_rep2_PE1.fastq.gz
mv SRR25073964_2.fastq.gz T47D_Clone28_0.5hour_rep2_PE2.fastq.gz
mv SRR25073963_1.fastq.gz T47D_Clone28_0.5hour_rep3_PE1.fastq.gz
mv SRR25073963_2.fastq.gz T47D_Clone28_0.5hour_rep3_PE2.fastq.gz
mv SRR25073962_1.fastq.gz T47D_Clone28_0.5hour_rep4_PE1.fastq.gz
mv SRR25073962_2.fastq.gz T47D_Clone28_0.5hour_rep4_PE2.fastq.gz
mv SRR25073961_1.fastq.gz T47D_Clone28_1hour_rep1_PE1.fastq.gz
mv SRR25073961_2.fastq.gz T47D_Clone28_1hour_rep1_PE2.fastq.gz
mv SRR25073960_1.fastq.gz T47D_Clone28_1hour_rep2_PE1.fastq.gz
mv SRR25073960_2.fastq.gz T47D_Clone28_1hour_rep2_PE2.fastq.gz
mv SRR25073959_1.fastq.gz T47D_Clone28_1hour_rep3_PE1.fastq.gz
mv SRR25073959_2.fastq.gz T47D_Clone28_1hour_rep3_PE2.fastq.gz
mv SRR25073958_1.fastq.gz T47D_Clone28_1hour_rep4_PE1.fastq.gz
mv SRR25073958_2.fastq.gz T47D_Clone28_1hour_rep4_PE2.fastq.gz
mv SRR25073957_1.fastq.gz T47D_Clone28_2hour_rep1_PE1.fastq.gz
mv SRR25073957_2.fastq.gz T47D_Clone28_2hour_rep1_PE2.fastq.gz
mv SRR25073956_1.fastq.gz T47D_Clone28_2hour_rep2_PE1.fastq.gz
mv SRR25073956_2.fastq.gz T47D_Clone28_2hour_rep2_PE2.fastq.gz
mv SRR25073955_1.fastq.gz T47D_Clone28_2hour_rep3_PE1.fastq.gz
mv SRR25073955_2.fastq.gz T47D_Clone28_2hour_rep3_PE2.fastq.gz
mv SRR25073954_1.fastq.gz T47D_Clone28_2hour_rep4_PE1.fastq.gz
mv SRR25073954_2.fastq.gz T47D_Clone28_2hour_rep4_PE2.fastq.gz
mv SRR25073953_1.fastq.gz T47D_Clone28_4hour_rep1_PE1.fastq.gz
mv SRR25073953_2.fastq.gz T47D_Clone28_4hour_rep1_PE2.fastq.gz
mv SRR25073952_1.fastq.gz T47D_Clone28_4hour_rep2_PE1.fastq.gz
mv SRR25073952_2.fastq.gz T47D_Clone28_4hour_rep2_PE2.fastq.gz
mv SRR25073951_1.fastq.gz T47D_Clone28_4hour_rep3_PE1.fastq.gz
mv SRR25073951_2.fastq.gz T47D_Clone28_4hour_rep3_PE2.fastq.gz
mv SRR25073950_1.fastq.gz T47D_Clone28_4hour_rep4_PE1.fastq.gz
mv SRR25073950_2.fastq.gz T47D_Clone28_4hour_rep4_PE2.fastq.gz
mv SRR25073949_1.fastq.gz T47D_Clone28_24hour_rep1_PE1.fastq.gz
mv SRR25073949_2.fastq.gz T47D_Clone28_24hour_rep1_PE2.fastq.gz
mv SRR25073948_1.fastq.gz T47D_Clone28_24hour_rep2_PE1.fastq.gz
mv SRR25073948_2.fastq.gz T47D_Clone28_24hour_rep2_PE2.fastq.gz
mv SRR25073947_1.fastq.gz T47D_Clone28_24hour_rep3_PE1.fastq.gz
mv SRR25073947_2.fastq.gz T47D_Clone28_24hour_rep3_PE2.fastq.gz
mv SRR25073946_1.fastq.gz T47D_Clone28_24hour_rep4_PE1.fastq.gz
mv SRR25073946_2.fastq.gz T47D_Clone28_24hour_rep4_PE2.fastq.gz
The script used in the next code chunk (230417_ATACseq_analysis_on_input_file.sh) is below:
#! /bin/bash
directory=/scratch/$USER/TRPS1_timecourse_ATAC
read_size=62
cores=$SLURM_CPUS_PER_TASK
chrM=$HOME/hg38/chrM_hg38
genome_index=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome
tallymer=$HOME/seqOutBias/genome.tal_${read_size}.gtTxt.gz
table=$HOME/seqOutBias/genome_${read_size}.4.2.2.tbl
PATH=$HOME/bin:$PATH
cd $directory
module load cutadapt/3.4
module load gcc/9.2.0 pigz/2.4
module load bowtie2/2.2.9
module load samtools/1.12
module load wigtobigwig/2.8
module load bedtools/2.29.2
module load deeptools/3.5.1
filename=$1
name=$(echo $filename | awk -F"_PE1.fastq.gz" '{print $1}')
echo $name
gunzip ${name}_PE1.fastq.gz
gunzip ${name}_PE2.fastq.gz
#Remove adapters (also trim PE1 reads to the same length as PE2 (that was unnecessary))
cutadapt -j $cores -l $read_size -m 10 -O 1 -a CTGTCTCTTATACACATCT ${name}_PE1.fastq -o ${name}_PE1_no_adapt.fastq
cutadapt -j $cores -m 10 -O 1 -a CTGTCTCTTATACACATCT ${name}_PE2.fastq -o ${name}_PE2_no_adapt.fastq
#Align to chrM first and remove aligned reads
bowtie2 -p $((cores-2)) -x $chrM -U ${name}_PE1_no_adapt.fastq | \
samtools sort -n - | samtools fastq -f 0x4 - > ${name}_PE1.chrM.fastq
#Pair reads
reads=$(wc -l ${name}_PE1.chrM.fastq | awk '{print $1/4}')
fastq_pair -t $reads ${name}_PE1.chrM.fastq ${name}_PE2_no_adapt.fastq
#Align to the hg38 genome and remove duplicates
bowtie2 -p $((cores-5)) --maxins 800 -x $genome_index \
-1 ${name}_PE1.chrM.fastq.paired.fq -2 ${name}_PE2_no_adapt.fastq.paired.fq | \
samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | \
samtools markdup -s -r - ${name}.hg38.bam
#Convert BAM to bed for counting (previously created the tallymer with `seqOutBias seqtable`)
seqOutBias scale $table ${name}.hg38.bam --no-scale --shift-counts --skip-bw \
--read-size=$read_size --tallymer=$tallymer --bed=$name.bed
#Remove noncanonical chromosomes and sort for use with mapBed
grep -v "random" ${name}_not_scaled.bed | grep -v "chrUn" | grep -v "chrEBV" | grep -v "alt" | \
sort -k1,1 -k2,2n > ${name}_tmp.txt && mv ${name}_tmp.txt ${name}_not_scaled.bed
#Convert BAM to bigWig for visualization; extends the reads from PE1 to PE2;
#uses a bin size of 10 for a smoother output; normalizes as bins per million, analagous to TPM for transcripts
bamCoverage --bam ${name}.hg38.bam -bs 10 -p 10 -e --normalizeUsing BPM -bl $HOME/hg38/hg38.blacklist.bed \
-o ${name}_scaled.bigWig
#Zip fastq's and delete intermediate files
pigz -p $cores ${name}_PE1.fastq
pigz -p $cores ${name}_PE2.fastq
rm ${name}_PE1_no_adapt.fastq
rm ${name}_PE2_no_adapt.fastq
rm ${name}_PE1.chrM.fastq
rm ${name}_PE1.chrM.fastq.paired.fq
rm ${name}_PE1.chrM.fastq.single.fq
rm ${name}_PE2_no_adapt.fastq.paired.fq
rm ${name}_PE2_no_adapt.fastq.single.fq
Run the above script on each sample in parallel.
chmod +x /scratch/ts2hx/scripts/230417_*.sh
for filename in *_PE1.fastq.gz
do
name=`basename $filename _PE1.fastq.gz`
sbatch -p standard -A gioeli_lab -t 1-00:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name ${name}_atacseq_analysis \
-o ${name}_atacseq.log --wrap="sh /scratch/ts2hx/scripts/230417_ATACseq_analysis_on_input_file.sh $filename"
sleep 1 # wait 1 second between each job submission
done
#Should include this indexing step in the above script
for i in *.hg38.bam
do
name=`basename $filename .hg38.bam`
sbatch -p standard -A gioeli_lab -t 1:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name ${name}_index \
-o ${name}_index.log --wrap="module load samtools/1.12; samtools index $i"
sleep 1
done
This code uses all bam files, calls sub-peak summits, and makes each fragment 200bp centered on the original 5’ end of the read.
name=TRPS1_timecourse_ATAC
sbatch -p largemem -A gioeli_lab -t 1-00:00:00 -N 1 -n 1 --mem-per-cpu=64000 --job-name macs2 -o ${name}_macs2.log \
--wrap="module load macs2/2.2.7.1; \
macs2 callpeak --call-summits -t *.hg38.bam -n TRPS1_timecourse_ATAC -g hs -q 0.01 --keep-dup all \
-f BAM --nomodel --shift -100 --extsize 200"
Remove noncanonical chromosomes and blacklisted regions.
blacklist=$HOME/hg38/hg38.blacklist.bed
module load gcc/9.2.0 bedtools/2.29.2
grep -v "random" ${name}_peaks.narrowPeak | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
grep -v "alt" | intersectBed -v -a stdin -b $blacklist > ${name}_tmp.txt
mv ${name}_tmp.txt ${name}_peaks.narrowPeak
grep -v "random" ${name}_summits.bed | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
grep -v "alt" | intersectBed -v -a stdin -b $blacklist > ${name}_tmp.txt
mv ${name}_tmp.txt ${name}_summits.bed
Make a window of 100bp on either side of the summit.
slopBed -b 100 -i ${name}_summits.bed -g $HOME/hg38/hg38.chrom.sizes > ${name}_summit_window.bed
Count reads where the cut site is within these windows.
peaks=${name}_summit_window.bed
for bed in *_not_scaled.bed
do
name=`basename $bed _not_scaled.bed`
sbatch -p standard -A gioeli_lab -t 1:00:00 -N 1 -n 1 --job-name ${name}_mapBed -o ${name}_mapBed.log \
--wrap="module load gcc/9.2.0 bedtools/2.29.2; mapBed -null '0' -a $peaks -b $bed > ${name}_peak_counts.txt;"
sleep 1
done
Just keep the last column (the counts). I should do this in the above jobs.
for counts in T47D_Clone28_*_peak_counts.txt
do
name=`basename $counts _peak_counts.txt`
awk '{print $NF}' ${name}_peak_counts.txt > ${name}_peak_counts_only.txt
echo $name | cat - ${name}_peak_counts_only.txt > ${name}_peak_counts.txt
rm ${name}_peak_counts_only.txt
done
Combine into 1 file per factor.
echo -e "chr\tstart\tend\tname\tqvalue" | cat - $peaks | \
paste -d'\t' - T47D_Clone28_*_peak_counts.txt > Combined_ATAC_peak_counts.txt
rm T47D_Clone28_*_peak_counts.txt
Divide each by the geometric mean of all of them.
module load samtools/1.12
i=ATAC
> ${i}_header.txt
> ${i}_reads.txt
for j in T47D_Clone28_*.hg38.bam
do
echo $j
name=$(echo $j | awk -F".hg38.bam" '{print $1}')
echo $name | paste ${i}_header.txt - > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_header.txt
reads=`samtools view -c $j`
echo $reads | paste ${i}_reads.txt - > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_reads.txt
done
cat ${i}_header.txt ${i}_reads.txt > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_reads.txt
rm ${i}_header.txt
Combine these into one file.
module load gcc/9.2.0 openmpi/3.1.6 R/4.2.1
R
library(DESeq2)
setwd("/scratch/ts2hx/TRPS1_timecourse_ATAC/")
samples <- "ATAC_reads.txt"
x <- read.table(samples, sep = '\t', header = TRUE)[,-1]
write.table(estimateSizeFactorsForMatrix(x), file = paste0("/scratch/ts2hx/TRPS1_timecourse_ATAC/ATAC_sizeFactor.txt"), quote =F, col.names=F)
q()
The script used in the next code chunk (230417_ATAC_to_browser.sh) is below. It merges the scaled bigWigs across replicates into one bigWig and one bedGraph per condition.
#! /bin/bash
directory=/scratch/$USER/TRPS1_timecourse_ATAC
PATH=$HOME/bin:$PATH
cd $directory
module load ucsc-tools/3.7.4
name=$(echo $1 | awk -F"T47D_Clone28_" '{print $2}' | awk -F"_rep1" '{print $1}')
echo $name
files=$(ls T47D_Clone28_${name}_rep*_scaled.bigWig)
echo $files
touch T47D_Clone28_${name}_temp.txt
echo "track type=bedGraph name=${name} alwaysZero=on visibility=full" >> T47D_Clone28_${name}_temp.txt
count=$(ls T47D_Clone28_${name}_rep*_scaled.bigWig | wc -l | bc)
scaleall=$(bc <<< "scale=4 ; 1.0 / $count")
name=T47D_Clone28_${name}
echo $count
echo $scaleall
bigWigMerge $files ${name}_tmp.bg
normalize_bedGraph.py -i ${name}_tmp.bg -s $scaleall -o ${name}_scaled.bg
LC_COLLATE=C sort -k1,1 -k2,2n ${name}_scaled.bg > ${name}_scaled_sorted.bg
cat ${name}_temp.txt ${name}_scaled_sorted.bg > ${name}_scaled.bedGraph
rm ${name}_tmp.bg
rm ${name}_temp.txt
rm ${name}_scaled.bg
rm ${name}_scaled_sorted.bg
head ${name}_scaled.bedGraph
bedGraphToBigWig ${name}_scaled.bedGraph hg38.chrom.sizes ${name}_combined_normalized.bigWig
gzip -f ${name}_scaled.bedGraph
Run the above script on each sample in parallel.
for i in T47D_Clone28_*_rep1_scaled.bigWig
do
name=$(echo $i | awk -F"_rep1_scaled.bigWig" '{print $1}')
sbatch -p largemem -A gioeli_lab -t 4:00:00 -N 1 -n 1 --job-name to_browser -o ${name}_to_browser.log \
--wrap="sh /scratch/ts2hx/scripts/230417_ATAC_to_browser.sh $i"
sleep 1
done
exit
exit
#Copy files
cd ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_ATAC/Combined_ATAC_peak_counts.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_ATAC/ATAC_sizeFactor.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_ATAC/T47D_Clone28_*.hg38.bam* \
/Volumes/External/Users/TScott/TRPS1_timecourse_ATAC_bams
Here we plot the distribution of fragment lengths, as well as the enrichment of signal around transcription start sites.
direc = '/Volumes/External/Users/TScott/TRPS1_timecourse_ATAC_bams'
setwd(direc)
library(BiocManager)
BiocManager::install(c("ATACseqQC", "ChIPpeakAnno", "MotifDb", "GenomicAlignments",
"BSgenome.Hsapiens.UCSC.hg38", "TxDb.Hsapiens.UCSC.hg38.knownGene",
"phastCons100way.UCSC.hg38"))
library(ATACseqQC)
library(Rsamtools)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(GenomicAlignments)
#input the bamFiles
bamfile <- list.files(path = ".", pattern = "*.hg38.bam$", all.files = FALSE, full.names = FALSE, recursive = FALSE, ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
#change the order to reflect the time points
bamfile = bamfile[c(5:8, 1:4, 9:12, 17:24, 13:16)]
#change names to remove excess
bamfile.labels <- gsub(".hg38.bam", "", basename(bamfile))
#get TSS's
txs <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
#get tags
possibleTag <- list("integer"=c("AM", "AS", "CM", "CP", "FI", "H0", "H1", "H2",
"HI", "IH", "MQ", "NH", "NM", "OP", "PQ", "SM",
"TC", "UQ"),
"character"=c("BC", "BQ", "BZ", "CB", "CC", "CO", "CQ", "CR",
"CS", "CT", "CY", "E2", "FS", "LB", "MC", "MD",
"MI", "OA", "OC", "OQ", "OX", "PG", "PT", "PU",
"Q2", "QT", "QX", "R2", "RG", "RX", "SA", "TS",
"U2"))
#get fragment size distributions (modified from ATACseqQC for easier plotting)
bamFiles = bamfile
bamFiles.labels = bamfile.labels
index = bamFiles
ylim = NULL
pe <- mapply(testPairedEndBam, bamFiles, index)
if (any(!pe)) {
stop(paste(bamFiles[!pe], collapse = ", "), "is not Paired-End file.")
}
summaryFunction <- function(seqname, seqlength, bamFile, ind, ...) {
param <- ScanBamParam(what = c("isize"), which = GRanges(seqname, IRanges(1, seqlength)), flag = scanBamFlag(isSecondaryAlignment = FALSE, isUnmappedQuery = FALSE, isNotPassingQualityControls = FALSE)) table(abs(unlist(sapply(scanBam(bamFile, index = ind, ..., param = param), `[[`, "isize"), use.names = FALSE)))
}
idxstats <- unique(do.call(rbind, mapply(function(.ele, .ind) idxstatsBam(.ele, index = .ind)[, c("seqnames", "seqlength")], bamFiles, index, SIMPLIFY = FALSE)))
idxstats <- idxstats[idxstats[, "seqnames"] != "*", , drop = FALSE]
seqnames <- as.character(idxstats[, "seqnames"])
seqlen <- as.numeric(idxstats[, "seqlength"])
seqlen <- checkMaxChrLength(seqlen)
fragment.len <- mapply(function(bamFile, ind) summaryFunction(seqname = seqnames, seqlength = seqlen, bamFile, ind), bamFiles, index, SIMPLIFY = FALSE)
names(fragment.len) <- bamFiles.labels
minor.ticks.axis <- function(ax, n = 9, t.ratio = 0.5, mn, mx, ...) {
lims <- par("usr")
lims <- if (ax %in% c(1, 3))
lims[1:2]
else lims[3:4]
major.ticks <- pretty(lims, n = 5)
if (missing(mn))
mn <- min(major.ticks)
if (missing(mx))
mx <- max(major.ticks)
major.ticks <- major.ticks[major.ticks >= mn & major.ticks <= mx]
labels <- sapply(major.ticks, function(i) as.expression(bquote(10^.(i))))
axis(ax, at = major.ticks, labels = labels, las = ifelse(ax %in% c(2, 4), 2, 1), ...)
n <- n + 2
minors <- log10(pretty(10^major.ticks[1:2], n)) - major.ticks[1]
minors <- minors[-c(1, n)]
minor.ticks = c(outer(minors, major.ticks, `+`))
minor.ticks <- minor.ticks[minor.ticks > mn & minor.ticks < mx]
axis(ax, at = minor.ticks, tcl = par("tcl") * t.ratio, labels = FALSE)
}
#save for future use
save(fragment.len, file = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/ATAC_fragment_lengths.RData")
#plot
pdf("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/ATAC_fragSize.pdf", height = 10, width = 15)
par(mfcol=c(4,6))
null <- mapply(function(frag.len, frag.name) {
x <- 1:1010
frag.len <- frag.len[match(x, names(frag.len))]
frag.len[is.na(frag.len)] <- 0
y <- frag.len/sum(frag.len)
y <- as.numeric(y)
names(y) <- x
plot(x, y * 10^3, main = frag.name,
xlim = c(0, 1010), ylim = c(0,13), xlab = "Fragment length (bp)",
ylab = expression(Normalized ~ read ~ density ~ x ~ 10^-3), type = "l")
}, fragment.len, sapply(strsplit(names(fragment.len), "T47D_Clone28_"), "[", 2))
dev.off()
#get TSS enrichments
#initialize
seqlev <- "chr1" ## subsample data for quick run
seqinformation <- seqinfo(TxDb.Hsapiens.UCSC.hg38.knownGene)
which <- as(seqinformation[seqlev], "GRanges")
#make an empty list to fill with each sample
TSSEs <- sapply(bamfile.labels,function(x) NULL)
for (i in 1:length(bamfile)) {
print(bamfile[i])
#get reads with tags
bamTop100 <- scanBam(BamFile(bamfile[i], yieldSize = 100),
param = ScanBamParam(tag=unlist(possibleTag)))[[1]]$tag
tags <- names(bamTop100)[lengths(bamTop100)>0]
tags
#shift the coordinates of 5'ends of alignments in the bam file
gal <- readBamFile(bamfile[i], tag=tags, which=which, asMates=TRUE, bigFile=TRUE)
rm(gal1)
gal1 <- shiftGAlignmentsList(gal)
#calculate TSS enrichment and append to the list
tsse <- TSSEscore(gal1, txs)
TSSEs[[i]] = tsse$values
save(TSSEs, file = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/ATAC_TSSEs.RData")
print(tsse$values)
#plot enrichment around TSS's
plot(100*(-9:10-.5), tsse$values, type="b",
xlim=c(-1000, 1000),
xlab="distance to TSS",
ylab="aggregate TSS score",
main=sapply(strsplit(sapply(strsplit(bamfile[i], "T47D_Clone28_"), "[", 2), ".hg38.bam"), "[", 1))
}
#save for future use
save(TSSEs, file = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/ATAC_TSSEs.RData")
load("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/ATAC_TSSEs.RData")
#plot
pdf("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/ATAC_TSS_enrichment.pdf", height = 10, width = 15)
par(mfcol=c(4,6))
for (i in 1:length(TSSEs)) {
plot(100*(-9:10-.5), TSSEs[[i]], type="b",
xlim=c(-1000, 1000),
ylim=c(0,75),
xlab="distance to TSS",
ylab="aggregate TSS score",
main=sapply(strsplit(names(TSSEs)[i], "T47D_Clone28_"), "[", 2))
}
dev.off()
Set up libraries and functions in R.
direc = '~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/'
setwd(direc)
library(DESeq2)
library(lattice)
library(viridis)
library(DEGreport)
library(lattice)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)
library(MatchIt)
tighten_summit_window <- function(res.deseq, b = 100) {
chr = sapply(strsplit(rownames(res.deseq), ':'), '[', 1)
start = as.numeric(sapply(strsplit(sapply(strsplit(rownames(res.deseq), ':'), '[', 2), "-"), "[", 1)) + b
end = as.numeric(sapply(strsplit(rownames(res.deseq), '-'), '[', 2)) - b
df = cbind.data.frame(chr, start, end)
return(df)
}
`%notin%` <- Negate(`%in%`)
Read in the counts table and change the column names.
x <- read.table("Combined_ATAC_peak_counts.txt", sep = '\t', header = TRUE)
rownames(x) = paste0(x[,1], ":", x[,2], "-", x[,3])
peak_name = x[,4]
peak_qvalue = x[,5]
x = x[,-c(1:5)]
colnames(x) = sapply(strsplit(colnames(x), "Clone28_"), "[", 2)
Run DESeq2.
time = factor(sapply(strsplit(colnames(x), '_'), '[', 1), levels = c("0hour", "0.5hour", "1hour", "2hour", "4hour", "24hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
factors <- read.table("ATAC_sizeFactor.txt")
size_factors = factors$V2
names(size_factors) = factors$V1
sizeFactors(deseq.df) <- size_factors
deseq.df = DESeq(deseq.df)
save(deseq.df, file = "deseq.df.ATAC.Rdata")
load("deseq.df.ATAC.Rdata")
Identify differentially accessible peaks.
i = "0.5hour"
res.deseq = results(deseq.df[peak_qvalue > 100,], contrast = c("time", i, "0hour"))
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0) #472
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0) #36
Match the unchanged peaks to activated peaks based on accessibility.
activated = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]
unchanged$treatment = 0
activated$treatment = 1
df.deseq.effects.lattice = rbind(unchanged, activated)
out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
repressed = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0,]
df.deseq.effects.lattice = res.deseq
df.deseq.effects.lattice$status = "Other"
df.deseq.effects.lattice$status[rownames(df.deseq.effects.lattice) %in% rownames(activated)] = "Increased"
df.deseq.effects.lattice$status[rownames(df.deseq.effects.lattice) %in% rownames(unchanged)] = "Unchanged"
df.deseq.effects.lattice$status[rownames(df.deseq.effects.lattice) %in% rownames(repressed)] = "Decreased"
df.deseq.effects.lattice$status = factor(df.deseq.effects.lattice$status, levels = c("Other", "Increased", "Unchanged", "Decreased"))
Plot.
pdf(paste0("MA_ATAC_q100_", i, "_for_figure.pdf"), width = 7, height = 3.5)
ggplot(as.data.frame(df.deseq.effects.lattice) %>% arrange(status), aes(x = log(baseMean, 10), y = log2FoldChange, color = status)) +
geom_point() +
scale_color_manual(values = c("grey90", "red", "grey60", "blue")) +
geom_hline(yintercept = 0, linetype="dashed") +
ylim(-1.2, 1.2) +
xlim(1.5, 3.5) +
ylab(expression("log"[2]~"(fold change in ATAC signal)")) +
xlab(expression("log"[10]~"(mean of normalized intensity)")) +
ggtitle("Change in chromatin accessibility upon TRPS1 depletion") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(color=NULL)
dev.off()
Write regions as tables.
increased = as.data.frame(activated)
increased$region = rownames(increased)
increased$chr = sapply(strsplit(increased$region, ":"), "[", 1)
increased$start = sapply(strsplit(sapply(strsplit(increased$region, ":"), "[", 2), "-"), "[", 1)
increased$end = sapply(strsplit(sapply(strsplit(increased$region, ":"), "[", 2), "-"), "[", 2)
write.table(increased[,c("chr", "start", "end")], file = "increased_ATAC_30min_summits.bed", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
decreased = as.data.frame(repressed)
decreased$region = rownames(decreased)
decreased$chr = sapply(strsplit(decreased$region, ":"), "[", 1)
decreased$start = sapply(strsplit(sapply(strsplit(decreased$region, ":"), "[", 2), "-"), "[", 1)
decreased$end = sapply(strsplit(sapply(strsplit(decreased$region, ":"), "[", 2), "-"), "[", 2)
write.table(decreased[,c("chr", "start", "end")], file = "decreased_ATAC_30min_summits.bed", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
Repeat above with ATAC-seq peaks that are distal to sites of bidirectional transcription. The dREG elements are generated below in this vignette in the PRO-seq data pre-processing section.
#Get dREG elements
dREG = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/dTAG_TRPS1_timecourse_dREG/dTAG_TRPS1_timecourse_dREG.dREG.peak.full.bed")
#Get distal peaks
res.deseq$chr = sapply(strsplit(rownames(res.deseq), ":"), "[", 1)
res.deseq$start = sapply(strsplit(sapply(strsplit(rownames(res.deseq), ":"), "[", 2), "-"), "[", 1)
res.deseq$end = sapply(strsplit(sapply(strsplit(rownames(res.deseq), ":"), "[", 2), "-"), "[", 2)
res.deseq = res.deseq[, c(7:9,1:6)]
res.deseq$start = as.numeric(res.deseq$start) + 100
res.deseq$end = as.numeric(res.deseq$end) - 100
dREG_distal_ATAC_peaks = bedTools.intersect(bed1 = res.deseq, bed2 = dREG, opt.string = "-v")
res.deseq = dREG_distal_ATAC_peaks
#Match the unchanged peaks to activated peaks based on accessibility.
activated = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]
unchanged$treatment = 0
activated$treatment = 1
df.deseq.effects.lattice = rbind(unchanged, activated)
out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
repressed = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0,]
df.deseq.effects.lattice = res.deseq
df.deseq.effects.lattice$status = "Other"
df.deseq.effects.lattice$status[rownames(df.deseq.effects.lattice) %in% rownames(activated)] = "Increased"
df.deseq.effects.lattice$status[rownames(df.deseq.effects.lattice) %in% rownames(unchanged)] = "Unchanged"
df.deseq.effects.lattice$status[rownames(df.deseq.effects.lattice) %in% rownames(repressed)] = "Decreased"
df.deseq.effects.lattice$status = factor(df.deseq.effects.lattice$status, levels = c("Other", "Increased", "Unchanged", "Decreased"))
#Plot
pdf("MA_dREG_distal_ATAC_peaks.pdf", width = 7, height = 3.5)
ggplot(as.data.frame(df.deseq.effects.lattice) %>% arrange(status), aes(x = log(baseMean, 10), y = log2FoldChange, color = status)) +
geom_point() +
scale_color_manual(values = c("grey90", "red", "grey60", "blue")) +
geom_hline(yintercept = 0, linetype="dashed") +
ylim(-1.2, 1.2) +
xlim(1.5, 3.5) +
ylab(expression("log"[2]~"(fold change in ATAC signal)")) +
xlab(expression("log"[10]~"(mean of normalized intensity)")) +
ggtitle("Change in chromatin accessibility upon TRPS1 depletion distal to dREG elements") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(color=NULL)
dev.off()
For each time point, identify differentially accessible peaks and match based on accessibility. Should have called these “increased” and “decreased” for consistency.
for (i in levels(time)[-1]) {
res.deseq = results(deseq.df[peak_qvalue > 100,], contrast = c("time", i, "0hour"))
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj))
activated = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
write.table(tighten_summit_window(activated, b = 100), file = paste0(i, '_activated_peaks.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]
unchanged$treatment = 0
activated$treatment = 1
df.deseq.effects.lattice = rbind(unchanged, activated)
out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
write.table(tighten_summit_window(unchanged, b = 100), file = paste0(i, '_activated_matched_peaks.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
repressed = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0,]
write.table(tighten_summit_window(repressed, b = 100), file = paste0(i, '_repressed_peaks.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
unchanged = res.deseq[rownames(res.deseq) %notin% rownames(repressed) & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
unchanged$treatment = 0
repressed$treatment = 1
df.deseq.effects.lattice = rbind(unchanged, repressed)
out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
write.table(tighten_summit_window(unchanged, b = 100), file = paste0(i, '_repressed_matched_peaks.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
}
For each time point, run MEME on the increased and decreased peaks.
cd ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC
genome=~/Library/CloudStorage/Box-Box/GuertinLab/hg38/hg38.fa
for bed in *_activated_peaks.bed *_repressed_peaks.bed
do
name=`basename $bed .bed`
fastaFromBed -fi $genome -bed ${name}.bed -fo ${name}.fasta
sites=`wc -l ${name}.fasta | awk '{print $1/2}'`
if [ $sites -lt 100 ]
then
minsites=2
else
minsites=`echo $sites | awk '{print $1/50}'`
fi
meme -oc ${name}_meme_output -nmotifs 1000 -objfun classic -csites 20000 -evt 0.01 \
-searchsize 0 -minw 6 -maxw 18 -revcomp -dna -markov_order 3 -maxsize 100000000 -minsites $minsites \
${name}.fasta
meme -oc ${name}_meme_short_output -nmotifs 1000 -objfun classic -csites 20000 -evt 0.01 \
-searchsize 0 -minw 6 -maxw 8 -revcomp -dna -markov_order 3 -maxsize 100000000 -minsites $minsites \
${name}.fasta
done
Report results at FDR 0.1 (1401 total in the above database).
cd ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/
motifs=~/Library/CloudStorage/Box-Box/GuertinLab/Motif_databases/tomtom_db/homer_uniprobe_jaspar_edited.txt
n_motifs=140
Make fasta files for each bed file.
genome=~/Library/CloudStorage/Box-Box/GuertinLab/hg38/hg38.fa
for bed in *hour_activated*_peaks.bed *hour_repressed*_peaks.bed
do
name=`basename $bed .bed`
echo $name
fastaFromBed -fi $genome -bed ${name}.bed -fo ${name}.fasta
done
Run AME.
for fasta in *hour_activated_peaks.fasta *hour_repressed_peaks.fasta
do
name=`basename $fasta _peaks.fasta`
echo $name
ame --verbose 1 --oc ${name}_ame --scoring avg --method fisher --hit-lo-fraction 0.25 \
--evalue-report-threshold $n_motifs --control ${name}_matched_peaks.fasta ${name}_peaks.fasta $motifs
done
Configure the AME output for use in LaTex. Remove underscores, keep the columns we need, and convert to CSV.
cd ~/Library/CloudStorage/Box-Box/guertinlab/TRPS1_timecourse_ATAC/0.5hour_repressed_ame
cat ame.tsv | \
awk 'BEGIN {FS="\t"} {print $1, $3, $5, $15, $17, $7}' | \
awk '{sub(/_homer.*/,"-Homer",$2)} 1' | \
awk '{sub(/_jaspar.*/,"-Jaspar",$2)} 1' | \
awk '{sub(/_secondary_uniprobe.*/,"-Secondary-Uniprobe",$2)} 1' | \
awk '{sub(/_primary_uniprobe.*/,"-Secondary-Uniprobe",$2)} 1' | \
awk '{sub(/_uniprobe.*/,"-Uniprobe",$2)} 1' | \
awk 'BEGIN { OFS="," } {$1=$1; print}' |
awk '/^$/{exit}1' | sed 's/%/\percent/g' > \
decreased_ame.csv
cd ~/Library/CloudStorage/Box-Box/guertinlab/TRPS1_timecourse_ATAC/0.5hour_activated_ame
cat ame.tsv | \
awk 'BEGIN {FS="\t"} {print $1, $3, $5, $15, $17, $7}' | \
awk '{sub(/_homer.*/,"-Homer",$2)} 1' | \
awk '{sub(/_jaspar.*/,"-Jaspar",$2)} 1' | \
awk '{sub(/_secondary_uniprobe.*/,"-Secondary-Uniprobe",$2)} 1' | \
awk '{sub(/_primary_uniprobe.*/,"-Secondary-Uniprobe",$2)} 1' | \
awk '{sub(/_uniprobe.*/,"-Uniprobe",$2)} 1' | \
awk 'BEGIN { OFS="," } {$1=$1; print}' |
awk '/^$/{exit}1' | sed 's/%/\percent/g' |
head -n 50 > \
increased_ame.csv
The script used in the next code chunk (220914_FIMO.sh) is below:
#! /bin/bash
directory=/scratch/$USER/TRPS1_timecourse_ATAC
genome=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome.fa
cd $directory
module load gcc/9.2.0 openmpi/3.1.6 meme/5.3.3 bedtools/2.29.2
fimo --thresh 0.001 --text $i $genome > ${name}_fimo.txt
wc -l ${name}_fimo.txt
score=$(tail -n +2 ${name}_fimo.txt | sort -nrk6,6 | awk 'FNR == 2000000 {print $6}')
echo $score
tail -n +2 ${name}_fimo.txt | awk -v sco="$score" '{ if ($6 >= sco) { print } }' | \
awk '{OFS="\t";} {print $2,$3,$4,$7,$6,$5,$8}' > ${name}_2M.txt
Activated=`wc -l 24hour_activated_peaks.bed | awk '{print $1}'`
Activated_with=`intersectBed -u -a 24hour_activated_peaks.bed -b ${name}_2M.txt | wc -l | awk '{print $1}'`
Activated_without=$(echo "$Activated - $Activated_with" | bc)
Unchanged=`wc -l 24hour_unchanged_peaks.bed | awk '{print $1}'`
Unchanged_with=`intersectBed -u -a 24hour_unchanged_peaks.bed -b ${name}_2M.txt | wc -l | awk '{print $1}'`
Unchanged_without=$(echo "$Unchanged - $Unchanged_with" | bc)
Repressed=`wc -l 24hour_repressed_peaks.bed | awk '{print $1}'`
Repressed_with=`intersectBed -u -a 24hour_repressed_peaks.bed -b ${name}_2M.txt | wc -l | awk '{print $1}'`
Repressed_without=$(echo "$Repressed - $Repressed_with" | bc)
echo -e \
"$name\t$Activated_with\t$Activated_without\t$Unchanged_with\t$Unchanged_without\t$Repressed_with\t$Repressed_without" > \
${name}_Motif_overlaps.txt
Find individual motif instances in the genome with FIMO and overlap with peak sets. I need to provide the motif files or indicate how to generate them.
ssh -Y ts2hx@rivanna.hpc.virginia.edu
ijob -c 1 -A gioeli_lab -p standard --time=1-00:00:00
cd /scratch/ts2hx/TRPS1_timecourse_ATAC/
for i in individual_memes/*_meme.txt
do
name=`basename $i _meme.txt`
sbatch -p standard --export=ALL,i=$i,name=$name --output=${name}_FIMO.out /scratch/ts2hx/scripts/220914_FIMO.sh
sleep 1
done
Combine into one file.
echo -e \
"Motif\tActivated_with\tActivated_without\tUnchanged_with\tUnchanged_without\tRepressed_with\tRepressed_without" \
> Motif_overlaps.txt
cat *_Motif_overlaps.txt >> Motif_overlaps.txt
exit
exit
cd ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_ATAC/Motif_overlaps.txt .
Set up in R.
direc = '~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/'
setwd(direc)
library(tidyverse)
library(tidyr)
library(lattice)
library(tidyr)
ImmediatePeaksWithMotif <- function(motif_overlaps, motif)
{
result = motif_overlaps[rownames(motif_overlaps) == motif,]
result = cbind(rbind(result[,3], result[,4]), rbind(result[,1], result[,2]), rbind(result[,7], result[,8]))
colnames(result) <- c("Increased", "Unchanged", "Decreased")
rownames(result) <- c("With motif", "Without motif")
print(chisq.test(result))
#Can remove this line if you want the actual counts
result = 100*sweep(result, 2, colSums(result), "/")
barplot(result, col = c("#F84C1E", "#232D4B"), main = motif, legend.text = TRUE, args.legend = list(x = "topright", bg = "white", cex = 1.5), cex.axis=1.5, cex.names=1.5)
return(result)
}
Plot barcharts for a representative GATA motif and ER half site.
motif_overlaps = read.table("Overlaps.txt", header = T, row.names = 1)
pdf("ATAC_GATA5_motif_heatmap_immediate.pdf")
ImmediatePeaksWithMotif(motif_overlaps, motif = "GATA5_jaspar")
dev.off()
pdf("ATAC_NR2F2_motif_heatmap_immediate.pdf")
ImmediatePeaksWithMotif(motif_overlaps, motif = "NR2F2_jaspar")
dev.off()
Set up in R.
direc = '~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/'
setwd(direc)
library(DESeq2)
library(lattice)
library(viridis)
library(DEGreport)
library(lattice)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)
library(MatchIt)
library(EnhancedVolcano)
library(pheatmap)
library(dendsort)
library(RColorBrewer)
sort_hclust <- function(...) as.hclust(dendsort(as.dendrogram(...)))
Read in the counts table and change the column names. Here we focus on the early time points.
x <- read.table("Combined_ATAC_peak_counts.txt", sep = '\t', header = TRUE)
rownames(x) = paste0(x[,1], ":", x[,2], "-", x[,3])
peak_name = x[,4]
peak_qvalue = x[,5]
x = x[,-c(1:5)]
colnames(x) = sapply(strsplit(colnames(x), "Clone28_"), "[", 2)
x = x[,!grepl("24hour", colnames(x))]
Run DESeq2.
time = factor(sapply(strsplit(colnames(x), '_'), '[', 1), levels = c("0hour", "0.5hour", "1hour", "2hour", "4hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
factors <- read.table("ATAC_sizeFactor.txt")
size_factors = factors$V2
names(size_factors) = factors$V1
size_factors = size_factors[,!grepl("24hour", names(size_factors))]
sizeFactors(deseq.df) <- size_factors
deseq.df = DESeq(deseq.df)
save(deseq.df, file = "deseq.df.ATAC.no24.Rdata")
load("deseq.df.ATAC.no24.Rdata")
Run the likelihood ratio test to identify dynamic peaks over the time course.
dds.lrt = DESeq(deseq.df, test="LRT", reduced = ~ rep)
save(dds.lrt, file = "dds.lrt.ATAC.no24.Rdata")
load("dds.lrt.ATAC.no24.Rdata")
res.lrt = results(dds.lrt[peak_qvalue > 100,])
sum(res.lrt$padj < 0.1 & !is.na(res.lrt$padj))
Get the normalized counts.
wide_counts = counts(deseq.df[rownames(deseq.df) %in% rownames(res.lrt)[res.lrt$padj < .01 & !is.na(res.lrt$padj)],], normalized = TRUE)
Assign the colors.
anno <- data.frame(Time = as.factor(time))
rownames(anno) = colnames(wide_counts)
ann_colors = list(
Time = c(`0hour` = viridis(5)[1],
`0.5hour` = viridis(5)[2],
`1hour` = viridis(5)[3],
`2hour` = viridis(5)[4],
`4hour` = viridis(5)[5]))
paletteLength <- 50
myColor <- colorRampPalette(c("blue", "white", "red"))(paletteLength)
myBreaks <- c(seq(min(scale(t(wide_counts))), 0, length.out=ceiling(paletteLength/2) + 1),
seq(max(scale(t(wide_counts)))/paletteLength, max(scale(t(wide_counts))), length.out=floor(paletteLength/2)))
Move the order of columns to be in time order (legal rotating of the dendrogram).
cluster_cols = hclust(dist(scale(t(wide_counts))))
cluster_cols$order = cluster_cols$order[c(1:12, 17:20, 13:16)]
Plot.
pdf("ATAC_counts_heatmap_FDR_0.01.pdf", width = 8, height = 5)
pheatmap(wide_counts, scale = "row", show_rownames = F, show_colnames = F,
color = myColor, breaks=myBreaks,
annotation_col = anno, annotation_colors = ann_colors,
cluster_cols = cluster_cols)
dev.off()
Set up in R.
direc = '~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/'
setwd(direc)
library(tidyverse)
library(tidyr)
library(lattice)
library(tidyr)
PeaksWithMotif <- function(motif_overlaps, motif)
{
result = motif_overlaps[rownames(motif_overlaps) == motif,]
times = c("0.5hour", "1hour", "2hour", "4hour")
conditions = c("activated", "activated_matched", "repressed_matched", "repressed")
df = as.data.frame(matrix(nrow = length(conditions), ncol = length(times)))
colnames(df) = times
rownames(df) = conditions
for (i in 1:length(times)) {
for (j in 1:length(conditions)) {
df[j,i] = result[,paste0("X", times[i], "_", conditions[j], "_with")] /
(result[,paste0("X", times[i], "_", conditions[j], "_without")] +
result[,paste0("X", times[i], "_", conditions[j], "_with")])
}
}
df = df[-3,]
rownames(df) = c("Increased", "Unchanged", "Decreased")
colnames(df) = sapply(strsplit(colnames(df), "hour"), "[", 1)
df$class = factor(rownames(df), levels = c("Increased", "Unchanged", "Decreased"))
df_long <- gather(df, time, Proportion, -class, factor_key=TRUE)
print(ggplot(df_long, aes(y = time, x = class, fill = Proportion)) +
geom_tile() +
geom_text(aes(label = round(Proportion, 2))) +
scale_fill_gradient(low="white", high="black", limits=c(0, 1)) +
xlab(NULL) +
ylab("Time (hours)")) +
theme_bw() +
theme(text = element_text(size = 20))
}
Plot.
motif_overlaps = read.table("Overlaps.txt", header = T, row.names = 1)
head(motif_overlaps)
pdf("ATAC_TRPS1_motif_heatmap_timecourse.pdf", height = 3.5)
PeaksWithMotif(motif_overlaps, motif = "TRPS1_homer_uniprobe_jaspar_edited.txt")
dev.off()
For the initial basic processing steps, we used a pipeline we have thoroughly described elsewhere (https://github.com/guertinlab/Nascent_RNA_Methods). Please see the above GitHub repository for detailed explanations of the steps, as well as software dependencies and preprocessing steps (e.g., gene annotation file downloads).
Log in to the cluster to perform the processing for each sample in parallel.
ssh -Y ts2hx@rivanna.hpc.virginia.edu
ijob -c 1 -A gioeli_lab -p standard --time=1-00:00:00
Make a directory.
mkdir -p /scratch/ts2hx/TRPS1_timecourse_PRO_3
The script used in the next code chunk (220927_tallymer_generation.sh) is below:
#! /bin/bash
module load genometools/1.5.10
read_size=40
genome=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa
mkdir -p $HOME/seqOutBias
cd $HOME/seqOutBias
$HOME/bin/seqOutBias seqtable $genome --read-size=$read_size
Make tallymer and table for seqOutBias (this takes a while).
sbatch -p largemem -A gioeli_lab -t 1-00:00:00 --job-name tallymer -o tallymer.log \
/scratch/ts2hx/scripts/220927_tallymer_generation.sh
We sequenced the libraries 3 times, first on a NextSeq2000, and the next two runs on a NextSeq550. We found that the NextSeq2000 gave us many duplicate reads, which we attribute to the patterned flow cell and our small fragment sizes for PRO-seq. We first download the fastq’s and gzip.
cd /scratch/ts2hx/TRPS1_timecourse_PRO_3/
mkdir -p logs/
for i in {24..92}
do
sbatch -p standard -A gioeli_lab -t 1:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name fasterq-dump \
-o logs/SRR250747${i}_fasterq_dump.log \
--wrap="module load gcc/11.4.0 sratoolkit/3.0.3; fasterq-dump SRR250747${i}; gzip SRR250747${i}_*.fastq"
sleep 1
done
Combine the NextSeq550 reads to generate representative quality control metrics.
cat SRR25074724_1.fastq.gz SRR25074726_1.fastq.gz > T47D_Clone28_24hour_rep4_NextSeq550_PE1.fastq.gz
cat SRR25074727_1.fastq.gz SRR25074729_1.fastq.gz > T47D_Clone28_24hour_rep3_NextSeq550_PE1.fastq.gz
cat SRR25074730_1.fastq.gz SRR25074732_1.fastq.gz > T47D_Clone28_24hour_rep2_NextSeq550_PE1.fastq.gz
cat SRR25074733_1.fastq.gz SRR25074735_1.fastq.gz > T47D_Clone28_4hour_rep4_NextSeq550_PE1.fastq.gz
cat SRR25074736_1.fastq.gz SRR25074738_1.fastq.gz > T47D_Clone28_4hour_rep3_NextSeq550_PE1.fastq.gz
cat SRR25074739_1.fastq.gz SRR25074741_1.fastq.gz > T47D_Clone28_4hour_rep2_NextSeq550_PE1.fastq.gz
cat SRR25074742_1.fastq.gz SRR25074744_1.fastq.gz > T47D_Clone28_4hour_rep1_NextSeq550_PE1.fastq.gz
cat SRR25074745_1.fastq.gz SRR25074747_1.fastq.gz > T47D_Clone28_2hour_rep4_NextSeq550_PE1.fastq.gz
cat SRR25074748_1.fastq.gz SRR25074750_1.fastq.gz > T47D_Clone28_2hour_rep3_NextSeq550_PE1.fastq.gz
cat SRR25074751_1.fastq.gz SRR25074753_1.fastq.gz > T47D_Clone28_2hour_rep2_NextSeq550_PE1.fastq.gz
cat SRR25074754_1.fastq.gz SRR25074756_1.fastq.gz > T47D_Clone28_2hour_rep1_NextSeq550_PE1.fastq.gz
cat SRR25074757_1.fastq.gz SRR25074759_1.fastq.gz > T47D_Clone28_1hour_rep4_NextSeq550_PE1.fastq.gz
cat SRR25074760_1.fastq.gz SRR25074762_1.fastq.gz > T47D_Clone28_1hour_rep3_NextSeq550_PE1.fastq.gz
cat SRR25074763_1.fastq.gz SRR25074765_1.fastq.gz > T47D_Clone28_1hour_rep2_NextSeq550_PE1.fastq.gz
cat SRR25074766_1.fastq.gz SRR25074768_1.fastq.gz > T47D_Clone28_1hour_rep1_NextSeq550_PE1.fastq.gz
cat SRR25074769_1.fastq.gz SRR25074771_1.fastq.gz > T47D_Clone28_0.5hour_rep4_NextSeq550_PE1.fastq.gz
cat SRR25074772_1.fastq.gz SRR25074774_1.fastq.gz > T47D_Clone28_0.5hour_rep3_NextSeq550_PE1.fastq.gz
cat SRR25074775_1.fastq.gz SRR25074777_1.fastq.gz > T47D_Clone28_0.5hour_rep2_NextSeq550_PE1.fastq.gz
cat SRR25074778_1.fastq.gz SRR25074780_1.fastq.gz > T47D_Clone28_0.5hour_rep1_NextSeq550_PE1.fastq.gz
cat SRR25074781_1.fastq.gz SRR25074783_1.fastq.gz > T47D_Clone28_0hour_rep4_NextSeq550_PE1.fastq.gz
cat SRR25074784_1.fastq.gz SRR25074786_1.fastq.gz > T47D_Clone28_0hour_rep3_NextSeq550_PE1.fastq.gz
cat SRR25074787_1.fastq.gz SRR25074789_1.fastq.gz > T47D_Clone28_0hour_rep2_NextSeq550_PE1.fastq.gz
cat SRR25074790_1.fastq.gz SRR25074792_1.fastq.gz > T47D_Clone28_0hour_rep1_NextSeq550_PE1.fastq.gz
cat SRR25074724_2.fastq.gz SRR25074726_2.fastq.gz > T47D_Clone28_24hour_rep4_NextSeq550_PE2.fastq.gz
cat SRR25074727_2.fastq.gz SRR25074729_2.fastq.gz > T47D_Clone28_24hour_rep3_NextSeq550_PE2.fastq.gz
cat SRR25074730_2.fastq.gz SRR25074732_2.fastq.gz > T47D_Clone28_24hour_rep2_NextSeq550_PE2.fastq.gz
cat SRR25074733_2.fastq.gz SRR25074735_2.fastq.gz > T47D_Clone28_4hour_rep4_NextSeq550_PE2.fastq.gz
cat SRR25074736_2.fastq.gz SRR25074738_2.fastq.gz > T47D_Clone28_4hour_rep3_NextSeq550_PE2.fastq.gz
cat SRR25074739_2.fastq.gz SRR25074741_2.fastq.gz > T47D_Clone28_4hour_rep2_NextSeq550_PE2.fastq.gz
cat SRR25074742_2.fastq.gz SRR25074744_2.fastq.gz > T47D_Clone28_4hour_rep1_NextSeq550_PE2.fastq.gz
cat SRR25074745_2.fastq.gz SRR25074747_2.fastq.gz > T47D_Clone28_2hour_rep4_NextSeq550_PE2.fastq.gz
cat SRR25074748_2.fastq.gz SRR25074750_2.fastq.gz > T47D_Clone28_2hour_rep3_NextSeq550_PE2.fastq.gz
cat SRR25074751_2.fastq.gz SRR25074753_2.fastq.gz > T47D_Clone28_2hour_rep2_NextSeq550_PE2.fastq.gz
cat SRR25074754_2.fastq.gz SRR25074756_2.fastq.gz > T47D_Clone28_2hour_rep1_NextSeq550_PE2.fastq.gz
cat SRR25074757_2.fastq.gz SRR25074759_2.fastq.gz > T47D_Clone28_1hour_rep4_NextSeq550_PE2.fastq.gz
cat SRR25074760_2.fastq.gz SRR25074762_2.fastq.gz > T47D_Clone28_1hour_rep3_NextSeq550_PE2.fastq.gz
cat SRR25074763_2.fastq.gz SRR25074765_2.fastq.gz > T47D_Clone28_1hour_rep2_NextSeq550_PE2.fastq.gz
cat SRR25074766_2.fastq.gz SRR25074768_2.fastq.gz > T47D_Clone28_1hour_rep1_NextSeq550_PE2.fastq.gz
cat SRR25074769_2.fastq.gz SRR25074771_2.fastq.gz > T47D_Clone28_0.5hour_rep4_NextSeq550_PE2.fastq.gz
cat SRR25074772_2.fastq.gz SRR25074774_2.fastq.gz > T47D_Clone28_0.5hour_rep3_NextSeq550_PE2.fastq.gz
cat SRR25074775_2.fastq.gz SRR25074777_2.fastq.gz > T47D_Clone28_0.5hour_rep2_NextSeq550_PE2.fastq.gz
cat SRR25074778_2.fastq.gz SRR25074780_2.fastq.gz > T47D_Clone28_0.5hour_rep1_NextSeq550_PE2.fastq.gz
cat SRR25074781_2.fastq.gz SRR25074783_2.fastq.gz > T47D_Clone28_0hour_rep4_NextSeq550_PE2.fastq.gz
cat SRR25074784_2.fastq.gz SRR25074786_2.fastq.gz > T47D_Clone28_0hour_rep3_NextSeq550_PE2.fastq.gz
cat SRR25074787_2.fastq.gz SRR25074789_2.fastq.gz > T47D_Clone28_0hour_rep2_NextSeq550_PE2.fastq.gz
cat SRR25074790_2.fastq.gz SRR25074792_2.fastq.gz > T47D_Clone28_0hour_rep1_NextSeq550_PE2.fastq.gz
Combine all reads for use in downstream analysis.
cat SRR25074724_1.fastq.gz SRR25074725_1.fastq.gz SRR25074726_1.fastq.gz > T47D_Clone28_24hour_rep4_combined_PE1.fastq.gz
cat SRR25074727_1.fastq.gz SRR25074728_1.fastq.gz SRR25074729_1.fastq.gz > T47D_Clone28_24hour_rep3_combined_PE1.fastq.gz
cat SRR25074730_1.fastq.gz SRR25074731_1.fastq.gz SRR25074732_1.fastq.gz > T47D_Clone28_24hour_rep2_combined_PE1.fastq.gz
cat SRR25074733_1.fastq.gz SRR25074734_1.fastq.gz SRR25074735_1.fastq.gz > T47D_Clone28_4hour_rep4_combined_PE1.fastq.gz
cat SRR25074736_1.fastq.gz SRR25074737_1.fastq.gz SRR25074738_1.fastq.gz > T47D_Clone28_4hour_rep3_combined_PE1.fastq.gz
cat SRR25074739_1.fastq.gz SRR25074740_1.fastq.gz SRR25074741_1.fastq.gz > T47D_Clone28_4hour_rep2_combined_PE1.fastq.gz
cat SRR25074742_1.fastq.gz SRR25074743_1.fastq.gz SRR25074744_1.fastq.gz > T47D_Clone28_4hour_rep1_combined_PE1.fastq.gz
cat SRR25074745_1.fastq.gz SRR25074746_1.fastq.gz SRR25074747_1.fastq.gz > T47D_Clone28_2hour_rep4_combined_PE1.fastq.gz
cat SRR25074748_1.fastq.gz SRR25074749_1.fastq.gz SRR25074750_1.fastq.gz > T47D_Clone28_2hour_rep3_combined_PE1.fastq.gz
cat SRR25074751_1.fastq.gz SRR25074752_1.fastq.gz SRR25074753_1.fastq.gz > T47D_Clone28_2hour_rep2_combined_PE1.fastq.gz
cat SRR25074754_1.fastq.gz SRR25074755_1.fastq.gz SRR25074756_1.fastq.gz > T47D_Clone28_2hour_rep1_combined_PE1.fastq.gz
cat SRR25074757_1.fastq.gz SRR25074758_1.fastq.gz SRR25074759_1.fastq.gz > T47D_Clone28_1hour_rep4_combined_PE1.fastq.gz
cat SRR25074760_1.fastq.gz SRR25074761_1.fastq.gz SRR25074762_1.fastq.gz > T47D_Clone28_1hour_rep3_combined_PE1.fastq.gz
cat SRR25074763_1.fastq.gz SRR25074764_1.fastq.gz SRR25074765_1.fastq.gz > T47D_Clone28_1hour_rep2_combined_PE1.fastq.gz
cat SRR25074766_1.fastq.gz SRR25074767_1.fastq.gz SRR25074768_1.fastq.gz > T47D_Clone28_1hour_rep1_combined_PE1.fastq.gz
cat SRR25074769_1.fastq.gz SRR25074770_1.fastq.gz SRR25074771_1.fastq.gz > T47D_Clone28_0.5hour_rep4_combined_PE1.fastq.gz
cat SRR25074772_1.fastq.gz SRR25074773_1.fastq.gz SRR25074774_1.fastq.gz > T47D_Clone28_0.5hour_rep3_combined_PE1.fastq.gz
cat SRR25074775_1.fastq.gz SRR25074776_1.fastq.gz SRR25074777_1.fastq.gz > T47D_Clone28_0.5hour_rep2_combined_PE1.fastq.gz
cat SRR25074778_1.fastq.gz SRR25074779_1.fastq.gz SRR25074780_1.fastq.gz > T47D_Clone28_0.5hour_rep1_combined_PE1.fastq.gz
cat SRR25074781_1.fastq.gz SRR25074782_1.fastq.gz SRR25074783_1.fastq.gz > T47D_Clone28_0hour_rep4_combined_PE1.fastq.gz
cat SRR25074784_1.fastq.gz SRR25074785_1.fastq.gz SRR25074786_1.fastq.gz > T47D_Clone28_0hour_rep3_combined_PE1.fastq.gz
cat SRR25074787_1.fastq.gz SRR25074788_1.fastq.gz SRR25074789_1.fastq.gz > T47D_Clone28_0hour_rep2_combined_PE1.fastq.gz
cat SRR25074790_1.fastq.gz SRR25074791_1.fastq.gz SRR25074792_1.fastq.gz > T47D_Clone28_0hour_rep1_combined_PE1.fastq.gz
cat SRR25074724_2.fastq.gz SRR25074725_2.fastq.gz SRR25074726_2.fastq.gz > T47D_Clone28_24hour_rep4_combined_PE2.fastq.gz
cat SRR25074727_2.fastq.gz SRR25074728_2.fastq.gz SRR25074729_2.fastq.gz > T47D_Clone28_24hour_rep3_combined_PE2.fastq.gz
cat SRR25074730_2.fastq.gz SRR25074731_2.fastq.gz SRR25074732_2.fastq.gz > T47D_Clone28_24hour_rep2_combined_PE2.fastq.gz
cat SRR25074733_2.fastq.gz SRR25074734_2.fastq.gz SRR25074735_2.fastq.gz > T47D_Clone28_4hour_rep4_combined_PE2.fastq.gz
cat SRR25074736_2.fastq.gz SRR25074737_2.fastq.gz SRR25074738_2.fastq.gz > T47D_Clone28_4hour_rep3_combined_PE2.fastq.gz
cat SRR25074739_2.fastq.gz SRR25074740_2.fastq.gz SRR25074741_2.fastq.gz > T47D_Clone28_4hour_rep2_combined_PE2.fastq.gz
cat SRR25074742_2.fastq.gz SRR25074743_2.fastq.gz SRR25074744_2.fastq.gz > T47D_Clone28_4hour_rep1_combined_PE2.fastq.gz
cat SRR25074745_2.fastq.gz SRR25074746_2.fastq.gz SRR25074747_2.fastq.gz > T47D_Clone28_2hour_rep4_combined_PE2.fastq.gz
cat SRR25074748_2.fastq.gz SRR25074749_2.fastq.gz SRR25074750_2.fastq.gz > T47D_Clone28_2hour_rep3_combined_PE2.fastq.gz
cat SRR25074751_2.fastq.gz SRR25074752_2.fastq.gz SRR25074753_2.fastq.gz > T47D_Clone28_2hour_rep2_combined_PE2.fastq.gz
cat SRR25074754_2.fastq.gz SRR25074755_2.fastq.gz SRR25074756_2.fastq.gz > T47D_Clone28_2hour_rep1_combined_PE2.fastq.gz
cat SRR25074757_2.fastq.gz SRR25074758_2.fastq.gz SRR25074759_2.fastq.gz > T47D_Clone28_1hour_rep4_combined_PE2.fastq.gz
cat SRR25074760_2.fastq.gz SRR25074761_2.fastq.gz SRR25074762_2.fastq.gz > T47D_Clone28_1hour_rep3_combined_PE2.fastq.gz
cat SRR25074763_2.fastq.gz SRR25074764_2.fastq.gz SRR25074765_2.fastq.gz > T47D_Clone28_1hour_rep2_combined_PE2.fastq.gz
cat SRR25074766_2.fastq.gz SRR25074767_2.fastq.gz SRR25074768_2.fastq.gz > T47D_Clone28_1hour_rep1_combined_PE2.fastq.gz
cat SRR25074769_2.fastq.gz SRR25074770_2.fastq.gz SRR25074771_2.fastq.gz > T47D_Clone28_0.5hour_rep4_combined_PE2.fastq.gz
cat SRR25074772_2.fastq.gz SRR25074773_2.fastq.gz SRR25074774_2.fastq.gz > T47D_Clone28_0.5hour_rep3_combined_PE2.fastq.gz
cat SRR25074775_2.fastq.gz SRR25074776_2.fastq.gz SRR25074777_2.fastq.gz > T47D_Clone28_0.5hour_rep2_combined_PE2.fastq.gz
cat SRR25074778_2.fastq.gz SRR25074779_2.fastq.gz SRR25074780_2.fastq.gz > T47D_Clone28_0.5hour_rep1_combined_PE2.fastq.gz
cat SRR25074781_2.fastq.gz SRR25074782_2.fastq.gz SRR25074783_2.fastq.gz > T47D_Clone28_0hour_rep4_combined_PE2.fastq.gz
cat SRR25074784_2.fastq.gz SRR25074785_2.fastq.gz SRR25074786_2.fastq.gz > T47D_Clone28_0hour_rep3_combined_PE2.fastq.gz
cat SRR25074787_2.fastq.gz SRR25074788_2.fastq.gz SRR25074789_2.fastq.gz > T47D_Clone28_0hour_rep2_combined_PE2.fastq.gz
cat SRR25074790_2.fastq.gz SRR25074791_2.fastq.gz SRR25074792_2.fastq.gz > T47D_Clone28_0hour_rep1_combined_PE2.fastq.gz
The script used in the next code chunk (220927_PROseq_analysis_on_input_file.sh) is below:
#! /bin/bash
UMI_length=8
read_size=40
cores=$SLURM_CPUS_PER_TASK
annotation_prefix=$HOME/gene_annotations/Homo_sapiens.GRCh38.104
genome_index=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome
prealign_rdna_index=$HOME/human_rDNA/human_rDNA
tallymer=$HOME/seqOutBias/genome.tal_${read_size}.gtTxt.gz
table=$HOME/seqOutBias/genome_${read_size}.4.2.2.tbl
module load cutadapt/3.4
module load intel/18.0 intelmpi/18.0 R/4.1.1
module load gcc/9.2.0 bowtie2/2.2.9
module load samtools/1.12
module load wigtobigwig/2.8
module load bedtools/2.29.2
module load pigz/2.4
PATH=$HOME/bin:$PATH
echo $name
gunzip ${name}_PE1.fastq.gz
gunzip ${name}_PE2.fastq.gz
echo 'removing dual adapter ligations and calculating the fraction of adapter/adapters in' $name
cutadapt --cores=$cores -m $((UMI_length+2)) -O 1 -a TGGAATTCTCGGGTGCCAAGG ${name}_PE1.fastq \
-o ${name}_PE1_noadap.fastq --too-short-output ${name}_PE1_short.fastq > ${name}_PE1_cutadapt.txt
cutadapt --cores=$cores -m $((UMI_length+10)) -O 1 -a GATCGTCGGACTGTAGAACTCTGAAC ${name}_PE2.fastq \
-o ${name}_PE2_noadap.fastq --too-short-output ${name}_PE2_short.fastq > ${name}_PE2_cutadapt.txt
PE1_total=$(wc -l ${name}_PE1.fastq | awk '{print $1/4}')
PE1_w_Adapter=$(wc -l ${name}_PE1_short.fastq | awk '{print $1/4}')
AAligation=$(echo "scale=2 ; $PE1_w_Adapter / $PE1_total" | bc)
echo -e "value\texperiment\tthreshold\tmetric" > ${name}_QC_metrics.txt
echo -e "$AAligation\t$name\t0.80\tAdapter/Adapter" >> ${name}_QC_metrics.txt
echo 'removing short RNA insertions in' $name
seqtk seq -L $((UMI_length+10)) ${name}_PE1_noadap.fastq > ${name}_PE1_noadap_trimmed.fastq
echo 'removing PCR duplicates from' $name
fqdedup -i ${name}_PE1_noadap_trimmed.fastq -o ${name}_PE1_dedup.fastq
PE1_noAdapter=$(wc -l ${name}_PE1_dedup.fastq | awk '{print $1/4}')
fastq_pair -t $PE1_noAdapter ${name}_PE1_dedup.fastq ${name}_PE2_noadap.fastq
echo 'calculating and plotting RNA insert sizes from' $name
flash -q --compress-prog=gzip --suffix=gz ${name}_PE1_dedup.fastq.paired.fq \
${name}_PE2_noadap.fastq.paired.fq -o ${name}
insert_size.R ${name}.hist ${UMI_length}
echo 'trimming off the UMI from' $name
seqtk trimfq -b ${UMI_length} ${name}_PE1_dedup.fastq | seqtk seq -r - > ${name}_PE1_processed.fastq
seqtk trimfq -e ${UMI_length} ${name}_PE2_noadap.fastq | seqtk seq -r - > ${name}_PE2_processed.fastq
echo 'aligning' $name 'to rDNA and removing aligned reads'
bowtie2 -p $((cores-2)) -x $prealign_rdna_index -U ${name}_PE1_processed.fastq 2>${name}_bowtie2_rDNA.log | \
samtools sort -n - | samtools fastq -f 0x4 - > ${name}_PE1.rDNA.fastq
reads=$(wc -l ${name}_PE1.rDNA.fastq | awk '{print $1/4}')
fastq_pair -t $reads ${name}_PE1.rDNA.fastq ${name}_PE2_processed.fastq
echo 'aligning' $name 'to the genome'
bowtie2 -p $((cores-2)) --maxins 1000 -x $genome_index --rf -1 ${name}_PE1.rDNA.fastq.paired.fq \
-2 ${name}_PE2_processed.fastq.paired.fq 2>${name}_bowtie2.log | \
samtools view -b - | samtools sort - -o ${name}.bam
PE1_prior_rDNA=$(wc -l ${name}_PE1_processed.fastq | awk '{print $1/4}')
PE1_post_rDNA=$(wc -l ${name}_PE1.rDNA.fastq | awk '{print $1/4}')
total_rDNA=$(echo "$(($PE1_prior_rDNA-$PE1_post_rDNA))")
concordant_pe1=$(samtools view -c -f 0x42 ${name}.bam)
total=$(echo "$(($concordant_pe1+$total_rDNA))")
rDNA_alignment=$(echo "scale=2 ; $total_rDNA / $total" | bc)
echo -e "$rDNA_alignment\t$name\t0.10\trDNA Alignment Rate" >> ${name}_QC_metrics.txt
map_pe1=$(samtools view -c -f 0x42 ${name}.bam)
pre_alignment=$(wc -l ${name}_PE1.rDNA.fastq.paired.fq | awk '{print $1/4}')
alignment_rate=$(echo "scale=2 ; $map_pe1 / $pre_alignment" | bc)
echo -e "$alignment_rate\t$name\t0.80\tAlignment Rate" >> ${name}_QC_metrics.txt
echo 'plotting and calculating complexity for' $name
fqComplexity -i ${name}_PE1_noadap_trimmed.fastq
echo 'calculating and plotting theoretical sequencing depth'
echo 'to achieve a defined number of concordantly aligned reads for' $name
PE1_total=$(wc -l ${name}_PE1.fastq | awk '{print $1/4}')
PE1_noadap_trimmed=$(wc -l ${name}_PE1_noadap_trimmed.fastq | awk '{print $1/4}')
factorX=$(echo "scale=2 ; $PE1_noadap_trimmed / $PE1_total" | bc)
echo fraction of reads that are not adapter/adapter ligation products or below 10 base inserts
echo $factorX
PE1_dedup=$(wc -l ${name}_PE1_dedup.fastq | awk '{print $1/4}')
factorY=$(echo "scale=2 ; $concordant_pe1 / $PE1_dedup" | bc)
fqComplexity -i ${name}_PE1_noadap_trimmed.fastq -x $factorX -y $factorY
echo 'Separating paired end reads and creating genomic BED and bigWig intensity files for' $name
mkdir ${name}_seqOutBias
cd ${name}_seqOutBias
seqOutBias scale $table ../${name}.bam --no-scale --stranded --bed-stranded-positive --only-paired --tail-edge \
--bw=$name.bigWig --bed=$name.bed --out-split-pairends --read-size=$read_size --tallymer=$tallymer
cd ..
mv ${name}_seqOutBias/${name}* .
rm -r ${name}_seqOutBias
grep -v "random" ${name}_not_scaled_PE1.bed | grep -v "chrUn" | grep -v "chrEBV" | sort -k1,1 -k2,2n > \
${name}_tmp.txt && mv ${name}_tmp.txt ${name}_not_scaled_PE1.bed
mapBed -null "0" -s -a $annotation_prefix.pause.bed -b ${name}_not_scaled_PE1.bed | \
awk '$7>0' | sort -k5,5 -k7,7nr | sort -k5,5 -u > ${name}_pause.bed
join -1 5 -2 5 ${name}_pause.bed $annotation_prefix.bed | \
awk '{OFS="\t";} $2==$8 && $6==$12 {print $2, $3, $4, $1, $6, $7, $9, $10}' | \
awk '{OFS="\t";} $5 == "+" {print $1,$2+480,$8,$4,$6,$5} $5 == "-" {print $1,$7,$2 - 380,$4,$6,$5}' | \
awk '{OFS="\t";} $3>$2 {print $1,$2,$3,$4,$5,$6}' | \
sort -k1,1 -k2,2n > ${name}_pause_counts_body_coordinates.bed
mapBed -null "0" -s -a ${name}_pause_counts_body_coordinates.bed -b ${name}_not_scaled_PE1.bed | \
awk '$7>0' | awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$5/100,$7/($3 - $2)}' | \
awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$8/$9}' > ${name}_pause_body.bed
pause_index.R ${name}_pause_body.bed
echo 'Calculating exon density / intron density as a metric for nascent RNA purity for' $name
mapBed -null "0" -s -a $annotation_prefix.introns.bed -b ${name}_not_scaled_PE1.bed | awk '$7>0' | \
awk '{OFS="\t";} {print $1,$2,$3,$5,$5,$6,$7,($3 - $2)}' > ${name}_intron_counts.bed
mapBed -null "0" -s -a $annotation_prefix.no.first.exons.named.bed -b ${name}_not_scaled_PE1.bed | awk '$7>0' | \
awk '{OFS="\t";} {print $1,$2,$3,$4,$4,$6,$7,($3 - $2)}' > ${name}_exon_counts.bed
exon_intron_ratio.R ${name}_exon_counts.bed ${name}_intron_counts.bed
echo 'Counting reads in genes for' $name
echo -e "\t${name}" > ${name}_gene_counts.txt
mapBed -null "0" -s -a ${annotation_prefix}_sorted.bed -b ${name}_not_scaled_PE1.bed | \
awk '{OFS="\t";} {print $4,$7}' >> ${name}_gene_counts.txt
rm ${name}_PE1_short.fastq
rm ${name}_PE2_short.fastq
rm ${name}_PE1_noadap.fastq
rm ${name}_PE2_noadap.fastq
rm ${name}_PE1_noadap_trimmed.fastq
rm ${name}_PE1_dedup.fastq
rm ${name}_PE1_processed.fastq
rm ${name}_PE2_processed.fastq
rm ${name}_PE1_dedup.fastq.paired.fq
rm ${name}_PE2_noadap.fastq.paired.fq
rm ${name}_PE1_dedup.fastq.single.fq
rm ${name}_PE2_noadap.fastq.single.fq
rm ${name}_PE1.rDNA.fastq.paired.fq
rm ${name}_PE1.rDNA.fastq.single.fq
rm ${name}_PE2_processed.fastq.paired.fq
rm ${name}_PE2_processed.fastq.single.fq
rm ${name}.extendedFrags.fastq.gz
rm ${name}.notCombined_1.fastq.gz
rm ${name}.notCombined_2.fastq.gz
pigz ${name}_PE1.fastq
pigz ${name}_PE2.fastq
Run the above script on each sample in parallel.
for i in *_NextSeq550_PE1.fastq.gz
do
name=`basename $i _PE1.fastq.gz`
sbatch -p standard -A gioeli_lab -t 4:00:00 -N 1 -n 1 --cpus-per-task=20 \
--job-name proseq-analysis -o ${name}_proseq.log --export=ALL,i=$i,name=$name \
/scratch/ts2hx/scripts/220927_PROseq_analysis_on_input_file.sh
sleep 1
done
Plot the QC metrics.
module load intel/18.0 intelmpi/18.0 R/4.1.1
PATH=$HOME/bin:$PATH
rm TRPS1_timecourse_PRO_NextSeq550_QC_metrics.txt
cat *QC_metrics.txt | awk '!x[$0]++' > TRPS1_timecourse_PRO_NextSeq550_QC_metrics.txt
plot_all_metrics.R TRPS1_timecourse_PRO_NextSeq550_QC_metrics.txt TRPS1_timecourse_PRO_NextSeq550
paste -d'\t' *NextSeq550_gene_counts.txt > TRPS1_timecourse_PRO_gene_counts_NextSeq550.txt
The script used in the next code chunk (231221_PROseq_analysis_on_input_file_no_QC_cut_PE1_to_48.sh) is below:
#! /bin/bash
UMI_length=8
read_size=40
cores=$SLURM_CPUS_PER_TASK
annotation_prefix=$HOME/gene_annotations/Homo_sapiens.GRCh38.104
genome_index=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome
prealign_rdna_index=$HOME/human_rDNA/human_rDNA
tallymer=$HOME/seqOutBias/genome.tal_${read_size}.gtTxt.gz
table=$HOME/seqOutBias/genome_${read_size}.4.2.2.tbl
module load cutadapt/3.4
module load gcc/11.4.0 openmpi/4.1.4 R/4.3.1
module load bowtie2/2.5.1
module load samtools/1.17
module load wigtobigwig/2.8
module load bedtools/2.30.0
module load pigz/2.7
PATH=$HOME/bin:$PATH
echo $name
gunzip ${name}_PE1.fastq.gz
gunzip ${name}_PE2.fastq.gz
echo 'removing dual adapter ligations and calculating the fraction of adapter/adapters in' $name
cutadapt --cores=$cores -m $((UMI_length+10)) -l 48 -O 1 -a TGGAATTCTCGGGTGCCAAGG ${name}_PE1.fastq \
-o ${name}_PE1_noadap_trimmed.fastq
cutadapt --cores=$cores -m $((UMI_length+10)) -O 1 -a GATCGTCGGACTGTAGAACTCTGAAC ${name}_PE2.fastq \
-o ${name}_PE2_noadap.fastq
pigz ${name}_PE1.fastq
pigz ${name}_PE2.fastq
echo 'removing PCR duplicates from' $name
fqdedup -i ${name}_PE1_noadap_trimmed.fastq -o ${name}_PE1_dedup.fastq
rm ${name}_PE1_noadap_trimmed.fastq
echo 'trimming off the UMI from' $name
seqtk trimfq -b ${UMI_length} ${name}_PE1_dedup.fastq | seqtk seq -r - > ${name}_PE1_processed.fastq
seqtk trimfq -e ${UMI_length} ${name}_PE2_noadap.fastq | seqtk seq -r - > ${name}_PE2_processed.fastq
rm ${name}_PE2_noadap.fastq
rm ${name}_PE1_dedup.fastq
echo 'aligning' $name 'to rDNA and removing aligned reads'
bowtie2 -p $((cores-2)) -x $prealign_rdna_index -U ${name}_PE1_processed.fastq 2>${name}_bowtie2_rDNA.log | \
samtools sort -n - | samtools fastq -f 0x4 - > ${name}_PE1.rDNA.fastq
rm ${name}_PE1_processed.fastq
reads=$(wc -l ${name}_PE1.rDNA.fastq | awk '{print $1/4}')
fastq_pair -t $reads ${name}_PE1.rDNA.fastq ${name}_PE2_processed.fastq
rm ${name}_PE1.rDNA.fastq
rm ${name}_PE2_processed.fastq
rm ${name}_PE1.rDNA.fastq.single.fq
rm ${name}_PE2_processed.fastq.single.fq
echo 'aligning' $name 'to the genome'
bowtie2 -p $((cores-2)) --maxins 1000 -x $genome_index --rf -1 ${name}_PE1.rDNA.fastq.paired.fq \
-2 ${name}_PE2_processed.fastq.paired.fq 2>${name}_bowtie2.log | \
samtools view -b - | samtools sort - -o ${name}.bam
rm ${name}_PE1.rDNA.fastq.paired.fq
rm ${name}_PE2_processed.fastq.paired.fq
echo 'Separating paired end reads and creating genomic BED and bigWig intensity files for' $name
mkdir ${name}_seqOutBias
cd ${name}_seqOutBias
seqOutBias scale $table ../${name}.bam --no-scale --stranded --bed-stranded-positive --only-paired --tail-edge \
--bw=$name.bigWig --bed=$name.bed --out-split-pairends --read-size=$read_size --tallymer=$tallymer
cd ..
mv ${name}_seqOutBias/${name}* .
rm -r ${name}_seqOutBias
grep -v "random" ${name}_not_scaled_PE1.bed | grep -v "chrUn" | grep -v "chrEBV" | sort -k1,1 -k2,2n > \
${name}_tmp.txt && mv ${name}_tmp.txt ${name}_not_scaled_PE1.bed
echo 'Counting reads in genes for' $name
echo -e "\t${name}" > ${name}_gene_counts.txt
mapBed -null "0" -s -a ${annotation_prefix}_sorted.bed -b ${name}_not_scaled_PE1.bed | \
awk '{OFS="\t";} {print $4,$7}' >> ${name}_gene_counts.txt
Run the above script on each sample in parallel.
for i in *_combined_PE1.fastq.gz
do
name=`basename $i _PE1.fastq.gz`
sbatch -p standard -A gioeli_lab -t 4:00:00 -N 1 -n 1 --cpus-per-task=20 \
--job-name proseq-analysis -o ${name}_proseq.log --export=ALL,i=$i,name=$name \
/scratch/ts2hx/scripts/231221_PROseq_analysis_on_input_file_no_QC_cut_PE1_to_48.sh
sleep 1
done
Combine the reads into one file.
paste -d'\t' *combined_gene_counts.txt > TRPS1_timecourse_PRO_gene_counts_combined.txt
Here we merge the bigWigs to create T47D_all_plus_PE1.bigWig and T47D_all_minus_PE1.bigWig to upload to the dREG portal (https://dreg.dnasequence.org/). The output includes a dTAG_TRPS1_timecourse_dREG.dREG.peak.full.bed file that represents the genomic coordinates of each bidirectionally transcribed putative regulatory element. We use this output above in this vignette to separate ATAC-seq peaks that are distal to bidirectional transcription.
#This only worked for the positive strand because the counts are positive
for i in T47D_Clone28_0hour_rep1_combined_*_PE1.bigWig
do
strand=$(echo $i | awk -F"T47D_Clone28_0hour_rep1_combined_" '{print $2}' | awk -F"_PE1.bigWig" '{print $1}')
echo $strand
files=$(ls T47D_Clone28_*combined*_${strand}_PE1.bigWig)
echo $files
bigWigMerge $files tmp.bg
LC_COLLATE=C sort -k1,1 -k2,2n tmp.bg > tmp_sorted.bg
bedGraphToBigWig tmp_sorted.bg hg38.chrom.sizes T47D_all_${strand}_PE1.bigWig
rm tmp.bg
rm tmp_sorted.bg
done
#For minus strand
#Could have done this for loop as independent jobs
for i in T47D_*_combined_minus_PE1.bigWig
do
name=$(echo $i | awk -F"_PE1.bigWig" '{print $1}')
echo $name
bigWigToBedGraph $i tmp.bg
echo normalizing
normalize_bedGraph.py -i tmp.bg -s -1 -o tmp2.bg
bedGraphToBigWig tmp2.bg hg38.chrom.sizes ${name}_positive_PE1.bigWig
rm tmp.bg
rm tmp2.bg
done
files=$(ls T47D_*_minus_positive_PE1.bigWig)
echo $files
bigWigMerge $files tmp.bg
LC_COLLATE=C sort -k1,1 -k2,2n tmp.bg > tmp_sorted.bg
bedGraphToBigWig tmp_sorted.bg hg38.chrom.sizes T47D_all_minus_PE1.bigWig
rm tmp.bg
rm tmp_sorted.bg
The script used in the next code chunk (230403_PRO_scale_bigWig_whole_fragment.sh) is below. It scales each bigWig based on the read depth, broken down by strand.
#! /bin/bash
module load samtools/1.12 deeptools/3.5.1
name=$(echo $1 | awk -F".bam" '{print $1}')
# include reads that are 1st in a pair (64) and properly paired (2);
# exclude reads that are mapped to the reverse strand (16)
samtools view -b -f 66 -F 16 ${name}.bam > ${name}.fwd1.bam
# include reads that are mapped to the reverse strand (16),
# second in a pair (128), and properly paired (2)
samtools view -b -f 146 ${name}.bam > ${name}.fwd2.bam
# combine the temporary files
samtools merge -f ${name}.fwd.bam ${name}.fwd1.bam ${name}.fwd2.bam
# index the filtered BAM file
samtools index ${name}.fwd.bam
# run bamCoverage
bamCoverage --bam ${name}.fwd.bam -bs 10 -p 10 -e --normalizeUsing BPM -o ${name}_whole_fragment_plus_scaled.bigWig
# remove the temporary files
rm ${name}.fwd*
# include reads that are 2nd in a pair (128) and properly paired (2);
# exclude reads that are mapped to the reverse strand (16)
samtools view -b -f 130 -F 16 ${name}.bam > ${name}.rev1.bam
# include reads that are mapped to the reverse strand (16),
# first in a pair (64), and properly paired (2)
samtools view -b -f 82 ${name}.bam > ${name}.rev2.bam
# combine the temporary files
samtools merge -f ${name}.rev.bam ${name}.rev1.bam ${name}.rev2.bam
# index the filtered BAM file
samtools index ${name}.rev.bam
# run bamCoverage
bamCoverage --bam ${name}.rev.bam -bs 10 -p 10 -e --normalizeUsing BPM -o ${name}_whole_fragment_minus_scaled.bigWig
# remove the temporary files
rm ${name}.rev*
Run the above script on each sample in parallel.
for i in *_combined.bam
do
name=$(echo $i | awk -F".bam" '{print $1}')
sbatch -p standard -A gioeli_lab -t 1:00:00 --cpus-per-task=10 --job-name bamCoverage -o ${name}_bamCoverage.log \
--wrap="sh /scratch/ts2hx/scripts/230403_PRO_scale_bigWig_whole_fragment.sh $i"
sleep 1
done
The script used in the next code chunk (230403_PRO_to_browser_whole_fragment.sh) is below. It merges the scaled bigWigs across replicates into one bigWig and one bedGraph per condition.
#! /bin/bash
directory=/scratch/$USER/TRPS1_timecourse_PRO_3
PATH=$HOME/bin:$PATH
cd $directory
module load ucsc-tools/3.7.4
name=$(echo $1 | awk -F"T47D_Clone28_" '{print $2}' | awk -F"_rep2" '{print $1}')
strand=$(echo $1 | awk -F"rep2_combined_whole_fragment_" '{print $2}' | awk -F"_scaled.bigWig" '{print $1}')
echo $name
echo $strand
files=$(ls T47D_Clone28_${name}_rep*_combined_whole_fragment_${strand}_scaled.bigWig)
echo $files
if [ "$strand" == "plus" ]
then
echo "track type=bedGraph name=T47D_PRO_${name}_plus color=255,0,0 alwaysZero=on visibility=full" > T47D_${name}_${strand}_temp.txt
fi
if [ "$strand" == "minus" ]
then
echo "track type=bedGraph name=T47D_PRO_${name}_minus color=0,0,255 alwaysZero=on visibility=full" > T47D_${name}_${strand}_temp.txt
fi
count=$(ls T47D_Clone28_${name}_rep*_combined_whole_fragment_${strand}_scaled.bigWig | wc -l | bc)
scaleall=$(bc <<< "scale=4 ; 1.0 / $count")
echo $count
echo $scaleall
name=T47D_${name}
bigWigMerge $files ${name}_${strand}_tmp.bg
normalize_bedGraph.py -i ${name}_${strand}_tmp.bg -s $scaleall -o ${name}_${strand}_scaled.bg
LC_COLLATE=C sort -k1,1 -k2,2n ${name}_${strand}_scaled.bg > ${name}_${strand}_scaled_sorted.bg
cat ${name}_${strand}_temp.txt ${name}_${strand}_scaled_sorted.bg | \
grep -v "random" | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | grep -v "alt" > \
${name}_${strand}_scaled.bedGraph
rm ${name}_${strand}_tmp.bg
rm ${name}_${strand}_temp.txt
rm ${name}_${strand}_scaled.bg
rm ${name}_${strand}_scaled_sorted.bg
head ${name}_${strand}_scaled.bedGraph
bedGraphToBigWig ${name}_${strand}_scaled.bedGraph hg38.chrom.sizes ${name}_${strand}_combined_normalized.bigWig
gzip -f ${name}_${strand}_scaled.bedGraph
Run the above script on each sample in parallel.
for i in T47D_Clone28_*_rep2_combined_whole_fragment_*_scaled.bigWig
do
name=$(echo $i | awk -F"_scaled.bigWig" '{print $1}')
sbatch -p largemem -A gioeli_lab -t 4:00:00 -N 1 -n 1 --job-name to_browser -o ${name}_to_browser.log \
--wrap="sh /scratch/ts2hx/scripts/230403_PRO_to_browser_whole_fragment.sh $i"
sleep 1
done
exit
exit
mkdir -p ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/
cd ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/*.pdf .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/TRPS1_timecourse_PRO_NextSeq550_QC_metrics.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/TRPS1_timecourse_PRO_gene_counts_NextSeq550.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/TRPS1_timecourse_PRO_gene_counts_combined.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/*us_scaled.bedGraph.gz .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/*us_combined_normalized.bigWig .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/T47D_all_*us_PE1.bigWig .
As above, here we pre-process the raw data from the three clone PRO-seq experiment used to assess the robustness of the changes in ER target gene expression upon TRPS1 depletion.
Log in to the cluster to perform the processing for each sample in parallel.
ssh -Y ts2hx@rivanna.hpc.virginia.edu
ijob -c 1 -A gioeli_lab -p standard --time=1-00:00:00
Make a directory.
mkdir -p /scratch/ts2hx/TRPS1
Make tallymer and table for seqOutBias (this takes a while).
module load genometools/1.5.10
read_size=62
genome=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa
mkdir -p $HOME/seqOutBias
cd $HOME/seqOutBias
$HOME/bin/seqOutBias seqtable $genome --read-size=$read_size
main=/scratch/ts2hx/TRPS1
cd $main
for i in {882..905}
do
sbatch -p standard -A gioeli_lab -t 1:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name fasterq-dump \
-o logs/SRR250747${i}_fasterq_dump.log \
--wrap="module load gcc/11.4.0 sratoolkit/3.0.3; fasterq-dump SRR27313${i}"
sleep 1
done
Rename the files
mv SRR27313882_1.fastq.gz T47D_Clone39_dTAG13_rep4_PE1.fastq.gz
mv SRR27313883_1.fastq.gz T47D_Clone39_dTAG13_rep3_PE1.fastq.gz
mv SRR27313884_1.fastq.gz T47D_Clone39_dTAG13_rep2_PE1.fastq.gz
mv SRR27313885_1.fastq.gz T47D_Clone39_dTAG13_rep1_PE1.fastq.gz
mv SRR27313886_1.fastq.gz T47D_Clone39_DMSO_rep4_PE1.fastq.gz
mv SRR27313887_1.fastq.gz T47D_Clone39_DMSO_rep3_PE1.fastq.gz
mv SRR27313888_1.fastq.gz T47D_Clone39_DMSO_rep2_PE1.fastq.gz
mv SRR27313889_1.fastq.gz T47D_Clone39_DMSO_rep1_PE1.fastq.gz
mv SRR27313890_1.fastq.gz T47D_Clone35_dTAG13_rep4_PE1.fastq.gz
mv SRR27313891_1.fastq.gz T47D_Clone35_dTAG13_rep3_PE1.fastq.gz
mv SRR27313892_1.fastq.gz T47D_Clone35_dTAG13_rep2_PE1.fastq.gz
mv SRR27313905_1.fastq.gz T47D_Clone35_dTAG13_rep1_PE1.fastq.gz
mv SRR27313904_1.fastq.gz T47D_Clone35_DMSO_rep4_PE1.fastq.gz
mv SRR27313903_1.fastq.gz T47D_Clone35_DMSO_rep3_PE1.fastq.gz
mv SRR27313902_1.fastq.gz T47D_Clone35_DMSO_rep2_PE1.fastq.gz
mv SRR27313901_1.fastq.gz T47D_Clone35_DMSO_rep1_PE1.fastq.gz
mv SRR27313900_1.fastq.gz T47D_Clone28_dTAG13_rep4_PE1.fastq.gz
mv SRR27313899_1.fastq.gz T47D_Clone28_dTAG13_rep3_PE1.fastq.gz
mv SRR27313898_1.fastq.gz T47D_Clone28_dTAG13_rep2_PE1.fastq.gz
mv SRR27313897_1.fastq.gz T47D_Clone28_dTAG13_rep1_PE1.fastq.gz
mv SRR27313896_1.fastq.gz T47D_Clone28_DMSO_rep4_PE1.fastq.gz
mv SRR27313895_1.fastq.gz T47D_Clone28_DMSO_rep3_PE1.fastq.gz
mv SRR27313894_1.fastq.gz T47D_Clone28_DMSO_rep2_PE1.fastq.gz
mv SRR27313893_1.fastq.gz T47D_Clone28_DMSO_rep1_PE1.fastq.gz
mv SRR27313882_2.fastq.gz T47D_Clone39_dTAG13_rep4_PE2.fastq.gz
mv SRR27313883_2.fastq.gz T47D_Clone39_dTAG13_rep3_PE2.fastq.gz
mv SRR27313884_2.fastq.gz T47D_Clone39_dTAG13_rep2_PE2.fastq.gz
mv SRR27313885_2.fastq.gz T47D_Clone39_dTAG13_rep1_PE2.fastq.gz
mv SRR27313886_2.fastq.gz T47D_Clone39_DMSO_rep4_PE2.fastq.gz
mv SRR27313887_2.fastq.gz T47D_Clone39_DMSO_rep3_PE2.fastq.gz
mv SRR27313888_2.fastq.gz T47D_Clone39_DMSO_rep2_PE2.fastq.gz
mv SRR27313889_2.fastq.gz T47D_Clone39_DMSO_rep1_PE2.fastq.gz
mv SRR27313890_2.fastq.gz T47D_Clone35_dTAG13_rep4_PE2.fastq.gz
mv SRR27313891_2.fastq.gz T47D_Clone35_dTAG13_rep3_PE2.fastq.gz
mv SRR27313892_2.fastq.gz T47D_Clone35_dTAG13_rep2_PE2.fastq.gz
mv SRR27313905_2.fastq.gz T47D_Clone35_dTAG13_rep1_PE2.fastq.gz
mv SRR27313904_2.fastq.gz T47D_Clone35_DMSO_rep4_PE2.fastq.gz
mv SRR27313903_2.fastq.gz T47D_Clone35_DMSO_rep3_PE2.fastq.gz
mv SRR27313902_2.fastq.gz T47D_Clone35_DMSO_rep2_PE2.fastq.gz
mv SRR27313901_2.fastq.gz T47D_Clone35_DMSO_rep1_PE2.fastq.gz
mv SRR27313900_2.fastq.gz T47D_Clone28_dTAG13_rep4_PE2.fastq.gz
mv SRR27313899_2.fastq.gz T47D_Clone28_dTAG13_rep3_PE2.fastq.gz
mv SRR27313898_2.fastq.gz T47D_Clone28_dTAG13_rep2_PE2.fastq.gz
mv SRR27313897_2.fastq.gz T47D_Clone28_dTAG13_rep1_PE2.fastq.gz
mv SRR27313896_2.fastq.gz T47D_Clone28_DMSO_rep4_PE2.fastq.gz
mv SRR27313895_2.fastq.gz T47D_Clone28_DMSO_rep3_PE2.fastq.gz
mv SRR27313894_2.fastq.gz T47D_Clone28_DMSO_rep2_PE2.fastq.gz
mv SRR27313893_2.fastq.gz T47D_Clone28_DMSO_rep1_PE2.fastq.gz
The script used in the next code chunk (220408_PROseq_analysis_on_input_file.sh) is below:
#! /bin/bash
directory=/scratch/$USER/TRPS1
annotation_prefix=$HOME/gene_annotations/Homo_sapiens.GRCh38.104
UMI_length=8
read_size=62
cores=$SLURM_CPUS_PER_TASK
genome_index=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome
prealign_rdna_index=$HOME/human_rDNA/human_rDNA
tallymer=$HOME/seqOutBias/genome.tal_${read_size}.gtTxt.gz
table=$HOME/seqOutBias/genome_${read_size}.4.2.2.tbl
PATH=$HOME/bin:$PATH
cd $directory
module load cutadapt/3.4
module load intel/18.0 intelmpi/18.0 R/4.1.1
module load gcc/9.2.0 bowtie2/2.2.9
module load samtools/1.12
module load wigtobigwig/2.8
module load bedtools/2.29.2
filename=$1
name=$(echo $filename | awk -F"_PE1.fastq" '{print $1}')
echo $name
echo 'removing dual adapter ligations and calculating the fraction of adapter/adapters in' $name
cutadapt --cores=$cores -m $((UMI_length+2)) -O 1 -a TGGAATTCTCGGGTGCCAAGG ${name}_PE1.fastq \
-o ${name}_PE1_noadap.fastq --too-short-output ${name}_PE1_short.fastq > ${name}_PE1_cutadapt.txt
cutadapt --cores=$cores -m $((UMI_length+10)) -O 1 -a GATCGTCGGACTGTAGAACTCTGAAC ${name}_PE2.fastq \
-o ${name}_PE2_noadap.fastq --too-short-output ${name}_PE2_short.fastq > ${name}_PE2_cutadapt.txt
PE1_total=$(wc -l ${name}_PE1.fastq | awk '{print $1/4}')
PE1_w_Adapter=$(wc -l ${name}_PE1_short.fastq | awk '{print $1/4}')
AAligation=$(echo "scale=2 ; $PE1_w_Adapter / $PE1_total" | bc)
echo -e "value\texperiment\tthreshold\tmetric" > ${name}_QC_metrics.txt
echo -e "$AAligation\t$name\t0.80\tAdapter/Adapter" >> ${name}_QC_metrics.txt
echo 'removing short RNA insertions in' $name
seqtk seq -L $((UMI_length+10)) ${name}_PE1_noadap.fastq > ${name}_PE1_noadap_trimmed.fastq
echo 'removing PCR duplicates from' $name
fqdedup -i ${name}_PE1_noadap_trimmed.fastq -o ${name}_PE1_dedup.fastq
PE1_noAdapter=$(wc -l ${name}_PE1_dedup.fastq | awk '{print $1/4}')
fastq_pair -t $PE1_noAdapter ${name}_PE1_dedup.fastq ${name}_PE2_noadap.fastq
echo 'calculating and plotting RNA insert sizes from' $name
flash -q --compress-prog=gzip --suffix=gz ${name}_PE1_dedup.fastq.paired.fq \
${name}_PE2_noadap.fastq.paired.fq -o ${name}
insert_size.R ${name}.hist ${UMI_length}
echo 'trimming off the UMI from' $name
seqtk trimfq -b ${UMI_length} ${name}_PE1_dedup.fastq | seqtk seq -r - > ${name}_PE1_processed.fastq
seqtk trimfq -e ${UMI_length} ${name}_PE2_noadap.fastq | seqtk seq -r - > ${name}_PE2_processed.fastq
echo 'aligning' $name 'to rDNA and removing aligned reads'
bowtie2 -p $cores -x $prealign_rdna_index -U ${name}_PE1_processed.fastq 2>${name}_bowtie2_rDNA.log | \
samtools sort -n - | samtools fastq -f 0x4 - > ${name}_PE1.rDNA.fastq
reads=$(wc -l ${name}_PE1.rDNA.fastq | awk '{print $1/4}')
fastq_pair -t $reads ${name}_PE1.rDNA.fastq ${name}_PE2_processed.fastq
echo 'aligning' $name 'to the genome'
bowtie2 -p $cores --maxins 1000 -x $genome_index --rf -1 ${name}_PE1.rDNA.fastq.paired.fq \
-2 ${name}_PE2_processed.fastq.paired.fq 2>${name}_bowtie2.log | samtools view -b - | \
samtools sort - -o ${name}.bam
PE1_prior_rDNA=$(wc -l ${name}_PE1_processed.fastq | awk '{print $1/4}')
PE1_post_rDNA=$(wc -l ${name}_PE1.rDNA.fastq | awk '{print $1/4}')
total_rDNA=$(echo "$(($PE1_prior_rDNA-$PE1_post_rDNA))")
concordant_pe1=$(samtools view -c -f 0x42 ${name}.bam)
total=$(echo "$(($concordant_pe1+$total_rDNA))")
rDNA_alignment=$(echo "scale=2 ; $total_rDNA / $total" | bc)
echo -e "$rDNA_alignment\t$name\t0.10\trDNA Alignment Rate" >> ${name}_QC_metrics.txt
map_pe1=$(samtools view -c -f 0x42 ${name}.bam)
pre_alignment=$(wc -l ${name}_PE1.rDNA.fastq.paired.fq | awk '{print $1/4}')
alignment_rate=$(echo "scale=2 ; $map_pe1 / $pre_alignment" | bc)
echo -e "$alignment_rate\t$name\t0.80\tAlignment Rate" >> ${name}_QC_metrics.txt
echo 'plotting and calculating complexity for' $name
fqComplexity -i ${name}_PE1_noadap_trimmed.fastq
echo 'calculating and plotting theoretical sequencing depth'
echo 'to achieve a defined number of concordantly aligned reads for' $name
PE1_total=$(wc -l ${name}_PE1.fastq | awk '{print $1/4}')
PE1_noadap_trimmed=$(wc -l ${name}_PE1_noadap_trimmed.fastq | awk '{print $1/4}')
factorX=$(echo "scale=2 ; $PE1_noadap_trimmed / $PE1_total" | bc)
echo fraction of reads that are not adapter/adapter ligation products or below 10 base inserts
echo $factorX
PE1_dedup=$(wc -l ${name}_PE1_dedup.fastq | awk '{print $1/4}')
factorY=$(echo "scale=2 ; $concordant_pe1 / $PE1_dedup" | bc)
fqComplexity -i ${name}_PE1_noadap_trimmed.fastq -x $factorX -y $factorY
echo 'Separating paired end reads and creating genomic BED and bigWig intensity files for' $name
seqOutBias scale $table ${name}.bam --no-scale --stranded --bed-stranded-positive \
--bw=$name.bigWig --bed=$name.bed --out-split-pairends --only-paired \
--tail-edge --read-size=$read_size --tallymer=$tallymer
coverageBed -counts -s -a $annotation_prefix.pause.bed -b ${name}_not_scaled_PE1.bed | \
awk '$7>0' | sort -k5,5 -k7,7nr | sort -k5,5 -u > ${name}_pause.bed
join -1 5 -2 5 ${name}_pause.bed $annotation_prefix.bed | \
awk '{OFS="\t";} $2==$8 && $6==$12 {print $2, $3, $4, $1, $6, $7, $9, $10}' | \
awk '{OFS="\t";} $5 == "+" {print $1,$2+480,$8,$4,$6,$5} $5 == "-" {print $1,$7,$2 - 380,$4,$6,$5}' | \
awk '{OFS="\t";} $3>$2 {print $1,$2,$3,$4,$5,$6}' > ${name}_pause_counts_body_coordinates.bed
coverageBed -counts -s -a ${name}_pause_counts_body_coordinates.bed \
-b ${name}_not_scaled_PE1.bed | awk '$7>0' | \
awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$5/100,$7/($3 - $2)}' | \
awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$8/$9}' > ${name}_pause_body.bed
pause_index.R ${name}_pause_body.bed
echo 'Calculating exon density / intron density as a metric for nascent RNA purity for' $name
coverageBed -counts -s -a $annotation_prefix.introns.bed \
-b ${name}_not_scaled_PE1.bed | awk '$7>0' | \
awk '{OFS="\t";} {print $1,$2,$3,$5,$5,$6,$7,($3 - $2)}' > ${name}_intron_counts.bed
coverageBed -counts -s -a $annotation_prefix.no.first.exons.named.bed \
-b ${name}_not_scaled_PE1.bed | awk '$7>0' | \
awk '{OFS="\t";} {print $1,$2,$3,$4,$4,$6,$7,($3 - $2)}' > ${name}_exon_counts.bed
exon_intron_ratio.R ${name}_exon_counts.bed ${name}_intron_counts.bed
echo 'Counting reads in genes for' $name
echo -e "\t${name}" > ${name}_gene_counts.txt
coverageBed -counts -s -a $annotation_prefix.bed -b ${name}_not_scaled_PE1.bed | \
awk '{OFS="\t";} {print $4,$7}' >> ${name}_gene_counts.txt
rm ${name}_PE1_short.fastq
rm ${name}_PE2_short.fastq
rm ${name}_PE1_noadap.fastq
rm ${name}_PE2_noadap.fastq
rm ${name}_PE1_noadap_trimmed.fastq
rm ${name}_PE1_dedup.fastq
rm ${name}_PE1_processed.fastq
rm ${name}_PE2_processed.fastq
rm ${name}_PE1_dedup.fastq.paired.fq
rm ${name}_PE2_noadap.fastq.paired.fq
rm ${name}_PE1_dedup.fastq.single.fq
rm ${name}_PE2_noadap.fastq.single.fq
rm ${name}_PE1.rDNA.fastq.paired.fq
rm ${name}_PE1.rDNA.fastq.single.fq
rm ${name}_PE2_processed.fastq.paired.fq
rm ${name}_PE2_processed.fastq.single.fq
rm ${name}.extendedFrags.fastq.gz
rm ${name}.notCombined_1.fastq.gz
rm ${name}.notCombined_2.fastq.gz
gzip ${name}_PE1.fastq
gzip ${name}_PE2.fastq
The script used in the next code chunk (220408_PROseq_analysis_on_all_files_for_slurm.sh) is below:
#! /bin/bash
#Change this variable as well as the wrapped string
main=/scratch/ts2hx/TRPS1
cd $main
for filename in *_PE1.fastq
do
name=`basename $filename _PE1.fastq`
sbatch -p standard -A gioeli_lab -t 1-00:00:00 -N 1 -n 1 --cpus-per-task=4 --job-name proseq-analysis \
-o ${main}/${name}_proseq.log \
--wrap="sh /scratch/ts2hx/TRPS1/220408_PROseq_analysis_on_input_file.sh $filename"
sleep 1 # wait 1 second between each job submission
done
Run the above script to process each sample in parallel.
./220408_PROseq_analysis_on_all_files_for_slurm.sh
Plot QC metrics and combine the reads into one file.
cat *_QC_metrics.txt | awk '!x[$0]++' > TRPS1_QC_metrics.txt
module load intel/18.0 intelmpi/18.0 R/4.1.1
plot_all_metrics.R TRPS1_QC_metrics.txt TRPS1_PRO
paste -d'\t' *_gene_counts.txt > TRPS1_PRO_gene_counts.txt
exit
exit
mkdir -p ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1
cd ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1/*_QC_metrics.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1/*_gene_counts.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1/*.pdf .
Set up.
ssh -Y ts2hx@rivanna.hpc.virginia.edu
ijob -c 1 -A gioeli_lab -p standard --cpus-per-task=20 --time=12:00:00
main=/scratch/ts2hx/ChIP/results
cd $main
mkdir -p heatmaps
cd heatmaps
module load deeptools/3.5.1
Make the matrix (starting with the bed file from the matrix generated for ChIP).
computeMatrix reference-point --referencePoint center -b 500 -a 500 -p 20 --missingDataAsZero \
-R TRPS1_peaks_sorted_for_heatmap.bed \
-S /scratch/ts2hx/TRPS1_timecourse_PRO_3/T47D_0hour_plus_combined_normalized.bigWig \
/scratch/ts2hx/TRPS1_timecourse_PRO_3/T47D_0hour_minus_combined_normalized.bigWig \
/scratch/ts2hx/TRPS1_timecourse_PRO_3/T47D_0.5hour_plus_combined_normalized.bigWig \
/scratch/ts2hx/TRPS1_timecourse_PRO_3/T47D_0.5hour_minus_combined_normalized.bigWig \
-o matrix_PRO_TRPS1_peaks.gz
Plot the composites.
plotProfile -m matrix_PRO_TRPS1_peaks.gz -out profile_PRO_TRPS1_peaks.pdf \
--regionsLabel "Functional TRPS1 peaks" --perGroup --colors "#8B0000" "#00008B" "#FF7F7F" "#7F7FFF" \
--samplesLabel "DMSO plus strand" "DMSO minus strand" "dTAG plus strand" "dTAG minus strand"
Count PRO reads in each TRPS1 peak for use in DESeq2
cd /scratch/ts2hx/TRPS1_timecourse_PRO_3/
module load bedtools/2.30.0
slopBed -b 500 -i /scratch/ts2hx/ChIP/results/heatmaps/Functional_TRPS1_summits.bed \
-g $HOME/hg38/hg38.chrom.sizes > \
/scratch/ts2hx/ChIP/results/heatmaps/Functional_TRPS1_summits_window.bed
for i in *_combined_PE1.fastq.gz
do
name=`basename $i _PE1.fastq.gz`
echo -e "\t${name}" > ${name}_TRPS1_peak_counts.txt
mapBed -null "0" -a /scratch/ts2hx/ChIP/results/heatmaps/Functional_TRPS1_summits_window.bed \
-b ${name}_not_scaled_PE1.bed | awk '{OFS="\t";} {print $4, $NF}' >> ${name}_TRPS1_peak_counts.txt
done
paste -d'\t' *_TRPS1_peak_counts.txt > \
/scratch/ts2hx/ChIP/results/heatmaps/TRPS1_timecourse_PRO_TRPS1_peak_counts.txt
Leave the cluster environment, and copy files to your local computer.
exit
exit
cd ~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/results
scp -r ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/ChIP/results/heatmaps/ .
Set up in R.
library(lattice)
library(DESeq2)
library(tidyverse)
library(latticeExtra)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)
Provide size factors based on reads in genes
x = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/TRPS1_timecourse_PRO_gene_counts_combined.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]
x = x[,grepl("0", colnames(x))]
time = factor(sapply(strsplit(colnames(x), '_'), '[', 3), levels = c("0hour", "0.5hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
deseq.df = estimateSizeFactors(deseq.df)
size_factors = sizeFactors(deseq.df)
Identify differentially transcribed peaks.
x = read.table("TRPS1_timecourse_PRO_TRPS1_peak_counts.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]
time = factor(sapply(strsplit(colnames(x), '_'), '[', 3), levels = c("0hour", "0.5hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
sizeFactors(deseq.df) <- size_factors
deseq.df = DESeq(deseq.df)
res.deseq = results(deseq.df, contrast = c("time", "0.5hour", "0hour"))
sum(!is.na(res.deseq$log2FoldChange)) #12621
sum(!is.na(res.deseq$log2FoldChange) & res.deseq$log2FoldChange > 0) #7900
sum(!is.na(res.deseq$log2FoldChange) & res.deseq$log2FoldChange < 0) #4721
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0) #9
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0) #0
sum((colSums(x)/size_factors)[1:4])/sum((colSums(x)/size_factors)[5:8]) #1.065969
t.test(res.deseq$log2FoldChange) # p-value < 2.2e-16, mean 0.1317621
df.deseq.effects.lattice = res.deseq
df.deseq.effects.lattice$status = "Other"
df.deseq.effects.lattice$status[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0] = "Activated"
df.deseq.effects.lattice$status[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0] = "Repressed"
df.deseq.effects.lattice$status = factor(df.deseq.effects.lattice$status, levels = c("Other", "Activated", "Repressed"))
Plot.
pdf("MA_PRO_at_TRPS1_peaks.pdf", width = 7, height = 3.5)
ggplot(as.data.frame(df.deseq.effects.lattice) %>% arrange(status), aes(x = log(baseMean, 10), y = log2FoldChange, color = status)) +
geom_point(aes(size = status)) +
scale_size_manual(values = c(0.1, 1, 1), guide = 'none') +
scale_color_manual(values = c("grey60", "red", "blue")) +
geom_hline(yintercept = 0, linetype="dashed") +
ylab(expression("log"[2]~"(fold change in PRO signal)")) +
xlab(expression("log"[10]~"(mean of normalized intensity)")) +
ggtitle("Bidirectional transcription at TRPS1 peaks 30 minutes after TRPS1 depletion") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(color=NULL)
dev.off()
ANOVA to determine if dTAG reproducibly increases transcription at these sites
library(tidyr)
wide_counts = as.data.frame(counts(deseq.df, normalized = T))
long_counts <- gather(wide_counts, condition, counts, factor_key=TRUE)
long_counts$rep = factor(sapply(strsplit(as.character(long_counts$condition), '_'), '[', 4))
long_counts$time = factor(sapply(strsplit(as.character(long_counts$condition), '_'), '[', 3))
model = lm(log(counts + 1, 10) ~ rep + time, data = long_counts)
anova(model) #p-value < 2.2e-16
Set up in R.
direc = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/"
setwd(direc)
library(lattice)
library(DESeq2)
library(viridis)
library(tidyverse)
library(latticeExtra)
library(msigdbr)
library(fgsea)
library(enrichplot)
library(EnhancedVolcano)
library(clusterProfiler)
library(pheatmap)
library(dendsort)
library(RColorBrewer)
library(DEGreport)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)
source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')
cdf.deseq.df <- function(df, genes = gene.file, chip.peaks = chip.peaks) {
bed.tss.activated = filter.deseq.into.bed(df, genes, cat = "Activated")
paste(head(bed.tss.activated))
bed.tss.unchanged = filter.deseq.into.bed(df, genes, cat = "Unchanged")
bed.tss.repressed = filter.deseq.into.bed(df, genes, cat = "Repressed")
peaks = chip.peaks
print(head(peaks))
act.distance = bedTools.closest(bed1 = bed.tss.activated, bed2 = peaks[,c(1:3)], opt.string = '-D a')
unreg.distance = bedTools.closest(bed1 = bed.tss.unchanged, bed2 = peaks[,c(1:3)], opt.string = '-D a')
rep.distance = bedTools.closest(bed1 = bed.tss.repressed, bed2 = peaks[,c(1:3)], opt.string = '-D a')
df.up.can = cbind(act.distance[,c(4, 10)], "Activated")
df.un.can = cbind(unreg.distance[,c(4, 10)], "Unchanged")
df.dn.can = cbind(rep.distance[,c(4, 10)], "Repressed")
colnames(df.up.can) = c(colnames(df.up.can)[1:2], 'status')
colnames(df.un.can) = c(colnames(df.up.can)[1:2], 'status')
colnames(df.dn.can) = c(colnames(df.up.can)[1:2], 'status')
df.all = rbind(df.up.can, df.un.can, df.dn.can)
return(df.all)
}
filter.deseq.into.bed <- function(deseq.df, gene.file, cat = 'R1881 Activated') {
deseq.df = deseq.df[deseq.df$status == cat,]
#print(dim(deseq.df))
#scientific notation was messing this up occasionally
x = gene.file$V4
#print(length(x))
y = gene.file[x %in% rownames(deseq.df),]
#print(dim(y))
z = get.tss(y)
#print(dim(z))
return(z)
}
#bedTools.closest <- function(functionstring="/anaconda3/bin/closestBed",bed1,bed2,opt.string="") {
bedTools.closest <- function(functionstring="/usr/local/anaconda3/bin/closestBed",bed1,bed2,opt.string="") {
options(scipen =99) # not to use scientific notation when writing out
#write bed formatted dataframes to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n")
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='')
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2)[1:ncol(bed2)], 'dis' )
return(res)
}
#Viridis color scheme
col = viridis(6)
#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")
Read in the counts table and change the column names. Here we focus on the early time points.
#DESeq
x = read.table("TRPS1_timecourse_PRO_gene_counts_combined.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]
x = x[,!grepl("24hour", colnames(x))]
Run DESeq2.
time = factor(sapply(strsplit(colnames(x), '_'), '[', 3), levels = c("0hour", "0.5hour", "1hour", "2hour", "4hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
deseq.df = DESeq(deseq.df)
save(deseq.df, file = "TRPS1_timecourse_PRO_deseq.df.RData")
Run the likelihood ratio test to identify dynamic peaks over the time course.
dds.lrt = DESeq(deseq.df, test="LRT", reduced = ~ rep)
res.lrt = results(dds.lrt)
sum(res.lrt$padj < 0.1 & !is.na(res.lrt$padj))
Cluster the genes.
rld = rlogTransformation(dds.lrt)
rld_mat <- assay(rld)
siglrt.re = res.lrt[res.lrt$padj < 0.1 & !is.na(res.lrt$padj),]
nrow(siglrt.re) #1425
cluster_rlog = rld_mat[rownames(siglrt.re),]
meta = cbind.data.frame(time, rep)
rownames(meta) = colnames(cluster_rlog)
clusters_lrt <- degPatterns(cluster_rlog, metadata = meta, minc = 0)
Set up a data frame for plotting.
plot.df = clusters_lrt$normalized
plot.df$time = as.character(plot.df$time)
plot.df$time[plot.df$time == '0hour'] = 0
plot.df$time[plot.df$time == '0.5hour'] = 0.5
plot.df$time[plot.df$time == '1hour'] = 1
plot.df$time[plot.df$time == '2hour'] = 2
plot.df$time[plot.df$time == '4hour'] = 4
plot.df$time = as.numeric(plot.df$time)
plot.df = plot.df[order(plot.df$genes),]
plot.df = plot.df[order(plot.df$time),]
plot.df$cluster = paste('cluster', as.character(plot.df$cluster), sep = '')
head(plot.df)
save(plot.df, file = "plot.df.pro.no.24hour.Rdata")
Generate a dendrogram to check where we can merge clusters.
x = as.data.table(plot.df)
plot.df.cluster = dcast(x, genes + cluster ~ time, value.var="value")
avg.clusters = as.data.frame(matrix(nrow = 0, ncol = 4))
for (i in unique(plot.df.cluster$cluster)) {
z = data.frame(matrix(colMeans(plot.df.cluster[plot.df.cluster$cluster == i,3:6]),
ncol = 4, nrow = 1))
rownames(z) = c(i)
colnames(z) = as.character(colnames(plot.df.cluster)[3:6])
avg.clusters = rbind(avg.clusters, z)
}
dd = dist(avg.clusters)
hc = hclust(dd, method = "complete")
pdf('Dendrogram_PRO_no_24hour.pdf', width=8, height=5)
plot(hc, xlab = "Clusters", main = ' ', hang = -1)
abline(h = 1.85, lty = 2)
dev.off()
Merge clusters.
gradual.down = plot.df[plot.df$cluster == 'cluster2' |
plot.df$cluster == 'cluster7',]
transient.down = plot.df[plot.df$cluster == 'cluster3' |
plot.df$cluster == 'cluster9' |
plot.df$cluster == 'cluster12' |
plot.df$cluster == 'cluster4' |
plot.df$cluster == 'cluster18' |
plot.df$cluster == 'cluster16' |
plot.df$cluster == 'cluster17',]
oscillating.up = plot.df[plot.df$cluster == 'cluster11' |
plot.df$cluster == 'cluster6' |
plot.df$cluster == 'cluster19',]
immediate.up = plot.df[plot.df$cluster == 'cluster10' |
plot.df$cluster == 'cluster14' |
plot.df$cluster == 'cluster15',]
transient.up = plot.df[plot.df$cluster == 'cluster1' |
plot.df$cluster == 'cluster8',]
gradual.up = plot.df[plot.df$cluster == 'cluster5' |
plot.df$cluster == 'cluster13',]
gradual.down$supercluster = 'Gradual Decrease'
transient.down$supercluster = 'Transient Decrease'
oscillating.up$supercluster = 'Oscillating Increase'
immediate.up$supercluster = 'Immediate Increase'
transient.up$supercluster = 'Transient Increase'
gradual.up$supercluster = 'Gradual Increase'
Make a data frame for plotting the superclusters.
plot.df.pro = rbind(gradual.down,
transient.down,
oscillating.up,
immediate.up,
transient.up,
gradual.up)
plot.df.pro$supercluster <- factor(plot.df.pro$supercluster,
levels = c("Oscillating Increase", "Immediate Increase", "Gradual Increase",
"Transient Increase", "Gradual Decrease", "Transient Decrease"))
plot.df.pro$genes = as.factor(plot.df.pro$genes)
Assign the colors.
cat.colours = plot.df.pro[plot.df.pro$merge == 'one_group0hour',]
cat.colours$genes <- as.factor(cat.colours$genes)
cat.colours$supercluster <- as.factor(cat.colours$supercluster)
cat.colours$colour[grep("Increase", cat.colours$supercluster)] <- '#FF000016'
cat.colours$colour[grep("Decrease", cat.colours$supercluster)] <- '#0000FF16'
cat.colours$colour <- as.factor(cat.colours$colour)
cat.colours <- cat.colours[match(levels(plot.df.pro$genes), cat.colours$genes), ]
Plot.
pdf('pro_superclusters_transient_gradual_only.pdf', width=10, height=10)
trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(xyplot(value ~ factor(time) | supercluster, group = genes,
data = plot.df.pro[plot.df.pro$supercluster %in% levels(plot.df.pro$supercluster)[3:6],], type = c('l'),
scales=list(x=list(cex=2,relation = "free", rot = 45),
y =list(cex=2, relation="free")),
aspect=1.0,
between=list(y=0.5, x=0.5),
layout = c(2,2),
ylab = list(label = 'Scaled normalized PRO signal', cex=3),
xlab = list(label = 'Time (hours)', cex =3),
par.strip.text=list(cex=2),
par.settings = list(superpose.symbol = list(pch = c(16), col=c('grey20'), cex =1),
strip.background=list(col="grey80"),
superpose.line = list(col = as.character(cat.colours$colour), lwd=c(1), lty = c(1))),
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = .5,
do.out = F)
panel.spline(x, y, col = 'black', lwd =3.0, ...)
})
)
dev.off()
Activated genes.
genes = plot.df.pro$genes[(plot.df.pro$supercluster == "Gradual Increase" | plot.df.pro$supercluster == "Transient Increase") & plot.df.pro$merge == "one_group0hour"]
m_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>%
dplyr::select(gs_name, ensembl_gene)
set.seed(0)
em <- enricher(genes, TERM2GENE=m_t2g, pvalueCutoff = 0.1)
em$ID
em$p.adjust
fgseaResTidy <- em %>%
as_tibble() %>%
arrange(p.adjust)
print(fgseaResTidy, n = 100)
gse_top = fgseaResTidy
gse_top$Ratio_ratio =
(as.numeric(sapply(strsplit(gse_top$GeneRatio, "/"), "[", 1)) /
as.numeric(sapply(strsplit(gse_top$GeneRatio, "/"), "[", 2))) /
(as.numeric(sapply(strsplit(gse_top$BgRatio, "/"), "[", 1)) /
as.numeric(sapply(strsplit(gse_top$BgRatio, "/"), "[", 2)))
pdf("Hallmark_ORA_Combined_Increase.pdf", width = 14)
ggplot(gse_top, aes(reorder(ID, Ratio_ratio), Ratio_ratio, fill = log(p.adjust, base = 10))) +
geom_col() +
coord_flip() +
scale_fill_gradient(low="#750000", high="#FF8A8A") +
labs(y="Observed / Expected", x = NULL) +
theme_minimal() +
labs(fill = "log(padj)") +
theme(text = element_text(size = 20),
axis.text.x= element_text(size = 30))
dev.off()
Repressed genes.
genes = plot.df.pro$genes[(plot.df.pro$supercluster == "Gradual Decrease" | plot.df.pro$supercluster == "Transient Decrease") & plot.df.pro$merge == "one_group0hour"]
m_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>%
dplyr::select(gs_name, ensembl_gene)
set.seed(0)
em <- enricher(genes, TERM2GENE=m_t2g, pvalueCutoff = 0.1)
em$ID
em$p.adjust
fgseaResTidy <- em %>%
as_tibble() %>%
arrange(p.adjust)
print(fgseaResTidy, n = 100)
gse_top = fgseaResTidy
gse_top$Ratio_ratio =
(as.numeric(sapply(strsplit(gse_top$GeneRatio, "/"), "[", 1)) /
as.numeric(sapply(strsplit(gse_top$GeneRatio, "/"), "[", 2))) /
(as.numeric(sapply(strsplit(gse_top$BgRatio, "/"), "[", 1)) /
as.numeric(sapply(strsplit(gse_top$BgRatio, "/"), "[", 2)))
pdf("Hallmark_ORA_Combined_Decrease.pdf", width = 14)
ggplot(gse_top, aes(reorder(ID, Ratio_ratio), Ratio_ratio, fill = log(p.adjust, base = 10))) +
geom_col() +
coord_flip() +
scale_fill_gradient(low="#000075", high="#8A8AFF") +
labs(y="Observed / Expected", x = NULL) +
theme_minimal() +
labs(fill = "log(padj)") +
theme(text = element_text(size = 20),
axis.text.x= element_text(size = 30))
dev.off()
Write the gene names as tables.
genes = plot.df.pro$genes[plot.df.pro$supercluster == "Gradual Increase" & plot.df.pro$merge == "one_group0hour"]
names = gene.symbol$symbol[gene.symbol$gene %in% genes]
write.table(names, file = "gradually_activated_genes.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
genes = plot.df.pro$genes[plot.df.pro$supercluster == "Transient Increase" & plot.df.pro$merge == "one_group0hour"]
names = gene.symbol$symbol[gene.symbol$gene %in% genes]
write.table(names, file = "transiently_activated_genes.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
genes = plot.df.pro$genes[plot.df.pro$supercluster == "Gradual Decrease" & plot.df.pro$merge == "one_group0hour"]
names = gene.symbol$symbol[gene.symbol$gene %in% genes]
write.table(names, file = "gradually_repressed_genes.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
genes = plot.df.pro$genes[plot.df.pro$supercluster == "Transient Decrease" & plot.df.pro$merge == "one_group0hour"]
names = gene.symbol$symbol[gene.symbol$gene %in% genes]
write.table(names, file = "transiently_repressed_genes.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
Make a data frame with dynamic genes and unchanged genes.
#CDFs separated by TRPS1 peaks overlapping increased vs decreased ATAC peaks
genes = plot.df.pro$genes[(plot.df.pro$supercluster == "Gradual Increase" | plot.df.pro$supercluster == "Transient Increase") & plot.df.pro$merge == "one_group0hour"]
activated = cbind.data.frame(genes, status = "Activated")
genes = plot.df.pro$genes[(plot.df.pro$supercluster == "Gradual Decrease" | plot.df.pro$supercluster == "Transient Decrease") & plot.df.pro$merge == "one_group0hour"]
repressed = cbind.data.frame(genes, status = "Repressed")
genes = rownames(res.lrt)[res.lrt$padj > 0.5 & !is.na(res.lrt$padj)]
unchanged = cbind.data.frame(genes, status = "Unchanged")
df.deseq.effects.lattice = rbind.data.frame(unchanged, activated, repressed)
rownames(df.deseq.effects.lattice) = df.deseq.effects.lattice$genes
Get the dynamic ATAC peaks.
x <- read.table("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/Combined_ATAC_peak_counts.txt", sep = '\t', header = TRUE)
peak_qvalue = x[,5]
load("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/deseq.df.ATAC.Rdata")
i="0.5hour"
res.deseq = results(deseq.df[peak_qvalue > 100,], contrast = c("time", i, "0hour"))
res.deseq$chr = sapply(strsplit(rownames(res.deseq), ":"), "[", 1)
res.deseq$start = sapply(strsplit(sapply(strsplit(rownames(res.deseq), ":"), "[", 2), "-"), "[", 1)
res.deseq$end = sapply(strsplit(sapply(strsplit(rownames(res.deseq), ":"), "[", 2), "-"), "[", 2)
res.deseq = res.deseq[, c(7:9,1:6)]
res.deseq$start = as.numeric(res.deseq$start) + 100
res.deseq$end = as.numeric(res.deseq$end) - 100
increasing = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
decreasing = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0,]
Overlap with TRPS1 peaks.
chip.peaks = read.table('~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/results/macs2/HA_DMSO_vs_dTAG_summits.bed')
chip.peaks = chip.peaks[chip.peaks$V5 > 30,]
TRPS1_increasing_ATAC_peaks = bedTools.intersect(bed1 = chip.peaks, bed2 = increasing, opt.string = "-wa -wb")
TRPS1_decreasing_ATAC_peaks = bedTools.intersect(bed1 = chip.peaks, bed2 = decreasing, opt.string = "-wa -wb")
Plot.
df.all = cdf.deseq.df(df = df.deseq.effects.lattice, genes = gene.file, chip.peaks = TRPS1_increasing_ATAC_peaks)
pdf("CDF_HA_ChIP_increasing_ATAC_by_gene_superclusters.pdf", height = 3.5, width = 4)
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
auto.key = list(lines=TRUE, points=FALSE, space = "right"),
col = col.lines,
aspect = 1,
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'TRPS1 Distance from TSS'),
between=list(y = 1.0),
type = 'a',
#xlim = c(0, 7.5),
lwd = 2,
par.settings = list(superpose.line = list(col = col.lines, lwd=3),
strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 200, lty =2)
panel.ecdfplot(...)
})
dev.off()
df.all = cdf.deseq.df(df = df.deseq.effects.lattice, genes = gene.file, chip.peaks = TRPS1_decreasing_ATAC_peaks)
df.all = df.all[df.all$dis != -1,]
pdf("CDF_HA_ChIP_decreasing_ATAC_by_gene_superclusters.pdf", height = 3.5, width = 4)
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
auto.key = list(lines=TRUE, points=FALSE, space = "right"),
col = col.lines,
aspect = 1,
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'TRPS1 Distance from TSS'),
between=list(y = 1.0),
type = 'a',
#xlim = c(0, 7.5),
lwd = 2,
par.settings = list(superpose.line = list(col = col.lines, lwd=3),
strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 200, lty =2)
panel.ecdfplot(...)
})
dev.off()
Set up in R.
setwd("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R")
library(lattice)
library(DESeq2)
library(ggplot2)
cdf.deseq.df <- function(df, genes = gene.file, chip.peaks = chip.peaks) {
bed.tss.activated = filter.deseq.into.bed(df, genes, cat = "Activated")
bed.tss.unchanged = filter.deseq.into.bed(df, genes, cat = 'Not Activated')
act.distance = bedTools.closest(bed1 = bed.tss.activated, bed2 = chip.peaks[,c(1:3)], opt.string = '-D a')
unreg.distance = bedTools.closest(bed1 = bed.tss.unchanged, bed2 = chip.peaks[,c(1:3)], opt.string = '-D a')
df.up.can = cbind(act.distance[,c(4, 10)], "Activated")
df.un.can = cbind(unreg.distance[,c(4, 10)], "Not Activated")
colnames(df.up.can) = c(colnames(df.up.can)[1:2], 'status')
colnames(df.un.can) = c(colnames(df.up.can)[1:2], 'status')
df.all = rbind(df.up.can, df.un.can)
df.all$status = factor(df.all$status, levels = c("Activated", "Not Activated"))
return(df.all)
}
filter.deseq.into.bed <- function(deseq.df, gene.file, cat = 'R1881 Activated') {
deseq.df = deseq.df[deseq.df$treatment == cat,]
#print(dim(deseq.df))
#scientific notation was messing this up occasionally
x = gene.file$V4
#print(length(x))
y = gene.file[x %in% rownames(deseq.df),]
#print(dim(y))
z = get.tss(y)
#print(dim(z))
return(z)
}
plot_cdf <- function(df.all, tf = "TRPS1", col.lines = c("#FF0000", "grey60", "#0000FF")) {
ecdfplot(~log(abs(as.numeric(dis)), base = 10), groups = status, data = df.all,
auto.key = list(lines=TRUE, points=FALSE),
col = col.lines,
aspect = 1,
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = bquote("log"[10]~.(tf)~"Distance from TSS"),
between=list(y=1.0),
type = 'a',
xlim = c(0,7.5),
lwd=2,
par.settings = list(superpose.line = list(col = col.lines, lwd=3),
strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 200, lty =2)
panel.ecdfplot(...)
})
}
col.lines = c("#FF0000", "grey60")
bedTools.closest <- function(functionstring="/usr/local/anaconda3/bin/closestBed",bed1,bed2,opt.string="") {
options(scipen =99) # not to use scientific notation when writing out
#write bed formatted dataframes to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n")
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='')
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
return(res)
}
#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")
We used the counts table (Estrogen_treatment_PRO_gene_counts.txt) generated from a previous analysis, extensively described here: https://github.com/guertinlab/Nascent_RNA_Methods. Run DESeq2 with these counts.
x = read.table("Estrogen_treatment_PRO_gene_counts.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]
rep = factor(sapply(strsplit(colnames(x), '_rep'), '[', 2))
sample.conditions = factor(sapply(strsplit(sapply(strsplit(colnames(x), 'T47D_'), '[', 2), '_rep'), '[', 1))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(rep, sample.conditions), ~ 0 + rep + sample.conditions)
deseq.df = DESeq(deseq.df)
res.deseq.estrogen = results(deseq.df, contrast = c("sample.conditions", "Starved_Estrogen", "Starved_DMSO"))
res.deseq.estrogen = res.deseq.estrogen[rownames(res.deseq.estrogen) %in% gene.symbol$gene,]
ma.df = as.data.frame(res.deseq.estrogen)
Plot.
pdf("MA_plot_Estrogen_vs_DMSO_black_activated.pdf", width=3.83, height=3.83)
print(xyplot(ma.df$log2FoldChange ~ log(ma.df$baseMean, base=10),
groups=(ma.df$padj < .1 & ma.df$log2FoldChange > 0 & !is.na(ma.df$padj)),
col=c("grey60","black"), main="Differential PRO Expression", scales="free", aspect=1, pch=20, cex=0.25,
ylab=expression("log"[2]~"PRO fold change"), xlab=expression("log"[10]~"Mean of Normalized Counts"),
par.settings=list(par.xlab.text=list(cex=1.1,font=2), par.ylab.text=list(cex=1.1,font=2))))
dev.off()
Set up in R.
direc = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/"
setwd(direc)
library(lattice)
library(DESeq2)
library(viridis)
library(tidyverse)
library(latticeExtra)
library(msigdbr)
library(clusterProfiler)
library(fgsea)
library(enrichplot)
library(EnhancedVolcano)
library(clusterProfiler)
library(pheatmap)
library(dendsort)
library(RColorBrewer)
library(DEGreport)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)
source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')
#Viridis color scheme
col = viridis(6)
#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")
Load the DESeq2 object and the counts table to get the variable names.
load("TRPS1_timecourse_PRO_deseq.df.RData")
x = read.table("TRPS1_timecourse_PRO_gene_counts_combined.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]
x = x[,!grepl("24hour", colnames(x))]
time = factor(sapply(strsplit(colnames(x), '_'), '[', 3), levels = c("0hour", "0.5hour", "1hour", "2hour", "4hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
Run the likelihood ratio test to identify dynamic peaks over the time course.
dds.lrt = DESeq(deseq.df, test="LRT", reduced = ~ rep)
res.lrt = results(dds.lrt)
rld = rlogTransformation(dds.lrt)
rld_mat <- assay(rld)
sum(res.lrt$padj < 0.1 & !is.na(res.lrt$padj))
siglrt.re = res.lrt[res.lrt$padj < 0.1 & !is.na(res.lrt$padj),]
Intersect with ER target genes.
estrogen_response_genes = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/estrogen_activated_gene_names.txt", header = FALSE)$V1
estrogen_response_genes = gene.symbol$gene[gene.symbol$symbol %in% estrogen_response_genes]
siglrt.re.sets = siglrt.re[rownames(siglrt.re) %in% estrogen_response_genes,]
Separate those that are activated at 30 minutes from those that are repressed.
res.deseq = results(deseq.df, contrast = c("time", "0.5hour", "0hour"))
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj))
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & rownames(res.deseq) %in% estrogen_response_genes)
act.res.deseq = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0 & rownames(res.deseq) %in% estrogen_response_genes,]
rep.res.deseq = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0 & rownames(res.deseq) %in% estrogen_response_genes,]
sig.res.deseq = rbind(act.res.deseq, rep.res.deseq)
Cluster the genes.
dim(sig.res.deseq)
cluster_rlog = rld_mat[rownames(sig.res.deseq),]
meta = cbind.data.frame(time, rep)
rownames(meta) = colnames(cluster_rlog)
clusters_lrt <- degPatterns(cluster_rlog, metadata = meta, minc = 0)
Set up a data frame for plotting.
plot.df = clusters_lrt$normalized
plot.df$Class[plot.df$genes %in% rownames(act.res.deseq)] <- "Activated at 30 minutes"
plot.df$Class[plot.df$genes %in% rownames(rep.res.deseq)] <- "Repressed at 30 minutes"
plot.df$time = as.character(plot.df$time)
plot.df$time[plot.df$time == '0hour'] = 0
plot.df$time[plot.df$time == '0.5hour'] = 0.5
plot.df$time[plot.df$time == '1hour'] = 1
plot.df$time[plot.df$time == '2hour'] = 2
plot.df$time[plot.df$time == '4hour'] = 4
plot.df$time[plot.df$time == '24hour'] = 24
plot.df$time = as.numeric(plot.df$time)
plot.df = plot.df[order(plot.df$genes),]
plot.df = plot.df[order(plot.df$time),]
plot.df$cluster = paste('cluster', as.character(plot.df$cluster), sep = '')
head(plot.df)
save(plot.df, file = "LRT_estrogen_response_gene_set_significant_at_30min_plot.df.RData")
load("LRT_estrogen_response_gene_set_significant_at_30min_plot.df.RData")
Assign the colors.
cat.colours = plot.df
cat.colours$colour[plot.df$Class == "Activated at 30 minutes"] <- '#FF000064'
cat.colours$colour[plot.df$Class == "Repressed at 30 minutes"] <- '#0000FF64'
cat.colours$colour <- as.factor(cat.colours$colour)
sum(plot.df$Class == "Activated at 30 minutes" & plot.df$time == 0) #65
sum(plot.df$Class == "Repressed at 30 minutes" & plot.df$time == 0) #58
Plot.
pdf('Normalized_counts_estrogen_response_gene_set_significant_at_30min_red_blue.pdf')
trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(
xyplot(value ~ factor(time) | Class, group = genes, data = plot.df, type = c('l'),
par.strip.text=list(cex=1),
scales=list(x=list(cex=1, relation = "free", rot = 45), y =list(cex=1, relation="free")),
aspect=1.0,
between=list(y=0.5, x=0.5),
ylab = list(label = 'Scaled normalized\nPRO signal', cex = 2),
xlab = list(label = 'Time (hours)', cex = 2),
par.settings = list(superpose.symbol = list(pch = c(16), col=c('grey20'), cex =0.5),
strip.background=list(col="grey80"),
superpose.line = list(col = as.character(cat.colours$colour), lwd=c(1), lty = c(1))),
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = .5, do.out = FALSE)
panel.loess(x, y, ..., col = "black", lwd =2.0, span = 1/2, degree = 1, family = c("gaussian"))
})
)
dev.off()
Write the gene names as tables.
activated = plot.df$genes[plot.df$Class == "Activated at 30 minutes"]
activated = gene.symbol$symbol[gene.symbol$gene %in% activated]
write.table(activated, file = "activated_ER_target_genes.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
repressed = plot.df$genes[plot.df$Class == "Repressed at 30 minutes"]
repressed = gene.symbol$symbol[gene.symbol$gene %in% repressed]
write.table(repressed, file = "repressed_ER_target_genes.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
Set up in R.
direc = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1/"
setwd(direc)
library(DESeq2)
library(pheatmap)
categorize.deseq.df.mods <- function(df, fdr = 0.05, log2fold = 0.0, treat = 'Estrogen')
{
df.activated = df[df$padj < fdr & !is.na(df$padj) & df$log2FoldChange > log2fold,]
df.repressed = df[df$padj < fdr & !is.na(df$padj) & df$log2FoldChange < -log2fold,]
df.unchanged = df[df$padj > 0.5 & !is.na(df$padj) & abs(df$log2FoldChange) < 0.1,]
df.dregs = df[!(df$padj < fdr & !is.na(df$padj) & df$log2FoldChange > log2fold) &
!(df$padj < fdr & !is.na(df$padj) & df$log2FoldChange < -log2fold) &
!(df$padj > 0.5 & !is.na(df$padj) & abs(df$log2FoldChange) < 0.1), ]
df.unchanged$treatment = paste(treat, 'Unchanged')
df.activated$treatment = paste(treat, 'Activated')
df.repressed$treatment = paste(treat, 'Repressed')
df.dregs$treatment = paste(treat, 'All Other Genes')
df.effects.lattice =
rbind(df.activated,
df.unchanged,
df.repressed,
df.dregs)
df.effects.lattice$treatment = factor(df.effects.lattice$treatment)
df.effects.lattice$treatment = relevel(df.effects.lattice$treatment, ref = paste(treat, 'Activated'))
df.effects.lattice$treatment = relevel(df.effects.lattice$treatment, ref = paste(treat, 'Repressed'))
df.effects.lattice$treatment = relevel(df.effects.lattice$treatment, ref = paste(treat, 'Unchanged'))
df.effects.lattice$treatment = relevel(df.effects.lattice$treatment, ref = paste(treat, 'All Other Genes'))
return(df.effects.lattice)
}
rwb <- colorRampPalette(colors = c("red", "white", "blue"))
#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")
Identify differentially expressed genes across the three clones
#DESeq2
x = read.table("TRPS1_PRO_gene_counts.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]
treatment = factor(sapply(strsplit(colnames(x), '_'), '[', 3))
clone = factor(sapply(strsplit(colnames(x), '_'), '[', 2))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(treatment, clone, rep), ~ rep + clone + treatment)
deseq.df = DESeq(deseq.df)
res.deseq.treatment = results(deseq.df, contrast = c("treatment", "dTAG13", "DMSO"))
res.deseq.treatment = res.deseq.treatment[rownames(res.deseq.treatment) %in% gene.symbol$gene,]
sum(res.deseq.treatment$padj < 0.1 & !is.na(res.deseq.treatment$padj))
#Gene classes
df.deseq.effects.lattice =
categorize.deseq.df.mods(res.deseq.treatment, fdr = 0.1, log2fold = 0.0, treat = 'dTAG13')
Get dynamic ER target genes
dTAG_activated_ER_targets = read.table(file = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/activated_ER_target_genes.txt")$V1
dTAG_activated_ER_targets = gene.symbol$gene[gene.symbol$symbol %in% dTAG_activated_ER_targets]
dTAG_repressed_ER_targets = read.table(file = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/repressed_ER_target_genes.txt")$V1
dTAG_repressed_ER_targets = gene.symbol$gene[gene.symbol$symbol %in% dTAG_repressed_ER_targets]
Plot heatmaps
#Activated heatmap
wide_counts = counts(deseq.df[rownames(deseq.df) %in% dTAG_activated_ER_targets,], normalized = TRUE)
wide_counts = as.data.frame(wide_counts)
wide_counts = cbind.data.frame("Clone28" = log(rowSums(wide_counts[,5:8])/rowSums(wide_counts[,1:4]), base = 2),
"Clone35" = log(rowSums(wide_counts[,13:16])/rowSums(wide_counts[,9:12]), base = 2),
"Clone39" = log(rowSums(wide_counts[,21:24])/rowSums(wide_counts[,17:20]), base = 2))
wide_counts = merge.data.frame(wide_counts, gene.symbol, by.x = "row.names", by.y = "gene")
rownames(wide_counts) = wide_counts$symbol
wide_counts = wide_counts[,2:4]
head(wide_counts)
pdf("dTAG_activated_ER_targets_heatmap.pdf", height = 10)
pheatmap(wide_counts, scale = "none", angle_col = 0,
color = rev(rwb(100)),
breaks = seq(-.6, .6, 1.2/100))
dev.off()
#Repressed heatmap
wide_counts = counts(deseq.df[rownames(deseq.df) %in% dTAG_repressed_ER_targets,], normalized = TRUE)
wide_counts = as.data.frame(wide_counts)
wide_counts = cbind.data.frame("Clone28" = log(rowSums(wide_counts[,5:8])/rowSums(wide_counts[,1:4]), base = 2),
"Clone35" = log(rowSums(wide_counts[,13:16])/rowSums(wide_counts[,9:12]), base = 2),
"Clone39" = log(rowSums(wide_counts[,21:24])/rowSums(wide_counts[,17:20]), base = 2))
wide_counts = merge.data.frame(wide_counts, gene.symbol, by.x = "row.names", by.y = "gene")
rownames(wide_counts) = wide_counts$symbol
wide_counts = wide_counts[,2:4]
head(wide_counts)
pdf("dTAG_repressed_ER_targets_heatmap.pdf", height = 10)
pheatmap(wide_counts, scale = "none", angle_col = 0,
color = rev(rwb(100)),
breaks = seq(-.6, .6, 1.2/100))
dev.off()
Set up in R.
direc = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/"
setwd(direc)
library(lattice)
library(latticeExtra)
library(DESeq2)
library(viridis)
library(tidyverse)
library(ggplot2)
library(readxl)
source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')
filter.deseq.into.bed <- function(deseq.df, gene.file) {
x = gene.file$V4
y = gene.file[x %in% rownames(deseq.df),]
z = get.tss(y)
return(z)
}
bedTools.intersect <- function(functionstring="/usr/local/anaconda3/bin/intersectBed",bed1,bed2,opt.string="-wa -wb") {
options(scipen =99) # not to use scientific notation when writing out
#write bed formatted dataframes to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n")
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='', sep = "\t")
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2))
return(res)
}
bedTools.closest <- function(functionstring="/usr/local/anaconda3/bin/closestBed",bed1,bed2,opt.string="") {
options(scipen =99) # not to use scientific notation when writing out
#write bed formatted dataframes to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n")
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='', sep = "\t")
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
return(res)
}
plot_cdf <- function(df.all, col.lines = c("#FF0000", "grey60", "#0000FF")) {
ecdfplot(~log(abs(as.numeric(dis)), base = 10), groups = Class, data = df.all,
auto.key = list(lines=TRUE, points=FALSE),
col = col.lines,
aspect = 1,
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~" TRPS1 summit distance from ER summit"),
between=list(y=1.0),
type = 'a',
xlim = c(0,7.5),
lwd=2,
par.settings = list(superpose.line = list(col = col.lines, lwd=3),
strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 200, lty =2)
panel.ecdfplot(...)
})
}
#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")
Get read depth.
complexity_estimate <- read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/logs/230322_complexity_estimate.txt", header = T)
name = sapply(strsplit(complexity_estimate$name, "Clone28_"), "[", 2)
complexity_estimate = t(complexity_estimate$unique_reads)
colnames(complexity_estimate) = name
complexity_estimate = complexity_estimate[,-grep("HA", colnames(complexity_estimate))]
Read in the counts table and change the column names.
x <- read.table("Combined_ER_peak_counts.txt", sep = '\t', header = TRUE)
rownames(x) = paste0(x[,1], ":", x[,2], "-", x[,3])
peak_name = x[,4]
peak_qvalue = x[,5]
x = x[,-c(1:5)]
colnames(x) = sapply(strsplit(sapply(strsplit(colnames(x), "Clone28_"), "[", 2), "_ER"), "[", 1)
Subtract the read depth normalized background signal.
background <- sweep(x[,grep("IgG", colnames(x))], 2, complexity_estimate[grep("IgG", names(complexity_estimate))], FUN = "/")
background = rowMeans(background)
background = cbind.data.frame(background, background, background, background, background, background, background, background)
background <- sweep(as.data.frame(background), 2, complexity_estimate[grep("ER", names(complexity_estimate))], FUN = "*")
background = round(background, digits = 0)
x = x[,grep("ER", colnames(x))] - background
x[x < 0] = 0
Run DESeq2.
treatment = factor(sapply(strsplit(colnames(x), '_'), '[', 2), levels = c("DMSO", "dTAG"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(treatment, rep), ~ treatment)
factors <- read.table("ER_sizeFactor.txt")
size_factors = factors$V2
names(size_factors) = factors$V1
sizeFactors(deseq.df) <- size_factors
deseq.df = DESeq(deseq.df)
save(deseq.df, file = "deseq.df.ER.Rdata")
load("deseq.df.ER.Rdata")
Get normalized counts for each condition.
counts = as.data.frame(counts(deseq.df[peak_qvalue > 200,], normalized = TRUE))
counts$chr = sapply(strsplit(rownames(counts), ":"), "[", 1)
coord = sapply(strsplit(rownames(counts), ":"), "[", 2)
counts$start = sapply(strsplit(coord, "-"), "[", 1)
counts$end = sapply(strsplit(coord, "-"), "[", 2)
counts$DMSO = rowMeans(counts[,grep("DMSO", colnames(counts))])
counts$dTAG = rowMeans(counts[,grep("dTAG", colnames(counts))])
counts = counts[,9:13]
rownames(counts) = NULL
Get dynamic ER target genes.
load("TRPS1_timecourse_PRO_deseq.df.RData")
res.deseq = results(deseq.df, contrast = c("time", "0.5hour", "0hour"))
estrogen_response_genes = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/estrogen_activated_gene_names.txt", header = FALSE)$V1
estrogen_response_genes = gene.symbol$gene[gene.symbol$symbol %in% estrogen_response_genes]
res.deseq = res.deseq[rownames(res.deseq) %in% estrogen_response_genes,]
Set up a window of 100kbp around each transcription start site.
tss.df = filter.deseq.into.bed(res.deseq, gene.file)
colnames(tss.df) = c("tss.chr", "tss.start", "tss.end", "gene", "symbol", "strand")
tss.df = merge.data.frame(tss.df, res.deseq, by.x = "gene", by.y = "row.names")
tss.df[,1:4] = tss.df[,c(2:4,1)]
colnames(tss.df)[1:4] = colnames(tss.df)[c(2:4,1)]
tss.df$tss.start = tss.df$tss.start - 100000
tss.df$tss.end = tss.df$tss.end + 100000
Generate transcripts per million.
counts.mat = counts(deseq.df, normalized = F)
counts.mat = merge(counts.mat, gene.file, by.x = 0, by.y = "V4")
rownames(counts.mat) = counts.mat$Row.names
gene.length = counts.mat$V3 - counts.mat$V2
gene.full.length = gene.length
x <- counts.mat[,grep("T47D", colnames(counts.mat))] / gene.length
tpm.mat <- as.data.frame(t( t(x) * 1e6 / colSums(x) ))
tpm.mat$DMSO_TPM = log(rowMeans(tpm.mat[,grep("0hour", colnames(tpm.mat))]), 10)
tpm.mat$dTAG_TPM = log(rowMeans(tpm.mat[,grep("0.5hour", colnames(tpm.mat))]), 10)
tpm.mat$gene_length = gene.full.length
tss.df = merge(tss.df, tpm.mat[, c("DMSO_TPM", "dTAG_TPM", "gene_length")], by.x = "gene", by.y = 0)
tss.df[,1:4] = tss.df[,c(2:4,1)]
colnames(tss.df)[1:4] = colnames(tss.df)[c(2:4,1)]
Match unchanged genes based on expression.
tss.df$Match = "Controls"
tss.df$Class = "Other Genes"
tss.df$Class[!is.na(tss.df$padj) & tss.df$padj < 0.1 & tss.df$log2FoldChange > 0] = "Activated Genes"
tss.df$Class[!is.na(tss.df$padj) & tss.df$padj < 0.1 & tss.df$log2FoldChange < 0] = "Repressed Genes"
tss.df$Match[tss.df$Class == "Activated Genes"] = "Cases"
tss.df$Match = relevel(factor(tss.df$Match), ref = "Controls")
tpm.mat$DMSO_TPM = log(tpm.mat$DMSO_TPM, 10)
tpm.mat$dTAG_TPM = log(tpm.mat$dTAG_TPM, 10)
out = matchit(Match ~ DMSO_TPM, data = tss.df[tss.df$Class != "Repressed Genes",], method = "optimal", ratio = 1)
out2 = matchit(Match ~ DMSO_TPM, data = tss.df[tss.df$Class != "Other Genes",], method = "optimal", ratio = 1)
tss.df = tss.df[tss.df$Match == "Cases" | rownames(tss.df) %in% out$match.matrix | rownames(tss.df) %in% out2$match.matrix,]
rownames(tss.df) = NULL
Get ER peaks near the genes.
peaks_near_genes = bedTools.intersect(bed1 = tss.df, bed2 = counts, opt.string = "-wa -wb")
peaks_near_genes$Class = factor(peaks_near_genes$Class, levels = c("Activated Genes", "Other Genes", "Repressed Genes"))
peaks_near_genes = peaks_near_genes[,c(1:22)]
Plot.
pdf("ER_q200_ChIP_intensity_near_dynamic_ER_target_genes_100kb.pdf", width = 7, height = 3.5)
ggplot(data = peaks_near_genes, aes(x = Class, y = log(dTAG/DMSO, base = 2), color = Class),) +
theme(legend.position = "none") +
geom_violin() +
geom_boxplot(width=.1, outlier.shape = NA) +
geom_jitter(size = .01) +
geom_hline(yintercept = 0, linetype="dashed") +
scale_color_manual(values = c("red", "grey60", "blue")) +
xlab(NULL) +
ylab(expression("log"[2]~"(fold change in ER binding intensity)")) +
xlab("Genes class defined by change in expression upon TRPS1 depletion") +
ggtitle("Change in ER binding intensity in peaks near ER target genes") +
theme_bw() +
theme(legend.position="none", plot.title = element_text(hjust = 0.5))
dev.off()
T-tests.
Activated = peaks_near_genes[peaks_near_genes$Class == "Activated Genes",]
t.test(log(Activated$dTAG), log(Activated$DMSO), paired = T) #0.000000000000002274
Unchanged = peaks_near_genes[peaks_near_genes$Class == "Other Genes",]
t.test(log(Unchanged$dTAG), log(Unchanged$DMSO), paired = T) #0.3518
Repressed = peaks_near_genes[peaks_near_genes$Class == "Repressed Genes",]
t.test(log(Repressed$dTAG), log(Repressed$DMSO), paired = T) #0.000003224
Get the distance from each ER peak summit to the nearest TRPS1 peak summit.
q = 30
peaks = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/results/macs2/HA_DMSO_vs_dTAG_summits.bed", sep = "\t")
x <- read.table("Combined_HA_DMSO_vs_dTAG_peak_counts.txt", sep = '\t', header = TRUE)
peak_qvalue = x[,5]
peaks = bedTools.closest(bed1 = counts, bed2 = peaks[peak_qvalue > q,], opt.string = "-d")
peaks$status[!is.na(peaks$padj) & peaks$padj < 0.1 & peaks$log2FoldChange > 0] = "Increasing"
peaks$status[!is.na(peaks$padj) & peaks$padj < 0.1 & peaks$log2FoldChange < 0] = "Decreasing"
peaks$status[!is.na(peaks$padj) & peaks$padj >= 0.1] = "Not significantly changing"
peaks$status = factor(peaks$status, levels = c("Increasing", "Not significantly changing", "Decreasing"))
colnames(peaks)[colnames(peaks) == "V5"] = "q_value"
Bin the ER peaks by proximity to TRPS1 peaks.
peaks$Class = NULL
peaks$Class[peaks$dis < 200] = "< 200bp"
peaks$Class[peaks$dis >= 200 & peaks$dis <= 500] = "200 - 500bp"
peaks$Class[peaks$dis > 500] = "> 500bp"
peaks$Class = factor(peaks$Class, levels = c("< 200bp", "200 - 500bp", "> 500bp"))
Plot.
pdf("ER_ChIP_intensity_by_binned_HA_proximity.pdf", width = 7, height = 3.5)
ggplot(data = peaks[peaks$Class %in% levels(peaks$Class),], aes(x = Class, y = log(dTAG/DMSO, base = 2), color = Class),) +
theme(legend.position = "none") +
geom_hline(yintercept = 0, linetype="dashed") +
geom_violin() +
geom_boxplot(width=.1, outlier.shape = NA) +
geom_jitter(size = .01) +
geom_hline(yintercept = 0, linetype="dashed") +
scale_color_manual(values = c("#FF0000", "grey60", "#0000FF")) +
xlab("Distance to nearest TRPS1 peak") +
ylab(expression("log"[2]~"(fold change in ER binding intensity)")) +
ggtitle("Change in ER binding intensity by proximity to TRPS1") +
theme_bw() +
theme(legend.position="none", plot.title = element_text(hjust = 0.5))
dev.off()
T-tests.
Close = peaks[peaks$Class == "< 200bp",]
t.test(log(Close$dTAG), log(Close$DMSO), paired = T) #<0.00000000000000022
Intermediate = peaks[peaks$Class == "200 - 500bp",]
t.test(log(Intermediate$dTAG), log(Intermediate$DMSO), paired = T) #0.4901
Far = peaks[peaks$Class == "> 500bp",]
t.test(log(Far$dTAG), log(Far$DMSO), paired = T) #0.00001017
Set up in R.
direc = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/"
setwd(direc)
library(lattice)
library(DESeq2)
library(viridis)
library(tidyverse)
library(latticeExtra)
library(msigdbr)
library(clusterProfiler)
library(fgsea)
library(enrichplot)
library(clusterProfiler)
library(DEGreport)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)
source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')
filter.deseq.into.bed <- function(deseq.df, gene.file, cat = 'R1881 Activated') {
deseq.df = deseq.df[deseq.df$status == cat,]
#print(dim(deseq.df))
#scientific notation was messing this up occasionally
x = gene.file$V4
#print(length(x))
y = gene.file[x %in% rownames(deseq.df),]
#print(dim(y))
z = get.tss(y)
#print(dim(z))
return(z)
}
#bedTools.closest <- function(functionstring="/anaconda3/bin/closestBed",bed1,bed2,opt.string="") {
bedTools.closest <- function(functionstring="/usr/local/anaconda3/bin/closestBed",bed1,bed2,opt.string="") {
options(scipen =99) # not to use scientific notation when writing out
#write bed formatted dataframes to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n")
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='')
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2)[1:ncol(bed2)], 'dis' )
return(res)
}
#color scheme
col.lines = c("red", "blue", "grey60")
#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")
Identify differentially expressed genes at 24 hours, shrinking the fold changes.
x = read.table("TRPS1_timecourse_PRO_gene_counts_combined.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]
time = factor(sapply(strsplit(colnames(x), '_'), '[', 3), levels = c("0hour", "0.5hour", "1hour", "2hour", "4hour", "24hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
deseq.df = DESeq(deseq.df)
i = "24hour"
res.deseq = results(deseq.df, contrast = c("time", i, "0hour"))
res.deseq = res.deseq[rownames(res.deseq) %in% gene.symbol$gene,]
res.deseq = merge(as.data.frame(res.deseq), gene.symbol, by.x = 0, by.y = "gene")
res.deseq.shrink <- lfcShrink(deseq.df, coef = "time_24hour_vs_0hour", type="ashr")
res.deseq.shrink = res.deseq.shrink[rownames(res.deseq.shrink) %in% gene.symbol$gene,]
sum(res.deseq.shrink$padj < 0.1 & !is.na(res.deseq.shrink$padj))
df.deseq.effects.lattice = res.deseq.shrink
df.deseq.effects.lattice$status = "Other"
df.deseq.effects.lattice$status[res.deseq.shrink$padj < 0.1 & !is.na(res.deseq.shrink$padj) & res.deseq.shrink$log2FoldChange > 0] = "Activated"
df.deseq.effects.lattice$status[res.deseq.shrink$padj < 0.1 & !is.na(res.deseq.shrink$padj) & res.deseq.shrink$log2FoldChange < 0] = "Repressed"
df.deseq.effects.lattice$status = factor(df.deseq.effects.lattice$status, levels = c("Other", "Activated", "Repressed"))
Plot.
pdf("MA_PRO_24hour_ashr_for_figure.pdf", width = 7, height = 3.5)
ggplot(as.data.frame(df.deseq.effects.lattice) %>% arrange(status), aes(x = log(baseMean, 10), y = log2FoldChange, color = status)) +
geom_point() +
scale_color_manual(values = c("grey60", "red", "blue")) +
geom_hline(yintercept = 0, linetype="dashed") +
ylim(-1, 1) +
xlim(0, 5) +
ylab(expression("Shrunken log"[2]~"(fold change in PRO signal)")) +
xlab(expression("log"[10]~"(mean of normalized intensity)")) +
ggtitle("Change in expression 24 hours after TRPS1 depletion") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(color=NULL)
dev.off()
Rank genes based on shrunken fold change.
original_gene_list <- res.deseq.shrink$log2FoldChange
names(original_gene_list) <- rownames(res.deseq.shrink)
gene_list<-na.omit(original_gene_list)
gene_list = sort(gene_list, decreasing = TRUE)
h_gene_sets = msigdbr(species = "Homo sapiens", category = "H")
msigdbr_list = split(x = h_gene_sets$ensembl_gene, f = h_gene_sets$gs_name)
set.seed(0)
fgseaRes <- fgsea(msigdbr_list, gene_list, eps = 0)
fgseaResTidy <- fgseaRes %>%
as_tibble() %>%
arrange(desc(NES))
print(fgseaResTidy[fgseaResTidy$padj < 0.1,], n = 100)
gse_top = fgseaResTidy[fgseaResTidy$padj < 0.1,]
Plot.
#padj 2.45e-19
pdf("Mountain_plot_24hour_E2F_targets.pdf", height = 5)
plotEnrichment(msigdbr_list[["HALLMARK_E2F_TARGETS"]], gene_list) +
ylim(-.9, .1) +
labs(title="Hallmark E2F Targets")
dev.off()
Set up in R.
direc = "~/Library/CloudStorage/Box-Box/GuertinLab/Cell_Counts/"
setwd(direc)
library(readxl)
library(tidyr)
library(plyr)
library(dplyr)
library(ggplot2)
library(viridis)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95, .drop=TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- plyr::rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
Read in the counts data.
Cell_Counts <- read_excel("220831_dTAG_TRPS1_reps_3_to_6.xlsx")
Convert to cells/well.
Dilution_factor <- 10000*Cell_Counts$mL/9
Cell_Counts <- Cell_Counts[,-6]
Cell_Counts[,3:5] = Cell_Counts[,3:5]*Dilution_factor
Average the technical replicates.
Cell_Counts <- pivot_longer(Cell_Counts, colnames(Cell_Counts)[3:5], names_to = "Condition", values_to = "Count")
Cell_Counts <- summarySE(Cell_Counts, measurevar="Count", groupvars=c("Condition", "Day", "Rep"))
Cell_Counts <- Cell_Counts[,c(1,2,3,5)]
Cell_Counts <- na.omit(Cell_Counts)
Calculate mean and standard error for the biological replicates.
Summary <- summarySE(Cell_Counts, measurevar="Count", groupvars=c("Condition", "Day"))
Fit a linear model to the data with and without an interaction term for the effect of dTAG.
pd <- position_dodge(0.1)
ylim <- c(0, 500000)
fit = lm(log(Count, base = 2) ~ Day*Condition, data = Cell_Counts[Cell_Counts$Condition != "Untreated",])
fit_no_interaction = lm(log(Count, base = 2) ~ Day + Condition, data = Cell_Counts[Cell_Counts$Condition != "Untreated",])
anova(fit_no_interaction, fit) # p = 1.01e-05
predict = predict(fit, interval = "confidence")
Cell_Counts$lwr[Cell_Counts$Condition != "Untreated"] = predict[,2]
Cell_Counts$upr[Cell_Counts$Condition != "Untreated"] = predict[,3]
Plot.
pdf("dTAG_TRPS1_reps3_to_6_curve_fit_confidence.pdf", width = 14)
ggplot(Cell_Counts, aes(x=Day, y=Count, colour=Condition, fill = Condition)) +
geom_point(data=Cell_Counts, size=3, shape=21) +
stat_function(fun=function(x) 2^(as.numeric(fit$coefficients[1]) + as.numeric(fit$coefficients[2])*x), color = "Black") +
stat_function(fun=function(x) 2^(as.numeric(fit$coefficients[1] + fit$coefficients[3]) + as.numeric(fit$coefficients[2] + fit$coefficients[4])*x), color = "Red") +
geom_errorbar(aes(ymin=2^lwr, ymax=2^upr), width=.1) +
scale_colour_manual(values = c("Black", "Red", "Grey"), labels = c("DMSO", "dTAG", "Untreated")) +
scale_fill_manual(values = c("Black", "Red", "Grey"), labels = c("DMSO", "dTAG", "Untreated")) +
scale_x_continuous(breaks = c(0, 3, 7, 10, 14), minor_breaks = seq(0,14,1)) +
ylab("Cell Number") +
theme_bw() +
theme(legend.justification = "center",
legend.position = c(0.2, 0.7),
legend.title = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.text = element_text(size=24),
axis.text=element_text(size=12),
axis.title=element_text(size=24,face="bold"))
dev.off()
Repeat with parental T47D counts.
#Read data
Cell_Counts <- read_excel("231128_dTAG_T47D.xlsx")
#Convert to cells/well
Dilution_factor <- 10000*2*Cell_Counts$mL/9
Cell_Counts <- Cell_Counts[,-6]
Cell_Counts[,3:5] = Cell_Counts[,3:5]*Dilution_factor
#Average technical replicates
Cell_Counts <- pivot_longer(Cell_Counts, colnames(Cell_Counts)[3:5], names_to = "Condition", values_to = "Count")
Cell_Counts <- summarySE(Cell_Counts, measurevar="Count", groupvars=c("Condition", "Day", "Rep"))
Cell_Counts <- Cell_Counts[,c(1,2,3,5)]
Cell_Counts <- na.omit(Cell_Counts)
#Mean and SEM for biological replicates
Summary <- summarySE(Cell_Counts, measurevar="Count", groupvars=c("Condition", "Day"))
#Plot
pd <- position_dodge(0.1)
ylim <- c(0, 500000)
fit = lm(log(Count, base = 2) ~ Day*Condition, data = Cell_Counts[Cell_Counts$Condition != "Untreated",])
fit_no_interaction = lm(log(Count, base = 2) ~ Day + Condition, data = Cell_Counts[Cell_Counts$Condition != "Untreated",])
anova(fit_no_interaction, fit) # p = 0.9411
predict = predict(fit, interval = "confidence")
Cell_Counts$lwr[Cell_Counts$Condition != "Untreated"] = predict[,2]
Cell_Counts$upr[Cell_Counts$Condition != "Untreated"] = predict[,3]
pdf("dTAG_T47D_curve_fit_confidence.pdf", width = 14)
ggplot(Cell_Counts, aes(x=Day, y=Count, colour=Condition, fill = Condition)) +
geom_point(data=Cell_Counts, size=3, shape=21) +
stat_function(fun=function(x) 2^(as.numeric(fit$coefficients[1]) + as.numeric(fit$coefficients[2])*x), color = "Black") +
stat_function(fun=function(x) 2^(as.numeric(fit$coefficients[1] + fit$coefficients[3]) + as.numeric(fit$coefficients[2] + fit$coefficients[4])*x), color = "Red") +
geom_errorbar(aes(ymin=2^lwr, ymax=2^upr), width=.1) +
scale_colour_manual(values = c("Black", "Red", "Grey"), labels = c("DMSO", "dTAG", "Untreated")) +
scale_fill_manual(values = c("Black", "Red", "Grey"), labels = c("DMSO", "dTAG", "Untreated")) +
scale_x_continuous(breaks = c(0, 3, 7, 10, 14), minor_breaks = seq(0,14,1)) +
ylab("Cell Number") +
theme_bw() +
theme(legend.justification = "center",
legend.position = c(0.2, 0.7),
legend.title = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.text = element_text(size=24),
axis.text=element_text(size=12),
axis.title=element_text(size=24,face="bold"))
dev.off()
Set up in R.
direc = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/"
setwd(direc)
library(DESeq2)
library(RTN)
library(RTNsurvival)
library(pheatmap)
library(Fletcher2013b)
#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")
Identify primary TRPS1-responsive genes.
load("TRPS1_timecourse_PRO_deseq.df.RData")
i = "0.5hour"
res.deseq = results(deseq.df, contrast = c("time", i, "0hour"))
res.deseq = res.deseq[rownames(res.deseq) %in% gene.symbol$gene,]
res.deseq = merge(as.data.frame(res.deseq), gene.symbol, by.x = 0, by.y = "gene")
activated = res.deseq$symbol[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0]
repressed = res.deseq$symbol[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0]
Define the TRPS1 regulon based on our own primary gene set.
data("rtni1st")
TRPS1 = rtni1st@regulatoryElements["TRPS1"]
ILMN_SYMBOL = rtni1st@rowAnnotation
rtni1st@results$tn.dpi[,TRPS1] <- 0
rtni1st@results$tn.dpi[,TRPS1][names(rtni1st@results$tn.dpi[,TRPS1]) %in%
ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% activated]] <- -1
rtni1st@results$tn.dpi[,TRPS1][names(rtni1st@results$tn.dpi[,TRPS1]) %in%
ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% repressed]] <- 1
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time", event = "event", keycovar =c("Age","Grade"))
tns1st <-tnsGSEA2(tns1st)
data("rtni2nd")
TRPS1 = rtni2nd@regulatoryElements["TRPS1"]
ILMN_SYMBOL = rtni2nd@rowAnnotation
rtni2nd@results$tn.dpi[,TRPS1] <- 0
rtni2nd@results$tn.dpi[,TRPS1][names(rtni2nd@results$tn.dpi[,TRPS1]) %in%
ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% activated]] <- -1
rtni2nd@results$tn.dpi[,TRPS1][names(rtni2nd@results$tn.dpi[,TRPS1]) %in%
ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% repressed]] <- 1
tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time", event = "event", keycovar =c("Age", "Grade"))
tns2nd <-tnsGSEA2(tns2nd)
Merge the two halves of the cohort.
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
tns2nd@results$regulonActivity$status)
Get regulon activity and sample attributes.
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")
Run Kaplan-Meier analysis.
tns_both <-tnsKM(tns_both)
Identify the patient with the highest TRPS1 activity score.
max = rownames(tns_both@results$regulonActivity$differential)[tns_both@results$regulonActivity$differential == max(tns_both@results$regulonActivity$differential)]
Plot.
pdf("RTNsurvival_GSEA2_TRPS1_high_regulon_merged_METABRIC_30minutes_FDR_0.1.pdf", height = 3, width = 3)
tnsPlotGSEA2(tns_both, max, regs = "TRPS1")
dev.off()
Plot.
#Logrank p-value 8.3e-6
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()
#Just ER positive
data("rtni1st")
TRPS1 = rtni1st@regulatoryElements["TRPS1"]
ILMN_SYMBOL = rtni1st@rowAnnotation
rtni1st@results$tn.dpi[,TRPS1] <- 0
rtni1st@results$tn.dpi[,TRPS1][names(rtni1st@results$tn.dpi[,TRPS1]) %in%
ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% activated]] <- -1
rtni1st@results$tn.dpi[,TRPS1][names(rtni1st@results$tn.dpi[,TRPS1]) %in%
ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% repressed]] <- 1
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade")
, excludeAttribs = "ER-"
)
tns1st <-tnsGSEA2(tns1st)
data("rtni2nd")
TRPS1 = rtni2nd@regulatoryElements["TRPS1"]
ILMN_SYMBOL = rtni2nd@rowAnnotation
rtni2nd@results$tn.dpi[,TRPS1] <- 0
rtni2nd@results$tn.dpi[,TRPS1][names(rtni2nd@results$tn.dpi[,TRPS1]) %in%
ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% activated]] <- -1
rtni2nd@results$tn.dpi[,TRPS1][names(rtni2nd@results$tn.dpi[,TRPS1]) %in%
ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% repressed]] <- 1
tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age", "Grade")
, excludeAttribs = "ER-"
)
tns2nd <-tnsGSEA2(tns2nd)
#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
tns2nd@results$regulonActivity$status)
#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")
#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)
#Logrank p-value 3.86e-6
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_ER_positive.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()
#Just ER negative
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade"),
excludeAttribs = "ER+")
tns1st <-tnsGSEA2(tns1st)
tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade"),
excludeAttribs = "ER+")
tns2nd <-tnsGSEA2(tns2nd)
#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
tns2nd@results$regulonActivity$status)
#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")
#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)
#Logrank p-value
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_ER_negative.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()
#Just Basal
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade"),
excludeAttribs = c("LumA", "LumB", "Her2", "Normal"))
tns1st <-tnsGSEA2(tns1st)
tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade"),
excludeAttribs = c("LumA", "LumB", "Her2", "Normal"))
tns2nd <-tnsGSEA2(tns2nd)
#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
tns2nd@results$regulonActivity$status)
#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")
#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)
#Logrank p-value
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_Basal.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()
#Just Normal
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade"),
excludeAttribs = c("LumA", "LumB", "Her2", "Basal"))
tns1st <-tnsGSEA2(tns1st)
tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade"),
excludeAttribs = c("LumA", "LumB", "Her2", "Basal"))
tns2nd <-tnsGSEA2(tns2nd)
#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
tns2nd@results$regulonActivity$status)
#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")
#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)
#Logrank p-value
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_Normal.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()
#Just Her2-amplified
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade"),
excludeAttribs = c("LumA", "LumB", "Normal", "Basal"))
tns1st <-tnsGSEA2(tns1st)
tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade"),
excludeAttribs = c("LumA", "LumB", "Normal", "Basal"))
tns2nd <-tnsGSEA2(tns2nd)
#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
tns2nd@results$regulonActivity$status)
#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")
#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)
#Logrank p-value
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_HER2.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()
#Just luminal A
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade"),
excludeAttribs = c("Basal", "Her2", "Normal", "LumB"))
tns1st <-tnsGSEA2(tns1st)
tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade"),
excludeAttribs = c("Basal", "Her2", "Normal", "LumB"))
tns2nd <-tnsGSEA2(tns2nd)
#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
tns2nd@results$regulonActivity$status)
#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")
#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)
#Logrank p-value 4.23e-4
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_luminal_A.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()
#Just luminal B
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade"),
excludeAttribs = c("Basal", "Her2", "Normal", "LumA"))
tns1st <-tnsGSEA2(tns1st)
tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time",
event = "event", keycovar =c("Age","Grade"),
excludeAttribs = c("Basal", "Her2", "Normal", "LumA"))
tns2nd <-tnsGSEA2(tns2nd)
#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
tns2nd@results$regulonActivity$status)
#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")
#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)
#Logrank p-value 2.34e-1
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_luminal_B.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()
Identify TRPS1-responsive genes.
x = read.table("TRPS1_timecourse_PRO_gene_counts_combined.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]
time = factor(sapply(strsplit(colnames(x), '_'), '[', 3), levels = c("0hour", "0.5hour", "1hour", "2hour", "4hour", "24hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
deseq.df = DESeq(deseq.df)
i = "24hour"
res.deseq = results(deseq.df, contrast = c("time", i, "0hour"))
res.deseq = res.deseq[rownames(res.deseq) %in% gene.symbol$gene,]
res.deseq = merge(as.data.frame(res.deseq), gene.symbol, by.x = 0, by.y = "gene")
Define the TRPS1 regulon based on this gene set.
data("rtni1st")
TRPS1 = rtni1st@regulatoryElements["TRPS1"]
ILMN_SYMBOL = rtni1st@rowAnnotation
rtni1st@results$tn.dpi[,TRPS1] <- 0
rtni1st@results$tn.dpi[,TRPS1][names(rtni1st@results$tn.dpi[,TRPS1]) %in%
ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% activated]] <- -1
rtni1st@results$tn.dpi[,TRPS1][names(rtni1st@results$tn.dpi[,TRPS1]) %in%
ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% repressed]] <- 1
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time", event = "event", keycovar =c("Age","Grade"))
tns1st <-tnsGSEA2(tns1st)
data("rtni2nd")
TRPS1 = rtni2nd@regulatoryElements["TRPS1"]
ILMN_SYMBOL = rtni2nd@rowAnnotation
rtni2nd@results$tn.dpi[,TRPS1] <- 0
rtni2nd@results$tn.dpi[,TRPS1][names(rtni2nd@results$tn.dpi[,TRPS1]) %in%
ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% activated]] <- -1
rtni2nd@results$tn.dpi[,TRPS1][names(rtni2nd@results$tn.dpi[,TRPS1]) %in%
ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% repressed]] <- 1
tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time", event = "event", keycovar =c("Age", "Grade"))
tns2nd <-tnsGSEA2(tns2nd)
Merge the two halves of the cohort.
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
tns2nd@results$regulonActivity$status)
Get regulon activity and sample attributes.
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")
Run Kaplan-Meier analysis.
tns_both <-tnsKM(tns_both)
Plot.
#Logrank p-value 4.99e-4
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_24hour_FDR_0.1_no_endpoint.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()
As it turns out, our degron insertion was heterozygous, and T47D cells in fact have four genomic copies of TRPS1. We have determined that we inadvertently created two to three knockout alleles in each of the three clones. However, we were more concerned with TRPS1 protein expression than genomic copies edited. In Figure 2A, the tagged alleles are visualized with the TRPS1 antibody, and there is no detectable untagged TRPS1. Even though we knocked out 2 or 3 alleles, we found that TRPS1 negatively regulates its own expression (TRPS1 is the most significantly activated gene upon TRPS1 depletion as measured by PRO-seq after TRPS1 degradation at any time point). This is a possible explanation for why the protein levels are comparable to the progenitor despite having fewer than 4 functional alleles.
We used primers flanking the insertion site for PCR of genomic DNA from each of the three clones, along with the parental T47D cells. We performed TOPO cloning and picked colonies for Sanger sequencing. The predicted amplicon for the unedited parental cells is as follows:
TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT
The predicted amplicon for a perfectly tagged allele is an insertion directly after the start codon, where Cas9 cuts.
TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGAAAAAGCCTGAACTCACCGCGACGTCTGTCGAGAAGTTTCTGATCGAAAAGTTCGACAGCGTCTCCGACCTGATGCAGCTCTCGGAGGGCGAAGAATCTCGTGCTTTCAGCTTCGATGTAGGAGGGCGTGGATATGTCCTGCGGGTAAATAGCTGCGCCGATGGTTTCTACAAAGATCGTTATGTTTATCGGCACTTTGCATCGGCCGCGCTCCCGATTCCGGAAGTGCTTGACATTGGGGAATTCAGCGAGAGCCTGACCTATTGCATCTCCCGCCGTGCACAGGGTGTCACGTTGCAAGACCTGCCTGAAACCGAACTGCCCGCTGTTCTGCAGCCGGTCGCGGAGGCCATGGATGCGATCGCTGCGGCCGATCTTAGCCAGACGAGCGGGTTCGGCCCATTCGGACCGCAAGGAATCGGTCAATACACTACATGGCGTGATTTCATATGCGCGATTGCTGATCCCCATGTGTATCACTGGCAAACTGTGATGGACGACACCGTCAGTGCGTCCGTCGCGCAGGCTCTCGATGAGCTGATGCTTTGGGCCGAGGACTGCCCCGAAGTCCGGCACCTCGTGCACGCGGATTTCGGCTCCAACAATGTCCTGACGGACAATGGCCGCATAACAGCGGTCATTGACTGGAGCGAGGCGATGTTCGGGGATTCCCAATACGAGGTCGCCAACATCTTCTTCTGGAGGCCGTGGTTGGCTTGTATGGAGCAGCAGACGCGCTACTTCGAGCGGAGGCATCCGGAGCTTGCAGGATCGCCGCGGCTCCGGGCGTATATGCTCCGCATTGGTCTTGACCAACTCTATCAGAGCTTGGTTGACGGCAATTTCGATGATGCAGCTTGGGCGCAGGGTCGATGCGACGCAATCGTCCGATCCGGAGCCGGGACTGTCGGGCGTACACAAATCGCCCGCAGAAGCGCGGCCGTCTGGACCGATGGCTGTGTAGAAGTACTCGCCGATAGTGGAAACCGACGCCCCAGCACTCGTCCGGATCGGGAGATGGGGGAGGCTAACGGAAGCGGAGCTACTAACTTCAGCCTGCTGAAGCAGGCTGGAGACGTGGAGGAGAACCCTGGACCTTACCCCTACGACGTGCCCGACTACGCCGGCTATCCGTATGATGTCCCGGACTATGCAGGAAGCGGAGGAGTGCAGGTGGAAACCATCTCCCCAGGAGACGGGCGCACCTTCCCCAAGCGCGGCCAGACCTGCGTGGTGCACTACACCGGGATGCTTGAAGATGGAAAGAAAGTTGATTCCTCCCGGGACAGAAACAAGCCCTTTAAGTTTATGCTAGGCAAGCAGGAGGTGATCCGAGGCTGGGAAGAAGGGGTTGCCCAGATGAGTGTGGGTCAGAGAGCCAAACTGACTATATCTCCAGATTATGCCTATGGTGCCACTGGGCACCCAGGCATCATCCCACCACATGCCACTCTCGTCTTCGATGTGGAGCTTCTAAAACTGGAAAGATCTGGTGGATCTGGAGGTTCAGGTTCAGGTGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT
The first allele is a knockout. There is an 18-base deletion of (TTTATCTTTGCAGATATG), the last 3 bases of which are the start codon of TRPS1.
TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT
The second allele is a knockout. There is a 70-base insertion between the TRPS1 start codon and the second codon. The first 6 bases are GCGTCA, and the next are from the px458 (Cas9-expressing) plasmid, AGGCCACCTCGTCCACGATGTTGCCGAAGATGGGGTGCCGCTCGTGCTTCTTATCCTCTTCCAC.
TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGGCGTCAAGGCCACCTCGTCCACGATGTTGCCGAAGATGGGGTGCCGCTCGTGCTTCTTATCCTCTTCCACGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT
The third allele produces the goal dTAG-TRPS1 protein. The allele differs from the perfect product above, missing two T’s in a string of 13 T’s over 100 bases upstream from the start codon, and missing amino acids 2-77 from the Hygromycin resistance (HygR) gene.
TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGCTTGACATTGGGGAATTCAGCGAGAGCCTGACCTATTGCATCTCCCGCCGTGCACAGGGTGTCACGTTGCAAGACCTGCCTGAAACCGAACTGCCCGCTGTTCTGCAGCCGGTCGCGGAGGCCATGGATGCGATCGCTGCGGCCGATCTTAGCCAGACGAGCGGGTTCGGCCCATTCGGACCGCAAGGAATCGGTCAATACACTACATGGCGTGATTTCATATGCGCGATTGCTGATCCCCATGTGTATCACTGGCAAACTGTGATGGACGACACCGTCAGTGCGTCCGTCGCGCAGGCTCTCGATGAGCTGATGCTTTGGGCCGAGGACTGCCCCGAAGTCCGGCACCTCGTGCACGCGGATTTCGGCTCCAACAATGTCCTGACGGACAATGGCCGCATAACAGCGGTCATTGACTGGAGCGAGGCGATGTTCGGGGATTCCCAATACGAGGTCGCCAACATCTTCTTCTGGAGGCCGTGGTTGGCTTGTATGGAGCAGCAGACGCGCTACTTCGAGCGGAGGCATCCGGAGCTTGCAGGATCGCCGCGGCTCCGGGCGTATATGCTCCGCATTGGTCTTGACCAACTCTATCAGAGCTTGGTTGACGGCAATTTCGATGATGCAGCTTGGGCGCAGGGTCGATGCGACGCAATCGTCCGATCCGGAGCCGGGACTGTCGGGCGTACACAAATCGCCCGCAGAAGCGCGGCCGTCTGGACCGATGGCTGTGTAGAAGTACTCGCCGATAGTGGAAACCGACGCCCCAGCACTCGTCCGGATCGGGAGATGGGGGAGGCTAACGGAAGCGGAGCTACTAACTTCAGCCTGCTGAAGCAGGCTGGAGACGTGGAGGAGAACCCTGGACCTTACCCCTACGACGTGCCCGACTACGCCGGCTATCCGTATGATGTCCCGGACTATGCAGGAAGCGGAGGAGTGCAGGTGGAAACCATCTCCCCAGGAGACGGGCGCACCTTCCCCAAGCGCGGCCAGACCTGCGTGGTGCACTACACCGGGATGCTTGAAGATGGAAAGAAAGTTGATTCCTCCCGGGACAGAAACAAGCCCTTTAAGTTTATGCTAGGCAAGCAGGAGGTGATCCGAGGCTGGGAAGAAGGGGTTGCCCAGATGAGTGTGGGTCAGAGAGCCAAACTGACTATATCTCCAGATTATGCCTATGGTGCCACTGGGCACCCAGGCATCATCCCACCACATGCCACTCTCGTCTTCGATGTGGAGCTTCTAAAACTGGAAAGATCTGGTGGATCTGGAGGTTCAGGTTCAGGTGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT
We did not identify a fourth unique allele in this pool of cloned PCR products. We turned to our three-clone PRO-seq data to determine if there was a fourth unique allele. We first made FLASH
-merged .fastq
files from overlapping reads to extend our effective read length.
The script used in the next code chunk (231226_grep_prep.sh) is below:
#! /bin/bash
main=/scratch/$USER/TRPS1_PRO
cd $main
UMI_length=8
filename=$1
module load cutadapt/3.4 pigz/2.7; PATH=$HOME/bin:$PATH
name=`basename $filename _PE1.fastq.gz`
gunzip ${name}_PE1.fastq.gz
gunzip ${name}_PE2.fastq.gz
cutadapt -m $((UMI_length+2)) -O 1 -a TGGAATTCTCGGGTGCCAAGG ${name}_PE1.fastq \
-o ${name}_PE1_noadap.fastq --too-short-output ${name}_PE1_short.fastq > ${name}_PE1_cutadapt.txt
cutadapt -m $((UMI_length+10)) -O 1 -a GATCGTCGGACTGTAGAACTCTGAAC ${name}_PE2.fastq \
-o ${name}_PE2_noadap.fastq --too-short-output ${name}_PE2_short.fastq > ${name}_PE2_cutadapt.txt
seqtk seq -L $((UMI_length+10)) ${name}_PE1_noadap.fastq > ${name}_PE1_noadap_trimmed.fastq
fqdedup -i ${name}_PE1_noadap_trimmed.fastq -o ${name}_PE1_dedup.fastq
fastq_pair ${name}_PE1_dedup.fastq ${name}_PE2_noadap.fastq
flash -q ${name}_PE2_noadap.fastq.paired.fq ${name}_PE1_dedup.fastq.paired.fq -o ${name}
seqtk trimfq -e ${UMI_length} ${name}.extendedFrags.fastq > ${name}_for_grep.fastq
rm ${name}_*short.fastq ${name}_*noadap* ${name}_*dedup* ${name}.hist* ${name}.notCombined* \
${name}.extendedFrags.fastq
pigz ${name}_PE[1,2].fastq
Run the above script on each file in parallel.
ssh -Y ts2hx@rivanna.hpc.virginia.edu
ijob -p standard -A gioeli_lab -t 4:00:00 -N 1 -n 1 --cpus-per-task=1
chmod +x /scratch/$USER/scripts/231226_grep_prep.sh
main=/scratch/$USER/TRPS1_PRO
cd $main
for filename in *_PE1.fastq.gz
do
name=`basename $filename _PE1.fastq.gz`
sbatch -p standard -A gioeli_lab -t 4:00:00 --cpus-per-task=1 --job-name ${name}_grep_prep \
-o ${name}_grep_prep.log --wrap="sh /scratch/$USER/scripts/231226_grep_prep.sh $filename"
sleep 1 # wait 1 second between each job submission
done
Next we used grep
to search for a unique sequence specific to the desired product and not found in the above alleles. CTTTGCAGATATGAAAAAGC is the designed junction between the 5’UTR and the start of the HygR gene. This produced a single unique result, with a duplication of 33 bases (the 30 bases upstream from the start codon and including the start codon) directly after the start codon in the ideally tagged allele from above.
grep "CTTTGCAGATATGAAAAAGC" T47D_Clone28_*_for_grep.fastq
We search upstream from this junction to find reads that are specific to this allele. We can see as far as 73 bases of the 5’ UTR.
grep "CTTTATCTTTGCAGATATGT" T47D_Clone28_*_for_grep.fastq
We next search downstream from the insertion junction to find reads that are specific to this allele. We can see as far as Leucine 28 of HygR.
grep "GAAAAAGCCTGAACTCACCG" T47D_Clone28_*_for_grep.fastq
We look downstream from L28, which should still include only reads that are specific to this allele. We can see as far as Phenylalanine 57.
grep "TCTCCGACCTGATGCAGCTC" T47D_Clone28_*_for_grep.fastq
We look downstream from F57, which should still include only reads that are specific to this allele. We can see as far as Serine 84.
grep "ATAGCTGCGCCGATGGTTTC" T47D_Clone28_*_for_grep.fastq
We look downstream from Valine 77, which should still include only reads that are specific to this allele. We can see as far as Aspartic acid 102.
grep "GCTTGACATTGGGGAATTCA" T47D_Clone28_*_for_grep.fastq
So from our PRO-seq reads, we can produce the following continuous sequence specific to this allele:
ATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGAAAAAGCCTGAACTCACCGCGACGTCTGTCGAGAAGTTTCTGATCGAAAAGTTCGACAGCGTCTCCGACCTGATGCAGCTCTCGGAGGGCGAAGAATCTCGTGCTTTCAGCTTCGATGTAGGAGGGCGTGGATATGTCCTGCGGGTAAATAGCTGCGCCGATGGTTTCTACAAAGATCGTTATGTTTATCGGCACTTTGCATCGGCCGCGCTCCCGATTCCGGAAGTGCTTGACATTGGGGAATTCAGCGAGAGCCTGACCTATTGCATCTCCCGCCGTGCACAGGGTGTCACGTTGCAAGAC
We designed primers predicted to anneal to each strand of the duplication event that should be specific for this allele (forward TTTATCTTTGCAGATATGTATCT and reverse GGAAAAGGCAGATACATATC). We also designed primers flanking the designed insertion site, closer to the insertion site than the original primers to extend how far we can sequence into the insertion (forward CCTCCGCTTCCTGAAAGTGA and reverse TCCTCCCTTACTGGGGCTTT). We used these primers for PCR of genomic DNA from Clone 28 and the parental T47D cells as a control. The reverse clone-specific primer and the forward flanking primer produced a band in Clone 28 and not in the parental cells. We submitted this purified PCR product for Sanger sequencing with the primers used in the PCR. However, this product appears to have amplified the third allele from above, as we found the two missing T’s in the string of 13 T’s.
Using two different annealing temperatures, we could not use the forward clone-specific primer and the reverse flanking primer to generate a PCR product specifically in Clone 28 and not in the parental cells. We could attempt to further optimize this PCR. However,wWe realized this is the only allele that has the beginning of the HygR gene. We performed a PCR using the HDR template PCR primer, the 3’ end of which contains the beginning of the HygR gene, along with the reverse flanking primer. We submitted this purified PCR product for Sanger sequencing with the primers used in the PCR. From the 5’ end, we can see the entirety of the HygR gene and into the P2A site.
GCGACGTCTGTCGAGAAGTTTCTGATCGAAAAGTTCGACAGCGTCTCCGACCTGATGCAGCTCTCGGAGGGCGAAGAATCTCGTGCTTTCAGCTTCGATGTAGGAGGGCGTGGATATGTCCTGCGGGTAAATAGCTGCGCCGATGGTTTCTACAAAGATCGTTATGTTTATCGGCACTTTGCATCGGCCGCGCTCCCGATTCCGGAAGTGCTTGACATTGGGGAATTCAGCGAGAGCCTGACCTATTGCATCTCCCGCCGTGCACAGGGTGTCACGTTGCAAGACCTGCCTGAAACCGAACTGCCCGCTGTTCTGCAGCCGGTCGCGGAGGCCATGGATGCGATCGCTGCGGCCGATCTTAGCCAGACGAGCGGGTTCGGCCCATTCGGACCGCAAGGAATCGGTCAATACACTACATGGCGTGATTTCATATGCGCGATTGCTGATCCCCATGTGTATCACTGGCAAACTGTGATGGACGACACCGTCAGTGCGTCCGTCGCGCAGGCTCTCGATGAGCTGATGCTTTGGGCCGAGGACTGCCCCGAAGTCCGGCACCTCGTGCACGCGGATTTCGGCTCCAACAATGTCCTGACGGACAATGGCCGCATAACAGCGGTCATTGACTGGAGCGAGGCGATGTTCGGGGATTCCCAATACGAGGTCGCCAACATCTTCTTCTGGAGGCCGTGGTTGGCTTGTATGGAGCAGCAGACGCGCTACTTCGAGCGGAGGCATCCGGAGCTTGCAGGATCGCCGCGGCTCCGGGCGTATATGCTCCGCATTGGTCTTGACCAACTCTATCAGAGCTTGGTTGACGGCAATTTCGATGATGCAGCTTGGGCGCAGGGTCGATGCGACGCAATCGTCCGATCCGGAGCCGGGACTGTCGGGCGTACACAAATCGCCCGCAGAAGCGCGGCCGTCTGGACCGATGGCTGTGTAGAAGTACTCGCCGATAGTGGAAACCGACGCCCCAGCACTCGTCCGGATCGGGAGATGGGGGAGGCTAACGGAAGCGGAGCTACTAACTTCAGCCTGCTGAAGCAGGCTGGAGACGTGGAGGAGAACC
We are awaiting sequencing of this PCR product from the other direction. In future, allele-specific primers can be designed to anneal to either strand of the 5’ end of the HygR gene (corresponding to amino acids 2-77), which should only be found in this allele.
The first allele is a knockout. Reading from the 5’ end, we see the 5’UTR except missing the last 4bp, then TRPS1 missing the first 242bp (including the start codon).
TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT
The second allele is a knockout. Reading from the 5’ end, we see the 5’UTR except missing the last 7bp, then TRPS1 missing the start codon.
TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT
We are still looking for up to two more unique alleles. Searching the PRO-seq data, we do see the designed junction between the 5’ UTR and the start of the HygR gene.
grep "CTTTGCAGATATGAAAAAGC" T47D_Clone35_*_for_grep.fastq
Upstream from this, we can see two distinct sequences, one with an unknown sequence (GCCCCTGGAACG) upstream from the last 50 bases of the 5’ UTR, and one with the end of dTAG and the first 2 bases of the linker (CGATGTGGAGCTTCTAAAACTGGAAAG) upstream from the last 50 bases of the 5’ UTR. Notably, the dTAG would not be in frame with HygR, so this is probably a knockout allele (unless there is another dTAG downstream on this allele and translation initiates at the HygR start codon).
We search for each of these unique junctions. The first only produced the read we saw before. This may actually be mis-sequencing, as there are similarities in the sequence, the junction is exactly the same, and the quality scores for the different bases are lower. Thus we hypothesize that these sequences correspond to one unique allele.
grep "CCCTGGAACGGTAA" T47D_Clone35_*_for_grep.fastq
grep -B 1 -A 2 "CCCTGGAACGGTAA" T47D_Clone35_*_for_grep.fastq
We search around the junction to find reads that are specific to this allele. Upstream we see through glycine 86 of FKBP12. Downstream we see through 16 bases of HygR.
grep "AACTGGAAAGGTAACTTTCA" T47D_Clone35_*_for_grep.fastq
More upstream we see through one base beyond Serine 77 of FKBP12.
grep "AGCTTCTAAAACTGGAAAGG" T47D_Clone35_*_for_grep.fastq
More downstream the farthest we can see is the read above that extends through 16 bases of HygR.
grep "GGTAACTTTCAGATAACACT" T47D_Clone35_*_for_grep.fastq
Looking for a fourth unique allele, we search for the junction of the linker and TRPS1, we find it and can see back through Valine 98 of FKBP12 and forward through one base beyond Arginine 10 of TRPS1. This would be in frame, as long as there is translation of an upstream HygR gene that is correctly initiated.
grep "AGGTTCAGGTGTCCGGAAAA" T47D_Clone35_*_for_grep.fastq
We search downstream from the start of TRPS1, requiring the upstream sequence be the linker. Here we find reads indicating that, after 35 bases of TRPS1, the sequence switches to an unknown sequence (GGCCTCAGTGAGCGAGCGAGCGCGCAGCTGCCTGC).
grep "GGTGTCCGGAAAAAGAACCC" T47D_Clone35_*_for_grep.fastq
We search for this sequence but only find the reads above.
grep "GAAACGTTGCGGCCTCAGT" T47D_Clone35_*_for_grep.fastq
The first allele is a knockout. There is a single base deletion of the first G after the ATG start codon.
TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT
The second allele is a knockout that involves a weird rearrangement. From the 5’ end, beginning where the HDR primer starts (what is before this?), we see the start codon of TRPS1, then an unknown 14bp sequence (CCTTCGGAATCGGG), then an inverted 5’ end of TRPS1 (the first 233bp after (not including) the start codon), then an unknown 34bp sequence (TTAACTTTCAGATACAGTGTTATCTGAAAGTTAA), and then the 5’ end of TRPS1 as expected after the start codon.
GTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGCCTTCGGAATCGGGACTGCGCTTTTCAAGTCCTTCTTACTGCTAGAAGATGGATCTTGAACATGCAAGCTATGTTCCTCCTTATGATTTAGTTCTGCAGCATCACTCTGATCCGTATTTTCTGACATCTGATCTGCAGAAAATTCTTTGTTCTTTCCAGATACCTTGCTTTCTGTACCTATAGGCTCCAGGATCTGGCCCTCGCCTTCACTTGCAACGTTTCTCAGAGGGGGGTTCTTTTTCCGGACTTAACTTTCAGATACAGTGTTATCTGAAAGTTAAGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT
The third allele is a knockout. From the 5’ end, the 5’ UTR extends through 31 bases upstream from the start codon, then an unknown 50bp sequence (CAGGCTGATGACGTGGAGGAGATCCCTGGTCCTTACCCCTACGACGTGCC), then an inversion of (48bp of the 5’UTR (the wrong end of the HDR template) into the first 319bp of the HygR gene), then the 5’ end of TRPS1 as expected after the start codon.
TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGCAGGCTGATGACGTGGAGGAGATCCCTGGTCCTTACCCCTACGACGTGCCCGGTTTCAGGCAGGTCTTGCAACGTGACACCCTGTGCACGGCGGGAGATGCAATAGGTCAGGCTCTCGCTGAATTCCCCAATGTCAAGCACTTCCGGAATCGGGAGCGCGGCCGATGCAAAGTGCCGATAAACATAACGATCTTTGTAGAAACCATCGGCGCAGCTATTTACCCGCAGGACATATCCACGCCCTCCTACATCGAAGCTGAAAGCACGAGATTCTTCGCCCTCCGAGAGCTGCATCAGGTCGGAGACGCTGTCGAACTTTTCGATCAGAAACTTCTCGACAGACGTCGCGGTGAGTTCAGGCTTTTTCATATCTGCAAAGATAAAGGGAAAAGGCAGATACAGTGTTATCTGAAAGTTGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT
We are still looking for up to one more unique allele. Searching the PRO-seq data for the junction of the linker and TRPS1, we find it and can see back through one base beyond Histidine 87 of FKBP12 and forward through two bases beyond Lysine 29 of TRPS1. This would be in frame, as long as there is translation of an upstream HygR gene that is correctly initiated.
grep "AGGTTCAGGTGTCCGGAAAA" T47D_Clone39_*_for_grep.fastq
So from our PRO-seq reads, we can produce the following continuous sequence specific to this allele. In future, allele-specific primers can be designed to anneal to either strand of this sequence, which should only be found in this allele.
GCACCCAGGCATCATCCCACCACATGCCACTCTCGTCTTCGATGTGGAGCTTCTAAAACTGGAAAGATCTGGTGGATCTGGAGGTTCAGGTTCAGGTGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGT