Contents

1 Introduction

This is an analysis vignette accompanying the manuscript entitled “TRPS1 modulates chromatin accessibility to regulate estrogen receptor alpha (ER) binding and ER target gene expression in luminal breast cancer cells” (https://doi.org/10.1101/2023.07.03.547524). The reader should be able to follow these steps to download the raw data and reproduce all the results in the manuscript.

2 GWAS summary statistics processing

2.1 Make the file to upload to LocusZoom (Figure 1A)

Here we just focus on chromosome 8 (where TRPS1 is located) and a subset of columns to keep the file size manageable. I uploaded this file to LocusZoom to make the plot.

wget https://bcac.ccge.medschl.cam.ac.uk/files/icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt.zip
unzip icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt.zip
head -n 1 icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt > Breast_cancer_GWAS_PMID_32424353_chr8.txt
grep '^8_' icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt >> Breast_cancer_GWAS_PMID_32424353_chr8.txt
awk '{OFS = "\t"; print $9, $10, $41, $40, $46}' Breast_cancer_GWAS_PMID_32424353_chr8.txt > Breast_cancer_GWAS_PMID_32424353_chr8_subset.txt

3 Cancer Dependency Map data processing

3.1 Download the data

I downloaded the 22Q1 data release from the DepMap portal, with the TRPS1 and ESR1 CRISPR gene effect (Chronos) scores for all cell lines. I also downloaded the list of annotated luminal breast cancer cell lines from the DepMap portal as of March 13, 2022.

3.2 Plot the data (Figure 1B,C)

Set up in R. R code chunks will be in light blue, to distinguish from bash code chunks in grey.

setwd("~/Library/CloudStorage/Box-Box/GuertinLab/DepMap/")
library(tidyverse)
library(dplyr)
library(ggrepel)
library(ggsignif)
`%notin%` <- Negate(`%in%`)

Read in the data.

Chronos <- read_csv("Chronos.csv")
colnames(Chronos)[c(2,3)] <- c("ESR1", "TRPS1")
Luminal <- read_csv("cell lines in luminal.csv")
Chronos_luminal_last <- rbind.data.frame(Chronos[Chronos$`DepMap ID` %notin% Luminal$`Depmap Id`,],
                                         Chronos[Chronos$`DepMap ID` %in% Luminal$`Depmap Id`,])
cols = ifelse(Chronos_luminal_last$`DepMap ID` %in% Luminal$`Depmap Id`, "red", "black")

Plot.

pdf("Chronos_ESR1_vs_TRPS1_Luminal.pdf")
ggplot(Chronos_luminal_last, aes(x = ESR1, y = TRPS1, color = cols)) +
  geom_point() +
  scale_color_manual(values = c("black", "red")) +
  geom_text_repel(aes(label=ifelse(`DepMap ID` %in% Luminal$`Depmap Id`, 
                                   as.character(`Cell Line Name`),'')),
                  max.overlaps = nrow(Chronos_luminal_last)) + 
  xlab("ESR1 knockout score") +
  ylab("TRPS1 knockout score") +
  coord_fixed(ratio = 1, xlim = c(-1.2, 0.5), ylim = c(-1.2, 0.5)) +
  theme_bw() +
  theme(text = element_text(size = 20), legend.position="none")
dev.off()

pdf("Chronos_TRPS1_score_by_class.pdf")
ggplot(Chronos_luminal_last, aes(x = cols, y = TRPS1, color = cols)) +
  geom_violin() +
  geom_boxplot(width=.1) +
  #geom_signif(comparisons = list(c("black", "red")), map_signif_level=F)
  geom_hline(yintercept = 0, linetype="dashed") +
  scale_color_manual(values = c("black", "red")) +
  ylim(-1.2, 0.5) +
  ylab("TRPS1 knockout score") +
  xlab("Cell line class") +
  theme_bw() +
  theme(legend.position="none")
dev.off()

T-tests.

#0.002235
t.test(Chronos_luminal_last$TRPS1[cols == "red"],
       Chronos_luminal_last$TRPS1[cols == "black"])
#3.204e-05
wilcox.test(Chronos_luminal_last$TRPS1[cols == "red"],
       Chronos_luminal_last$TRPS1[cols == "black"])

4 ChIP-seq data pre-processing

4.1 Set up

Log in to the cluster to perform the processing for each sample in parallel.

ssh -Y ts2hx@rivanna.hpc.virginia.edu
ijob -A gioeli_lab -p standard --cpus-per-task=10 --time=8:00:00
module load gcc/11.4.0 sratoolkit/3.0.3
vdb-config -i #press x to exit

Make directories for the later scripts to write files.

title=ChIP
scripts=/scratch/ts2hx/scripts
main=/scratch/ts2hx/${title}/results

mkdir -p /scratch/ts2hx/${title}/logs
mkdir -p /scratch/ts2hx/${title}/raw_data
mkdir -p ${main}/fastqc
mkdir -p ${main}/bowtie2/controls
mkdir -p ${main}/macs2
mkdir -p ${main}/meme
mkdir -p ${main}/chipqc/meta
mkdir -p ${main}/chipqc/bams
mkdir -p ${main}/counts
mkdir -p ${main}/bigWigs

Get the required script for bigWig merging and move to a directory that will end up in the $PATH.

wget https://raw.githubusercontent.com/guertinlab/Nascent_RNA_Methods/main/normalize_bedGraph.py
chmod +x normalize_bedGraph.py
mv normalize_bedGraph.py $HOME/bin/

Get the ENCODE blacklist.

wget https://raw.githubusercontent.com/Boyle-Lab/Blacklist/master/lists/hg38-blacklist.v2.bed.gz -P $HOME/hg38/
gunzip $HOME/hg38/hg38-blacklist.v2.bed.gz
mv $HOME/hg38/hg38-blacklist.v2.bed $HOME/hg38/hg38.blacklist.bed

4.2 Download the raw data from both sequencing runs, rename the files, and combine the reads

Download the fastq’s and gzip.

cd /scratch/ts2hx/${title}/raw_data

for i in {676..761}
do
  sbatch -p standard -A gioeli_lab -t 1:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name fasterq-dump \
    -o /scratch/ts2hx/${title}/logs/SRR25076${i}_fasterq_dump.log \
    --wrap="module load gcc/11.4.0 sratoolkit/3.0.3; fasterq-dump SRR25076${i}; gzip SRR25076${i}_*.fastq"
  sleep 1
done

Combine the reads and rename

cat SRR25076713_1.fastq.gz SRR25076725_1.fastq.gz > T47D_Clone28_IgG_DMSO_rep1_PE1.fastq.gz
cat SRR25076712_1.fastq.gz SRR25076726_1.fastq.gz > T47D_Clone28_IgG_DMSO_rep2_PE1.fastq.gz
cat SRR25076703_1.fastq.gz SRR25076711_1.fastq.gz > T47D_Clone28_IgG_DMSO_rep3_PE1.fastq.gz
cat SRR25076700_1.fastq.gz SRR25076701_1.fastq.gz > T47D_Clone28_IgG_DMSO_rep4_PE1.fastq.gz

cat SRR25076698_1.fastq.gz SRR25076699_1.fastq.gz > T47D_Clone28_IgG_dTAG_rep1_PE1.fastq.gz
cat SRR25076696_1.fastq.gz SRR25076697_1.fastq.gz > T47D_Clone28_IgG_dTAG_rep2_PE1.fastq.gz
cat SRR25076694_1.fastq.gz SRR25076695_1.fastq.gz > T47D_Clone28_IgG_dTAG_rep3_PE1.fastq.gz
cat SRR25076692_1.fastq.gz SRR25076693_1.fastq.gz > T47D_Clone28_IgG_dTAG_rep4_PE1.fastq.gz

cat SRR25076752_1.fastq.gz SRR25076753_1.fastq.gz > T47D_Clone28_ER_DMSO_rep1_PE1.fastq.gz
cat SRR25076754_1.fastq.gz SRR25076755_1.fastq.gz > T47D_Clone28_ER_DMSO_rep2_PE1.fastq.gz
cat SRR25076751_1.fastq.gz SRR25076756_1.fastq.gz > T47D_Clone28_ER_DMSO_rep3_PE1.fastq.gz
cat SRR25076750_1.fastq.gz SRR25076757_1.fastq.gz > T47D_Clone28_ER_DMSO_rep4_PE1.fastq.gz

cat SRR25076749_1.fastq.gz SRR25076758_1.fastq.gz > T47D_Clone28_ER_dTAG_rep1_PE1.fastq.gz
cat SRR25076748_1.fastq.gz SRR25076759_1.fastq.gz > T47D_Clone28_ER_dTAG_rep2_PE1.fastq.gz
cat SRR25076747_1.fastq.gz SRR25076760_1.fastq.gz > T47D_Clone28_ER_dTAG_rep3_PE1.fastq.gz
cat SRR25076738_1.fastq.gz SRR25076761_1.fastq.gz > T47D_Clone28_ER_dTAG_rep4_PE1.fastq.gz

cat SRR25076743_1.fastq.gz SRR25076744_1.fastq.gz SRR25076745_1.fastq.gz SRR25076746_1.fastq.gz > T47D_Clone28_HA_DMSO_rep1_PE1.fastq.gz
cat SRR25076739_1.fastq.gz SRR25076740_1.fastq.gz SRR25076741_1.fastq.gz SRR25076742_1.fastq.gz > T47D_Clone28_HA_DMSO_rep2_PE1.fastq.gz
cat SRR25076736_1.fastq.gz SRR25076737_1.fastq.gz > T47D_Clone28_HA_DMSO_rep3_PE1.fastq.gz
cat SRR25076734_1.fastq.gz SRR25076735_1.fastq.gz > T47D_Clone28_HA_DMSO_rep4_PE1.fastq.gz

cat SRR25076718_1.fastq.gz SRR25076719_1.fastq.gz SRR25076720_1.fastq.gz SRR25076721_1.fastq.gz > T47D_Clone28_HA_dTAG_rep1_PE1.fastq.gz
cat SRR25076716_1.fastq.gz SRR25076722_1.fastq.gz > T47D_Clone28_HA_dTAG_rep2_PE1.fastq.gz
cat SRR25076715_1.fastq.gz SRR25076723_1.fastq.gz > T47D_Clone28_HA_dTAG_rep3_PE1.fastq.gz
cat SRR25076714_1.fastq.gz SRR25076724_1.fastq.gz > T47D_Clone28_HA_dTAG_rep4_PE1.fastq.gz

cat SRR25076713_2.fastq.gz SRR25076725_2.fastq.gz > T47D_Clone28_IgG_DMSO_rep1_PE2.fastq.gz
cat SRR25076712_2.fastq.gz SRR25076726_2.fastq.gz > T47D_Clone28_IgG_DMSO_rep2_PE2.fastq.gz
cat SRR25076703_2.fastq.gz SRR25076711_2.fastq.gz > T47D_Clone28_IgG_DMSO_rep3_PE2.fastq.gz
cat SRR25076700_2.fastq.gz SRR25076701_2.fastq.gz > T47D_Clone28_IgG_DMSO_rep4_PE2.fastq.gz

cat SRR25076698_2.fastq.gz SRR25076699_2.fastq.gz > T47D_Clone28_IgG_dTAG_rep1_PE2.fastq.gz
cat SRR25076696_2.fastq.gz SRR25076697_2.fastq.gz > T47D_Clone28_IgG_dTAG_rep2_PE2.fastq.gz
cat SRR25076694_2.fastq.gz SRR25076695_2.fastq.gz > T47D_Clone28_IgG_dTAG_rep3_PE2.fastq.gz
cat SRR25076692_2.fastq.gz SRR25076693_2.fastq.gz > T47D_Clone28_IgG_dTAG_rep4_PE2.fastq.gz

cat SRR25076752_2.fastq.gz SRR25076753_2.fastq.gz > T47D_Clone28_ER_DMSO_rep1_PE2.fastq.gz
cat SRR25076754_2.fastq.gz SRR25076755_2.fastq.gz > T47D_Clone28_ER_DMSO_rep2_PE2.fastq.gz
cat SRR25076751_2.fastq.gz SRR25076756_2.fastq.gz > T47D_Clone28_ER_DMSO_rep3_PE2.fastq.gz
cat SRR25076750_2.fastq.gz SRR25076757_2.fastq.gz > T47D_Clone28_ER_DMSO_rep4_PE2.fastq.gz

cat SRR25076749_2.fastq.gz SRR25076758_2.fastq.gz > T47D_Clone28_ER_dTAG_rep1_PE2.fastq.gz
cat SRR25076748_2.fastq.gz SRR25076759_2.fastq.gz > T47D_Clone28_ER_dTAG_rep2_PE2.fastq.gz
cat SRR25076747_2.fastq.gz SRR25076760_2.fastq.gz > T47D_Clone28_ER_dTAG_rep3_PE2.fastq.gz
cat SRR25076738_2.fastq.gz SRR25076761_2.fastq.gz > T47D_Clone28_ER_dTAG_rep4_PE2.fastq.gz

cat SRR25076743_2.fastq.gz SRR25076744_2.fastq.gz SRR25076745_2.fastq.gz SRR25076746_2.fastq.gz > T47D_Clone28_HA_DMSO_rep1_PE2.fastq.gz
cat SRR25076739_2.fastq.gz SRR25076740_2.fastq.gz SRR25076741_2.fastq.gz SRR25076742_2.fastq.gz > T47D_Clone28_HA_DMSO_rep2_PE2.fastq.gz
cat SRR25076736_2.fastq.gz SRR25076737_2.fastq.gz > T47D_Clone28_HA_DMSO_rep3_PE2.fastq.gz
cat SRR25076734_2.fastq.gz SRR25076735_2.fastq.gz > T47D_Clone28_HA_DMSO_rep4_PE2.fastq.gz

cat SRR25076718_2.fastq.gz SRR25076719_2.fastq.gz SRR25076720_2.fastq.gz SRR25076721_2.fastq.gz > T47D_Clone28_HA_dTAG_rep1_PE2.fastq.gz
cat SRR25076716_2.fastq.gz SRR25076722_2.fastq.gz > T47D_Clone28_HA_dTAG_rep2_PE2.fastq.gz
cat SRR25076713_2.fastq.gz SRR25076723_2.fastq.gz > T47D_Clone28_HA_dTAG_rep3_PE2.fastq.gz
cat SRR25076714_2.fastq.gz SRR25076724_2.fastq.gz > T47D_Clone28_HA_dTAG_rep4_PE2.fastq.gz

4.3 Process each sample

The script used in the next code chunk (230322_chipseq_analysis_on_input_file.sh) is below:

#!/bin/bash/

# This script takes a gzip'd fastq file of ChIP-seq data, runs FastQC, removes adapters,
# and outputs a bam file for peak calling, a bed file for counting reads in peaks, 
# and a bigWig file for visualization.
# USAGE: sh chipseq_analysis_on_input_file.sh <path to the gzip'd PE1 file> <path to main results directory>

cores=$SLURM_CPUS_PER_TASK

#For fastq_pair
PATH=$HOME/bin:$PATH

#Change this to your main results directory
main=$2

#Change this to the location of the blacklist
blacklist=$HOME/hg38/hg38.blacklist.bed

# grab base of filename for naming outputs
name=`basename $1 _PE1.fastq.gz`
echo "Sample name is $name"

# grab the directory
dir=`dirname $1`
cd $dir

# directory with bowtie genome index
genome=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome 

#Set up more variables for additional directories to help clean up the results folder
fastqc_out=${main}/fastqc/
bowtie_results=${main}/bowtie2
chipqc_out=${main}/chipqc/bams
counts_out=${main}/counts
bigWigs_out=${main}/bigWigs

#Set up the software environment
module load fastqc/0.11.5
module load cutadapt/3.4
module load gcc/9.2.0 pigz/2.4
module load bowtie2/2.2.9 
module load samtools/1.12
module load wigtobigwig/2.8
module load bedtools/2.29.2

#Unzip
gunzip ${name}_PE1.fastq.gz
gunzip ${name}_PE2.fastq.gz

#Run FastQC and move output to the appropriate folder
fastqc ${name}_PE1.fastq
fastqc ${name}_PE2.fastq
mv ${dir}/${name}_PE*_fastqc.* $fastqc_out

#Count raw reads for complexity estimate
raw_reads=$(wc -l ${name}_PE1.fastq | awk '{print $1/4}')
echo "Raw reads: $raw_reads"

#Remove adapters
cutadapt -j $cores -m 10 -O 1 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA ${name}_PE1.fastq -o ${name}_PE1_noadap.fastq
cutadapt -j $cores -m 10 -O 1 -a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT ${name}_PE2.fastq -o ${name}_PE2_noadap.fastq

#Pair reads
reads=$(wc -l ${name}_PE1_noadap.fastq | awk '{print $1/4}')
fastq_pair -t $reads ${name}_PE1_noadap.fastq ${name}_PE2_noadap.fastq

#Align to the genome and keep concordantly aligned, unique reads
#Also print statistics on duplication rate and estimated library complexity
cd $bowtie_results
bowtie2 -p $((cores-5)) --maxins 1000 -x $genome \
  -1 ${dir}/${name}_PE1_noadap.fastq.paired.fq -2 ${dir}/${name}_PE2_noadap.fastq.paired.fq | \
  samtools view -bS -f 0x2 - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | \
  samtools markdup -s -r - ${name}.bam
samtools index ${name}.bam

#Isolate PE1 for ChIPQC
samtools view -f 0x42 ${name}.bam -o ${chipqc_out}/${name}_ChIPQC.bam
samtools index ${chipqc_out}/${name}_ChIPQC.bam ${chipqc_out}/${name}_ChIPQC.bam.bai
  
#Convert bam to bed for counting
samtools sort -n ${name}.bam | bedtools bamtobed -i stdin -bedpe | \
  awk 'BEGIN { OFS = "\t"} { print $1, $2, $6, ".", "1" }' |
  grep -v "random" | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | grep -v "alt" | \
  intersectBed -v -a stdin -b $blacklist | sort -k1,1 -k2,2n > ${counts_out}/${name}_not_scaled.bed

#Count resultant reads for complexity estimate
resultant_reads=$(wc -l ${counts_out}/${name}_not_scaled.bed | awk '{print $1}')
echo "Resultant reads: $resultant_reads"

#Zip and delete files
cd $dir
pigz -p $cores ${name}_PE1.fastq
pigz -p $cores ${name}_PE2.fastq
rm ${name}_*noadap*

#Deeptools wasn't loading python/numpy before, but purging and reloading helped (this may be unnecessary now)
module purge
module load deeptools/3.5.1

#Convert bam to bigWig for coverage visualization
#We re-do this later with a larger bin size, so this is not necessary here (should change in future)
bamCoverage --bam ${bowtie_results}/${name}.bam -o ${bigWigs_out}/${name}_not_scaled.bw -p $cores \
  --binSize 1 --extendReads --centerReads

Run the above script on each sample in parallel.

chmod +x ${scripts}/230322_chipseq_analysis_on_input_file.sh
for fastq in *_HA*_PE1.fastq.gz *_ER_*_PE1.fastq.gz *_IgG_*_PE1.fastq.gz
do
  name=`basename $fastq _PE1.fastq.gz`
  fastq=/scratch/ts2hx/${title}/raw_data/${fastq}
  sbatch -p standard -A gioeli_lab -t 8:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name chipseq-analysis \
    -o /scratch/ts2hx/${title}/logs/${name}_chipseq.log  \
    --wrap="sh ${scripts}/230322_chipseq_analysis_on_input_file.sh $fastq $main"
  sleep 1
done

Move IgG samples to subdirectory.

cd ${main}/bowtie2/
mv *IgG*.bam* controls/

4.4 Call peaks

The script used in the next code chunk (230308_macs2.sh) is below:

#!/bin/bash
#Get the variables from the arguments
samples=$1
control=$2
factor=$3
main=$4

blacklist=$HOME/hg38/hg38.blacklist.bed

module load macs2/2.2.7.1 gcc/7.1.0 openmpi/3.1.4 intel/18.0 intelmpi/18.0 R/4.1.1 bedtools/2.26.0

macs2 callpeak -t ${samples}_*.bam -c ${control}_*.bam -f BAMPE -g hs -n $factor --keep-dup all \
  --outdir ${main}/macs2 

#Remove noncanonical chromosomes and blacklisted regions, and make a window of 50bp on either side of the summit
grep -v "random" ${main}/macs2/${factor}_summits.bed | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
  intersectBed -v -a stdin -b $blacklist | slopBed -b 50 -i stdin -g $HOME/hg38/hg38.chrom.sizes > \
  ${main}/macs2/${factor}_summit_window.bed

Run the above script on each sample in parallel.

#Call peaks for each factor 
chmod +x ${scripts}/230308_macs2.sh
for bam in *_DMSO_rep1.bam
do
  name=`basename $bam _DMSO_rep1.bam`
  factor=$(echo $name | awk -F"_" '{print $3}')
  samples=T47D_Clone28_${factor}
  control=controls/T47D_Clone28_IgG
  sbatch -p largemem -A gioeli_lab -t 4:00:00 -N 1 -n 1 --mem-per-cpu=64000 --job-name macs2 \
    -o /scratch/ts2hx/${title}/logs/${factor}_macs2.log \
    --wrap="sh ${scripts}/230308_macs2.sh $samples $control $factor $main"
  sleep 1
done

Remove noncanonical chromosomes and blacklisted regions. Could have added this to the macs2 script when I did it for the summit window.

blacklist=$HOME/hg38/hg38.blacklist.bed
module load gcc/9.2.0 bedtools/2.29.2
for bam in *_DMSO_rep1.bam
do
  name=`basename $bam _DMSO_rep1.bam`
  factor=$(echo $name | awk -F"_" '{print $3}')
  grep -v "random" ${main}/macs2/${factor}_peaks.narrowPeak | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
    intersectBed -v -a stdin -b $blacklist > ${factor}_tmp.txt
  mv ${factor}_tmp.txt ${main}/macs2/${factor}_peaks.narrowPeak 
  grep -v "random" ${main}/macs2/${factor}_summits.bed | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
    intersectBed -v -a stdin -b $blacklist > ${factor}_tmp.txt
  mv ${factor}_tmp.txt ${main}/macs2/${factor}_summits.bed 
done

4.5 Count reads in peaks

Count reads that overlap any part of the full narrowPeak region.

cd ${main}/counts/
for bed in `ls *_not_scaled.bed | grep -v IgG`
do
  name=`basename $bed _not_scaled.bed`
  factor=$(echo $name | awk -F"_" '{print $3}')
  sbatch -p standard -A gioeli_lab -t 10:00 -N 1 -n 1 --job-name ${name}_mapBed \
    -o /scratch/ts2hx/${title}/logs/${name}_mapBed.log \
    --wrap="module load gcc/9.2.0 bedtools/2.29.2; \
      mapBed -null "0" -a ${main}/macs2/${factor}_peaks.narrowPeak -b $bed > ${name}_peak_counts.txt"
  sleep 1 # wait 1 second between each job submission
done

Just keep the last column (the counts). I should do this in the above jobs.

for counts in T47D_Clone28_*_peak_counts.txt
do
  name=`basename $counts _peak_counts.txt`
  awk '{print $NF}' ${name}_peak_counts.txt > ${name}_peak_counts_only.txt
  echo $name | cat - ${name}_peak_counts_only.txt > ${name}_peak_counts.txt
  rm ${name}_peak_counts_only.txt
done

Combine into 1 file per factor.

for peaks in ../macs2/*_peaks.narrowPeak
do
  factor=`basename $peaks _peaks.narrowPeak`
  awk '{OFS = "\t"} {print $1, $2, $3}' $peaks > peaks.bed
  echo -e "chr\tstart\tend" | cat - peaks.bed > peaks_header.bed
  paste -d'\t' peaks_header.bed T47D_Clone28_${factor}_*_peak_counts.txt > Combined_${factor}_peak_counts.txt
  rm peaks*
done

4.6 Call TRPS1 peaks with the dTAG samples as a control

The script used in the next code chunk (230322_macs2.sh) is below:

#!/bin/bash

samples=$1
control=$2
factor=$3
main=$4

blacklist=$HOME/hg38/hg38.blacklist.bed

module load macs2/2.2.7.1 gcc/7.1.0 openmpi/3.1.4 intel/18.0 intelmpi/18.0 R/4.1.1 bedtools/2.26.0

macs2 callpeak --call-summits -t ${samples}_*.bam -c ${control}_*.bam -f BAMPE -g hs -n $factor --keep-dup all \
  --outdir ${main}/macs2 
grep -v "random" ${main}/macs2/${factor}_peaks.narrowPeak | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
  grep -v "alt" | intersectBed -v -a stdin -b $blacklist > ${factor}_tmp.txt
mv ${factor}_tmp.txt ${main}/macs2/${factor}_peaks.narrowPeak 
grep -v "random" ${main}/macs2/${factor}_summits.bed | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
    grep -v "alt" | intersectBed -v -a stdin -b $blacklist > ${factor}_tmp.txt
mv ${factor}_tmp.txt ${main}/macs2/${factor}_summits.bed 
slopBed -b 50 -i ${main}/macs2/${factor}_summits.bed -g $HOME/hg38/hg38.chrom.sizes > \
  ${main}/macs2/${factor}_summit_window.bed

Run the above script on the HA samples.

cd ${main}/bowtie2/
factor=HA_DMSO_vs_dTAG
samples=T47D_Clone28_HA_DMSO
control=T47D_Clone28_HA_dTAG
sbatch -p largemem -A gioeli_lab -t 4:00:00 -N 1 -n 1 --mem-per-cpu=64000 --job-name macs2 \
  -o /scratch/ts2hx/${title}/logs/${factor}_macs2.log \
  --wrap="sh ${scripts}/230322_macs2.sh $samples $control $factor $main"

Count reads in HA peaks.

cd ${main}/counts/
#Just count read fragments that overlap the summit
peaks=${main}/macs2/${factor}_summits.bed
for bed in T47D_Clone28_HA_*_not_scaled.bed T47D_Clone28_IgG_*_not_scaled.bed
do
  name=`basename $bed _not_scaled.bed`
  echo $name
  name=${name}_${factor}
  mapBed -null "0" -a $peaks -b $bed > ${name}_peak_counts.txt
  awk '{print $NF}' ${name}_peak_counts.txt > ${name}_peak_counts_only.txt
  echo $name | cat - ${name}_peak_counts_only.txt > ${name}_peak_counts.txt
  rm ${name}_peak_counts_only.txt
done

echo -e "chr\tstart\tend\tname\tqvalue" | cat - $peaks | \
  paste -d'\t' - *${factor}_peak_counts.txt > Combined_${factor}_peak_counts.txt
rm T47D_Clone28_*${factor}_peak_counts.txt

4.7 Generate the ChIPQC object for plotting QC metrics

I ran the following interactively to make sure I had the right samplesheet before running ChIPQC.

module load gcc/9.2.0  openmpi/3.1.6 R/4.2.1
R
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
library(BiocManager)
if (!require("ChIPQC", quietly = TRUE))
  BiocManager::install("ChIPQC")
library(ChIPQC)
if (!require("TxDb.Hsapiens.UCSC.hg38.knownGene", quietly = TRUE))
  install.packages("TxDb.Hsapiens.UCSC.hg38.knownGene")
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

samplesheet <- data.frame(SampleID=character(),
                          Tissue=character(),
                          Factor=character(),
                          Condition=character(),
                          Replicate=integer(),
                          bamReads=character(),
                          ControlID=character(),
                          bamControl=character(),
                          Peaks=character(),
                          PeakCaller=character(),
                          stringsAsFactors=FALSE)

path = "/scratch/ts2hx/ChIP/results/bowtie2/"
samples <- dir(path, pattern = ".bam.bai")
samples

setwd("/scratch/ts2hx/ChIP/results/")
for (i in 1:length(samples)) {
  samplesheet[i,1] <- strsplit(strsplit(samples[i], split = ".bam")[[1]][1], split = "T47D_Clone28_")[[1]][2]
  samplesheet[i,2] <- strsplit(samples[i], split = "_")[[1]][1]
  samplesheet[i,3] <- strsplit(samples[i], split = "_")[[1]][3]
  samplesheet[i,4] <- strsplit(samples[i], split = "_")[[1]][4]
  samplesheet[i,5] <- strsplit(strsplit(samples[i], split = ".bam")[[1]][1], split = "_rep")[[1]][2]
  samplesheet[i,6] <- paste("chipqc/bams/", strsplit(samples[i], split = ".bam")[[1]][1], "_ChIPQC.bam", sep = "")
  samplesheet[i,7] <- paste("T47D_Clone28_IgG_", samplesheet[i,4], "_rep", samplesheet[i,5], sep = "")
  samplesheet[i,8] <- paste("chipqc/bams/T47D_Clone28_IgG_", samplesheet[i,4], "_rep", samplesheet[i,5], "_ChIPQC.bam", sep = "")
  samplesheet[i,9] <- paste("macs2/", samplesheet[i,3], "_peaks.narrowPeak", sep = "")
  samplesheet[i,10] <- "narrow"
}
head(samplesheet)
samplesheet[9:16,9] <- "macs2/HA_DMSO_vs_dTAG_peaks.narrowPeak"

write.csv(samplesheet, file = "/scratch/ts2hx/ChIP/results/chipqc/meta/samplesheet.csv", row.names = FALSE)
q()

The script used in the next code chunk (230313_ChIPQC.R) is below:

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
library(BiocManager)
if (!require("ChIPQC", quietly = TRUE))
  BiocManager::install("ChIPQC")
library(ChIPQC)
if (!require("TxDb.Hsapiens.UCSC.hg38.knownGene", quietly = TRUE))
  install.packages("TxDb.Hsapiens.UCSC.hg38.knownGene")
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
setwd("/scratch/ts2hx/ChIP/results/")
samples <- read.csv("/scratch/ts2hx/ChIP/results/chipqc/meta/samplesheet.csv")
register(SerialParam())
chipObj <- ChIPQC(samples, annotation="hg38")
save(chipObj, file = "/scratch/ts2hx/ChIP/results/chipqc/chipObj.R")
load("/scratch/ts2hx/ChIP/results/chipqc/chipObj.R")
setwd("/scratch/ts2hx/ChIP/results/chipqc")
ChIPQCreport(chipObj, reportName="ChIPQC", reportFolder="ChIPQCreport")

Run the above script to generate the ChIPQC object.

sbatch -p standard -A gioeli_lab -t 2:00:00 -N 1 -n 1 --job-name chipqc -o /scratch/ts2hx/${title}/logs/chipqc.log \
  --wrap="module load gcc/9.2.0  openmpi/3.1.6 R/4.2.1; Rscript ${scripts}/230313_ChIPQC.R"

4.8 Scale and merge bigWigs for use in the genome browser

First make size factors based on total mapped reads (divide each by the geometric mean of all of them like in DESeq2).

cd ${main}/bowtie2
module load samtools/1.12
for i in HA ER
do
  echo $i
  > ${i}_header.txt
  > ${i}_reads.txt
  for j in T47D_Clone28_${i}_*.bam
  do
    echo $j
    name=$(echo $j | awk -F".bam" '{print $1}')
    echo $name | paste ${i}_header.txt - > ${i}_tmp.txt 
    mv ${i}_tmp.txt ${i}_header.txt
    reads=`samtools view -c $j`
    echo $reads | paste ${i}_reads.txt - > ${i}_tmp.txt 
    mv ${i}_tmp.txt ${i}_reads.txt
  done  
  cat ${i}_header.txt ${i}_reads.txt > ${i}_tmp.txt
  mv ${i}_tmp.txt ${i}_reads.txt
  rm ${i}_header.txt
done 

Combine these into one file.

module load gcc/9.2.0  openmpi/3.1.6 R/4.2.1
R
library(DESeq2)
setwd("/scratch/ts2hx/ChIP/results/bowtie2/")
samples <- dir("/scratch/ts2hx/ChIP/results/bowtie2/", pattern = "_reads.txt")
for (i in samples) {
  x <- read.table(i, sep = '\t', header = TRUE)[,-1]
  factor = sapply(strsplit(colnames(x)[1], "_"), "[", 3)
  write.table(estimateSizeFactorsForMatrix(x), file = paste0("/scratch/ts2hx/ChIP/results/bigWigs/", factor, '_sizeFactor.txt'), quote =F, col.names=F)
}
q()

Generate the binned bigWigs.

for i in *.bam
do
  name=$(echo $i | awk -F".bam" '{print $1}')
  sbatch -p standard -A gioeli_lab -t 1:00:00 --cpus-per-task=10 --job-name bamCoverage -o ${name}_bamCoverage.log \
    --export=ALL,name=$name,main=$main --wrap="module load deeptools/3.5.1 samtools/1.12; samtools index ${name}.bam; \
    bamCoverage --bam ${name}.bam -o ${main}/bigWigs/${name}_binned_not_scaled.bw -p 10 --extendReads --centerReads;"
  sleep 1
done

The script used in the next code chunk (230329_ChIP_scale_bigWig_binned.sh) is below. It scales each bigWig based on the read depth.

#! /bin/bash
factor=$2
directory=$3

PATH=$HOME/bin:$PATH

cd $directory

module load ucsc-tools/3.7.4

name=$(echo $1 | awk -F"_binned_not_scaled.bw" '{print $1}')
invscale1=$(grep ${name} ${factor}_sizeFactor.txt | awk -F" " '{print $2}')
invscale=$(echo $invscale1 | bc)
scaletrue=$(bc <<< "scale=4 ; 1.0 / $invscale")
name=${name}_binned
echo $name
echo $scaletrue
echo file_to_scale
echo ${name}_not_scaled.bw
bigWigToBedGraph ${name}_not_scaled.bw ${name}.bg
echo normalizing
normalize_bedGraph.py -i ${name}.bg -s $scaletrue -o ${name}_scaled.bg
LC_COLLATE=C sort -k1,1 -k2,2n ${name}_scaled.bg > ${name}_scaled_sorted.bg
cat ${name}_scaled_sorted.bg | grep -v "random" | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM"> \
  ${name}_scaled.bg
bedGraphToBigWig ${name}_scaled.bg $HOME/hg38/hg38.chrom.sizes ${name}_scaled.bigWig
rm ${name}_scaled.bg
rm ${name}_scaled_sorted.bg
rm ${name}.bg

Run the above script on each sample in parallel.

cd ${main}/bigWigs
for i in *_binned_not_scaled.bw
do
  name=$(echo $i | awk -F"_binned_not_scaled.bw" '{print $1}')
  factor=$(echo $name | awk -F"_" '{print $3}')
  sbatch -p standard -A gioeli_lab -t 1:00:00 --cpus-per-task=10 --job-name scale_bigWig \
    -o ../../logs/${name}_bamCoverage.log --wrap="sh ${scripts}/230329_ChIP_scale_bigWig_binned.sh \
      ${name}_binned_not_scaled.bw $factor ${main}/bigWigs"
  sleep 1
done

The script used in the next code chunk (230329_ChIP_to_browser_binned.sh) is below. It merges the scaled bigWigs across replicates into one bigWig and one bedGraph per condition.

#! /bin/bash
directory=$2

PATH=$HOME/bin:$PATH

cd $directory

module load ucsc-tools/3.7.4

name=$(echo $1 | awk -F"T47D_Clone28_" '{print $2}' | awk -F"_rep1" '{print $1}')
echo $name
files=$(ls T47D_Clone28_${name}_rep*_binned_scaled.bigWig)
echo $files
echo "track type=bedGraph name=${name} alwaysZero=on visibility=full" > T47D_Clone28_${name}_binned_temp.txt
count=$(ls T47D_Clone28_${name}_rep*_binned_scaled.bigWig | wc -l | bc)
scaleall=$(bc <<< "scale=4 ; 1.0 / $count")
name=T47D_Clone28_${name}_binned
echo $count
echo $scaleall
bigWigMerge $files ${name}_tmp.bg
normalize_bedGraph.py -i ${name}_tmp.bg -s $scaleall -o ${name}_scaled.bg
LC_COLLATE=C sort -k1,1 -k2,2n ${name}_scaled.bg > ${name}_scaled_sorted.bg
cat ${name}_temp.txt ${name}_scaled_sorted.bg | \
  grep -v "random" | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | grep -v "alt" > \
  ${name}_scaled.bedGraph
rm ${name}_tmp.bg
rm ${name}_temp.txt
rm ${name}_scaled.bg
rm ${name}_scaled_sorted.bg
head ${name}_scaled.bedGraph
bedGraphToBigWig ${name}_scaled.bedGraph $HOME/hg38/hg38.chrom.sizes \
  ${name}_combined_normalized.bigWig
gzip ${name}_scaled.bedGraph

Run the above script on each condition in parallel.

for i in T47D_Clone28_*_rep1_binned_scaled.bigWig
do
  name=$(echo $i | awk -F"_rep1_scaled.bigWig" '{print $1}')
  sbatch -p largemem -A gioeli_lab -t 4:00:00 -N 1 -n 1 --job-name to_browser -o ../../logs/${name}_to_browser.log \
  --wrap="sh ${scripts}/230329_ChIP_to_browser_binned.sh $i ${main}/bigWigs"
  sleep 1
done

4.9 Leave the cluster environment, and copy files to your local computer

exit
exit

title=ChIP
main=/scratch/ts2hx/${title}/results
cd /Users/TScott/Library/CloudStorage/Box-Box/GuertinLab/${title}/results
scp -r ts2hx@rivanna.hpc.virginia.edu:${main}/fastqc/ .
scp -r ts2hx@rivanna.hpc.virginia.edu:${main}/macs2/*_summit* macs2/
scp -r ts2hx@rivanna.hpc.virginia.edu:${main}/macs2/*peaks.narrowPeak macs2/
scp -r ts2hx@rivanna.hpc.virginia.edu:${main}/counts/Combined* counts/
scp -r ts2hx@rivanna.hpc.virginia.edu:${main}/bigWigs/*_not_scaled.bw bigWigs/
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/ChIP/results/bigWigs/*_binned_scaled.bedGraph.gz bigWigs/
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/ChIP/results/bigWigs/*_binned_combined_normalized.bigWig bigWigs/
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/ChIP/results/bigWigs/*_sizeFactor.txt counts/
scp ts2hx@rivanna.hpc.virginia.edu:${main}/chipqc/chipObj.R chipqc

4.10 ChIPQC (Figures S1,2)

Here we plot the fraction of reads in peaks, as well as the peak-calling-independent metric of cross-strand correlation.

library(ChIPQC)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

setwd("/Users/TScott/Library/CloudStorage/Box-Box/GuertinLab/ChIP/results/chipqc/")
load("chipObj.R")

pdf("ChIP_CC.pdf")
plotCC(chipObj, facetBy = c("Factor", "Condition"))
dev.off()

plotFrip(chipObj)
frip = frip(chipObj)[1:16]
Factor = sapply(strsplit(names(frip), "_"), "[", 1)
Condition = sapply(strsplit(names(frip), "_"), "[", 2)
Replicate = sapply(strsplit(names(frip), "_"), "[", 3)
FRiP = as.vector(frip)*100
Sample = names(frip)

frip = cbind.data.frame(Sample, FRiP, Factor, Condition, Replicate)

pdf("ChIP_FRiP.pdf", height = 5)
ggplot(frip, aes(x = Sample, y = FRiP, fill = Factor, alpha = Condition)) +
  geom_bar(stat="identity") +
  scale_fill_manual(values = c("grey40", "darkmagenta")) +
  scale_alpha_manual(values = c(1, 0.5)) +
  ylab("FRiP (%)") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
dev.off()

5 TRPS1 ChIP-seq analysis

5.1 Heatmap of TRPS1 peak intensity in dTAG vs. DMSO (Figure 2C)

Set up.

ssh -Y ts2hx@rivanna.hpc.virginia.edu 
ijob -c 1 -A gioeli_lab -p standard --cpus-per-task=20 --time=12:00:00
main=/scratch/ts2hx/ChIP/results
cd $main
mkdir -p heatmaps
cd heatmaps

module load deeptools/3.5.1

Define functional TRPS1 peaks by MACS2 q-value < 10^(-30). I chose this based on making CDFs for TRPS1 proximity to activated genes using the PRO-seq data. More liberal q-values gave peaks that were promoter-proximal and not differentially close to activated genes compared to unchanged genes.

awk '$5 > 30' /scratch/ts2hx/ChIP/results/macs2/HA_DMSO_vs_dTAG_summits.bed > \
  Functional_TRPS1_summits.bed

Make the matrix (this can be used for a composite profile as well).

computeMatrix reference-point --referencePoint center -b 500 -a 500 -p 20 --missingDataAsZero \
  -R Functional_TRPS1_summits.bed \
  -S /scratch/ts2hx/ChIP/results/bigWigs/T47D_Clone28_HA_DMSO_combined_normalized.bigWig \
  /scratch/ts2hx/ChIP/results/bigWigs/T47D_Clone28_HA_dTAG_combined_normalized.bigWig \
  -o matrix_HA_ChIP_TRPS1_peaks.gz --outFileSortedRegions TRPS1_peaks_sorted_for_heatmap.bed

Make the heatmap.

plotHeatmap -m matrix_HA_ChIP_TRPS1_peaks.gz -out heatmap_HA_ChIP_TRPS1_peaks.pdf --heatmapHeight 14 \
   --regionsLabel "Functional TRPS1 peaks" --xAxisLabel "Distance from summit" \
   --samplesLabel "DMSO" "dTAG" --colorMap Purples --whatToShow "heatmap and colorbar" 

Leave the cluster environment, and copy files to your local computer.

exit
exit

cd ~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/results
scp -r ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/ChIP/results/heatmaps .

5.2 MA plot of TRPS1 peak intensity in dTAG vs. DMSO (Figure 2D)

First get the complexity estimate file for size factors (total aligned reads).

title=ChIP
cd /Users/TScott/Library/CloudStorage/Box-Box/GuertinLab/${title}/
scp -r ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/${title}/logs/*chipseq.log logs/
scp -r ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/${title}/results/chipqc/chipObj.R results/chipqc/

cd logs
echo -e "name\traw_reads\taligned_reads\tunique_reads\testimated_size\tresultant_reads" > \
  230322_complexity_estimate.txt
for i in *_HA_*_chipseq.log *_ER_*_chipseq.log *_IgG_*_chipseq.log
do
  name=`basename $i _chipseq.log`
  raw_reads=`grep "Raw reads:" $i | awk '{print $3}'`
  aligned_reads=`grep "READ:" $i | awk '{print $2/2}'`
  unique_reads=`grep "WRITTEN:" $i | awk '{print $2/2}'`
  estimated_size=`grep "ESTIMATED_LIBRARY_SIZE:" $i | awk '{print $2}'`
  resultant_reads=`grep "Resultant reads:" $i | awk '{print $3}'`
  echo -e "$name\t$raw_reads\t$aligned_reads\t$unique_reads\t$estimated_size\t$resultant_reads" >> \
    230322_complexity_estimate.txt
done

Now use DESeq2 to identify differentially bound peaks (in this case all of them).

direc = '~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/results/counts/'
setwd(direc)

library(DESeq2)
library(lattice)
library(viridis)
library(DEGreport)
library(lattice)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)

Get read depth.

complexity_estimate <- read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/logs/230322_complexity_estimate.txt", header = T)
name = sapply(strsplit(complexity_estimate$name, "Clone28_"), "[", 2)
complexity_estimate = t(complexity_estimate$unique_reads)
colnames(complexity_estimate) = name
complexity_estimate = complexity_estimate[,-grep("ER", colnames(complexity_estimate))]

Read in the counts table and change the column names.

x <- read.table("Combined_HA_DMSO_vs_dTAG_peak_counts.txt", sep = '\t', header = TRUE)
rownames(x) = paste0(x[,1], ":", x[,2], "-", x[,3])
peak_name = x[,4]
peak_qvalue = x[,5]
x = x[,-c(1:6)]
colnames(x) = sapply(strsplit(sapply(strsplit(colnames(x), "Clone28_"), "[", 2), "_HA_DMSO"), "[", 1)

Subtract the read depth normalized background signal.

background <- sweep(x[,grep("IgG", colnames(x))], 2, complexity_estimate[grep("IgG", names(complexity_estimate))], FUN = "/")
background = rowMeans(background)
background = cbind.data.frame(background, background, background, background, background, background, background, background)
background <- sweep(as.data.frame(background), 2, complexity_estimate[grep("HA", names(complexity_estimate))], FUN = "*")
background = round(background, digits = 0)

x = x[,grep("HA", colnames(x))] - background
x[x < 0] = 0

Run Deseq2.

treatment = factor(sapply(strsplit(colnames(x), '_'), '[', 2), levels = c("DMSO", "dTAG"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(treatment, rep), ~ treatment)
factors <- read.table("HA_sizeFactor.txt")
size_factors = factors$V2
names(size_factors) = factors$V1
sizeFactors(deseq.df) <- size_factors
deseq.df = DESeq(deseq.df)

Plot.

pdf("MA_HA_DMSO_vs_dTAG_subtract_IgG_qvalue_30.pdf")
res.deseq = results(deseq.df[peak_qvalue > 30,], contrast = c("treatment", "dTAG", "DMSO"))
save(res.deseq, file = "res.deseq.HA.q30.Rdata")
plotMA(res.deseq, ylim = c(-4,4))
dev.off()

6 ATAC-seq data pre-processing

6.1 Set up

Log in to the cluster to perform the processing for each sample in parallel.

ssh -Y ts2hx@rivanna.hpc.virginia.edu 
ijob -c 1 -A gioeli_lab -p standard --cpus-per-task=10 --time=8:00:00

Make the main directory (should make nice subdirectories like I did for ChIP).

main=/scratch/ts2hx/TRPS1_timecourse_ATAC
mkdir -p $main

Get the required script for bigWig merging and move to a directory that will end up in the $PATH.

wget https://raw.githubusercontent.com/guertinlab/Nascent_RNA_Methods/main/normalize_bedGraph.py
chmod +x normalize_bedGraph.py
mv normalize_bedGraph.py $HOME/bin/

Get the ENCODE blacklist.

wget https://raw.githubusercontent.com/Boyle-Lab/Blacklist/master/lists/hg38-blacklist.v2.bed.gz -P $HOME/hg38/
gunzip $HOME/hg38/hg38-blacklist.v2.bed.gz
mv $HOME/hg38/hg38-blacklist.v2.bed $HOME/hg38/hg38.blacklist.bed

Get chrM for pre-alignment.

module load gcc/9.2.0 bowtie2/2.2.9 
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz
tar -xzf hg38.chromFa.tar.gz
mkdir -p $HOME/hg38
mv chroms/chrM.fa $HOME/hg38
rm -r chroms
rm hg38.chromFa.tar.gz
ls $HOME/hg38
bowtie2-build $HOME/hg38/chrM.fa $HOME/hg38/chrM_hg38

Make tallymer and table for seqOutBias (this takes a while).

cd $HOME/seqOutBias/
seqOutBias seqtable /project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome.fa --read-size=62

6.2 Download the raw data and rename the files

Download and gzip.

module load gcc/11.4.0 sratoolkit/3.0.3

cd $main

for i in {46..69}
do
  sbatch -p standard -A gioeli_lab -t 1:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name fasterq-dump \
    -o /scratch/ts2hx/${title}/logs/SRR250739${i}_fasterq_dump.log \
    --wrap="module load gcc/11.4.0 sratoolkit/3.0.3; fasterq-dump SRR250739${i}; gzip SRR250739${i}_*.fastq"
  sleep 1
done

Rename.

mv SRR25073969_1.fastq.gz T47D_Clone28_0hour_rep1_PE1.fastq.gz
mv SRR25073969_2.fastq.gz T47D_Clone28_0hour_rep1_PE2.fastq.gz
mv SRR25073968_1.fastq.gz T47D_Clone28_0hour_rep2_PE1.fastq.gz
mv SRR25073968_2.fastq.gz T47D_Clone28_0hour_rep2_PE2.fastq.gz
mv SRR25073967_1.fastq.gz T47D_Clone28_0hour_rep3_PE1.fastq.gz
mv SRR25073967_2.fastq.gz T47D_Clone28_0hour_rep3_PE2.fastq.gz
mv SRR25073966_1.fastq.gz T47D_Clone28_0hour_rep4_PE1.fastq.gz
mv SRR25073966_2.fastq.gz T47D_Clone28_0hour_rep4_PE2.fastq.gz

mv SRR25073965_1.fastq.gz T47D_Clone28_0.5hour_rep1_PE1.fastq.gz
mv SRR25073965_2.fastq.gz T47D_Clone28_0.5hour_rep1_PE2.fastq.gz
mv SRR25073964_1.fastq.gz T47D_Clone28_0.5hour_rep2_PE1.fastq.gz
mv SRR25073964_2.fastq.gz T47D_Clone28_0.5hour_rep2_PE2.fastq.gz
mv SRR25073963_1.fastq.gz T47D_Clone28_0.5hour_rep3_PE1.fastq.gz
mv SRR25073963_2.fastq.gz T47D_Clone28_0.5hour_rep3_PE2.fastq.gz
mv SRR25073962_1.fastq.gz T47D_Clone28_0.5hour_rep4_PE1.fastq.gz
mv SRR25073962_2.fastq.gz T47D_Clone28_0.5hour_rep4_PE2.fastq.gz

mv SRR25073961_1.fastq.gz T47D_Clone28_1hour_rep1_PE1.fastq.gz
mv SRR25073961_2.fastq.gz T47D_Clone28_1hour_rep1_PE2.fastq.gz
mv SRR25073960_1.fastq.gz T47D_Clone28_1hour_rep2_PE1.fastq.gz
mv SRR25073960_2.fastq.gz T47D_Clone28_1hour_rep2_PE2.fastq.gz
mv SRR25073959_1.fastq.gz T47D_Clone28_1hour_rep3_PE1.fastq.gz
mv SRR25073959_2.fastq.gz T47D_Clone28_1hour_rep3_PE2.fastq.gz
mv SRR25073958_1.fastq.gz T47D_Clone28_1hour_rep4_PE1.fastq.gz
mv SRR25073958_2.fastq.gz T47D_Clone28_1hour_rep4_PE2.fastq.gz

mv SRR25073957_1.fastq.gz T47D_Clone28_2hour_rep1_PE1.fastq.gz
mv SRR25073957_2.fastq.gz T47D_Clone28_2hour_rep1_PE2.fastq.gz
mv SRR25073956_1.fastq.gz T47D_Clone28_2hour_rep2_PE1.fastq.gz
mv SRR25073956_2.fastq.gz T47D_Clone28_2hour_rep2_PE2.fastq.gz
mv SRR25073955_1.fastq.gz T47D_Clone28_2hour_rep3_PE1.fastq.gz
mv SRR25073955_2.fastq.gz T47D_Clone28_2hour_rep3_PE2.fastq.gz
mv SRR25073954_1.fastq.gz T47D_Clone28_2hour_rep4_PE1.fastq.gz
mv SRR25073954_2.fastq.gz T47D_Clone28_2hour_rep4_PE2.fastq.gz

mv SRR25073953_1.fastq.gz T47D_Clone28_4hour_rep1_PE1.fastq.gz
mv SRR25073953_2.fastq.gz T47D_Clone28_4hour_rep1_PE2.fastq.gz
mv SRR25073952_1.fastq.gz T47D_Clone28_4hour_rep2_PE1.fastq.gz
mv SRR25073952_2.fastq.gz T47D_Clone28_4hour_rep2_PE2.fastq.gz
mv SRR25073951_1.fastq.gz T47D_Clone28_4hour_rep3_PE1.fastq.gz
mv SRR25073951_2.fastq.gz T47D_Clone28_4hour_rep3_PE2.fastq.gz
mv SRR25073950_1.fastq.gz T47D_Clone28_4hour_rep4_PE1.fastq.gz
mv SRR25073950_2.fastq.gz T47D_Clone28_4hour_rep4_PE2.fastq.gz

mv SRR25073949_1.fastq.gz T47D_Clone28_24hour_rep1_PE1.fastq.gz
mv SRR25073949_2.fastq.gz T47D_Clone28_24hour_rep1_PE2.fastq.gz
mv SRR25073948_1.fastq.gz T47D_Clone28_24hour_rep2_PE1.fastq.gz
mv SRR25073948_2.fastq.gz T47D_Clone28_24hour_rep2_PE2.fastq.gz
mv SRR25073947_1.fastq.gz T47D_Clone28_24hour_rep3_PE1.fastq.gz
mv SRR25073947_2.fastq.gz T47D_Clone28_24hour_rep3_PE2.fastq.gz
mv SRR25073946_1.fastq.gz T47D_Clone28_24hour_rep4_PE1.fastq.gz
mv SRR25073946_2.fastq.gz T47D_Clone28_24hour_rep4_PE2.fastq.gz

6.3 Process each sample

The script used in the next code chunk (230417_ATACseq_analysis_on_input_file.sh) is below:

#! /bin/bash
directory=/scratch/$USER/TRPS1_timecourse_ATAC
read_size=62
cores=$SLURM_CPUS_PER_TASK
chrM=$HOME/hg38/chrM_hg38
genome_index=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome
tallymer=$HOME/seqOutBias/genome.tal_${read_size}.gtTxt.gz
table=$HOME/seqOutBias/genome_${read_size}.4.2.2.tbl

PATH=$HOME/bin:$PATH

cd $directory

module load cutadapt/3.4
module load gcc/9.2.0 pigz/2.4
module load bowtie2/2.2.9 
module load samtools/1.12
module load wigtobigwig/2.8
module load bedtools/2.29.2
module load deeptools/3.5.1

filename=$1
name=$(echo $filename | awk -F"_PE1.fastq.gz" '{print $1}')
echo $name
gunzip ${name}_PE1.fastq.gz
gunzip ${name}_PE2.fastq.gz
#Remove adapters (also trim PE1 reads to the same length as PE2 (that was unnecessary))
cutadapt -j $cores -l $read_size -m 10 -O 1 -a CTGTCTCTTATACACATCT ${name}_PE1.fastq -o ${name}_PE1_no_adapt.fastq
cutadapt -j $cores -m 10 -O 1 -a CTGTCTCTTATACACATCT ${name}_PE2.fastq -o ${name}_PE2_no_adapt.fastq
#Align to chrM first and remove aligned reads    
bowtie2 -p $((cores-2)) -x $chrM -U ${name}_PE1_no_adapt.fastq | \
    samtools sort -n - | samtools fastq -f 0x4 - > ${name}_PE1.chrM.fastq
#Pair reads
reads=$(wc -l ${name}_PE1.chrM.fastq | awk '{print $1/4}')
fastq_pair -t $reads ${name}_PE1.chrM.fastq ${name}_PE2_no_adapt.fastq
#Align to the hg38 genome and remove duplicates
bowtie2 -p $((cores-5)) --maxins 800 -x $genome_index \
  -1 ${name}_PE1.chrM.fastq.paired.fq -2 ${name}_PE2_no_adapt.fastq.paired.fq | \
  samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | \
  samtools markdup -s -r - ${name}.hg38.bam
#Convert BAM to bed for counting (previously created the tallymer with `seqOutBias seqtable`)
seqOutBias scale $table ${name}.hg38.bam --no-scale --shift-counts --skip-bw \
  --read-size=$read_size --tallymer=$tallymer --bed=$name.bed
#Remove noncanonical chromosomes and sort for use with mapBed
grep -v "random" ${name}_not_scaled.bed | grep -v "chrUn" | grep -v "chrEBV" | grep -v "alt" | \
  sort -k1,1 -k2,2n > ${name}_tmp.txt && mv ${name}_tmp.txt ${name}_not_scaled.bed 
#Convert BAM to bigWig for visualization; extends the reads from PE1 to PE2; 
#uses a bin size of 10 for a smoother output; normalizes as bins per million, analagous to TPM for transcripts
bamCoverage --bam ${name}.hg38.bam -bs 10 -p 10 -e --normalizeUsing BPM -bl $HOME/hg38/hg38.blacklist.bed \
  -o ${name}_scaled.bigWig 
#Zip fastq's and delete intermediate files
pigz -p $cores ${name}_PE1.fastq
pigz -p $cores ${name}_PE2.fastq
rm ${name}_PE1_no_adapt.fastq
rm ${name}_PE2_no_adapt.fastq
rm ${name}_PE1.chrM.fastq
rm ${name}_PE1.chrM.fastq.paired.fq
rm ${name}_PE1.chrM.fastq.single.fq
rm ${name}_PE2_no_adapt.fastq.paired.fq
rm ${name}_PE2_no_adapt.fastq.single.fq

Run the above script on each sample in parallel.

chmod +x /scratch/ts2hx/scripts/230417_*.sh
for filename in *_PE1.fastq.gz
do
name=`basename $filename _PE1.fastq.gz`
sbatch -p standard -A gioeli_lab -t 1-00:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name ${name}_atacseq_analysis \
  -o ${name}_atacseq.log  --wrap="sh /scratch/ts2hx/scripts/230417_ATACseq_analysis_on_input_file.sh $filename"
sleep 1 # wait 1 second between each job submission
done

#Should include this indexing step in the above script
for i in *.hg38.bam
do
  name=`basename $filename .hg38.bam`
  sbatch -p standard -A gioeli_lab -t 1:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name ${name}_index \
    -o ${name}_index.log  --wrap="module load samtools/1.12; samtools index $i"
  sleep 1
done

6.4 Call peaks

This code uses all bam files, calls sub-peak summits, and makes each fragment 200bp centered on the original 5’ end of the read.

name=TRPS1_timecourse_ATAC
sbatch -p largemem -A gioeli_lab -t 1-00:00:00 -N 1 -n 1 --mem-per-cpu=64000 --job-name macs2 -o ${name}_macs2.log \
  --wrap="module load macs2/2.2.7.1; \
    macs2 callpeak --call-summits -t *.hg38.bam -n TRPS1_timecourse_ATAC -g hs -q 0.01 --keep-dup all \
    -f BAM --nomodel --shift -100 --extsize 200"

Remove noncanonical chromosomes and blacklisted regions.

blacklist=$HOME/hg38/hg38.blacklist.bed
module load gcc/9.2.0 bedtools/2.29.2
grep -v "random" ${name}_peaks.narrowPeak | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
  grep -v "alt" | intersectBed -v -a stdin -b $blacklist > ${name}_tmp.txt
mv ${name}_tmp.txt ${name}_peaks.narrowPeak 
grep -v "random" ${name}_summits.bed | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | \
    grep -v "alt" | intersectBed -v -a stdin -b $blacklist > ${name}_tmp.txt
mv ${name}_tmp.txt ${name}_summits.bed 

Make a window of 100bp on either side of the summit.

slopBed -b 100 -i ${name}_summits.bed -g $HOME/hg38/hg38.chrom.sizes > ${name}_summit_window.bed

6.5 Count reads in peaks

Count reads where the cut site is within these windows.

peaks=${name}_summit_window.bed
for bed in *_not_scaled.bed
do
  name=`basename $bed _not_scaled.bed`
  sbatch -p standard -A gioeli_lab -t 1:00:00 -N 1 -n 1 --job-name ${name}_mapBed -o ${name}_mapBed.log \
    --wrap="module load gcc/9.2.0 bedtools/2.29.2; mapBed -null '0' -a $peaks -b $bed > ${name}_peak_counts.txt;"
  sleep 1
done

Just keep the last column (the counts). I should do this in the above jobs.

for counts in T47D_Clone28_*_peak_counts.txt
do
  name=`basename $counts _peak_counts.txt`
  awk '{print $NF}' ${name}_peak_counts.txt > ${name}_peak_counts_only.txt
  echo $name | cat - ${name}_peak_counts_only.txt > ${name}_peak_counts.txt
  rm ${name}_peak_counts_only.txt
done

Combine into 1 file per factor.

echo -e "chr\tstart\tend\tname\tqvalue" | cat - $peaks | \
  paste -d'\t' - T47D_Clone28_*_peak_counts.txt > Combined_ATAC_peak_counts.txt
rm T47D_Clone28_*_peak_counts.txt

6.6 Make size factors for DESeq2 based on total mapped reads

Divide each by the geometric mean of all of them.

module load samtools/1.12
i=ATAC
> ${i}_header.txt
> ${i}_reads.txt
for j in T47D_Clone28_*.hg38.bam
do
  echo $j
  name=$(echo $j | awk -F".hg38.bam" '{print $1}')
  echo $name | paste ${i}_header.txt - > ${i}_tmp.txt 
  mv ${i}_tmp.txt ${i}_header.txt
  reads=`samtools view -c $j`
  echo $reads | paste ${i}_reads.txt - > ${i}_tmp.txt 
  mv ${i}_tmp.txt ${i}_reads.txt
done  
cat ${i}_header.txt ${i}_reads.txt > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_reads.txt
rm ${i}_header.txt

Combine these into one file.

module load gcc/9.2.0  openmpi/3.1.6 R/4.2.1
R
library(DESeq2)
setwd("/scratch/ts2hx/TRPS1_timecourse_ATAC/")
samples <- "ATAC_reads.txt"
x <- read.table(samples, sep = '\t', header = TRUE)[,-1]
write.table(estimateSizeFactorsForMatrix(x), file = paste0("/scratch/ts2hx/TRPS1_timecourse_ATAC/ATAC_sizeFactor.txt"), quote =F, col.names=F)
q()

6.7 Merge scaled bigWigs for use in the genome browser

The script used in the next code chunk (230417_ATAC_to_browser.sh) is below. It merges the scaled bigWigs across replicates into one bigWig and one bedGraph per condition.

#! /bin/bash
directory=/scratch/$USER/TRPS1_timecourse_ATAC

PATH=$HOME/bin:$PATH

cd $directory

module load ucsc-tools/3.7.4

name=$(echo $1 | awk -F"T47D_Clone28_" '{print $2}' | awk -F"_rep1" '{print $1}')
echo $name
files=$(ls T47D_Clone28_${name}_rep*_scaled.bigWig)
echo $files
touch T47D_Clone28_${name}_temp.txt
echo "track type=bedGraph name=${name} alwaysZero=on visibility=full" >> T47D_Clone28_${name}_temp.txt
count=$(ls T47D_Clone28_${name}_rep*_scaled.bigWig | wc -l | bc)
scaleall=$(bc <<< "scale=4 ; 1.0 / $count")
name=T47D_Clone28_${name}
echo $count
echo $scaleall
bigWigMerge $files ${name}_tmp.bg
normalize_bedGraph.py -i ${name}_tmp.bg -s $scaleall -o ${name}_scaled.bg
LC_COLLATE=C sort -k1,1 -k2,2n ${name}_scaled.bg > ${name}_scaled_sorted.bg
cat ${name}_temp.txt ${name}_scaled_sorted.bg > ${name}_scaled.bedGraph
rm ${name}_tmp.bg
rm ${name}_temp.txt
rm ${name}_scaled.bg
rm ${name}_scaled_sorted.bg
head ${name}_scaled.bedGraph
bedGraphToBigWig ${name}_scaled.bedGraph hg38.chrom.sizes ${name}_combined_normalized.bigWig
gzip -f ${name}_scaled.bedGraph

Run the above script on each sample in parallel.

for i in T47D_Clone28_*_rep1_scaled.bigWig
do
  name=$(echo $i | awk -F"_rep1_scaled.bigWig" '{print $1}')
  sbatch -p largemem -A gioeli_lab -t 4:00:00 -N 1 -n 1 --job-name to_browser -o ${name}_to_browser.log \
  --wrap="sh /scratch/ts2hx/scripts/230417_ATAC_to_browser.sh $i"
  sleep 1
done

6.8 Leave the cluster environment, and copy files to your local computer

exit
exit

#Copy files
cd ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_ATAC/Combined_ATAC_peak_counts.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_ATAC/ATAC_sizeFactor.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_ATAC/T47D_Clone28_*.hg38.bam* \
  /Volumes/External/Users/TScott/TRPS1_timecourse_ATAC_bams

6.9 ATACseqQC (Figures S3,4)

Here we plot the distribution of fragment lengths, as well as the enrichment of signal around transcription start sites.

direc = '/Volumes/External/Users/TScott/TRPS1_timecourse_ATAC_bams'
setwd(direc)

library(BiocManager)
BiocManager::install(c("ATACseqQC", "ChIPpeakAnno", "MotifDb", "GenomicAlignments",
                       "BSgenome.Hsapiens.UCSC.hg38", "TxDb.Hsapiens.UCSC.hg38.knownGene",
                       "phastCons100way.UCSC.hg38"))
library(ATACseqQC)
library(Rsamtools)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(GenomicAlignments)

#input the bamFiles
bamfile <- list.files(path = ".", pattern = "*.hg38.bam$", all.files = FALSE, full.names = FALSE, recursive = FALSE, ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
#change the order to reflect the time points
bamfile = bamfile[c(5:8, 1:4, 9:12, 17:24, 13:16)]
#change names to remove excess
bamfile.labels <- gsub(".hg38.bam", "", basename(bamfile))

#get TSS's
txs <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)

#get tags
possibleTag <- list("integer"=c("AM", "AS", "CM", "CP", "FI", "H0", "H1", "H2", 
                                "HI", "IH", "MQ", "NH", "NM", "OP", "PQ", "SM",
                                "TC", "UQ"), 
                    "character"=c("BC", "BQ", "BZ", "CB", "CC", "CO", "CQ", "CR",
                                  "CS", "CT", "CY", "E2", "FS", "LB", "MC", "MD",
                                  "MI", "OA", "OC", "OQ", "OX", "PG", "PT", "PU",
                                  "Q2", "QT", "QX", "R2", "RG", "RX", "SA", "TS",
                                  "U2"))

#get fragment size distributions (modified from ATACseqQC for easier plotting)
bamFiles = bamfile
bamFiles.labels = bamfile.labels
index = bamFiles
ylim = NULL
pe <- mapply(testPairedEndBam, bamFiles, index)
if (any(!pe)) {
  stop(paste(bamFiles[!pe], collapse = ", "), "is not Paired-End file.")
}
summaryFunction <- function(seqname, seqlength, bamFile, ind, ...) {
  param <- ScanBamParam(what = c("isize"), which = GRanges(seqname, IRanges(1, seqlength)), flag = scanBamFlag(isSecondaryAlignment = FALSE, isUnmappedQuery = FALSE, isNotPassingQualityControls = FALSE)) table(abs(unlist(sapply(scanBam(bamFile, index = ind, ..., param = param), `[[`, "isize"), use.names = FALSE)))
}
idxstats <- unique(do.call(rbind, mapply(function(.ele, .ind) idxstatsBam(.ele, index = .ind)[, c("seqnames", "seqlength")], bamFiles, index, SIMPLIFY = FALSE)))
idxstats <- idxstats[idxstats[, "seqnames"] != "*", , drop = FALSE]
seqnames <- as.character(idxstats[, "seqnames"])
seqlen <- as.numeric(idxstats[, "seqlength"])
seqlen <- checkMaxChrLength(seqlen)
fragment.len <- mapply(function(bamFile, ind) summaryFunction(seqname = seqnames, seqlength = seqlen, bamFile, ind), bamFiles, index, SIMPLIFY = FALSE)
names(fragment.len) <- bamFiles.labels
minor.ticks.axis <- function(ax, n = 9, t.ratio = 0.5, mn, mx, ...) {
  lims <- par("usr")
  lims <- if (ax %in% c(1, 3)) 
    lims[1:2]
  else lims[3:4]
  major.ticks <- pretty(lims, n = 5)
  if (missing(mn)) 
    mn <- min(major.ticks)
  if (missing(mx)) 
    mx <- max(major.ticks)
  major.ticks <- major.ticks[major.ticks >= mn & major.ticks <=  mx]
  labels <- sapply(major.ticks, function(i) as.expression(bquote(10^.(i))))
  axis(ax, at = major.ticks, labels = labels, las = ifelse(ax %in%  c(2, 4), 2, 1), ...)
  n <- n + 2
  minors <- log10(pretty(10^major.ticks[1:2], n)) - major.ticks[1]
  minors <- minors[-c(1, n)]
  minor.ticks = c(outer(minors, major.ticks, `+`))
  minor.ticks <- minor.ticks[minor.ticks > mn & minor.ticks < mx]
  axis(ax, at = minor.ticks, tcl = par("tcl") * t.ratio, labels = FALSE)
}

#save for future use
save(fragment.len, file = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/ATAC_fragment_lengths.RData")

#plot
pdf("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/ATAC_fragSize.pdf", height = 10, width = 15)
par(mfcol=c(4,6))
null <- mapply(function(frag.len, frag.name) {
  x <- 1:1010
  frag.len <- frag.len[match(x, names(frag.len))]
  frag.len[is.na(frag.len)] <- 0
  y <- frag.len/sum(frag.len)
  y <- as.numeric(y)
  names(y) <- x
  plot(x, y * 10^3, main = frag.name, 
       xlim = c(0, 1010), ylim = c(0,13), xlab = "Fragment length (bp)", 
       ylab = expression(Normalized ~ read ~ density ~ x ~ 10^-3), type = "l")
}, fragment.len, sapply(strsplit(names(fragment.len), "T47D_Clone28_"), "[", 2))
dev.off()

#get TSS enrichments
#initialize
seqlev <- "chr1" ## subsample data for quick run
seqinformation <- seqinfo(TxDb.Hsapiens.UCSC.hg38.knownGene)
which <- as(seqinformation[seqlev], "GRanges")

#make an empty list to fill with each sample
TSSEs <- sapply(bamfile.labels,function(x) NULL)

for (i in 1:length(bamfile)) {
  
  print(bamfile[i])
  
  #get reads with tags
  bamTop100 <- scanBam(BamFile(bamfile[i], yieldSize = 100), 
                       param = ScanBamParam(tag=unlist(possibleTag)))[[1]]$tag
  tags <- names(bamTop100)[lengths(bamTop100)>0]
  tags
  
  #shift the coordinates of 5'ends of alignments in the bam file
  gal <- readBamFile(bamfile[i], tag=tags, which=which, asMates=TRUE, bigFile=TRUE)
  rm(gal1)
  gal1 <- shiftGAlignmentsList(gal)
  
  #calculate TSS enrichment and append to the list
  tsse <- TSSEscore(gal1, txs)
  TSSEs[[i]] = tsse$values
  save(TSSEs, file = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/ATAC_TSSEs.RData")
  print(tsse$values)
  
  #plot enrichment around TSS's
  plot(100*(-9:10-.5), tsse$values, type="b", 
       xlim=c(-1000, 1000),
       xlab="distance to TSS",
       ylab="aggregate TSS score",
       main=sapply(strsplit(sapply(strsplit(bamfile[i], "T47D_Clone28_"), "[", 2), ".hg38.bam"), "[", 1))
}

#save for future use
save(TSSEs, file = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/ATAC_TSSEs.RData")
load("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/ATAC_TSSEs.RData")

#plot
pdf("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/ATAC_TSS_enrichment.pdf", height = 10, width = 15)
par(mfcol=c(4,6))
for (i in 1:length(TSSEs)) {
plot(100*(-9:10-.5), TSSEs[[i]], type="b", 
     xlim=c(-1000, 1000),
     ylim=c(0,75),
     xlab="distance to TSS",
     ylab="aggregate TSS score",
     main=sapply(strsplit(names(TSSEs)[i], "T47D_Clone28_"), "[", 2))
}
dev.off()

7 ATAC-seq analysis

7.1 MA plot of chromatin accessibility at 30 minutes dTAG vs. DMSO (Figures 3A,S6 and Tables S1,2)

Set up libraries and functions in R.

direc = '~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/'
setwd(direc)

library(DESeq2)
library(lattice)
library(viridis)
library(DEGreport)
library(lattice)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)
library(MatchIt)

tighten_summit_window <- function(res.deseq, b = 100) {
  chr = sapply(strsplit(rownames(res.deseq), ':'), '[', 1)
  start = as.numeric(sapply(strsplit(sapply(strsplit(rownames(res.deseq), ':'), '[', 2), "-"), "[", 1)) + b
  end = as.numeric(sapply(strsplit(rownames(res.deseq), '-'), '[', 2)) - b
  df = cbind.data.frame(chr, start, end)
  return(df)
}

`%notin%` <- Negate(`%in%`)

Read in the counts table and change the column names.

x <- read.table("Combined_ATAC_peak_counts.txt", sep = '\t', header = TRUE)
rownames(x) = paste0(x[,1], ":", x[,2], "-", x[,3])
peak_name = x[,4]
peak_qvalue = x[,5]
x = x[,-c(1:5)]
colnames(x) = sapply(strsplit(colnames(x), "Clone28_"), "[", 2)

Run DESeq2.

time = factor(sapply(strsplit(colnames(x), '_'), '[', 1), levels = c("0hour", "0.5hour", "1hour", "2hour", "4hour", "24hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
factors <- read.table("ATAC_sizeFactor.txt")
size_factors = factors$V2
names(size_factors) = factors$V1
sizeFactors(deseq.df) <- size_factors
deseq.df = DESeq(deseq.df)
save(deseq.df, file = "deseq.df.ATAC.Rdata")
load("deseq.df.ATAC.Rdata")

Identify differentially accessible peaks.

i = "0.5hour"
res.deseq = results(deseq.df[peak_qvalue > 100,], contrast = c("time", i, "0hour"))
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0) #472
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0) #36

Match the unchanged peaks to activated peaks based on accessibility.

activated = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]
unchanged$treatment = 0
activated$treatment = 1
df.deseq.effects.lattice = rbind(unchanged, activated)
out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
repressed = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0,]

df.deseq.effects.lattice = res.deseq
df.deseq.effects.lattice$status = "Other"
df.deseq.effects.lattice$status[rownames(df.deseq.effects.lattice) %in% rownames(activated)] = "Increased"
df.deseq.effects.lattice$status[rownames(df.deseq.effects.lattice) %in% rownames(unchanged)] = "Unchanged"
df.deseq.effects.lattice$status[rownames(df.deseq.effects.lattice) %in% rownames(repressed)] = "Decreased"
df.deseq.effects.lattice$status = factor(df.deseq.effects.lattice$status, levels = c("Other", "Increased", "Unchanged", "Decreased"))

Plot.

pdf(paste0("MA_ATAC_q100_", i, "_for_figure.pdf"), width = 7, height = 3.5)
ggplot(as.data.frame(df.deseq.effects.lattice) %>% arrange(status), aes(x = log(baseMean, 10), y = log2FoldChange, color = status)) +
  geom_point() +
  scale_color_manual(values = c("grey90", "red", "grey60", "blue")) +
  geom_hline(yintercept = 0, linetype="dashed") +
  ylim(-1.2, 1.2) +
  xlim(1.5, 3.5) +
  ylab(expression("log"[2]~"(fold change in ATAC signal)")) +
  xlab(expression("log"[10]~"(mean of normalized intensity)")) +
  ggtitle("Change in chromatin accessibility upon TRPS1 depletion") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(color=NULL)
dev.off()

Write regions as tables.

increased = as.data.frame(activated)
increased$region = rownames(increased)
increased$chr = sapply(strsplit(increased$region, ":"), "[", 1)
increased$start = sapply(strsplit(sapply(strsplit(increased$region, ":"), "[", 2), "-"), "[", 1)
increased$end = sapply(strsplit(sapply(strsplit(increased$region, ":"), "[", 2), "-"), "[", 2)
write.table(increased[,c("chr", "start", "end")], file = "increased_ATAC_30min_summits.bed", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

decreased = as.data.frame(repressed)
decreased$region = rownames(decreased)
decreased$chr = sapply(strsplit(decreased$region, ":"), "[", 1)
decreased$start = sapply(strsplit(sapply(strsplit(decreased$region, ":"), "[", 2), "-"), "[", 1)
decreased$end = sapply(strsplit(sapply(strsplit(decreased$region, ":"), "[", 2), "-"), "[", 2)
write.table(decreased[,c("chr", "start", "end")], file = "decreased_ATAC_30min_summits.bed", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

Repeat above with ATAC-seq peaks that are distal to sites of bidirectional transcription. The dREG elements are generated below in this vignette in the PRO-seq data pre-processing section.

#Get dREG elements
dREG = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/dTAG_TRPS1_timecourse_dREG/dTAG_TRPS1_timecourse_dREG.dREG.peak.full.bed")

#Get distal peaks
res.deseq$chr = sapply(strsplit(rownames(res.deseq), ":"), "[", 1)
res.deseq$start = sapply(strsplit(sapply(strsplit(rownames(res.deseq), ":"), "[", 2), "-"), "[", 1)
res.deseq$end = sapply(strsplit(sapply(strsplit(rownames(res.deseq), ":"), "[", 2), "-"), "[", 2)
res.deseq = res.deseq[, c(7:9,1:6)]
res.deseq$start = as.numeric(res.deseq$start) + 100
res.deseq$end = as.numeric(res.deseq$end) - 100
dREG_distal_ATAC_peaks = bedTools.intersect(bed1 = res.deseq, bed2 = dREG, opt.string = "-v")
res.deseq = dREG_distal_ATAC_peaks

#Match the unchanged peaks to activated peaks based on accessibility.
activated = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]
unchanged$treatment = 0
activated$treatment = 1
df.deseq.effects.lattice = rbind(unchanged, activated)
out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
repressed = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0,]

df.deseq.effects.lattice = res.deseq
df.deseq.effects.lattice$status = "Other"
df.deseq.effects.lattice$status[rownames(df.deseq.effects.lattice) %in% rownames(activated)] = "Increased"
df.deseq.effects.lattice$status[rownames(df.deseq.effects.lattice) %in% rownames(unchanged)] = "Unchanged"
df.deseq.effects.lattice$status[rownames(df.deseq.effects.lattice) %in% rownames(repressed)] = "Decreased"
df.deseq.effects.lattice$status = factor(df.deseq.effects.lattice$status, levels = c("Other", "Increased", "Unchanged", "Decreased"))

#Plot
pdf("MA_dREG_distal_ATAC_peaks.pdf", width = 7, height = 3.5)
ggplot(as.data.frame(df.deseq.effects.lattice) %>% arrange(status), aes(x = log(baseMean, 10), y = log2FoldChange, color = status)) +
  geom_point() +
  scale_color_manual(values = c("grey90", "red", "grey60", "blue")) +
  geom_hline(yintercept = 0, linetype="dashed") +
  ylim(-1.2, 1.2) +
  xlim(1.5, 3.5) +
  ylab(expression("log"[2]~"(fold change in ATAC signal)")) +
  xlab(expression("log"[10]~"(mean of normalized intensity)")) +
  ggtitle("Change in chromatin accessibility upon TRPS1 depletion distal to dREG elements") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(color=NULL)
dev.off()

7.2 de novo identification of enriched motifs in increased or decreased peaks using MEME (Figure 3B)

For each time point, identify differentially accessible peaks and match based on accessibility. Should have called these “increased” and “decreased” for consistency.

for (i in levels(time)[-1]) {
  res.deseq = results(deseq.df[peak_qvalue > 100,], contrast = c("time", i, "0hour"))
  sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj))
  
  activated = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
  write.table(tighten_summit_window(activated, b = 100), file = paste0(i, '_activated_peaks.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
  
  unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]
  unchanged$treatment = 0
  activated$treatment = 1
  df.deseq.effects.lattice = rbind(unchanged, activated)
  out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
  unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
  write.table(tighten_summit_window(unchanged, b = 100), file = paste0(i, '_activated_matched_peaks.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
  
  repressed = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0,]
  write.table(tighten_summit_window(repressed, b = 100), file = paste0(i, '_repressed_peaks.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
  
  unchanged = res.deseq[rownames(res.deseq) %notin% rownames(repressed) & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
  unchanged$treatment = 0
  repressed$treatment = 1
  df.deseq.effects.lattice = rbind(unchanged, repressed)
  out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
  unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
  write.table(tighten_summit_window(unchanged, b = 100), file = paste0(i, '_repressed_matched_peaks.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
}

For each time point, run MEME on the increased and decreased peaks.

cd ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC
genome=~/Library/CloudStorage/Box-Box/GuertinLab/hg38/hg38.fa

for bed in *_activated_peaks.bed *_repressed_peaks.bed
do
  name=`basename $bed .bed`
  fastaFromBed -fi $genome -bed ${name}.bed -fo ${name}.fasta
  sites=`wc -l ${name}.fasta | awk '{print $1/2}'`
  if [ $sites -lt 100 ] 
  then
    minsites=2
  else
    minsites=`echo $sites | awk '{print $1/50}'`
  fi
  meme -oc ${name}_meme_output -nmotifs 1000 -objfun classic -csites 20000 -evt 0.01 \
    -searchsize 0 -minw 6 -maxw 18 -revcomp -dna -markov_order 3 -maxsize 100000000 -minsites $minsites \
    ${name}.fasta
  meme -oc ${name}_meme_short_output -nmotifs 1000 -objfun classic -csites 20000 -evt 0.01 \
    -searchsize 0 -minw 6 -maxw 8 -revcomp -dna -markov_order 3 -maxsize 100000000 -minsites $minsites \
    ${name}.fasta
done

7.3 Identify differentially enriched motifs in increased or decreased peaks relative to unchanged peaks using AME (Tables S3,4)

Report results at FDR 0.1 (1401 total in the above database).

cd ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/
motifs=~/Library/CloudStorage/Box-Box/GuertinLab/Motif_databases/tomtom_db/homer_uniprobe_jaspar_edited.txt
n_motifs=140

Make fasta files for each bed file.

genome=~/Library/CloudStorage/Box-Box/GuertinLab/hg38/hg38.fa
for bed in *hour_activated*_peaks.bed *hour_repressed*_peaks.bed
do
  name=`basename $bed .bed`
  echo $name
  fastaFromBed -fi $genome -bed ${name}.bed -fo ${name}.fasta
done

Run AME.

for fasta in *hour_activated_peaks.fasta *hour_repressed_peaks.fasta
do
  name=`basename $fasta _peaks.fasta`
  echo $name
  ame --verbose 1 --oc ${name}_ame --scoring avg --method fisher --hit-lo-fraction 0.25 \
    --evalue-report-threshold $n_motifs --control ${name}_matched_peaks.fasta ${name}_peaks.fasta $motifs
done

Configure the AME output for use in LaTex. Remove underscores, keep the columns we need, and convert to CSV.

cd ~/Library/CloudStorage/Box-Box/guertinlab/TRPS1_timecourse_ATAC/0.5hour_repressed_ame

cat ame.tsv | \
  awk 'BEGIN {FS="\t"} {print $1, $3, $5, $15, $17, $7}' | \
  awk '{sub(/_homer.*/,"-Homer",$2)} 1' | \
  awk '{sub(/_jaspar.*/,"-Jaspar",$2)} 1' | \
  awk '{sub(/_secondary_uniprobe.*/,"-Secondary-Uniprobe",$2)} 1' | \
  awk '{sub(/_primary_uniprobe.*/,"-Secondary-Uniprobe",$2)} 1' | \
  awk '{sub(/_uniprobe.*/,"-Uniprobe",$2)} 1' | \
  awk 'BEGIN { OFS="," } {$1=$1; print}' |
  awk '/^$/{exit}1' | sed 's/%/\percent/g' > \
  decreased_ame.csv  
  
cd ~/Library/CloudStorage/Box-Box/guertinlab/TRPS1_timecourse_ATAC/0.5hour_activated_ame

cat ame.tsv | \
  awk 'BEGIN {FS="\t"} {print $1, $3, $5, $15, $17, $7}' | \
  awk '{sub(/_homer.*/,"-Homer",$2)} 1' | \
  awk '{sub(/_jaspar.*/,"-Jaspar",$2)} 1' | \
  awk '{sub(/_secondary_uniprobe.*/,"-Secondary-Uniprobe",$2)} 1' | \
  awk '{sub(/_primary_uniprobe.*/,"-Secondary-Uniprobe",$2)} 1' | \
  awk '{sub(/_uniprobe.*/,"-Uniprobe",$2)} 1' | \
  awk 'BEGIN { OFS="," } {$1=$1; print}' |
  awk '/^$/{exit}1' | sed 's/%/\percent/g' |
  head -n 50 > \
  increased_ame.csv  

7.4 Count individual motif instances in dynamic peaks at 30 minutes (Figure 3C,D)

The script used in the next code chunk (220914_FIMO.sh) is below:

#! /bin/bash
directory=/scratch/$USER/TRPS1_timecourse_ATAC
genome=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome.fa

cd $directory
module load gcc/9.2.0 openmpi/3.1.6 meme/5.3.3 bedtools/2.29.2

fimo --thresh 0.001 --text $i $genome > ${name}_fimo.txt
wc -l ${name}_fimo.txt
score=$(tail -n +2 ${name}_fimo.txt | sort -nrk6,6 | awk 'FNR == 2000000 {print $6}')
echo $score
tail -n +2 ${name}_fimo.txt | awk -v sco="$score" '{ if ($6 >= sco) { print } }' | \
    awk '{OFS="\t";} {print $2,$3,$4,$7,$6,$5,$8}' > ${name}_2M.txt
Activated=`wc -l 24hour_activated_peaks.bed | awk '{print $1}'`
Activated_with=`intersectBed -u -a 24hour_activated_peaks.bed -b ${name}_2M.txt | wc -l | awk '{print $1}'`
Activated_without=$(echo "$Activated - $Activated_with" | bc)
Unchanged=`wc -l 24hour_unchanged_peaks.bed | awk '{print $1}'`
Unchanged_with=`intersectBed -u -a 24hour_unchanged_peaks.bed -b ${name}_2M.txt | wc -l | awk '{print $1}'`
Unchanged_without=$(echo "$Unchanged - $Unchanged_with" | bc)
Repressed=`wc -l 24hour_repressed_peaks.bed | awk '{print $1}'`
Repressed_with=`intersectBed -u -a 24hour_repressed_peaks.bed -b ${name}_2M.txt | wc -l | awk '{print $1}'`
Repressed_without=$(echo "$Repressed - $Repressed_with" | bc)
echo -e \
  "$name\t$Activated_with\t$Activated_without\t$Unchanged_with\t$Unchanged_without\t$Repressed_with\t$Repressed_without" > \
  ${name}_Motif_overlaps.txt

Find individual motif instances in the genome with FIMO and overlap with peak sets. I need to provide the motif files or indicate how to generate them.

ssh -Y ts2hx@rivanna.hpc.virginia.edu 
ijob -c 1 -A gioeli_lab -p standard --time=1-00:00:00
cd /scratch/ts2hx/TRPS1_timecourse_ATAC/

for i in individual_memes/*_meme.txt
do
  name=`basename $i _meme.txt`
  sbatch -p standard --export=ALL,i=$i,name=$name --output=${name}_FIMO.out /scratch/ts2hx/scripts/220914_FIMO.sh
  sleep 1
done

Combine into one file.

echo -e \
  "Motif\tActivated_with\tActivated_without\tUnchanged_with\tUnchanged_without\tRepressed_with\tRepressed_without" \
  > Motif_overlaps.txt
cat *_Motif_overlaps.txt >> Motif_overlaps.txt

exit
exit

cd ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_ATAC/Motif_overlaps.txt .

Set up in R.

direc = '~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/'
setwd(direc)

library(tidyverse)
library(tidyr)
library(lattice)
library(tidyr)

ImmediatePeaksWithMotif <- function(motif_overlaps, motif)
{
  result = motif_overlaps[rownames(motif_overlaps) == motif,]
  result = cbind(rbind(result[,3], result[,4]), rbind(result[,1], result[,2]), rbind(result[,7], result[,8]))
  colnames(result) <- c("Increased", "Unchanged", "Decreased")
  rownames(result) <- c("With motif", "Without motif")
  print(chisq.test(result))
  #Can remove this line if you want the actual counts
  result = 100*sweep(result, 2, colSums(result), "/")
  barplot(result, col = c("#F84C1E", "#232D4B"), main = motif, legend.text = TRUE, args.legend = list(x = "topright", bg = "white", cex = 1.5), cex.axis=1.5, cex.names=1.5)
  return(result)
}

Plot barcharts for a representative GATA motif and ER half site.

motif_overlaps = read.table("Overlaps.txt", header = T, row.names = 1)

pdf("ATAC_GATA5_motif_heatmap_immediate.pdf")
ImmediatePeaksWithMotif(motif_overlaps, motif = "GATA5_jaspar")
dev.off()

pdf("ATAC_NR2F2_motif_heatmap_immediate.pdf")
ImmediatePeaksWithMotif(motif_overlaps, motif = "NR2F2_jaspar")
dev.off()

7.5 Cluster dynamic peaks and plot the heatmap (Figure 3E)

Set up in R.

direc = '~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/'
setwd(direc)

library(DESeq2)
library(lattice)
library(viridis)
library(DEGreport)
library(lattice)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)
library(MatchIt)
library(EnhancedVolcano)
library(pheatmap)
library(dendsort)
library(RColorBrewer)

sort_hclust <- function(...) as.hclust(dendsort(as.dendrogram(...)))

Read in the counts table and change the column names. Here we focus on the early time points.

x <- read.table("Combined_ATAC_peak_counts.txt", sep = '\t', header = TRUE)
rownames(x) = paste0(x[,1], ":", x[,2], "-", x[,3])
peak_name = x[,4]
peak_qvalue = x[,5]
x = x[,-c(1:5)]
colnames(x) = sapply(strsplit(colnames(x), "Clone28_"), "[", 2)
x = x[,!grepl("24hour", colnames(x))]

Run DESeq2.

time = factor(sapply(strsplit(colnames(x), '_'), '[', 1), levels = c("0hour", "0.5hour", "1hour", "2hour", "4hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
factors <- read.table("ATAC_sizeFactor.txt")
size_factors = factors$V2
names(size_factors) = factors$V1
size_factors = size_factors[,!grepl("24hour", names(size_factors))]
sizeFactors(deseq.df) <- size_factors
deseq.df = DESeq(deseq.df)
save(deseq.df, file = "deseq.df.ATAC.no24.Rdata")
load("deseq.df.ATAC.no24.Rdata")

Run the likelihood ratio test to identify dynamic peaks over the time course.

dds.lrt = DESeq(deseq.df, test="LRT", reduced = ~ rep)
save(dds.lrt, file = "dds.lrt.ATAC.no24.Rdata")
load("dds.lrt.ATAC.no24.Rdata")
res.lrt = results(dds.lrt[peak_qvalue > 100,])
sum(res.lrt$padj < 0.1 & !is.na(res.lrt$padj))

Get the normalized counts.

wide_counts = counts(deseq.df[rownames(deseq.df) %in% rownames(res.lrt)[res.lrt$padj < .01 & !is.na(res.lrt$padj)],], normalized = TRUE)

Assign the colors.

anno <- data.frame(Time = as.factor(time))
rownames(anno) = colnames(wide_counts)
ann_colors = list(
  Time = c(`0hour` = viridis(5)[1], 
           `0.5hour` = viridis(5)[2],
           `1hour` = viridis(5)[3],
           `2hour` = viridis(5)[4],
           `4hour` = viridis(5)[5]))

paletteLength <- 50
myColor <- colorRampPalette(c("blue", "white", "red"))(paletteLength)
myBreaks <- c(seq(min(scale(t(wide_counts))), 0, length.out=ceiling(paletteLength/2) + 1), 
              seq(max(scale(t(wide_counts)))/paletteLength, max(scale(t(wide_counts))), length.out=floor(paletteLength/2)))

Move the order of columns to be in time order (legal rotating of the dendrogram).

cluster_cols = hclust(dist(scale(t(wide_counts))))
cluster_cols$order = cluster_cols$order[c(1:12, 17:20, 13:16)]

Plot.

pdf("ATAC_counts_heatmap_FDR_0.01.pdf", width = 8, height = 5)
pheatmap(wide_counts, scale = "row", show_rownames = F, show_colnames = F, 
         color = myColor, breaks=myBreaks,
         annotation_col = anno, annotation_colors = ann_colors,
         cluster_cols = cluster_cols)
dev.off()

7.6 Count GATA motif instances in dynamic peaks across time points (Figure 3F)

Set up in R.

direc = '~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/'
setwd(direc)

library(tidyverse)
library(tidyr)
library(lattice)
library(tidyr)

PeaksWithMotif <- function(motif_overlaps, motif)
{
  result = motif_overlaps[rownames(motif_overlaps) == motif,]
  times = c("0.5hour", "1hour", "2hour", "4hour")
  conditions = c("activated", "activated_matched", "repressed_matched", "repressed")
  df = as.data.frame(matrix(nrow = length(conditions), ncol = length(times)))
  colnames(df) = times
  rownames(df) = conditions
  for (i in 1:length(times)) {
    for (j in 1:length(conditions)) {
      df[j,i] = result[,paste0("X", times[i], "_", conditions[j], "_with")] / 
        (result[,paste0("X", times[i], "_", conditions[j], "_without")] +
           result[,paste0("X", times[i], "_", conditions[j], "_with")])
    }
  }
  df = df[-3,]
  rownames(df) = c("Increased", "Unchanged", "Decreased")
  colnames(df) = sapply(strsplit(colnames(df), "hour"), "[", 1)
  df$class = factor(rownames(df), levels = c("Increased", "Unchanged", "Decreased"))
  df_long <- gather(df, time, Proportion, -class, factor_key=TRUE)
  print(ggplot(df_long, aes(y = time, x = class, fill = Proportion)) + 
    geom_tile() +
    geom_text(aes(label = round(Proportion, 2))) +
    scale_fill_gradient(low="white", high="black", limits=c(0, 1)) +
    xlab(NULL) +
    ylab("Time (hours)")) +
    theme_bw() +
    theme(text = element_text(size = 20))
}

Plot.

motif_overlaps = read.table("Overlaps.txt", header = T, row.names = 1)
head(motif_overlaps)
pdf("ATAC_TRPS1_motif_heatmap_timecourse.pdf", height = 3.5)
PeaksWithMotif(motif_overlaps, motif = "TRPS1_homer_uniprobe_jaspar_edited.txt")
dev.off()

8 Time course PRO-seq data pre-processing

For the initial basic processing steps, we used a pipeline we have thoroughly described elsewhere (https://github.com/guertinlab/Nascent_RNA_Methods). Please see the above GitHub repository for detailed explanations of the steps, as well as software dependencies and preprocessing steps (e.g., gene annotation file downloads).

8.1 Set up

Log in to the cluster to perform the processing for each sample in parallel.

ssh -Y ts2hx@rivanna.hpc.virginia.edu 
ijob -c 1 -A gioeli_lab -p standard --time=1-00:00:00

Make a directory.

mkdir -p /scratch/ts2hx/TRPS1_timecourse_PRO_3

The script used in the next code chunk (220927_tallymer_generation.sh) is below:

#! /bin/bash

module load genometools/1.5.10
read_size=40
genome=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa

mkdir -p $HOME/seqOutBias
cd $HOME/seqOutBias
$HOME/bin/seqOutBias seqtable $genome --read-size=$read_size

Make tallymer and table for seqOutBias (this takes a while).

sbatch -p largemem -A gioeli_lab -t 1-00:00:00 --job-name tallymer -o tallymer.log \
    /scratch/ts2hx/scripts/220927_tallymer_generation.sh

8.2 Download the raw data from the three sequencing runs, rename the files, and combine the reads

We sequenced the libraries 3 times, first on a NextSeq2000, and the next two runs on a NextSeq550. We found that the NextSeq2000 gave us many duplicate reads, which we attribute to the patterned flow cell and our small fragment sizes for PRO-seq. We first download the fastq’s and gzip.

cd /scratch/ts2hx/TRPS1_timecourse_PRO_3/
mkdir -p logs/

for i in {24..92}
do
  sbatch -p standard -A gioeli_lab -t 1:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name fasterq-dump \
    -o logs/SRR250747${i}_fasterq_dump.log \
    --wrap="module load gcc/11.4.0 sratoolkit/3.0.3; fasterq-dump SRR250747${i}; gzip SRR250747${i}_*.fastq"
  sleep 1
done

Combine the NextSeq550 reads to generate representative quality control metrics.

cat SRR25074724_1.fastq.gz SRR25074726_1.fastq.gz > T47D_Clone28_24hour_rep4_NextSeq550_PE1.fastq.gz
cat SRR25074727_1.fastq.gz SRR25074729_1.fastq.gz > T47D_Clone28_24hour_rep3_NextSeq550_PE1.fastq.gz
cat SRR25074730_1.fastq.gz SRR25074732_1.fastq.gz > T47D_Clone28_24hour_rep2_NextSeq550_PE1.fastq.gz

cat SRR25074733_1.fastq.gz SRR25074735_1.fastq.gz > T47D_Clone28_4hour_rep4_NextSeq550_PE1.fastq.gz
cat SRR25074736_1.fastq.gz SRR25074738_1.fastq.gz > T47D_Clone28_4hour_rep3_NextSeq550_PE1.fastq.gz
cat SRR25074739_1.fastq.gz SRR25074741_1.fastq.gz > T47D_Clone28_4hour_rep2_NextSeq550_PE1.fastq.gz
cat SRR25074742_1.fastq.gz SRR25074744_1.fastq.gz > T47D_Clone28_4hour_rep1_NextSeq550_PE1.fastq.gz

cat SRR25074745_1.fastq.gz SRR25074747_1.fastq.gz > T47D_Clone28_2hour_rep4_NextSeq550_PE1.fastq.gz
cat SRR25074748_1.fastq.gz SRR25074750_1.fastq.gz > T47D_Clone28_2hour_rep3_NextSeq550_PE1.fastq.gz
cat SRR25074751_1.fastq.gz SRR25074753_1.fastq.gz > T47D_Clone28_2hour_rep2_NextSeq550_PE1.fastq.gz
cat SRR25074754_1.fastq.gz SRR25074756_1.fastq.gz > T47D_Clone28_2hour_rep1_NextSeq550_PE1.fastq.gz

cat SRR25074757_1.fastq.gz SRR25074759_1.fastq.gz > T47D_Clone28_1hour_rep4_NextSeq550_PE1.fastq.gz
cat SRR25074760_1.fastq.gz SRR25074762_1.fastq.gz > T47D_Clone28_1hour_rep3_NextSeq550_PE1.fastq.gz
cat SRR25074763_1.fastq.gz SRR25074765_1.fastq.gz > T47D_Clone28_1hour_rep2_NextSeq550_PE1.fastq.gz
cat SRR25074766_1.fastq.gz SRR25074768_1.fastq.gz > T47D_Clone28_1hour_rep1_NextSeq550_PE1.fastq.gz

cat SRR25074769_1.fastq.gz SRR25074771_1.fastq.gz > T47D_Clone28_0.5hour_rep4_NextSeq550_PE1.fastq.gz
cat SRR25074772_1.fastq.gz SRR25074774_1.fastq.gz > T47D_Clone28_0.5hour_rep3_NextSeq550_PE1.fastq.gz
cat SRR25074775_1.fastq.gz SRR25074777_1.fastq.gz > T47D_Clone28_0.5hour_rep2_NextSeq550_PE1.fastq.gz
cat SRR25074778_1.fastq.gz SRR25074780_1.fastq.gz > T47D_Clone28_0.5hour_rep1_NextSeq550_PE1.fastq.gz

cat SRR25074781_1.fastq.gz SRR25074783_1.fastq.gz > T47D_Clone28_0hour_rep4_NextSeq550_PE1.fastq.gz
cat SRR25074784_1.fastq.gz SRR25074786_1.fastq.gz > T47D_Clone28_0hour_rep3_NextSeq550_PE1.fastq.gz
cat SRR25074787_1.fastq.gz SRR25074789_1.fastq.gz > T47D_Clone28_0hour_rep2_NextSeq550_PE1.fastq.gz
cat SRR25074790_1.fastq.gz SRR25074792_1.fastq.gz > T47D_Clone28_0hour_rep1_NextSeq550_PE1.fastq.gz


cat SRR25074724_2.fastq.gz SRR25074726_2.fastq.gz > T47D_Clone28_24hour_rep4_NextSeq550_PE2.fastq.gz
cat SRR25074727_2.fastq.gz SRR25074729_2.fastq.gz > T47D_Clone28_24hour_rep3_NextSeq550_PE2.fastq.gz
cat SRR25074730_2.fastq.gz SRR25074732_2.fastq.gz > T47D_Clone28_24hour_rep2_NextSeq550_PE2.fastq.gz

cat SRR25074733_2.fastq.gz SRR25074735_2.fastq.gz > T47D_Clone28_4hour_rep4_NextSeq550_PE2.fastq.gz
cat SRR25074736_2.fastq.gz SRR25074738_2.fastq.gz > T47D_Clone28_4hour_rep3_NextSeq550_PE2.fastq.gz
cat SRR25074739_2.fastq.gz SRR25074741_2.fastq.gz > T47D_Clone28_4hour_rep2_NextSeq550_PE2.fastq.gz
cat SRR25074742_2.fastq.gz SRR25074744_2.fastq.gz > T47D_Clone28_4hour_rep1_NextSeq550_PE2.fastq.gz

cat SRR25074745_2.fastq.gz SRR25074747_2.fastq.gz > T47D_Clone28_2hour_rep4_NextSeq550_PE2.fastq.gz
cat SRR25074748_2.fastq.gz SRR25074750_2.fastq.gz > T47D_Clone28_2hour_rep3_NextSeq550_PE2.fastq.gz
cat SRR25074751_2.fastq.gz SRR25074753_2.fastq.gz > T47D_Clone28_2hour_rep2_NextSeq550_PE2.fastq.gz
cat SRR25074754_2.fastq.gz SRR25074756_2.fastq.gz > T47D_Clone28_2hour_rep1_NextSeq550_PE2.fastq.gz

cat SRR25074757_2.fastq.gz SRR25074759_2.fastq.gz > T47D_Clone28_1hour_rep4_NextSeq550_PE2.fastq.gz
cat SRR25074760_2.fastq.gz SRR25074762_2.fastq.gz > T47D_Clone28_1hour_rep3_NextSeq550_PE2.fastq.gz
cat SRR25074763_2.fastq.gz SRR25074765_2.fastq.gz > T47D_Clone28_1hour_rep2_NextSeq550_PE2.fastq.gz
cat SRR25074766_2.fastq.gz SRR25074768_2.fastq.gz > T47D_Clone28_1hour_rep1_NextSeq550_PE2.fastq.gz

cat SRR25074769_2.fastq.gz SRR25074771_2.fastq.gz > T47D_Clone28_0.5hour_rep4_NextSeq550_PE2.fastq.gz
cat SRR25074772_2.fastq.gz SRR25074774_2.fastq.gz > T47D_Clone28_0.5hour_rep3_NextSeq550_PE2.fastq.gz
cat SRR25074775_2.fastq.gz SRR25074777_2.fastq.gz > T47D_Clone28_0.5hour_rep2_NextSeq550_PE2.fastq.gz
cat SRR25074778_2.fastq.gz SRR25074780_2.fastq.gz > T47D_Clone28_0.5hour_rep1_NextSeq550_PE2.fastq.gz

cat SRR25074781_2.fastq.gz SRR25074783_2.fastq.gz > T47D_Clone28_0hour_rep4_NextSeq550_PE2.fastq.gz
cat SRR25074784_2.fastq.gz SRR25074786_2.fastq.gz > T47D_Clone28_0hour_rep3_NextSeq550_PE2.fastq.gz
cat SRR25074787_2.fastq.gz SRR25074789_2.fastq.gz > T47D_Clone28_0hour_rep2_NextSeq550_PE2.fastq.gz
cat SRR25074790_2.fastq.gz SRR25074792_2.fastq.gz > T47D_Clone28_0hour_rep1_NextSeq550_PE2.fastq.gz

Combine all reads for use in downstream analysis.

cat SRR25074724_1.fastq.gz SRR25074725_1.fastq.gz SRR25074726_1.fastq.gz > T47D_Clone28_24hour_rep4_combined_PE1.fastq.gz
cat SRR25074727_1.fastq.gz SRR25074728_1.fastq.gz SRR25074729_1.fastq.gz > T47D_Clone28_24hour_rep3_combined_PE1.fastq.gz
cat SRR25074730_1.fastq.gz SRR25074731_1.fastq.gz SRR25074732_1.fastq.gz > T47D_Clone28_24hour_rep2_combined_PE1.fastq.gz

cat SRR25074733_1.fastq.gz SRR25074734_1.fastq.gz SRR25074735_1.fastq.gz > T47D_Clone28_4hour_rep4_combined_PE1.fastq.gz
cat SRR25074736_1.fastq.gz SRR25074737_1.fastq.gz SRR25074738_1.fastq.gz > T47D_Clone28_4hour_rep3_combined_PE1.fastq.gz
cat SRR25074739_1.fastq.gz SRR25074740_1.fastq.gz SRR25074741_1.fastq.gz > T47D_Clone28_4hour_rep2_combined_PE1.fastq.gz
cat SRR25074742_1.fastq.gz SRR25074743_1.fastq.gz SRR25074744_1.fastq.gz > T47D_Clone28_4hour_rep1_combined_PE1.fastq.gz

cat SRR25074745_1.fastq.gz SRR25074746_1.fastq.gz SRR25074747_1.fastq.gz > T47D_Clone28_2hour_rep4_combined_PE1.fastq.gz
cat SRR25074748_1.fastq.gz SRR25074749_1.fastq.gz SRR25074750_1.fastq.gz > T47D_Clone28_2hour_rep3_combined_PE1.fastq.gz
cat SRR25074751_1.fastq.gz SRR25074752_1.fastq.gz SRR25074753_1.fastq.gz > T47D_Clone28_2hour_rep2_combined_PE1.fastq.gz
cat SRR25074754_1.fastq.gz SRR25074755_1.fastq.gz SRR25074756_1.fastq.gz > T47D_Clone28_2hour_rep1_combined_PE1.fastq.gz

cat SRR25074757_1.fastq.gz SRR25074758_1.fastq.gz SRR25074759_1.fastq.gz > T47D_Clone28_1hour_rep4_combined_PE1.fastq.gz
cat SRR25074760_1.fastq.gz SRR25074761_1.fastq.gz SRR25074762_1.fastq.gz > T47D_Clone28_1hour_rep3_combined_PE1.fastq.gz
cat SRR25074763_1.fastq.gz SRR25074764_1.fastq.gz SRR25074765_1.fastq.gz > T47D_Clone28_1hour_rep2_combined_PE1.fastq.gz
cat SRR25074766_1.fastq.gz SRR25074767_1.fastq.gz SRR25074768_1.fastq.gz > T47D_Clone28_1hour_rep1_combined_PE1.fastq.gz

cat SRR25074769_1.fastq.gz SRR25074770_1.fastq.gz SRR25074771_1.fastq.gz > T47D_Clone28_0.5hour_rep4_combined_PE1.fastq.gz
cat SRR25074772_1.fastq.gz SRR25074773_1.fastq.gz SRR25074774_1.fastq.gz > T47D_Clone28_0.5hour_rep3_combined_PE1.fastq.gz
cat SRR25074775_1.fastq.gz SRR25074776_1.fastq.gz SRR25074777_1.fastq.gz > T47D_Clone28_0.5hour_rep2_combined_PE1.fastq.gz
cat SRR25074778_1.fastq.gz SRR25074779_1.fastq.gz SRR25074780_1.fastq.gz > T47D_Clone28_0.5hour_rep1_combined_PE1.fastq.gz

cat SRR25074781_1.fastq.gz SRR25074782_1.fastq.gz SRR25074783_1.fastq.gz > T47D_Clone28_0hour_rep4_combined_PE1.fastq.gz
cat SRR25074784_1.fastq.gz SRR25074785_1.fastq.gz SRR25074786_1.fastq.gz > T47D_Clone28_0hour_rep3_combined_PE1.fastq.gz
cat SRR25074787_1.fastq.gz SRR25074788_1.fastq.gz SRR25074789_1.fastq.gz > T47D_Clone28_0hour_rep2_combined_PE1.fastq.gz
cat SRR25074790_1.fastq.gz SRR25074791_1.fastq.gz SRR25074792_1.fastq.gz > T47D_Clone28_0hour_rep1_combined_PE1.fastq.gz


cat SRR25074724_2.fastq.gz SRR25074725_2.fastq.gz SRR25074726_2.fastq.gz > T47D_Clone28_24hour_rep4_combined_PE2.fastq.gz
cat SRR25074727_2.fastq.gz SRR25074728_2.fastq.gz SRR25074729_2.fastq.gz > T47D_Clone28_24hour_rep3_combined_PE2.fastq.gz
cat SRR25074730_2.fastq.gz SRR25074731_2.fastq.gz SRR25074732_2.fastq.gz > T47D_Clone28_24hour_rep2_combined_PE2.fastq.gz

cat SRR25074733_2.fastq.gz SRR25074734_2.fastq.gz SRR25074735_2.fastq.gz > T47D_Clone28_4hour_rep4_combined_PE2.fastq.gz
cat SRR25074736_2.fastq.gz SRR25074737_2.fastq.gz SRR25074738_2.fastq.gz > T47D_Clone28_4hour_rep3_combined_PE2.fastq.gz
cat SRR25074739_2.fastq.gz SRR25074740_2.fastq.gz SRR25074741_2.fastq.gz > T47D_Clone28_4hour_rep2_combined_PE2.fastq.gz
cat SRR25074742_2.fastq.gz SRR25074743_2.fastq.gz SRR25074744_2.fastq.gz > T47D_Clone28_4hour_rep1_combined_PE2.fastq.gz

cat SRR25074745_2.fastq.gz SRR25074746_2.fastq.gz SRR25074747_2.fastq.gz > T47D_Clone28_2hour_rep4_combined_PE2.fastq.gz
cat SRR25074748_2.fastq.gz SRR25074749_2.fastq.gz SRR25074750_2.fastq.gz > T47D_Clone28_2hour_rep3_combined_PE2.fastq.gz
cat SRR25074751_2.fastq.gz SRR25074752_2.fastq.gz SRR25074753_2.fastq.gz > T47D_Clone28_2hour_rep2_combined_PE2.fastq.gz
cat SRR25074754_2.fastq.gz SRR25074755_2.fastq.gz SRR25074756_2.fastq.gz > T47D_Clone28_2hour_rep1_combined_PE2.fastq.gz

cat SRR25074757_2.fastq.gz SRR25074758_2.fastq.gz SRR25074759_2.fastq.gz > T47D_Clone28_1hour_rep4_combined_PE2.fastq.gz
cat SRR25074760_2.fastq.gz SRR25074761_2.fastq.gz SRR25074762_2.fastq.gz > T47D_Clone28_1hour_rep3_combined_PE2.fastq.gz
cat SRR25074763_2.fastq.gz SRR25074764_2.fastq.gz SRR25074765_2.fastq.gz > T47D_Clone28_1hour_rep2_combined_PE2.fastq.gz
cat SRR25074766_2.fastq.gz SRR25074767_2.fastq.gz SRR25074768_2.fastq.gz > T47D_Clone28_1hour_rep1_combined_PE2.fastq.gz

cat SRR25074769_2.fastq.gz SRR25074770_2.fastq.gz SRR25074771_2.fastq.gz > T47D_Clone28_0.5hour_rep4_combined_PE2.fastq.gz
cat SRR25074772_2.fastq.gz SRR25074773_2.fastq.gz SRR25074774_2.fastq.gz > T47D_Clone28_0.5hour_rep3_combined_PE2.fastq.gz
cat SRR25074775_2.fastq.gz SRR25074776_2.fastq.gz SRR25074777_2.fastq.gz > T47D_Clone28_0.5hour_rep2_combined_PE2.fastq.gz
cat SRR25074778_2.fastq.gz SRR25074779_2.fastq.gz SRR25074780_2.fastq.gz > T47D_Clone28_0.5hour_rep1_combined_PE2.fastq.gz

cat SRR25074781_2.fastq.gz SRR25074782_2.fastq.gz SRR25074783_2.fastq.gz > T47D_Clone28_0hour_rep4_combined_PE2.fastq.gz
cat SRR25074784_2.fastq.gz SRR25074785_2.fastq.gz SRR25074786_2.fastq.gz > T47D_Clone28_0hour_rep3_combined_PE2.fastq.gz
cat SRR25074787_2.fastq.gz SRR25074788_2.fastq.gz SRR25074789_2.fastq.gz > T47D_Clone28_0hour_rep2_combined_PE2.fastq.gz
cat SRR25074790_2.fastq.gz SRR25074791_2.fastq.gz SRR25074792_2.fastq.gz > T47D_Clone28_0hour_rep1_combined_PE2.fastq.gz

8.3 Generate a plot of quality control metrics (Figure S5)

The script used in the next code chunk (220927_PROseq_analysis_on_input_file.sh) is below:

#! /bin/bash
UMI_length=8
read_size=40
cores=$SLURM_CPUS_PER_TASK
annotation_prefix=$HOME/gene_annotations/Homo_sapiens.GRCh38.104 
genome_index=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome
prealign_rdna_index=$HOME/human_rDNA/human_rDNA
tallymer=$HOME/seqOutBias/genome.tal_${read_size}.gtTxt.gz
table=$HOME/seqOutBias/genome_${read_size}.4.2.2.tbl

module load cutadapt/3.4
module load intel/18.0 intelmpi/18.0 R/4.1.1
module load gcc/9.2.0 bowtie2/2.2.9 
module load samtools/1.12
module load wigtobigwig/2.8
module load bedtools/2.29.2
module load pigz/2.4

PATH=$HOME/bin:$PATH

echo $name
gunzip ${name}_PE1.fastq.gz
gunzip ${name}_PE2.fastq.gz
echo 'removing dual adapter ligations and calculating the fraction of adapter/adapters in' $name
cutadapt --cores=$cores -m $((UMI_length+2)) -O 1 -a TGGAATTCTCGGGTGCCAAGG ${name}_PE1.fastq \
    -o ${name}_PE1_noadap.fastq --too-short-output ${name}_PE1_short.fastq > ${name}_PE1_cutadapt.txt
cutadapt --cores=$cores -m $((UMI_length+10)) -O 1 -a GATCGTCGGACTGTAGAACTCTGAAC ${name}_PE2.fastq \
    -o ${name}_PE2_noadap.fastq --too-short-output ${name}_PE2_short.fastq > ${name}_PE2_cutadapt.txt
PE1_total=$(wc -l ${name}_PE1.fastq | awk '{print $1/4}')
PE1_w_Adapter=$(wc -l ${name}_PE1_short.fastq | awk '{print $1/4}')
AAligation=$(echo "scale=2 ; $PE1_w_Adapter / $PE1_total" | bc)
echo -e  "value\texperiment\tthreshold\tmetric" > ${name}_QC_metrics.txt
echo -e "$AAligation\t$name\t0.80\tAdapter/Adapter" >> ${name}_QC_metrics.txt
echo 'removing short RNA insertions in' $name
seqtk seq -L $((UMI_length+10)) ${name}_PE1_noadap.fastq > ${name}_PE1_noadap_trimmed.fastq
echo 'removing PCR duplicates from' $name
fqdedup -i ${name}_PE1_noadap_trimmed.fastq -o ${name}_PE1_dedup.fastq
PE1_noAdapter=$(wc -l ${name}_PE1_dedup.fastq | awk '{print $1/4}')
fastq_pair -t $PE1_noAdapter ${name}_PE1_dedup.fastq ${name}_PE2_noadap.fastq
echo 'calculating and plotting RNA insert sizes from' $name
flash -q --compress-prog=gzip --suffix=gz ${name}_PE1_dedup.fastq.paired.fq \
    ${name}_PE2_noadap.fastq.paired.fq -o ${name}
insert_size.R ${name}.hist ${UMI_length}
echo 'trimming off the UMI from' $name
seqtk trimfq -b ${UMI_length} ${name}_PE1_dedup.fastq | seqtk seq -r - > ${name}_PE1_processed.fastq
seqtk trimfq -e ${UMI_length} ${name}_PE2_noadap.fastq | seqtk seq -r - > ${name}_PE2_processed.fastq
echo 'aligning' $name 'to rDNA and removing aligned reads'
bowtie2 -p $((cores-2)) -x $prealign_rdna_index -U ${name}_PE1_processed.fastq 2>${name}_bowtie2_rDNA.log | \
    samtools sort -n - | samtools fastq -f 0x4 - > ${name}_PE1.rDNA.fastq
reads=$(wc -l ${name}_PE1.rDNA.fastq | awk '{print $1/4}')
fastq_pair -t $reads ${name}_PE1.rDNA.fastq ${name}_PE2_processed.fastq
echo 'aligning' $name 'to the genome'
bowtie2 -p $((cores-2)) --maxins 1000 -x $genome_index --rf -1 ${name}_PE1.rDNA.fastq.paired.fq \
    -2 ${name}_PE2_processed.fastq.paired.fq 2>${name}_bowtie2.log | \
    samtools view -b - | samtools sort - -o ${name}.bam
PE1_prior_rDNA=$(wc -l ${name}_PE1_processed.fastq | awk '{print $1/4}')
PE1_post_rDNA=$(wc -l ${name}_PE1.rDNA.fastq | awk '{print $1/4}')
total_rDNA=$(echo "$(($PE1_prior_rDNA-$PE1_post_rDNA))")
concordant_pe1=$(samtools view -c -f 0x42 ${name}.bam)
total=$(echo "$(($concordant_pe1+$total_rDNA))")
rDNA_alignment=$(echo "scale=2 ; $total_rDNA / $total" | bc)
echo -e "$rDNA_alignment\t$name\t0.10\trDNA Alignment Rate" >> ${name}_QC_metrics.txt
map_pe1=$(samtools view -c -f 0x42 ${name}.bam)
pre_alignment=$(wc -l ${name}_PE1.rDNA.fastq.paired.fq | awk '{print $1/4}')
alignment_rate=$(echo "scale=2 ; $map_pe1 / $pre_alignment" | bc)
echo -e "$alignment_rate\t$name\t0.80\tAlignment Rate" >> ${name}_QC_metrics.txt
echo 'plotting and calculating complexity for' $name
fqComplexity -i ${name}_PE1_noadap_trimmed.fastq
echo 'calculating and plotting theoretical sequencing depth'
echo 'to achieve a defined number of concordantly aligned reads for' $name
PE1_total=$(wc -l ${name}_PE1.fastq | awk '{print $1/4}')
PE1_noadap_trimmed=$(wc -l ${name}_PE1_noadap_trimmed.fastq | awk '{print $1/4}')
factorX=$(echo "scale=2 ; $PE1_noadap_trimmed / $PE1_total" | bc)
echo fraction of reads that are not adapter/adapter ligation products or below 10 base inserts
echo $factorX
PE1_dedup=$(wc -l ${name}_PE1_dedup.fastq | awk '{print $1/4}')
factorY=$(echo "scale=2 ; $concordant_pe1 / $PE1_dedup" | bc)
fqComplexity -i ${name}_PE1_noadap_trimmed.fastq -x $factorX -y $factorY
echo 'Separating paired end reads and creating genomic BED and bigWig intensity files for' $name
mkdir ${name}_seqOutBias
cd ${name}_seqOutBias
seqOutBias scale $table ../${name}.bam --no-scale --stranded --bed-stranded-positive --only-paired --tail-edge \
    --bw=$name.bigWig --bed=$name.bed --out-split-pairends --read-size=$read_size --tallymer=$tallymer
cd ..
mv ${name}_seqOutBias/${name}* .
rm -r ${name}_seqOutBias
grep -v "random" ${name}_not_scaled_PE1.bed | grep -v "chrUn" | grep -v "chrEBV" | sort -k1,1 -k2,2n > \
    ${name}_tmp.txt && mv ${name}_tmp.txt ${name}_not_scaled_PE1.bed 
mapBed -null "0" -s -a $annotation_prefix.pause.bed -b ${name}_not_scaled_PE1.bed | \
    awk '$7>0' | sort -k5,5 -k7,7nr | sort -k5,5 -u > ${name}_pause.bed
join -1 5 -2 5 ${name}_pause.bed $annotation_prefix.bed | \
    awk '{OFS="\t";} $2==$8 && $6==$12 {print $2, $3, $4, $1, $6, $7, $9, $10}' | \
    awk '{OFS="\t";} $5 == "+" {print $1,$2+480,$8,$4,$6,$5} $5 == "-" {print $1,$7,$2 - 380,$4,$6,$5}' | \
    awk  '{OFS="\t";} $3>$2 {print $1,$2,$3,$4,$5,$6}' | \
    sort -k1,1 -k2,2n > ${name}_pause_counts_body_coordinates.bed
mapBed -null "0" -s -a ${name}_pause_counts_body_coordinates.bed -b ${name}_not_scaled_PE1.bed | \
    awk '$7>0' | awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$5/100,$7/($3 - $2)}' | \
    awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$8/$9}' > ${name}_pause_body.bed
pause_index.R ${name}_pause_body.bed
echo 'Calculating exon density / intron density as a metric for nascent RNA purity for' $name
mapBed -null "0" -s -a $annotation_prefix.introns.bed -b ${name}_not_scaled_PE1.bed | awk '$7>0' | \
    awk '{OFS="\t";} {print $1,$2,$3,$5,$5,$6,$7,($3 - $2)}' > ${name}_intron_counts.bed
mapBed -null "0" -s -a $annotation_prefix.no.first.exons.named.bed -b ${name}_not_scaled_PE1.bed | awk '$7>0' | \
    awk '{OFS="\t";} {print $1,$2,$3,$4,$4,$6,$7,($3 - $2)}' > ${name}_exon_counts.bed
exon_intron_ratio.R ${name}_exon_counts.bed ${name}_intron_counts.bed
echo 'Counting reads in genes for' $name
echo -e "\t${name}" > ${name}_gene_counts.txt
mapBed -null "0" -s -a ${annotation_prefix}_sorted.bed -b ${name}_not_scaled_PE1.bed | \
    awk '{OFS="\t";} {print $4,$7}' >> ${name}_gene_counts.txt
rm ${name}_PE1_short.fastq
rm ${name}_PE2_short.fastq
rm ${name}_PE1_noadap.fastq
rm ${name}_PE2_noadap.fastq
rm ${name}_PE1_noadap_trimmed.fastq
rm ${name}_PE1_dedup.fastq
rm ${name}_PE1_processed.fastq
rm ${name}_PE2_processed.fastq
rm ${name}_PE1_dedup.fastq.paired.fq
rm ${name}_PE2_noadap.fastq.paired.fq
rm ${name}_PE1_dedup.fastq.single.fq
rm ${name}_PE2_noadap.fastq.single.fq
rm ${name}_PE1.rDNA.fastq.paired.fq
rm ${name}_PE1.rDNA.fastq.single.fq
rm ${name}_PE2_processed.fastq.paired.fq
rm ${name}_PE2_processed.fastq.single.fq
rm ${name}.extendedFrags.fastq.gz
rm ${name}.notCombined_1.fastq.gz
rm ${name}.notCombined_2.fastq.gz
pigz ${name}_PE1.fastq
pigz ${name}_PE2.fastq

Run the above script on each sample in parallel.

for i in *_NextSeq550_PE1.fastq.gz
do
  name=`basename $i _PE1.fastq.gz`
  sbatch -p standard -A gioeli_lab -t 4:00:00 -N 1 -n 1 --cpus-per-task=20 \
    --job-name proseq-analysis -o ${name}_proseq.log --export=ALL,i=$i,name=$name \
    /scratch/ts2hx/scripts/220927_PROseq_analysis_on_input_file.sh
  sleep 1
done

Plot the QC metrics.

module load intel/18.0 intelmpi/18.0 R/4.1.1
PATH=$HOME/bin:$PATH
rm TRPS1_timecourse_PRO_NextSeq550_QC_metrics.txt
cat *QC_metrics.txt | awk '!x[$0]++' > TRPS1_timecourse_PRO_NextSeq550_QC_metrics.txt 
plot_all_metrics.R TRPS1_timecourse_PRO_NextSeq550_QC_metrics.txt TRPS1_timecourse_PRO_NextSeq550
paste -d'\t' *NextSeq550_gene_counts.txt > TRPS1_timecourse_PRO_gene_counts_NextSeq550.txt

8.4 Process the combined reads

The script used in the next code chunk (231221_PROseq_analysis_on_input_file_no_QC_cut_PE1_to_48.sh) is below:

#! /bin/bash
UMI_length=8
read_size=40
cores=$SLURM_CPUS_PER_TASK
annotation_prefix=$HOME/gene_annotations/Homo_sapiens.GRCh38.104 
genome_index=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome
prealign_rdna_index=$HOME/human_rDNA/human_rDNA
tallymer=$HOME/seqOutBias/genome.tal_${read_size}.gtTxt.gz
table=$HOME/seqOutBias/genome_${read_size}.4.2.2.tbl

module load cutadapt/3.4
module load gcc/11.4.0  openmpi/4.1.4 R/4.3.1
module load bowtie2/2.5.1
module load samtools/1.17
module load wigtobigwig/2.8
module load bedtools/2.30.0
module load pigz/2.7

PATH=$HOME/bin:$PATH

echo $name
gunzip ${name}_PE1.fastq.gz
gunzip ${name}_PE2.fastq.gz
echo 'removing dual adapter ligations and calculating the fraction of adapter/adapters in' $name
cutadapt --cores=$cores -m $((UMI_length+10)) -l 48 -O 1 -a TGGAATTCTCGGGTGCCAAGG ${name}_PE1.fastq \
    -o ${name}_PE1_noadap_trimmed.fastq
cutadapt --cores=$cores -m $((UMI_length+10)) -O 1 -a GATCGTCGGACTGTAGAACTCTGAAC ${name}_PE2.fastq \
    -o ${name}_PE2_noadap.fastq
pigz ${name}_PE1.fastq
pigz ${name}_PE2.fastq
echo 'removing PCR duplicates from' $name
fqdedup -i ${name}_PE1_noadap_trimmed.fastq -o ${name}_PE1_dedup.fastq
rm ${name}_PE1_noadap_trimmed.fastq
echo 'trimming off the UMI from' $name
seqtk trimfq -b ${UMI_length} ${name}_PE1_dedup.fastq | seqtk seq -r - > ${name}_PE1_processed.fastq
seqtk trimfq -e ${UMI_length} ${name}_PE2_noadap.fastq | seqtk seq -r - > ${name}_PE2_processed.fastq
rm ${name}_PE2_noadap.fastq
rm ${name}_PE1_dedup.fastq
echo 'aligning' $name 'to rDNA and removing aligned reads'
bowtie2 -p $((cores-2)) -x $prealign_rdna_index -U ${name}_PE1_processed.fastq 2>${name}_bowtie2_rDNA.log | \
    samtools sort -n - | samtools fastq -f 0x4 - > ${name}_PE1.rDNA.fastq
rm ${name}_PE1_processed.fastq
reads=$(wc -l ${name}_PE1.rDNA.fastq | awk '{print $1/4}')
fastq_pair -t $reads ${name}_PE1.rDNA.fastq ${name}_PE2_processed.fastq
rm ${name}_PE1.rDNA.fastq
rm ${name}_PE2_processed.fastq
rm ${name}_PE1.rDNA.fastq.single.fq
rm ${name}_PE2_processed.fastq.single.fq
echo 'aligning' $name 'to the genome'
bowtie2 -p $((cores-2)) --maxins 1000 -x $genome_index --rf -1 ${name}_PE1.rDNA.fastq.paired.fq \
    -2 ${name}_PE2_processed.fastq.paired.fq 2>${name}_bowtie2.log | \
    samtools view -b - | samtools sort - -o ${name}.bam
rm ${name}_PE1.rDNA.fastq.paired.fq
rm ${name}_PE2_processed.fastq.paired.fq
echo 'Separating paired end reads and creating genomic BED and bigWig intensity files for' $name
mkdir ${name}_seqOutBias
cd ${name}_seqOutBias
seqOutBias scale $table ../${name}.bam --no-scale --stranded --bed-stranded-positive --only-paired --tail-edge \
    --bw=$name.bigWig --bed=$name.bed --out-split-pairends --read-size=$read_size --tallymer=$tallymer
cd ..
mv ${name}_seqOutBias/${name}* .
rm -r ${name}_seqOutBias
grep -v "random" ${name}_not_scaled_PE1.bed | grep -v "chrUn" | grep -v "chrEBV" | sort -k1,1 -k2,2n > \
    ${name}_tmp.txt && mv ${name}_tmp.txt ${name}_not_scaled_PE1.bed 
echo 'Counting reads in genes for' $name
echo -e "\t${name}" > ${name}_gene_counts.txt
mapBed -null "0" -s -a ${annotation_prefix}_sorted.bed -b ${name}_not_scaled_PE1.bed | \
    awk '{OFS="\t";} {print $4,$7}' >> ${name}_gene_counts.txt

Run the above script on each sample in parallel.

for i in *_combined_PE1.fastq.gz
do
  name=`basename $i _PE1.fastq.gz`
  sbatch -p standard -A gioeli_lab -t 4:00:00 -N 1 -n 1 --cpus-per-task=20 \
    --job-name proseq-analysis -o ${name}_proseq.log --export=ALL,i=$i,name=$name \
    /scratch/ts2hx/scripts/231221_PROseq_analysis_on_input_file_no_QC_cut_PE1_to_48.sh
  sleep 1
done

Combine the reads into one file.

paste -d'\t' *combined_gene_counts.txt > TRPS1_timecourse_PRO_gene_counts_combined.txt

8.5 Merge bigWigs to upload to dREG

Here we merge the bigWigs to create T47D_all_plus_PE1.bigWig and T47D_all_minus_PE1.bigWig to upload to the dREG portal (https://dreg.dnasequence.org/). The output includes a dTAG_TRPS1_timecourse_dREG.dREG.peak.full.bed file that represents the genomic coordinates of each bidirectionally transcribed putative regulatory element. We use this output above in this vignette to separate ATAC-seq peaks that are distal to bidirectional transcription.

#This only worked for the positive strand because the counts are positive
for i in T47D_Clone28_0hour_rep1_combined_*_PE1.bigWig
do
  strand=$(echo $i | awk -F"T47D_Clone28_0hour_rep1_combined_" '{print $2}' | awk -F"_PE1.bigWig" '{print $1}')
  echo $strand
  files=$(ls T47D_Clone28_*combined*_${strand}_PE1.bigWig)
  echo $files
  bigWigMerge $files tmp.bg
  LC_COLLATE=C sort -k1,1 -k2,2n tmp.bg > tmp_sorted.bg
  bedGraphToBigWig tmp_sorted.bg hg38.chrom.sizes T47D_all_${strand}_PE1.bigWig
  rm tmp.bg
  rm tmp_sorted.bg
done

#For minus strand
#Could have done this for loop as independent jobs
for i in T47D_*_combined_minus_PE1.bigWig
do
  name=$(echo $i | awk -F"_PE1.bigWig" '{print $1}')
  echo $name
  bigWigToBedGraph $i tmp.bg
  echo normalizing
  normalize_bedGraph.py -i tmp.bg -s -1 -o tmp2.bg
  bedGraphToBigWig tmp2.bg hg38.chrom.sizes ${name}_positive_PE1.bigWig
  rm tmp.bg
  rm tmp2.bg
done
files=$(ls T47D_*_minus_positive_PE1.bigWig)
echo $files
bigWigMerge $files tmp.bg
LC_COLLATE=C sort -k1,1 -k2,2n tmp.bg > tmp_sorted.bg
bedGraphToBigWig tmp_sorted.bg hg38.chrom.sizes T47D_all_minus_PE1.bigWig
rm tmp.bg
rm tmp_sorted.bg

8.6 Scale and merge bigWigs for use in the genome browser

The script used in the next code chunk (230403_PRO_scale_bigWig_whole_fragment.sh) is below. It scales each bigWig based on the read depth, broken down by strand.

#! /bin/bash
module load samtools/1.12 deeptools/3.5.1 

name=$(echo $1 | awk -F".bam" '{print $1}')

# include reads that are 1st in a pair (64) and properly paired (2);
# exclude reads that are mapped to the reverse strand (16)
samtools view -b -f 66 -F 16 ${name}.bam > ${name}.fwd1.bam

# include reads that are mapped to the reverse strand (16),
# second in a pair (128), and properly paired (2)
samtools view -b -f 146 ${name}.bam > ${name}.fwd2.bam

# combine the temporary files
samtools merge -f ${name}.fwd.bam ${name}.fwd1.bam ${name}.fwd2.bam

# index the filtered BAM file
samtools index ${name}.fwd.bam

# run bamCoverage
bamCoverage --bam ${name}.fwd.bam -bs 10 -p 10 -e --normalizeUsing BPM -o ${name}_whole_fragment_plus_scaled.bigWig

# remove the temporary files
rm ${name}.fwd*


# include reads that are 2nd in a pair (128) and properly paired (2);
# exclude reads that are mapped to the reverse strand (16)
samtools view -b -f 130 -F 16 ${name}.bam > ${name}.rev1.bam

# include reads that are mapped to the reverse strand (16),
# first in a pair (64), and properly paired (2)
samtools view -b -f 82 ${name}.bam > ${name}.rev2.bam

# combine the temporary files
samtools merge -f ${name}.rev.bam ${name}.rev1.bam ${name}.rev2.bam

# index the filtered BAM file
samtools index ${name}.rev.bam

# run bamCoverage
bamCoverage --bam ${name}.rev.bam -bs 10 -p 10 -e --normalizeUsing BPM -o ${name}_whole_fragment_minus_scaled.bigWig  

# remove the temporary files
rm ${name}.rev*

Run the above script on each sample in parallel.

for i in *_combined.bam
do
  name=$(echo $i | awk -F".bam" '{print $1}')
  sbatch -p standard -A gioeli_lab -t 1:00:00 --cpus-per-task=10 --job-name bamCoverage -o ${name}_bamCoverage.log \
    --wrap="sh /scratch/ts2hx/scripts/230403_PRO_scale_bigWig_whole_fragment.sh $i"
  sleep 1
done

The script used in the next code chunk (230403_PRO_to_browser_whole_fragment.sh) is below. It merges the scaled bigWigs across replicates into one bigWig and one bedGraph per condition.

#! /bin/bash
directory=/scratch/$USER/TRPS1_timecourse_PRO_3

PATH=$HOME/bin:$PATH

cd $directory

module load ucsc-tools/3.7.4

name=$(echo $1 | awk -F"T47D_Clone28_" '{print $2}' | awk -F"_rep2" '{print $1}')
strand=$(echo $1 | awk -F"rep2_combined_whole_fragment_" '{print $2}' | awk -F"_scaled.bigWig" '{print $1}')
echo $name
echo $strand
files=$(ls T47D_Clone28_${name}_rep*_combined_whole_fragment_${strand}_scaled.bigWig)
echo $files
if [ "$strand" == "plus" ]
then
echo "track type=bedGraph name=T47D_PRO_${name}_plus color=255,0,0 alwaysZero=on visibility=full" > T47D_${name}_${strand}_temp.txt
fi
if [ "$strand" == "minus" ]
then
echo "track type=bedGraph name=T47D_PRO_${name}_minus color=0,0,255 alwaysZero=on visibility=full" > T47D_${name}_${strand}_temp.txt
fi
count=$(ls T47D_Clone28_${name}_rep*_combined_whole_fragment_${strand}_scaled.bigWig | wc -l | bc)
scaleall=$(bc <<< "scale=4 ; 1.0 / $count")
echo $count
echo $scaleall
name=T47D_${name}
bigWigMerge $files ${name}_${strand}_tmp.bg
normalize_bedGraph.py -i ${name}_${strand}_tmp.bg -s $scaleall -o ${name}_${strand}_scaled.bg
LC_COLLATE=C sort -k1,1 -k2,2n ${name}_${strand}_scaled.bg > ${name}_${strand}_scaled_sorted.bg
cat ${name}_${strand}_temp.txt ${name}_${strand}_scaled_sorted.bg | \
  grep -v "random" | grep -v "chrUn" | grep -v "chrEBV" | grep -v "chrM" | grep -v "alt" > \
  ${name}_${strand}_scaled.bedGraph
rm ${name}_${strand}_tmp.bg
rm ${name}_${strand}_temp.txt
rm ${name}_${strand}_scaled.bg
rm ${name}_${strand}_scaled_sorted.bg
head ${name}_${strand}_scaled.bedGraph
bedGraphToBigWig ${name}_${strand}_scaled.bedGraph hg38.chrom.sizes ${name}_${strand}_combined_normalized.bigWig
gzip -f ${name}_${strand}_scaled.bedGraph

Run the above script on each sample in parallel.

for i in T47D_Clone28_*_rep2_combined_whole_fragment_*_scaled.bigWig
do
  name=$(echo $i | awk -F"_scaled.bigWig" '{print $1}')
  sbatch -p largemem -A gioeli_lab -t 4:00:00 -N 1 -n 1 --job-name to_browser -o ${name}_to_browser.log \
  --wrap="sh /scratch/ts2hx/scripts/230403_PRO_to_browser_whole_fragment.sh $i"
  sleep 1
done

8.7 Leave the cluster environment, and copy files to your local computer

exit
exit

mkdir -p ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/
cd ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/

scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/*.pdf .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/TRPS1_timecourse_PRO_NextSeq550_QC_metrics.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/TRPS1_timecourse_PRO_gene_counts_NextSeq550.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/TRPS1_timecourse_PRO_gene_counts_combined.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/*us_scaled.bedGraph.gz .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/*us_combined_normalized.bigWig .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1_timecourse_PRO_3/T47D_all_*us_PE1.bigWig .

9 Three clone PRO-seq data pre-processing

As above, here we pre-process the raw data from the three clone PRO-seq experiment used to assess the robustness of the changes in ER target gene expression upon TRPS1 depletion.

9.1 Set up

Log in to the cluster to perform the processing for each sample in parallel.

ssh -Y ts2hx@rivanna.hpc.virginia.edu 
ijob -c 1 -A gioeli_lab -p standard --time=1-00:00:00

Make a directory.

mkdir -p /scratch/ts2hx/TRPS1

Make tallymer and table for seqOutBias (this takes a while).

module load genometools/1.5.10
read_size=62
genome=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa

mkdir -p $HOME/seqOutBias
cd $HOME/seqOutBias
$HOME/bin/seqOutBias seqtable $genome --read-size=$read_size    

9.2 Download the raw data and rename the files

main=/scratch/ts2hx/TRPS1
cd $main

for i in {882..905}
do
  sbatch -p standard -A gioeli_lab -t 1:00:00 -N 1 -n 1 --cpus-per-task=10 --job-name fasterq-dump \
    -o logs/SRR250747${i}_fasterq_dump.log \
    --wrap="module load gcc/11.4.0 sratoolkit/3.0.3; fasterq-dump SRR27313${i}"
  sleep 1
done

Rename the files

mv SRR27313882_1.fastq.gz T47D_Clone39_dTAG13_rep4_PE1.fastq.gz
mv SRR27313883_1.fastq.gz T47D_Clone39_dTAG13_rep3_PE1.fastq.gz
mv SRR27313884_1.fastq.gz T47D_Clone39_dTAG13_rep2_PE1.fastq.gz
mv SRR27313885_1.fastq.gz T47D_Clone39_dTAG13_rep1_PE1.fastq.gz

mv SRR27313886_1.fastq.gz T47D_Clone39_DMSO_rep4_PE1.fastq.gz
mv SRR27313887_1.fastq.gz T47D_Clone39_DMSO_rep3_PE1.fastq.gz
mv SRR27313888_1.fastq.gz T47D_Clone39_DMSO_rep2_PE1.fastq.gz
mv SRR27313889_1.fastq.gz T47D_Clone39_DMSO_rep1_PE1.fastq.gz

mv SRR27313890_1.fastq.gz T47D_Clone35_dTAG13_rep4_PE1.fastq.gz
mv SRR27313891_1.fastq.gz T47D_Clone35_dTAG13_rep3_PE1.fastq.gz
mv SRR27313892_1.fastq.gz T47D_Clone35_dTAG13_rep2_PE1.fastq.gz
mv SRR27313905_1.fastq.gz T47D_Clone35_dTAG13_rep1_PE1.fastq.gz

mv SRR27313904_1.fastq.gz T47D_Clone35_DMSO_rep4_PE1.fastq.gz
mv SRR27313903_1.fastq.gz T47D_Clone35_DMSO_rep3_PE1.fastq.gz
mv SRR27313902_1.fastq.gz T47D_Clone35_DMSO_rep2_PE1.fastq.gz
mv SRR27313901_1.fastq.gz T47D_Clone35_DMSO_rep1_PE1.fastq.gz

mv SRR27313900_1.fastq.gz T47D_Clone28_dTAG13_rep4_PE1.fastq.gz
mv SRR27313899_1.fastq.gz T47D_Clone28_dTAG13_rep3_PE1.fastq.gz
mv SRR27313898_1.fastq.gz T47D_Clone28_dTAG13_rep2_PE1.fastq.gz
mv SRR27313897_1.fastq.gz T47D_Clone28_dTAG13_rep1_PE1.fastq.gz

mv SRR27313896_1.fastq.gz T47D_Clone28_DMSO_rep4_PE1.fastq.gz
mv SRR27313895_1.fastq.gz T47D_Clone28_DMSO_rep3_PE1.fastq.gz
mv SRR27313894_1.fastq.gz T47D_Clone28_DMSO_rep2_PE1.fastq.gz
mv SRR27313893_1.fastq.gz T47D_Clone28_DMSO_rep1_PE1.fastq.gz


mv SRR27313882_2.fastq.gz T47D_Clone39_dTAG13_rep4_PE2.fastq.gz
mv SRR27313883_2.fastq.gz T47D_Clone39_dTAG13_rep3_PE2.fastq.gz
mv SRR27313884_2.fastq.gz T47D_Clone39_dTAG13_rep2_PE2.fastq.gz
mv SRR27313885_2.fastq.gz T47D_Clone39_dTAG13_rep1_PE2.fastq.gz

mv SRR27313886_2.fastq.gz T47D_Clone39_DMSO_rep4_PE2.fastq.gz
mv SRR27313887_2.fastq.gz T47D_Clone39_DMSO_rep3_PE2.fastq.gz
mv SRR27313888_2.fastq.gz T47D_Clone39_DMSO_rep2_PE2.fastq.gz
mv SRR27313889_2.fastq.gz T47D_Clone39_DMSO_rep1_PE2.fastq.gz

mv SRR27313890_2.fastq.gz T47D_Clone35_dTAG13_rep4_PE2.fastq.gz
mv SRR27313891_2.fastq.gz T47D_Clone35_dTAG13_rep3_PE2.fastq.gz
mv SRR27313892_2.fastq.gz T47D_Clone35_dTAG13_rep2_PE2.fastq.gz
mv SRR27313905_2.fastq.gz T47D_Clone35_dTAG13_rep1_PE2.fastq.gz

mv SRR27313904_2.fastq.gz T47D_Clone35_DMSO_rep4_PE2.fastq.gz
mv SRR27313903_2.fastq.gz T47D_Clone35_DMSO_rep3_PE2.fastq.gz
mv SRR27313902_2.fastq.gz T47D_Clone35_DMSO_rep2_PE2.fastq.gz
mv SRR27313901_2.fastq.gz T47D_Clone35_DMSO_rep1_PE2.fastq.gz

mv SRR27313900_2.fastq.gz T47D_Clone28_dTAG13_rep4_PE2.fastq.gz
mv SRR27313899_2.fastq.gz T47D_Clone28_dTAG13_rep3_PE2.fastq.gz
mv SRR27313898_2.fastq.gz T47D_Clone28_dTAG13_rep2_PE2.fastq.gz
mv SRR27313897_2.fastq.gz T47D_Clone28_dTAG13_rep1_PE2.fastq.gz

mv SRR27313896_2.fastq.gz T47D_Clone28_DMSO_rep4_PE2.fastq.gz
mv SRR27313895_2.fastq.gz T47D_Clone28_DMSO_rep3_PE2.fastq.gz
mv SRR27313894_2.fastq.gz T47D_Clone28_DMSO_rep2_PE2.fastq.gz
mv SRR27313893_2.fastq.gz T47D_Clone28_DMSO_rep1_PE2.fastq.gz

9.3 Process the combined reads

The script used in the next code chunk (220408_PROseq_analysis_on_input_file.sh) is below:

#! /bin/bash
directory=/scratch/$USER/TRPS1
annotation_prefix=$HOME/gene_annotations/Homo_sapiens.GRCh38.104 
UMI_length=8
read_size=62
cores=$SLURM_CPUS_PER_TASK
genome_index=/project/genomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome
prealign_rdna_index=$HOME/human_rDNA/human_rDNA
tallymer=$HOME/seqOutBias/genome.tal_${read_size}.gtTxt.gz
table=$HOME/seqOutBias/genome_${read_size}.4.2.2.tbl

PATH=$HOME/bin:$PATH

cd $directory 

module load cutadapt/3.4
module load intel/18.0 intelmpi/18.0 R/4.1.1
module load gcc/9.2.0 bowtie2/2.2.9 
module load samtools/1.12
module load wigtobigwig/2.8
module load bedtools/2.29.2

filename=$1
name=$(echo $filename | awk -F"_PE1.fastq" '{print $1}')
echo $name
echo 'removing dual adapter ligations and calculating the fraction of adapter/adapters in' $name
cutadapt --cores=$cores -m $((UMI_length+2)) -O 1 -a TGGAATTCTCGGGTGCCAAGG ${name}_PE1.fastq \
    -o ${name}_PE1_noadap.fastq --too-short-output ${name}_PE1_short.fastq > ${name}_PE1_cutadapt.txt
cutadapt --cores=$cores -m $((UMI_length+10)) -O 1 -a GATCGTCGGACTGTAGAACTCTGAAC ${name}_PE2.fastq \
    -o ${name}_PE2_noadap.fastq --too-short-output ${name}_PE2_short.fastq > ${name}_PE2_cutadapt.txt
PE1_total=$(wc -l ${name}_PE1.fastq | awk '{print $1/4}')
PE1_w_Adapter=$(wc -l ${name}_PE1_short.fastq | awk '{print $1/4}')
AAligation=$(echo "scale=2 ; $PE1_w_Adapter / $PE1_total" | bc)
echo -e  "value\texperiment\tthreshold\tmetric" > ${name}_QC_metrics.txt
echo -e "$AAligation\t$name\t0.80\tAdapter/Adapter" >> ${name}_QC_metrics.txt
echo 'removing short RNA insertions in' $name
seqtk seq -L $((UMI_length+10)) ${name}_PE1_noadap.fastq > ${name}_PE1_noadap_trimmed.fastq 
echo 'removing PCR duplicates from' $name
fqdedup -i ${name}_PE1_noadap_trimmed.fastq -o ${name}_PE1_dedup.fastq
PE1_noAdapter=$(wc -l ${name}_PE1_dedup.fastq | awk '{print $1/4}')
fastq_pair -t $PE1_noAdapter ${name}_PE1_dedup.fastq ${name}_PE2_noadap.fastq
echo 'calculating and plotting RNA insert sizes from' $name
flash -q --compress-prog=gzip --suffix=gz ${name}_PE1_dedup.fastq.paired.fq \
    ${name}_PE2_noadap.fastq.paired.fq -o ${name}
insert_size.R ${name}.hist ${UMI_length}
echo 'trimming off the UMI from' $name
seqtk trimfq -b ${UMI_length} ${name}_PE1_dedup.fastq | seqtk seq -r - > ${name}_PE1_processed.fastq
seqtk trimfq -e ${UMI_length} ${name}_PE2_noadap.fastq | seqtk seq -r - > ${name}_PE2_processed.fastq
echo 'aligning' $name 'to rDNA and removing aligned reads'
bowtie2 -p $cores -x $prealign_rdna_index -U ${name}_PE1_processed.fastq 2>${name}_bowtie2_rDNA.log | \
    samtools sort -n - | samtools fastq -f 0x4 - > ${name}_PE1.rDNA.fastq
reads=$(wc -l ${name}_PE1.rDNA.fastq | awk '{print $1/4}')
fastq_pair -t $reads ${name}_PE1.rDNA.fastq ${name}_PE2_processed.fastq
echo 'aligning' $name 'to the genome'
bowtie2 -p $cores --maxins 1000 -x $genome_index --rf -1 ${name}_PE1.rDNA.fastq.paired.fq \
    -2 ${name}_PE2_processed.fastq.paired.fq 2>${name}_bowtie2.log | samtools view -b - | \
    samtools sort - -o ${name}.bam
PE1_prior_rDNA=$(wc -l ${name}_PE1_processed.fastq | awk '{print $1/4}')
PE1_post_rDNA=$(wc -l ${name}_PE1.rDNA.fastq | awk '{print $1/4}')
total_rDNA=$(echo "$(($PE1_prior_rDNA-$PE1_post_rDNA))") 
concordant_pe1=$(samtools view -c -f 0x42 ${name}.bam)
total=$(echo "$(($concordant_pe1+$total_rDNA))")
rDNA_alignment=$(echo "scale=2 ; $total_rDNA / $total" | bc)
echo -e "$rDNA_alignment\t$name\t0.10\trDNA Alignment Rate" >> ${name}_QC_metrics.txt
map_pe1=$(samtools view -c -f 0x42 ${name}.bam)
pre_alignment=$(wc -l ${name}_PE1.rDNA.fastq.paired.fq | awk '{print $1/4}')
alignment_rate=$(echo "scale=2 ; $map_pe1 / $pre_alignment" | bc)
echo -e "$alignment_rate\t$name\t0.80\tAlignment Rate" >> ${name}_QC_metrics.txt
echo 'plotting and calculating complexity for' $name
fqComplexity -i ${name}_PE1_noadap_trimmed.fastq
echo 'calculating and plotting theoretical sequencing depth' 
echo 'to achieve a defined number of concordantly aligned reads for' $name
PE1_total=$(wc -l ${name}_PE1.fastq | awk '{print $1/4}')
PE1_noadap_trimmed=$(wc -l ${name}_PE1_noadap_trimmed.fastq | awk '{print $1/4}')
factorX=$(echo "scale=2 ; $PE1_noadap_trimmed / $PE1_total" | bc)
echo fraction of reads that are not adapter/adapter ligation products or below 10 base inserts
echo $factorX 
PE1_dedup=$(wc -l ${name}_PE1_dedup.fastq | awk '{print $1/4}')
factorY=$(echo "scale=2 ; $concordant_pe1 / $PE1_dedup" | bc)
fqComplexity -i ${name}_PE1_noadap_trimmed.fastq -x $factorX -y $factorY
echo 'Separating paired end reads and creating genomic BED and bigWig intensity files for' $name
seqOutBias scale $table ${name}.bam --no-scale --stranded --bed-stranded-positive \
    --bw=$name.bigWig --bed=$name.bed --out-split-pairends --only-paired \
    --tail-edge --read-size=$read_size --tallymer=$tallymer
coverageBed -counts -s -a $annotation_prefix.pause.bed -b ${name}_not_scaled_PE1.bed | \
    awk '$7>0' | sort -k5,5 -k7,7nr | sort -k5,5 -u > ${name}_pause.bed
join -1 5 -2 5 ${name}_pause.bed $annotation_prefix.bed | \
    awk '{OFS="\t";} $2==$8 && $6==$12 {print $2, $3, $4, $1, $6, $7, $9, $10}' | \
    awk '{OFS="\t";} $5 == "+" {print $1,$2+480,$8,$4,$6,$5} $5 == "-" {print $1,$7,$2 - 380,$4,$6,$5}' | \
    awk  '{OFS="\t";} $3>$2 {print $1,$2,$3,$4,$5,$6}' > ${name}_pause_counts_body_coordinates.bed
coverageBed -counts -s -a ${name}_pause_counts_body_coordinates.bed \
    -b ${name}_not_scaled_PE1.bed | awk '$7>0' | \
    awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$5/100,$7/($3 - $2)}' | \
    awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$8/$9}' > ${name}_pause_body.bed
pause_index.R ${name}_pause_body.bed
echo 'Calculating exon density / intron density as a metric for nascent RNA purity for' $name
coverageBed -counts -s -a $annotation_prefix.introns.bed \
    -b ${name}_not_scaled_PE1.bed | awk '$7>0' | \
    awk '{OFS="\t";} {print $1,$2,$3,$5,$5,$6,$7,($3 - $2)}' > ${name}_intron_counts.bed
coverageBed -counts -s -a $annotation_prefix.no.first.exons.named.bed \
    -b ${name}_not_scaled_PE1.bed | awk '$7>0' | \
    awk '{OFS="\t";} {print $1,$2,$3,$4,$4,$6,$7,($3 - $2)}' > ${name}_exon_counts.bed
exon_intron_ratio.R ${name}_exon_counts.bed ${name}_intron_counts.bed
echo 'Counting reads in genes for' $name
echo -e  "\t${name}" > ${name}_gene_counts.txt
coverageBed -counts -s -a $annotation_prefix.bed -b ${name}_not_scaled_PE1.bed | \
    awk '{OFS="\t";} {print $4,$7}' >> ${name}_gene_counts.txt
rm ${name}_PE1_short.fastq
rm ${name}_PE2_short.fastq
rm ${name}_PE1_noadap.fastq
rm ${name}_PE2_noadap.fastq
rm ${name}_PE1_noadap_trimmed.fastq
rm ${name}_PE1_dedup.fastq
rm ${name}_PE1_processed.fastq
rm ${name}_PE2_processed.fastq
rm ${name}_PE1_dedup.fastq.paired.fq   
rm ${name}_PE2_noadap.fastq.paired.fq
rm ${name}_PE1_dedup.fastq.single.fq
rm ${name}_PE2_noadap.fastq.single.fq
rm ${name}_PE1.rDNA.fastq.paired.fq
rm ${name}_PE1.rDNA.fastq.single.fq
rm ${name}_PE2_processed.fastq.paired.fq
rm ${name}_PE2_processed.fastq.single.fq
rm ${name}.extendedFrags.fastq.gz
rm ${name}.notCombined_1.fastq.gz
rm ${name}.notCombined_2.fastq.gz
gzip ${name}_PE1.fastq
gzip ${name}_PE2.fastq

The script used in the next code chunk (220408_PROseq_analysis_on_all_files_for_slurm.sh) is below:

#! /bin/bash

#Change this variable as well as the wrapped string
main=/scratch/ts2hx/TRPS1
cd $main

for filename in *_PE1.fastq
do
  name=`basename $filename _PE1.fastq`
  sbatch -p standard -A gioeli_lab -t 1-00:00:00 -N 1 -n 1 --cpus-per-task=4 --job-name proseq-analysis \
    -o ${main}/${name}_proseq.log  \
    --wrap="sh /scratch/ts2hx/TRPS1/220408_PROseq_analysis_on_input_file.sh $filename"
  sleep 1       # wait 1 second between each job submission
done

Run the above script to process each sample in parallel.

./220408_PROseq_analysis_on_all_files_for_slurm.sh

Plot QC metrics and combine the reads into one file.

cat *_QC_metrics.txt | awk '!x[$0]++' > TRPS1_QC_metrics.txt 

module load intel/18.0  intelmpi/18.0 R/4.1.1
plot_all_metrics.R TRPS1_QC_metrics.txt TRPS1_PRO
paste -d'\t' *_gene_counts.txt > TRPS1_PRO_gene_counts.txt

9.4 Leave the cluster environment, and copy files to your local computer

exit
exit

mkdir -p ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1
cd ~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1

scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1/*_QC_metrics.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1/*_gene_counts.txt .
scp ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/TRPS1/*.pdf .

10 PRO-seq analysis

10.1 Composite plot of PRO signal around TRPS1 ChIP-seq peaks (Figure 3G)

Set up.

ssh -Y ts2hx@rivanna.hpc.virginia.edu 
ijob -c 1 -A gioeli_lab -p standard --cpus-per-task=20 --time=12:00:00
main=/scratch/ts2hx/ChIP/results
cd $main
mkdir -p heatmaps
cd heatmaps

module load deeptools/3.5.1

Make the matrix (starting with the bed file from the matrix generated for ChIP).

computeMatrix reference-point --referencePoint center -b 500 -a 500 -p 20 --missingDataAsZero \
  -R TRPS1_peaks_sorted_for_heatmap.bed \
  -S /scratch/ts2hx/TRPS1_timecourse_PRO_3/T47D_0hour_plus_combined_normalized.bigWig \
  /scratch/ts2hx/TRPS1_timecourse_PRO_3/T47D_0hour_minus_combined_normalized.bigWig \
  /scratch/ts2hx/TRPS1_timecourse_PRO_3/T47D_0.5hour_plus_combined_normalized.bigWig \
  /scratch/ts2hx/TRPS1_timecourse_PRO_3/T47D_0.5hour_minus_combined_normalized.bigWig \
  -o matrix_PRO_TRPS1_peaks.gz

Plot the composites.

plotProfile -m matrix_PRO_TRPS1_peaks.gz -out profile_PRO_TRPS1_peaks.pdf \
    --regionsLabel "Functional TRPS1 peaks" --perGroup --colors "#8B0000" "#00008B" "#FF7F7F" "#7F7FFF" \
    --samplesLabel "DMSO plus strand" "DMSO minus strand" "dTAG plus strand" "dTAG minus strand" 

Count PRO reads in each TRPS1 peak for use in DESeq2

cd /scratch/ts2hx/TRPS1_timecourse_PRO_3/

module load bedtools/2.30.0

slopBed -b 500 -i /scratch/ts2hx/ChIP/results/heatmaps/Functional_TRPS1_summits.bed \
  -g $HOME/hg38/hg38.chrom.sizes > \
  /scratch/ts2hx/ChIP/results/heatmaps/Functional_TRPS1_summits_window.bed

for i in *_combined_PE1.fastq.gz
do
  name=`basename $i _PE1.fastq.gz`
  echo -e "\t${name}" > ${name}_TRPS1_peak_counts.txt
  mapBed -null "0" -a /scratch/ts2hx/ChIP/results/heatmaps/Functional_TRPS1_summits_window.bed \
    -b ${name}_not_scaled_PE1.bed | awk '{OFS="\t";} {print $4, $NF}' >> ${name}_TRPS1_peak_counts.txt
done

paste -d'\t' *_TRPS1_peak_counts.txt > \
  /scratch/ts2hx/ChIP/results/heatmaps/TRPS1_timecourse_PRO_TRPS1_peak_counts.txt

Leave the cluster environment, and copy files to your local computer.

exit
exit

cd ~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/results
scp -r ts2hx@rivanna.hpc.virginia.edu:/scratch/ts2hx/ChIP/results/heatmaps/ .

10.2 Identify differentially transcribed TRPS1 peaks (Figure S?)

Set up in R.

library(lattice)
library(DESeq2)
library(tidyverse)
library(latticeExtra)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)

Provide size factors based on reads in genes

x = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/TRPS1_timecourse_PRO_gene_counts_combined.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]
x = x[,grepl("0", colnames(x))]
time = factor(sapply(strsplit(colnames(x), '_'), '[', 3), levels = c("0hour", "0.5hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
deseq.df = estimateSizeFactors(deseq.df)
size_factors = sizeFactors(deseq.df)

Identify differentially transcribed peaks.

x = read.table("TRPS1_timecourse_PRO_TRPS1_peak_counts.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]

time = factor(sapply(strsplit(colnames(x), '_'), '[', 3), levels = c("0hour", "0.5hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
sizeFactors(deseq.df) <- size_factors
deseq.df = DESeq(deseq.df)

res.deseq = results(deseq.df, contrast = c("time", "0.5hour", "0hour"))

sum(!is.na(res.deseq$log2FoldChange)) #12621
sum(!is.na(res.deseq$log2FoldChange) & res.deseq$log2FoldChange > 0) #7900
sum(!is.na(res.deseq$log2FoldChange) & res.deseq$log2FoldChange < 0) #4721

sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0) #9
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0) #0
sum((colSums(x)/size_factors)[1:4])/sum((colSums(x)/size_factors)[5:8]) #1.065969

t.test(res.deseq$log2FoldChange) # p-value < 2.2e-16, mean 0.1317621

df.deseq.effects.lattice = res.deseq
df.deseq.effects.lattice$status = "Other"
df.deseq.effects.lattice$status[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0] = "Activated"
df.deseq.effects.lattice$status[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0] = "Repressed"
df.deseq.effects.lattice$status = factor(df.deseq.effects.lattice$status, levels = c("Other", "Activated", "Repressed"))

Plot.

pdf("MA_PRO_at_TRPS1_peaks.pdf", width = 7, height = 3.5)
ggplot(as.data.frame(df.deseq.effects.lattice) %>% arrange(status), aes(x = log(baseMean, 10), y = log2FoldChange, color = status)) +
  geom_point(aes(size = status)) +
  scale_size_manual(values = c(0.1, 1, 1), guide = 'none') +
  scale_color_manual(values = c("grey60", "red", "blue")) +
  geom_hline(yintercept = 0, linetype="dashed") +
  ylab(expression("log"[2]~"(fold change in PRO signal)")) +
  xlab(expression("log"[10]~"(mean of normalized intensity)")) +
  ggtitle("Bidirectional transcription at TRPS1 peaks 30 minutes after TRPS1 depletion") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(color=NULL)
dev.off()

ANOVA to determine if dTAG reproducibly increases transcription at these sites

library(tidyr)
wide_counts = as.data.frame(counts(deseq.df, normalized = T))
long_counts <- gather(wide_counts, condition, counts, factor_key=TRUE)
long_counts$rep = factor(sapply(strsplit(as.character(long_counts$condition), '_'), '[', 4))
long_counts$time = factor(sapply(strsplit(as.character(long_counts$condition), '_'), '[', 3))

model = lm(log(counts + 1, 10) ~ rep + time, data = long_counts)
anova(model) #p-value < 2.2e-16

10.3 Cluster dynamic genes (Figure 4A,B)

Set up in R.

direc = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/"
setwd(direc)

library(lattice)
library(DESeq2)
library(viridis)
library(tidyverse)
library(latticeExtra)
library(msigdbr)
library(fgsea)
library(enrichplot)
library(EnhancedVolcano)
library(clusterProfiler)
library(pheatmap)
library(dendsort)
library(RColorBrewer)
library(DEGreport)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)

source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')

cdf.deseq.df <- function(df, genes = gene.file, chip.peaks = chip.peaks) {
  bed.tss.activated = filter.deseq.into.bed(df, genes, cat = "Activated")
  paste(head(bed.tss.activated))
  bed.tss.unchanged = filter.deseq.into.bed(df, genes, cat = "Unchanged")
  bed.tss.repressed = filter.deseq.into.bed(df, genes, cat = "Repressed")
  
  peaks = chip.peaks
  print(head(peaks))
  act.distance = bedTools.closest(bed1 = bed.tss.activated, bed2 = peaks[,c(1:3)], opt.string = '-D a')
  unreg.distance = bedTools.closest(bed1 = bed.tss.unchanged, bed2 = peaks[,c(1:3)], opt.string = '-D a')
  rep.distance = bedTools.closest(bed1 = bed.tss.repressed, bed2 = peaks[,c(1:3)], opt.string = '-D a')
  
  df.up.can = cbind(act.distance[,c(4, 10)], "Activated")
  df.un.can = cbind(unreg.distance[,c(4, 10)], "Unchanged")
  df.dn.can = cbind(rep.distance[,c(4, 10)], "Repressed")
  
  colnames(df.up.can) = c(colnames(df.up.can)[1:2], 'status')
  colnames(df.un.can) = c(colnames(df.up.can)[1:2], 'status')
  colnames(df.dn.can) = c(colnames(df.up.can)[1:2], 'status')
  
  df.all = rbind(df.up.can, df.un.can, df.dn.can)
  return(df.all)
}

filter.deseq.into.bed <- function(deseq.df, gene.file, cat = 'R1881 Activated') {
  deseq.df = deseq.df[deseq.df$status == cat,]
  #print(dim(deseq.df))
  #scientific notation was messing this up occasionally
  x = gene.file$V4
  #print(length(x))
  y = gene.file[x %in% rownames(deseq.df),]
  #print(dim(y))
  z = get.tss(y)
  #print(dim(z))
  return(z)
}

#bedTools.closest <- function(functionstring="/anaconda3/bin/closestBed",bed1,bed2,opt.string="") {
bedTools.closest <- function(functionstring="/usr/local/anaconda3/bin/closestBed",bed1,bed2,opt.string="") {
  
  options(scipen =99) # not to use scientific notation when writing out
  
  #write bed formatted dataframes to tempfile
  write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  
  # create the command string and call the command using system()
  command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
  cat(command1,"\n")
  try(system(command1))
  command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
  cat(command2,"\n")
  try(system(command2))
  
  command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
  cat(command,"\n")
  try(system(command))
  
  res=read.table('out.file.bed',header=F, comment.char='')
  
  command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
  cat(command3,"\n")
  try(system(command3))
  
  colnames(res) = c(colnames(bed1), colnames(bed2)[1:ncol(bed2)], 'dis' )
  return(res)
}

#Viridis color scheme
col = viridis(6)

#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")

Read in the counts table and change the column names. Here we focus on the early time points.

#DESeq
x = read.table("TRPS1_timecourse_PRO_gene_counts_combined.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]
x = x[,!grepl("24hour", colnames(x))]

Run DESeq2.

time = factor(sapply(strsplit(colnames(x), '_'), '[', 3), levels = c("0hour", "0.5hour", "1hour", "2hour", "4hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
deseq.df = DESeq(deseq.df)
save(deseq.df, file = "TRPS1_timecourse_PRO_deseq.df.RData")

Run the likelihood ratio test to identify dynamic peaks over the time course.

dds.lrt = DESeq(deseq.df, test="LRT", reduced = ~ rep)
res.lrt = results(dds.lrt)
sum(res.lrt$padj < 0.1 & !is.na(res.lrt$padj))

Cluster the genes.

rld = rlogTransformation(dds.lrt)
rld_mat <- assay(rld)
siglrt.re = res.lrt[res.lrt$padj < 0.1 & !is.na(res.lrt$padj),]
nrow(siglrt.re) #1425
cluster_rlog = rld_mat[rownames(siglrt.re),]
meta = cbind.data.frame(time, rep)
rownames(meta) = colnames(cluster_rlog)
clusters_lrt <- degPatterns(cluster_rlog, metadata = meta, minc = 0)

Set up a data frame for plotting.

plot.df = clusters_lrt$normalized 
plot.df$time = as.character(plot.df$time)
plot.df$time[plot.df$time == '0hour'] = 0
plot.df$time[plot.df$time == '0.5hour'] = 0.5
plot.df$time[plot.df$time == '1hour'] = 1
plot.df$time[plot.df$time == '2hour'] = 2
plot.df$time[plot.df$time == '4hour'] = 4
plot.df$time = as.numeric(plot.df$time)
plot.df = plot.df[order(plot.df$genes),]
plot.df = plot.df[order(plot.df$time),]
plot.df$cluster = paste('cluster', as.character(plot.df$cluster), sep = '')
head(plot.df)
save(plot.df, file = "plot.df.pro.no.24hour.Rdata")

Generate a dendrogram to check where we can merge clusters.

x = as.data.table(plot.df)
plot.df.cluster = dcast(x, genes + cluster ~ time, value.var="value")
avg.clusters = as.data.frame(matrix(nrow = 0, ncol = 4))
for (i in unique(plot.df.cluster$cluster)) {
  z = data.frame(matrix(colMeans(plot.df.cluster[plot.df.cluster$cluster == i,3:6]), 
                        ncol = 4, nrow = 1))
  rownames(z) = c(i)
  colnames(z) = as.character(colnames(plot.df.cluster)[3:6])
  avg.clusters = rbind(avg.clusters, z)
}
dd = dist(avg.clusters)
hc = hclust(dd, method = "complete")
pdf('Dendrogram_PRO_no_24hour.pdf', width=8, height=5)
plot(hc, xlab = "Clusters", main = ' ', hang = -1)
abline(h = 1.85, lty = 2)
dev.off()

Merge clusters.

gradual.down = plot.df[plot.df$cluster == 'cluster2' | 
                         plot.df$cluster == 'cluster7',]
transient.down = plot.df[plot.df$cluster == 'cluster3' |
                            plot.df$cluster == 'cluster9' |
                            plot.df$cluster == 'cluster12' |
                            plot.df$cluster == 'cluster4' |
                            plot.df$cluster == 'cluster18' |
                            plot.df$cluster == 'cluster16' |
                            plot.df$cluster == 'cluster17',]
oscillating.up = plot.df[plot.df$cluster == 'cluster11' | 
                       plot.df$cluster == 'cluster6' |
                       plot.df$cluster == 'cluster19',]
immediate.up = plot.df[plot.df$cluster == 'cluster10' |
                       plot.df$cluster == 'cluster14' |
                       plot.df$cluster == 'cluster15',]
transient.up = plot.df[plot.df$cluster == 'cluster1' |
                       plot.df$cluster == 'cluster8',]
gradual.up = plot.df[plot.df$cluster == 'cluster5' |
                       plot.df$cluster == 'cluster13',]

gradual.down$supercluster = 'Gradual Decrease' 
transient.down$supercluster = 'Transient Decrease' 
oscillating.up$supercluster = 'Oscillating Increase' 
immediate.up$supercluster = 'Immediate Increase' 
transient.up$supercluster = 'Transient Increase' 
gradual.up$supercluster = 'Gradual Increase' 

Make a data frame for plotting the superclusters.

plot.df.pro = rbind(gradual.down,
                    transient.down,
                    oscillating.up,
                    immediate.up,
                    transient.up,
                    gradual.up)
plot.df.pro$supercluster <- factor(plot.df.pro$supercluster, 
                                   levels = c("Oscillating Increase", "Immediate Increase", "Gradual Increase",
                                              "Transient Increase", "Gradual Decrease", "Transient Decrease"))
plot.df.pro$genes = as.factor(plot.df.pro$genes)

Assign the colors.

cat.colours = plot.df.pro[plot.df.pro$merge == 'one_group0hour',]
cat.colours$genes <- as.factor(cat.colours$genes)
cat.colours$supercluster <- as.factor(cat.colours$supercluster)
cat.colours$colour[grep("Increase", cat.colours$supercluster)] <- '#FF000016'
cat.colours$colour[grep("Decrease", cat.colours$supercluster)] <- '#0000FF16'
cat.colours$colour <- as.factor(cat.colours$colour)
cat.colours <- cat.colours[match(levels(plot.df.pro$genes), cat.colours$genes), ]

Plot.

pdf('pro_superclusters_transient_gradual_only.pdf', width=10, height=10)
trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(xyplot(value ~  factor(time) | supercluster, group = genes,
             data = plot.df.pro[plot.df.pro$supercluster %in% levels(plot.df.pro$supercluster)[3:6],], type = c('l'),
             scales=list(x=list(cex=2,relation = "free", rot = 45),
                         y =list(cex=2, relation="free")),
             aspect=1.0, 
             between=list(y=0.5, x=0.5), 
             layout = c(2,2), 
             ylab = list(label = 'Scaled normalized PRO signal', cex=3), 
             xlab = list(label = 'Time (hours)', cex =3), 
             par.strip.text=list(cex=2),
             par.settings = list(superpose.symbol = list(pch = c(16), col=c('grey20'), cex =1),
                                 strip.background=list(col="grey80"),
                                 superpose.line = list(col = as.character(cat.colours$colour), lwd=c(1), lty = c(1))),
             panel = function(x, y, ...) {
               panel.xyplot(x, y, ...)
               panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = .5,
                            do.out = F)
               panel.spline(x, y, col = 'black', lwd =3.0, ...) 
               
             })
)
dev.off()

10.4 Over-representation analysis (Figure 4C,D, Tables S5-8)

Activated genes.

genes = plot.df.pro$genes[(plot.df.pro$supercluster == "Gradual Increase" | plot.df.pro$supercluster == "Transient Increase") & plot.df.pro$merge == "one_group0hour"]
m_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>% 
  dplyr::select(gs_name, ensembl_gene)
set.seed(0)
em <- enricher(genes, TERM2GENE=m_t2g, pvalueCutoff = 0.1)
em$ID
em$p.adjust
fgseaResTidy <- em %>%
  as_tibble() %>%
  arrange(p.adjust)
print(fgseaResTidy, n = 100)
gse_top = fgseaResTidy
gse_top$Ratio_ratio = 
  (as.numeric(sapply(strsplit(gse_top$GeneRatio, "/"), "[", 1)) / 
     as.numeric(sapply(strsplit(gse_top$GeneRatio, "/"), "[", 2))) / 
  (as.numeric(sapply(strsplit(gse_top$BgRatio, "/"), "[", 1)) / 
     as.numeric(sapply(strsplit(gse_top$BgRatio, "/"), "[", 2)))
pdf("Hallmark_ORA_Combined_Increase.pdf", width = 14)
ggplot(gse_top, aes(reorder(ID, Ratio_ratio), Ratio_ratio, fill = log(p.adjust, base = 10))) +
  geom_col() +
  coord_flip() +
  scale_fill_gradient(low="#750000", high="#FF8A8A") +
  labs(y="Observed / Expected", x = NULL) + 
  theme_minimal() + 
  labs(fill = "log(padj)") +
  theme(text = element_text(size = 20),
        axis.text.x= element_text(size = 30)) 
dev.off()

Repressed genes.

genes = plot.df.pro$genes[(plot.df.pro$supercluster == "Gradual Decrease" | plot.df.pro$supercluster == "Transient Decrease") & plot.df.pro$merge == "one_group0hour"]
m_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>% 
  dplyr::select(gs_name, ensembl_gene)
set.seed(0)
em <- enricher(genes, TERM2GENE=m_t2g, pvalueCutoff = 0.1)
em$ID
em$p.adjust
fgseaResTidy <- em %>%
  as_tibble() %>%
  arrange(p.adjust)
print(fgseaResTidy, n = 100)
gse_top = fgseaResTidy
gse_top$Ratio_ratio = 
  (as.numeric(sapply(strsplit(gse_top$GeneRatio, "/"), "[", 1)) / 
     as.numeric(sapply(strsplit(gse_top$GeneRatio, "/"), "[", 2))) / 
  (as.numeric(sapply(strsplit(gse_top$BgRatio, "/"), "[", 1)) / 
     as.numeric(sapply(strsplit(gse_top$BgRatio, "/"), "[", 2)))
pdf("Hallmark_ORA_Combined_Decrease.pdf", width = 14)
ggplot(gse_top, aes(reorder(ID, Ratio_ratio), Ratio_ratio, fill = log(p.adjust, base = 10))) +
  geom_col() +
  coord_flip() +
  scale_fill_gradient(low="#000075", high="#8A8AFF") +
  labs(y="Observed / Expected", x = NULL) + 
  theme_minimal() + 
  labs(fill = "log(padj)") +
  theme(text = element_text(size = 20),
        axis.text.x= element_text(size = 30)) 
dev.off()

Write the gene names as tables.

genes = plot.df.pro$genes[plot.df.pro$supercluster == "Gradual Increase" & plot.df.pro$merge == "one_group0hour"]
names = gene.symbol$symbol[gene.symbol$gene %in% genes]
write.table(names, file = "gradually_activated_genes.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

genes = plot.df.pro$genes[plot.df.pro$supercluster == "Transient Increase" & plot.df.pro$merge == "one_group0hour"]
names = gene.symbol$symbol[gene.symbol$gene %in% genes]
write.table(names, file = "transiently_activated_genes.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

genes = plot.df.pro$genes[plot.df.pro$supercluster == "Gradual Decrease" & plot.df.pro$merge == "one_group0hour"]
names = gene.symbol$symbol[gene.symbol$gene %in% genes]
write.table(names, file = "gradually_repressed_genes.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

genes = plot.df.pro$genes[plot.df.pro$supercluster == "Transient Decrease" & plot.df.pro$merge == "one_group0hour"]
names = gene.symbol$symbol[gene.symbol$gene %in% genes]
write.table(names, file = "transiently_repressed_genes.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

11 Integrative analysis

11.1 CDFs of distance between dynamic genes and dynamic TRPS1-proximal ATAC peaks (Figure 4E,F)

Make a data frame with dynamic genes and unchanged genes.

#CDFs separated by TRPS1 peaks overlapping increased vs decreased ATAC peaks 
genes = plot.df.pro$genes[(plot.df.pro$supercluster == "Gradual Increase" | plot.df.pro$supercluster == "Transient Increase") & plot.df.pro$merge == "one_group0hour"]
activated = cbind.data.frame(genes, status = "Activated")
genes = plot.df.pro$genes[(plot.df.pro$supercluster == "Gradual Decrease" | plot.df.pro$supercluster == "Transient Decrease") & plot.df.pro$merge == "one_group0hour"]
repressed = cbind.data.frame(genes, status = "Repressed")
genes = rownames(res.lrt)[res.lrt$padj > 0.5 & !is.na(res.lrt$padj)]
unchanged = cbind.data.frame(genes, status = "Unchanged")

df.deseq.effects.lattice = rbind.data.frame(unchanged, activated, repressed)
rownames(df.deseq.effects.lattice) = df.deseq.effects.lattice$genes

Get the dynamic ATAC peaks.

x <- read.table("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/Combined_ATAC_peak_counts.txt", sep = '\t', header = TRUE)
peak_qvalue = x[,5]

load("~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_ATAC/deseq.df.ATAC.Rdata")
i="0.5hour"
res.deseq = results(deseq.df[peak_qvalue > 100,], contrast = c("time", i, "0hour"))
res.deseq$chr = sapply(strsplit(rownames(res.deseq), ":"), "[", 1)
res.deseq$start = sapply(strsplit(sapply(strsplit(rownames(res.deseq), ":"), "[", 2), "-"), "[", 1)
res.deseq$end = sapply(strsplit(sapply(strsplit(rownames(res.deseq), ":"), "[", 2), "-"), "[", 2)
res.deseq = res.deseq[, c(7:9,1:6)]
res.deseq$start = as.numeric(res.deseq$start) + 100
res.deseq$end = as.numeric(res.deseq$end) - 100
increasing = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
decreasing = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0,]

Overlap with TRPS1 peaks.

chip.peaks = read.table('~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/results/macs2/HA_DMSO_vs_dTAG_summits.bed')
chip.peaks = chip.peaks[chip.peaks$V5 > 30,]
TRPS1_increasing_ATAC_peaks = bedTools.intersect(bed1 = chip.peaks, bed2 = increasing, opt.string = "-wa -wb")
TRPS1_decreasing_ATAC_peaks = bedTools.intersect(bed1 = chip.peaks, bed2 = decreasing, opt.string = "-wa -wb")

Plot.

df.all = cdf.deseq.df(df = df.deseq.effects.lattice, genes = gene.file, chip.peaks = TRPS1_increasing_ATAC_peaks)
pdf("CDF_HA_ChIP_increasing_ATAC_by_gene_superclusters.pdf", height = 3.5, width = 4) 
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
         auto.key = list(lines=TRUE, points=FALSE, space = "right"),
         col = col.lines,
         aspect = 1,
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('log'[10]~'TRPS1 Distance from TSS'),
         between=list(y = 1.0),
         type = 'a',
         #xlim = c(0, 7.5),
         lwd = 2,
         par.settings = list(superpose.line = list(col = col.lines, lwd=3), 
                             strip.background=list(col="grey85")),
         panel = function(...) {
           panel.abline(v= 200, lty =2)
           panel.ecdfplot(...)
         })
dev.off()

df.all = cdf.deseq.df(df = df.deseq.effects.lattice, genes = gene.file, chip.peaks = TRPS1_decreasing_ATAC_peaks)
df.all = df.all[df.all$dis != -1,]
pdf("CDF_HA_ChIP_decreasing_ATAC_by_gene_superclusters.pdf", height = 3.5, width = 4) 
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
         auto.key = list(lines=TRUE, points=FALSE, space = "right"),
         col = col.lines,
         aspect = 1,
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('log'[10]~'TRPS1 Distance from TSS'),
         between=list(y = 1.0),
         type = 'a',
         #xlim = c(0, 7.5),
         lwd = 2,
         par.settings = list(superpose.line = list(col = col.lines, lwd=3), 
                             strip.background=list(col="grey85")),
         panel = function(...) {
           panel.abline(v= 200, lty =2)
           panel.ecdfplot(...)
         })
dev.off()

11.2 Identification of ER target genes (Figure S7)

Set up in R.

setwd("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R")

library(lattice)
library(DESeq2)
library(ggplot2)

cdf.deseq.df <- function(df, genes = gene.file, chip.peaks = chip.peaks) {
  bed.tss.activated = filter.deseq.into.bed(df, genes, cat = "Activated")
  bed.tss.unchanged = filter.deseq.into.bed(df, genes, cat = 'Not Activated')
  act.distance = bedTools.closest(bed1 = bed.tss.activated, bed2 = chip.peaks[,c(1:3)], opt.string = '-D a')
  unreg.distance = bedTools.closest(bed1 = bed.tss.unchanged, bed2 = chip.peaks[,c(1:3)], opt.string = '-D a')

  df.up.can = cbind(act.distance[,c(4, 10)], "Activated")
  df.un.can = cbind(unreg.distance[,c(4, 10)], "Not Activated")

  colnames(df.up.can) = c(colnames(df.up.can)[1:2], 'status')
  colnames(df.un.can) = c(colnames(df.up.can)[1:2], 'status')

  df.all = rbind(df.up.can, df.un.can)
  df.all$status = factor(df.all$status, levels = c("Activated", "Not Activated"))
  return(df.all)
}

filter.deseq.into.bed <- function(deseq.df, gene.file, cat = 'R1881 Activated') {
  deseq.df = deseq.df[deseq.df$treatment == cat,]
  #print(dim(deseq.df))
  #scientific notation was messing this up occasionally
  x = gene.file$V4
  #print(length(x))
  y = gene.file[x %in% rownames(deseq.df),]
  #print(dim(y))
  z = get.tss(y)
  #print(dim(z))
  return(z)
}

plot_cdf <- function(df.all, tf = "TRPS1", col.lines = c("#FF0000", "grey60", "#0000FF")) {
  ecdfplot(~log(abs(as.numeric(dis)), base = 10), groups = status, data = df.all,
           auto.key = list(lines=TRUE, points=FALSE),
           col = col.lines,
           aspect = 1,
           scales=list(relation="free",alternating=c(1,1,1,1)),
           ylab = 'Cumulative Distribution Function',
           xlab = bquote("log"[10]~.(tf)~"Distance from TSS"),
           between=list(y=1.0),
           type = 'a',
           xlim = c(0,7.5),
           lwd=2,
           par.settings = list(superpose.line = list(col = col.lines, lwd=3), 
                               strip.background=list(col="grey85")),
           panel = function(...) {
             panel.abline(v= 200, lty =2)
             panel.ecdfplot(...)
           })
}

col.lines = c("#FF0000", "grey60")

bedTools.closest <- function(functionstring="/usr/local/anaconda3/bin/closestBed",bed1,bed2,opt.string="") {
  
  options(scipen =99) # not to use scientific notation when writing out
  
  #write bed formatted dataframes to tempfile
  write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  
  # create the command string and call the command using system()
  command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
  cat(command1,"\n")
  try(system(command1))
  command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
  cat(command2,"\n")
  try(system(command2))
  
  command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
  cat(command,"\n")
  try(system(command))
  
  res=read.table('out.file.bed',header=F, comment.char='')
  
  command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
  cat(command3,"\n")
  try(system(command3))
  
  colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
  return(res)
}

#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")

We used the counts table (Estrogen_treatment_PRO_gene_counts.txt) generated from a previous analysis, extensively described here: https://github.com/guertinlab/Nascent_RNA_Methods. Run DESeq2 with these counts.

x = read.table("Estrogen_treatment_PRO_gene_counts.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]
rep = factor(sapply(strsplit(colnames(x), '_rep'), '[', 2)) 
sample.conditions = factor(sapply(strsplit(sapply(strsplit(colnames(x), 'T47D_'), '[', 2), '_rep'), '[', 1)) 
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(rep, sample.conditions), ~ 0 + rep + sample.conditions)
deseq.df = DESeq(deseq.df)

res.deseq.estrogen = results(deseq.df, contrast = c("sample.conditions", "Starved_Estrogen", "Starved_DMSO"))
res.deseq.estrogen = res.deseq.estrogen[rownames(res.deseq.estrogen) %in% gene.symbol$gene,]
ma.df = as.data.frame(res.deseq.estrogen)

Plot.

pdf("MA_plot_Estrogen_vs_DMSO_black_activated.pdf", width=3.83, height=3.83)
print(xyplot(ma.df$log2FoldChange ~ log(ma.df$baseMean, base=10),
             groups=(ma.df$padj < .1 & ma.df$log2FoldChange > 0 & !is.na(ma.df$padj)),
             col=c("grey60","black"), main="Differential PRO Expression", scales="free", aspect=1, pch=20, cex=0.25,
             ylab=expression("log"[2]~"PRO fold change"), xlab=expression("log"[10]~"Mean of Normalized Counts"),
             par.settings=list(par.xlab.text=list(cex=1.1,font=2), par.ylab.text=list(cex=1.1,font=2))))
dev.off()

11.3 Cluster of dynamic ER target genes (Figure 5A, Tables S9,10)

Set up in R.

direc = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/"
setwd(direc)

library(lattice)
library(DESeq2)
library(viridis)
library(tidyverse)
library(latticeExtra)
library(msigdbr)
library(clusterProfiler)
library(fgsea)
library(enrichplot)
library(EnhancedVolcano)
library(clusterProfiler)
library(pheatmap)
library(dendsort)
library(RColorBrewer)
library(DEGreport)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)

source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')

#Viridis color scheme
col = viridis(6)

#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")

Load the DESeq2 object and the counts table to get the variable names.

load("TRPS1_timecourse_PRO_deseq.df.RData")

x = read.table("TRPS1_timecourse_PRO_gene_counts_combined.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]
x = x[,!grepl("24hour", colnames(x))]

time = factor(sapply(strsplit(colnames(x), '_'), '[', 3), levels = c("0hour", "0.5hour", "1hour", "2hour", "4hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))

Run the likelihood ratio test to identify dynamic peaks over the time course.

dds.lrt = DESeq(deseq.df, test="LRT", reduced = ~ rep)
res.lrt = results(dds.lrt)
rld = rlogTransformation(dds.lrt)
rld_mat <- assay(rld)

sum(res.lrt$padj < 0.1 & !is.na(res.lrt$padj))
siglrt.re = res.lrt[res.lrt$padj < 0.1 & !is.na(res.lrt$padj),]

Intersect with ER target genes.

estrogen_response_genes = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/estrogen_activated_gene_names.txt", header = FALSE)$V1
estrogen_response_genes = gene.symbol$gene[gene.symbol$symbol %in% estrogen_response_genes]
siglrt.re.sets = siglrt.re[rownames(siglrt.re) %in% estrogen_response_genes,] 

Separate those that are activated at 30 minutes from those that are repressed.

res.deseq = results(deseq.df, contrast = c("time", "0.5hour", "0hour"))
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj))
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & rownames(res.deseq) %in% estrogen_response_genes)
act.res.deseq = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0 & rownames(res.deseq) %in% estrogen_response_genes,]
rep.res.deseq = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0 & rownames(res.deseq) %in% estrogen_response_genes,]
sig.res.deseq = rbind(act.res.deseq, rep.res.deseq)

Cluster the genes.

dim(sig.res.deseq)      
cluster_rlog = rld_mat[rownames(sig.res.deseq),]
meta = cbind.data.frame(time, rep)
rownames(meta) = colnames(cluster_rlog)
clusters_lrt <- degPatterns(cluster_rlog, metadata = meta, minc = 0)

Set up a data frame for plotting.

plot.df = clusters_lrt$normalized 
plot.df$Class[plot.df$genes %in% rownames(act.res.deseq)] <- "Activated at 30 minutes"
plot.df$Class[plot.df$genes %in% rownames(rep.res.deseq)] <- "Repressed at 30 minutes"
plot.df$time = as.character(plot.df$time)
plot.df$time[plot.df$time == '0hour'] = 0
plot.df$time[plot.df$time == '0.5hour'] = 0.5
plot.df$time[plot.df$time == '1hour'] = 1
plot.df$time[plot.df$time == '2hour'] = 2
plot.df$time[plot.df$time == '4hour'] = 4
plot.df$time[plot.df$time == '24hour'] = 24
plot.df$time = as.numeric(plot.df$time)
plot.df = plot.df[order(plot.df$genes),]
plot.df = plot.df[order(plot.df$time),]
plot.df$cluster = paste('cluster', as.character(plot.df$cluster), sep = '')
head(plot.df)

save(plot.df, file = "LRT_estrogen_response_gene_set_significant_at_30min_plot.df.RData")
load("LRT_estrogen_response_gene_set_significant_at_30min_plot.df.RData")

Assign the colors.

cat.colours = plot.df
cat.colours$colour[plot.df$Class == "Activated at 30 minutes"] <- '#FF000064'
cat.colours$colour[plot.df$Class == "Repressed at 30 minutes"] <- '#0000FF64'
cat.colours$colour <- as.factor(cat.colours$colour)
sum(plot.df$Class == "Activated at 30 minutes" & plot.df$time == 0) #65
sum(plot.df$Class == "Repressed at 30 minutes" & plot.df$time == 0) #58

Plot.

pdf('Normalized_counts_estrogen_response_gene_set_significant_at_30min_red_blue.pdf') 
trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(
  xyplot(value ~  factor(time) | Class, group = genes, data = plot.df, type = c('l'),
         par.strip.text=list(cex=1),
         scales=list(x=list(cex=1, relation = "free", rot = 45), y =list(cex=1, relation="free")),
         aspect=1.0,
         between=list(y=0.5, x=0.5),
         ylab = list(label = 'Scaled normalized\nPRO signal', cex = 2),
         xlab = list(label = 'Time (hours)', cex = 2),
         par.settings = list(superpose.symbol = list(pch = c(16), col=c('grey20'), cex =0.5),
                             strip.background=list(col="grey80"),
                             superpose.line = list(col = as.character(cat.colours$colour), lwd=c(1), lty = c(1))),
         panel = function(x, y, ...) {
           panel.xyplot(x, y, ...)
           panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = .5, do.out = FALSE)
           panel.loess(x, y, ..., col = "black", lwd =2.0,  span = 1/2, degree = 1, family = c("gaussian"))
           
         })
)
dev.off()

Write the gene names as tables.

activated = plot.df$genes[plot.df$Class == "Activated at 30 minutes"]
activated = gene.symbol$symbol[gene.symbol$gene %in% activated]
write.table(activated, file = "activated_ER_target_genes.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

repressed = plot.df$genes[plot.df$Class == "Repressed at 30 minutes"]
repressed = gene.symbol$symbol[gene.symbol$gene %in% repressed]
write.table(repressed, file = "repressed_ER_target_genes.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

11.4 Plot fold changes for these dynamic ER target genes across three independent clones (Figure S8)

Set up in R.

direc = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1/"
setwd(direc)

library(DESeq2)
library(pheatmap)

categorize.deseq.df.mods <- function(df, fdr = 0.05, log2fold = 0.0, treat = 'Estrogen') 
{
  df.activated = df[df$padj < fdr & !is.na(df$padj) & df$log2FoldChange > log2fold,]
  df.repressed = df[df$padj < fdr & !is.na(df$padj) & df$log2FoldChange < -log2fold,]
  df.unchanged = df[df$padj > 0.5 & !is.na(df$padj) & abs(df$log2FoldChange) < 0.1,]
  df.dregs = df[!(df$padj < fdr & !is.na(df$padj) & df$log2FoldChange > log2fold) &
                  !(df$padj < fdr & !is.na(df$padj) & df$log2FoldChange < -log2fold) &
                  !(df$padj > 0.5 & !is.na(df$padj) & abs(df$log2FoldChange) < 0.1), ]
  df.unchanged$treatment = paste(treat, 'Unchanged')
  df.activated$treatment = paste(treat, 'Activated')
  df.repressed$treatment = paste(treat, 'Repressed')
  df.dregs$treatment = paste(treat, 'All Other Genes')
  df.effects.lattice =
    rbind(df.activated,
          df.unchanged,
          df.repressed,
          df.dregs)
  df.effects.lattice$treatment = factor(df.effects.lattice$treatment)
  df.effects.lattice$treatment = relevel(df.effects.lattice$treatment, ref = paste(treat, 'Activated'))
  df.effects.lattice$treatment = relevel(df.effects.lattice$treatment, ref = paste(treat, 'Repressed'))
  df.effects.lattice$treatment = relevel(df.effects.lattice$treatment, ref = paste(treat, 'Unchanged'))
  df.effects.lattice$treatment = relevel(df.effects.lattice$treatment, ref = paste(treat, 'All Other Genes'))
  return(df.effects.lattice)
}

rwb <- colorRampPalette(colors = c("red", "white", "blue"))

#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")

Identify differentially expressed genes across the three clones

#DESeq2
x = read.table("TRPS1_PRO_gene_counts.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]

treatment = factor(sapply(strsplit(colnames(x), '_'), '[', 3))
clone = factor(sapply(strsplit(colnames(x), '_'), '[', 2))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(treatment, clone, rep), ~ rep + clone + treatment)
deseq.df = DESeq(deseq.df)

res.deseq.treatment = results(deseq.df, contrast = c("treatment", "dTAG13", "DMSO"))
res.deseq.treatment = res.deseq.treatment[rownames(res.deseq.treatment) %in% gene.symbol$gene,]
sum(res.deseq.treatment$padj < 0.1 & !is.na(res.deseq.treatment$padj))

#Gene classes
df.deseq.effects.lattice = 
  categorize.deseq.df.mods(res.deseq.treatment, fdr = 0.1, log2fold = 0.0, treat = 'dTAG13')

Get dynamic ER target genes

dTAG_activated_ER_targets = read.table(file = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/activated_ER_target_genes.txt")$V1
dTAG_activated_ER_targets = gene.symbol$gene[gene.symbol$symbol %in% dTAG_activated_ER_targets]

dTAG_repressed_ER_targets = read.table(file = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/repressed_ER_target_genes.txt")$V1
dTAG_repressed_ER_targets = gene.symbol$gene[gene.symbol$symbol %in% dTAG_repressed_ER_targets]

Plot heatmaps

#Activated heatmap
wide_counts = counts(deseq.df[rownames(deseq.df) %in% dTAG_activated_ER_targets,], normalized = TRUE)
wide_counts = as.data.frame(wide_counts)
wide_counts = cbind.data.frame("Clone28" = log(rowSums(wide_counts[,5:8])/rowSums(wide_counts[,1:4]), base = 2), 
                               "Clone35" = log(rowSums(wide_counts[,13:16])/rowSums(wide_counts[,9:12]), base = 2),
                               "Clone39" = log(rowSums(wide_counts[,21:24])/rowSums(wide_counts[,17:20]), base = 2))
wide_counts = merge.data.frame(wide_counts, gene.symbol, by.x = "row.names", by.y = "gene")
rownames(wide_counts) = wide_counts$symbol
wide_counts = wide_counts[,2:4]
head(wide_counts)

pdf("dTAG_activated_ER_targets_heatmap.pdf", height = 10)
pheatmap(wide_counts, scale = "none", angle_col = 0,
         color = rev(rwb(100)),
         breaks = seq(-.6, .6, 1.2/100))
dev.off()

#Repressed heatmap
wide_counts = counts(deseq.df[rownames(deseq.df) %in% dTAG_repressed_ER_targets,], normalized = TRUE)
wide_counts = as.data.frame(wide_counts)
wide_counts = cbind.data.frame("Clone28" = log(rowSums(wide_counts[,5:8])/rowSums(wide_counts[,1:4]), base = 2), 
                               "Clone35" = log(rowSums(wide_counts[,13:16])/rowSums(wide_counts[,9:12]), base = 2),
                               "Clone39" = log(rowSums(wide_counts[,21:24])/rowSums(wide_counts[,17:20]), base = 2))
wide_counts = merge.data.frame(wide_counts, gene.symbol, by.x = "row.names", by.y = "gene")
rownames(wide_counts) = wide_counts$symbol
wide_counts = wide_counts[,2:4]
head(wide_counts)

pdf("dTAG_repressed_ER_targets_heatmap.pdf", height = 10)
pheatmap(wide_counts, scale = "none", angle_col = 0,
         color = rev(rwb(100)),
         breaks = seq(-.6, .6, 1.2/100))
dev.off()

11.5 ER binding intensity near dynamic ER target genes (Figure 5B)

Set up in R.

direc = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/"
setwd(direc)

library(lattice)
library(latticeExtra)
library(DESeq2)
library(viridis)
library(tidyverse)
library(ggplot2)
library(readxl)

source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')

filter.deseq.into.bed <- function(deseq.df, gene.file) {
  x = gene.file$V4
  y = gene.file[x %in% rownames(deseq.df),]
  z = get.tss(y)
  return(z)
}

bedTools.intersect <- function(functionstring="/usr/local/anaconda3/bin/intersectBed",bed1,bed2,opt.string="-wa -wb") {
  
  options(scipen =99) # not to use scientific notation when writing out
  
  #write bed formatted dataframes to tempfile
  write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  
  # create the command string and call the command using system()
  command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
  cat(command1,"\n")
  try(system(command1))
  command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
  cat(command2,"\n")
  try(system(command2))
  
  command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
  cat(command,"\n")
  try(system(command))
  
  res=read.table('out.file.bed',header=F, comment.char='', sep = "\t")
  
  command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
  cat(command3,"\n")
  try(system(command3))
  
  colnames(res) = c(colnames(bed1), colnames(bed2))
  return(res)
}

bedTools.closest <- function(functionstring="/usr/local/anaconda3/bin/closestBed",bed1,bed2,opt.string="") {
  
  options(scipen =99) # not to use scientific notation when writing out
  
  #write bed formatted dataframes to tempfile
  write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  
  # create the command string and call the command using system()
  command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
  cat(command1,"\n")
  try(system(command1))
  command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
  cat(command2,"\n")
  try(system(command2))
  
  command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
  cat(command,"\n")
  try(system(command))
  
  res=read.table('out.file.bed',header=F, comment.char='', sep = "\t")
  
  command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
  cat(command3,"\n")
  try(system(command3))
  
  colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
  return(res)
}

plot_cdf <- function(df.all, col.lines = c("#FF0000", "grey60", "#0000FF")) {
  ecdfplot(~log(abs(as.numeric(dis)), base = 10), groups = Class, data = df.all,
           auto.key = list(lines=TRUE, points=FALSE),
           col = col.lines,
           aspect = 1,
           scales=list(relation="free",alternating=c(1,1,1,1)),
           ylab = 'Cumulative Distribution Function',
           xlab = expression('log'[10]~" TRPS1 summit distance from ER summit"),
           between=list(y=1.0),
           type = 'a',
           xlim = c(0,7.5),
           lwd=2,
           par.settings = list(superpose.line = list(col = col.lines, lwd=3), 
                               strip.background=list(col="grey85")),
           panel = function(...) {
             panel.abline(v= 200, lty =2)
             panel.ecdfplot(...)
           })
}

#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")

Get read depth.

complexity_estimate <- read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/logs/230322_complexity_estimate.txt", header = T)
name = sapply(strsplit(complexity_estimate$name, "Clone28_"), "[", 2)
complexity_estimate = t(complexity_estimate$unique_reads)
colnames(complexity_estimate) = name
complexity_estimate = complexity_estimate[,-grep("HA", colnames(complexity_estimate))]

Read in the counts table and change the column names.

x <- read.table("Combined_ER_peak_counts.txt", sep = '\t', header = TRUE)
rownames(x) = paste0(x[,1], ":", x[,2], "-", x[,3])
peak_name = x[,4]
peak_qvalue = x[,5]
x = x[,-c(1:5)]
colnames(x) = sapply(strsplit(sapply(strsplit(colnames(x), "Clone28_"), "[", 2), "_ER"), "[", 1)

Subtract the read depth normalized background signal.

background <- sweep(x[,grep("IgG", colnames(x))], 2, complexity_estimate[grep("IgG", names(complexity_estimate))], FUN = "/")
background = rowMeans(background)
background = cbind.data.frame(background, background, background, background, background, background, background, background)
background <- sweep(as.data.frame(background), 2, complexity_estimate[grep("ER", names(complexity_estimate))], FUN = "*")
background = round(background, digits = 0)

x = x[,grep("ER", colnames(x))] - background
x[x < 0] = 0

Run DESeq2.

treatment = factor(sapply(strsplit(colnames(x), '_'), '[', 2), levels = c("DMSO", "dTAG"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(treatment, rep), ~ treatment)
factors <- read.table("ER_sizeFactor.txt")
size_factors = factors$V2
names(size_factors) = factors$V1
sizeFactors(deseq.df) <- size_factors
deseq.df = DESeq(deseq.df)
save(deseq.df, file = "deseq.df.ER.Rdata")
load("deseq.df.ER.Rdata")

Get normalized counts for each condition.

counts = as.data.frame(counts(deseq.df[peak_qvalue > 200,], normalized = TRUE))
counts$chr = sapply(strsplit(rownames(counts), ":"), "[", 1)
coord = sapply(strsplit(rownames(counts), ":"), "[", 2)
counts$start = sapply(strsplit(coord, "-"), "[", 1)
counts$end = sapply(strsplit(coord, "-"), "[", 2)
counts$DMSO = rowMeans(counts[,grep("DMSO", colnames(counts))])
counts$dTAG = rowMeans(counts[,grep("dTAG", colnames(counts))])
counts = counts[,9:13]
rownames(counts) = NULL

Get dynamic ER target genes.

load("TRPS1_timecourse_PRO_deseq.df.RData")
res.deseq = results(deseq.df, contrast = c("time", "0.5hour", "0hour"))
estrogen_response_genes = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/estrogen_activated_gene_names.txt", header = FALSE)$V1
estrogen_response_genes = gene.symbol$gene[gene.symbol$symbol %in% estrogen_response_genes]
res.deseq = res.deseq[rownames(res.deseq) %in% estrogen_response_genes,]

Set up a window of 100kbp around each transcription start site.

tss.df = filter.deseq.into.bed(res.deseq, gene.file)
colnames(tss.df) = c("tss.chr", "tss.start", "tss.end", "gene", "symbol", "strand")
tss.df = merge.data.frame(tss.df, res.deseq, by.x = "gene", by.y = "row.names")
tss.df[,1:4] = tss.df[,c(2:4,1)]
colnames(tss.df)[1:4] = colnames(tss.df)[c(2:4,1)]
tss.df$tss.start = tss.df$tss.start - 100000
tss.df$tss.end = tss.df$tss.end + 100000

Generate transcripts per million.

counts.mat = counts(deseq.df, normalized = F)
counts.mat = merge(counts.mat, gene.file, by.x = 0, by.y = "V4")
rownames(counts.mat) = counts.mat$Row.names
gene.length = counts.mat$V3 - counts.mat$V2
gene.full.length = gene.length
x <- counts.mat[,grep("T47D", colnames(counts.mat))] / gene.length
tpm.mat <- as.data.frame(t( t(x) * 1e6 / colSums(x) ))
tpm.mat$DMSO_TPM = log(rowMeans(tpm.mat[,grep("0hour", colnames(tpm.mat))]), 10)
tpm.mat$dTAG_TPM = log(rowMeans(tpm.mat[,grep("0.5hour", colnames(tpm.mat))]), 10)
tpm.mat$gene_length = gene.full.length
tss.df = merge(tss.df, tpm.mat[, c("DMSO_TPM", "dTAG_TPM", "gene_length")], by.x = "gene", by.y = 0)
tss.df[,1:4] = tss.df[,c(2:4,1)]
colnames(tss.df)[1:4] = colnames(tss.df)[c(2:4,1)]

Match unchanged genes based on expression.

tss.df$Match = "Controls"
tss.df$Class = "Other Genes"
tss.df$Class[!is.na(tss.df$padj) & tss.df$padj < 0.1 & tss.df$log2FoldChange > 0] = "Activated Genes"
tss.df$Class[!is.na(tss.df$padj) & tss.df$padj < 0.1 & tss.df$log2FoldChange < 0] = "Repressed Genes"
tss.df$Match[tss.df$Class == "Activated Genes"] = "Cases"
tss.df$Match = relevel(factor(tss.df$Match), ref = "Controls")

tpm.mat$DMSO_TPM = log(tpm.mat$DMSO_TPM, 10)
tpm.mat$dTAG_TPM = log(tpm.mat$dTAG_TPM, 10)
out = matchit(Match ~ DMSO_TPM, data = tss.df[tss.df$Class != "Repressed Genes",], method = "optimal", ratio = 1)
out2 = matchit(Match ~ DMSO_TPM, data = tss.df[tss.df$Class != "Other Genes",], method = "optimal", ratio = 1)
tss.df = tss.df[tss.df$Match == "Cases" | rownames(tss.df) %in% out$match.matrix | rownames(tss.df) %in% out2$match.matrix,]
rownames(tss.df) = NULL

Get ER peaks near the genes.

peaks_near_genes = bedTools.intersect(bed1 = tss.df, bed2 = counts, opt.string = "-wa -wb")
peaks_near_genes$Class = factor(peaks_near_genes$Class, levels = c("Activated Genes", "Other Genes", "Repressed Genes"))
peaks_near_genes = peaks_near_genes[,c(1:22)]

Plot.

pdf("ER_q200_ChIP_intensity_near_dynamic_ER_target_genes_100kb.pdf", width = 7, height = 3.5)
ggplot(data = peaks_near_genes, aes(x = Class, y = log(dTAG/DMSO, base = 2), color = Class),) +
  theme(legend.position = "none") +
  geom_violin() +
  geom_boxplot(width=.1, outlier.shape = NA) +
  geom_jitter(size = .01) +
  geom_hline(yintercept = 0, linetype="dashed") +
  scale_color_manual(values = c("red", "grey60", "blue")) +
  xlab(NULL) +
  ylab(expression("log"[2]~"(fold change in ER binding intensity)")) +
  xlab("Genes class defined by change in expression upon TRPS1 depletion") +
  ggtitle("Change in ER binding intensity in peaks near ER target genes") +
  theme_bw() +
  theme(legend.position="none", plot.title = element_text(hjust = 0.5))
dev.off()

T-tests.

Activated = peaks_near_genes[peaks_near_genes$Class == "Activated Genes",]
t.test(log(Activated$dTAG), log(Activated$DMSO), paired = T) #0.000000000000002274
Unchanged = peaks_near_genes[peaks_near_genes$Class == "Other Genes",]
t.test(log(Unchanged$dTAG), log(Unchanged$DMSO), paired = T) #0.3518
Repressed = peaks_near_genes[peaks_near_genes$Class == "Repressed Genes",]
t.test(log(Repressed$dTAG), log(Repressed$DMSO), paired = T) #0.000003224

11.6 ER binding intensity by TRPS1 proximity (Figure 5C)

Get the distance from each ER peak summit to the nearest TRPS1 peak summit.

q = 30
peaks = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ChIP/results/macs2/HA_DMSO_vs_dTAG_summits.bed", sep = "\t")
x <- read.table("Combined_HA_DMSO_vs_dTAG_peak_counts.txt", sep = '\t', header = TRUE)
peak_qvalue = x[,5]
peaks = bedTools.closest(bed1 = counts, bed2 = peaks[peak_qvalue > q,], opt.string = "-d")
peaks$status[!is.na(peaks$padj) & peaks$padj < 0.1 & peaks$log2FoldChange > 0] = "Increasing"
peaks$status[!is.na(peaks$padj) & peaks$padj < 0.1 & peaks$log2FoldChange < 0] = "Decreasing"
peaks$status[!is.na(peaks$padj) & peaks$padj >= 0.1] = "Not significantly changing"
peaks$status = factor(peaks$status, levels = c("Increasing", "Not significantly changing", "Decreasing"))
colnames(peaks)[colnames(peaks) == "V5"] = "q_value"

Bin the ER peaks by proximity to TRPS1 peaks.

peaks$Class = NULL
peaks$Class[peaks$dis < 200] = "< 200bp"
peaks$Class[peaks$dis >= 200 & peaks$dis <= 500] = "200 - 500bp"
peaks$Class[peaks$dis > 500] = "> 500bp"
peaks$Class = factor(peaks$Class, levels = c("< 200bp", "200 - 500bp", "> 500bp"))

Plot.

pdf("ER_ChIP_intensity_by_binned_HA_proximity.pdf", width = 7, height = 3.5)
ggplot(data = peaks[peaks$Class %in% levels(peaks$Class),], aes(x = Class, y = log(dTAG/DMSO, base = 2), color = Class),) +
  theme(legend.position = "none") +
  geom_hline(yintercept = 0, linetype="dashed") +
  geom_violin() +
  geom_boxplot(width=.1, outlier.shape = NA) +
  geom_jitter(size = .01) +
  geom_hline(yintercept = 0, linetype="dashed") +
  scale_color_manual(values = c("#FF0000", "grey60", "#0000FF")) +
  xlab("Distance to nearest TRPS1 peak") +
  ylab(expression("log"[2]~"(fold change in ER binding intensity)")) +
  ggtitle("Change in ER binding intensity by proximity to TRPS1") +
  theme_bw() +
  theme(legend.position="none", plot.title = element_text(hjust = 0.5))
dev.off()

T-tests.

Close = peaks[peaks$Class == "< 200bp",]
t.test(log(Close$dTAG), log(Close$DMSO), paired = T) #<0.00000000000000022
Intermediate = peaks[peaks$Class == "200 - 500bp",]
t.test(log(Intermediate$dTAG), log(Intermediate$DMSO), paired = T) #0.4901
Far = peaks[peaks$Class == "> 500bp",]
t.test(log(Far$dTAG), log(Far$DMSO), paired = T) #0.00001017

12 Outcomes associations

12.1 MA plot of gene expression at 24 hours dTAG vs. DMSO (Figure 6A)

Set up in R.

direc = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/"
setwd(direc)

library(lattice)
library(DESeq2)
library(viridis)
library(tidyverse)
library(latticeExtra)
library(msigdbr)
library(clusterProfiler)
library(fgsea)
library(enrichplot)
library(clusterProfiler)
library(DEGreport)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)

source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')

filter.deseq.into.bed <- function(deseq.df, gene.file, cat = 'R1881 Activated') {
  deseq.df = deseq.df[deseq.df$status == cat,]
  #print(dim(deseq.df))
  #scientific notation was messing this up occasionally
  x = gene.file$V4
  #print(length(x))
  y = gene.file[x %in% rownames(deseq.df),]
  #print(dim(y))
  z = get.tss(y)
  #print(dim(z))
  return(z)
}

#bedTools.closest <- function(functionstring="/anaconda3/bin/closestBed",bed1,bed2,opt.string="") {
bedTools.closest <- function(functionstring="/usr/local/anaconda3/bin/closestBed",bed1,bed2,opt.string="") {
  
  options(scipen =99) # not to use scientific notation when writing out
  
  #write bed formatted dataframes to tempfile
  write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  
  # create the command string and call the command using system()
  command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
  cat(command1,"\n")
  try(system(command1))
  command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
  cat(command2,"\n")
  try(system(command2))
  
  command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
  cat(command,"\n")
  try(system(command))
  
  res=read.table('out.file.bed',header=F, comment.char='')
  
  command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
  cat(command3,"\n")
  try(system(command3))
  
  colnames(res) = c(colnames(bed1), colnames(bed2)[1:ncol(bed2)], 'dis' )
  return(res)
}

#color scheme
col.lines = c("red", "blue", "grey60")

#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")

Identify differentially expressed genes at 24 hours, shrinking the fold changes.

x = read.table("TRPS1_timecourse_PRO_gene_counts_combined.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]

time = factor(sapply(strsplit(colnames(x), '_'), '[', 3), levels = c("0hour", "0.5hour", "1hour", "2hour", "4hour", "24hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
deseq.df = DESeq(deseq.df)

i = "24hour"
res.deseq = results(deseq.df, contrast = c("time", i, "0hour"))
res.deseq = res.deseq[rownames(res.deseq) %in% gene.symbol$gene,]
res.deseq = merge(as.data.frame(res.deseq), gene.symbol, by.x = 0, by.y = "gene")

res.deseq.shrink <- lfcShrink(deseq.df, coef = "time_24hour_vs_0hour", type="ashr")
res.deseq.shrink = res.deseq.shrink[rownames(res.deseq.shrink) %in% gene.symbol$gene,]
sum(res.deseq.shrink$padj < 0.1 & !is.na(res.deseq.shrink$padj))

df.deseq.effects.lattice = res.deseq.shrink
df.deseq.effects.lattice$status = "Other"
df.deseq.effects.lattice$status[res.deseq.shrink$padj < 0.1 & !is.na(res.deseq.shrink$padj) & res.deseq.shrink$log2FoldChange > 0] = "Activated"
df.deseq.effects.lattice$status[res.deseq.shrink$padj < 0.1 & !is.na(res.deseq.shrink$padj) & res.deseq.shrink$log2FoldChange < 0] = "Repressed"
df.deseq.effects.lattice$status = factor(df.deseq.effects.lattice$status, levels = c("Other", "Activated", "Repressed"))

Plot.

pdf("MA_PRO_24hour_ashr_for_figure.pdf", width = 7, height = 3.5)
ggplot(as.data.frame(df.deseq.effects.lattice) %>% arrange(status), aes(x = log(baseMean, 10), y = log2FoldChange, color = status)) +
  geom_point() +
  scale_color_manual(values = c("grey60", "red", "blue")) +
  geom_hline(yintercept = 0, linetype="dashed") +
  ylim(-1, 1) +
  xlim(0, 5) +
  ylab(expression("Shrunken log"[2]~"(fold change in PRO signal)")) +
  xlab(expression("log"[10]~"(mean of normalized intensity)")) +
  ggtitle("Change in expression 24 hours after TRPS1 depletion") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(color=NULL)
dev.off()

12.2 24 hour gene set enrichment analysis (Figure 6B)

Rank genes based on shrunken fold change.

original_gene_list <- res.deseq.shrink$log2FoldChange
names(original_gene_list) <- rownames(res.deseq.shrink)
gene_list<-na.omit(original_gene_list)
gene_list = sort(gene_list, decreasing = TRUE)

h_gene_sets = msigdbr(species = "Homo sapiens", category = "H")
msigdbr_list = split(x = h_gene_sets$ensembl_gene, f = h_gene_sets$gs_name)
set.seed(0)
fgseaRes <- fgsea(msigdbr_list, gene_list, eps = 0)

fgseaResTidy <- fgseaRes %>%
  as_tibble() %>%
  arrange(desc(NES))

print(fgseaResTidy[fgseaResTidy$padj < 0.1,], n = 100)
gse_top = fgseaResTidy[fgseaResTidy$padj < 0.1,]

Plot.

#padj 2.45e-19
pdf("Mountain_plot_24hour_E2F_targets.pdf", height = 5)
plotEnrichment(msigdbr_list[["HALLMARK_E2F_TARGETS"]], gene_list) + 
  ylim(-.9, .1) + 
  labs(title="Hallmark E2F Targets")
dev.off()

12.3 Growth curves (Figures 6C and S10)

Set up in R.

direc = "~/Library/CloudStorage/Box-Box/GuertinLab/Cell_Counts/"
setwd(direc)

library(readxl)
library(tidyr)
library(plyr)
library(dplyr)
library(ggplot2)
library(viridis)

summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95, .drop=TRUE) {
  # New version of length which can handle NA's: if na.rm==T, don't count them
  length2 <- function (x, na.rm=FALSE) {
    if (na.rm) sum(!is.na(x))
    else       length(x)
  }
  # This does the summary. For each group's data frame, return a vector with
  # N, mean, and sd
  datac <- ddply(data, groupvars, .drop=.drop,
                 .fun = function(xx, col) {
                   c(N    = length2(xx[[col]], na.rm=na.rm),
                     mean = mean   (xx[[col]], na.rm=na.rm),
                     sd   = sd     (xx[[col]], na.rm=na.rm)
                   )
                 },
                 measurevar
  )
  # Rename the "mean" column    
  datac <- plyr::rename(datac, c("mean" = measurevar))
  datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean
  # Confidence interval multiplier for standard error
  # Calculate t-statistic for confidence interval: 
  # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
  ciMult <- qt(conf.interval/2 + .5, datac$N-1)
  datac$ci <- datac$se * ciMult
  return(datac)
}

Read in the counts data.

Cell_Counts <- read_excel("220831_dTAG_TRPS1_reps_3_to_6.xlsx")

Convert to cells/well.

Dilution_factor <- 10000*Cell_Counts$mL/9
Cell_Counts <- Cell_Counts[,-6]
Cell_Counts[,3:5] = Cell_Counts[,3:5]*Dilution_factor

Average the technical replicates.

Cell_Counts <- pivot_longer(Cell_Counts, colnames(Cell_Counts)[3:5], names_to = "Condition", values_to = "Count")
Cell_Counts <- summarySE(Cell_Counts, measurevar="Count", groupvars=c("Condition", "Day", "Rep"))
Cell_Counts <- Cell_Counts[,c(1,2,3,5)]
Cell_Counts <- na.omit(Cell_Counts)

Calculate mean and standard error for the biological replicates.

Summary <- summarySE(Cell_Counts, measurevar="Count", groupvars=c("Condition", "Day"))

Fit a linear model to the data with and without an interaction term for the effect of dTAG.

pd <- position_dodge(0.1)
ylim <- c(0, 500000)
fit = lm(log(Count, base = 2) ~ Day*Condition, data = Cell_Counts[Cell_Counts$Condition != "Untreated",])
fit_no_interaction = lm(log(Count, base = 2) ~ Day + Condition, data = Cell_Counts[Cell_Counts$Condition != "Untreated",])
anova(fit_no_interaction, fit) # p = 1.01e-05
predict = predict(fit, interval = "confidence")
Cell_Counts$lwr[Cell_Counts$Condition != "Untreated"] = predict[,2]
Cell_Counts$upr[Cell_Counts$Condition != "Untreated"] = predict[,3]

Plot.

pdf("dTAG_TRPS1_reps3_to_6_curve_fit_confidence.pdf", width = 14)
ggplot(Cell_Counts, aes(x=Day, y=Count, colour=Condition, fill = Condition)) + 
  geom_point(data=Cell_Counts, size=3, shape=21) +
  stat_function(fun=function(x) 2^(as.numeric(fit$coefficients[1]) + as.numeric(fit$coefficients[2])*x), color = "Black") +
  stat_function(fun=function(x) 2^(as.numeric(fit$coefficients[1] + fit$coefficients[3]) + as.numeric(fit$coefficients[2] + fit$coefficients[4])*x), color = "Red") +
  geom_errorbar(aes(ymin=2^lwr, ymax=2^upr), width=.1) +
  scale_colour_manual(values = c("Black", "Red", "Grey"), labels = c("DMSO", "dTAG", "Untreated")) +   
  scale_fill_manual(values = c("Black", "Red", "Grey"), labels = c("DMSO", "dTAG", "Untreated")) +    
  scale_x_continuous(breaks = c(0, 3, 7, 10, 14), minor_breaks = seq(0,14,1)) +
  ylab("Cell Number") +
  theme_bw() +
  theme(legend.justification = "center",
        legend.position = c(0.2, 0.7),
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5),
        legend.text = element_text(size=24),
        axis.text=element_text(size=12),
        axis.title=element_text(size=24,face="bold"))      
dev.off()

Repeat with parental T47D counts.

#Read data
Cell_Counts <- read_excel("231128_dTAG_T47D.xlsx")

#Convert to cells/well
Dilution_factor <- 10000*2*Cell_Counts$mL/9
Cell_Counts <- Cell_Counts[,-6]
Cell_Counts[,3:5] = Cell_Counts[,3:5]*Dilution_factor

#Average technical replicates
Cell_Counts <- pivot_longer(Cell_Counts, colnames(Cell_Counts)[3:5], names_to = "Condition", values_to = "Count")
Cell_Counts <- summarySE(Cell_Counts, measurevar="Count", groupvars=c("Condition", "Day", "Rep"))
Cell_Counts <- Cell_Counts[,c(1,2,3,5)]
Cell_Counts <- na.omit(Cell_Counts)

#Mean and SEM for biological replicates
Summary <- summarySE(Cell_Counts, measurevar="Count", groupvars=c("Condition", "Day"))

#Plot
pd <- position_dodge(0.1)
ylim <- c(0, 500000)
fit = lm(log(Count, base = 2) ~ Day*Condition, data = Cell_Counts[Cell_Counts$Condition != "Untreated",])
fit_no_interaction = lm(log(Count, base = 2) ~ Day + Condition, data = Cell_Counts[Cell_Counts$Condition != "Untreated",])
anova(fit_no_interaction, fit) # p = 0.9411
predict = predict(fit, interval = "confidence")
Cell_Counts$lwr[Cell_Counts$Condition != "Untreated"] = predict[,2]
Cell_Counts$upr[Cell_Counts$Condition != "Untreated"] = predict[,3]
pdf("dTAG_T47D_curve_fit_confidence.pdf", width = 14)
ggplot(Cell_Counts, aes(x=Day, y=Count, colour=Condition, fill = Condition)) + 
  geom_point(data=Cell_Counts, size=3, shape=21) +
  stat_function(fun=function(x) 2^(as.numeric(fit$coefficients[1]) + as.numeric(fit$coefficients[2])*x), color = "Black") +
  stat_function(fun=function(x) 2^(as.numeric(fit$coefficients[1] + fit$coefficients[3]) + as.numeric(fit$coefficients[2] + fit$coefficients[4])*x), color = "Red") +
  geom_errorbar(aes(ymin=2^lwr, ymax=2^upr), width=.1) +
  scale_colour_manual(values = c("Black", "Red", "Grey"), labels = c("DMSO", "dTAG", "Untreated")) +   
  scale_fill_manual(values = c("Black", "Red", "Grey"), labels = c("DMSO", "dTAG", "Untreated")) +    
  scale_x_continuous(breaks = c(0, 3, 7, 10, 14), minor_breaks = seq(0,14,1)) +
  ylab("Cell Number") +
  theme_bw() +
  theme(legend.justification = "center",
        legend.position = c(0.2, 0.7),
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5),
        legend.text = element_text(size=24),
        axis.text=element_text(size=12),
        axis.title=element_text(size=24,face="bold"))      
dev.off()

12.4 Plot example patient with high TRPS1 activity score (Figure 6D)

Set up in R.

direc = "~/Library/CloudStorage/Box-Box/GuertinLab/TRPS1_timecourse_PRO_3/"
setwd(direc)

library(DESeq2)
library(RTN)
library(RTNsurvival)
library(pheatmap)
library(Fletcher2013b)

#For converting gene names
gene.file = read.table("~/Library/CloudStorage/Box-Box/GuertinLab/ER_Antagonists_R/Homo_sapiens.GRCh38.104.bed", sep = '\t', header = FALSE)
`%notin%` <- Negate(`%in%`)
gene.file = gene.file[gene.file$V5 %notin% c("havana", "havana_tagene", "ensembl_havana"),]
gene.symbol = gene.file[,c(4,5)]
colnames(gene.symbol) = c("gene", "symbol")

Identify primary TRPS1-responsive genes.

load("TRPS1_timecourse_PRO_deseq.df.RData")
i = "0.5hour"
res.deseq = results(deseq.df, contrast = c("time", i, "0hour"))
res.deseq = res.deseq[rownames(res.deseq) %in% gene.symbol$gene,]
res.deseq = merge(as.data.frame(res.deseq), gene.symbol, by.x = 0, by.y = "gene")

activated = res.deseq$symbol[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0] 
repressed = res.deseq$symbol[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0] 

Define the TRPS1 regulon based on our own primary gene set.

data("rtni1st")
TRPS1 = rtni1st@regulatoryElements["TRPS1"]
ILMN_SYMBOL = rtni1st@rowAnnotation
rtni1st@results$tn.dpi[,TRPS1] <- 0
rtni1st@results$tn.dpi[,TRPS1][names(rtni1st@results$tn.dpi[,TRPS1]) %in% 
                                 ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% activated]] <- -1
rtni1st@results$tn.dpi[,TRPS1][names(rtni1st@results$tn.dpi[,TRPS1]) %in% 
                                 ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% repressed]] <- 1

tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time", event = "event", keycovar =c("Age","Grade"))
tns1st <-tnsGSEA2(tns1st)

data("rtni2nd")
TRPS1 = rtni2nd@regulatoryElements["TRPS1"]
ILMN_SYMBOL = rtni2nd@rowAnnotation
rtni2nd@results$tn.dpi[,TRPS1] <- 0
rtni2nd@results$tn.dpi[,TRPS1][names(rtni2nd@results$tn.dpi[,TRPS1]) %in% 
                                 ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% activated]] <- -1
rtni2nd@results$tn.dpi[,TRPS1][names(rtni2nd@results$tn.dpi[,TRPS1]) %in% 
                                 ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% repressed]] <- 1

tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time", event = "event", keycovar =c("Age", "Grade"))
tns2nd <-tnsGSEA2(tns2nd)

Merge the two halves of the cohort.

tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
                                                      tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
                                                  tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
                                                  tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
                                                tns2nd@results$regulonActivity$status)

Get regulon activity and sample attributes.

regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")

Run Kaplan-Meier analysis.

tns_both <-tnsKM(tns_both)

Identify the patient with the highest TRPS1 activity score.

max = rownames(tns_both@results$regulonActivity$differential)[tns_both@results$regulonActivity$differential == max(tns_both@results$regulonActivity$differential)]

Plot.

pdf("RTNsurvival_GSEA2_TRPS1_high_regulon_merged_METABRIC_30minutes_FDR_0.1.pdf", height = 3, width = 3)
tnsPlotGSEA2(tns_both, max, regs = "TRPS1")
dev.off()

12.5 Plot distribution of TRPS1 activity scores and Kaplan-Meier curves binned by TRPS1 activity score (Figure 6E,F)

Plot.

#Logrank p-value 8.3e-6
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()

12.6 Plot Kaplan-Meier curves subdivided by ER-positivity or intrinsic subtype (Figure S10)

#Just ER positive
data("rtni1st")
TRPS1 = rtni1st@regulatoryElements["TRPS1"]
ILMN_SYMBOL = rtni1st@rowAnnotation
rtni1st@results$tn.dpi[,TRPS1] <- 0
rtni1st@results$tn.dpi[,TRPS1][names(rtni1st@results$tn.dpi[,TRPS1]) %in% 
                                 ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% activated]] <- -1
rtni1st@results$tn.dpi[,TRPS1][names(rtni1st@results$tn.dpi[,TRPS1]) %in% 
                                 ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% repressed]] <- 1

tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade")
                           , excludeAttribs = "ER-"
                           )
tns1st <-tnsGSEA2(tns1st)

data("rtni2nd")
TRPS1 = rtni2nd@regulatoryElements["TRPS1"]
ILMN_SYMBOL = rtni2nd@rowAnnotation
rtni2nd@results$tn.dpi[,TRPS1] <- 0
rtni2nd@results$tn.dpi[,TRPS1][names(rtni2nd@results$tn.dpi[,TRPS1]) %in% 
                                 ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% activated]] <- -1
rtni2nd@results$tn.dpi[,TRPS1][names(rtni2nd@results$tn.dpi[,TRPS1]) %in% 
                                 ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% repressed]] <- 1

tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age", "Grade")
                           , excludeAttribs = "ER-"
                           )
tns2nd <-tnsGSEA2(tns2nd)

#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
                                                      tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
                                                  tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
                                                  tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
                                                tns2nd@results$regulonActivity$status)

#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")

#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)

#Logrank p-value 3.86e-6
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_ER_positive.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()


#Just ER negative
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade"), 
                           excludeAttribs = "ER+")
tns1st <-tnsGSEA2(tns1st)

tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade"), 
                           excludeAttribs = "ER+")
tns2nd <-tnsGSEA2(tns2nd)

#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
                                                      tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
                                                  tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
                                                  tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
                                                tns2nd@results$regulonActivity$status)

#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")

#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)

#Logrank p-value 
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_ER_negative.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()




#Just Basal
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade"), 
                           excludeAttribs = c("LumA", "LumB", "Her2", "Normal"))
tns1st <-tnsGSEA2(tns1st)

tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade"), 
                           excludeAttribs = c("LumA", "LumB", "Her2", "Normal"))
tns2nd <-tnsGSEA2(tns2nd)

#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
                                                      tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
                                                  tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
                                                  tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
                                                tns2nd@results$regulonActivity$status)

#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")

#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)

#Logrank p-value 
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_Basal.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()



#Just Normal
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade"), 
                           excludeAttribs = c("LumA", "LumB", "Her2", "Basal"))
tns1st <-tnsGSEA2(tns1st)

tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade"), 
                           excludeAttribs = c("LumA", "LumB", "Her2", "Basal"))
tns2nd <-tnsGSEA2(tns2nd)

#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
                                                      tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
                                                  tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
                                                  tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
                                                tns2nd@results$regulonActivity$status)

#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")

#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)

#Logrank p-value 
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_Normal.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()



#Just Her2-amplified
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade"), 
                           excludeAttribs = c("LumA", "LumB", "Normal", "Basal"))
tns1st <-tnsGSEA2(tns1st)

tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade"), 
                           excludeAttribs = c("LumA", "LumB", "Normal", "Basal"))
tns2nd <-tnsGSEA2(tns2nd)

#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
                                                      tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
                                                  tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
                                                  tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
                                                tns2nd@results$regulonActivity$status)

#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")

#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)

#Logrank p-value 
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_HER2.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()




#Just luminal A
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade"), 
                           excludeAttribs = c("Basal", "Her2", "Normal", "LumB"))
tns1st <-tnsGSEA2(tns1st)

tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade"), 
                           excludeAttribs = c("Basal", "Her2", "Normal", "LumB"))
tns2nd <-tnsGSEA2(tns2nd)

#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
                                                      tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
                                                  tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
                                                  tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
                                                tns2nd@results$regulonActivity$status)

#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")

#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)

#Logrank p-value 4.23e-4
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_luminal_A.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()



#Just luminal B
tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade"), 
                           excludeAttribs = c("Basal", "Her2", "Normal", "LumA"))
tns1st <-tnsGSEA2(tns1st)

tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time", 
                           event = "event", keycovar =c("Age","Grade"), 
                           excludeAttribs = c("Basal", "Her2", "Normal", "LumA"))
tns2nd <-tnsGSEA2(tns2nd)

#Merge
tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
                                                      tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
                                                  tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
                                                  tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
                                                tns2nd@results$regulonActivity$status)

#Get regulon activity and sample attributes
regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")

#Run KM and Cox analysis
tns_both <-tnsKM(tns_both)
tns_both <-tnsCox(tns_both)

#Logrank p-value 2.34e-1
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_30minutes_FDR_0.1_no_endpoint_just_luminal_B.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()

12.7 Plot distribution of TRPS1 activity scores and Kaplan-Meier curves binned by TRPS1 activity score using genes differentially expressed at 24 hours (Figure S11)

Identify TRPS1-responsive genes.

x = read.table("TRPS1_timecourse_PRO_gene_counts_combined.txt", sep = '\t', header = TRUE)
rownames(x) = x[,1]
x = x[,seq(2,to=ncol(x),by=2)]

time = factor(sapply(strsplit(colnames(x), '_'), '[', 3), levels = c("0hour", "0.5hour", "1hour", "2hour", "4hour", "24hour"))
rep = factor(sapply(strsplit(colnames(x), 'rep'), '[', 2))
deseq.df = DESeqDataSetFromMatrix(x, cbind.data.frame(time, rep), ~ rep + time)
deseq.df = DESeq(deseq.df)

i = "24hour"
res.deseq = results(deseq.df, contrast = c("time", i, "0hour"))
res.deseq = res.deseq[rownames(res.deseq) %in% gene.symbol$gene,]
res.deseq = merge(as.data.frame(res.deseq), gene.symbol, by.x = 0, by.y = "gene")

Define the TRPS1 regulon based on this gene set.

data("rtni1st")
TRPS1 = rtni1st@regulatoryElements["TRPS1"]
ILMN_SYMBOL = rtni1st@rowAnnotation
rtni1st@results$tn.dpi[,TRPS1] <- 0
rtni1st@results$tn.dpi[,TRPS1][names(rtni1st@results$tn.dpi[,TRPS1]) %in% 
                                 ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% activated]] <- -1
rtni1st@results$tn.dpi[,TRPS1][names(rtni1st@results$tn.dpi[,TRPS1]) %in% 
                                 ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% repressed]] <- 1

tns1st <-tni2tnsPreprocess(tni = rtni1st, regulatoryElements = c("TRPS1"), time = "time", event = "event", keycovar =c("Age","Grade"))
tns1st <-tnsGSEA2(tns1st)

data("rtni2nd")
TRPS1 = rtni2nd@regulatoryElements["TRPS1"]
ILMN_SYMBOL = rtni2nd@rowAnnotation
rtni2nd@results$tn.dpi[,TRPS1] <- 0
rtni2nd@results$tn.dpi[,TRPS1][names(rtni2nd@results$tn.dpi[,TRPS1]) %in% 
                                 ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% activated]] <- -1
rtni2nd@results$tn.dpi[,TRPS1][names(rtni2nd@results$tn.dpi[,TRPS1]) %in% 
                                 ILMN_SYMBOL$PROBEID[ILMN_SYMBOL$SYMBOL %in% repressed]] <- 1

tns2nd <-tni2tnsPreprocess(tni = rtni2nd, regulatoryElements = c("TRPS1"), time = "time", event = "event", keycovar =c("Age", "Grade"))
tns2nd <-tnsGSEA2(tns2nd)

Merge the two halves of the cohort.

tns_both = tns1st
tns_both@survivalData <- rbind(tns1st@survivalData[,-ncol(tns1st@survivalData)], tns2nd@survivalData)
tns_both@results$regulonActivity$differential = rbind(tns1st@results$regulonActivity$differential,
                                                      tns2nd@results$regulonActivity$differential)
tns_both@results$regulonActivity$positive = rbind(tns1st@results$regulonActivity$positive,
                                                  tns2nd@results$regulonActivity$positive)
tns_both@results$regulonActivity$negative = rbind(tns1st@results$regulonActivity$negative,
                                                  tns2nd@results$regulonActivity$negative)
tns_both@results$regulonActivity$status = rbind(tns1st@results$regulonActivity$status,
                                                tns2nd@results$regulonActivity$status)

Get regulon activity and sample attributes.

regact_gsea <-tnsGet(tns_both, "regulonActivity")$dif
sdata <-tnsGet(tns_both, "survivalData")
attribs <-c("ER+", "ER-","LumA","LumB","Basal","Her2","Normal")

Run Kaplan-Meier analysis.

tns_both <-tnsKM(tns_both)

Plot.

#Logrank p-value 4.99e-4
pdf("RTNsurvival_TRPS1_regulon_merged_METABRIC_after_GSEA_24hour_FDR_0.1_no_endpoint.pdf", height = 3, width = 6)
tnsPlotKM(tns_both, regs = "TRPS1", attribs = attribs, panelWidths=c(3,1,4), width = 6)
dev.off()

13 Allele sequencing

As it turns out, our degron insertion was heterozygous, and T47D cells in fact have four genomic copies of TRPS1. We have determined that we inadvertently created two to three knockout alleles in each of the three clones. However, we were more concerned with TRPS1 protein expression than genomic copies edited. In Figure 2A, the tagged alleles are visualized with the TRPS1 antibody, and there is no detectable untagged TRPS1. Even though we knocked out 2 or 3 alleles, we found that TRPS1 negatively regulates its own expression (TRPS1 is the most significantly activated gene upon TRPS1 depletion as measured by PRO-seq after TRPS1 degradation at any time point). This is a possible explanation for why the protein levels are comparable to the progenitor despite having fewer than 4 functional alleles.

We used primers flanking the insertion site for PCR of genomic DNA from each of the three clones, along with the parental T47D cells. We performed TOPO cloning and picked colonies for Sanger sequencing. The predicted amplicon for the unedited parental cells is as follows:

TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT

The predicted amplicon for a perfectly tagged allele is an insertion directly after the start codon, where Cas9 cuts.

TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGAAAAAGCCTGAACTCACCGCGACGTCTGTCGAGAAGTTTCTGATCGAAAAGTTCGACAGCGTCTCCGACCTGATGCAGCTCTCGGAGGGCGAAGAATCTCGTGCTTTCAGCTTCGATGTAGGAGGGCGTGGATATGTCCTGCGGGTAAATAGCTGCGCCGATGGTTTCTACAAAGATCGTTATGTTTATCGGCACTTTGCATCGGCCGCGCTCCCGATTCCGGAAGTGCTTGACATTGGGGAATTCAGCGAGAGCCTGACCTATTGCATCTCCCGCCGTGCACAGGGTGTCACGTTGCAAGACCTGCCTGAAACCGAACTGCCCGCTGTTCTGCAGCCGGTCGCGGAGGCCATGGATGCGATCGCTGCGGCCGATCTTAGCCAGACGAGCGGGTTCGGCCCATTCGGACCGCAAGGAATCGGTCAATACACTACATGGCGTGATTTCATATGCGCGATTGCTGATCCCCATGTGTATCACTGGCAAACTGTGATGGACGACACCGTCAGTGCGTCCGTCGCGCAGGCTCTCGATGAGCTGATGCTTTGGGCCGAGGACTGCCCCGAAGTCCGGCACCTCGTGCACGCGGATTTCGGCTCCAACAATGTCCTGACGGACAATGGCCGCATAACAGCGGTCATTGACTGGAGCGAGGCGATGTTCGGGGATTCCCAATACGAGGTCGCCAACATCTTCTTCTGGAGGCCGTGGTTGGCTTGTATGGAGCAGCAGACGCGCTACTTCGAGCGGAGGCATCCGGAGCTTGCAGGATCGCCGCGGCTCCGGGCGTATATGCTCCGCATTGGTCTTGACCAACTCTATCAGAGCTTGGTTGACGGCAATTTCGATGATGCAGCTTGGGCGCAGGGTCGATGCGACGCAATCGTCCGATCCGGAGCCGGGACTGTCGGGCGTACACAAATCGCCCGCAGAAGCGCGGCCGTCTGGACCGATGGCTGTGTAGAAGTACTCGCCGATAGTGGAAACCGACGCCCCAGCACTCGTCCGGATCGGGAGATGGGGGAGGCTAACGGAAGCGGAGCTACTAACTTCAGCCTGCTGAAGCAGGCTGGAGACGTGGAGGAGAACCCTGGACCTTACCCCTACGACGTGCCCGACTACGCCGGCTATCCGTATGATGTCCCGGACTATGCAGGAAGCGGAGGAGTGCAGGTGGAAACCATCTCCCCAGGAGACGGGCGCACCTTCCCCAAGCGCGGCCAGACCTGCGTGGTGCACTACACCGGGATGCTTGAAGATGGAAAGAAAGTTGATTCCTCCCGGGACAGAAACAAGCCCTTTAAGTTTATGCTAGGCAAGCAGGAGGTGATCCGAGGCTGGGAAGAAGGGGTTGCCCAGATGAGTGTGGGTCAGAGAGCCAAACTGACTATATCTCCAGATTATGCCTATGGTGCCACTGGGCACCCAGGCATCATCCCACCACATGCCACTCTCGTCTTCGATGTGGAGCTTCTAAAACTGGAAAGATCTGGTGGATCTGGAGGTTCAGGTTCAGGTGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT

13.1 Clone 28

The first allele is a knockout. There is an 18-base deletion of (TTTATCTTTGCAGATATG), the last 3 bases of which are the start codon of TRPS1.

TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT

The second allele is a knockout. There is a 70-base insertion between the TRPS1 start codon and the second codon. The first 6 bases are GCGTCA, and the next are from the px458 (Cas9-expressing) plasmid, AGGCCACCTCGTCCACGATGTTGCCGAAGATGGGGTGCCGCTCGTGCTTCTTATCCTCTTCCAC.

TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGGCGTCAAGGCCACCTCGTCCACGATGTTGCCGAAGATGGGGTGCCGCTCGTGCTTCTTATCCTCTTCCACGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT

The third allele produces the goal dTAG-TRPS1 protein. The allele differs from the perfect product above, missing two T’s in a string of 13 T’s over 100 bases upstream from the start codon, and missing amino acids 2-77 from the Hygromycin resistance (HygR) gene.

TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGCTTGACATTGGGGAATTCAGCGAGAGCCTGACCTATTGCATCTCCCGCCGTGCACAGGGTGTCACGTTGCAAGACCTGCCTGAAACCGAACTGCCCGCTGTTCTGCAGCCGGTCGCGGAGGCCATGGATGCGATCGCTGCGGCCGATCTTAGCCAGACGAGCGGGTTCGGCCCATTCGGACCGCAAGGAATCGGTCAATACACTACATGGCGTGATTTCATATGCGCGATTGCTGATCCCCATGTGTATCACTGGCAAACTGTGATGGACGACACCGTCAGTGCGTCCGTCGCGCAGGCTCTCGATGAGCTGATGCTTTGGGCCGAGGACTGCCCCGAAGTCCGGCACCTCGTGCACGCGGATTTCGGCTCCAACAATGTCCTGACGGACAATGGCCGCATAACAGCGGTCATTGACTGGAGCGAGGCGATGTTCGGGGATTCCCAATACGAGGTCGCCAACATCTTCTTCTGGAGGCCGTGGTTGGCTTGTATGGAGCAGCAGACGCGCTACTTCGAGCGGAGGCATCCGGAGCTTGCAGGATCGCCGCGGCTCCGGGCGTATATGCTCCGCATTGGTCTTGACCAACTCTATCAGAGCTTGGTTGACGGCAATTTCGATGATGCAGCTTGGGCGCAGGGTCGATGCGACGCAATCGTCCGATCCGGAGCCGGGACTGTCGGGCGTACACAAATCGCCCGCAGAAGCGCGGCCGTCTGGACCGATGGCTGTGTAGAAGTACTCGCCGATAGTGGAAACCGACGCCCCAGCACTCGTCCGGATCGGGAGATGGGGGAGGCTAACGGAAGCGGAGCTACTAACTTCAGCCTGCTGAAGCAGGCTGGAGACGTGGAGGAGAACCCTGGACCTTACCCCTACGACGTGCCCGACTACGCCGGCTATCCGTATGATGTCCCGGACTATGCAGGAAGCGGAGGAGTGCAGGTGGAAACCATCTCCCCAGGAGACGGGCGCACCTTCCCCAAGCGCGGCCAGACCTGCGTGGTGCACTACACCGGGATGCTTGAAGATGGAAAGAAAGTTGATTCCTCCCGGGACAGAAACAAGCCCTTTAAGTTTATGCTAGGCAAGCAGGAGGTGATCCGAGGCTGGGAAGAAGGGGTTGCCCAGATGAGTGTGGGTCAGAGAGCCAAACTGACTATATCTCCAGATTATGCCTATGGTGCCACTGGGCACCCAGGCATCATCCCACCACATGCCACTCTCGTCTTCGATGTGGAGCTTCTAAAACTGGAAAGATCTGGTGGATCTGGAGGTTCAGGTTCAGGTGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT

We did not identify a fourth unique allele in this pool of cloned PCR products. We turned to our three-clone PRO-seq data to determine if there was a fourth unique allele. We first made FLASH-merged .fastq files from overlapping reads to extend our effective read length.

The script used in the next code chunk (231226_grep_prep.sh) is below:

#! /bin/bash
main=/scratch/$USER/TRPS1_PRO
cd $main

UMI_length=8
filename=$1

module load cutadapt/3.4 pigz/2.7; PATH=$HOME/bin:$PATH

name=`basename $filename _PE1.fastq.gz`
gunzip ${name}_PE1.fastq.gz 
gunzip ${name}_PE2.fastq.gz
cutadapt -m $((UMI_length+2)) -O 1 -a TGGAATTCTCGGGTGCCAAGG ${name}_PE1.fastq \
  -o ${name}_PE1_noadap.fastq --too-short-output ${name}_PE1_short.fastq > ${name}_PE1_cutadapt.txt
cutadapt -m $((UMI_length+10)) -O 1 -a GATCGTCGGACTGTAGAACTCTGAAC ${name}_PE2.fastq \
  -o ${name}_PE2_noadap.fastq --too-short-output ${name}_PE2_short.fastq > ${name}_PE2_cutadapt.txt
seqtk seq -L $((UMI_length+10)) ${name}_PE1_noadap.fastq > ${name}_PE1_noadap_trimmed.fastq
fqdedup -i ${name}_PE1_noadap_trimmed.fastq -o ${name}_PE1_dedup.fastq
fastq_pair ${name}_PE1_dedup.fastq ${name}_PE2_noadap.fastq
flash -q ${name}_PE2_noadap.fastq.paired.fq ${name}_PE1_dedup.fastq.paired.fq -o ${name}
seqtk trimfq -e ${UMI_length} ${name}.extendedFrags.fastq > ${name}_for_grep.fastq
rm ${name}_*short.fastq ${name}_*noadap* ${name}_*dedup* ${name}.hist* ${name}.notCombined* \
  ${name}.extendedFrags.fastq
pigz ${name}_PE[1,2].fastq

Run the above script on each file in parallel.

ssh -Y ts2hx@rivanna.hpc.virginia.edu 
ijob -p standard -A gioeli_lab -t 4:00:00 -N 1 -n 1 --cpus-per-task=1

chmod +x /scratch/$USER/scripts/231226_grep_prep.sh

main=/scratch/$USER/TRPS1_PRO
cd $main

for filename in *_PE1.fastq.gz
do
  name=`basename $filename _PE1.fastq.gz`
  sbatch -p standard -A gioeli_lab -t 4:00:00 --cpus-per-task=1 --job-name ${name}_grep_prep \
    -o ${name}_grep_prep.log  --wrap="sh /scratch/$USER/scripts/231226_grep_prep.sh $filename"
  sleep 1 # wait 1 second between each job submission
done

Next we used grep to search for a unique sequence specific to the desired product and not found in the above alleles. CTTTGCAGATATGAAAAAGC is the designed junction between the 5’UTR and the start of the HygR gene. This produced a single unique result, with a duplication of 33 bases (the 30 bases upstream from the start codon and including the start codon) directly after the start codon in the ideally tagged allele from above.

grep "CTTTGCAGATATGAAAAAGC" T47D_Clone28_*_for_grep.fastq

We search upstream from this junction to find reads that are specific to this allele. We can see as far as 73 bases of the 5’ UTR.

grep "CTTTATCTTTGCAGATATGT" T47D_Clone28_*_for_grep.fastq

We next search downstream from the insertion junction to find reads that are specific to this allele. We can see as far as Leucine 28 of HygR.

grep "GAAAAAGCCTGAACTCACCG" T47D_Clone28_*_for_grep.fastq

We look downstream from L28, which should still include only reads that are specific to this allele. We can see as far as Phenylalanine 57.

grep "TCTCCGACCTGATGCAGCTC" T47D_Clone28_*_for_grep.fastq

We look downstream from F57, which should still include only reads that are specific to this allele. We can see as far as Serine 84.

grep "ATAGCTGCGCCGATGGTTTC" T47D_Clone28_*_for_grep.fastq

We look downstream from Valine 77, which should still include only reads that are specific to this allele. We can see as far as Aspartic acid 102.

grep "GCTTGACATTGGGGAATTCA" T47D_Clone28_*_for_grep.fastq

So from our PRO-seq reads, we can produce the following continuous sequence specific to this allele:

ATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGAAAAAGCCTGAACTCACCGCGACGTCTGTCGAGAAGTTTCTGATCGAAAAGTTCGACAGCGTCTCCGACCTGATGCAGCTCTCGGAGGGCGAAGAATCTCGTGCTTTCAGCTTCGATGTAGGAGGGCGTGGATATGTCCTGCGGGTAAATAGCTGCGCCGATGGTTTCTACAAAGATCGTTATGTTTATCGGCACTTTGCATCGGCCGCGCTCCCGATTCCGGAAGTGCTTGACATTGGGGAATTCAGCGAGAGCCTGACCTATTGCATCTCCCGCCGTGCACAGGGTGTCACGTTGCAAGAC

We designed primers predicted to anneal to each strand of the duplication event that should be specific for this allele (forward TTTATCTTTGCAGATATGTATCT and reverse GGAAAAGGCAGATACATATC). We also designed primers flanking the designed insertion site, closer to the insertion site than the original primers to extend how far we can sequence into the insertion (forward CCTCCGCTTCCTGAAAGTGA and reverse TCCTCCCTTACTGGGGCTTT). We used these primers for PCR of genomic DNA from Clone 28 and the parental T47D cells as a control. The reverse clone-specific primer and the forward flanking primer produced a band in Clone 28 and not in the parental cells. We submitted this purified PCR product for Sanger sequencing with the primers used in the PCR. However, this product appears to have amplified the third allele from above, as we found the two missing T’s in the string of 13 T’s.

Using two different annealing temperatures, we could not use the forward clone-specific primer and the reverse flanking primer to generate a PCR product specifically in Clone 28 and not in the parental cells. We could attempt to further optimize this PCR. However,wWe realized this is the only allele that has the beginning of the HygR gene. We performed a PCR using the HDR template PCR primer, the 3’ end of which contains the beginning of the HygR gene, along with the reverse flanking primer. We submitted this purified PCR product for Sanger sequencing with the primers used in the PCR. From the 5’ end, we can see the entirety of the HygR gene and into the P2A site.

GCGACGTCTGTCGAGAAGTTTCTGATCGAAAAGTTCGACAGCGTCTCCGACCTGATGCAGCTCTCGGAGGGCGAAGAATCTCGTGCTTTCAGCTTCGATGTAGGAGGGCGTGGATATGTCCTGCGGGTAAATAGCTGCGCCGATGGTTTCTACAAAGATCGTTATGTTTATCGGCACTTTGCATCGGCCGCGCTCCCGATTCCGGAAGTGCTTGACATTGGGGAATTCAGCGAGAGCCTGACCTATTGCATCTCCCGCCGTGCACAGGGTGTCACGTTGCAAGACCTGCCTGAAACCGAACTGCCCGCTGTTCTGCAGCCGGTCGCGGAGGCCATGGATGCGATCGCTGCGGCCGATCTTAGCCAGACGAGCGGGTTCGGCCCATTCGGACCGCAAGGAATCGGTCAATACACTACATGGCGTGATTTCATATGCGCGATTGCTGATCCCCATGTGTATCACTGGCAAACTGTGATGGACGACACCGTCAGTGCGTCCGTCGCGCAGGCTCTCGATGAGCTGATGCTTTGGGCCGAGGACTGCCCCGAAGTCCGGCACCTCGTGCACGCGGATTTCGGCTCCAACAATGTCCTGACGGACAATGGCCGCATAACAGCGGTCATTGACTGGAGCGAGGCGATGTTCGGGGATTCCCAATACGAGGTCGCCAACATCTTCTTCTGGAGGCCGTGGTTGGCTTGTATGGAGCAGCAGACGCGCTACTTCGAGCGGAGGCATCCGGAGCTTGCAGGATCGCCGCGGCTCCGGGCGTATATGCTCCGCATTGGTCTTGACCAACTCTATCAGAGCTTGGTTGACGGCAATTTCGATGATGCAGCTTGGGCGCAGGGTCGATGCGACGCAATCGTCCGATCCGGAGCCGGGACTGTCGGGCGTACACAAATCGCCCGCAGAAGCGCGGCCGTCTGGACCGATGGCTGTGTAGAAGTACTCGCCGATAGTGGAAACCGACGCCCCAGCACTCGTCCGGATCGGGAGATGGGGGAGGCTAACGGAAGCGGAGCTACTAACTTCAGCCTGCTGAAGCAGGCTGGAGACGTGGAGGAGAACC

We are awaiting sequencing of this PCR product from the other direction. In future, allele-specific primers can be designed to anneal to either strand of the 5’ end of the HygR gene (corresponding to amino acids 2-77), which should only be found in this allele.

13.2 Clone 35

The first allele is a knockout. Reading from the 5’ end, we see the 5’UTR except missing the last 4bp, then TRPS1 missing the first 242bp (including the start codon).

TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT

The second allele is a knockout. Reading from the 5’ end, we see the 5’UTR except missing the last 7bp, then TRPS1 missing the start codon.

TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT

We are still looking for up to two more unique alleles. Searching the PRO-seq data, we do see the designed junction between the 5’ UTR and the start of the HygR gene.

grep "CTTTGCAGATATGAAAAAGC" T47D_Clone35_*_for_grep.fastq

Upstream from this, we can see two distinct sequences, one with an unknown sequence (GCCCCTGGAACG) upstream from the last 50 bases of the 5’ UTR, and one with the end of dTAG and the first 2 bases of the linker (CGATGTGGAGCTTCTAAAACTGGAAAG) upstream from the last 50 bases of the 5’ UTR. Notably, the dTAG would not be in frame with HygR, so this is probably a knockout allele (unless there is another dTAG downstream on this allele and translation initiates at the HygR start codon).

We search for each of these unique junctions. The first only produced the read we saw before. This may actually be mis-sequencing, as there are similarities in the sequence, the junction is exactly the same, and the quality scores for the different bases are lower. Thus we hypothesize that these sequences correspond to one unique allele.

grep "CCCTGGAACGGTAA" T47D_Clone35_*_for_grep.fastq
grep -B 1 -A 2 "CCCTGGAACGGTAA" T47D_Clone35_*_for_grep.fastq

We search around the junction to find reads that are specific to this allele. Upstream we see through glycine 86 of FKBP12. Downstream we see through 16 bases of HygR.

grep "AACTGGAAAGGTAACTTTCA" T47D_Clone35_*_for_grep.fastq

More upstream we see through one base beyond Serine 77 of FKBP12.

grep "AGCTTCTAAAACTGGAAAGG" T47D_Clone35_*_for_grep.fastq

More downstream the farthest we can see is the read above that extends through 16 bases of HygR.

grep "GGTAACTTTCAGATAACACT" T47D_Clone35_*_for_grep.fastq

Looking for a fourth unique allele, we search for the junction of the linker and TRPS1, we find it and can see back through Valine 98 of FKBP12 and forward through one base beyond Arginine 10 of TRPS1. This would be in frame, as long as there is translation of an upstream HygR gene that is correctly initiated.

grep "AGGTTCAGGTGTCCGGAAAA" T47D_Clone35_*_for_grep.fastq

We search downstream from the start of TRPS1, requiring the upstream sequence be the linker. Here we find reads indicating that, after 35 bases of TRPS1, the sequence switches to an unknown sequence (GGCCTCAGTGAGCGAGCGAGCGCGCAGCTGCCTGC).

grep "GGTGTCCGGAAAAAGAACCC" T47D_Clone35_*_for_grep.fastq

We search for this sequence but only find the reads above.

grep "GAAACGTTGCGGCCTCAGT" T47D_Clone35_*_for_grep.fastq

13.3 Clone 39

The first allele is a knockout. There is a single base deletion of the first G after the ATG start codon.

TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT

The second allele is a knockout that involves a weird rearrangement. From the 5’ end, beginning where the HDR primer starts (what is before this?), we see the start codon of TRPS1, then an unknown 14bp sequence (CCTTCGGAATCGGG), then an inverted 5’ end of TRPS1 (the first 233bp after (not including) the start codon), then an unknown 34bp sequence (TTAACTTTCAGATACAGTGTTATCTGAAAGTTAA), and then the 5’ end of TRPS1 as expected after the start codon.

GTAACTTTCAGATAACACTGTATCTGCCTTTTCCCTTTATCTTTGCAGATATGCCTTCGGAATCGGGACTGCGCTTTTCAAGTCCTTCTTACTGCTAGAAGATGGATCTTGAACATGCAAGCTATGTTCCTCCTTATGATTTAGTTCTGCAGCATCACTCTGATCCGTATTTTCTGACATCTGATCTGCAGAAAATTCTTTGTTCTTTCCAGATACCTTGCTTTCTGTACCTATAGGCTCCAGGATCTGGCCCTCGCCTTCACTTGCAACGTTTCTCAGAGGGGGGTTCTTTTTCCGGACTTAACTTTCAGATACAGTGTTATCTGAAAGTTAAGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT

The third allele is a knockout. From the 5’ end, the 5’ UTR extends through 31 bases upstream from the start codon, then an unknown 50bp sequence (CAGGCTGATGACGTGGAGGAGATCCCTGGTCCTTACCCCTACGACGTGCC), then an inversion of (48bp of the 5’UTR (the wrong end of the HDR template) into the first 319bp of the HygR gene), then the 5’ end of TRPS1 as expected after the start codon.

TTTGTCATCCTCCGCTTCCTGAAAGTGATGAGGCAGTTTTCACTCTGTTTATAACATAGCTGTTTCTGTCTTTTTGACTATGAAGATATTTTGTTTGAAACTATATTTACATCCCGGTTACATAGGGTAACAATATCGAATTCCAATTTATGTTATTTATGTATACATATATGTACCATATAAATAAATGTGATGTATTCTATAACCATAAACACATATATGTATTTTATAAACATAACATGTGTACATGTTATATATAATATAGACCATATATAAATATATAAGCATGCATATATATTGTTAGCTACAATCAAAGGAAGAAATTGGGCAAAAATATATTTTCAGGAAAGAATAAGTAATAGGGAGGGTAAACCTAGTAATAAAAATAAATCCTTCATTATTCCTAAAACCACCTTAAAATAATAAATGAGCCAGTCTTGGTTACAAAAGTAATTCTAAATGTAAATTGTTTGTTTTAGTCTCAACTTTTTGCTGTTTTTAGAATGGGTTTTTTTTTTTTTCCTAGAGGAAGCATTTGATTATGCACCTTAAAAAACTCACATAAGTAAAATATTCCAGATTCAACTGTATCTCTGTAACTTTCAGATAACACTGCAGGCTGATGACGTGGAGGAGATCCCTGGTCCTTACCCCTACGACGTGCCCGGTTTCAGGCAGGTCTTGCAACGTGACACCCTGTGCACGGCGGGAGATGCAATAGGTCAGGCTCTCGCTGAATTCCCCAATGTCAAGCACTTCCGGAATCGGGAGCGCGGCCGATGCAAAGTGCCGATAAACATAACGATCTTTGTAGAAACCATCGGCGCAGCTATTTACCCGCAGGACATATCCACGCCCTCCTACATCGAAGCTGAAAGCACGAGATTCTTCGCCCTCCGAGAGCTGCATCAGGTCGGAGACGCTGTCGAACTTTTCGATCAGAAACTTCTCGACAGACGTCGCGGTGAGTTCAGGCTTTTTCATATCTGCAAAGATAAAGGGAAAAGGCAGATACAGTGTTATCTGAAAGTTGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGTATCTGGAAAGAACAAAGAATTTTCTGCAGATCAGATGTCAGAAAATACGGATCAGAGTGATGCTGCAGAACTAAATCATAAGGAGGAACATAGCTTGCATGTTCAAGATCCATCTTCTAGCAGTAAGAAGGACTTGAAAAGCGCAGTTCTGAGTGAGAAGGCTGGCTTCAATTATGAAAGCCCCAGTAAGGGAGGAAACTTTCCCTCCTTTCCGCATGAT

We are still looking for up to one more unique allele. Searching the PRO-seq data for the junction of the linker and TRPS1, we find it and can see back through one base beyond Histidine 87 of FKBP12 and forward through two bases beyond Lysine 29 of TRPS1. This would be in frame, as long as there is translation of an upstream HygR gene that is correctly initiated.

grep "AGGTTCAGGTGTCCGGAAAA" T47D_Clone39_*_for_grep.fastq

So from our PRO-seq reads, we can produce the following continuous sequence specific to this allele. In future, allele-specific primers can be designed to anneal to either strand of this sequence, which should only be found in this allele.

GCACCCAGGCATCATCCCACCACATGCCACTCTCGTCTTCGATGTGGAGCTTCTAAAACTGGAAAGATCTGGTGGATCTGGAGGTTCAGGTTCAGGTGTCCGGAAAAAGAACCCCCCTCTGAGAAACGTTGCAAGTGAAGGCGAGGGCCAGATCCTGGAGCCTATAGGTACAGAAAGCAAGGT