Contents

1 Review of analysis pipeline

ChIP-seq analysis workflow

Figure 1: ChIP-seq analysis workflow

2 TODAY: de novo Motif searching with MEME

An important piece of information that can be gained from genomic binding data is the discovery of the underlying DNA motif that is bound by the transcription factor. There are many tools out there that will allow you to discover motifs. The MEME analysis suite is a popular and powerful set of tools that can allow you to perform several useful analyses on DNA sequences of interest, such as:

2.1 MEME: Multiple Em for Motif Elucidation

MEME Website: http://meme-suite.org/

Several steps must be performed prior to MEME analysis. Here is a summary list:

1. Create bed file with peak locations to analyze.
- Short windows around summits.
- Only most intense peaks needed.
2. retrieve DNA sequences from regions to analyze
3. Create a background base frequency model (part of the MEME suite)

MEME requires DNA sequences (in fasta format) of your regions of interest. Long sequences can decrease accuracy and take a long time to search through, so it is best to choose narrow regions. When finding motifs within ChIP-seq peaks we can use our peak summits to center our analysis since most TFs will be found in the immediate vicinity of their peak summits. We can use bedtools slop to create small windows around our summits:

bedSlop

Figure 2: bedSlop

Let’s get 100bp window centered on peak summits (/home/FCAM/meds5420/data/ATF1/ATF1_peaks/ATF1_summits.bed) as follows (technically this is a 101 base window):

bedtools slop -i SUMMITS.BED -g <chrom_sizes> -b 50 > SUMMITS_100bp.bed
#-g is chromosome sizes file 
#-b The number of basebairs to add to both sides
#-l The number of base pairs to subtract from the start coordinate
#-r The number of base pairs to add from the end coordinate

Another way to reduce time for running MEME and to increase accuracy is to run it on a subset of the most intense peaks. The peak intensity is in column 5 of the peak summits file. The summits .bed file can be sorted and parsed as follows:

sort -n -r -k5 SUMMITS_100bp.bed | head -200 > SUMMITS_100bp_top200.bed 
# Performs numeric sort on column 5, reports it in 
# reverse (-r) order (descending) and outputs the first 200 to output file.

Next, we can use bedtools getfasta to get the sequence within the desired intervals and return them in fasta format.

fastaFromBed -fi GENOME.fa -bed SUMMITS_100bp.bed -fo SUMMITS_100bp.fa
#-name: Will use the 'name' column from bed file to name sequence
#-fi: Input fasta sequences to retrieve interval sequences from
#-bed: Input bed file
#-fo: Output fasta file

Fasta formatted genome sequences and the genome sizes file for this class can be found at:

/home/FCAM/meds5420/genomes/
If you get the error: could not open index file /home/FCAM/meds5420/genomes/hg38.fa.fai for writing!, then just copy the file to your directory and operate directly on the local file.

3 Loading meme

On Xanadu you should load meme version 5.4.1 every time. I worked with the HPC staff to ensure that meme/5.4.1 can be run with the -p option, which allows for parallelization.

module load meme/5.4.1

3.1 Running meme

Now we can run MEME on the .fasta formatted sequences.

I recommend running the classic function for more sensitive and wider motifs, but it takes longer. I employ -markov_order 3 in the command, which incorporates a background model on the fly using the file that specifies all k-mer frequencies up to the specified value. I typically just calculate it from the input file as opposed to random regions or the entire genome.

name=NAME_of EXPERIMENT
meme -oc ${name}.meme_output -objfun classic -evt 0.01 -searchsize 0 -minw 5 -maxw 18 -revcomp -dna -markov_order 3 -maxsize 100000000 SUMMITS_100bp.fa

There are a number of options to consider here:

  • -oc: Folder to create where files will be written. Allows overwriting of previously made folders (use -o to disable this)

  • -dna: Tell MEME to interpret symbols as nucleic acids (as opposed to amino acids)

  • -nmotifs: number of motifs to search for

  • -evt: stop searching for motifs when a motif cannot be found with a samller e-value (if nmotifs and evt are invoked, then evt will supercede the nmotifs option)  

  • -minw: minumum motif width in bases

  • -maxw: maximum motif width in bases

  • -revcomp: look for motif on both strands of DNA (do not assume strandedness)

  • -objfun: classic is a sensitive algorithm to detect motifs, but it can be slow

  • There are more options available. See manual or http://meme-suite.org/doc/meme.html?man_type=web.

The following files are placed in the designated output folder:

  • .xml file of output (machine readable)
  • .txt file of output (human readable)
  • .html file of output browser viewable file of output
  • .eps file of sequence logo for motif(s). Both forward and reverse complement

The output file contains:

  • The found motifs
  • A list of the sequences with the motifs with associated significance of the motifs
  • A Position Specific Weight Matrix (PSWM) of the motif that can be used to generate the sequence logo as well as search for the motif in other sequences.

3.2 Finding occurences of motifs with MAST: Motif Alignment and Search Tool

Given a motif position weight matrix, one can search other for sequences for the occurance of that motif. For instance, finding a motif under a ChIP-seq peak with MEME does not mean that the motif is present at every binding site. We can use MAST to determine the number of sequences or found binding sites that have a specific motif.

For usage go here: http://meme-suite.org/tools/mast
Click on manual bar on left and view command line version for MAST.

The basic usage is simple:

mast MEME_output.txt SUMMITS_100bp.fa -oc mast_OUT_FOLDER
# The first file is the meme output with motifs that you want to search for
# The second file is the fasta formatted ChIP-seq regions we want to search for motifs within
# The third is the output folder as in MEME

The MAST output format is similar to meme and is placed in the designated output folder:

  • .xml file of output (machine readable)
  • .txt file of output (human readable)
  • .html file of output browser viewable file of output

The output file contains the names of the sequences that have significant occurences of the motif.

Using the -hit_list option in mast creates a list of matches in a more easily processible format. However, this is sent to stdout and so must be redirected to a file.

#sending hits list to file by redirect
mast MEME_output.txt -hit_list SUMMITS_100bp.fa > mast_hits.txt

3.3 Scanning whole genomes for presence of motifs with FIMO: Find Individual Motif Occurrences

MAST is useful for determining if a sequence (or lists of them) have a specific motif. FIMO, is good for finding ALL significant occurrences of a motif within a sequence (i.e. a whole chromosome or genome).

You can then run FIMO to scan sequences for all occurrences of a motif. I usually set --max-stored-scores quite high and --max-strand provides the best match to identically overlapping sequences (only difference being the strand):

fimo --max-strand --max-stored-scores 10000000 --oc fimo_OUT_FOLDER MEME_output.txt hg38.fa

This is a basic run, but there are several options one can implement: http://meme-suite.org/doc/fimo.html?man_type=web

FIMO returns .html, .gff3, and .tsv (tab-separated values) files. I usually invoke the option --text and redirect (>) the stdout to a file name to create a text file output that contains a list of the motifs, their locations, and significance.

Here is a run that I recently:

fimo --max-strand --text --max-stored-scores 10000000 --oc fimo_atf1_out meme.txt hg38.fa > fimo_output.txt

After going through this workflow, there are a number of things one can do with these locations and bedtools.

  • Determine the number of peaks with motifs - intersect peak file with mast output using bedtools.
  • List of motifs not within peaks - intersect peak file with fimo output using bedtools, choose non-overlapping.
MAST vs FIMO

Figure 3: MAST vs FIMO

4 In class exercise 1:

Starting with the ChIP-seq summits file, try the following:

5 Inspect MEME results

First, let’s look at the results from our meme analysis. You can copy it to from the server at this location:
/home/FCAM/meds5420/motif
Open the .html and .txt files to see the results

5.1 Motif Matrices Motif

Position Frequency Matrix (PFM): represents the frequency of each base occurrence at each position within the motif (Raw data):

Position frequency matrix

Figure 4: Position frequency matrix

Position Weight Matrix (PWM): Score for probablility that a base will be present at a given position. Considers the numbers of sequences and background frequency of bases. PWMs are a more realistic reflection of the binding strength of a protein for a given sequence.

Position weight matrix

Figure 5: Position weight matrix

Matrixes can be in many formats, see: http://meme-suite.org/doc/overview.html#motif_conversion_utilities.

You can create Sequence Logos from your enriched sequences with weblogo: http://weblogo.berkeley.edu/logo.cgi or a vector image using ceqlogo:

ceqlogo -i meme.txt -m 2 -f EPS -o atf1.eps

5.2 Comparing Your Motif to Databases

Motif databases:
Jeff Vierstra: https://resources.altius.org/~jvierstra/projects/motif-clustering-v2.1beta/consensus_pwms.meme
JASPAR:http://jaspardev.genereg.net/.
CisBP: http://cisbp.ccbr.utoronto.ca/TFTools.php
Transfac: http://www.gene-regulation.com/pub/databases.html
HOCOMOCO: http://hocomoco11.autosome.ru/

Notes:
Jeff did a geat job consolidating redundant motifs into consensus motifs.
JASPAR DB is highly curated from a number of sources.
CisBP is a largely single experimental effort: see: http://www.sciencedirect.com/science/article/pii/S0092867414010368
HOMER: http://homer.ucsd.edu/homer/custom.motifs Transfac in NOT open access. However, UConn recently purchased University-wide licenses for GeneXplain which accesses the Transfac database: http://genexplain.com/
HOCOMOCO was made from the motif search tool ChIPmunk (not covered in this course):http://autosome.ru/ChIPMunk/

5.3 Motif database access and aquisition

NOTE: Database must be in meme format. Several formats can be converted using tools from the MEME tools suite.
see: http://meme-suite.org/doc/overview.html#motif_conversion_utilities.

For instance, the JASPAR motifs look like this:

head ./JASPAR_all_matrix.txt
## >MA0001.1 AGL3
## A  [ 0  3 79 40 66 48 65 11 65  0 ]
## C  [94 75  4  3  1  2  5  2  3  3 ]
## G  [ 1  0  3  4  1  0  5  3 28 88 ]
## T  [ 2 19 11 50 29 47 22 81  1  6 ]
## >MA0002.1 RUNX1
## A  [10 12  4  1  2  2  0  0  0  8 13 ]
## C  [ 2  2  7  1  0  8  0  0  1  2  2 ]
## G  [ 3  1  1  0 23  0 26 26  0  0  4 ]
## T  [11 11 14 24  1 16  0  0 25 16  7 ]

To convert a whole directory of JASPAR motif files:

jaspar2meme -pfm DIRECTORY_INPUT > jaspar.meme

-pfm: specifies input format

head -28 ./jaspar.meme | tail -20
## 
## MOTIF CN0001.1 LM1
## 
## letter-probability matrix: alength= 4 w= 16 nsites= 5332 E= 0
##   0.168230     0.079895    0.383721    0.368155  
##   0.045949     0.003938    0.918792    0.031320  
##   0.009377     0.051013    0.010315    0.929295  
##   0.031508     0.018942    0.009002    0.940548  
##   0.107839     0.052138    0.721868    0.118155  
##   0.005064     0.874156    0.014254    0.106527  
##   0.003563     0.904351    0.000750    0.091335  
##   0.918980     0.022881    0.021943    0.036197  
##   0.066954     0.029632    0.039197    0.864216  
##   0.207802     0.000750    0.788635    0.002813  
##   0.019505     0.006377    0.969242    0.004876  
##   0.509002     0.249625    0.051013    0.190360  
##   0.744561     0.018192    0.197299    0.039947  
##   0.955551     0.010128    0.023631    0.010690  
##   0.015191     0.915229    0.010503    0.059077  
##   0.368530     0.346774    0.064891    0.219805  

A complete list of JASPAR and other database motifs can be found and downloaded from the MEME website:
https://meme-suite.org/meme/db/motifs

I typically download the databases directly from meme: https://meme-suite.org/meme/meme-software/Databases/motifs/motif_databases.12.22.tgz

You can find a few database files here: /home/FCAM/meds5420/TF_db/JASPAR/
I found that you need to copy them locally to use the files as input for TOMTOM.

6 TOMTOM usage

We can use TOMTOM to compare our discovered motif to databases of known motifs. See documentation: http://meme-suite.org/doc/tomtom.html?man_type=web Output is a .html file and a text file. The file shows the name of the motifs (database ID), the significance of the match and the relevant consensus sequences.

Basic usage:

tomtom -eps -oc tomtom_OUTPUT meme.txt DATABASE.meme

Here’s an example usage with more options:

tomtom -no-ssc -oc tomtom_OUPUT -verbosity 1 -min-overlap 5 -mi 1 -evalue -thresh 0.05 meme.txt DATABASE.meme

7 Motif similarity / redundancy

7.1 Question . . . Why do we get so many hits for our motifs from TF databases?

Many transcription factors are part of families of transcription factors that have arisen through genome or local duplication events, or in rare cases through convergent evolution. TFs in families provide redundancy, but sequence divergence amongst family members allows certain TFs to interact with different partners and / or respond to different signaling queues. However, the DNA binding domains are often the most conserved part of these proteins, which results in overlapping and similar binding sites for seemingly distinct TFs.

Paralogous TF DBDs. The heat map shows the degree of protein sequence conservation amongst transcription factor families that bind similar concensus motifs (indicated by the sequence logo at the top). Note that Twist and ZNF appear to be an example of convergent evolution.

Figure 6: Paralogous TF DBDs
The heat map shows the degree of protein sequence conservation amongst transcription factor families that bind similar concensus motifs (indicated by the sequence logo at the top). Note that Twist and ZNF appear to be an example of convergent evolution.

8 In Class Exercise 2:

  1. There are converted JASPAR databases in the following location
    /home/FCAM/meds5420/TF_db/JASPAR/. Use either database to compare to your motif using tomtom.

  2. View the beginning of the tomtom.tsv output—this is just a tab-separated values text file: .tsv. Notice that the protein ID is in the format like MA0604.1 Use grep on the original JASPAR.meme file to find out the common transcription factor name of some of your top hits. Use grep to see if any Atf1 motifs are found

  3. Copy the .html file to your local computer and view it in the browser. Any surprises regarding the TFs found?

  4. Try searching for your motif using the web version of TOMTOM. Do you see any differences in the results?

9 Answers to In Class Exercises:

9.1 exercise 1:

module load meme/5.4.1
module load bedtools

#Let's set a variable to the chromInfo file so we don't have to type it every time:
genome=/home/FCAM/meds5420/genomes/hg38.chrom.sizes 
peaks=/home/FCAM/meds5420/data/ATF1/ATF1_peaks/ATF1_summits.bed

mkdir in_class_lec14
cd in_class_lec14

#starting from your own ATF1/peaks directory make a window around the summits
slopBed -i $peaks -g $genome -b 50 > ATF1_summit_101bp.bed
# get the top 200 peaks
sort -k5nr ATF1_summit_101bp.bed | head -n 200 > ATF1_summit_101bp_top200.bed

#you will have to cp the hg38.fa file because an intermediate file is made
cp /home/FCAM/meds5420/genomes/hg38.fa ./

fastaFromBed -fi hg38.fa -bed ATF1_summit_101bp_top200.bed -fo ATF1_summit_101bp_top200.fasta

meme ATF1_summit_101bp_top200.fasta -oc meme_ATF1 -nmotifs 2 -objfun classic -searchsize 0 -minw 5 -maxw 10 -revcomp -dna -markov_order 3 -maxsize 10000000

#Now we must get the fasta sequence for all peaks in order to use MAST on them
fastaFromBed -fi hg38.fa -bed ATF1_summit_101bp.bed -fo ATF1_summit_101bp.fasta

# Now run MAST on them to ID peaks with the MEME motifs.  
# Check the manual for the options.

mast -hit_list meme_ATF1/meme.txt ATF1_summit_101bp.fasta > mast_ATF1_hits.txt

# To then run FIMO we should create a background file,

# Using our chromosome, which you can get from UCSC, or use hg38.fa
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz 
tar -xzf hg38.chromFa.tar.gz
cp /home/FCAM/meds5420/genomes/chroms/chr17.fa ./
# recall I put them here /home/FCAM/meds5420/genomes/chroms

fasta-get-markov -m 3 chr17.fa > chr17_bkgrnd.txt

#Next run FIMO with the background file, MEME output and your chr.fasta to search
fimo --text --max-strand --max-stored-scores 10000000 --bgfile chr17_bkgrnd.txt ./meme_ATF1/meme.txt chr17.fa > fimo_ATF1_chr17.txt

#Convert FIMO output to bed

awk '{OFS="\t";} NR>1 {print $3, $4, $5, $7, $8, $6}' fimo_ATF1_chr17.txt > fimo_ATF1_chr17.bed

#Intersect FIMO output with peaks
intersectBed -wo -a fimo_ATF1_chr17.bed -b ATF1_summit_101bp.bed -g $genome > ATF1_chr17_motifs_in_peaks.txt

# to get all the fimo output files, note the tsv files is nearly identical to the txt file above
fimo --max-strand --max-stored-scores 10000000 -oc fimo_ATF1 --bgfile chr17_bkgrnd.txt ./meme_ATF1/meme.txt chr17.fa 

# run meme with bfile on random intervals
randomBed -n 1000 -l 101 -g $genome > hg38_random_intervals.bed

fastaFromBed -fi hg38.fa -bed hg38_random_intervals.bed -fo hg38_random_intervals.fasta

fasta-get-markov -m 3 hg38_random_intervals.fasta > hg38_random_intervals_markov.txt
 
meme ATF1_summit_101bp_top200.fasta -oc meme_ATF1_bfile -nmotifs 2 -objfun classic -evt 0.01 -searchsize 0 -minw 5 -maxw 10 -revcomp -dna -bfile hg38_random_intervals_markov.txt -maxsize 10000000


Reporting:

#How many peaks are there on chr17?
grep -w chr17 ATF1_summit_101bp.bed > ATF1_chr17_peaks.bed
wc -l ATF1_chr17_peaks.bed

#What is the number of peaks (summit -/+ 50bp) on your chromosome with MAST called motifs:

grep "chr17:" mast_ATF1_hits.txt | cut -f1 -d " " | sort | uniq | wc -l 

#how many FIMO motifs on chr17:
cat fimo_ATF1/fimo.tsv | grep "#" -v | grep "motif_id" -v | wc -l 

#how many FIMO motifs in peaks?
wc -l ATF1_chr17_motifs_in_peaks.txt  

#How many unique peaks in FIMO overlap?
cat ATF1_chr17_motifs_in_peaks.txt | cut -f 10 | sort | uniq | wc -l 

# MAST and FIMO call motif instances a bit differently

9.2 exercise 2:

running TomTom

#switch to directory above where meme was performed:
tomtom -no-ssc -oc tomtom_OUPUT -verbosity 1 -min-overlap 5 -mi 1 -evalue -thresh 0.05 ATF1_classic.meme_output/meme.txt JASPAR2022_CORE_vertebrates_non-redundant.meme

grep out some top hits:

grep MA1131.1 /home/FCAM/meds5420/TF_db/JASPAR/JASPAR2022_CORE_vertebrates_redundant.meme
grep MA0604.1 /home/FCAM/meds5420/TF_db/JASPAR/JASPAR2022_CORE_vertebrates_redundant.meme

#the - means that stdin is passed to grep:
grep -i atf1 JASPAR2022_CORE_vertebrates_redundant.meme | cut -d ' ' -f2 | grep -f - tomtom.tsv 

You can also search JASPAR online.