Contents

Readings: Practical Computing for Biologists: Chapter 6.

Review basic commands and server access from UConn_Unix_basics

1 Scripting in the shell.

Scripts are a logially ordered set of commands used to process files. They can be simple routines or complex programs. There are three main things one needs when writing scripts in general:
1. The commands themselves

2. Information about what language should be used to interpret the commands

3. Permission to execute commands in script file.

The first line is the Shebang line:

#! /bin/sh

or sometimes

#! /usr/bin/sh

to find out where your shell is type:

which sh

Let’s try a simple script:

ls -la
echo "Above are the directory listings for the following folder"
pwd

Create a new folder in your MEDS5420 folder called ‘scripts’
Save your script as ls.sh
Go to the directory where ls.sh is and try to execute it:

./ls.sh

In order to run this script we need to give the proper permissions. To see permissions for files, type:

ls -la

The columns are:

1. permissions
2. owner
3. group
4. size
5. last modifiation
6. name  

In permissions: ‘d’=directory, ‘-’ = file, ‘r’ = read, ‘w’ = write, ‘x’ = execute.
Three sets of permissions are shown: User (owner), Group, Other users.

To give permission to execute type:

chmod +x ls.sh

Now use ls -la to view permissions and then try to execute.

Other ways to designate permissions:

Permissions

Figure 1: Permissions


To give permission for everone to read, write, and execute a script use:

chmod 777 ls.sh

1.0.1 Scripting with a loop

list="1 2 4 6"
for x in $list
do
   echo Hello people of MEDS_5420
done
## Hello people of MEDS_5420
## Hello people of MEDS_5420
## Hello people of MEDS_5420
## Hello people of MEDS_5420

The numbers mean nothing here. They are just placeholders such that every time an item is encountered the loop repeats itself. For instance:

list="a b c"
for x in $list
do
   echo Hello people of MEDS_5420
done
## Hello people of MEDS_5420
## Hello people of MEDS_5420
## Hello people of MEDS_5420

1.1 Exercise 1:

Create a shell script, called grad_folders.sh that does the following:
1. Create a variable that lists the following items: notebook, raw_data, figures, manuscripts.
2. Create a folder for each item that is named the same as each item.
3. Print to the screen what is happening (i.e. that you are creating a folder).
4. Copy the haiku.txt file into each folder.

2 What we’ve learned so far:

1. How to navigate your computer from your terminal and create or find files and folders (cd, ls, mv, rm, mkdir, touch, find)
2. How to view the content of files (head, tail, less), and search for specific lines or items (grep)
3. How to combine multiple files together (cat) and redirect terminal output into new or existing files (> or > > ).
4. How to string several commands together with pipes (‘|’)
5. The importance of quoting syntax.
6. The beginnings of shell scripting.

3 Retrieving files from a URL using curl or wget

Download chr_coordinates.bed from the Lecture 4 folder in GitHub and move this file to your MEDS5420 folder.

It is not immediately apparent how to download the file from GitHub, but you do have access to the Raw file by clicking Raw. Copy the URL https://raw.githubusercontent.com/guertinlab/meds5420/refs/heads/main/Lecture_4/chr_coordinates.bed We can use a command in the Terminal to directly retrive this raw file without having to click Save As in the browser.

If you have a Mac, use curl:

curl -O https://path/to/the/raw/file/on/github.bed

Linux OS have wget:

wget https://path/to/the/raw/file/on/github.bed

4 More commands

1. Learn a few more useful terminal commands: sort, cut, uniq

4.1 Creating path and filename shortcuts with variables

Download color-table.txt from the Lecture 4 folder in GitHub and move this file to your MEDS5420 folder.

If you have a Mac, then curl to retrieve files from URLs:

curl -O https://raw.githubusercontent.com/guertinlab/meds5420/main/Lecture_4/color-table.txt
#the manual will tell you what the -O (the letter, not a zero) option does 

Linux OS use wget:

wget https://raw.githubusercontent.com/guertinlab/meds5420/main/Lecture_4/color-table.txt

Let’s tuck the color-table.txt file away into some new directories:

#start from the MEDS5420 folder:
mkdir ./in_class
mkdir ./in_class/colors
mv color-table.txt ./in_class/colors
table="./in_class/colors/color-table.txt"
head $table
## This 1   red
## is   2   orange
## a    4   yellow  
## test 4   green   
## this 7   blue
## is   6   purple
## only 80  red
## a    19  orange
## test 100 yellow
## if   6   green

4.2 String splitting and manipulation

The cut command is useful for splitting strings based on user-defined delimiters. For instance, if I want to extract only the time from the date command you can try this:

# selects the 4th item after the line is split by spaces (" ")
echo "the date and time is:"
echo $(date)

echo "the time is:"
echo $(date) | cut -d " " -f 4  
## the date and time is:
## Mon Feb 2 17:32:07 UTC 2026
## the time is:
## 17:32:07

-d: is the delimiter set by user. Default is tab: \t
-f: the fields to select for after splitting. Can list each (1,2,3) or list a range (1-3)

reversing a string:

# returns backwards string

echo `date` | rev
## 6202 CTU 70:23:71 2 beF noM

Can be done on any part of a pipe

echo $(date)

echo "the reverse time is:"
echo $(date) | cut -d " " -f 4 |rev
## Mon Feb 2 17:32:07 UTC 2026
## the reverse time is:
## 70:23:71

Extracting columns from tables: Cut can also be used to extract columns from tables.

Let’s just get the first column of color-table.txt:

table="./in_class/colors/color-table.txt"
cat $table | cut -f 1
## This
## is
## a
## test
## this
## is
## only
## a
## test
## if 
## this
## had
## been
## a
## real 
## emergency

4.3 Sorting columns

Simple sorting of columns

sort -k 2 color-table.txt | head -n 5

# k followed by a number represents the column to sort by.
##  
## This 1   red
## test 100 yellow
## a    19  orange
## is   2   orange

Notice how numbers are handled. They are handled as a string of numbers and each position in evaluated separately. If you want a true numeric sort, try this:

#numerical sort on column 2

sort -k 2n color-table.txt | head -n 5
##  
## This 1   red
## is   2   orange
## a    4   yellow  
## real     4   yellow

The option -k 2 sorts on column 2, but if column 2 is identical, the row will continue to be sorted until a distinct character can differentiate. The following will only sort on column 2 and retain the original relative order of row that have identical column 2 values. The -s refers to a stable sort and -k 2,2 says only use column 2 to sort.

#numerical sort on column 2
sort -s -k 2,2n color-table.txt | head -n 5
##  
## This 1   red
## is   2   orange
## a    4   yellow  
## test 4   green   

Numerical sorts are ascending, to return a descending sort, try the following:

#numerical sort on column 2

sort -k 2nr color-table.txt | head -n 5
## test 100 yellow
## only 80  red
## a    60  orange
## had  54  purple
## been 23  red

4.4 Finding unique items in list.

You can use uniq to determine how many times an item appears in a list.

# -c prints the number of each item
cat color-table.txt | cut -f 3 | uniq -c | head 
##    1 red
##    1 orange
##    1 yellow
##    1 green
##    1 blue
##    1 purple
##    1 red
##    1 orange
##    1 yellow
##    1 green

One pitfall is that it only operates on adjacent items, so lists must be sorted first:

# sorting first gives true number of unique items

cat color-table.txt | cut -f 3 |sort | uniq -c | head  
##    1 
##    2 blue
##    3 green
##    3 orange
##    2 purple
##    3 red
##    3 yellow

4.5 In class exercise 2: Extract the last item from a string of unknown length.

Consider this filename at the end of the path: /tempdata3/MEDS5420/data/raw/chip_repA.txt

Devise a way to split the string and report the filename chip_repA.txt without referencing the exact position. This come in handy if you want to get the last item in the path without knowing how long the path is.

5 Manipulation / parsing of tables with awk

One bad thing about cut is that you cannot reorder columns (e.g. cut -f 3,1 table does not work)

A convenient way to reorder columns, especially with large files is with another language called awk.
Awk user manual: https://www.cs.unibo.it/~renzo/doc/awk/nawkA4.pdf

Here’s an example:

cat color-table.txt | awk '{print $3, "\t", $2, "\t", $1}' > newTable.txt
## red   1   This
## orange    2   is
## yellow    4   a
## green     4   test
## blue      7   this
## purple    6   is
## red   80      only
## orange    19      a
## yellow    100     test
## green     6   if

In the example above:

awk has very cryptic syntax. Therefore, I recommend learning as needed for specific tasks and keeping track of useful operations by creating a repository to store them. For instance, I have an “awk one-liners” file where I keep useful operations.

5.1 Feedback: drop the awk

The only class feedback I ignored from previous evaluations is to not teach awk. I view computational analyses for genomics in two parts: 1) perform basic, routine, data processing that can usually be automated by using shell scripts; 2) exploratory data analyses in your favorite programming language, such as python or R. awk is extremely useful for shell scripting and you will learn 16 years of my accumulated (and frequently forgotten) awk knowledge from this class. Exploratory data analysis in R or python would be another class entirely. It is true that many tasks that awk performs can be performed in R or python and awk is more awkward. However, awk is faster and easier to implement into workflows.

5.2 More awk usage and syntax:

Outer Field Separator (OFS) can be used to specify the delimiter

cat color-table.txt | awk '{OFS="\t";} {print $3, $1, $2}' > newTable.txt
head -3 newTable.txt
## red  This    1
## orange   is  2
## yellow   a   4

Input files can be passed in at the end of command instead of using a pipe | and cat:

awk '{OFS="\t";} {print $3, $1, $2}' < color-table.txt > newTable.txt
head -3 newTable.txt
## red  This    1
## orange   is  2
## yellow   a   4

left caret/arrow/less than sign < before input file is not necessary, but can be used to avoid ambiguity.

One can quickly parse columns with awk using “if” statements:

cat color-table.txt | awk '{ if($3 == "yellow") print $0}'
## a    4   yellow  
## test 100 yellow
## real     4   yellow

In this example, the if statement is followed by a test (in parentheses), followed by the desired action if the test result is true.
- $0 prints the entire line instead of a specific column.
- The double equal sign == signifies that you are asking if the two items are equal to each other rather than setting the value of a variable.

5.3 Passing shell variables to awk:

There are times when you want to use a variable created in shell in an awk command. Try:

y=yellow
cat color-table.txt | awk '{ if($3 == $y) print $0}'

As you can see, awk does not recognize shell variables, but they can be “passed” into awk as follows:

y=yellow
cat color-table.txt | awk -v cols="$y" '{ if($3 == cols) print $0}' #readable to awk

In this case, the -v option allows you to create an awk variable from a shell variable. Note: this is done before the rest of the awk statement is wrapped in '{}'. I randomly named the variable cols within awk, but we can use any designation, including y.

5.4 Running simple tests before function in awk:

Example: sometimes we will want to work on the beginning of a file in order to add, remove, or alter the header for columns. We can run a simple test to determine which lines to work with in awk using the NR (row number) argument before we start awk functions.

head -3 mm9_genes.txt
## genename geneID  chr strand  start   stop
## 4930594M22Rik    AK157947    chr14   +   123312252   123328664
## Zfp85-rs1    NM_001001130    chr13   -   67848736    67856071
cat mm9_genes.txt | awk 'NR>1{ print $0}' | head -3
## 4930594M22Rik    AK157947    chr14   +   123312252   123328664
## Zfp85-rs1    NM_001001130    chr13   -   67848736    67856071
## Scap NM_001001144    chr9    +   110235821   110287450

5.5 Common uses for awk we will use in this course:

1. Reordering columns
2. Simple math with columns
3. Printing certain rows or columns
4. Adding rows or columns to large files
5. Adding/removing headers from large files

5.6 In class exercise 3: Splitting strings and parsing files.

Consider the example path to the mm9_genes.txt:  /users/tempdata3/MEDS5420/annotations/mm9-genes.txt

1. Use cut to get the file name without the extension (.txt); modify the code extract the file name (minus the extension) regardless of how many directories deep the file is located.
Download the mm9_genes.txt file from GitHub Lecture 4 and move it to your MEDS5420 folder.
2. Count the number of rows with gene information in the file.
3. The number of genes names in the list (column 1) is not the same as the number of gene IDs (column 2). Determine how many redundant gene names are listed in the table.
4. Use awk to move genes on the plus strand to another file - call it PlusStrandGenes.txt
5. Use awk to create a file with another column that has the gene length - call it mm9GeneLengths.txt
6. Create a .bed file (used later in course) by reordering the columns as follows and separate the columns with a tab: chromosome, start, end, geneID, strand

6 Answers to exercises:

7 Answers to in class questions:

7.1 In class exercise 1:

#! /bin/sh
folders="notebook raw_data figures manuscripts"

for x in $folders
do
   echo "creating the following directory": $x
   mkdir $x
   cp ~/path/to/haiku.txt $x
done
  

7.2 In class exercise 2:

Retrieve the last item in a string of unknown length

file="/tempdata3/MEDS5420/data/raw/chip_repA.txt"
echo $file | rev | cut -d "/" -f 1 | rev

7.3 In class exercise 3:

Consider the path to the mm9_genes.txt in the MEDS5420 folder on a server: /archive/MEDS5420/annotations/mm9_genes.txt

1. Use cut (alone or in combination with other functions) to retrieve the file name without the extension (.txt):

file=/archive/MEDS5420/annotations/mm9_genes.txt
echo $file | cut -d "/" -f 4 | cut -d "." -f 1

# OR independent of positional information:

echo $file | rev | cut -d "/" -f 1 | rev | cut -d "." -f 1


2. Count the number of lines in the file:

tail +2 mm9_genes.txt | wc -l
# or
awk 'NR>1{ print $0 }'  mm9_genes.txt | wc -l 


3. The number of genes in the list is not the same as the number of genetic loci. Determine how many redundant gene names are listed in the table:



# to get number of unique gene names
cat mm9_genes.txt | cut -f 1| sort | uniq| wc -l 

# to get the number of duplicated gene names
cat mm9_genes.txt | cut -f 1| sort | uniq -d| wc -l 

4. Use awk to move genes on the plus strand to another file:

cat mm9_genes.txt | awk '{if($4 == "+") print $0}' > PlusStrandGenes.txt

5. Use awk to create another column that has the gene length
Use awk to create another column that has the gene length

cat mm9_genes.txt | awk '{print $0, $6-$5}' > mm9_plus_genes.txt
# Above command works, but the header is not correct. Try it and see.

cat mm9_genes.txt | awk 'NR<2{ print $0 "\t" "geneLength"} NR>1 {print $0 "\t" $6-$5}'
#Above commands use "NR"" to create the proper header and data running commands on specific rows.

6. Create a .bed. file by reordering columns and separate the columns with a tab delimiter

cat mm9_genes.txt | awk '{OFS="\t";} {print $3, $5, $6, $2, $4}' > mm9_genes.bed