Contents

1 A review of wget, curl, and cut

I asked you to complete section 5.6 from Lecture 4 for homework, so let’s review these commands.

1.1 Retrieving files from a URL using curl or wget

Download chr_coordinates.bed from the Lecture 5 folder in GitHub and move this file to your MEDS5420 folder.

If you have a Mac, use curl:

curl -O https://path/to/the/raw/file/on/github.bed 

Linux OS have wget:

wget https://path/to/the/raw/file/on/github.bed

1.2 Move the bed file to a new directory

#start from the MEDS5420 folder:
mkdir ./in_class/coordinates
mv chr_coordinates.bed ./in_class/coordinates
cd ./in_class/coordinates

1.3 String splitting and manipulation with cut

Review the cut command by splitting the file name chr_coordinates.bed on the r character and then the c character to output oo. Remember to use echo to interpret the file name as a string and not a file.

fileName="chr_coordinates.bed"
echo $fileName | cut -f 2 -d 'r' | cut -f 2 -d 'c' 
## oo

2 Logical tests in Shell

We’ve learn some basics of shell scripting with loops. Let’s add more sophistication by adding conditional statements.

2.1 Example: if/else

snow="1 2 3 4 6 8 10"
for x in $snow
  do
    echo "There are ${x} inches of snow"
    if [ $x -lt 3 ]
      then
      echo 'stay calm'
    else
      echo 'panic'
    fi
  done
## There are 1 inches of snow
## stay calm
## There are 2 inches of snow
## stay calm
## There are 3 inches of snow
## panic
## There are 4 inches of snow
## panic
## There are 6 inches of snow
## panic
## There are 8 inches of snow
## panic
## There are 10 inches of snow
## panic

*Note: the test statement must be separated from the square brackets by a space.
End if loops with fi

To get a list of operators for numerical tests use:

man test

What if I want to add a layer(s) of contingency here: elif

snow="1 2 3 4 6 8 10"
type=windy
for x in $snow
  do
    echo "There are ${x} inches of snow"
    if [ $x -lt 3 ]
      then
      echo 'stay calm'
    elif [ $x -lt 8 ] && [ ${type} = windy ]
      then
      echo 'it is windy, take cover'
    else
      echo 'ignore the wind and grab your sled'
    fi
  done
## There are 1 inches of snow
## stay calm
## There are 2 inches of snow
## stay calm
## There are 3 inches of snow
## it is windy, take cover
## There are 4 inches of snow
## it is windy, take cover
## There are 6 inches of snow
## it is windy, take cover
## There are 8 inches of snow
## ignore the wind and grab your sled
## There are 10 inches of snow
## ignore the wind and grab your sled

Note that I also included a variable that is used for interpretation.

&& represents and creating an if/and statement
|| represents or creating an if/or statement

3 Passing files and options into scripts from command line:

I uploaded a file to the Lecture 5 directory in GitHub with rain data and I want to process with this script.

cat rain_data.txt
## Inches   1
## Inches   2
## Inches   3
## Inches   4
## Inches   6
## Inches   8
## Inches   10

I could read it in directly:

rain=$(cat rain_data.txt | cut -f 2)
cond=windy
for x in $rain
  do
    echo "There are ${x} inches of rain"
    if [ $x -lt 3 ]
      then
      echo 'stay calm'
    elif [ $x -gt 5 ] && [ ${cond} == windy ]
      then
      echo 'it is windy, take cover'
    else
      echo 'get in a boat'
    fi
  done
## There are 1 inches of rain
## stay calm
## There are 2 inches of rain
## stay calm
## There are 3 inches of rain
## get in a boat
## There are 4 inches of rain
## get in a boat
## There are 6 inches of rain
## it is windy, take cover
## There are 8 inches of rain
## it is windy, take cover
## There are 10 inches of rain
## it is windy, take cover

Or, I could set a variable as below and save this as rain.sh script.

#! /usr/bin/sh

rain=$(cat "$1" | cut -f "$2")
cond=$3
for x in $rain
  do
    echo "There are ${x} inches of rain"
    if [ $x -lt 3 ]
      then
      echo 'stay calm'
    elif [ $x -gt 5 ] && [ $cond == windy ]
      then
      echo 'it is windy, take cover'
    else
      echo 'get in a boat'
    fi
  done

Note: the $1 usage here is a shortcut that allows the user to add an input file in the first argument. The usage would then be: <script_name> ARG1 ARG2 ARG3
$1 refers to rain_data.txt
$2 refers to the number 2, which happens to be our second argument and in the script it is used to parse out the second column
$3 refers to the condition, which is the third argument

Then I would run:

# script input1 input2
bash rain.sh rain_data.txt 2 calm
# OR
bash rain.sh rain_data.txt 2 windy
# OR
chmod +x rain.sh
./rain.sh rain_data.txt 2 windy

More arguments can be added and the order of the arguments sets the substitution order.

3.1 Reviewing how to pass shell variables to awk:

Question: What if we have a shell variable that we want to use or pass to awk?

Try this:

list="1 2 3"
echo $list | awk '{print $list}'


It doesn’t work because a variable made in the shell cannot inherently be read by awk. You have to pass the variable to awk. Here’s how:

list="1 2 3"
echo $list| awk -v nums="$list" '{print nums}'
## 1 2 3

Recall the -v option in the beginning of the awk command.

3.2 In class exercise 1:

In the class last week we used color-table.txt and learned how to isolate and parse different columns and rows with cut, uniq and awk. Now, try writing a script that will use the color column to parse each row to a file with identical colors only. That is, all the ‘red’ rows should go to ‘red.txt’ file, blue; to a ‘blue.txt’ file, etc.

3.2.1 Proper script structure and annotation

Even though your code may have worked, the script is not considered finished as it stands. We need to use indentation and add annotation to the code for several reasons.
1. Proper indentation of loop makes the code more readable.
2. To provide USAGE instructions
3. To describe the steps being taken. This is important to remind yourself what your coding steps were or for other that might want to modify your script.
4. To track the progress of the script. This is most important for debugging, so that one can know where in the code a script failed.

Editing in a text editor with syntax highlighting will help construct a readable script

4 Logging into the Xanadu cluster

To access the cluster you need to login with ssh (secure shell):

ssh <user_name>@xanadu-submit-ext.cam.uchc.edu 

# you user name looks like this:
ssh meds5420usr17@xanadu-submit-ext.cam.uchc.edu 

5 Tranferring data to and from the cluster

5.1 For PC users:

You can use scp from a terminal window, or you can use WinSCP which is a convenient FTP client or user interface for transferring data between computers. Below are some links with tutorials for downloading, installing, and using WinSCP.

https://winscp.net/eng/docs/guide_connect

https://www.youtube.com/watch?v=58KmUBaEW34

To move files in between computer you can login with sftp use scp (secure copy):

5.2 sftp:

ftp stands for “File Transfer Protocol”, sftp is ” Secure File Transfer Protocol”. In other words, with sftp, a useraccount and password are required.


sftp  <your_username>@<host_name>

For the Xanadu cluster, there is a special partition for transferring data:


sftp <your_username>@transfer.cam.uchc.edu

1. You can then navigate to the directory where you want to take files from.
2. put and get can be used to move files from or to your computer, respectively

put /Users/guertinlab/MEDS5420/color-table.txt

get 

5.3 scp

scp can be used without logging in provided you know the exact location where your file of interest is or will go. We will primarily use sftp in this course.

# for copying TO the server
scp -r <path_to_directory> <your_username>@transfer.cam.uchc.edu:~/path/to/target/folder

You should be prompted for a password. If not, the transfer probably failed.

# for copying FROM the server
scp -r <your_username>@<host_name>:<target_directory> 

5.3.1 How do we know if transfer was complete?

There’s a program called md5 (mac) or md5sum (linux) that can help us with this. It returns a compact digital fingerprint for each file. Any change to the file will result in a different fingerprint.

on a mac:

md5 ./data-shell.tar
## MD5 (./data-shell.tar) = f20d68f53260e594e0ffb26263894b6f

on Linux:

md5sum ./data-shell.tar

5.4 In class exercise 2: Inspecting, retrieving, and checking files from server

1 Log onto the server using ssh
2 Navigate to the MEDS5420 folder in /home/FCAM/meds5420/in_class
3 View the contents of the data-shell.tar file without unbundling it.
4 View the checksum string for the file.
5 Logout and return to your home directory or open a new terminal window (command-t)
6 Transfer the file to your computer using sftp
7 Confirm that the transfer was complete

6 Answers to in class exercises:

6.1 In class exercise 1:


colors=$(cat "$1"| cut -f 3 | sort | uniq)

for col in $colors
    do  
        touch ${col}_rows.txt
        cat "$1" | awk -v col="$col" '{ if ($3 == col) print $0}' >> ${col}_rows.txt
    done

Here’s another version of the script with decent annotation. The input file would be the color-table.txt file. The first argument then replaces the “$1” wherever it appears in the script.

# This script will parse unique items from column 3 to separate files
# USAGE:bash parse_colors.sh <INPUT_FILE>

#Create uniq list of colors.  Note that input file is first argument
colors=$(cat "$1"| cut -f 3 | sort | uniq)
echo $colors

#iterate through list of colors and parse into new files
for col in $colors
    do  
        echo parsing ${col} # this prints the progress to the screen
        touch ${col}_rows.txt
        cat "$1" | awk -v col="$col" '{ if ($3 == col) print $0}' >> ${col}_rows.txt
        echo ${col} parsed  # this prints the progress to the screen
    done

This version requires a Google search to figure out how to pass multiple variables to awk. This script is general and you can use it to parse any file based on unique elements in any user-defined column. Maybe whether genes are activated, unchanged, or repressed is in column 7, can we use this script to get a file for each gene category?


#! /bin/sh
# positional argument 1 is the file with the color data
# positional argument 2 is the column with the color 
color=$(cat $1 | cut -f $2 | sort | uniq)
for x in $color
  do
    echo "parsing the ${x} file"
    cat $1 | awk -v col="$x" -v fileName="$1" -v column="$2" '{if ($column == col) print $0}' > ${x}.txt
  done

6.2 In class exercise 2:

6.2.1 ssh

ssh meds5420usr17@xanadu-submit-ext.cam.uchc.edu
#the 17 refers to your user number
cd /home/FCAM/meds5420/in_class
tar -tvf data-shell.tar
md5sum data-shell.tar
# a174bf3795d25f39891f43571ba1c678 
exit

Note that none of these commands demand significant compute resources.

6.2.2 sftp

cd ~
sftp meds5420usr17@transfer.cam.uchc.edu
cd /home/FCAM/meds5420/in_class
get data-shell.tar
exit
md5 data-shell.tar # mac
#OR
md5sum data-shell.tar # linux