Contents

1 Readings

Practical Computing for Biologists: Chapters 1, 4, 5, Appendix 3.

Unix Basics from UConn CBC: http://bioinformatics.uconn.edu/unix-basics/

Software Carpentry Shell Novice lesson: Episodes 1-4: https://swcarpentry.github.io/shell-novice/

2 Lesson overview

Command Function
pwd Print Working Directory
cd Change Directoy (cd, cd -, cd ~/<directory>, cd ..)
ls List files (ls -F, ls --help)
man Manual page for a command
mkdir Make a new directory
nano A rudimentary command-line text editor
rm Delete file(s); use rm -r <directory> to delete directory and contents
touch Create empty file
cp Copy files and directories
mv Move or rename files and directories
wc Word Count (wc -l)
cat Print an entire file, or concatenate multiple files
less Read a file, one page at a time
sort Sort lines (sort -n, sort -r)
head Read beginning lines of a file (head -n #)
tail Read last few lines of a file (tail -n #)

2.1 Concepts:

  • Terminology and basic commands
  • Directory structure
  • Navigating through command line: relative vs absolute paths
  • Examining file contents
  • Using the * wildcard to select multiple files in a directory, using the [] wildcard to select one or more letters, e.g. *[AB].txt for file names ending in A or B.
  • Standard Output (stdout) and Standard Input (stdin) from commands can be combined using with pipes (|).
  • Redirect stdout to files (>).

2.2 Why learn the command-line?

  • The majority of bioinformatics and computational software is developed for command-line use from the shell
  • Low system resource use (processor and memory)
  • Easier to automate and more adaptable than GUIs
  • Web-based tools are at risk of becoming obsolete (e.g. Galaxy, GenomeSpace) as more scientists devolop command-line competence

Notes about reading these documents:

Sections highlighted in grey are shell input or “standard input” (stdin).
Lines following it prefixed by '##' denote shell output or “standard output” (stdout):

x='Welcome to MEDS 5420'
echo $x
## Welcome to MEDS 5420

3 Interacting with your computer: the terminal

You will communicate with the operating system (OS) by typing commands into the terminal window. You can use the terminal window to:

4 Bash, Shell, Terminal, Command Line: What’s the difference?

Command Line is the most general and refers to typing commands directly into a terminal that can be executed by the computer.
Shell (sh) is specific program (language written by Steve Bourne while at Bell Labs) that processes commands and returns output.
Bash stands for Bourne Again Shell and is an updated version of the Shell language. This is the most popular Shell.
Terminal is a user interface that takes input and provides an output in text format; the interface runs the input through Shell or Bash to process the command.

6 Dealing with files and text:

We’re going to start using more system utilities or command line utilities.  The general format is:

command [options] target_file(s)

First we will make a MEDS5420 folder in our home directory:

cd ~
mkdir MEDS5420

Go to GitHub /guertinlab/meds5420/Lecture2_command_line/ and download the lec02_files.zip. If your browser automatically unzips compressed files, you need to change this preference (on Safari: Settings > General > uncheck open “safe” files after downloading)

Use the Terminal window to list the contents of the downloads folder to confirm the download.

Let’s move the dowloaded file to the ‘MEDS5420’ folder you created:

mv ~/Downloads/lec02_files.zip ~/MEDS5420/

If you are using Ubuntu in Windows, you can access your Windows C drive in the Ubuntu Terminal through the PATH: /mnt/c/, then it is usually /mnt/c/Users/<username>/Downloads to navigate to the location the file downloaded. The following command will move the file to your directory.

mv /mnt/c/Users/<username>/Downloads/lec02_files.zip ~/MEDS5420

Move (mv) can also be used to rename files:

mv <old_name> <new_name>

Now switch (navigate) to the MEDS5420 folder.

cd ~/MEDS5420/

to unzip the file, use:


unzip -v lec02_files.zip

The format is:

unzip [options] <target_directory> <file.zip>

Check the contents of the folder to see the results.
What happens if you run this without the ‘-v’ option and without specifying the target directory?

*Note on unzip usage: Depending on your OS, the ‘-d’ option may be needed in order to unzip the contents into a specific folder. In this case you will also need to designate the name of the output directory to where the files will be unpacked. example:

unzip -d lec02_files lec02_files.zip

Viewing file content Data from HTS experiments is generally in the form of large text files. These files will crash your computer if you try to open them with standard GUI programs (gEdit, textEdit or Word). There are lots of ways to get around this.

To view the beginning of a file:

head Wonderful_world.txt
## What A Wonderful World
## 
## By Bob Thiele, George David White
## 
## I see trees of green
## Red roses too
## I see them bloom
## For me and you
## And I think to myself
## What a wonderful world

To view the end of a file:

tail -n 3 Wonderful_world.txt
## 
## Yes, I think to myself
## What a wonderful world

You can incrementally load parts of a file with less:

less the_raven.txt

When using less you can navigate with the following commands (see Appendix 3 for more):

Print entire contents to screen:

cat the_raven.txt

* If you accidentally print a large file to the screen, stop it with control-c to Cancel it.

Getting information about files

How many lines, words, or characters does my file have:

wc the_raven.txt
##      127    1073    6906 the_raven.txt

Just count the number of lines:

wc -l the_raven.txt
##      127 the_raven.txt

Have a look at the manual for wc to see other output options.

Basic file manipulation Use touch to create an empty file

touch empty_file.txt

Anything that is printed to screen can be saved in a file using the redirection operator (>):

cat the_raven.txt > raven_copy.txt

Alternatively, the command cp copies the file to a new directory or a new file name. If the copy is within the current directory, the second positional argument is the new file name.

cp the_raven.txt raven_cp.txt

If the copy should go into a new folder/directory, the second positional argument is the relative or absolute path to the directory. Below we make the raven_files directory.

mkdir raven_files
cp the_raven.txt raven_files/

Screen output can also be appended to the end of an existing file:

cat the_raven.txt >> empty_file.txt

Multiple files can be pooled in this way:

cat the_raven.txt Wonderful_world.txt > pool.txt

6.1 Exercise 2: Copy with Multiple Filenames

What does cp do when given several filenames and a directory name, as in:

mkdir backup
cp the_raven.txt Thoreau_quotes.txt backup

What does cp do when given three or more filenames, as in:

cp the_raven.txt Thoreau_quotes.txt animal.txt

7 Pipes, filtering with wildcards, redirecting outputs to files

One can select multiple files using the * wildcard. Navigate to the ~/MEDS5420/lec02_files directory and type:

wc *.txt

Instead of seeing the 3 columns of numbers for the number of lines, words and characters, we can limit the wc command to only show us the number of lines using the -l argument:

wc -l *.txt

One can also add some specificity to wild cards using brackets: []

wc -l [Wt]*.txt
    # this is equivalent to saying files that start with a "W" or "t"

Let’s find which file is shortest. Let’s save the wc output to disk with the redirection > operator; then we can verify the contents of length.txt are the same as what wc produces using cat or less:

wc -l *.txt > lengths.txt
cat lengths.txt
less lengths.txt

To find the shortest file, we then sort the lengths using the sort command. We then pick the top shortest file using head -n 1:

sort -n lengths.txt > sorted-lengths.txt
head -n 1 sorted-lengths.txt

Using the intermediate files can be confusing, especially in more complex problems. We can save a lot of messy files and typing using pipes (|):

wc -l *.txt | sort -n | head -n 1

7.1 Exercise 3: Pipe Reading Comprehension

A file called animals.txt contains the following data:

deer
rabbit
raccoon
rabbit
deer
fox
rabbit
bear

7.1.1 Part 1:

What text passes through each of the pipes and the final redirect in the pipeline below? Manually rearrange and parse the input before you run or deconstruct the command.

cat animals.txt | head -n 5 | tail -n 3 | sort > final.txt

7.1.2 Part 2:

Alter the commands to get only all three rabbits as the final output.

8 Additional Commands:

8.1 File Compression

Command Function
gzip compression/decompression tool using Lempel-Ziv coding (LZ77)
tar Bundling files in folders

8.2 Finding things:

  • Files in directories
  • words in files
Command Function
grep Global Regular Expression Print (useful flags: -w, -i, -v, -n)
find Recursively list all files and directories and filter

8.3 Concepts:

1. Variables (creating and printing to screen).
2. Basics of shell scripts.

9 Dealing with compressed files (archives)

Download and move the data-shell.tar from GitHub to your MEDS5420 folder. See the third code chunk of section 6 of Lecture 2 for how to accomplish this for Windows OS.

We already unzipped a file using unzip:

unzip -d Example_files Example_files.zip

Other types of archives you will encounter:
.tar # bundles multiple files or folders
.gzip # compressed file

XKCD: valid `tar` command

Figure 4: XKCD: valid tar command

To view contents of archive:

tar -tvf data-shell.tar # displays tar contents

To extract contents of archive:

tar -xvf data-shell.tar # extracts contents into original directories

To combine contents of a directory:

tar -cvf data-shell_retar.tar data-shell 

#format is <target.tar> <directory-to-be-tarred>
#For directories, execute command in parent directory (one level up). 
#Don't use absolute path. 

Compressing files with gzip:

gzip filename #  compresses file

Let’s look a specific example in the writing folder within data.shell

cd ./data-shell/writing/leisure/

ls

To view contents of a gzipped file (linux):

zcat haiku.txt.gz | head

On a Mac use this instead:

gunzip -c haiku.txt.gz | head

OR
gzcat on a Mac

gzcat haiku.txt.gz | head

Note: These commands are useful because they allow you to glance at or access the contents of large compressed files without spending the time of decompressing them.

To extract gzipped files:

gunzip haiku.txt.gz #decompresses file

10 Next Time

10.1 TO DO: Get a scripting text editor

MAC USERS:
BBedit: https://www.barebones.com/products/bbedit/

PC USERS: download Visual Studio: https://visualstudio.microsoft.com/downloads/
or
download notepad++ here: https://notepad-plus-plus.org/


Note: You can also use emacs or other command line editors such as nano or vim. We will be using nano when we work on the server soon.

11 Answers to Exercises

11.1 Answers to Exercise 1

  1. No: . stands for the current directory
  2. No: / stands for the root directory
  3. Yes: Dr. McClintock’s home directory is /home/mcclintock
  4. No: this goes up two levels, i.e. ends in /home
  5. Yes: ~ stands for the home directory, /home/mcclintock
  6. No: this would navigate into a directory home
  7. Yes: unnecessarily complicated, but correct
  8. Yes: shortcut to go back to the home directory
  9. Yes: goes up one level

11.2 Answers to Exercise 2

In the first instance, cp will make a copy of each of the files, citations.txt and quotations.txt into the directory backup/.

In the second instances, cp gives an error when we provide 3 files as arguments. To understand the error, see the output of cp --help or man cp. The usage line towards the top indicates that the last argument must be a directory when we are providing more than 2 arguments.

11.3 Answers to Exercise 3

Part1:
cat prints all the contents of animals.txt and passes it on to head. Standard output from cat (or standard input to head):

deer
rabbit
raccoon
rabbit
deer
fox
rabbit
bear

head reads the first 5 lines of that output and passes it on to tail. Standard output from head (or standard input to tail):

deer
rabbit
raccoon
rabbit
deer

tail reads the last 3 lines of the output and passes it on to sort. Standard output from tail (or standard input to sort):

raccoon
rabbit
deer

sort rearranges the lines in alphabetical order (you can read the man pages of sort to discern the arguments, including -r which is reverse alphabetical) and saves them into final.txt. Standard output from sort (or contents of final.txt)

deer
rabbit
raccoon

Part 2:

cat animals.txt | sort | tail -4 | head -3