Practical Computing for Biologists: Chapters 1, 4, 5, Appendix 3.
Unix Basics from UConn CBC: http://bioinformatics.uconn.edu/unix-basics/
Software Carpentry Shell Novice lesson: Episodes 1-4: https://swcarpentry.github.io/shell-novice/
| Command | Function |
|---|---|
pwd |
Print Working Directory |
cd |
Change Directoy (cd, cd -, cd ~/<directory>, cd ..) |
ls |
List files (ls -F, ls --help) |
man |
Manual page for a command |
mkdir |
Make a new directory |
nano |
A rudimentary command-line text editor |
rm |
Delete file(s); use rm -r <directory> to delete directory and contents |
touch |
Create empty file |
cp |
Copy files and directories |
mv |
Move or rename files and directories |
wc |
Word Count (wc -l) |
cat |
Print an entire file, or concatenate multiple files |
less |
Read a file, one page at a time |
sort |
Sort lines (sort -n, sort -r) |
head |
Read beginning lines of a file (head -n #) |
tail |
Read last few lines of a file (tail -n #) |
* wildcard to select multiple files in a directory, using the []
wildcard to select one or more letters, e.g. *[AB].txt for file names ending
in A or B.|).>).Notes about reading these documents:
Sections highlighted in grey are shell input or “standard input” (stdin).
Lines following it prefixed by '##' denote shell output or “standard output” (stdout):
x='Welcome to MEDS 5420'
echo $x
## Welcome to MEDS 5420
You will communicate with the operating system (OS) by typing commands into the terminal window. You can use the terminal window to:
Command Line is the most general and refers to typing commands directly into a terminal that can be executed by the computer.
Shell (sh) is specific program (language written by Steve Bourne while at Bell Labs) that processes commands and returns output.
Bash stands for Bourne Again Shell and is an updated version of the Shell language. This is the most popular Shell.
Terminal is a user interface that takes input and provides an output in text format; the interface runs the input through Shell or Bash to process the command.
We’re going to start using more system utilities or command line utilities. The general format is:
command [options] target_file(s)
First we will make a MEDS5420 folder in our home directory:
cd ~
mkdir MEDS5420
Go to GitHub /guertinlab/meds5420/Lecture2_command_line/ and download the lec02_files.zip. If your browser automatically unzips compressed files, you need to change this preference (on Safari: Settings > General > uncheck open “safe” files after downloading)
Use the Terminal window to list the contents of the downloads folder to confirm the download.
Let’s move the dowloaded file to the ‘MEDS5420’ folder you created:
mv ~/Downloads/lec02_files.zip ~/MEDS5420/
If you are using Ubuntu in Windows, you can access your Windows C drive in the Ubuntu Terminal through the PATH: /mnt/c/, then it is usually /mnt/c/Users/<username>/Downloads to navigate to the location the file downloaded. The following command will move the file to your directory.
mv /mnt/c/Users/<username>/Downloads/lec02_files.zip ~/MEDS5420
Move (mv) can also be used to rename files:
mv <old_name> <new_name>
Now switch (navigate) to the MEDS5420 folder.
cd ~/MEDS5420/
to unzip the file, use:
unzip -v lec02_files.zip
The format is:
unzip [options] <target_directory> <file.zip>
Check the contents of the folder to see the results.
What happens if you run this without the ‘-v’ option and without specifying the target directory?
*Note on unzip usage: Depending on your OS, the ‘-d’ option may be needed in order to unzip the contents into a specific folder. In this case you will also need to designate the name of the output directory to where the files will be unpacked. example:
unzip -d lec02_files lec02_files.zip
Viewing file content Data from HTS experiments is generally in the form of large text files. These files will crash your computer if you try to open them with standard GUI programs (gEdit, textEdit or Word). There are lots of ways to get around this.
To view the beginning of a file:
head Wonderful_world.txt
## What A Wonderful World
##
## By Bob Thiele, George David White
##
## I see trees of green
## Red roses too
## I see them bloom
## For me and you
## And I think to myself
## What a wonderful world
To view the end of a file:
tail -n 3 Wonderful_world.txt
##
## Yes, I think to myself
## What a wonderful world
You can incrementally load parts of a file with less:
less the_raven.txt
When using less you can navigate with the following commands (see Appendix 3 for more):
Print entire contents to screen:
cat the_raven.txt
* If you accidentally print a large file to the screen, stop it with control-c to Cancel it.
Getting information about files
How many lines, words, or characters does my file have:
wc the_raven.txt
## 127 1073 6906 the_raven.txt
Just count the number of lines:
wc -l the_raven.txt
## 127 the_raven.txt
Have a look at the manual for wc to see other output options.
Basic file manipulation Use touch to create an empty file
touch empty_file.txt
Anything that is printed to screen can be saved in a file using the redirection operator (>):
cat the_raven.txt > raven_copy.txt
Alternatively, the command cp copies the file to a new directory or a new file name. If the copy is within the current directory, the second positional argument is the new file name.
cp the_raven.txt raven_cp.txt
If the copy should go into a new folder/directory, the second positional argument is the relative or absolute path to the directory. Below we make the raven_files directory.
mkdir raven_files
cp the_raven.txt raven_files/
Screen output can also be appended to the end of an existing file:
cat the_raven.txt >> empty_file.txt
Multiple files can be pooled in this way:
cat the_raven.txt Wonderful_world.txt > pool.txt
What does cp do when given several filenames and a directory
name, as in:
mkdir backup
cp the_raven.txt Thoreau_quotes.txt backup
What does cp do when given three or more filenames, as in:
cp the_raven.txt Thoreau_quotes.txt animal.txt
One can select multiple files using the * wildcard. Navigate to the ~/MEDS5420/lec02_files directory and type:
wc *.txt
Instead of seeing the 3 columns of numbers for the number of lines,
words and characters, we can limit the wc command to only show us
the number of lines using the -l argument:
wc -l *.txt
One can also add some specificity to wild cards using brackets: []
wc -l [Wt]*.txt
# this is equivalent to saying files that start with a "W" or "t"
Let’s find which file is shortest. Let’s save the wc output to disk
with the redirection > operator; then we can verify the contents of
length.txt are the same as what wc produces using cat or less:
wc -l *.txt > lengths.txt
cat lengths.txt
less lengths.txt
To find the shortest file, we then sort the lengths using the sort
command. We then pick the top shortest file using head -n 1:
sort -n lengths.txt > sorted-lengths.txt
head -n 1 sorted-lengths.txt
Using the intermediate files can be confusing, especially in more
complex problems. We can save a lot of messy files and typing using
pipes (|):
wc -l *.txt | sort -n | head -n 1
A file called animals.txt contains the following data:
deer
rabbit
raccoon
rabbit
deer
fox
rabbit
bear
What text passes through each of the pipes and the final redirect in the pipeline below? Manually rearrange and parse the input before you run or deconstruct the command.
cat animals.txt | head -n 5 | tail -n 3 | sort > final.txt
Alter the commands to get only all three rabbits as the final output.
| Command | Function |
|---|---|
gzip |
compression/decompression tool using Lempel-Ziv coding (LZ77) |
tar |
Bundling files in folders |
| Command | Function |
|---|---|
grep |
Global Regular Expression Print (useful flags: -w, -i, -v, -n) |
find |
Recursively list all files and directories and filter |
1. Variables (creating and printing to screen).
2. Basics of shell scripts.
Download and move the data-shell.tar from GitHub to your MEDS5420 folder. See the third code chunk of section 6 of Lecture 2 for how to accomplish this for Windows OS.
We already unzipped a file using unzip:
unzip -d Example_files Example_files.zip
Other types of archives you will encounter:
.tar # bundles multiple files or folders
.gzip # compressed file
Figure 4: XKCD: valid tar command
To view contents of archive:
tar -tvf data-shell.tar # displays tar contents
To extract contents of archive:
tar -xvf data-shell.tar # extracts contents into original directories
To combine contents of a directory:
tar -cvf data-shell_retar.tar data-shell
#format is <target.tar> <directory-to-be-tarred>
#For directories, execute command in parent directory (one level up).
#Don't use absolute path.
Compressing files with gzip:
gzip filename # compresses file
Let’s look a specific example in the writing folder within data.shell
cd ./data-shell/writing/leisure/
ls
To view contents of a gzipped file (linux):
zcat haiku.txt.gz | head
On a Mac use this instead:
gunzip -c haiku.txt.gz | head
OR
gzcat on a Mac
gzcat haiku.txt.gz | head
Note: These commands are useful because they allow you to glance at or access the contents of large compressed files without spending the time of decompressing them.
To extract gzipped files:
gunzip haiku.txt.gz #decompresses file
MAC USERS:
BBedit:
https://www.barebones.com/products/bbedit/
PC USERS:
download Visual Studio: https://visualstudio.microsoft.com/downloads/
or
download notepad++ here: https://notepad-plus-plus.org/
Note: You can also use emacs or other command line editors such as nano or vim. We will be using nano when we work on the server soon.
. stands for the current directory/ stands for the root directory/home/mcclintock/home~ stands for the home directory, /home/mcclintockhomeIn the first instance, cp will make a copy of each of the files,
citations.txt and quotations.txt into the directory backup/.
In the second instances, cp gives an error when we provide 3 files
as arguments. To understand the error, see the output of cp --help
or man cp. The usage line towards the top indicates that the last
argument must be a directory when we are providing more than 2
arguments.
Part1:
cat prints all the contents of animals.txt and passes it on to
head. Standard output from cat (or standard input to head):
deer
rabbit
raccoon
rabbit
deer
fox
rabbit
bear
head reads the first 5 lines of that output and passes it on to
tail. Standard output from head (or standard input to tail):
deer
rabbit
raccoon
rabbit
deer
tail reads the last 3 lines of the output and passes it on to
sort. Standard output from tail (or standard input to sort):
raccoon
rabbit
deer
sort rearranges the lines in alphabetical order (you can
read the man pages of sort to discern the arguments, including -r which is reverse alphabetical) and saves them into final.txt. Standard output from sort (or contents of final.txt)
deer
rabbit
raccoon
Part 2:
cat animals.txt | sort | tail -4 | head -3