shamshadpbg.github.io

An interactive course on learning genomics/bioinformatics tasks in Linux.

View on GitHub

Task3 - Genome Annotation

This task is a tutorial on genome annotation using prokka and other tools.

You will learn how to perform basic genome annotation, and also how to extract specific regions of interest from your genome sequence.

Requirements

Graphical software indicated by (*)

Installation

If you do not already have access to a GUI running the graphical software listed above, please install the software on your local machine. Once locally installed, you can download results off the linux server and locally visualize them on your own system.


Getting Started

mkdir task3  #creates folder
cd task3 #enters into folder

Retrieving the raw data

cp ../task2/abyss-assembly-contigs.fa . 

Exploring the assembly using artemis

Now that we have generated a good quality assembly, let’s explore the genome sequence itself and do some very basic annotation using artemis.

Visualize the genome you have produced (using abyss) with the Artemis application. Note: you will need to have artemis installed on your local machine. You will also need to download your contigs to your local machine. Open your contigs.fa file in artemis.

question Q1) How many ORFs are there of length >= 100 nucleotides?

question Q2) How many ORFs are theref of length >= 50 nucleotides?

Annotating your genome from Task2 using prokka

By marking the ORFs in your genome (given a min size threshold), you have essentially performed a simple gene finding algorithm. However, there are more advanced ways of gene-finding that take additional criteria into account.

A popular genome annotation tool for prokaryotic genomes is prokka. prokka automates a series of genome annotation tools and is simple to run. It has been installed for you on the server.

prokka abyss-assembly-contigs.fa

Note: This will generate a folder called PROKKA-XXXXXXXX where XXXXXXXX is the current date. It will be different for you than in the examples below.

question Q3) You will notice that there are vertical black lines in the middle of predicted ORFs. What do these lines represent?

question Q4) Re-start artemis and change your artemis ‘Options’ to better reflect the source of this genome. Which source did you choose (e.g., “Standard”, “Vertebrate Mitochondrial”, etc.)?

When prokka is run without any parameters, it selects ‘bacteria’ as the default taxonomy.

Look at the --kingdom options in prokka -h and re-run prokka to use the correct annotation mode. You will also need to use --outdir to specify a folder for your new results.

question Q5) Has anything changed in this genome annotation? Examine the CDSs, tRNAs, and rRNAs, and their annotations.

Annotation of an E. coli genome using prokka

Next, let’s perform genome annotation on a larger scale.

https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task1/e-coli-k12-genome.fasta.gz

Next, explore the files produced by prokka. Start with the .txt file.

question Q6) How many genes, rRNAs, tRNAs, and CRISPR loci were predicted? What is the size of the genome in Mb?

Prokka also annotates genes based on COGs and also E.C. (enzyme commission) numbers. This information can be found in the .tbl file.

Column 6 of this .tsv file lists the COGs. To print out only column 6, you can use the cut command as follows (replace “yourPROKKAoutput”):

cut -f6 yourPROKKAoutput.tsv

Using commands such as cut, sort, grep, uniq, and wc answer the following two questions (Q7 and Q8).

e.g., this line below will count the number of unique entries in column 3 of file.txt

cut -f3 file.txt | sort | uniq | wc -l

question Q7) How many genes were annotated with COGs?

question Q8) How many unique enzymatic activities (E.C. numbers) were assigned to the E. coli genome? Note: 1.-.-.- and 1.1.1.17 would count as two separate E.C. numbers.

Assigning GO terms

Next, we will be assigning Gene Ontology (GO) terms to your predicted genes/proteins.

Prokka identifies homologs of your proteins within the UniProtKB database. Since there are already pre-computed GO terms for all proteins in UniProtKB, we can map these GO terms over using the following commands:

#extract the predicted proteins that have been mapped to entries in UniProt
cat yourPROKKAoutput.gff | grep -o "UniProtKB.*;" | awk -F'[:;=]' '{print $4" "$2}' >uniProts.txt

#assign GO annotations from a uniprot-GO database table
uniprot2go.py -i uniProts.txt -d /data/uniprot2go/uniprot-vs-go-db.sl3 >go.annotations

This will generate an go.annotations file, which contains your predicted functional annotations.

This one-liner will extract column 3 (GO terms), and list the top 20 according to their frequency in your proteome.

cat go.annotations | awk '{print $3}' | tr "," "\n" | sort | uniq -c | sort -n -r | head -20

Now, there is a lot you can explore using your predicted GO terms for your genome. e.g., Suppose you want to find all the predicted DNA binding proteins. Look here to find the GO ID for “DNA binding”.

question Q9) How many proteins were annotated with the GO TERM for “DNA binding”?

After annotation: Extracting genes and regions of interest

Once your genome has been assembled and annotated, you may be interested in identifying and extracting specific genes or regions of interest.

Extracting genes of interest

For example, suppose you are interested in the “trpE” gene from E. coli. You can see whether this gene exists in the predictions like this:

grep "trpE" yourPROKKAoutput.tsv

This will output

JKMANJED_01263 CDS 1563 trpE 4.1.3.27 COG0147 Anthranilate synthase component 1

… which tells you that “trpE” has been assigned to the gene labeled “JKMANJED_01263”

You can then extract this gene sequence from the gene predictions file (.ffn) like this:

# index the .ffn file so we can extract from it
makeblastdb -in yourPROKKAoutput.ffn -dbtype 'nucl' -parse_seqids

blastdbcmd -entry JKMANJED_01263 -db yourPROKKAoutput.ffn

This will output:

>>JKMANJED_01263 Anthranilate synthase component 1 ATGCAAACACAAAAACCGACTCTCGAACTGCTAACCTGCGAAGGCGCTTATCGCGACAATCCCACCGCGCTTTTTCACCA GTTGTGTGGGGATCGTCCGGCAACGCTGCTGCTGGAATCCGCAGATATCGACAGCAAAGATGATTTAAAAAGCCTGCTGC TGGTAGACAGTGCGCTGCGCATTACAGCTTTAGGTGACACTGTCACAATCCAGGCACTTTCCGGCAACGGCGAAGCCCTC CTGGCACTACTGGATAACGCCCTGCCTGCGGGTGTGGAAAGTGAACAATCACCAAACTGCCGTGTGCTGCGCTTCCCCCC …

Note: in this example we searched the annotations with a text query “trpE”. However, the best way of finding your gene of interest is to do a BLAST search since it may not be labeled correctly in your annotations

Extracting regions of interest

Next, suppose are interested in extracting the promoter of the “trp operon”. See here for some background information. The trp operon regulatory sequences (operator and promoter) can be found upstream of the trpE gene. These regions are not in the annotations files so you will need to locate them yourself.

First, let’s see where the trpE gene is located in the genome:

grep "trpE" yourPROKKAoutput.gff

This will output:

U00096.3 Prodigal:2.6 CDS 1321384 1322946 . - 0 ID=JKMANJED_01263;eC_number=4.1.3.27;Name=trpE;db_xref=COG:COG0147;gene=trpE;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P00895;locus_tag=JKMANJED_01263;product=Anthranilate synthase component 1

This tells us that trpE is located in entry “U00096.3” at chromosome position “1321384 to 1322946” and encoded on the minus (-) strand.

To extract the sequence for these coordinates, we can use blastdbcmd against the genome as follows:

# index the genome so we can extract regions from it
makeblastdb -in yourPROKKAoutput.fna -dbtype 'nucl' -parse_seqids

blastdbcmd -entry U00096.3 -db yourPROKKAoutput.fna -range 1321384-1322946 -strand minus

This should produce a FASTA sequence output of the gene identical to that in the above example.

But you are not interested in the gene sequence; you actually want the upstream regulatory region. Suppose you want to identify the 30-nucleotide long region upstream (before but not including the start codon) of the trpE coding sequence. By modifying the code above, answer the following question.

question Q10) What is the 30-nucleotide long sequence immediately upstream of the TrpE coding sequence?

Extracting the rRNAs predicted by barrnap

Sometimes you may be interested in extracting multiple genes or regions at once. E.g., suppose you want to extract all of the regions corresponding to predicted 16S rRNA sequences. In prokka, rRNA genes are predicted for you using the barrnap tool.

Here is a two-liner to extract the 16S rRNAs predicted by barrnap.

cat *.gff | grep "barrnap" | awk '{ if ($7 == "-") {print $1" "$4"-"$5" minus"} else {print $1" "$4"-"$5" plus"} }' >rRNAs.txt
blastdbcmd -db yourPROKKAoutput.fna -entry_batch rRNAs.txt > rRNAs.fa

Now, to predict taxonomy, we can BLAST these rRNA sequences against the NCBI 16S database for example using web-BLAST. Note, that there may be multiple rRNAs and some of them may be partial sequences.

Analyzing a mystery genome of unknown source

And now for something a little more difficult.

https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task3/mysteryGenome.fna.gz

question Q11) Based on 16S rRNA sequences, what is the taxonomic origin of this genome (genus and species)? e.g., “Escherichia coli”


ASSIGNMENT QUESTIONS

Please answer questions 1-11 above on LEARN under Quizzes.

#

Congratulations. You have now completed Task 3.