shamshadpbg.github.io

An interactive course on learning genomics/bioinformatics tasks in Linux.

View on GitHub

Task5 - Comparative genomics - gene set comparison

This task is a tutorial on comparative genomics with a focus on gene set comparison.

You are going to download, annotate and compare the genomes of a enterohemorrhagic E. coli (strain O157:H7) versus a non-pathogenic E. coli (strain K12). You are then going to identify genes and gene duplications that are unique to each organism and biologically interpret your results.

Requirements

Getting Started

Retrieving the raw data

You will be comparing two genomes of E. coli - strain K12 (non-pathogenic lab strain) and O157H7 (pathogenic E. coli associated with disease outbreaks).

ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_O157H7_EDL933_uid259/AE005174.fna
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substr__DH10B_uid20079/CP000948.fna

question Q1 - Look within the ftp directories for these bacterial genome projects. i.e., go to: https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_O157H7_EDL933_uid259

What do the other files contain?

Annotating both genomes

#this may take a while so be patient... Remember, these are full bacterial genomes as opposed to small mitochondrial contigs
prokka O157H7.fna --outdir O157H7 --norrna --notrna
prokka K12.fna --outdir K12 --norrna --notrna

Generating gene lists

#generate a gene list text file by grepping the gene names from the .tbl file
cat PROKKA*.tbl | awk '{if ($1 == "gene") {print $2}}' | awk -F'_' '{print $1}' | sort 

question Q2 - How many genes are present in each genome?

Comparing gene lists

Now, let’s compare these lists to find genes that are common to both, duplicated or unique to either

A lot of this work can be done simply using the comm and uniq functions within the shell. Explore the command-line usage and options of these commands using man

Comparison including gene duplicates

We can compare both gene lists like this in your task5 folder:

comm O157H7/genelist_O157H7.txt K12/genelist_K12.txt >geneListComparison.txt

Examine the output of geneListComparison.txt using less

question Q3 - What do the genes in column 1, column 2, and column 3 represent?

Now, suppose we want to output the genes in column 1 (ignoring spaces). We can do so like this:

cat geneListComparison.txt | awk -F'\t' '{print $1}' | grep -v -e '^$'

… and we can then count them by piping this command to wc -l

cat geneListComparison.txt | awk -F'\t' '{print $1}' | grep -v -e '^$' | wc -l

question Q4 - How many genes are only in the O157H7 genome? Only in the K12 genome? In both?

Comparison without gene duplicates (finding unique genes)

Construct a simple Venn diagram illustrating the number of shared vs genome-specific genes. In this Venn diagram, if a gene G is duplicated (2 copies) in genome A and not in genome B (1 copy), then gene G will contribute to the unique genes in genome A. This is not really true, since genome B has a copy of gene G.

Let’s account for this by “removing gene redundancy”. This can be done by first filtering each initial gene list using the tool uniq

cd /O157H7/
uniq genelist_O157H7.txt > unique_genelist_O157H7.txt

cd ../K12
uniq genelist_K12.txt > unique_genelist_K12.txt

cd ..

Now, when we compare these lists using comm, we will only be comparing single copies of each gene. Therefore, the result of comm should show us only those genes that are unique to genome 1, unique to genome 2, or shared between both

comm O157H7/unique_genelist_O157H7.txt K12/unique_genelist_K12.txt > uniqueGeneListComparison.txt

question Q5 - How many unique genes are only in the O157H7 genome? Only in the K12 genome? In both?

Going further: inspecting your duplicated and unique genes within each organism

Now, let’s examine the output of the lists above.

Examine geneListComparison.txt to find the gene expansions specific to enterohemorrhagic E. coli O157:H7. The following code will sort the O157:H7-specific genes by their copy number. This will identify those that have undergone the most pathogen-specific duplication.

cat geneListComparison.txt | awk -F'\t' '{print $1}' | sort | uniq -c | sort -n -r | head -20

Examine your result carefully. Column 1 states the copy number and copy 2 states the gene name.

question Q6 - Which gene in O157H7 occurs the most times. In K12?


ASSIGNMENT QUESTIONS

The questions for this task are indicated by the lines starting with question above. Please submit your answers using the quiz on LEARN.