shamshadpbg.github.io

An interactive course on learning genomics/bioinformatics tasks in Linux.

View on GitHub

Task8 - Analysis of 16S amplicon sequencing data using Kraken2/Bracken

In this lab, you will be analyzing 16S and metagenomic data from a study by Lobb et al. (2020) here. This study examined the microbial communities of decomposing fish in local rivers near Waterloo, ON, Canada.

There are 48 16S rRNA samples with the following metadata.

Sample ID Name
SRS6112281 WW1.1
SRS6112282 WW1.2
SRS6112283 WW1.3
SRS6112267 EE1.1
SRS6112268 EE1.2
SRS6112279 EE1.3
SRS6112285 WW4.1
SRS6112284 WW4.2
SRS6112286 WW4.3
SRS6112293 WE4.1
SRS6112295 WE4.2
SRS6112296 WE4.3
SRS6112290 EE4.1
SRS6112302 EE4.2
SRS6112304 EE4.3
SRS6112271 EW4.1
SRS6112272 EW4.2
SRS6112273 EW4.3
SRS6112287 WW8.1
SRS6112288 WW8.2
SRS6112289 WW8.3
SRS6112297 WE8.1
SRS6112298 WE8.2
SRS6112299 WE8.3
SRS6112305 EE8.1
SRS6112306 EE8.2
SRS6112307 EE8.3
SRS6112274 EW8.1
SRS6112275 EW8.2
SRS6112276 EW8.3
SRS6112291 WW10.1
SRS6112292 WW10.2
SRS6112294 WW10.3
SRS6112300 WE10.1
SRS6112301 WE10.2
SRS6112303 WE10.3
SRS6112308 EE10.1
SRS6112269 EE10.2
SRS6112270 EE10.3
SRS6112277 EW10.1
SRS6112278 EW10.2
SRS6112280 EW10.3

Our goal will be to perform taxonomic profiling of these 16S rRNA datasets using Kraken2/Bracken.

Requirements

Command-line tools

Graphical tools

You will also need to download and install R on your own machine with the following packages

Getting Started

mkdir task8  #creates folder
cd task8 #enters into folder

Retrieving the raw data

The data has already been downloaded for you, and is located in the /fsys/data/lobb-et-al/ folder

If you’re curious, the original data was downloaded from the NCBI SRA using this command:

fastq-dump --split-files SRS6112303 SRS6112301 SRS6112300 SRS6112299 SRS6112298 SRS6112297 SRS6112296 SRS6112295 SRS6112293 SRS6112294 SRS6112292 SRS6112291 SRS6112289 SRS6112288 SRS6112287 SRS6112286 SRS6112284 SRS6112285 SRS6112283 SRS6112282 SRS6112281 SRS6112280 SRS6112278 SRS6112277 SRS6112276 SRS6112275 SRS6112274 SRS6112273 SRS6112272 SRS6112271 SRS6112270 SRS6112269 SRS6112308 SRS6112307 SRS6112306 SRS6112305 SRS6112304 SRS6112302 SRS6112290 SRS6112279 SRS6112268 SRS6112267 SRS6098991 SRS6098990 SRS6098989 SRS6098988 SRS6098999 SRS6098998 SRS6098997 SRS6098996 SRS6098995 SRS6098994 SRS6098993 SRS6098992 SRS6098987 SRS6098986

Quality filtering

We have previously covered the use of tools such as fastqc and trimmomatic to quality filter our dataset. QC is a required step for any high-throughput sequencing pipeline, but for simplicity we will skip it for the purposes of this tutorial.

Taxonomic classification of 16S reads using Kraken2

Analyzing single samples

Tools such as QIIME2 and Mothur are common for analyzing 16S rRNA sequences. For this tutorial, we will be using a different tool called Kraken2.

Suppose we wanted to analyze a single sample (e.g., SRS6112303). We can do so with the following Kraken2 command:

CLASSIFICATION_LVL=G  # this will set an environmental variable for the taxonomic level of classification desired (G = "Genus", S = "Species", etc.)
krakenDB=/data/krakendb/16S_Greengenes_k2db/  #this is the location of the kraken2 database you want to use for classification
fastq1=/fsys1/data/lobb-et-al/SRS6112303_1.fastq
fastq2=/fsys1/data/lobb-et-al/SRS6112303_2.fastq

kraken2 --db $krakenDB --paired --report report.txt --output kraken.out $fastq1 $fastq2

bracken -d $krakenDB -l $CLASSIFICATION_LVL -i report.txt -o bracken.out

bracken.out will look like this:

name	taxonomy_id	taxonomy_lvl	kraken_assigned_reads	added_reads	new_est_reads	fraction_total_reads
Parabacteroides distasonis	2601	S	8	53	61	0.00554
Parabacteroides gordonii	2602	S	1	37	38	0.00347
Bacteroides fragilis	2596	S	11	2044	2055	0.18386
Bacteroides uniformis	2599	S	1	9	10	0.00094
Clostridium pasteurianum	2769	S	278	346	624	0.05586
Clostridium perfringens	2770	S	146	986	1132	0.10128
Clostridium subterminale	2772	S	78	2011	2089	0.18691
Clostridium bowmanii	2764	S	13	64	77	0.00692
Clostridium butyricum	2765	S	10	102	112	0.01009
Clostridium neonatale	2768	S	3	33	36	0.00328
Alkaliphilus transvaalensis	2762	S	3	1	4	0.00037
Sporomusa polytropa	2800	S	4	98	102	0.00913
Veillonella dispar	2801	S	22	255	277	0.02478
...
...

Analyzing many samples

But these commands run Kraken2/Bracken on only a single sample. What do we do if we want to run them on all samples?

First, you need to have a list of the samples you want to analyze. This has been done for you with the file at /fsys1/data/lobb-et-al/files.txt.

Then, we will create a bash script called runAll.bash with the following contents.

#!/bin/bash

# $1 is the file containing the list of samples
# $2 is the classification level

CLASSIFICATION_LVL=$2

while IFS=$'\t' read sample
do 
    echo "processing sample $sample"

    kraken2 --db /data/krakendb/16S_Greengenes_k2db/ --paired --report $sample.$CLASSIFICATION_LVL.kraken --output $sample.$CLASSIFICATION_LVL.kraken.out /fsys1/data/lobb-et-al/${sample}_1.fastq /fsys1/data/lobb-et-al/${sample}_2.fastq

    bracken -d /data/krakendb/16S_Greengenes_k2db -l $CLASSIFICATION_LVL -i $sample.$CLASSIFICATION_LVL.kraken -o $sample.$CLASSIFICATION_LVL.bracken.out

done < $1

Let’s now run runAll.bash to apply Kraken2 and Bracken to all of our 16S samples.

#first let's create a new folder
mkdir order_classification

cd order_classification

# this will perform taxonomic classification at the Order level
bash runAll.bash /fsys1/data/lobb-et-al/files.txt O  # now wait a while....

Once completed, you will see that your folder is full of .bracken output files. To merge these together into a single file containing bracken output for all your samples, do the following:

python2.7 /usr/local/bin/Bracken-2.5/analysis_scripts/combine_bracken_outputs.py --files $(ls *.bracken.out) -o combined.order.out

Plotting results in R

Now, download your combined.order.out file to your local computer and load R for further analysis and plotting.

First load these libraries. Install these first if they are not already installed.

library(ggplot2)
library(reshape2)
library(viridisLite)

Next, load your data

tb = read.delim("combined.order.out",header=T,row.names=1)

#Bracken output has _frac and _num columns. We will just analyze the _num columns.
tbp = tb[,grep("_num",colnames(tb))]

#Transpose the table
tbp <- t(tbp)

#Convert to proportions
tb_prop<-as.data.frame(round(prop.table(as.matrix(tbp), 1) * 100,1))

#Choose a selection of taxa with a % > 3 (Note: might have to play around with this until you get a reasonable number of taxa to display)
tb_sub <- tb_prop[,apply(tb_prop, 2, function(x) max(x, na.rm = TRUE))>3]

For plotting, we have to do a few more modifications to the data matrix

#Melt the dataframe for plotting
tbm <- as.data.frame(melt(as.matrix(tb_sub)))

#fix labels
tbm[,1] = within(tbm, Var1<-data.frame(do.call('rbind', strsplit(as.character(Var1), '.', fixed=TRUE))))[,1][,1]

#Turn 0s into NAs
tbm[tbm == 0] <- NA

#Set the order of the taxa on the plot (Note: optional)
tbm$Var2 <- factor(tbm$Var2, levels = row.names(as.table(sort(colMeans(tb_sub)))))

Now, we can plot using ggplot2. Note: the following ggplot command is very parameter-rich, and it can be a lot simpler than this.

ggplot(tbm, aes(Var1,Var2,size = value,fill=value), colsep=c(1:100), rowsep=(1:100), sepwidth=c(5,1)) + geom_point(shape = 21, alpha=0.4) + ggtitle("") + xlab("") + ylab("") + theme(axis.text = element_text(colour= "black", size = 12), text = element_text(size=15), axis.text.x=element_text(angle=90, vjust = 0.5, hjust = 1))+ scale_size_area(max_size = 15,guide="none") + labs(fill="Relative\nfrequency (%)") + scale_fill_viridis_c()

This should produce the following plot:

We can also create a barplot by doing the following:

ggplot(tbm, aes(fill=Var2, y=value, x=Var1)) + 
    geom_bar(position="fill", stat="identity", col="grey50") +
	scale_y_continuous(labels=scales::percent) +
	xlab("") + ylab("Relative frequency") + labs(fill="Order") +
	theme(axis.text.x= element_text(angle = 90, hjust = 1))

… which should produce:

Adding in metadata annotations

The last plot contains sample names, but let’s replace these names with annotations from the metadata that are more informative.

First, make sure you have a metadata.txt text file that contains the 48 samples (column 1) and names (column 2) listed at the top of the page. It should look like this:

SRS6112281 WW1.1
SRS6112282 WW1.2
SRS6112283 WW1.3
SRS6112267 EE1.1
SRS6112268 EE1.2
SRS6112279 EE1.3
...

Now, load in your metadata

metadata = read.table("metadata.txt")

Now, let’s subset our data matrix to include only the metadata samples, and let’s also re-order the variables so that they plot in the desired order

#subset tbm to only include those samples in metadata
tbm = tbm[which(tbm[,1] %in% metadata[,1]),]
tbm[,1] = metadata[match(tbm[,1],metadata[,1]),2]
tbm$Var1 = factor(tbm$Var1,levels= metadata[,2])

ggplot(tbm, aes(fill=Var2, y=value, x=Var1)) + 
    geom_bar(position="fill", stat="identity", col="grey50") +
	scale_y_continuous(labels=scales::percent) +
	xlab("") + ylab("Relative frequency") + labs(fill="Order") +
	theme(axis.text.x= element_text(angle = 90, hjust = 1))

This should produce:

How does this result compare to the result from Lobb et al. (2020) here ?

Lastly, let’s create a heatmap and add in an annotation category

library(pheatmap)

# convert the tbm table back to a 2D matrix
tb = acast(tbm, Var1 ~ Var2,value.var='value',fill=0)
tb = t(tb)  #transpose

# let's split the names (EE, WW, etc.) into a matrix that we can use as annotations
annot = data.frame(do.call("rbind", strsplit(as.character(metadata[,2]), "", fixed = TRUE)))[,c(1,2)]
rownames(annot) = metadata[,2]
colnames(annot) = c("Fish_Origin","Water_Origin")

# specify the colors
ann_colors = list(
    Fish_Origin = c(W = "#EBEBEB", E = "#424242"),
    Water_Origin = c(W = "#EBEBEB", E = "#424242")
)

# plot
pheatmap(tb,annotation_col=annot,cluster_cols=F,annotation_colors=ann_colors,color = viridis(1000))

This should produce: