shamshadpbg.github.io

An interactive course on learning genomics/bioinformatics tasks in Linux.

View on GitHub

Task 2 - Genome Assembly

In this lab, you will download raw sequencing data, perform genome assembly, visualize and analyze your assemblies, and compare the assembled genome sequence to the database using BLAST.

Requirements

Installation

If you do not already have access to a GUI running the graphical software listed above, please install the software on your local machine. Once locally installed, you can download results off the linux server and locally visualize them on your own system.

All software used are available for Mac/Windows/Linux.


Getting Started

Login to your Linux environment as you did in task1, and create a new folder for your task2 work

mkdir task2  #creates folder
cd task2 #enters into folder

Retrieving the raw data

Download the raw sequencing data from https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task2/mt_reads.fastq.gz into your folder and then uncompress it.

Explore the file using less.

Next, consider how would you return the sequence of the nth read using only head and tail or grep? Note that each read consists of a fixed number of lines and the first line of a read starts with a specific character.

question Q1) What is the sequence of the third read in the file? Make sure to remove all spaces.

question Q2) How many reads are in the file? Hint: the command grep "^X" reports all lines starting with the character X.

Data preprocessing

Before we can assemble a genome, we need to:

1) Assess the quality of the sequencing data 2) Demultiplex the data 3) Trim barcodes 4) Filter out low-quality reads (this is called quality filtering)

Quality assessment

For a quick quality report, you can use the program fastqc. The command below will analyze the mt_reads.fastq file and produce an .html results file.

fastqc mt_reads.fastq

Download the ‘fastqc_report.html’ file to your local machine and open it in a web browser. Tip: find the path to your file with realpath yourFile.txt

Explore and inspect the FastQC report for mt_reads.fastq.

question Q3) Which of the following statements is correct?

question Q4) The per-base sequence quality is lowest at the _____ of the reads.

question Q5) Most of the reads were assigned a quality (Phred) score of _____ .

question Q6) Examine the per-base sequence content. The base composition is unusual/unexpected for position _____ to position _____ of the reads.

question Q7) This unexpected composition may be due to the inclusion of _____ .

Splitting the barcodes (demultiplexing)

Sequencing data may be barcoded. In this case, the data contains two different samples, each with a unique barcode. This allows us to split the data by sample. Sometimes, sequencing data can have tens or hundreds of barcodes. See multiplexing

We will use a standard script from the fastx toolkit to split the data by its known barcodes (defined already for you in the file downloaded below)

#first download the barcodes file
wget https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task2/mt_barcodes.txt

#now split
fastx_barcode_splitter.pl <mt_reads.fastq --bcfile mt_barcodes.txt --bol --suffix .fastq --prefix splitData_

There are now two .fastq files; one for each barcode. There is also an unmatched.fasta file which should be empty. We will be focusing on the first sample, ie. the one now in the file ‘splitData_mt1.fastq’.

Barcode trimming

Barcode trimming is needed to remove the barcode sequences from the beginning of each read. The Q33 is required due to differences in sanger and illumina encoding.

fastx_trimmer -i splitData_mt1.fastq -f 9 -o trimmed_mt1.fastq -Q33

Quality filtering

Next, we need to remove low quality sequences. This will increase the accuracy of the assembly.

fastq_quality_filter -i trimmed_mt1.fastq -q 25 -p 80 -o qual_trim_mt1.fastq -Q33 -v

question Q8) What percentage of reads were removed?

question Q9) Compare FastQC reports from before and after trimming and quality filtering. Which of the measures improved from a warning/fail to a pass?

Genome Assembly

Now we are ready to assemble a genome.

Assembly with Velvet

To start we are going to try using the popular velvet assembler. Like many assemblers, velvet performs genome assembly using de bruijn graphs. This means that we must choose a value of k to define the k-mers (sequence fragments of length k) to be used in constructing the graph.

Read more: velvet. de bruijn graphs de novo assemblers

The commands below will compute the graph. The first parameter is the folder name (you choose) and the second parameter is the value of k. So below, we are assembling the genome from the trimmed and quality-filtered reads using a k-mer value of 21.

velveth out_21 21 -short -fastq qual_trim_mt1.fastq

Next, to compute the actual contig sequences from the graph, run the following:

velvetg out_21/ -scaffolding no -read_trkg yes -amos_file yes

question Q10) How many nodes are there in the graph that was produced?

Inspect the contigs.fa file that has been produced (will be in out_21 folder).

question Q11) How many contigs do you get using k=21?

question Q12) How many contigs do you get using k=31?

Assembly visualization

Assembly visualization with Tablet

Velvet has the option of keeping track of where the reads map to the assembly using the -read_trkg flag. This will produce a velvet_asm.afg file.

Download this file (from your k=31 assembly) to your local machine. Then open it in tablet. Tablet is a great program to explore how reads map to assemblies and genomes.

question Q13) What is the average contig length?

question Q14) What is the N50 value?

question Q15) Examine the read coverage across the longest contig. Does the coverage distribution match that shown here?

Explore tablet more on your own. We will be using it later in the course.

Assembly visualization with Bandage (optional)

Velvet and other de bruijn assemblers produce a graph that can be visualized. bandage is an excellent tool for this purpose.

If you are interested, locate the ‘lastgraph’ file produced by velvet and download it to your local machine.

Open this file in the Bandage application, and explore further.

Generating an improved assembly with ABYSS

As you can see based on the results from above, velvet (with the parameters we chose) did not yield a high quality assembly. It is too fragmented.

Often in genomics it is useful to try numerous parameters and different assemblers. Let’s try to assemble this genome again but with a different assembler. This time, we will be using the popular abyss assembler and we’ll keep the value of k = 21. And see here for info on stats reported by an Abyss assembly.

abyss-pe k=21 in='qual_trim_mt1.fastq' name=abyss-assembly

question Q16) How long (# bases) is your assembly?

question Q17) What is the N50 value (# bases)?

question Q18) How many contigs did abyss generate?

What is the taxonomic source of your genome? Explore with BLAST

You still do not know the source of this genome. Is it eukaryotic? bacterial? Is it a nuclear genome, a plasmid, or something else?

To investigate this question, do a BLAST search using the online BLAST tool. Use the Abyss assembly as your query.

question Q19) Based on the BLAST result, describe the most likely source of this DNA sequence?

ASSIGNMENT QUESTIONS

Please answer questions 1-19 above on LEARN under Quizzes.

#

Congratulations. You have now completed Task 2.