SARS-CoV-2: Assembler

A new assembly for your files.

Overview

One of the common analyses done on the newly sequenced genome is to perform a de novo assembly of the genome and understand the genome structure in terms of their location and function. This pipeline takes in the user uploaded fastq files and performs the assembly.

Workflow

The raw data, represented by fastq files, collected from sequencing experiments are taken as input and then checked for quality. The quality checked and the adapter trimmed reads are mapped to hg19 human reference genome (GRCh37.75). So the unaligned are then taken for de novo assembly using Spades program, while the assembled contigs are evaluated using the Quast program. The assembled genome is viewable in programs such as Bandage. The contigs are subjected to gene/ORF prediction and the resulting sequences are further annotated using PROKKA.

User Journey

As you can see, this pipeline also shares the same structure with the others; what you need to do to use it are a few simple steps:

  1. once you have selected your pipeline of interest, upload your fastq files;

  2. select the parameters;

  3. run the pipeline;

then, you will be redirected to the results page.

As you can see, the pipeline design is divided into two parts: a first one dedicated to the overview of the pipeline itself,where you can name your analysis and view the estimated cost for use, and a second part concerned about data and parameters.

Let's take a look at the parameters.

The first thing to do is to load your Fastq Files, where the Short Reads are noted.

If you have Long Reads, switch on the appropriate button and upload your file.

Finally, upload your metadata file containing:

  • sample name;

  • run;

  • short reads;

  • long reads.

As for quality control, whether the user wants to do QC both for short and long (Nanoplot) reads, just switch on.

For Adapter Trimming, the program that will be used for Short Reads will be fastp, while Nanoplot for Long Reads.

The chosen reference genome for these analyzes will be hg19 (Homo sapiens (human) genome assembly GRCh37 (hg19) from Genome Reference Consortium).

The reference genome index that will be applied will be:

  • for Short Reads: bwa;

  • for Long Reads: genome.fa.mmi.

The Alignment tool involved in this process will be:

  • for Short Reads: bwa_mem;

  • for Long Reads: minmap2.

The number of threads for alignment is set to 16 and it is possible to save your .bam file by switching on.

As already said, de novo assembly is conceived by using Spades program.

It is possible to choose whether or not to include the Long Reads and also whether to perform error correction.

The flag rna should be used when assembling RNA-Seq data sets.

K-mers are subsequences of length k contained within a biological sequence; here you can have the k-mer size automatically chosen or you can insert it manually.

Default Parameters Set

Input Data:

  • Long Reads: switch on.

QC:

  • FastQC: switch on;

  • Nanoplot: switch on.

Adapter Trimming:

  • Program for Short Reads: fastp;

  • Program for Long Reads: Nanoplot.

Reference genomes and indexes:

  • Reference genome: hg19;

  • Reference genome index for Short Reads: bwa;

  • Reference genome index for Long Reads: genome.fa.mmi.

Alignment:

  • Alignment program for Short Reads: bwa_mem;

  • Alignment program for Long Reads: minmap2;

  • Threads: 16;

  • Save the alignment files: switch on.

Assembly:

  • De novo assembly tool: SPADES;

  • Include Long Reads: switch on;

  • Output directory: sample_assembly;

  • Error Correction: switch on;

  • rna: switch off;

  • K-mer size - automatic: switch on.

Results

Once you have selected the dataset to be used, chosen the pipeline and set all the parameters, you can start your analysis using the Run Analysis box; at this point, you will be redirected to this page, where you can keep an eye on which works are In Progress, which are Completed, and choose to carry out a new analysis.

By clicking on your JobName, you will have access to this page, where you can monitor all the processes involved in your analysis:

Now, selecting the Results box on the right, let's take a look at the demo results obtained using the Default Parameters Set:

Sequence Counts

Sequence counts for each sample. Duplicate read counts are an estimate only.

Raw Data Summary

GC Content (or guanine-cytosine content)

is the percentage of nitrogenous bases guanine (G) or cytosine (C) in a DNA or RNA molecule. This measure indicates the proportion of G and C bases out of an implied four total bases, considering:

  • adenine and thymine in DNA,

  • adenine and uracil in RNA.

Alignment Stats

This graph is obtained from BWA-MEM algorithm, one of the three algorithms of Burrows-Wheeler Alignment Tool. BWA-MEM is generally recommended for high-quality queries as it is faster and more accurate than the others.

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.

The BWA-MEM algorithm performs local alignment and it may produce multiple primary alignments for different parts of a query sequence: this indicates a critical characteristic for long sequences.

The execution of the algorithm is as follows:

  • seeding alignments with maximal exact matches (MEMs);

  • extending seeds with the affine-gap Smith-Waterman algorithm (SW).

Assembly Stats

Here an example of the first rows of assembly stats.

By clicking on the Interactive Graphs option, another way of displaying the results is made available:

Finally, using the Export box, you will be able to download the results of your analysis in a .pdf format file.

Pipeline reference: This pipeline is mainly based on the Assembly of SARS-CoV-2 from pre-processed reads

https://github.com/galaxyproject/SARS-CoV-2/tree/master/genomics/2-Assembly

References