# SARS-CoV-2: Assembler

### Overview

One of the common analyses done on the newly sequenced genome is to perform a ***de novo*****&#x20;assembly** of the genome and understand the genome structure in terms of their location and function. This pipeline takes in the user uploaded *fastq files* and performs the assembly.

### Workflow

![](/files/-M5aPoHzqpO1wRGnj1Vr)

The raw data, represented by fastq files, collected from sequencing experiments are taken as input and then checked for quality. The quality checked and the adapter trimmed reads are mapped to *hg19 human reference genome (GRCh37.75)*. So the unaligned are then taken for *de novo* assembly using *Spades* program, while the assembled contigs are evaluated using the *Quast* program. The assembled genome is viewable in programs such as Bandage. The contigs are subjected to gene/ORF prediction and the resulting sequences are further annotated using PROKKA.

#### User Journey

As you can see, this pipeline also shares the same structure with the others; what you need to do to use it are a few simple steps:

1. once you have selected your pipeline of interest, upload your fastq files;
2. select the parameters;
3. run the pipeline;

then, you will be redirected to the results page.

As you can see, the pipeline design is divided into two parts: a first one dedicated to the overview of the pipeline itself,where you can name your analysis and view the estimated cost for use, and a second part concerned about data and parameters.

![](/files/-MDyRE1vJ0Bw0HeoQLKy)

Let's take a look at the **parameters**.

![](/files/-MDyRf8iMyFJoPYZr7VX)

The first thing to do is to load your **Fastq Files**, where the Short Reads are noted.&#x20;

If you have Long Reads, *switch on* the appropriate button and upload your file.

Finally, upload your metadata file containing:

* sample name;
* run;
* short reads;
* long reads.

As for **quality control**, whether the user wants to do QC both for short and long (**Nanoplot**) reads, just *switch on*.

For **Adapter Trimming**, the program that will be used for Short Reads will be *fastp*, while *Nanoplot* for Long Reads.

![](/files/-MDyRoOKSQ4nNdwvIYrU)

The chosen **reference genome** for these analyzes will be *hg19* (Homo sapiens (human) *genome* assembly GRCh37 (*hg19*) from *Genome* Reference Consortium).

The reference genome **index** that will be applied will be:

* for Short Reads: *bwa*;
* for Long Reads: *genome.fa.mmi*.

The **Alignment** tool involved in this process will be:

* for Short Reads: *bwa\_mem*;
* for Long Reads: *minmap2*.

The number of **threads** for alignment is set to *16* and it is possible to save your .bam file by switching on.

![](/files/-MDyRuKpbZHXRR-nwzC-)

As already said, *de novo* assembly is conceived by using *Spades* program.

It is possible to choose whether or not to include the Long Reads and also whether to perform **error correction**.

The flag **rna** should be used when assembling RNA-Seq data sets.

**K-mers** are subsequences of length *k* contained within a biological sequence; here you can have the k-mer size automatically chosen or you can insert it manually.&#x20;

#### Default Parameters Set

Input Data:

* Long Reads: *switch on.*

QC:

* FastQC: *switch on;*
* Nanoplot: *switch on.*

Adapter Trimming:

* Program for Short Reads: *fastp;*
* Program for Long Reads: *Nanoplot.*

Reference genomes and indexes:

* Reference genome: *hg19;*
* Reference genome index for Short Reads: *bwa;*
* Reference genome index for Long Reads: *genome.fa.mmi.*

Alignment:

* Alignment program for Short Reads: *bwa\_mem;*
* Alignment program for Long Reads: *minmap2;*
* Threads: *16;*
* Save the alignment files: *switch on*.

Assembly:

* De novo assembly tool: *SPADES;*
* Include Long Reads: *switch on;*
* Output directory: *sample\_assembly;*
* Error Correction: *switch on;*
* rna: *switch off;*
* K-mer size - automatic: *switch on.*

### Results

Once you have selected the dataset to be used, chosen the pipeline and set all the parameters, you can start your analysis using the *Run Analysis* box; at this point, you will be redirected to this page, where you can keep an eye on which works are *In Progress*, which are *Completed*, and choose to carry out a new analysis.&#x20;

![](/files/-ME3S3HQbJfzDdiKFdRV)

By clicking on your *JobName*, you will have access to this page, where you can monitor all the processes involved in your analysis:

![](/files/-ME3Rj_M3uekxa18wcRx)

Now, selecting the *Results* box on the right, let's take a look at the demo results obtained using the Default Parameters Set:

**Sequence Counts**

Sequence counts for each sample. Duplicate read counts are an estimate only.

![](/files/-M5aanerBAWOv6HqdkWd)

**Raw Data Summary**

![](/files/-M6-yG_Bu6cka1dUj1bq)

**GC Content** (or **guanine-cytosine content**)&#x20;

is the percentage of nitrogenous bases guanine (G) or cytosine (C) in a DNA or RNA molecule. This measure indicates the proportion of G and C bases out of an implied four total bases, considering:&#x20;

* adenine and thymine in DNA,&#x20;
* adenine and uracil in RNA.

![](/files/-M6-yOmtL0HsDQRPL5Ct)

**Alignment Stats**

![](/files/-M6-y_mZc1UfjIA_LMuZ)

This graph is obtained from BWA-MEM algorithm, one of the three algorithms of Burrows-Wheeler Alignment Tool. BWA-MEM is generally recommended for high-quality queries as it is faster and more accurate than the others.&#x20;

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.&#x20;

The BWA-MEM algorithm performs local alignment and it may produce multiple primary alignments for different parts of a query sequence: this indicates a critical characteristic for long sequences.

The execution of the algorithm is as follows:

* seeding alignments with maximal exact matches (MEMs);
* &#x20;extending seeds with the affine-gap Smith-Waterman algorithm (SW).

**Assembly Stats**

![Here an example of the first rows of assembly stats.](/files/-M6UxE4Fo0KsObGWS6wW)

![](/files/-M5bavv5rgYp67QAvxdC)

By clicking on the Interactive Graphs option, another way of displaying the results is made available:

![](/files/-M60KRB5ivS_UKqpudtq)

![](/files/-M60Lmgi0QGVzps3w7N7)

![](/files/-M60LvdNZ3ewteDh94p9)

Finally, using the *Export* box, you will be able to download the results of your analysis in a *.pdf* format file.

**Pipeline reference:**  This pipeline is mainly based on the Assembly of SARS-CoV-2 from pre-processed reads

<https://github.com/galaxyproject/SARS-CoV-2/tree/master/genomics/2-Assembly>

####

#### References

1. [**Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM**](https://arxiv.org/abs/1303.3997), Li H., 2013
2. [**Assembling genomes and mini-metagenomes from highly chimeric reads**](https://link.springer.com/chapter/10.1007/978-3-642-37195-0_13), Nurk S. et al, 2013


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://shivom.gitbook.io/documentation/covid-19/pipeline-1.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
