SARS-CoV-2: Alignment

The variants study.

Overview

One of the common analyses done on the newly sequenced genome is to perform an alignment with the available reference genome and check the variants in the current sample. This pipeline is intended to call the variants and the annotation associated with.

Variants are genetic differences between:

  • healthy and diseased tissue,

  • individuals of a population,

  • strains of an organism

that can provide mechanistic insight into disease processes and the natural function of affected genes.

The normalization of a variant representation in VCF consists of two parts: parsimony and left alignment pertaining to the nature of a variant's length and position respectively.

Workflow

The figure above shows the analysis of variation between the samples. The raw data from sequencing experiments are taken as input and checked for quality: then, the quality checked reads are mapped to hg19 human reference genome. The unaligned are next used for alignment with the coronavirus (NC_05512) reference genome. The alignments are checked for duplicates and realigned using picard. The variants calling tools (lofreq) are used on these realigned files to get the variants. Finally, the variants are annotated with snpEff.

User Journey

As you can see, this pipeline also shares the same structure with the others; what you need to do to use it are a few simple steps:

  1. once you have selected your pipeline of interest, upload your fastq files;

  2. select the parameters;

  3. run the pipeline;

then, you will be redirected to the result page.

As you can see, the pipeline design is divided into two parts: a first one dedicated to the overview of the pipeline itself, where you can name your analysis and view the estimated cost for use, and a second part concerned about data and parameters.

Let's take a look at the parameters.

The first thing to do is to load your Fastq Files, where the Short Reads are noted; then, upload your metadata file containing:

  • sample name;

  • run;

  • short reads;

  • long reads.

As for quality control, whether the user wants to do it, just switch on.

For Adapter Trimming, the program used will be fastp tool.

The chosen reference genome for these analyzes will be hg19 (Homo sapiens (human) genome assembly GRCh37 (hg19) from Genome Reference Consortium).

The reference genome index that will be applied will be bwa.

The Alignment tool involved in this process will be bwa_mem, the faster and more accurate of the Burrow-Wheeler Alignment tool algorithms.

The number of threads for alignment is set to 16 and it is possible to save your .bam file by switching on.

The SARS-CoV-2 Reference Genome chosen to be aligned is NC_045512.

As already mentioned, the programs used are:

  • for mark duplicates: picard;

  • for variant calling on realigned files: lofreq;

  • for variant annotation on VCF files: SnpEff.

Default Parameters Set

QC:

  • FastQC: switch on.

Adapter Trimming:

  • Program for Short Reads: fastp.

Reference genomes, indexes and alignment:

  • Reference genome: hg19;

  • Reference genome index for Short Reads: bwa;

  • Alignment program for Short Reads: bwa_mem;

  • Threads: 16;

  • Save the alignment files: switch on.

Alignment with Coronavirus Reference Genome:

  • Duplicate Mark Program: picard;

  • Variant Calling Program: lofreq;

  • Variant Annotation Program: SnpEff.

Results

Once you have selected the dataset to be used, chosen the pipeline and set all the parameters, you can start your analysis using the Run Analysis box; at this point you will be redirected to this page, where you can keep an eye on which works are In Progress, which are Completed, and choose to carry out a new analysis.

By clicking on your JobName, you will have access to this page, where you can monitor all the processes involved in your analysis:

Now, selecting the Results box on the right, let's take a look at the demo results obtained using the Default Parameters Set:

Sequence Counts

Sequence counts for each sample. Duplicate read counts are an estimate only.

GC Content (or guanine-cytosine content)

is the percentage of nitrogenous bases guanine (G) or cytosine (C) in a DNA or RNA molecule. This measure indicates the proportion of G and C bases out of an implied four total bases, considering:

  • adenine and thymine in DNA,

  • adenine and uracil in RNA.

This module measures the GC content across the whole length of each sequence in a file and compares it to the distribution of GC content in another file.

Hg19 Alignment Stats

This graph is obtained from BWA-MEM algorithm, one of the three algorithms of Burrows-Wheeler Alignment Tool. BWA-MEM is generally recommended for high-quality queries as it is faster and more accurate than the others.

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.

The BWA-MEM algorithm performs local alignment and it may produce multiple primary alignments for different parts of a query sequence: this indicates a critical characteristic for long sequences.

The execution of the algorithm is as follows:

  • seeding alignments with maximal exact matches (MEMs);

  • extending seeds with the affine-gap Smith-Waterman algorithm (SW).

Corona Alignment Stats

Variant Stats

Here an example of the first rows of variant stats.

By clicking on the Interactive Graphs option, another way of displaying the results is available:

Finally, using the Export box, you will be able to download the results of your analysis in a .pdf format file.

Pipeline reference:

This pipeline is based on analysis of variants between samples.

https://github.com/galaxyproject/SARS-CoV-2/tree/master/genomics/4-Variation

Reference