Variant Caller: Sarek

Overview

This pipeline is designed to run analyses on whole genome or targeted sequencing data.

It is based on nf-core/sarek pipeline.

In this pipeline, the raw reads are quality checked and adapter trimmed. These preprocessed reads are aligned to the human reference genome and recalibrated. The variants are called using GATK best practice tools as implemented in sarek pipeline. The variants are annotated using snpEff.

Workflow

User Journey

As you can see, this pipeline also shares the same structure with the others;

  1. Once you have selected your pipeline of interest, upload your fastq files;

  2. Select the parameters;

  3. Run the pipeline;

then, you will be redirected to the result page.

As you can see, the pipeline design is divided into two parts. The first part is dedicated to the overview of the pipeline itself, where you can name your analysis and view the estimated cost for use, and the second part is concerned with the data and parameters.

Let's take a look at the parameters.

The first thing to do is to load your Fastq Files, where the Short Reads are noted; then, upload your metadata file containing:

  • sample name;

  • run;

  • short reads;

  • long reads.

As for quality control, whether the user wants to do it, just switch on.

The chosen reference genome for this analysis will be hg19 (Homo sapiens (human) genome assembly GRCh37 (hg19) from Genome Reference Consortium).

The Alignment tool involved in this process will be bwa_mem, the faster and more accurate of the Burrow-Wheeler Alignment tool algorithms.

The programs used are:

  • for mark duplicates: GATK;

  • for variant calling for germline: GATK4 HaplotypeCaller;

  • for variant annotation on VCF files: SnpEff.

Default Parameters Set

QC:

  • FastQC: switch on.

Alignment:

  • Reference genome: hg19;

  • Alignment program to remove human DNA reads: bwa;

  • Program for mark duplicates: GATK;

  • Program for variant calling for germline: GATK4 HaplotypeCaller;

  • Program for variant annotation on VCF files: SnpEff.

Results

Once you have selected the dataset to be used, chosen the pipeline and set all the parameters, you can start your analysis using the Run Analysis box; at this point you will be redirected to this page, where you can keep an eye on which works are In Progress, which are Completed, and choose to carry out a new analysis.

By clicking on your JobName, you will have access to this page, where you can monitor all the processes involved in your analysis.

Now, selecting the Results box on the right, let's take a look at the demo results obtained using the Default Parameters Set:

Sequence Counts

Sequence counts for each sample. Duplicate read counts are an estimate only.

GC Content (or guanine-cytosine content)

is the percentage of nitrogenous bases guanine (G) or cytosine (C) in a DNA or RNA molecule. This measure indicates the proportion of G and C bases out of an implied four total bases, considering:

  • adenine and thymine in DNA,

  • adenine and uracil in RNA.

This module measures the GC content across the whole length of each sequence in a file and compares it to the distribution of GC content in another file.

Hg19 Alignment Stats

This graph is obtained from BWA-MEM algorithm, one of the three algorithms of Burrows-Wheeler Alignment Tool. BWA-MEM is generally recommended for high-quality queries as it is faster and more accurate than the others.

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.

The BWA-MEM algorithm performs local alignment and it may produce multiple primary alignments for different parts of a query sequence: this indicates a critical characteristic for long sequences.

The execution of the algorithm is as follows:

  • seeding alignments with maximal exact matches (MEMs);

  • extending seeds with the affine-gap Smith-Waterman algorithm (SW).

Variant Stats

By clicking on the Interactive Graphs option, another way of displaying the results is made available:

Finally, using the Export box, you will be able to download the results of your analysis in a .pdf format file.

Reference