Data Preparation

Pre-processing.

Overview

The COVID19 pipelines require the following input files:

  • Short Reads;

  • Long Reads;

  • samplesheet;

  • parameters.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment.

Long and short reads differ according to the length of reads produced and the underlying technology used for sequencing. The short read length (100-200 bp) limits its capability to resolve complex regions with repetitive or heterozygous sequences, so the longer read lengths (up to kbs) have fundamentally more information. However, these tend to suffer from higher error rates than the short reads.

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. A FASTQ file contains the sequence data from the clusters that pass filter on a flow cell; for each cluster that passes filter, a single sequence is written to the corresponding sample’s R1 FASTQ file, and, for a paired-end run, a single sequence is also written to the sample’s R2 FASTQ file. Each entry in a FASTQ files consists of 4 lines:

  1. a sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary by based on the BCL (individual base call) to FASTQ conversion software used;

  2. the sequence (the base calls; A, C, T, G and N);

  3. a separator;

  4. the base call quality scores.

Workflow

Preparation

The Short Read Fastq files are some of the input files for the pipeline. These are the reads sequenced from Illumina sequencing platform (HiSeq 2500/HiSeq4000/MiSeq) in zipped format (.gz); since there are paired end, each sample will have two files: R1 and R2 fastq files.

Illumina sequencing technology leverages clonal array formation and proprietary reversible terminator technology for fast and accurate large-scale sequencing. The innovative and adaptable sequencing system permits a wide array of applications in many omics as genomics, transcriptomics, and epigenomics.

The Long Read Fastq files are instead generated from Oxford Nanopore Technologies (ONT) platform and they are in zipped format too. Differently from Short Read Fastq Files, for each sample there is one fastq file.

Nanopore sequencing is a unique, scalable technology that enables direct, real-time analysis of long DNA or RNA fragments. It monitors changes to electrical current as nucleic acids are passed through a protein nanopore and the resulting signal is decoded to supply the specific DNA or RNA sequence.

Each analysis requires a sample metadata sheet, which specifies the fastq files for each of the sample and other associated metadata files.

A template metadata file can be accessed here, in which you can also find the description of the columns and the example values.