GWAS Pipeline

The genome-wide study of variants.

GWAS - Genome-Wide Association Study

GWAS - also known as Whole Genome Association Study (WGAS) - is an observational study of a genome-wide set of genetic variants in different individuals to see if any of them is associated with a trait.

Genome-wide association studies are used to identify associations between single nucleotide polymorphisms (SNPs) and phenotypic traits. GWAS aims to identify single nucleotide polymorphisms of which the allele frequencies vary systematically as a function of phenotypic trait values. Identification of trait‐associated SNPs may reveal new insights into the biological mechanisms underlying these phenotypes

Workflow

Data quality control in GWAS

The data quality assessment and control steps are typically carried out during case-control association studies to identify and remove DNA samples and markers that introduce bias. In order to reduce the bias (remove false-positive associations), one must take several QC steps to remove individuals or markers with high error rates. All of these analyses can be performed with the PLINK tool by providing the corresponding option for each type of analysis. Other tools that may be required are SmartPCA and R. QC is generally advised to implement per-individual basis prior to conducting QC on per-marker basis to maximise the number of markers remaining in the study.

Per Individual QC of GWAS data consists of :

  • identification of individuals with discordant sex information (correct for sex);

  • identification of individuals with outlying missing genotype or heterozygosity rate;

  • identification of duplicated or related individuals.

User Journey

A few simple steps are needed to run this pipeline:

  1. search the dataset using a therapeutic tag of interest in the search dataset page;

  2. select the data that you want to perform the analysis on;

  3. select GWAS pipeline card;

  4. choose the parameters from the configuration page;

  5. provide the job name and run the analysis;

then, you will be redirected to the results page.

As you can see, the estimated cost of this analysis amounts to $ 0.00 since the GWAS pipeline is given for free; give a name to your job and then move on to the choice of parameters!

List of parameters

  • pheno: this is the therapeutic tag that the user has provided at the time of searching the data;

  • covariates: a comma-separated list of phenotypes that you want to use for association test. These can be selected from the dropdown list;

  • mperm: do you want to test doing permutation testing. If so, how many tests? Chose here the amount; by default, this is 1000.

  • thin: you can set this to a floating-point number in the range (0, 1] and then the PLINK data files are thinned leaving only that proportion of the SNPs. This allows the pipeline to be tested with a small proportion of the data; this is probably only needed for debugging purposes and usually, this should not be set.

  • chrom: only do testing on this chosen chromosome;

  • chi2: if a chi2 test has to be used, switch on;

  • logistic: if a logistic regression has to be used, switch on;

  • assoc: if an association test has to be done, switch on;

  • linear: if a linear regression has to be used, switch on;

  • fisher: if a Fisher exact test has to be used, switch on;

  • model: if SNPs is higher than 950,000, 950,000 SNPs are chosen randomly to build the model;

  • gemma: if gemma has to be used, switch on; gemma (Genome-wide Efficient Mixed Model Association) is another choice for doing association test;

  • adjust: if you want to do explicit testing for Bonferroni correction et al that PLINK does, switch on.

Results

Once you have selected the dataset to be used, chosen the pipeline and set all the parameters, you can start your analysis using the Run Analysis box; at this point, you will be redirected to this page, where you can keep an eye on which works are In Progress, which are Completed, and choose to carry out a new analysis.

By clicking on your JobName, you will have access to this page, where you can monitor all the processes involved in your analysis:

Now, selecting the Results box on the right, let's take a look at the demo results obtained using the Default Parameters Set:

Principal Component Analysis (PCA) is a technique used to correct for population structure. The eigenvalues that are computed are then used as covariates in the model.

QQ Plot - using an association test from PLINK - A popular way to correct for population structure in GWAS is a QQ-plot. The rationale behind this visualisation is to verify whether the test statistics deviate from the expected null distribution. Early separation from the expected and observed may be due to population stratification.

Manhattan Plot - using an association test from PLINK - In GWAS, Manhattan plots, genomic coordinates are displayed along the X-axis, with the negative logarithm of the association p-value for each single nucleotide polymorphism (SNP) displayed on the Y-axis, meaning that each dot on the Manhattan plot signifies a SNP.

Genome-wide significance is at 5 x 10(-5) or 5 x 10(-8), meaning that there is a statistical significant association between the loci and the phenotype.

By clicking on the Interactive Graphs option, you can also view your results like this:

Finally, use the Export option to download the results of your analysis in a .pdf format file.

Reference

Last updated