PRS Pipeline

The best prediction score.

PRS - Polygenic Risk Score

A PRS, also known as genetic risk score or genome-wide score, is a value based on variation in multiple genetic loci and their associated weights. It renders the best prediction for the trait that can be made when taking into account variation in multiple genetic variants.

Single variant association analysis requires very large sample sizes to detect more than a few of SNPs for many complex traits. In contrast, PRS analysis aims to identify aggregates genetic risk across the genome in a single individual polygenic score for a trait of interest. In this approach, a large discovery sample is required to determine how much each SNP is expected to contribute to the polygenic score of a specific trait. Subsequently, in an independent target sample, which can be more modest in size, polygenic scores can be determined based on genetic DNA profiles and these weights.

Workflow

User Journey

A few simple steps are needed to run this pipeline:

search the dataset using a therapeutic tag of interest in the search dataset page;
select the data that you want to perform the analysis on;
select PRS pipeline card;
choose the parameters from the configuration page;
provide the job name and run the analysis;

then, you will be redirected to the results page.

The design is always the same as the previous pipelines since they share all the same characteristics.

You can find an in-depth explanation of GWAS parameters here.

You can choose to use a Shivom Dataset by switching on, or upload your dataset; in this case, the file to be uploaded must contain the GWAS summary stats.

There are two ways to select this dataset, either output from Shivom GWAS pipeline or if the GWAS was run on a Plink based program/software:

Directly from GWAS output /results/assoc/xxx.assoc (extension, .assoc)
you can upload a file from your own GWAS output computed via a non-Plink based software, then you will have to state the format of their GWAS text file or instead, and more preferably, we would state the format of the file itself.

In order for you to trigger the upload of a file of your own GWAS, you would have to toggle ‘ON’ index.

Then, you will have to specify the columns for the base file using a 0-based index: you can do this using the drop-down menu in which the names of the columns will be specified.

The Target Dataset Folder will contain the bed, bim and fam files (Plink Binary Files).

Target Dataset: the PRS that is generated in the Base dataset will be applied on a different dataset for the same trait or a completely different trait. Target dataset can either be on Shivom’s Dataset or a third party GWAS summary statistics.

Or, if this pipeline is connected to the GWAS pipeline, then this can then automatically be selected. The flag to set the file input for the target dataset is stated below.

A phenotype will be required which will be associated with the target dataset. This file format will be the same as the GWAS pheno file, with FID, IID. If the Target Dataset is Shivom’s GWAS, then we can associate the pheno-file directly from the platform automatically. However, if it isn’t, then we offer a file upload button. Regarding the Phenotype file, the format will be two columns, the first being ‘FID’ and the next being ‘IID’.

If you want to ignore FID column, switch on the button: when ignore-fid is set, the first column should be the IID. The rest of the columns can be the phenotype(s).

To specify a trait within the phenotype file, the column name for the trait can be specified using Column name, providing that the phenotype file contains a header.

If you would like to use covariates as part of your calculation, then you would have to upload a Covariate File. The format will be the same as the phenotype file, with FID and then IID as the first two columns followed by the covariate columns.

You can choose from the options Clumping and no-Clumping: this is an integral part of the pre-processing of the data set. This feature allows the SNPs with the greatest signal (lowest p-value) to be kept in all calculations.

Clumping is a procedure in which only the most significant SNP in each LD block is identified and selected for further analyses. This reduces the correlation between the remaining SNPs, while retaining SNPs with the strongest statistical evidence.

For MissingData, you will have to select from the options in the dropdown menu how you would like to deal and handle missing genotype data:

MEAN_IMPUTE;
SET_ZERO;
CENTER.

If the target dataset is small (<500) then it will be useful to use a LD reference file, such as 1000 Genome to improve clumping.

All Linkage Disequilibrium parameters accept any integer / float as input, except LD info value which must be between 0.0 and 1.0.

Linkage disequilibrium is a measure of non‐random association between alleles at different loci at the same chromosome in a given population. SNPs are in LD when the frequency of association of their alleles is higher than expected under random assortment. LD concerns patterns of correlations between SNPs.
Minor allele frequency (MAF) is the frequency of the least often occurring allele at a specific location. Most studies are underpowered to detect associations with SNPs with a low MAF and therefore exclude these SNPs.

The next section requires configuration of the parameters that will be used to compute the analysis.

If you switch off the No Regression option, you can choose from the following Regression models:

add: additive model;
dom: dominant model;
rec: recessive model;
het: heterozygous only.

You can also select one of these Scoring methods:

avg: average;
std: standard deviation;
sum.

You can add additional parameters: Number of permutations to perform, the Quantile plot and to print the first 10 rows of the SNP file.

The default parameters set recommended is:

Target Dataset Folder:

Ignore FID Column: switch off.

Clumping (if ON):

Clumping distance in kb: 250;
Clumping r2 value: 0.1;
Clumping p-value threshold: 0.1.

Missing Data:

PRS calc. Method for missing samples: mean_impute.

LD parameters:

LD info value: must be between 0.0 and 1.0.

Polygenic Risk Calculations:

No Regression: switch off;
Model for Regression: Add - additive model;
Scoring Method: Avg;
Score for individual samples: switch off.

Additional parameters:

Number of permutations: enter at least 10000;
Print the SNP to the file: switch on;
Print the Quantile Plot: switch on.

Results

Once you have selected the dataset to be used, chosen the pipeline and set all the parameters, you can start your analysis using the Run Analysis box; at this point, you will be redirected to this page, where you can keep an eye on which works are In Progress, which are Completed, and choose to carry out a new analysis.

By clicking on your JobName, you will have access to a page where you can monitor all the processes involved in your analysis:

Now, selecting the Results box on the right, let's take a look at the demo results obtained using the Default Parameters Set:

Bar Plot

High Resolution Plot

Quantile Plot

By clicking on the Interactive Graphs option, another way of displaying the results is made available:

Finally, using the Export box, you will be able to download the results of your analysis in a .pdf format file.

Reference

Pipeline Reference
Data quality control in genetic case-control association studies, Anderson CA et al, 2010
A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis, Marees AT et al, 2018
PRSice-2: Polygenic Risk Score software for biobank-scale data, Choi SW et al, 2019

PreviousPheWAS Pipeline NextMetaGWAS Pipeline

Last updated 4 years ago

Was this helpful?