eQTLGen phase II cookbook

This cookbook is for running genome-wide eQTL mapping in the consortium settings, by using HASE (Roschupkin et al, 2016; original code repo). HASE method enables to run of genome-wide high-dimensional association analyses in consortium settings by limiting the size of the shared data while preserving participant privacy.

In this cookbook we perform the required steps for performing eQTL analysis: we do data QC, remove related samples, do genotype phasing and imputation, convert genotype data to the needed format, prepare covariates, convert expression and covariate data into the needed format, encode the data, calculate partial derivatives, prepare files for permuted data, and organize non-personal encoded data and partial derivatives for sharing with the central site. We also include pipelines that will be used for running meta-analyses on the central site.

For eQTLGen phase II we perform this analysis for whole blood and PBMC tissues and in the individual datasets with at least 100 samples.

Why this method?

Classical meta-analysis requires that each cohort performs a full GWAS for every gene (~20,000). Full summary statistics are large, meaning that, per every cohort, several terabytes of data would be needed to be shared with the central site. Sharing that much data is technically very complicated.

In contrast, the HASE method calculates matrices of aggregate statistics (called partial derivatives) which are needed for running predefined models and, additionally, encoded genotype and expression matrices. All those matrices are in the efficient file format and relatively small size, making it feasible to share those. Encoding means performing a matrix multiplication of the original data with the random matrix F (or its inverse, in case of expression matrix), so that no person-level information is obtainable from the genotype and expression matrices after encoding. However, sharing partial derivatives and encoded matrices with the central site enables running of pre-defined association tests between every variant and every gene for every dataset; and finally, running the meta-analysis over datasets. An additional merit of HASE is that it is possible adaptively drop covariates in the encoded association test (by slicing the aggregate partial derivative matrices), therefore enabling to make informed decisions on the most suitable set of covariates to include into the final meta-analysis. Also, this method shifts most of the computational burden into the central site, so individual cohorts do not need to run ~20,000 GWAS’es in their respective HPCs.

You can see the specifics of the method here: Roschupkin et al, 2016

Technical support

In case of any issues when running this cookbook please contact Urmo Võsa (urmo.vosa at gmail.com) and Robert Warmerdam (c.a.warmerdam at umcg.nl).

Prerequisites

Needed

HPC with multiple cores, UNIX/Linux and scheduling system.
You need to have Bash >=3.2 installed to your HPC.
You need to have Java >=11 installed to your HPC. This is needed for running Nextflow pipeline management tool.

Strongly recommended

Our pipelines expect that you have Singularity configured and running in your HPC, for managing needed tools/dependencies. This means that you don’t need to install many programs to your HPC. If there is no way of using Singularity in your HPC, please contact us and we will look into alternative ways for dependency management.
You might also need a few extra modules, so that Singularity would run as expected. If this is the case, we recommend to check this with your HPC documentation and/or tech support. E.g. in University of Tartu HPC there is also module named squashFS needed, so that Singularity would work correctly.
You HPC should have access to the internet. This is needed for downloading analysis pipelines from code repos, for automatic download of the containers from container repos and for downloading some reference files. If it is mandatory for you to work offline, please use these instructions for getting the pipelines working offline. Please contact lead analysts in case additional help is needed.

git. This is convenient for cloning analysis pipelines from the code repos. However, you can also download those manually.
Current analysis pipelines and configurations in this cookbook are tailored for usage with Slurm scheduler. We also provide profiles for two other commonly used schedulers (PBS/TORQUE and SGE; we have no possibility to test but these should work). It is also straightforward to add configurations for other commonly used schedulers. Therefore, if your HPC uses some other scheduler, please contact us and we will try to make the configuration file for that specific scheduler.

Setup

The analysis consists of four Nextflow pipelines which are used to perform all needed tasks in the correct order. For each pipeline, we provide a Slurm script template which should be complemented/adjusted with your input paths and a few dataset-specific settings. Those pipelines are:

Data QC pipeline: for doing automatic genotype and gene expression quality control, constructing covariates, and calculating some diagnostic summary statistics.
Imputation pipeline: for imputing the quality-controlled genotype data to the common reference panel.
Genotype conversion pipeline: for converting imputed genotype data into efficient .hdf5 format.
Per-cohort data preparations: for making per-cohort preparations, encoding data, calculating partial derivatives, doing permutation, and organizing all needed files for sharing with the central site.

We will provide detailed instructions for each pipeline below, however for the most optimal setup, we recommend you to first organize your analysis folder as follows:

Make analysis root folder for this project, e.g. eQTLGen_phase2.
```
mkdir eQTLGen_phase2
```
Inside this folder make another folder for Nextflow executable, called tools. This path is specified in all the template scripts, so that scripts find Nextflow executable.
```
mkdir eQTLGen_phase2/tools
```
Inside eQTLGen_phase2/tools download and self-install Nextflow executable, as specified here. You might need to load Java >=11 before running self-install (e.g. module load [Java >=11 module name in your HPC]).
```
cd eQTLGen_phase2/tools
wget -qO- https://get.nextflow.io | bash
```

❗ If this does not work, it might be that your HPC has restrictions in connecting with internet. If this is the case please use these instructions for trying to install Nextflow offline. Please contact lead analysts in case additional help is needed.

Inside eQTLGen_phase2, make additional separate folders for each step of this analysis plan:
- eQTLGen_phase2/1_DataQC
- eQTLGen_phase2/2_Imputation
- eQTLGen_phase2/3_ConvertVcf2Hdf5
- eQTLGen_phase2/4_PerCohortPreparations
```
cd ../
mkdir 1_DataQC 2_Imputation 3_ConvertVcf2Hdf5 4_PerCohortPreparations
```

Inside each of those folders, you should clone/download the corresponding pipeline. If git is available in your HPC, easiest is to use command git clone [repo of the pipeline]. This yields a pipeline folder e.g. for the first pipeline it will look like that: eQTLGen_phase2/1_DataQC/DataQC.

cd 1_DataQC
git clone https://github.com/eQTLGen/DataQC.git
cd ../2_Imputation
git clone https://github.com/eQTLGen/eQTLGenImpute.git
cd ../3_ConvertVcf2Hdf5
git clone https://github.com/eQTLGen/ConvertVcf2Hdf5.git
cd ../4_PerCohortPreparations
git clone https://github.com/eQTLGen/PerCohortDataPreparations.git

We recommend to specify output (and, if needed, input etc.) folder(s) for each step. E.g. eQTLGen_phase2/1_DataQC/output.
```
cd ..
mkdir 1_DataQC/output 2_Imputation/output 3_ConvertVcf2Hdf5/output 4_PerCohortPreparations/output
```
Inside each pipeline folder is the script template using name format submit_*_template.sh e.g. eQTLGen_phase2/1_DataQC/DataQC/submit_DataQc_pipeline_template.sh. You should adjust this according to your data (e.g. specify the path to your input folder, add required HPC modules), and save it. For better tracking, I usually rename it too: eQTLGen_phase2/1_DataQC/DataQC/submit_DataQc_pipeline_EstBB_HT12v3.sh.

❗ To ease the process, we have pre-filled some paths in the templates, assuming that you use the recommended folder structure.

You should submit each pipeline from inside each pipeline folder.
The logic of the cookbook is the following: the relevant output from the previous step should be specified as the input for the next step in the analysis.
E.g. QCd genotype files from the output of first DataQC pipeline: eQTLGen_phase2/1_DataQC/output/[Your Cohort Name]/outputfolder_gen/gen_data_QCd/ are the input for the second pipeline (imputation).

Offline use

In some HPCs the internet accessibility is restricted and this causes error messages when trying to install Nextflow and running it. This is because, by default, Nextflow executable tries to contact internet and check if newer version of Nextflow is available. Also, by default, first Data QC pipeline tries to automatically download plink executables and 1000G reference panel. In order to get the setup working offline, please follow these instructions. Please consult with lead analysts, in case abovementioned instructions are not solving the issues.

Running, monitoring and debugging Nextflow pipelines

Example commands here are based on the Slurm scheduler. However, this can be easily changed to other schedulers.

In order to run Nextflow pipelines, you have to modify each template script and load a couple of modules which enable to run Nextflow and Singularity containers. These very common modules should be available in your HPC, however exact name and version might differ: Java >=11 and Singularity. You might also need a few extra modules to run Singularity without problems. If you think this might be the case, we recommend to check with your HPC documentation and/or tech support which modules are required. E.g. in the University of Tartu HPC there is also a module named squashFS needed to run Singularity without problems.
When you submit the job e.g. sbatch submit_DataQc_pipeline_EstBB_HT12v3.sh, this initiates the pipeline, makes the analysis environment (using Singularity container) and automatically submits all the pipeline steps in the correct order and parallelized way. A separate work directory is made in the pipeline folder and will contain all the interim files of the pipeline.
Monitoring:
- Monitor the slurm-***.out log file or the .nextflow.log file and check if all the steps finish without error. Trick: command watch tail -n 20 slurm-***.out helps you to interactively monitor the status of the jobs.
- Use squeue -u [YourUserName] to see if individual tasks are in the queue.
If the pipeline has crashed (e.g. due to wall time), you can just resubmit the same script after any required changes are implemented. Nextflow does not rerun completed steps and continues only from the steps which had not been completed.
When the work has finished, download and check the job report, for potential errors or warnings. This file is automatically written to your output folder pipeline_info subfolder. E.g. output/pipeline_info/DataQcReport.html.
When you need to do some debugging, then you can use the last section of the aforementioned report to figure out in which subfolder from work folder the actual step was run. You can then navigate to this folder and investigate the following hidden files:
- .command.sh: script which was submitted
- .command.log: log file for seeing the analysis outputs/errors.
- .command.err: file which lists the errors, if any.

Hopefully those three files can give you some handle which might have gone wrong. Contact us, if you think you need any help to resolve the issue.

We have specified reasonable computational resources for each pipeline step. However, in some cases these might work differently, depending on HPC or dataset size. Resources for each pipeline step are specified in config/base.conf. In this file there are resources (RAM, CPU, walltime) for each step of the pipeline, as currently specified. You can try to increase those, if the pipeline crashes due to insufficient resources. Also, you can adjust cluster-specific options by modifying clusterOptions flag for each process. For example you can modify clusterOptions = '--job-name=GenotypeQC' into clusterOptions = '--job-name=GenotypeQC -p long' to ensure that this specific process runs on the partition named long.
For some analyses, work directory can become quite large. So, when the pipeline is successfully finished, and you have checked the output, you can delete work directory, in order to save this storage space of your HPC. But this means that the pipeline will restart from scratch, if you ever need to rerun this pipeline.
Nextflow and Singularity by default create cache folders in your home directory. This may unexpectedly fill your home. Therefore, we manually set and export the environment variables responsible for cache folders within each nextflow pipeline to ../../singularitycache and ../../nextflowcache. If you followed the naming scheme from the setup chapter and ran the pipelines from within the pipeline directories, these evaluate to eQTLGen_phase2/singularitycache and eQTLGen_phase2/nextflowcache. Be aware that these might be changed from your default if you’ve sourced the submit script from a different directory.

Data

For this analysis plan you need the following information and data:

Name of your cohort.
Unimputed array genotype data or WGS genotype data for your eQTL samples.
Gene expression matrix from your eQTL samples (for this project: whole blood or PBMC).
Genotype-to-expression linking file, linking sample IDs between those two datasets.

Name of your cohort

An informative name for your dataset is one of the required inputs for the majority of the pipelines in this cookbook. This could be the combination of cohort name and expression platform. If you contribute with multiple datasets from e.g. several batches, you could specify it by adding _batch1. If you contribute with multiple datasets from several ancestries, then you could specify this by adding ancestry e.g. _EUR.

Example 1: For EstBB cohort I have 2 datasets EstBB_HT12v3 and EstBB_RNAseq.

Example 2: If there would be two EstBB RNAseq datasets, I would use: EstBB_RNAseq_batch1 and EstBB_RNAseq_batch2.

Genotype data

We require either an unimputed PLINK dataset. .bed/.bim/.fam or an unimputed VCF dataset. Can be either genotyping array-based or WGS data. Can be in hg19/GRCh37 or hg38/GRCh38.
WGS datasets are expected to have some sample-level quality control previously done (such as mean sequence coverage, percent of chimeric reads, median and standard deviation of insert size, contamination rate, bacterial contamination, large chromosomal abnormalities).
We encourage the use of a VCF file format for WGS data. This enables the DataQC step to perform variant-level WGS-specific quality control.
A VCF dataset can either be split by chromosome (by using a globbing pattern surrounded by single quotes), or it can be a single VCF.
Ideally, the .fam file should contain information about the reported sex of the sample. This information is used in the automatic data quality control step. If this information is not available, you can still use those files but this specific quality control step will be skipped.
For both VCF and Plink input, a separate .fam file can be supplied with reported sex data. When Plink input is used, this separate .fam file will take precedence over the one in the Plink dataset.
In order to take advantage of optimal pre-phasing, the input genotype data should ideally have several thousand genotype samples. This means that if you have e.g. 7,000 samples in the available genotype data and only 500 of those have expression available (eQTL samples), please specify all 7,000 as the input of the analysis, not only eQTL samples. A larger number of input genotype samples will yield better pre-phasing and subsequent imputation. eQTL samples will be filtered after the imputation step. Our pipelines will automatically use up to 5,000 genotype samples in data QC and imputation pipelines. If only eQTL samples have genotype data available, only eQTL samples will be used.

❗ Standard pre-imputation genotype and sample quality control will be done by our pipeline, so no need for thorough QC.

Gene expression data

Tab-delimited expression matrix.
Each row contains a gene expression value and each column contains a sample ID.
The first column has column name - and contains either ENSEMBL gene IDs (RNA-seq), Illumina ArrayAddress IDs for probes or probe names for corresponding Affymetrix array.
For RNA-seq and Illumina arrays we expect that data is raw: not normalized and/or transformed. Our pipeline takes care of those steps.
Can be gzipped.

❗ For RNA-seq data we expect that the raw RNA-seq data has already been appropriately processed and QC’d prior to generating the count matrix (e.g. proper settings for the read mapping, etc.).

❗ For Affymetrix arrays we expect that all the standard data preprocessing steps for this array type have already been applied (e.g. RMA normalization). On top of those, our pipeline just applies inverse normal transformation on each gene (forces the gene expression distribution to be normal).

❗ For Affymetrix arrays we have currently used probe-to-gene linking as used in eQTLGen phase 1. It is possible that you have some other format for probe IDs: if this is the case, please contact us and we will update the pipeline accordingly.

Genotype-to-expression (GTE) linking file

Genotype-to-expression (GTE) linking file: tab-delimited file with sample ID matches between genotype data (column 1) and expression data (column 2). If those IDs are the same in both data layers, you can just use file with same sample IDs in both columns.

If you participated in eQTLGen phase 1 analyses, this file should already be available.

Additional covariates

As standard, within the dataQC step a number of default covariates are automatically constructed from both genotype data and expression data. Respectively, we calculate 10 genotype and 100 expression PCs from from the the data. The genotype-inferred sex is used as an additional covariate. Optionally, an additional set of covariates can be supplied, if you know that these are relevant for given dataset. For that please make a file with additional cohort-specific covariates:

Can be tab-delimited or space-delimited.
The first column should be the sample identifier, labeled SampleID.
Subsequent columns constitute the additional covariate, with the column name corresponding to the covariate name.
Categorical variables need to be text-based (e.g. batch1, batch2, etc), rather than numerical.
Categorical variables with k categories over 2 (k > 2) are automatically split into binary variables with values of 0 or 1.
Dichotomous text-based variables (k = 2) are similarly converted to a variable with values of 0 or 1.

Mixups

We assume that potential sample mixups have already been identified, resolved or removed in the previous research. E.g. see details from here.

Analysis instructions

1. Data QC

In this step, we prepare the genotype and gene expression data for the next steps. Specifically, the pipeline does the following steps:

Genotypes
- WGS-specific QC
  - WGS-specific variant level QC (VQSR thresholds, PASS labels, inbreeding coefficients, etc.)
- Standard SNP QC filtering (call-rate>0.95, Hardy-Weinberg P>1e-6, MAF>0.01).
- Individual-level missingness filter <0.05.
- Comparison of the reported and genetic check, removal of mismatched samples.
- Removal of samples with unclear genetic sex.
- Removal of samples with excess heterozygosity (+/-3SD from the mean).
- Removal of related samples (3rd degree relatives). From each pair of related samples, one is kept in the data.
- Visualisation of samples in the genetic reference space, instructions on how to remove or split the data in case of multi-ancestry samples.
- Removal of in-sample genetic outliers.
- Calculates 10 first genetic principal components (PCs), used in analyses as covariates to correct for population stratification.
Gene expression:
- Aligns sample IDs between genotype samples and gene expression samples.
- Filters in blood-expressed genes.
- Replaces array probe IDs with gene IDs.
- Iteratively checks for expression outliers and removes samples which fail this check.
- Normalises the data according to the expression platform used and applies additional inverse normal transformation.
- Calculates 100 first expression PCs, used in analyses as covariates to correct for unknown variation.
- Calculates the expression summary statistics for each gene, used to do QC and gene filtering in the meta-analysis.
Additional steps:
- Removes samples whose genetic sex does not match with the expression of sex-specific genes.
- Reorders the genotype samples into random order.
- Organises all the QCd data into the standard folder format.
- Provides commented html QC report which should be used to get an overview of the quality of the data.

Input files

Genotype data

Unimputed genotype file in plink .bed/.bim/.fam format. It is advisable that the .fam file also includes observed sex for all samples (format: males=1, females=2) so that pipeline is able to compare this with genetically inferred sex. For samples with missing reported sex, this step will be skipped.
We also allow unimputed VCF datasets as input, specifically for WGS datasets for which we can do WGS-specific variant level QC.
We allow a separate .fam file to be supplied with reported sex information, which is especially useful for VCF datasets.

❗ Because pre-phasing of genotypes benefits from larger sample sizes and many eQTL datasets have modest sample sizes (N<1,000), it is not advisable to prefilter the unimputed genotype data to include only eQTL samples. Our pipeline will extract eQTL samples itself (based on the genotype-to-expression file) and, if additional samples are available, includes additional up to 5,000 genotype samples which will be kept until the pre-phasing step of the analysis. If the genotype data is available for <5,000 samples, the pipeline uses only those samples which are available.

Expression data

Raw, unprocessed gene expression matrix. Tab-delimited file, genes/probes in the rows, samples in the columns.
First column has header “-”.
For Illumina arrays, probe ID has to be Illumina ArrayAddress.
For RNA-seq, gene ID has to be stable ENSEMBL gene ID (ENSEMBL v75).
For Affymetrix arrays we expect that gene expression matrix has already gone through the standard preprocessing and is in the same format as was used in eQTLGen phase 1 analyses (incl. array probe names).

❗ For Affymetrix arrays- if you did not participate in eQTLGen phase 1, it might be that your probe name format does not align with our pipeline. Please contact with urmo.vosa at gmail.com and send some information, how the probe naming and annotation was done in your data. We will then update the pipeline accordingly.

Additional files

Genotype-to-expression linking file (GTE). Tab-delimited file, no header, 2 columns: sample ID in genotype data, corresponding sample ID in gene expression data. You can use this file to select samples which go into the analysis.
Optional file with cohort-specific covariates. If your cohort has specific covariates which could be relevant, please specify those in separate file.
- Can be tab-delimited or space-delimited.
- The first column should be the sample identifier, labeled SampleID.
- Subsequent columns constitute the additional covariate, with the column name corresponding to the covariate name.
- Categorical variables need to be text-based (e.g. batch1, batch2, etc), rather than numerical.
- Categorical variables with k categories over 2 (k > 2) are automatically split into binary variables with values of 0 or 1.
- Dichotomous text-based variables (k = 2) are similarly converted to a variable with values of 0 or 1.

Instructions

Go into eQTLGen_phase2/1_DataQC.
Clone the DataQC pipeline git clone https://github.com/eQTLGen/DataQC.git.
Make folder output.
Go into pipeline folder eQTLGen_phase2/1_DataQC/DataQC and adjust the script template with needed modules, input paths and settings. Aside from the paths to your data files, you must specify:
Expression platform of your expression data (--exp_platform). Options: Illumina arrays: HT12v3, HT12v4 or HuRef8; RNA-seq: RNAseq; Affymetrix arrays: AffyU219 or AffyHumanExon.
Informative name for your dataset (--cohort_name): your unique dataset name used by all pipelines.
The genome build of your dataset (--genome_build): hg19, GRCh37, hg38 or GRCh38 (GRCh37 prefilled).
If your dataset concerns a VCF dataset, replace the --bfile argument with the --vcf argument. The --vcf argument expects a full path to a vcf dataset. Pathname extension using globbing is allowed (using * or ?), but the path should be provided without pathway extension.
You can select scheduler type by adjusting the profile as following:

Slurm: -profile slurm,singularity
PBS/TORQUE: -profile pbs,singularity
SGE: -profile sge,singularity

If your cluster setup does not allow internet connection, you can specify paths to local plink executables and reference files (arguments: --plink_executable, --plink2_executable and --reference_1000g_folder). Contact lead analysts for further guidance, if needed.

This is the template for Slurm scheduler (submit_DataQC_pipeline_template.sh):

#!/bin/bash

#SBATCH --time=48:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=6G
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --job-name="DataQc"

# These are needed modules in UT HPC to get singularity and Nextflow running. Replace with appropriate ones for your HPC.
module load java-1.8.0_40
module load singularity/3.5.3
module load squashfs/4.4

# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook

# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache

# Disable pathname expansion. Nextflow handles pathname expansion by itself.
set -f

# Define paths
nextflow_path=../../tools # folder where Nextflow executable is

# Genotype data
bfile_path=[full path to your input genotype files without .bed/.bim/.fam extension]

# Other data
exp_path=[full path to your gene expression matrix]
gte_path=[full path to your genotype-to-expression file]
exp_platform=[expression platform name: HT12v3/HT12v4/HuRef8/RNAseq/AffyU219/AffyHumanExon]
cohort_name=[name of the cohort]
genome_build="GRCh37"
output_path=../output # Output path

# Additional settings and optional arguments for the command

# --GenOutThresh [numeric threshold]
# --GenSdThresh [numeric threshold]
# --ExpSdThresh [numeric threshold]
# --ContaminationSlope [number between 0 and 90, default 45]
# --ContaminationArea [number between 0 and 90, default 30]
# --AdditionalCovariates [file with additional covariates]
# --InclusionList [file with the list of samples to restrict the analysis]
# --ExclusionList [file with the list of samples to remove from the analysis]
# --preselected_sex_check_vars "data/Affy6_pruned_chrX_variant_positions.txt"
# --AdditionalCovariates [file with additional covariates. First column should be `SampleID`]
# --gen_qc_steps 'WGS'
# --fam [PLINK .fam file. Takes precedence over .fam file supplied with `--bfile`]
# --plink_executable [path to plink executable (PLINK v1.90b6.26 64-bit)]
# --plink2_executable [path to plink2 executable (PLINK v2.00a3.7LM 64-bit Intel)]
# --reference_1000g_folder [path to folder with 1000G reference data]

# Command:
NXF_VER=21.10.6 ${nextflow_path}/nextflow run DataQC.nf \
--bfile ${bfile_path} \
--expfile ${exp_path} \
--gte ${gte_path} \
--exp_platform ${exp_platform} \
--cohort_name ${cohort_name} \
--genome_build ${genome_build} \
--outdir ${output_path}  \
-profile slurm,singularity \
-resume

Submit the job.
Go into the output folder, download and investigate the thorough report file Report_DataQc_[your cohort name].html. You will find diagnostic plots and instructions on how to proceed from this report.
According to the instructions in the QC report, it is likely that you need to adjust some of the QC thresholds. In your slurm script, you can add the some or all of the arguments --GenOutThresh, --GenSdThresh, --ExpSdThresh, --ContaminationSlope, --ContaminationArea, and specify the appropriate values for each of those.
In WGS datasets many samples might not fit the thresholds set on the X-chromosome heterozygosity (F<0.2 and F>0.8). To resolve this you can add the argument --preselected_sex_check_vars ${sex_check_vars_path}, which will make the method use a preselected set of variants selected based on array data.
In some cases, you might also need to split the data into several batches, according to the ancestry. Then you should run this and following pipelines for each of the batches separately.
Resubmit the job.
Go into the output folder, download and investigate the report file Report_DataQc_[your cohort name].html. Now QC plots should look as expected.
If some of the plots still indicate issues, adjust the arguments and re-run the pipeline. You might need to do so a couple of times until there are no more apparent issues.
🎉 Done!

Output

Pipeline gives output into the specified directory. The important files are following.

Files: output/outputfolder_gen/gen_data_QCd/*_ToImputation.bed, output/outputfolder_gen/gen_data_QCd/*_ToImputation.bim, output/outputfolder_gen/gen_data_QCd/*_ToImputation.fam are the filtered and QCd genotype files which need to be the input for the next, imputation pipeline. You need to specify the path to the base name (no extension .bed/.bim/.fam) as an input.
The whole output folder should be specified as one of the inputs for per-cohort preparations pipeline. This pipeline automatically uses the processed, QCd expression data and covariate file to run data encoding and partial derivative calculation. It then organises the encoded matrices for sharing with the central site. It also extracts some QC files, plots and summaries for sharing with the central site.

❗ While continuing with the next step, now it is also good time to send the Report_DataQc_[your cohort name].html to Urmo Võsa (urmo.vosa at gmail.com) and Robert Warmerdam (c.a.warmerdam at umcg.nl). This way we can have a initial look to the quality of the data and let you know if something seems unoptimal.

2. Genotype imputation

In this step we impute the QC’d genotype data to the recently released 1000G 30X WGS reference panel.

Specifically, this pipeline performs the following steps:

Lifts the unimputed genotype files to hg38/GRCh38 coordinates. (depending on genome build of input data)
Aligns unimputed genotypes to reference panel.
Converts unimputed genotype files to .vcf format.
Fixes alleles to match with the reference panel.
Does another round of genotype QC (Hardy-Weinberg P-value < 1e-6, missingness > 0.05, and minor allele frequency < 0.01).
Calculates individual-level missingness.
Does genotype pre-phasing (with Eagle v2.4.1).
Does genotype imputation (with Minimac4).
Filters imputed genotypes to MAF>0.01.

Input files

Unimputed and QC’d genotype files in plink .bed/.bim/.fam. These files are in the output of previous data QC pipeline. These are in the folder output/outputfolder_gen/gen_data_QCd. You need to specify the path to the base name (no extension .bed/.bim/.fam) as an input.
Folder with all the needed reference files for pre-phasing and imputation. We provide it in Dropbox (see below).
Genotype-to-expression file.

Instructions

Go into eQTLGen_phase2/2_Imputation.
Clone the eQTLGenImpute pipeline git clone https://github.com/eQTLGen/eQTLGenImpute.git.
Make folder output.
Download the zipped reference wget https://www.dropbox.com/s/6g58ygjg9d2fvbi/eQTLGenReferenceFiles.tar.gz?dl=1.
Because this file is ~30GB and can corrupt during the download, also please download corresponding md5sum file wget https://www.dropbox.com/s/ekfciajzevn6o1l/eQTLGenReferenceFiles.tar.gz.md5?dl=1.
Check if download was successful md5sum --check eQTLGenReferenceFiles.tar.gz.md5. You should see this message in your terminal: eQTLGenReferenceFiles.tar.gz:OK.
Unzip the reference file tar -xfvz eQTLGenReferenceFiles.tar.gz. This yields the folder named hg38 which contains all needed reference files.
Go into eQTLGen_phase2/2_Imputation/eQTLGenImpute pipeline folder and adjust the imputation script template with needed modules and inputs
Path to QCd genotype data: path pointing to unimputed and QC’d genotype files in plink .bed/.bim/.fam format, ending with *_ToImputation. These files are in the output of previous data QC pipeline, ending with *_ToImputation.* and are in the folder output/outputfolder_gen/gen_data_QCd. This path is pre-filled.
Path to reference folder (pre-filled).
Output path: to the folder output you made (pre-filled).
Cohort name: this should be the same as in DataQC.
Path to the folder with reference files (pre-filled).
The genome build of your dataset (--genome_build): hg19, GRCh37, hg38 or GRCh38 (hg19/GRCh37 by default).
You can select scheduler type by adjusting the profile as following:
- Slurm: -profile slurm,singularity
- PBS/TORQUE: -profile pbs,singularity
- SGE: -profile sge,singularity

This is the template for Slurm scheduler (submit_imputation_pipeline_template.sh):

#!/bin/bash

#SBATCH --time=72:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=6G
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --job-name="ImputeGenotypes"

# These are needed modules in UT HPC to get Singularity and Nextflow running.
# Replace with appropriate ones for your HPC.
module load java/11.0.2
module load singularity/3.5.3
module load squashfs/4.4

# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook

# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache

# Define paths and arguments
nextflow_path=../../tools # folder where Nextflow executable is.
reference_path=../hg38 # folder where you unpacked the reference files.

cohort_name=[name of your cohort]
qc_input_folder=../../1_DataQC/output/ # folder with QCd genotype and expression data, output of DataQC pipeline.
output_path=../output/ # Output path.
genome_build="GRCh37"

# Command
NXF_VER=21.10.6 ${nextflow_path}/nextflow run eQTLGenImpute.nf \
--qcdata ${qc_input_folder} \
--target_ref ${reference_path}/genome_reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--ref_panel_hg38 ${reference_path}/harmonizing_reference/30x-GRCh38_NoSamplesSorted \
--eagle_genetic_map ${reference_path}/phasing_reference/genetic_map/genetic_map_hg38_withX.txt.gz \
--eagle_phasing_reference ${reference_path}/phasing_reference/phasing/ \
--minimac_imputation_reference ${reference_path}/imputation_reference/ \
--cohort_name ${cohort_name} \
--genome_build ${genome_build} \
--outdir ${output_path}  \
-profile slurm,singularity \
-resume

Submit the job.
Check the output/pipeline_info/imputation_report.html.
🎉 Done!

Output

The pipeline gives output into the specified directory. The important files are following.

Folder output/postimute/ should be specified as the input for the next step, genotype conversion. This folder contains imputed .vcf.gz files, filtered by MAF>0.01.
Folder output/preimpute/ contains the formatted, filtered, and QCd genotype data before imputation (.vcf.gz). This is not directly needed for this cookbook and might be used for other applications, debugging the imputation, or just deleted.

3. Genotype conversion

In this step, we convert the imputed genotype files from .vcf.gz format into efficient .hdf5 format. This file format is native to HASE and is needed for running the following HASE steps (encoding, partial derivative computation) in a more efficient way.

Specifically, this pipeline performs the following steps:

It chunks .vcf.gz files into chunks of 25,000 variants for faster parallel processing.
It converts the chunks into .hdf5 format.
It outputs a file with summary and quality metrics for each SNP included in the analysis. This includes MAF, Hardy-Weinberg P-value, Mach R2 (a measure of imputation quality), genotype counts, call rate (always 1 as imputed data) and an indicator whether SNP was typed of imputed.

These quality metrics will be used for performing additional per-variant QC in the central site.

It renames and organises files into custom hdf5 genotype folder format so that these are usable in the HASE framework.

Input files

Folder containing imputed genotype files in the bgzipped vcf.gz format.

Instructions

Go into eQTLGen_phase2/3_ConvertVcf2Hdf5.
Clone the ConvertVcf2Hdf5 pipeline git clone https://github.com/eQTLGen/ConvertVcf2Hdf5.git.
Make folder output
Go into 3_ConvertVcf2Hdf5/ConvertVcf2Hdf5 and adjust the genotype conversion script template with needed modules and inputs.
Imputed genotype path (pre-filled).
Output path (pre-filled).
You can select scheduler type by adjusting the profile as following:
- Slurm: -profile slurm,singularity
- PBS/TORQUE: -profile pbs,singularity
- SGE: -profile sge,singularity

This is the template for Slurm scheduler (submit_genotype_conversion_template.sh):

#!/bin/bash

#SBATCH --time=24:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=5G
#SBATCH --job-name="ConvertVcf2Hdf5"

# These are needed modules in UT HPC to get Singularity and Nextflow running.
# Replace with appropriate ones for your HPC.
module load java/11.0.2
module load singularity/3.5.3
module load squashfs/4.4

# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook

# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache

nextflow_path=../../tools/  # Path to Nextflow executable, no need to adjust if folder structure is same as recommended in cookbook.

cohortname=[name of your cohort]
genopath=../../2_Imputation/output/postimpute   # Folder with input genotype files in .vcf.gz format, no need to adjust if the folder structure is same as recommended in cookbook.
outputpath=../output/ # Path to output folder, no need to adjust if the folder structure is same as recommended in cookbook.

NXF_VER=21.10.6 ${nextflow_path}/nextflow run ConvertVcf2Hdf5.nf \
--vcf ${genopath} \
--outdir ${outputpath} \
--cohort_name ${cohortname} \
-profile slurm,singularity \
-resume

Submit the job.
Check the output/pipeline_info/ConvertGenotype_report.html.
🎉 Done!

Output

After successful completion of the pipeline, there should be hdf5 genotype file structure in your output folder, which, in addition to pipeline_info, contains folders named individuals, probes, genotype and SNPQC. For eQTLGen phase 2 analyses, this folder is one out of two inputs of per-cohort data preparations pipeline.

4. Per-cohort data preparations

This step prepares the data and performs all the steps which are needed for running encoded HASE meta-analysis in the central site.

Specifically, this pipeline performs the following steps:

Uses 1000G 30x reference file to create variant mapper files (to make the SNPs from different studies jointly analyzable in central site).
Encodes genotype and gene expression data, and deletes the random matrix used for encoding. This means that no information for individual study participant is obtainable from encoded data, even for the original cohort analyst.
Calculates partial derivatives, needed for running the eQTL mapping.
Permutes the sample links on the unencoded data, encodes, and calculates partial derivatives for the permuted data. This is needed for obtaining in-sample LD estimates for downstream analyses in the central site (useful for e.g. multiple testing corrections and fine-mapping).
Associates expression PCs with genotypes, writes out suggestive associations (P<1×10^-5). This enables us to make an informed decision which covariates to include into encoded HASE model in the central site and control for the collider effects.
Replaces original sample IDs in the encoded data with “CohortName__index”.
Collects several summary reports, QC reports and diagnostic plots from the output of data QC pipeline.
Collects summary file with SNP QC information (MAF, imputation quality, genotype counts, imputation/typed status).
Organises all the partial derivates, encoded matrices and QC reports into the structured folder structure, ready for sharing.
Calculates md5sum for all the shared files, so the integrity of the upload can be checked.

Input files

Folder with genotype files in the .hdf5 format. This folder is the output of genotype conversion pipeline.
Full path to the output folder of data QC pipeline. Pipeline automatically takes preprocessed expression matrix, covariate file and several QC metric files from this folder structure.

Instructions

Go into eQTLGen_phase2/4_PerCohortPreparations.
Make folder output
Clone the PerCohortPreparations pipeline git clone https://github.com/eQTLGen/PerCohortDataPreparations.git.
Go into 4_PerCohortPreparations/PerCohortPreparations.
Put genotype reference file into 4_PerCohortPreparations/PerCohortPreparations/bin/hase/data/ folder. This was already downloaded together with imputation references in Imputation step. Therefore you can copy it to the correct location like that: cp ../../2_Imputation/hg38/hase_reference/* bin/hase/data/..
Adjust the template script.
Path to the genotypes in hdf5 format (pre-filled)
Path to QC’d expression data (pre-filled)
Output path (pre-filled)
You can select scheduler type by adjusting the profile as following:
- Slurm: -profile slurm,singularity
- PBS/TORQUE: -profile pbs,singularity
- SGE: -profile sge,singularity

This is the template script for Slurm scheduler (submit_per_cohort_preparations_template.sh):

#!/bin/bash

#SBATCH --time=24:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=5G
#SBATCH --job-name="RunDataPreparations"

# These are needed modules in UT HPC to get Singularity and Nextflow running.
# Replace with appropriate ones for your HPC.
module load java/11.0.2
module load singularity/3.5.3
module load squashfs/4.4

# If you follow the eQTLGen phase II cookbook and analysis folder structure,
# some of the following paths are pre-filled.
# https://github.com/eQTLGen/eQTLGen-phase-2-cookbook/wiki/eQTLGen-phase-II-cookbook

# We set the following variables for nextflow to prevent writing to your home directory (and potentially filling it completely)
# Feel free to change these as you wish.
export SINGULARITY_CACHEDIR=../../singularitycache
export NXF_HOME=../../nextflowcache

nextflow_path=../../tools

genotypes_hdf5=../../3_ConvertVcf2Hdf5/output # Folder with genotype files in .hdf5 format
qc_data_folder=../../1_DataQC/output # Folder containing QCd data, inc. expression and covariates
output_path=../output

NXF_VER=21.10.6 ${nextflow_path}/nextflow run PerCohortDataPreparations.nf \
--hdf5 ${genotypes_hdf5} \
--qcdata ${qc_data_folder} \
--outdir ${output_path} \
-profile slurm,singularity \
-resume

Submit the job.
Check the output/pipeline_info/PerCohortDataPreparations_report.html.
🎉 Done!

Output

In the output folder there is subfolder called [YourCohortName]_IntermediateFilesEncoded_to_upload. This folder contains all the non-personal files, logs, reports and should be shared with the central site.
In the output folder there is also file [YourCohortName]_IntermediateFilesEncoded_to_upload.md5. This should also be shared with the central site, in order to check the integrity of uploaded files.

The instructions for getting the SFTP account in UT HPC: send an e-mail to urmo.vosa at gmail.com and we will arrange the account. We will also instruct how to upload the needed data to SFTP server.

The output folder from per-cohort preparations pipeline named [YourCohortName]_IntermediateFilesEncoded_to_upload and its accompanying .md5 file should be uploaded into our sftp server.

❗ You might want to inform your HPC team about this project and the planned upload. Although the sheer file size is much smaller that would be for the classical meta-analysis, it is still ~50GB in case of 500 samples and RNA-seq. For the few largest datasets, the upload might be up to 500GB and might raise some red flags when monitoring the data traffic ;).

🎉 Thank you, you have shared the data necessary for global eQTL mapping! We will keep you updated and will re-contact if any additional information is needed.

Central analyses

Following pipelines are for performing global trans-eQTL meta-analysis in the central location and are here outlined FYI. If you are cohort analyst, no further action is needed. However, feel free to use this code and instructions to conduct further analyses in your respective cohort.

Pipeline for running encoded meta-analysis - this is for running encoded eQTL meta-analysis on one or several datasets.

Pipeline for extracting subsets of data from full summary statistics - this is for extracting data subsets for subsequent interpretation.

We have been working on more convenient output formats and pipelines that allow for quicker extractions of parts of the data. These updates are not yet included in the github repositories. These will be updated soon.

Runtime estimates

Here are some runtime benchmarks for each of the steps in this cookbook: these might help you to get some hint how much time running the pipelines might take.

All benchmarks are for RNA-seq dataset with following features:

Final eQTL sample size of 477 samples.
RNA-seq: up to 19,942 genes in processed expression matrix.
GSA genotyping array: 700,078 variants in unimputed data.

Data QC

Initial number of samples in unimputed .bed genotype file: 1,052
Initial number of SNPs in unimputed .bed genotype file: 700,078
Initial number of samples in gene expression data: 1,074
Initial number of gene in gene expression data: 48803
Infrastructure: University HPC with ~150 compute nodes
Scheduler: Slurm
Dependency management: Singularity
Time to run the pipeline: 25m
CPU hours: 0.7h
Final size of output directory: 344MB
Final size of work subdirectory: 1.8GB

Note: benchmarks to initial run with default settings, you probably need to re-run at least once after adjusting the settings.

Imputation

Initial number of samples in QCd and unimputed .bed genotype file:
Initial number of SNPs in QCd and unimpuated .bed genotype file:
Infrastructure: University HPC with ~150 compute nodes
Scheduler: Slurm
Dependency management: Singularity
Time to run the pipeline: ~4.5h
CPU hours: ~80h
Final size of output directory: ~13GB
Final size of work subdirectory: ~100GB

Genotype conversion

Number of samples in imputed .vcf.gz genotype data: ~4,000
Number of variants in imputed .vcf.gz genotype data: ~4,000
Infrastructure: University HPC with ~150 compute nodes
Dependency management: Singularity
Time to run the pipeline (without wall times): ~4.5h
CPU hours: ~80h
Final size of output directory: ~4GB
Final size of work subdirectory: ~100GB

Per-cohort data preparation

Infrastructure: University HPC with ~150 compute nodes
Scheduler: Slurm
Dependency management: Singularity
Time to run the pipeline (without wall times): ~1.5h
CPU hours: ~44.8h
Final size of output directory: ~50GB
Final size of work subdirectory: ~100GB

Acknowledgements

This cookbook utilizes HASE (https://github.com/roshchupkin/hase) and some of its helper scripts originally developed by:

Gennady V. Roscupkin (Department of Epidemiology, Radiology and Medical Informatics, Erasmus MC, Rotterdam, Netherlands)

Hieab H. Adams (Department of Epidemiology, Erasmus MC, Rotterdam, Netherlands).

Changes from the original HASE repo

Robert Warmerdam (Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen, Netherlands) modified the original HASE and fixed some bugs. He also added classic-meta option to HASE analyses which enables to perform (encoded) inverse-variance weighted meta-analysis and has started implementing methods for encoded interaction analysis.

Urmo Võsa (Institute of Genomics, University of Tartu, Tartu, Estonia) incorporated it into Nextflow pipeline and applied some minor customization to the parts of the code.

Changes:

Fixed bug causing an exception when more than 1000 individuals were used.
Resolved bug causing the --intercept option having no effect.
Made version numbers of pip packages explicit.
Added commentary to code in places.
Lines 355-357 of hase.py were commented out because this caused pipeline to crash when >1 datasets were added.
Line 355 of /hdgwas/data.py were changed self.chunk_size=10000 --> self.chunk_size=20000.
For eQTLGen pipelines: removed folders with unit tests and test data, in order to keep the tool lightweight.

Citation

Original method paper for HASE framework:

Roshchupkin, G. V., H. H.H. Adams, M. W. Vernooij, A. Hofman, C. M. Van Duijn, M. A. Ikram, and W. J. Niessen. “HASE: Framework for Efficient High-Dimensional Association Analyses.” Scientific Reports 6, no. 1 (December 26, 2016): 36076. https://doi.org/10.1038/srep36076.

1000G 30X imputation reference:

Byrska-Bishop, Marta, Uday S. Evani, Xuefang Zhao, Anna O. Basile, Haley J. Abel, Allison A. Regier, André Corvelo, et al. “High-Coverage Whole-Genome Sequencing of the Expanded 1000 Genomes Project Cohort Including 602 Trios.” Cell 185, no. 18 (September 1, 2022): 3426-3440.e19. https://doi.org/10.1016/j.cell.2022.08.004.

Nextflow workflow management tool:

Di Tommaso, Paolo, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. “Nextflow Enables Reproducible Computational Workflows.” Nature Biotechnology 35, no. 4 (2017): 316–19. https://doi.org/10.1038/nbt.3820.

Contacts

For this cookbook: Urmo Võsa (urmo.vosa at gmail.com) and Robert Warmerdam (c.a.warmerdam at umcg.nl)

For the method of HASE, please find contacts from here: https://github.com/roshchupkin/hase

eQTLGen phase II cookbook

Why this method?

Technical support

Prerequisites

Needed

Strongly recommended

Recommended

Setup

Offline use

Running, monitoring and debugging Nextflow pipelines

Data

Name of your cohort

Genotype data

Gene expression data

Genotype-to-expression (GTE) linking file

Additional covariates

Mixups

Analysis instructions

1. Data QC

Input files

Genotype data

Expression data

Additional files

Instructions

Output

2. Genotype imputation

Input files

Instructions

Output

3. Genotype conversion

Input files

Instructions

Output

4. Per-cohort data preparations

Input files

Instructions

Output

5. Share the data with central location

Central analyses

Runtime estimates

Data QC

Imputation

Genotype conversion

Per-cohort data preparation

Acknowledgements

Changes from the original HASE repo

Citation

Contacts